0% found this document useful (0 votes)

38 views916 pages

An Introduction To Bayesian Data Analysis

Uploaded by

drkostadinkostadinov

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

38 views916 pages

An Introduction To Bayesian Data Analysis

Uploaded by

drkostadinkostadinov

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 916

Code

An Introduction to Bayesian Data Analysis

for Cognitive Science
Bruno Nicenboim, Daniel Schad, and Shravan Vasishth

2023-02-18

Preface #

This book is intended to be a relatively gentle introduction to carrying out Bayesian data
analysis and cognitive modeling using the probabilistic programming language Stan
(Carpenter et al. 2017), and the front-end to Stan called brms (Bürkner 2019). Our target
audience is cognitive scientists (e.g., linguists and psychologists) who carry out planned
behavioral experiments, and who are interested in learning the Bayesian data analysis
methodology from the ground up and in a principled manner. Our aim is to make Bayesian
statistics a standard part of the data analysis toolkit for experimental linguistics,
psycholinguistics, psychology, and related disciplines.

Many excellent introductory textbooks already exist for Bayesian data analysis. Why write yet
another book? Our text is different from other attempts in two respects. First, our main focus is
on showing how to analyze data from planned experiments involving repeated measures; this
type of experimental data involves unique complexities. We provide many examples of data
sets involving time measurements (e.g., self-paced reading, eye-tracking-while-reading, voice
onset time), event-related potentials, pupil sizes, accuracies (e.g., recall tasks, yes-no
questions), categorical answers (e.g., picture naming), choice-reaction time (e.g, Stroop task,
motion detection task), etc. Second, from the very outset, we stress a particular workflow that
has as its centerpiece simulating data; we aim to teach a philosophy that involves thinking
hard about the assumed underlying generative process, even before the data are collected.
The data analysis approach that we hope to teach through this book involves a cycle of prior
predictive and posterior predictive checks, and model validation using simulated data. We try
to inculcate a sense of how inferences can be drawn from the posterior distribution of
theoretically interesting parameters without resorting to binary decisions like “significant” or
“not-significant”. We are hopeful that this will set a new standard for reporting and interpreting
results of data analyses in a more nuanced manner, and lead to more measured claims in the
published literature.

Please report typos, errors, or suggestions for improvement at

https://github.com/vasishth/bayescogsci/issues.

Why read this book, and what is its target audience?

A commonly-held belief in psychology, psycholinguistics, and other areas is that statistical

data analysis is secondary to the science, and should be quick and easy. For example, a
senior mathematical psychologist once told the last author of this book: if you need to run
anything more complicated than a paired t-test, you are asking the wrong question.’’ The most
colorful version of this sentiment was expressed by a former editor-in-chief of the Journal of
Memory and Language. The gist of the tweet was that statistical analysis should be like going
to the toilet, and as a scientist, one should not be expected to invest too much time into
studying statistics. If one really believes that statistics should be like going to the toilet—quick
and dirty—then one should not be surprised if the end-result turns out to be crap.

The target audience for this book is students and researchers who want to treat statistics as
an equal partner in their scientific work. We expect that the reader is willing to take the time to
both understand and to run the computational analyses.

Any rigorous introduction to Bayesian data analysis requires at least a passive knowledge of
probability theory, calculus, and linear algebra. However, do not require that the reader has
this background when they start the book. Instead, the relevant ideas are introduced informally
and just in time, as soon as they are needed. The reader is never required to have an active
ability to solve probability problems, to solve integrals or compute derivatives, or to carry out
relatively complex matrix computations (such as inverting matrices) by hand.

What we do expect is familiarity with arithmetic, basic set theory and elementary probability
theory (e.g., sum and product rules, conditional probability), simple matrix operations like
addition and multiplication, and simple algebraic operations. A quick look through chapter 1 of
Gill (2006) before starting this book is highly recommended. We also presuppose that, when
the need arises, the reader is willing to look up concepts that that they might have forgotten
(e.g., logarithms).

We also expect that the reader already knows and/or is willing to learn enough of the
programming language R (R Core Team 2019) to reproduce the examples presented and to
carry out the exercises. If the reader is completely unfamiliar with R, before starting this book
they should first consult books like R for data science, and Efficient R programming.

We also assume that the reader has encountered simple linear modeling, and linear mixed
models (Bates, Mächler, et al. 2015a; Baayen, Davidson, and Bates 2008). What this means
in practice is that the reader should have used the lm() and lmer() functions in R. A
passing acquaintance with basic statistical concepts, like the correlation between two
variables, is also taken for granted.

This book is not appropriate for complete beginners to data analysis. Newcomers to data
analysis should start with a freely available textbook like Kerns (2014), and then read our
introduction to frequentist data analysis, which is also available freely online (Vasishth et al.
2021). This latter book will prepare the reader well for the material presented here.

Developing the right mindset for this book

One very important characteristic that the reader should bring to this book is a can-do spirit.
There will be many places where the going will get tough, and the reader will have to play
around with the material, or refresh their understanding of arithmetic or middle-school algebra.
The basic principles of such a can-do spirit are nicely summarized in the book by Burger and
Starbird (2012); also see Levy (2021). Although we cannot summarize the insights from these
books in a few words, inspired by the Burger and Starbird (2012) book, here is a short
enumeration of the kind of mindset the reader will need to cultivate:

Spend time on the basic, apparently easy material; make sure you understand it deeply.
Look for gaps in your understanding. Reading different presentations of the same material
(in different books or articles) can yield new insights.
Let mistakes and errors be your teacher. We instinctively recoil from our mistakes, but
errors are ultimately our friends; they have the potential to teach us more than our correct
answers can. In this sense, a correct solution can be less interesting than an incorrect
one.
When you are intimidated by some exercise or problem, give up and admit defeat
immediately. This relaxes the mind; you’ve already given up, there’s nothing more to do.
Then, after a while, try to solve a simpler version of the problem. Sometimes, it is useful
to break the problem down to smaller parts, each of which may be easier to solve.
Create your own questions. Don’t wait to be asked questions; develop your own problems
and then try to solve them.
Don’t expect to understand everything in the first pass. Just mentally note the gaps in
your understanding, and return to them later and work on these gaps.
Step back periodically to try to sketch out a broader picture of what you are learning.
Writing down what you know, without looking up anything, is one helpful way to achieve
this. Don’t wait for the teacher to give you bullet-point summaries of what you should have
learned; develop such summaries yourself.
Develop the art of finding information. When confronted with something you don’t know, or
with some obscure error message, use google to find some answers.

As instructors, we have noticed over the years that students with such a mindset generally do
very well. Some students already have that spirit, but others need to explicitly develop it. We
firmly believe that everyone can develop such a mindset; but one may have to work on
acquiring it.

In any case, such an attitude is absolutely necessary for a book of this sort.

How to read this book

The chapters in this book are intended to be read in sequence, but during the first pass
through the book, the reader should feel free to completely skip the boxes. These boxes
provide a more formal development (useful to transition to more advanced textbooks like
Gelman et al. 2014), or deal with tangential aspects of the topics presented in the chapter.

Here are some suggested paths through this book, depending on the reader’s goals:

For a short course for complete beginners, read chapters 1 to 5. We usually cover these
five chapters in a five-day summer school course that we teach annually. Most of the
material in this chapter is also covered in a free four-week course available online:
https://open.hpi.de/courses/bayesian-statistics2023.
For a course that focuses on regression models with the R package brms , read chapters
1 to 9 and, optionally, 15.
For an advanced course that focuses on complex models involving Stan, read chapters
10 to 20.

Some conventions used in this book

We adopt the following conventions:

All distribution names are lower-case unless they are also a proper name (e.g., Poisson,
Bernoulli).
The univariate normal distribution is parameterized by the mean and standard deviation
(not variance).
The code for figures is provided only in some cases, where we consider it to be
pedagogically useful. In other cases, the code remains hidden, but it can be found in the
web version of the book. Notice that all the R code from the book can be extracted from
the Rmd source files for each chapter, which are released with the book.

Online materials

The entire book, including all data and source code, is available online for free on
https://vasishth.github.io/bayescogsci/book. The solutions to exercises will be made available
on request.

Software needed

Before you start, please install

R and RStudio, or any other Integrated Development Environment that you prefer, such
as Visual Studio Code and Emacs Speaks Statistics.
The R package rstan . At the time of writing this book, the CRAN version of rstan lags
behind the latest developments in Stan so it is recommended to install rstan from
https://mc-stan.org/r-packages/ as indicated in https://github.com/stan-

dev/rstan/wiki/RStan-Getting-Started
The R packages dplyr , purrr , tidyr , extraDistr , brms , hypr and lme4 are
used in many chapters of the book and can be installed the usual way:
install.packages(c("dplyr","purrr","tidyr", "extraDistr", "brms","hypr","lme4")) .

The following R packages are optional: tictoc , rootSolve , SHELF , cmdstanr , and
SBC .

Some packages, such as intoo , barsurf , bivariate , SIN , and rethinking could
require manual installation from archived or github versions.
The data and Stan models used in this book can be installed using
remotes::install_github("bnicenboim/bcogsci") . This command uses the function
install_github from the package remotes . (Thus this package should be in the system

as well.)

In every R session, load these packages, and set the options shown below for Stan.
Hide

library(MASS)
## be careful to load dplyr after MASS
library(dplyr)
library(tidyr)
library(purrr)

library(extraDistr)
library(ggplot2)
library(loo)
library(bridgesampling)
library(brms)
library(bayesplot)

library(tictoc)
library(hypr)
library(bcogsci)
library(lme4)
library(rstan)

# This package is optional, see https://mc-stan.org/cmdstanr/:

library(cmdstanr)
# This package is optional, see https://hyunjimoon.github.io/SBC/:
library(SBC)
library(SHELF)

library(rootSolve)

## Save compiled models:

rstan_options(auto_write = FALSE)
## Parallelize the chains using all the cores:
options(mc.cores = parallel::detectCores())

# To solve some conflicts between packages:

select <- dplyr::select
extract <- rstan::extract
Acknowledgments

We are grateful to the many generations of students at the University of Potsdam, various
summer schools at ESSLLI, the LOT winter school, other short courses we have taught at
various institutions, and the annual summer school on Statistical Methods for Linguistics and
Psychology (SMLP) held annually at Potsdam, Germany. The participants in these courses
helped us considerably in improving the material presented here. A special thanks to Anna
Laurinavichyute, Paula Lissón, and Himanshu Yadav for co-teaching the the Bayesian courses
at SMLP. We are also grateful to members of Vasishth lab, especially Dorothea Pregla, for
comments on earlier drafts of this book. We would also like to thank Christian Robert
(otherwise known as Xi’an), Robin Ryder, Nicolas Chopin, Michael Betancourt, Andrew
Gelman, and the Stan developers (especially Bob Carpenter and Paul-Christian Bürkner) for
their advice; to Pavel Logačev for his feedback, and Athanassios Protopapas, Patricia
Mirabile, Masataka Ogawa, Alex Swiderski, Andrew Ellis, Jakub Szewczyk, Chi Hou Pau, Alec
Shaw, Patrick Wen, Riccardo Fusaroli, Abdulrahman Dallak, Elizabeth Pankratz, Jean-Pierre
Haeberly, Chris Hammill, Florian Wickelmaier, Ole Seeth, Jules Bouton, Siqi Zheng, Michael
Gaunt, Benjamin Senst, Chris Moreh, Richard Hatcher, and Noelia Stetie for catching typos,
unclear passages, and errors in the book. Thanks also go to Jeremy Oakley and other
statisticians at the School of Mathematics and Statistics, University of Sheffield, UK, for helpful
discussions, and ideas for exercises that were inspired from the MSc program taught online at
Sheffield.

This book would have been impossible to write without the following software: R (Version
4.2.2; R Core Team 2019) and the R-packages afex (Singmann et al. 2020), barsurf (Version
0.7.0; Spurdle 2020a), bayesplot (Version 1.9.0; Gabry and Mahr 2019), bcogsci (Version
0.0.0.9000; Nicenboim, Schad, and Vasishth 2020), bibtex (Version 0.5.0; Francois 2017),
bivariate (Version 0.7.0; Spurdle 2020b), bookdown (Version 0.28; Xie 2019a), bridgesampling
(Version 1.1.2; Gronau, Singmann, and Wagenmakers 2020), brms (Version 2.17.0; Bürkner
2019), citr (Aust 2019), cmdstanr (Version 0.5.3; Gabry and Češnovar 2021), cowplot (Version
1.1.1; Wilke 2020), digest (Version 0.6.31; Antoine Lucas et al. 2021), dplyr (Version 1.1.0;
Wickham, François, et al. 2019), DT (Version 0.24; Xie, Cheng, and Tan 2019), extraDistr
(Version 1.9.1; Wolodzko 2019), forcats (Version 1.0.0; Wickham 2019a), gdtools (Gohel et al.
2019), ggplot2 (Version 3.4.0; Wickham, Chang, et al. 2019), gridExtra (Version 2.3; Auguie
2017), htmlwidgets (Version 1.5.4; Vaidyanathan et al. 2018), hypr (Version 0.2.3; Schad et al.
2019; Rabe, Vasishth, Hohenstein, Kliegl, and Schad 2020a), intoo (Version 0.4.0; Spurdle
and Bode 2020), kableExtra (Version 1.3.4; Zhu 2019), knitr (Version 1.42; Xie 2019b), lme4
(Version 1.1.31; Bates, Mächler, et al. 2015b), loo (Version 2.5.1; Vehtari, Gelman, and Gabry
2017a; Yao et al. 2017), MASS (Version 7.3.58.2; Ripley 2019), Matrix (Version 1.5.3; Bates
and Maechler 2019), miniUI (Version 0.1.1.1; Cheng 2018), papaja (Version 0.1.1; Aust and
Barth 2020), pdftools (Version 3.3.2; Ooms 2021), purrr (Version 1.0.1; Henry and Wickham
2019), Rcpp (Version 1.0.10; Eddelbuettel et al. 2019), readr (Version 2.1.4; Wickham, Hester,
and Francois 2018), RefManageR (Version 1.4.0; McLean 2017), remotes (Version 2.4.2;
Hester et al. 2021), rethinking (Version 2.21; McElreath 2021), rmarkdown (Version 2.20;
Allaire et al. 2019), rootSolve (Version 1.8.2.3; Soetaert and Herman 2009), rstan (Version
2.26.13; Guo, Gabry, and Goodrich 2019), SBC (Version 0.1.1.9000; Kim et al. 2022), servr
(Version 0.24; Xie 2019c), SHELF (Version 1.8.0; Oakley 2021), SIN (Version 0.6; Drton
2013), StanHeaders (Version 2.26.13; Goodrich et al. 2019), stringr (Version 1.5.0; Wickham
2019b), texPreview (Sidi and Polhamus 2020), tibble (Version 3.1.8; Müller and Wickham
2020), tictoc (Version 1.0.1; Izrailev 2014), tidyr (Version 1.2.1; Wickham and Henry 2019),
tidyverse (Version 1.3.2; Wickham, Averick, et al. 2019), tinylabels (Version 0.2.3; Barth 2022),
and webshot (Version 0.5.3; Chang 2018).

Bruno Nicenboim (Tilburg, The Netherlands), Daniel Schad (Potsdam, Germany), Shravan
Vasishth (Potsdam, Germany)

References

Allaire, JJ, Yihui Xie, Jonathan McPherson, Javier Luraschi, Kevin Ushey, Aron Atkins, Hadley
Wickham, Joe Cheng, Winston Chang, and Richard Iannone. 2019. rmarkdown: Dynamic
Documents for R. https://CRAN.R-project.org/package=rmarkdown.

Antoine Lucas, Dirk Eddelbuettel with contributions by, Jarek Tuszynski, Henrik Bengtsson,
Simon Urbanek, Mario Frasca, Bryan Lewis, Murray Stokely, et al. 2021. Digest: Create
Compact Hash Digests of R Objects. https://CRAN.R-project.org/package=digest.

Auguie, Baptiste. 2017. GridExtra: Miscellaneous Functions for "Grid" Graphics.

https://CRAN.R-project.org/package=gridExtra.

Aust, Frederik. 2019. citr: RStudio Add-in to Insert Markdown Citations. https://CRAN.R-
project.org/package=citr.

Aust, Frederik, and Marius Barth. 2020. papaja: Create APA Manuscripts with R Markdown.
https://github.com/crsh/papaja.

Baayen, R Harald, Douglas J Davidson, and Douglas M Bates. 2008. “Mixed-Effects Modeling
with Crossed Random Effects for Subjects and Items.” Journal of Memory and Language 59
(4). Elsevier: 390–412.
Barth, Marius. 2022. tinylabels: Lightweight Variable Labels. https://cran.r-
project.org/package=tinylabels.

Bates, Douglas M, and Martin Maechler. 2019. Matrix: Sparse and Dense Matrix Classes and
Methods. https://CRAN.R-project.org/package=Matrix.

Bates, Douglas M, Martin Mächler, Ben Bolker, and Steve Walker. 2015a. “Fitting Linear
Mixed-Effects Models Using lme4.” Journal of Statistical Software 67 (1): 1–48.
https://doi.org/10.18637/jss.v067.i01.

Bates, Douglas M, Martin Mächler, Ben Bolker, and Steve Walker. 2015b. “Fitting Linear
Mixed-Effects Models Using lme4.” Journal of Statistical Software 67 (1): 1–48.
https://doi.org/10.18637/jss.v067.i01.

Burger, Edward B, and Michael Starbird. 2012. The 5 Elements of Effective Thinking.
Princeton University Press.

Bürkner, Paul-Christian. 2019. brms: Bayesian Regression Models Using “Stan”.

https://CRAN.R-project.org/package=brms.

Carpenter, Bob, Andrew Gelman, Matthew D Hoffman, Daniel Lee, Ben Goodrich, Michael J.
Betancourt, Marcus Brubaker, Jiqiang Guo, Peter Li, and Allen Riddell. 2017. “Stan: A
Probabilistic Programming Language.” Journal of Statistical Software 76 (1). Columbia Univ.,
New York, NY (United States); Harvard Univ., Cambridge, MA (United States).

Chang, Winston. 2018. webshot: Take Screenshots of Web Pages. https://CRAN.R-

project.org/package=webshot.

Cheng, Joe. 2018. miniUI: Shiny Ui Widgets for Small Screens. https://CRAN.R-
project.org/package=miniUI.

Drton, Mathias. 2013. SIN: A Sinful Approach to Selection of Gaussian Graphical Markov
Models. https://CRAN.R-project.org/package=SIN.

Eddelbuettel, Dirk, Romain Francois, JJ Allaire, Kevin Ushey, Qiang Kou, Nathan Russell,
Douglas M Bates, and John Chambers. 2019. Rcpp: Seamless R and C++ Integration.
https://CRAN.R-project.org/package=Rcpp.

Francois, Romain. 2017. Bibtex: Bibtex Parser. https://CRAN.R-project.org/package=bibtex.

Gabry, Jonah, and Rok Češnovar. 2021. cmdstanr: R Interface to “CmdStan”.

Gabry, Jonah, and Tristan Mahr. 2019. bayesplot: Plotting for Bayesian Models.
https://CRAN.R-project.org/package=bayesplot.
Gelman, Andrew, John B. Carlin, Hal S. Stern, David B. Dunson, Aki Vehtari, and Donald B.
Rubin. 2014. Bayesian Data Analysis. Third Edition. Boca Raton, FL: Chapman; Hall/CRC
Press.

Gill, Jeff. 2006. Essential Mathematics for Political and Social Research. Cambridge University
Press Cambridge.

Gohel, David, Hadley Wickham, Lionel Henry, and Jeroen Ooms. 2019. gdtools: Utilities for
Graphical Rendering. https://CRAN.R-project.org/package=gdtools.

Goodrich, Ben, Andrew Gelman, Bob Carpenter, Matt Hoffman, Daniel Lee, Michael
Betancourt, Marcus Brubaker, et al. 2019. StanHeaders: C++ Header Files for Stan.
https://CRAN.R-project.org/package=StanHeaders.

Gronau, Quentin F., Henrik Singmann, and Eric-Jan Wagenmakers. 2017. “Bridgesampling: An
R Package for Estimating Normalizing Constants.” Arxiv. http://arxiv.org/abs/1710.08162.

2020. “bridgesampling: An R Package for Estimating Normalizing Constants.” Journal of

Statistical Software 92 (10): 1–29. https://doi.org/10.18637/jss.v092.i10.

Guo, Jiqiang, Jonah Gabry, and Ben Goodrich. 2019. rstan: R Interface to Stan.
https://CRAN.R-project.org/package=rstan.

Henry, Lionel, and Hadley Wickham. 2019. purrr: Functional Programming Tools.
https://CRAN.R-project.org/package=purrr.

Hester, Jim, Gábor Csárdi, Hadley Wickham, Winston Chang, Martin Morgan, and Dan
Tenenbaum. 2021. Remotes: R Package Installation from Remote Repositories, Including
’Github’. https://CRAN.R-project.org/package=remotes.

Izrailev, Sergei. 2014. Tictoc: Functions for Timing R Scripts, as Well as Implementations of
Stack and List Structures. https://CRAN.R-project.org/package=tictoc.

Kerns, G.J. 2014. Introduction to Probability and Statistics Using R. Second Edition.

Kim, Shinyoung, Hyunji Moon, Martin Modrák, and Teemu Säilynoja. 2022. SBC: Simulation
Based Calibration for Rstan/Cmdstanr Models.

Levy, Dan. 2021. Maxims for Thinking Analytically: The Wisdom of Legendary Harvard
Professor Richard Zeckhauser. Dan Levy.

McElreath, Richard. 2021. Rethinking: Statistical Rethinking Book Package.

McLean, Mathew William. 2017. “RefManageR: Import and Manage Bibtex and Biblatex
References in R.” The Journal of Open Source Software. https://doi.org/10.21105/joss.00338.
Müller, Kirill, and Hadley Wickham. 2020. Tibble: Simple Data Frames. https://CRAN.R-
project.org/package=tibble.

Nicenboim, Bruno, Daniel J. Schad, and Shravan Vasishth. 2020. bcogsci: Data and Models
for the Book “an Introduction to Bayesian Data Analysis for Cognitive Science”.

Oakley, Jeremy. 2021. SHELF: Tools to Support the Sheffield Elicitation Framework.
https://CRAN.R-project.org/package=SHELF.

Ooms, Jeroen. 2021. pdftools: Text Extraction, Rendering and Converting of Pdf Documents.
https://CRAN.R-project.org/package=pdftools.

Rabe, Maximilian M., Shravan Vasishth, Sven Hohenstein, Reinhold Kliegl, and Daniel J.
Schad. 2020a. “Hypr: An R Package for Hypothesis-Driven Contrast Coding.” The Journal of
Open Source Software. https://doi.org/10.21105/joss.02134.

R Core Team. 2019. R: A Language and Environment for Statistical Computing. Vienna,
Austria: R Foundation for Statistical Computing. https://www.R-project.org/.

Ripley, Brian. 2019. MASS: Support Functions and Datasets for Venables and Ripley’s Mass.
https://CRAN.R-project.org/package=MASS.

Schad, Daniel J., Shravan Vasishth, Sven Hohenstein, and Reinhold Kliegl. 2019. “How to
Capitalize on a Priori Contrasts in Linear (Mixed) Models: A Tutorial.” Journal of Memory and
Language 110. https://doi.org/10.1016/j.jml.2019.104038.

Sidi, Jonathan, and Daniel Polhamus. 2020. TexPreview: Compile and Preview Snippets of
“Latex”. https://CRAN.R-project.org/package=texPreview.

Singmann, Henrik, Ben Bolker, Jake Westfall, Frederik Aust, and Mattan S. Ben-Shachar.
2020. Afex: Analysis of Factorial Experiments. https://CRAN.R-project.org/package=afex.

Soetaert, Karline, and Peter M.J. Herman. 2009. A Practical Guide to Ecological Modelling.
Using R as a Simulation Platform. Springer.

Spurdle, Abby. 2020a. Barsurf: Heatmap-Related Plots and Smooth Multiband Color
Interpolation. https://CRAN.R-project.org/package=barsurf.

Spurdle, Abby. 2020b. Bivariate: Bivariate Probability Distributions. https://CRAN.R-

project.org/package=bivariate.

Spurdle, Abby, and Emil Bode. 2020. Intoo: Minimal Language-Like Extensions.
https://CRAN.R-project.org/package=intoo.
Vaidyanathan, Ramnath, Yihui Xie, JJ Allaire, Joe Cheng, and Kenton Russell. 2018.
htmlwidgets: HTML Widgets for R. https://CRAN.R-project.org/package=htmlwidgets.

Vasishth, Shravan, Daniel J. Schad, Audrey Bürki, and Reinhold Kliegl. 2021. Linear Mixed
Models for Linguistics and Psychology: A Comprehensive Introduction. CRC Press.
https://vasishth.github.io/Freq_CogSci/.

Vehtari, Aki, Andrew Gelman, and Jonah Gabry. 2017a. “Practical Bayesian Model Evaluation
Using Leave-One-Out Cross-Validation and Waic.” Statistics and Computing 27 (5): 1413–32.
https://doi.org/10.1007/s11222-016-9696-4.

Wickham, Hadley. 2019a. forcats: Tools for Working with Categorical Variables (Factors).
https://CRAN.R-project.org/package=forcats.

Wickham, Hadley. 2019b. stringr: Simple, Consistent Wrappers for Common String
Operations. https://CRAN.R-project.org/package=stringr.

Wickham, Hadley, Mara Averick, Jennifer Bryan, Winston Chang, Lucy D’Agostino McGowan,
Romain François, Garrett Grolemund, et al. 2019. “Welcome to the tidyverse.” Journal of Open
Source Software 4 (43): 1686. https://doi.org/10.21105/joss.01686.

Wickham, Hadley, Romain François, Lionel Henry, and Kirill Müller. 2019. dplyr: A Grammar of
Data Manipulation. https://CRAN.R-project.org/package=dplyr.

Wickham, Hadley, and Lionel Henry. 2019. Tidyr: Tidy Messy Data. https://CRAN.R-
project.org/package=tidyr.

Wickham, Hadley, Jim Hester, and Romain Francois. 2018. readr: Read Rectangular Text
Data. https://CRAN.R-project.org/package=readr.

Wilke, Claus O. 2020. cowplot: Streamlined Plot Theme and Plot Annotations for ’Ggplot2’.
https://CRAN.R-project.org/package=cowplot.

Wolodzko, Tymoteusz. 2019. extraDistr: Additional Univariate and Multivariate Distributions.

https://CRAN.R-project.org/package=extraDistr.

Xie, Yihui. 2019a. bookdown: Authoring Books and Technical Documents with R Markdown.
https://CRAN.R-project.org/package=bookdown.
Xie, Yihui. 2019b. knitr: A General-Purpose Package for Dynamic Report Generation in R.
https://CRAN.R-project.org/package=knitr.

Xie, Yihui. 2019c. servr: A Simple Http Server to Serve Static Files or Dynamic Documents.
https://CRAN.R-project.org/package=servr.

Xie, Yihui, Joe Cheng, and Xianying Tan. 2019. DT: A Wrapper of the Javascript Library
’Datatables’. https://CRAN.R-project.org/package=DT.

Yao, Yuling, Aki Vehtari, Daniel Simpson, and Andrew Gelman. 2017. “Using Stacking to
Average Bayesian Predictive Distributions.” Bayesian Analysis. https://doi.org/10.1214/17-
BA1091.

Zhu, Hao. 2019. KableExtra: Construct Complex Table with ’Kable’ and Pipe Syntax.
https://CRAN.R-project.org/package=kableExtra.
Code

Chapter 1 Introduction

The central idea we will explore in this book is: given some data, how to use Bayes’ theorem
to quantify uncertainty about our belief regarding a scientific question of interest. Before we
get into the details of the underlying theory and its application, some familiarity with the
following topics needs to be in place: the basic concepts behind probability, the concept of
random variables, probability distributions, and the concept of likelihood. We therefore turn to
these topics first.

1.1 Probability

Informally, we all understand what the term probability means. We routinely talk about things
like the probability of it raining today. However, there are two distinct ways to think about
probability. One can think of the probability of an event with reference to the frequency with
which it might occur in repeated observations. Such a conception of probability is easy to
imagine in cases where an event can, at least in principle, occur repeatedly. An example
would be obtaining a 6 when tossing a die again and again. However, this frequentist view of
probability is difficult to justify when talking about certain one-of-a-kind events, such as
earthquakes. In such situations, probability is expressing our uncertainty about the event
happening. Moreover, we could even be uncertain about exactly how probable the event in
question is; for example, we might say something like “I am 90% certain that the probability of
an earthquake happening in the next year is between 10 and 40%”. In this book, we will be
particularly interested in quantifying uncertainty in this way: we will always want to know how
unsure we are of the estimate we are interested in.

Both the frequency-based and the uncertain-belief perspective have their place in statistical
inference, and depending on the situation, we are going to rely on both ways of thinking.
Regardless of these differences in perspective, the probability of an event happening is
defined to be constrained in the following way. The statements below are not formal
statements of the axioms of probability theory; for more details (and more precise
formulations), see Ross (2002) or Kolmogorov (1933).

The probability of an event must lie between 0 and 1, where 0 means that the event is
impossible and cannot happen, and 1 means that the event is certain to happen.
For any two mutually exclusive events, the probability that one or the other occurs is the
sum of their individual probabilities.
Two events are independent if and only if the probability of both events happening is
equal to the product of the probabilities of each event happening.
The probabilities of all possible events in the entire sample space must sum up to 1.

The above definitions are based on the axiomatic definition of probability by Kolmogorov
(1933).

In the context of data analysis, we will talk about probability in the following way. Consider
some data that we might have collected. This could be discrete 0,1 responses in a question-
response accuracy task, or continuous measurements of reading times in milliseconds from an
eyetracking study, etc. In any such cases, we will say that the data are being generated from a
random variable, which we will designate with a capital letter such as Y .1

The actually observed data will be distinguished from the random variable that generated it by
using lower case y. We can call y an instance of Y ; every new set of data will be slightly
different due to random variability.

We can summarize the above informal concepts relating to random variables very compactly if
we re-state them in mathematical form. A mathematical statement has the advantage not only
of brevity but also of reducing ambiguity.

Somewhat more formally (following Blitzstein and Hwang 2014), we define a random variable
Y as a function from a sample space of possible outcomes S to the real number system:2

Y : S → R

The random variable associates to each outcome ω ∈ S exactly one number Y (ω) = y .
Suppose that y represents all the possible values that the random variable generates; these
values are taken to belong to the support of S: y ∈ SY .

As a concrete example, consider an experiment where we ask subjects to respond to 10

questions that can either have a correct or incorrect answer. We will say that the number of
correct responses from a subject is generated from a random variable Y . Because only
discrete responses are possible (the number of correct responses can be 0, 1, 2, …, 10), this
is an example of a discrete random variable.

This random variable will be assumed to have a parameter θ that represents the probability of
producing a correct response. In statistics, given some observed data, typically our goal is to
obtain an estimate of this parameter’s true (unknown) value.
This discrete random variable Y has associated with it a function called a probability mass
function or PMF. This function, which is written p(y) , gives us the probability of obtaining each
of these 11 possible outcomes (from 0 correct responses to 10). We are using lower-case p(⋅)

here, and this is distinct from P (⋅) , which we will use to talk about probabilities.

We will write that this PMF p(y) depends on, or is conditional on, a particular fixed but
unknown value for θ; the PMF will be written p(y|θ) .3

In frequentist approaches to data analysis, only the observed data y are used to draw
inferences about θ. A typical question that we ask in the frequentist paradigm is: does θ have
a particular value θ0 ? One can obtain estimates of the unknown value of θ from the observed
data y, and then draw inferences about how different–or more precisely how far away–this
estimate is from the hypothesized θ0 . This is the essence of null hypothesis significance
testing. The conclusions from such a procedure are framed in terms of either rejecting the
hypothesis that θ has value θ0 , or failing to reject this hypothesis. Here, rejecting the null
hypothesis is the primary goal of the statistical hypothesis test.

Bayesian data analysis begins with a different question. What is common to the frequentist
paradigm is the assumption that the data are generated from a random variable Y and that
there is a function p(y|θ) that depends on the parameter θ. Where the Bayesian approach
diverges from the frequentist one is that an important goal is to express our uncertainty about
θ . In other words, we treat the parameter θ itself as a random variable, which means that we
assign a probability distribution p(θ) to this random variable. This distribution p(θ) is called
the prior distribution on θ; such a distribution could express our belief about the probability of
correct responses, before we observe the data y.

In later chapters, we will spend some time trying to understand how such a prior distribution
can be defined for a range of different research problems.

Given such a prior distribution and some data y, the end-product of a Bayesian data analysis
is what is called the posterior distribution of the parameter (or parameters) given the data:
p(θ|y) . This posterior distribution is the probability distribution of θ after conditioning on y, i.e.,
after the data has been observed and is therefore known. All our statistical inference is based
on this posterior distribution of θ; we can even carry out hypothesis tests that are analogous
(but not identical) to the likelihood ratio based frequentist hypothesis tests.

We already mentioned conditional probability above when discussing the probability of the
data given some parameter θ, which we wrote as the PMF p(y|θ) . Conditional probability is an
important concept in Bayesian data analysis, not least because it allows us to derive Bayes’
theorem. Let’s look at the definition of conditional probability next.
1.2 Conditional probability

Suppose that A stands for some discrete event; an example would be “the streets are wet.”
Suppose also that B stands for some other discrete event; an example is “it has been raining.”
We can talk about the probability of the streets being wet given that it has been raining; or
more generally, the probability of A given that B has happened.

This kind of statement is written as P rob(A|B) or more simply P (A|B) . This is the
conditional probability of event A given B . Conditional probability is defined as follows.

P (A, B)
P (A|B) = where P (B) > 0
P (B)

The conditional probability of A given B is thus defined to be the joint probability of A and B,
divided by the probability of B. We can rearrange the above equation so that we can talk about
the joint probability of both events A and B happening. This joint probability can be computed
by first taking P (B) , the probability that event B (it has been raining) happens, and multipling
this by the probability that A happens conditional on B , i.e., the probability that the streets are
wet given it has been raining. This multiplication will give us P (A, B) , the joint probability of A

and B , i.e., that it has been raining and that the streets are wet. We will write the above
description as: P (A, B) = P (A|B)P (B) .

Now, since the probability A and B happening is the same as the probability of B and A

happening, i.e., since P (B, A) = P (A, B) , we can equate the expansions of these two
terms:

P (A, B) = P (A|B)P (B) and P (B, A) = P (B|A)P (A)

Equating the two expansions, we get:

P (A|B)P (B) = P (B|A)P (A)

Dividing both sides by P (B) :

P (B|A)P (A)
P (A|B) =
P (B)

The above statement is Bayes’ rule, and is the basis for all the statistical inference we will do
in this book.
1.3 The law of total probability

Related to the above discussion of conditional probability is the law of total probability.
Suppose that we have A1 , … , An distinct events that are pairwise disjoint which together
make up the entire sample space S ; see Figure 1.1. Then, P (B) , the probability of an event
B , will be the sum of the probabilities P (B ∩ Ai ) , i.e., the sum of the joint probabilities of B

and each A occurring (the symbol ∩ is the and’’ operator used in set theory.)

Formally:

P (B) = ∑ P (B ∩ Ai )

i=1

Because of the conditional probability rule, we can rewrite this as:

P (B) = ∑ P (B|Ai )P (Ai )

i=1

Thus, the probability of B is the sum of the conditional probabilities P (B|Ai ) weighted by the
probability P (Ai ) . We will see the law of total probability in action below when we talk about
marginal likelihood.

FIGURE 1.1: An illustration of the law of total probability.

For now, this is all the probability theory we need to know!

The next sections expand on the idea of a random variable, the probability distributions
associated with the random variable, what it means to specify a prior distribution on a
parameter, and how the prior and data can be used to derive the posterior distribution of θ.

To make the discussion concrete, we will use an example of a discrete random variable, the
binomial. After discussing this discrete random variable, we present another example, this
time involving a continuous random variable, the normal random variable.

The binomial and normal cases serve as the canonical examples that we will need in the initial
stages of this book. We will introduce other random variables as needed: in particular, we will
need the Uniform and Beta distributions. In other textbooks, you will encounter distributions
like the Poisson, gamma, and the exponential. The most commonly used distributions and
their properties are discussed in most textbooks on statistics (see Further Readings at the end
of this chapter).

1.4 Discrete random variables: An example using the

binomial distribution

Consider the following sentence:

“It’s raining, I’m going to take the ….”

Suppose that our research goal is to estimate the probability, call it θ, of the word “umbrella”
appearing in this sentence, versus any other word. If the sentence is completed with the word
“umbrella”, we will refer to it as a success; any other completion will be referred to as a failure.
This is an example of a binomial random variable: given n trials, there can be only two
possible outcomes in each trial, a success or a failure, and there is some true unknown
probability θ of success that we want to estimate. When the number of trials n = 1 is one, the
random variable is called a Bernoulli distribution. So the Bernoulli distribution is the binomial
distribution with n = 1 .

One way to empirically estimate this probability of success is to carry out a cloze task. In a
cloze task, subjects are asked to complete a fragment of the original sentence, such as “It’s
raining, I’m going to take the …”. The predictability or cloze probability of “umbrella” is then
calculated as the proportion of times that the target word “umbrella” was produced as an
answer by subjects.
Assume for simplicity that 10 subjects are asked to complete the above sentence; each
subject does this task only once. This gives us independent responses from 10 trials that are
either coded a success (“umbrella” was produced) or as a failure (some other word was
produced). We can sum up the number of successes to calculate how many of the 10 trials
had “umbrella” as a response. For example, if 8 instances of “umbrella” are produced in 10

trials, we would estimate the cloze probability of producing “umbrella” would be 8/10 .

We can repeatedly generate simulated sequences of the number of successes in R (later on

we will demonstrate how to generate such random sequences of simulated data). Here is a
case where we run the same experiment n = 10 times, and the sample size, is size = 10

each time.

Hide

rbinom(n = 20, size = 10, prob = 0.5)

## [1] 7 3 4 7 5 2 5 5 5 4 5 8 4 4 8 4 2 5 10 9

The number of successes in each of the 20 simulated experiments above is being generated
by a discrete random variable Y with a probability distribution p(y|θ) called the binomial
distribution.

For discrete random variables such as the binomial, the probability distribution p(y|θ) is called
a probability mass function (PMF). The PMF defines the probability of each possible outcome.
In the above example, with n = 10 trials, there are 11 possible outcomes: y = 0, 1, 2, . . . , 10

successes. Which of these outcomes is most probable depends on the parameter θ in the
binomial distribution that represents the probability of success.

The left-hand side plot in Figure 1.2 shows an example of a binomial PMF with 10 trials, with
the parameter θ fixed at 0.5 . Setting θ to 0.5 leads to a PMF where the most probable
outcome is 5 successes out of 10 . If we had set θ to, say 0.1, then the most probable outcome
would be 1 success out of 10 ; and if we had set θ to 0.9 , then the most probable outcome
would be 9 successes out of 10 .
θ = 0.5 θ = 0.1 θ = 0.9
0.25 0.4 0.4

0.20 0.3 0.3

Probability

Probability
0.15
0.2 0.2
0.10
0.1 0.1
0.05

0.00 0.0 0.0

0 1 2 3 4 5 6 7 8 910 0 1 2 3 4 5 6 7 8 9 10 0 1 2 3 4 5 6 7 8 9 10
Possible outcomes Possible outcomes Possible outcomes

FIGURE 1.2: Probability mass functions of a binomial distribution assuming 10 trials, with
50%, 10%, and 90% probability of success.
The probability mass function for the binomial is written as follows.

n
k n−k
Binomial(k|n, θ) = ( )θ (1 − θ)
k

Here, n represents the total number of trials, k the number of successes (this could range
n
from 0 to 10), and θ the probability of success. The term ( )
k
, pronounced n-choose-k,
represents the number of ways in which one can choose k successes out of n trials. For
example, 1 success out of 10 can occur in 10 possible ways: the very first trial could be a 1,
n n!
the second trial could be a 1, etc. The term ( )
k
expands to k!(n−k)!
. In R, it is computed using
the function choose(n,k) , with n and k representing positive integer values.

When we want to express the fact that the data is assumed to be generated from a binomial
random variable, we will write Y ∼ Binomial(n, θ) . If the data is generated from a random
variable that has some other probability distribution f (θ) , we will write Y ∼ f (θ) . In this book,
we use f (⋅) synonymously with p(⋅) to represent a probability distribution.

1.4.1 The mean and variance of the binomial distribution

It is possible to analytically compute the mean (expectation) and variance of the PMF
associated with the binomial random variable Y .

The expectation of a discrete random variable Y with probability mass function f(y), is defined
as

E[Y ] = ∑ y ⋅ f (y)

As a simple example, suppose that we toss a fair coin once. The possible outcomes are Tails
(represented as 0) and Heads (represented as 1), each with equal probability, 0.5. The
expectation is:
1

E[Y ] = ∑ y ⋅ f (y) = 0 ⋅ 0.5 + 1 ⋅ 0.5 = 0.5

y=0

The expectation has the interpretation that if we were to do the experiment a large number of
times and calculate the sample mean of the observations, in the long run we would approach
the value 0.5. Another way to look at the above definition is that the expectation gives us the
weighted mean of the possible outcomes, weighted by the respective probabilities of each
outcome.

Without getting into the details of how these are derived mathematically (Kerns 2014), we just
state here that the mean of Y (the expectation E[Y ] ) and the variance of Y (written V ar(Y ) )
of a binomial distribution with parameter θ and n trials are E[Y ] = nθ and
V ar(Y ) = nθ(1 − θ) .

In the binomial example above, n is a fixed number because we decide on the total number of
n
trials before running the experiment. In the PMF, k
( )θ (1 − θ)
k
n−k
, θ is also a fixed value; the
only variable in a PMF is k. In real experimental situations, we never know the true value of θ.
But θ can be estimated from the data. From the observed data, we can compute the estimate
of θ, ^
θ = k/n . The quantity ^
θ is the observed proportion of successes, and is called the
maximum likelihood estimate of the true (but unknown) parameter θ. Once we have estimated
θ in this way, we can also obtain an estimate of the variance by computing ^(1 − θ
nθ ^) . These
estimates are then used for statistical inference.

What does the term “maximum likelihood estimate” mean? In order to understand this term, it
is necessary to first understand what a likelihood function is. Recall that in the discussion
above, the PMF p(k|n, θ) assumes that θ and n are fixed, and k will vary from 0 to 10 when
the experiment is repeated multiple times.

The likelihood function refers to the PMF p(k|n, θ) , treated as a function of θ. Once we have
observed a particular value for k, this value is now fixed, along with n . Once k and n are
fixed, the function p(k|n, θ) only depends on θ. Thus, the likelihood function is the same
function as the PMF, but assumes that the data is fixed and only the parameter θ varies (from
0 to 1). The likelihood function is written L(θ|k, n) , or simply L(θ) .

For example, suppose that we record n = 10 trials, and observe k = 7 successes. The
likelihood function is:

10
7 10−7
L(θ|k = 7, n = 10) = ( )θ (1 − θ)
7

If we now plot the likelihood function for all possible values of θ ranging from 0 to 1, we get the
plot shown in Figure 1.3.
Likelihood function

0.2
Likelihood

0.1

Max. value at:

0.7

0.0

0.00 0.25 0.50 0.75 1.00

theta

FIGURE 1.3: The likelihood function for 7 successes out of 10.

What is important about this plot is that it shows that, given the data, the maximum point is at
the point 0.7 , which corresponds to the estimated mean using the formula shown above:
k/n = 7/10 . Thus, the maximum likelihood estimate (MLE) gives us the most likely value that
the parameter θ has, given the data. In the binomial, the proportion of successes k/n can be
shown to be the maximum likelihood estimate of the parameter θ (e.g., see p. 339-340 of
Miller and Miller 2004).

A crucial point: the “most likely” value of the parameter is with respect to the data at hand. The
goal is to estimate an unknown parameter value from the data. This parameter value is
chosen such that the probability (discrete case) or probability density (continuous case) of
getting the sample values (i.e., the data) is a maximum. This parameter value is the maximum
likelihood estimate (MLE).

This MLE from a particular sample of data need not invariably give us an accurate estimate of
θ . For example, if we run our experiment for 10 trials and get 1 success out of 10 , the MLE is
0.10 . We could have just happened to observe only one success out of ten by chance, even if
the true θ were 0.7 . If we were to repeatedly run the experiment with increasing sample sizes,
as the sample size increases, the MLE would converge to the true value of the parameter.
Figure 1.4 illustrates this point. The key point here is that with a smaller sample size, the MLE
from a particular data-set may or may not point to the true value.
The MLE as a function of sample size
1.00

0.75
estimate

0.50

0.25

0.00

10 100 1000 10000 100000

sample size n (log scale)

FIGURE 1.4: The plot shows the estimate of the mean proportion of successes sampled from
a binomial distribution with true probability of success 0.7, with increasing sample sizes. As
the sample size increases, the estimate converges to the true value of 0.7.

1.4.2 What information does a probability distribution provide?

In Bayesian data analysis, we will constantly be asking the question: what information does a
probability distribution give us? In particular, we will treat each parameter θ as a random
variable; this will raise questions like: “what is the probability that the parameter θ lies between
two values a and b”; and “what is the range over which we can be 95% certain that the true
value of the parameter lies”? In order to be able to answer questions like these, we need to
know what information we can obtain once we have decided on a probability distribution that is
assumed to have generated the data, and how to extract this information using R. We
therefore discuss the different kinds of information we can obtain from a probability
distribution. For now we focus only on the binomial random variable introduced above.
1.4.2.1 Compute the probability of a particular outcome (discrete case
only)

The binomial distribution shown in Figure 1.2 already shows the probability of each possible
outcome under a different value for θ. In R, there is a built-in function that allows us to
calculate the probability of k successes out of n , given a particular value of k (this number
constitutes our data), the number of trials n , and given a particular value of θ; this is the
dbinom function. For example, the probability of 5 successes out of 10 when θ is 0.5 is:

Hide

dbinom(5, size = 10, prob = 0.5)

## [1] 0.246

The probabilities of success when θ is 0.1 or 0.9 can be computed by replacing 0.5 above by
each of these probabilities. One can just do this by giving dbinom a vector of probabilities:

Hide

dbinom(5, size = 10, prob = c(0.1, 0.9))

## [1] 0.00149 0.00149

The probability of a particular outcome like k = 5 successes is only computable in the

discrete case. In the continuous case, the probability of obtaining a particular point value will
always be zero (we discuss this when we turn to continuous probability distributions below).

1.4.2.2 Compute the cumulative probability of k or less (more) than k

successes

Using the dbinom function, we can compute the cumulative probability of obtaining 1 or less,
2 or less successes etc. This is done through a simple summation procedure:

Hide
## the cumulative probability of obtaining
## 0, 1, or 2 successes out of 10,
## with theta=0.5:
dbinom(0, size = 10, prob = 0.5) +
dbinom(1, size = 10, prob = 0.5) +

dbinom(2, size = 10, prob = 0.5)

## [1] 0.0547

Mathematically, we could write the above summation as:

2
n
k n−k
∑( )θ (1 − θ)
k
k=0

An alternative to the cumbersome addition in the R code above is this more compact
statement, which is identical to the above mathematical expression:

Hide

sum(dbinom(0:2, size = 10, prob = 0.5))

## [1] 0.0547

R has a built-in function called pbinom that does this summation for us. If we want to know
the probability of 2 or fewer than 2 successes as in the above example, we can write:

Hide

pbinom(2, size = 10, prob = 0.5, lower.tail = TRUE)

## [1] 0.0547

The specification lower.tail = TRUE (the default value) ensures that the summation goes
from 2 to numbers smaller than 2 (which lie in the lower tail of the distribution in Figure 1.2). If
we wanted to know what the probability is of obtaining 3 or more successes out of 10 , we can
set lower.tail to FALSE :

Hide
pbinom(2, size = 10, prob = 0.5, lower.tail = FALSE)

## [1] 0.945

Hide

## equivalently:
## sum(dbinom(3:10,size = 10, prob = 0.5))

The cumulative distribution function or CDF can be plotted by computing the cumulative
probabilities for any value k or less than k, where k ranges from 0 to 10 in our running
example. The CDF is shown in Figure 1.5.

Cumulative distribution function

1.00
Probab. of k or less successes

0.75

0.50

0.25

0.00

0 1 2 3 4 5 6 7 8 9 10
Possible outcomes k

FIGURE 1.5: The cumulative distribution function for a binomial distribution assuming 10 trials,
with 50% probability of success.
1.4.2.3 Compute the inverse of the cumulative distribution function (the
quantile function)

We can also find out the value of the variable k (the quantile) such that the probability of
obtaining k or less than k successes is some specific probability value p. If we switch the x
and y axes of Figure 1.5, we obtain another very useful function, the inverse of the CDF.

The inverse of the CDF (known as the quantile function in R because it returns the quantile,
the value k) is available in R as the function qbinom . The usage is as follows: to find out what
the value k of the outcome is such that the probability of obtaining k or less successes is 0.37

, type:

Hide

qbinom(0.37, size = 10, prob = 0.5)

## [1] 4

One can visualize the inverse CDF of the binomial as in Figure 1.6.

Inverse CDF, binomial(size=10,prob=0.5)

6
quantile

0.00 0.25 0.50 0.75 1.00

probability

FIGURE 1.6: The inverse CDF for the binomial(size=10,prob=0.5).

1.4.2.4 Generate simulated data from a Binomial(n, θ) distribution

We can generate simulated data from a binomial distribution by specifying the number of trials
and the probability of success θ. In R, we do this as follows:

Hide

rbinom(n = 1, size = 10, prob = 0.5)

## [1] 7

The above code generates the number of successes in an experiment with 10 trials.
Repeatedly run the above code; we will get different numbers of successes each time.

As mentioned earlier, if there is only one trial, then instead of the binomial distribution, we
have a Bernoulli distribution. For example, if we have 10 observations from a Bernoulli
distribution, where the probability of success is 0.5, we can simulate data as follows using the
function rbern from the package extraDistr .

Hide

rbern(n=10,prob=0.5)

## [1] 0 1 1 0 1 1 0 0 1 1

The above kind of output can also be generated by using the rbinom function: rbinom(n =
10, size = 1, prob = 0.5) . When the data are generated using the rbinom function in this

way, one can calculate the number of successes by just summing up the vector, or computing
its mean and multiplying by the number of trials, here 10 :

Hide

(y <- rbinom(n = 10, size = 1, prob = 0.5))

## [1] 0 1 1 1 0 1 0 0 0 0

Hide
mean(y) * 10

## [1] 4

Hide

sum(y)

## [1] 4

1.5 Continuous random variables: An example using

the normal distribution

We will now revisit the idea of the random variable using a continuous distribution. Imagine
that we have a vector of reading time data y measured in milliseconds and coming from a
normal distribution. The normal distribution is defined in terms of two parameters: the location,
its mean value μ , which determines its center, and the scale, its standard deviation, σ, which
determines how much spread there is around this center point.

The probability density function (PDF) of the normal distribution is defined as follows:

2
1 (y − μ)
Normal(y|μ, σ) = f (y) = exp(− )
2
√2πσ
2 2σ

Here, μ is some true, unknown mean, and σ

2
is some true, unknown variance of the normal
distribution that the reading times have been sampled from. There is a built-in function in R
that computes the above function once we specify the mean μ and the standard deviation σ

(in R, this parameter is specified in terms of the standard deviation rather than the variance).

Figure 1.7 visualizes the normal distribution for particular values of μ and σ, as a PDF (using
dnorm ), a CDF (using pnorm ), and the inverse CDF (using qnorm ). It should be clear from
the figure that these are three different ways of looking at the same information.
PDF CDF Inverse CDF
0.004 1.00
700

Probability
0.003 0.75 600
Density

0.002 0.50 500

y
0.001 0.25 400

300
0.000 0.00
200 400 600 800 200 400 600 800 0.00 0.25 0.50 0.75 1.00
y y Probability

FIGURE 1.7: The PDF, CDF, and inverse CDF for the Normal(μ = 500, σ = 100) .
As in the discrete example, the PDF, CDF, and inverse of the CDF allow us to ask questions
like:

What is the probability of observing values between a and b from a normal distribution
with mean μ and standard deviation σ? Using the above example, we can ask what the
probability of observing values between 200 and 700 ms:

Hide

pnorm(700, mean = 500, sd = 100) - pnorm(200, mean = 500, sd = 100)

## [1] 0.976

The probability of any point value in a PDF is always 0. This is because the probability in a
continuous probability distribution is the area under the curve, and the area at any point on the
x-axis is always 0. The implication here is that it is only meaningful to ask about probabilities
between two different point values; e.g., the probability that Y lies between a and b, or
P (a < Y < b) . Notice that P (a < Y < b) is the same statement as P (a ≤ Y ≤ b) .

What is the quantile q such that the probability is p of observing that value q or a value
more extreme than q? For example, we can work out the quantile q such that the
probability of observing q or some value less than it is 0.975, in the normal(500,100)
distribution. Formally, we would write this as P (Y < a) .

Hide

qnorm(0.975, mean = 500, sd = 100)

## [1] 696
The above output says that the quantile value q such that P rob(X < q) = 0.975 is q = 696 .

Generate simulated data. Given a vector of n independent and identically distributed data
y , i.e., given that each data point is being generated independently from
Y ∼ Normal(μ, σ) for some values of the parameters, the sample mean and standard
deviation4 are:
n
∑ yi
i=1
ȳ =
n

n
2
∑ (yi − ȳ )
i=1
sd(y) = √
n

For example, we can generate 10 data points using the rnorm function, and then use the
simulated data to compute the mean and standard deviation:

Hide

y <- rnorm(10, mean = 500, sd = 100)

mean(y)

## [1] 532

Hide

sd(y)

## [1] 80.6

Again, the sample mean and sample standard deviation computed from a particular (simulated
or real) data set need not necessarily be close to the true values of the respective parameters.
Especially when sample size is small, one can end up with mis-estimates of the mean and
standard deviation.

Incidentally, simulated data can be used to generate all kinds of statistics. For example, we
can compute the lower and upper quantiles such that 95% of the simulated data are contained
within these quantiles:

Hide
quantile(y, probs = c(0.025, 0.975))

## 2.5% 97.5%
## 427 673

Later on, we will be using samples to produce summary statistics like the ones shown above.

1.5.1 An important distinction: probability vs. density in a

continuous random variable

In continuous distributions like the normal discussed above, it is important to understand that
the probability density function or PDF, p(y|μ, σ) defines a mapping from the y values (the
possible values that the data can have) to a quantity called the density of each possible value.
We can see this function in action when we use dnorm to compute, say, the density value
corresponding to y = 1 and y = 2 in the standard normal distribution.

Hide

## density:
dnorm(1, mean = 0, sd = 1)

## [1] 0.242

Hide

dnorm(2, mean = 0, sd = 1)

## [1] 0.054

If the density at a particular point value like 1 is high compared to some other value – 2 in the
above example – then this point value 1 has a higher likelihood than 2 in the standard normal
distribution.

The quantity computed for the values 1 and 2 are not the probability of observing 1 or 2 in this
distribution. As mentioned earlier, probability in a continuous distribution is defined as the area
under the curve, and this area will always be zero at any point value (because there are
infinitely many different possible values). If we want to know the probability of obtaining values
between an upper and lower bound b and a, i.e., P (a < Y < b) where these are two distinct
values, we must use the cumulative distribution function or CDF: in R, for the normal
distribution, this is the pnorm function. For example, the probability of observing a value
between +2 and -2 in a normal distribution with mean 0 and standard deviation 1 is:

Hide

pnorm(2, mean = 0, sd = 1) - pnorm(-2, mean = 0, sd = 1)

## [1] 0.954

The situation is different in discrete random variables. These have a probability mass function
(PMF) associated with them—an example is the binomial distribution that we saw earlier.
There, the PMF maps the possible y values to the probabilities of those values occurring. That
is why, in the binomial distribution, the probability of observing exactly 2 successes when
sampling from a Binomial(n = 10, θ = 0.5) can be computed using either dbinom or
pbinom :

Hide

dbinom(2, size = 10, prob = 0.5)

## [1] 0.0439

Hide

pbinom(2, size = 10, prob = 0.5) - pbinom(1, size = 10, prob = 0.5)

## [1] 0.0439

In the second line of code above, we are computing the cumulative probability of observing
two or less successes, minus the probability of observing one or less successes. This gives us
the probability of observing exactly two successes. The dbinom gives us this same
information.
1.5.2 Truncating a normal distribution

In the above discussion, the support for the normal distribution ranges from minus infinity to
plus infinity. One can define PDFs with a more limited support; an example would be a normal
distribution whose PDF f (x) is such that the lower bound is truncated at 0 to allow only
0
positive values. In such a case, the area under the range minus infinity to zero (∫−∞ f (x) dx)
will be 0 because the range lies outside the support of the truncated normal distribution. Also,
if one truncates a probability density function like the standard normal (Normal(0, 1)) at 0, in
order to make the area between zero and plus infinity sum up to 1, we would have to multiply
the truncated distribution f(x) by some factor k such that the following integral sums to 1:
∞

k∫ f (x) dx = 1 (1.1)
0

1
Clearly, this factor is k = ∞ . For the standard normal, this integral is easy to compute
∫ f (x) dx
0

using R; we just calculate the complement of the cumulative distribution (CCDF):

Hide

pnorm(0, mean = 0, sd = 1, lower.tail = FALSE)

## [1] 0.5

Hide

## alternatively:
1 - pnorm(0, mean = 0, sd = 1, lower.tail = TRUE)

## [1] 0.5

Also, if we had truncated the distribution at 0 to the right instead of the left (allowing only
negative values), we would have to find the factor k in the same way as above, except that we
would have to find k such that:

k∫ f (x) dx = 1
−∞

For the standard normal case, in R, this factor would require us to use the CDF:
Hide

pnorm(0, mean = 0, sd = 1, lower.tail = TRUE)

## [1] 0.5

Later in this book, we will be using such truncated distributions when doing Bayesian
modeling, and when we use them, we will want to multiply the truncated distribution by the
factor k to ensure that it is still a proper PDF that sums to 1. Truncated normal distributions
will be discussed in more detail in Box 4.1.

1.6 Bivariate and multivariate distributions

So far, we have only discussed univariate distributions; these are distributions that involve
only one variable. For example, when we talk about data generated from a Binomial
distribution, or from a Normal distribution, these are univariate distributions.

It is also possible to specify distributions with two or more dimensions. Some examples will
make it clear what a bivariate (or more generally, multivariate) distribution is.

1.6.1 Example 1: Discrete bivariate distributions

Starting with the discrete case, consider the discrete bivariate distribution shown below. These
are data from an experiment where, inter alia, in each trial a Likert acceptability rating and a
question-response accuracy were recorded (the data are from a study by Laurinavichyute
(2020), used with permission here). Load the data by loading the R package bcogsci .

Hide

data("df_discreteagrmt")

Figure 1.8 shows the joint probability mass function of two random variables X and Y. The
random variable X consists of 7 possible values (this is the 1-7 Likert response scale), and the
random variable Y is question-response accuracy, with 0 representing an incorrect response,
and 1 representing a correct response. One can also display Figure 1.8 as a table; see Table
1.1.
7
1 6
5
4
y 3 x
0 2
1

FIGURE 1.8: Example of a discrete bivariate distribution. In these data, in every trial, two
pieces of information were collected: Likert responses and yes-no question responses. The
random variable X represents Likert scale responses on a scale of 1-7. and the random
variable Y represents 0, 1 (incorrect, correct) responses to comprehension questions.
TABLE 1.1: The joint PMF for two random variables X and Y.

x = 1 x = 2 x = 3 x = 4 x = 5 x = 6 x = 7

y = 0 0.018 0.023 0.04 0.043 0.063 0.049 0.055

y = 1 0.031 0.053 0.086 0.096 0.147 0.153 0.142

For each possible pair of values of X and Y, we have a joint probability pX,Y (x, y) . Given such
a bivariate distribution, there are two useful quantities we can compute: the marginal
distributions (pX and pY ), and the conditional distributions (pX|Y and pY |X ). Table 1.1 shows
the joint probability mass function pX,Y (x, y) .

1.6.1.1 Marginal distributions

The marginal distribution pY is defined as follows. SX is the support of X, i.e., all the possible
values of X.
pY (y) = ∑ pX,Y (x, y).

x∈SX

Similarly, the marginal distribution pX is defined as:

pX (x) = ∑ pX,Y (x, y).

y∈SY

pY is computed, by summing up the rows; and pX by summing up the columns. We can see
why this is called the marginal distribution; the result appears in the margins of the table. In
the code below, the object probs contains bivariate PMF shown in Table 1.1.

Hide

# P(Y)
(PY <- rowSums(probs))

## y=0 y=1
## 0.291 0.709

Hide

sum(PY) ## sums to 1

## [1] 1

Hide

# P(X)
(PX <- colSums(probs))

## x=1 x=2 x=3 x=4 x=5 x=6 x=7

## 0.0491 0.0766 0.1257 0.1394 0.2102 0.2020 0.1969

Hide

sum(PX) ## sums to 1
## [1] 1

The marginal probabilities sum to 1, as they should. Table 1.2 shows the marginal
probabilities.

TABLE 1.2: The joint PMF for two random variables X and Y, along with the marginal
distributions of X and Y.

x = 1 x = 2 x = 3 x = 4 x = 5 x = 6 x = 7 P (Y )

y = 0 0.018 0.023 0.04 0.043 0.063 0.049 0.055 0.291

y = 1 0.031 0.053 0.086 0.096 0.147 0.153 0.142 0.709

P (X) 0.049 0.077 0.126 0.139 0.21 0.202 0.197

To compute the marginal distribution of X, one is summing over all the Ys; and to compute the
marginal distribution of Y, one sums over all the X’s. We say that we are marginalizing out the
random variable that we are summing over. One can also visualize the two marginal
distributions using barplots (Figure 1.9).

P(X) P(Y)
0.20

0.6
0.4
0.10

0.2
0.00

0.0

x=1 x=3 x=5 x=7 y=0 y=1

FIGURE 1.9: The marginal distributions of the random variables X and Y, presented as
barplots.
1.6.1.2 Conditional distributions

For computing conditional distributions, recall that conditional probability (see section 1.2) is
defined as:

pX,Y (x, y)
pX∣Y (x ∣ y) =
pY (y)

and

pX,Y (x, y)
pY ∣X (y ∣ x) =
pX (x)

The conditional distribution of a random variable X given that Y = y , where y is some

specific (fixed) value, is:

pX,Y (x, y)
pX∣Y (x ∣ y) = provided pY (y) = P (Y = y) > 0
pY (y)

As an example, let’s consider how pX∣Y would be computed. The possible values of y are 0, 1 ,
and so we have to find the conditional distribution (defined above) for each of these values.
I.e., we have to find pX∣Y (x ∣ y = 0) , and pX∣Y (x ∣ y = 1) .

Let’s do the calculation for pX∣Y (x ∣ y = 0) .

pX,Y (1, 0)
pX∣Y (1 ∣ 0) =
pY (0)

0.018
=
0.291

=0.062

This conditional probability value will occupy the cell X=1, Y=0 in Table 1.3 summarizing the
conditional probability distribution pX|Y . In this way, one can fill in the entire table, which will
then represent the conditional distributions pX|Y =0 and pX|Y =1 . The reader may want to take
a few minutes to complete Table 1.3.

TABLE 1.3: A table for listing conditional distributions of X given Y.

x = 1 x = 2 x = 3 x = 4 x = 5 x = 6 x = 7

pX|Y (x|y=0) 0.062

pX|Y (x|y=1)

Similarly, one can construct a table that shows pY |X .

1.6.1.3 Covariance and correlation

Here, we briefly define the covariance and correlation of two discrete random variables. For
detailed examples and discussion, see the references at the end of the chapter. Informally, if
there is a high probability that large values of a random variable X are associate with large
values of another random variable Y , we will say that the covariance between the two random
variable X and Y , written Cov(X, Y ) , is positive.

The covariance of two (discrete) random variables X and Y is defined as follows. E[⋅] refers
to the expectation of a random variable.

Cov(X, Y ) = E[(X − E[X])(Y − E[Y ])]

It is possible to show that this is equivalent to:

Cov(X, Y ) = E[XY ] − E[X]E[Y ]

The expectation E[XY] is defined to be:

E[XY ] = ∑ ∑ xyfX,Y (x, y)

x y

If the standard deviations of the two random variables is σX and σY , the correlation between
the two random variables, ρXY , is defined as:

Cov(X, Y )
ρXY =
σX σY

1.6.2 Example 2: Continuous bivariate distributions

Consider now the continuous bivariate case; this time, we will use simulated data. Consider
two normal random variables X and Y , each of which coming from, for example, a
Normal(0, 1) distribution, with some correlation ρX,Y between the two random variables.

A bivariate distribution for two random variables X and Y , each of which comes from a normal
distribution, is expressed in terms of the means and standard deviations of each of the two
distributions, and the correlation ρXY between them. The standard deviations and correlation
are expressed in a special form of a 2 × 2 matrix called a variance-covariance matrix Σ . If
ρXY is the correlation between the two random variables, and σX and σY the respective
standard deviations, the variance-covariance matrix is written as:
2
σ ρXY σX σY
X
Σ = ( )
2
ρXY σX σY σ
Y

The off-diagonals of this matrix contain the covariance between X and Y .

The joint distribution of X and Y is defined as follows:

X 0
( ) ∼ N2 (( ) , Σ)
Y 0

The joint PDF is written with reference to the two variables fX,Y (x, y) . It has the property that
the area under the curve sums to 1—this sum-to-1 property is the same idea that we
encountered in the univariate cases (the normal and binomial distributions), except that we are
talking about a bivariate distribution here.

Formally, we would write the area under the curve as a double integral: we are summing up
the volume under the curve for both X and Y (hence the two integrals).

∬ fX,Y (x, y) dxdy = 1

SX,Y

Here, the terms dx and dy express the fact that we are summing the area under the curve
along the X axis and the Y axis.

The joint CDF would be written as follows. The equation below gives us the probability of
observing a value like (u, v) or some value smaller than that (i.e., some (u , v )
′ ′
, such that
u
′
< u and v
′
< v ).

FX,Y (u, v) =P (X < u, Y < v)

u v
2
=∫ ∫ fX,Y (x, y) dydx for (x, y) ∈ R
−∞ −∞

Just as in the discrete case, the marginal distributions can be derived by marginalizing out the
other random variable:

fX (x) = ∫ fX,Y (x, y) dy, fY (y) = ∫ fX,Y (x, y) dx

SY SX

Here, SX and SY are the respective supports.

Here, the integral sign ∫ is the continuous equivalent of the summation sign ∑ in the discrete
case. Luckily, we will never have to compute such integrals ourselves; but it is important to
appreciate how a marginal distribution arises from a bivariate distribution—by integrating out
or marginalizing out the other random variable.
A visualization will help. The figures below show a bivariate distribution with zero correlation
(Figure 1.10), a negative (Figure 1.11) and a positive correlation (Figure 1.12).
3
2
1
0
-1
-2
-3

y x
-3 -2 -1 0 1 2 3
3
2
1
0
-1
-2
-3

y x
-3 -2 -1 0 1 2 3

FIGURE 1.10: A bivariate normal distribution with zero correlation. Shown are four plots: the
top-right plot shows the three-dimensional bivariate density, the top-left plot the contour plot of
the distribution (seen from above). The lower plots show the cumulative distribution function
from two views, as a three-dimensional plot and as a contour plot.
3
2
1
0
-1
-2
-3

y x
-3 -2 -1 0 1 2 3
3
2
1
0
-1
-2
-3

y x
-3 -2 -1 0 1 2 3

FIGURE 1.11: A bivariate normal distribution with a negative correlation of -0.6. Shown are
four plots: the top-right plot shows the three-dimensional bivariate density, the top-left plot the
contour plot of the distribution (seen from above). The lower plots show the cumulative
distribution function from two views, as a three-dimensional plot and as a contour plot.
3
2
1
0
-1
-2
-3

y x
-3 -2 -1 0 1 2 3
3
2
1
0
-1
-2
-3

y x
-3 -2 -1 0 1 2 3

FIGURE 1.12: A bivariate normal distribution with a positive correlation of 0.6. Shown are four
plots: the top-right plot shows the three-dimensional bivariate density, the top-left plot the
contour plot of the distribution (seen from above). The lower plots show the cumulative
distribution function from two views, as a three-dimensional plot and as a contour plot.
In this book, we will make use of such multivariate distributions a lot, and it will soon become
important to know how to generate simulated bivariate or multivariate data that is correlated.
So let’s look at that next.

1.6.3 Generate simulated bivariate (multivariate) data

Suppose we want to generate 100 pairs of correlated data, with correlation ρ = 0.6 . The two
random variables have mean 0, and standard deviations 5 and 10 respectively.

Here is how we would generate such data. First, define a variance-covariance matrix; then,
use the multivariate analog of the rnorm() function, mvrnorm() , from the MASS package to
generate 100 data points.

Hide
## define a variance-covariance matrix:
Sigma <- matrix(c(5^2, 5 * 10 * .6, 5 * 10 * .6, 10^2),
byrow = FALSE, ncol = 2
)
## generate data:

u <- mvrnorm(
n = 100,
mu = c(0, 0),
Sigma = Sigma
)
head(u, n = 3)

## [,1] [,2]
## [1,] -5.32 -9.66
## [2,] 1.73 5.14
## [3,] 2.07 3.16

Figure 1.13 confirms that the simulated data are positively correlated.

0
u_2

-20

-10 -5 0 5 10
u_1
FIGURE 1.13: The relationship between two positively correlated random variables, generated
by simulating data using the R function mvrnorm from the MASS library.
One final useful fact about the variance-covariance matrix—one that we will need later—is that
it can be decomposed into the component standard deviations and an underlying correlation
matrix. For example, consider the matrix above:

Hide

Sigma

## [,1] [,2]
## [1,] 25 30
## [2,] 30 100

One can decompose the matrix as follows. The matrix can be seen as the product of a
diagonal matrix of the standard deviations and the correlation matrix:

5 0 1.0 0.6 5 0
( )( )( )
0 10 0.6 1.0 0 10

One can reassemble the variance-covariance matrix by pre-multiplying and post-multiplying

the correlation matrix with the diagonal matrix containing the standard deviations:5

5 0 1.0 0.6 5 0 25 30
( )( )( ) = ( )
0 10 0.6 1.0 0 10 30 100

Using R (the symbol %*% is the matrix multiplication operator in R ):

Hide

## sds:
(sds <- c(5, 10))

## [1] 5 10

Hide

## diagonal matrix:
(sd_diag <- diag(sds))
## [,1] [,2]
## [1,] 5 0
## [2,] 0 10

Hide

## correlation matrix:
(corrmatrix <- matrix(c(1, 0.6, 0.6, 1), ncol = 2))

## [,1] [,2]
## [1,] 1.0 0.6
## [2,] 0.6 1.0

Hide

sd_diag %% corrmatrix %% sd_diag

## [,1] [,2]
## [1,] 25 30
## [2,] 30 100

1.7 An important concept: The marginal likelihood

(integrating out a parameter)

Here, we introduce a concept that will turn up many times in this book. The concept we
unpack here is called “integrating out a parameter”. We will need this when we encounter
Bayes’ rule in the next chapter, and when we use Bayes factors later in the book (chapter 15).

Integrating out a parameter refers to the following situation. The example used here discusses
the binomial distribution, but the approach is generally applicable for any distribution.

Suppose we have a binomial random variable Y with PMF p(Y ) . Suppose also that this PMF
is defined in terms of parameter θ that can have only three possible values, 0.1, 0.5, 0.9 , each
with equal probability. In other words, the probability that θ is 0.1, 0.5, or 0.9 is 1/3 each.
We stick with our earlier example of n = 10 trials and k = 7 successes. The likelihood
function then is

10
7 3
p(k = 7, n = 10|θ) = ( )θ (1 − θ)
7

There is a related concept of marginal likelihood, which we can write here as

p(k = 7, n = 10) . Marginal likelihood is the likelihood computed by “marginalizing” out the
parameter θ: for each possible value that the parameter θ can have, we compute the
likelihood at that value and multiply that likelihood with the probability/density of that θ value
occurring. Then we sum up each of the products computed in this way. Mathematically, this
means that we carry out the following operation.

In our example, there are three possible values of θ, call them ,

θ1 = 0.1 θ2 = 0.5 , and
θ3 = 0.9 . Each has probability 1/3 ; so p(θ1 ) = p(θ2 ) = p(θ3 ) = 1/3 . Given this information,
we can compute the marginal likelihood as follows:

10
7 3
p(k = 7, n = 10) =( )θ (1 − θ1 ) × p(θ1 )
1
7

10
7 3
+( )θ (1 − θ2 ) × p(θ2 )
2
7

10
7 3
+( )θ (1 − θ3 ) × p(θ3 )
3
7

Writing the θ values and their probabilities, we get:

10 7 3
1
p(k = 7, n = 10) =( )0.1 (1 − 0.1) ×
7 3

10 1
7 3
+( )0.5 (1 − 0.5) ×
7 3

10 1
7 3
+( )0.9 (1 − 0.9) ×
7 3

Taking the common factors ( 3 and ) out:

1 10
( )
7

1 10
7 3
p(k = 7, n = 10) = ( )[0.1 (1 − 0.1)
3 7

8 3
+0.5 (1 − 0.5)

8 3
+0.9 (1 − 0.9) ]

=0.058

Thus, a marginal likelihood is a kind of weighted sum of the likelihood, weighted by the
possible values of the parameter.6
The above example was contrived, because we stated that the parameter θ has only three
possible values. In reality, because the parameter θ can have all possible values between 0
and 1, the summation has to be done over a continuous space [0, 1] . The way this summation
is expressed in mathematics is through the integral symbol:

1
10
7 3
p(k = 7, n = 10) = ∫ ( )θ (1 − θ) dθ
7
0

This statement is computing something similar to what we computed above with the three
discrete parameter values, except that the summation is being done over a continuous space
ranging from 0 to 1. We say that the parameter θ has been integrated out, or marginalized.
Integrating out a parameter will be a very common operation in this book, but we will never
have to do the calculation ourselves. For the above case, we can compute the integral in R:

Hide

BinLik <- function(theta) {

choose(10, 7) * theta^7 * (1 - theta)^3
}
integrate(BinLik, lower = 0, upper = 1)$value

## [1] 0.0909

The value that is output by the integrate function above is the marginal likelihood.

This completes our discussion of random variables and probability distributions. We now
summarize what we have learned so far.

1.8 Summary of useful R functions relating to

distributions

Table 1.4 summarizes the different functions relating to PMFs and PDFs, using the binomial
and normal as examples.
TABLE 1.4: Important R functions relating to random variables.

Discrete Continuous

Example: Binomial(y|n, θ) Normal(y|μ, σ)

Likelihood function dbinom dnorm

Prob Y=y dbinom always 0

Prob Y ≥ y, Y ≤ y, y1 < Y < y2 pbinom pnorm

Inverse CDF qbinom qnorm

Generate simulated data rbinom rnorm

Later on, we will use other distributions, such as the Uniform, Beta, etc., and each of these
has their own set of d-p-q-r functions in R. One can look up these different distributions in,
for example, Blitzstein and Hwang (2014).

1.9 Summary

This chapter briefly reviewed some very basic concepts in probability theory, univariate
discrete and continuous random variables, and bivariate distributions. An important set of
functions we encountered are the d-p-q-r family of functions for different distributions; these
are very useful for understanding the properties of commonly used distributions, visualizing
distributions, and for simulating data. Distributions will play a central role in this book; for
example, knowing how to visualize distributions will be important for deciding on prior
distributions for parameters. Other important ideas we learned about were marginal and
conditional probability, marginal likelihood, and how to define multivariate distributions; these
concepts will play an important role in Bayesian statistics.

1.10 Further reading

A quick review of the mathematical foundations needed for statistics is available in the short
book by Fox (2009), as well as Gill (2006). Morin (2016) and Blitzstein and Hwang (2014) are
accessible introductions to probability theory. Ross (2002) is a more advanced treatment
which discusses random variable theory and illustrates applications of probability theory. A
good formal introduction to mathematical statistics (covering classical frequentist theory) is
Miller and Miller (2004). The freely available book by Kerns (2014) introduces frequentist and
Bayesian statistics from the ground up in a very comprehensive and systematic manner; the
source code for the book is available from https://github.com/gjkerns/IPSUR. The open-access
book, Probability and Statistics: a simulation-based introduction, by Bob Carpenter is also
worth studying: https://github.com/bob-carpenter/prob-stats. A thorough introduction to the
matrix algebra needed for statistics, with examples using R, is provided in Fieller (2016).
Commonly used probability distributions are presented in detail in Miller and Miller (2004),
Blitzstein and Hwang (2014), and Ross (2002).

1.11 Exercises

Exercise 1.1 Practice using the pnorm function - Part 1

Given a normal distribution with mean 500 and standard deviation 100, use the pnorm
function to calculate the probability of obtaining values between 200 and 800 from this
distribution.

Exercise 1.2 Practice using the pnorm function - Part 2

Calculate the following probabilities. Given a normal distribution with mean 800 and standard
deviation 150, what is the probability of obtaining:

a score of 700 or less

a score of 900 or more
a score of 800 or more

Exercise 1.3 Practice using the pnorm function - Part 3

Given a normal distribution with mean 600 and standard deviation 200, what is the probability
of obtaining:

a score of 550 or less.

a score between 300 and 800.
a score of 900 or more.

Exercise 1.4 Practice using the qnorm function - Part 1

Consider a normal distribution with mean 1 and standard deviation 1. Compute the lower and
upper boundaries such that:

the area (the probability) to the left of the lower boundary is 0.10.
the area (the probability) to the left of the upper boundary is 0.90.
Exercise 1.5 Practice using the qnorm function - Part 2

Given a normal distribution with mean 650 and standard deviation 125. There exist two
quantiles, the lower quantile q1 and the upper quantile q2, that are equidistant from the mean
650, such that the area under the curve of the normal between q1 and q2 is 80%. Find q1 and
q2.

Exercise 1.6 Practice getting summaries from samples - Part 1

Given data that is generated as follows:

Hide

data_gen1 <- rnorm(1000, 300, 200)

Calculate the mean, variance, and the lower quantile q1 and the upper quantile q2, that are
equidistant and such that the range of probability between them is 80%.

Exercise 1.7 Practice getting summaries from samples - Part 2.

This time we generate the data with a truncated normal distribution from the package
extraDistr . The details of this distribution will be discussed later in section 4.1 and in the

Box 4.1, but for now we can treat it as an unknown generative process:

Hide

data_gen1 <- rtnorm(1000, 300, 200, a = 0)

Using the sample data, calculate the mean, variance, and the lower quantile q1 and the upper
quantile q2, such that the probability of observing values between these two quantiles is 80%.

Exercise 1.8 Practice with a variance-covariance matrix for a bivariate distribution.

Suppose that you have a bivariate distribution where one of the two random variables comes
from a normal distribution with mean μX = 600 and standard deviation σX = 100 , and the
other from a normal distribution with mean μY = 400 and standard deviation σY = 50 . The
correlation ρXY between the two random variables is 0.4 . Write down the variance-covariance
matrix of this bivariate distribution as a matrix (with numerical values, not mathematical
symbols), and then use it to generate 100 pairs of simulated data points. Plot the simulated
data such that the relationship between the random variables X and Y is clear. Generate two
sets of new data (100 pairs of data points each) with correlation −0.4 and 0, and plot these
alongside the plot for the data with correlation 0.4 .
References

Blitzstein, Joseph K, and Jessica Hwang. 2014. Introduction to Probability. Chapman;

Hall/CRC.

Fieller, Nick. 2016. Basics of Matrix Algebra for Statistics with R. Boca Raton, FL: CRC Press.

Fox, John. 2009. A Mathematical Primer for Social Statistics. Vol. 159. Sage.

Gill, Jeff. 2006. Essential Mathematics for Political and Social Research. Cambridge University
Press Cambridge.

Kerns, G.J. 2014. Introduction to Probability and Statistics Using R. Second Edition.

Kolmogorov, Andreĭ Nikolaevich. 1933. Foundations of the Theory of Probability: Second

English Edition. Courier Dover Publications.

Laurinavichyute, Anna. 2020. “Similarity-Based Interference and Faulty Encoding Accounts of

Sentence Processing.” Dissertation, University of Potsdam.

Miller, I., and M. Miller. 2004. John E. Freund’s Mathematical Statistics with Applications.
Upper Saddle River, NJ: Prentice Hall.

Morin, David J. 2016. Probability: For the Enthusiastic Beginner. Createspace Independent
Publishing Platform.

Ross, Sheldon. 2002. A First Course in Probability. Pearson Education.

Steyer, Rolf, and Werner Nagel. 2017. Probability and Conditional Expectation: Fundamentals
for the Empirical Sciences. Vol. 5. John Wiley & Sons.

1. Here, we use Y , but we could have used any letter, such as X, Z, . . . . Later on, in some
situations we will use Greek letters like θ, μ, σ to represent a random variable.↩

2. The actual formal definition of random variable is more complex, and it is based on
measure theory. A more rigorous definition can be found in, for example, Steyer and
Nagel (2017).↩

3. A notational aside: In frequentist treatments, the PMF would be written p(y; θ) , i.e., with a
semi-colon rather than the conditional distribution marked by the vertical bar. The semi-
colon is intended to indicate that in the frequentist paradigm, the parameters are fixed
point values; by contrast, in the Bayesian paradigm, parameters are random variables.
This has the consequence that for Bayesian, the distribution of y, p(y) is really a
conditional distribution, conditional on a random variable, here θ. For the frequentist, p(y)

requires some point value for θ, but it cannot be a conditional distribution because θ is not
a random variable. We define conditional distributions later in this section.↩
4. R will compute the standard deviation by dividing by n − 1 , not n ; this is because dividing
by n gives a biased estimate (chapter 10 of Miller and Miller 2004). This is not an
important detail for our purposes, and in any case for large n it doesn’t really matter
whether one divides by n or n − 1 .↩

5. There is a built-in convenience function, sdcor2cov in the SIN package that does this
calculation, taking the vector of standard deviations (not the diagonal matrix) and the
correlation matrix to yield the variance-covariance matrix: sdcor2cov(stddev = sds, corr
= corrmatrix) .↩

6. Where does the above formula come from? It falls out from the law of total probability
discussed above!↩
Code

Chapter 2 Introduction to Bayesian data

analysis
Before we can start analyzing realistic data sets using Bayes’ rule, it is important to
understand the application of Bayes’ rule in one of the simplest of cases, data involving the
binomial likelihood. This simple case is important to understand because it encapsulates the
essence of the Bayesian approach to data analysis, and because it allows us to analytically
work out the posterior distribution of the parameter of interest, using just a pen and paper.
This simple case also helps us to appreciate a crucial point: The posterior distribution of a
parameter is a compromise between the prior and the likelihood. This important insight will
play a central role in the realistic data analysis situations we will cover in the remainder of this
book.

2.1 Bayes’ rule

Recall Bayes’ rule: When A and B are observable discrete events (such as “it has been
raining” or “the streets are wet”), we can state the rule as follows:

P (B ∣ A)P (A)
P (A ∣ B) = (2.1)
P (B)

Given a vector of data y, Bayes’ rule allows us to work out the posterior distributions of the
parameters of interest, which we can represent as the vector of parameters Θ . This
computation is achieved by rewriting (2.1) as (2.2). What is different here is that Bayes’ rule is
written in terms of probability distributions. Here, p(⋅) is a probability density function
(continuous case) or a probability mass function (discrete case).

p(y|Θ) × p(Θ)
p(Θ|y) = (2.2)
p(y)

The above statement can be rewritten in words as follows:

Likelihood × Prior
Posterior =
Marginal Likelihood
The terms here have the following meaning. We elaborate on each point with an example
below.

The Posterior, p(Θ|y) , is the probability distribution of the parameters conditional on the
data.

The Likelihood, p(y|Θ ) is as described in chapter 1: it is the PMF (discrete case) or the
PDF (continuous case) expressed as a function of Θ .

The Prior, p(Θ) , is the initial probability distribution of the parameter(s), before seeing the
data.

The Marginal Likelihood, p(y) , was introduced in chapter 1 and standardizes the
posterior distribution to ensure that the area under the curve of the distribution sums to 1,
that is, it ensures that the posterior is a valid probability distribution.

An example will clarify all these terms, as we explain below.

2.2 Deriving the posterior using Bayes’ rule: An

analytical example

Recall our cloze probability example earlier. Subjects are shown sentences like

“It’s raining. I’m going to take the …”

Suppose that 100 subjects are asked to complete the sentence. If 80 out of 100 subjects
complete the sentence with “umbrella,” the estimated cloze probability or predictability (given
80
the preceding context) would be 100
= 0.8 . This is the maximum likelihood estimate of the
probability of producing this word; we will designate the estimate with a “hat” on the parameter
name: ^ = 0.8
θ . In the frequentist paradigm, ^ = 0.8
θ is an estimate of an unknown point value
θ “out there in nature.”

A crucial point to notice here is that the proportion 0.80 that we estimated above from the data
can vary from one data set to another, and the variability in the estimate will be influenced by
the sample size. For example, assuming that the true value of the θ parameter is in fact 0.80,
if we repeatedly carry out the above experiment with say 10 participants, we will get some
variability in the estimated proportion. Let’s check this by carrying out 100 simulated
experiments and computing the variability of the estimated means under repeated sampling:

Hide
estimated_means <- rbinom(n = 100, size = 10, prob = 0.80) / 10
sd(estimated_means)

## [1] 0.145

The repeated runs of the (simulated) experiment are the sole underlying cause for the
variability (shown by the output of the sd(estimated) command above) in the estimated
proportion; the parameter θ = 0.80 itself is invariant here (we are repeatedly estimating this
point value).

However, consider now an alternative radical idea: what if we treat θ as a random variable?
That is, suppose now that θ has a PDF associated with it. This PDF would now represent our
belief about possible values of θ, even before we have seen any data. For example, if at the
outset of the experiment, we believe that all possible values between 0 and 1 are equally
likely, we could represent that belief by stating that θ ∼ Uniform(0, 1) . The radical new idea
here is that we now have a way to represent our prior belief or knowledge about plausible
values of the parameter.

Now, if we were to run our simulated experiments again and again, there would be two
sources of variability in the estimate of the parameter: the data as well as the uncertainty
associated with θ.

Hide

theta <- runif(100, min = 0, max = 1)

estimated_means <- rbinom(n = 100, size = 10, prob = theta) / 10

sd(estimated_means)

## [1] 0.331

The higher standard deviation is now coming from the uncertainty associated with the θ

parameter. To see this, assume a “tighter” PDF for θ, say θ ∼ Uniform(0.3, 0.8) , then the
variability in the estimated means would again be smaller, but not as small as when we
assumed that θ was a point value:

Hide
theta<-runif(100,min=0.3,max=0.8)
estimated_means<-rbinom(n=100,size=10,prob=theta)/10
sd(estimated_means)

## [1] 0.209

In other words, the greater the uncertainty associated with the parameter θ, the greater the
variability in the data.

The Bayesian approach to parameter estimation makes this radical departure from the
standard frequentist assumption that θ is a point value; in the Bayesian approach, θ is a
random variable with a probability density/mass function associated with it. This PDF is called
a prior distribution, and represents our prior belief or prior knowledge about possible values of
this parameter. Once we obtain data, these data serve to modify our prior belief about this
distribution; this updated probability density function of the parameter is called the posterior
distribution. These ideas are unpacked in the sections below.

2.2.1 Choosing a likelihood

Under the assumptions we have set up above, the responses follow a binomial distribution,
and so the PMF can be written as follows.

n
k n−k
p(k|n, θ) = ( )θ (1 − θ) (2.3)
k

where k indicates the number of times “umbrella” is given as an answer, and n the total
number of answers given. Here, k can be any whole number going from 0 to 100 .

In a particular experiment that we carry out, if we collect 100 data points (n = 100 ) and it
turns out that k = 80 , these data are now a fixed quantity. The only variable now in the PMF
above is θ:

n
80 20
p(k = 80|n = 100, θ) = ( )θ (1 − θ)
k

The above function is a now a continuous function of the value θ, which has possible values
ranging from 0 to 1. Compare this to the PMF of the binomial, which treats θ as a fixed value
and defines a discrete distribution over the n+1 possible discrete values k that we can
observe (the possible number of successes).
Recall that the PMF and the likelihood are the same function seen from different points of
view. The only difference between the two is what is considered to be fixed and what is
varying. The PMF treats data as varying from experiment to experiment and θ as fixed,
whereas the likelihood function treats the data as fixed and the parameter θ as varying.

We now turn our attention back to our main goal, which is to find out, using Bayes’ rule, the
posterior distribution of θ given our data: p(θ|n, k) . In order to use Bayes’ rule to calculate this
posterior distribution, we need to define a prior distribution over the parameter θ. In doing so,
we are explicitly expressing our prior uncertainty about plausible values of θ.

2.2.2 Choosing a prior for θ

For the choice of prior for θ in the binomial distribution, we need to assume that the parameter
θ is a random variable that has a PDF whose range lies within [0,1], the range over which θ

can vary (this is because θ represents a probability). The beta distribution, which is a PDF for
a continuous random variable, is commonly used as prior for parameters representing
probabilities. One reason for this choice is that its PDF ranges over the interval [0, 1] . The
other reason for this choice is that it makes the Bayes’ rule calculation remarkably easy.

The beta distribution has the following PDF.

1
a−1 b−1
p(θ|a, b) = θ (1 − θ) (2.4)
B(a, b)

1
The term B(a, b) expands to ∫
0
θ
a−1
(1 − θ)
b−1
dθ , and is a normalizing constant that
ensures that the area under the curve sums to one. In some textbooks, you may see the PDF
Γ(a+b)
of the beta distribution with the normalizing constant (the expression Γ(n) is defined
Γ(a)Γ(b)

as (n-1)!):

Γ(a + b)
a−1 b−1
p(θ|a, b) = θ (1 − θ)
Γ(a)Γ(b)

These two statements for the beta distribution are identical because B(a, b) can be shown to
Γ(a)Γ(b)
be equal to (Ross 2002).
Γ(a+b)

The beta distribution’s parameters a and b can be interpreted as expressing our prior beliefs
about the probability of success; a represents the number of “successes”, in our case,
answers that are “umbrella” and b the number of failures, the answers that are not “umbrella”.
Figure 2.1 shows the different beta distribution shapes given different values of a and b.
Beta(a = 1, b = 1) Beta(a = 4, b = 4)
density 5 5

density
4 4
3 3
2 2
1 1
0 0
0.00 0.25 0.50 0.75 1.00 0.00 0.25 0.50 0.75 1.00
theta theta

Beta(a = 7, b = 7) Beta(a = 10, b = 10)

5 5
density

density
4 4
3 3
2 2
1 1
0 0
0.00 0.25 0.50 0.75 1.00 0.00 0.25 0.50 0.75 1.00
theta theta

Beta(a = 10, b = 1) Beta(a = 7, b = 4)

5 5
density

density
4 4
3 3
2 2
1 1
0 0
0.00 0.25 0.50 0.75 1.00 0.00 0.25 0.50 0.75 1.00
theta theta

Beta(a = 4, b = 7) Beta(a = 1, b = 10)

5 5
density

density

4 4
3 3
2 2
1 1
0 0
0.00 0.25 0.50 0.75 1.00 0.00 0.25 0.50 0.75 1.00
theta theta

FIGURE 2.1: Examples of beta distributions with different parameters.

As in the binomial and normal distributions that we saw in chapter 1, one can analytically
derive the formulas for the expectation and variance of the beta distribution. These are:

a a × b
E[X] = Var(X) = (2.5)
2
a + b (a + b) (a + b + 1)

As an example, choosing a = 4 and b = 4 would mean that the answer “umbrella” is as likely
as a different answer, but we are relatively unsure about this. We could express our
uncertainty by computing the region over which we are 95% certain that the value of the
parameter lies; this is the 95% credible interval. For this, we would use the qbeta function in
R; the parameters a and b are called shape1 and shape2 in R .

Hide

qbeta(c(0.025, 0.975), shape1 = 4, shape2 = 4)

## [1] 0.184 0.816

The credible interval chosen above is an equal-tailed interval: the area below the lower bound
and above the upper bound is the same (0.025 in the above case). One could define
alternative intervals; for example, in a distribution with only one mode (one peak; a unimodal
distribution), one could choose to use the narrowest interval that contains the mode. This is
called the highest posterior density interval (HDI). In skewed posterior distributions, the equal-
tailed credible interval and the HDI will not be identical, because the HDI will have unequal tail
probabilities. Some authors, such as Kruschke (2014), prefer to report the HDI. We will use
the equal-tailed interval in this book, simply because this is the standard output in Stan and
brms .

If we were to choose a = 10 and b = 10 , we would still be assuming that a priori the answer
“umbrella” is just as likely as some other answer, but now our prior uncertainty about this
mean is lower, as the 95% credible interval computed below shows.

Hide

qbeta(c(0.025, 0.975), shape1 = 10, shape2 = 10)

## [1] 0.289 0.711

In Figure 2.1, we can see also the difference in uncertainty in these two examples graphically.

Which prior should we choose? In a real data analysis problem, the choice of prior would
depend on what prior knowledge we want to bring into the analysis (see chapter 6). If we don’t
have much prior information, we could use a = b = 1 ; this gives us a uniform prior (i.e.,
Uniform(0, 1) ). This kind of prior goes by various names, such as flat, non-informative prior,
or uninformative prior. By contrast, if we have a lot of prior knowledge and/or a strong belief
(e.g., based on a particular theory’s predictions, or prior data) that θ has a particular range of
plausible values, we can use a different set of a, b values to reflect our belief about the
parameter. Generally speaking, the larger our parameters a and b, the narrower the spread of
the distribution; i.e., the lower our uncertainty about the mean value of the parameter.

We will discuss prior specification in detail later in chapter 6. For the moment, just for
illustration, we choose the values a = 4 and b = 4 for the beta prior. Then, our prior for θ is
the following beta PDF:
1
3 3
p(θ) = θ (1 − θ)
B(4, 4)

Having chosen a likelihood, and having defined a prior on θ, we are ready to carry out our first
Bayesian analysis to derive a posterior distribution for θ.

2.2.3 Using Bayes’ rule to compute the posterior p(θ|n, k)

Having specified the likelihood and the prior, we will now use Bayes’ rule to calculate p(θ|n, k)

. Using Bayes’ rule simply involves replacing the Likelihood and the Prior we defined above
into the equation we saw earlier:

Likelihood × Prior
Posterior =
Marginal Likelihood

Replace the terms for likelihood and prior into this equation:

100 80 20 1 3 3
[( )θ × (1 − θ) ] × [ × θ (1 − θ) ]
80 B(4,4)

p(θ|n = 100, k = 80) = (2.6)

p(k = 80)

1
where p(k = 80) is ∫
0
p(k = 80|n = 100, θ)p(θ) dθ . This term will be a constant once the
number of successes k is known; this is the marginal likelihood we encountered in chapter 1.
In fact, once k is known, there are several constant values in the above equation; they are
constants because none of them depend on the parameter of interest, θ. We can collect all of
these together:

100
( )
80 80 20 3 3
p(θ|n = 100, k = 80) = [ ] [θ (1 − θ) × θ (1 − θ) ] (2.7)
B(4, 4) × p(k = 80)

100
( )
The first term that is in square brackets, , is all the constants collected together,
80

B(4,4)×p(k=80)

and is the normalizing constant we have seen before; it makes the posterior distribution
p(θ|n = 100, k = 80) sum to one. Since it is a constant, we can ignore it for now and focus
on the two other terms in the equation. Because we are ignoring the constant, we will now say
that the posterior is proportional to the right-hand side.

80 20 3 3
p(θ|n = 100, k = 80) ∝ [θ (1 − θ) × θ (1 − θ) ] (2.8)

A common way of writing the above equation is:

Posterior ∝ Likelihood × Prior

Resolving the right-hand side now simply involves adding up the exponents! In this example,
computing the posterior really does boil down to this simple addition operation on the
exponents.

80+3 20+3 83 23
p(θ|n = 100, k = 80) ∝ [θ (1 − θ) ] = θ (1 − θ) (2.9)

The expression on the right-hand side corresponds to a beta distribution with parameters
a = 84 , and b = 24 . This becomes evident if we rewrite the right-hand side such that it
represents the core part of a beta PDF (see equation (2.4)). All that is missing is a normalizing
constant which would make the area under the curve sum to one.

83 23 84−1 24−1
θ (1 − θ) = θ (1 − θ)

This core part of any PDF or PMF is called the kernel of that distribution. Without a
normalizing constant, the area under the curve will not sum to one. Let’s check this:

Hide

PostFun <- function(theta) {

theta^84 * (1 - theta)^24
}
(AUC <- integrate(PostFun, lower = 0, upper = 1)$value)

## [1] 1.42e-26

So the area under the curve (AUC) is not 1—the posterior that we computed above is not a
proper probability distribution. What we have just done above is to compute the following
integral:

1
84 24
∫ θ (1 − θ)
0

We can use this integral to figure out what the normalizing constant is. Basically, we want to
know what the constant k is such that the area under the curve sums to 1:

1
84 24
k∫ θ (1 − θ) = 1
0

1
We know what ∫
0
θ
84
(1 − θ)
24
is; we just computed that value (called AUC in the R code
above). So, the normalizing constant is:
1 1
k = =
1
84 24 AU C
∫ θ (1 − θ)
0

So, all that is needed to make the kernel θ

84
(1 − θ)
24
into a proper probability distribution is
to include a normalizing constant, which, according to the definition of the beta distribution
(equation (2.4)), would be B(84, 24) . This term is in fact the integral we computed above.

So, what we have is the distribution of θ given the data, expressed as a PDF:

1 84−1 24−1
p(θ|n = 100, k = 80) = θ (1 − θ)
B(84, 24)

Now, this function will sum to one:

Hide

PostFun <- function(theta) {

theta^84 * (1 - theta)^24 / AUC
}
integrate(PostFun, lower = 0, upper = 1)$value

## [1] 1

2.2.4 Summary of the procedure

To summarize, we started with data (n = 100, k = 80 ) and a binomial likelihood, multiplied it

with the prior θ ∼ Beta(4, 4) , and obtained the posterior p(θ|n, k) ∼ Beta(84, 24) . The
constants were ignored when carrying out the multiplication; we say that we computed the
posterior up to proportionality. Finally, we showed how, in this simple example, the posterior
can be rescaled to become a probability distribution, by including a proportionality constant.

The above example is a case of a conjugate analysis: the posterior on the parameter has the
same form (belongs to the same family of probability distributions) as the prior. The above
combination of likelihood and prior is called the beta-binomial conjugate case. There are
several other such combinations of Likelihoods and Priors that yield a posterior that has a
PDF that belongs to the same family as the PDF on the prior; some examples will appear in
the exercises.

Formally, conjugacy is defined as follows: Given the likelihood p(y|θ) , if the prior p(θ) results
in a posterior p(θ|y) that has the same form as p(θ) , then we call p(θ) a conjugate prior.
For the beta-binomial conjugate case, we can derive a very general relationship between the
likelihood, prior, and posterior. Given the binomial likelihood up to proportionality (ignoring the
constant) k
θ (1 − θ)
n−k
, and given the prior, also up to proportionality, θ
a−1
(1 − θ)
b−1
, their
product will be:

k n−k a−1 b−1 a+k−1 b+n−k−1

θ (1 − θ) θ (1 − θ) = θ (1 − θ)

Thus, given a Binomial(n, k|θ) likelihood, and a Beta(a, b) prior on θ, the posterior will be
Beta(a + k, b + n − k) .

2.2.5 Visualizing the prior, likelihood, and posterior

We established in the example above that the posterior is a beta distribution with parameters
a = 84 , and b = 24 . We visualize the likelihood, prior, and the posterior alongside each other
in Figure 2.2.

10.0

7.5
density

Posterior
5.0 Prior
Scaled likelihood

2.5

0.0

0.00 0.25 0.50 0.75 1.00

FIGURE 2.2: The (scaled) likelihood, prior, and posterior in the beta-binomial conjugate
example. The likelihood is scaled to integrate to 1 to make it easier to compare to the prior
and posterior distributions.
We can summarize the posterior distribution either graphically as we did above, or summarize
it by computing the mean and the variance. The mean gives us an estimate of the cloze
probability of producing “umbrella” in that sentence (given the model, i.e., given the likelihood
and prior):

84
^
E[θ ] = = 0.78 (2.10)
84 + 24

84 × 24
^] =
var[θ = 0.0016 (2.11)
2
(84 + 24) (84 + 24 + 1)

We could also display the 95% credible interval, the range over which we are 95% certain the
true value of θ lies, given the data and model.

Hide

qbeta(c(0.025, 0.975), shape1 = 84, shape2 = 24)

## [1] 0.695 0.851

Typically, we would summarize the results of a Bayesian analysis by displaying the posterior
distribution of the parameter (or parameters) graphically, along with the above summary
statistics: the mean, the standard deviation or variance, and the 95% credible interval. You will
see many examples of such summaries later.

2.2.6 The posterior distribution is a compromise between the

prior and the likelihood

Just for the sake of illustration, let’s take four different beta priors, each reflecting increasing
certainty.

Beta(a = 2, b = 2)

Beta(a = 3, b = 3)

Beta(a = 6, b = 6)

Beta(a = 21, b = 21)

Each prior reflects a belief that θ = 0.5 , with varying degrees of (un)certainty. Given the
general formula we developed above for the beta-binomial case, we just need to plug in the
likelihood and the prior to get the posterior:
p(θ|n, k) ∝ p(k|n, θ)p(θ)

The four corresponding posterior distributions would be:

80 20 2−1 2−1 82−1 22−1

p(θ ∣ k, n) ∝ [θ (1 − θ) ][θ (1 − θ) ] = θ (1 − θ)

80 20 3−1 3−1 83−1 23−1

p(θ ∣ k, n) ∝ [θ (1 − θ) ][θ (1 − θ) ] = θ (1 − θ)

80 20 6−1 6−1 86−1 26−1

p(θ ∣ k, n) ∝ [θ (1 − θ) ][θ (1 − θ) ] = θ (1 − θ)

80 20 21−1 21−1 101−1 41−1

p(θ ∣ k, n) ∝ [θ (1 − θ) ][θ (1 − θ) ] = θ (1 − θ)

We can visualize each of these triplets of priors, likelihoods and posteriors; see Figure 2.3.

Prior: Prior:
beta(a = 2, b = 2) beta(a = 3, b = 3)
10.0 10.0

7.5 7.5
density

density

5.0 5.0

2.5 2.5

0.0 0.0
0.00 0.25 0.50 0.75 1.00 0.00 0.25 0.50 0.75 1.00
θ θ

Prior: Prior:
beta(a = 6, b = 6) beta(a = 21, b = 21)
10.0 10.0

7.5 7.5
density

density

5.0 5.0

2.5 2.5

0.0 0.0
0.00 0.25 0.50 0.75 1.00 0.00 0.25 0.50 0.75 1.00
θ θ

Posterior Prior Scaled likelihood

FIGURE 2.3: The (scaled) likelihood, prior, and posterior in the beta-binomial conjugate
example, for different uncertainties in the prior. The likelihood is scaled to integrate to 1 to
make its comparison easier.

Given some data and given a likelihood function, the tighter the prior, the greater the extent to
which the posterior orients itself towards the prior. In general, we can say the following about
the likelihood-prior-posterior relationship:

The posterior distribution of a parameter is a compromise between the prior and the
likelihood.
For a given set of data, the greater the certainty in the prior, the more heavily will the
posterior be influenced by the prior mean.
Conversely, for a given set of data, the greater the uncertainty in the prior, the more
heavily will the posterior be influenced by the likelihood.

Another important observation emerges if we increase the sample size from 100 to, say,
1000000 . Suppose we still get a sample mean of 0.8 here, so that k = 800000 . Now, the
posterior mean will be influenced almost entirely by the sample mean. This is because, in the
general form for the posterior Beta(a + k, b + n − k) that we computed above, the n and k

become very large relative to the a, b values, and dominate in determining the posterior mean.

Whenever we do a Bayesian analysis, it is good practice to check whether the parameter you
are interested in estimating is sensitive to the prior specification. Such an investigation is
called a sensitivity analysis. Later in this book, we will see many examples of sensitivity
analyses in realistic data-analysis settings.

2.2.7 Incremental knowledge gain using prior knowledge

In the above example, we used an artificial example where we asked 100 subjects to
complete the sentence shown at the beginning of the chapter, and then we counted the
number of times that they produced “umbrella” vs. some other word as a continuation. Given
80 instances of “umbrella”, and using a Beta(4, 4) prior, we derived the posterior to be
Beta(84, 24) . We could now use this posterior as our prior for the next study. Suppose that
we were to carry out a second experiment, again with 100 subjects, and this time 60 produced
“umbrella”. We could now use our new prior (Beta(84, 24)) to obtain an updated posterior. We
have a = 84, b = 24, n = 100, k = 60 . This gives us as posterior:
Beta(a + k, b + n − k) = Beta(84 + 60, 24 + 100 − 60) = Beta(144, 64) .

Now, if we were to pool all our data that we have from the two experiments, then we would
have as data n = 200, k = 140 . Suppose that we keep our initial prior of a = 4, b = 4 . Then,
our posterior would be Beta(4 + 140, 4 + 200 − 140) = Beta(144, 64) . This is exactly the
same posterior that we got when first analyzed the first 100 subjects’ data, derived the
posterior, and then used that posterior as a prior for the next 100 subjects’ data.

This toy example illustrates an important point that has great practical importance for cognitive
science. One can incrementally gain information about a research question by using
information from previous studies and deriving a posterior, and then use that posterior as a
prior. For practical examples from psycholinguistics showing how information can be pooled
from previous studies, see Jäger, Engelmann, and Vasishth (2017) and Nicenboim, Roettger,
and Vasishth (2018). Vasishth and Engelmann (2022) illustrates an example of how the
posterior from a previous study or collection of studies can be used to compute the posterior
derived from new data.

2.3 Summary

In this chapter, we learned how to use Bayes’ rule in the specific case of a binomial likelihood,
and a beta prior on the θ parameter in the likelihood function. Our goal in any Bayesian
analysis will follow the path we took in this simple example: decide on an appropriate
likelihood function, decide on priors for all the parameters involved in the likelihood function,
and use this model (i.e., the likelihood and the priors) to derive the posterior distribution of
each parameter. Then we draw inferences about our research question based on the posterior
distribution of the parameter.

In the example discussed in this chapter, Bayesian analysis was easy. This was because we
considered the simple conjugate case of the beta-binomial. In realistic data-analysis settings,
our likelihood function will be very complex, and many parameters will be involved. Multiplying
the likelihood function and the priors will become mathematically difficult or impossible. For
such situations, we use computational methods to obtain samples from the posterior
distributions of the parameters.

2.4 Further reading

Accessible introductions to conjugate Bayesian analysis are Lynch (2007), and Lunn et al.
(2012). Somewhat more demanding discussions of conjugate analysis are in Lee (2012),
Carlin and Louis (2008), Christensen et al. (2011), O’Hagan and Forster (2004) and Bernardo
and Smith (2009).

2.5 Exercises

Exercise 2.1 Deriving Bayes’ rule

Let A and B be two observable events. P (A) is the probability that A occurs, and P (B) is
the probability that B occurs. P (A|B) is the conditional probability that A occurs given that B

has happened. P (A, B) is the joint probability of A and B both occurring.

You are given the definition of conditional probability:

P (A, B)
P (A|B) = where P (B) > 0
P (B)

Using the above definition, and using the fact that P (A, B) = P (B, A) (i.e., the probability of
A and B both occurring is the same as the probability of B and A both occurring), derive an
expression for P (B|A) . Show the steps clearly in the derivation.

Exercise 2.2 Conjugate forms 1

Computing the general form of a PDF for a posterior

Suppose you are given data k consisting of the number of successes, coming from a
Binomial(n, θ) distribution. Given k successes in n trials coming from a binomial distribution,
we define a Beta(a, b) prior on the parameter θ.

Write down the Beta distribution that represents the posterior, in terms of a, b, n, and k.

Practical application

We ask 10 yes/no questions from a subject, and the subject returns 0 correct answers. We
assume a binomial likelihood function for these data. Also assume a Beta(1, 1) prior on the
parameter θ, which represents the probability of success. Use the result you derived above to
write down the posterior distribution of the θ parameter.

Exercise 2.3 Conjugate forms 2

Suppose that we perform n independent trials until we get a success (e.g., a heads in a coin
toss). For coin tosses, the possible outcomes could be, H, TH, . The probability of success in
each trial is θ. Then, the Geometric random variable, call it X , gives us the probability of
getting a success in n trials as follows:

n−1
P rob(X = n) = θ(1 − θ)

where n = 1, 2, … .

Let the prior on θ be Beta(a, b) , a beta distribution with parameters a,b. The posterior
distribution is a beta distribution with parameters a* and b*. Determine these parameters in
terms of a, b, and n .

Exercise 2.4 Conjugate forms 3

The Gamma distribution is defined in terms of the parameters a, b: Ga(a,b). If there is a
random variable Y (where y ≥ 0 ) that has a Gamma distribution as a PDF (
Y ∼ Gamma(a, b) ), then:

a a−1
b y exp{−by}
Ga(y|a, b) =
Γ(a)

Suppose that we have data x1 , … , xn , with sample size n that is exponentially distributed.
The exponential likelihood function is:

n
p(x1 , … , xn |λ) = λ exp{−λ ∑ xi }

i=1

It turns out that if we assume a Ga(a,b) prior distribution for λ and the above Exponential
likelihood, the posterior distribution of λ is a Gamma distribution. In other words, the
Gamma(a,b) prior on the λ parameter in the Exponential distribution will be written:

a a−1
b λ exp{−bλ}
Ga(λ|a, b) =
Γ(a)

Find the parameters a

′
and b
′
of the posterior distribution.

Exercise 2.5 Conjugate forms 4

Computing the posterior

This is a contrived example. Suppose we are modeling the number of times that a speaker
says the word “I” per day. This could be of interest if we are studying, for example, how self-
oriented a speaker is. The number of times x that the word is uttered in over a particular time
period (here, one day) can be modeled by a Poisson distribution (x = 0, 1, 2, … ):

x
exp(−θ)θ
f (x ∣ θ) =
x!

where the rate θ is unknown, and the numbers of utterances of the target word on each day
are independent given θ.

We are told that the prior mean of θ is 100 and prior variance for θ is 225. This information is
based on the results of previous studies on the topic. We will use the Gamma(a,b) density
(see previous question) as a prior for θ because this is a conjugate prior to the Poisson
distribution.

a. First, visualize the prior, a Gamma density prior for θ based on the above information.
a
[Hint: we know that for a Gamma density with parameters a, b, the mean is and the
b

variance is . Since we are given values for the mean and variance, we can solve for a,b,
a
2
b

which gives us the Gamma density.]

b. Next, derive the posterior distribution of the parameter θ up to proportionality, and write
down the posterior distribution in terms of the parameters of a Gamma distribution.

Practical application

Suppose we know that the number of “I” utterances from a particular individual is
115, 97, 79, 131 . Use the result you derived above to obtain the posterior distribution. In other
words, write down the parameters of the Gamma distribution (call them a∗, b∗ ) representing
the posterior distribution of θ.

Plot the prior and the posterior distributions alongside each other.

Now suppose you get one new data point: 200. Using the posterior Gamma(a∗, b∗) as your
prior, write down the updated posterior (in terms of the updated parameters of the Gamma
distribution) given this new data point. Add the updated posterior to the plot you made above.

Exercise 2.6 The posterior mean is a weighted mean of the prior mean and the MLE
(Poisson-Gamma conjugate case)

The number of times an event happens per unit time can be modeled using a Poisson
distribution, whose PMF is:

x
exp(−θ)θ
f (x ∣ θ) =
x!

Suppose that we define a Gamma(a,b) prior for the rate parameter θ. It is a fact (see
exercises above) that the posterior of the θ parameter is a Gamma(a∗, b∗) distribution,
where a∗ and b∗ are the updated parameters given the data: θ ∼ Gamma(a∗, b∗) .

Prove that the posterior mean is a weighted mean of the prior mean and the maximum
n
likelihood estimate (mean) of the Poisson-distributed data, x̄ = ∑
i=1
x/n . Hint: the
a
mean of a Gamma distribution is b
.

Specifically, what you have to prove is that:

a∗ a w1 w2
= × + x̄ × (2.12)
b∗ b w1 + w2 w1 + w2

n
where w1 = 1 and w2 =
b
.
Given equation (2.12), show that as n increases (as sample size goes up), the maximum
likelihood estimate x̄ dominates in determining the posterior mean, and when n gets
smaller and smaller, the prior mean dominates in determining the posterior mean.

Finally, given that the variance of a Gamma distribution is a

2
, show that as n increases,
b

the posterior variance will get smaller and smaller (the uncertainty on the posterior will go
down).

References

Bernardo, José M, and Adrian FM Smith. 2009. Bayesian Theory. Vol. 405. John Wiley &
Sons.

Carlin, Bradley P, and Thomas A Louis. 2008. Bayesian Methods for Data Analysis. CRC
Press.

Christensen, Ronald, Wesley Johnson, Adam Branscum, and Timothy Hanson. 2011.
“Bayesian Ideas and Data Analysis.” CRC Press.

Jäger, Lena A., Felix Engelmann, and Shravan Vasishth. 2017. “Similarity-Based Interference
in Sentence Comprehension: Literature review and Bayesian meta-analysis.” Journal of
Memory and Language 94: 316–39. https://doi.org/https://doi.org/10.1016/j.jml.2017.01.004.

Kruschke, John. 2014. Doing Bayesian Data Analysis: A tutorial with R, JAGS, and Stan.
Academic Press.

Lee, Peter M. 2012. Bayesian Statistics: An Introduction. John Wiley & Sons.

Lunn, David, Chris Jackson, David J Spiegelhalter, Nicky Best, and Andrew Thomas. 2012.
The BUGS Book: A Practical Introduction to Bayesian Analysis. Vol. 98. CRC Press.

Lynch, Scott Michael. 2007. Introduction to Applied Bayesian Statistics and Estimation for
Social Scientists. New York, NY: Springer.

O’Hagan, Antony, and Jonathan Forster. 2004. “Kendall’s Advanced Theory of Statistics, Vol.
2B: Bayesian Inference.” Wiley.

Ross, Sheldon. 2002. A First Course in Probability. Pearson Education.

Vasishth, Shravan, and Felix Engelmann. 2022. Sentence Comprehension as a Cognitive
Process: A Computational Approach. Cambridge, UK: Cambridge University Press.
https://books.google.de/books?id=6KZKzgEACAAJ.
Code

Chapter 3 Computational Bayesian data

analysis
In the previous chapter, we learned how to analytically derive the posterior distribution of the
parameters in our model. In practice, however, this is possible for only a very limited number
of cases. Although the numerator of the Bayes rule, the unnormalized posterior, is easy to
calculate (by multiplying the probability density/mass functions analytically), the denominator,
the marginal likelihood, requires us to carry out an integration; see Equation (3.1).

Unless we are dealing with conjugate distributions, the solution will be extremely hard to
derive or there will be no analytical solution. This was the major bottleneck of Bayesian
analysis in the past, and required Bayesian practitioners to program an approximation method
by themselves before they could even begin the Bayesian analysis. Fortunately, many of the
probabilistic programming languages freely available today (see the next section for a listing)
allow us to define our models without having to acquire expert knowledge about the relevant
numerical techniques.

3.1 Deriving the posterior through sampling

Let’s say that we want to derive the posterior of the model from section 2.2, that is, the
posterior distribution of the cloze probability of “umbrella”, θ, given the following data: a word
(e.g., “umbrella”) was answered 80 out of 100 times, and assuming a binomial distribution as
the likelihood function, and Beta(a = 4, b = 4) as a prior distribution for the cloze probability.
If we can obtain samples from the posterior distribution of θ, instead of an analytically derived
posterior distribution, given enough samples we will have a good approximation of the
posterior distribution. Obtaining samples from the posterior will be the only viable option in the
models that we will discuss in this book. By “obtaining samples”, we are talking about a
situation analogous to when we use rbinom() or rnorm() to obtain samples from a
particular distribution. For more details about sampling algorithms, see the further readings
suggested in section 3.10.

Thanks to probabilistic programming languages, it will be relatively straightforward to get

these samples, and we will discuss how we will do it in more detail in the next section. For
now let’s assume that we used some probabilistic programming language to obtain 20000
samples from the posterior distribution of the cloze probability, θ: 0.828, 0.813, 0.786, 0.81,
0.806, 0.737, 0.771, 0.722, 0.763, 0.77, 0.778, 0.829, 0.736, 0.838, 0.776, 0.816, 0.743, 0.73,
0.701, 0.764, … Figure 3.1 shows that the approximation of the posterior looks quite similar to
the analytically derived posterior. The difference between the analytically computed and
approximated mean and variance are −0.0004 and 0.000005 respectively.

10.0

7.5
density

5.0

2.5

0.0

0.6 0.7 0.8 0.9

theta

FIGURE 3.1: Histogram of the samples of θ from the posterior distribution generated via
sampling. The black line shows the density plot of the analytically derived posterior.
3.2 Bayesian Regression Models using Stan: brms

The surge in popularity of Bayesian statistics is closely tied to the increase in computing
power and the appearance of probabilistic programming languages, such as WinBUGS (Lunn
et al. 2000), JAGS (Plummer 2016), PyMC3 (Salvatier, Wiecki, and Fonnesbeck 2016), Turing
(Ge, Xu, and Ghahramani 2018), and Stan (Carpenter et al. 2017); for a historical review, see
Plummer (2022).

These probabilistic programming languages allow the user to define models without having to
deal (for the most part) with the complexities of the sampling process. However, they require
learning a new language since the user has to fully specify the statistical model using a
particular syntax.7 Furthermore, some knowledge of the sampling process is needed to
correctly parameterize the models and to avoid convergence issues (these topics will be
covered in detail in chapter 11).

There are some alternatives that allow Bayesian inference in R without having to fully specify
the model “by hand”. The packages rstanarm (Goodrich et al. 2018) and brms (Bürkner 2019)
provide Bayesian equivalents of many popular R model-fitting functions, such as (g)lmer
(Bates, Mächler, et al. 2015b) and many others; both rstanarm and brms use Stan as the
back-end for estimation and sampling. The package R-INLA (Lindgren and Rue 2015) allows
for fitting a limited selection of likelihood functions and priors in comparison to rstanarm and
brms (R-INLA can fit models that can be expressed as latent Gaussian models). This package
uses the integrated nested Laplace approximation (INLA) method for approximating Bayesian
inference rather than a sampling algorithm as it is used by the other probabilistic languages
listed. Another alternative is JASP (JASP Team 2019), which provides a graphical user
interface for both frequentist and Bayesian modeling, and is intended to be an open-source
alternative to SPSS.

We will focus on brms in this part of the book. This is because it can be useful for a smooth
transition from frequentist models to their Bayesian equivalents. The package brms is not
only powerful enough to satisfy the statistical needs of many cognitive scientists, it has the
added benefit that the Stan code can be inspected (with the brms functions
make_stancode() and make_standata() ), allowing the user to customize their models or learn
from the code produced internally by brms to eventually transition to writing the models
entirely in Stan. We revisit the models of this and the following chapters in the introduction to
Stan in chapter 10.
3.2.1 A simple linear model: A single subject pressing a button
repeatedly (a finger tapping task)

We’ll use the following example of a finger tapping task (for a review, see Hubel et al. 2013) to
illustrate the basic steps for fitting a model. Suppose that a subject first sees a blank screen.
Then, after a certain amount of time (say 200 ms), the subject sees a cross in the middle of a
screen, and as soon as they see the cross, they tap on the space bar as fast as they can until
the experiment is over (361 trials). The dependent measure here is the time it takes in
milliseconds from one press of the space bar to the next one. The data in each trial are
therefore finger tapping times in milliseconds. Suppose that the research question is: how long
does it take for this particular subject to press a key? (As an aside, notice that the data are not
independent and identically distributed but are ignoring that detail for now; we will address that
issue later in the book).

Let’s model the data with the following assumptions:

1. There is a true (unknown) underlying time, μ ms, that the subject needs to press the
space bar.
2. There is some noise in this process.
3. The noise is normally distributed (this assumption is questionable given that finger
tapping as also response times are generally skewed; we will fix this assumption later).8

This means that the likelihood for each observation n will be:

tn ∼ Normal(μ, σ) (3.2)

where n = 1, … , N , and t is the dependent variable (finger tapping times in milliseconds).

The variable N indexes the total number of data points. The symbol μ indicates the location of
the normal distribution function; the location parameter shifts the distribution left or right on the
horizontal axis. For the normal distribution, the location is also the mean of the distribution.
The symbol σ indicates the scale of the distribution; as the scale decreases, the distribution
gets narrower. This compressing approaches a spike (all the probability mass get
concentrated near one point) as the scale parameter approaches zero. For the normal
distribution, the scale is also its standard deviation.

The reader may have encountered the model shown in Equation (3.2) in the form shown in
Equation (3.3):

iid
tn = μ + ε, where εn ∼ Normal(0, σ) (3.3)
When the model is written in this way, it should be understood as meaning that each data
point tn has some variability around a mean value μ , and that variability has standard
deviation σ. The term iid’’ (independent and identically distributed) implies that each data point
tn is independently generated (is not correlated with any of the other data points), and is
coming from the same distribution (namely, N ormal(μ, σ) ).

For a frequentist model that will give us the maximum likelihood estimate (the sample mean)
of the time it takes to press the space bar, this would be enough information to write the
formula in R , t ~ 1 , and plug it into the function lm() together with the data: lm(t ~ 1,
data) . The meaning of the 1 here is that lm will estimate the intercept in the model, in our

case μ . If the reader is completely unfamiliar with linear models, the references in section 4.5
will be helpful.

For a Bayesian linear model, we will also need to define priors for the two parameters of our
model. Let’s say that we know for sure that the time it takes to press a key will be positive and
lower than a minute (or 60000 ms), but we don’t want to make a commitment regarding which
values are more likely. We encode what we know about the noise in the task in σ: we know
that this parameter must be positive and we’ll assume that any value below 2000 ms is
equally likely. These priors are in general strongly discouraged: A flat (or very wide) prior will
almost never be the best approximation of what we know. Prior specification will be discussed
in detail in chapter 6.

In this case, even if we know very little about the task, we know that pressing the spacebar will
take at most a couple of seconds. We’ll use flat priors in this section for pedagogical
purposes; the next sections will already show more realistic uses of priors.

μ ∼ Uniform(0, 60000)
(3.4)
σ ∼ Uniform(0, 2000)

First, load the data frame df_spacebar from the bcogsci package:

Hide

data("df_spacebar")
df_spacebar
## # A tibble: 361 × 2
## t trial
## <int> <int>
## 1 141 1
## 2 138 2

## 3 128 3
## # … with 358 more rows

It is always a good idea to plot the data before doing anything else; see Figure 3.2. As we
suspected, the data look a bit skewed, but we ignore this for the moment.

Hide

ggplot(df_spacebar, aes(t)) +
geom_density() +
xlab("Finger tapping times") +
ggtitle("Button-press data")

Button-press data
0.020

0.015
density

0.010

0.005

0.000

100 200 300 400

Finger tapping times

FIGURE 3.2: Visualizing the button-press data.

3.2.1.1 Specifying the model in brms

Fit the model defined by equations (3.2) and (3.4) with brms in the following way.

Hide

fit_press <- brm(t ~ 1,

data = df_spacebar,
family = gaussian(),
prior = c(
prior(uniform(0, 60000), class = Intercept, lb = 0, ub = 60000),
prior(uniform(0, 2000), class = sigma, lb = 0, ub = 2000)

),
chains = 4,
iter = 2000,
warmup = 1000
)

The brms code has some differences from a model fit with lm . At this beginning stage, we’ll
focus on the following options:

1. The term family = gaussian() makes it explicit that the underlying likelihood function is
a normal distribution (Gaussian and normal are synonyms). This detail is implicit in lm .
Other linking functions are possible, exactly as in the glm function. The default for brms
that corresponds to the lm function is gaussian() .
2. The term prior takes as argument a vector of priors. Although this specification of priors
is optional, the researcher should always explicitly specify each prior. Otherwise, brms
will define priors by default, which may or may not be appropriate for the research area. In
cases where the distribution has a restricted coverage, that is, not every value is valid
(e.g., smaller than 0 or larger than 60000 are not valid for the intercept), we need to set
lower and upper boundaries with lb and ub .9
3. The term chains refers to the number of independent runs for sampling (by default four).
4. The term iter refers to the number of iterations that the sampler makes to sample from
the posterior distribution of each parameter (by default 2000 ).
5. The term warmup refers to the number of iterations from the start of sampling that are
eventually discarded (by default half of iter ).
The last three options, chains , iter , warmup determine the behavior of the sampling
algorithm: the No-U-Turn Sampler (NUTS; Hoffman and Gelman 2014) extension of
Hamiltonian Monte Carlo (Duane et al. 1987; Neal 2011). We will discuss sampling in a bit
more depth in chapter 10, but the basic process is explained next.

3.2.1.2 Sampling and convergence in a nutshell

The code specification starts four chains independently from each other. Each chain
“searches” for samples of the posterior distribution in a multidimensional space, where each
parameter corresponds to a dimension. The shape of this space is determined by the priors
and the likelihood. The chains start at random locations, and in each iteration they take one
sample each. When sampling begins, the samples may or may not belong to the posterior
distributions of the parameters. Eventually, the chains end up in the vicinity of the posterior
distribution, and from that point onward the samples will belong to the posterior.

Thus, when sampling begins, the samples from the different chains can be far from each
other, but at some point they will “converge” and start delivering samples from the posterior
distributions. Although there are no guarantees that the number of iterations we run the chains
for will be sufficient for obtaining samples from the posteriors, the default values of brms (and
Stan) are in many cases sufficient to achieve convergence. When the default number of
iterations do not suffice, brms (actually, Stan) will print out warnings, with suggestions for
fixing the convergence problems. If all the chains converge to the same distribution, by
removing the “warmup” samples, we make sure that we do not get samples from the initial
path to the posterior distributions. The default in brms is that half of the total number of
iterations in each chain (which default to 2000) will count as “warmup”. So, if one runs a model
with four chains and the default number of iterations, we will obtain a total of 4000 samples
from the four chains, after discarding the warmup iterations.

Figure 3.3(a) shows the path of the chains from the warmup phase onwards. Such plots are
called trace plots. The warmup is shown only for illustration purposes; generally, one should
only inspect the chains after the point where convergence has (presumably) been achieved
(i.e., after the dashed line). After convergence has occurred, a visual diagnostic check is that
chains should look like a “fat hairy caterpillar.” Compare the trace plot of our model in Figure
3.3(a) with the trace plot of a model that did not converge, shown in Figure 3.3(b).

Trace plots are not always diagnostic as regards convergence. The trace plots might look fine,
but the model may not have converged. Fortunately, Stan automatically runs several
diagnostics with the information from the chains, and if there are no warnings after fitting the
model and the trace plots look fine, we can be reasonably sure that the model converged, and
assume that our samples are from the true posterior distribution. However, it is necessary to
run more than one chain (preferably four), with a couple of thousands of iterations (at least) in
order for the diagnostics to work.

a. Model that converged.

mu
200
150 Warm-up
100
chain
Sample value

50
0 1

sigma 2
200 3
150 Warm-up 4
100
50
0
0 500 1000 1500 2000
Iteration number

b. Model that did not converge.

mu
2000
1500
1000 Warm-up chain
Sample value

500
0 1

sigma 2
2000 3
1500
4
1000 Warm-up
500
0
0 500 1000 1500 2000
Iteration number

FIGURE 3.3: a. Trace plots of our brms model for the button-pressing data. All the chains
start from initial values above 200 and are outside of the plot. b. Trace plots of a model that
did not converge. We can diagnose the non-convergence by the observing that the chains do
not overlap—each chain seems to be sampling from a different distribution.

3.2.1.3 Output of brms

Once the model has been fit (and assuming that we got no warning messages about
convergence problems), we can print out the samples of the posterior distributions of each of
the parameters using as_draws_df() (which stores metadata about the chains) or with
as.data.frame() :

Hide
as_draws_df(fit_press) %>%
head(3)

## # A draws_df: 3 iterations, 1 chains, and 4 variables

## b_Intercept sigma lprior lp__

## 1 167 27 -19 -1686

## 2 171 24 -19 -1684
## 3 171 24 -19 -1684
## # ... hidden reserved variables {'.chain', '.iteration', '.draw'}

The term b_Intercept in the brms output corresponds to our μ . We can ignore the last two
columns: lp is not really part of the posterior, it’s the log-density of the unnormalized
posterior for each iteration ( lp will be discussed later in Box 10.1), lprior is the log-
density of the (joint) prior distribution and it is there for compatibility with the package
priorsense (https://github.com/n-kall/priorsense).

Plot the density and trace plot of each parameter after the warmup (Figure 3.4).

b_Intercept b_Intercept
0.3
172

0.2 170

168
0.1

165
Chain
0.0 1
165 168 170 172 0 200 400 600 800 1000
2
sigma sigma
29 3
0.4
4

0.3 27

0.2 25

0.1 23

0.0
22 24 26 28 0 200 400 600 800 1000

FIGURE 3.4: Density and trace plots of our brms model for the button-pressing data.

Printing the object with the brms fit provides a nice, if somewhat verbose, summary:
Hide

fit_press
# posterior_summary(fit_press) is also useful

## Family: gaussian
## Links: mu = identity; sigma = identity
## Formula: t ~ 1
## Data: df_spacebar (Number of observations: 361)
## Draws: 4 chains, each with iter = 2000; warmup = 1000; thin = 1;
## total post-warmup draws = 4000

##
## Population-Level Effects:
## Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESS
## Intercept 168.63 1.26 166.18 171.06 1.00 3234 2568
##

## Family Specific Parameters:

## Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESS
## sigma 25.02 0.95 23.25 26.95 1.00 2809 2222
##
## Draws were sampled using sampling(NUTS). For each parameter, Bulk_ESS

## and Tail_ESS are effective sample size measures, and Rhat is the potential
## scale reduction factor on split chains (at convergence, Rhat = 1).

The Estimate is just the mean of the posterior samples, Est.Error is the standard deviation
of the posterior and the CIs mark the lower and upper bounds of the 95% credible intervals (to
distinguish credible intervals from frequentist confidence intervals, the former will be
abbreviated as CrIs):

Hide

as_draws_df(fit_press)$b_Intercept %>% mean()

## [1] 169

Hide
as_draws_df(fit_press)$b_Intercept %>% sd()

## [1] 1.26

Hide

as_draws_df(fit_press)$b_Intercept %>%
quantile(c(0.025, .975))

## 2.5% 97.5%
## 166 171

Furthermore, the summary provides the Rhat , Bulk_ESS , and Tail_ESS of each parameter.
R-hat compares the between- and within-chain estimates of each parameter. R-hat is larger
than 1 when chains have not mixed well, one can only rely on the model if the R-hats for all
the parameters are less than 1.05. (R warnings will appear otherwise). Bulk ESS (bulk
effective sample size) is a measure of sampling efficiency in the bulk of the posterior
distribution, that is the effective sample size for the mean and median estimates, whereas tail
ESS (tail effective sample size) indicates the sampling efficiency at the tails of the distribution,
that is the minimum of effective sample sizes for 5% and 95% quantiles. The effective sample
size is generally smaller than the number of post-warmup samples, because the samples from
the chains are not independent (they are correlated to some extent), and carry less
information about the posterior distribution in comparison to independent samples. In some
cases, however, such as with the Intercept here, the effective sample size is actually larger
than the number of post-warmup samples. This might happen for parameters with a normally
distributed posterior (in the unconstrained space, see Box (thm:target)) and low dependence
on the other parameters (Vehtari, Gelman, Simpson, Carpenter, and Bürkner 2019). Very low
effective sample size indicates sampling problems (and are accompanied by R warnings) and
in general appear together with chains that are not properly mixed. As rule of thumb, Vehtari,
Gelman, Simpson, Carpenter, and Bürkner (2019) suggest that a minimum of 400 effective
sample size is required for statistical summaries.

We see that we can fit our model without problems, and we get some posterior distributions
for our parameters. However, we should ask ourselves the following questions:

1. What information are the priors encoding? Do the priors make sense?
2. Does the likelihood assumed in the model make sense for the data?
We’ll try to answer these questions by looking at the prior and posterior predictive
distributions, and by doing sensitivity analyses. This is explained in the following sections.

3.3 Prior predictive distribution

We had defined the following priors for our linear model:

μ ∼ Uniform(0, 60000)
(3.5)
σ ∼ Uniform(0, 2000)

These priors encode assumptions about the kind of data we would expect to see in a future
study. To understand these assumptions, we are going to generate data from the model; such
data, which is generated entirely by the prior distributions, is called the prior predictive
distribution. Generating prior predictive distributions repeatedly helps us to check whether the
priors make sense. What we want to know here is, do the priors generate realistic-looking
data?

Formally, we want to know the density p(⋅) of data points ypred , … , ypred
1 N
from a data set
ypred of length N , given a vector of priors Θ and our likelihood p(⋅|Θ) ; (in our example,
Θ = ⟨μ, σ⟩ ). The prior predictive density is written as follows:

p(ypred ) = p(ypred , … , ypred )

1 n

= ∫ p(ypred |Θ) ⋅ p(ypred |Θ) ⋯ p(ypred |Θ)p(Θ) dΘ

1 2 N

In essence, the vector of parameters is integrated out. This yields the probability distribution of
possible data sets given the priors and the likelihood, before any observations are taken into
account.

The integration can be carried out computationally by generating samples from the prior
distribution.

Here is one way to generate prior predictive distributions:

Repeat the following many times:

1. Take one sample from each of the priors.

2. Plug those samples into the probability density/mass function used as the likelihood in the
model to generate a data set ypred , … , ypred
1 n
.

Each sample is an imaginary or potential data set.

Create a function that does this:

Hide

normal_predictive_distribution <-
function(mu_samples, sigma_samples, N_obs) {
# empty data frame with headers:
df_pred <- tibble(
trialn = numeric(0),

t_pred = numeric(0),
iter = numeric(0)
)
# i iterates from 1 to the length of mu_samples,
# which we assume is identical to
# the length of the sigma_samples:

for (i in seq_along(mu_samples)) {
mu <- mu_samples[i]
sigma <- sigma_samples[i]
df_pred <- bind_rows(
df_pred,

tibble(
trialn = seq_len(N_obs), # 1, 2,... N_obs
t_pred = rnorm(N_obs, mu, sigma),
iter = i
)

)
}
df_pred
}

The following code produces 1000 samples of the prior predictive distribution of the model
that we defined in section 3.2.1. This means that it will produce 361000 predicted values (361
predicted observations for each of the 1000 simulations). Although this approach works, it’s
quite slow (a couple of seconds). See Box 3.1 for a more efficient version of this function.
Section 3.7.2 will show that it’s possible to use brms to sample from the priors, ignoring the
t in the data by setting sample_prior = "only" . However, since brms still depends on

Stan’s sampler, which uses Hamiltonian Monte Carlo, the prior sampling process can also fail
to converge, especially when one uses very uninformative priors, like the ones used in this
example. In contrast, our function above, which uses rnorm() , cannot have convergence
issues and will always produce multiple sets of prior predictive data ypred , … , ypred
1 n
.
Hide

N_samples <- 1000

N_obs <- nrow(df_spacebar)
mu_samples <- runif(N_samples, 0, 60000)
sigma_samples <- runif(N_samples, 0, 2000)
tic()

prior_pred <- normal_predictive_distribution(

mu_samples = mu_samples,
sigma_samples = sigma_samples,
N_obs = N_obs
)
toc()

## 2.48 sec elapsed

Hide

prior_pred

## # A tibble: 361,000 × 3
## trialn t_pred iter
## <dbl> <dbl> <dbl>
## 1 1 16710. 1

## 2 2 16686. 1
## 3 3 17245. 1
## # … with 360,997 more rows

Box 3.1 A more efficient function for generating prior predictive distribution

A more efficient function can be created in the following way using the map2_dfr()
function from the purrr package. This function yields an approximately 10-fold increase
in speed. Although the distributions should be the same with both functions, the specific
numbers in the tables won’t be, due to the randomness in the process of sampling.
The purrr function map2_dfr() (which works similarly to the base R function lapply()
and Map() ) essentially runs a for-loop, and builds a data frame with the output. It iterates
over the values of two vectors (or lists) simultaneously, here, mu_samples and
sigma_samples and, in each iteration, it applies a function to each value of the two

vectors, here, mu and sigma . The output of each function is a data frame (or tibble in
this case) with N_obs observations which is bound in a larger data frame at the end of
the loop. Each of these data frames bound together represents an iteration in the
simulation, and we identify the iterations by setting .id = "iter" .

Although this method for generating prior predictive distributions is a bit involved, it
presents an advantage in comparison to the more straightforward use of predict() (or
posterior_predict() , which can also generate prior predictions) together with setting
sample_prior = "only" in the brms model (as we will do in section 3.7.2). Namely, here

we don’t depend on Stan’s sampler, and that means that no matter the number of
iterations in our simulation or how uninformative our priors, there will never be any
convergence problems.

Hide
library(purrr)
# Define the function:
normal_predictive_distribution <- function(mu_samples,
sigma_samples,
N_obs) {

map2_dfr(mu_samples, sigma_samples, function(mu, sigma) {

tibble(
trialn = seq_len(N_obs),
t_pred = rnorm(N_obs, mu, sigma)
)
}, .id = "iter") %>%

# .id is always a string and

# needs to be converted to a number
mutate(iter = as.numeric(iter))
}
# Test the timing:

tic()
prior_pred <- normal_predictive_distribution(
mu_samples = mu_samples,
sigma_samples = sigma_samples,
N_obs = N_obs

)
toc()

## 0.596 sec elapsed

Figure 3.5 shows the first 18 samples of the prior predictive distribution (i.e., 18 independently
generated prior predicted data sets) with the code below.

Hide
prior_pred %>%
filter(iter <= 18) %>%
ggplot(aes(t_pred)) +
geom_histogram(aes(y = after_stat(density))) +
xlab("predicted t (ms)") +

theme(
axis.text.x =
element_text(angle = 40, vjust = 1, hjust = 1, size = 14)
) +
scale_y_continuous(
limits = c(0, 0.0005),

breaks = c(0, 0.00025, 0.0005), name = "density"

) +
facet_wrap(~iter, ncol = 3)

1 2 3
0.00050
0.00025
0.00000
4 5 6
0.00050
0.00025
0.00000
7 8 9
0.00050
0.00025
density

0.00000
10 11 12
0.00050
0.00025
0.00000
13 14 15
0.00050
0.00025
0.00000
16 17 18
0.00050
0.00025
0.00000
0 00 000 000 0 00 000 000 0 00 000 000
0 0 0
20 40 60 20 40 60 20 40 60
predicted t (ms)

FIGURE 3.5: Eighteen samples from the prior predictive distribution of the model defined in
section 3.2.1.

The prior predictive distribution in Figure 3.5 shows prior data sets that are not realistic: Apart
from the fact that the data sets show that finger tapping times distributions are symmetrical–
and we know from prior experience with such data that they are generally right-skewed–some
data sets present finger tapping times that are unrealistically long. Worse yet, if we inspect
enough samples from the prior predicted data, it will become clear that a few data sets have
negative finger tapping time values.

We can also look at the distribution of summary statistics in the prior predictive data. Even if
we don’t know beforehand what the data should look like, it’s very likely that we have some
expectations for possible mean, minimum, or maximum values. For example, in the button-
pressing example, it seems reasonable to assume that average finger tapping times are
between -
200 600 ms; finger tapping times are very unlikely to be below 50 ms (given the
delays in keyboards), and even long lapses of attention won’t be greater than a couple of
seconds.10 Three distributions of summary statistics are shown in Figure 3.6.

mean_t
0.00003

0.00002

0.00001

0.00000

min_t
0.00003
density

0.00002

0.00001

0.00000

max_t
0.00003

0.00002

0.00001

0.00000
0 20000 40000 60000
t

FIGURE 3.6: The prior predictive distributions of the mean, minimum, and maximum values of
the button-pressing model defined in section 3.2.1.

Figure 3.6 shows that we used much less prior information than what we could have: Our
priors were encoding the information that any mean between 0 and 60000 ms is equally likely.
It seems clear that a value close to 0 or to 60000 ms would be extremely surprising. This wide
range of mean values occurs because of the uniform prior on μ . Similarly, maximum values
are quite “uniform”, spanning a much wider range than what one would expect. Finally, in the
distribution of minimum values, negative finger tapping times occur. This might seem
surprising (our prior for μ excluded negative values), but the reason that negative values
appear is that the prior is interpreted together with the likelihood (Gelman, Simpson, and
Betancourt 2017), and the likelihood is a normal distribution, which will allow for negative
samples even if the location parameter μ has a positive value.

To summarize the above discussion, the priors used in the example are clearly not very
realistic given what we might know about finger tapping times for such a button pressing task.
This raises the question: what priors should we have chosen? In the next section, we consider
this question.

3.4 The influence of priors: sensitivity analysis

For most cases that we will encounter in this book, there are four main classes of priors that
we can choose from. In the Bayesian community, there is no fixed nomenclature for classifying
different kinds of priors. For this book, we have chosen specific names for each type of prior,
but this is just a convention that we follow for consistency. There are also other classes of
prior that we do not discuss in this book. An example is improper priors such as
Uniform(−∞, +∞) , which are not proper probability distributions because the area under
the curve does not sum to 1.

When thinking about priors, the reader should not get hung up on what precisely the name is
for a particular type of prior; they should rather focus on what that prior means in the context
of the research problem.

3.4.1 Flat, uninformative priors

One option is to choose priors that are as uninformative as possible. The idea behind this
approach is to let the data “speak for itself” and to not bias the statistical inference with
“subjective” priors. There are several issues with this approach: First, the prior is as subjective
as the likelihood, and in fact, different choices of likelihood might have a much stronger impact
on the posterior than different choices of priors. Second, uninformative priors are in general
unrealistic because they give equal weight to all values within the support of the prior
distribution, ignoring the fact that usually there is some minimal information about the
parameters of interest. Usually, at the very least, the order of magnitude is known (response
times or finger tapping times will be in milliseconds and not days, EEG signals some
microvolts and not volts, etc.). Third, uninformative priors make the sampling slower and might
lead to convergence problems. Unless there is a large amount of data, it would be wise to
avoid such priors. Fourth, it is not always clear which parameterization of a given distribution
the flat priors should be assigned to. For example, the Normal distribution is sometimes
defined based on its standard deviation (σ), variance (σ 2 ), or precision (1/σ 2 ): a flat prior for
the standard deviation is not flat for the precision of the distribution. Although it is sometimes
possible to find an uninformative prior that is uninvariant under a change of parameters (also
called Jeffreys priors; Jaynes 2003, sec. 6.15; Jeffreys 1939, Chapter 3), this is not always the
case. Finally, if Bayes factors need to be computed, uninformative priors can lead to very
misleading conclusions (chapter 15).

In the button-pressing example discussed in this chapter, an example of a flat, uninformative

prior would be μ ∼ Uniform(−10
20
, 10
20
) . On the millisecond scale, this is a very strange
prior to use for a parameter representing mean button-pressing time: it allows for impossibly
large positive values, and it also allows negative button-pressing times, which is of course
impossible. It is technically possible to use such a prior, but it wouldn’t make much sense.

3.4.2 Regularizing priors

If there does not exist much prior information (and if this information cannot be worked out
through reasoning about the problem), and there is enough data (what “enough” means here
will presently become clear when we look at specific examples), it is fine to use regularizing
priors. These are priors that down-weight extreme values (that is, they provide regularization),
they are usually not very informative, and mostly let the likelihood dominate in determining the
posteriors. These priors are theory-neutral; that is, they usually do not bias the parameters to
values supported by any prior belief or theory. The idea behind this type of prior is to help to
stabilize computation. These priors are sometimes called weakly informative or mildly
informative priors in the Bayesian literature. For many applications, they perform well, but
discussed in chapter 15, they tend to be problematic if Bayes factors need to be computed.

In the button-pressing example, an example of a regularizing prior would be

μ ∼ Normal+ (0, 1000) . This is a Normal distribution prior truncated at 0 ms, and allows a
relatively constrained range of positive values for button-pressing times (roughly, up to 2000

ms or so). This is a regularizing prior because it rules out negative button-pressing times and
down-weights extreme values over 2000 ms.

3.4.3 Principled priors

The idea here is to have priors that encode all (or most of) the theory-neutral information that
the researcher has. Since one generally knows what one’s data do and do not look like, it is
possible to build priors that truly reflect the properties of potential data sets, using prior
predictive checks. In this book, many examples of this class of priors will come up.

In the button-pressing data, an example of a principled prior would be

μ ∼ Normal+ (250, 100) . This prior is not overly restrictive, but represents a guess about
plausible button-pressing times. Prior predictive checks using principled priors should produce
realistic distributions of the dependent variable.

3.4.4 Informative priors

There are cases where a lot of prior knowledge exists. In general, unless there are very good
reasons for having relatively informative priors (see chapter 15), it is not a good idea to let the
priors have too much influence on the posterior. An example where informative priors would
be important is when investigating a language-impaired population from which we can’t get
many subjects, but a lot of previously published papers exist on the research topic.

In the button-pressing data, an informative prior could be based on a meta-analysis of

previously published or existing data, or the result of prior elicitation from an expert (or
multiple experts) on the topic under investigation. An example of an informative prior would be
μ ∼ Normal+ (200, 20) . This prior will have some influence on the posterior for μ , especially
when one has relatively sparse data.

These four options constitute a continuum. The uniform prior from the last model (section
3.2.1) falls between flat, uninformative and regularizing priors. In practical data analysis
situations, we are mostly going to choose priors that fall between regularizing and principled.
Informative priors, in the sense defined above, will be used only relatively rarely; but they
become more important to consider when doing Bayes factor analyses (chapter 15).

3.5 Revisiting the button-pressing example with

different priors

What would happen if even wider priors were used for the model defined previously (in section
3.2.1)? Suppose that every mean between −10
6
and 10
6
ms is assumed to be equally likely.
This prior is clearly unrealistic and actually makes no sense at all: we are not expecting
negative finger tapping times. Regarding the standard deviation, one could assume that any
value between 0 and 10
6
is equally likely.11 The likelihood remains unchanged.
6 6
μ ∼ Uniform(−10 , 10 )
(3.6)
6
σ ∼ Uniform(0, 10 )

Hide

# The default settings are used when they are not set explicitly:

# 4 chains, with half of the iterations (set as 3000) as warmup.

fit_press_unif <- brm(t ~ 1,
data = df_spacebar,
family = gaussian(),
prior = c(

prior(uniform(-10^6, 10^6),
class = Intercept,
lb = -10^6,
ub = 10^6),
prior(uniform(0, 10^6),

class = sigma,
lb = 0,
ub = 10^6)
),
iter = 3000,
control = list(adapt_delta = .99,

max_treedepth = 15)
)

Even with these extremely unrealistic priors, which require us to change the adapt_delta and
max_treedepth default values to achieve convergence, the output of the model is virtually

identical to the previous one (see Figure 3.7).

Hide

fit_press_unif
## ...
## Population-Level Effects:
## Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESS
## Intercept 168.65 1.34 166.02 171.28 1.00 3939 3345
##

## Family Specific Parameters:

## Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESS
## sigma 25.00 0.98 23.13 27.04 1.01 568 470
##
## ...

b_Intercept sigma
0.3 0.4

0.3

0.2
density

0.2

0.1

0.0 0.0

165 168 171 174 23 25 27 29

Finger tapping times (ms)

model fit_press fit_press_unif

FIGURE 3.7: Comparison of the posterior distributions from the model with extremely
unrealistic priors, fit_press_unif , against the previous model with more “realistically”
bounded uniform distributions (but still not recommended), fit_press .

Next, consider what happens if very informative priors are used. Assume that mean values
very close to 400 ms are the most likely, and that the standard deviation of the finger tapping
times is very close to 100 ms. Given that this is a model of button-pressing times, such an
informative prior seems wrong—200 ms seems like a more realistic mean button-pressing
time, not 400 ms. You can check this by doing an experiment yourself and looking at the
recorded times; a software like Linger (http://tedlab.mit.edu/~dr/Linger/) makes it easy to set
up such an experiment.

The Normal+ notation indicates a normal distribution truncated at zero such that only positive
values are allowed (Box 4.1 discusses this type of distribution in detail). Even though the prior
for the standard deviation is restricted to be positive, we are not required to add lb = 0 to
the prior, and it is automatically taken into account by brms .

μ ∼ Normal(400, 10)
(3.7)
σ ∼ Normal+ (100, 10)

Hide

fit_press_inf <- brm(t ~ 1,

data = df_spacebar,
family = gaussian(),
prior = c(
prior(normal(400, 10), class = Intercept),
# `brms` knows that SDs need to be bounded
# to exclude values below zero:

prior(normal(100, 10), class = sigma)

)
)

Despite these unrealistic but informative priors, the likelihood mostly dominates and the new
posterior means and credible intervals are just a couple of milliseconds away from the
previous estimates:

Hide

fit_press_inf
## ...
## Population-Level Effects:
## Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESS
## Intercept 172.89 1.38 170.22 175.66 1.00 2452 2659
##

## Family Specific Parameters:

## Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESS
## sigma 26.05 1.03 24.08 28.14 1.00 2791 2418
##
## ...

As a final example of a sensitivity analysis, choose some principled priors. Assuming that we
have some prior experience with previous similar experiments, suppose the mean reaction
time is expected to be around 200 ms, with a 95% probability of the mean ranging from 0 to
400 ms. This uncertainty is perhaps unreasonably large, but one might want to allow a bit
more uncertainty than one really thinks is reasonable (this kind of conservativity in allowing
somewhat more uncertainty is sometimes called Cromwell’s rule in Bayesian statistics; see
O’Hagan and Forster 2004, sec. 3.19). In such a case, one can decide on the prior
Normal(200, 100) . Given that the experiment involves only one subject and the task is very
simple, one might not expect the residual standard deviation σ to be very large: As an
example, one can settle on a location of 50 ms for a truncated normal distribution, but still
allow for relatively large uncertainty: N ormal+ (50, 50) . The prior specifications are
summarized below.

μ ∼ Normal(200, 100)

σ ∼ Normal+ (50, 50)

Why are these priors principled? The designation “principled” here largely depends on our
domain knowledge. Chapter 6 discusses how one can use domain knowledge when specifying
priors.

One can achieve a better understanding of what a particular set of priors imply by visualizing
the priors graphically, and carrying out prior predictive checks. These steps are skipped here,
but these issues will be discussed in detail in chapters 6 and 7. These chapters will give more
detailed information about choosing priors and on developing a principled workflow for
Bayesian data analysis.

Hide
fit_press_prin <- brm(t ~ 1,
data = df_spacebar,
family = gaussian(),
prior = c(
prior(normal(200, 100), class = Intercept),

prior(normal(50, 50), class = sigma)

)
)

The new estimates are virtually the same as before:

Hide

fit_press_prin

## ...
## Population-Level Effects:
## Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESS
## Intercept 168.70 1.32 166.07 171.24 1.00 3533 2548

##
## Family Specific Parameters:
## Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESS
## sigma 25.01 0.94 23.24 26.94 1.00 3171 2484
##

## ...

The above examples of using different priors should not be misunderstood to mean that priors
never matter. When there is enough data, the likelihood will dominate in determining the
posterior distributions. What constitutes “enough” data is also a function of the complexity of
the model; as a general rule, more complex models require more data.

Even in cases where there is enough data and the likelihood dominates in determining the
posteriors, regularizing, principled priors (i.e., priors that are more consistent with our a priori
beliefs about the data) will in general speed-up model convergence.

In order to determine the extent to which the posterior is influenced by the priors, it is a good
practice to carry out a sensitivity analysis: try different priors and either verify that the posterior
doesn’t change drastically, or report how the posterior is affected by some specific priors (for
examples from psycholinguistics, see Vasishth et al. 2013; Vasishth and Engelmann 2022).
Chapter 15 will demonstrate that sensitivity analysis becomes crucial for reporting Bayes
factors; even in cases where the choice of priors does not affect the posterior distribution, it
generally affects the Bayes factor.

3.6 Posterior predictive distribution

The posterior predictive distribution is a collection of data sets generated from the model (the
likelihood and the priors). Having obtained the posterior distributions of the parameters after
taking into account the data, the posterior distributions can be used to generate future data
from the model. In other words, given the posterior distributions of the parameters of the
model, the posterior predictive distribution gives us some indication of what future data might
look like.

Once the posterior distributions p(Θ ∣ y) are available, the predictions based on these
distributions can be generated by integrating out the parameters:

p(ypred ∣ y) = ∫ p(ypred , Θ ∣ y) dΘ = ∫ p(ypred ∣ Θ, y)p(Θ ∣ y) dΘ

Θ Θ

Assuming that past and future observations are conditionally independent given Θ , i.e.,
p(ypred ∣ Θ, y) = p(ypred ∣ Θ) , the above equation can be written as:

p(ypred ∣ y) = ∫ p(ypred ∣ Θ)p(Θ ∣ y) dΘ (3.8)

In Equation (3.8), we are conditioning ypred only on y, we do not condition on what we don’t
know (Θ); the unknown parameters have been integrated out. This posterior predictive
distribution has important differences from predictions obtained with the frequentist approach.
The frequentist approach gives a point estimate of each predicted observation given the
maximum likelihood estimate of Θ (a point value), whereas the Bayesian approach gives a
distribution of values for each predicted observation. As with the prior predictive distribution,
the integration can be carried out computationally by generating samples from the posterior
predictive distribution. The same function that we created before,
normal_predictive_distribution() , can be used here. The only difference is that instead of

sampling mu and sigma from the priors, the samples come from the posterior.

Hide
N_obs <- nrow(df_spacebar)
mu_samples <- as_draws_df(fit_press)$b_Intercept
sigma_samples <- as_draws_df(fit_press)$sigma
normal_predictive_distribution(
mu_samples = mu_samples,

sigma_samples = sigma_samples,
N_obs = N_obs
)

## # A tibble: 1,444,000 × 3
## iter trialn t_pred

## <dbl> <int> <dbl>

## 1 1 1 132.
## 2 1 2 128.
## 3 1 3 174.
## # … with 1,443,997 more rows

The brms function posterior_predict() is a convenient function that delivers samples from
the posterior predictive distribution. Using the command posterior_predict(fit_press)
yields the predicted finger tapping times in a matrix, with the samples as rows and the
observations (data-points) as columns. (Bear in mind that if a model is fit with sample_prior =
"only" , the dependent variable is ignored and posterior_predict will yield samples from the

prior predictive distribution).

The posterior predictive distribution can be used to examine the “descriptive adequacy” of the
model under consideration (Gelman et al. 2014, Chapter 6; Shiffrin et al. 2008). Examining the
posterior predictive distribution to establish descriptive adequacy is called posterior predictive
checks. The goal here is to establish that the posterior predictive data look more or less
similar to the observed data. Achieving descriptive adequacy means that the current data
could have been generated by the model. Although passing a test of descriptive adequacy is
not strong evidence in favor of a model, a major failure in descriptive adequacy can be
interpreted as strong evidence against a model (Shiffrin et al. 2008). For this reason,
comparing the descriptive adequacy of different models is not enough to differentiate between
their relative performance. When doing model comparison, it is important to consider the
criteria that Roberts and Pashler (2000) define. Although Roberts and Pashler (2000) are
more interested in process models and not necessarily Bayesian models, their criteria are
important for any kind of model comparison. Their main point is that it is not enough to have a
good fit to the data for a model to be convincing. One should check that the range of
predictions that the model makes is reasonably constrained; if a model can capture any
possible outcome, then the model fit to a particular data set is not so informative. In the
Bayesian modeling context, although posterior predictive checking is important, it is only a
sanity check to assess whether the model behavior is reasonable (for more on this point, see
chapter 7).

In many cases, one can simply use the plot functions from brms (that act as wrappers for
bayesplot functions). For example, the plotting function pp_check() takes as arguments the

model, the number of predicted data sets, and the type of visualization, and it can display
different visualizations of posterior predictive checks. In these type of plots, the observed data
are plotted as y and predicted data as yrep . Below, we use pp_check() to investigate how
well the observed distribution of finger tapping times fit our model based on some number (11
and 100 ) of samples of the posterior predictive distributions (that is, simulated data sets) ; see
Figures 3.8 and 3.9.

Hide

pp_check(fit_press, ndraws = 11, type = "hist")

y
y rep

100 200 300 400 100 200 300 400 100 200 300 400 100 200 300 400

FIGURE 3.8: Histograms of eleven samples from the posterior predictive distribution of the
model fit_press (yrep ).
Hide

pp_check(fit_press, ndraws = 100, type = "dens_overlay")

y
y rep

100 200 300 400

FIGURE 3.9: A posterior predictive check that shows the fit of the model fit_press in
comparison to data sets from the posterior predictive distribution using an overlay of density
plots.

The data is slightly skewed and has no values smaller than 100 ms, but the predictive
distributions are centered and symmetrical; see Figures 3.8 and 3.9. This posterior predictive
check shows a slight mismatch between the observed and predicted data. Can we build a
better model? We’ll come back to this issue in the next section.

3.7 The influence of the likelihood

Finger tapping times (and response times in general) are not usually normally distributed. A
more realistic distribution is the log-normal. A random variable (such as time) that is log-
normally distributed takes only positive real values and is right-skewed. Although other
distributions can also produce data with such properties, the log-normal will turn out to be a
pretty reasonable distribution for finger tapping times and response times.
3.7.1 The log-normal likelihood

If y is log-normally distributed, this means that log(y) is normally distributed.12 The log-
normal distribution is also defined using the parameters location, μ , and scale, σ, but these
are on the log ms scale; they correspond to the mean and standard deviation of the logarithm
of the data y, log(y) , which will be normally distributed. Thus, when we model some data y

using the log-normal likelihood, the parameters μ and σ are on a different scale than the data
y . Equation (3.9) shows the relationship between the log-normal and the normal.

log(y) ∼ Normal(μ, σ)
(3.9)
y ∼ LogNormal(μ, σ)

We can obtain samples from the log-normal distribution, using the normal distribution by first
setting an auxiliary variable z, so that z = log(y) . This means that z ∼ Normal(μ, σ) . Then
we can just use exp(z) as samples from the LogNormal(μ, σ) , since
exp(z) = exp(log(y)) = y . The code below produces Figure 3.10.

Hide

mu <- 6
sigma <- 0.5
N <- 500000

# Generate N random samples from a log-normal distribution

sl <- rlnorm(N, mu, sigma)
ggplot(tibble(samples = sl), aes(samples)) +
geom_histogram(aes(y = after_stat(density)), binwidth = 50) +
ggtitle("Log-normal distribution\n") +

coord_cartesian(xlim = c(0, 2000))

# Generate N random samples from a normal distribution,
# and then exponentiate them
sn <- exp(rnorm(N, mu, sigma))
ggplot(tibble(samples = sn), aes(samples)) +

geom_histogram(aes(y = after_stat(density)), binwidth = 50) +

ggtitle("Exponentiated samples from\na normal distribution") +
coord_cartesian(xlim = c(0, 2000))
Log-normal distribution Exponentiated samples from
a normal distribution

0.0020 0.0020

0.0015 0.0015
density

density
0.0010 0.0010

0.0005 0.0005

0.0000 0.0000

0 500 1000 1500 2000 0 500 1000 1500 2000

samples samples

FIGURE 3.10: Two log-normal distributions with the same parameters generated by either
generating samples from a log-normal distribution or exponentiating samples from a normal
distribution.

3.7.2 Using a log-normal likelihood to fit data from a single

subject pressing a button repeatedly

If we assume that finger tapping times are log-normally distributed, the likelihood function
changes as follows:

tn ∼ LogNormal(μ, σ)

But now the scale of the priors needs to change! Let’s start with uniform priors for ease of
exposition, even though, as we mentioned earlier, these are not really appropriate here. (More
realistic priors are discussed below.)

μ ∼ Uniform(0, 11)
(3.10)
σ ∼ Uniform(0, 1)
Because the parameters are on a different scale than the dependent variable, their
interpretation changes and it is more complex than if we were dealing with a linear model that
assumes a normal likelihood (location and scale do not coincide with the mean and standard
deviation of the log-normal):

The location μ : In our previous linear model, μ represented the mean (in a normal
distribution, the mean happens to be identical to the median and the mode). But now, the
mean needs to be calculated by computing exp(μ + σ /2)
2
. In other words, in the log-
normal, the mean is dependent on both μ and σ. The median is just exp(μ) . Notice that
the prior of μ is not on the milliseconds scale, but rather on the log milliseconds scale.
The scale σ: This is the standard deviation of the normal distribution of log(y) . The
standard deviation of a log-normal distribution with location μ and scale σ will be
2
exp(μ + σ /2) × √exp(σ ) − 1
2
. Unlike the normal distribution, the spread of the log-
normal distribution depends on both μ and σ.

To understand the meaning of the priors on the millisecond scale, both the priors and the
likelihood need to be taken into account. Generating a prior predictive distribution will help in
interpreting the priors. This distribution can be generated by just exponentiating the samples
produced by normal_predictive_distribution() (or, alternatively, edit the function by
replacing rnorm() with rlnorm() ).

Hide

N_samples <- 1000

N_obs <- nrow(df_spacebar)
mu_samples <- runif(N_samples, 0, 11)
sigma_samples <- runif(N_samples, 0, 1)

prior_pred_ln <- normal_predictive_distribution(

mu_samples = mu_samples,
sigma_samples = sigma_samples,
N_obs = N_obs
) %>%
mutate(t_pred = exp(t_pred))

Next, plot the distribution of some representative statistics; see Figure 3.11.
mean_t

0.10

0.05

0.00
median_t

0.10

0.05
density

0.00
min_t

0.10

0.05

0.00
max_t

0.10

0.05

0.00
1 10 100 1000 10000 100000
Finger tapping times in ms

FIGURE 3.11: The prior predictive distribution of the mean, median, minimum, and maximum
value of the log-normal model with priors defined in Equation (3.10), that is
μ ∼ Uniform(0, 11) and σ ∼ Uniform(0, 1) . The x-axis is log-transformed.
We cannot generate negative values any more, since exp( any finite real number)> 0. These
priors might work in the sense that the model might converge; but it would be better to have
regularizing priors for the model. An example of regularizing priors:

μ ∼ Normal(6, 1.5)
(3.11)
σ ∼ Normal+ (0, 1)

The prior for σ here is a truncated distribution, and although its location is zero, this is not its
mean. We can calculate its approximate mean from a large number of random samples of the
prior distribution using the function rtnorm() from the package extraDistr . In this function,
we have to set the parameter a = 0 to express the fact that the normal distribution is
truncated from the left at 0. (Box 4.1 discusses this type of distribution in detail):

Hide

mean(rtnorm(100000, 0, 1, a = 0))

## [1] 0.795
Even before generating the prior predictive distributions, we can calculate the values within
which we are 95% sure that the expected median of the observations will lie. We do this by
looking at what happens at two standard deviations away from the mean of the prior, μ , that is
6 − 2 × 1.5 and 6 + 2 × 1.5 , and exponentiating these values:

Hide

c(
lower = exp(6 - 2 * 1.5),
higher = exp(6 + 2 * 1.5)

## lower higher
## 20.1 8103.1

This means that the prior for μ is still not too informative (these are medians; the actual values
generated by the log-normal distribution can be much more spread out). Next, plot the
distribution of some representative statistics of the prior predictive distributions. brms allows
one to sample from the priors, ignoring the observed data t , by setting sample_prior =
"only" in the brm function.

If we want to use brms to generate prior predictive data in this manner even before we have
any data, (at least for now) we do need to have some non- NA values as the dependent
variable t . Setting sample_prior = "only" will ignore the data, but we still need to add a
data frame in the data = specification in the brm function: In this case, we add a vector of
1 as “data”. The family is specified as lognormal() ; recall that in the first example, the

family was gaussian() .

Hide
df_spacebar_ref <- df_spacebar %>%
mutate(t = rep(1, n()))
fit_prior_press_ln <- brm(t ~ 1,
data = df_spacebar_ref,
family = lognormal(),

prior = c(
prior(normal(6, 1.5), class = Intercept),
prior(normal(0, 1), class = sigma)
),
sample_prior = "only",
control = list(adapt_delta = .9)

To avoid the warnings, increase the adapt_delta parameter’s default value from 0.8 to 0.9 to
simulate the data. Since Stan samples from the prior distributions in the same way that it
samples from the posterior distribution, one should not ignore warnings; always ensure that
the model converged. In that respect, the custom function normal_predictive_distribution()
defined in section 3.3 has the advantage that it will always yield independent samples from the
prior distribution and will not experience any convergence problems. This is because it just
relies on the rnorm() function in R.

Plot the prior predictive distribution of means with the following code (the figure is not
produced here, to conserve space). In a prior predictive distribution, we generally want to
ignore the data; this requires setting prefix = "ppd" in pp_check() .

Hide

pp_check(fit_prior_press_ln, type = "stat", stat = "mean",

prefix = "ppd") + coord_cartesian(xlim = c(0.001, 300000)) +
scale_x_continuous("Finger tapping times [ms]",

trans = "log",
breaks = c(0.001, 1, 100, 1000, 10000, 100000),
labels = c(
"0.001", "1", "100", "1000", "10000",
"100000"

)
) +
ggtitle("Prior predictive distribution of means")
To plot the distribution of minimum, and maximum values, just replace mean with min , and
max respectively. The distributions of the three statistics are displayed in Figure 3.12.

Hide
p1 <- pp_check(fit_prior_press_ln, type = "stat", stat = "mean", prefix = "ppd") +
coord_cartesian(xlim = c(0.001, 300000)) +
scale_x_continuous("Finger tapping times [ms]",
trans = "log",

breaks = c(0.001, 1, 100, 1000, 10000, 100000),

labels = c(
"0.001", "1", "100", "1000", "10000",
"100000"
)
) +

ggtitle("Prior predictive distribution of means")

p2 <- pp_check(fit_prior_press_ln, type = "stat", stat = "min", prefix = "ppd") +
coord_cartesian(xlim = c(0.001, 300000)) +
scale_x_continuous("Finger tapping times [ms]",
trans = "log",

breaks = c(0.001, 1, 100, 1000, 10000, 100000),

labels = c(
"0.001", "1", "100", "1000", "10000",
"100000"
)

) +
ggtitle("Prior predictive distribution of minimum values")
p3 <- pp_check(fit_prior_press_ln, type = "stat", stat = "max", prefix = "ppd") +
coord_cartesian(xlim = c(0.001, 300000)) +
scale_x_continuous("Finger tapping times [ms]",
trans = "log",

breaks = c(0.001, 1, 100, 1000, 10000, 100000),

labels = c(
"0.001", "1", "10", "1000", "10000",
"100000"
)

) +
ggtitle("Prior predictive distribution of maximum values")
plot_grid(p1, p2, p3, nrow = 3, ncol =1)
Prior predictive distribution of means

T = mean
T (y pred)

0.001 1 100 1000 10000 100000

Finger tapping times [ms]

Prior predictive distribution of minimum values

T = min
T (y pred)

0.001 1 100 1000 10000 100000

Finger tapping times [ms]

Prior predictive distribution of maximum values

T = max
T (y pred)

0.001 1 10 1000 10000 100000

Finger tapping times [ms]

FIGURE 3.12: The prior predictive distributions of the mean, maximum, and minimum values
of the log-normal model with priors defined in equation (3.11). The prior predictive distributions
are labeled ypred . The x-axis shows values back-transformed from the log-scale.
Figure 3.12 shows that the priors used here are still quite uninformative. The tails of the prior
predictive distributions that correspond to our normal priors shown in Figure 3.12 are even
further to the right, reaching more extreme values than for the prior predictive distributions
generated by uniform priors (shown in Figure 3.11). The new priors are still far from
representing our prior knowledge. We could run more iterations of choosing priors and
generating prior predictive distributions until we have priors that generate realistic data.
However, given that the bulk of the distributions of the mean, maximum, minimum values lies
roughly in the correct order of magnitude, these priors are going to be acceptable. In general,
summary statistics (e.g., mean, median, min, max) can be used to test whether the priors are
in a plausible range. This can be done by defining, for the particular research problem under
study, the extreme data that would be very implausible to ever observe (e.g., reading times at
a word larger than one minute) and choosing priors such that such extreme finger tapping
times occur only very rarely in the prior predictive distribution.
Next, fit the model; recall that both the distribution family and prior change in comparison to
the previous example.

Hide

fit_press_ln <- brm(t ~ 1,

data = df_spacebar,
family = lognormal(),
prior = c(
prior(normal(6, 1.5), class = Intercept),

prior(normal(0, 1), class = sigma)

)
)

When we look at the summary of the posterior, the parameters are on the log-scale:

Hide

fit_press_ln

## ...
## Population-Level Effects:

## Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESS

## Intercept 5.12 0.01 5.10 5.13 1.00 3994 2616
##
## Family Specific Parameters:
## Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESS

## sigma 0.13 0.00 0.13 0.15 1.00 3247 2641

##
## ...

If the research goal is to find out how long it takes to press the space bar in milliseconds, we
need to transform the μ (or Intercept in the model) to milliseconds. Because the median of
the log-normal distribution is exp(μ) , the following returns the estimates in milliseconds:

Hide

estimate_ms <- exp(as_draws_df(fit_press_ln)$b_Intercept)

To display the mean and 95% credible interval of these samples, type:

Hide

c(mean = mean(estimate_ms), quantile(estimate_ms, probs = c(.025, .975)))

## mean 2.5% 97.5%

## 167 165 169

Next, check whether the predicted data sets look similar to the observed data set. See Figure
3.13; compare this with the earlier Figure 3.9.

Hide

pp_check(fit_press_ln, ndraws = 100)

y
y rep

100 200 300 400

FIGURE 3.13: The posterior predictive distribution of fit_press_ln .

The key question is: Are the posterior predicted data now more similar to the observed data,
compared to the case where we had a Normal likelihood? According to Figure 3.13, it seems
so, but it’s not easy to tell.
Another way to examine the extent to which the predicted data looks similar to the observed
data would be to look at the distribution of some summary statistic. As with prior predictive
distributions, examine the distribution of representative summary statistics for the data sets
generated by different models. However, in contrast with what occurs with prior predictive
distributions, at this point we have a clear reference, our observations, and this means that we
can compare the summary statistics with the observed statistics from our data. We suspect
that the normal distribution would generate finger tapping times that are too fast (since it’s
symmetrical) and that the log-normal distribution may capture the long tail better than the
normal model. Based on this supposition, compute the distribution of minimum and maximum
values for the posterior predictive distributions, and compare them with the minimum and
maximum value respectively in the data. The function pp_check() implements this by
specifying stat either "min" or "max" for both fit_press , and fit_press_ln ; an
example is shown below. The plots are shown in Figures 3.14 and 3.15.

Hide

pp_check(fit_press, type = "stat", stat = "min")

Normal model Log-normal model

T = min T = min
T (y rep) T (y rep)

T (y ) T (y )

40 60 80 100 120 90 100 110 120 130

FIGURE 3.14: The distributions of minimum values in a posterior predictive check, using the
normal and log-normal probability density functions. The minimum in the data is 110 ms.

Normal model Log-normal model

T = max T = max
T (y rep) T (y rep)

T (y ) T (y )

200 250 300 350 400 250 300 350 400

FIGURE 3.15: The distributions of maximum values in a posterior predictive check using the
normal and log-normal. The maximum in the data is 409 ms.
Figure 3.14 shows that the log-normal likelihood does a slightly better job since the minimum
value is contained in the bulk of the log-normal distribution and in the tail of the normal one.
Figure 3.15 shows that both models are unable to capture the maximum value of the observed
data. One explanation for this is that the log-normal-ish observations in our data are being
generated by the task of pressing as fast as possible, whereas the observations with long
finger tapping times are being generated by lapses of attention. This would mean that two
probability distributions are mixed here; modeling this process involves more complex tools
that we will take up in chapter 19.

This completes our introduction to brms . We are now ready to learn about more regression
models.

3.8 List of the most important commands

Here is a list of the most important commands we learned in this chapter.

The core brms function for fitting models, for generating prior predictive and posterior
predictive data:

Hide

fit_press <- brm(t ~ 1,

data = df_spacebar,

family = gaussian(),
prior = c(
prior(uniform(0, 60000), class = Intercept, lb = 0, ub = 60000),
prior(uniform(0, 2000), class = sigma, lb = 0, ub = 2000)
),

chains = 4,
iter = 2000,
warmup = 1000
## uncomment for prior predictive:
## sample_prior = "only",

## uncomment when dealing with divergent transitions

## control = list(adapt_delta = .9)
)
Extract samples from fitted model:

Hide

as_draws_df(fit_press)

Basic plot of posteriors

Hide

plot(fit_press)

Plot prior predictive/posterior predictive data

Hide

## Posterior predictive check:

pp_check(fit_press, ndraws = 100, type = "dens_overlay")

## Plot posterior predictive distribution of statistical summaries:
pp_check(fit_press, ndraws = 100, type = "stat", stat = "mean")
## Plot prior predictive distribution of statistical summaries:
pp_check(fit_press, ndraws = 100, type = "stat", stat = "mean",

prefix = "ppd")

3.9 Summary

This chapter showed how to fit and interpret a Bayesian model with a normal likelihood. We
looked at the effect of priors by investigating prior predictive distributions and by carrying out a
sensitivity analysis. We also looked at the fit of the posterior, by inspecting the posterior
predictive distribution (which gives us some idea about the descriptive adequacy of the
model). We also showed how to fit a Bayesian model with a log-normal likelihood, and how to
compare the predictive accuracy of different models.
3.10 Further reading

Sampling algorithms are discussed in detail in Gamerman and Lopes (2006). Also helpful are
the sections on sampling from the short open-source book by Bob Carpenter, Probability and
Statistics: a simulation-based introduction (https://github.com/bob-carpenter/prob-stats), and
the sections on sampling algorithms in Lambert (2018) and Lynch (2007). Introductory linear
modeling theory is covered in Dobson and Barnett (2011); more advanced treatments are in
Montgomery, Peck, and Vining (2012) and Seber and Lee (2003). Generalized linear models
are covered in detail in McCullagh and Nelder (2019). The reader may also benefit from our
own freely available online lecture notes on linear modeling: https://github.com/vasishth/LM.

3.11 Exercises

Exercise 3.1 A simple linear model.

a. Fit the model fit_press with just a few iterations, say 50 iterations (set warmup to the
default of 25, and use four chains). Does the model converge?

b. Using normal distributions, choose priors that better represent your assumptions/beliefs
about finger tapping times. To think about a reasonable set of priors for μ and σ, you
should come up with your own subjective assessment about what you think a reasonable
range of values can be for μ and how much variability might happen. There is no correct
answer here, we’ll discuss priors in depth in chapter 6. Fit this model to the data. Do the
posterior distributions change?

Exercise 3.2 Revisiting the button-pressing example with different priors.

a. Can you come up with very informative priors that influence the posterior in a noticeable
way (use normal distributions for priors, not uniform priors)? Again, there are no correct
answers here; you may have to try several different priors before you can noticeably
influence the posterior.
b. Generate and plot prior predictive distributions based on this prior and plot them.

c. Generate posterior predictive distributions based on this prior and plot them.

Exercise 3.3 Posterior predictive checks with a log-normal model.

a. For the log-normal model fit_press_ln , change the prior of σ so that it is a log-normal
distribution with location (μ) of −2 and scale (σ) of 0.5 . What does such a prior imply
about your belief regarding button-pressing times in milliseconds? Is it a good prior?
Generate and plot prior predictive distributions. Do the new estimates change compared
to earlier models when you fit the model?
b. For the log-normal model, what is the mean (rather than median) time that takes to press
the space bar, what is the standard deviation of the finger tapping times in milliseconds?

Exercise 3.4 A skew normal distribution.

Would it make sense to use a “skew normal distribution” instead of the log-normal? The skew
normal distribution has three parameters: location ξ (this is the lower-case version of the
Greek letter Ξ , pronounced “chi”, with the “ch” pronounced like the “ch” in “Bach”), scale ω

(omega), and shape α . The distribution is right skewed if α > 0 , is left skewed if α < 0 , and
is identical to the regular normal distribution if α = 0 . For fitting this in brms , one needs to
change family and set it to skew_normal() , and add a prior of class = alpha (location
remains class = Intercept and scale, class = sigma ).

a. Fit this model with a prior that assigns approximately 95% of the prior probability of
alpha to be between 0 and 10.
b. Generate posterior predictive distributions and compare the posterior distribution of
summary statistics of the skew normal with the normal and log-normal.

References

Bürkner, Paul-Christian. 2019. brms: Bayesian Regression Models Using “Stan”.

https://CRAN.R-project.org/package=brms.

Dobson, Annette J, and Adrian Barnett. 2011. An Introduction to Generalized Linear Models.
CRC press.

Duane, Simon, A. D. Kennedy, Brian J. Pendleton, and Duncan Roweth. 1987. “Hybrid Monte
Carlo.” Physics Letters B 195 (2): 216–22. https://doi.org/10.1016/0370-2693(87)91197-X.

Gamerman, Dani, and Hedibert F Lopes. 2006. Markov chain Monte Carlo: Stochastic
simulation for Bayesian inference. CRC Press.
Ge, Hong, Kai Xu, and Zoubin Ghahramani. 2018. “Turing: A Language for Flexible
Probabilistic Inference.” In Proceedings of Machine Learning Research, edited by Amos
Storkey and Fernando Perez-Cruz, 84:1682–90. Playa Blanca, Lanzarote, Canary Islands:
PMLR. http://proceedings.mlr.press/v84/ge18b.html.

Gelman, Andrew, John B. Carlin, Hal S. Stern, David B. Dunson, Aki Vehtari, and Donald B.
Rubin. 2014. Bayesian Data Analysis. Third Edition. Boca Raton, FL: Chapman; Hall/CRC
Press.

Gelman, Andrew, Daniel Simpson, and Michael J. Betancourt. 2017. “The Prior Can Often
Only Be Understood in the Context of the Likelihood.” Entropy 19 (10): 555.
https://doi.org/10.3390/e19100555.

Goodrich, Ben, Jonah Gabry, Imad Ali, and Sam Brilleman. 2018. “Rstanarm: Bayesian
Applied Regression Modeling via Stan.” http://mc-stan.org/.

Hoffman, Matthew D., and Andrew Gelman. 2014. “The No-U-Turn Sampler: Adaptively
Setting Path Lengths in Hamiltonian Monte Carlo.” Journal of Machine Learning Research 15
(1): 1593–1623. http://dl.acm.org/citation.cfm?id=2627435.2638586.

JASP Team. 2019. “JASP (Version 0.11.1)[Computer software].” https://jasp-stats.org/.

Jaynes, Edwin T. 2003. Probability Theory: The Logic of Science. Cambridge university press.

Jeffreys, Harold. 1939. Theory of Probability. Oxford: Clarendon Press.

Lambert, Ben. 2018. A Student’s Guide to Bayesian Statistics. London, UK: Sage.

Lindgren, Finn, and Håvard Rue. 2015. “Bayesian Spatial Modelling with R-INLA.” Journal of
Statistical Software 63 (1): 1–25.

Luce, R Duncan. 1991. Response Times: Their Role in Inferring Elementary Mental
Organization. Oxford University Press.

Lunn, D.J., A. Thomas, N. Best, and D. Spiegelhalter. 2000. “WinBUGS-A Bayesian Modelling
Framework: Concepts, Structure, and Extensibility.” Statistics and Computing 10 (4). Springer:
325–37.

Lynch, Scott Michael. 2007. Introduction to Applied Bayesian Statistics and Estimation for
Social Scientists. New York, NY: Springer.
McCullagh, Peter, and J.A. Nelder. 2019. Generalized Linear Models. Second Edition. Boca
Raton, Florida: Chapman; Hall/CRC.

Montgomery, D. C., E. A. Peck, and G. G. Vining. 2012. An Introduction to Linear Regression

Analysis. 5th ed. Hoboken, NJ: Wiley.

Neal, Radford M. 2011. “MCMC Using Hamiltonian Dynamics.” In Handbook of Markov Chain
Monte Carlo, edited by Steve Brooks, Andrew Gelman, Galin Jones, and Xiao-Li Meng. Taylor
& Francis. https://doi.org/10.1201/b10905-10.

O’Hagan, Antony, and Jonathan Forster. 2004. “Kendall’s Advanced Theory of Statistics, Vol.
2B: Bayesian Inference.” Wiley.

Plummer, Martin. 2016. “JAGS Version 4.2.0 User Manual.”

Plummer, Martyn. 2022. “Simulation-Based Bayesian Analysis.” Annual Reviews.

Roberts, Seth, and Harold Pashler. 2000. “How Persuasive Is a Good Fit? A Comment on
Theory Testing.” Psychological Review 107 (2): 358–67.

Salvatier, John, Thomas V. Wiecki, and Christopher Fonnesbeck. 2016. “Probabilistic

Programming in Python Using PyMC3.” PeerJ Computer Science 2 (April). PeerJ: e55.
https://doi.org/10.7717/peerj-cs.55.

Seber, George A. F., and Allen J. Lee. 2003. Linear Regression Analysis. 2nd Edition.
Hoboken, NJ: John Wiley; Sons.

Shiffrin, Richard, Michael D. Lee, Woojae Kim, and Eric-Jan Wagenmakers. 2008. “A Survey
of Model Evaluation Approaches with a Tutorial on Hierarchical Bayesian Methods.” Cognitive
Science: A Multidisciplinary Journal 32 (8): 1248–84.
https://doi.org/10.1080/03640210802414826.

Vasishth, Shravan, and Felix Engelmann. 2022. Sentence Comprehension as a Cognitive

Process: A Computational Approach. Cambridge, UK: Cambridge University Press.
https://books.google.de/books?id=6KZKzgEACAAJ.

Vehtari, Aki, Andrew Gelman, Daniel Simpson, Bob Carpenter, and Paul-Christian Bürkner.
2019. “Rank-Normalization, Folding, and Localization: An Improved R̂ for Assessing
Convergence of MCMC.” arXiv Preprint arXiv:1903.08008.
Vehtari, Aki, Andrew Gelman, Daniel Simpson, Bob Carpenter, and Paul-Christian Bürkner.
2019. “Rank-Normalization, Folding, and Localization: An Improved ˆ
R for Assessing
Convergence of MCMC.”

7. The Python package PyMC3 and the Julia library Turing are recent exceptions since they
are fully integrated into their respective languages.↩

8. We refer to the time it takes for a subject to respond or react to a stimuli as response time
(response and reaction times are often used interchangeably, cf. Luce 1991). In this case,
however, there are no stimuli, and the subject only taps the space bar.↩

9. The problem here is that although the parameter for the intercept is assigned a uniform
distribution bounded between 0 and 60000 ms, the sampler might start sampling from an
initial value outside this range. The sampler can start from an initial value that is outside
the 0-60000 range because the initial value is chosen randomly (unless the user specifies
an initial value explicitly). ↩

10. We’ll see later how to generate prior predictive distributions of statistics such as mean,
minimum, or maximum value in section 3.7.2 using brms and pp_check() .↩

11. Even though, in theory, one could use even wider priors, in practice, these are the widest
priors that achieve convergence.↩

12. More precisely, log (y)

e
or ln(y) , but we’ll write it as just log() .↩
Code

Chapter 4 Bayesian regression models

We generally run experiments because we are interested in the relationship between two or
more variables. A regression will tell us how our dependent variable, also called the response
or outcome variable (e.g., pupil size, response times, accuracy, etc.) is affected by one or
many independent variables, predictors, or explanatory variables. Predictors can be
categorical (e.g., male or female), ordinal (first, second, third, etc.), or continuous (e.g., age).
In this chapter we focus on simple regression models with different likelihood functions.

4.1 A first linear regression: Does attentional load

affect pupil size? #

Let us look at the effect of cognitive processing on human pupil size to illustrate the use of
Bayesian linear regression models. Although pupil size is mostly related to the amount of light
that reaches the retina or the distance to a perceived object, pupil sizes are also
systematically influenced by cognitive processing: Increased cognitive load leads to an
increase in the pupil size (for a review, see Mathot 2018).

For this example, we’ll use the data from one subject’s pupil size of the control experiment by
Wahn et al. (2016), averaged by trial. The data are available from df_pupil in the package
bcogsci . In this experiment, the subject covertly tracks between zero and five objects among

several randomly moving objects on a computer screen. This task is called multiple object
tracking (or MOT; see Pylyshyn and Storm 1988). First, several objects appear on the screen,
and a subset of the objects are indicated as “targets” at the beginning. Then, the objects start
moving randomly across the screen and become indistinguishable. After several seconds, the
objects stop moving and the subject need to indicate which objects were the targets. See
Figure 4.1. Our research goal is to examine how the number of moving objects being tracked–
that is, how attentional load–affects pupil size.
FIGURE 4.1: Flow of events in a trial where two objects need to be tracked. Adapted from
Blumberg, Peterson, and Parasuraman (2015); licensed under CC BY 4.0
(https://creativecommons.org/licenses/by/4.0/).

4.1.1 Likelihood and priors

We will model pupil size as normally distributed, because we are not expecting a skew, and
we have no further information available about the distribution of pupil sizes. (Given the units
used here, pupil sizes cannot be of size zero or negative, so we know for sure that this choice
is not exactly right.) For simplicity, assume a linear relationship between load and the pupil
size.

Let’s summarize our assumptions:

1. There is some average pupil size represented by α .

2. The increase of attentional load has a linear relationship with pupil size, determined by β.
3. There is some noise in this process, that is, variability around the true pupil size. This
variability is represented by the scale σ.
4. The noise is normally distributed.

The generative probability density function will be as follows:

p_sizen ∼ Normal(α + c_loadn ⋅ β, σ)

where n indicates the observation number with n = 1, … , N .

This means that the formula in brms will be p_size ~ 1 + c_load , where 1 represents the
intercept, α , which doesn’t depend on the predictor, and c_load is the predictor that is
multiplied by β. The prefix c_ will generally indicate that a predictor (in this case load) is
centered (i.e., the mean of all the values is subtracted from each value). If load is centered,
the intercept represents the pupil size at the average load in the experiment (because at the
average load, the centered load is zero, yielding α + 0 ⋅ β ). If the load had not been centered
(i.e., starts with no load, then one, two, etc.), then the intercept would represent the pupil size
when there is no load. Although we can fit a frequentist model with lm(p_size ~ 1 + c_load,
data set) , when we fit a Bayesian model, we have to specify priors for each of the

parameters.

For setting plausible priors, some research needs to be done to find some information about
pupil sizes. Although we might know that pupil diameters range between 2 to 4 mm in bright
light to 4 to 8 mm in the dark (Spector 1990), this experiment was conducted with the Eyelink-
II eyetracker which measures the pupils in arbitrary units (Hayes and Petrov 2016). If this is
our first-ever analysis of pupil size, before setting up the priors, we’ll need to look at some
measures of pupil size. (If we had analyzed this type of data before, we could also look at
estimates from previous experiments). Fortunately, we have some measurements of the same
subject with no attentional load for the first 100 ms, measured every 10 ms, in the data frame
df_pupil_pilot from the package bcogsci : This will give us some idea about the order of

magnitude of our dependent variable.

Hide

data("df_pupil_pilot")
df_pupil_pilot$p_size %>% summary()

## Min. 1st Qu. Median Mean 3rd Qu. Max.

## 852 856 862 861 866 868

With this information we can set a regularizing prior for α . Center the prior around 1000 to be
in the right order of magnitude.13 Since we don’t know how much pupil sizes are going to vary
by load yet, we include a rather wide prior by defining it as a normal distribution and setting its
standard deviation as 500 .

α ∼ Normal(1000, 500)
Given that our predictor load is centered, with the prior for α , we are saying that we suspect
that the average pupil size for the average load in the experiment will be in a 95% credible
interval limited by approximately 1000 ± 2 ⋅ 500 = [0, 2000] units. We can calculate this with
more precision in R using the qnorm function:

Hide

qnorm(c(.025, .975), mean = 1000, sd = 500)

## [1] 20 1980

We know that the measurements of the pilot data are strongly correlated because they were
taken 10 milliseconds apart. For this reason, they won’t give us a realistic estimate of how
much the pupil size can vary. Accordingly, set up quite an uninformative prior for σ that
encodes this lack of precise information: σ is surely larger than zero and has to be in the order
of magnitude of the pupil size with no load.

σ ∼ Normal+ (0, 1000)

With this prior for σ, we are saying that we expect that the standard deviation of the pupil
sizes should be in the following 95% credible interval.

Hide

qtnorm(c(.025, .975), mean = 0, sd = 1000, a = 0)

## [1] 31.3 2241.4

In order to compute the 95% credible interval, we used qtnorm from the extraDistr
package rather than qnorm() . As mentioned earlier, the relevant command specification is
qtnorm(..., a = 0) ; recall that a = 0 indicates a truncated normal distribution, truncated at

the left by zero.

The mean of Normal+ , a normal distribution truncated at zero so as to allow for only positive
values, does not coincide with its location indicated with the parameter μ (and neither does
the standard deviation coincide with the scale, σ); see Box 4.1.

Hide
samples <- rtnorm(20000, mean = 0, sd = 1000, a = 0)
c(mean = mean(samples), sd = sd(samples))

## mean sd
## 794 595

We still need to set a prior for β, the change in pupil size produced by the attentional load.
Given that pupil size changes are not easily perceptible (we don’t usually observe changes in
pupil size in our day-to-day life), we expect them to be much smaller than the pupil size (which
we assume has mean 1000 units), so we use the following prior:

β ∼ Normal(0, 100)

With the prior of β, we are saying that we don’t really know if the attentional load will increase
or even decrease the pupil size (it is centered at zero), but we do know that one unit of load
(that is one more object to track) will potentially change the pupil size in a way that is
consistent with the following 95% credible interval.

Hide

qnorm(c(.025, .975), mean = 0, sd = 100)

## [1] -196 196

That is, we don’t expect changes in size that increase or decrease the pupil size more than
200 units for one unit increase in load.

The priors we have specified here are relatively uninformative; as mentioned earlier, this is
because we are considering the situation where we don’t have much prior experience with
pupil size studies. In other settings, we might have more prior knowledge and experience; in
that case, one could use somewhat more principled priors. We will return to this point in the
chapter on priors (chapter 6) and on the Bayesian workflow (chapter 7).

Box 4.1 Truncated distributions

Any distribution can be truncated. For a continuous distribution, the truncated version of
the original distribution will have non-zero probability density values for a continuous
subset of the original coverage. To make this more concrete, in our previous example, the
normal distribution has coverage for values between minus infinity to plus infinity, and our
truncated version N ormal+ has coverage between zero and plus infinity: all negative
values have a density of zero. Let’s see how we can generalize this to be able to
understand any truncation of any continuous distribution. (For the discrete case, we can
simply replace the integral with a sum, and replace PDF with PMF).

From the axiomatic definitions of probability, we know that the area below a PDF, f (x) ,
must be equal to one (section 1.1). More formally, this means that the integral of f

evaluated as f (−∞ < X < ∞) should be equal to one:

∞

∫ f (x)dx = 1
−∞

But if the distribution is truncated, f is going to be evaluated in some subset of its possible
values, f (a < X < b) ; in the specific case of N ormal+ , for example, a = 0 , and b = ∞

. In the general case, this means that the integral of the PDF evaluated for a < X < b will
be lower than one, unless a = −∞ and b = +∞ .

∫ f (x)dx < 1
a

We want to ensure that we build a new PDF for the truncated distribution so that even
though it has less coverage than the non-truncated version, it still integrates to one. To
achieve this, we divide the “unnormalized” PDF by the total area of f (a < X < b) (recall
the discussion surrounding Equation (1.1)):

f (x)
f[a,b] (x) =
b
∫ f (x)dx
a

The denominator of the previous equation is the difference between the CDF evaluated at
X = b and the CDF evaluated at X = a ; this can be written as F (b) − F (a) :

f (x)
f[a,b] (x) = (4.1)
F (b) − F (a)

For the specific case where f (x) is N ormal(x|0, σ) and we want the PDF of
N ormal+ (x|0, σ) , the bounds will be a = 0 and b = ∞ .

N ormal(x|0, σ)
N ormal+ (x|0, σ) =
1/2

Because F (X = b = ∞) = 1 and F (X = a = 0) = 1/2 .

You can verify this in R (this is valid for any value of sd ).

Hide

dnorm(1, mean = 0) * 2 == dtnorm(1, mean = 0, a = 0)

## [1] TRUE

Unless the truncation of the normal distribution is symmetrical, the mean μ of the
truncated normal does not coincide with the mean of the parent (untruncated) normal
distribution; call this mean of the parent distribution μ
^ . For any type of truncation, the
standard deviation of the truncated distribution σ does not coincide with the standard
deviation of the parent distribution; call this latter standard deviation σ
^. Confusingly

enough, the arguments of the family of truncated functions *tnorm keeps the names of
the family of functions *norm , the terms mean and sd . So, when defining a truncated
normal distribution like dtnorm(mean = 300, sd = 100, a = 0, b = Inf) , the mean and
sd refer to the mean μ
^ and standard deviation σ
^ of the untruncated parent distribution.

Sometimes one needs to model observed data as coming from a truncated normal
distribution. An example would be a vector of observed standard deviations; perhaps one
wants to use these estimates to work out a truncated normal prior. In order to derive such
an empirically motivated prior, we have to work out what mean and standard deviation we
need to use in a truncated normal distribution. We could compute the mean and standard
deviation from the observed vector of standard deviations, and then use the procedure
shown below to work out the mean and standard deviation that we would need to put into
the truncated normal distribution. This approach is used in chapter 6, section 6.1.4 for
working out a prior based on standard deviation estimates from existing data.

The mean and standard deviation of the parent distribution of a truncated normal (μ
^ and σ
^

) with boundaries a and b, given the mean μ and standard deviation σ of the truncated
normal, are computed as follows (Johnson, Kotz, and Balakrishnan 1995). ϕ(X) is the
PDF of the standard normal (i.e., Normal(μ = 0, σ = 1) ) evaluated at X , and Φ(X) is
the CDF of the standard normal evaluated at X .

First, define two terms α and β for convenience:

α = (a − μ)/
^ σ
^ β = (b − μ)/
^ σ
^

Then, the mean μ of the truncated distribution can be computed as follows based on the
parameters of the parent distribution:
ϕ(β) − ϕ(α)
μ = μ
^ − σ
^ (4.2)
Φ(β) − Φ(α)

The variance σ
2
of the truncated distribution is:

2
βϕ(α) − αϕ(β) ϕ(α) − ϕ(β)
2 2
σ = σ
^ × (1 − − ( ) ) (4.3)
Φ(β) − Φ(α) Φ(β) − Φ(α)

Equations (4.2) and (4.3) have two variables, so if one is given the values for the
truncated distribution μ and σ, one can solve (using algebra) for the mean and standard
deviation of the untruncated distribution, μ
^ and σ
^.

For example, suppose that a = 0 and b = 500 , and that the mean and standard deviation
of the untruncated parent distribution is μ
^ = 300 and σ
^ = 200 . We can simulate such a
situation and estimate the mean and standard deviation of the truncated distribution:

Hide

x <- rtnorm(10000000, mean = 300, sd = 200, a = 0, b = 500)

## the mean and sd of the truncated distributions
## using simulation:

mean(x)

## [1] 271

Hide

sd(x)

## [1] 129

These simulated values are identical to the values computed using equations (4.2) and
(4.3):

Hide
a <- 0
b <- 500
bar_x <- 300
bar_sigma <- 200
alpha <- (a - bar_x) / bar_sigma

beta <- (b - bar_x) / bar_sigma

term1 <- ((dnorm(beta) - dnorm(alpha)) /
(pnorm(beta) - pnorm(alpha)))
term2 <- ((beta * dnorm(beta) - alpha * dnorm(alpha)) /
(pnorm(beta) - pnorm(alpha)))
## the mean and sd of the truncated distribution

## computed analytically:
(mu <- bar_x - bar_sigma * term1)

## [1] 271

Hide

(sigma <- sqrt(bar_sigma^2 * (1 - term2 - term1^2)))

## [1] 129

The equations for the mean and variance of the truncated distribution (μ and σ) can also
be used to work out the mean and variance of the parent untruncated distribution (μ
^ and σ
^

), if one has estimates for μ and σ (from data).

Suppose that we have observed data with mean μ = 271 and σ = 129 . We want to
assume that the data are coming from a truncated normal which has lower bound 0 and
upper bound 500 . What are the mean and standard deviation of the parent distribution, μ
^

and σ
^?

To answer this question, first rewrite the equations as follows:

ϕ(β) − ϕ(α)
μ − μ
^ + σ
^ = 0 (4.4)
Φ(β) − Φ(α)

The variance σ
2
of the truncated distribution is:
2
βϕ(α) − αϕ(β) ϕ(α) − ϕ(β)
2 2
σ − σ
^ × (1 − − ( ) ) = 0 (4.5)
Φ(β) − Φ(α) Φ(β) − Φ(α)

Next, solve for μ

^ and σ
^ given the observed mean and the standard deviation of the
truncated distribution, and that one knows the boundaries (a, and b).

Define the system of equations according to the specifications of multiroot from the
package rootSolve : x for the unknowns (μ
^ and σ
^), and parms for the known

parameters: a, b, and the mean and standard deviation of the truncated normal.

Hide

eq_system <- function(x, parms) {

mu_hat <- x[1]
sigma_hat <- x[2]
alpha <- (parms["a"] - mu_hat) / sigma_hat

beta <- (parms["b"] - mu_hat) / sigma_hat

c(
F1 = parms["mu"] - mu_hat + sigma_hat *
(dnorm(beta) - dnorm(alpha)) / (pnorm(beta) - pnorm(alpha)),
F2 = parms["sigma"] -

sigma_hat *
sqrt((1 - ((beta) * dnorm(beta) - (alpha) * dnorm(alpha)) /
(pnorm(beta) - pnorm(alpha)) -
((dnorm(beta) - dnorm(alpha)) /
(pnorm(beta) - pnorm(alpha)))^2))
)

Solving the two equations using multiroot() from the package rootSolve gives us the
mean and standard deviation μ
^ and σ
^ of the parent normal distribution. (Notice that x is
a required parameter of the previous function so that it works with multiroot() , however,
outside of the function the variable x is a vector containing the samples of the truncated
normal distribution generated with rtnorm() ).

Hide
soln <- multiroot(f = eq_system, start = c(1, 1),
parms = c(a = 0, b = 500,
mu = mean(x), sigma = sd(x)))
soln$root

## [1] 300 200

4.1.2 The brms model

Before fitting the brms model of the effect of load on pupil size, load the data and center the
predictor load :

Hide

data("df_pupil")
(df_pupil <- df_pupil %>%
mutate(c_load = load - mean(load)))

## # A tibble: 41 × 5
## subj trial load p_size c_load

## <int> <int> <int> <dbl> <dbl>

## 1 701 1 2 1021. -0.439
## 2 701 2 1 951. -1.44
## 3 701 3 5 1064. 2.56
## # … with 38 more rows

Now fit the brms model:

Hide
fit_pupil <- brm(p_size ~ 1 + c_load,
data = df_pupil,
family = gaussian(),
prior = c(
prior(normal(1000, 500), class = Intercept),

prior(normal(0, 1000), class = sigma),

prior(normal(0, 100), class = b, coef = c_load)
)
)

The only difference from our previous models is that we now have a predictor in the formula
and in the priors. Priors for predictors are indicated with class = b , and the specific predictor
with coef = c_load . If we want to set the same priors to different predictors we can omit the
argument coef . Even if we drop the 1 from the formula, brm() will fit the same model as
when we specify 1 explicitly. If we really want to remove the intercept, this must be indicated
with 0 +... or -1 +... . Also see the Box 4.2 for more details about the treatment of the
intercepts by brms . The priors are normal distributions for the intercept (α) and the slope (β),
and a truncated normal distribution for the scale parameter σ, which coincides with the
standard deviation (because the likelihood is a normal distribution). brms will automatically
truncate the prior specification for σ and allow only positive values.

Next, inspect the output of the model. The posteriors and trace plots are shown in Figure 4.2;
this figure is generated by typing:

Hide

plot(fit_pupil)
b_Intercept b_Intercept

750
0.015

0.010 700

0.005 650
0.000
660 700 740 0 200 400 600 800 1000

b_c_load b_c_load
Chain
0.03
60
1
0.02 40
2
20 3
0.01
0 4
0.00
0 20 40 60 0 200 400 600 800 1000

sigma sigma
200
175
0.02
150
0.01 125
100
0.00
100 125 150 175 0 200 400 600 800 1000

FIGURE 4.2: The posterior distributions of the parameters in the brms model fit_pupil ,
along with the corresponding trace plots.
Hide

fit_pupil

## ...
## Population-Level Effects:
## Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESS
## Intercept 701.38 20.28 662.06 741.42 1.00 3851 2949

## c_load 33.90 11.68 11.43 57.01 1.00 3637 2645

##
## Family Specific Parameters:
## Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESS
## sigma 128.59 15.47 102.82 162.79 1.00 3256 2823

##
## ...
In the next section, we discuss how one can communicate the relevant information from the
model.

Box 4.2 Intercepts in brms

When we set up a prior for the intercept in brms , we actually set a prior for an intercept
assuming that all the predictors are centered. This means that when predictors are not
centered (and only then), there is a mismatch between the interpretation of the intercept
as returned in the output of brms and the interpretation of the intercept with respect to its
prior specification. In this case, only the intercept in the output corresponds to the formula
in the brms call, that is, the intercept in the output corresponds to the non-centered
model. However, as we show below, when the intercept is much larger than the effects
that we are considering in the formula (what we generally call β), this discrepancy hardly
matters.

The reason for this mismatch when our predictors are uncentered is that brms increases
sampling efficiency by automatically centering all the predictors internally (that is the
population-level design matrix X is internally centered around its column means when
brms fits a model). This did not matter in our previous examples because we centered

our predictor (or we had no predictor), but it might matter if we want to have uncentered
predictors. In the design we are discussing, a non-centered predictor of load will mean
that the intercept, α , has a straightforward interpretation: the α is the mean pupil size
when there is no attention load. This is in contrast with the centered version presented
before, where the intercept α represents the pupil size for the average load of 2.44
( c_load is equal to 0). The difference between the non-centered model (below) and the
centered version presented before is depicted in Figure 4.3.

Suppose that we are quite sure that the prior values for the no load condition (i.e., load is
non-centered) fall between 400 and 1200 ms. In that case, the following prior could be set
for :
α Normal(800, 200) . In this case, the model becomes:

Hide
prior_nc <- c(
prior(normal(800, 200), class = b, coef = Intercept),
prior(normal(0, 1000), class = sigma),
prior(normal(0, 100), class = b, coef = load)
)

fit_pupil_non_centered <- brm(p_size ~ 0 + Intercept + load,

data = df_pupil,
family = gaussian(),
prior = prior_nc
)

Non-centered predictors Centered predictors

1000 1000

800 800
p_size

p_size

600 600

0 1 2 3 4 5 -2 -1 0 1 2
load c_load

FIGURE 4.3: Regression lines for the non-centered and centered linear regressions. The
intercept (α) represented by a circle is positioned differently depending on the centering,
whereas the slope (β) represented by a vertical dashed line has the same magnitude in
both models.

When the predictor is non-centered as shown above, the regular centered intercept is
removed by adding 0 to the formula, and by replacing the intercept with the “actual”
intercept we want to set priors on with Intercept . The word Intercept is a reserved
word; we cannot name any predictor with this name. This new parameter is also of class
b , so its prior needs to be defined accordingly. Once we use 0 + Intercept + .. , the

intercept is not calculated with predictors that are automatically centered any more.

The output below shows that, as expected, although the posterior for the intercept has
changed noticeably, the posterior for the effect of load remains virtually unchanged.

Hide

posterior_summary(fit_pupil_non_centered,
variable = c("b_Intercept", "b_load"))

## Estimate Est.Error Q2.5 Q97.5

## b_Intercept 623.6 34.9 554.8 691.7
## b_load 32.5 12.0 8.4 55.5

Notice the following potential pitfall. A model like the one below will fit a non-centered load
predictor, but will assign a prior of Normal(800, 200) to the intercept of a model that
assumes a centered predictor, αcentered , and not the current intercept, α .

Hide

fit_pupil_wrong <- brm(p_size ~ 1 + load,

data = df_pupil,
family = gaussian(),

prior = prior_nc
)

What does it mean to set a prior to αcentered in a model that doesn’t include αcentered ?

The fitted (expected) values of the non-centered model and the centered one are
identical, that is, the values of the response distribution without the residual error are
identical for both models:

α + loadn ⋅ β = αcentered + (loadn − mean(load)) ⋅ β (4.6)

The left side of Equation (4.6) refers to the expected values based on our current non-
centered model, and the right side refers to the expected values based on the centered
model. We can re-arrange terms to understand what the effect is of a prior on αcentered in
our model that doesn’t include αcentered .
α + loadn ⋅ β = αcentered + loadn ⋅ β − mean(load) ⋅ β

α = αcentered − mean(load) ⋅ β

α + mean(load) ⋅ β = αcentered

That means that in the centered model, we are actually setting our prior to
α + mean(load) ⋅ β . When β is very small (or the means of our predictors are very small
because they might be “almost” centered), and the prior for α is very wide, we might
hardly notice the difference between setting a prior to αcentered or to our actual α in a non-
centered model (especially if the likelihood dominates anyway). But it is important to pay
attention to what the parameters represent that we are setting priors on.

To sum up, brms automatically centers all predictors for posterior estimation, and the
prior of the intercept is applied to the centered version of the model during model fitting.
However, when the predictors specified in the formula are not centered, then brms uses
the equations shown before to return in the output the posterior of the intercept for the
non-centered predictors.14

In our example analyses with brms in this book, we will always center our predictors.

4.1.3 How to communicate the results?

We want to answer the research question “What is the effect of attentional load on the
subject’s pupil size?” To answer this question, we’ll need to examine what happens with the
posterior distribution of β, which is printed out as c_load in the summary of brms . The
summary of the posterior tells us that the most likely values of β will be around the mean of
the posterior, 33.9, and we can be 95% certain that the value of β, given the model and the
data, lies between 11.43 and 57.01.

The model tells us that as attentional load increases, the pupil size of the subject becomes
larger. If we want to determine how likely it is that the pupil size increased rather than
decreased, we can examine the proportion of samples above zero. (The intercept and the
slopes are always preceded by b_ in brms . One can see all the names of parameters being
estimated with variables() .)

Hide

mean(as_draws_df(fit_pupil)$b_c_load > 0)

## [1] 0.998
This high probability does not mean that the effect of load is non-zero. It means instead
that it’s much more likely that the effect is positive rather than negative. In order to claim that
the effect is likely to be non-zero, we would have to compare the model with an alternative
model in which the model assumes that the effect of load is 0. We’ll come back to this issue
when we discuss model comparison in chapter 14.

4.1.4 Descriptive adequacy

Our model converged and we obtained a posterior distribution. However, there is no

guarantee that our model is good enough to represent our data. We can use posterior
predictive checks to check the descriptive adequacy of the model.

Sometimes it’s useful to customize the posterior predictive check to visualize the fit of our
model. We iterate over the different loads (e.g, 0 to 4), and we show the posterior predictive
distributions based on 100 simulations for each load together with the observed pupil sizes in
Figure 4.4. We don’t have enough data to derive a strong conclusion: both the predictive
distributions and our data look very widely spread out, and it’s hard to tell if the distribution of
the observations could have been generated by our model. For now we can say that it doesn’t
look too bad.

Hide

for (l in 0:4) {

df_sub_pupil <- filter(df_pupil, load == l)

p <- pp_check(fit_pupil,
type = "dens_overlay",
ndraws = 100,
newdata = df_sub_pupil

) +
geom_point(data = df_sub_pupil, aes(x = p_size, y = 0.0001)) +
ggtitle(paste("load: ", l)) +
coord_cartesian(xlim = c(400, 1000))
print(p)
}
load: 0

y
y rep

400 600 800 1000

load: 1

y
y rep

400 600 800 1000

load: 2

y
y rep

400 600 800 1000

load: 3

y
y rep

400 600 800 1000

load: 4

y
y rep

400 600 800 1000

FIGURE 4.4: The plot shows 100 posterior predicted distributions with the label yrep , the
distribution of pupil size data in black with the label y, and the observed pupil sizes in black
dots for the five levels of attentional load.
In Figure 4.5, we look instead at the distribution of a summary statistic, such as mean pupil
size by load:

Hide

for (l in 0:4) {
df_sub_pupil <- filter(df_pupil, load == l)

p <- pp_check(fit_pupil,
type = "stat",
ndraws = 1000,
newdata = df_sub_pupil,
stat = "mean"
) +

geom_point(data = df_sub_pupil, aes(x = p_size, y = 0.1)) +

ggtitle(paste("load: ", l)) +
coord_cartesian(xlim = c(400, 1000))
print(p)
}

load: 0

T = mean
T (y rep)

T (y )

400 600 800 1000

load: 1

T = mean
T (y rep)

T (y )

400 600 800 1000

load: 2

T = mean
T (y rep)

T (y )

400 600 800 1000

load: 3

T = mean
T (y rep)

T (y )

400 600 800 1000

load: 4

T = mean
T (y rep)

T (y )

400 600 800 1000

FIGURE 4.5: Distribution of posterior predicted means in gray and observed pupil size means
in black lines by load.
Figure 4.5 shows that the observed means for no load and for a load of one are falling in the
tails of the distributions. Although our model predicts a monotonic increase of pupil size, the
data might be indicating that the relevant difference is simply between no load, and some
load. However, given the uncertainty in the posterior predictive distributions and that the
observed means are contained somewhere in the predicted distributions, it could be the case
that with this model we are overinterpreting noise.
4.2 Log-normal model: Does trial affect finger tapping
times?

Let us revisit the small experiment from section 3.2.1, where a subject repeatedly taps the
space bar as fast as possible. Suppose that we want to know whether the subject tended to
speed up (a practice effect) or slow down (a fatigue effect) while pressing the space bar. We’ll
use the same data set df_spacebar as before, and we’ll center the column trial :

Hide

df_spacebar <- df_spacebar %>%

mutate(c_trial = trial - mean(trial))

4.2.1 Likelihood and priors for the log-normal model

If we assume that finger tapping times are log-normally distributed, the likelihood becomes:

tn ∼ LogNormal(α + c_trialn ⋅ β, σ) (4.7)

where n = 1, … , N , and t is the dependent variable (finger tapping times in milliseconds).

The variable N represents the total number of data points.

Use the same priors as in section 3.7.2 for α (which is equivalent to μ in the previous model)
and for σ.

α ∼ Normal(6, 1.5)

σ ∼ Normal+ (0, 1)

We still need a prior for β. Effects are multiplicative rather than additive when we assume a
log-normal likelihood, and that means that we need to take into account α in order to interpret
β ; for details, see Box 4.3. We are going to try to understand how all our priors interact, by
generating some prior predictive distributions. We start with the following prior centered in
zero, a prior agnostic regarding the direction of the effect, which allows for a slowdown (β > 0

) and a speedup (β < 0 ):

β ∼ Normal(0, 1)

Here is our first attempt at a prior predictive distribution:

Hide
# Ignore the dependent variable,
# use a vector of ones a placeholder.
df_spacebar_ref <- df_spacebar %>%
mutate(t = rep(1, n()))
fit_prior_press_trial <- brm(t ~ 1 + c_trial,

data = df_spacebar_ref,
family = lognormal(),
prior = c(
prior(normal(6, 1.5), class = Intercept),
prior(normal(0, 1), class = sigma),
prior(normal(0, 1), class = b, coef = c_trial)

),
sample_prior = "only",
control = list(adapt_delta = .9)
)

In order to understand the type of data that we are assuming a priori with the prior of the
parameter β, plot the median difference between the finger tapping times at adjacent trials. As
the prior of β gets wider, larger differences are observed between adjacent trials. The
objective of the prior predictive checks is to calibrate the prior of β to obtain a plausible range
of differences. We are going to plot a distribution of medians because they are less affected
by the variance in the prior predicted distribution than the distribution of mean differences;
distributions of means will have much more spread. To make the distribution of means more
realistic, we would also need to find a more accurate prior for the scale σ. (Recall that the
mean of log-normal distributed values depend on both the location, μ and the scale, σ, of the
distribution.) To plot the median effect, first define a function that calculates the difference
between adjacent trials, and then apply the median to the result. We use that function in
pp_check and show the result in Figure 4.6. As expected, the median effect is centered on

zero (as is our prior), but we see that the distribution of possible medians for the effect is too
widely spread out and includes values that are too extreme.

Hide
median_diff <- function(x) {
median(x - lag(x), na.rm = TRUE)
}
pp_check(fit_prior_press_trial,
type = "stat",

stat = "median_diff",
# show only prior predictive distributions
prefix = "ppd",
# each bin has a width of 500ms
binwidth = 500) +
# cut the top of the plot to improve its scale

coord_cartesian(ylim = c(0, 50))

T = median_diff
T (y pred)

200000 -100000 0 100000

FIGURE 4.6: The prior predictive distribution of the median effect of the model defined in 4.2
with β ∼ Normal(0, 1) .

Repeat the same procedure with β ∼ Normal(0, 0.01) ; the resulting prior predictive
distribution is shown in Figure 4.7. The prior predictive distribution shows us that the prior is
still quite vague; it is, however at least in the right order of magnitude.

T = median_diff
T (y pred)

-1000 0 1000 2000 3000

FIGURE 4.7: The prior predictive distribution of the median difference in finger tapping times
between adjacent trials based on the model defined in section 4.2 with β ∼ Normal(0, 0.01) .
Prior selection might look daunting and can be a lot of work. However, this work is usually
done only the first time we start working with an experimental paradigm; besides, priors can
be informed by the estimates from previous experiments (even maximum likelihood estimates
from frequentist models can be useful). We will generally use very similar (or identical priors)
for analyses dealing with the same type of task. When in doubt, a sensitivity analysis (see
section 3.4) can tell us whether the posterior distribution depends unintentionally strongly on
our prior selection. We will return to the issue of prior selection in chapter 6.

Box 4.3 Understanding the Log-normal likelihood

It is important to understand what we are assuming with our log-normal likelihood.

Formally, if a random variable Z is normally distributed with mean μ and variance σ
2
, then
the transformed random variable Y = exp(Z) is log-normally distributed and has density:

2
1 (log(y) − μ)
LogNormal(y|μ, σ) = f (z) = exp(− )
2
√2πσ y
2 2σ

As explained in section 3.7.1, the model from Equation (4.7) is equivalent to the following:

log(tn ) ∼ Normal(α + c_trialn ⋅ β, σ) (4.8)

The family of normal distributions is closed under linear transformations: that is, if X is
normally distributed with mean μ and standard deviation σ, then (for any real numbers a

and b), aX + b is also normally distributed, with mean aμ + b (and standard deviation
√a σ
2 2
= |a|σ ).

This means that, assuming Z ∼ Normal(α, σ) , Equation (4.8) can be re-written as

follows:

log(rtn ) = Z + c_trialn ⋅ β (4.9)

Exponentiate both sides, and we use the property of exponents that exp(x + y) is equal
to exp(x) ⋅ exp(y) , and set Y = exp(Z) .

rtn = exp (Z + c_trialn ⋅ β)

rtn = exp(Z) ⋅ exp (c_trialn ⋅ β)

rtn = Y ⋅ exp (c_trialn ⋅ β)

The last equation has two terms being multiplied, the first one, Y , is telling us that we are
assuming that finger tapping times are log-normally distributed with a median of exp(α) ,
the second term, exp(c_trialn ⋅ β) is telling us that the effect of trial number is
multiplicative and grows or decays exponentially with the trial number. This has two
important consequences:

1. Different values of the intercept, α , given the same β, will affect the difference in
finger tapping or response times for two adjacent trials (compare this with what
happens with an additive model, such as when a normal likelihood is used); see
Figure 4.8. This is because, unlike in the additive case, the intercept doesn’t cancel
out:

Additive case:

(α + trialn ⋅ β) − (α + trialn−1 ⋅ β) =

= α − α + (trialn − trialn−1 ) ⋅ β

= (trialn − trialn−1 ) ⋅ β

Multiplicative case:

exp(α) ⋅ exp(trialn ⋅ β) − exp(α) ⋅ exp(trialn−1 ⋅ β) =

= exp(α)( exp(trialn ⋅ β) − exp(trialn−1 ⋅ β))

≠ ( exp(trialn ) − exp(trialn−1 )) ⋅ exp(β)

Difference in response times between adjacent trials

30000

20000

10000

0 5 10 15
Intercept value
FIGURE 4.8: The fitted values of the difference in response time between two adjacent
trials, when β = 0.01 and α lies between 0.1 and 15. The graph shows how changes in
the intercept lead to changes in the difference in response times between trials, even if β

is fixed.
2. As the trial number increases, the same value of β will have a very different impact on
the original scale of the dependent variable: Any (fixed) negative value for β will lead
to exponential decay and any (fixed) positive value will lead to exponential growth;
see Figure 4.9.

400

(A) Exponential decay

300
Response times in milliseconds

200

100

8000

(B) Exponential growth

6000

4000

2000

0 100 200 300

Trial number

FIGURE 4.9: The fitted values of the dependent variable (response times in ms) as a
function of trial number, when (A) β = −0.01 , exponential decay, and when (B) β = 0.01 ,
exponential growth.

Does exponential growth or decay make sense in this particular example? We need to
consider that if they do make sense, they will be an approximation valid for a specific
range of values, at some point we will expect a ceiling or a floor effect: response times
cannot truly be 0 milliseconds, or take several minutes. However, in our specific model,
exponential growth or decay by trial is probably a bad approximation: We will predict that
our subject will take extremely long (if β > 0 ) or extremely short (if β < 0 ) time in
pressing the space bar in a relatively low number of trials. This doesn’t mean that the
likelihood is wrong by itself, but it does mean that at least we need to put a cap on the
growth or decay of our experimental manipulation. We can do this if the exponential
growth or decay is a function of, for example, log-transformed trial numbers:

tn ∼ LogNormal(α + c_ log _trialn ⋅ β, σ)

400

(A) Exponential decay

395
Response times in milliseconds

390

385

425

(B) Exponential growth

420

415

410
0 100 200 300
Trial number

FIGURE 4.10: Fitted value of the dependent variable (times in ms) as function of the
natural logarithm of the trial number, when (A) β = −0.01 , exponential decay, and when
(B) β = .01 , exponential growth.

Log-normal distributions everywhere

The normal distribution is most often assumed to describe the random variation that
occurs in the data from many scientific disciplines. However, most measurements actually
show skewed distributions. Limpert, Stahel, and Abbt (2001) discuss the log-normal
distribution in scientific disciplines and how diverse type of data, from lengths of latent
periods of infectious diseases to distribution of mineral resources in the Earth’s crust,
including even body height–the quintessential example of a normal distribution–closely fit
the log-normal distribution.
Limpert, Stahel, and Abbt (2001) point out that because a random variable that results
from multiplying many independent variables has an approximate log-normal distribution,
the most basic indicator of the importance of the log-normal distribution may be very
general: Chemistry and physics are fundamental in life, and the prevailing operation in the
laws of these disciplines is multiplication rather than addition.

Furthermore, at many physiological and anatomical levels in the brain, the distribution of
numerous parameters is in fact strongly skewed with a heavy tail, suggesting that skewed
(typically log-normal) distributions are fundamental to structural and functional brain
organization. This might be explained given that the majority of interactions in highly
interconnected systems, especially in biological systems, are multiplicative and synergistic
rather than additive (Buzsáki and Mizuseki 2014).

Does the log-normal distribution make sense for response times? It has been long noticed
that the log-normal distribution often provides a good fit to response times distributions
(Brée 1975; Ulrich and Miller 1994). One advantage of assuming log-normally distributed
response times (but, in fact, this is true for many skewed distributions) is that it entails that
the standard deviation of the reaction time distribution will increase with the mean, as has
been observed in empirical distributions of response times (Wagenmakers, Grasman, and
Molenaar 2005). Interestingly, it turns out that log-normal response times are also easily
generated by certain process models. Ulrich and Miller (1993) show, for example, that
models in which response times are determined by a series of processes cascading
activation from an input level to an output level (usually passing through a number of
intervening processing levels along the way) can generate log-normally distributed
response times.

4.2.2 The brms model

We are now relatively satisfied with the priors for our model, and we can fit the model of the
effect of trial as a button-pressing using brms . We need to specify that the family is
lognormal() .

Hide
fit_press_trial <- brm(t ~ 1 + c_trial,
data = df_spacebar,
family = lognormal(),
prior = c(
prior(normal(6, 1.5), class = Intercept),

prior(normal(0, 1), class = sigma),

prior(normal(0, .01), class = b, coef = c_trial)
)
)

Instead of printing out the complete output from the model, look at the estimates from the
posteriors for the parameters , , and σ. These parameters are on the log scale:
α β

Hide

posterior_summary(fit_press_trial,

variable = c("b_Intercept",
"b_c_trial",
"sigma"))

## Estimate Est.Error Q2.5 Q97.5

## b_Intercept 5.118501 0.0063317 5.106380 5.130839
## b_c_trial 0.000523 0.0000627 0.000401 0.000647
## sigma 0.123411 0.0046241 0.114477 0.132991

The posterior distributions can be plotted to obtain a graphical summary of all the parameters
in the model (Figure 4.11):

Hide

plot(fit_press_trial)
b_Intercept b_Intercept
60 5.14
5.13
40
5.12

20 5.11
5.10
0
5.10 5.11 5.12 5.13 5.14 0 200 400 600 800 1000

b_c_trial b_c_trial
6000 0.0007
Chain
1
0.0006
4000
2
0.0005
2000 3
0.0004
4
0 0.0003
0.0004 0.0005 0.0006 0.0007 0 200 400 600 800 1000

sigma sigma
0.14
75
0.13
50
0.12
25

0.11
0
0.11 0.12 0.13 0.14 0 200 400 600 800 1000

FIGURE 4.11: Posterior distributions of the model of the effect of trial on button-pressing.
Next, we turn to the question of what we can report as our results, and what we can conclude
from the data.

4.2.3 How to communicate the results?

As shown above, the first step is to summarize the posteriors in a table or graphically (or
both). If the research relates to the effect estimated by the model, the posterior of β can be
summarized in the following way: ^ = 0.00052
β , 95% CrI = [0.0004, 0.00065] .

The effect is easier to interpret in milliseconds. We can transform the estimates back to the
millisecond scale from the log scale, but we need to take into account that the scale is not
linear, and that the effect between two button presses will differ depending on where we are in
the experiment.

We will have a certain estimate if we consider the difference between response times in a trial
at the middle of the experiment (when the centered trial number is zero) and the previous one
(when the centered trial number is minus one).

Hide
alpha_samples <- as_draws_df(fit_press_trial)$b_Intercept
beta_samples <- as_draws_df(fit_press_trial)$b_c_trial
effect_middle_ms <- exp(alpha_samples) -
exp(alpha_samples - 1 * beta_samples)
## ms effect in the middle of the expt

## (mean trial vs. mean trial - 1)

c(mean = mean(effect_middle_ms),
quantile(effect_middle_ms, c(0.025, 0.975)))

## mean 2.5% 97.5%

## 0.0874 0.0669 0.1080

We will obtain different estimate if we consider the difference between the second and the first
trial:

Hide

first_trial <- min(df_spacebar$c_trial)

second_trial <- min(df_spacebar$c_trial) + 1

effect_beginning_ms <-
exp(alpha_samples + second_trial * beta_samples) -
exp(alpha_samples + first_trial * beta_samples)
## ms effect from first to second trial:
c(mean = mean(effect_beginning_ms),

quantile(effect_beginning_ms, c(0.025, 0.975)))

## mean 2.5% 97.5%

## 0.0795 0.0623 0.0962

So far we converted the estimates to obtain median effects, that’s why we used exp(⋅) , if we
want to obtain mean effects we need to take into account σ, since we need to calculate
exp(⋅ + σ /2)
2
. However, we can also use the built-in function fitted() which calculates
mean effects. Consider again the difference between the second and the first trial this time
using fitted() .
First, define for which observations we want to obtain the fitted values in millisecond scale. If
we are interested in the difference between the second and first trial, create a data frame with
their centered versions.

Hide

newdata_1 <- data.frame(c_trial = c(first_trial, second_trial))

Second, use fitted() on the brms object, including the new data, and setting the summary
parameter to FALSE . The first column contains the posterior samples transformed into
milliseconds of the first trial, and the second column of the second trial.

Hide

beginning <- fitted(fit_press_trial,

newdata = newdata_1,
summary = FALSE)
head(beginning, 3)

## [,1] [,2]
## [1,] 153 153
## [2,] 154 155

## [3,] 155 155

Last, calculate the difference between trials, and report mean and 95% quantiles.

Hide

effect_beginning_ms <- beginning[, 2] - beginning[,1]

c(mean = mean(effect_beginning_ms),
quantile(effect_beginning_ms, c(0.025, 0.975)))

## mean 2.5% 97.5%

## 0.0801 0.0628 0.0970

Given that σ is much smaller than ,

μ σ doesn’t have a large influence on the mean effects,
and the mean and 95% CrI of the mean and median effects are quite similar.
We see that no matter how we calculate the trial effect, there is a slowdown. When reporting
the results of these analyses, one should present the posterior mean and a credible interval,
and then reason about whether the observed estimates are consistent with the prediction from
the theory being investigated. The 95% credible interval used here is just a convention
adopted from standard practice in psychology and related areas.

The practical relevance of the effect for the research question can be important too. For
example, only after 100 button presses do we see a barely noticeable slowdown:

Hide

effect_100 <-
exp(alpha_samples + 100 * beta_samples) -
exp(alpha_samples)

c(mean = mean(effect_100),
quantile(effect_100, c(0.025, 0.975)))

## mean 2.5% 97.5%

## 8.98 6.83 11.16

We need to consider whether our uncertainty of this estimate, and the estimated mean effect
have any scientific relevance. Such relevance can be established by considering the previous
literature, predictions from a quantitative model, or other expert domain knowledge.
Sometimes, a quantitative meta-analysis is helpful; for examples, see Bürki, Alario, and
Vasishth (2022), Cox et al. (2022), Bürki et al. (2020), Jäger, Engelmann, and Vasishth (2017),
Mahowald et al. (2016), Nicenboim, Roettger, and Vasishth (2018), and Vasishth et al. (2013).
We will discuss meta-analysis in later in the book, in chapter 13.

Sometimes, researchers are only interested in establishing that an effect is present or absent;
the magnitude and uncertainty of the estimate is of secondary interest. Here, the goal is to
argue that there is evidence of a slowdown. The word evidence has a special meaning in
statistics (Royall 1997), and in null hypothesis significance testing, a likelihood ratio test is the
standard way to argue that one has evidence for an effect. In the Bayesian data analysis
context, in order to answer such a question, a Bayes factor analysis must be carried out. We’ll
come back to this issue in the model comparison chapters 14-16.
4.2.4 Descriptive adequacy

We look now at the predictions of the model. Since we now know that trial effects are very
small, let’s examine predictions of the model for differences in response times between 100
button presses. Similarly as for prior predictive checks, we define a function,
median_diff100() , that calculates the median difference between a trial n and a trial n + 100

. This time we’ll compare the observed median difference against the range of predicted
differences based on the model and the data rather than only the model as we did for the prior
predictions. Below we use virtually the same code that we use for plotting prior predictive
checks, but since we now use the fitted model, we’ll obtain posterior predictive checks; this is
displayed in Figure 4.12.

Hide

median_diff100 <- function(x) median(x - lag(x, 100), na.rm = TRUE)

pp_check(fit_press_trial,
type = "stat",

stat = "median_diff100")

T = median_diff100
T (y rep)

T (y )

0 4 8 12 16

FIGURE 4.12: The posterior predictive distribution of the median difference in response times
between a trial n and a trial n + 100 based on the model fit_press_trial and the observed
data.

From Figure 4.12, we can conclude that model predictions for differences in response trials
between trials are reasonable.
4.3 Logistic regression: Does set size affect free
recall?

In this section, we will learn how the principles we have learned so far can naturally extend to
generalized linear models (GLMs). We focus on one special case of GLMs that has wide
application in linguistics and psychology, logistic regression.

As an example data set, we look at a study investigating the capacity level of working memory.
The data are a subset of a data set created by Oberauer (2019). Each subject was presented
word lists of varying lengths (2, 4, 6, and 8 elements), and then was asked to recall a word
given its position on the list; see Figure 4.13. We will focus on the data from one subject.
FIGURE 4.13: The flow of events in a trial with memory set size 4 and free recall. Adapted
from Oberauer (2019); licensed under CC BY 4.0
(https://creativecommons.org/licenses/by/4.0/).
It is well-established that as the number of items to be held in working memory increases,
performance, that is accuracy, decreases (see Oberauer and Kliegl 2001, among others). We
will investigate this claim with data from only one subject.

The data can be found in df_recall in the package bcogsci . The code below loads the
data, centers the predictor set_size , and briefly explores the data set.

Hide
data("df_recall")
df_recall <- df_recall %>%
mutate(c_set_size = set_size - mean(set_size))
# Set sizes in the data set:
df_recall$set_size %>%

unique() %>% sort()

## [1] 2 4 6 8

Hide

# Trials by set size

df_recall %>%
group_by(set_size) %>%
count()

## # A tibble: 4 × 2
## # Groups: set_size [4]
## set_size n

## <int> <int>
## 1 2 23
## 2 4 23
## 3 6 23
## # … with 1 more row

Here, the column correct records the incorrect vs. correct responses with 0 vs 1 , and
the column c_set_size records the centered memory set size; these latter scores have
continuous values -3, -1, 1, and 3. These continuous values are centered versions of 2, 4, 6,
and 8.

Hide

df_recall
## # A tibble: 92 × 8
## subj set_size correct trial session block tested c_set_size
## <chr> <int> <int> <int> <int> <int> <int> <dbl>
## 1 10 4 1 1 1 1 2 -1
## 2 10 8 0 4 1 1 8 3

## 3 10 2 1 9 1 1 2 -3
## # … with 89 more rows

We want to model the trial-by-trial accuracy and examine whether the probability of recalling a
word is related to the number of words in the set that the subject needs to remember.

4.3.1 The likelihood for the logistic regression model

Recall that the Bernoulli likelihood generates a 0 or 1 response with a particular probability θ.
For example, one can generate simulated data for 10 trials, with a 50% probability of getting a
1 using rbern from the package extraDistr .

Hide

rbern(10, prob = 0.5)

## [1] 0 1 1 1 1 1 0 1 0 1

We can therefore define each dependent value correct_n in the data as being generated
from a Bernoulli random variable with probability of success θn . Here, n = 1, … , N indexes
the trial, correct_n is the dependent variable (0 indicates an incorrect recall and 1 a correct
recall), and θn is the probability of correctly recalling a probe in a given trial n .

correctn ∼ Bernoulli(θn ) (4.10)

Since θn is bounded to be between 0 and 1 (it is a probability), we cannot just fit a regression
model using the normal (or log-normal) likelihood as we did in the preceding examples. Such
a model would be inappropriate because it would assume that the data range from −∞ to
+∞ (or from 0 to +∞ ), rather than being limited to zeros and ones.

The generalized linear modeling framework solves this problem by defining a link function g(⋅)

that connects the linear model to the quantity to be estimated (here, the probabilities θn ). The
link function used for 0, 1 responses is called the logit link, and is defined as follows.
θn
ηn = g(θn ) = log( )
1 − θn

is called the odds.15 The logit link function is therefore a log-odds; it maps
θn
The term
1−θn

probability values ranging from [0, 1] to real numbers ranging from −∞ to +∞ . Figure 4.14
shows the logit link function, η = g(θ) , and the inverse logit, θ = g
−1
(η) , which is called the
logistic function; the relevance of this logistic function will become clear in a moment.

The logit link The inverse logit link (logistic)

1.00

−1
4 η=g(θ) 0.75
θ=g (η)

0 0.50
η

0.25
-4

0.00

0.00 0.25 0.50 0.75 1.00 -4 0 4

θ η

FIGURE 4.14: The logit and inverse logit (logistic) function.

The linear model is now fit not to the 0,1 responses as the dependent variable, but to ηn , i.e.,
log-odds, as the dependent variable:

θn
ηn = log( ) = α + β ⋅ c_set_size
1 − θn

Unlike linear models, the model is defined so that there is no residual error term (ε) in this
model. Once ηn is estimated, one can solve the above equation for θn (in other words, we
compute the inverse of the logit function and obtain the estimates on the probability scale).
This gives the above-mentioned logistic regression function:

exp(ηn ) 1
−1
θn = g (ηn ) = =
1 + exp(ηn ) 1 + exp(−ηn )
The last equality in the equation above arises by dividing both the numerator and denominator
by exp(ηn ) .

In summary, the generalized linear model with the logit link fits the following Bernoulli
likelihood:

correctn ∼ Bernoulli(θn ) (4.11)

The model is fit on the log-odds scale, ηn = α + c_set_sizen ⋅ β . Once ηn has been
estimated, the inverse logit or the logistic function is used to compute the probability estimates
exp(ηn )
θn = . An example of this calculation will be shown in the next section.
1+exp(ηn )

4.3.2 Priors for the logistic regression

In order to decide on priors for α and β, we need to take into account that these parameters
do not represent probabilities or proportions, but log-odds, the x-axis in Figure 4.14 (right-
hand side figure). As shown in the figure, the relationship between log-odds and probabilities
is not linear.

There are two functions in R that implement the logit and inverse logit functions: qlogis(p)
for the logit function and plogis(x) for the inverse logit or logistic function.

Now we need to set priors for α and β. Given that we centered our predictor, the intercept, α ,
represents the log-odds of correctly recalling one word in a random position for the average
2+4+6+8
set size of five (since 5 =
4
), which, incidentally, was not presented in the experiment.
This is one case where the intercept doesn’t have a clear interpretation if we leave the
prediction uncentered: With non-centered set size, the intercept will be the log-odds of
recalling one word in a set of zero words, which obviously makes no sense.

The prior for α will depend on how difficult the recall task is. We could assume that the
probability of recalling a word for an average set size, α , is centered in .5 (a 50/50 chance)
with a great deal of uncertainty. The R command qlogis(.5) tells us that .5 corresponds to
zero in log-odds space. How do we include a great deal of uncertainty? We could look at
Figure 4.14, and decide on a standard deviation of 4 in a normal distribution centered in zero:

α ∼ Normal(0, 4)

Let’s plot this prior in log-odds and in probability scale by drawing random samples.

Hide
samples_logodds <- tibble(alpha = rnorm(100000, 0, 4))
samples_prob <- tibble(p = plogis(rnorm(100000, 0, 4)))
ggplot(samples_logodds, aes(alpha)) +
geom_density()
ggplot(samples_prob, aes(p)) +

geom_density()

0.100

0.075 2
density

density
0.050

0.025

0.000 0
-20 -10 0 10 20 0.00 0.25 0.50 0.75 1.00
alpha p

FIGURE 4.15: The prior for α ∼ Normal(0, 4) in log-odds and in probability space.

Figure 4.15 shows that our prior assigns more probability mass to extreme probabilities of
recall than to intermediate values. Clearly, this is not what we intended.

We could try several values for standard deviation of the prior, until we find a prior that make
sense for us. Reducing the standard deviation to 1.5 seems to make sense as shown in
Figure 4.16.

α ∼ Normal(0, 1.5)
0.9
0.2
density

density
0.6

0.1
0.3

0.0 0.0
-4 0 4 0.00 0.25 0.50 0.75 1.00
alpha p

FIGURE 4.16: Prior for α ∼ Normal(0, 1.5) in log-odds and in probability space.
We need to decide now on the prior for β, the effect in log-odds of increasing the set size. We
could choose a normal distribution centered on zero, reflecting our lack of any commitment
regarding the direction of the effect. Let’s get some intuitions regarding different possible
standard deviations for this prior, by testing the following distributions as priors:

a. β ∼ Normal(0, 1)

b. β ∼ Normal(0, .5)

c. β ∼ Normal(0, .1)

d. β ∼ Normal(0, .01)

e. β ∼ Normal(0, .001)

In principle, we could produce the prior predictive distributions using brms with sample_prior
= "only" and then predict() . However, as mentioned before, brms also uses Stan’s

Hamiltonian sampler for sampling from the priors, and this can lead to convergence problems
when the priors are too uninformative (as in this case). We solve this issue by performing prior
predictive checks directly in R using the r* family of functions (e.g., rnorm() , rbinom() ,
etc.) together with loops. This method is not as simple as using the convenient functions
provided by brms , but it is very flexible and can be very powerful. We show the prior
predictive distributions in Figure 4.17; for the details on the implementation in R, see Box 4.4.

Box 4.4 Prior predictive checks in R

The following function is an edited version of the earlier normal_predictive_distribution

from the Box 3.1 in section 3.3; it has been edited to make it compatible with logistic
regression and dependent on set size.
As we did before, our custom function uses the purrr function map2_dfr() , which runs
an efficient for-loop, iterating over two vectors (here alpha_samples and beta_samples ),
and builds a data frame with the output.

Hide

logistic_model_pred <- function(alpha_samples,

beta_samples,
set_size,
N_obs) {

map2_dfr(alpha_samples, beta_samples,
function(alpha, beta) {
tibble(
set_size = set_size,
# center size:
c_set_size = set_size - mean(set_size),

# change the likelihood:

# Notice the use of a link function
# for alpha and beta
theta = plogis(alpha + c_set_size * beta),
correct_pred = rbernoulli(N_obs, p = theta)

)
},
.id = "iter"
) %>%
# .id is always a string and needs

# to be converted to a number
mutate(iter = as.numeric(iter))
}

Let’s assume 800 observations with 200 observation for each set size:

Hide

N_obs <- 800

set_size <- rep(c(2, 4, 6, 8), 200)

Now, iterate over plausible standard deviations of β with the purrr function map_dfr() ,
which iterates over one vector (here sds_beta ), and also builds a data frame with the
output.

Hide

alpha_samples <- rnorm(1000, 0, 1.5)

sds_beta <- c(1, 0.5, 0.1, 0.01, 0.001)
prior_pred <- map_dfr(sds_beta, function(sd) {
beta_samples <- rnorm(1000, 0, sd)

logistic_model_pred(
alpha_samples = alpha_samples,
beta_samples = beta_samples,
set_size = set_size,
N_obs = N_obs
) %>%

mutate(prior_beta_sd = sd)
})

## Warning: `rbernoulli()` was deprecated in purrr 1.0.0.

Calculate the accuracy for each one of the priors we want to examine, for each iteration,
and for each set size.

Hide

mean_accuracy <-
prior_pred %>%
group_by(prior_beta_sd, iter, set_size) %>%
summarize(accuracy = mean(correct_pred)) %>%
mutate(prior = paste0("Normal(0, ", prior_beta_sd, ")"))

Plot the accuracy in Figure 4.17 as follows.

Hide
mean_accuracy %>%
ggplot(aes(accuracy)) +
geom_histogram() +
facet_grid(set_size ~ prior) +
scale_x_continuous(breaks = c(0, .5, 1))

It’s sometimes more useful to look at the predicted differences in accuracy between set
sizes. We calculate them as follows, and plot them in Figure 4.18.

Hide

diff_accuracy <- mean_accuracy %>%

arrange(set_size) %>%
group_by(iter, prior_beta_sd) %>%
mutate(diff_accuracy = accuracy - lag(accuracy)) %>%
mutate(diffsize = paste(set_size, "-", lag(set_size))) %>%

filter(set_size > 2)

Hide

diff_accuracy %>%

ggplot(aes(diff_accuracy)) +
geom_histogram() +
facet_grid(diffsize ~ prior) +
scale_x_continuous(breaks = c(-.5, 0, .5))

Figure 4.17 shows that, as expected, the priors are centered at zero. We see that the
distribution of possible accuracies for the prior that has a standard deviation of 1 is
problematic: There is too much probability concentrated near 0 and 1 for set sizes of 2 and 8.
It’s hard to tell the differences between the other priors, and it might be more useful to look at
the predicted differences in accuracy between set sizes in Figure 4.18.
Normal(0, 0.001) Normal(0, 0.01) Normal(0, 0.1) Normal(0, 0.5) Normal(0, 1)
125
100
75

2
50
25
0
125
100
75

4
50
25
count

0
125
100
75

6
50
25
0
125
100
75

8
50
25
0
0.0 0.5 1.0 0.0 0.5 1.0 0.0 0.5 1.0 0.0 0.5 1.0 0.0 0.5 1.0
accuracy

FIGURE 4.17: The prior predictive distributions of mean accuracy of the model defined in
section 4.3, for different set sizes and different priors for β.

Normal(0, 0.001) Normal(0, 0.01) Normal(0, 0.1) Normal(0, 0.5) Normal(0, 1)

600

400

200 4-2

0
600

400
count

6-4

200

0
600

400
8-6

200

0
-0.5 0.0 0.5 -0.5 0.0 0.5 -0.5 0.0 0.5 -0.5 0.0 0.5 -0.5 0.0 0.5
diff_accuracy
FIGURE 4.18: The prior predictive distributions of differences in mean accuracy between set
sizes of the model defined in section 4.3 for different priors for β.
If we are not sure whether the increase of set size could produce something between a null
effect and a relatively large effect, we can choose the prior with a standard deviation of 0.1 .
Under this reasoning, we settle on the following priors:

α ∼ Normal(0, 1.5)

β ∼ Normal(0, 0.1)

4.3.3 The brms model

Having decided on the likelihood, the link function, and the priors, the model can now be fit
using brms . We need to specify that the family is bernoulli() , and the link is logit .

Hide

fit_recall <- brm(correct ~ 1 + c_set_size,

data = df_recall,

family = bernoulli(link = logit),

prior = c(
prior(normal(0, 1.5), class = Intercept),
prior(normal(0, .1), class = b, coef = c_set_size)
)
)

Next, look at the summary of the posteriors of each of the parameters. Keep in mind that the
parameters are in log-odds space:

Hide

posterior_summary(fit_recall,
variable = c("b_Intercept", "b_c_set_size"))

## Estimate Est.Error Q2.5 Q97.5

## b_Intercept 1.912 0.3003 1.352 2.5505

## b_c_set_size -0.182 0.0826 -0.343 -0.0215

Inspecting b_c_set_size , we see that increasing the set size has a detrimental effect on
recall, as we suspected.

Plot the posteriors as well (Figure 4.19):

Hide

plot(fit_recall)

b_Intercept b_Intercept

3.0

1.0 2.5

2.0

0.5
1.5

Chain
1.0
0.0 1
1.0 1.5 2.0 2.5 3.0 0 200 400 600 800 1000
2
b_c_set_size b_c_set_size
0.1 3
4
4 0.0

3 -0.1

-0.2
2
-0.3
1
-0.4
0
-0.4 -0.3 -0.2 -0.1 0.0 0 200 400 600 800 1000

FIGURE 4.19: The posterior distributions of the parameters in the brms model fit_recall ,
along with the corresponding trace plots.

Next, we turn to the question of what we can report as our results, and what we can conclude
from the data.

4.3.4 How to communicate the results?

Here, we are in a situation analogous to the one we saw earlier with the log-normal model. If
we want to talk about the effect estimated by the model in log-odds space, we summarize the
posterior of β in the following way: ^ = −0.182
β , 95% CrI = [−0.343, −0.022] .
However, the effect is easier to understand in proportions rather than in log-odds. Let’s look at
the average accuracy for the task first:

Hide

alpha_samples <- as_draws_df(fit_recall)$b_Intercept

av_accuracy <- plogis(alpha_samples)
c(mean = mean(av_accuracy), quantile(av_accuracy, c(0.025, 0.975)))

## mean 2.5% 97.5%

## 0.868 0.795 0.928

As before, to transform the effect of our manipulation to an easier to interpret scale (i.e.,
proportions), we need to take into account that the scale is not linear, and that the effect of
increasing the set size depends on the average accuracy, and the set size that we start from.

We can do the following calculation, similar to what we did for the trial effects experiment, to
find out the decrease in accuracy in proportions or probability scale:

Hide

beta_samples <- as_draws_df(fit_recall)$b_c_set_size

effect_middle <- plogis(alpha_samples) -
plogis(alpha_samples - beta_samples)
c(mean = mean(effect_middle),

quantile(effect_middle, c(0.025, 0.975)))

## mean 2.5% 97.5%

## -0.01893 -0.03800 -0.00226

Notice the interpretation here, if we increase the set size from the average set size minus one
to the average set size, we get a reduction in the accuracy of recall of −0.019 , 95% CrI =
[−0.038, −0.002] . Recall that the average set size, 5, was not presented to the subject! We
could alternatively look at the decrease in accuracy from a set size of 2 to 4:

Hide
four <- 4 - mean(df_recall$set_size)
two <- 2 - mean(df_recall$set_size)
effect_4m2 <-
plogis(alpha_samples + four * beta_samples) -
plogis(alpha_samples + two * beta_samples)

c(mean = mean(effect_4m2),
quantile(effect_4m2, c(0.025, 0.975)))

## mean 2.5% 97.5%

## -0.02959 -0.05469 -0.00445

We can also back-transform to probability scale using the function fitted() rather than
using plogis() . One advantage is that this will work regardless of the type of link function; in
this section we only discussed the logit link, but other link functions are possible to use in
generalized linear models (e.g., the probit link; see Dobson and Barnett (2011)).

Since the set size is the only predictor and it is centered, for estimating the average accuracy,
we can consider an imaginary observation where the c_set_size is zero. (If there were more
centered predictors, we would need to set all of them to zero). Now we can use the summary
argument provided by fitted() .

Hide

fitted(fit_recall,
newdata = data.frame(c_set_size = 0),
summary = TRUE)[,c("Estimate", "Q2.5","Q97.5")]

## Estimate Q2.5 Q97.5

## 0.868 0.795 0.928

For estimating the difference in accuracy from the average set size minus one to the average
set size, and from a set size of two to four, first, define newdata with these set sizes.

Hide
new_sets <- data.frame(c_set_size = c(0, -1, four, two))
set_sizes <- fitted(fit_recall,
newdata = new_sets,
summary = FALSE)

Then calculate the appropriate differences considering that column one of set_sizes
corresponds to the average set size, column two to the average set size minus one and so
forth.

Hide

effect_middle <- set_sizes[, 1] - set_sizes[, 2]

effect_4m2 <- set_sizes[, 3] - set_sizes[, 4]

Finally, calculate the summaries.

Hide

c(mean = mean(effect_middle), quantile(effect_middle,

c(0.025, 0.975)))

## mean 2.5% 97.5%

## -0.01893 -0.03800 -0.00226

Hide

c(mean = mean(effect_4m2), quantile(effect_4m2,

c(0.025, 0.975)))

## mean 2.5% 97.5%

## -0.02959 -0.05469 -0.00445

As expected we get exactly the same values with fitted() than when we calculate them “by
hand.”
4.3.5 Descriptive adequacy

One potentially useful aspect of posterior distributions is that we could also make predictions
for other conditions not presented in the actual experiment, such as set sizes that weren’t
tested. We could then carry out another experiment to investigate whether our model was right
using another experiment. To make predictions for other set sizes, we extend our data set,
adding rows with set sizes of 3, 5, and 7. To be consistent with the data of the other set sizes
in the experiment, we add 23 trials of each new set size (this is the number of trial by set size
in the data set). Something important to notice is that we need to center our predictor
based on the original mean set size. This is because we want to maintain our interpretation
of the intercept. We extend the data as follows, and we summarize the data and plot it in
Figure 4.20.

Hide

df_recall_ext <- df_recall %>%

bind_rows(tibble(

set_size = rep(c(3, 5, 7), 23),

c_set_size = set_size -
mean(df_recall$set_size),
correct = 0
))

# nicer label for the facets:

set_size <- paste("set size", 2:8) %>%
setNames(-3:3)
pp_check(fit_recall,
type = "stat_grouped",

stat = "mean",
group = "c_set_size",
newdata = df_recall_ext,
facet_args = list(
ncol = 1, scales = "fixed",
labeller = as_labeller(set_size)

),
binwidth = 0.02
)
set size 2

set size 3

set size 4

set size 5 T = mean

T (y rep)

T (y )
set size 6

set size 7

set size 8

0.00 0.25 0.50 0.75 1.00

FIGURE 4.20: The distributions of posterior predicted mean accuracies for tested set sizes (2,
4, 6, and 8) and untested ones (3, 5, and 7) are labeled with yrep . The observed mean
accuracy, y, are only relevant for the tested set sizes *2, 4, 6, and 8); the “observed”
accuracies of the untested set sizes are represented as 0.
We could now gather new data in an experiment that also shows set sizes of 3, 5, and 7.
These data would be held out from the model fit_recall , since the model was fit when
those data were not available. Verifying that the new observations fit in our already generated
posterior predictive distribution would be a way to test genuine predictions from our model.

Having seen how we can fit simple regression models, we turn to hierarchical models in the
next chapter.

4.4 Summary

In this chapter, we learned how to fit simple linear regression models and to fit and interpret
models with a log-normal likelihood and logistic regression models. We investigated the prior
specification for the models, using prior predictive checks, and the descriptive adequacy of the
models using posterior predictive checks.
4.5 Further reading

Linear regression is discussed in several classic textbooks; these have largely a frequentist
orientation, but the basic theory of linear modeling presented there can easily be extended to
the Bayesian framework. An accessible textbook is by Dobson and Barnett (2011). Other
useful textbooks on linear modeling are Harrell Jr (2015), Faraway (2016), Fox (2015), and
Montgomery, Peck, and Vining (2012).

4.6 Exercises

Exercise 4.1 A simple linear regression: Power posing and testosterone.

Load the following data set:

Hide

data("df_powerpose")
head(df_powerpose)

## id hptreat female age testm1 testm2

## 2 29 High Male 19 38.7 62.4
## 3 30 Low Female 20 32.8 29.2
## 4 31 High Female 20 32.3 27.5
## 5 32 Low Female 18 18.0 28.7
## 7 34 Low Female 21 73.6 44.7

## 8 35 High Female 20 80.7 105.5

The data set, which was originally published in Carney, Cuddy, and Yap (2010) but released in
modified form by Fosse (2016), shows the testosterone levels of 39 different individuals,
before and after treatment, where treatment refers to each individual being assigned to a high
power pose or a low power pose. In the original paper by Carney, Cuddy, and Yap (2010), the
unit given for testosterone measurement (estimated from saliva samples) was picograms per
milliliter (pg/ml). One picogram per milliliter is 0.001 nanogram per milliliter (ng/ml).

The research hypothesis is that on average, assigning a subject a high power pose vs. a low
power pose will lead to higher testosterone levels after treatment. Assuming that you know
nothing about normal ranges of testosterone using salivary measurement, choose an
appropriate Cauchy prior (e.g., Cauchy(0, 2.5) ) for the target parameter(s).

Investigate this claim using a linear model and the default priors of brms . You’ll need to
estimate the effect of a new variable that encodes the change in testosterone.

Exercise 4.2 Another linear regression model: Revisiting attentional load effect on pupil size.

Here, we revisit the analysis shown in the chapter, on how attentional load affects pupil size.

a. Our priors for this experiment were quite arbitrary. How do the prior predictive
distributions look like? Do they make sense?
b. Is our posterior distribution sensitive to the priors that we selected? Perform a sensitivity
analysis to find out whether the posterior is affected by our choice of prior for the σ.
c. Our data set includes also a column that indicates the trial number. Could it be that trial
has also an effect on the pupil size? As in lm , we indicate another main effect with a +
sign. How would you communicate the new results?

Exercise 4.3 Log-normal model: Revisiting the effect of trial on finger tapping times.

We continue considering the effect of trial on finger tapping times.

a. Estimate the slowdown in milliseconds between the last two times the subject pressed the
space bar in the experiment.
b. How would you change your model (keeping the log-normal likelihood) so that it includes
centered log-transformed trial numbers or square-root-transformed trial numbers (instead
of centered trial numbers)? Does the effect in milliseconds change?

Exercise 4.4 Logistic regression: Revisiting the effect of set size on free recall.

Our data set includes also a column coded as tested that indicates the position of the
queued word. (In Figure 4.13 tested would be 3). Could it be that position also has an effect
on recall accuracy? How would you incorporate this in the model? (We indicate another main
effect with a + sign).

Exercise 4.5 Red is the sexiest color.

Load the following data set:

Hide

data("df_red")
head(df_red)
## risk age red pink redorpink
## 8 0 19 0 0 0
## 9 0 25 0 0 0
## 10 0 20 0 0 0
## 11 0 20 0 0 0

## 14 0 20 0 0 0
## 15 0 18 0 0 0

The data set is from a study (Beall and Tracy 2013) that contains information about the color
of the clothing worn (red, pink, or red or pink) when the subject (female) is at risk of becoming
pregnant (is ovulating, self-reported). The broader issue being investigated is whether women
wear red more often when they are ovulating (in order to attract a mate). Using logistic
regressions, fit three different models to investigate whether being ovulating increases the
probability of wearing (a) red, (b) pink, or (c) either pink or red. Use priors that are reasonable
(in your opinion).

References

Beall, Alec T., and Jessica L. Tracy. 2013. “Women Are More Likely to Wear Red or Pink at
Peak Fertility.” Psychological Science 24 (9). Sage Publications Sage CA: Los Angeles, CA:
1837–41.

Blumberg, Eric J., Matthew S. Peterson, and Raja Parasuraman. 2015. “Enhancing Multiple
Object Tracking Performance with Noninvasive Brain Stimulation: A Causal Role for the
Anterior Intraparietal Sulcus.” Frontiers in Systems Neuroscience 9: 3.
https://doi.org/10.3389/fnsys.2015.00003.

Brée, David S. 1975. “The Distribution of Problem-Solving Times: An Examination of the

Stages Model.” British Journal of Mathematical and Statistical Psychology 28 (2): 177–200.
https://doi.org/10/cnx3q7.

Buzsáki, György, and Kenji Mizuseki. 2014. “The Log-Dynamic Brain: How Skewed
Distributions Affect Network Operations.” Nature Reviews Neuroscience 15 (4): 264–78.
https://doi.org/10.1038/nrn3687.

Bürki, Audrey, Francois-Xavier Alario, and Shravan Vasishth. 2022. “When Words Collide:
Bayesian Meta-Analyses of Distractor and Target Properties in the Picture-Word Interference
Paradigm.” Quarterly Journal of Experimental Psychology.
Bürki, Audrey, Shereen Elbuy, Sylvain Madec, and Shravan Vasishth. 2020. “What Did We
Learn from Forty Years of Research on Semantic Interference? A Bayesian Meta-Analysis.”
Journal of Memory and Language. https://doi.org/10.1016/j.jml.2020.104125.

Carney, Dana R, Amy JC Cuddy, and Andy J Yap. 2010. “Power Posing: Brief Nonverbal
Displays Affect Neuroendocrine Levels and Risk Tolerance.” Psychological Science 21 (10).
Sage Publications Sage CA: Los Angeles, CA: 1363–8.

Cox, Christopher Martin Mikkelsen, Tamar Keren-Portnoy, Andreas Roepstorff, and Riccardo
Fusaroli. 2022. “A Bayesian Meta-Analysis of Infants’ Ability to Perceive Audio–Visual
Congruence for Speech.” Infancy 27 (1). Wiley Online Library: 67–96.

Dobson, Annette J, and Adrian Barnett. 2011. An Introduction to Generalized Linear Models.
CRC press.

Faraway, Julian J. 2016. Extending the Linear Model with R: Generalized Linear, Mixed
Effects and Nonparametric Regression Models. Chapman; Hall/CRC.

Fosse, Nathan E. 2016. “Replication Data for ‘Power Posing: Brief Nonverbal Displays Affect
Neuroendocrine Levels and Risk Tolerance’ by Carney, Cuddy, Yap (2010).” Harvard
Dataverse. https://doi.org/10.7910/DVN/FMEGS6.

Fox, John. 2015. Applied Regression Analysis and Generalized Linear Models. Sage
Publications.

Harrell Jr, Frank E. 2015. Regression Modeling Strategies: With Applications to Linear Models,
Logistic and Ordinal Regression, and Survival Analysis. New York, NY: Springer.

Hayes, Taylor R., and Alexander A. Petrov. 2016. “Mapping and Correcting the Influence of
Gaze Position on Pupil Size Measurements.” Behavior Research Methods 48 (2): 510–27.
https://doi.org/10.3758/s13428-015-0588-x.

Johnson, Norman L, Samuel Kotz, and Narayanaswamy Balakrishnan. 1995. Continuous

Univariate Distributions. Vol. 289. John Wiley & Sons.

Limpert, Eckhard, Werner A. Stahel, and Markus Abbt. 2001. “Log-Normal Distributions Across
the Sciences: Keys and Clues.” BioScience 51 (5): 341. https://doi.org/10.1641/0006-
3568(2001)051[0341:LNDATS]2.0.CO;2.
Mahowald, Kyle, Ariel James, Richard Futrell, and Edward Gibson. 2016. “A Meta-Analysis of
Syntactic Priming in Language Production.” Journal of Memory and Language 91. Elsevier: 5–
27.

Mathot, Sebastiaan. 2018. “Pupillometry: Psychology, Physiology, and Function.” Journal of

Cognition 1 (1): 16. https://doi.org/10.5334/joc.18.

Montgomery, D. C., E. A. Peck, and G. G. Vining. 2012. An Introduction to Linear Regression

Analysis. 5th ed. Hoboken, NJ: Wiley.

Oberauer, Klaus. 2019. “Working Memory Capacity Limits Memory for Bindings.” Journal of
Cognition 2 (1): 40. https://doi.org/10.5334/joc.86.

Oberauer, Klaus, and Reinhold Kliegl. 2001. “Beyond Resources: Formal Models of
Complexity Effects and Age Differences in Working Memory.” European Journal of Cognitive
Psychology 13 (1-2). Routledge: 187–215. https://doi.org/10.1080/09541440042000278.

Pylyshyn, Zenon W., and Ron W. Storm. 1988. “Tracking Multiple Independent Targets:
Evidence for a Parallel Tracking Mechanism.” Spatial Vision 3 (3): 179–97.
https://doi.org/10.1163/156856888X00122.

Royall, Richard. 1997. Statistical Evidence: A Likelihood Paradigm. New York: Chapman; Hall,
CRC Press.

Spector, Robert H. 1990. “The Pupils.” In Clinical Methods: The History, Physical, and
Laboratory Examinations, edited by H. Kenneth Walker, W. Dallas Hall, and J. Willis Hurst, 3rd
ed. Boston: Butterworths.

Ulrich, Rolf, and Jeff Miller. 1993. “Information Processing Models Generating Lognormally
Distributed Reaction Times.” Journal of Mathematical Psychology 37 (4): 513–25.
https://doi.org/10.1006/jmps.1993.1032.

Ulrich, Rolf, and Jeff Miller. 1994. “Effects of Truncation on Reaction Time Analysis.” Journal of
Experimental Psychology: General 123 (1): 34–80. https://doi.org/10/b8tsnh.

Vasishth, Shravan, Zhong Chen, Qiang Li, and Gueilan Guo. 2013. “Processing Chinese
Relative Clauses: Evidence for the Subject-Relative Advantage.” PLoS ONE 8 (10). Public
Library of Science: 1–14.
Wagenmakers, Eric-Jan, Raoul P. P. P. Grasman, and Peter C. M. Molenaar. 2005. “On the
Relation Between the Mean and the Variance of a Diffusion Model Response Time
Distribution.” Journal of Mathematical Psychology 49 (3): 195–204.
https://doi.org/10.1016/j.jmp.2005.02.003.

13. The average pupil size will probably be higher than 800, since this measurement was with
no load, but, in any case, the exact number won’t matter, any mean for the prior between
500-1500 would be fine if the standard deviation is large.↩

14. These transformations are visible when checking the generated Stan code using
make_stancode .↩

15. Odds are defined to be the ratio of the probability of success to the probability of failure.
1/6
For example, the odds of obtaining a one in a fair six-sided die are = 1/5 . The
1−1/6

odds of obtaining a heads in a fair coin are 1/1. Do not confuse this technical term with
the day-to-day usage of the word “odds” to mean probability.↩
Code

Chapter 5 Bayesian hierarchical models

Usually, experimental data in cognitive science contain “clusters”. These are natural groups
that contain observations that are more similar within the clusters than between them. The
most common examples of clusters in experimental designs are subjects and experimental
items (e.g., words, pictures, objects that are presented to the subjects). These clusters arise
because we have multiple (repeated) observations for each subject, and for each item. If we
want to incorporate this grouping structure in our analysis, we generally use a hierarchical
model (also called multi-level or a mixed model, Pinheiro and Bates 2000). This kind of
clustering and hierarchical modeling arises as a consequence of the idea of exchangeability.

5.1 Exchangeability and hierarchical models

Exchangeability is the Bayesian analog of the phrase “independent and identically distributed”
that appears regularly in classical (i.e., frequentist) statistics. Some connections and
differences between exchangeability and the frequentist concept of independent and
identically distributed (iid) are detailed in Box 5.1.

Informally, the idea of exchangeability is as follows. Suppose we assign a numerical index to

each of the levels of a group (e.g., to each subject). When the levels are exchangeable, we
can reassign the indices arbitrarily and lose no information; that is, the joint distribution will
remain the same, because we don’t have any different prior information for each cluster (here,
for each subject). In hierarchical models, we treat the levels of the group as exchangeable,
and observations within each level in the group as also exchangeable. We generally include
predictors at the level of the observations, those are the predictors that correspond to the
experimental manipulations (e.g., attentional load, trial number, cloze probability, etc.); and
maybe also at the group level, these are predictors that indicate characteristics of the levels in
the group (e.g., the working memory capacity score of each subject). Then the conditional
distributions given these explanatory variables would be exchangeable; that is, our predictors
incorporate all the information that is not exchangeable, and when we factor the predictors
out, the observations or units in the group are exchangeable. This is the reason why the item
number is an appropriate cluster, but trial number is not: In the first case, if we permute the
numbering of the items there is no loss of information because the indexes are exchangeable,
all the information about the items is incorporated as predictors in the model. In the second
case, the numbering of the trials include information that will be lost if we treat them as
exchangeable. For example, consider the case where, as trial numbers increase, subjects get
more experienced or fatigued. Even if we are not interested in the specific cluster-level
estimates, hierarchical models allow us to generalize to the underlying population (subjects,
items) from which the clusters in the sample were drawn. For more on exchangeability, consult
the further reading at the end of the chapter.

Exchangeability is important in Bayesian statistics because of a theorem called the

Representation Theorem (de Finetti 1931). This theorem states that if a sequence of random
variables is exchangeable, then the prior distributions on the parameters in a model are a
necessary consequence; priors are not merely an arbitrary addition to the frequentist modeling
approach that we are familiar with.

Furthermore, exchangeability has been shown (Bernardo and Smith 2009) to be

mathematically equivalent to assuming a hierarchical structure in the model. The argument
goes as follows. Suppose that the parameters for each level in a group are μi , where the
levels are labeled i = 1, … , I . An example of groups is subjects. Suppose also that the data
yn , where n = 1, … , N are observations from these subjects (e.g., pupil size measurements,
IQ scores, or any other approximately normally distributed outcome). The data are assumed to
be generated as

yn ∼ Normal(μsubj[n] , σ)

The notation subj[n] , which roughly follows Gelman and Hill (2007), identifies the subject
index. Suppose that 20 subjects respond 50 times each. If the data are ordered by subject id,
then subj[1] to subj[50] corresponds to a subject with id ,
i = 1 subj[51] to subj[100]

corresponds to a subject with id i = 2 , and so forth.

We can code this representation in a straightforward way in R:

Hide

N_subj <- 20
N_obs_subj <- 50
N <- N_subj * N_obs_subj
df <- tibble(row = 1:N,
subj = rep(c(1:N_subj), each = N_obs_subj))

df
## # A tibble: 1,000 × 2
## row subj
## <int> <int>
## 1 1 1
## 2 2 1

## 3 3 1
## # … with 997 more rows

Hide

# Example:

c(df$subj[1], df$subj[2], df$subj[51])

## [1] 1 1 2

If the data yn are exchangeable, the parameters μi are also exchangeable. The fact that the
μi are exchangeable can be shown (Bernardo and Smith 2009) to be mathematically
equivalent to assuming that they come from a common distribution, for example:

μi ∼ Normal(μ, τ )

To make this more concrete, assume some completely arbitrary true values for the
parameters, and generate observations based on a hierarchical process in R.

Hide

mu <- 100
tau <- 15
sigma <- 4
mu_i <- rnorm(N_subj, mu, tau)

df_h <- mutate(df, y = rnorm(N, mu_i[subj], sigma))

df_h
## # A tibble: 1,000 × 3
## row subj y
## <int> <int> <dbl>
## 1 1 1 74.8
## 2 2 1 74.2

## 3 3 1 74.2
## # … with 997 more rows

The parameters μ and τ , called hyperparameters, are unknown and have prior distributions
(hyperpriors) defined for them. This fact leads to a hierarchical relationship between the
parameters: there is a common parameter μ for all the levels of a group, and the parameters
μi are assumed to be generated as a function of this common parameter μ . Here, τ

represents between-group variability, and σ represents within-group variability. The three

parameters have priors defined for them. The first two priors below are called hyperpriors.

p(μ)

p(τ )

p(σ)

In such a model, information about μi comes from two sources:

a. from each of the observed yn corresponding to the respective μsubj[n] parameter, and

b. from the parameters μ and τ that led to all the other yk (where k ≠ n ) being generated.

This is illustrated in Figure 5.1.

Fit this model in brms in the following way. Intercept corresponds to μ , sigma to σ, and
sd to τ . For now the prior distributions are arbitrary.

Hide

fit_h <- brm(y ~ 1 + (1 | subj), df_h,

prior =
c(prior(normal(50, 200), class = Intercept),
prior(normal(2, 5), class = sigma),
prior(normal(10, 20), class = sd)),

# increase iterations to avoid convergence issues

iter = 4000,
warmup = 1000)
Hide

fit_h

## ...
## Group-Level Effects:
## ~subj (Number of levels: 20)

## Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS

## sd(Intercept) 15.18 2.68 11.02 21.32 1.01 563
## Tail_ESS
## sd(Intercept) 971
##
## Population-Level Effects:

## Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESS

## Intercept 92.08 3.44 85.22 98.55 1.01 486 788
##
## Family Specific Parameters:
## Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESS

## sigma 4.08 0.09 3.91 4.26 1.00 2375 3141

##
## ...

In this output, Intercept corresponds to the posterior of μ , sigma to σ, and sd(Intercept)

to τ . There is more information in the brms object, we can also get the posteriors for each
level of our group. However, rather than estimating μi , brms estimates the adjustments to μ ,
ui , named r_subj[i,Intercept] , so that μ i = μ + ui . See the code below.

Hide

# Extract the posterior estimates of u_i

u_i_post <- as_draws_df(fit_h) %>%
select(starts_with("r_subj"))

# Extract the posterior estimate of mu

mu_post <- as_draws_df(fit_h)$b_Intercept
# Build the posterior estimate of mu_i
mu_i_post <- mu_post + u_i_post
colMeans(mu_i_post) %>% unname()
## [1] 72.7 115.5 96.7 85.6 62.5 96.8 103.8 89.5 118.4 81.7 68.1
## [12] 98.4 93.2 83.8 85.4 101.7 102.5 98.6 92.6 87.9

Hide

# Compare with true values

mu_i

## [1] 72.5 115.3 97.1 86.0 62.5 97.0 104.0 88.5 119.3 82.6 67.8
## [12] 98.6 92.4 84.2 85.0 102.1 102.9 98.2 92.3 88.9

FIGURE 5.1: A directed acyclic graph illustrating a hierarchical model (partial pooling).

There are two other configurations possible that do not involve this hierarchical structure and
which represent two alternative, extreme scenarios.

One of these two configurations is called the complete pooling model, Here, the data yn are
assumed to be generated from a single distribution:

yn ∼ Normal(μ, σ).

This model is an intercept only regression similar to what we saw in chapter 3.

Generate fake observations in a vector y based on arbitrary true values in R in the following
way.

Hide
sigma <- 4
mu <- 100
df_cp <- mutate(df, y = rnorm(N, mu, sigma))
df_cp

## # A tibble: 1,000 × 3
## row subj y
## <int> <int> <dbl>
## 1 1 1 94.1
## 2 2 1 100.
## 3 3 1 95.3

## # … with 997 more rows

Fit it in brms .

Hide

fit_cp <- brm(y ~ 1, df_cp,

prior =

c(prior(normal(50, 200), class = Intercept),

prior(normal(2, 5), class = sigma)))

Hide

fit_cp

## ...
## Population-Level Effects:

## Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESS

## Intercept 99.91 0.13 99.66 100.16 1.00 3341 2321
##
## Family Specific Parameters:
## Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESS

## sigma 4.03 0.09 3.86 4.21 1.00 3192 2651

##
## ...
The configuration of the complete pooling model is illustrated in Figure 5.2.

FIGURE 5.2: A directed acyclic graph illustrating a complete pooling model.

The other configuration is called the no pooling model; here, each yn is assumed to be
generated from an independent distribution:

yn ∼ Normal(μsubj[n] , σ)

Generate fake observations from the no pooling model in R with arbitrary true values.

Hide

sigma <- 4
mu_i <- c(156, 178, 95, 183, 147, 191, 67, 153, 129, 119, 195,
150, 172, 97, 110, 115, 78, 126, 175, 80)
df_np <- mutate(df, y = rnorm(N, mu_i[subj], sigma))

df_np
## # A tibble: 1,000 × 3
## row subj y
## <int> <int> <dbl>
## 1 1 1 159.
## 2 2 1 158.

## 3 3 1 157.
## # … with 997 more rows

Fit it in brms . By using the formula 0 + factor(subj) , we remove the common intercept and
force the model to estimate one intercept for each level of subj . The column subj is
converted to a factor so that brms does not interpret it as a number.

Hide

fit_np <- brm(y ~ 0 + factor(subj), df_np,

prior =
c(prior(normal(0, 200), class = b),
prior(normal(2, 5), class = sigma)))

The summary shows now the 20 estimates of μi as b_factorsubj and σ. (We ignore lp__
and lprior .)

Hide

fit_np %>% posterior_summary()

## Estimate Est.Error Q2.5 Q97.5
## b_factorsubj1 156.49 0.5626 155.39 157.63
## b_factorsubj2 177.81 0.5557 176.72 178.90
## b_factorsubj3 94.34 0.5505 93.27 95.43
## b_factorsubj4 182.62 0.5568 181.51 183.69

## b_factorsubj5 147.01 0.5521 145.90 148.08

## b_factorsubj6 191.90 0.5520 190.80 192.95
## b_factorsubj7 66.37 0.5579 65.29 67.45
## b_factorsubj8 152.22 0.5806 151.06 153.37
## b_factorsubj9 129.59 0.5469 128.51 130.67
## b_factorsubj10 118.55 0.5374 117.48 119.61

## b_factorsubj11 195.33 0.5406 194.27 196.38

## b_factorsubj12 149.13 0.5509 148.07 150.22
## b_factorsubj13 171.14 0.5634 170.00 172.24
## b_factorsubj14 97.30 0.5677 96.17 98.41
## b_factorsubj15 110.73 0.5762 109.59 111.89

## b_factorsubj16 115.22 0.5605 114.13 116.29

## b_factorsubj17 77.74 0.5589 76.66 78.85
## b_factorsubj18 126.08 0.5448 124.99 127.16
## b_factorsubj19 174.36 0.5655 173.23 175.48
## b_factorsubj20 80.50 0.5484 79.44 81.57

## sigma 3.94 0.0898 3.77 4.13

## lprior -131.51 0.0111 -131.53 -131.49
## lp__ -2920.89 3.1840 -2927.92 -2915.52

Unlike the hierarchical model, now there is no common distribution that generates the μi

parameters. This is illustrated in Figure 5.3.

FIGURE 5.3: A directed acyclic graph illustrating a no pooling model.

The hierarchical model lies between these two extremes and for this reason is sometimes
called a partial pooling model. One way that the hierarchical model is often described is that
the estimates μi “borrow strength” from the parameter μ (which represents the grand mean in
the above example).

An important practical consequence of partial pooling is the idea of “borrowing strength from
the mean”: if we have very sparse data from a particular member of a group (e.g., missing
data from a particular subject), the estimate μi of that particular group member n is
determined by the parameter μ . In other words, when the data are sparse for group member n

, the posterior estimate μi is determined largely by the prior p(μ) . In this sense, even the
frequentist hierarchical modeling software in R, lmer from the package lme4 , is essentially
Bayesian in formulation (except of course that there is no prior as such on μ ).

So far we focused on the structure of μ , the location parameter of the likelihood. We could
even have partial pooling, complete pooling or no pooling with respect to σ, the scale
parameter of the likelihood. More generally, any parameter of a likelihood can have any of
these kinds of pooling.

In the coming sections, we will be looking at each of these models with more detail and using
realistic examples.

Box 5.1 Finitely exchangeable random variables

Formally, we say that the random variables Y1 , … , YN are finitely exchangeable if, for
any set of particular outcomes of an experiment y1 , … , yN , the probability p(y1 , … , yN )

that we assign to these outcomes is unaffected by permuting the labels given to the
variables. In other words, for any permutation π(n) , where n = 1, … , N ( π is a function
that takes as input the positive integer n and returns another positive integer; e.g., the
function takes a subject indexed as 1, and returns index 3), we can reasonably assume
that p(y1 , … , yN ) = p(yπ(1) , … , yπ(N ) ) . A simple example is a coin tossed twice.
Suppose the first coin toss is Y1 = 1 , a heads, and the second coin toss is Y2 = 0 , a tails.
If we are willing to assume that the probability of getting one heads is unaffected by
whether it appears in the first or the second toss, i.e.,
p(Y1 = 1, Y2 = 0) = p(Y1 = 0, Y2 = 1) , then we assume that the indices are
exchangeable.

Some important connections and differences between exchangeability and the frequentist
concept of independent and identically distributed (iid):
If the data are exchangeable, they are not necessarily iid. For example, suppose
you have a box with one black ball and two red balls in it. Your task is to repeatedly
draw a ball at random. Suppose that in your first draw, you draw one ball and get the
black ball. The probability of getting a black ball in the next two draws is now 0.
However, if in your first draw you had retrieved a red ball, then there is a non-zero
probability of drawing a black ball in the next two draws. The outcome in the first draw
affects the probability of subsequent draws–they are not independent. But the
sequence of random variables is exchangeable. To see this, consider the following: If
a red ball is drawn, count it as a 0, and if a black ball is drawn, then count it as 1.
Then, the three possible outcomes and the probabilities are

;
1, 0, 0 P (X1 = 1, X2 = 0, X3 = 0) =
1

3
× 1 × 1 =
1

3
2 1 1
0, 1, 0 P (X1 = 0, X2 = 1, X3 = 0) = × × 1 =
3 2 3
2 1 1
0, 0, 1 P (X1 = 0, X2 = 0, X3 = 1) = × × 1 =
3 2 3

The random variables X1 , X2 , X3 can be permuted and the joint probability

distribution (technically, the PMF) is the same in each case.

If the data are exchangeable, then they are identically distributed. For example,
in the box containing one black ball and two red balls, suppose we count the draw of
a black ball as a 1, and the draw of a red ball as a 0. Then the probability
and ; this is also true for and . That is, these
1 2
P (X1 = 1) = P (X1 = 0) = X2 X3
3 3

random variables are identically distributed.

If the data are iid in the standard frequentist sense, then they are exchangeable.
For example, suppose you have i = 1, … , n instances of a random variable X

whose PDF is f (x) . Suppose also that Xi are iid. The joint PDF (this can be discrete
or continuous, i.e., a PMF or PDF) is

fX (x1 , … , xn ) = f (x1 ) ⋅ ⋯ ⋅ f (xn )

1 ,…,Xn

Because the terms on the right-hand side can be permuted, the labels can be
permuted on any of the xi . This means that X1 , … , Xn are exchangeable.
5.2 A hierarchical model with a normal likelihood: The
N400 effect

Event-related potentials (ERPs) allow scientists to observe electrophysiological responses in

the brain measured by means of electroencephalography (EEG) that are time-locked to a
specific event (i.e., the presentation of the stimuli). A very robust ERP effect in the study of
language is the N400. Words with low predictability are accompanied by an N400 effect in
comparison with high-predictable words, this is a relative negativity that peaks around 300-500
ms after word onset over central parietal scalp sites (first reported in Kutas and Hillyard 1980,
for semantic anomalies, and in 1984 for low predictable word; for a review, see Kutas and
Federmeier 2011). The N400 is illustrated in Figure 5.4.

2
Amplitude (μV)

Predictability
high

1 low

0.0 0.2 0.4 0.6 0.8

Time (s)

FIGURE 5.4: Typical ERP for the grand average across the N400 spatial window (central
parietal electrodes: Cz, CP1, CP2, P3, Pz, P4, POz) for high and low predictability nouns
(specifically from the constraining context of the experiment reported in Nicenboim, Vasishth,
and Rösler 2020a). The x-axis indicates time in seconds and the y-axis indicates voltage in
microvolts (unlike many EEG/ERP plots, the negative polarity is plotted downwards).

For example, in (1) below, the continuation ‘paint’ has higher predictability than the
continuation ‘dog’, and thus we would expect a more negative signal, that is, an N400 effect,
in ‘dog’ in (b) in comparison with ‘paint’ in (a). It is often the case that predictability is
measured with a cloze task (see section 1.4).

1. Example from Kutas and Hillyard (1984)

a. Don’t touch the wet paint.
b. Don’t touch the wet dog.

The EEG data are typically recorded in tens of electrodes every couple of milliseconds, but for
our purposes (i.e., for learning about Bayesian hierarchical models), we can safely ignore the
complexity of the data. A common way to simplify the high-dimensional EEG data when we are
dealing with the N400 is to focus on the average amplitude of the EEG signal at the typical
spatio-temporal window of the N400 (for example, see Frank et al. 2015).

For this example, we are going to focus on the N400 effect for critical nouns from a subset of
the data of Nieuwland et al. (2018). Nieuwland et al. (2018) presented a replication attempt of
an original experiment of DeLong, Urbach, and Kutas (2005) with sentences like (2).

2. Example from DeLong, Urbach, and Kutas (2005)

a. The day was breezy so the boy went outside to fly a kite.
b. The day was breezy so the boy went outside to fly an airplane.

We’ll ignore the goal of original experiment (DeLong, Urbach, and Kutas 2005), and its
replication (Nieuwland et al. 2018). We are going to focus on the N400 at the final nouns in the
experimental stimuli. In example (2), for example, the final noun ‘kite’ has higher predictability
than ‘airplane’, and thus we would expect a more negative signal in ‘airplane’ in (b) in
comparison with ‘kite’ in (a).

To speed up computation, we restrict the data set of Nieuwland et al. (2018) to the subjects
from the Edinburgh lab. This subset of the data can be found in df_eeg in the bcogsci
package. Center the cloze probability before using it as a predictor.

Hide

data("df_eeg")
(df_eeg <- df_eeg %>%
mutate(c_cloze = cloze - mean(cloze)))
## # A tibble: 2,863 × 7
## subj cloze item n400 cloze_ans N c_cloze
## <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 1 0 1 7.08 0 44 -0.476
## 2 1 0.03 2 -0.68 1 44 -0.446

## 3 1 1 3 1.39 44 44 0.524
## # … with 2,860 more rows

Hide

# Number of subjects

df_eeg %>%
distinct(subj) %>%
count()

## # A tibble: 1 × 1
## n
## <int>
## 1 37

One convenient aspect of using averages of EEG data is that they are roughly normally
distributed. This allows us to use the normal likelihood. Figure 5.5 shows the distribution of the
data.

Hide
df_eeg %>% ggplot(aes(n400)) +
geom_histogram(
binwidth = 4,
colour = "gray",

alpha = .5,
aes(y = after_stat(density))
) +
stat_function(fun = dnorm, args = list(
mean = mean(df_eeg$n400),
sd = sd(df_eeg$n400)

)) +
xlab("Average voltage in microvolts for
the N400 spatiotemporal window")

0.03
density

0.02

0.01

0.00

-75 -50 -25 0 25 50

Average voltage in microvolts for
the N400 spatiotemporal window

FIGURE 5.5: Histogram of the N400 averages for every trial, overlaid is a density plot of a
normal distribution.
5.2.1 Complete pooling model (Mcp )

We’ll start from the simplest model which is basically the linear regression we encountered in
the preceding chapter.

5.2.1.1 Model assumptions

This model, call it Mcp , makes the following assumptions.

1. The EEG averages for the N400 spatiotemporal window are normally distributed.
2. Observations are independent.
3. There is a linear relationship between cloze and the EEG signal for the trial.

This model is incorrect for these data due to assumption (2) being violated.

With the last assumption, we are saying that the difference in the average signal when we
compare nouns with cloze probability of 0 and 0.1 is the same as the difference in the signal
when we compare nouns with cloze values of 0.1 and 0.2 (or 0.9 and 1). This is just an
assumption, and it may not necessarily be the case in the actual data. This means that we are
going to get a posterior for β conditional on the assumption that the linear relationship holds.
Even if it approximately holds, we still don’t know how much we deviate from this assumption.

We can now decide on a likelihood and priors.

5.2.1.2 Likelihood and priors

A normal likelihood seems reasonable for these data:

signaln ∼ Normal(α + c_clozen ⋅ β, σ) (5.1)

where n = 1, … , N , and signal is the dependent variable (average signal in the N400
spatiotemporal window in microvolts). The variable N represents the total number of data
points.

As always we need to rely on our previous knowledge and domain expertise to decide on
priors. We know that ERPs (signals time-locked to a stimulus) have mean amplitudes of a
couple of microvolts: This is easy to see in any plot of the EEG literature. This means that we
don’t expect the effect of our manipulation to exceed, say, 10μV . As before, a priori we’ll
assume that effects can be negative or positive. We can quantify our prior knowledge
regarding plausible values of β as normally distributed centered at zero with a standard
deviation of 10μV . (Other values such as 5μV would have been also reasonable, since it
would entail that 95% of the prior mass probability is between −10 $ and 10μV .)

If the signal for each ERP is baselined, that is, the mean signal of a time window before the
time window of interest is subtracted from the time window of interest, then the mean signal
would be relatively close to 0. Since we know that the ERPs were baselined in this study, we
expect that the grand mean of our signal should be relatively close to zero. Our prior for α is
then normally distributed centered in zero with a standard deviation of 10μV as well.

The standard deviation of our signal distribution is harder to guess. We know that EEG signals
are quite noisy, and that the standard deviation must be higher than zero. Our prior for σ is a
truncated normal distribution with location zero and scale 50. Recall that since we truncate the
distribution, the parameters location and scale do not correspond to the mean and standard
deviation of the new distribution; see Box 4.1.

We can draw random samples from this truncated distribution and calculate their mean and
standard deviation:

Hide

samples <- rtnorm(20000, mean = 0, sd = 50, a = 0)

c(mean = mean(samples), sd = sd(samples))

## mean sd
## 39.6 29.8

So we are essentially saying that we assume a priori that we will find the true standard
deviation of the signal in the following interval with 95% probability:

Hide

quantile(samples, probs = c(0.025, .975))

## 2.5% 97.5%

## 1.63 111.08

Hide
# Analytically:
# c(qtnorm(.025, 0, 50, a = 0), qtnorm(.975, 0, 50, a = 0))

To sum up, we are going to use the following priors:

α ∼ Normal(0, 10)

β ∼ Normal(0, 10)

σ ∼ Normal+ (0, 50)

A model such as Mcp is sometimes called a fixed-effects model: all the parameters are fixed
in the sense that do not vary from subject to subject or from item to item. A similar frequentist
model would correspond to fitting a simple linear model using the lm function: lm(n400 ~ 1 +
cloze, data = df_eeg) .

We fit this model in brms as follows (the default family is gaussian() so we can omit it). As
with the lm function in R, by default an intercept is fitted and thus n400 ~ c_cloze is
equivalent to n400 ~ 1 + c_cloze :

Hide

fit_N400_cp <- brm(n400 ~ c_cloze,

prior =
c(
prior(normal(0, 10), class = Intercept),
prior(normal(0, 10), class = b, coef = c_cloze),
prior(normal(0, 50), class = sigma)

),
data = df_eeg
)

For now, check the summary, and plot the posteriors of the model (Figure 5.6).

Hide

fit_N400_cp
## ...
## Population-Level Effects:
## Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESS
## Intercept 3.66 0.22 3.22 4.08 1.00 3753 2742
## c_cloze 2.27 0.55 1.22 3.33 1.00 3614 2917

##
## Family Specific Parameters:
## Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESS
## sigma 11.82 0.15 11.52 12.13 1.00 4339 3010
##
## ...

Hide

plot(fit_N400_cp)

b_Intercept b_Intercept
4.5
1.5
4.0
1.0
3.5
0.5
3.0
0.0
3.0 3.5 4.0 0 200 400 600 800 1000

b_c_cloze b_c_cloze

4
Chain
0.6
1
3
0.4
2
2
0.2 3
1
4
0.0
1 2 3 4 0 200 400 600 800 1000

sigma sigma
2.5
12.2
2.0
12.0
1.5
11.8
1.0
0.5 11.5

0.0 11.2
11.5 11.8 12.0 12.2 0 200 400 600 800 1000

FIGURE 5.6: Posterior distributions of the complete pooling model, fit_N400_cp .

5.2.2 No pooling model (Mnp )

One of the assumptions of the previous model is clearly wrong: observations are not
independent, they are clustered by subject (and also by the specific item, but we’ll ignore this
until section 5.2.4). It is reasonable to assume that EEG signals are more similar within
subjects than between them. The following model assumes that each subject is completely
independent from each other.16

5.2.2.1 Model assumptions

1. EEG averages for the N400 spatio-temporal window are normally distributed.
2. Every subject’s model is fit independently of the other subjects; the subjects have no
parameters in common (an exception is the standard deviation, σ; this is the same for all
subjects in Equation (5.2)).
3. There is a linear relationship between cloze and the EEG signal for the trial.

What likelihood and priors can we choose here?

5.2.2.2 Likelihood and priors

The likelihood is a normal distribution as before:

signaln ∼ Normal(αsubj[n] + c_clozen ⋅ βsubj[n] , σ) (5.2)

As before, n represents each observation, that is, the n th row in the data frame, which has N

rows, and now the index i identifies the subject. The notation subj[n] , which roughly follows
Gelman and Hill (2007), identifies the subject index; for example, if subj[10] = 3 , then the 10

th row of the data frame is from subject 3.

We define the priors as follows:

αi ∼ Normal(0, 10)

βi ∼ Normal(0, 10)

σ ∼ Normal+ (0, 50)

In brms , such a model can be fit by removing the common intercept with the formula n400 ~
0 + factor(subj) + c_cloze:factor(subj) .

This formula forces the model to estimate one intercept and one slope for each level of
subj .17 The by-subject intercepts are indicated with factor(subj) and the by-subject

slopes with c_cloze:factor(subj) . It’s very important to specify that subject should be
treated as a factor and not as a number; we don’t assume that subject number 3 will show 3
times more positive (or negative) average signal than subject number 1! The model fits 37
independent intercepts and 37 independent slopes. By setting a prior to class = b and
omitting coef , we are essentially setting identical priors to all the intercepts and slopes of the
model. The parameters are independent from each other; it is only our previous knowledge (or
prior beliefs) about their possible values (encoded in the priors) that is identical. We can set
different priors to each intercept and slope, but that will mean setting 74 priors!

Hide

fit_N400_np <- brm(n400 ~ 0 + factor(subj) + c_cloze:factor(subj),

prior =
c(
prior(normal(0, 10), class = b),
prior(normal(0, 50), class = sigma)
),

data = df_eeg
)

For this model, printing a summary means printing the 75 parameters (α1,...,37 , β1,...,37 , and σ).
We could do this as always by printing out the model results: just type fit_N400_np .

It may be easier to understand the output of the model by plotting β1,..,37 using bayesplot .
( brms also includes a wrapper for this function called stanplot .) We can take a look at the
internal names that brms gives to the parameters with variables(fit_N400_np) ; they are
b_factorsubj , then the subject index and then :c_cloze . The code below changes the

subject labels back to their original numerical indices and plots them in Figure 5.7. The
subjects are ordered by the magnitude of their mean effects.

The model Mnp does not estimate a unique population-level effect; instead, there is a different
effect estimated for each subject. However, given the posterior means from each subject, it is
still possible to calculate the average of these estimates ^
β
1,...,I
, where I is the total number of
subjects:

Hide
# parameter name of beta by subject:
ind_effects_np <- paste0(
"b_factorsubj",
unique(df_eeg$subj), ":c_cloze"
)

beta_across_subj <- as.data.frame(fit_N400_np) %>%

#removes the meta data from the object
select(all_of(ind_effects_np)) %>%
rowMeans()

# Calculate the average of these estimates

(grand_av_beta <- tibble(

mean = mean(beta_across_subj),
lq = quantile(beta_across_subj, c(.025)),
hq = quantile(beta_across_subj, c(.975))
))

## # A tibble: 1 × 3
## mean lq hq
## <dbl> <dbl> <dbl>
## 1 2.17 1.17 3.16

In Figure 5.7, the 95% credible interval of this overall mean effect is plotted as two vertical
lines together with the effect of cloze probability for each subject (ordered by effect size).
Here, rather than using a plotting function from brms , we can extract the summary of by-
subject effects, reorder them by magnitude, and then plot the summary with a custom plot
using ggplot2 .

Hide
# make a table of beta's by subject
beta_by_subj <- posterior_summary(fit_N400_np,
variable = ind_effects_np
) %>%
as.data.frame() %>%

mutate(subject = 1:n()) %>%

## reorder plot by magnitude of mean:
arrange(Estimate) %>%
mutate(subject = factor(subject, levels = subject))

The code below generates Figure 5.7.

Hide

ggplot(
beta_by_subj,

aes(x = Estimate, xmin = Q2.5, xmax = Q97.5, y = subject)

) +
geom_point() +
geom_errorbarh() +
geom_vline(xintercept = grand_av_beta$mean) +

geom_vline(xintercept = grand_av_beta$lq, linetype = "dashed") +

geom_vline(xintercept = grand_av_beta$hq, linetype = "dashed") +
xlab("By-subject effect of cloze probability in microvolts")
35
6
36
28
8
21
30
37
24
10
15
13
11
4
17
29
23
31
subject

18
19
32
12
26
33
27
34
3
9
5
7
22
20
16
1
14
2
25

-10 -5 0 5 10 15
By-subject effect of cloze probability in microvolts

FIGURE 5.7: 95% credible intervals of the effect of cloze probability for each subject
according to the no pooling model, fit_N400_np . The solid vertical line represents the mean
over all the subjects; and the broken vertical lines mark the 95% credible interval for this
mean.
5.2.3 Varying intercepts and varying slopes model (Mv )

One major problem with the no-pooling model is that we completely ignore the fact that the
subjects were doing the same experiment. We fit each subject’s data ignoring the information
available in the other subjects’ data. The no-pooling model is very likely to overfit the
individual subjects’ data; we are likely to ignore the generalities of the data and we may end
up overinterpreting noisy estimates from each subject’s data. The model can be modified to
explicitly assume that the subjects have an overall effect common to all the subjects, with the
individual subjects deviating from this common effect.

In the model that we fit next, we will assume that there is an overall effect that is common to
the subjects and, importantly, that all subjects’ parameters originate from one common
(normal) distribution. This model specification will result in the estimation of posteriors for each
subject being also influenced by what we know about all the subjects together. We begin with
a hierarchical model with uncorrelated varying intercepts and slopes. The analogous
frequentist model can be fit using lmer from the package lme4 , using (1+c_cloze||subj)
or, equivalently, (c_cloze||subj) for the by-subject random effects.

5.2.3.1 Model assumptions

1. EEG averages for the N400 spatio-temporal window are normally distributed.
2. Each subject deviates to some extent (this is made precise below) from the grand mean
and from the mean effect of predictability. This implies that there is some between-subject
variability in the individual-level intercept and slope adjustments by subject.
3. There is a linear relationship between cloze and the EEG signal.

5.2.3.2 Likelihood and priors

The likelihood now incorporates the assumption that both the intercept and slope are adjusted
by subject.

signaln ∼ Normal(α + usubj[n],1 + c_clozen ⋅ (β + usubj[n],2 ), σ)

α ∼ Normal(0, 10)

β ∼ Normal(0, 10)

u1 ∼ Normal(0, τu )
1

u2 ∼ Normal(0, τu )
2

τu ∼ Normal+ (0, 20)

τu2 ∼ Normal+ (0, 20)

σ ∼ Normal+ (0, 50)

In this model each subject has their own intercept adjustment, usubj,1 , and slope adjustment,
18
usubj,2 . If usubj,1 is positive, the subject will have a more positive EEG signal than the grand
mean average. If usubj,2 is positive, the subject will have a more positive EEG response to a
change of one unit in c_cloze than the overall mean effect (i.e., there will be a more positive
effect of cloze probability on the N400). The parameters u are sometimes called random
effects and thus a model with fixed effects (α and β) and random effects is called a mixed
model. However, random effects have different meanings in different contexts. To avoid
ambiguity, brms calls these random-effects parameters group-level effects. Since we are
estimating α and u at the same time and we assume that the average of the u’s is 0 (since it
is assumed to be normally distributed with mean 0), what is common between the subjects,
the grand mean, is estimated as the intercept α , and the deviations of individual subjects’
means from this grand mean are the adjustments u1 . Similarly, the mean effect of cloze is
estimated as β, and the deviations of individual subjects’ mean effects of cloze from β are the
adjustment u2 . The standard deviations of these two adjustment terms, τu
1
and τu
2
,
respectively, represent between subject variability; see Box 5.2.

Thus, the model Mv has three standard deviations: σ, τu

1
and τu
2
. In statistics, it is
conventional to talk about variances (the square of these standard deviations); for this reason,
these standard deviations are also (confusingly) called variance components. The variance
components τu
1
and τu
2
characterize between-subject variability, and the variance component
σ characterizes within-subject variability.

The by-subject adjustments u1 and u2 are parameters in the model, and therefore have priors
defined on them. By contrast, in the frequentist lmer model, the adjustments u1 and u2 are
not parameters; they are called conditional modes; see Bates, Mächler, et al. (2015b).

Parameters that appear in the prior specifications for parameters, such as τu , are often called
hyperparameters,19 and the priors on such hyperparameters are called hyperpriors. Thus, the
parameter u1 has Normal(0, τu )
1
as a prior; τu
1
is a hyperparameter, and the hyperprior on
20
τu
1
is Normal(0, 20) .
We know that in general, in EEG experiments, the standard deviations for the by-subject
adjustments are smaller than the standard deviation of the observations (which is the within-
subjects standard deviation). That is, usually the between-subject variability in the intercepts
and slopes is smaller than the within-subjects variability in the data. For this reason, reducing
the scale of the truncated normal distribution to 20 (in comparison to 50 ) seems reasonable
for the priors of the τ parameters. As always, we can do a sensitivity analysis to verify that our
priors are reasonably uninformative (if we intended them to be uninformative).

Box 5.2 Some important (and sometimes confusing) points:

Why does u have a mean of 0?

Because we want u to capture only differences between subjects, we could achieve

the same by assuming the following relationship between the likelihood and the
intercept and slope:

signaln ∼ Normal(αsubj[n] + βsubj[n] ⋅ c_clozen , σ)

αi ∼ Normal(α, τu )
1

βi ∼ Normal(β, τu2 )

In fact, this is another common way to write the model.

Why do the adjustments u have a normal distribution?

Mostly by convention, the adjustments u are assumed to come from a normal distribution.
Another reason is that if we don’t know anything about the distribution besides its mean
and variance, the normal distribution is the most conservative assumption (see chapter 9
of McElreath 2020).

For now, we are assuming that there is no relationship (no correlation) between the by-subject
intercept and slope adjustments u1 and u2 ; this lack of correlation is indicated in brms using
the double pipe || . The double pipe is also used in the same way in lmer from the
package lme4 (in fact brms bases its syntax on that of the lme4 package).

In brms , we need to specify hyperpriors for τu1 and τu2 ; these are called sd in brms , to
distinguish these standard deviations from the standard deviation of the residuals σ. As with
the population-level effects, the by-subjects intercept adjustments are implicitly fit for the
group-level effects and thus (c_cloze || subj) is equivalent to (1 + c_cloze || subj) . If
we don’t want an intercept we need to explicitly indicate it with (0 + c_cloze || subj) or (-1
+ c_cloze || subj) . Such a removal of the intercept is not normally done.

Hide
prior_v <-
c(
prior(normal(0, 10), class = Intercept),
prior(normal(0, 10), class = b, coef = c_cloze),
prior(normal(0, 50), class = sigma),

prior(normal(0, 20), class = sd, coef = Intercept, group = subj),

prior(normal(0, 20), class = sd, coef = c_cloze, group = subj)
)
fit_N400_v <- brm(n400 ~ c_cloze + (c_cloze || subj),
prior = prior_v,
data = df_eeg

When we print a brms fit, we first see the summaries of the posteriors of the standard
deviation of the by-group intercept and slopes, τu1 and τu2 as sd(Intercept) and
sd(c_cloze) , and then, as with previous models, the population-level effects, α and β as
Intercept and c_cloze , and the scale of the likelihood, σ , as sigma . The full summary can
be printed out by typing:

Hide

fit_N400_v

Because the above command will result in some pages of output, it is easier to understand the
summary graphically (Figure 5.8). Rather than the wrapper plot() , we use the original
function of the package bayesplot , mcmc_dens() , to only show density plots. We extract the
first 5 parameters of the model with variables(fit_N400_v)[1:5] .

Hide

mcmc_dens(fit_N400_v, pars = variables(fit_N400_v)[1:5])

b_Intercept b_c_cloze sd_subj__Intercept

2 3 4 5 1 2 3 4 2 3 4

sd_subj__c_cloze sigma

1 2 3 4 5 11.2 11.5 11.8 12.0

FIGURE 5.8: Posterior distributions of the parameters in the model fit_N400_v .

Because we estimated how the population-level effect of cloze is adjusted for each subject,
we could examine how each subject is being affected by the manipulation. For this we do the
following, and we plot it in Figure 5.9. These are adjustments, u1,1 , u1,... , u1,37 , and not the
effect of the manipulation by subject, β + [u1,1 , u1,... , u1,37 ] . The code below produces Figure
5.9.

Hide
# make a table of u_2s
ind_effects_v <- paste0("r_subj[", unique(df_eeg$subj),
",c_cloze]")
u_2_v <- posterior_summary(fit_N400_v, variable = ind_effects_v) %>%
as_tibble() %>%

mutate(subj = 1:n()) %>%

## reorder plot by magnitude of mean:
arrange(Estimate) %>%
mutate(subj = factor(subj, levels = subj))
# We plot:
ggplot(

u_2_v,
aes(x = Estimate, xmin = Q2.5, xmax = Q97.5, y = subj)
) +
geom_point() +
geom_errorbarh() +

xlab("By-subject adjustment to the slope in microvolts")

35
6
28
36
8
21
30
37
24
10
15
13
11
17
4
29
18
23
subj

31
32
19
12
26
27
33
34
3
7
9
5
16
20
22
1
2
14
25

-6 -3 0 3 6
By-subject adjustment to the slope in microvolts

FIGURE 5.9: 95% credible intervals of adjustments to the effect of cloze probability for each
subject (u1,1..37 ) according to the varying intercept and varying slopes model, fit_N400_v . To
obtain the effect of cloze probability for each subject, we would need to add the estimate of β

to each adjustment.
There is an important difference between the no-pooling model and the varying intercepts and
slopes model we just fit. The no-pooling model fits each individual subject’s intercept and
slope independently for each subject. By contrast, the varying intercepts and slopes model
takes all the subjects’ data into account in order to compute the fixed effects α and β; and the
model “shrinks” (Pinheiro and Bates 2000) the by-subject intercept and slope adjustments
towards the fixed effects estimates. In Figure 5.10, we can see the shrinkage of the estimates
in the varying intercepts model by comparing them with the estimates of the no pooling model
(Mnp ).

Hide
# Extract parameter estimates from the no pooling model:
par_np <- posterior_summary(fit_N400_np, variable = ind_effects_np) %>%
as_tibble() %>%
mutate(

model = "No pooling",

subj = unique(df_eeg$subj)
)
# For the hierarchical model, the code is more complicated
# because we want the effect (beta) + adjustment.
# Extract the overall group level effect:

beta <- c(as_draws_df(fit_N400_v)$b_c_cloze)

# Extract the individual adjustments:
ind_effects_v <- paste0("r_subj[", unique(df_eeg$subj), ",c_cloze]")
adjustment <- as_draws_matrix(fit_N400_v, variable = ind_effects_v)
# Get the by subject effects in a data frame where each adjustment

# is in each column.
# Remove all the draws meta data by using as.data.frame
by_subj_effect <- as.data.frame(beta + adjustment)
# Summarize them by getting a table with the mean and the
# quantiles for each column and then binding them.

par_h <- lapply(by_subj_effect, function(x) {

tibble(
Estimate = mean(x),
Q2.5 = quantile(x, .025),
Q97.5 = quantile(x, .975)
)

}) %>%
bind_rows() %>%
# Add a column to identify that the model,
# and one with the subject labels:
mutate(

model = "Hierarchical",
subj = unique(df_eeg$subj)
)
# The mean and 95% CI of both models in one data frame:
by_subj_df <- bind_rows(par_h, par_np) %>%
arrange(Estimate) %>%
mutate(subj = factor(subj, levels = unique(.data$subj)))

Hide

ggplot(
by_subj_df,
aes(
ymin = Q2.5, ymax = Q97.5, x = subj, y = Estimate, color = model,
shape = model

)
) +
geom_errorbar(position = position_dodge(1)) +
geom_point(position = position_dodge(1)) +
# We'll also add the mean and 95% CrI of the overall difference
# to the plot:

geom_hline(
yintercept =
posterior_summary(fit_N400_v,
variable = "b_c_cloze")[, "Estimate"]
) +

geom_hline(
yintercept =
posterior_summary(fit_N400_v,
variable = "b_c_cloze")[, "Q2.5"],
linetype = "dotted", linewidth = .5

) +
geom_hline(
yintercept =
posterior_summary(fit_N400_v,
variable = "b_c_cloze")[, "Q97.5"],
linetype = "dotted", linewidth = .5

) +
xlab("N400 effect of predictability") +
coord_flip()
35
6
28
36
8
21
30
37
24
10
15
13
11
17
N400 effect of predictability

4
29
23
31 model
18 Hierarchical

19 No pooling

32
12
26
33
27
34
3
9
5
7
22
20
16
1
14
2
25

-10 -5 0 5 10 15
Estimate

FIGURE 5.10: This plot compares the estimates of the effect of cloze probability for each
subject between (i) the no pooling, fit_N400_np and (ii) the varying intercepts and varying
slopes, hierarchical, model, fit_N400_v .
5.2.4 Correlated varying intercept varying slopes model (Mh
)

The model Mv allowed for differences in intercepts (mean voltage) and slopes (effects of
cloze) across subjects, but it has the implicit assumption that these varying intercepts and
varying slopes are independent. It is in principle possible that subjects showing more negative
voltage may also show stronger effects (or weaker effects). Next, we fit a model that allows a
correlation between the intercepts and slopes. We model the correlation between varying
intercepts and slopes by defining a variance-covariance matrix Σ between the by-subject
varying intercepts and slopes, and by assuming that both adjustments (intercept and slope)
come from a multivariate (in this case, a bivariate) normal distribution.

In Mh , we model the EEG data with the following assumptions:

1. EEG averages for the N400 spatio-temporal window are normally distributed.
2. Some aspects of the mean signal voltage and of the effect of predictability depend on the
subject, and these two might be correlated, i.e., we assume group-level intercepts and
slopes, and allow a correlation between them by-subject.
3. There is a linear relationship between cloze and the EEG signal for the trial.

The likelihood remains identical to the model Mv , which assumes no correlation between
group-level intercepts and slopes (section 5.2.3):

signaln ∼ Normal(α + usubj[n],1 + c_clozen ⋅ (β + usubj[n],2 ), σ)

The correlation is indicated in the priors on the adjustments for intercept u1 and slopes u2 .

Priors:

α ∼ Normal(0, 10)

β ∼ Normal(0, 10)

σ ∼ Normal+ (0, 50)

ui,1 0
( ) ∼ N (( ), Σu )
ui,2 0

In this model, a bivariate normal distribution generates the varying intercepts and varying
slopes u ; this is an n × 2 matrix. The variance-covariance matrix Σu defines the standard
deviations of the varying intercepts and varying slopes, and the correlation between them.
Recall from section 1.6.2 that the diagonals of the variance-covariance matrix contain the
variances of the correlated random variables, and the off-diagonals contain the covariances.
In this example, the covariance Cov(u1 , u2 ) between two variables u1 and u2 is defined as
the product of their correlation ρ and their standard deviations τu
1
and τu
2
. In other words,
Cov(u1 , u2 ) = ρu τu τu
1 2
.

2
τu ρ u τu1 τu2
1
Σu = ( )
2
ρu τu τu τu
1 2 2

In order to specify a prior for Σu , we need priors for the standard deviations, τu
1
, and τu
2
, and
also for their correlation, ρu . We can use the same priors for τ as before. For the correlation
parameter ρu (and the correlation matrix more generally), we use the LKJ prior. The basic idea
of the LKJ prior on the correlation matrix is that as its parameter, η (eta), increases, it will favor
correlations closer to zero.21 At η = 1 , the LKJ correlation distribution is uninformative (similar
to Beta(1, 1) ), at η < 1 , it favors extreme correlations (similar to Beta(a < 1, b < 1) ). We
set η = 2 so that we don’t favor extreme correlations, and we still represent our lack of
knowledge through the wide spread of the prior between −1 and 1. Thus, η = 2 gives us a
regularizing, relatively uninformative or mildly informative prior.

Figure 5.11 shows a visualization of different parameterizations of the LKJ prior.

eta = 1 eta = 2

2.75 2.5
density

density

2.73 2.0

2.70 1.5

2.68
1.0
-1.0 -0.5 0.0 0.5 1.0 -1.0 -0.5 0.0 0.5 1.0
rho rho

eta = 4 eta = .9

2.5
4.0
density

density

2.0
3.5

1.5

3.0

1.0
-1.0 -0.5 0.0 0.5 1.0 -1.0 -0.5 0.0 0.5 1.0
rho rho

FIGURE 5.11: Visualization of the LKJ correlation distribution prior with four different values of
the η parameter.
τu ∼ Normal+ (0, 20)
1

τu2 ∼ Normal+ (0, 20)

ρu ∼ LKJcorr(2)

In our brms model, we allow a correlation between the by-subject intercepts and slopes by
using a single pipe | instead of the double pipe || that we used previously. This
convention follows that in the frequentist lmer function. As before, the varying intercepts are
implicitly fit.

Because we have a new parameter, the correlation ρu , we need to add a new prior for this
correlation; in brms , this is achieved by addition a prior for the parameter type cor .

Hide

prior_h <- c(

prior(normal(0, 10), class = Intercept),

prior(normal(0, 10), class = b, coef = c_cloze),
prior(normal(0, 50), class = sigma),
prior(normal(0, 20),
class = sd, coef = Intercept,

group = subj
),
prior(normal(0, 20),
class = sd, coef = c_cloze,
group = subj

),
prior(lkj(2), class = cor, group = subj)
)
fit_N400_h <- brm(n400 ~ c_cloze + (c_cloze | subj),
prior = prior_h,
data = df_eeg

The estimates do not change much in comparison with the varying intercept/slope model,
probably because the estimation of the correlation is quite poor (i.e., there is a lot of
uncertainty). As before we show the estimates graphically (Figure 5.12. One can access the
complete summary as always with fit_N400_h .

Hide
plot(fit_N400_h, N = 6)

b_Intercept b_Intercept
0.75 5
0.50 4
0.25 3
0.00 2
2 3 4 5 0 200 400 600 800 1000

b_c_cloze b_c_cloze
0.6 5
4
0.4 3
0.2 2
1
0.0 0
1 2 3 4 5 0 200 400 600 800 1000

sd_subj__Intercept sd_subj__Intercept
0.9 3 Chain
0.6
0.3 2
0.0 1
1.5 2.0 2.5 3.0 3.5 0 200 400 600 800 1000
2
sd_subj__c_cloze sd_subj__c_cloze
0.4 3
0.3 4
0.2 2 4
0.1
0.0 0
1 2 3 4 5 0 200 400 600 800 1000

cor_subj__Intercept__c_cloze cor_subj__Intercept__c_cloze
1.00 1.0
0.75 0.5
0.50 0.0
0.25 -0.5
0.00 -1.0
-0.5 0.0 0.5 0 200 400 600 800 1000

sigma sigma
2.5 12.2
2.0 12.0
1.5 11.8
1.0 11.5
0.5 11.2
0.0 11.0
11.2 11.5 11.8 12.0 0 200 400 600 800 1000

FIGURE 5.12: The posteriors of the parameters in the model fit_N400_h .

We are now half-way to what is sometimes called the “maximal” hierarchical model (Barr et al.
2013). This usually refers to a model with all the by-participant and by-items group-level
variance components allowed by the experimental design and a full variance covariance
matrix for all the group-level parameters. Not all variance components are allowed by the
experimental design: in particular, between-group manipulations cannot have variance
components. For example, even if we assume that the working memory capacity of the
subjects might affect the N400, we cannot measure how working memory affects the subjects
differently.

When we refer to a full variance-covariance matrix, we mean a variance-covariance matrix

where all the elements (variances and covariances) are non-zero. In our previous model, for
example, the variance-covariance matrix Σu was full because no element was zero. If we
assume no correlation between group-level intercept and slope, it would mean to have zeros
in the diagonal of the matrix and this would render the model to be identical to Mv defined in
section 5.2.3; if we assume that also the bottom right element (τ 2 ) is zero, the model would
turn into a varying intercept model (in brms formula n400 ~ c_cloze + (1 | subj) ); and if
we assume that the matrix has only zeros, the model would turn into a complete pooling
model, Mcp , as defined in section 5.2.1.

As we will see in section 5.2.6 and in chapter 7, “maximal” is a misnomer for Bayesian
models, since this mostly refers to limitations of the popular frequentist package for fitting
models, lme4 .

The next section spells out a model with full variance-covariance matrix for both subjects and
items-level effects.

5.2.5 By-subjects and by-items correlated varying intercept

varying slopes model (Msih )

Our new model, Msih will allow for differences in intercepts (mean voltage) and slopes (effects
of predictability) across subjects and across items. In typical Latin square designs, subjects
and items are said to be crossed random effects—each subject sees exactly one instance of
each item. Here we assume a possible correlation between varying intercepts and slopes by
subjects, and another one by items.

In Msih , we model the EEG data with the following assumptions:

1. EEG averages for the N400 spatio-temporal window are normally distributed.
2. Some aspects of the mean signal voltage and of the effect of predictability depend on the
subject, i.e., we assume group-level intercepts, and slopes, and a correlation between
them by-subject.
3. Some aspects of the mean signal voltage and of the effect of predictability depend on the
item, i.e., we assume group-level intercepts, and slopes, and a correlation between them
by-item.

4. There is a linear relationship between cloze and the EEG signal for the trial.

Likelihood:

signaln ∼ Normal(α + usubj[n],1 + witem[n],1 +

c_clozen ⋅ (β + usubj[n],2 + witem[n],2 ), σ)

Priors:

α ∼ Normal(0, 10)

β ∼ Normal(0, 10)

σ ∼ Normal+ (0, 50)

ui,1 0
( ) ∼ N (( ), Σu )
ui,2 0

wj,1 0
( ) ∼ N (( ), Σw )
wj,2 0

We have added the index j, which represents each item, as we did with subjects; item[n]

indicates the item that corresponds to the observation in the n -th row of the data frame.

We have hyperparameters and hyperpriors as before:

2
τu1 ρ u τu1 τu2
Σu = ( )
2
ρu τu τu τu
1 2 2

2
τw ρw τw τw
1 1 2
Σw = ( )
2
ρw τw τw τw
1 2 2

τu ∼ Normal+ (0, 20)

τu2 ∼ Normal+ (0, 20)

ρu ∼ LKJcorr(2)

τw ∼ Normal+ (0, 20)

τw2 ∼ Normal+ (0, 20)

ρw ∼ LKJcorr(2)

We set identical priors for by-items group-level effects as for by-subject group-level effects,
but this only because we don’t have any differentiated prior information about subject-level
vs. item-level variation. However, bear in mind that the estimation for items is completely
independent from the estimation for subjects. Although we wrote many more equations than
before, the brms model is quite straightforward to extend:

Hide
prior_sih_full <-
c(
prior(normal(0, 10), class = Intercept),
prior(normal(0, 10), class = b, coef = c_cloze),
prior(normal(0, 50), class = sigma),

prior(normal(0, 20),
class = sd, coef = Intercept,
group = subj
),
prior(normal(0, 20),
class = sd, coef = c_cloze,

group = subj
),
prior(lkj(2), class = cor, group = subject),
prior(normal(0, 20),
class = sd, coef = Intercept,

group = item
),
prior(normal(0, 20),
class = sd, coef = c_cloze,
group = item

),
prior(lkj(2), class = cor, group = item)
)
fit_N400_sih <- brm(n400 ~ c_cloze + (c_cloze | subj) +
(c_cloze | item),
prior = prior_sih_full,

data = df_eeg)

We can also simplify the call to brms , when we assign the same priors to the by-subject and
by-item parameters:

Hide
prior_sih <-
c(
prior(normal(0, 10), class = Intercept),
prior(normal(0, 10), class = b),
prior(normal(0, 50), class = sigma),

prior(normal(0, 20), class = sd),

prior(lkj(2), class = cor)
)
fit_N400_sih <- brm(n400 ~ c_cloze + (c_cloze | subj) +
(c_cloze | item),
prior = prior_sih,

data = df_eeg
)

We have new group-level effects in the summary, but again the estimate of the effect of cloze
remains virtually unchanged (Figure 5.13).

Hide

fit_N400_sih

Hide

plot(fit_N400_sih, N = 9)
b_Intercept b_Intercept
0.8 5
0.6 4
0.4
0.2 3
0.0 2
3 4 5 0 200 400 600 800 1000

b_c_cloze b_c_cloze
0.6
4
0.4 3
2
0.2 1
0.0 0
0 1 2 3 4 0 200 400 600 800 1000

sd_item__Intercept sd_item__Intercept
3
0.9
2
0.6
0.3 1
0.0
1 2 3 0 200 400 600 800 1000

sd_item__c_cloze sd_item__c_cloze
6
0.3
4
0.2
2
0.1
0.0 0
1 2 3 4 5 0 200 400 600 800 1000

sd_subj__Intercept sd_subj__Intercept Chain

0.9 1
3
0.6 2
0.3 2
3
0.0
2 3 0 200 400 600 800 1000 4

sd_subj__c_cloze sd_subj__c_cloze
0.4 5
0.3 4
3
0.2 2
0.1 1
0.0 0
1 2 3 4 0 200 400 600 800 1000

cor_item__Intercept__c_cloze cor_item__Intercept__c_cloze

1.0 0.5
0.0
0.5
-0.5
0.0 -1.0
-0.5 0.0 0.5 0 200 400 600 800 1000

cor_subj__Intercept__c_cloze cor_subj__Intercept__c_cloze
1.0
0.9 0.5
0.6 0.0
0.3 -0.5
0.0 -1.0
-0.5 0.0 0.5 0 200 400 600 800 1000

sigma sigma
12.0
2.0 11.8
1.5
11.5
1.0
0.5 11.2
0.0 11.0
11.0 11.2 11.5 11.8 12.0 0 200 400 600 800 1000
.0 . .5 .8 .0 0 00 00 600 800 000

FIGURE 5.13: The posterior distributions of the parameters in the model fit_N400_sih.

5.2.6 Beyond the maximal model–Distributional regression

models

We can use posterior predictive checks to verify that our last model can capture the entire
signal distribution. This is shown in Figure 5.14.

Hide

pp_check(fit_N400_sih, ndraws = 50, type = "dens_overlay")

y
y rep

-50 -25 0 25 50

FIGURE 5.14: Overlay of densities from the posterior predictive distributions of the model
fit_N400_sih .

However, we know that in ERP studies, large levels of impedance between the recording
electrodes and the skin tissue increase the noise in the recordings (Picton et al. 2000). Given
that skin tissue is different between subjects, it could be the case that the level of noise varies
by subject. It might be a good idea to verify that our model is good enough for capturing the
by-subject variability. The code below produces Figure 5.15.

Hide

ppc_dens_overlay_grouped(df_eeg$n400,
yrep =
posterior_predict(fit_N400_sih,
ndraws = 100
),

group = df_eeg$subj
) +
xlab("Signal in the N400 spatiotemporal window")
1 2 3 4 5 6 7

8 9 10 11 12 13 14

15 16 17 18 19 20 21

y
22 23 24 25 26 27 28
y rep

29 30 31 32 33 34 35

-60-30 0 30 -60-30 0 30 -60-30 0 30 -60-30 0 30 -60-30 0 30

36 37

-60-30 0 30 -60-30 0 30
Signal in the N400 spatiotemporal window
g p p

FIGURE 5.15: The plot shows 100 predicted distributions with the label yr ep and the
distribution of the average signal data with the label y density plots for the 37 subjects that
participated in the experiment.
Figure 5.15 hints that we might be misfitting some subjects: Some of the by-subject observed
distributions of the EEG signal averages look much tighter than their corresponding posterior
predictive distributions (e.g., subjects 3, 5, 9, 10, 14), whereas some other by-subject
observed distributions look wider (e.g., subjects 25, 26, 27). Another approach to examine
whether we misfit the by-subject noise level is to plot posterior distributions of the standard
deviations and compare them with the observed standard deviation. This is achieved in the
following code, which groups the data by subject, and shows the distribution of standard
deviations. The result is shown in Figure 5.16. It is clear now that, for some subjects, the
observed standard deviation lies outside the distribution of predictive standard deviations.

Hide

pp_check(fit_N400_sih,
type = "stat_grouped",
ndraws = 1000,
group = "subj",
stat = "sd"

)
1 2 3 4 5 6 7

10 12 14 10 12 14 16 10 12 14 10 12 14 8 10 12 14 8 10 12 14 8 10 12 14

8 9 10 11 12 13 14

10 12 14 9101112131415 10 12 14 10 12 14 10 12 14 10 12 14 91011121314

15 16 17 18 19 20 21

T = sd
10 12 14 16 10 12 14 16 10 12 14 8 10 12 14 10 12 14 9101112131415 91011121314 T (y rep)

22 23 24 25 26 27 28
T (y )

9101112131415 10 12 14 10 12 14 10 12 14 16 10 12 14 10.0
12.5
15.0
17.5 10 12 14

29 30 31 32 33 34 35

9 11 13 15 10 12 14 8 10 12 14 10 12 14 10 12 14 8 10 12 14 10 12 14

36 37

10 12 14 10 12 14
0 0

FIGURE 5.16: Distribution of posterior predicted standard deviations in gray and observed
standard deviation in black lines by subject.
Why is our “maximal” hierarchical model misfitting the by-subject distribution of data? This is
because, the maximal models are, in general and implicitly, models with the maximal group-
level effect structure for the location parameter (e.g., the mean, μ , in a normal model). Other
parameters (e.g., scale or shape parameters) are estimated as auxiliary parameters, and are
assumed to be constant across observations and clusters. This assumption is so common that
researchers may not be aware that it is just an assumption. In the Bayesian framework, it is
easy to change such default assumptions if necessary. Changing the assumption that all
subjects have the same residual standard deviation leads to the distributional regression
model. Such models can be fit in brms ; see also the brms vignette, https://cran.r-
project.org/web/packages/brms/vignettes/brms_distreg.html.

We are going to change our previous likelihood, so that the scale, σ has also a group-level
effect structure. We exponentiate σ to make sure that the negative adjustments do not cause
σ to become negative.

signaln ∼ Normal(α + usubj[n],1 + witem[n],1 +

c_clozen ⋅ (β + usubj[n],2 + witem[n],2 ), σn )

σn = exp(σα + σu )
subj[n]

We just need to add priors to our new parameters (that replace the old prior for σ). We set the
prior to the intercept of the standard deviation, σα , to be similar to our previous σ. For the
variance component of σ, τ σu , we set rather uninformative hyperpriors. Recall that everything
is exponentiated when it goes inside the likelihood; that is why we use log(50) rather than 50
in σ.

σα ∼ Normal(0, log(50))

σu ∼ Normal(0, τσ )
u

τσu ∼ Normal+ (0, 5)

This model can be fit in brms using the internal function brmsformula() (or its shorter alias
bf ). This is a powerful function that extends the formulas that we used so far allowing for

setting a hierarchical regression to any parameter of a model. This will allow us to set a by-
subject hierarchical structure to the parameter σ. We also need to set new priors; these priors
are identified by dpar = sigma .

Hide
prior_s <- c(
prior(normal(0, 10), class = Intercept),
prior(normal(0, 10), class = b),
prior(normal(0, 20), class = sd),
prior(lkj(2), class = cor),

prior(normal(0, log(50)), class = Intercept, dpar = sigma),

prior(normal(0, 5),
class = sd, group = subj,
dpar = sigma
)
)

fit_N400_s <- brm(brmsformula(

n400 ~ c_cloze + (c_cloze | subj) + (c_cloze | item),
sigma ~ 1 + (1 | subj)),
prior = prior_s, data = df_eeg
)

Inspect the output below; notice that our estimate for the effect of cloze remains very similar to
that of the model fit_N400_sih .

Compare the two models’ estimates:

Hide

posterior_summary(fit_N400_sih, variable = "b_c_cloze")

## Estimate Est.Error Q2.5 Q97.5

## b_c_cloze 2.31 0.678 0.969 3.64

Hide

posterior_summary(fit_N400_s, variable = "b_c_cloze")

## Estimate Est.Error Q2.5 Q97.5

## b_c_cloze 2.29 0.657 1.01 3.56
Nonetheless, Figure 5.17 shows that the fit of the model with respect to the by-subject
variability is much better than before. Furthermore, Figure 5.18 shows that the observed
standard deviations for each subject are well inside the posterior predictive distributions. The
code below produces Figure 5.17.

Hide

ppc_dens_overlay_grouped(df_eeg$n400,
yrep =
posterior_predict(fit_N400_s,

ndraws = 100
),
group = df_eeg$subj
) +
xlab("Signal in the N400 spatiotemporal window")
1 2 3 4 5 6 7

8 9 10 11 12 13 14

15 16 17 18 19 20 21

y
22 23 24 25 26 27 28
y rep

29 30 31 32 33 34 35

-40 0 40 -40 0 40 -40 0 40 -40 0 40 -40 0 40

36 37

-40 0 40 -40 0 40
Signal in the N400 spatiotemporal window
g p p

FIGURE 5.17: The gray density plots show 100 predicted distributions from a model that
includes a hierarchical structure for σ. The black density plots show the distribution of the
average signal data for the 37 subjects in the experiment.
1 2 3 4 5 6 7

8 101214 8 101214167.510.012.515.0 8 10 12 146 8 101214 10.0

12.5
15.0
17.5 8 10 12 14 16

8 9 10 11 12 13 14

9 11 13 15 17 8 10 12 7.510.012.515.0 7.5
10.0
12.5
15.0 7.510.0
12.5
15.0 8 10 12 14 8 10 12 14

15 16 17 18 19 20 21

T = sd
12 15 18 21 9 12 15 187.5
10.0
12.5
15.0
17.5 10.0
12.5
15.0 9 11 13 15 1012141618 7.510.0
12.5
15.0 T (y rep)

22 23 24 25 26 27 28
T (y )

8 10 12 14 10.0
12.5
15.0
17.5 8 10 12 12 15 18 21 10.0
12.5
15.0
17.5 15 20 25 10.0
12.5
15.0
17.5

29 30 31 32 33 34 35

8 101214167.510.0
12.5
15.0 10.0
12.5
15.0
17.5
7.510.0
12.5
15.0 10.0
12.5
15.0 6 8 10 12 10.0
12.5
15.0
17.5

36 37

7.510.0
12.5
15.0 8 101214
7.5 0.0 .55.0 8 0

FIGURE 5.18: The gray lines show the distributions of posterior predicted standard deviations
from a model that includes a hierarchical structure for σ, and observed mean standard
deviations by subject (black vertical lines).
The model fit_N400_s raises the question: how much structure should we add to our
statistical model? Should we also assume that σ can vary by items, and also by our
experimental manipulation? Should we also have a maximal model for σ? Unfortunately, there
are no clear answers that apply to every situation. The amount of complexity that we can
introduce in a statistical model depends on (i) the answers we are looking for (we should
include parameters that represent what we want to estimate), (ii) the size of the data at hand
(more complex models require more data), (iii) our computing power (as the complexity
increases models take increasingly long to converge and require more computer power to
finish the computations in a feasible time frame), and (iv) our domain knowledge.

Whether certain effects should be included in a model also depends on whether they are
known to impact posterior inference or statistical testing (e.g., via Bayes factors). For
example, it is well known that estimating group-level effects for the location parameter can
have a strong influence on the test statistics for the corresponding population-level effect (Barr
et al. 2013; Schad et al. 2021). Given that population-level effects are often what researchers
care about, it is therefore important to consider group-level effects for the location parameter.
However, to our knowledge, it is not clear whether estimating group-level effects for the
standard deviation of the likelihood has an impact on inferences for the fixed effects. Maybe
there is one, but it is not widely known–statistical research would have to be conducted via
simulations to assess whether such an influence can take place. The point here is that for
some effects, it’s crucial to include them in the model, because they are known to affect the
inferences that we want to draw from the data. Other model components may (presumably) be
less decisive. Which ones these are remains an open question for research.

Ultimately, all models are approximations (that’s in the best case; often, they are plainly
wrong) and we need to think carefully about which aspects of our data we have to account
and which aspects we can abstract away from.

In the context of cognitive modeling, James L. McClelland (2009a) argues that models should
not focus on a every single detail of the process they intend to explain. In order to understand
a model, it needs to be simple enough. However, James L. McClelland (2009a) warns us that
one must bear in mind that oversimplification does have an impact on what we can conclude
from our analysis: A simplification can limit the phenomena that a model addresses, or can
even lead to incorrect predictions. There is a continuum between purely statistical models
(e.g., a linear regression) and computational cognitive models. For example, we can define
“hybrid” models such as the linear ballistic accumulator (Brown and Heathcote 2008; and see
Nicenboim 2018 for an implementation in Stan), where a great deal of cognitive detail is
sacrificed for tractability. The conclusions of James L. McClelland (2009a) apply to any type of
model in cognitive science: “Simplification is essential, but it comes at a cost, and real
understanding depends in part on understanding the effects of the simplification”.

5.3 A hierarchical log-normal model: The Stroop

effect

Next, using data from Ebersole et al. (2016), we illustrate some of the issues that arise with a
log-normal likelihood in a hierarchical model. The data are from a Stroop task (Stroop 1935;
for a review, see MacLeod 1991). We will analyze a subset of the data of 3337 subjects that
participated in one variant of the Stroop task; this was part of a battery of tasks run in
Ebersole et al. (2016).

For this variant of the Stroop task, subjects were presented with one word at the center of the
screen (“red”, “blue”, or “green”). The word was written in either red, blue, or green color. In
one third of the trials, the word matched the color of the text (“congruent” condition); and in the
rest of the trials it did not match (“incongruent” condition). Subjects were instructed to only pay
attention to the color that the word was written in, and press 1 if the color was red, 2 if it
was blue, and 3 if it was green. In the incongruent condition, it is difficult to identify the color
when it mismatches the word that is written on the screen. For example, it is hard to respond
that the color is blue if the word written on the screen is green but the color it is presented in is
blue; naming the color blue here is difficult in comparison to a baseline condition (the
congruent condition), in which the word green appears in the color green. This increased
difficulty in the incongruent condition is called the Stroop effect; the effect is extremely robust
across variations in the task.

This task yields two measures: the accuracy of the decision made, and the time it took to
respond. For the Stroop task, accuracy is usually almost at ceiling; to simplify the model, we
will ignore accuracy. For a cognitive model that incorporates accuracy and response times into
a model to analyze these Stroop data, see Nicenboim (2018).
5.3.1 A correlated varying intercept varying slopes log-normal
model

If our theory only focuses on the difference between the response times for the “congruent” vs.
“incongruent” condition, we can ignore the actual color presented and the word that was
written. We can simply focus on whether a trial was congruent or incongruent. Define a
predictor c_cond to represent these two conditions. For simplicity, we will also assume that
all subjects share the same variance (as we saw in section 5.2.6, changing this assumption
leads to distributional regression models).

The above assumptions mean that we are going to fit the data with the following likelihood.
The likelihood function is identical to the one that we fit in section 5.2.4, except that here the
location and scale are embedded in a log-normal likelihood rather than a normal likelihood.
Equation (5.3) states that we are dealing with a hierarchical model with by-subjects varying
intercepts and varying slopes model:

rtn ∼ LogNormal(α + usubj[n],1 + c_condn ⋅ (β + usubj[n],2 ), σ) (5.3)

In chapter 8, we will discuss the sum-contrast coding of the two conditions ( c_cond ). For
now, it suffices to say that we assign a +1 to c_cond for the “incongruent” condition, and a
-1 for the “congruent” condition (i.e., a sum-contrast coding). Under this contrast coding, if

the posterior mean of the parameter β turns out to be positive, that would mean that the model
predicts that the incongruent condition has slower reaction times than the congruent one. This
is because on average the location of the log-normal likelihood for each condition will be as
follows. In Equation (5.4), μincongruent refers to the location of the incongruent condition, and
μcongruent to the location of the congruent condition.

μincongruent = α + 1 ⋅ β
(5.4)
μcongruent = α + −1 ⋅ β

We could have chosen to do the opposite contrast coding assignments: −1 for the
incongruent condition, and +1 for congruent condition. In that case, if the posterior mean of
the parameter β turns out to be positive, that would mean that the incongruent condition has a
faster reaction time than the congruent condition. Given that the Stroop effect is very robust,
we do not expect such an outcome. In order to make the β parameter easier to interpret, we
have chosen the contrast coding where a positive sign on the mean of β implies that the
inconguent condition has slower reaction times.
As always, we need priors for all the parameters in our model. For the population-level
parameters (or fixed effects), we use the same priors as we did when we were fitting a
regression with a log-normal likelihood in section 3.7.2.

α ∼ Normal(6, 1.5)

β ∼ Normal(0, 0.01)

σ ∼ Normal+ (0, 1)

Here, β represents, on the log scale, the change in the intercept α as a function of the
experimental manipulation. In this model, β will probably be larger in magnitude than for the
model that examined the difference in pressing the spacebar for two consecutive trials in
section 3.7.2. We might need to examine the prior for β with predictive distributions, but we
will delay this for now.

In contrast to our previous models, the intercept α is not the grand mean of the location. This
is because the conditions were not balanced in the experiment (one-third of the conditions
were congruent and two-thirds incongruent). The intercept could be interpreted here as the
time (in log-scale) it takes to respond if we ignore the experimental manipulation.
Next, we turn our attention to the prior specification for the group-level parameters (or random
effects). If we assume a possible correlation between by-subject intercepts and slopes, our
model will have the following structure. In particular, we have to define priors for the
parameters in the variance-covariance matrix Σu .

ui,1 0
( ) ∼ N (( ), Σu )
ui,2 0

2
τu ρu τu τu
1 1 2

Σu = ( )
2
ρu τu τu τu
1 2 2

In practice, this means that we need priors for the by-subject standard deviations and
correlations. For the variance components, we will set a similar prior as for σ. We don’t expect
the by-group adjustments to the intercept and slope to have more variance than the within-
subject variance, so this prior will be quite conservative because it allows for a large range of
prior uncertainty. We assign the same prior for the correlations as we did in section 5.2.5.

τu ∼ Normal+ (0, 1)
1

τu2 ∼ Normal+ (0, 1)

ρu ∼ LKJcorr(2)

We are now ready to fit the model. To speed up computation, we subset 50 subjects of the
original data set; both the subsetted data and the original data set can be found in the
package bcogsci . If we were analyzing these data for publication in a journal article or the
like, we would obviously not subset the data.

We restrict ourselves to the correct trials only, and add a c_cond predictor, sum-coded as
described earlier.

Hide

data("df_stroop")

(df_stroop <- df_stroop %>%

mutate(c_cond = if_else(condition == "Incongruent", 1, -1)))

## # A tibble: 3,058 × 5

## subj trial condition RT c_cond

## <dbl> <int> <chr> <int> <dbl>
## 1 1 0 Congruent 1484 -1
## 2 1 1 Incongruent 1316 1
## 3 1 2 Incongruent 628 1

## # … with 3,055 more rows

Fit the model.

Hide

fit_stroop <- brm(RT ~ c_cond + (c_cond | subj),

family = lognormal(),

prior =
c(
prior(normal(6, 1.5), class = Intercept),
prior(normal(0, .01), class = b),
prior(normal(0, 1), class = sigma),
prior(normal(0, 1), class = sd),

prior(lkj(2), class = cor)

),
data = df_stroop
)

We will focus on β (but you can verify that there is nothing surprising in the other parameters
in the model fit_stroop ).
Hide

posterior_summary(fit_stroop, variable = "b_c_cond")

## Estimate Est.Error Q2.5 Q97.5

## b_c_cond 0.0271 0.0054 0.0165 0.0374

As shown in Figure 5.19, if we overlay the density plots for the prior and posterior distributions
of β, it becomes evident that the prior might have been too restrictive: the posterior is
relatively far from the prior, and the prior strongly down-weights the values that the posterior is
centered around. Such a strong discrepancy between the prior and posterior can be
investigated with a sensitivity analysis.

Hide

sample_b_post <- as_draws_df(fit_stroop)$b_c_cond

# We generate samples from the prior as well:

N <- length(sample_b_post)
sample_b_prior <- rnorm(N, 0, .01)
samples <- tibble(
sample = c(sample_b_post, sample_b_prior),

distribution = c(rep("posterior", N), rep("prior", N))

)
ggplot(samples, aes(x = sample, fill = distribution)) +
geom_density(alpha = .5)
60

40 distribution
density

posterior
prior

-0.025 0.000 0.025

sample

FIGURE 5.19: The discrepancy between the prior and the posterior distributions for the slope
parameter in the model fit_stroop .

5.3.1.1 Sensitivity analysis

Here, the discrepancy evident in Figure 5.19 is investigated with a sensitivity analysis. We will
examine what happens for the following priors for β. In the models we fit below, all the other
parameters have the same priors as in the model fit_stroop ; we vary only the priors for β.
The different priors are:

β ∼ Normal(0, 0.05)

β ∼ Normal(0, 0.1)

β ∼ Normal(0, 1)

β ∼ Normal(0, 2)

We can summarize the estimates of β given different priors as shown in Table 5.1.
TABLE 5.1: The summary (mean and 95% credible interval) for the posterior distribution of the
slope in the model fit_stroop, given different priors on the slope parameter.

prior Estimate Q2.5 Q97.5

Normal(0, 0.001) 0.001 −0.001 0.003

Normal(0, 0.01) 0.027 0.016 0.037

Normal(0, 0.05) 0.037 0.025 0.049

Normal(0, 0.1) 0.037 0.025 0.049

Normal(0, 1) 0.037 0.025 0.049

Normal(0, 2) 0.038 0.026 0.050

It might be easier to see how much the posterior difference between conditions changes
depending on the prior. In order to answer this question, we need to remember that the
median difference between conditions (MedianRTdif f ) can be calculated as the difference
between the exponents of each condition’s medians:

MedianRTdif f = MedianRTincongruent − MedianRTcongruent

(5.5)
MedianRTdif f = exp(α + β) − exp(α − β)

Equation (5.5) gives us the posterior distributions of the median difference between conditions
for the different models. We calculate the median difference rather than the mean difference
because the mean depends on the parameter σ, but the median doesn’t: The mean of a log-
normal distribution is 2
exp(μ + σ /2) , and the median is simply exp(μ) ; see also 3.7.2.

Table 5.2 summarizes the posterior distributions under different priors using the means of the
difference in medians, along with 95% credible intervals. It’s important to realize that the use
of mean to summarize the posterior distribution is orthogonal to our use of the median to
summarize the response times by condition: In the first case, we use the median to
summarize a group of observations, and in the second case, we use the mean to summarize a
group of samples from the posterior–we could have summarized the samples from the
posterior with its median instead of the mean.
TABLE 5.2: A summary, under a range of priors, of the posterior distributions of the mean
difference between the two conditions, back-transformed to the millisecond scale.

prior mean diff (ms) Q2.5 Q97.5

Normal(0, 0.001) 0.64 −1.54 2.86

Normal(0, 0.01) 30.08 17.67 41.75

Normal(0, 0.05) 41.48 27.48 55.34

Normal(0, 0.1) 41.80 27.93 55.87

Normal(0, 1) 42.16 28.70 55.56

Normal(0, 2) 42.06 28.21 56.27

Table 5.2 shows us that the posterior changes substantially when we use wider priors. It
seems that the posterior is relatively unaffected when we use priors with a standard deviation
larger than 0.05 . However, if we assume a priori that the effect of the manipulation must be
small, we will end up obtaining a posterior that is consistent with that belief. When we include
less information about the possible effect sizes by using a less informative prior, we allow the
data to influence the posterior more. A sensitivity analysis is always an important component
of a good-quality Bayesian analysis.

Which analysis should one report after carrying out a sensitivity analysis? In the above
example, the priors ranging from Normal(0, 0.05) to Normal(0, 2) show rather similar
posterior distributions for the mean difference. The most common approach in Bayesian
analysis is to report the results of such relatively uninformative priors (e.g., one could report
the posterior associated with the Normal(0, 2) here), because this kind of prior allows for a
broader range of possible effects and is relatively agnostic. However, if there is a good reason
to use a prior from a previous analysis, then of course it makes sense to report the analysis
with the informative prior alongside an analysis with an uninformative prior. Reporting only
informative priors in a Bayesian analysis is generally not a good idea. The issue is
transparency: the reader should know what the posterior looks like for both an informative and
an uninformative prior.

Another situation where posterior distributions associated with multiple priors should be
reported is when one is carrying out an adversarial sensitivity analysis (Spiegelhalter, Abrams,
and Myles 2004): one can take a group of agnostic, enthusiastic, and adversarial or skeptical
priors that, respectively, reflect a non-committal a priori position, an informed position based
on the researcher’s prior beliefs, and an adversarial position based on a scientific opponent’s
beliefs. In such a situation, analyses using all three priors can be reported, so that the reader
can determine how different prior beliefs influence the posterior. For an example of such an
adversarial analysis, see Vasishth and Engelmann (2022). Finally, when carrying out
hypothesis testing using Bayes factors, the choice of the prior on the parameter of interest
becomes critically important; in that situation, it is very important to report a sensitivity
analysis, showing the Bayes factor as well as a summary of the posterior distributions (Schad
et al. 2021); we return to this point in chapter 15, which covers Bayes factors.

5.4 Why fitting a Bayesian hierarchical model is worth

the effort

Carrying out Bayesian data analysis clearly requires much more effort than fitting a frequentist
model: we have to define priors, verify that our model works, and decide how to interpret the
results. By comparison, fitting a linear mixed model using lme4 consists of only a single line
of code. But there is a hidden cost to the relatively high speed furnished by the functions such
as lmer . First, the model fit using lmer or the like makes many assumptions, but they are
hidden from the user. This is not a problem for the knowledgeable modeler, but very
dangerous for the naive user. A second conceptual problem is that the way frequentist models
are typically used is to answer a binary question: is the effect “significant” or not? If a result is
significant, the paper is considered worth publishing; if not, it is not. Although frequentist
models can quickly answer the question that the null hypothesis test poses, the frequentist
test answers the wrong question. For discussion, see Vasishth and Nicenboim (2016).

Nevertheless, it is natural to ask why one should bother to go through all the trouble of fitting a
Bayesian model. An important reason is the flexibility in model specification. The approach we
have presented here can be used to extend essentially any parameter of any model. This
includes popular uses, such as logistic and Poisson regressions, and also useful models that
are relatively rarely used in cognitive science, such as multi-logistic regression (e.g., accuracy
in some task with more than two answers), ordered logistic (e.g., ratings, Bürkner and Vuorre
2018), models with a shifted log-normal distribution (see exercise 12.1 and chapter 20 which
deals with a log-normal race mode, and see Nicenboim, Logačev, et al. 2016; Rouder 2005),
and distributional regression models (as shown in 5.2.6). By contrast, a frequentist model,
although easy to fit quickly, forces the user to use an inflexible canned model, which may not
necessarily make sense for their data.

This flexibility allows us also to go beyond the statistical models discussed before, and to
develop complex hierarchical computational process models that are tailored to specific
phenomena. An example are computational cognitive models, these can be extended
hierarchically in a straightforward way, see Lee (2011b) and Lee and Wagenmakers (2014).
This is because, as we have seen with distributional regression models in section 5.2.6, any
parameter can have a group-level effect structure. Some examples of hierarchical
computational cognitive models in psycholinguistics are Logačev and Vasishth (2016),
Nicenboim and Vasishth (2018), Vasishth et al. (2017), Vasishth, Jaeger, and Nicenboim
(2017), Lissón et al. (2021), Logačev and Dokudan (2021), Paape et al. (2021), Yadav, Smith,
and Vasishth (2021a), and Yadav, Smith, and Vasishth (2021b). The hierarchical Bayesian
modeling approach can even be extended to process models that cannot be expressed as a
likelihood function, although in such cases one may have to write one’s own sampler; for an
example from psycholinguistics, see Yadav, Paape, et al. (2022).22. We discuss and
implement in Stan some relatively simple computational cognitive models in chapters 17-20.

5.5 Summary

This chapter presents two very commonly used classes of hierarchical model: those with
normal and log-normal likelihoods. We saw several common variants of such models: varying
intercepts, varying intercepts and varying slopes with or without a correlation parameter, and
crossed random effects for subjects and items. We also experienced the flexibility of the Stan
modeling framework through the example of a model that assumes a different residual
standard deviation for each subject.

5.6 Further reading

Chapter 5 of Gelman et al. (2014) provides a rather technical but complete treatment of
exchangeability in Bayesian hierarchical models. Bernardo and Smith (2009) is a brief but
useful article explaining exchangeability, and Lunn et al. (2012) also has a helpful discussion
that we have drawn on in this chapter. Gelman and Hill (2007) is a comprehensive treatment
of hierarchical modeling, although it uses WinBUGS. Yarkoni (2020) discusses the importance
of modeling variability in variables that researchers clearly intend to generalize over (e.g.,
stimuli, tasks, or research sites), and how under-specification of group-level effects imposes
strong constraints on the generalizability of results. Sorensen, Hohenstein, and Vasishth
(2016) provides an introduction, using Stan, to the Laird-Ware style matrix formulation (Laird
and Ware 1982) of hierarchical models; this formulation has the advantage of flexibility and
efficiency when specifying models in Stan syntax.
5.7 Exercises

Exercise 5.1 A hierarchical model (normal likelihood) of cognitive load on pupil size.

As in section 4.1, we focus on the effect of cognitive load on pupil size, but this time we look at
all the subjects of Wahn et al. (2016):

Hide

data("df_pupil_complete")
df_pupil_complete

## # A tibble: 2,228 × 4
## subj trial load p_size
## <int> <int> <int> <dbl>
## 1 701 1 2 1021.
## 2 701 2 1 951.

## 3 701 3 5 1064.
## # … with 2,225 more rows

You should be able to now fit a “maximal” model (correlated varying intercept and slopes for
subjects) assuming a normal likelihood. Base your priors in the priors discussed in section 4.1.

a. Examine the effect of load on pupil size, and the average pupil size. What do you
conclude?
b. Do a sensitivity analysis for the prior on the intercept (α). What is the estimate of the
effect (β) under different priors?
c. Is the effect of load consistent across subjects? Investigate this visually.

Exercise 5.2 Are subject relatives easier to process than object relatives (log-normal
likelihood)?

We begin with a classic question from the psycholinguistics literature: Are subject relatives
easier to process than object relatives? The data come from Experiment 1 in a paper by
Grodner and Gibson (2005).

Scientific question: Is there a subject relative advantage in reading?

Grodner and Gibson (2005) investigate an old claim in psycholinguistics that object relative
clause (ORC) sentences are more difficult to process than subject relative clause (SRC)
sentences. One explanation for this predicted difference is that the distance between the
relative clause verb (sent in the example below) and the head noun phrase of the relative
clause (reporter in the example below) is longer in ORC vs. SRC. Examples are shown below.
The relative clause is shown in square brackets.

(1a) The reporter [who the photographer sent to the editor] was hoping for a good story.
(ORC)

(1b) The reporter [who sent the photographer to the editor] was hoping for a good story. (SRC)

The underlying explanation has to do with memory processes: Shorter linguistic dependencies
are easier to process due to either reduced interference or decay, or both. For implemented
computational models that spell this point out, see Lewis and Vasishth (2005) and Engelmann,
Jäger, and Vasishth (2020).

In the Grodner and Gibson data, the dependent measure is reading time at the relative clause
verb, (e.g., sent) of different sentences with either ORC or SRC. The dependent variable is in
milliseconds and was measured in a self-paced reading task. Self-paced reading is a task
where subjects read a sentence or a short text word-by-word or phrase-by-phrase, pressing a
button to get each word or phrase displayed; the preceding word disappears every time the
button is pressed. In 6.1, we provide a more detailed explanation of this experimental method.

For this experiment, we are expecting longer reading times at the relative clause verbs of
ORC sentences in comparison to the relative clause verb of SRC sentences.

Hide

data("df_gg05_rc")
df_gg05_rc

## # A tibble: 672 × 7
## subj item condition RT residRT qcorrect experiment
## <int> <int> <chr> <int> <dbl> <int> <chr>

## 1 1 1 objgap 320 -21.4 0 tedrg3

## 2 1 2 subjgap 424 74.7 1 tedrg2
## 3 1 3 objgap 309 -40.3 0 tedrg3
## # … with 669 more rows
You should use a sum coding for the predictors. Here, object relative clauses ( "objgaps" ) are
coded +1 , subject relative clauses −1 .

Hide

df_gg05_rc <- df_gg05_rc %>%

mutate(c_cond = if_else(condition == "objgap", 1, -1))

You should be able to now fit a “maximal” model (correlated varying intercept and slopes for
subjects and for items) assuming a log-normal likelihood.

a. Examine the effect of relative clause attachment site (the predictor c_cond ) on reading
times RT (β).
b. Estimate the median difference between relative clause attachment sites in milliseconds,
and report the mean and 95% CI.
c. Do a sensitivity analysis. What is the estimate of the effect (β) under different priors?
What is the difference in milliseconds between conditions under different priors?

Exercise 5.3 Relative clause processing in Mandarin Chinese

Load the following two data sets:

Hide

data("df_gibsonwu")
data("df_gibsonwu2")

The data are taken from two experiments that investigate (inter alia) the effect of relative
clause type on reading time in Chinese. The data are from Gibson and Wu (2013) and
Vasishth et al. (2013) respectively. The second data set is a direct replication attempt of the
Gibson and Wu (2013) experiment.

Chinese relative clauses are interesting theoretically because they are prenominal: the relative
clause appears before the head noun. For example, the English relative clauses shown above
would appear in the following order in Mandarin. The square brackets mark the relative
clause, and REL refers to the Chinese equivalent of the English relative pronoun who.

(2a) [The photographer sent to the editor] REL the reporter was hoping for a good story.
(ORC)

(2b) [sent the photographer to the editor] REL the reporter who was hoping for a good story.
(SRC)
As discussed in Gibson and Wu (2013), the consequence of Chinese relative clauses being
prenominal is that the distance between the verb in relative clause and the head noun is larger
in subject relatives than object relatives. Hsiao and Gibson (2003) were the first to suggest
that the larger distance in subject relatives leads to longer reading time at the head noun.
Under this view, the prediction is that subject relatives are harder to process than object
relatives. If this is true, this is interesting and surprising because in most other languages that
have been studied, subject relatives are easier to process than object relatives; so Chinese
will be a very unusual exception cross-linguistically.

The data provided are for the critical region (the head noun; here, reporter). The experiment
method is self-paced reading, so we have reading times in milliseconds. The second data set
is a direct replication attempt of the first data set, which is from Gibson and Wu (2013).

The research hypothesis is whether the difference in reading times between object and
subject relative clauses is negative. For the first data set ( df_gibsonwu ), investigate this
question by fitting two “maximal” hierarchical models (correlated varying intercept and slopes
for subjects and items). The dependent variable in both models is the raw reading time in
milliseconds. The first model should use the normal likelihood in the model; the second model
should use the log-normal likelihood. In both models, use ±0.5 sum coding to model the effect
of relative clause type. You will need to decide on appropriate priors for the various
parameters.

a. Plot the posterior predictive distributions from the two models. What is the difference in
the posterior predictive distributions of the two models; and why is there a difference?
b. Examine the posterior distributions of the effect estimates (in milliseconds) in the two
models. Why are these different?
c. Given the posterior predictive distributions you plotted above, why is the log-normal
likelihood model better for carrying out inference and hypothesis testing?

Next, work out a normal approximation of the log-normal model’s posterior distribution for the
relative clause effect that you obtained from the above data analysis. Then use that normal
approximation as an informative prior for the slope parameter when fitting a hierarchical model
to the second data set. This is an example of incrementally building up knowledge by
successively using a previous study’s posterior as a prior for the next study; this is essentially
equivalent to pooling both data sets (check that pooling the data and using a Normal(0,1) prior
for the effect of interest, with a log-normal likelihood, gives you approximately the same
posterior as the informative-prior model fit above).

Exercise 5.4 Agreement attraction in comprehension

Load the following data:

Hide

data("df_dillonE1")
dillonE1 <- df_dillonE1
head(dillonE1)

## subj item rt int expt

## 49 dillonE11 dillonE119 2918 low dillonE1
## 56 dillonE11 dillonE119 1338 low dillonE1
## 63 dillonE11 dillonE119 424 low dillonE1
## 70 dillonE11 dillonE119 186 low dillonE1

## 77 dillonE11 dillonE119 195 low dillonE1

## 84 dillonE11 dillonE119 1218 low dillonE1

The data are taken from an experiment that investigate (inter alia) the effect of number
similarity between a noun and the auxiliary verb in sentences like the following. There are two
levels to a factor called Int(erference): low and high.

(3a) low: The key to the cabinet are on the table (3b) high: The key to the cabinets are on the
table

Here, in (3b), the auxiliary verb are is predicted to be read faster than in (3a), because the
plural marking on the noun cabinets leads the reader to think that the sentence is
grammatical. (Both sentences are ungrammatical.) This phenomenon, where the high
condition is read faster than the low condition, is called agreement attraction.

The data provided are for the critical region (the auxiliary verb are). The experiment method is
eye-tracking; we have total reading times in milliseconds.

The research question is whether the difference in reading times between high and low
conditions is negative.

First, using a log-normal likelihood, fit a hierarchical model with correlated varying
intercept and slopes for subjects and items. You will need to decide on the priors for the
model.
By simply looking at the posterior distribution of the slope parameter β, what would you
conclude about the theoretical claim relating to agreement attraction?

Exercise 5.5 Attentional blink (Bernoulli likelihood)

The attentional blink (AB; first described by Raymond, Shapiro, and Arnell 1992; though it has
been noticed before e.g., Broadbent and Broadbent 1987) refers to a temporary reduction in
the accuracy of detecting a probe (e.g., a letter “X”) presented closely after a target that has
been detected (e.g., a white letter). We will focus on the experimental condition of Experiment
2 of Raymond, Shapiro, and Arnell (1992). Subjects are presented with letters in rapid serial
visual presentation (RSVP) at the center of the screen at a constant rate and are required to
identify the only white letter (target) in the stream of black letters, and then to report whether
the letter X (probe) occurred in the subsequent letter stream. The AB is defined as having
occurred when the target is reported correctly but the report of the probe is inaccurate at a
short lag or target-probe interval.

The data set df_ab is a subset of the data of this paradigm from a replication conducted by
Grassi et al. (2021). In this subset, the probe was always present and the target was correctly
identified. We want to find out how the lag affects the accuracy of the identification of the
probe.

Hide

data("df_ab")

df_ab

## # A tibble: 2,101 × 4

## subj probe_correct trial lag

## <int> <int> <int> <int>
## 1 1 0 2 5
## 2 1 1 4 4
## 3 1 1 8 6
## # … with 2,098 more rows

Fit a logistic regression assuming a linear relationship between lag and accuracy
( probe_correct ). Assume a hierarchical structure with correlated varying intercept and slopes
for subjects. You will need to decide on the priors for this model.

a. How is the accuracy of the probe identification affected by the lag? Estimate this in log-
odds and percentages.
b. Is the linear relationship justified? Use posterior predictive checks to verify this.
c. Can you think about a better relationship between lag and accuracy? Fit a new model and
use posterior predictive checks to verify if the fit improved.
Exercise 5.6 Is there a Stroop effect in accuracy?

Instead of the response times of the correct answers, we want to find out whether accuracy
also changes by condition in the Stroop task. Fit the Stroop data with a hierarchical logistic
regression (i.e., a Bernoulli likelihood with a logit link). Use the complete data set,
df_stroop_complete which also includes incorrect answers, and subset it selecting the first 50

subjects.

a. Fit the model.

b. Report the Stroop effect in log-odds and accuracy.

Exercise 5.7 Distributional regression for the Stroop effect.

We will relax some of the assumptions of the model of Stroop presented in section 5.3. We will
no longer assume that all subjects share the same variance component, and, in addition, we’ll
investigate whether the experimental manipulation affects the scale of the response times. A
reasonable hypothesis could be that the incongruent condition is noisier than the congruent
one.

Assume the following likelihood, and fit the model with sensible priors (recall that our initial
prior for β wasn’t reasonable). (Priors for all the sigma parameters require us to set dpar =
sigma ).

rtn ∼ LogNormal(α + usubj[n],1 + c_condn ⋅ (β + usubj[n],2 ), σn )

σn = exp(σα + σu + c_cond ⋅ (σβ + σu ))

subj[n],1 subj[n],2

In this likelihood σn has both population- and group-level parameters: σα and σβ are the
intercept and slope of the population level effects repectively, and σu
subj[n],1
and σu
subj[n],2
are the
intercept and slope of the group-level effects.

a. Is our hypothesis reasonable in light of the results?

b. Why is the intercept for the scale negative?
c. What’s the posterior estimate of the scale for congruent and incongruent conditions?

Exercise 5.8 The grammaticality illusion

Load the following two data sets:

Hide
data("df_english")
english <- df_english
data("df_dutch")
dutch <- df_dutch

In an offline accuracy rating study on English double center-embedding constructions, Gibson

and Thomas (1999) found that grammatical constructions (e.g., example 4a below) were no
less acceptable than ungrammatical constructions (e.g., example 4b) where a middle verb
phrase (e.g., was cleaning every week) was missing.

(4a) The apartment that the maid who the service had sent over was cleaning every week was
well decorated.

(4b) *The apartment that the maid who the service had sent over — was well decorated

Based on these results from English, Gibson and Thomas (1999) proposed that working-
memory overload leads the comprehender to forget the prediction of the upcoming verb
phrase (VP), which reduces working-memory load. This came to be known as the VP-
forgetting hypothesis. The prediction is that in the word immediately following the final verb,
the grammatical condition (which is coded as +1 in the data frames) should be harder to read
than the ungrammatical condition (which is coded as -1).

The design shown above is set up to test this hypothesis using self-paced reading for English
(Vasishth et al. 2011), and for Dutch (Frank, Trompenaars, and Vasishth 2015). The data
provided are for the critical region (the noun phrase, labeled NP1, following the final verb); this
is the region for which the theory predicts differences between the two conditions. We have
reading times in log milliseconds.

a. First, fit a linear model with a full hierarchical structure by subjects and by items for the
English data. Because we have log milliseconds data, we can simply use the normal
likelihood (not the log-normal). What scale will be the parameters be in, milliseconds or
log milliseconds?
b. Second, using the posterior for the effect of interest from the English data, derive a prior
distribution for the effect in the Dutch data. Then fit two linear mixed models: (i) one model
with relatively uninformative priors for β (for example, N ormal(0, 1) ), and (ii) one model
with the prior for β you derived from the English data. Do the posterior distributions of the
Dutch data’s effect show any important differences given the two priors? If yes, why; if
not, why not?
c. Finally, just by looking at the English and Dutch posteriors, what can we say about the
VP-forgetting hypothesis? Are the posteriors of the effect from these two languages
consistent with the hypothesis?

References

Barr, Dale J, Roger Levy, Christoph Scheepers, and Harry J Tily. 2013. “Random Effects
Structure for Confirmatory Hypothesis Testing: Keep It Maximal.” Journal of Memory and
Language 68 (3). Elsevier: 255–78.

Bernardo, José M, and Adrian FM Smith. 2009. Bayesian Theory. Vol. 405. John Wiley &
Sons.

Brown, Scott D., and Andrew Heathcote. 2008. “The Simplest Complete Model of Choice
Response Time: Linear Ballistic Accumulation.” Cognitive Psychology 57 (3): 153–78.
https://doi.org/10.1016/j.cogpsych.2007.12.002.

Bürkner, Paul-Christian, and Matti Vuorre. 2018. “Ordinal Regression Models in Psychological
Research: A Tutorial.” PsyArXiv Preprints.

de Finetti, Bruno. 1931. “Funcione Caratteristica Di Un Fenomeno Aleatorio.” Atti Dela Reale
Accademia Nazionale Dei Lincei, Serie 6. Memorie, Classe Di Scienze Fisiche, Mathematice
E Naturale 4: 251–99.

DeLong, Katherine A, Thomas P Urbach, and Marta Kutas. 2005. “Probabilistic Word Pre-
Activation During Language Comprehension Inferred from Electrical Brain Activity.” Nature
Neuroscience 8 (8): 1117–21. https://doi.org/10.1038/nn1504.

Ebersole, Charles R., Olivia E. Atherton, Aimee L. Belanger, Hayley M. Skulborstad, Jill M.
Allen, Jonathan B. Banks, Erica Baranski, et al. 2016. “Many Labs 3: Evaluating Participant
Pool Quality Across the Academic Semester via Replication.” Journal of Experimental Social
Psychology 67: 68–82. https://doi.org/https://doi.org/10.1016/j.jesp.2015.10.012.

Engelmann, Felix, Lena A. Jäger, and Shravan Vasishth. 2020. “The Effect of Prominence and
Cue Association in Retrieval Processes: A Computational Account.” Cognitive Science 43 (12):
e12800. https://doi.org/10.1111/cogs.12800.
Frank, Stefan L., Leun J. Otten, Giulia Galli, and Gabriella Vigliocco. 2015. “The ERP
Response to the Amount of Information Conveyed by Words in Sentences.” Brain and
Language 140: 1–11. https://doi.org/10.1016/j.bandl.2014.10.006.

Frank, Stefan L., Thijs Trompenaars, and Shravan Vasishth. 2015. “Cross-Linguistic
Differences in Processing Double-Embedded Relative Clauses: Working-Memory Constraints
or Language Statistics?” Cognitive Science 40: 554–78. https://doi.org/10.1111/cogs.12247.

Gelman, Andrew, John B. Carlin, Hal S. Stern, David B. Dunson, Aki Vehtari, and Donald B.
Rubin. 2014. Bayesian Data Analysis. Third Edition. Boca Raton, FL: Chapman; Hall/CRC
Press.

Gelman, Andrew, and Jennifer Hill. 2007. Data Analysis Using Regression and
Multilevel/Hierarchical Models. Cambridge University Press.

Gibson, Edward, and James Thomas. 1999. “Memory Limitations and Structural Forgetting:
The Perception of Complex Ungrammatical Sentences as Grammatical.” Language and
Cognitive Processes 14(3): 225–48.

Gibson, Edward, and H-H Iris Wu. 2013. “Processing Chinese Relative Clauses in Context.”
Language and Cognitive Processes 28 (1-2). Taylor & Francis: 125–55.

Grassi, Massimo, Camilla Crotti, David Giofrè, Ingrid Boedker, and Enrico Toffalini. 2021. “Two
Replications of Raymond, Shapiro, and Arnell (1992), the Attentional Blink.” Behavior
Research Methods 53 (2): 656–68. https://doi.org/10.3758/s13428-020-01457-6.

Grodner, Daniel, and Edward Gibson. 2005. “Consequences of the Serial Nature of Linguistic
Input.” Cognitive Science 29: 261–90.

Hsiao, Fanny Pai-Fang, and Edward Gibson. 2003. “Processing Relative Clauses in Chinese.”
Cognition 90: 3–27.

Kutas, Marta, and Kara D. Federmeier. 2011. “Thirty Years and Counting: Finding Meaning in
the N400 Componentof the Event-Related Brain Potential (ERP).” Annual Review of
Psychology 62 (1): 621–47. https://doi.org/10.1146/annurev.psych.093008.131123.

Kutas, Marta, and Steven A Hillyard. 1984. “Brain Potentials During Reading Reflect Word
Expectancy and Semantic Association.” Nature 307 (5947): 161–63.
https://doi.org/10.1038/307161a0.
Laird, Nan M, and James H Ware. 1982. “Random-Effects Models for Longitudinal Data.”
Biometrics. JSTOR, 963–74.

Lee, Michael D., ed. 2011a. “Special Issue on Hierarchical Bayesian Models.” Journal of
Mathematical Psychology 55 (1). https://www.sciencedirect.com/journal/journal-of-
mathematical-psychology/vol/55/issue/1.

2011b. “How Cognitive Modeling Can Benefit from Hierarchical Bayesian Models.” Journal of
Mathematical Psychology 55 (1). Elsevier BV: 1–7. https://doi.org/10.1016/j.jmp.2010.08.013.

Lee, Michael D., and Eric-Jan Wagenmakers. 2014. Bayesian Cognitive Modeling: A Practical
Course. Cambridge University Press.

Lewis, Richard L., and Shravan Vasishth. 2005. “An Activation-Based Model of Sentence
Processing as Skilled Memory Retrieval.” Cognitive Science 29: 1–45.

Logačev, Pavel, and Noyan Dokudan. 2021. “A Multinomial Processing Tree Model of RC
Attachment.” In Proceedings of the Workshop on Cognitive Modeling and Computational
Linguistics, 39–47. Online: Association for Computational Linguistics.
https://www.aclweb.org/anthology/2021.cmcl-1.4.

Logačev, Pavel, and Shravan Vasishth. 2016. “A Multiple-Channel Model of Task-Dependent

Ambiguity Resolution in Sentence Comprehension.” Cognitive Science 40 (2): 266–98.
https://doi.org/10.1111/cogs.12228.

Lunn, David, Chris Jackson, David J Spiegelhalter, Nicky Best, and Andrew Thomas. 2012.
The BUGS Book: A Practical Introduction to Bayesian Analysis. Vol. 98. CRC Press.

MacLeod, Colin M. 1991. “Half a Century of Research on the Stroop Effect: An Integrative
Review.” Psychological Bulletin 109 (2). American Psychological Association: 163.

McClelland, James L. 2009a. “The Place of Modeling in Cognitive Science.” Topics in

Cognitive Science 1 (1): 11–38. https://doi.org/10.1111/j.1756-8765.2008.01003.x.

McElreath, Richard. 2020. Statistical Rethinking: A Bayesian Course with Examples in R and
Stan. Boca Raton, Florida: Chapman; Hall/CRC.
Nicenboim, Bruno. 2018. “The Implementation of a Model of Choice: The (Truncated) Linear
Ballistic Accumulator.” In StanCon. Aalto University, Helsinki, Finland.
https://doi.org/10.5281/zenodo.1465990.

Nicenboim, Bruno, Pavel Logačev, Carolina Gattei, and Shravan Vasishth. 2016. “When High-
Capacity Readers Slow down and Low-Capacity Readers Speed up: Working Memory and
Locality Effects.” Frontiers in Psychology 7 (280). https://doi.org/10.3389/fpsyg.2016.00280.

Nicenboim, Bruno, and Shravan Vasishth. 2018. “Models of Retrieval in Sentence

Comprehension: A Computational Evaluation Using Bayesian Hierarchical Modeling.” Journal
of Memory and Language 99: 1–34. https://doi.org/10.1016/j.jml.2017.08.004.

Nicenboim, Bruno, Shravan Vasishth, and Frank Rösler. 2020a. “Are Words Pre-Activated
Probabilistically During Sentence Comprehension? Evidence from New Data and a Bayesian
Random-Effects Meta-Analysis Using Publicly Available Data.” Neuropsychologia 142.
https://doi.org/10.1016/j.neuropsychologia.2020.107427.

Nieuwland, Mante S, Stephen Politzer-Ahles, Evelien Heyselaar, Katrien Segaert, Emily

Darley, Nina Kazanina, Sarah Von Grebmer Zu Wolfsthurn, et al. 2018. “Large-Scale
Replication Study Reveals a Limit on Probabilistic Prediction in Language Comprehension.”
eLife 7. https://doi.org/10.7554/eLife.33468.

Paape, Dario, Serine Avetisyan, Sol Lago, and Shravan Vasishth. 2021. “Modeling Misretrieval
and Feature Substitution in Agreement Attraction: A Computational Evaluation.” Cognitive
Science 45 (8). https://doi.org/https://doi.org/10.1111/cogs.13019.

Picton, T.W., S. Bentin, P. Berg, E. Donchin, S.A. Hillyard, R. Johnson JR., G.A. Miller, et al.
2000. “Guidelines for Using Human Event-Related Potentials to Study Cognition: Recording
Standards and Publication Criteria.” Psychophysiology 37 (2): 127–52.
https://doi.org/10.1111/1469-8986.3720127.

Pinheiro, José C, and Douglas M Bates. 2000. Mixed-Effects Models in S and S-PLUS. New
York: Springer-Verlag.

Raymond, Jane E, Kimron L Shapiro, and Karen M Arnell. 1992. “Temporary Suppression of
Visual Processing in an RSVP Task: An Attentional Blink?” Journal of Experimental
Psychology: Human Perception and Performance 18 (3). American Psychological Association:
849.

Rouder, Jeffrey N. 2005. “Are Unshifted Distributional Models Appropriate for Response
Time?” Psychometrika 70 (2). Springer Science + Business Media: 377–81.
https://doi.org/10.1007/s11336-005-1297-7.
Schad, Daniel J., Bruno Nicenboim, Paul-Christian Bürkner, Michael J. Betancourt, and
Shravan Vasishth. 2021. “Workflow Techniques for the Robust Use of Bayes Factors.”

Spiegelhalter, David J, Keith R Abrams, and Jonathan P Myles. 2004. Bayesian Approaches to
Clinical Trials and Health-Care Evaluation. Vol. 13. John Wiley & Sons.

Stroop, J Ridley. 1935. “Studies of Interference in Serial Verbal Reactions.” Journal of

Experimental Psychology 18 (6). Psychological Review Company: 643.

Vasishth, Shravan, Nicolas Chopin, Robin Ryder, and Bruno Nicenboim. 2017. “Modelling
Dependency Completion in Sentence Comprehension as a Bayesian Hierarchical Mixture
Process: A Case Study Involving Chinese Relative Clauses.” In Proceedings of Cognitive
Science Conference. London, UK. https://arxiv.org/abs/1702.00564v2.

Vasishth, Shravan, and Felix Engelmann. 2022. Sentence Comprehension as a Cognitive

Process: A Computational Approach. Cambridge, UK: Cambridge University Press.
https://books.google.de/books?id=6KZKzgEACAAJ.

Vasishth, Shravan, and Bruno Nicenboim. 2016. “Statistical Methods for Linguistic Research:
Foundational Ideas – Part I.” Language and Linguistics Compass 10 (8): 349–69.

Vasishth, Shravan, Katja Suckow, Richard L. Lewis, and Sabine Kern. 2011. “Short-Term
Forgetting in Sentence Comprehension: Crosslinguistic Evidence from Head-Final Structures.”
Language and Cognitive Processes 25: 533–67.

Wahn, Basil, Daniel P. Ferris, W. David Hairston, and Peter König. 2016. “Pupil Sizes Scale
with Attentional Load and Task Experience in a Multiple Object Tracking Task.” PLOS ONE 11
(12): e0168087. https://doi.org/10.1371/journal.pone.0168087.
Yadav, Himanshu, Dario Paape, Garrett Smith, Brian W. Dillon, and Shravan Vasishth. 2022.
“Individual Differences in Cue Weighting in Sentence Comprehension: An evaluation using
Approximate Bayesian Computation.” Open Mind.
https://doi.org/https://doi.org/10.1162/opmi_a_00052.

Yadav, Himanshu, Garrett Smith, and Shravan Vasishth. 2021a. “Feature Encoding Modulates
Cue-Based Retrieval: Modeling Interference Effects in Both Grammatical and Ungrammatical
Sentences.” Proceedings of the Cognitive Science Conference.

Yadav, Himanshu, Garrett Smith, and Shravan Vasishth. 2021b. “Is Similarity-Based
Interference Caused by Lossy Compression or Cue-Based Retrieval? A Computational
Evaluation.” Proceedings of the International Conference on Cognitive Modeling.

Yarkoni, Tal. 2020. “The Generalizability Crisis.” Behavioral and Brain Sciences. Cambridge
University Press, 1–37. https://doi.org/10.1017/S0140525X20001685.

16. For simplicity, we assume that they share the same standard deviation.↩

17. If we don’t remove the intercept, that is, if we use the formula n400 ~ 1 + factor(subj) +
c_cloze:factor(subj) , with factor(subj) we are going estimate the deviation between

the first subject and each of the other subjects.↩

18. The intercept adjustment is often called u0 in statistics books, where the intercept might
be called α or (sometimes also β0 ), and thus u1 refers to the adjustment to the slope.
However, in this book, we start the indexing with 1 to be consistent with the Stan
language.↩

19. Another source of confusion here is that hyperparameters is also used in the machine
learning literature with a different meaning.↩

20. One could in theory keep going deeper and deeper, defining hyper-hyperpriors etc., but
the model would quickly become impossible to fit.↩

21. This is because an LKJ correlation distribution with a large η corresponds to a correlation
matrix with values close to zero in the lower and upper triangles↩

22. Most of the papers mentioned above provide example code using Stan or brms ↩
Code

Chapter 6 The Art and Science of Prior

Elicitation
Nothing strikes fear into the heart of the newcomer to Bayesian methods more than the idea of
specifying priors for the parameters in a model. On the face of it, this concern seems like a
valid one; how can one know what the plausible parameter values are in a model before one
has even seen the data?
In reality, this worry is purely a consequence of the way we are normally taught to carry out
data analysis, especially in areas like psychology and linguistics. Model fitting is considered to
be a black-box activity, with the primary concern being whether the effect of interest is
“significant” or “non-significant.” As a consequence of the training that we receive, we learn to
focus on one thing (the p-value) and we learn to ignore the estimates that we obtain from the
model; it becomes irrelevant whether the effect of interest has a mean value of 500 ms (in a
reading study, say) or 10 ms; all that matters is whether it is a significant effect or not. In fact,
the way many scientists summarize the literature in their field is by classifying studies into two
bins: significant and non-significant. There are obvious problems with this classification
method; for example, p = 0.051 might be counted as “marginally” significant, but p = 0.049

is never counted as marginally non-significant. But there will usually not be any important
difference between these two borderline values. Real-life examples of such a binary
classification approach are seen in Phillips, Wagers, and Lau (2011) and Hammerly, Staub,
and Dillon (2019). Because the focus is on significance, we never develop a sense of what the
estimates of an effect are likely to be in a future study. This is why, when faced with a prior-
distribution specification problem, we are misled into feeling like we know nothing about the
quantitative estimates relating to a problem we are studying.

Prior specification has a lot in common with something that physicists call a Fermi problem. As
Von Baeyer (1988) describes it: “A Fermi problem has a characteristic profile: Upon first
hearing it, one doesn’t have even the remotest notion what the answer might be. And one
feels certain that too little information exists to find a solution. Yet, when the problem is broken
down into subproblems, each one answerable without the help of experts or reference books,
an estimate can be made ”. Fermi problems in the physics context are situations where one
needs ballpark (approximate) estimates of physical quantities in order to proceed with a
calculation. The name comes from a physicist, Enrico Fermi; he developed the ability to carry
out fairly accurate back-of-the-envelope calculations when working out approximate numerical
values needed for a particular computation. Von Baeyer (1988) puts it well: “Prudent physicists
—those who want to avoid false leads and dead ends—operate according to a long-standing
principle: Never start a lengthy calculation until you know the range of values within which the
answer is likely to fall (and, equally important, the range within which the answer is unlikely to
fall).” As in physics, so in data analysis: as Bayesians, we need to acquire the ability to work
out plausible ranges of values for parameters. This is a learnable skill, and improves with
practice. With time and practice, we can learn to become prudent data analysts.

As Spiegelhalter, Abrams, and Myles (2004) point out, there is no one “correct” prior
distribution. One consequence of this fact is that a good Bayesian analysis always takes a
range of prior specifications into account; this is called a sensitivity analysis. We have already
seen examples of this, but more examples will be provided in this and later chapters.

Prior specification requires the estimation of probabilities. Human beings are not good at
estimating probabilities, because they are susceptible to several kinds of biases (Kadane and
Wolfson 1998; Spiegelhalter, Abrams, and Myles 2004). We list the most important ones that
are relevant to cognitive science applications:

Availability bias: Events that are more salient to the researcher are given higher
probability, and events that are less salient are given lower probability.
Adjustment and anchoring bias: The initial assessment of the probability of an event can
influence one’s subsequent judgements. An example is credible intervals: a researcher’s
estimate of the credible interval will tend to be influenced by their initial assessment.
Overconfidence: When eliciting credible intervals from oneself, there is a tendency to
specify too tight an interval.
Hindsight bias: If one relies on the data to come up with a prior for the analysis of that
very same data set, one’s assessment is likely to be biased.

Although training can improve the natural tendency to be biased in these different ways, one
must recognize that bias is inevitable when eliciting priors, either from oneself or from other
experts; it follows that one should always define “a community of priors” (Kass and
Greenhouse 1989): one should consider the effect of informed as well as skeptical or agnostic
(uninformative) priors on the posterior distribution of interest. Incidentally, bias is not unique to
Bayesian statistics; the same problems arise in frequentist data analysis. Even in frequentist
analyses, the researcher always interprets the data in the light of their prior beliefs; the data
never really “speak for themselves.” For example, the researcher might remove outliers’’
based on a belief that certain values are implausible; or the researcher will choose a particular
likelihood based on their belief about the underlying generative process. All these are
subjective decisions made by the researcher, and can dramatically impact the outcome of the
analyses.
The great advantage that Bayesian methods have is that they allow us to formally take a
range of (competing) prior beliefs into account when interpreting the data. We illustrate this
point in the present chapter.

6.1 Eliciting priors from oneself for a self-paced

reading study: A simple example

In section 3.5, we have already encountered a sensitivity analysis; there, several priors were
used to investigate how the posterior is affected. Here is another example of a sensitivity
analysis; the problem here is how to elicit priors from oneself for a particular research
problem.

6.1.1 An example: English relative clauses

We will work out priors from first principles for a commonly-used experiment design in
psycholinguistics. As an example, consider English subject vs. object relative clause
processing differences. Relative clauses are sentences like (1a) and (1b):

(1a) The reporter [who the photographer sent to the editor] was hoping for a good story.
(ORC)

(1b) The reporter [who sent the photographer to the editor] was hoping for a good story. (SRC)

Sentence (1a) is an object relative clause (ORC): the noun reporter is modified by a relative
clause (demarcated in square brackets), and the noun reporter is the object of the verb sent.
Sentence (1b) is a subject relative clause (SRC): the noun reporter is modified by a relative
clause (demarcated in square brackets), but this time the noun reporter is the subject of the
verb sent. Many theories in sentence processing predict that the reading time at the verb sent
will be shorter in English subject vs. object relatives; one explanation is that the dependency
distance between reporter and sent is shorter in subject vs. object relatives (Grodner and
Gibson 2005).

The experimental method we consider here is self-paced reading.23 The self-paced reading
method is commonly used in psycholinguistics as a cheaper and faster substitute to
eyetracking during reading. The subject is seated in front of a computer screen and is initially
shown a series of broken lines that mask words from a complete sentence. The subject then
unmasks the first word (or phrase) by pressing the space bar. Upon pressing the space bar
again, the second word/phrase is unmasked and the first word/phrase is masked again; see
Figure 6.1. The time in milliseconds that elapses between these two space-bar presses counts
as the reading time for the first word/phrase. In this way, the reading time for each successive
word/phrase in the sentence is recorded. Usually, at the end of each trial, the subject is also
asked a yes/no question about the sentence. This is intended to ensure that the subject is
adequately attending to the meaning of the sentence.

The

king

dead.

FIGURE 6.1: A moving window self-paced reading task for the sentence “The king is dead.”
Words are unmasked one by one after each press of the space bar.

A classic example of self-paced reading data appeared in Exercise 5.2. A hierarchical model
that we could fit to such data would be the following. In chapter 5, we showed that for reading-
time data, the log-normal likelihood is generally a better choice than a normal likelihood. In the
present chapter, in order to make it easier for the reader to get started with thinking about
priors, we use the normal likelihood instead of the log-normal. In real-life data analysis, the
normal likelihood would be a very poor choice for reading-time data.

The model below has varying intercepts and varying slopes for subjects and for items, but
assumes no correlation between the varying intercepts and slopes. The correlation is removed
in order to compare the posteriors to the estimates from the corresponding frequentist lme4
model. In the model shown below, we use “default” priors that the brm function assumes for
all the parameters. We are only using default priors here as a starting point; in practice, we will
never use default priors for a reported analysis. In the model output below, for brevity we will
only display the summary of the posterior distribution for the slope parameter, which
represents the difference between the two condition means.

Hide
data("df_gg05_rc")
df_gg05_rc <- df_gg05_rc %>%
mutate(c_cond = if_else(condition == "objgap", 1 / 2, -1 / 2))

fit_gg05 <- brm(RT ~ c_cond + (1 + c_cond || subj) +

(1 + c_cond || item), df_gg05_rc)

Hide

(default_b <- posterior_summary(fit_gg05,

variable = "b_c_cond"))

## Estimate Est.Error Q2.5 Q97.5

## b_c_cond 103 35.9 34.2 173

The estimates from this model are remarkably similar to those from a frequentist linear mixed
model (Bates, Mächler, et al. 2015a):

Hide

fit_lmer <- lmer(RT ~ c_cond + (1 + c_cond || subj) +

(1 + c_cond || item), df_gg05_rc)
b <- summary(fit_lmer)$coefficients["c_cond", "Estimate"]

SE <- summary(fit_lmer)$coefficients["c_cond", "Std. Error"]

## estimate of the slope and
## lower and upper bounds of the 95% CI:
(lmer_b <- c(b, b - (2 * SE), b + (2 * SE)))

## [1] 102.3 29.9 174.7

The similarity between the estimates from the Bayesian and frequentist models is due to the
fact that default priors, being relatively uninformative, don’t influence the posterior much. This
leads to the likelihood dominating in determining the posteriors. In general, such uninformative
priors on the parameters will show a similar lack of influence on the posterior (Spiegelhalter,
Abrams, and Myles 2004). We can quickly establish this in the above example by using
another uninformative prior:

Hide
fit_gg05_unif <- brm(RT ~ c_cond + (1 + c_cond || subj) +
(1 + c_cond || item),
prior = c(
prior(uniform(-2000, 2000), class = Intercept,
lb = -2000, ub = 2000),

prior(uniform(-2000, 2000), class = b,

lb = -2000, ub = 2000),
prior(normal(0, 500), class = sd),
prior(normal(0, 500), class = sigma)
), df_gg05_rc
)

Hide

(uniform_b <- posterior_summary(fit_gg05_unif,

variable = c("b_c_cond")))

## Estimate Est.Error Q2.5 Q97.5

## b_c_cond 102 38.7 23.1 178

As shown in Table 6.1, the means of the posteriors from this versus the other two model
estimates shown above all look very similar.

TABLE 6.1: Estimates of the mean difference (with 95% confidence/credible intervals)
between two conditions in a hierarchical model of English relative clause data from Grodner
and Gibson, 2005, using (a) the frequentist hierarchical model, (b) a Bayesian model using
default priors from the brm function, and (c) a Bayesian model with uniform priors.

model mean lower upper

Frequentist 102 30 175

Default prior 103 34 173

Uniform 102 23 178

It is tempting for the newcomer to Bayesian statistics to conclude from Table 6.1 that default
priors used in brms , or uniform priors, are good enough for fitting models. This conclusion
would in general be incorrect. There are many reasons why a sensitivity analysis–which
includes regularizing, relatively informative priors–is necessary in Bayesian modeling. First,
relatively informative, regularizing priors must be considered in many cases to avoid
convergence problems (an example is finite mixture models, presented in chapter 19). In fact,
in many cases the frequentist model fit in lme4 will return estimates–such as ±1 correlation
estimates between varying intercepts and varying slopes–that are actually represent
convergence failures (Bates, Kliegl, et al. 2015; Matuschek et al. 2017). In Bayesian models,
unless we use regularizing priors that are at least mildly informative, we will generally face
similar convergence problems. Second, when computing Bayes factors, sensitivity analyses
using increasingly informative priors is vital; see chapter 15 for extensive discussion of this
point. Third, one of the greatest advantages of Bayesian models is that one can formally take
into account conflicting or competing prior beliefs in the model, by eliciting informative priors
from competing experts. Although such a use of informative priors is still rare in cognitive
science, it can be of great value when trying to interpret a statistical analysis.

Given the importance of regularizing, informative priors, we consider next some informative
priors that we could use in the given model. We unpack the process by which we could work
these priors out from existing information in the literature.

Initially, when trying to work out some alternative priors for some parameters of interest, we
might think that we know absolutely nothing about the seven parameters in this model. But, as
in Fermi problems, we actually know more than we realize.

Let’s think about the parameters in the relative clause example one by one. For ease of
exposition, we begin by writing out the model in mathematical form. n is the row id in the data-
frame. The variable c_cond is a sum-coded (±0.5) predictor.

RTn ∼ Normal(α + usubj[n],1 + witem[n],1 + c_condn ⋅ (β + usubj[n],2 + witem[n],2 ), σ)

where

u1 ∼ Normal(0, τu )
1

u2 ∼ Normal(0, τu )
2

w1 ∼ Normal(0, τw )
1

w2 ∼ Normal(0, τw2 )

The parameters that we need to define priors for are the following: α, β, τu , τu , τw , τw , σ
1 2 1 2
.

6.1.2 Eliciting a prior for the intercept

We will proceed from first principles. Let’s begin with the intercept, α ; under the sum-contrast
coding used here, it represents the grand mean reading time in the data set.
Ask yourself: What is the absolute minimum possible reading time? The answer is 0 ms;
reading time cannot be negative. You have already eliminated half the real-number line as
impossible values! Thus, one cannot really say that one knows nothing about the plausible
values of mean reading times. Having eliminated half the real-number line, now ask yourself:
what is a reasonable upper bound on reading time for an English ditransitive verb? Even after
taking into account variations in word length and frequency, one minute (60 seconds) seems
like too long; even 30 seconds seems unreasonably long to spend on a single word. As a first
attempt at an approximation, somewhere between 2500 and 3000 ms might constitute a
reasonable upper bound, with 3000 ms being less likely than 2500 ms.

Now consider what an approximate average reading time for a verb might be. One can arrive
at such a ballpark number by asking oneself how fast one can read an abstract that has, say,
500 words in it. Suppose that we estimate that we can read 500 words in 120 seconds (two
minutes). Then, 120/500 = 0.24 seconds is the time we would spend per word on average;
this is 240 ms per word. Maybe two minutes for 500 words was too optimistic? Let’s adjust the
mean to 300 ms, instead of 240 ms. Such intuition-based judgments can be a valuable
starting point for an analysis, as Fermi showed repeatedly in his work (Von Baeyer 1988). If
one is uncomfortable consulting one’s intuition about average reading times, or even as a
sanity check to independently validate one’s own intuitions, one can look up a review article
on reading that gives empirical estimates (e.g., Rayner 1998).

One could express the above guesses as a normal distribution truncated at 0 ms on the ms
scale, with mean 300 ms and standard deviation 1000 ms. An essential step in such an
estimation procedure is to plot one’s assumed prior distribution graphically to see if it seems
reasonable: Figure 6.2 shows a graphical summary of this prior.
0.0006
0.0004
density

0.0002
0.0000

0 2000 4000 6000 8000 10000

Reading times (ms)

FIGURE 6.2: A truncated normal distribution representing a prior distribution on mean reading
times.
Once we plot the prior, one might conclude that the prior distribution is a bit too widely spread
out to represent mean reading time per word. But for estimating the posterior distribution, it
will rarely be harmful to allow a broader range of values than we strictly consider plausible (the
situation is different when it comes to Bayes factors analyses, as we will see later—there,
widely spread out priors for a parameter of interest can have a dramatic impact on the Bayes
factor test for whether that parameter is zero or not).

Another way to obtain a better feel for what plausible distributions of word reading times might
be to just plot some existing data from published work. Figure 6.3 shows the distribution of
mean reading times from ten published studies.
0.006

0.004
density

0.002

0.000

500 600 700 800

mean reading times (ms)

FIGURE 6.3: The distribution of mean reading times from ten self-paced reading studies.
Although our truncated normal distribution, Normal+ (300, 1000) , seems like a pretty wild
guess, it actually is not terribly unreasonable given what we observe in these ten published
self-paced reading studies. As shown in Figure 6.3, the distribution of mean reading times in
these different self-paced reading studies from different languages (English, Persian, Dutch,
Hindi, German, Spanish) fall within the prior distribution. The means range from a minimum
value of 464 ms and a maximum value of 751 ms. These values easily lie within the 95%
credible interval for a Normal+ (300, 1000) : [40, 2458] ms. These 10 studies are not about
relative clauses; but that doesn’t matter, because we are just trying to come up with a prior
distribution on average reading times for a word. We just want an approximate idea of the
range of plausible mean reading times.

The above prior specification for the intercept can (and must!) be evaluated in the context of
the model using prior predictive checks. We have already encountered prior predictive checks
in previous chapters; we will revisit them in detail in chapter 7. In the above data set on
English relative clauses, one could check what the prior on the intercept implies in terms of
the data generated by the model (see chapter 5 for examples). As stressed repeatedly
throughout this book, sensitivity analysis is an integral component of Bayesian methodology. A
sensitivity analysis should be used to work out what the impact is of a range of priors on the
posterior distribution.
6.1.3 Eliciting a prior for the slope

Having come up with some potential priors for the intercept, consider next the prior
specification for the effect of relative clause type on reading time; this is the slope β in the
model above. Recall that c_cond is ±0.5 sum coded.

Theory suggests (see Grodner and Gibson 2005 for a review) that subject relatives in English
should be easier to process than object relatives, at the relative clause verb. This means that
a priori, we expect the difference between object and subject relatives to be positive in sign.
What would be a reasonable mean for this effect? We can look at previous research to obtain
some ballpark estimates.

For example, Just and Carpenter (1992) carried out a self-paced reading study on English
subject and object relatives, and their Figure 2 (p. 130) shows that the difference between the
two relative clause types at the relative clause verb ranges from about 10 ms to 100 ms
(depending on working memory capacity differences in different groups of subjects). This is
already a good starting point, but we can look at some other published data to gain more
confidence about the approximate difference between the conditions.

For example, Reali and Christiansen (2007) investigated subject and object relatives in four
self-paced reading studies; in their design, the noun phrase inside the relative clause was
always a pronoun, and they carried out analyses on the verb plus pronoun, not just the verb
as in Grodner and Gibson (2005). We can still use the estimates from this study, because
including a pronoun like “I”, “you”, or “they” in a verb region is not going to increase reading
times dramatically. The hypothesis in Reali and Christiansen (2007) was that because object
relatives containing a pronoun occur more frequently in corpora than subject relatives with a
pronoun, the relative clause verb should be processed faster in object relatives than subject
relatives (this is the opposite to the prediction for the reading times at the relative clause verb
discussed in Grodner and Gibson 2005). The authors report comparisons for the pronoun and
relative clause verb taken together (i.e., pronoun+verb in object relatives and verb+pronoun in
subject relatives). In experiment 1, they report a −57 ms difference between object and
subject relatives, with a 95% confidence interval ranging from −104 to −10 ms. In a second
experiment, they report a difference of −53.5 ms with a 95% confidence interval ranging from
−79 to −28 ms; in a third experiment, the difference was −32 ms [−48, −16]; and in a fourth
experiment, −43 ms [−84, −2]. This range of values gives us a good ballpark estimate of the
magnitude of the effect.

Yet another study involved English relative clauses is by Fedorenko, Gibson, and Rohde
(2006). In this self-paced reading study, Fedorenko and colleagues compared reading times
within the entire relative clause phrase (the relative pronoun and the noun+verb sequence
inside the relative clause). Their data show that object relatives are harder to process than
subject relatives; the difference in means is 460 ms, with a confidence interval [299, 621] ms.
This difference is much larger than in the studies mentioned above, but this is because of the
long region of interest considered—it is well-known that the longer the reading/reaction time,
the larger the standard deviation and therefore the larger the potential difference between
means (Wagenmakers and Brown 2007).

One can also look at adjacent, related phenomena in sentence processing to get a feel for
what the relative clause processing time difference should be. Research on similarity-based
interference is closely related to relative clause processing differences: in both types of
phenomenon, the assumption is that intervening nouns can increase processing difficulty. So
let’s look at some reading studies on similarity-based interference.

In a recent study (Jäger, Engelmann, and Vasishth 2017), we investigated the estimates from
about 80 reading studies on interference. Interference here refers to the difficulty experienced
by the comprehender during sentence comprehension (e.g., in reading studies) when they
need to retrieve a particular word from working memory but other words with similar features
hinder retrieval. The meta-analysis in Jäger, Engelmann, and Vasishth (2017) suggests that
the effect sizes for interference effects range from at most −50 to 50 ms, depending on the
phenomenon (some kinds of interference cause speed-ups, others cause slow-downs; see the
discussion in Engelmann, Jäger, and Vasishth 2020).

Given that the Grodner and Gibson (2005) design can be seen as falling within the broader
class of interference effects (Lewis and Vasishth 2005; Vasishth et al. 2019; Vasishth and
Engelmann 2022), it is reasonable to choose informative priors that reflect this observed range
of interference effects in the literature.

The above discussion gives us some empirical basis for assuming that the object minus
subject relative clause difference in the Grodner and Gibson (2005) study on English could
range from 10 to at most 100 ms or so. Although we expect the effect to be positive, perhaps
we don’t want to pre-judge this before we see the data. For this reason, we could decide on a
Normal(0, 50) prior on the slope parameter in the model. This prior, which implies that we
are 95% certain that the range of values lies between −100 and +100 ms. This prior is
specifically for the millisecond scale, and specifically for the case where the critical region is
one word (the relative clause verb in English).

In this particular example, it makes sense to assume that large effects like 100 ms are
unlikely; this is so even if we do occasionally see estimates that are even higher than 100 ms
in published data. For example, in Gordon, Hendrick, and Johnson (2001), their experiments
1-4 have very large OR-SR differences at the relative clause verb: 450 ms, 250 ms, 500 ms,
and 200 ms, respectively, with an approximate SE of 50 ms. The number of subjects in the
four experiments were 44, 48, 48, and 68, respectively. Given the other estimates mentioned
above, we would be unwilling to take such large effects seriously because a major reason for
observing overly large estimates in a one-word region of interest would be publication bias
coupled with Type M error (Gelman and Carlin 2014). Published studies in psycholinguistics
are often underpowered, which leads to exaggerated estimates being published (Type M
error). Because big-news effects are encouraged in major journals, overestimates tend to get
published preferentially.24 In recent work (Vasishth, Yadav, et al. 2022), we have shown that
even under the optimistic assumption that effect size is approximately 50 ms, achieving 80%
power in English relative clause studies would require at least 120 subjects (if one takes the
uncertainty of the effect estimate into account, many more subjects would be needed).

Of course, if our experiment is designed so that the critical region constitutes several words,
as in the Fedorenko, Gibson, and Rohde (2006) study, then one would have to choose a prior
with a larger mean and standard deviation.

Box 6.1 The scale of the parameter must be taken into account when eliciting a prior

A related, important issue to consider when defining priors is the scale in which the
parameter is defined. For example, if we were analyzing the Grodner and Gibson (2005)
experiment using the log-normal likelihood, then the intercept and slope are on the log
millisecond scale. A uniform prior on the intercept and slope parameter imply rather
strange priors on the millisecond scale. For example, suppose we assume that the
intercept on the log ms scale has priors Normal(0, 10) and the slope has a prior
Normal(0, 1) . In the millisecond scale, the priors on the intercept and slope imply a very
broad range of reading time differences between the two conditions, ranging from a very
large negative value to a very large positive value, which obviously makes little sense:

Hide

intercept <- rnorm(100000, mean = 0, sd = 10)

slope <- rnorm(100000, mean = 0, sd = 1)

effect <- exp(intercept + slope / 2) -
exp(intercept - slope / 2)
quantile(effect, prob = c(0.025, 0.975))

## 2.5% 97.5%
## -8751704 7399342
In this connection, it may be useful to revisit the discussion in section 4.3.2, where we
discussed the effect of prior specification on the log-odds scale and what that implies on
the probability scale.

Box 6.2 Cromwell’s rule

A frequently asked question from newcomers to Bayes is: what if I define a too restricted
prior? Wouldn’t that bias the posterior distribution? This concern is also raised quite often
by critics of Bayesian methods. The key point here is that a good Bayesian analysis
always involves a sensitivity analysis, and also includes prior and posterior predictive
checks under different priors. One should reject the priors that make no sense in the
particular research problem we are working on, or which unreasonably bias the posterior.
As one gains experience with Bayesian modeling, these concerns will recede as we come
to understand how useful and important priors are for interpreting the data. Chapter 7 will
elaborate on developing a sensible workflow for understanding and interpreting the results
of a Bayesian analysis.

As an extreme example of an overly specific prior, if one were to define a Normal(0, 10)

prior for the α and/or β parameters on the millisecond scale for the Grodner and Gibson
(2005) example above; that would definitely bias the posterior for the parameters. Let’s
check this. Try running this code (the output of the code is suppressed here to conserve
space). In this model, the correlation between the varying intercepts and varying slopes
for subjects and for items are not included; this is only done in order to keep the model
simple.

Hide
restrictive_priors <- c(
prior(normal(0, 10), class = Intercept),
prior(normal(0, 10), class = b),
prior(normal(0, 500), class = sd),
prior(normal(0, 500), class = sigma)

fit_restrictive <- brm(RT ~ c_cond + (c_cond || subj) +

(c_cond || item),
prior = restrictive_priors,
# Increase the iterations to avoid warnings

iter = 4000,
df_gg05_rc
)

Hide

summary(fit_restrictive)

If you run the above code, you will see that the overly specific (and extremely
unreasonable) priors on the intercept and slope will dominate in determining the posterior;
such priors obviously make no sense. If there is ever any doubt about the implications of a
prior, prior and posterior predictive checks should be used to investigate the implications.

Here, an important Bayesian principle is Cromwell’s rule (Lindley 1991; Jackman 2009):
we should generally allow for some uncertainty in our priors. A prior like Normal(0, 10) or
N ormal+ (0, 10) on the millisecond scale is clearly overly restrictive given what we’ve
established about plausible values of the relative clause effect from existing data. A more
reasonable but still quite tight prior would be Normal(0, 50) . In the spirit of Cromwell’s
rule, just to be conservative, we can consider allowing (in a sensitivity analysis) larger
possible effect sizes by adopting a prior such as Normal(0, 75) , and we allow the effect
to be negative, even if theory suggests otherwise.

Although there are no fixed rules for deciding on a prior, a sensitivity analysis will quickly
establish whether the prior or priors chosen are biasing the posterior. One critical thing to
remember related to Cromwell’s rule is that if we categorically rule out a range of values a
priori for a parameter by giving that range a probability of 0, the posterior will also never
include that range of values, no matter what the data show. For example, in the Reali and
Christiansen (2007) experiments, if we had used a truncated prior like Normal+ (0, 50) ,
the posterior can never show the observed negative sign on the effects as reported in the
paper. As a general rule, therefore, one should allow the effect to vary in both directions,
positive and negative. Sometimes unidirectional priors are justified; in those cases, it is of
course legitimate to use them. An example is the prior on standard deviations (which
cannot be negative).

6.1.4 Eliciting priors for the variance components

Having defined the priors for the intercept and the slope, we are left with prior specifications
for the variance component parameters. At least in psycholinguistics, the residual standard
deviation is usually the largest source of variance; the by-subject intercepts’ standard
deviation is usually the next-largest value, and if experimental items are designed to have
minimal variance, then these are usually the smallest components. Here again, we can look at
some previous data to get a sense of what the priors should look like.

For example, we could use the estimates for the variance components from existing studies.
Figure 6.4 shows the empirical distributions from 10 published studies. There are four classes
of variance component: the subject and item intercept standard deviations, the standard
deviations of slopes, and the standard deviations of the residuals. In each case, we can
compute the estimated means and standard deviations of each type of variance component,
and then use these to define normal distributions truncated at 0. The empirically estimated
distributions of the variance components are shown in Figure 6.4. The estimated means and
standard deviations of each type of variance component are as follows:

Subject intercept SDs: estimated mean: 165, estimated standard deviation (sd): 55.
Item intercept SDs: mean: 49, sd: 52.
Slope SDs: mean 39, sd: 58.
Residual SDs: mean: 392, sd: 140.

The largest standard deviations are those from the item intercepts and the residual standard
deviation, so these are the ones we will focus on. We can and should orient our prior for the
group-level (also known as random effects) variance components to subsume these larger
values.
30

residuals sd
20

0
30

ubj intercepts s tem intercepts s

10
count

0
30

slopes sd
20

0
0 200 400 600
standard deviation

FIGURE 6.4: Histograms of empirical distributions of the different variance components from
ten publishes studies. The y-axis shows counts rather than density in order to make it clear
that we are working with only a few data sets.
We can now use the equations (4.2) and (4.3) shown in Box 4.1 to work out the means and
standard deviations of a corresponding truncated normal distribution. As an example, we could
assume a prior distribution truncated at 0 from below, and at 1000 ms from above. That is,
a = 0 , and b = 1000 .

We can write a function that takes the estimated means and standard deviations, and returns
the mean and standard deviation of the corresponding truncated distribution (see Box 4.1).

The largest variance component among the group-level effects (that is, all variance
components other than the residual standard deviation) is the by-subjects intercept. One can
compute the mean and standard deviation of of the truncated distribution that would generate
the observed mean and standard deviation of the item-level estimates:

Hide

# Subject intercept SDs:

compute_meansd_parent(mean_trunc = intsubjmean, sd_trunc = intsubjsd)

## [1] "location: 165 scale: 55"

The corresponding truncated distribution is shown in Figure 6.5.

0.006
0.004
density

0.002
0.000

0 100 200 300 400 500

FIGURE 6.5: A truncated normal distribution (with location 165 and scale 55) representing an
empirically derived prior distribution for the parameter for the by-subjects intercept adjustment
in a hierarchical model.

The prior shown in Figure 6.5 looks a bit too restrictive; it could well happen that in a future
study the by-subject intercept standard deviation is closer to 500 ms. Taking Cromwell’s rule
into account, one could widen the scale parameter of the truncated normal to, say 200. The
result is Figure 6.6.
0.0020
density

0.0010
0.0000

0 200 400 600 800 1000

FIGURE 6.6: A truncated normal distribution (with location 165 and scale 200) representing an
empirically derived prior distribution for the parameter for the by-subjects intercept adjustment
in a hierarchical model taking Cromwell’s rule into account.
Figure 6.6 does not look too unreasonable as an informative prior for this variance component.
This prior will also serve us well for all the other group-level effects (the random intercept for
items, and the random slopes for subject and item), which will have smaller values.

Finally, the prior for the residual standard deviation is going to have to allow a broader range
of wider values:

Hide

compute_meansd_parent(mean_trunc = resmean, sd_trunc = ressd)

## [1] "location: 391 scale: 142"

Figure 6.7 shows a plausible informative prior derived from the empirical estimates.
0.0020
density

0.0010
0.0000

0 200 400 600 800 1000

FIGURE 6.7: A truncated normal distribution representing an empirically derived prior

distribution for the parameter for the residual standard deviation in a hierarchical model.
We stress again that Cromwell’s rule should generally be kept in mind—it’s usually better to
have a little bit more uncertainty than warranted than too tight a prior. An overly tight prior will
ensure that the posterior is entirely driven by the prior. Again, prior predictive checks should
be an integral part of the process of establishing a sensible set of priors for the variance
components. This point about prior predictive checks will be elaborated on with examples in
chapter 7.

We now apply the relatively informative priors we came up with above to analyze the Grodner
and Gibson (2005) data. Applying Cromwell’s rule, we allow for a bit more uncertainty than our
existing empirical data suggest.

Specifically, we could choose the following informative priors for the Grodner and Gibson
(2005) data:

The intercept: α ∼ Normal(500, 100)

The slope: β ∼ Normal(50, 50)

All variance components: σu , σw ∼ Normal+ (165, 200)

The residual standard deviation: σ ∼ Normal+ (391, 200)

The first step is to check whether the prior predictive distribution makes sense. Figure 6.8
shows that the prior predictive distributions are not too implausible, although they could be
improved further. One big problem is the normal distribution assumed in the model; a log-
normal distribution captures the shape of the distribution of the Grodner and Gibson (2005)
data better than a normal distribution. The discrepancy between the Grodner and Gibson
(2005) data and our prior predictive distribution implies that we might be using the wrong
likelihood. Another problem is that the reading times in the prior predictive distribution can be
negative—this is also a consequence of our using the wrong likelihood. As an exercise, fit a
model with a log-normal likelihood and informative priors based on previous data. When using
a log-normal likelihood, the prior for the slope parameter obviously has to be on the log scale.
Therefore, we will need to define an informative prior on the log scale for the slope parameter.
For example, consider the following prior on the slope: Normal(0.12, 0.04) . Here is how to
interpret this on the millisecond scale: Assuming a mean reading time of 6 log ms, this prior
roughly corresponds to an effect size on the millisecond scale that has a 95 credible interval
ranging from 16 ms to 81 ms. Review 3.7.2 if you have forgotten how this transformation was
done.

For now, because our running example uses a normal likelihood on reading times in
milliseconds, we can retain these priors.

-1000 0 1000 2000

-1000 0 1000 2000 -1000 0 1000 2000

FIGURE 6.8: Prior predictive distributions from the model (using a normal likelihood) to be
used for the Grodner and Gibson data analysis. The panels show eight prior predictive
distributions.

The sensitivity analysis could then be displayed, showing the posteriors under different prior
settings. Figures 6.9 and 6.10 show the posteriors under two distinct sets of priors.
Posteriors (Uniform priors)
with medians and 80% intervals

b_Intercept

b_c_cond

sd_item__Intercept

sd_item__c_cond

sd_subj__Intercept

sd_subj__c_cond

sigma

0 200 400

FIGURE 6.9: Posterior distributions of parameters for the English relative clause data, using
uniform priors (Uniform(0, 2000)) on the intercept and slope.
Posteriors (Informative priors)
with medians and 80% intervals

b_Intercept

b_c_cond

sd_item__Intercept

sd_item__c_cond

sd_subj__Intercept

sd_subj__c_cond

sigma

0 200 400 600

FIGURE 6.10: Posterior distributions of parameters for the English relative clause data, using
relatively informative priors on the intercept and slope.
What can one do if one doesn’t know absolutely anything about one’s research problem? An
example is the power posing data that we encountered in Chapter 4, in an exercise in section
4.6. Here, we investigated the change in testosterone levels after the subject was either asked
to adopt a high power pose or a low power pose (a between-subjects design). Not being
experts in this domain, we may find ourselves stumped for priors. In such a situation, it could
be defensible to use uninformative priors like Cauchy(0, 2.5) , at least initially. However, as
discussed in a later chapter, if one is committed to doing a Bayes factor analysis, then we are
obliged to think carefully about plausible a priori values of the effect. This would require
consulting one or more experts or reading the literature on the topic to obtain ballpark
estimates. An exercise at the end of this chapter will elaborate on this idea. We turn next to
the topic of eliciting priors from experts.

6.2 Eliciting priors from experts

It can happen that one is working on a research problem where either our own prior
knowledge is lacking, or we need to incorporate a range of competing prior beliefs into the
analysis. In such situations, it becomes important to elicit priors from experts other than
oneself. Although informal elicitation can be a perfectly legitimate approach, there does exist a
well-developed methodology for systematically eliciting priors in Bayesian statistics (O’Hagan
et al. 2006).

The particular method developed by O’Hagan and colleagues comes with an R package called
SHELF , which stands for the Sheffield Elicitation Framework; the method was developed by

statisticians at the University of Sheffield, UK. SHELF is available from

http://www.tonyohagan.co.uk/shelf/. This framework comes with a detailed set of instructions
and a fixed procedure for eliciting distributions. It also provides detailed guidance on
documenting the elicitation process, thereby allowing a full record of the elicitation process to
be created. Creating such a record is important because the elicitation procedure needs to be
transparent to a third party reading the final report on the data analysis.

The SHELF procedure works as follows. There is a facilitator and an expert (or a group of
experts; we will consider the single expert case here, but one can easily extend the approach
to multiple experts).

A pre-elicitation form is filled out by the facilitator in consultation with the expert. This form
sets the stage for the elicitation exercise and records some background information, such
as the nature of the expertise of the assessor.
Then, an elicitation method is chosen. Simple methods are the most effective. One good
approach is the quartile method. The expert first decides on a lower and upper limit of
possible values for the quantity to be estimated. Because the lower and upper bounds are
elicited before the median, this minimizes the effects of the “anchoring and adjustment
heuristic” (O’Hagan et al. 2006), whereby experts tend to anchor their subsequent
estimates of quartiles based on their first judgement of the median. Following this, a
median value is decided on, and lower and upper quartiles are elicited. The SHELF
package has functions to display these quartiles graphically, allowing the expert to adjust
them at this stage if necessary. It is important for the expert to confirm that, in their
judgement, the four partitioned regions that result have equal probability.
The elicited distribution is then displayed as a density plot (several choices of probability
density functions are available, but we will usually use the normal or the truncated normal
in this chapter); this graphical summary serves to give feedback to the expert. The
parameters of the distribution are also displayed. Once the expert agrees to the final
density, the parameters can be considered the expert’s judgement regarding the prior
distribution of the bias. One can consult multiple experts and either combine their
judgements into one prior, or consider each expert’s prior separately in a sensitivity
analysis.

When eliciting priors from more than one expert, one can elicit the priors separately and then
use the priors separately in a sensitivity analysis. This approach takes each individual expert’s
opinion in interpreting the data and can be a valuable sensitivity analysis (for an example from
psycholinguistics, see the discussion surrounding Table 2.2 on p. 47 in Vasishth and
Engelmann 2022). Alternatively, one can pool the priors together (see Spiegelhalter, Abrams,
and Myles 2004 for discussion) and create a single consensus prior; this would amount to an
average of the differing opinions about prior distributions. A third approach is to elicit a
consensus prior by bringing all the experts together and eliciting a prior from the group in a
single setting. Of course, these approaches are not mutually exclusive. One of the hallmark
properties of Bayesian analysis is that the posterior distribution of the parameter of interest
can be investigated in light of differing prior beliefs and the data (and of course the model).
Box 6.3 illustrates a simple elicitation procedure involving two experts; the example is adapted
from the SHELF package’s vignette.

Box 6.3 Example: prior elicitiation using SHELF

An example of prior elicitation using SHELF is shown below. This example is adapted from
the SHELF vignette.
Suppose that two experts are consulted separately. The question asked of the experts is
what they think that a probability parameter X has as plausible values. The parameter X

can be seen as a percentage; so, it ranges from 0 to 100.

Step 1: Elicit quartiles and median from each expert.

Expert A states that P (X < 30) = 0.25, P (X < 40) = 0.5, P (X < 50) = 0.75 .
Expert B states that P (X < 20) = 0.25, P (X < 25) = 0.5, P (X < 35) = 0.75 .

Step 2: Fit the implied distributions for each expert’s judgements and plot the
distributions, along with a pooled distribution (the linear pool in the figure).

Hide

library(SHELF)

elicited <- matrix(c(30, 20, 0.25,

40, 25, 0.5,
50, 35, 0.75),
nrow = 3, ncol = 3, byrow = TRUE)
dist_2expr <- fitdist(vals = elicited[, 1:2],

probs = elicited[, 3],

lower = 0, upper = 100)
plotfit(dist_2expr, lp = TRUE, returnPlot = TRUE) +
scale_color_grey()
0.04

0.03

expert
expert.A
fX(x)

0.02
expert.B
linear pool

0.01

0.00

0 25 50 75 100
x

FIGURE 6.11: Visualizing priors elicited from two experts for a parameter X representing
a percentage ranging from 0 to 100.
Step 3: Then bring the two experts together and elicit a consensus distribution.

Suppose that the experts agree that

P (X < 25) = 0.25, P (X < 30) = 0.5, P (X < 40) = 0.75 . The consensus distribution
is then:

Hide

elicited <- matrix(c(25, 0.25,

30, 0.5,
40, 0.75),
nrow = 3, ncol = 2, byrow = TRUE)
dist_cons <- fitdist(vals = elicited[,1],
probs = elicited[,2],

lower = 0, upper = 100)

plotfit(dist_cons, ql = 0.05, qu = 0.95, returnPlot = TRUE)
Log T(3.43, 0.311), df = 3
0.04

0.03
fX(x)

0.02

0.01

0.00

25 50 75
x

FIGURE 6.12: Visualizing a consensus prior from two experts for a parameter X

representing a percentage ranging from 0 to 100.

Step 4: Give feedback to the experts by showing them the 5th and 95th percentiles, and
check that these bounds match their beliefs. If not, then repeat the above steps.

Hide

feedback(dist_cons, quantiles = c(0.05, 0.95))

## $fitted.quantiles
## normal t gamma lognormal logt beta hist
## 0.05 12.5 7.49 16.2 17.3 14.8 15.2 5
## 0.95 50.4 55.10 53.2 55.3 64.1 51.2 88
## mirrorgamma mirrorlognormal mirrorlogt

## 0.05 10.5 9.18 2.08

## 0.95 49.1 48.60 52.10
##
## $fitted.probabilities
## elicited normal t gamma lognormal logt beta hist
## 25 0.25 0.288 0.289 0.279 0.274 0.275 0.283 0.25

## 30 0.50 0.451 0.453 0.461 0.466 0.469 0.456 0.50

## 40 0.75 0.772 0.774 0.769 0.767 0.768 0.770 0.75
## mirrorgamma mirrorlognormal mirrorlogt
## 25 0.292 0.295 0.296
## 30 0.446 0.444 0.447

## 40 0.772 0.773 0.775

6.3 Deriving priors from meta-analyses

Meta-analysis has been used widely in clinical research (Higgins and Green 2008; Sutton et
al. 2012; DerSimonian and Laird 1986; Normand 1999), but it is used relatively rarely in
psychology and linguistics. Random-effects meta-analysis (discussed in a later chapter in
detail) is an especially useful tool in cognitive science.

Meta-analysis is not a magic bullet; this is because of publication bias—usually only

(supposedly) newsworthy results are published, leading to a skewed picture of the effects. As
a consequence, meta-analysis will yield biased estimates; but they can still tell us something
about what we know so far from published studies, if only that the studies are too noisy to be
interpretable. Nevertheless, some prior information is better than no information. As long as
one remains aware of the limitations of meta-analysis, one can still use them effectively to
study one’s research questions.

We begin with observed effects yn (e.g., estimated difference between two conditions) and
their estimated standard errors (SEs); the SEs serve as an indication of the precision of the
estimate, with larger SEs implying a low-precision estimate. Once we have collected the
observed estimates (e.g., from published studies), we can define an assumed underlying
generative process whereby each study n = 1, … , N has an unknown true mean ζn :

yn ∼ Normal(ζn , SEn )

A further assumption is that each unknown true mean ζn in each study is generated from a
distribution that has some true overall mean ζ , and standard deviation τ . The standard
deviation τ reflects between-study variation, which could be due to different subjects being
used in each study, different lab protocols, different methods, different languages being
studied, etc.

ζn ∼ Normal(ζ, τ )

This kind of meta-analysis is actually the familiar hierarchical model we have already
encountered in chapter 5. As in hierarchical models, hyperpriors have to be defined for ζ and
τ . A useful application of this kind of meta-analysis is to derive a posterior distribution for ζ

based on the available evidence; this posterior can be used (e.g., with a normal
approximation) as a prior for a future study.

A simple example is the published data on Chinese relative clauses; the data are a selection
from Vasishth (2015); two estimates for which standard errors could not be computed from the
reported statistical summaries have been removed. Table 6.2 shows the mean difference
between the object and subject relative, along with the standard error, that was derived from
published reading studies on Chinese relatives.
TABLE 6.2: The difference between object and subject relative clause reading times (effect),
along with their standard errors (SE), from different published reading studies on Chinese
relative clauses.

study.id study y se

1 Hsiao et al 03 50.0 25.0

4 Vas. et al 13, E2 82.6 41.2

5 Vas. et al 13, E3 −109.4 54.8

6 Jaeg. et al 15, E1 55.6 65.1

7 Jaeg. et al 15, E2 81.9 36.3

9 Wu 09 50.0 23.0

10 Qiao et al 11, E1 −70.0 42.0

11 Qiao et al 11, E2 6.2 19.9

12 Lin & Garn. 11, E1 −100.0 30.0

14 Chen et al 08 75.0 35.5

15 C Lin & Bev. 06 100.0 80.0

Suppose that we want to do a new study investigating the difference between object and
subject relative clauses, and suppose that in the sensitivity analysis, one of the priors we want
is an empirically justifiable informative prior. Of course, the sensitivity analysis will also contain
uninformative priors; we have seen examples of such priors in the previous chapters.

We can derive an empirically justified prior by conducting a group-level effects meta-analysis.

We postpone discussion of how exactly to fit such a model to chapter 13; here, we simply
report the posterior distribution of the overall effect ζ based on the prior data, ignoring the
details of model fitting.
-50 0 50 100
zeta

FIGURE 6.13: The posterior distribution of the difference between object and subject relative
clause processing in Chinese relative clause data, computed from a random-effects meta-
analysis using published Chinese relative clause data from reading studies.
The posterior distribution of ζ is shown in Figure 6.13. What we can derive from this posterior
distribution of ζ is a normal approximation that represents what we know so far about Chinese
relatives, based on the available data. The key here is the word “available”; almost certaintly
there exist studies that were inconclusive and were therefore never published. The published
record is always biased because of the nature of the publication game in science (only
supposedly newsworthy results get published).

The mean of the posterior is 20 ms, and the width of the 95% credible intervals is
23 − 2 = 21 ms. Since the 95% credible interval has a width that is approximately four times
the standard deviation (assuming a normal distribution), we can work out the standard
deviation by dividing the width by four: 5.25 . Given these estimates, we could use a normal
distribution with mean 20 and standard deviation 5.25 as an informative prior in a sensitivity
analysis. Notice that if the credible interval were to range from −10 to 24 ms, the width would
be 34 ms, not 14 (we have noticed that sometimes people are confused by the negative sign).

As an example, we will analyze a data set from Gibson and Wu (2013) that was not part of the
above meta-analysis, and we will use the meta-analysis posterior as an informative prior. First,
we load the data and sum-code the predictor:
Hide

data("df_gibsonwu")
df_gibsonwu <- df_gibsonwu %>%
mutate(c_cond = if_else(type == "obj-ext", 1 / 2, -1 / 2))

Because we will now use a log-normal likelihood for the reading time data, we need to work
out what the meta-analysis posterior of ζ corresponds to on the log scale. The grand mean
reading time of the Gibson and Wu (2013) data on the log scale is 6.1 . In order to arrive at
approximately the mean difference of 20 ms, the log-scale value of the mean difference would
be 0.041 , with a 95% credible interval [−0.079, 0.145] , which implies a standard deviation of
(0.0145 − (−0.79))/4 = 0.2 . The calculations are shown below (see Box 4.3 in chapter 4 for
an explanation of a model with a log-normal likelihood).

Hide

int <- mean(log(df_gibsonwu$rt))

b <- 0.041

exp(int + b / 2) - exp(int - b / 2)

## [1] 17.6

Hide

lower <- -0.079

exp(int + lower / 2) - exp(int - lower / 2)

## [1] -33.9

Hide

upper <- 0.145

exp(int + upper / 2) - exp(int - upper / 2)

## [1] 62.2
As always, we will do a sensitivity analysis, using uninformative priors on the slope parameter
(Normal(0, 1)), as well as the meta-analysis prior.

Hide

## uninformative priors on the parameters of interest

## and on the variance components:
fit_gibsonwu_log <- brm(rt ~ c_cond +
(c_cond | subj) +
(c_cond | item),

family = lognormal(),
prior =
c(
prior(normal(6, 1.5), class = Intercept),
prior(normal(0, 1), class = b),
prior(normal(0, 1), class = sigma),

prior(normal(0, 1), class = sd),

prior(lkj(2), class = cor)
),
data = df_gibsonwu
)

## meta-analysis priors:
fit_gibsonwu_ma <- brm(rt ~ c_cond +
(c_cond | subj) +
(c_cond | item),
family = lognormal(),

prior =
c(
prior(normal(6, 1.5), class = Intercept),
prior(normal(0.041, 0.2), class = b),
prior(normal(0, 1), class = sigma),
prior(normal(0, 1), class = sd),

prior(lkj(2), class = cor)

),
data = df_gibsonwu
)
A summary of the posteriors (means and 95% credible intervals) under the Normal(0, 1) and
the meta-analysis prior is shown in Table 6.3. In this particular case, the posteriors are not
strongly influenced by the two different priors. The differences between the two posteriors are
small, but these differences could in principle lead to different outcomes in a Bayes factor
analysis.

TABLE 6.3: A summary of the posteriors under a relatively uninformative prior and an
informative prior based on a meta-analysis, for the Chinese relative clause data from Gibson
and Wu, 2013.

Priors Mean Lower Upper

Normal(0, 1) −0.07 −0.18 0.04

Normal(0.041, 0.2) −0.06 −0.17 0.04

6.4 Using previous experiments’ posteriors as priors

for a new study

In a situation where we are attempting to replicate a previous study’s results, we can derive an
informative prior for the analysis of the replication attempt by figuring out a prior based on the
previous study’s posterior distribution. In the previous chapter, we encountered this in one of
the exercises: Given data on Chinese relatives (Gibson and Wu 2013), we want to replicate
the effect with a new data set that has the same design but different subjects. The data from
the replication attempt is from Vasishth et al. (2013).

The first data set from Gibson and Wu (2013) was analyzed in the previous section using
uninformative priors. We can extract the mean and standard deviation of the posterior, and
use that to derive an informative prior for the replication attempt.

Now, for the replication study, we can use this posterior (with a normal approximation), if we
want to build on what we learned from the original Gibson and Wu (2013) study. As usual, we
will do a sensitivity analysis: one model is fit with an uninformative prior on the parameter of
interest, Normal(0, 1) , as we did in the preceding section; and another model will be fit with
the informative directional prior Normal(−0.071, 0.209) . For good measure, we can also
include a model with a prior derived from the meta-analysis in the preceding section (the
posterior of the ζ parameter).
TABLE 6.4: A summary of the posteriors under an uninformative prior (Normal(0,1)), a prior
based on previous data, and a meta-analysis prior, for data from a replication attempt of
Gibson and Wu, 2013.

Priors Mean Lower Upper

Normal(0,1) −0.08 −0.21 0.04

Normal(-0.07,0.2) −0.08 −0.20 0.03

Normal(0.041, 0.2) −0.07 −0.19 0.06

Table 6.4 summarizes the different posteriors under the three prior specification. Again, in this
case, the differences in the posteriors are small, but in a Bayes factor analysis, the outcomes
under these different priors could be different.

6.5 Summary

Working out appropriate priors for one’s research problem is essentially a Fermi problem. One
can use several different strategies for working out priors: introspection, a literature review,
computing statistics from existing data, conducting a meta-analysis, using posteriors from
existing data as priors for a new, closely related study, or formally eliciting priors from domain
experts. If a prior specification is too vague, this can lead to slow convergence or convergence
problems, and would lead to biased Bayes factors (biased towards the null hypothesis); and if
a prior is too informative, this can bias the posterior. This inherent potential for bias in prior
specification should be formally investigated using sensitivity analyses (with a collection of
uninformative, skeptical, and informative priors of various types), and prior and posterior
predictive checks. Although prior specification seems like a daunting task to the beginning
student of Bayes, with time and experience one can develop a very well-informed set of priors
for one’s research problems.

6.6 Further reading

For interesting (and amusing) examples of Fermi solutions to questions, see https://what-
if.xkcd.com/84/. Two important books, Mahajan (2010) and Mahajan (2014), unpack the art of
approximation in mathematics and other disciplines; the approach presented in these books is
closely related to the art of Fermi-style approximation. Levy (2021) is an important book that
develops the analytical skill needed to figure out what your “tacit knowledge” about a particular
problem is. Tetlock and Gardner (2015) explains how experts deploy existing knowledge to
derive probabilistic predictions (predictions that come with a certain amount of uncertainty)
about real-world problems—this skill is closely related to prior (self-)elicitation. An excellent
presentation of prior elicitation is in O’Hagan et al. (2006). Useful discussions about priors are
provided in Lunn et al. (2012); Spiegelhalter, Abrams, and Myles (2004); Gelman, Simpson,
and Betancourt (2017); and Simpson et al. (2017). The Stan website also includes some
guidelines: Prior distributions for rstanarm models in https://mc-
stan.org/rstanarm/articles/priors.html; and prior choice recommendations in
https://github.com/stan-dev/stan/wiki/Prior-Choice-Recommendations. Browne and Draper
(2006) and Gelman (2006) discuss prior specifications in hierarchical models.

References

Bates, Douglas M, Reinhold Kliegl, Shravan Vasishth, and Harald Baayen. 2015.
“Parsimonious Mixed Models.”

Browne, William J, and David Draper. 2006. “A Comparison of Bayesian and Likelihood-Based
Methods for Fitting Multilevel Models.” Bayesian Analysis 1 (3). International Society for
Bayesian Analysis: 473–514.

DerSimonian, Rebecca, and Nan Laird. 1986. “Meta-Analysis in Clinical Trials.” Controlled
Clinical Trials 7 (3). Elsevier: 177–88.

Fedorenko, Evelina, Edward Gibson, and Douglas Rohde. 2006. “The Nature of Working
Memory Capacity in Sentence Comprehension: Evidence Against Domain-Specific Working
Memory Resources.” Journal of Memory and Language 54 (4). Elsevier: 541–53.

Gelman, Andrew. 2006. “Prior Distributions for Variance Parameters in Hierarchical Models
(Comment on Article by Browne and Draper).” Bayesian Analysis 1 (3). International Society
for Bayesian Analysis: 515–34.
Gelman, Andrew, and John B. Carlin. 2014. “Beyond Power Calculations: Assessing Type S
(Sign) and Type M (Magnitude) Errors.” Perspectives on Psychological Science 9 (6). SAGE
Publications: 641–51.

Gibson, Edward, and H-H Iris Wu. 2013. “Processing Chinese Relative Clauses in Context.”
Language and Cognitive Processes 28 (1-2). Taylor & Francis: 125–55.

Gordon, P. C., Randall Hendrick, and Marcus Johnson. 2001. “Memory Interference During
Language Processing.” Journal of Experimental Psychology: Learning, Memory, and Cognition
27(6): 1411–23.

Grodner, Daniel, and Edward Gibson. 2005. “Consequences of the Serial Nature of Linguistic
Input.” Cognitive Science 29: 261–90.

Hammerly, Christopher, Adrian Staub, and Brian Dillon. 2019. “The Grammaticality Asymmetry
in Agreement Attraction Reflects Response Bias: Experimental and Modeling Evidence.”
Cognitive Psychology 110: 70–104.

Higgins, Julian, and Sally Green. 2008. Cochrane Handbook for Systematics Reviews of
Interventions. New York: Wiley-Blackwell.

Jackman, Simon. 2009. Bayesian Analysis for the Social Sciences. Vol. 846. John Wiley &
Sons.

Jäger, Lena A., Daniela Mertzen, Julie A. Van Dyke, and Shravan Vasishth. 2020.
“Interference Patterns in Subject-Verb Agreement and Reflexives Revisited: A Large-Sample
Study.” Journal of Memory and Language 111.
https://doi.org/https://doi.org/10.1016/j.jml.2019.104063.

Just, Marcel A., and Patricia A. Carpenter. 1992. “A Capacity Theory of Comprehension:
Individual Differences in Working Memory.” Psychological Review 99(1): 122–49.

Kadane, Joseph, and Lara J Wolfson. 1998. “Experiences in Elicitation: [Read Before the
Royal Statistical Society at a Meeting on ‘Elicitation’ on Wednesday, April 16th, 1997, the
President, Professor a. F. M. Smith in the Chair].” Journal of the Royal Statistical Society:
Series D (the Statistician) 47 (1). Wiley Online Library: 3–19.
Kass, Robert E, and Joel B Greenhouse. 1989. “[Investigating Therapies of Potentially Great
Benefit: ECMO]: Comment: A Bayesian Perspective.” Statistical Science 4 (4). JSTOR: 310–
17.

Levy, Dan. 2021. Maxims for Thinking Analytically: The Wisdom of Legendary Harvard
Professor Richard Zeckhauser. Dan Levy.

Lewis, Richard L., and Shravan Vasishth. 2005. “An Activation-Based Model of Sentence
Processing as Skilled Memory Retrieval.” Cognitive Science 29: 1–45.

Lindley, Dennis V. 1991. Making Decisions. Second. John Wiley & Sons.

Lunn, David, Chris Jackson, David J Spiegelhalter, Nicky Best, and Andrew Thomas. 2012.
The BUGS Book: A Practical Introduction to Bayesian Analysis. Vol. 98. CRC Press.

Mahajan, Sanjoy. 2010. Street-Fighting Mathematics: The Art of Educated Guessing and
Opportunistic Problem Solving. Cambridge, MA: The MIT Press.

Mahajan, Sanjoy. 2014. The Art of Insight in Science and Engineering: Mastering Complexity.
Cambridge, MA: The MIT Press.

Matuschek, Hannes, Reinhold Kliegl, Shravan Vasishth, R. Harald Baayen, and Douglas M
Bates. 2017. “Balancing Type I Error and Power in Linear Mixed Models.” Journal of Memory
and Language 94: 305–15. https://doi.org/10.1016/j.jml.2017.01.001.

Normand, S.L.T. 1999. “Tutorial in Biostatistics Meta-Analysis: Formulating, Evaluating,

Combining, and Reporting.” Statistics in Medicine 18 (3): 321–59.

O’Hagan, Anthony, Caitlin E Buck, Alireza Daneshkhah, J Richard Eiser, Paul H Garthwaite,
David J Jenkinson, Jeremy E Oakley, and Tim Rakow. 2006. Uncertain Judgements: Eliciting
Experts’ Probabilities. John Wiley & Sons.

Phillips, Colin, Matthew W. Wagers, and Ellen F. Lau. 2011. “Grammatical Illusions and
Selective Fallibility in Real-Time Language Comprehension.” In Experiments at the Interfaces,
37:147–80. Emerald Bingley, UK.

Rayner, K. 1998. “Eye movements in reading and information processing: 20 years of

research.” Psychological Bulletin 124 (3): 372–422.
Reali, Florencia, and Morten H Christiansen. 2007. “Processing of Relative Clauses Is Made
Easier by Frequency of Occurrence.” Journal of Memory and Language 57 (1). Elsevier: 1–23.

Simpson, Daniel, Håvard Rue, Andrea Riebler, Thiago G. Martins, and Sigrunn H. Sørbye.
2017. “Penalising Model Component Complexity: A Principled, Practical Approach to
Constructing Priors.” Statistical Science 32 (1): 1–28. https://doi.org/10.1214/16-STS576.

Spiegelhalter, David J, Keith R Abrams, and Jonathan P Myles. 2004. Bayesian Approaches to
Clinical Trials and Health-Care Evaluation. Vol. 13. John Wiley & Sons.

Sutton, Alexander J, Nicky J Welton, Nicola Cooper, Keith R Abrams, and AE Ades. 2012.
Evidence Synthesis for Decision Making in Healthcare. Vol. 132. John Wiley & Sons.

Tetlock, Philip, and Dan Gardner. 2015. Superforecasting: The Art and Science of Prediction.
Crown Publishers.

Vasishth, Shravan. 2015. “A Meta-Analysis of Relative Clause Processing in MANDARIN

CHINESE Using Bias Modelling.” Master’s thesis, Sheffield, UK: School of Mathematics;
Statistics, University of Sheffield. http://www.ling.uni-
potsdam.de/~vasishth/pdfs/VasishthMScStatistics.pdf.

Vasishth, Shravan, and Felix Engelmann. 2022. Sentence Comprehension as a Cognitive

Process: A Computational Approach. Cambridge, UK: Cambridge University Press.
https://books.google.de/books?id=6KZKzgEACAAJ.

Vasishth, Shravan, Daniela Mertzen, Lena A. Jäger, and Andrew Gelman. 2018a. “The
Statistical Significance Filter Leads to Overoptimistic Expectations of Replicability.” Journal of
Memory and Language 103: 151–75. https://doi.org/https://doi.org/10.1016/j.jml.2018.07.004.

Vasishth, Shravan, Bruno Nicenboim, Felix Engelmann, and Frank Burchert. 2019.
“Computational Models of Retrieval Processes in Sentence Processing.” Trends in Cognitive
Sciences 23: 968–82. https://doi.org/https://doi.org/10.1016/j.tics.2019.09.003.

Vasishth, Shravan, Himanshu Yadav, Daniel J. Schad, and Bruno Nicenboim. 2022. “Sample
Size Determination for Bayesian Hierarchical Models Commonly Used in Psycholinguistics.”
Computational Brain and Behavior.

Von Baeyer, Hans Christian. 1988. “How Fermi Would Have Fixed It.” The Sciences 28 (5).
Blackwell Publishing Ltd Oxford, UK: 2–4.
Wagenmakers, Eric-Jan, and Scott Brown. 2007. “On the Linear Relation Between the Mean
and the Standard Deviation of a Response Time Distribution.” Psychological Review 114 (3).
American Psychological Association: 830.

23. This discussion reuses text from Vasishth, Yadav, et al. (2022).↩

24. See Vasishth et al. (2013);Jäger et al. (2020);Nicenboim, Vasishth, et al. (2018);Vasishth,
Mertzen, Jäger, et al. (2018a) for detailed discussion of this point in the context of
psycholinguistics.↩
Code

Chapter 7 Workflow

Although modern Bayesian analysis tools (such as brms ) greatly facilitate Bayesian
computations, the model specification is still (as it should be) the responsibility of the user. In
the previous chapters (e.g., chapter 3), we have outlined some of the steps needed to arrive at
a useful and robust analysis. In this chapter, we bring these ideas together to spell out a
principled approach to developing a workflow. This chapter is based on a recent introduction
of a principled Bayesian workflow to cognitive science (Schad, Betancourt, and Vasishth 2019,
for a revised published version see 2020).

Much research has been carried out in recent years to develop tools to ensure robust
Bayesian data analyses (e.g., Gabry et al. 2017; Talts et al. 2018). One of the most recent
end-products of this research has been the formulation of a principled Bayesian workflow for
conducting a probabilistic analysis (Betancourt 2018; Schad, Betancourt, and Vasishth 2019).
This workflow provides an initial coherent set of steps to take for a robust analysis, leaving
room for further improvements and methodological developments. At an abstract level, parts
of this workflow can be applied to any kind of data analysis, be it frequentist or Bayesian, be it
based on sampling or on analytic procedures.

Here, we introduce parts of this principled Bayesian workflow by illustrating its use with
experimental data from a reading-time experiment. Some parts of this principled Bayesian
workflow are specifically recommended when using advanced/non-standard models.

A number of questions should be asked when fitting a model, and several checks should be
performed to validate a probabilistic model. Before going into the details of this discussion, we
first treat the process of model building, and how different traditions have yielded different
approaches to this questions.

7.1 Model building

One strategy for model building is to start with a minimal model that captures just the
phenomenon of interest but not much other structure in the data. For example, this could be a
linear model with just the factor or covariate of main interest. For this model, we perform a
number of checks described in detail in the following sections. If the model passes all checks
and does not show signs of inadequacy, then it can be applied in practice and we can be
confident that the model provides reasonably robust inferences on our scientific question.
However, if the model shows signs of trouble on one or more of these checks, then the model
may need to be improved. Alternatively, we may need to be more modest with respect to our
scientific question. For example, in a repeated measures data set, we may be interested in
estimating the correlation parameter between by-group adjustments (their random effects
correlation) based on a sample of 30 subjects. If model analysis reveals that our sample size
is not sufficiently large to estimate the correlation reliably, then we may need to either increase
our sample size, or give up on our plan.

During the model building process, we make use of an aspirational model MA : we mentally
imagine a model with all the possible details that the phenomenon and measurement process
contain; i.e., we imagine a model that one would fit if there were no limitations in resources,
time, mathematical and computational tools, subjects, and so forth. It would contain all
systematic effects that might influence the measurement process. For example, influences of
time or heterogeneity across individuals. This should be taken to guide and inform model
development; such a procedure prevents random walks in model space during model
development. The model has to consider both the latent phenomenon of interest as well as the
environment and experiment used to probe it.

In contrast to the aspirational model, the initial model M1 may only contain enough structure
to incorporate the phenomenon of core scientific interest, but none of the additional
aspects/structures relevant for the modeling or measurement. The additional, initially left-out
structures, which reflect the difference between the initial (M1 ) and the aspirational model (
MA ), can then be probed for using specifically designed summary statistics. These summary
statistics can thus inform model expansion from the initial model M1 into the direction of the
aspirational model MA . If the initial model proves inadequate, then the aspirational model
and the associated summary statistics guide model development. If the expanded model is
still not adequate, then another cycle of model development is conducted.

The range of prior and posterior predictive checks discussed in the following sections serve as
a basis for a principled approach to model expansion. The notion of expansion is critical here.
If an expanded model does not prove more adequate, one can always fall back to the previous
model version.

Some researchers suggest an alternative strategy of fitting models with all the group-level
variance components (e.g., by-participant and by-items) allowed by the experimental design
and a full variance covariance matrix for all the group-level parameters (as we did in section
5.2.5). This type of model is sometimes called a “maximal” model (e.g., Barr et al. 2013).
However, this model is maximal within the scope of a linear regression. In section 5.2.6, for
example, we saw distributional models, which are more complex than the so-called maximal
models. However, a maximal model can provide an alternative starting point for the principled
Bayesian workflow. In this case, the focus does not lie so much on model expansion. Instead,
for maximal models the workflow can be used to specify priors encoding domain expertise
(possibly to ensure computational faithfulness and model sensitivity), and to ensure model
adequacy. Some steps in the principled Bayesian workflow (e.g., the computationally more
demanding steps of assessing computational faithfulness and model sensitivity) may even be
performed only for models coded in Stan or only once for a given research program, where
similar designs are repeatedly used. We will explain this in more detail below.

In the maximal model, “maximal” refers to maximal specification of the variance components
within the scope of the linear regression approximation, not maximal with respect to the actual
data generating process. Models that are bound by the linear regression structure cannot
capture effects such as selection bias in the data, dynamical changes in processes across
time, or measurement error. Importantly, the “maximal” models are not the aspirational model,
which is an image of the true data generating process.

Finally, sometimes the results from the Bayesian workflow will show that our experimental
design or data is not sufficient to answer our scientific question at hand. In this case, ambition
needs to be reduced, or new data needs to be collected, possibly with a different experimental
design more sensitive to the phenomenon of interest.

One important development in open science practices is pre-registration of experimental

analyses before the data are collected (Chambers 2019). This can be done using online-
platforms such as the Open Science Foundation or AsPredicted (but see Szollosi et al. 2020).
What information can or should one document in preregistration of the Bayesian workflow? If
one plans on using the maximal model for analysis, then this maximal model, including
contrast coding (Schad et al. 2020), population- and group-level effects (also known as fixed
and random effects) should be described. In the case of incremental model building, if a model
isn’t a good fit to the data, then any resulting inference will be limited if not useless, so a rigid
preregistration is useless unless one knows exactly what the model is. Thus, the deeper issue
with preregistration is that a model cannot be confirmed until the phenomenon and experiment
are all extremely well understood. One practical possibility is to describe the initial and the
aspirational model, and the incremental strategy used to probe the initial model to move more
towards the aspirational model. This can also include delineation of summary statistics that
one plans to use for probing the tested models. Even if it is difficult to spell out the aspirational
model fully, it can be useful to preregister the initial model, summary statistics, and the
principles one intends to apply in model selection. Although the maximal modeling approach
clearly reflects confirmatory hypothesis testing, the incremental model building strategy
towards the aspirational model may be seen as lying at the boundary between confirmatory
and exploratory, and becomes more confirmatory the more clearly the aspirational model can
be spelled out a priori.

7.2 Principled questions on a model

What characterizes a useful probabilistic model? A useful probabilistic model should be

consistent with domain expertise. Moreover, a useful probabilistic model should be rich
enough to capture the structure of the true data generating process needed to answer
scientific questions. When very complex or non-standard models are developed, there are two
additional requirements that must be met (we will briefly touch upon these in the present
chapter): it is key for the model to allow accurate posterior approximation, and the model must
capture enough of the experimental design to give useful answers to our questions.

So what can we do aiming to meet these properties of our probabilistic model? In the
following, we will outline a number of analysis steps to take and questions to ask in order to
improve these properties for our model.

In a first step, we will use prior predictive checks to investigate whether our model is
consistent with our domain expertise. Moreover, posterior predictive checks assess model
adequacy for the given data set, that is, they investigate the question whether the model
captures the relevant structure of the true data generating process. We will also briefly discuss
two additional steps that are computationally very expensive, and can be used e.g., when
coding advanced/non-standard models, but which are also part of the principled workflow: this
includes investigating computational faithfulness by studying whether posterior estimation is
accurate, and it includes studing model sensitivity and the question whether we can recover
model parameters with the given design and model.

7.2.1 Prior predictive checks: Checking consistency with

domain expertise

The first key question for checking the model is whether the model and the distributions of
prior parameters are consistent with domain expertise. Prior distributions can be selected
based on prior research or plausibility. However, for complex models it is often difficult to know
which prior distributions should be chosen, and what consequences distributions of prior
model parameters have for expected data. A viable solution is to use prior distributions to
simulate hypothetical data from the model and to check whether the simulated data are
plausible and consistent with domain expertise. This approach is often much easier to judge
compared to assessing prior distributions in complex models directly.

In practice, this approach can be implemented by the following steps:

1. Take the prior p(Θ) and randomly draw a parameter set Θpred from it: Θpred ∼ p(Θ)

2. Use this parameter set Θpred to simulate hypothetical data ypred from the model:
ypred ∼ p(y ∣ Θpred )

To assess whether prior model predictions are consistent with domain expertise, it is useful to
compute summary statistics of the simulated data t(ypred ) . The distribution of these summary
statistics can be visualized using, for example, histograms (see Figure 7.1). This can quickly
reveal whether the data falls in an expected range, or whether a substantial amount of
extreme data points are expected a priori. For example, in a study using self-paced reading
times, extreme values may be considered to be reading times smaller than 50 ms or larger
than 2000 ms. Reading times for a word larger than 2000 ms are not impossible, but would be
implausible and largely inconsistent with domain expertise. Experience with reading studies
shows that a small number of observations may actually take extreme values. However, if we
observe a large number of extreme data points in the hypothetical data, and if these are
inconsistent with domain expertise, the priors or the model should be adjusted so that they
yield hypothetical data within the range of reasonable values.

a 0.6 b 0.6 c 0.6

0.4 0.4 0.4

density

0.2 0.2 0.2

0.0 0.0 0.0

Summary Statistic [t(y_pred)] Summary Statistic [t(y_pred)] Summary Statistic [t(y_pred)]

FIGURE 7.1: Prior predictive checks. a) In a first step, define a summary statistic that one
wants to investigate. b) Second, define extremity thresholds (shaded areas), beyond which
one does not expect a lot of data to be observed. c) Third, simulate prior model predictions for
the data (histogram) and compare them with the extreme values (shaded areas).

Choosing good summary statistics is more an art than a science. However, the choice of
summary statistics will be crucial, as they provide key markers of what we want the model to
account for in the data. They should thus be carefully chosen and designed based on the
expectations that we have about the true data generating process and about the kinds of
structures and effects we expect the data may exhibit. Interestingly, summary statistics can
also be used to critique the model: if someone wants to criticize an analysis, then they can
formalize that criticism into a summary statistic they expect to show undesired behavior. Such
criticism can serve as a very constructive way to write reviews in a peer-review setting. Here,
we will show some examples of useful summary statistics below when discussing data
analysis for a concrete example data set.

Choosing good priors will be particularly relevant in cases where the likelihood is not
sufficiently informed by the data (see Figure 7.2, in particular g-i). In hierarchical models, for
example, this often occurs in cases where a “maximal” model is fitted for a small data set that
does not constrain estimation of all group-level effects variance and covariance parameters.25

In such situations, using a prior in a Bayesian analysis (or a more informative prior rather than
a relatively uninformative one) should incorporate just enough domain expertise to suppress
extreme, although not impossible parameter values. This may allow the model to be fit, as the
posterior is now sufficiently constrained. Thus, introducing prior information in Bayesian
computation allows us to fit and interpret models that cannot be validly estimated using
frequentist tools.

A welcome side-effect of incorporating more domain expertise (into what still constitutes
weakly informative priors) is thus more concentrated prior distributions, which can facilitate
Bayesian computation. This allows more complex models to be estimated; that is, using prior
knowledge can make it possible to fit models that could otherwise not be estimated using the
available tools. In other words, incorporating prior knowledge can allow us to get closer to the
aspirational model in the iterative model building procedure. Moreover, more informative priors
also lead to faster convergence of MCMC algorithms.
a Prior b Likelihood c Posterior

d Prior e Likelihood f Posterior

g Prior h Likelihood i Posterior

FIGURE 7.2: The role of priors for informative and uninformative data. a)-c) When the data
provides good information via the likelihood (b), then a flat uninformative prior (a) is sufficient
to obtain a concentrated posterior (c). d)-f) When the data does not sufficiently constrain the
parameters through the likelihood (e), then using a flat uninformative prior (d) also leaves the
posterior (f) widely spread out. g)-i) When the data does not constrain the parameter through
the likelihood (h), then including domain expertise through an informative prior (g) can help to
constrain the posterior (i) to reasonable values.
Incorporating more domain expertise into the prior also has crucial consequences for
Bayesian modeling when computing Bayes factors (see chapter 15, on Bayes factors).

Importantly, this first step simulates from the prior predictive distribution, which specifies how
the prior interacts with the likelihood. Mathematically, it computes an average (the integral)
over different possible (prior) parameter values. The prior predictive distribution is (also see
chapter 3):
p(ypred ) = ∫ p(ypred , Θ) dΘ = ∫ p(ypred ∣ Θ)p(Θ) dΘ

= ∫ likelihood(ypred ∣ Θ) ⋅ prior(Θ) dΘ

As a concrete example, suppose we assume that our likelihood is a Normal distribution with
mean μ and standard deviation σ. Suppose that we now define the following priors on the
parameters: μ ∼ Normal(0, 1) , and σ ∼ Uniform(1, 2) . We can generate the prior
predictive distribution using the following steps:

Do the following 100000 times:

Take one sample m from a Normal(0,1) distribution
Take one sample s from a Uniform(1,2) distribution
Generate and save a data point from Normal(m,s)
The generated data is the prior predictive distribution.

More complex generative processes involving repeated measures data can also be defined.

7.2.2 Computational faithfulness: Testing for correct posterior

approximations

Approximations of posterior expectations can be inaccurate. For example, a computer

program that is designed to sample from a posterior can be erroneous. This could involve an
error in the specification of the likelihood (e.g., caused by an error in the R syntax formula), or
insufficient sampling of the full density of the posterior. The sampler may be biased, sampling
parameter values that are larger or smaller than the true posterior, or the variance of the
posterior samples may be larger or smaller than the true posterior uncertainty. However,
posterior sampling from simple and standard models should work properly in most cases.
Thus, we think that in many applications, a further check of computational faithfulness may be
asking for too much, and might need to be performed only once for a given research program,
where different experiments are rather similar to each other. However, checking computational
faithfulness can become an important issue when dealing with more advanced/non-standard
models (such as those discussed in the later chapters of this book). Here, errors in the
specification of the likelihood can occur more easily.

Given that posterior approximations can be inaccurate, it is important to design a procedure to

test whether the posterior approximation of choice is indeed accurate, e.g., that the software
used to implement the sampling works without errors for the specific problem at hand. This
checking can be performed using simulation-based calibration (SBC; Talts et al. 2018; Schad,
Betancourt, and Vasishth 2019). This is a very simulation-intensive procedure, which can take
a long time to run for considerably complex models and larger data sets. We do not discuss
SBC in detail here, but refer the reader to its later treatment in chapter 18, where SBC is
applied for models coded in Stan directly, as well as to the description in Schad, Betancourt,
and Vasishth (2019).

Assuming that our posterior computations are accurate and faithful, we can take a next step,
namely looking at the sensitivity of the model analyses.

7.2.3 Model sensitivity

What can we realistically expect from the posterior of a model, and how can we check whether
these expectations are justified for the current setup? First, we might expect that the posterior
recovers the true parameters generating the data without bias. That is, when we simulate
hypothetical data based on a true parameter value, we may expect that the posterior mean is
close to the true value. However, for a given model, experimental design, and data set, this
expectation may or may not be justified. Indeed, parameter estimation for some, e.g., non-
linear, models may be biased, such that the true value of the parameter can practically not be
recovered from the data. At the same time, we might expect from the posterior that it is highly
informative with respect to the parameters that generated the data. That is, we may hope for
small posterior uncertainty (a small posterior standard deviation) relative to our prior
knowledge. However, posterior certainty may sometimes be low. Some experimental designs,
models, or data sets may yield highly uninformative estimates, where uncertainty is not
reduced compared to our prior information. This can be the case when we have very little
data, or when the experimental design does not allow us to constrain certain model
parameters; i.e., the model is too complex for the experimental design.

To study model sensitivity, one can investigate two questions about the model:

1. How well does the estimated posterior mean match the true simulating parameter?
2. How much is uncertainty reduced from the prior to the posterior?

To investigate these questions, it is again possible to perform extensive simulation studies.

This is crucial to do for complex, non-standard, or cognitive models, but may be less important
for simpler and more standard models. Indeed, the same set of simulations can be used that
are also used in SBC. Therefore, both analyses can be usefully applied in tandem. Again, here
we skip the details of how these computations can be implemented, and refer the interested
reader to Schad, Betancourt, and Vasishth (2019).
7.2.4 Posterior predictive checks: Does the model adequately
capture the data?

“All models are wrong but some are useful.” (Box 1979, 2). We know that our model probably
does not fully capture the true data generating process, which is noisily reflected in the
observed data. Our question therefore is whether our model is close enough to the true
process that has generated the data, and whether the model is useful for informing our
scientific question. To compare the model to the true data generating process (i.e., to the
data), we can simulate data from the model and compare the simulated to the real data. This
can be formulated via a posterior predictive distribution (see chapter 3): the model is fit to the
data, and the estimated posterior model parameters are used to simulate new data.

Mathematically, the posterior predictive distribution is written:

p(ypred ∣ y) = ∫ p(ypred ∣ Θ)p(Θ ∣ y) dΘ

Here, the observed data y is used to infer the posterior distribution over model parameters,
p(Θ ∣ y) . This is combined with the model or likelihood function, p(ypred ∣ Θ) , to yield new,
now simulated, data, ypred . The integral ∫ dΘ indicates averaging across different possible
values for the posterior model parameters (Θ).

As mentioned in chapter 3, we can’t evaluate this integral exactly: Θ can be a vector of many
parameters, making this a very complicated integral with no analytical solution. However, we
can approximate it using sampling. We can now use each of the posterior samples as
parameters to simulate new data from the model. This procedure then approximates the
integral and yields an approximation to the posterior predictive distribution.

To summarize, in the posterior predictive distribution, the model is fit to the data, and the
estimated posterior model parameters are used to simulate new data. Critically, the question
then is how close the simulated data is to the observed data.

One approach is to use features of the data that we care about, and to test how well the model
can capture these features. Indeed, we had already defined summary statistics in the prior
predictive checks. We can now compute these summary statistics for the data simulated from
the posterior predictive distribution. This will yield a distribution for each summary statistic. In
addition, we compute the summary statistic for the observed data, and can now check whether
the data falls within the distribution of the model predictions (cf. Figure 7.3a), or whether the
model predictions are far from the observed data (see Figure 7.3b). If the observed data is
similar to the posterior-predicted data, then this supports model adequacy. If we observe a
large discrepancy, then this indicates that our model likely is missing some important structure
of the true process that has generated the data, and that we have to use our domain expertise
to further improve the model. Alternatively, a large discrepancy can be due to the data being
an extreme observation, which was nevertheless generated by the process captured in our
model. In general, we can’t discriminate between these two possibilities. Consequently, we
have to use our best judgement as to which possibility is more relevant, in particular changing
the model only if the discrepancy is consistent with a known missing model feature.

a b
0.5

0.4
density

0.3

0.2

0.1

0.0
Summary Statistic [t(y)]

FIGURE 7.3: Posterior predictive checks. Compare posterior model predictions (histogram)
with observed data (vertical line) for a specific summary statistic, t(y). a) This displays a case
where the observed summary statistic (vertical line) lies within the posterior model predictions
(histogram). b) This displays a case where the summary statistic of the observed data (vertical
line) lies clearly outside of what the model predicts a posteriori (histogram).

7.3 Exemplary data analysis

We perform an exemplary analysis of a data set from Gibson and Wu (2013), a data set that
we have already encountered in previous chapters. The methodology they used is the familiar
method (self-paced reading) that we encountered in earlier chapters. Gibson and Wu (2013)
collected self-paced reading data using Chinese relative clauses. Relative clauses are
sentences like: The student who praised the teacher was very happy. Here, the head noun,
student, is modified by a relative clause whoteacher, and the head noun is the subject of the
relative clause as well: the student praised the teacher. Such relative clauses are called
subject relatives. By contrast, one can also have object relative clauses, where the head noun
is modified by a relative clause which takes the head noun as an object. An example is: The
student whom the teacher praised was very happy. Here, the teacher praised the student.
Gibson and Wu (2013) were interested in testing the hypothesis that Chinese shows an object
relative (OR) processing advantage compared to the corresponding subject relative (SR). The
theoretical reason for this processing advantage is that, in Chinese, the distance (which can
be approximated by counting the number of words intervening) between the relative clause
verb (praised) and the head noun is shorter in ORs than SRs . This prediction arises because,
unlike English, the relative clause appears before the head noun in Chinese; see Gibson and
Wu (2013) for a detailed explanation.

Their experimental design had one factor with two levels: (i) object relative sentences, and (ii)
subject relative sentences. We use sum coding (-1, +1) for this factor, which we call so , an
abbreviation for subject-object. Following Gibson and Wu (2013), we analyze reading time on
the target word, which was the head noun of the relative clause. As mentioned above, in
Chinese, the head noun appears after the relative clause. By the time the subject reads the
head noun, they already know whether they are reading a subject or an object relative.
Because the distance between the relative clause verb and the head noun is shorter in
Chinese object relatives compared to subject relatives, reading the head noun is expected to
be easier in object relatives.

The data set contains reading time measurements in milliseconds from 37 subjects and from
15 items (there were 16 items originally, but one item was removed during analysis). The
design is a classic repeated measures Latin square design.

7.3.1 Prior predictive checks

The first step in Bayesian data analysis is to specify the statistical model and the priors for the
model parameters. As a statistical model, we use what is called the maximal model (Barr et al.
2013) for the design. Such a model includes population-level effects for the intercept and the
slope (coded using sum contrast coding: +1 for object relatives, and −1 for subject relatives),
correlated group-level intercepts and slopes for subjects, and correlated group-level intercepts
and slopes for items. We define the likelihood as follows:

rtn ∼ LogNormal(α + usubj[n],1 + witem[n],1 + son ⋅ (β + usubj[n],2 + witem[n],2 ), σ)

In brms syntax, the model would be specified as follows:

rt ~ so + ( so | subj) + (so | item), family = lognormal() .

Because we assume a possible correlation between group-level (or random) effects, the
adjustments to the intercept and slope for subjects and items have multivariate (in this case,
bivariate) normal distributions with means zero, as in Equation (7.1).
ui,1 0
( ) ∼ N (( ), Σu )
ui,2 0
(7.1)
wj,1 0
( ) ∼ N (( ), Σw )
wj,2 0

One possible standard setup for (relatively) uninformative priors which is sometimes used in
reading studies (e.g., Paape, Nicenboim, and Vasishth 2017; Vasishth, Mertzen, Jäger, et al.
2018b) is as follows:

α ∼ Normal(0, 10)

β ∼ Normal(0, 1)

σ ∼ Normal+ (0, 1)

τ{1,2} ∼ Normal+ (0, 1)

ρ ∼ LKJ(2)

We define these priors in brms syntax as follows:

Hide

priors <- c(
prior(normal(0, 10), class = Intercept),
prior(normal(0, 1), class = b, coef = so),

prior(normal(0, 1), class = sd),

prior(normal(0, 1), class = sigma),
prior(lkj(2), class = cor)
)

For the intercept (α) we use a normal distribution with mean 0 and standard deviation 10 . This
is on the log scale as we assume a log-normal distribution of reading times. That is, this
approach assumes a priori that the intercept for reading times varies between 0 seconds and
(one standard deviation) exp(10) = 22026 ms (i.e., 22 sec) or (two standard deviations)
exp(20) = 485165195 ms (i.e., 135 hours). Going from seconds to hours within one standard
deviation shows how uninformative this prior is.

Moreover, for the effect of linguistic manipulations on reading times (β), one common standard
prior is to assume a mean of 0 and a standard deviation of 1 (also on the log scale). The prior
on the effect size on log scale is a multiplicative factor, that is, the prediction for the effect size
depends on the intercept. For an intercept of exp(6) = 403 ms, a variation to one standard
deviation above multiplies the base effect by 2.71 , increasing the mean from 403 to
exp(6) × exp(1) = 1097 . Likewise a variation to one standard deviation below multiplies the
base effect by 1/2.71 , decreasing the mean from 403 to exp(6) × exp(−1) = 148 . This
effect size is strongly changed when assuming a different intercept: for a slightly smaller value
for the intercept of exp(5) = 148 ms, the expected condition difference is reduced to 37 %(
349 ms), and for a slightly larger value for the intercept of exp(7) = 1097 ms, the condition
difference is enhanced to 272 % (2578 ms). Also see Box 4.3 in chapter 4 for an explanation
about the non-linear behavior of the log-normal model. Even though it seems Normal(0, 1) is
not entirely appropriate as a prior for the difference between object-relative and subject-
relative sentences (i.e., the slope), we use it for illustrative purposes. We use the same
Normal(0, 1) prior for the τ parameter and σ. Finally, for the group-level effects correlation
between the intercept and the slope, we use an LKJ prior (Lewandowski, Kurowicka, and Joe
2009) with a relatively uninformative/regularizing prior parameter value of 2 (for visualization
of the prior see Figure 7.4).

1.00

0.75

0.50

0.25

0.00
-1.0 -0.5 0.0 0.5 1.0
Correlation

FIGURE 7.4: The shape of the LKJ prior with the parameter set to 2. This is the prior density
for the group-level effects correlation parameter, here used as a prior for the correlation
between the effect size (so) and the intercept. The shape shows that correlation estimates
close to zero are expected, and that very strong positive correlations (close to 1) or negative
correlations (close to -1) are increasingly unlikely. Thus, correlation estimates are regularized
towards values of zero.

For the prior predictive checks, we use these priors to draw random parameter sets from the
distributions, and to simulate hypothetical data using the statistical model. We load the data
and code the predictor variable so . We next use brms to simulate prior predictive data from
the hierarchical model.

Hide
data("df_gibsonwu")
df_gibsonwu <- df_gibsonwu %>%
mutate(so = ifelse(type == "obj-ext", 1, -1))
fit_prior_gibsonwu <- brm(rt ~ so + (so | subj) + (so | item),
data = df_gibsonwu,

family = lognormal(),
prior = priors,
sample_prior = "only"
)

Based on the simulated data we can now perform prior predictive checks: we compute
summary statistics, and plot the distributions of the summary statistic across simulated data
sets. First, we visualize the distribution of the simulated data. For a single data set, this could
be visualized as a histogram. Here, we have a large number of simulated data sets, and thus
a large number of histograms. We represent this uncertainty: for each bin, we plot the median
as well as quantiles showing where 10%-90%, 20%-80%, 30%-70%, and 40%-60% of the
histograms lie (for R code, see Schad, Betancourt, and Vasishth 2019). For the current prior
data simulations, this shows (see Figure 7.5) that most of the hypothetical reading times are
close to zero or larger than 2000 ms. It is immediately clear that the data predicted by this
prior follows a very implausible distribution: it looks exponential; we would expect a log-normal
distribution for reading times. Most data points take on extreme values.
400

200

0 500 1000 1500 2000

RT [ms]

FIGURE 7.5: Prior predictive checks for a high-variance prior. Multivariate summary statistic:
Distribution of histograms of reading times shows very short and also very long reading times
are expected too frequently by the uninformative prior. Values larger 2000 ms are shown as
2000 ms.
As an additional summary statistic, in Figure 7.6 we take a look at the mean per simulated
data set and also at the standard deviation. We create functions that use as summary
statistics the mean and standard deviations; the functions collapse all the values over 2000
ms, making the figure more readable. Then we use these summary statistics to visualize prior
predictive checks using the pp_check() function, where we enter the prior predictions
( fit_prior_gibsonwu ) and the relevant summary statistic ( mean_2000 and sd_2000 ).

Hide
mean_2000 <- function(x) {
tmp <- mean(x)
tmp[tmp > 2000] <- 2000
tmp
}

sd_2000 <- function(x) {

tmp <- sd(x)
tmp[tmp > 2000] <- 2000
tmp
}
fig_pri_1b <- pp_check(fit_prior_gibsonwu,

type = "stat",
stat = "mean_2000",
prefix = "ppd") +
labs(x = "Mean RT [ms]") +
theme(legend.position = "none")

fig_pri_1b
fig_pri_1c <- pp_check(fit_prior_gibsonwu,
type = "stat",
stat = "sd_2000",
prefix = "ppd") +

labs(x = "Standard Deviation RT [ms]") +

theme(legend.position = "none")
fig_pri_1c

0 500 1000 1500 2000 0 500 1000 1500 2000

Mean RT [ms] Standard Deviation RT [ms]
FIGURE 7.6: Prior predictive distribution of average reading times shows that extremely large
reading times of more than 2000 ms are too frequently expected. Distribution of standard
deviations of reading times shows that very large standard deviations are too frequently
expected in the priors. Values larger than 2000 are plotted at a value of 2000 for visualization.
The results, displayed in Figure 7.6, show that the mean (or the standard deviation) varies
across a wide range, with a substantial number of data sets having a mean (or the standard
deviation) larger than 2000 ms. Again, this reveals a highly implausible assumption about the
intercept parameter.

Moreover, we also plot the size of the effect of object relative minus subject relative sentence
as a measure of effect size with the following code. We first compute the summary statistic,
here the difference in reading times between subject versus object relative sentences (i.e.,
variable so ). Then we define difference-values larger than 2000 ms as 2000 ms, and values
smaller than −2000 ms as −2000 ms because these represent implausibly large/small values
and make it easier to read the plot. We plot these prior predictive data using pp_check by
supplying the prior samples ( fit_prior_gibsonwu ) and the summary statistic
( effsize_2000 ).

Hide

effsize_2000 <- function(x) {

tmp <- mean(x[df_gibsonwu$so == +1]) -
mean(x[df_gibsonwu$so == -1])
tmp[tmp > +2000] <- +2000

tmp[tmp < -2000] <- -2000

tmp
}
fig_pri_1d <- pp_check(fit_prior_gibsonwu,
type = "stat",
stat = "effsize_2000",

prefix = "ppd") +
labs(x = "Object - Subject [S-O RT]") +
theme(legend.position = "none")
fig_pri_1d
-2000 -1000 0 1000 2000
Object - Subject [S-O RT]

FIGURE 7.7: Prior predictive distribution of differences in reading times between object minus
subject relatives shows that very large effect sizes are far too frequently expected. Values
larger than 2000 (or smaller than -2000) are plotted at a value of 2000 (or -2000) for
visualization.
The results in Figure 7.7 show that our priors commonly assume differences in reading times
between conditions of more than 2000 ms, which are larger than we would expect for a
psycholinguistic manipulation of the kind investigated here. More specifically, given that we
model reading times using a log-normal distribution, the expected effect size depends on the
value for the intercept. To take an extreme example, for an intercept of exp(1) = 2.7 ms and
an effect size in log space of 1 (i.e., one standard deviation of the prior for the effect size),
expected reading times for the two conditions are exp(1 − 1) = 1 ms and exp(1 + 1) = 7

ms. By contrast, for an intercept of exp(10) = 22026 ms the corresponding reading times for
the two conditions would be exp(10 − 1) = 8103 ms and exp(10 + 1) = 59874 ms.

This implies highly variable expectations for the effect size, including the possibility of very
large effect sizes. If there is good reason to believe that the effect size is likely to be relatively
small, priors with smaller expected effect sizes may be more reasonable. In chapter 6 we
discussed different methods for working out ballpark estimates of the range of variation in
expected effect sizes in reading studies on relative clause processing.

It is also useful to look at individual-level differences in the effect of object versus subject
relatives. Figure 7.8 shows the subject with the largest (absolute) difference in reading times
between object versus subject relatives.

Here, we first assign the prior simulated reading time to the data frame df_gibsonwu (terming
it rtfake ). Then we group the data frame by subject and by experimental condition ( so ),
average prior predictive reading times for each subject and condition, and then compute the
difference in reading times between subject versus object relative sentences for each subject
(the subset where so == 1 minus the subset where so == -1 ). Now, we can take the
absolute value of this difference ( abs(tmp$dif) ), and take the maximum or standard
deviation across all subjects. Again, we set values larger than 2000 ms to a value of 2000 ms
for better visibility.

Hide
effsize_max_2000 <- function(x) {
df_gibsonwu$rtfake <- x
tmp <- df_gibsonwu %>%
group_by(subj, so) %>%
summarize(rtfake = mean(rtfake)) %>%

# calculates the difference between conditions:

summarize(dif = rtfake[so == 1] - rtfake[so == -1])
effect_size_max <- max(abs(tmp$dif), na.rm = TRUE)
effect_size_max[effect_size_max > 2000] <- 2000
effect_size_max
}

effsize_sd_2000 <- function(x) {

df_gibsonwu$rtfake <- x
tmp <- df_gibsonwu %>%
group_by(subj, so) %>%
summarize(rtfake = mean(rtfake)) %>%

summarize(dif = rtfake[so == 1] - rtfake[so == -1])

effect_size_SD <- sd(tmp$dif, na.rm = TRUE)
effect_size_SD[effect_size_SD > 2000] <- 2000
effect_size_SD
}

fig_pri_1e <- pp_check(fit_prior_gibsonwu,

type = "stat",
stat = "effsize_max_2000",
prefix = "ppd"
) +
labs(x = "Max Effect Size [S-O RT]") +

theme(legend.position = "none")
fig_pri_1e
fig_pri_1f <- pp_check(fit_prior_gibsonwu,
type = "stat",
stat = "effsize_sd_2000",

prefix = "ppd"
) +
labs(x = "SD Effect Size [S-O RT]") +
theme(legend.position = "none")
fig_pri_1f

0 500 1000 1500 2000 0 500 1000 1500 2000

Max Effect Size [S-O RT] SD Effect Size [S-O RT]
FIGURE 7.8: Maximal prior predicted effect size (object - subject relatives) across subjects
again shows far too many extreme values and standard deviation of effect size (object -
subject relatives) across subjects; again far too many extreme values are expected. Values >
2000 or < -2000 are plotted at a value of 2000 or -2000 for visualization.

The prior simulations in Figure 7.8 show common maximal effect sizes of larger than 2000 ms,
which is more than we would expect for observed data; similarly, the variance in hypothetical
effect sizes is large, with many SDs larger than 2000 ms, and thus again takes many values
that are inconsistent with our domain expertise about reading experiments.

7.3.2 Adjusting priors

Based on these analyses of prior predictive data, we can next use our domain expertise to
refine our priors and adjust them to values for which we expect more plausible prior predictive
hypothetical data as captured in the summary statistics.

First, we adapt the intercept; recall that in chapter 6 we made a first attempt at coming up with
priors for the intercept; now we take our reasoning one step further. Given our prior knowledge
about mean reading times (see the discussion in the previous chapter), we could choose a
normal distribution in log-space with a mean of 6. This corresponds to an expected grand
average reading time of exp(6) = 403 ms. For the standard deviation, we use a value of SD
= 0.6 . For these prior values, we expect a strongly reduced mean reading time and a strongly
reduced residual standard deviation in the simulated hypothetical data. Moreover, we expect
that implausibly small or large values for reading times will no longer be expected. For a
visualization of the prior distribution of the intercept parameter in log-space and in ms-space
see Figure 7.9a+b. Other values for the standard deviation that are close to 0.6 , (e.g., SD =
0.5 or 0.7 ), may yield similar results. Our goal is not to specify a precise value, but rather to
use prior parameter values that are qualitatively in line with our domain expertise about
expected observed reading time data, and that do not produce highly implausible hypothetical
data.

a Intercept b Intercept
0.0020
0.6
0.0015
density

density

0.4
0.0010

0.2
0.0005

0.0 0.0000
4 5 6 7 8 9 0 500 1000 1500 2000 2500
Prior distribution in log-space Prior distribution in ms space
[grand mean in log ms] [median of grand mean in ms]

c Effect size d Effect size

8 0.012

6 0.009
density

density

4 0.006

2 0.003

0 0.000
-0.2 -0.1 0.0 0.1 0.2 -500 0 500 1000
Prior distribution in log-space Prior distribution in ms space
[effect size in log ms] [effect size: difference in median m

FIGURE 7.9: Prior distribution in log-space and in ms-space for a toy example of a linear
regression. a) Displays the prior distribution of the intercept in log-space. b) Displays the prior
distribution of the intercept in ms-space. c) Displays the prior distribution of the effect size in
log-space. d) Displays the prior distribution of the effect size in ms-space.

Next, for the effect of object minus subject relative sentences, we define a normally distributed
prior with mean 0 and a much smaller standard deviation of 0.05 . Again, we do not have
precise information on the specific value for the standard deviation, but as we saw in chapter
6, we have some understanding of the range of variation seen in reading studies involving
relative clauses. We expect a generally smaller effect size (see the meta-analysis in chapter
6), and we can check through prior predictive checks (data simulation and investigation of
summary statistics) whether this yields a plausible pattern of expected results. Figures 7.9c+d
show expected effects in log-scale and in ms-scale for a simple linear regression example.

In addition, we assume much smaller values for the standard deviations in how the intercept
and the slope vary across subjects and across items of 0.1 , and a smaller residual standard
deviation of 0.5 . Our expectation for the correlation between group-level effects is unchanged.
In summary, we settle on the following prior specification:

Hide

priors2 <- c(
prior(normal(6, 0.6), class = Intercept),
prior(normal(0, 0.05), class = b, coef = so),

prior(normal(0, 0.1), class = sd),

prior(normal(0, 0.5), class = sigma),
prior(lkj(2), class = cor)
)

7.3.2.1 Prior predictive checks after increasing the informativity of the

priors

We have now adjusted the priors increasing their informativity. These priors are still not too
informative, but they are somewhat principled since they include some of the theory-neutral
information that we have. Based on this new set of now weakly informative priors, we can
again perform prior predictive checks as we did before. The new prior predictive checks are
shown in Figure 7.10.
a 40
b
30
20
10
0
0 500 1000 1500 2000 0 500 1000 1500 2000
RT [ms] Mean RT [ms]

c d

-1000 0 1000 0 500 1000 1500 2000

Object - Subject [S-O RT] Standard Deviation RT [ms]

e f

0 500 1000 1500 2000 0 500 1000 1500 2000

Max Effect Size [S-O RT] SD Effect Size [S-O RT]

FIGURE 7.10: Prior predictive checks after adjusting the priors. The figures show prior
predictive distributions. a) Histograms of reading times. Shaded areas correspond to 10-90
percent, 20-80 percent, 30-70 percent, and 40-60 percent quantiles across histograms. This
now provides a much more reasonable range of expectations. b)-f) Prior predictive
distributions. b) Average reading times now span a more reasonable range of values. c)
Differences in reading times between object minus subject relatives; the values are now much
more constrained without too many extreme values. d) Standard deviations of reading times;
in contrast to the relatively uninformative priors, values are in a reasonable range. e) Maximal
effect size (object - subject relatives) across subjects; again, prior expectations are now much
more reasonable compared to the uninformative prior. f) The standard deviation of effect size
(object minus subject relative reading times) across subjects; this no longer shows a
dominance of extreme values any more. a)-f) Values > 2000 or < -2000 are plotted at 2000 or
-2000 for visualization.
Figure 7.10a shows that now the distribution over histograms of the data looks much more
reasonable, i.e., more like what we would expect for a histogram of observed data. Very small
values for reading times are now rare, and not heavily inflated any more. Moreover, extremely
large values for reading times larger than 2000 ms are rather unlikely.

We also take a look at the hypothetical average reading times (Figure 7.10b), and find that our
expectations are now much more reasonable. We expect average reading times of around
log(6) = 403 ms. Most of the expected average reading times lie between 50 ms and 1500
ms, and only relatively few extreme values beyond these numbers are observed. The standard
deviations of reading times are also in a much more reasonable range (see Figure 7.10d), with
only very few values larger than the extreme value of 2000 ms.

As a next step, we look at the expected effect size (OR minus SR) in the hypothetical data
(Figure 7.10c). Extreme values of larger or smaller than 2000 ms are now very rare, and most
of the absolute values of expected effect sizes are smaller than 200 ms. More specifically, we
also check the maximal effect size among all subjects (Figure 7.10e). Most of the distribution
centers below a value of 1000 ms, reflecting a more plausible range of expected values.
Likewise, the standard deviation of the psycholinguistically interesting effect size now rarely
takes values larger than 500 ms (Figure 7.10f), reflecting more realistic a priori assumptions
than in our initial (relatively) uninformative prior.

7.3.3 Computational faithfulness and model sensitivity

The next formal steps in the principled Bayesian workflow are to investigate computational
faithfulness (using SBC) and model sensitivity. These allow the researcher to determine
whether the posterior is estimated accurately for the given problem. Moreover, model
sensitivity can be used to test whether parameter estimates are unbiased and whether
anything can be learned by sampling data using the given design. Computational faithfulness
(i.e., accurate posterior estimation) and model sensitivity need to be checked for non-standard
and more complex models, but for simpler/standard models may be performed only once for a
research program where experimental designs and models are similar across studies. These
steps are computationally very expensive, and can take a very long time to run for realistic
data sets and models. For details on how to implement these steps, we refer the interested
readers to Schad, Betancourt, and Vasishth (2019) and Betancourt (2018). We discuss a
simplified version inspired by SBC for more advanced custom models implemented directly in
Stan in chapter 18.

7.3.4 Posterior predictive checks: Model adequacy

Having examined the prior predictive data in detail, we can now take the observed data and
perform posterior inference on it. We start by fitting a maximal brms model to the observed
data.

Hide
fit_gibsonwu <- brm(rt ~ so + (1 + so | subj) + (1 + so | item),
data = df_gibsonwu,
family = lognormal(),
prior = priors2
)

One could examine the posteriors from this model; we skip this step for brevity, but the reader
should run the above code and examine the posterior summaries by typing:

Hide

fit_gibsonwu

Figure 7.11 shows the posterior distribution for the slope parameter, which estimates the
difference in reading times between object minus subject relative sentences.

Hide

mcmc_hist(fit_gibsonwu, pars = c("b_so")) +

labs(
x = "Object - subject relatives",
title = "Posterior distribution"
)

Posterior distribution

-0.10 -0.05 0.00 0.05

Object - subject relatives

FIGURE 7.11: Posterior distribution for the slope parameter, estimating the difference in
reading times between object relative minus subject relative sentences.

Hide
postgw <- as_draws_df(fit_gibsonwu)
mean(postgw$b_so < 0)

## [1] 0.882

Figure 7.11 shows that the reading times in object relative sentences tends to be slightly faster
than in subject relative sentences (P (β < 0) = 0.88 ); this is as predicted by Gibson and Wu
(2013). However, given the wide 95 % credible intervals, it is difficult to rule out the possibility
that there is effectively no difference in reading time between the two conditions without doing
model comparison (with Bayes factor or cross-validation).

To assess model adequacy, we perform posterior predictive checks. We simulate data based
on posterior samples of parameters. This then allows us to investigate the simulated data by
computing the summary statistics that we used in the prior predictive checks, and by
comparing model predictions with the observed data.
a Posterior Predictive Distr. b
40

0
0 500 1000 1500 2000 450 500 550
y Mean RT [ms]

c d

-100 0 300 400 500 600

Object - Subject [O-S RT] Standard Deviation RT [ms]

e f

500 1000 1500 2000 100 200 300 400 500

Max Effect Size [O-S RT] SD Effect Size [O-S RT]

FIGURE 7.12: Posterior predictive checks for weakly informative priors. Distributions are over
posterior predictive simulated data. a) Histograms of reading times. 10-90 percent, 20-80
percent, 30-70 percent, and 40-60 percent quantiles across histograms are shown as shaded
areas; the median is shown as a dotted line and the observed data as a solid line. For
illustration, values > 2000 are plotted as 2000; modeling was done on the original data. b)-f)
Grey shows the posterior predictive distributions, the dark line shows the relevant estimate
from the observed data. b) Average reading times. c) Differences in reading times between
object minus subject relatives. d) Standard deviations of reading times. e) Maximal effect size
(object - subject relatives) across subjects. f) The standard deviation of effect size across
subjects.
The results from these analyses show that the log-normal distribution (see Figure 7.12a)
provides an approximation to the distribution of the data. However, although the fit looks
reasonable, there is still systematic deviation from the data of the model’s predictions. This
deviation suggests that maybe a constant offset is needed in addition to the log-normal
distribution. This can be implemented in brms by replacing the family specification family =
lognormal() with the shifted version family = shifted_lognormal() , and motivates another

round of model validation (see exercise 12.1, and also chapter 20 which deals with a log-
normal race mode using Stan, and see Nicenboim, Logačev, et al. 2016; Rouder 2005 for a
discussion about shifted log-normal models).

Next, for the other summary statistics, we first look at the distribution of means. The posterior
predictive means capture the mean reading time in the observed data (i.e., the vertical line in
Figure 7.12b); the data is not perfectly captured, but still in the distribution of the model. For
the standard deviation we can see that the model posterior assumes too little posterior
variation and that the model thus does not capture the standard deviation of the data well
(Figure 7.12d). Figure 7.12c shows the effect size of object minus subject relative sentences
predicted by model (histogram) and observed in the data (vertical line). Here, posterior model
predictions for the effect are in line with the empirical data. However, again, the model is
showing mostly smaller effect sizes than the data do. For the biggest effect among all subjects
(Figure 7.12e) the model captures the data reasonably well. For the standard deviation of the
effect across subjects (Figure 7.12f), the variation in the model is again a bit too small
considering the variation in the data. Potentially, the lack of agreement between data and
posterior predictive distributions might be due to the mismatch between the true distribution of
the data and the log-normal likelihood that we assumed. This could be checked by running the
model again and using a shifted log-normal instead of a log-normal likelihood.

7.4 Summary

In this chapter, we have introduced key questions to ask about a model and the inference
process as discussed by Betancourt (2018) and by Schad, Betancourt, and Vasishth (2019),
and have applied this to a data set from an experiment involving a typical repeated measures
experimental design used in cognitive psychology and psycholinguistics. Prior predictive
checks using analyses of simulated prior data suggest that, compared to previous applications
in reading experiments (e.g., Nicenboim and Vasishth 2018), far more informative priors can
and should be used. We demonstrated that including such additional domain knowledge into
the priors leads to more plausible expected data. Moreover, incorporating more informative
priors should also speed up the sampling process. These more informative priors, however,
may not alter posterior inferences much for the present design. Posterior predictive checks
showed weak support for our statistical model, as the model only partially successfully
recovered the tested summary statistics. This may reflect the mis-fit of the likelihood, i.e., that
the data may be better explained by a shifted log-normal distribution rather than a log-normal
distribution. For inference on whether reading times differ between Chinese object versus
subject relative sentences, a Bayes factor analysis would be needed to compare a null model
assuming no effect to an alternative model assuming a difference between object versus
subject relative sentences. See Vasishth, Yadav, et al. (2022) for more discussion.

In summary, this analysis provides an example and tutorial for using a principled Bayesian
workflow (Betancourt 2018; Schad, Betancourt, and Vasishth 2019) in cognitive science
experiments. The workflow reveals useful information about which (weakly informative) priors
to use, and performs checks of the used inference procedures and the statistical model. The
workflow provides a robust foundation for using a statistical model to answer scientific
questions, and will be useful for researchers developing analysis plans as part of pre-
registrations, registered reports, or simply as preparatory design analyses prior to conducting
an experiment.

7.5 Further reading

Some important articles relating to developing a principled Bayesian workflow are by

Betancourt (2018), Gabry et al. (2017), Gelman et al. (2020), and Talts et al. (2018). The
stantargets R package provides tools for a systematic, efficient, and reproducible workflow

(Landau 2021). Also recommended is the article on reproducible workflows by Wilson et al.
(2017).

References

Betancourt, Michael J. 2018. “Towards a Principled Bayesian Workflow.”

https://betanalpha.github.io/assets/case_studies/principled_bayesian_workflow.html.

Box, George EP. 1979. “Robustness in the Strategy of Scientific Model Building.” In
Robustness in Statistics, 201–36. Elsevier.

Chambers, Chris. 2019. The Seven Deadly Sins of Psychology: A Manifesto for Reforming the
Culture of Scientific Practice. Princeton University Press.
Gabry, Jonah, Daniel Simpson, Aki Vehtari, Michael J. Betancourt, and Andrew Gelman. 2017.
“Visualization in Bayesian Workflow.” arXiv Preprint arXiv:1709.01449.

Gelman, Andrew, Aki Vehtari, Daniel Simpson, Charles C Margossian, Bob Carpenter, Yuling
Yao, Lauren Kennedy, Jonah Gabry, Paul-Christian Bürkner, and Martin Modrák. 2020.
“Bayesian Workflow.” arXiv Preprint arXiv:2011.01808.

Gibson, Edward, and H-H Iris Wu. 2013. “Processing Chinese Relative Clauses in Context.”
Language and Cognitive Processes 28 (1-2). Taylor & Francis: 125–55.

Landau, William Michael. 2021. “The Stantargets R Package: A Workflow Framework for
Efficient Reproducible Stan-Powered Bayesian Data Analysis Pipelines.” Journal of Open
Source Software 6 (60): 3193. https://doi.org/10.21105/joss.03193.

Lewandowski, Daniel, Dorota Kurowicka, and Harry Joe. 2009. “Generating Random
Correlation Matrices Based on Vines and Extended Onion Method.” Journal of Multivariate
Analysis 100 (9): 1989–2001.

Nicenboim, Bruno, and Shravan Vasishth. 2018. “Models of Retrieval in Sentence

Comprehension: A Computational Evaluation Using Bayesian Hierarchical Modeling.” Journal
of Memory and Language 99: 1–34. https://doi.org/10.1016/j.jml.2017.08.004.

Paape, Dario, Bruno Nicenboim, and Shravan Vasishth. 2017. “Does Antecedent Complexity
Affect Ellipsis Processing? An Empirical Investigation.” Glossa: A Journal of General
Linguistics 2 (1).

Schad, Daniel J., Michael Betancourt, and Shravan Vasishth. 2019. “Toward a Principled
Bayesian Workflow in Cognitive Science.” arXiv. https://doi.org/10.48550/ARXIV.1904.12765.
Schad, Daniel J., Shravan Vasishth, Sven Hohenstein, and Reinhold Kliegl. 2019. “How to
Capitalize on a Priori Contrasts in Linear (Mixed) Models: A Tutorial.” Journal of Memory and
Language 110. https://doi.org/10.1016/j.jml.2019.104038.

2020. “How to Capitalize on a Priori Contrasts in Linear (Mixed) Models: A Tutorial.” Journal of
Memory and Language 110. Elsevier: 104038.

Szollosi, Aba, David Kellen, Danielle J Navarro, Richard Shiffrin, Iris van Rooij, Trisha Van
Zandt, and Chris Donkin. 2020. “Is Preregistration Worthwhile?” Trends in Cognitive Sciences
24 (2). Elsevier: 94–95.

Talts, Sean, Michael J. Betancourt, Daniel Simpson, Aki Vehtari, and Andrew Gelman. 2018.
“Validating Bayesian Inference Algorithms with Simulation-Based Calibration.” arXiv Preprint
arXiv:1804.06788.

Vasishth, Shravan, Daniela Mertzen, Lena A Jäger, and Andrew Gelman. 2018b. “The
Statistical Significance Filter Leads to Overoptimistic Expectations of Replicability.” Journal of
Memory and Language 103: 151–75.

Wilson, Greg, Jennifer Bryan, Karen Cranston, Justin Kitzes, Lex Nederbragt, and Tracy K
Teal. 2017. “Good Enough Practices in Scientific Computing.” PLoS Computational Biology 13
(6). Public Library of Science San Francisco, CA USA: e1005510.

25. In frequentist methods (such as implemented in the lme4 package in the lmer program),
this problem manifests itself as problems with convergence of the optimizer, which
indicates that the likelihood is too flat and that the parameter estimates are not
constrained by the data.↩
Code

Chapter 8 Contrast coding

Whenever one uses a categorical factor as a predictor in a Bayesian regression model, for
example when estimating the difference in a dependent variable between two or three
experimental conditions, it becomes necessary to code the discrete factor levels into numeric
predictor variables. This coding is termed contrast coding. For example, in the previous
chapter (section 5.3), we coded two experimental conditions as −1 and +1 , i.e., implementing
a sum contrast. Those contrasts are the values that we assign to predictor variables to encode
specific comparisons between factor levels and to create predictor terms to estimate these
comparisons in any type of regression, including Bayesian regressions.

Contrast coding in Bayesian models works more or less the same way as in frequentist
models, and the same principles and tools can be used in both cases. This chapter will
introduce contrast coding in the context of Bayesian models. The descriptions are in large
parts taken from Schad et al. (2020) (which is published under a CC-BY 4.0 license:
https://creativecommons.org/licenses/by/4.0/) and adapted for the current context.

Consider a situation where we want to estimate differences in a dependent variable between

three factor levels. An example could be differences in response times between three levels of
word class (noun, verb, adjective). We might be interested in whether word class influences
response times. In frequentist statistics, one way to approach this question would be to run an
ANOVA and compute an omnibus F -test for whether word class explains response times. A
Bayesian equivalent to the frequentist omnibus F -test is Bayesian model comparison (i.e.,
Bayes factors), where we might compare an alternative model including word class as a
predictor term with a null model lacking this predictor. We will discuss such Bayesian model
comparison using Bayes factors in chapter 15. However, if based on such omnibus
approaches we find support for an influence of word class on response times, it remains
unclear where this effect actually comes from, i.e., whether it originated from the nouns, verbs,
or adjectives. This is problematic for inference because scientists typically have specific
expectations about which groups differ from each other. In this chapter, we will show how to
estimate specific comparisons directly in a Bayesian linear model. This gives the researcher a
lot of control over Bayesian analyses. Specifically, we show how planned comparisons
between specific conditions (groups) or clusters of conditions, are implemented as contrasts.
This is a very effective way to align expectations with the statistical model. In Bayesian
models, any specific comparisons can also be computed after the model is fit. Nevertheless,
coding a priori expectations into contrasts for model fitting will make it much more
straightforward to estimate certain comparisons between experimental conditions, and will
allow to perform Bayesian model comparisons using Bayes factors to provide evidence for or
against very specific hypotheses.

For this and the next chapter, although knowledge of the matrix formulation of the linear model
is not necessary, for a deeper understanding of contrast coding some exposure to the matrix
formulation is desirable. We discuss the matrix formulation in the frequentist textbook
(Vasishth et al. 2021).

8.1 Basic concepts illustrated using a two-level

factor

We first consider the simplest case: suppose we want to compare the means of a dependent
variable (DV) such as response times between two groups of subjects. A simulated data set is
available in the package bcogsci as the data set df_contrasts1 . The simulations assumed
longer response times in condition F1 ( μ1 = 0.8 sec) than F2 ( μ2 = 0.4 sec). The data from
the 10 simulated subjects are aggregated and summary statistics are computed for the two
groups.

Hide

data("df_contrasts1")
df_contrasts1

## # A tibble: 10 × 3
## F DV id
## <fct> <dbl> <int>

## 1 F1 0.636 1
## 2 F1 0.841 2
## 3 F1 0.555 3
## # … with 7 more rows

## [1] 0.6
TABLE 8.1: Summary statistics per condition for the simulated data.

Factor N data Est. means Std. dev. Std. errors

F1 5 0.8 0.2 0.1

F2 5 0.4 0.2 0.1

1.0
Mean Response Time [sec]

0.8

0.6

0.4

0.2

0.0

F1 F2
Factor F

FIGURE 8.1: Means and standard errors of the simulated dependent variable (e.g., response
times in seconds) in two conditions F1 and F2 .

The results, displayed in Figure 8.1 and in Table 8.1, show that the assumed true condition
means are exactly realized with the simulated data. The numbers are exact because the
mvrnorm() function used here (see ?df_contrasts1 ) ensures that the data are generated so

that the sample mean yields the true means for each level. In real data sets, of course, the
sample means will vary from experiment to experiment.

A simple Bayesian linear model of DV on F yields a straightforward estimate of the difference

between the group means. We use relatively uninformative priors. The estimates for the
population-level effects are presented below using the function fixef() :

Hide
fit_F <- brm(DV ~ 1 + F,
data = df_contrasts1,
family = gaussian(),
prior = c(
prior(normal(0, 2), class = Intercept),

prior(normal(0, 2), class = sigma),

prior(normal(0, 1), class = b)
)
)

Hide

fixef(fit_F)

## Estimate Est.Error Q2.5 Q97.5

## Intercept 0.80 0.11 0.58 1.00

## FF2 -0.39 0.15 -0.69 -0.09

Comparing the means for each condition with the coefficients ( Estimates ) reveals that (i) the
intercept (0.8) is the mean for condition ,
F1 μ
^1 ; and (ii) the slope ( FF2 : −0.4 ) is the
difference between the estimated means for the two groups, μ
^2 − μ
^1 (Bolker 2018):

Intercept= μ
^ = estimated mean for F 1
1
(8.1)
Slope (FF2)= μ
^2 − μ
^1 = estim. mean for F 2 − estim. mean for F 1

The new information is the 95 % credible interval for the difference between the two groups.

8.1.1 Default contrast coding: Treatment contrasts

How does the function brm arrive at these particular values for the intercept and slope? That
is, why does the intercept assess the mean of condition F1 and how do we know the slope
measures the difference in means between -
F2 F1 ? This result is a consequence of the
default contrast coding of the factor F . R assigns treatment contrasts to factors and orders
their levels alphabetically. The alphabetically first factor level (here: F1 ) is coded in R by
default as 0 and the second level (here: F2 ) is coded as 1. This becomes clear when we
inspect the current contrast attribute of the factor using the contrasts() command:

Hide
contrasts(df_contrasts1$F)

## F2
## F1 0
## F2 1

Why does this contrast coding yield these particular regression coefficients? Let’s take a look
at the regression equation. Let α represent the intercept, and β1 the slope. Then, the simple
regression above expresses the belief that the expected response time y
^ (or E[Y ] ) is a linear
function of the factor F .

E[Y ] = α + β1 x

So, if x = 0 (condition F1 ), the expectation is α + β1 ⋅ 0 = α ; and if x = 1 (condition F2 ),

the expectation is α + β1 ⋅ 1 = α + β1 .

Expressing the above in terms of the estimated coefficients:

estim. value for F 1 = μ

^ = α
^ = Intercept
1

estim. value for F 2 = μ ^

^2 = α
^ + β1 = Intercept+ Slope (FF2)

It is useful to think of such unstandardized regression coefficients as difference scores; they

express the increase in the dependent variable y associated with a change in the independent
variable x of 1 unit, such as going from 0 to 1 in this example. The difference between
condition means is 0.4 − 0.8 = −0.4 , which is the estimated regression coefficient ^
β
1
. The
sign of the slope is negative because we have chosen to subtract the larger mean F1 score
from the smaller mean F2 score.

8.1.2 Defining comparisons

The analysis of the regression equation demonstrates that in the treatment contrast the
intercept assesses the average response in the baseline condition, whereas the slope
estimates the difference between condition means. However, these are just verbal
descriptions of what each coefficient assesses. Is it also possible to formally write down what
each coefficient assesses?

From the perspective of parameter estimation, the slope represents the effect of main interest,
so we consider this first. The treatment contrast specifies that the slope β1 estimates the
difference in means between the two levels of the factor F . This can formally be written as:
β1 = μ F 2 − μ F 1

or equivalently:

β1 = −1 ⋅ μF 1 + 1 ⋅ μF 2

The ±1 weights in the parameter estimation directly express which means are compared by
the treatment contrast.

The intercept in the treatment contrast estimates a quantity that is usually of little interest: it
estimates the mean in condition F1 . Formally, the parameter α estimates the following
quantity:

α = μF 1

or equivalently:

α = 1 ⋅ μF 1 + 0 ⋅ μF 2 .

The fact that the intercept term formally estimates the mean of condition F1 is in line with our
previous derivation (see equation (8.1)).

In R, factor levels are ordered alphabetically and by default the first level is used as the
baseline in treatment contrasts. Obviously, this default mapping will depend on the levels’
alphabetical ordering. If a different baseline condition is desired, it is possible to re-order the
levels. Here is one way of re-ordering the levels:

Hide

df_contrasts1$Fb <- factor(df_contrasts1$F,

levels = c("F2", "F1")
)

contrasts(df_contrasts1$Fb)

## F1

## F2 0
## F1 1

This re-ordering did not change any data associated with the factor, only one of its attributes.
With this new contrast attribute, a simple Bayesian model yields the following result.

Hide
fit_Fb <- brm(DV ~ 1 + Fb,
data = df_contrasts1,
family = gaussian(),
prior = c(
prior(normal(0, 2), class = Intercept),

prior(normal(0, 2), class = sigma),

prior(normal(0, 1), class = b)
)
)

Hide

fixef(fit_Fb)

## Estimate Est.Error Q2.5 Q97.5

## Intercept 0.40 0.11 0.17 0.64

## FbF1 0.39 0.16 0.05 0.69

The model now estimates different quantities. The intercept now codes the mean of condition
F2 , and the slope measures the difference in means between F1 minus F2 . This represents
an alternative coding of the treatment contrast.

These model posteriors do not provide evidence for the hypothesis that the effect of factor F

is different from zero. If the research focus is on such hypothesis testing, Bayesian hypothesis
tests can be carried out using Bayes factors, by comparing a model containing a contrast of
interest with a model lacking this contrast. We will discuss details of Bayesian hypothesis
testing based on Bayes factors in chapter 15.

8.1.3 Sum contrasts

Treatment contrasts are only one of many options. It is also possible to use sum contrasts,
which code one of the conditions as −1 and the other as +1 , effectively centering the effects
at the grand mean (GM, i.e., the mean of the two group means). Here, we rescale the contrast
to values of −0.5 and +0.5 , which makes the estimated treatment effect the same as for
treatment coding and easier to interpret.
To define this contrast in a linear regression, one way is to use the contrasts function
(another way is to define a column containing +0.5 and −0.5 for the corresponding levels of
the factor).

Hide

contrasts(df_contrasts1$F) <- c(-0.5, +0.5)

fit_mSum <- brm(DV ~ 1 + F,
data = df_contrasts1,
family = gaussian(),

prior = c(
prior(normal(0, 2), class = Intercept),
prior(normal(0, 2), class = sigma),
prior(normal(0, 1), class = b)
)
)

Hide

fixef(fit_mSum)

## Estimate Est.Error Q2.5 Q97.5

## Intercept 0.60 0.08 0.44 0.76
## F1 -0.39 0.16 -0.71 -0.07

Here, the slope (F 1) again codes the difference of the groups associated with the first and
second factor levels. It has the same value as in the treatment contrast. One important
difference from the treatment contrast is that the intercept now represents the estimate of the
average of condition means for F1 and F2 , that is, the grand mean. For the scaled sum
contrast:

Intercept= (μ
^ + μ
^ )/2 = estimated mean of F 1 and F 2
1 2

Slope (F1)= μ
^ − μ
^ = estim. mean of F 2 − estim. mean forF 1
2 1

Why does the intercept assess the grand mean and why does the slope estimate the group
difference? This is the result of rescaling the sum contrast. The first factor level (F 1) was
coded as −0.5 , and the second factor level (F 2) as +0.5 :

Hide
contrasts(df_contrasts1$F)

## [,1]
## F1 -0.5
## F2 0.5

Look again at the regression equation to better understand what computations are performed.
Again, α represents the intercept, β1 represents the slope, and the predictor variable x

represents the factor F . The regression equation is written as:

E[Y ] = α + β1 x

The group of F1 subjects is then coded as −0.5 , and the response time for the group of F1

subjects is α + β1 ⋅ x1 = 0.6 + (−0.4) ⋅ (−0.5) = 0.8 . By contrast, the F2 group is coded

as +0.5 . By implication, the mean of the F2 group must be
α + β1 ⋅ x1 = 0.6 + (−0.4) ⋅ 0.5 = 0.4 . Expressed in terms of the estimated coefficients:

estim. value for F 1 = μ ^

^1 = α
^ − 0.5 ⋅ β 1 = Intercept − 0.5 ⋅ Slope (F1)

estim. value for F 2 = μ = α ^

^ ^ + 0.5 ⋅ β 1 = Intercept + 0.5 ⋅ Slope (F1)
2

The unstandardized regression coefficient is a difference score: Taking a step of one unit on
the predictor variable x, e.g., from −0.5 to +0.5 , reflecting a step from condition F1 to F2 ,
changes the dependent variable from 0.8 (for condition F1 ) to 0.4 (condition F2 ). This
reflects a difference of 0.4 − 0.8 = −0.4 ; this is again the estimated regression coefficient ^
β 1

. Moreover, as mentioned above, the intercept now assesses the grand mean of conditions F1

and F2 : it is in the middle between condition means for F1 and F2 .

So far we gave verbal statements about what is estimated by the intercept and the slope in the
case of the scaled sum contrast. It is possible to write these statements as formal parameter
estimates. In sum contrasts, the slope parameter β1 assesses the following quantity:

β1 = −1 ⋅ μF 1 + 1 ⋅ μF 2

This estimates the same quantity as the slope in the treatment contrast. The intercept now
assesses a different quantity: the average of the two conditions F1 and F2 :

μF 1 + μF 2
α = 1/2 ⋅ μF 1 + 1/2 ⋅ μF 2 =
2

In balanced data, i.e., in data sets where there are no missing data points, the average of the
two conditions F1 and F2 is the grand mean. In unbalanced data sets, where there are
missing values, this average is the weighted grand mean. To illustrate this point, consider an
example with fully balanced data and two equal group sizes of 5 subjects for each group F1

and F2 . Here, the grand mean is also the mean across all subjects. Next, consider a highly
simplified unbalanced data set, where in condition F1 two observations of the dependent
variable are available with values of 2 and 3, and where in condition F2 only one observation
of the dependent variable is available with a value of 4. In this data set, the mean across all
2+3+4 9
subjects is 3
=
3
= 3 . However, the (weighted) grand mean as assessed in the
intercept in a model using sum contrasts for factor F would first compute the mean for each
2+3
group separately (i.e., 2
= 2.5 , and 4), and then compute the mean across conditions
2.5+4 6.5

2
=
2
= 3.25 . The grand mean of 3.25 is different from the mean across subjects of 3.

To summarize, treatment contrasts and sum contrasts are two possible ways to parameterize
the difference between two groups; they generally estimate different quantities. Treatment
contrasts compare one or more means against a baseline condition, whereas sum contrasts
compare a condition’s mean to the grand mean (which in the two-group case also implies
estimating the difference between the two group means). One question that comes up here is:
how does one know or how can one formally derive what quantities are estimated by a given
set of contrasts? (In the context of Bayes factors, the question would be: what hypothesis test
does the contrast coding encode?) This question will be discussed in detail below for the
general case of any arbitrary contrast coding.

8.1.4 Cell means parameterization and posterior

comparisons

One alternative option is to use what is called the cell means parameterization (this coding is
also called “one-hot encoding” in the context of machine learning). In this approach, one does
not estimate an intercept term, and then differences between factor levels. Instead, each free
parameter is used to simply estimate the mean of one of the factor levels. As a consequence,
no comparisons between condition means are estimated, but simply the mean of each
experimental condition is estimated. Cell means parameterization is specified by explicitly
removing the intercept term (which is added automatically in brms ) by adding a −1 in the
regression formula:

Hide
fit_mCM <- brm(DV ~ -1 + F,
data = df_contrasts1,
family = gaussian(),
prior = c(
prior(normal(0, 2), class = sigma),

prior(normal(0, 2), class = b)

)
)

Hide

fixef(fit_mCM)

## Estimate Est.Error Q2.5 Q97.5

## FF1 0.8 0.11 0.57 1.02
## FF2 0.4 0.11 0.16 0.62

Now, the regression coefficients (see the column labeled Estimate ) estimate the mean of the
first factor level (0.8) and the mean of the second factor level (0.4). This cell means
parameterization usually does not allow us to make inferences about the hypotheses of
interest using Bayes factors, as these hypotheses usually relate to differences between
conditions rather than to whether each condition differs from zero.

The cell means parameterization provides a good example demonstrating an advantage of

Bayesian data analysis: In Bayesian models, it is possible to use the posterior samples to
compute new estimates that were not directly contained in the fitted model. To implement this,
we first extract the posterior samples from the brm model object:

Hide

df_postSamp <- as_draws_df(fit_mCM)

In a second step, we can then compute comparisons from these posterior samples. For
example, we can compute the difference between conditions F2 and F1 . To do so, we simply
take the posterior samples for each condition, and compute their difference.

Hide
df_postSamp$b_dif <- df_postSamp$b_FF2 - df_postSamp$b_FF1

This provides a posterior sample of the difference between conditions. It is possible to

investigate this posterior sample by looking at its mean and 95% credible interval:

Hide

c(
Estimate = mean(df_postSamp$b_dif),
quantile(df_postSamp$b_dif, p = c(0.025, 0.975))

## Estimate 2.5% 97.5%

## -0.4025 -0.7192 -0.0963

The above summary provides the same estimate (roughly −0.4 ) that we obtained previously
when using the treatment contrast or the scaled sum contrast. Thus, Bayesian models provide
a lot of flexibility in computing new comparisons post-hoc from the posterior samples and in
obtaining their posterior distributions. However, what these posterior computations do not
provide directly are inferences on null hypotheses. That is, just by looking at the credible
intervals, we cannot make inferences about whether a null hypothesis can be rejected; an
explicit hypothesis test is needed to answer such a question (see chapter 15).

8.2 The hypothesis matrix illustrated with a three-level

factor

Consider an example with the three word classes nouns, verbs, and adjectives. We load
simulated data from a lexical decision task with response times as dependent variable. The
research question is: do response times differ as a function of the between-subject factor word
class with three levels: nouns, verbs, and adjectives? Here, just to illustrate the case of a
three-level factor, we make the arbitrary assumption that nouns have longest response times
and adjectives the shortest response times. Word class is specified as a between-subject
factor. In cognitive science experiments, word class will usually vary within subjects and
between items. Because the within- or between-subjects status of an effect is independent of
its contrast coding, we assume the manipulation to be between subjects for ease of
exposition. The concepts presented here extend to repeated measures designs that are often
analyzed using hierarchical Bayesian (linear mixed) models.

Load and display the simulated data.

Hide

data("df_contrasts2")
head(df_contrasts2)

## # A tibble: 6 × 3

## F DV id
## <fct> <int> <int>
## 1 nouns 476 1
## 2 nouns 517 2
## 3 nouns 491 3

## # … with 3 more rows

TABLE 8.2: Summary statistics per condition for the simulated data.

Factor N data Est. means Std. dev. Std. errors

adjectives 4 400.2 19.9 9.9

nouns 4 500.0 20.0 10.0

verbs 4 450.2 20.0 10.0

As shown in Table 8.2, the estimated means reflect our assumptions about the true means in
the data simulation: Response times are longest for nouns and shortest for adjectives. In the
following sections, we use this data set to illustrate sum contrasts. Furthermore, we will use an
additional data set to illustrate repeated, Helmert, polynomial, and custom contrasts. In
practice, usually only one set of contrasts is selected when the expected pattern of means is
formulated during the design of the experiment.
8.2.1 Sum contrasts

We begin with sum contrasts. Suppose that the expectation is that nouns are responded to
slower and adjectives are responded to faster than the grand mean response time. Then, the
research question could be: By how much do nouns differ from the grand mean and by how
much do adjectives differ from the grand mean? And are the responses slower or faster than
the grand mean? We want to estimate the following two quantities:

μ1 + μ2 + μ3
β1 = μ 1 − = μ1 − GM
3

and

μ1 + μ2 + μ3
β2 = μ 2 − = μ2 − GM
3

β1 can also be written as:

μ1 + μ2 + μ3
β1 = μ 1 −
3

2 1 1
⇔ β1 = μ1 − μ2 − μ3
3 3 3

Here, the weights 2/3, −1/3, −1/3 are informative about how to combine the condition
means to estimate the linear model coefficient.

β2 is also rewritten as:

μ1 + μ2 + μ3
β2 =μ2 −
3

1 2 1
⇔ β2 = − μ1 + μ2 − μ3
3 3 3

Here, the weights are −1/3, 2/3, −1/3 , and they again indicate how to combine the condition
means for estimating the regression coefficient.

8.2.2 The hypothesis matrix

The weights of the condition means are not only useful for defining parameter estimates. They
also provide the starting step in a very powerful method which allows the researcher to
generate the contrasts that are needed to estimate these comparisons in a linear model. That
is, what we did so far is to explain some kinds of different contrast codings that exist and what
comparisons are estimated by these contrasts. That is, if a certain data set is given and the
goal is to estimate certain comparisons, then the procedure would be to check whether any of
the contrasts that we encountered estimate these comparisons of interest. Sometimes it
suffices to use one of these existing contrasts. At other times, our research questions may not
correspond exactly to any of the contrasts in the default set of standard contrasts provided in
R. For these cases, or for more complex designs, it is very useful to know how contrast
matrices are created. Indeed, a relatively simple procedure exists in which we write our
comparisons formally, extract the weights of the condition means from the comparisons, and
then automatically generate the correct contrast matrix that we need in order to estimate these
comparisons in a linear model. Using this powerful method, it is not necessary to find a match
to a contrast matrix provided by the family of functions in R starting with the prefix contr .
Instead, it is possible to simply define the comparisons that one wants to estimate, and to
obtain the correct contrast matrix for these in an automatic procedure. Here, for pedagogical
reasons, we show some examples of how to apply this procedure in cases where the
comparisons do correspond to some of the existing contrasts.

Defining a custom contrast matrix involves four steps:

1. Write down the estimated comparisons

2. Extract the weights and write them into what we will call a hypothesis matrix (and can also
be viewed as a comparison matrix)
3. Apply the generalized matrix inverse to the hypothesis matrix to create the contrast matrix
4. Assign the contrast matrix to the factor and run the (Bayesian) model

The term hypothesis matrix is used here because contrast coding is often done to carry out an
explicit hypothesis test; but one could of course use contrast coding just to compute the
estimates of an effect and their uncertainty, without doing a hypothesis test.

Let us apply this four-step procedure to our example of the sum contrast. The first step, writing
down the estimated comparisons, is shown above. The second step involves writing down the
weights that each comparison gives to condition means. The weights for the first comparison
are wH01=c(+2/3, -1/3, -1/3) , and the weights for the second comparison are wH02=c(-1/3,
+2/3, -1/3) .

Before writing these into a hypothesis matrix, we also define the estimated quantity for the
intercept term. The intercept parameter estimates the mean across all conditions:

μ1 + μ2 + μ3
α =
3

1 1 1
α = μ1 + μ2 + μ3
3 3 3
This estimate has weights of 1/3 for all condition means. The weights from all three model
parameters that were defined are now combined and written into a matrix that we refer to as
the hypothesis matrix ( Hc ):

Hide

HcSum <- rbind(

cH00 = c(adjectives = 1 / 3, nouns = 1 / 3, verbs = 1 / 3),
cH01 = c(adjectives = +2 / 3, nouns = -1 / 3, verbs = -1 / 3),
cH02 = c(adjectives = -1 / 3, nouns = +2 / 3, verbs = -1 / 3)

)
fractions(t(HcSum))

## cH00 cH01 cH02

## adjectives 1/3 2/3 -1/3

## nouns 1/3 -1/3 2/3
## verbs 1/3 -1/3 -1/3

Each set of weights is first entered as a row into the matrix (command rbind() ).26 We switch
rows and columns of the matrix for easier readability using the command t() (this
transposes the matrix). The command fractions() from MASS library turns the decimals into
fractions to improve readability.

Now that the condition weights have been written into the hypothesis matrix, the third step of
the procedure is implemented: a matrix operation called the generalized matrix inverse27 is
used to obtain the contrast matrix that is needed to estimate these comparisons in a linear
model.

Use the function ginv() from the MASS package for this next step. Define a function
ginv2() for nicer formatting of the output.28

Hide

ginv2 <- function(x) { # define a function to make the output nicer

fractions(provideDimnames(ginv(x),
base = dimnames(x)[2:1]
))
}
Applying the generalized inverse to the hypothesis matrix results in the new matrix XcSum .
This is the contrast matrix Xc that estimates exactly those comparisons that were specified
earlier:

Hide

(XcSum <- ginv2(HcSum))

## cH00 cH01 cH02

## adjectives 1 1 0
## nouns 1 0 1
## verbs 1 -1 -1

This contrast matrix corresponds exactly to the sum contrasts described above. In the case of
the sum contrast, the contrast matrix looks very different from the hypothesis matrix. The
contrast matrix in sum contrasts codes with +1 the condition that is to be compared to the
grand mean. The condition that is never compared to the grand mean is coded as −1 . Without
knowing the relationship between the hypothesis matrix and the contrast matrix, the meaning
of the coefficients is completely opaque.

To verify this custom-made contrast matrix, it is compared to the sum contrast matrix as
generated by the R function contr.sum() in the stats package. The resulting contrast
matrix is identical to the result when adding the intercept term, a column of ones, to the
contrast matrix:

Hide

fractions(cbind(1, contr.sum(3)))

## [,1] [,2] [,3]

## 1 1 1 0
## 2 1 0 1
## 3 1 -1 -1

In order to estimate model parameters, step four in our procedure involves assigning sum
contrasts to the factor F in our example data, and fitting a (Bayesian) linear model. This
allows us to estimate the regression coefficients associated with each contrast. We compare
these to the data shown above (Table 8.2) to test whether the regression coefficients actually
correspond to the differences of condition means, as intended. To define the contrast, it is
necessary to remove the intercept term, as this is automatically added by the modeling
function brm() .

Hide

contrasts(df_contrasts2$F) <- XcSum[, 2:3]

fit_Sum <- brm(DV ~ 1 + F,
data = df_contrasts2,
family = gaussian(),

prior = c(
prior(normal(500, 100), class = Intercept),
prior(normal(0, 100), class = sigma),
prior(normal(0, 100), class = b)
)
)

Hide

fixef(fit_Sum)

## Estimate Est.Error Q2.5 Q97.5

## Intercept 450.5 6.88 437.2 464.6
## FcH01 -49.4 9.92 -69.9 -30.2

## FcH02 49.1 9.83 29.1 69.3

The (Bayesian) linear model regression coefficients show the grand mean response time of
450 ms in the intercept. Remember that the first regression coefficient FcH01 was designed
to estimate the extent to which adjectives are responded to faster than the grand mean. The
regression coefficient FcH01 ( Estimate ) of −50 reflects the difference between adjectives (
400 ms) and the grand mean of 450 ms. The second estimate of interest tells us the extent to
which response times for nouns differ from the grand mean. The fact that the second
regression coefficient FcH02 is close to 50 indicates that response times for nouns are (500
ms) slower than the grand mean of 450 ms. Although the nouns are estimated to have 50 ms
longer reading times than the grand mean, the reading times for adjectives are 50 ms faster
than the grand mean.
We have now not only derived contrasts, parameter estimates, and comparisons for the sum
contrast, we have also used a powerful and highly general procedure that is used to generate
contrasts for many kinds of different comparisons and experimental designs.

8.2.3 Generating contrasts: The hypr package

To work with the four-step procedure, i.e., to flexibly design contrasts to estimate specific
comparisons, we have developed the R package hypr (Rabe, Vasishth, Hohenstein, Kliegl,
and Schad 2020b). This package allows the researcher to specify the desired comparisons,
and based on these comparisons, it automatically generates contrast matrices that allow us to
estimate these comparisons in linear models. The functions available in this package thus
considerably simplify the implementation of the four-step procedure outlined above. Because
hypr was originally written with the frequentist framework in mind, the comparisons are

expressed as null hypotheses. In the Bayesian framework, these should be treated as

comparisons between (bundles of) condition means.
To illustrate the functionality of the hypr package, we will use the two comparisons that we
had defined and analyzed in the previous section:

μ1 + μ2 + μ3
β1 = μ 1 − = μ1 − GM
3

and

μ1 + μ2 + μ3
β2 = μ 2 − = μ2 − GM
3

These estimates are effectively comparisons between condition means or between bundles of
condition means. That is, μ1 is compared to the grand mean and μ2 is compared to the GM.
These two comparisons can be directly entered into R using the hypr() function from the
hypr package. To do so, we use some labels to indicate factor levels. E.g., adjectives ,
nouns , and verbs can represent factor levels μ1 , μ2 , and μ3 . The first comparison specifies
μ1 +μ2 +μ3
that μ1 is compared to 3
. This can be written as a formula in R: adjectives ~
μ1 +μ2 +μ3
(adjectives + nouns + verbs)/3 . The second comparison is that μ2 is compared to ,
3

which can be written as nouns ~ (adjectives + nouns + verbs)/3 .

Hide
HcSum <- hypr(
b1 = adjectives ~ (adjectives + nouns + verbs) / 3,
b2 = nouns ~ (adjectives + nouns + verbs) / 3,
levels = c("adjectives", "nouns", "verbs")
)

HcSum

## hypr object containing 2 null hypotheses:

## H0.b1: 0 = (2*adjectives - nouns - verbs)/3
## H0.b2: 0 = (2*nouns - adjectives - verbs)/3
##

## Call:
## hypr(b1 = ~2/3 * adjectives - 1/3 * nouns - 1/3 * verbs, b2 = ~2/3 *
## nouns - 1/3 * adjectives - 1/3 * verbs, levels = c("adjectives",
## "nouns", "verbs"))
##

## Hypothesis matrix (transposed):

## b1 b2
## adjectives 2/3 -1/3
## nouns -1/3 2/3
## verbs -1/3 -1/3

##
## Contrast matrix:
## b1 b2
## adjectives 1 0
## nouns 0 1
## verbs -1 -1

The results show that the comparisons between condition means have been re-written into a
form where 0 is coded on the left side of the equation, and the condition means together with
associated weights are written on the right side of the equation. This presentation makes it
easy to see the weights of the condition means to code a certain comparison. The next part of
the results shows the hypothesis matrix, which contains the weights from the condition means.
Thus, hypr takes comparisons between condition means as input, and automatically extracts
the corresponding weights and encodes them into the hypothesis matrix. hypr moreover
applies the generalized matrix inverse to obtain the contrast matrix from the hypothesis matrix.
The different steps correspond exactly to the steps we had carried out manually in the
preceding section. hypr automatically performs these steps for us. We can now extract the
contrast matrix by a simple function call:

Hide

contr.hypothesis(HcSum)

## b1 b2
## adjectives 1 0
## nouns 0 1
## verbs -1 -1
## attr(,"class")
## [1] "hypr_cmat" "matrix" "array"

We can assign this contrast to our factor as we did before.

Hide

contrasts(df_contrasts2$F) <- contr.hypothesis(HcSum)

Now, we could again run the same model. However, since the contrast matrix is now the same
as used before, the results would also be exactly the same, and we therefore skip the model
fitting for brevity.

The hypr package can be used to create contrasts for Bayesian models, where the focus
lies on estimation of contrasts that code comparisons between condition means or bundles of
condition means (of course, one can use contrast coding for carrying out hypothesis tests
using the Bayes factor; see chapter 15). Thus, the comparison that one specifies implies the
estimation of a difference between condition means or bundles of condition means. We see
this in the output of the hypr() function (see the first section of the results) - these formulate
the comparison in a way that also illustrates the estimation of model parameters. That is, the
μ1 +μ2 +μ3
comparison (expressed in the hypr package’s syntax) μ1 ∼
3
corresponds to a
parameter estimate of b1 = 2/3*m1 - 1/3*m2 - 1/3*m3 , where m1 to m3 are the means for
each of the conditions. The resulting contrasts will then allow us to estimate the specified
differences between condition means or bundles of condition means.
8.3 Other types of contrasts: illustration with a factor
with four levels

Here, we introduce repeated difference, Helmert, and polynomial contrasts. For these, it may
be instructive to consider an experiment with one between-subject factor with four levels. We
load a corresponding data set, which contains simulated data about response times with a
four-level between-subject factor. The sample sizes for each level and the means and
standard errors are shown in Table 8.3, and the means and standard errors are also shown
graphically in Figure 8.2.

Hide

data("df_contrasts3")

## [1] 20

30
Mean DV

F1 F2 F3 F4
Factor F

FIGURE 8.2: The means and error bars (showing standard errors) for a simulated data set
with one between-subjects factor with four levels.
TABLE 8.3: Summary statistics per condition for the simulated data.

Factor N data Est. means Std. dev. Std. errors

F1 5 10.0 10.0 4.5

F2 5 20.0 10.0 4.5

F3 5 10.0 10.0 4.5

F4 5 40.0 10.0 4.5

We assume that the four factor levels F1 to F4 reflect levels of word frequency, including the
levels low , medium-low , medium-high , and high frequency words, and that the dependent
variable reflects response time.

Qualitatively, the simulated pattern of results is similar to empirically observed values for word
frequency effects on single fixation durations in eye tracking (Heister, Würzner, and Kliegl
2012).

8.3.1 Repeated contrasts

Arguably, the most popular contrast psychologists and psycholinguists are interested in is the
comparison between neighboring levels of a factor. This type of contrast is called the repeated
contrast. In our example, our research question might be whether the frequency level leads to
slower response times than frequency level , whether frequency level leads to slower
response times than frequency level , and whether frequency level leads to slower response
times than frequency level .

Repeated contrasts are used to implement these comparisons. Consider first how to derive
the contrast matrix for repeated contrasts, starting out by specifying the comparisons that are
to be estimated. Importantly, this again applies the general strategy of how to translate (any)
comparisons between groups or conditions into a set of contrasts, yielding a powerful tool of
great value in many research settings. We follow the four-step procedure outlined above.

The first step is to specify our comparisons, and to write them down in a way such that their
weights can be extracted easily. For a four-level factor, the three comparisons are:

β2−1 = −1 ⋅ μ1 + 1 ⋅ μ2 + 0 ⋅ μ3 + 0 ⋅ μ4

β3−2 = 0 ⋅ μ1 − 1 ⋅ μ2 + 1 ⋅ μ3 + 0 ⋅ μ4

β4−3 = 0 ⋅ μ1 + 0 ⋅ μ2 − 1 ⋅ μ3 + 1 ⋅ μ4
Here, the μx are the mean response times in condition x. Each regression coefficient gives
weights to the different condition means. For example, the first estimate (β2−1 ) estimates the
difference between condition mean for F2 (μ2 ) minus the condition mean for F1 (μ1 ), but
ignores condition means for F3 and F4 ( μ3 , μ4 ). μ1 has a weight of ,
−1 μ2 has a weight of
+1 , and μ3 and μ4 have weights of 0.

We can write these comparisons into hypr:

Hide

HcRep <- hypr(

c2vs1 = F2 ~ F1,
c3vs2 = F3 ~ F2,
c4vs3 = F4 ~ F3,

levels = c("F1", "F2", "F3", "F4")

)
HcRep
## hypr object containing 3 null hypotheses:
## H0.c2vs1: 0 = F2 - F1
## H0.c3vs2: 0 = F3 - F2
## H0.c4vs3: 0 = F4 - F3
##

## Call:
## hypr(c2vs1 = ~F2 - F1, c3vs2 = ~F3 - F2, c4vs3 = ~F4 - F3, levels = c("F1",
## "F2", "F3", "F4"))
##
## Hypothesis matrix (transposed):
## c2vs1 c3vs2 c4vs3

## F1 -1 0 0
## F2 1 -1 0
## F3 0 1 -1
## F4 0 0 1
##

## Contrast matrix:
## c2vs1 c3vs2 c4vs3
## F1 -3/4 -1/2 -1/4
## F2 1/4 -1/2 -1/4
## F3 1/4 1/2 -1/4

## F4 1/4 1/2 3/4

The hypothesis matrix shows exactly the weights that we had written down above. Moreover,
we see the contrast matrix. In the case of the repeated contrast, the contrast matrix again
looks very different from the hypothesis matrix. In this case, the contrast matrix looks a lot less
intuitive than the hypothesis matrix, and if one did not know the associated hypothesis matrix,
it seems unclear what the contrast matrix would actually test. To verify this custom-made
contrast matrix, we compare it to the repeated contrast matrix as generated by the R function
contr.sdif() in the package (Ripley 2019). The resulting contrast matrix is identical to our

result:

Hide

fractions(contr.sdif(4))
## 2-1 3-2 4-3
## 1 -3/4 -1/2 -1/4
## 2 1/4 -1/2 -1/4
## 3 1/4 1/2 -1/4
## 4 1/4 1/2 3/4

We can thus use either approach ( hypr() or contr.sdif() ) to obtain the contrast matrix in
this case. Next, we apply the repeated contrasts to the factor F in the example data and run a
linear model. This allows us to estimate the regression coefficients associated with each
contrast. These are compared to the data in Figure 8.2 to test whether the regression
coefficients actually correspond to the differences between successive condition means, as
intended.

Hide

contrasts(df_contrasts3$F) <- contr.hypothesis(HcRep)

fit_Rep <- brm(DV ~ 1 + F,
data = df_contrasts3,
family = gaussian(),

prior = c(
prior(normal(20, 50), class = Intercept),
prior(normal(0, 50), class = sigma),
prior(normal(0, 50), class = b)
)

Hide

fixef(fit_Rep)

## Estimate Est.Error Q2.5 Q97.5

## Intercept 20.01 2.40 15.4 24.72
## Fc2vs1 10.09 6.87 -3.3 23.11

## Fc3vs2 -9.67 6.80 -22.9 3.74

## Fc4vs3 29.29 6.74 15.7 42.49
The results show that as expected, the regression coefficients reflect the differences that were
of interest: the regression coefficient ( Estimate ) Fc2vs1 has a value of 10 , which exactly
corresponds to the difference between the condition mean for F2 (20) minus the condition
mean for F1 (10), i.e., 20 − 10 = 10 . Likewise, the regression coefficient Fc3vs2 has a
value of −10 , which corresponds to the difference between the condition mean for F3 (10)
minus the condition mean for F2 (20), i.e., 10 − 20 = −10 . Finally, the regression coefficient
Fc4vs3 has a value of roughly 30 , which reflects the difference between condition F4 (40)
minus condition F3 (10), i.e., 40 − 10 = 30 . Thus, the regression coefficients estimate
differences between successive or neighboring condition means.

8.3.2 Helmert contrasts

Another common contrast is the Helmert contrast. In a Helmert contrast for our four-level
factor, the first contrast compares level F1 versus F2 . The second contrast compares level
F3 to the average of the first two, i.e., F3 ~ (F1+F2)/2 . The third contrast then compares
level F4 to the average of the first three. We can code this contrast in hypr :

Hide

HcHel <- hypr(

b1 = F2 ~ F1,
b2 = F3 ~ (F1 + F2) / 2,
b3 = F4 ~ (F1 + F2 + F3) / 3,
levels = c("F1", "F2", "F3", "F4")
)

HcHel
## hypr object containing 3 null hypotheses:
## H0.b1: 0 = F2 - F1
## H0.b2: 0 = (2*F3 - F1 - F2)/2
## H0.b3: 0 = (3*F4 - F1 - F2 - F3)/3
##

## Call:
## hypr(b1 = ~F2 - F1, b2 = ~F3 - 1/2 * F1 - 1/2 * F2, b3 = ~F4 -
## 1/3 * F1 - 1/3 * F2 - 1/3 * F3, levels = c("F1", "F2", "F3",
## "F4"))
##
## Hypothesis matrix (transposed):

## b1 b2 b3
## F1 -1 -1/2 -1/3
## F2 1 -1/2 -1/3
## F3 0 1 -1/3
## F4 0 0 1

##
## Contrast matrix:
## b1 b2 b3
## F1 -1/2 -1/3 -1/4
## F2 1/2 -1/3 -1/4

## F3 0 2/3 -1/4
## F4 0 0 3/4

The classical Helmert contrast coded by the function contr.helmert() yields a similar but
slightly different result:

Hide

contr.helmert(4)

## [,1] [,2] [,3]

## 1 -1 -1 -1

## 2 1 -1 -1
## 3 0 2 -1
## 4 0 0 3
These contrasts are scaled versions of our custom Helmert contrast. I.e., the first column of
our custom Helmert contrast has to be multiplied by 2 to get the classical version, the second
column has to be multiplied by 3, and the fourth column has to be multiplied by 4 to get to our
custom Helmert contrast. Therefore, we suggest that our custom Helmert contrast defined
using the hypr function is more appropriate and intuitive to use. Probably the only reason the
classical Helmert contrast uses these scaled differences is that the rescaling yields an easier
contrast matrix, which consists of integers rather than fractions. The intuitive estimates from
our custom Helmert contrast seem much more relevant in Bayesian approaches today.

Hide

contrasts(df_contrasts3$F) <- contr.hypothesis(HcHel)

fit_Hel <- brm(DV ~ 1 + F,
data = df_contrasts3,
family = gaussian(),
prior = c(

prior(normal(20, 50), class = Intercept),

prior(normal(0, 50), class = sigma),
prior(normal(0, 50), class = b)
)
)

Hide

fixef(fit_Hel)

## Estimate Est.Error Q2.5 Q97.5

## Intercept 19.99 2.38 15.37 24.5
## Fb1 9.79 6.88 -3.84 23.5
## Fb2 -4.88 5.87 -16.33 6.7

## Fb3 26.29 5.51 15.31 37.4

When we fit the Bayesian model using our custom Helmert contrast, we can see that the
estimates reflect the comparisons outlined above. The first estimate Fb1 has a value of
roughly 10 , reflecting the difference between conditions F1 and F2 . The second estimate
Fb2 has a value of 5 , which reflects the difference between condition F3 (10) and the
average of the first two conditions ((10 + 20)/2 = 15 ). The estimate Fb3 reflects the
difference between F4 (40) minus the average of the first three, which is
(10 + 20 + 10)/3 = 13.3 , and is thus 40 − 13.3 = 26.7 .

Box 8.1 Treatment contrast with intercept as the grand mean.

Above, we have introduced the treatment contrast, where each contrast compares one
condition to a baseline condition. We have discussed that the intercept in the treatment
contrast estimates the condition mean for the baseline condition. There are some
applications where this behavior may seem sub-optimal. This can be the case in
experimental designs with multiple factors, where we may want to use centered contrasts
(this is discussed below). Moreover, the contrast coding of the population-level (or fixed)
effects also defines what the group-level (or random) effects assess. If the intercept
assesses the grand mean - rather than the baseline condition - in hierarchical models,
then the group-level intercepts reflect the grand mean variance, rather than the variance
in the baseline condition.

It is possible to design a treatment contrast where the intercept reflects the grand mean.
We implement this using the hypr package. The trick is to add the intercept explicitly as
a comparison of the average of all four condition means:

Hide

HcTrGM <- hypr(

b0 = ~ (F1 + F2 + F3 + F4) / 4,

b1 = F2 ~ F1,
b2 = F3 ~ F1,
b3 = F4 ~ F1,
levels = c("F1", "F2", "F3", "F4")
)
HcTrGM
## hypr object containing 4 null hypotheses:
## H0.b0: 0 = (F1 + F2 + F3 + F4)/4 (Intercept)
## H0.b1: 0 = F2 - F1
## H0.b2: 0 = F3 - F1
## H0.b3: 0 = F4 - F1

##
## Call:
## hypr(b0 = ~1/4 * F1 + 1/4 * F2 + 1/4 * F3 + 1/4 * F4, b1 = ~F2 -
## F1, b2 = ~F3 - F1, b3 = ~F4 - F1, levels = c("F1", "F2",
## "F3", "F4"))
##

## Hypothesis matrix (transposed):

## b0 b1 b2 b3
## F1 1/4 -1 -1 -1
## F2 1/4 1 0 0
## F3 1/4 0 1 0

## F4 1/4 0 0 1
##
## Contrast matrix:
## b0 b1 b2 b3
## F1 1 -1/4 -1/4 -1/4

## F2 1 3/4 -1/4 -1/4

## F3 1 -1/4 3/4 -1/4
## F4 1 -1/4 -1/4 3/4

The hypothesis matrix now explicitly codes the intercept as the first column, where all
hypothesis weights are equal and sum up to one. This is coding the intercept. The other
hypothesis weights are as expected for the treatment contrast. The contrast matrix now
looks very different compared to the standard treatment contrast. Next, we fit a model with
this adapted treatment contrast. The function contr.hypothesis automatically removes
the intercept that is encoded in HcTrGM , since this is automatically added by brms .

Hide
contrasts(df_contrasts3$F) <- contr.hypothesis(HcTrGM)
fit_TrGM <- brm(DV ~ 1 + F,
data = df_contrasts3,
family = gaussian(),
prior = c(

prior(normal(20, 50), class = Intercept),

prior(normal(0, 50), class = sigma),
prior(normal(0, 50), class = b)
)
)

Hide

fixef(fit_TrGM)

## Estimate Est.Error Q2.5 Q97.5

## Intercept 20.03 2.51 14.97 24.9

## Fb1 9.45 6.90 -4.04 22.7
## Fb2 -0.41 6.72 -14.28 12.8
## Fb3 29.33 6.82 15.81 42.5

The results show that the coefficients reflect comparisons of each condition F2 , F3, and
F4 to the baseline condition F1 . The intercept now captures the grand mean across all
four conditions of 20 .

8.3.3 Contrasts in linear regression analysis: The design or

model matrix

We have discussed how different contrasts are created from the hypothesis matrix. What we
have not treated in detail is how exactly contrasts are used in a linear model. Here, we will see
that the contrasts for a factor in a linear model are just the same thing as continuous numeric
predictors (i.e., covariates) in a linear/multiple regression analysis. That is, contrasts are the
way to encode discrete factor levels into numeric predictor variables to use in linear/multiple
regression analysis, by encoding which differences between factor levels are estimated. The
contrast matrix Xc that we have looked at so far has one entry (row) for each experimental
condition. For use in a linear model, the contrast matrix is coded into a design or model matrix
X , where each individual data point has one row. The design matrix X can be extracted using
the function model.matrix() :

Hide

# contrast matrix:
(contrasts(df_contrasts3$F) <- contr.hypothesis(HcRep))

## c2vs1 c3vs2 c4vs3

## F1 -0.75 -0.5 -0.25
## F2 0.25 -0.5 -0.25
## F3 0.25 0.5 -0.25
## F4 0.25 0.5 0.75

## attr(,"class")
## [1] "hypr_cmat" "matrix" "array"

Hide

# design/model matrix:
covars <- model.matrix(~ 1 + F, df_contrasts3)
(covars <- as.data.frame(covars))
## (Intercept) Fc2vs1 Fc3vs2 Fc4vs3
## 1 1 -0.75 -0.5 -0.25
## 2 1 -0.75 -0.5 -0.25
## 3 1 -0.75 -0.5 -0.25
## 4 1 -0.75 -0.5 -0.25

## 5 1 -0.75 -0.5 -0.25

## 6 1 0.25 -0.5 -0.25
## 7 1 0.25 -0.5 -0.25
## 8 1 0.25 -0.5 -0.25
## 9 1 0.25 -0.5 -0.25
## 10 1 0.25 -0.5 -0.25

## 11 1 0.25 0.5 -0.25

## 12 1 0.25 0.5 -0.25
## 13 1 0.25 0.5 -0.25
## 14 1 0.25 0.5 -0.25
## 15 1 0.25 0.5 -0.25

## 16 1 0.25 0.5 0.75

## 17 1 0.25 0.5 0.75
## 18 1 0.25 0.5 0.75
## 19 1 0.25 0.5 0.75
## 20 1 0.25 0.5 0.75

For each of the 20 subjects, four numbers are stored in this model matrix. They represent the
three values of three predictor variables used to predict response times in the task. Indeed,
this matrix is exactly the design matrix X commonly used in multiple regression analysis,
where each column represents one numeric predictor variable (covariate), and the first column
codes the intercept term.

To further illustrate this, the covariates are extracted from this design matrix and stored
separately as numeric predictor variables in the data frame:

Hide

df_contrasts3[, c("Fc2vs1", "Fc3vs2", "Fc4vs3")] <- covars[, 2:4]

They are now used as numeric predictor variables in a multiple regression analysis:

Hide
fit_m3 <- brm(DV ~ 1 + Fc2vs1 + Fc3vs2 + Fc4vs3,
data = df_contrasts3,
family = gaussian(),
prior = c(
prior(normal(20, 50), class = Intercept),

prior(normal(0, 50), class = sigma),

prior(normal(0, 50), class = b)
)
)

Hide

fixef(fit_m3)

## Estimate Est.Error Q2.5 Q97.5

## Intercept 20.04 2.50 15.07 25.11

## Fc2vs1 9.61 6.94 -4.38 22.91

## Fc3vs2 -9.41 6.76 -22.66 3.93
## Fc4vs3 29.30 6.64 16.44 42.36

The results show that the regression coefficients are the same as in the contrast-based
analysis shown in the previous section (on repeated contrasts). This demonstrates that
contrasts serve to code discrete factor levels into a linear/multiple regression analysis by
numerically encoding comparisons between specific condition means.

8.3.4 Polynomial contrasts

Polynomial contrasts are another option for analyzing factors. Suppose that we expect a linear
trend across conditions, where the response increases by a constant magnitude with each
successive factor level. This could be the expectation when four levels of a factor reflect
decreasing levels of word frequency (i.e., four factor levels: high, medium-high, medium-low,
and low word frequency), where one expects the fastest response for high frequency words,
and successively slower responses for lower word frequencies. The effect for each individual
level of a factor (e.g., as coded via a repeated contrast) may not be strong enough for
detecting it in the statistical model. Specifying a linear trend in a polynomial contrast (see
effect F.L below) allows us to pool the whole increase (across all four factor levels) into a
single coefficient for the linear trend, increasing statistical sensitivity for estimating/detecting
the increase. Such a specification constrains the estimate to one interpretable parameter, e.g.,
a linear increase across factor levels. The larger the number of factor levels, the more
parsimonious are polynomial contrasts compared to contrast-based specifications as
introduced in the previous sections. Going beyond a linear trend, one may also have
expectations about quadratic trends (see the estimate for F.Q below). For example, one may
expect an increase only among very low frequency words, but no difference between high and
medium-high frequency words.

Here is an example for how to code polynomial contrasts for a four-level factor. In this case,
one can estimate a linear ( F.L ), a quadratic ( F.Q ), and a cubic ( F.C ) trend. If more factor
levels are present, then higher order trends can be estimated.

Hide

Xpol <- contr.poly(4)

(contrasts(df_contrasts3$F) <- Xpol)

## .L .Q .C
## [1,] -0.671 0.5 -0.224
## [2,] -0.224 -0.5 0.671
## [3,] 0.224 -0.5 -0.671
## [4,] 0.671 0.5 0.224

Hide

fit_Pol <- brm(DV ~ 1 + F,

data = df_contrasts3,

family = gaussian(),
prior = c(
prior(normal(20, 50), class = Intercept),
prior(normal(0, 50), class = sigma),
prior(normal(0, 50), class = b)
)

Hide

fixef(fit_Pol)
## Estimate Est.Error Q2.5 Q97.5
## Intercept 19.97 2.53 14.97 24.9
## F.L 17.76 4.87 7.95 27.3
## F.Q 9.88 4.88 0.26 19.8
## F.C 13.23 4.89 3.65 22.7

In this example, condition means increase across factor levels in a linear fashion, but there
may also be quadratic and cubic trends.

8.3.5 An alternative to contrasts: monotonic effects

An alternative to specifying contrasts to estimate specific comparisons between factor levels is

monotonic effects (https://paul-buerkner.github.io/brms/articles/brms_monotonic.html; Bürkner
and Charpentier 2020). This simply assumes that the dependent variable increases (or
decreases) in a monotonic fashion across levels of an ordered factor. In this kind of analysis,
one does not define contrasts specifying differences between (groups of) factor levels.
Instead, one estimates one parameter which captures the average increase (or decrease) in
the dependent variable associated with two neighboring factor levels. Moreover, one estimates
the percentages of the overall increase (or decrease) that is associated with each of the
differences between neighboring factor levels (i.e., similar to simple difference contrasts, but
measured in percentage increase, and assuming monotonicity, i.e., that the same increase or
decrease is present for all simple differences).

To implement a monotonic analysis, we first code the factor F as being an ordered factor
(i.e., ordered=TRUE ). Then, we specify that we want to estimate a monotonic effect of F using
the notation mo(F) in our call to brms :

Hide
df_contrasts3$F <- factor(df_contrasts3$F, ordered = TRUE)
fit_mo <- brm(DV ~ 1 + mo(F),
data = df_contrasts3,
family = gaussian(),
prior = c(

prior(normal(20, 50), class = Intercept),

prior(normal(0, 50), class = sigma),
prior(normal(0, 50), class = b)
)
)

Hide

summary(fit_mo)
## Family: gaussian
## Links: mu = identity; sigma = identity
## Formula: DV ~ 1 + mo(F)
## Data: df_contrasts3 (Number of observations: 20)
## Draws: 4 chains, each with iter = 2000; warmup = 1000; thin = 1;

## total post-warmup draws = 4000

##
## Population-Level Effects:
## Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESS
## Intercept 9.45 4.24 1.23 17.62 1.00 2028 2319
## moF 9.46 2.45 4.23 14.14 1.00 2006 1865

##
## Simplex Parameters:
## Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESS
## moF1[1] 0.20 0.14 0.01 0.52 1.00 1927 1548
## moF1[2] 0.11 0.11 0.00 0.42 1.00 2804 2106

## moF1[3] 0.69 0.17 0.29 0.94 1.00 2110 2053

##
## Family Specific Parameters:
## Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESS
## sigma 11.75 2.30 8.29 17.22 1.00 2197 2505

##
## Draws were sampled using sampling(NUTS). For each parameter, Bulk_ESS
## and Tail_ESS are effective sample size measures, and Rhat is the potential
## scale reduction factor on split chains (at convergence, Rhat = 1).

The results show that there is an overall positive population-level effect of the factor F with an
estimate ( moF ) of 9.46 , reflecting an average increase in the dependent variables of 9.46

with each level of F . The model summary shows estimates for the simplex parameters, which
represent the ratios of the overall increase associated with F that can be attributed to each of
the differences between neighboring factor levels. The results show that most of the increase
is associated with moF1[3] , i.e., with the last difference, reflecting the difference between F3

and F4 , whereas the other two differences ( moF1[1] , reflecting the difference between F1

and F2 , and moF1[2] , reflecting the difference between F2 and F3 ) are smaller. Comparing
conditional effects between a model using polynomial contrasts and a model assuming
monotonic effects makes it clear that the current model “forces” the effects to increase in a
monotonic fashion; see Figure 8.3.
Hide

ppoly <- conditional_effects(fit_Pol)

pmon <- conditional_effects(fit_mo)
plot(ppoly, plot=FALSE)[[1]] + ggtitle("Polynomial contrasts")
plot(pmon, plot=FALSE)[[1]] + ggtitle("Monotonic effects")

Polynomial contrasts Monotonic effects

50 50

40 40

30 30
DV

DV
20 20

10 10
0 0
F1 F2 F3 F4 F1 F2 F3 F4
F F
FIGURE 8.3: Conditional effects using the polynomial contrasts on the left side vs. assuming
monotonic effects on the right side.

This is regardless of the information provided in the data; see the posterior predictive checks
in Figure 8.4.

Hide

pp_check(fit_Pol, type = "violin_grouped",

group = "F", y_draw = "points") +
theme(legend.position = "bottom")+
coord_cartesian(ylim = c(-55, 105)) +
ggtitle("Polynomial contrasts")

pp_check(fit_mo, type = "violin_grouped",

group = "F", y_draw = "points") +
theme(legend.position = "bottom") +
coord_cartesian(ylim = c(-55, 105)) +
ggtitle("Monotonic effects")
Polynomial contrasts Monotonic effects
100 100

50 50

0 0

-50 -50
F1 F2 F3 F4 F1 F2 F3 F4

y y rep y y rep

FIGURE 8.4: Posterior predictive distributions by condition using the polynomial contrasts on
the left side vs. assuming monotonic effects on the right side.
The monotonicity assumption is violated in the current data set, since the mean is larger in
condition F2 than in condition F3 . The monotonic model thus assumes this (negative)
difference is due to chance; see Figure 8.4.

Estimating such monotonic effects provides an alternative to the contrast coding we treat in
the rest of this chapter. It may be relevant when the specific differences between factor levels
are not of interest, but when instead the goal is to estimate the overall monotonic effect of a
factor and when this overall effect is not well approximated by a simple linear trend.

8.4 What makes a good set of contrasts?

For a factor with I levels one can make only I − 1 comparisons within a single model. For
example, in a design with one factor with two levels, only one comparison is possible
(between the two factor levels). The reason for this is that the intercept is also estimated.
More generally, if we have a factor with I1 and another factor with I2 levels, then the total
number of conditions is I1 × I2 = ν (not I1 + I2 ), which implies a maximum of ν − 1

contrasts.

For example, in a design with one factor with three levels, A B , , and C , in principle one could
make three comparisons (A vs. B A, vs. C , B vs. C ). However, after defining an intercept,
only two means can be compared. Therefore, for a factor with three levels, we define two
comparisons within one statistical model.

One critical precondition for contrasts is that they implement different comparisons that are not
collinear, that is, that none of the contrasts can be generated from the other contrasts by linear
combination. For example, the contrast c1 = c(1,2,3) can be generated from the contrast
c2 = c(3,4,5) simply by computing c2 - 2 . Therefore, contrasts c1 and c2 cannot be used

simultaneously. That is, each contrast needs to encode some independent information about
the data.

There are (at least) two criteria to decide what a good contrast is. First, orthogonal contrasts
have advantages as they estimate mutually independent comparisons in the data (see Dobson
and Barnett 2011, sec. 6.2.5, p. 91 for a detailed explanation of orthogonality). Second, it is
crucial that contrasts are defined in a way such that they answer the research questions. One
way to accomplish this second point is to use the hypothesis matrix to generate contrasts
(e.g., via the hypr package), as this ensures that one uses contrasts that exactly estimate
the comparisons of interest in a given study.

8.4.1 Centered contrasts

Contrasts are often constrained to be centered, such that the individual contrast coefficients ci
I
for different factor levels i sum to 0: ∑
i=1
ci = 0 . This has advantages when estimating
interactions with other factors or covariates (we discuss interactions between factors in the
next chapter). All contrasts discussed here are centered except for the treatment contrast, in
which the contrast coefficients for each contrast do not sum to zero:

Hide

colSums(contr.treatment(4))

## 2 3 4

## 1 1 1

Other contrasts, such as repeated contrasts, are centered and the contrast coefficients for
each contrast sum to 0:

Hide

colSums(contr.sdif(4))
## 2-1 3-2 4-3
## 0 0 0

The contrast coefficients mentioned above appear in the contrast matrix. The weights in the
hypothesis matrix are always centered. This is also true for the treatment contrast. The reason
is that they code comparisons between conditions or bundles of conditions. The only
exception are the weights for the intercept, which are all the same and together always sum to
1 in the hypothesis matrix. This is done to ensure that when applying the generalized matrix
inverse, the intercept results in a constant term with values of 1 in the contrast matrix. An
important question concerns whether (or when) the intercept needs to be considered in the
generalized matrix inversion, and whether (or when) it can be ignored. This question is closely
related to orthogonal contrasts, a concept we turn to below.

8.4.2 Orthogonal contrasts

Two centered contrasts c1 and c2 are orthogonal to each other if the following condition
applies. Here, i is the i-th cell of the vector representing the contrast.

∑ c1,i ⋅ c2,i = 0

i=1

Whether contrasts are orthogonal can often be determined easily by computing the correlation
between two contrasts. Orthogonal contrasts have a correlation of 0. Contrasts are therefore
just a special case for the general case of predictors in regression models, where two numeric
predictor variables are orthogonal if they are uncorrelated.

For example, coding two factors in a 2 × 2 design (we return to this case in a section on
designs with two factors below) using sum contrasts, these sum contrasts and their interaction
are orthogonal to each other:

Hide

(Xsum <- cbind(

F1 = c(1, 1, -1, -1),
F2 = c(1, -1, 1, -1),
F1xF2 = c(1, -1, -1, 1)
))
## F1 F2 F1xF2
## [1,] 1 1 1
## [2,] 1 -1 -1
## [3,] -1 1 -1
## [4,] -1 -1 1

Hide

cor(Xsum)

## F1 F2 F1xF2
## F1 1 0 0
## F2 0 1 0
## F1xF2 0 0 1

The correlations between the different contrasts (i.e., the off-diagonals) are exactly 0. Notice
that sum contrasts coding one multi-level factor are not orthogonal to each other:

Hide

cor(contr.sum(4))

## [,1] [,2] [,3]

## [1,] 1.0 0.5 0.5
## [2,] 0.5 1.0 0.5

## [3,] 0.5 0.5 1.0

Here, the correlations between individual contrasts, which appear in the off-diagonals, deviate
from 0, indicating non-orthogonality. The same is also true for treatment and repeated
contrasts:

Hide

cor(contr.sdif(4))
## 2-1 3-2 4-3
## 2-1 1.000 0.577 0.333
## 3-2 0.577 1.000 0.577
## 4-3 0.333 0.577 1.000

Hide

cor(contr.treatment(4))

## 2 3 4
## 2 1.000 -0.333 -0.333
## 3 -0.333 1.000 -0.333
## 4 -0.333 -0.333 1.000

Orthogonality of contrasts plays a critical role when computing the generalized inverse. In the
inversion operation, orthogonal contrasts are converted independently from each other. That
is, the presence or absence of another orthogonal contrast does not change the resulting
weights. In fact, for orthogonal contrasts, applying the generalized matrix inverse to the
hypothesis matrix simply furnishes a scaled version of the hypothesis matrix in the contrast
matrix (for mathematical details see Schad et al. 2020).

In Bayesian models, scaling is always important, since we need to interpret the scale in order
to define priors or interpret posteriors. Therefore, when working with contrasts in Bayesian
models, the generalized matrix inverse is always a good procedure to use.

8.4.3 The role of the intercept in non-centered contrasts

A related question concerns whether the intercept needs to be considered when computing
the generalized inverse for a contrast. This is of key importance when using the generalized
matrix inverse to define contrasts: the resulting contrast matrix and also the definition of
estimates can completely change between a situation where the intercept is explicitly
considered or not considered, and can thus change the resulting estimates in possibly
unintended ways. Thus, if the definition of the intercept is incorrect, the estimates of slopes
may also be wrong.
More specifically, it turns out that considering the intercept is necessary for contrasts that are
not centered. This is the case for treatment contrasts which are not centered; e.g., the
treatment contrast for two factor levels c1vs0 = c(0,1) : ∑ ci = 0 + 1 = 1
i
. One can
actually show that the formula to determine whether contrasts are centered (i.e., ∑ ci = 0
i
) is
the same formula as the formula to test whether a contrast is “orthogonal to the intercept”.
Remember that for the intercept, all contrast coefficients are equal to one: c1,i = 1 (here, c1,i

indicates the vector of contrast coefficients associated with the intercept). We enter these
contrast coefficient values into the formula testing whether a contrast is orthogonal to the
intercept (here, c2,i indicates the vector of contrast coefficients associated with some contrast
for which we want to test whether it is “orthogonal to the intercept”):
∑ c1,i ⋅ c2,i = ∑ 1 ⋅ c2,i = ∑ c2,i = 0
i i i
. The resulting formula is: ∑ c2,i = 0
i
, which is
exactly the formula for whether a contrast is centered. Because of this analogy, treatment
contrasts can be viewed to be `not orthogonal to the intercept.’ This means that the intercept
needs to be considered when computing the generalized inverse for treatment contrasts. As
we have discussed above, when the intercept is included in the hypothesis matrix, the weights
for this intercept term should sum to one, as this yields a column of ones for the intercept term
in the contrast matrix.

We can see that considering the intercept makes a difference for the treatment contrast. First,
we define the comparisons involved in a treatment contrast, where two experimental
conditions b and c are each compared to a baseline condition a ( b~a and c~a ). In
addition, we explicitly code the intercept term, which involves a comparison of the baseline to
0 ( a~0 ). We take a look at the resulting contrast matrix:

Hide

hypr(int = a ~ 0, b1 = b ~ a, b2 = c ~ a)
## hypr object containing 3 null hypotheses:
## H0.int: 0 = a (Intercept)
## H0.b1: 0 = b - a
## H0.b2: 0 = c - a
##

## Call:
## hypr(int = ~a, b1 = ~b - a, b2 = ~c - a, levels = c("a", "b",
## "c"))
##
## Hypothesis matrix (transposed):
## int b1 b2

## a 1 -1 -1
## b 0 1 0
## c 0 0 1
##
## Contrast matrix:

## int b1 b2
## a 1 0 0
## b 1 1 0
## c 1 0 1

Hide

contr.treatment(c("a", "b", "c"))

## b c
## a 0 0
## b 1 0
## c 0 1

This shows a contrast matrix that we know from the treatment contrast. The intercept is coded
as a column of ones. And each of the comparisons is coded as a 1 in the condition which is
compared to the baseline, and a 0 in other conditions. The point is here that this gives us the
contrast matrix that is expected and known for the treatment contrast.

We can also ignore the intercept in the specification of the comparisons:

Hide

hypr(b1 = m1 ~ m0, b2 = m2 ~ m0)

## hypr object containing 2 null hypotheses:

## H0.b1: 0 = m1 - m0
## H0.b2: 0 = m2 - m0

##
## Call:
## hypr(b1 = ~m1 - m0, b2 = ~m2 - m0, levels = c("m0", "m1", "m2"
## ))
##
## Hypothesis matrix (transposed):

## b1 b2
## m0 -1 -1
## m1 1 0
## m2 0 1
##

## Contrast matrix:
## b1 b2
## m0 -1/3 -1/3
## m1 2/3 -1/3
## m2 -1/3 2/3

Notice that the resulting contrast matrix now looks very different from the contrast matrix that
we know from the treatment contrast. Indeed, this contrast also estimates a reasonable set of
quantities. It again estimates whether the condition mean m1 differs from the baseline and
whether m2 differs from baseline. However, the intercept now estimates the average
dependent variable across all three conditions (i.e., the grand mean). This can be seen by
explicitly adding a comparison of the average of all three conditions to 0:

Hide

hypr(int = (m0 + m1 + m2) / 3 ~ 0, b1 = m1 ~ m0, b2 = m2 ~ m0)

## hypr object containing 3 null hypotheses:
## H0.int: 0 = (m0 + m1 + m2)/3 (Intercept)
## H0.b1: 0 = m1 - m0
## H0.b2: 0 = m2 - m0
##

## Call:
## hypr(int = ~1/3 * m0 + 1/3 * m1 + 1/3 * m2, b1 = ~m1 - m0, b2 = ~m2 -
## m0, levels = c("m0", "m1", "m2"))
##
## Hypothesis matrix (transposed):
## int b1 b2

## m0 1/3 -1 -1
## m1 1/3 1 0
## m2 1/3 0 1
##
## Contrast matrix:

## int b1 b2
## m0 1 -1/3 -1/3
## m1 1 2/3 -1/3
## m2 1 -1/3 2/3

The last two columns of the resulting contrast matrix are now the same as when the intercept
was ignored, which confirms that the two columns encode the same comparison.

8.5 Computing condition means from estimated

contrasts

As mentioned earlier, one advantage of Bayesian modeling is that based on the posterior
samples, it is possible to very flexibly compute new comparisons and estimates. Above (see
section 8.1.4), we had discussed the case where the Bayesian model estimated the condition
means instead of contrasts by removing the intercept from the brms model (the formula in
brms was: DV ~ -1 + F ). This allowed us to get posterior samples from each condition

mean, and then to compute any possible comparison between condition means by subtracting
the corresponding samples.
Importantly, posterior samples for the condition means can also be obtained after fitting a
model with contrasts. We illustrate this here for the case of sum contrasts. Let’s use our above
example of a design where we assess response times (in milliseconds, DV ) for three different
word classes adjectives, nouns, and verbs, that is, for a 3-level factor F . In the above
example, factor F was coded using a sum contrast, where the first contrast coded the
difference of adjectives from the grand mean, and the second contrast coded the difference of
nouns from the grand mean. This was the corresponding contrast matrix:

Hide

contrasts(df_contrasts2$F) <- contr.hypothesis(HcSum)

contrasts(df_contrasts2$F)

## b1 b2

## adjectives 1 0
## nouns 0 1
## verbs -1 -1

We had estimated a brms model for this data. The posterior estimates show results for the
intercept (which is estimated to be 450 ms) and for our two coded comparisons. The effect
FcH01 codes our first comparison that response times for adjectives differ from the grand

mean, and show an estimate that response times for adjectives are about 50 ms shorter than
the grand mean. Moreover, the effect FcH02 codes our second comparison that response
times for nouns differ from the grand mean, and show the estimate that response times for
nouns are 50 ms longer than the grand mean.

Hide

fixef(fit_Sum)

## Estimate Est.Error Q2.5 Q97.5

## Intercept 450.5 6.88 437.2 464.6

## FcH01 -49.4 9.92 -69.9 -30.2
## FcH02 49.1 9.83 29.1 69.3

Of course other comparisons might be of interest to us as well. For example, we might be

interested in estimating how strongly response times for verbs differ from the grand mean.
To do so, one possible first step is to obtain the posteriors for the response times in each of
the three conditions. How can this be done? The first step is to again extract the posterior
samples from the model:

Hide

df_postSamp_Sum <- as_draws_df(fit_Sum)

We can see the samples for our first contrast ( b_FcH01 ) and for our second contrast
( b_FcH02 ). How can we now compute the posterior samples for each of the condition means,
i.e., for adjectives, nouns, and verbs? For this, we need to take another look at the contrast
matrix.

Hide

contrasts(df_contrasts2$F)

## b1 b2
## adjectives 1 0
## nouns 0 1
## verbs -1 -1

It tells us how the condition means are computed. For adjectives (see the first row of the
contrast matrix), we can see that the response time is computed by taking 1 times the
coefficient for b1 (i.e., FcH01 ) and 0 times the coefficient for b2 (i.e., FcH02 ). Thus,
response times for adjectives are simply the samples for the b1 (i.e., FcH01 ) contrast. The
contrast matrix does not show the intercept term, which is implicitly added. Thus, we also have
to add the estimates for the intercept. Thus, the condition mean for adjectives is computed as
b_adjectives <- b_Intercept + b_FcH01 :

Hide

df_postSamp_Sum$b_adjectives <-
df_postSamp_Sum$b_Intercept + df_postSamp_Sum$b_FcH01

Similarly, we can obtain the posterior samples for the response times for nouns. The
computation can be seen from the second row of the contrast matrix, which shows that the
contrast b1 (i.e., FcH01 ) has weight 0 times, whereas the contrast b2 (i.e., FcH02 ) has
weight 1. Adding the intercept thus gives:
Hide

df_postSamp_Sum$b_nouns <-
df_postSamp_Sum$b_Intercept + df_postSamp_Sum$b_FcH02

Finally, we want to obtain posterior samples for the average response times for verbs. For
verbs, the third row of the contrast matrix shows two times a −1 . Thus, contrasts b1 (i.e.,
FcH01 ) and b2 (i.e., FcH02 ) have to be subtracted from the intercept:

Hide

df_postSamp_Sum$b_verbs <-
df_postSamp_Sum$b_Intercept - df_postSamp_Sum$b_FcH01 -
df_postSamp_Sum$b_FcH02

This yields posterior samples for the mean response times for verbs.

We can now look at the posterior means and 95% credible intervals for adjectives, nouns, and
verbs by computing the means and quantiles across all computed samples.

Hide

postTab <- df_postSamp_Sum %>%

# remove the meta-data:
as.data.frame() %>%
select(b_adjectives, b_nouns, b_verbs) %>%
# transform from wide to long with tidyr:

pivot_longer(
cols = everything(),
names_to = "condition",
values_to = "samp"
) %>%
group_by(condition) %>%

summarize(
post_mean = round(mean(samp)),
`2.5%` = round(quantile(samp, p = 0.025)),
`97.5%` = round(quantile(samp, p = 0.975))
)
Hide

postTab

## # A tibble: 3 × 4
## condition post_mean `2.5%` `97.5%`
## <chr> <dbl> <dbl> <dbl>

## 1 b_adjectives 401 376 426

## 2 b_nouns 500 477 524
## 3 b_verbs 451 427 475

The results show that as expected the posterior mean for adjectives is 400 ms, for nouns it is
500 ms, and for verbs, the posterior mean is 450 ms. Moreover, we have now posterior
credible intervals for each of these estimates.

In fact, brms has a very convenient built-in function that allows us to compute these nested
effects automatically ( robust = FALSE shows the posterior mean; by default brms shows the
posterior median). Notice that you need to add a [] after the function call, otherwise brms
will plot the results.

Hide

conditional_effects(fit_Sum, robust = FALSE)[]

## $F
## F DV cond__ effect1__ estimate__ se__ lower__ upper__
## 1 adjectives 450 1 adjectives 401 12.2 376 426

## 2 nouns 450 1 nouns 500 11.7 477 524

## 3 verbs 450 1 verbs 451 12.1 427 475

The same function allows us to visualize the effects, as shown in Figure 8.5.

Hide

conditional_effects(fit_Sum, robust = FALSE)

500
DV

450

400

adjectives nouns verbs

FIGURE 8.5: Estimated condition means, computed from a brms model fitted with a sum
contrast.
Coming back to our hand-crafted computations, the posterior samples can be used to
compute additional comparisons. For example, we might be interested in how much response
times for verbs differ from the grand mean. This can be computed based on the samples for
the condition means: we first compute the grand mean from the three condition means, b_GM
<- (b_adjectives + b_nouns + b_verbs)/3 , and then we compare this to the estimate for

verbs.

Hide

df_postSamp_Sum <- df_postSamp_Sum %>%

mutate(GM = (b_adjectives + b_nouns + b_verbs) / 3,

b_FcH03 = b_verbs - GM)
c(post_mean = mean(df_postSamp_Sum$b_FcH03),
quantile(df_postSamp_Sum$b_FcH03, p = c(0.025, 0.975)))

## post_mean 2.5% 97.5%

## 0.273 -18.628 20.336

The results show that reading times for verbs are quite the same as the grand mean, with a
posterior mean estimate for the differences of nearly 0 ms, and with a 95% credible interval
ranging between −20 and +20 ms.
The key message here is that based on the contrast matrix, it is possible to compute posterior
samples for the condition means, and then to compute any arbitrary further comparisons or
contrasts. We want to stress again that just obtaining the posterior distribution of a comparison
does not allow us to argue that we have evidence for the effect; to argue that we have
evidence for an effect being present/absent, we need Bayes factors. But the approach we
outline above does allow us to obtain posterior means and credible intervals for arbitrary
comparisons.

We briefly show how to compute posterior samples for condition means for one more example
contrast, namely for repeated contrasts. Here, the contrast matrix is:

Hide

contrasts(df_contrasts3$F) <- contr.hypothesis(HcRep)

contrasts(df_contrasts3$F)

## c2vs1 c3vs2 c4vs3

## F1 -0.75 -0.5 -0.25

## F2 0.25 -0.5 -0.25
## F3 0.25 0.5 -0.25
## F4 0.25 0.5 0.75

The model estimates were:

Hide

fixef(fit_Rep)

## Estimate Est.Error Q2.5 Q97.5

## Intercept 20.01 2.40 15.4 24.72

## Fc2vs1 10.09 6.87 -3.3 23.11
## Fc3vs2 -9.67 6.80 -22.9 3.74
## Fc4vs3 29.29 6.74 15.7 42.49

We first obtain the posterior samples for the contrasts:

Hide
df_postSamp_Rep <- as.data.frame(fit_Rep)

Then we compute the posterior samples for condition F1 . First, we have to add the intercept.
Then, we can see in the contrast matrix that to compute the condition mean for F1 , we have
to add up all contrasts, using the weights c(-3/4, -1/2, -1/4) for each of the three contrasts
(see first row of the contrast matrix). Thus, the posterior samples are computed as follows:

Hide

df_postSamp_Rep <- df_postSamp_Rep %>%

mutate(b_F1 = b_Intercept +
-3 / 4 * b_Fc2vs1 +
-1 / 2 * b_Fc3vs2 +
-1 / 4 * b_Fc4vs3)

The other condition means are computed correspondingly:

Hide

df_postSamp_Rep <- df_postSamp_Rep %>%

mutate(b_F2 = b_Intercept +
1 / 4 * b_Fc2vs1 +
-1 / 2 * b_Fc3vs2 +
-1 / 4 * b_Fc4vs3,

b_F3 = b_Intercept +
1 / 4 * b_Fc2vs1 +
1 / 2 * b_Fc3vs2 +
-1 / 4 *b_Fc4vs3,
b_F4 = b_Intercept +
1 / 4 * b_Fc2vs1 +

1 / 2 * b_Fc3vs2 +
3 / 4 * b_Fc4vs3)

Now we can look at the posterior means and credible intervals:

Hide
postTab <- df_postSamp_Rep %>%
select(b_F1, b_F2, b_F3, b_F4) %>%
pivot_longer(
cols = everything(),
names_to = "condition",

values_to = "samp"
) %>%
group_by(condition) %>%
summarize(
post_mean = round(mean(samp)),
`2.5%` = round(quantile(samp, p = 0.025)),

`97.5%` = round(quantile(samp, p = 0.975))

)

Hide

print(postTab, n = 4)

## # A tibble: 4 × 4
## condition post_mean `2.5%` `97.5%`
## <chr> <dbl> <dbl> <dbl>

## 1 b_F1 10 0 20
## 2 b_F2 20 10 30
## 3 b_F3 10 1 20
## 4 b_F4 40 30 49

We can verify that brms function return the same values:

Hide

conditional_effects(fit_Rep, robust = FALSE)[]

## $F
## F DV cond__ effect1__ estimate__ se__ lower__ upper__
## 1 F1 20 1 F1 9.96 4.93 0.252 19.7
## 2 F2 20 1 F2 20.05 4.90 10.390 29.9
## 3 F3 20 1 F3 10.38 4.76 0.966 19.8

## 4 F4 20 1 F4 39.67 4.79 30.012 49.2

The posterior means reflect exactly the means in the data (for comparison see Figure 8.2 and
Table 8.3). We now have posterior samples for each of the conditions and can compute
posterior credible intervals as well as new comparisons between conditions.

8.6 Summary

Contrasts in Bayesian models work in exactly the same way as in frequentist models.
Contrasts provide a way to tell the model how to code factors into numeric covariates. That is,
they provide a way to define which comparisons between which condition means or bundles of
condition means should be estimated in the Bayesian model. There are a number of default
contrasts, like treatment contrasts, sum contrasts, repeated contrasts, or Helmert contrasts,
that estimate specific comparisons between condition means. A much more powerful
procedure is to use the generalized matrix inverse, e.g., as implemented in the hypr
package, to derive contrasts automatically after specifying the comparisons that a contrast
should estimate. We have seen that in Bayesian models, it is quite straightforward to compute
posterior samples for new contrasts post-hoc, after the model is fit. However, specifying
precise contrasts is still of key importance when doing model comparisons (via Bayes factors)
to answer the question of whether the data provide evidence for an effect of interest. If the
effect of interest relates to a factor, then it has to be defined using contrast coding.

8.7 Further reading

A good discussion on contrast coding appears in Chapter 15 of Baguley (2012). A book-length

treatment is by Rosenthal, Rosnow, and Rubin (2000). A brief discussion on contrast coding
appears in Venables and Ripley (2002).
8.8 Exercises

Exercise 8.1 Contrast coding for a four-condition design

Load the following data. These data are from Experiment 1 in a set of reading studies on
Persian (Safavi, Husain, and Vasishth 2016). This is a self-paced reading study on particle-
verb constructions, with a 2 × 2 design: distance (short, long) and predictability (predictable,
unpredictable). The data are from a critical region in the sentence. All the data from the Safavi,
Husain, and Vasishth (2016) paper are available from
https://github.com/vasishth/SafaviEtAl2016.

Hide

library(bcogsci)

data("df_persianE1")
dat1 <- df_persianE1
head(dat1)

## subj item rt distance predability

## 60 4 6 568 short predictable
## 94 4 17 517 long unpredictable
## 146 4 22 675 short predictable

## 185 4 5 575 long unpredictable

## 215 4 3 581 long predictable
## 285 4 7 1171 long predictable

The four conditions are:

Distance=short and Predictability=unpredictable

Distance=short and Predictability=predictable
Distance=long and Predictability=unpredictable
Distance=long and Predictability=predictable

The researcher wants to do the following sets of comparisons between condition means:

Compare the condition labeled Distance=short and Predictability=unpredictable with each of

the following conditions:

Distance=short and Predictability=predictable

Distance=long and Predictability=unpredictable
Distance=long and Predictability=predictable

Questions:

Which contrast coding is needed for such a comparison?

First, define the relevant contrast coding. Hint: You can do it by creating a condition
column labeled a,b,c,d and then use a built-in contrast coding function.
Then, use the hypr library function to confirm that your contrast coding actually does the
comparison you need.
Fit a simple linear model with the above contrast coding and display the slopes, which
constitute the relevant comparisons.
Now, compute each of the four conditions’ means and check that the slopes from the
linear model correspond to the relevant differences between means that you obtained
from the data.

Exercise 8.2 Helmert coding for a four-condition design.

Load the following data:

Hide

library(bcogsci)

data("df_polarity")
head(df_polarity)

## subject item condition times value

## 1 1 6 f SFD 328
## 2 1 24 f SFD 206
## 3 1 35 e SFD 315
## 4 1 17 e SFD 265
## 5 1 34 d SFD 252

## 6 1 7 a SFD 156

The data come from an eyetracking study in German reported in Vasishth et al. (2008). The
experiment is a reading study involving six conditions. The sentences are in English, but the
original design was involved German sentences. In German, the word durchaus (certainly) is a
positive polarity item: in the constructions used in this experiment, durchaus cannot have a c-
commanding element that is a negative polarity item licensor. Here are the conditions:

Negative polarity items

g p y
a. Grammatical: No man who had a beard was ever thrifty.
b. Ungrammatical (Intrusive NPI licensor): A man who had no beard was ever thrifty.
c. Ungrammatical: A man who had a beard was ever thrifty.
Positive polarity items
d. Ungrammatical: No man who had a beard was certainly thrifty.
e. Grammatical (Intrusive NPI licensor): A man who had no beard was certainly thrifty.
f. Grammatical: A man who had a beard was certainly thrifty.

We will focus only on re-reading time in this data set. Subset the data so that we only have re-
reading times in the data frame:

Hide

dat2 <- subset(df_polarity, times == "RRT")

head(dat2)

## subject item condition times value

## 6365 1 20 b RRT 240
## 6366 1 3 c RRT 1866

## 6367 1 13 a RRT 530

## 6368 1 19 a RRT 269
## 6369 1 27 c RRT 845
## 6370 1 26 b RRT 635

The comparisons we are interested in are:

What is the difference in reading time between negative polarity items and positive
polarity items?
Within negative polarity items, what is the difference between grammatical and
ungrammatical conditions?
Within negative polarity items, what is the difference between the two ungrammatical
conditions?
Within positive polarity items, what is the difference between grammatical and
ungrammatical conditions?
Within positive polarity items, what is the difference between the two grammatical
conditions?
Use the hypr package to specify the comparisons specified above, and then extract the
contrast matrix. Finally, specify the contrasts to the condition column in the data frame. Fit a
linear model using this contrast specification, and then check that the estimates from the
model match the mean differences between the conditions being compared.

Exercise 8.3 Number of possible comparisons in a single model.

How many comparisons can one make in a single model when there is a single factor with
four levels? Why can we not code four comparisons in a single model?
How many comparisons can one code in a model where there are two factors, one with
three levels and one with two levels?
How about a model for a 2 × 2 × 3 design?

References

Baguley, Thomas. 2012. Serious Stats: A Guide to Advanced Statistics for the Behavioral
Sciences. Macmillan International Higher Education.

Bolker, Ben. 2018. “Https://Github.com/Bbolker/Mixedmodels-

Misc/Blob/Master/Notes/Contrasts.rmd.”

Bürkner, Paul-Christian, and Emmanuel Charpentier. 2020. “Modelling Monotonic Effects of

Ordinal Predictors in Bayesian Regression Models.” British Journal of Mathematical and
Statistical Psychology. Wiley Online Library.

Dobson, Annette J, and Adrian Barnett. 2011. An Introduction to Generalized Linear Models.
CRC press.

Friendly, Michael, John Fox, and Phil Chalmers. 2020. Matlib: Matrix Functions for Teaching
and Learning Linear Algebra and Multivariate Statistics. https://CRAN.R-
project.org/package=matlib.

Heister, Julian, Kay-Michael Würzner, and Reinhold Kliegl. 2012. “Analysing Large Datasets of
Eye Movements During Reading.” Visual Word Recognition 2: 102–30.

Rabe, Maximilian M., Shravan Vasishth, Sven Hohenstein, Reinhold Kliegl, and Daniel J
Schad. 2020b. “Hypr: An R Package for Hypothesis-Driven Contrast Coding.” Journal of Open
Source Software 5 (48): 2134.

Ripley, Brian. 2019. MASS: Support Functions and Datasets for Venables and Ripley’s Mass.
https://CRAN.R-project.org/package=MASS.
Rosenthal, Robert, Ralph L Rosnow, and Donald B Rubin. 2000. Contrasts and Effect Sizes in
Behavioral Research: A Correlational Approach. Cambridge University Press.

Safavi, Molood Sadat, Samar Husain, and Shravan Vasishth. 2016. “Dependency Resolution
Difficulty Increases with Distance in Persian Separable Complex Predicates: Implications for
Expectation and Memory-Based Accounts.” Frontiers in Psychology 7 (403).

2020. “How to Capitalize on a Priori Contrasts in Linear (Mixed) Models: A Tutorial.” Journal of
Memory and Language 110. Elsevier: 104038.

Vasishth, Shravan, Sven Bruessow, Richard L. Lewis, and Heiner Drenhaus. 2008.
“Processing Polarity: How the Ungrammatical Intrudes on the Grammatical.” Cognitive
Science 32 (4, 4): 685–712.

Venables, William N., and Brian D. Ripley. 2002. Modern Applied Statistics with S-PLUS. New
York: Springer.

26. The reason for this is that mathematically, individual comparisons in the hypothesis matrix
are coded as rows rather than as columns (see Schad et al. 2020).↩

27. At this point, there is no need to understand in detail what this means. We refer the
interested reader to Schad et al. (2020). For a quick overview, we recommend a vignette
explaining the generalized inverse in the matlib package (Friendly, Fox, and Chalmers
2020).↩

28. The function from the package is used to make the output more easily readable, and the
function is used to keep row and column names.↩
Code

Chapter 9 Contrast coding for designs with

two predictor variables
Chapter 8 provides a basic introduction into contrast coding in situations where there is one
predictor variable, i.e., one factor, for which its effect can be estimated using a specified
contrast matrix. Here, we will investigate how contrast coding generalizes to situations where
there is more than one predictor variable. This could either be a situation where two factors
are present or where one factor is paired with a continuous predictor variable, i.e., a covariate.
We first discuss contrast coding for the case of two factors (for 2 × 2 designs; section 9.1)
and then go on to investigate situations where one predictor is a factor and the other predictor
is a covariate (section 9.2). Moreover, one problem in the analysis of interactions occurs in
situations where the model is not linear, but has some non-linear link function, such as e.g., in
logistic models or when assuming a log-normally distributed dependent variable. In these
situations, the model makes predictions for each condition (i.e., design cell) at the latent level
of the linear model. Sometimes it is important to translate these model predictions to the level
of the observations (e.g., to probabilities in a logistic regression model). We will discuss how
this can be implemented in section 9.3. We begin by treating contrast coding in a factorial
2 × 2 design.

9.1 Contrast coding in a factorial 2 × 2 design

In chapter 8 in section 8.3, we used a data set with one 4-level factor. Here, we assume that
the same four means come from an A(2) × B(2) between-subject-factor design rather than
an F(4) between-subject-factor design. Load the simulated data and show summary statistics
in Table 9.1 and in Figure 9.1. The means and standard deviations are exactly the same as in
Figure 8.2 and in Table 8.3.
40
Dependent variable

30 Factor B
B1

20 B2

A1 A2
Factor A

FIGURE 9.1: Means and error bars (showing standard errors) for a simulated data set with a
two-by-two between-subjects factorial design.
TABLE 9.1: Summary statistics per condition for the simulated data.

Factor A Factor B N data Means Std. dev. Std. errors

A1 B1 5 10 10 4.5

A1 B2 5 20 10 4.5

A2 B1 5 10 10 4.5

A2 B2 5 40 10 4.5

In order to carry out a 2 × 2 ANOVA-type (main effects and interaction) analysis, one needs
sum contrasts in the linear model. (This is true for factors with two levels, but does not
generalize to factors with more levels.) The results of such an analysis are shown in Table 9.2.

Hide

# define sum contrasts:

contrasts(df_contrasts4$A) <- contr.sum(2)
contrasts(df_contrasts4$B) <- contr.sum(2)

Hide
# Bayesian LM
fit_AB.sum <- brm(DV ~ 1 + A * B,
data = df_contrasts4,
family = gaussian(),
prior = c(

prior(normal(20, 50), class = Intercept),

prior(normal(0, 50), class = sigma),
prior(normal(0, 50), class = b)
)
)

TABLE 9.2: Bayesian linear model of a 2 x 2 design with sum contrasts.

Predictor Estimate Est. Error 2.5% 97.5%

Intercept 20.05 2.53 15.15 24.98

A1 −4.97 2.46 −9.77 −0.25

B1 −10.03 2.55 −15.20 −4.91

A1:B1 4.96 2.38 0.23 9.72

Next, we reproduce the A(2) × B(2) - ANOVA with contrasts specified for the corresponding
one-way F (4) ANOVA, that is by treating the 2 × 2 = 4 condition means as four levels of a
single factor F . In other words, we go back to the data frame simulated for the analysis of
repeated contrasts (see chapter 8, section 8.3). We first define weights for condition means
according to our hypotheses, invert this matrix, and use it as the contrast matrix for factor F .
We define weights of 1/4 and −1/4 . We do so because (a) we want to compare the mean of
F 1+F 2 F 3+F 4
two conditions to the mean of two other conditions (e.g., factor A compares 2
to 2

). Moreover, (b) we want coefficients to code half the difference between condition means,
reflecting sum contrasts. Together (a + b) , this yields weights of 1/2 ⋅ 1/2 = 1/4 . The
resulting contrast matrix contains contrast coefficients of +1 or −1 , showing that we
successfully implemented sum contrasts. The results are identical to the previous models.

Hide
t(fractions(HcInt <- rbind(
A = c(F1 = 1 / 4, F2 = 1 / 4, F3 = -1 / 4, F4 = -1 / 4),
B = c(F1 = 1 / 4, F2 = -1 / 4, F3 = 1 / 4, F4 = -1 / 4),
AxB = c(F1 = 1 / 4, F2 = -1 / 4, F3 = -1 / 4, F4 = 1 / 4)
)))

## A B AxB
## F1 1/4 1/4 1/4
## F2 1/4 -1/4 -1/4
## F3 -1/4 1/4 -1/4
## F4 -1/4 -1/4 1/4

Hide

(XcInt <- ginv2(HcInt))

## A B AxB
## F1 1 1 1
## F2 1 -1 -1

## F3 -1 1 -1
## F4 -1 -1 1

Hide

contrasts(df_contrasts3$F) <- XcInt

Hide
fit_F4.sum <- brm(DV ~ 1 + F,
data = df_contrasts3,
family = gaussian(),
prior = c(
prior(normal(20, 50), class = Intercept),

prior(normal(0, 50), class = sigma),

prior(normal(0, 50), class = b)
)
)

Hide

fixef(fit_F4.sum)

## Estimate Est.Error Q2.5 Q97.5

## Intercept 19.99 2.54 14.97 25.10

## FA -4.93 2.47 -9.78 0.04

## FB -9.96 2.57 -14.92 -4.77
## FAxB 5.01 2.46 0.15 9.92

This shows that it is possible to specify the contrasts not only for each factor (e.g., here in the
2 × 2 design) separately. Instead, one can also pool all experimental conditions (or design
cells) into one large factor (here factor F with 4 levels), and specify the contrasts for the main
effects and for the interactions in the resulting one large contrast matrix simultaneously.

In this approach, it can again be very useful to apply the hypr package to construct contrasts
for a 2 × 2 design. The first parameter estimates the main effect A , i.e., it compares the
average of F1 and F2 to the average of F3 and F4 . The second parameter estimates the
main effect B , i.e., it compares the average of F1 and F3 to the average of F2 and F4 . We
code direct differences between the averages, i.e., we implement scaled sum contrasts
instead of sum contrasts. This is shown below: the contrast matrix contains coefficients of
+1/2 and −1/2 instead of +1 and −1 . The interaction term estimates the difference
between differences, i.e., the difference between F1 − F2 and F3 − F4 .

Hide
hAxB <- hypr(
A = (F1 + F2) / 2 ~ (F3 + F4) / 2,
B = (F1 + F3) / 2 ~ (F2 + F4) / 2,
AxB = (F1 - F2) ~ (F3 - F4)
)

hAxB

## hypr object containing 3 null hypotheses:

## H0.A: 0 = (F1 + F2 - F3 - F4)/2
## H0.B: 0 = (F1 + F3 - F2 - F4)/2
## H0.AxB: 0 = F1 - F2 - F3 + F4

##
## Call:
## hypr(A = ~1/2 * F1 + 1/2 * F2 - 1/2 * F3 - 1/2 * F4, B = ~1/2 *
## F1 + 1/2 * F3 - 1/2 * F2 - 1/2 * F4, AxB = ~F1 - F2 - F3 +
## F4, levels = c("F1", "F2", "F3", "F4"))

##
## Hypothesis matrix (transposed):
## A B AxB
## F1 1/2 1/2 1
## F2 1/2 -1/2 -1

## F3 -1/2 1/2 -1
## F4 -1/2 -1/2 1
##
## Contrast matrix:
## A B AxB
## F1 1/2 1/2 1/4

## F2 1/2 -1/2 -1/4

## F3 -1/2 1/2 -1/4
## F4 -1/2 -1/2 1/4

Hide

contrasts(df_contrasts3$F) <- contr.hypothesis(hAxB)

Hide
fit_F4hypr <- brm(DV ~ 1 + F,
data = df_contrasts3,
family = gaussian(),
prior = c(
prior(normal(20, 50), class = Intercept),

prior(normal(0, 50), class = sigma),

prior(normal(0, 50), class = b)
)
)

Hide

fixef(fit_F4hypr)

## Estimate Est.Error Q2.5 Q97.5

## Intercept 20.00 2.40 15.18 24.76

## FA -9.81 4.91 -19.64 -0.22

## FB -19.91 4.87 -29.77 -10.18
## FAxB 19.27 9.54 0.44 37.17

The results show that the estimates for the main effects have double the size as compared to
the sum contrasts–this is the result of the scaling that we applied. I.e., the main effects now
directly estimate the difference between averages. The interaction estimates the difference
between differences. Nevertheless, if adequate priors are used, then both contrasts would
lead to the same hypothesis tests if one were doing hypothesis testing using Bayes factors.
Thus, the hypr package can be used to code hypotheses in a 2 × 2 design.

An alternative way to code main effects and interactions is to use the ifelse command in R.
For example, if we want to use ±1 sum contrasts in the above example, we can specify the
contrasts for the main effects as vectors:

Hide

A <- ifelse(df_contrasts3$F %in% c("F1", "F2"), -1, 1)

B <- ifelse(df_contrasts3$F %in% c("F1", "F3"), -1, 1)

Now, defining the interaction is simply a matter of multiplying the two vectors:

Hide
AxB <- A * B

An alternative is to use ±1/2 coding when using this approach:

Hide

A <- ifelse(df_contrasts3$F %in% c("F1", "F2"), -1 / 2, 1 / 2)

B <- ifelse(df_contrasts3$F %in% c("F1", "F3"), -1 / 2, 1 / 2)
## interaction:
(AxB <- A * B)

## [1] 0.25 0.25 0.25 0.25 0.25 -0.25 -0.25 -0.25 -0.25 -0.25 -0.25
## [12] -0.25 -0.25 -0.25 -0.25 0.25 0.25 0.25 0.25 0.25

Now the main effects and interaction can be directly interpreted as differences between
averages and as differences between differences. If one wants the interaction term to be on
the same scale as the main effects, it would need to be multiplied by 2, however, then its
interpretation would not be straightforward (“i.e., difference between differences”) any more,
but a scaled variant of this.

Hide

## rescale:

(AxB <- A * B * 2)

## [1] 0.5 0.5 0.5 0.5 0.5 -0.5 -0.5 -0.5 -0.5 -0.5 -0.5 -0.5 -0.5
## [14] -0.5 -0.5 0.5 0.5 0.5 0.5 0.5

This kind of vector-based contrast coding is convenient for more complex designs, such as
2 × 2 × 2 factorial designs.

9.1.1 Nested effects

One can estimate effects that do not correspond directly to main effects and interaction of the
traditional ANOVA. For example, in a 2 × 2 experimental design, where factor A codes word
frequency (low/high) and factor B is part of speech (noun/verb), one can estimate the effect of
word frequency within nouns and the effect of word frequency within verbs. Formally, AB1
versus AB2 are nested within levels of B . Said differently, simple effects of factor A are
estimated for each of the levels of factor B . In this version, we estimate the main effect of part
of speech (B; as in traditional ANOVA). Instead of also estimating the second main effect word
frequency, A , and the interaction, we estimate (1) whether the two levels of word frequency A

differ for the first level of B (i.e., nouns), and (2) whether the two levels of word frequency, A ,
differ for the second level of B (i.e., verbs). In other words, we estimate whether there are
differences for A in each of the levels of B . Often researchers have hypotheses about these
differences, and not about the interaction.

Hide

t(fractions(HcNes <- rbind(

B = c(F1 = 1 / 2, F2 = -1 / 2, F3 = 1 / 2, F4 = -1 / 2),
B1xA = c(F1 = -1, F2 = 0, F3 = 1, F4 = 0),
B2xA = c(F1 = 0, F2 = -1, F3 = 0, F4 = 1)
)))

## B B1xA B2xA
## F1 1/2 -1 0

## F2 -1/2 0 -1
## F3 1/2 1 0
## F4 -1/2 0 1

Hide

(XcNes <- ginv2(HcNes))

## B B1xA B2xA

## F1 1/2 -1/2 0
## F2 -1/2 0 -1/2
## F3 1/2 1/2 0
## F4 -1/2 0 1/2

Hide

contrasts(df_contrasts3$F) <- XcNes

Hide

fit_Nest <- brm(DV ~ 1 + F,

data = df_contrasts3,
family = gaussian(),
prior = c(
prior(normal(20, 50), class = Intercept),

prior(normal(0, 50), class = sigma),

prior(normal(0, 50), class = b)
)
)

Hide

fixef(fit_Nest)

## Estimate Est.Error Q2.5 Q97.5

## Intercept 19.97 2.44 15.13 24.8

## FB -19.86 5.04 -29.65 -10.2

## FB1xA 0.01 6.76 -13.27 13.4
## FB2xA 19.67 6.70 6.39 33.2

The regression coefficients estimate the grand mean, the difference for the main effect of part
of speech (B) and the two differences (for A ; i.e., simple main effects) within the two levels
(noun and verb) of part of speech (B).

These custom nested contrasts’ columns are scaled versions of the corresponding hypothesis
matrix. This is the case because the columns are orthogonal. It illustrates the advantage of
orthogonal contrasts for the interpretation of regression coefficients: the underlying
comparisons being estimated are already clear from the contrast matrix.

There is also a built-in R formula specification for nested designs. The order of factors in the
formula from left to right specifies a top-down order of nesting within levels, i.e., here factor A

(word frequency) is nested within levels of the factor B (part of speech). This yields the same
result as our previous result based on custom nested contrasts:

Hide
contrasts(df_contrasts4$A) <- c(-0.5, +0.5)
contrasts(df_contrasts4$B) <- c(+0.5, -0.5)
fit_Nest2 <- brm(DV ~ 1 + B / A,
data = df_contrasts4,
family = gaussian(),

prior = c(
prior(normal(20, 50), class = Intercept),
prior(normal(0, 50), class = sigma),
prior(normal(0, 50), class = b)
)
)

Hide

fixef(fit_Nest2)

## Estimate Est.Error Q2.5 Q97.5

## Intercept 19.97 2.50 15.02 24.9
## B1 -19.86 4.85 -29.69 -10.5
## BB1:A1 0.13 6.74 -13.32 13.6
## BB2:A1 19.51 6.75 6.38 32.8

In cases such as these, where AB1 vs. AB2 are nested within levels of B , it is necessary to
include the effect of B (part of speech) in the model, even if one is only interested in the effect
of A (word frequency) within levels of B (part of speech). Leaving out factor B in this case
would increase posterior uncertainty in the case of fully balanced data, and can lead to biases
in parameter estimation in the case the data are not fully balanced.

Again, we show how nested contrasts can be easily implemented using hypr :

Hide
hNest <- hypr(
B = (F1 + F3) / 2 ~ (F2 + F4) / 2,
B1xA = F3 ~ F1,
B2xA = F4 ~ F2
)

hNest

## hypr object containing 3 null hypotheses:

## H0.B: 0 = (F1 + F3 - F2 - F4)/2
## H0.B1xA: 0 = F3 - F1
## H0.B2xA: 0 = F4 - F2

##
## Call:
## hypr(B = ~1/2 * F1 + 1/2 * F3 - 1/2 * F2 - 1/2 * F4, B1xA = ~F3 -
## F1, B2xA = ~F4 - F2, levels = c("F1", "F2", "F3", "F4"))
##

## Hypothesis matrix (transposed):

## B B1xA B2xA
## F1 1/2 -1 0
## F2 -1/2 0 -1
## F3 1/2 1 0

## F4 -1/2 0 1
##
## Contrast matrix:
## B B1xA B2xA
## F1 1/2 -1/2 0
## F2 -1/2 0 -1/2

## F3 1/2 1/2 0
## F4 -1/2 0 1/2

Hide

contrasts(df_contrasts3$F) <- contr.hypothesis(hNest)

Hide
fit_NestHypr <- brm(DV ~ 1 + F,
data = df_contrasts3,
family = gaussian(),
prior = c(
prior(normal(20, 50), class = Intercept),

prior(normal(0, 50), class = sigma),

prior(normal(0, 50), class = b)
)
)

Hide

fixef(fit_NestHypr)

## Estimate Est.Error Q2.5 Q97.5

## Intercept 19.89 2.47 15.03 24.8

## FB -19.78 4.82 -29.11 -10.1

## FB1xA -0.03 6.91 -13.70 13.5
## FB2xA 19.60 6.98 5.93 33.8

Of course, we can also ask the reverse question: Are there differences for part of speech (B)
in the levels of word frequency (A; in addition to estimating the main effect of word frequency,
A )? That is, do nouns differ from verbs for low-frequency words (BA1 ) and do nouns differ
from verbs for high-frequency words (BA2 )?

Hide

hNest2 <- hypr(

A = (F1 + F2) / 2 ~ (F3 + F4) / 2,
A1xB = F2 ~ F1,

A2xB = F4 ~ F3
)
hNest2
## hypr object containing 3 null hypotheses:
## H0.A: 0 = (F1 + F2 - F3 - F4)/2
## H0.A1xB: 0 = F2 - F1
## H0.A2xB: 0 = F4 - F3
##

## Call:
## hypr(A = ~1/2 * F1 + 1/2 * F2 - 1/2 * F3 - 1/2 * F4, A1xB = ~F2 -
## F1, A2xB = ~F4 - F3, levels = c("F1", "F2", "F3", "F4"))
##
## Hypothesis matrix (transposed):
## A A1xB A2xB

## F1 1/2 -1 0
## F2 1/2 1 0
## F3 -1/2 0 -1
## F4 -1/2 0 1
##

## Contrast matrix:
## A A1xB A2xB
## F1 1/2 -1/2 0
## F2 1/2 1/2 0
## F3 -1/2 0 -1/2

## F4 -1/2 0 1/2

Hide

contrasts(df_contrasts3$F) <- contr.hypothesis(hNest2)

Hide
fit_Nest2Hypr <- brm(DV ~ 1 + F,
data = df_contrasts3,
family = gaussian(),
prior = c(
prior(normal(20, 50), class = Intercept),

prior(normal(0, 50), class = sigma),

prior(normal(0, 50), class = b)
)
)

Hide

fixef(fit_Nest2Hypr)

## Estimate Est.Error Q2.5 Q97.5

## Intercept 19.92 2.50 15.03 24.91

## FA -9.74 5.08 -19.63 0.42

## FA1xB 9.78 6.97 -4.58 23.32
## FA2xB 29.35 7.20 14.97 44.09

Regression coefficients estimate the grand mean, the difference for the main effect of word
frequency (A) and the two part of speech effects (for B ; i.e., simple main effects) within levels
of word frequency (A).

9.1.2 Interactions between contrasts

In a 2 × 2 experimental design, the results from sum contrasts are equivalent to typical
ANOVA results that we see in frequentist analyses. This means that sum contrasts assess the
main effects and the interactions. One interesting question arises here is: what would happen
in a 2 × 2 design if we had used treatment contrasts instead of sum contrasts? Is it still
possible to meaningfully interpret the results from the treatment contrasts in a simple 2 × 2

design?

This leads us to a very important principle in interpreting results from contrasts: When
interactions between contrasts are included in a model, then the results for one contrast
actually depend on the specification of the other contrast(s) in the analysis! This may be
counter-intuitive at first, but it is very important and essential to keep in mind when interpreting
results from contrasts. How does this work in detail?

The general rule to remember is that the effect of one contrast measures its effect at the
location 0 of the other contrast(s) in the analysis.This can be seen in the regression equation
of a 2 × 2 design with factors A and B :

E[Y ] = α + βA A + βB B + βA×B A × B

If we set the predictor B to zero, then the equation simplifies to:

E[Y ] = α + βA A

Thus, now we can see the “pure” effect of A .

What does that mean practically? Let us consider the example that we use two treatment
contrasts in a 2 × 2 design. Here are the results from the linear model

Hide

contrasts(df_contrasts4$A) <- c(0, 1)

contrasts(df_contrasts4$B) <- c(0, 1)

fit_treatm <- brm(DV ~ 1 + B * A,

data = df_contrasts4,
family = gaussian(),
prior = c(
prior(normal(20, 50), class = Intercept),

prior(normal(0, 50), class = sigma),

prior(normal(0, 50), class = b)
)
)

Hide

fixef(fit_treatm)
## Estimate Est.Error Q2.5 Q97.5
## Intercept 9.88 5.00 -0.15 19.7
## B1 10.17 7.00 -3.96 24.4
## A1 0.27 7.05 -13.81 14.3
## B1:A1 19.52 10.01 -1.11 39.4

Let’s take a look at the effect of factor A . How can we interpret what this measures? This
effect actually estimates the effect of factor A at the “location” where factor B is coded as 0.
Factor B is coded as a treatment contrast, that is, it codes a zero at its baseline condition,
which is B1 . Thus, the effect of factor A estimates the effect of A nested within the baseline
condition of B , i.e., a simple effect. We take a look at the data presented in Figure 9.1, what
this nested effect should be. Figure 9.1 shows that the effect of factor A nested in B1 is 0. If
we now compare this to the results from the linear model, it is indeed clear that the effect of
factor A is exactly estimated as 0. As expected, when factor B is coded as a treatment
contrast, the effect of factor A estimates the effect of A nested within the baseline level of
factor B .

Next, consider the effect of factor B . According to the same logic, this effect estimates the
effect of factor B at the “location” where factor A is 0. Factor A is also coded as a treatment
contrast, that is, it codes its baseline condition A1 as 0. The effect of factor B estimates the
effect of B nested within the baseline condition of A . Figure 9.1 shows that this effect should
be 10 .

How do we know what the “location” is, where a contrast applies? For the treatment contrasts
discussed here, it is possible to reason this through because all contrasts are coded as 0 or 1.
How can one derive the “location” in general? What we can do is to look at the comparisons
that are estimated when using the treatment contrasts (or in case we use Bayes factors, which
hypotheses are tested) in the presence of an interaction between them by using the
generalized matrix inverse. We go back to the default treatment contrasts. Then we extract the
contrast matrix from the design matrix:

Hide
contrasts(df_contrasts4$A) <- contr.treatment(2)
contrasts(df_contrasts4$B) <- contr.treatment(2)
XcTr <- df_contrasts4 %>%
group_by(A, B) %>%
summarise() %>%

model.matrix(~ 1 + A * B, .) %>%
as.data.frame() %>%
as.matrix()
rownames(XcTr) <- c("A1_B1", "A1_B2", "A2_B1", "A2_B2")
XcTr

## (Intercept) A2 B2 A2:B2
## A1_B1 1 0 0 0
## A1_B2 1 0 1 0
## A2_B1 1 1 0 0
## A2_B2 1 1 1 1

This shows the treatment contrast for factors A and B , and their interaction. We can now
assign this contrast matrix to a hypr object. hypr automatically converts the contrast matrix
into a hypothesis matrix, such that we can read from the hypothesis matrix which comparison
are being estimated by the different contrasts.

Hide

htr <- hypr() # initialize empty hypr object

cmat(htr) <- XcTr # assign contrast matrix to hypr object
htr # look at the resulting hypothesis matrix
## hypr object containing 4 null hypotheses:
## H0.(Intercept): 0 = A1_B1 (Intercept)
## H0.A2: 0 = -A1_B1 + A2_B1
## H0.B2: 0 = -A1_B1 + A1_B2
## H0.A2:B2: 0 = A1_B1 - A1_B2 - A2_B1 + A2_B2

##
## Call:
## hypr(`(Intercept)` = ~A1_B1, A2 = ~-A1_B1 + A2_B1, B2 = ~-A1_B1 +
## A1_B2, `A2:B2` = ~A1_B1 - A1_B2 - A2_B1 + A2_B2, levels = c("A1_B1",
## "A1_B2", "A2_B1", "A2_B2"))
##

## Hypothesis matrix (transposed):

## (Intercept) A2 B2 A2:B2
## A1_B1 1 -1 -1 1
## A1_B2 0 0 1 -1
## A2_B1 0 1 0 -1

## A2_B2 0 0 0 1
##
## Contrast matrix:
## (Intercept) A2 B2 A2:B2
## A1_B1 1 0 0 0

## A1_B2 1 0 1 0
## A2_B1 1 1 0 0
## A2_B2 1 1 1 1

The same result is obtained by applying the generalized inverse to the contrast matrix (this is
what hypr does as well). An important fact is that when we apply the generalized inverse to the
contrast matrix, we obtain the corresponding hypothesis matrix (for details, see Schad et al.
2020).

Hide

t(ginv2(XcTr))
## (Intercept) A2 B2 A2:B2
## A1_B1 1 -1 -1 1
## A1_B2 0 0 1 -1
## A2_B1 0 1 0 -1
## A2_B2 0 0 0 1

As discussed above, the effect of factor A estimates its effect nested within the baseline level
of factor B . Likewise, the effect of factor B estimates its effect nested within the baseline level
of factor A .

How does this work for sum contrasts? They do not have a baseline condition that is coded as
0 . In sum contrasts, the average of the contrast coefficients is 0. Therefore, effects estimate
the average effect across factor levels, i.e., they estimate a main effect. This is what is
typically also tested in standard ANOVA. Let’s look at the example shown in Table 9.2: given
that factor B has a sum contrast, the main effect of factor A is estimated as the average
across levels of factor B . Figure 9.1 shows that the effect of factor A in level B1 is
10 − 10 = 0 , and in level B2 it is 20 − 40 = −20 . The average effect across both levels is
(0 − 20)/2 = −10 . Due to the sum contrast coding, we have to divide this by two, yielding an
expected effect of −10/2 = −5 . This is exactly what the effect of factor A measures (see
Table 9.2, Estimate for A1 ).

Similarly, factor B estimates its effect at the location 0 of factor A . Again, 0 is exactly the
mean of the contrast coefficients from factor A , which is coded as a sum contrast. Therefore,
factor B estimates the effect of B averaged across factor levels of A , i.e., the main effect of B

. For factor level A1 , factor B has an effect of 10 − 20 = −10 . For factor level A2 , factor B

has an effect of 10 − 40 = −30 . The average effect is (−10 − 30)/2 = −20 , which again
needs to be divided by 2 due to the sum contrast. This yields exactly the estimate of −10 that
is also reported in Table 9.2 (Estimate for B1).

Again, we look at the hypothesis matrix for the main effects and the interaction:

Hide
contrasts(df_contrasts4$A) <- contr.sum(2)
contrasts(df_contrasts4$B) <- contr.sum(2)
XcSum <- df_contrasts4 %>%
group_by(A, B) %>%
summarise() %>%

model.matrix(~ 1 + A * B, .) %>%
as.data.frame() %>%
as.matrix()
rownames(XcSum) <- c("A1_B1", "A1_B2", "A2_B1", "A2_B2")

hsum <- hypr() # initialize empty hypr object

cmat(hsum) <- XcSum # assign contrast matrix to hypr object

hsum # look at the resulting hypothesis matrix
## hypr object containing 4 null hypotheses:
## H0.(Intercept): 0 = (A1_B1 + A1_B2 + A2_B1 + A2_B2)/4 (Intercept)
## H0.A1: 0 = (A1_B1 + A1_B2 - A2_B1 - A2_B2)/4
## H0.B1: 0 = (A1_B1 - A1_B2 + A2_B1 - A2_B2)/4
## H0.A1:B1: 0 = (A1_B1 - A1_B2 - A2_B1 + A2_B2)/4

##
## Call:
## hypr(`(Intercept)` = ~1/4 * A1_B1 + 1/4 * A1_B2 + 1/4 * A2_B1 +
## 1/4 * A2_B2, A1 = ~1/4 * A1_B1 + 1/4 * A1_B2 - 1/4 * A2_B1 -
## 1/4 * A2_B2, B1 = ~1/4 * A1_B1 - 1/4 * A1_B2 + 1/4 * A2_B1 -
## 1/4 * A2_B2, `A1:B1` = ~1/4 * A1_B1 - 1/4 * A1_B2 - 1/4 *

## A2_B1 + 1/4 * A2_B2, levels = c("A1_B1", "A1_B2", "A2_B1",

## "A2_B2"))
##
## Hypothesis matrix (transposed):
## (Intercept) A1 B1 A1:B1

## A1_B1 1/4 1/4 1/4 1/4

## A1_B2 1/4 1/4 -1/4 -1/4
## A2_B1 1/4 -1/4 1/4 -1/4
## A2_B2 1/4 -1/4 -1/4 1/4
##

## Contrast matrix:
## (Intercept) A1 B1 A1:B1
## A1_B1 1 1 1 1
## A1_B2 1 1 -1 -1
## A2_B1 1 -1 1 -1
## A2_B2 1 -1 -1 1

This shows that each of the effects now does not compute nested comparisons any more, but
that they rather estimate their effect averaged across conditions of the other factor. The
averaging involves using weights of 1/2 . Moreover, the regression coefficients in the sum
contrast measure half the distance between conditions, leading to weights of 1/2 ⋅ 1/2 = 1/4

The general rule to remember from these examples is that when interactions between
contrasts are estimated, what an effect of a factor estimates depends on the contrast coding
of the other factors in the design! The effect of a factor estimates the effect nested within the
location zero of the other contrast(s) in an analysis. If another contrast is centered, and zero is
the average of this other contrasts’ coefficients, then the contrast of interest estimates the
average or main effect, averaged across the levels of the other factor. Importantly, this
property holds only when the interaction between two contrasts is included into a model. If the
interaction is omitted and only effects are estimated, then there is no such “action at a
distance”.

This may be a very surprising result for interactions of contrasts. However, it is also essential
to interpreting contrast coefficients involved in interactions. It is particularly relevant for the
analysis of the default treatment contrast, where the main effects estimate nested effects
rather than average effects.

9.2 One factor and one covariate

9.2.1 Estimating a group difference and controlling for a

covariate

In this section we treat the case where there are again two predictor variables for one
dependent variable, but where one predictor variable is a discrete factor, and the other is a
continuous covariate. Let’s assume we have measured some response time (RT), e.g., in a
lexical decision task. We want to predict the response time based on each subject’s IQ, and
we expect that higher IQ leads to shorter response times. Moreover, we have two groups of
each 30 subjects. These are coded as factor F , with factor levels F1 and F2 . We assume
that these two groups have obtained different training programs to optimize their response
times on the task. Group F1 obtained a control training, whereas group F2 obtained training
to improve lexical decisions. We want to estimate in how far the training for better lexical
decisions in group F2 leads to shorter response times compared to the control group F1 . This
is our main question of interest here, i.e., in how far the training program in F2 leads to faster
response times compared to the control group F1 . We load the data, which is a simulated
data set.

Hide

data("df_contrasts5")

Our main effect of interest is the factor F . We want to estimate its effect on response times
and code it using scaled sum contrasts, such that negative parameter estimates would yield
support for our hypothesis that response times are faster in the training group F2 :
Hide

(contrasts(df_contrasts5$F) <- c(-0.5, +0.5))

## [1] -0.5 0.5

We run a brms model to estimate the effect of factor F , i.e., how strongly the response times
in the two groups differ from each other.

Hide

fit_RT_F <- brm(RT ~ 1 + F,

data = df_contrasts5,
family = gaussian(),

prior = c(
prior(normal(200, 50), class = Intercept),
prior(normal(0, 50), class = sigma),
prior(normal(0, 50), class = b)
)

Hide

fixef(fit_RT_F)

## Estimate Est.Error Q2.5 Q97.5

## Intercept 212.5 5.16 202.6 222.9

## F1 -24.1 10.06 -43.6 -3.3

Response times [ms] 230

220

210

200

F1 F2
Group (F)

FIGURE 9.2: Means and error bars (showing standard errors) for a simulated data set of
response times for two different groups of subjects, who have obtained a training in lexical
decisions (F 2) versus have obtained a control training (F 1).
We find (see model estimates and data shown in Figure 9.2) that response times in group F2

are roughly 25 ms faster than in group F1 (Estimate of −24 ). This suggests that as expected,
the training program that group F2 obtained seems to be successful in speeding up response
times. Recall that one cannot just look at the 95% credible interval and check whether zero is
outside the interval to declare that we have found an effect. To make a discovery claim, we
need to run a Bayes factor analysis on this data set to directly test this hypothesis, and this
may or may not provide evidence for a difference in response times between groups.

Let’s assume we have allocated subjects to the two groups randomly. Let’s say that we also
measured the IQ of each person using an IQ test. We did so because we expected that IQ
could have a strong influence on response times, and we wanted to control for this influence.
We now can check whether the two groups had the same average IQ.

Hide

df_contrasts5 %>%
group_by(F) %>%
summarize(M.IQ = mean(IQ))

## # A tibble: 2 × 2
## F M.IQ
## <fct> <dbl>
## 1 F1 85
## 2 F2 115
Group F2 not only obtained additional training and had faster response times, group F2 also
had a higher IQ (mean of 115) on average than group F1 (mean IQ = 85). Thus, the random
allocation of subjects to the two groups seems to have created–by chance–a difference in IQs.
Now we can ask the question: why might response times in group F2 be faster than in group
F1 ? Is this because of the training program in F2 ? Or is this simply because the average IQ
in group F2 was higher than in group F1 ? To investigate this question, we add both predictor
variables simultaneously in a brms model. Before we enter the continuous IQ variable, we
center it, by subtracting its mean. Centering covariates is generally good practice. Moreover, it
is often important to z-transform the covariate, i.e., to not only subtract the mean, but also to
divide by its standard deviation (this can be done as follows: df_contrasts5$IQ.s <-
scale(df_contrasts5$IQ) ). The reason why this is often important is that the sampler doesn’t

work well if predictors have different scales. For the simple models we use here, the sampler
works without z-transformation. However, for more realistic and more complex models, z-
transformation of covariates is often very important.

Hide

df_contrasts5$IQ.c <- df_contrasts5$IQ - mean(df_contrasts5$IQ)

fit_RT_F_IQ <- brm(RT ~ 1 + F + IQ.c,
data = df_contrasts5,
family = gaussian(),

prior = c(
prior(normal(200, 50), class = Intercept),
prior(normal(0, 50), class = sigma),
prior(normal(0, 50), class = b)
)

Hide

fixef(fit_RT_F_IQ)

## Estimate Est.Error Q2.5 Q97.5

## Intercept 212.46 4.91 202.88 221.87
## F1 6.23 13.34 -20.32 32.26

## IQ.c -1.05 0.32 -1.67 -0.42

The results from the brms model now show that the difference in response times between
groups (i.e., factor F ) is not estimated to be −25 ms any more, but instead, the estimate is
about +7 ms, and the 95% credible interval spans the range −20 to 33 . Thus, there doesn’t
seem to be much reason to believe any more that the groups would differ. At the same time,
we see that the predictor variable IQ shows a negative effect (Estimate = −1 with 95%
credible interval: −1.7 to −0.4 ), suggesting that–as expected–response times seem to be
faster in subjects with higher IQ.

300
Response times [ms]

250
F
F1
200 F2

150

50 75 100 125 150

FIGURE 9.3: Response times as a function of individual IQ for two groups with a lexical
decision training (F 2) versus a control training (F 1). Points indicate individual subjects, and
lines with error bands indicate linear regression lines.

This result can also be seen in Figure 9.3, which shows that response times decrease with
increasing IQ, as suggested by the brms model. However, the heights of the two regression
lines do not differ from each other, consistent with the observation in the brms model that the
effect of factor F did not seem to differ from zero. That is, factor F in the brms model
estimates the difference in height of the regression line between both groups. That the height
does not differ and the effect of F is estimated to be close to zero suggests that in fact group
F2 showed faster response times not because of their additional training program. Instead,
they had faster response times simply because their IQ was by chance higher on average
compared to the control group F1 . This analysis is the Bayesian equivalence of the
frequentist “analysis of covariance” (ANCOVA), where it’s possible to estimate a group
difference after “controlling for” the influence of a covariate.

We can also see in Figure 9.3 that the two regression lines for the two groups are exactly
parallel to each other. That is, the influence of IQ on response times seems to be exactly the
same in both groups. This is actually an assumption in the ANCOVA analysis that needs to be
checked in the data. That is, if we want to estimate the difference between groups after
controlling for a covariate (here IQ), we have to investigate whether the influence of the
covariate is the same in both groups. We can estimate this by including an interaction term
between the factor and the covariate in the brms model:

Hide

fit_RT_FxIQ <- brm(RT ~ 1 + F * IQ.c,

data = df_contrasts5,
family = gaussian(),
prior = c(

prior(normal(200, 50), class = Intercept),

prior(normal(0, 50), class = sigma),
prior(normal(0, 50), class = b)
)
)

Hide

fixef(fit_RT_FxIQ)

## Estimate Est.Error Q2.5 Q97.5

## Intercept 212.34 7.12 198.31 226.53
## F1 6.39 13.70 -20.59 33.63
## IQ.c -1.06 0.33 -1.71 -0.41

## F1:IQ.c 0.01 0.67 -1.33 1.28

The estimate for the interaction (the term F1:IQ.c ) is very small here (close to 0) and the
95% credible intervals clearly overlap with zero, showing that the two regression lines are
estimated to be very similar, or parallel, to each other. If this is the case, then it is possible to
correct for IQ when estimating the group difference.

9.2.2 Estimating differences in slopes

We now take a look at a different data set.

Hide
data("df_contrasts6")
levels(df_contrasts6$F) <- c("simple", "complex")

This again contains data from response times (RT) in two groups. Let’s assume the two
groups have performed two different response time tasks, where one simple RT task doesn’t
rely on much cognitive processing (group “simple”), whereas the other task is more complex
and depends on complex cognitive operations (group “complex”). We therefore expect that
RTs in the simple task should be independent of IQ, whereas in the complex task, individuals
with a high IQ should be faster in responding compared to individuals with low IQ. Thus, our
primary hypothesis of interest states that the influence of IQ on RT differs between conditions.
This means that we are interested in the difference between slopes. A slope in a linear
regression assesses how strongly the dependent variable (here RT) changes with an increase
of one unit on the covariate (here IQ), it thus assesses how “steep” the regression line is. Our
hypothesis thus states that the regression lines differ between groups.

300
Response times [ms]

250
F
simple
200 complex

150

80 100 120
IQ

FIGURE 9.4: Response times as a function of individual IQ for two groups performing a simple
versus a complex task. Points indicate individual subjects’ responses, and lines with error
bands show the fitted linear regression lines.

The results, displayed in Figure 9.4, suggest that the data seem to support our hypothesis. For
the subjects performing the complex task, response times seem to decrease with increasing
IQ, whereas for subjects performing the simple task, response times seem to be independent
of IQ. As stated before, our primary hypothesis relates to the difference in slopes. Statistically
speaking, this is assessed in the interaction between the factor and the covariate. Thus, we
run a brms model where the interaction is included. Importantly, we first use scaled sum
contrasts for the group effect, and again center the covariate IQ.

Hide
contrasts(df_contrasts6$F) <- c(-0.5, +0.5)
df_contrasts6$IQ.c <- df_contrasts6$IQ - mean(df_contrasts6$IQ)
fit_RT_FxIQ2 <- brm(RT ~ 1 + F * IQ.c,
data = df_contrasts6,
family = gaussian(),

prior = c(
prior(normal(200, 50), class = Intercept),
prior(normal(0, 50), class = sigma),
prior(normal(0, 50), class = b)
)
)

Hide

fixef(fit_RT_FxIQ2)

## Estimate Est.Error Q2.5 Q97.5

## Intercept 209.92 4.84 200.51 219.55
## F1 19.33 9.70 -0.59 38.76
## IQ.c -0.81 0.33 -1.45 -0.15
## F1:IQ.c -1.59 0.69 -2.92 -0.24

We can see that the main effect of IQ (term IQ.c ) is negative (−0.8) with 95% credible
intervals −1.5 to −0.2 , suggesting that overall response times decrease with increasing IQ.
This is qualified by the interaction term, which is estimated to be negative (−1.6), with 95%
credible intervals −2.9 to −0.3 . This suggests that the slope in the complex group (which was
coded as +0.5 in the scaled sum contrast) is more negative than the slope in the simple
group (which was coded as −0.5 in the scaled sum contrast). Thus, the interaction assesses
the difference between slopes.

We can also run a model, where the nested slopes are estimated, i.e., the slope of IQ in the
simple group and the slope of IQ in the complex group. This can be implemented by using the
nested coding that we learned about in the previous section:

Hide
fit_RT_FnIQ2 <- brm(RT ~ 1 + F / IQ.c,
data = df_contrasts6,
family = gaussian(),
prior = c(
prior(normal(200, 50), class = Intercept),

prior(normal(0, 50), class = sigma),

prior(normal(0, 50), class = b)
)
)

Hide

fixef(fit_RT_FnIQ2)

## Estimate Est.Error Q2.5 Q97.5

## Intercept 209.85 4.65 200.78 218.97

## F1 19.31 9.47 0.86 38.05

## Fsimple:IQ.c 0.01 0.45 -0.83 0.89
## Fcomplex:IQ.c -1.60 0.47 -2.54 -0.69

Now we see that the slope of IQ in the simple group ( Fsimple:IQ.c ) is estimated to be 0, with
credible intervals clearly including zero. By contrast, the slope in the complex group
( Fcomplex:IQ.c ) is estimated as −1.6 (95% CrI = [−2.5, −0.7] ). This is consistent with our
hypothesis that high IQ speeds up response times for the complex but not for the simple task.
(To obtain evidence for this effect, we need Bayes factors, see chapter 15.) We can also see
from the nested analysis that the difference in slopes between conditions is
−1.6 − 0.0 = −1.6 . This is exactly the value for the interaction term that we estimated in the
previous model, demonstrating that interaction terms assess the difference between slopes;
i.e., they estimate in how far the regression lines in the two conditions are parallel, with an
estimate of 0 indicating perfectly parallel lines.

Notice that we can compute posterior samples for the nested slopes from the model with the
interaction. That is, we can take the model that estimates main effects and the interaction, and
compute posterior samples for the slope of IQ in the simple task and the slope of IQ in the
complex task. First, we extract the posterior samples from the model.

Hide
df_postSamp_RT_FxIQ2 <- as_draws_df(fit_RT_FxIQ2)

Then, we take a look at the contrast coefficients for the group factor:

Hide

contrasts(df_contrasts6$F)

## [,1]
## simple -0.5
## complex 0.5

They show a value of −0.5 for the simple group. Thus, to compute the slope for the simple
group we have to take the overall slope for IQ.c and subtract −0.5 times the estimate for the
interaction:

Hide

df_postSamp_RT_FxIQ2 <- df_postSamp_RT_FxIQ2 %>%

mutate(b_IQ.c_simple = b_IQ.c - 0.5 * `b_F1:IQ.c`)

Likewise, to estimate the slope for the complex group we have to take the overall slope for
IQ.c and add +0.5 times the estimate for the interaction:

Hide

df_postSamp_RT_FxIQ2 <-df_postSamp_RT_FxIQ2 %>%

mutate(b_IQ.c_complex = b_IQ.c + 0.5 * `b_F1:IQ.c`)

Hide

c(
IQc.c_simple = mean(df_postSamp_RT_FxIQ2$b_IQ.c_simple),
IQc.c_complex = mean(df_postSamp_RT_FxIQ2$b_IQ.c_complex)
)

## IQc.c_simple IQc.c_complex
## -0.00879 -1.60173
The results show that the posterior means for the slope of IQ.c are 0 and −1.6 for the simple
and the complex groups, as we had found above in the nested analysis.

In most situations one should always center covariates before including them into a model. If
covariates are not centered, then the effects (here the effect for the factor) cannot be
interpreted as main effects any more.

One can also do analyses with interactions between a covariate and a factor, but by using
different contrast codings. For example, if we use treatment contrasts for the factor, then the
main effect of IQ.c assesses not the average slope of IQ.c across conditions, but instead
the nested slope of IQ.c within the baseline group of the treatment contrast. The interaction
still assesses the difference in slopes between groups. In a situation where there are more
than two groups, when one estimates the interaction of contrasts with a covariate, then the
contrasts define which slopes are compared with each other in the interaction terms. For
example, when using sum contrasts in an example where the influence of IQ is measured on
response times for nouns, verbs, and adjectives, then there are two interaction terms: these
assess (1) whether the slope of IQ for nouns is different from the average slope across
conditions, and (2) whether the slope of IQ for verbs is different from the average slope across
conditions. If one uses repeated contrasts in a situation where the influence of IQ on response
times is estimated for word frequency conditions “low”, “medium-low”, “medium-high”, and
“high”, then there are three interaction terms (one for each contrast). The first interaction term
estimates the difference in slopes between “low” and “medium-low” word frequencies, the
second interaction term estimates the difference in slopes between “medium-low” and
“medium-high” word frequencies, and the third interaction term estimates the difference in
slopes between “medium-high” and “high” word frequency conditions. Thus, the logic of how
contrasts specify certain comparisons between conditions extends directly to the situation
where differences in slopes are estimated.

9.3 Interactions in generalized linear models (with

non-linear link functions) and non-linear models

Next, we look at generalized linear models, where a linear predictor is passed through a non-
linear link function to predict the dependent variable. Examples for generalized linear models
include logistic regression models and models assuming a Poisson distribution. Even though a
log-normal model is a linear model on a log-transformed dependent variable, the same
techniques apply to this type of model since the logarithm transform is not linear. Here, we
treat an example with a logistic model in a 2 × 2 factorial between-subject design. The logistic
model has the following non-linear link function called the logistic function:
P (y = 1 ∣ η) =
1
, where η is the latent linear predictor. For example, in our 2 × 2
1+exp(−η)

factorial design with main effects A and B and their interaction, η is computed as a linear
combination of the intercept plus the main effects and their interaction:
η = 1 + βA xA + βB xB + βA×B xA×B .

Thus, there is a latent level of linear predictions (η), which are then passed through a non-
linear link function to predict the probability that the observed data is a success (P (y = 1) ).
We will use this logistic model to analyze an example data set where the dependent variable
is dichotomous, coded as either a 1 (indicating success) or a 0 (indicating failure).

We load a simulated data set where the dependent variable codes whether a subject
performed a task successfully (pDV = 1 ) or not (pDV = 0 ). Moreover, the data set has two
between-subject factors A and B. The means for each of the four conditions are shown in
Table 9.3.

Hide

data("df_contrasts7")

TABLE 9.3: Summary statistics per condition for the simulated data.

Factor A Factor B N data Means

A1 B1 50 0.2

A1 B2 50 0.5

A2 B1 50 0.2

A2 B2 50 0.8

To analyze this data, we use scaled sum contrasts, as we had done above for the 2 × 2

design with response times as the dependent variable; this allows us to interpret the
coefficients directly as main effects. Next, we fit a brms model. The model specification is the
same as the model with response times–with two differences: First, the family argument is
now specified as family = bernoulli(link = "logit") to indicate the logistic model. We do
not specify a prior for sigma , since there is no residual standard deviation in a logistic model.

Hide
contrasts(df_contrasts7$A) <- c(-0.5, +0.5)
contrasts(df_contrasts7$B) <- c(-0.5, +0.5)
fit_pDV_AB.sum <- brm(pDV ~ 1 + A * B,
data = df_contrasts7,
family = bernoulli(link = "logit"),

prior = c(
prior(normal(0, 3), class = Intercept),
prior(normal(0, 3), class = b)
)
)

Hide

fixef(fit_pDV_AB.sum)

## Estimate Est.Error Q2.5 Q97.5

## Intercept -0.36 0.17 -0.69 -0.03

## A1 0.71 0.34 0.05 1.37
## B1 2.10 0.34 1.44 2.78
## A1:B1 1.35 0.66 0.08 2.66

The results from this analysis show that the estimates for the two main effects (A1 and B1 ) as
well as the interaction (A1 : B1 ) are positive and the 95% credible intervals do not include
zero. If we want to make a discovery claim, we would need to perform Bayes factor analyses
to investigate the evidence that there is for each of the effects.

Next, we discuss how we can obtain model predictions for each of the four experimental
conditions for this generalized linear model. To obtain such predictions, we first take a look at
the contrast matrix. We simultaneously have contrasts for two main effects and one
interaction:

We obtain the posterior samples for the estimates from the model:

Hide

df_postSamp_pDV <- as.data.frame(fit_pDV_AB.sum)

From these, we can compute the posterior samples for the linear predictions for each group.
We see in the contrast matrix how we have to combine the posterior samples for the intercept,
main effects, and interaction to obtain latent linear predictions for each condition. The first
condition (design cell A1 B1 , ) has a weight of 1 for the intercept, and then weights of −0.5

(for the main effect of A ), −0.5 (for the main effect of B ), and 0.25 (for the interaction). The
posterior samples for the other conditions are computed accordingly.

Hide

df_postSamp_pDV <- df_postSamp_pDV %>%

mutate(A1_B1 = 1 * b_Intercept - 0.5 * b_A1 - 0.5 * b_B1 +

0.25 * `b_A1:B1`,
A1_B2 = 1 * b_Intercept - 0.5 * b_A1 + 0.5 * b_B1 -
0.25 * `b_A1:B1`,
A2_B1 = 1 * b_Intercept + 0.5 * b_A1 - 0.5 * b_B1 -
0.25 * `b_A1:B1`,

A2_B2 = 1 * b_Intercept + 0.5 * b_A1 + 0.5 * b_B1 +

0.25 * `b_A1:B1`)

Now, we have computed posterior samples for estimates of the latent linear predictor η for
each experimental condition. We can look at the posterior means:

Hide

df_postSamp_pDV %>%
select(A1_B1, A1_B2, A2_B1, A2_B2) %>%
colMeans()

## A1_B1 A1_B2 A2_B1 A2_B2

## -1.42458 0.00117 -1.39496 1.38267

This shows that these values are not on the probability scale. Instead, they are on the (log-
odds) scale of the latent linear predictor η. For presentation and interpretation of the results, it
might be much more informative to look at the condition means in terms of the probabilities of
success in each of the four conditions. Given that we have the linear predictions for each
condition, this can be easily computed by sending all posterior samples for the linear
predictions through the link function. Applying the logistic function ( plogis() in R) transforms
the linear predictors to the probability scale:29
Hide

df_postSamp_pDV <- df_postSamp_pDV %>%

mutate(p_A1_B1 = plogis(A1_B1),
p_A1_B2 = plogis(A1_B2),
p_A2_B1 = plogis(A2_B1),
p_A2_B2 = plogis(A2_B2))

Now, we have posterior samples for each condition on the probability scale. We can take a
look at the posterior means, and see that these closely correspond to the probabilities in the
data that we have seen above in Table 9.3.

Hide

df_postSamp_pDV %>%
select(p_A1_B1, p_A1_B2, p_A2_B1, p_A2_B2) %>%
colMeans()

## p_A1_B1 p_A1_B2 p_A2_B1 p_A2_B2

## 0.200 0.500 0.205 0.794

Of course, the advantage is that we now have posterior samples for these conditions
available, and can compute posterior 95% credible intervals (also see Figure 9.5). Rather than
do it manually, the function conditional_effects() can do this for us. By default, all main
effects and two-way interactions estimated in the model are shown (this can be changed by
including, for example, effects = "A:B" ).

Hide

conditional_effects(fit_pDV_AB.sum, robust = FALSE)[]

## $A
## A pDV B cond__ effect1__ estimate__ se__ lower__ upper__
## 1 A1 0.425 B1 1 A1 0.200 0.0556 0.105 0.318
## 2 A2 0.425 B1 1 A2 0.205 0.0561 0.106 0.325
##

## $B
## B pDV A cond__ effect1__ estimate__ se__ lower__ upper__
## 1 B1 0.425 A1 1 B1 0.2 0.0556 0.105 0.318
## 2 B2 0.425 A1 1 B2 0.5 0.0684 0.362 0.636
##
## $`A:B`

## A B pDV cond effect1 effect2 estimate se lower

## 1 A1 B1 0.425 1 A1 B1 0.200 0.0556 0.105
## 2 A1 B2 0.425 1 A1 B2 0.500 0.0684 0.362
## 3 A2 B1 0.425 1 A2 B1 0.205 0.0561 0.106
## 4 A2 B2 0.425 1 A2 B2 0.794 0.0564 0.674

## upper__
## 1 0.318
## 2 0.636
## 3 0.325
## 4 0.890

We plot the two-way interaction using brms embedding the conditional_effects() call in
plot(.)[[1]] . This allows us to select the first (and here the only) ggplot2 element and to

customize it.

Hide

plot(conditional_effects(fit_pDV_AB.sum,
effects = "A:B",

robust = FALSE),
plot = FALSE)[[1]] +
labs(y = "Success probability")
0.75
Success probability

B
0.50 B1
B2

0.25

A1 A2
A

FIGURE 9.5: Means and 95 percent posterior credible intervals for a simulated data set of
successful task performance in a 2 × 2 design.

9.4 Summary

To summarize, we have seen interesting results for contrasts in the context of 2 × 2 designs,
where depending on the contrast coding, the factors estimated nested effects (treatment
contrasts) or main effects (sum contrasts). We also saw that it is possible to code contrasts for
a 2 × 2 design, by creating one factor comprising all design cells, and by specifying all effects
of interest in one large contrast matrix. In designs with one factor and one covariate it is
possible to control group-differences for differences in the covariate (ANCOVA), or to estimate
in how far regression slopes are parallel in different experimental conditions. Last, in
generalized linear models with non-linear link functions it is possible to obtain posterior
samples not only on the latent scale of linear predictors, but also on the scale of the response.

9.5 Further reading

Analysis of variance is discussed in detail in Maxwell, Delaney, and Kelley (2017). A practical
book on ANOVA using R is Faraway (2002).

9.6 Exercises

Exercise 9.1 ANOVA coding for a four-condition design.

Load the following data. These data are from Experiment 1 in a set of reading studies on
Persian (Safavi, Husain, and Vasishth 2016); we encountered these data in the preceding
chapter’s exercises.

Hide

library(bcogsci)
data("df_persianE1")
dat1 <- df_persianE1
head(dat1)

## subj item rt distance predability

## 60 4 6 568 short predictable
## 94 4 17 517 long unpredictable

## 146 4 22 675 short predictable

## 185 4 5 575 long unpredictable
## 215 4 3 581 long predictable
## 285 4 7 1171 long predictable

The four conditions are:

Distance=short and Predictability=unpredictable

Distance=short and Predictability=predictable
Distance=long and Predictability=unpredictable
Distance=long and Predictability=predictable

For the data given above, define an ANOVA-style contrast coding, and compute main effects
and interactions. Check with hypr what the estimated comparisons are with an ANOVA
coding.

Exercise 9.2 ANOVA and nested comparisons in a 2 × 2 × 2 design

Load the following data set. This is a 2 × 2 × 2 design from Jäger et al. (2020), with the
factors Grammaticality (grammatical vs. ungrammatical), Dependency (Agreement
vs. Reflexives), and Interference (Interference vs. no interference). The experiment is a
replication attempt of Experiment 1 reported in Dillon et al. (2013).

Hide
library(bcogsci)
data("df_dillonrep")

The grammatical conditions are a,b,e,f. The rest of the conditions are ungrammatical.
The agreement conditions are a,b,c,d. The other conditions are reflexives.
The interference conditions are a,d,e,h, and the others are the no-interference conditions.

The dependent measure of interest is TFT (total fixation time, in milliseconds).

Using a linear model, do a main effects and interactions ANOVA contrast coding, and obtain
an estimate of the main effects of Grammaticality, Dependency, and Interference, and all
interactions. You may find it easier to code the contrasts coding the main effects as +1, -1,
using ifelse() in R to code vectors corresponding to each main effect. This will make the
specification of the interactions easy.

The researchers had a further research hypothesis: in ungrammatical sentences only,

agreement would show an interference effect but reflexives would not. In grammatical
sentences, both agreement and reflexives are expected to show interference effects. This kind
of research question can be answered with nested contrast coding.

To carry out the relevant nested contrasts, define contrasts that estimate the effects of

grammaticality
dependency type
the interaction between grammaticality and dependency type
reflexives interference within grammatical conditions
agreement interference within grammatical conditions
reflexives interference within ungrammatical conditions
agreement interference within ungrammatical conditions

Do the estimates match expectations? Check this by computing the condition means and
checking that the estimates from the models match the relevant differences between
conditions or clusters of conditions.

References

Dillon, Brian, Alan Mishler, Shayne Sloggett, and Colin Phillips. 2013. “Contrasting Intrusion
Profiles for Agreement and Anaphora: Experimental and Modeling Evidence.” Journal of
Memory and Language 69 (2). Elsevier: 85–103.

Faraway, Julian James. 2002. Practical Regression and ANOVA using R. Vol. 168. Citeseer.
Jäger, Lena A, Daniela Mertzen, Julie A Van Dyke, and Shravan Vasishth. 2020. “Interference
Patterns in Subject-Verb Agreement and Reflexives Revisited: A Large-Sample Study.” Journal
of Memory and Language 111. Elsevier: 104063.

Maxwell, Scott E, Harold D Delaney, and Ken Kelley. 2017. Designing Experiments and
Analyzing Data: A Model Comparison Perspective. New York, NY: Routledge.

2020. “How to Capitalize on a Priori Contrasts in Linear (Mixed) Models: A Tutorial.” Journal of
Memory and Language 110. Elsevier: 104038.

29. The same (with lower precision) can be achieved using 1/(1+exp(-.)) .↩
Code

Chapter 10 Introduction to the probabilistic

programming language Stan
Stan is a probabilistic programming language for statistical inference written in C++ that can
be accessed through several interfaces (e.g., R, Python, Matlab, etc.). Stan uses an advanced
dynamic Hamiltonian Monte Carlo algorithm (Betancourt 2016) based on a variant of the No-
U-Turn sampler (known as NUTS: Hoffman and Gelman 2014), which is, in general, more
efficient than the traditional Gibbs sampler used in other probabilistic languages such as
(Win)BUGS (Lunn et al. 2000) and JAGS (Plummer 2016). In this part of the book, we will
focus on the package rstan (Guo, Gabry, and Goodrich 2019) that integrates Stan
(Carpenter et al. 2017) with R (R Core Team 2019).

In order to understand how to fit a model in Stan and the difficulties we might face, a minimal
understanding of the Stan sampling algorithm is needed. Stan takes advantage of the fact that
the shape of the posterior distribution is completely determined by the priors and the likelihood
we have defined; the phrase completely determined means that we know the unnormalized
posterior distribution, which is the upper part of the Bayes rule (abbreviated below as up ).
This is because the denominator, or marginal likelihood, “only” constitutes a normalizing
constant:

p(y|Θ) ⋅ p(Θ)
p(Θ|y) = (10.1)
p(y)

up(Θ|y) = p(y|Θ) ⋅ p(Θ) (10.2)

Thus the unnormalized posterior is proportional to the posterior distribution:

up(Θ|y) ∝ p(Θ|y) (10.3)

(The notation up(⋅) that we are using here is not standard in statistics; we are using it only for
pedagogical convenience.)

The Stan sampler uses Hamiltonian dynamics and treats the vector of parameters, Θ (that
could range from a vector containing a couple of parameters, e.g., < μ, σ > , to a vector of
hundreds of parameters in hierarchical models), as the position of a frictionless particle that
glides on the negative logarithm of the unnormalized posterior. That means that high
probability places are valleys and low probability places are peaks in this space.30 However,
Stan doesn’t just let the particle glide until the bottom of this space. If we let that happen, we
would find the mode of the posterior distribution, rather than samples. Stan uses a complex
algorithm to determine the weight of the particle and the momentum that we apply to it, as well
as when to stop the particle trajectory to take a sample. Because we need to know the speed
of this particle, Stan needs to be able to calculate the derivative of the log unnormalized
posterior with respect to the parameters (recall that speed is the first derivative of position).
This means that if the parameter space is differentiable (is relatively smooth, and does not
have any break or angle) and if the parameters of the Stan algorithm are well adjusted–as
should happen in the warm-up period–these samples are going to represent samples of the
true posterior distribution. Bear in mind that the geometry of the posterior has a big influence
on whether the algorithm will converge (fast) or not: If the space is very flat, because there
isn’t much data and the priors are not informative, then the particle may need to glide for a
long time before it gets to a high probability area; if there are several valleys (multimodality)
the particle may never leave the vicinity of one of them; and if the space is funnel shaped, the
particle may never explore the funnel. One of the reasons for the difficulties in exploring
complicated spaces is that the continuous path of the “particle” is discretized and divided into
steps, and the step size is optimized for the entire posterior space. In spaces that are too
complex, such as a funnel, a step size might be too small to explore the wide part of the
funnel, but too large to explore the narrow part; we will deal with this problem in section
11.1.2.

Although our following example assumes a vector of two parameters and thus a simple
geometry, real world examples can easily have hundreds of parameters defining an
unnormalized posterior space with hundreds of dimensions.

One question that might arise here is the following: Given that we already know the shape of
the posterior, why do we need samples? After all, the posterior is just the unnormalized
posterior multiplied by some number, the normalizing constant.

To make this discussion concrete, let’s say that we have a subject that participates in a
memory test, and in each trial we get a noisy score from their true working memory score. We
assume that at each trial, the score is elicited with normally distributed noise. If we want to
estimate the score and how much the noise makes it vary from trial to trial, we are assuming a
normal likelihood and we want to estimate its mean and standard deviation.

We will use simulated data produced by a normal distribution with a true mean of 3 and a true
standard deviation of 10:

Hide
Y <- rnorm(n = 100, mean = 3, sd = 10)
head(Y)

## [1] 11.26 -18.74 -11.88 -8.62 -12.89 7.20

As always, given our prior knowledge, we decide on priors. In this case, we use a log-normal
prior for the standard deviation, σ, since it can only be positive, but except for that, the prior
distributions are quite arbitrary in this example.

μ ∼ Normal(0, 20)
(10.4)
σ ∼ LogNormal(3, 1)

The unnormalized posterior will be the product of the likelihood of each data point times the
prior for each parameter:

100

up(μ, σ|y) = ∏ Normal(yn |μ, σ) ⋅ Normal(μ|0, 20) ⋅ LogNormal(σ|3, 1) (10.5)

where y = 11.258, −18.741, …

We can also define the unnormalized posterior, up(⋅) , as a function in R:

Hide

up <- function(y, mu, sigma) {

dnorm(x = mu, mean = 0, sd = 20) *
dlnorm(x = sigma, mean = 3, sd = 1) *
prod(dnorm(x = y, mean = mu, sd = sigma))
}

For example, if we want to know the unnormalized posterior density for the vector of
parameters < μ, σ >=< 0, 5 > , we do the following:

Hide

up(y = Y, mu = 0, sigma = 5)

## [1] 3e-194
The shape of the unnormalized posterior density is completely defined and it will look like
Figure 10.1.

sigma
mu

FIGURE 10.1: The unnormalized posterior defined by Equation (10.5)

Why is the shape of the unnormalized posterior density not enough? The main reason is that
unless we already know which probability distribution we are dealing with (e.g., normal,
Bernoulli, etc.) or we can easily integrate it (which can only be done in simpler cases) we
cannot do much with the analytical form of the unnormalized posterior: We cannot calculate
credible intervals, or know how likely it is that the true score is above or below zero, and even
the mean of the posterior is impossible to calculate. This is because the unnormalized
posterior distribution represents the general shape of the posterior distribution. With just the
shape of an unknown distribution, we can only answer the following question: What is the
most (or least) likely value of the vector of parameters? We can answer this question by
searching for the highest (or lowest) place in that shape. This leads us to the the maximum a
posteriori (MAP) estimate, which is the Bayesian counterpart of the maximum likelihood
estimate (MLE). (However, if we can recognize the shape as a distribution, we are in a
different situation. In that case, we might know already the formulas for the expectation,
variance, etc. This is what we did in chapter 2, but this is an unusual situation in realistic
analyses.) As we mentioned before, if we want to get posterior density values, we need the
denominator of the Bayes rule (or marginal likelihood), p(y) , which requires integrating the
unnormalized posterior. Even this is not too useful if we want to communicate findings, almost
every summary statistic requires us to solve more integrals, and except for a handful of cases,
these integrals might not have an analytical solution.

As we mentioned before, the only summary that we can get with an unnormalized posterior
shape (that we don’t recognize as a familiar distribution) is its mode (or the MAP: the highest
point in Figure 10.1, or the lowest point of the negative log unnormalized log posterior).
However, even calculating the mode is not always trivial. In simple cases as this one, one can
calculate it analytically; but in more complex cases relatively complicated algorithms are
needed.

If we want to be able to calculate summary statistics of the posterior distribution (mean,

quantiles, etc.), we are going to need samples from this distribution. This is because with
enough samples of a probability distribution, we can achieve very good approximations of
summary statistics. Stan will take care of returning samples from the posterior distribution, if
the log unnormalized posterior distribution is differentiable and can be expressed as follows:31

log(up(Θ|y)) = ∑ log(p(yn |Θ)) + ∑ log(p(Θq )) (10.6)

n q

where n indicates each data point and q each parameter. In our case, this corresponds to the
following:

100

log(up(μ, σ|y)) = ∑ log(N ormal(yn |μ, σ)) + log(N ormal(μ|0, 20))

(10.7)
n

+ log(LogN ormal(σ|3, 1))

In the following sections, we’ll see how we can implement this model and many others in Stan.

10.1 Stan syntax

A Stan program is usually saved as a .stan file and accessed through R (or other interfaces)
and it is organized into a sequence of optional and obligatory blocks, which must be written in
order. The Stan language is different from R and it is loosely based on C++; one important
aspect to pay attention to is that every statement ends in a semi-colon, ; . Blocks ( {} ) do
not end in semi-colons. Some functions in Stan are written in the same way as in R (e.g.,
mean , sum , max , min ). But some are different; when in doubt, Stan documentation can be

extremely helpful. In addition, the package rstan provides the function lookup() to look up
for translations of functions. For example, in 4.3, we saw that the R function plogis() is
needed to convert from log-odds to probability space. If we need it in a Stan program, we can
look for it in the following way:

Hide

lookup(plogis)

## StanFunction Arguments ReturnType

## 227 inv_logit (T x) R
## 260 log_inv_logit (T x) R
## 261 logistic_cdf (reals y, reals mu, reals sigma) real
## 262 logistic_lccdf (reals y , reals mu, reals sigma) real
## 263 logistic_lcdf (reals y , reals mu, reals sigma) real

There are three columns in the output of this call. The first one indicates Stan function names,
the second one their arguments with their type, and the third one the type they return. Unlike
R, Stan is strict with the type of the variables.32 In order to decide on which function to use, it
is necessary to look at the Stan documentation and find the function that matches our specific
needs (for plogis , the corresponding function would be inv_logit() ).

Another important difference with R is that every variable needs to be declared at the
beginning of a block with its type (real, integer, vector, matrix, etc.). The next two sections
exemplify these details through basic Stan programs.

10.2 A first simple example with Stan: Normal

likelihood

Let’s fit a Stan model to estimate the simple example given at the introduction of this chapter,
where we simulate data in R from a normal distribution with a true mean of 3 and a true
standard deviation of 10 :

Hide

Y <- rnorm(n = 100, mean = 3, sd = 10)

As mentioned earlier, Stan code is organized in blocks. The first block indicates what
constitutes “data” for the model:

Hide

data {
int<lower = 1> N; // Total number of trials
vector[N] y; // Score in each trial
}

The variable of type int (integer) represents the number of trials. In addition to the type,
some constraints can be indicated with lower and upper . In this case, N can’t be smaller
than 1 . These constraints serve as a sanity check; if they are not satisfied, we get an error
and the model won’t run. The data are stored in a vector of length N , unlike R, vectors (and
matrices and arrays) need to be defined with their dimensions. Comments are indicated with
// rather than # .

The next block indicates the parameters of the model:

Hide

parameters {

real mu;
real<lower = 0> sigma;
}

The two parameters are real numbers, and sigma is constrained to be positive.

Finally, we indicate the prior distributions and likelihood functions in the model block:

Hide

model {
// Priors:

target += normal_lpdf(mu | 0, 20);

target += lognormal_lpdf(sigma | 3, 1);
// Likelihood:
for(i in 1:N)
target += normal_lpdf(y[i] | mu, sigma);

}
The variable target is a reserved word in Stan; every statement with target += adds terms
to the unnormalized log posterior density. We do this because adding to the unnormalized log
posterior amounts to multiplying a term in the numerator of the unnormalized posterior. As
explained earlier, Stan uses the shape of the unnormalized posterior to sample from the actual
posterior distribution. See Box 10.1 for a more detailed explanation, and see Box 10.2 for
alternative notations.

Box 10.1 What does target do?

We can exemplify how target works with one hypothetical iteration of the sampler.

In every iteration where the sampler explores the posterior space, mu and sigma
acquire different values (this is where the Stan algorithm stops the movement of the
particle in the Hamiltonian space). Say that in an iteration, mu = 1.77 and sigma =
10.703 . Then the following happens in the model block:

1. At the beginning of the iteration, target is zero.

2. The transformations that the sampler automatically does are taken into account. In
our case, although sigma is constrained to be positive in our model, inside Stan’s
sampler it is transformed to an “unconstrained” space amenable to Hamiltonian Monte
Carlo. That is, Stan samples from an auxiliary parameter that ranges from minus
infinity to infinity, which is equivalent to log(sigma) . This auxiliary parameter is then
exponentiated, when it is incorporated into our model. Because of the mismatch
between the constrained parameter space that we defined and the unconstrained
space that it is converted to by Stan, an adjustment to the unnormalized posterior is
required and added automatically. The reasons for this requirement are somewhat
complex and will be discussed in section 12. In this particular case, this adjustment
(which is the log absolute value of the Jacobian determinant), is equivalent to adding
log(sigma) = 2.371 to target .

3. After target += normal_lpdf(mu | 0, 20); the log of the density of Normal(0, 20) is
evaluated at a given sample of mu (specifically 1.77) and this is added to target . In
R, this would be dnorm(x = 1.77, mean = 0, sd = 20, log = TRUE) , which is equal to
-3.919 . Thus, target should be -3.919 + 2.371 = -1.548 .

4. After target += lognormal_lpdf(sigma | 3, 1) , we add the log of the density of

LogNormal(3, 1) evaluated at 10.703 to the previous value of the target. In R, this
would be dlnorm(x = 10.703 , mean = 3, sd = 1, log = TRUE) , which is equal to
-3.488 . Thus, target should be updated to -1.548 + -3.488 = -5.036 .

5. After each iteration of the for-loop in the model block, we add to the target the log
density of Normal(1.77, 10.703) evaluated at each of the values of Y. In R, this
would be to add sum(dnorm(Y, 1.77, 10.703, log = TRUE)) (which is equal to
-375.269 ) to the current value of target -5.036 + -375.269 = -380.305 .

This means that for the coordinates <mu = 1.77, sigma = 10.703>, the height of the
unnormalized posterior would be the value exp(target) =
exp(−380.305) = 6.852 × 10
−166
. Incidentally, the value of target is returned as
lp__ (log probability) in an object storing a fit model with Stan.

It is possible to expose the value of target , by printing target() inside a Stan model.
The value of target after each iteration is named lp__ in the Stan object. This can be
useful for troubleshooting a problematic model.

Box 10.2 Explicitly incrementing the log probability function ( target ) vs. using the
sampling ( ~ ) notation.

In this book we specify priors and likelihoods by explicitly incrementing the log-probability
function using the following syntax:

target += pdf_name_lpdf(parameter | ...)

However, Stan also allows for specifying priors and likelihood with the so-called sampling
notation with the following code.

parameter ~ pdf_name(..)

Confusingly enough a sampling statement does not perform any actual sampling, and it is
meant to be a notational convenience.

The following two lines of code lead to the same behavior in Stan with respect to
parameter estimation. There is, nonetheless, an important difference between them.

y ~ normal(mu, sigma);

target += normal_lpdf(y | mu, sigma);

The important difference is that the sampling notation (the notation with the ∼ ) will drop
normalizing constants. Consider the following formula that corresponds to the log-
transformed PDF of a normal distribution:
2
log(2π) (y − μ)
−log(σ) − −
2
2 2σ

If one uses the sampling notation, Stan will ignore the terms that don’t contain
log(2π)
parameters, such as −
2
, and depending on whether σ, μ , and y are data or
parameters, Stan will either ignore them or not. For example, if σ and μ are data and y is
a parameter, it means that −log(σ) is a constant term that can be ignored, but not
2
(y−μ)
− 2
because it contains the parameter y. Dropping constant terms does not affect
2σ

parameter estimation because it only affects the unnormalized likelihood in the same way
in all the parameter space. To make this more concrete, the whole plot in Figure 10.1 will
move up or down by some constant amount, and this won’t affect the Hamiltionian
dynamics that determine how we sample from the posterior.

The advantage of the sampling notation is that it can be faster (when many terms are
ignored), but the disadvantage is that (i) it is not compatible with the calculation of Bayes
factor with bridge sampling (see section 15.4 in chapter 15), or the calculation of the log-
likelihood for cross validation (see chapter 16), (ii) it misleads us into thinking that Stan is
actually sampling the left term in the sampling statement, e.g., drawing y from a normal
distribution in the previous example, when in fact at each step the log-probability
( target ) is incremented based on the parameter values determined by Hamiltonian
dynamics (as explained before), and (iii) it makes it less straightforward to transition to
more complex models where the sampling notation cannot be used (as in, for example,
mixture models in chapter 19).

If one is not going to use Bayes factor with bridge sampling or cross-validation, the same
speed advantage of the sampling notation can also be achieved by incrementing the log-
probability with log-unnormalized probability density or mass functions (functions ending
with _lupdf or _lupmf ). The previous example would be translated into the following:

target += normal_lupdf(y | mu, sigma);

We didn’t use curly brackets with the for-loop; this is a common practice if the for-loop has
only one line, but brackets can be added and are obligatory if the for-loop spans several lines.

It’s also possible to avoid the for-loop since many functions are vectorized in Stan:

Hide
model {
// Priors:
target += normal_lpdf(mu | 0, 20);
target += lognormal_lpdf(sigma | 3, 1);
// Likelihood:

target += normal_lpdf(y | mu, sigma);

}

The for-loop and vectorized versions give us the same output: The for-loop version evaluated
the log-likelihood at each value of y and added it to target . The vectorized version does
not create a vector of log-likelihoods; instead, it sums up the log-likelihood evaluated at each
element of y and then it adds that to target .

The complete model looks like this:

Hide

data {
int<lower = 1> N; // Total number of trials
vector[N] y; // Score in each trial
}
parameters {

real mu;
real<lower = 0> sigma;
}
model {
// Priors:
target += normal_lpdf(mu | 0, 20);

target += lognormal_lpdf(sigma | 3, 1);

// Likelihood:
target += normal_lpdf(y | mu, sigma);
}

You can save the above code as normal.stan . Alternatively, you can use the version stored
in the package bcogsci . (Typing ?stan-normal in the R console provides some
documentation for the model.) You can access the code of the models of this book by using
system.file("stan_models", "name_of_the_model.stan", package = "bcogsci") .

Hide
normal <- system.file("stan_models",

"normal.stan",
package = "bcogsci")

This command just points to a text file that the package bcogsci stores on your computer.
You can open it to read the code (with any text editor, or readLines() in R). You’ll need to
compile this code and run it with stan() .

Stan requires the data to be in a list object in R. Below, we fit the model with the default
number of chains and iterations.

Hide

Y <- rnorm(n = 100, mean = 3, sd = 10)

lst_score_data <- list(y = Y, N = length(Y))

# Fit the model with the default values of number of

# chains and iterations: chains = 4, iter = 2000
fit_score <- stan(normal, data = lst_score_data)
# alternatively:
# stan("normal.stan", data = lst_score_data)

Inspect how well the chains mixed in Figure 10.2. The chains for each parameter should look
like a “fat hairy caterpillar” (Lunn et al. 2012); see section 3.2.1.2 for a brief discussion about
convergence.

Hide

traceplot(fit_score, pars = c("mu", "sigma"))

mu sigma

6 14
chain
1
4
12
2
2
10 3
0 4
8
1000 1250 1500 1750 2000 1000 1250 1500 1750 2000

FIGURE 10.2: Traceplots of mu and sigma from the model fit_score .

We can see a summary of the posterior by either printing out the model fit, or by plotting it.
The summary displayed by the function print includes means, standard deviations ( sd ),
quantiles, Monte Carlo standard errors for the mean of the posterior ( se_mean ), split Rhats,
and effective sample sizes ( n_eff ). The summaries are computed after removing the
warmup and merging together all chains. The se_mean is unrelated to the se of an estimate
in the parallel frequentist model. Similarly to a large effective sample size, small Monte Carlo
standard errors indicate an “efficient” sampling procedure: with a large value of n_eff and a
small value for se_mean we can be relatively sure of the reliability of the mean of the
posterior. However, what constitutes a large or small se_mean is harder to define (see
Vehtari, Gelman, Simpson, Carpenter, and Bürkner 2019 for a more extensive discussion).33

Hide

print(fit_score, pars = c("mu", "sigma"))

## Inference for Stan model: anon_model.

## 4 chains, each with iter=2000; warmup=1000; thin=1;
## post-warmup draws per chain=1000, total post-warmup draws=4000.
##

## mean se_mean sd 2.5% 25% 50% 75% 97.5% n_eff Rhat

## mu 2.76 0.02 1.04 0.72 2.08 2.75 3.43 4.83 2787 1
## sigma 10.43 0.01 0.77 9.07 9.91 10.38 10.91 12.05 2864 1
##
## Samples were drawn using NUTS(diag_e) at Thu Feb 16 17:28:01 2023.

## For each parameter, n_eff is a crude measure of effective sample size,

## and Rhat is the potential scale reduction factor on split chains (at
## convergence, Rhat=1).

After transforming the stanfit object into a data frame, it’s possible to provide summary
plots as the one shown in 10.3. The package bayesplot (Gabry and Mahr 2019) is a wrapper
around ggplot2 (Wickham, Chang, et al. 2019) and has several convenient functions to plot
the samples. Bayesplot functions for posterior summaries start with mcmc_ :

Hide

df_fit_score <- as.data.frame(fit_score)

mcmc_hist(df_fit_score, pars = c("mu", "sigma"))
mu sigma

0.0 2.5 5.0 8 10 12 14

FIGURE 10.3: Histograms of the samples of the posterior distributions of mu and sigma
from the model fit_score .
There are also several ways to get the samples for other summaries or customized plots,
depending on whether we want a list, a data frame, or an array.

Hide

# extract from rstan is sometimes overwritten by

# a tidyverse version, we make sure that it's right one:

rstan::extract(fit_score) %>%
str()
## List of 3
## $ mu : num [1:4000(1d)] 1.77 3.71 2.43 1.06 2.09 ...
## ..- attr(*, "dimnames")=List of 1
## .. ..$ iterations: NULL
## $ sigma: num [1:4000(1d)] 10.7 11.81 9.93 11.06 11.07 ...

## ..- attr(*, "dimnames")=List of 1

## .. ..$ iterations: NULL
## $ lp__ : num [1:4000(1d)] -380 -382 -380 -381 -380 ...
## ..- attr(*, "dimnames")=List of 1
## .. ..$ iterations: NULL

Hide

as.data.frame(fit_score) %>%
str(list.len = 5)

## 'data.frame': 4000 obs. of 3 variables:

## $ mu : num 1.85 3.36 1.88 3.21 4.29 ...
## $ sigma: num 9.59 8.82 11.51 9.31 11.47 ...

## $ lp__ : num -381 -383 -381 -381 -382 ...

Hide

as.array(fit_score) %>%

str()

## num [1:1000, 1:4, 1:3] 1.85 3.36 1.88 3.21 4.29 ...
## - attr(*, "dimnames")=List of 3

## ..$ iterations: NULL

## ..$ chains : chr [1:4] "chain:1" "chain:2" "chain:3" "chain:4"
## ..$ parameters: chr [1:3] "mu" "sigma" "lp__"

Box 10.3 An alternative R interface to Stan: cmdstanr

At the time of writing this, there are two major nuisances with rstan , (i) the R code
interfaces directly with C++ creating installation problems in many systems, (ii) rstan
releases lag behind Stan language releases considerably preventing the user from taking
advantage of the latest features of Stan. The package cmdstanr (https://mc-
stan.org/cmdstanr/) is a lightweight interface to Stan for R that solves these problems. The
downside (at the moment of writing this) is that being lightweight some functionality of
rstan is lost, such as looking up functions with lookup() , exposing functions with

expose_stan_function() , as well as using the fitted model with the bridgesampling

package to generate Bayes factors. Furthermore, the package cmdstanr is currently

under development and the application programming interface (API) might still change.
However, the user interested in an easy (and painless) installation and the latest features
of Stan might find it useful.

Once cmdstanr is installed, we can use it as follows:

First create a new CmdStanModel object from a file containing a Stan program using
cmdstan_model()

Hide

normal <- system.file("stan_models",

"normal.stan",

package = "bcogsci")
normal_mod <- cmdstan_model(normal)

The object normal_mod is an R6 reference object (https://r6.r-lib.org/). This class of object

behaves similarly to objects in object oriented programming languages, such as python.
Methods are accessed using $ (rather than . as in python).

To sample, use the $sample() method. The data argument accepts a list (as we used in
stan() from rstan ). However, many of the arguments of $sample have different
names than the ones used in stan() from the rstan package:

Hide
lst_score_data <- list(y = Y, N = length(Y))
fit_normal_cmd <- normal_mod$sample(
data = lst_score_data,
seed = 123,
chains = 4,

parallel_chains = 4,
iter_warmup = 1000,
iter_sampling = 1000)

To show the posterior summary, access the method $summary() of the object
fit_normal_cmd .

Hide

fit_normal_cmd$summary()

Access the samples of fit_normal_cmd using $draws() .

Hide

fit_normal_cmd$draws(variables = "mu")

The vignette of https://mc-stan.org/cmdstanr/ shows more use cases, and how the
samples can be transformed into other formats (data frame, matrix, etc.) together with the
package posterior (https://mc-stan.org/cmdstanr/).

10.3 Another simple example: Cloze probability with

Stan with the binomial likelihood

Let’s fit a Stan model ( binomial_cloze.stan ) to estimate the cloze probability of a word given
its context: that is, what is the probability of an upcoming word given its previous context; the
model that is detailed in 2.2 and was fit in 3.1. We want to estimate the cloze probability of
“umbrella”, θ, given the following data: “umbrella” was answered 80 out of 100 trials. We
assume a binomial distribution as the likelihood function, and Beta(a = 4, b = 4) as a prior
distribution for the cloze probability.

Hide
data {
int<lower = 1> N; // Total number of answers
int<lower = 0, upper = N> k; // Number of times umbrella was answered
}
parameters {

// theta is a probability, it has to be constrained between 0 and 1

real<lower = 0, upper = 1> theta;
}
model {
// Prior on theta:
target += beta_lpdf(theta | 4, 4);

// Likelihood:
target += binomial_lpmf(k | N, theta);
}

There is only one parameter in this model, cloze probability represented with the parameter
theta , which is a real number constrained between 0 and 1. Another difference between this
and the previous example is that the likelihood function ends with _lpmf rather than with
_lpdf . This is because Stan differentiates between distributions of continuous variables, i.e,

probability density functions (PDFs), and distributions of discrete variables, i.e., probability
mass functions (PMFs).

Hide

lst_cloze_data <- list(k = 80, N = 100)

binomial_cloze <- system.file("stan_models",
"binomial_cloze.stan",
package = "bcogsci")
fit_cloze <- stan(binomial_cloze, data = lst_cloze_data)

Print the summary of the posterior distribution of θ below, and show its posterior distribution
graphically (see Figure 10.4):

Hide

print(fit_cloze, pars = c("theta"))

## mean 2.5% 97.5% n_eff Rhat
## theta 0.78 0.7 0.85 1482 1

Hide

df_fit_cloze <- as.data.frame(fit_cloze)

mcmc_dens(df_fit_cloze, pars = "theta") +
geom_vline(xintercept = mean(df_fit_cloze$theta))

0.65 0.70 0.75 0.80 0.85

theta

FIGURE 10.4: The posterior distribution of the cloze probability of umbrella; parameter θ.

10.4 Regression models in Stan

In the following sections, we will revisit and expand on some of the examples that were fit with
brms in chapter 4.
10.4.1 A first linear regression in Stan: Does attentional load
affect pupil size?

As in section 4.1, we focus on the effect of cognitive load on one subject’s pupil size with a
subset of the data of Wahn et al. (2016). We use the following likelihood and priors. For details
about our decision on priors and likelihood, see section 4.1.

p_sizen ∼ Normal(α + c_loadn ⋅ β, σ)

α ∼ Normal(1000, 500)

β ∼ Normal(0, 100)

σ ∼ Normal+ (0, 1000)

The Stan model pupil_model.stan follows:

Hide

data {
int<lower=1> N;

vector[N] p_size;
vector[N] c_load;
}
parameters {
real alpha;

real beta;
real<lower = 0> sigma;
}
model {
// priors:
target += normal_lpdf(alpha | 1000, 500);

target += normal_lpdf(beta | 0, 100);

target += normal_lpdf(sigma | 0, 1000)
- normal_lccdf(0 | 0, 1000);
// likelihood
target += normal_lpdf(p_size | alpha + c_load * beta, sigma);

Because we are fitting a regression, we use the location (μ) of the likelihood function to
regress p_size with the following equation alpha + c_load * beta , where both p_size
and c_load are vectors defined in the data block. The following line accumulates the log-
likelihood of every observation:

target += normal_lpdf(p_size | alpha + c_load * beta, sigma);

This is equivalent to and slightly faster than the following lines:

for(n in 1:N)
target += normal_lpdf(p_size[n] | alpha + c_load[n] * beta, sigma);

A statement that requires some explanation is the following:

target += normal_lpdf(sigma | 0, 1000)

- normal_lccdf(0 | 0, 1000);

As in our original example in 4.1, we are assuming a truncated normal distribution as a prior
for σ. Not only are we setting a lower boundary to the parameter with lower = 0 , but we are
also “correcting” its prior distribution by subtracting normal_lccdf(0 | 0, 1000) , where
lccdf stands for log complement of a cumulative distribution function. Once we add a lower

boundary, the probability mass under half of the “regular” normal distribution should be one,
that is, when we integrate from zero (rather than from minus infinity) to infinity. As discussed in
Box 4.1, we need to normalize the PDF by dividing it by the difference of its CDF evaluated in
the new boundaries (a = 0 and b = ∞ in our case):

f (x)
f[a,b] (x) = (10.8)
F (b) − F (a)

where f is a PDF and F a CDF.

This equation in log-space is:

log(f[a,b] (x)) = log(f (x)) − log(F (b) − F (a)) (10.9)

In Stan log(f (x)) corresponds to normal_lpdf(x |...) , and log(F(x)) to

normal_lcdf(x|...) . Because in our example ,
b = ∞ F (b) = 1 , we are dealing with the
complement of the log CDF evaluated at ,
a = 0 log(1 − F (0)) , that is why we use
normal_lccdf(0 | ...) (notice the double c ; this symbol represents the complement of the
CDF).

To be able to fit the model, Stan requires the data to be input as a list: First, load the data and
center the dependent variable in a data frame, then create a list, and finally fit the model.

Hide
df_pupil <- df_pupil %>%
mutate(c_load = load - mean(load))

Hide

ls_pupil <- list(

p_size = df_pupil$p_size,

c_load = df_pupil$c_load,
N = nrow(df_pupil)
)
pupil_model <- system.file("stan_models",
"pupil_model.stan",
package = "bcogsci"

)
fit_pupil <- stan(pupil_model, data = ls_pupil)

Check the traceplots (Figure 10.5).

Hide

traceplot(fit_pupil, pars = c("alpha", "beta", "sigma"))

alpha beta sigma

80 chain
750 175
60 1
700 40 150
2
20 125
650 3
0 100 4

1000 1250 1500 1750 2000 1000 1250 1500 1750 2000 1000 1250 1500 1750 2000

FIGURE 10.5: Traceplots of alpha , beta , and sigma from the model fit_pupil .

Examine some summaries of the marginal posterior distributions of the parameters of interest:

Hide

print(fit_pupil, pars = c("alpha", "beta", "sigma"))

## mean 2.5% 97.5% n_eff Rhat
## alpha 701.9 661.7 742 3693 1
## beta 33.4 10.3 57 4192 1
## sigma 128.4 102.5 162 3419 1

Plot the posterior distributions (Figure 10.6).

Hide

df_fit_pupil <- as.data.frame(fit_pupil)

mcmc_hist(fit_pupil, pars = c("alpha", "beta", "sigma"))

alpha beta sigma

600 650 700 750 0 20 40 60 80 100 125 150 175 200

FIGURE 10.6: Histograms of the posterior samples of alpha , beta , and sigma from the
model fit_pupil .

To determine the probability that the posterior for beta is larger than zero given the model and
data, examine the proportion of samples above zero:

Hide
# We are using df_fit_pupil and not the "raw" Stanfit object.
mean(df_fit_pupil$beta > 0)

## [1] 0.997

To generate prior or posterior predictive distributions, we can create our own functions in R
with the purrr function map_dfr (or a for-loop) as we did in section 4.2 with the function
lognormal_model_pred() . Alternatively, we can use the generated quantities block in our

model:

Hide

data {
int<lower = 1> N;

vector[N] c_load;
int<lower= 0, upper = 1> onlyprior;
vector[N] p_size;
}
parameters {

real alpha;
real beta;
real<lower = 0> sigma;
}
model {

// priors including all constants

target += normal_lpdf(alpha | 1000, 500);
target += normal_lpdf(beta | 0, 100);
target += normal_lpdf(sigma | 0, 1000)
- normal_lccdf(0 | 0, 1000);
if (!onlyprior)

target += normal_lpdf(p_size | alpha + c_load * beta, sigma);

}
generated quantities {
array[N] real p_size_pred;
p_size_pred = normal_rng(alpha + c_load * beta, sigma);

}
For most of the probability functions, there is a matching pseudorandom number generator
(PRNG) with the suffix _rng . Here we are using the vectorized function normal_rng . Once
p_size_pred is declared as an array of size N , the following statement generates N

predictions (for each iteration of the sampler):

p_size_pred = normal_rng(alpha + c_load * beta, sigma);

At the moment not all the PRNG are vectorized, but the ones that are only allow for arrays
and, confusingly enough, not vectors. We define an array by indicating array , between
brackets, the length of each dimension, then the type, and finaly the name of the variable. For
example to define an array of real numbers with three dimension of length 6, 7, and 10 we
write array[6, 7, 10] real var .34 Vectors and matrices are also valid types for an array.
See Box 10.4 for more about the difference between arrays and vectors, and other algebra
types. We also included a data variable called onlyprior , this is an integer that can only be
set to 1 (TRUE) or 0 (FALSE). When onlyprior = 1 , the likelihood is omitted from the model,
p_size is ignored, and p_size_pred is the prior predictive distribution. When onlyprior =
0 , the likelihood is incorporated in the model (as it is in the original code pupil_model.stan )

using p_size , and p_size_pred is the posterior predictive distribution.

If we want posterior predictive distributions, we fit the model to the data and set onlyprior =
0 , if we want prior predictive distributions, we sample from the priors and set onlyprior = 1 .

Then we use bayesplot functions to visualize predictive checks.

For posterior predictive checks, we would do the following:

Hide

ls_pupil <- list(

onlyprior = 0,

p_size = df_pupil$p_size,
c_load = df_pupil$c_load,
N = nrow(df_pupil)
)
pupil_gen <- system.file("stan_models",

"pupil_gen.stan",
package = "bcogsci")
fit_pupil <- stan(file = pupil_gen, data = ls_pupil)
Store the predicted pupil sizes in yrep_pupil . This variable contains an
Nsamples × Nobservations matrix, that is, each row of the matrix is a draw from the posterior
predictive distribution, i.e., a vector with one element for each of the data points in y.

Hide

yrep_pupil <- extract(fit_pupil)$p_size_pred

dim(yrep_pupil)

## [1] 4000 41

Predictive checks functions in bayesplot (starting with ppc_ ) require a vector with the
observations in the first argument and a matrix with the predictive distribution as its second
argument. As an example, in Figure 10.7 we use an overlay of densities and we draw only 50

elements (that is 50 predicted data sets).

Hide

ppc_dens_overlay(df_pupil$p_size, yrep = yrep_pupil[1:50, ])

y
y rep

250 500 750 1000

FIGURE 10.7: A posterior predictive check showing 50 predicted density plots from the model
fit_pupil against the observed data.

For prior predictive distributions, we simply set onlyprior = 1 . The observations ( p_size )
are ignored by the model, but are required by the data block in Stan. If we haven’t collected
data yet, we could include a vector of zeros.

Hide

ls_pupil_prior <- list(

onlyprior = 1,

p_size = df_pupil$p_size,
# or: p_size = rep(0, nrow(df_pupil)),
c_load = df_pupil$c_load,
N = nrow(df_pupil)
)
prior_pupil <- stan(pupil_gen, data = ls_pupil_prior,

control = list(adapt_delta = 0.9))

To avoid divergent transitions, increase the adapt_delta parameter’s default value from 0.8

to 0.9 . It is important to highlight that we cannot safely ignore the warnings of the above
model, even if we are not fitting data. This is so because in practice one is still sampling a
density using Hamiltonian Monte Carlo, and thus the prior sampling process can break in the
same ways as the posterior sampling process. Prior predictive distributions as the one shown
in Figure 10.8 can be plot with ppd_dens_overlay() (and in general with functions starting
with ppd_ which don’t require an argument with data).

Hide

yrep_prior_pupil <- extract(prior_pupil)$p_size_pred

ppd_dens_overlay(yrep_prior_pupil[1:50, ])
-2500 0 2500 5000 7500

FIGURE 10.8: Prior predictive distribution showing 50 predicted density plots from the model
fit_pupil .

Box 10.4 Matrix, vector, or array in Stan?

Stan contains three basic linear algebra types, vector , row_vector , and matrix . But
Stan also allows for building arrays of any dimension from any type of element (integer,
real, etc.). This means that there are several ways to define one-dimensional N-sized
containers of real numbers,

array[N] real a;
vector[N] a;

row_vector[N] a;

as well as, two-dimensional N1×N2-sized containers of real numbers:

array[N1, N2] real m;

matrix[N1, N2] m;

array[N1] vector[N2] b;
array[N1] row_vector[N2] b;
These distinctions affect either what we can do with these variables, or the speed of our
model, and sometimes are interchangeable. Matrix algebra is only defined for (row)
vectors and matrices, that is we cannot multiply arrays. The following line requires all the
one-dimensional containers ( p_size and c_load ) to be defined as vectors (or
row_vectors):

vector[N] mu = alpha + c_load * beta;

Many “vectorized” operations are also valid for arrays, that is, normal_lpdf , accepts
(row) vectors (as we did in our code) or arrays as in the next example. There is of course
no point in converting a vector to an array as follows, but this shows that Stan allows both
type of one-dimensional containers.

array[N] real mu = to_array_1d(alpha + c_load * beta);

target += normal_lpdf(p_size | mu, sigma);

By contrast, the outcome of “vectorized” pseudorandom number generator ( _rng )

functions can only be stored in an array. The following example shows the only way to
vectorize this type of function:

array[N] real p_size_pred = normal_rng(alpha + c_load * beta,

sigma);

Alternatively, one can always use a for-loop, and it won’t matter if p_size_pred is an
array or a vector:

vector[N] p_size_pred;
for(n in 1:N)

p_size_pred[n] = normal_rng(alpha + c_load[n] * beta, sigma);

See also Stan’s manual section on matrices, vector, and arrays (Stan Development Team
2021).

10.4.2 Interactions in Stan: Does attentional load interact with

trial number affecting pupil size?

We’ll expand the previous model to also include the effect of (centered) trial and its interaction
with cognitive load on one subject’s pupil size. Our new likelihood will look as follows:

p_sizen ∼ Normal(α + c_loadn ⋅ β1 + c_trial ⋅ β2 + c_load ⋅ c_trial ⋅ β3 , σ)

Define priors for all the new βs. Since we don’t have more information about the new
predictors, they are sampled from identical prior distributions:

α ∼ Normal(1000, 500)

β1 ∼ Normal(0, 100)

β2 ∼ Normal(0, 100)

β3 ∼ Normal(0, 100)

σ ∼ Normal+ (0, 1000)

The following Stan model, pupil_int1.stan , is the direct translation of the new priors and
likelihood.

Hide
data {
int<lower = 1> N;
vector[N] c_load;
vector[N] c_trial;
vector[N] p_size;

}
parameters {
real alpha;
real beta1;
real beta2;
real beta3;

real<lower = 0> sigma;

}
model {
// priors including all constants
target += normal_lpdf(alpha | 1000, 500);

target += normal_lpdf(beta1 | 0, 100);

target += normal_lpdf(beta2 | 0, 100);
target += normal_lpdf(beta3 | 0, 100);
target += normal_lpdf(sigma | 0, 1000)
- normal_lccdf(0 | 0, 1000);

target += normal_lpdf(p_size | alpha + c_load * beta1 +

c_trial * beta2 +
c_load .* c_trial * beta3, sigma);
}

When there are matrices or vectors involved, * indicates matrix multiplication whereas .*
indicates element-wise multiplication; in R %*% indicates matrix multiplication whereas *
indicates element-wise multiplication.

There is, however, an alternative notation that can simplify our code. In the following
likelihood, p_size is a vector of N observations (in this case 41), X is the model matrix with a
dimension of N × Npred (in this case 41 × 3 ), and β a vector of Npred (in this case, 3) rows.
Assuming that β is a vector, we indicate with one line that each parameter βn is sampled from
identical prior distributions.
p_size ∼ Normal(α + X ⋅ β, σ)

β ∼ Normal(0, 100)

σ ∼ Normal+ (0, 1000)

The translation into Stan code is the following:

Hide

data {
int<lower = 1> N;
int<lower = 0> K; // number of predictors

matrix[N, K] X; // model matrix

vector[N] p_size;
}
parameters {
real alpha;
vector[K] beta;

real<lower = 0> sigma;

}
model {
// priors including all constants
target += normal_lpdf(alpha | 1000, 500);

target += normal_lpdf(beta | 0, 100);

target += normal_lpdf(sigma | 0, 1000)
- normal_lccdf(0 | 0, 1000);
target += normal_lpdf(p_size | alpha + X * beta, sigma);
}

For some likelihood functions, Stan provides a more efficient implementation of the linear
regression than the one manually written in the previous code. It’s critical to understand that,
in general, a more efficient implementation should not only be faster, but should also achieve
the same number of effective samples (or more) than a less efficient implementation (and
should also show convergence). In this case, we can achieve that using _glm functions. We
can replace the last line with the following statement (the order of the arguments is
important):35

target += normal_id_glm_lpdf(p_size | X, alpha, beta, sigma);

The most optimized model, pupil_int.stan , includes this last statement. We prepare the
data as follows: First create a centered version of trial, c_trial and load c_load , then use
the function model.matrix to create the X matrix that contains in each column our
predictors and omits the intercept with 0 + .

Hide

df_pupil <- df_pupil %>%

mutate(
c_trial = trial - mean(trial),

c_load = load - mean(load)

)
X <- model.matrix(~ 0 + c_load * c_trial, df_pupil)
ls_pupil_X <- list(
p_size = df_pupil$p_size,
X = X,

K = ncol(X),
N = nrow(df_pupil)
)

Hide

pupil_int <- system.file("stan_models",

"pupil_int.stan",
package = "bcogsci"
)

fit_pupil_int <- stan(pupil_int, data = ls_pupil_X)

Hide

print(fit_pupil_int, pars = c("alpha", "beta", "sigma"))

## mean 2.5% 97.5% n_eff Rhat
## alpha 699.50 667.05 731.87 4326 1
## beta[1] 31.27 12.49 50.33 5018 1
## beta[2] -5.81 -8.48 -3.14 4356 1
## beta[3] -1.81 -3.43 -0.20 5080 1

## sigma 104.39 82.74 132.66 3682 1

In 10.9, we plot here the 95% CrI of the parameters of interest. We use regex_pars , rather
than pars , because we want to capture beta[1] , beta[2] , and beta[3] ; regex_pars
use regular expressions to select the parameters (for information about regular expressions in
R see https://stat.ethz.ch/R-manual/R-devel/library/base/html/regex.html)

Hide

df_fit_pupil_int <- as.data.frame(fit_pupil_int)

mcmc_intervals(fit_pupil_int,
regex_pars = "beta",
prob_outer = .95,
prob = .8,

point_est = "mean"
)
beta[1]

beta[2]

beta[3]

0 20 40

FIGURE 10.9: 95% CrI of the effect of load, beta[1] , the effect of trial beta[2] , and their
interaction beta[3] .

10.4.3 Logistic regression in Stan: Does set size and trial

affect free recall?

We revisit and expand on the analysis presented in 4.3 of a subset of the data of Oberauer
(2019). In this example, we will investigate whether the length of a list and trial number affect
the probability of correctly recalling a word.

As in section 4.3, we assume a Bernoulli likelihood with a logit link function, and the following
priors (recall that the logistic function is the inverse of the logit).

correctn ∼ Bernoulli(logistic(α + X ⋅ β))

α ∼ Normal(0, 1.5)

β ∼ Normal(0, 0.1)

Where β is a vector of size ,

K = 2 {β0 , β1 } . Below in recall.stan we present the most
efficient way to code this in Stan.

Hide
data {
int<lower = 1> N;
int<lower=0> K; // number of predictors
matrix[N, K] X; // model matrix
array[N] int correct;

}
parameters {
real alpha;
vector[K] beta;
}
model {

// priors including all constants

target += normal_lpdf(alpha | 0, 1.5);
target += normal_lpdf(beta | 0, .1);
target += bernoulli_logit_glm_lpmf(correct | X, alpha, beta);
}

The dependent variable, correct , is an array of integers rather than a vector; this is because
vectors are always composed of real numbers, but the Bernoulli likelihood only accepts the
integers 1 or 0. As in the previous example, we are taking advantage of the _glm functions. A
less efficient but more transparent option would be to replace the last statement with:

target += bernoulli_logit_lpmf(correct | alpha + X * beta);

We might want to use bernoulli_logit_lpmf if we want to define a non-linear relationship

between the predictors that are outside the generalized linear model framework. One example
would be the following:

target += bernoulli_logit_lpmf(correct| alpha + exp(X * beta));

Another more flexible possibility when we want to indicate a Bernoulli likelihood is to use
bernoulli_lpmf and add the link manually. The last statement of recall.stan would

become the following:

target += bernoulli_lpmf(correct| inv_logit(alpha + X * beta));

The function bernoulli_lpmf can be useful if one wants to try other link functions; see
exercise 10.4.

Finally, the most transparent form (but less efficient) would be the following for-loop:

for (n in 1:N)

target += bernoulli_lpmf(correct[n] | inv_logit(alpha + X[n] * beta));

To fit the model as recall.stan , prepare the data by centering the trial number first:

Hide

df_recall <- df_recall %>%

mutate(c_set_size = set_size - mean(set_size),
c_trial = trial - mean(trial))

Next, we create the model matrix, X , and prepare the data as a list. As in section 10.4.2, we
exclude the intercept from the matrix X using 0 +... . This is because the Stan code that
we are using already takes into account that the first column in the model matrix is going to be
a vector of ones.

Hide

X <- model.matrix(~ 0 + c_set_size * c_trial, df_recall)

ls_recall <- list(
correct = df_recall$correct,
X = X,
K = ncol(X),

N = nrow(df_recall)
)

Hide

recall <- system.file("stan_models",

"recall.stan",
package = "bcogsci")
fit_recall <- stan(recall, data = ls_recall)

After fitting the model we can print and plot summaries of the posterior distribution.
Hide

print(fit_recall, pars = c("alpha", "beta"))

## mean 2.5% 97.5% n_eff Rhat

## alpha 1.99 1.39 2.65 3516 1
## beta[1] -0.19 -0.35 -0.02 4014 1

## beta[2] -0.02 -0.09 0.05 3366 1

## beta[3] 0.00 -0.03 0.04 3886 1

In Figure 10.10, we plot the 95% CrI of the parameters of interest.

Hide

df_fit_recall <- as.data.frame(fit_recall)

mcmc_intervals(df_fit_recall,

regex_pars = "beta",
prob_outer = .95,
prob = .8,
point_est = "mean")

beta[1]

beta[2]

beta[3]

-0.3 -0.2 -0.1 0.0

FIGURE 10.10: 95% credible intervals of the beta parameters of fit_recall model.
As we did in 4.3.4, we might want to communicate the posterior in proportions rather than in
log-odds (as seen in the parameters beta ). We can do this in R manipulating the data frame
df_fit_recall , or extracting the samples with extract(fit_recall) . Another alternative

presented here is by using the generated quantities block. To make the code more compact
we declare the type of each variable and store its content in the same line in
recall_prop.stan .

Hide

generated quantities {
real average_accuracy = inv_logit(alpha);
vector[K] change_acc = inv_logit(alpha) - inv_logit(alpha - beta);
}

Recall that due to the non-linearity of the scale, the effects depend on the average accuracy,
and the set size and trial that we start from (in this case we are examining the change of one
unit from the average set size and the average trial).

Hide

recall_prop <- system.file("stan_models",

"recall_prop.stan",
package = "bcogsci")

fit_recall <- stan(recall_prop, data = ls_recall)

The plot in Figure 10.11 now shows how the average accuracy deteriorates when the subject
is exposed to a set size larger than the average by one, a trial further than the middle one,
and the interaction of both.

Hide
df_fit_recall <- as.data.frame(fit_recall) %>%
rename(
set_size = `change_acc[1]`,
trial = `change_acc[2]`,
interaction = `change_acc[3]`

)
mcmc_intervals(df_fit_recall,
pars = c("set_size", "trial", "interaction"),
prob_outer = .95,
prob = .8,
point_est = "mean"

) +
xlab("Change in accuracy")

set_size

trial

interaction

-0.03 -0.02 -0.01 0.00

Change in accuracy

FIGURE 10.11: Effect of set size, trial, and their interaction on the average accuracy of recall.

The plot in Figure 10.11 shows that our model is estimating that by increasing the set size by
one unit, the recall accuracy of the single subject is deteriorates by 2%. In contrast, there is
hardly any trial effect or interaction between trial and set size.
10.5 Summary

This chapter introduced basic Stan syntax for fitting some standard linear models. Example
code covered the normal, binomial, Bernoulli, and log-normal likelihoods. We also saw how to
express regression models using the matrix model in Stan syntax.

10.6 Further reading

For further reading on the Hamiltonian Monte Carlo algorithm, see the rather technical review
of Betancourt (2017), or the more conceptual introduction provided by Monnahan, Thorson,
and Branch (2017). A useful article with example R code is Neal (2011). A detailed walk-
through on its implementation is also provided in Chapter 41 of MacKay (2003). The Stan
documentation (Stan Development Team 2021), consisting of a User’s Guide and the
Language Reference Manual are important starting points for going deeper into Stan
programming: https://mc-stan.org/users/documentation/.

10.7 Exercises

Exercise 10.1 A very simple model.

In this exercise we revisit the model from 3.2.1. Assume the following:

1. There is a true underlying time, μ , that the subject needs to press the space bar.
2. There is some noise in this process.
3. The noise is normally distributed (this assumption is questionable given that response
times are generally skewed; we fix this assumption later).

That is the likelihood for each observation n will be:

tn ∼ Normal(μ, σ)

a. Decide on appropriate priors and fit this model in Stan. Data can be found in
df_spacebar .

b. Change the likelihood for a log-normal distribution and change the priors. Fit the model in
Stan.

Exercise 10.2 Incorrect Stan model.

We want to fit both response times and accuracy with the same model. We simulate the data
as follows:

Hide

N <- 500
df_sim <- tibble(
rt = rlnorm(N, mean = 6, sd = .5),
correct = rbern(N, prob = .85)
)

We build the following model:

Hide

data {
int<lower = 1> N;
vector[N] rt;
array[N] int correct;

}
parameters {
real<lower = 0> sigma;
real theta;
}

model {
target += normal_lpdf(mu | 0, 20);
target += lognormal_lpdf(sigma | 3, 1)
for(n in 1:N)
target += lognormal_lpdf(rt[n] | mu, sigma);
target += bernoulli_lpdf(correct[n] | theta);

Why does this model not work?

Hide
ls_sim <- list(
rt = df_sim$rt,
correct = df_sim$correct
)
incorrect <- system.file("stan_models",

"incorrect.stan",
package = "bcogsci")
fit_sim <- stan(incorrect, data = ls_sim)

## Error in stanc(file = file, model_code = model_code, model_name = model_name, : 0

## Syntax error in 'string', line 13, column 2 to column 5, parsing error:

##
## Ill-formed expression. Expression followed by ";" expected after "target +=".

Try to make it run. (Hint: There are several problems.)

Exercise 10.3 Using Stan documentation.

Edit the simple example with Stan from section 10.2, and replace the normal distribution with a
skew normal distribution. (Don’t forget to add a prior to the new parameter, and check the Stan
documentation or a statistics textbook for more information about the distribution).

Fit the following data:

Hide

Y <- rnorm(1000, mean = 3, sd = 10)

Does the estimate of the new parameter make sense?

Exercise 10.4 The probit link function as an alternative to the logit function.

The probit link function is the inverse of the CDF of the standard normal distribution (
N ormal(0, 1) ). Since the CDF of the standard normal is usually denoted with the Greek letter
Φ (Phi), the probit is denoted as Φ
−1
. Refit the model presented in 10.4.3 changing the logit
link function for the probit link (that is transforming the regression to a constrained space using
Phi() in Stan).
You will probably see the following as the model runs; this is because the probit link is less
numerically stable (i.e., under- and overflows) than the logit link in Stan. Don’t worry, it is good
enough for this exercise.

Rejecting initial value:

Log probability evaluates to log(0), i.e. negative infinity.
Stan can't start sampling from this initial value.

a. Do the results of the coefficients α and β change?

b. Do the results in probability space change?

Exercise 10.5 Examining the position of the queued word on recall.

Refit the model presented in section 10.4.3 and examine whether, set size, trial effects, the
position of the queued word ( tested in the data set), and their interaction affect free recall.
(Tip: You can do this exercise without changing the Stan code.).

How does the accuracy change from position one to position two?

Exercise 10.6 The conjunction fallacy.

Paolacci, Chandler, and Ipeirotis (2010) examined whether the results of some classic
experiments differ between a university pool population and subjects recruited from
Mechanical Turk. We’ll examine whether the results of the conjunction fallacy experiment (or
Linda problem: Tversky and Kahneman 1983) are replicated for both groups.

Hide

data("df_fallacy")
df_fallacy

## # A tibble: 268 × 2
## source answer
## <chr> <int>
## 1 mturk 1
## 2 mturk 1

## 3 mturk 1
## # … with 265 more rows
The conjunction fallacy shows that people often fail to regard a combination of events as less
probable than a single event in the combination (Tversky and Kahneman 1983):

Linda is 31 years old, single, outspoken, and very bright. She majored in philosophy. As a
student, she was deeply concerned with issues of discrimination and social justice, and also
participated in anti-nuclear demonstrations.

Which is more probable?

a. Linda is a bank teller.

b. Linda is a bank teller and is active in the feminist movement.

The majority of those asked chose option b even though it’s less probable (Pr(a ∧ b) ≤ Pr(b)

. The data set is named df_fallacy and it indicates with 0 option “a” and with 1 option
b . Fit a logistic regression in Stan and report:

a. The estimated overall probability of answering (b) ignoring the group.

b. The estimated overall probability of answering (b) for each group.

References

Betancourt, Michael J. 2016. “Identifying the Optimal Integration Time in Hamiltonian Monte
Carlo.”

Betancourt, Michael J. 2017. “A Conceptual Introduction to Hamiltonian Monte Carlo.”

Gabry, Jonah, and Tristan Mahr. 2019. bayesplot: Plotting for Bayesian Models.
https://CRAN.R-project.org/package=bayesplot.

Guo, Jiqiang, Jonah Gabry, and Ben Goodrich. 2019. rstan: R Interface to Stan.
https://CRAN.R-project.org/package=rstan.

Lunn, David, Chris Jackson, David J Spiegelhalter, Nicky Best, and Andrew Thomas. 2012.
The BUGS Book: A Practical Introduction to Bayesian Analysis. Vol. 98. CRC Press.
Lunn, D.J., A. Thomas, N. Best, and D. Spiegelhalter. 2000. “WinBUGS-A Bayesian Modelling
Framework: Concepts, Structure, and Extensibility.” Statistics and Computing 10 (4). Springer:
325–37.

MacKay, David JC. 2003. Information Theory, Inference and Learning Algorithms. Cambridge,
UK: Cambridge University Press.

Monnahan, Cole C., James T. Thorson, and Trevor A. Branch. 2017. “Faster Estimation of
Bayesian Models in Ecology Using Hamiltonian Monte Carlo.” Edited by Robert B. O’Hara.
Methods in Ecology and Evolution 8 (3): 339–48. https://doi.org/10.1111/2041-210X.12681.

Oberauer, Klaus. 2019. “Working Memory Capacity Limits Memory for Bindings.” Journal of
Cognition 2 (1): 40. https://doi.org/10.5334/joc.86.

Paolacci, Gabriele, Jesse Chandler, and Panagiotis G Ipeirotis. 2010. “Running Experiments
on Amazon Mechanical Turk.” Judgment and Decision Making 5 (5): 411–19.

Plummer, Martin. 2016. “JAGS Version 4.2.0 User Manual.”

R Core Team. 2019. R: A Language and Environment for Statistical Computing. Vienna,
Austria: R Foundation for Statistical Computing. https://www.R-project.org/.

Stan Development Team. 2021. “Stan Modeling Language Users Guide and Reference
Manual, Version 2.27.” https://mc-stan.org.

Tversky, Amos, and Daniel Kahneman. 1983. “Extensional Versus Intuitive Reasoning: The
Conjunction Fallacy in Probability Judgment.” Psychological Review 90 (4). American
Psychological Association: 293.

Vehtari, Aki, Andrew Gelman, Daniel Simpson, Bob Carpenter, and Paul-Christian Bürkner.
2019. “Rank-Normalization, Folding, and Localization: An Improved ˆ
R for Assessing
Convergence of MCMC.”

Wickham, Hadley, Winston Chang, Lionel Henry, Thomas Lin Pedersen, Kohske Takahashi,
Claus Wilke, Kara Woo, and Hiroaki Yutani. 2019. ggplot2: Create Elegant Data Visualisations
Using the Grammar of Graphics. https://CRAN.R-project.org/package=ggplot2.
30. In the specific case of a model with two parameters, e.g., < μ, σ > , the physical analogy
works quite well: The < x, y > coordinates of the particle would be determined by
< μ, σ > , and its z coordinate would be established by the height of the unnormalized
posterior.↩

˙
31. Incidentally, this log(up( ) ) is the variable target in a Stan model and lp__ in its
output; see Box 10.1.↩

32. In these output, there are some types that are new to the R user (but they are also used
in C++): reals indicates that any of real , real[] , vector , or row_vector . A return
type R with an input type T indicates that the type of the output of the function is the
same as type of the argument.↩

33. We simplify the output of print in the text after this call, by actually calling summary(fit,
pars = pars, probs = c(0.025, 0.975))$summary .↩

34. The notation for arrays has changed in recent versions, the previous notation would have
been real[6, 7, 10] var .↩

35. An extra boost in efficiency can be obtained in regular regressions where X and Y are
data (rather than parameters as in cases of missing data or measurement error), since
this function can be executed on a GPU.↩
Code

Chapter 11 Complex models and

reparameterization
Now that we know how to fit simple regression models using Stan syntax, we can now turn to
more complex cases, such as the hierarchical models that we fit in brms in chapter 5. Fitting
such models in Stan allows us a great deal of flexibility. However, a price to be paid when
using Stan is that we need to think about how exactly we code the model. In some cases, two
pieces of computer code that are mathematically equivalent might behave differently due to
the computer’s limitations; in this chapter, we will learn some of the more common techniques
needed to optimize the model’s behavior. In particular, we will learn how to deal with
convergence problems using what is called the non-centered reparameterization.

11.1 Hierarchical models with Stan

In the following sections, we will revisit and expand on some of the examples from chapter 5.

11.1.1 Varying intercept model with Stan

Recall that in section 5.2, we fit models to investigate the effect of cloze probability on EEG
averages in the N400 spatiotemporal time window. For our first model, we’ll make the
(implausible) assumption that only the average signal varies across subjects, but all subjects
share the same effect of cloze probability. This means that the likelihood incorporates the
assumption that the intercept, α , is adjusted with the term ui for each subject.

signaln ∼ Normal(α + usubj[n] + c_clozen ⋅ β, σ)

α ∼ Normal(0, 10)

β ∼ Normal(0, 10)

u ∼ Normal(0, τu )

τu ∼ Normal+ (0, 20)

σ ∼ Normal+ (0, 50)

Here n represents each observation, the n th row in the data frame and subj[n] is the subject
that corresponds to observation n . We present the mathematical notation of the likelihood with
“multiple indexing” [see the Stan users guide, available from mc-stan.org]: the index of u is
provided by the vector subj .

Before we discuss the Stan implementation, let’s see what the vector μ , the location of the
normal likelihood, looks like. There are 2863 observations; that means that
μ = {μ1 , μ2 , … , μ2863 } . We have 37 subjects which means that u = {u1 , u2 , … , u37 } . The
following equation shows that the use of multiple indexing allows us to have a vector of
adjustments with only 37 different elements, with a total length of 2863. In the equation below,
the multiplication operator ∘ is the Hadamard product (Fieller 2016): when we write X ∘ B ,
both X and B have the same dimensions m × n , and each cell in location [i, j] (where
i = 1, … , m , and j = 1, … , n ) in X and B are multiplied to give a matrix that also has
dimensions m × n .
usubj[1]
μ1 ⎡ ⎤
α ccloze1 β
⎡ ⎤ ⎡ ⎤ ⎡ ⎤ ⎡ ⎤
⎢ usubj[2] ⎥
⎢ μ2 ⎥ ⎢ ⎥ ⎢ β ⎥
⎢ α ⎥ ⎢ ccloze2 ⎥
⎢ ⎥ ⎢ ⎥ ⎢ ⎥
⎢ ⎥ … ⎢ ⎥
⎢ ⎥ ⎢ ⎥ ⎢…⎥
… ⎢…⎥ ⎢ … ⎥
⎢ ⎥ ⎢ ⎥ ⎢ ⎥
⎢ ⎥ ⎢ ⎥
⎢ ⎥ ⎢ ⎥ ⎢ ⎥
⎢ ⎥ usubj[101] ⎢ ⎥
⎢ μ101 ⎥ ⎢ ⎥ ⎢ β ⎥
⎢ α ⎥ ⎢ ccloze101 ⎥
⎢ ⎥ ⎢ ⎥ ⎢ ⎥
⎢ ⎥ ⎢ ⎥
⎢ ⎥ ⎢ usubj[102] ⎥ ⎢ ⎥
μ102 ⎢ α ⎥ ⎢ ccloze ⎥ β
⎢ ⎥ ⎢ ⎥ 102 ⎢ ⎥
⎢ ⎥ ⎢ ⎥
⎢ ⎥ ⎢ ⎥ ⎢ ⎥
⎢ ⎥ ⎢ ⎥
⎢ … ⎥ ⎢ … ⎥ ⎢…⎥
⎢…⎥ ⎢ … ⎥
⎢ ⎥ ⎢ ⎥ ⎢ ⎥
⎢ ⎥ ⎢ ⎥
⎢ ⎥ ⎢ ⎥ ⎢ ⎥
μ = ⎢ μ215 ⎢ ⎥ usubj[215] ⎢ ⎥ ∘
⎥ = α + ⎢ ⎥ + ccloze215 ⎢ β ⎥
⎢ ⎥ ⎢ ⎥ ⎢ ⎥
⎢ ⎥ ⎢ ⎥
⎢ ⎥ ⎢ ⎥ ⎢ ⎥
⎢ ⎥ ⎢ ⎥
μ216 ⎢ α ⎥ usubj[216] ⎢ ccloze216 ⎥ ⎢ β ⎥
⎢ ⎥ ⎢ ⎥
⎢ ⎥ ⎢ ⎥ ⎢ ⎥
⎢ ⎥ ⎢ ⎥
⎢ ⎥ ⎢ ⎥ ⎢ ⎥ ⎢ β ⎥
μ217 α ⎥ ⎢ ⎥
⎢ ⎥ ⎢ usubj[217] ⎢ ccloze217 ⎥ ⎢ ⎥
⎢ ⎥
⎢ ⎥ ⎢ ⎥ ⎢ ⎥ ⎢ ⎥
⎢ ⎥
⎢ … ⎥ ⎢…⎥ ⎢ … ⎥ ⎢…⎥
⎢ ⎥ ⎢ ⎥ ⎢ ⎥
⎢ ⎥ … ⎢ ⎥
⎢ ⎥
⎢ ⎥ ⎢ ⎥ ⎢ ⎥ ⎢ ⎥
⎢ ⎥
⎢
μ1000
⎥ ⎢ α ⎥ ⎢ ccloze1000 ⎥ ⎢ β ⎥
⎢ usubj[1000] ⎥
⎢ ⎥ ⎢ ⎥ ⎢ ⎥ ⎢ ⎥
⎢ ⎥ ⎢ ⎥ ⎢ ⎥
⎢ … ⎥ … … ⎢…⎥
⎢ ⎥
⎢ … ⎥
⎣ ⎦ ⎣ ⎦ ⎣ ⎦ ⎣ ⎦
μ2863 α ccloze2863 β
⎣ ⎦
ui[2863]

α u1 −0.476 β
⎡ ⎤ ⎡ ⎤ ⎡ ⎤ ⎡ ⎤

⎢ α ⎥ ⎢ u1 ⎥ ⎢ −0.446 ⎥ ⎢ β ⎥
⎢ ⎥ ⎢ ⎥ ⎢ ⎥ ⎢ ⎥
⎢…⎥ ⎢ … ⎥ ⎢ ⎥ ⎢…⎥
… ⎢ ⎥
⎢ ⎥ ⎢ ⎥ ⎢ ⎥
⎢ ⎥ ⎢ ⎥ ⎢ ⎥ ⎢ ⎥
⎢ α ⎥ ⎢ u2 ⎥ ⎢ −0.206 ⎥ ⎢ β ⎥
⎢ ⎥ ⎢ ⎥ ⎢ ⎥ ⎢ ⎥
⎢ ⎥
⎢ α ⎥ ⎢ u ⎥ ⎢ 0.494 ⎥ β
2 ⎢ ⎥
⎢ ⎥ ⎢ ⎥ ⎢ ⎥
⎢ ⎥
⎢ ⎥ ⎢ ⎥ ⎢ ⎥
⎢…⎥
⎢…⎥ ⎢ … ⎥ ⎢ … ⎥
⎢ ⎥
⎢ ⎥ ⎢ ⎥ ⎢ ⎥
⎢ ⎥
= ⎢ α ⎥ + ⎢ u3 ⎥ + ⎢ −0.136 ⎥ ∘ ⎢ β ⎥
⎢ ⎥ ⎢ ⎥ ⎢ ⎥ ⎢ ⎥
⎢ ⎥ ⎢ ⎥ ⎢ ⎥ ⎢ ⎥
⎢ α ⎥ ⎢ u3 ⎥ ⎢ 0.094 ⎥ ⎢ β ⎥
⎢ ⎥ ⎢ ⎥ ⎢ ⎥ ⎢ ⎥
⎢ ⎥ ⎢ ⎥ ⎢ ⎥ ⎢ β ⎥
⎢ α ⎥ ⎢ u3 ⎥ ⎢ 0.294 ⎥ ⎢ ⎥
⎢ ⎥ ⎢ ⎥ ⎢ ⎥ ⎢ ⎥
⎢…⎥ ⎢ … ⎥ ⎢ … ⎥ ⎢…⎥
⎢ ⎥ ⎢ ⎥ ⎢ ⎥ ⎢ ⎥
⎢ ⎥ ⎢ ⎥ ⎢ ⎥ ⎢ ⎥
⎢ α ⎥ ⎢ u13 ⎥ ⎢ 0.524 ⎥ ⎢ β ⎥
⎢ ⎥ ⎢ ⎥ ⎢ ⎥ ⎢ ⎥
⎢…⎥ ⎢ … ⎥ ⎢ … ⎥ ⎢ …⎥
⎣ ⎦ ⎣ ⎦ ⎣ ⎦ ⎣ ⎦
α u37 0.494 β

In this model, each subject has their own intercept adjustment ui , with i indexing the subjects.
If ui is positive, the subject has a more positive EEG signal than the average over all the
subjects; if ui is negative, then the subject has a more negative EEG signal than the average;
and if ui is 0, then the subject has the same EEG signal as the average. As we discussed in
section 5.2.3, since we are estimating α and u at the same time and we assume that the
average of the u’s is 0 (since it is assumed to be normally distributed with a mean of 0),
whatever the subjects have in common “goes” to α , and u only “absorbs” the differences
between subjects through the variance component τu .

This model is implemented in the file hierarchical1.stan , available in the bcosci package:

Hide
data {
int<lower=1> N;
vector[N] signal;
int<lower = 1> N_subj;
vector[N] c_cloze;

// The following line creates an array of integers;

array[N] int<lower = 1, upper = N_subj> subj;
}
parameters {
real<lower = 0> sigma;
real<lower = 0> tau_u;

real alpha;
real beta;
vector[N_subj] u;
}
model {

target += normal_lpdf(alpha| 0,10);

target += normal_lpdf(beta | 0,10);
target += normal_lpdf(sigma | 0, 50) -
normal_lccdf(0 | 0, 50);
target += normal_lpdf(tau_u | 0, 20) -

normal_lccdf(0 | 0, 20);
target += normal_lpdf(u | 0, tau_u);
target += normal_lpdf(signal | alpha + u[subj] +
c_cloze * beta, sigma);
}

In the Stan code above, we use

array [N] int<lower = 1, upper = N_subj> subj;

to define a one-dimensional array of N elements that contains integers (bounded between 1

and N_subj ). As explained in Box 10.4, the difference between vectors and one-dimensional
arrays is that vectors can only contain real numbers and can be used with matrix algebra
functions, and arrays can contain any type but can’t be used with matrix algebra functions. We
use normal_lpdf rather than normal_glm_lpdf since at the moment there is no efficient
likelihood implementation of hierarchical generalized linear models.
The following code centers the predictor cloze and stores the data required by the Stan model
in a list. Because we are using subj as a vector of indices, we need to be careful to have
integers starting from 1 and ending in N_subj without skipping any number (but the order of
the subject ids won’t matter).36

Hide

data("df_eeg")
df_eeg <- df_eeg %>%
mutate(c_cloze = cloze - mean(cloze))

ls_eeg <- list(

N = nrow(df_eeg),
signal = df_eeg$n400,
c_cloze = df_eeg$c_cloze,
subj = df_eeg$subj,
N_subj = max(df_eeg$subj)

Fit the model:

Hide

hierarchical1 <- system.file("stan_models",

"hierarchical1.stan",

package = "bcogsci"
)
fit_eeg1 <- stan(hierarchical1, data = ls_eeg)

Summary of the model:

Hide

print(fit_eeg1, pars = c("alpha", "beta", "sigma", "tau_u"))

## mean 2.5% 97.5% n_eff Rhat
## alpha 3.66 2.85 4.53 1683 1
## beta 2.31 1.26 3.34 5614 1
## sigma 11.64 11.33 11.94 4651 1
## tau_u 2.18 1.52 3.06 2101 1

11.1.2 Uncorrelated varying intercept and slopes model with

Stan

In the following model, we relax the strong assumption that every subject will be affected
equally by the manipulation. For ease of exposition, we start by assuming that (as we did in
section 5.2.3) the adjustments for the intercept and slope are not correlated.

signaln ∼ Normal(α + usubj[n],1 + c_clozen ⋅ (β + usubj[n],2 ), σ) (11.1)

α ∼ Normal(0, 10)

β ∼ Normal(0, 10)

u1 ∼ Normal(0, τu )
1

u2 ∼ Normal(0, τu ) (11.2)
2

τu1 ∼ Normal+ (0, 20)

τu ∼ Normal+ (0, 20)

σ ∼ Normal+ (0, 50)

We implement this in Stan in hierarchical2.stan :

Hide
data {
int<lower=1> N;
vector[N] signal;
int<lower = 1> N_subj;
vector[N] c_cloze;

array[N] int<lower = 1, upper = N_subj> subj;

}
parameters {
real<lower = 0> sigma;
vector<lower = 0>[2] tau_u;
real alpha;

real beta;
matrix[N_subj, 2] u;
}
model {
target += normal_lpdf(alpha| 0,10);

target += normal_lpdf(beta | 0,10);

target += normal_lpdf(sigma | 0, 50) -
normal_lccdf(0 | 0, 50);
target += normal_lpdf(tau_u[1] | 0, 20) -
normal_lccdf(0 | 0, 20);

target += normal_lpdf(tau_u[2] | 0, 20) -

normal_lccdf(0 | 0, 20);
target += normal_lpdf(u[, 1]| 0, tau_u[1]);
target += normal_lpdf(u[, 2]| 0, tau_u[2]);
target += normal_lpdf(signal | alpha + u[subj, 1] +
c_cloze .* (beta + u[subj, 2]), sigma);

In the previous model, we assign the same prior distribution to both tau_u[1] and
tau_u[2] , and thus in principle we could have written the two statements in one (we multiply

by 2 because there are two PDFs that need to be corrected for the truncation):

target += normal_lpdf(tau_u | 0, 20) -

2 * normal_lccdf(0 | 0, 20);

Fit the model as follows:

Hide

hierarchical2 <- system.file("stan_models",

"hierarchical2.stan",
package = "bcogsci")
fit_eeg2 <- stan(hierarchical2, data = ls_eeg)

## Warning: There were 3 chains where the estimated

## Bayesian Fraction of Missing Information was low. See
## https://mc-stan.org/misc/warnings.html#bfmi-low

## Warning: Examine the pairs() plot to diagnose sampling

## problems

## Warning: Bulk Effective Samples Size (ESS) is too low,

## indicating posterior means and medians may be unreliable.
## Running the chains for more iterations may help. See

## https://mc-stan.org/misc/warnings.html#bulk-ess

## Warning: Tail Effective Samples Size (ESS) is too low,

## indicating posterior variances and tail quantiles may be

## unreliable. Running the chains for more iterations may
## help. See
## https://mc-stan.org/misc/warnings.html#tail-ess

We see that there are warnings. As we increase the complexity and the number of
parameters, the sampler has a harder time exploring the parameter space.

Show the summary of the model below:

Hide

print(fit_eeg2, pars = c("alpha", "beta", "tau_u", "sigma"))

## mean 2.5% 97.5% n_eff Rhat
## alpha 3.66 2.81 4.50 1601 1.00
## beta 2.34 1.09 3.55 4734 1.00
## tau_u[1] 2.19 1.56 3.04 3371 1.00
## tau_u[2] 1.77 0.32 3.47 187 1.01

## sigma 11.62 11.32 11.93 7954 1.00

We see that tau_u[2] has a low number of effective samples ( n_eff ).

The traceplots are displayed in Figure 11.1:

Hide

traceplot(fit_eeg2, pars = c("alpha", "beta", "tau_u", "sigma"))

alpha beta tau_u[1]

5
4
5 4

3 3
4

2
3 2
1
chain
2 0 1 1
1000 1250 1500 1750 2000 1000 1250 1500 1750 2000 1000 1250 1500 1750 2000
2
tau_u[2] sigma
3
6 4
12.0

4 11.8

11.5
2

11.2

0
1000 1250 1500 1750 2000 1000 1250 1500 1750 2000

FIGURE 11.1: Traceplots of alpha , beta , tau_u , and sigma from the model fit_eeg2 .

Figure 11.1 shows that the chains of the parameter tau_u[2] are not mixing properly. This
parameter is specially problematic because there are not enough data from each subject to
estimate this parameter accurately, its estimated mean is quite small (in comparison with
sigma ), it’s bounded by zero, and there is a dependency between this parameter and u[,
2] . This makes the exploration by the sampler quite hard.

Pairs plots can be useful to uncover pathologies in the sampling, since we can visualize
correlations between samples, which are in general problematic. The following code creates a
pair plot where we see the samples of tau_u[2] against some of the adjustments to the
slope u ; see Figure 11.2.

Hide

pairs(fit_eeg2, pars = c("tau_u[2]", "u[1,2]", "u[2,2]", "u[3,2]"))

-8 -4 0 2 4 -8 -4 0 2 4

tau_u[2]

6
4
2
0
4

u[1,2]
0
-4
-8

u[2,2]

4
0
-4
-8
u[3,2]
4
0
-4
-8

0 1 2 3 4 5 6 -8 -4 0 2 4

FIGURE 11.2: Pair plots showing a relatively strong correlations (funnel-shaped clouds of
samples) between the samples of τ2 and some of the by-subject adjustments to the slope.

Compare with tau_u[1] plotted against the by-subject adjustments to the intercept. In Figure
11.3, instead of funnels we see blobs, indicating no strong correlation between the
parameters.

Hide
pairs(fit_eeg2, pars = c("tau_u[1]", "u[1,1]", "u[2,1]", "u[3,1]"))

-2 0 2 4 -4 -2 0 2

tau_u[1]

4.0
2.5
1.0
u[1,1]
4
2
-2

u[2,1]

2
-2
-6
u[3,1]
2
0
-4

1.0 2.0 3.0 4.0 -6 -4 -2 0 2

FIGURE 11.3: Pair plots showing no strong correlation (blob-shaped clouds of samples)
between the samples of τ1 and some of the by-subject adjustments to the intercept.

In contrast to Figure 11.3, which shows blob-shaped clouds of samples, Figure 11.2 shows a
relatively strong correlation between the samples of some of the parameters of the model, in
particular τ2 and the samples of ui,2 . This strong correlation is hindering the exploration of the
sampler leading to the warnings we saw in Stan. However, the problem that the sampler faces
is, in fact, more serious than what our initial plots show. Stan samples in an unconstrained
space where all the parameters can range from minus infinity to plus infinity, and then
transforms back the parameters to the constrained space that we specified where, for
example, a standard deviation parameter is restricted to be positive. This means that Stan is
actually sampling from an auxiliary parameter equivalent to log(tau_u[2]) rather than from
tau_u[2] . We can use mcmc_pairs to see the actual funnel; see Figure 11.4.

Hide
mcmc_pairs(as.array(fit_eeg2),
pars = c("tau_u[2]", "u[1,2]"),
transform = list(`tau_u[2]` = "log")
)

log(tau_u[2]) 2

-1

-2 -1 0 1 2 -5 0

u[1,2]

-5

-1 0 1 -5 0

FIGURE 11.4: Pair plots showing a strong correlation (funnel-shaped clouds of samples)
between the samples of τ2 and one of the by-subject adjustments to the intercept (u1,2 ).

At the neck of the funnel, tau_u[2] is close to zero (and log(tau_u[2]) is a negative
number) and thus the adjustment u is constrained to be near 0. This is a problem because a
step size that’s optimized to work well in the broad part of the funnel will fail to work
appropriately in the neck of the funnel and vice versa; see also Neal’s funnel (Neal 2003) and
the optimization chapter of Stan’s manual (Stan Development Team 2021, sec. 22.7). There
are two options, we might just remove the by-subject varying slope since it’s not giving us
much information anyway, or we can alleviate this problem by reparameterizing the model. In
general, this is the trickiest and probably most annoying part of modeling. A model can be
theoretically and mathematically sound, but still fail to converge. The best advice to solve this
type of problem is to start small with simulated data where we know the true values of the
parameters, and increase the complexity of the models gradually. Although in this example the
problem was clearly in the parameterization of tau_u[2] , in many cases the biggest hurdle is
to identify where the problem lies. Fortunately, the issue with tau_u[2] is a common problem
which is easy to solve by using a non-centered parameterization (Papaspiliopoulos, Roberts,
and Sköld 2007). The following Box explains the specific reparameterization we use for the
improved version of our Stan code.

Box 11.1 A simple non-centered parameterization

The sampler can explore the parameter space more easily if its step size is appropriate for
all the parameters. This is achieved when there are no strong correlations between
parameters. We want to assume the following.

u2 ∼ Normal(0, τu2 )

where u2 is the column vector of ui,2 ’s. The index i refers to the subject id.

We can transform u2 to z-scores as follows. This amounts to centering the parameter, so

we can call this centered parameterization.

u2 − 0
zu =
2
τu
2

where

zu ∼ Normal(0, 1)
2

Now zu
2
is easier to sample because it doesn’t depend on other parameters (in particular,
it is no longer conditional on τ ) and its scale is 1. Once we have sampled this centered
parameter, we can derive the actual parameter we care about by carrying out the reverse
operation, which is called a non-centered parameterization:

u 2 = zu 2 ⋅ τ u 2

A question that might be raised here is whether using a non-centered parameterization is

always a good idea. Betancourt and Girolami (2013) point out that the extremeness of the
correlation depends on the amount of data, and the efficacy of the parameterization
depends on the relative strength of the data. When there is enough data, this
parameterization is unnecessary and it may be more efficient to use the centered
parameterization. However, cases where there is enough data to render this
parameterization useless are also cases where the partial pooling of the hierarchical
models isn’t needed in the first place. Although data from conventional lab experiments in
psychology, psycholinguistics, and related areas seem to benefit from the non-centered
parameterization, the jury is still out for larger data sets with thousands of subjects from
crowdsourcing websites.
From a mathematical point of view, the following model is equivalent to the one described in
Equations (11.1) and (11.2). However, as discussed previously, the computational
implementation of the “new” model is more efficient. The following model includes the
reparameterization of both adjustments u1 and u2 , although the reparameterization of u1 is
not strictly necessary (we didn’t see any problems in the trace plots), it won’t hurt either and
the Stan code will be simpler.

signaln ∼ Normal(α + usubj[n],1 + c_clozen ⋅ (β + usubj[n],2 ), σ) (11.3)

α ∼ Normal(0, 10)

β ∼ Normal(0, 10)

zu ∼ Normal(0, 1)
1

zu2 ∼ Normal(0, 1)

τu ∼ Normal+ (0, 20) (11.4)

τu ∼ Normal+ (0, 20)

σ ∼ Normal+ (0, 50)

u1 = z u 1 ⋅ τ u 1

u2 = z u ⋅ τu
2 2

The following Stan code ( hierarchical3.stan ) uses the previous parameterization, and
introduces some new Stan functions: to_vector() converts a matrix into a long column
vector (in column-major order, that is, concatenating the columns from left to right); and
std_normal_lpdf() implements the log PDF of a standard normal distribution, a normal

distribution with location 0 and scale 1. This function is just a more efficient version of
normal_lpdf(... | 0, 1) . We also introduce a new optional block called transformed
parameters . With each iteration of the sampler, the values of the parameters (i.e., alpha ,

beta , sigma , and z ) are available at the transformed parameters block, and we can

derive new auxiliary variables based on them. In this case, we use z_u and tau_u to obtain
u , that then becomes available in the model block.

Hide
data {
int<lower=1> N;
vector[N] signal;
int<lower = 1> N_subj;
vector[N] c_cloze;

array[N] int<lower = 1, upper = N_subj> subj;

}
parameters {
real<lower = 0> sigma;
vector<lower = 0>[2] tau_u;
real alpha;

real beta;
matrix[N_subj, 2] z_u;
}
transformed parameters {
matrix[N_subj, 2] u;

u[, 1] = z_u[, 1] * tau_u[1];

u[, 2] = z_u[, 2] * tau_u[2];
}
model {
target += normal_lpdf(alpha| 0,10);

target += normal_lpdf(beta | 0,10);

normal_lccdf(0 | 0, 20);
target += std_normal_lpdf(to_vector(z_u));
target += normal_lpdf(signal | alpha + u[subj, 1] +
c_cloze .* (beta + u[subj, 2]), sigma);
}

By reparameterizing the model we can also optimize it more, we can convert the matrix z_u
into a long column vector allowing us to use a single call of std_normal_lpdf . Fit the model
named hierarchical3.stan .

Hide
hierarchical3 <- system.file("stan_models",

"hierarchical3.stan",
package = "bcogsci"
)
fit_eeg3 <- stan(hierarchical3, data = ls_eeg)

Verify that the model worked as expected by printing its summary and traceplots; see Figure
11.5.

Hide

print(fit_eeg3, pars = c("alpha", "beta", "tau_u", "sigma"))

## mean 2.5% 97.5% n_eff Rhat

## alpha 3.64 2.83 4.48 1234 1

## beta 2.31 1.06 3.54 3334 1

## tau_u[1] 2.19 1.54 3.01 1226 1
## tau_u[2] 1.76 0.17 3.55 1019 1
## sigma 11.62 11.32 11.92 4924 1

Hide

traceplot(fit_eeg3, pars = c("alpha", "beta", "tau_u", "sigma"))

alpha beta tau_u[1]
4
5 4

3 3
4
2

2
3 1

chain
0
1
1000 1250 1500 1750 2000 1000 1250 1500 1750 2000 1000 1250 1500 1750 2000
2
tau_u[2] sigma
3
4
12.0
4
11.8

2 11.5

11.2

0
1000 1250 1500 1750 2000 1000 1250 1500 1750 2000

FIGURE 11.5: Traceplots of alpha , beta , tau_u , and sigma from the model fit_eeg3 .
Although the samples of tau_u[2] are still correlated with the adjustments for the slope,
u[,2] , these latter parameters are not the ones explored by the model, the auxiliary

parameters, z_u , are the relevant ones for the sampler. The plots in Figures 11.6 and 11.7
show that although log(tau_u[2]) and u[1,2] are still correlated, log(tau_u[2]) and
z_u[1,2] are not.

Hide

mcmc_pairs(as.array(fit_eeg3),
pars = c("tau_u[2]", "u[1,2]"),
transform = list(`tau_u[2]` = "log")
)
log(tau_u[2])

-2

-4

-5.0 -2.5 0.0 -5 0

u[1,2]

-5

-10
-6 -4 -2 0 2 -10 -5 0

FIGURE 11.6: Pair plots showing a clear correlations (funnel-shaped clouds of samples)
between the samples of τ2 and some of the by-subject adjustments to the slope.
Hide

mcmc_pairs(as.array(fit_eeg3),

pars = c("tau_u[2]", "z_u[1,2]"),

transform = list(`tau_u[2]` = "log")
)
log(tau_u[2])

-2

-4

-5.0 -2.5 0.0 -2 0 2

z_u[1,2]
2

-2

-4
-6 -4 -2 0 2 -4 -2 0 2

FIGURE 11.7: Pair plots showing no clear correlation between the samples of τ2 and some of
the by-subject auxiliary parameters ( z_u ) used to build the adjustments to the slope.

11.1.3 Correlated varying intercept varying slopes model

For the model with correlated varying intercepts and slopes, the likelihood remains identical to
the model without a correlation between group-level intercepts and slopes. Priors and
hyperpriors change to reflect the potential correlation between by-subject adjustments to
intercepts and slopes:

signaln ∼ Normal(α + usubj[n],1 + c_clozen ⋅ (β + usubj[n],2 ), σ)

The correlation is indicated in the priors on the adjustments for vector of by-subject intercepts
u1 and the vector of by-subject slopes u2 .

Priors:

α ∼ Normal(0, 10)

β ∼ Normal(0, 10)

σ ∼ Normal+ (0, 50)

ui,1 0
( ) ∼ N (( ), Σu )
ui,2 0
where i = {1, . . , Nsubj }

2
τu ρu τu τu
1 1 2
Σu = ( )
2
ρ u τu1 τu2 τu
2

τu1 ∼ Normal+ (0, 20)

τu ∼ Normal+ (0, 20)

ρu ∼ LKJcorr(2)

As a first attempt, we write this model following the mathematical notation as closely as
possible. We’ll see that this will be problematic in terms of efficient sampling and convergence.
In this Stan model ( hierarchical_corr.stan ), we use some new functions and types:

corr_matrix[n] M; defines a (square) matrix of n rows and n columns called M ,

symmetrical around a diagonal of ones.
rep_row_vector(X, n) creates a row vector with n columns filled with X .
quad_form_diag(M, V) creates a quadratic form using the column vector V as a diagonal
matrix (a matrix with all zeros except for its diagonal), this function corresponds in Stan to:
diag_matrix(V) * M * diag_matrix(V) and in R to diag(V) %*% M %*% diag(V) . This

computes a variance-covariance matrix from the vector of standard deviations, V , and the
correlation matrix, M (recall the generation of multivariate data in section 1.6.3).

Hide
data {
int<lower=1> N;
vector[N] signal;
int<lower = 1> N_subj;
vector[N] c_cloze;

array[N] int<lower = 1, upper = N_subj> subj;

}
parameters {
real<lower = 0> sigma;
vector<lower = 0>[2] tau_u;
real alpha;

real beta;
matrix[N_subj, 2] u;
corr_matrix[2] rho_u;
}
model {

target += normal_lpdf(alpha| 0,10);

target += normal_lpdf(beta | 0,10);
target += normal_lpdf(sigma | 0, 50) -
normal_lccdf(0 | 0, 50);
target += normal_lpdf(tau_u[1] | 0, 20) -

rep_row_vector(0, 2),
quad_form_diag(rho_u, tau_u));
target += normal_lpdf(signal | alpha + u[subj, 1] +
c_cloze .* (beta + u[subj, 2]), sigma);
}

Problematic aspects of the first model presented in section 11.1.2 (before the
reparameterization), that is, dependencies between parameters, are also present here. Fit the
model as follows:

Hide
hierarchical_corr <- system.file("stan_models",

"hierarchical_corr.stan",
package = "bcogsci"
)
fit_eeg_corr <- stan(hierarchical_corr, data = ls_eeg)

## Warning: There were 37 divergent transitions after

## warmup. See
## https://mc-stan.org/misc/warnings.html#divergent-transitions-after-warmup
## to find out why this is a problem and how to eliminate
## them.

## Warning: There were 1 chains where the estimated

## Bayesian Fraction of Missing Information was low. See
## https://mc-stan.org/misc/warnings.html#bfmi-low

## Warning: Examine the pairs() plot to diagnose sampling

## problems

## Warning: Bulk Effective Samples Size (ESS) is too low,

## indicating posterior means and medians may be unreliable.
## Running the chains for more iterations may help. See

## https://mc-stan.org/misc/warnings.html#bulk-ess

## Warning: Tail Effective Samples Size (ESS) is too low,

## indicating posterior variances and tail quantiles may be

## unreliable. Running the chains for more iterations may
## help. See
## https://mc-stan.org/misc/warnings.html#tail-ess

As we expected, there are warnings and bad mixing of the chains for tau_u[2] ; see also the
traceplots in Figure 11.8.

Hide
print(fit_eeg_corr, pars = c("alpha", "beta", "tau_u", "sigma"))

## mean 2.5% 97.5% n_eff Rhat

## alpha 3.65 2.79 4.53 867 1.00
## beta 2.32 1.14 3.55 1972 1.00

## tau_u[1] 2.19 1.56 2.96 1914 1.00

## tau_u[2] 1.78 0.46 3.49 138 1.02
## sigma 11.62 11.33 11.92 4853 1.00

Hide

traceplot(fit_eeg_corr, pars = c("alpha", "beta", "tau_u", "sigma"))

alpha beta tau_u[1]

4
5
4

3 3
4

2
3 1

chain
0
1 1
1000 1250 1500 1750 2000 1000 1250 1500 1750 2000 1000 1250 1500 1750 2000
2
tau_u[2] sigma
3
5 4
12.0
4
11.8
3
11.5
2

11.2
1

0 11.0
1000 1250 1500 1750 2000 1000 1250 1500 1750 2000

FIGURE 11.8: Traceplots of alpha , beta , tau_u , and sigma from the model
fit_eeg_corr .

The problem (which can also be discovered in a pairs plot) is the same one that we saw
before: There is a strong correlation between tau_u[2] (in fact, log(tau_u[2]) , which is the
parameter dimension that the sampler considers) and u , creating a funnel.
The solution to this problem is the reparameterization of this model. The non-centered
parameterization for this type of model is called the Cholesky factorization (Fieller 2016). The
mathematics and the intuition behind this parameterization are explained in Box 11.2.

Box 11.2 Cholesky factorization

First, some definitions that we will need below. A matrix is square if the number of rows
and columns is identical. A square matrix A is symmetric if A
T
= A , i.e., if transposing
the matrix gives the matrix back. Suppose that A is a known matrix with real numbers. If
x is a vector of variables with length p (a p × 1 matrix), then x
T
Ax is called a quadratic
form in x (xT Ax will be a scalar, 1 × 1 ). If x
T
Ax > 0 for all x, then A is a positive
definite matrix. If x
T
Ax ≥ 0 for all x, then A is positive semi-definite.

We encountered correlation matrices first in section 1.6.3. A correlation matrix is always

symmetric, has ones along the diagonal, and real values ranging between −1 and 1 on
the off-diagonals. Given a correlation matrix ρu that is positive definite or semi-definite,
we can decompose it into a lower triangular matrix such that . The matrix
T
Lu Lu Lu = ρu

Lu is called the Cholesky factor of ρu . Intuitively, you can think of Lu as the matrix
equivalent of the square root of ρu . More details on the Cholesky factorization can be
found in Gentle (2007).

l11 0
Lu = ( )
l21 l22

For a model without a correlation between adjustments for the intercept and slope, we
assumed that adjustments u1 and u2 were generated by two independent normal
distributions. But now we want to allow the possibility that the adjustments can have a
non-zero correlation. We can use the Cholesky factorization to generate correlated
random variables in the following way.

1. Generate uncorrelated vectors, zu 1 and zu 2 , for each vector of adjustments u1 and u2 ,

as sampled from Normal(0, 1) :

zu ∼ Normal(0, 1)
1

zu ∼ Normal(0, 1)
2

2. By multiplying the Cholesky factor with the z’s, generate a matrix that contains two
row vectors of correlated variables (with standard deviation 1).

zu zu ... zu
l11 0 1,subj=1 1,subj=2 1,subj=N
subj

L u ⋅ zu = ( )( )
l21 l22 zu zu ... zu
2,subj=1 2,subj=2 2,subj=N
subj
l11 ⋅ zu + 0 ⋅ zu ... l11 ⋅ zu
1,1 2,1 1,N
subj

L u ⋅ zu = ( )
l21 ⋅ zu + l22 ⋅ zu ... l11 ⋅ zu + l22 ⋅ zu
1,1 2,1 1,N 2,N
subj subj

A very informal explanation of why this works is that we are making the variable that
corresponds to the slope to be a function of a scaled version of the intercept.

3. The last step is to scale the previous matrix to the desired standard deviation. We
define the diagonalized matrix diag_matrix(τu ) as before:

τu 0
1
( )
0 τu
2

Then pre-multiply it by the correlated variables with standard deviation 1 from before:

u = diag_matrix(τu ) ⋅ Lu ⋅ zu =

τu 0 l11 ⋅ zu ...
1 1,1

( )( )
0 τu2 l21 ⋅ zu + l22 ⋅ zu ...
1,1 2,1

τu ⋅ l11 ⋅ zu τu ⋅ l11 ⋅ zu ...

1 1,1 1 1,2

( )
τu2 ⋅ (l21 ⋅ zu1,1 + l22 ⋅ zu2,1 ) τu2 ⋅ (l21 ⋅ zu1,2 + l22 ⋅ zu2,2 ) ...

It might be helpful to see how one would implement this in R:

Let’s assume a correlation of 0.8 .

Hide

rho <- .8
# Correlation matrix
(rho_u <- matrix(c(1, rho, rho, 1), ncol = 2))

## [,1] [,2]
## [1,] 1.0 0.8
## [2,] 0.8 1.0

Hide
# Cholesky factor:
# (Transpose it so that it looks the same as in Stan)
(L_u <- t(chol(rho_u)))

## [,1] [,2]

## [1,] 1.0 0.0

## [2,] 0.8 0.6

Hide

# Verify that we recover rho_u,

# Recall that %*% indicates matrix multiplication

L_u %*% t(L_u)

## [,1] [,2]
## [1,] 1.0 0.8
## [2,] 0.8 1.0

1. Generate uncorrelated z from a standard normal distribution assuming only 10

subjects.

Hide

N_subj <- 10
(z_u1 <- rnorm(N_subj, 0, 1))

## [1] 0.21876 -1.70185 1.12707 0.27846 0.83054 0.52538

## [7] -0.00591 -0.20405 -0.52281 -0.59610

Hide

(z_u2 <- rnorm(N_subj, 0, 1))

## [1] -0.755 1.409 -0.639 0.160 0.442 -2.245 -0.822 -0.996

## [9] 0.292 -0.237

2. Create matrix of correlated parameters.

First, create a matrix with the uncorrelated parameters:

Hide

# matrix z_u
(z_u <- matrix(c(z_u1, z_u2), ncol = N_subj, byrow = TRUE))

## [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8]

## [1,] 0.219 -1.70 1.127 0.278 0.831 0.525 -0.00591 -0.204
## [2,] -0.755 1.41 -0.639 0.160 0.442 -2.245 -0.82191 -0.996

## [,9] [,10]
## [1,] -0.523 -0.596
## [2,] 0.292 -0.237

Then, generate correlated parameters by pre-multiplying the zu matrix with Lu .

Hide

L_u %*% z_u

## [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9]

## [1,] 0.219 -1.702 1.127 0.278 0.831 0.525 -0.00591 -0.204 -0.523
## [2,] -0.278 -0.516 0.518 0.319 0.930 -0.927 -0.49788 -0.761 -0.243
## [,10]
## [1,] -0.596
## [2,] -0.619

3. Use the following diagonal matrix to scale the z_u.

Hide

tau_u1 <- 0.2

tau_u2 <- 0.01
(diag_matrix_tau <- diag(c(tau_u1, tau_u2)))
## [,1] [,2]
## [1,] 0.2 0.00
## [2,] 0.0 0.01

4. Finally, generate the adjustments for each subject u:

Hide

(u <- diag_matrix_tau %% L_u %% z_u)

## [,1] [,2] [,3] [,4] [,5] [,6]

## [1,] 0.04375 -0.34037 0.22541 0.05569 0.1661 0.10508

## [2,] -0.00278 -0.00516 0.00518 0.00319 0.0093 -0.00927

## [,7] [,8] [,9] [,10]
## [1,] -0.00118 -0.04081 -0.10456 -0.11922
## [2,] -0.00498 -0.00761 -0.00243 -0.00619

Hide

# The rows are correlated ~.8

cor(u[1, ], u[2, ])

## [1] 0.549

Hide

# The variance components can be recovered as well:

sd(u[1, ])

## [1] 0.162

Hide

sd(u[2, ])
## [1] 0.00603

The reparameterization of the model, which allows for a correlation between adjustments for
the intercepts and slopes, in hierarchical_corr2.stan is shown below. The code implements
the following new types and functions:

cholesky_factor_corr[2] L_u , which defines L_u as a lower triangular (2 × 2 ) matrix

which has to be the Cholesky factor of a correlation.
diag_pre_multiply(tau_u,L_u) which makes a diagonal matrix out of the vector tau_u
and multiplies it by L_u .
to_vector(z_u) makes a long vector out the matrix z_u .
L_u ~ lkj_corr_cholesky(2) is the Cholesky factor associated with the LKJ correlation

distribution. It implies that L_u * L_u' = rho_u ~ lkj_corr(2.0) . The symbol '
indicates transposition (in R, this corresponds to the function t() ).

Hide
data {
int<lower=1> N;
vector[N] signal;
int<lower = 1> N_subj;
vector[N] c_cloze;

array[N] int<lower = 1, upper = N_subj> subj;

}
parameters {
real<lower = 0> sigma;
vector<lower = 0>[2] tau_u;
real alpha;

real beta;
matrix[2, N_subj] z_u;
cholesky_factor_corr[2] L_u;
}
transformed parameters {

matrix[N_subj, 2] u;
u = (diag_pre_multiply(tau_u, L_u) * z_u)';
}
model {
target += normal_lpdf(alpha| 0,10);

target += normal_lpdf(beta | 0,10);

normal_lccdf(0 | 0, 20);
target += lkj_corr_cholesky_lpdf(L_u | 2);
target += std_normal_lpdf(to_vector(z_u));
target += normal_lpdf(signal | alpha + u[subj, 1] +
c_cloze .* (beta + u[subj, 2]), sigma);

}
generated quantities {
corr_matrix[2] rho_u= L_u * L_u';
vector[N_subj] effect_by_subj;
for(i in 1:N_subj)
effect_by_subj[i] = beta + u[i, 2];
}

In this Stan model, we also created an effect_by_subject in the generated quantities. This
would allow us to plot or to summarize by-subject effects of cloze probability.

One can recover the correlation parameter by adding in the generated quantities section a
2 × 2 matrix rho_u , defined as rho_u = L_u * L_u'; .

Fit the new model:

Hide

hierarchical_corr2 <- system.file("stan_models",

"hierarchical_corr2.stan",

package = "bcogsci"
)
fit_eeg_corr2 <- stan(hierarchical_corr2, data = ls_eeg)

The Cholesky matrix has some elements which are always zero or one, and thus the variance
within and between chains (and therefore Rhat) are not defined. However, the rest of the
parameters of the model have an appropriate number of effective sample size (more than 10%
of the total number of post-warmup samples), Rhats are close to one, and the chains are
mixing well; see also the traceplots in Figure 11.9.

Hide

print(fit_eeg_corr2,

pars =
c("alpha", "beta", "tau_u", "rho_u", "sigma", "L_u"))
## mean 2.5% 97.5% n_eff Rhat
## alpha 3.65 2.78 4.43 1282 1
## beta 2.33 1.12 3.58 4331 1
## tau_u[1] 2.18 1.52 2.96 1563 1
## tau_u[2] 1.65 0.10 3.48 1095 1

## rho_u[1,1] 1.00 1.00 1.00 NaN NaN

## rho_u[1,2] 0.16 -0.57 0.75 3535 1
## rho_u[2,1] 0.16 -0.57 0.75 3535 1
## rho_u[2,2] 1.00 1.00 1.00 133 1
## sigma 11.62 11.31 11.94 5090 1
## L_u[1,1] 1.00 1.00 1.00 NaN NaN

## L_u[1,2] 0.00 0.00 0.00 NaN NaN

## L_u[2,1] 0.16 -0.57 0.75 3535 1
## L_u[2,2] 0.92 0.64 1.00 2041 1

Hide

traceplot(fit_eeg_corr2,
pars = c("alpha", "beta", "tau_u", "L_u[2,1]", "L_u[2,2]", "sigma"))
alpha beta tau_u[1]

5 4 4

3
4 3
2
3 1 2

2 0
1000 1250 1500 1750 2000 1000 1250 1500 1750 2000 1000 1250 1500 1750 2000

tau_u[2] L_u[2,1] L_u[2,2]

1.0 1.0 chain
4 0.5 0.8 1

0.0 0.6 2
2
-0.5 3
0.4
0 -1.0 4
1000 1250 1500 1750 2000 1000 1250 1500 1750 2000 1000 1250 1500 1750 2000

sigma

12.0
11.8
11.5
11.2

1000 1250 1500 1750 2000

FIGURE 11.9: Traceplots of alpha , beta , tau_u , L_u , and sigma from the model
fit_eeg_corr .

Is there a correlation between the by-subject intercept and slope?

Let’s visualize some of the posteriors with the following code (see Figure 11.10):

Hide

mcmc_hist(as.data.frame(fit_eeg_corr2),
pars = c("beta", "rho_u[1,2]"))

beta rho_u[1,2]

0 1 2 3 4 -1.0 -0.5 0.0 0.5 1.0

FIGURE 11.10: Histograms of the samples of the posteriors of beta and rho_u[1,2] from
the model fit_eeg_corr2 .
Figure 11.10 shows that the posterior distribution is widely spread out between −1 and +1 .
One can’t really learn from these data whether the by-subject intercepts and slopes are
correlated. The broad spread of the posterior indicates that we don’t have enough data to
estimate this parameter with high enough precision: the posterior is basically just reflecting the
prior specification (the LKJcorr prior with parameter η = 2 ).

11.1.4 By-subject and by-items correlated varying intercept

varying slopes model

We extend the previous model by adding by-items intercepts and slopes, and priors and
hyperpriors that reflect the potential correlation between by-items adjustments to intercepts
and slopes:

signaln ∼ Normal(α + usubj[n],1 + witem[n],1 +

c_clozen ⋅ (β + usubj[n],2 + witem[n],2 ), σ)

The correlation is indicated in the priors on the adjustments for the vectors representing the
varying intercepts u1 and varying slopes u2 for subjects, and the varying intercepts w1 and
varying slopes w2 for items.

Priors:

α ∼ Normal(0, 10)

β ∼ Normal(0, 10)

σ ∼ Normal+ (0, 50)

ui,1 0
( ) ∼ N (( ), Σu )
ui,2 0

wi,1 0
( ) ∼ N (( ), Σw )
wi,2 0

2
τu ρu τu τu
1 1 2

Σu = ( )
2
ρu τu τu τu
1 1 2

2
τw ρw τw τw
1 1 2
Σw = ( )
2
ρw τw1 τw1 τw
2

τu1 ∼ Normal+ (0, 20)

τu ∼ Normal+ (0, 20)

ρu ∼ LKJcorr(2)
τw ∼ Normal+ (0, 20)
1

τw2 ∼ Normal+ (0, 20)

ρw ∼ LKJcorr(2)

The translation to Stan ( hierarchical_corr_by.stan ) looks as follows:

Hide
data {
int<lower=1> N;
vector[N] signal;
int<lower = 1> N_subj;
int<lower = 1> N_item;

vector[N] c_cloze;
array[N] int<lower = 1, upper = N_subj> subj;
array[N] int<lower = 1, upper = N_item> item;
}
parameters {
real<lower = 0> sigma;

vector<lower = 0>[2] tau_u;

vector<lower = 0>[2] tau_w;
real alpha;
real beta;
matrix[2, N_subj] z_u;

matrix[2, N_item] z_w;

cholesky_factor_corr[2] L_u;
cholesky_factor_corr[2] L_w;
}
transformed parameters {

matrix[N_subj, 2] u;
matrix[N_item, 2] w;
u = (diag_pre_multiply(tau_u, L_u) * z_u)';
w = (diag_pre_multiply(tau_w, L_w) * z_w)';
}
model {

target += normal_lpdf(alpha| 0,10);

target += normal_lpdf(beta | 0,10);
target += normal_lpdf(sigma | 0, 50) -
normal_lccdf(0 | 0, 50);
target += normal_lpdf(tau_u | 0, 20) -

2 * normal_lccdf(0 | 0, 20);
target += normal_lpdf(tau_w | 0, 20) -
2* normal_lccdf(0 | 0, 20);
target += lkj_corr_cholesky_lpdf(L_u | 2);
target += lkj_corr_cholesky_lpdf(L_w | 2);
target += std_normal_lpdf(to_vector(z_u));
target += std_normal_lpdf(to_vector(z_w));
target += normal_lpdf(signal | alpha + u[subj, 1] + w[item, 1]+

c_cloze .* (beta + u[subj, 2] + w[item, 2]), sigma);

}
generated quantities {
corr_matrix[2] rho_u = L_u * L_u';
corr_matrix[2] rho_w = L_w * L_w';

Add item as a number to the data and store it in a list:

Hide

df_eeg <- df_eeg %>%

mutate(item = as.numeric(as.factor(item)))
ls_eeg <- list(
N = nrow(df_eeg),
signal = df_eeg$n400,

c_cloze = df_eeg$c_cloze,
subj = df_eeg$subj,
item = df_eeg$item,
N_subj = max(df_eeg$subj),
N_item = max(df_eeg$item))

Fit the model:

Hide

hierarchical_corr_by <- system.file("stan_models",

"hierarchical_corr_by.stan",
package = "bcogsci")
fit_eeg_corr_by <- stan(hierarchical_corr_by, data = ls_eeg)

Print the summary:

Hide
print(fit_eeg_corr_by,
pars = c("alpha", "beta", "sigma", "tau_u", "tau_w",
"rho_u", "rho_w"))

## mean 2.5% 97.5% n_eff Rhat

## alpha 3.67 2.79 4.55 1974 1

## beta 2.29 0.90 3.63 4030 1
## sigma 11.49 11.19 11.80 5975 1
## tau_u[1] 2.19 1.55 3.01 1601 1
## tau_u[2] 1.49 0.09 3.34 1067 1
## tau_w[1] 1.51 0.83 2.18 1663 1

## tau_w[2] 2.27 0.26 4.21 935 1

## rho_u[1,1] 1.00 1.00 1.00 NaN NaN
## rho_u[1,2] 0.14 -0.60 0.77 3872 1
## rho_u[2,1] 0.14 -0.60 0.77 3872 1
## rho_u[2,2] 1.00 1.00 1.00 629 1

## rho_w[1,1] 1.00 1.00 1.00 NaN NaN

## rho_w[1,2] -0.41 -0.89 0.31 2143 1
## rho_w[2,1] -0.41 -0.89 0.31 2143 1
## rho_w[2,2] 1.00 1.00 1.00 180 1

The correlations of interest are rho_u[1,2] and rho_w[1,2] ; the summary above shows that
the data are far too sparse to get tight estimates of these parameters: both posteriors are
widely spread out.

This completes our review of hierarchical models and their implementation in Stan. The
importance of coding a hierarchical model directly in Stan rather than using brms is that this
increases the flexibility of the type of models that we can fit. In fact, we will see in chapters 17-
20 that the same “machinery” can be used to have hierarchical parameters in cognitive
models.

11.2 Summary

In this chapter, we learned to fit the four standard types of hierarchical models that we
encountered in earlier chapters:

The by-subjects varying intercepts model.

The by-subjects varying intercepts and varying slopes model without any correlation.
The by-subjects varying intercepts and varying slopes model with correlation.
The hierarchical model, with a full variance covariance matrix for both subjects and items.

We also learned about some important and powerful tools for making the Stan models more
efficient at sampling: the non-centered parameterization and the Cholesky factorization. One
important takeaway was that if data are sparse, the posteriors will just reflect the priors. We
saw examples of this situation when investigating the posteriors of the correlation parameters.

11.3 Further reading

Gelman and Hill (2007) provides a comprehensive introduction to Bayesian hierarchical

models, although that edition does not use Stan but rather WinBUGS. Sorensen, Hohenstein,
and Vasishth (2016) is a short tutorial on hierarchical modeling using Stan, especially tailored
for psychologists and linguists.

11.4 Exercises

Exercise 11.1 A log-normal model in Stan.

Refit the Stroop example from section 5.3 in Stan ( df_stroop ).

Assume the following likelihood and priors:

rtn ∼ LogNormal(α + usubj[n],1 + c_condn ⋅ (β + usubj[n],2 ), σ)

α ∼ Normal(6, 1.5)

β ∼ Normal(0, .1)

σ ∼ Normal+ (0, 1)

τu1 ∼ Normal+ (0, 1)

τu ∼ Normal+ (0, 1)
2

ρu ∼ LKJcorr(2)

Exercise 11.2 A by-subjects and by-items hierarchical model with a log-normal likelihood.

Revisit the question “Are subject relatives easier to process than object relatives?” Fit the
model from the exercise 5.2 using Stan.

Exercise 11.3 A hierarchical logistic regression with Stan.

Revisit the question “Is there a Stroop effect in accuracy?” Fit the model the exercise 5.6 using
Stan.

Exercise 11.4 A distributional regression model of the effect of cloze probability on the N400.

In section 5.2.6, we saw how to fit a distributional regression model. We might want to extend
this approach to Stan. Fit the EEG data to a hierarchical model with by-subject and by-items
varying intercept and slopes, and in addition assume that the residual standard deviation (the
scale of the normal likelihood) can vary by subject.

signaln ∼ Normal(α + usubj[n],1 + witem[n],1 +

c_clozen ⋅ (β + usubj[n],2 + witem[n],2 ), σn )

σn = exp(ασ + uσ )
subj[n]

αα ∼ Normal(0, log(50))

uσ ∼ Normal(0, τu )
σ

τuσ ∼ Normal+ (0, 5)

To fit this model, take into account that sigma is now a vector, and it is a transformed
parameter which depends on two parameters: alpha_sigma and the vector with N_subj
elements u_sigma . In addition, u_sigma depends on the hyperparameter tau_u_sigma (τu σ

). (Using non-centered parameterization for u_sigma speeds up the model fit considerably).

References

Betancourt, Michael J., and Mark Girolami. 2013. “Hamiltonian Monte Carlo for Hierarchical
Models.”

Fieller, Nick. 2016. Basics of Matrix Algebra for Statistics with R. Boca Raton, FL: CRC Press.

Gelman, Andrew, and Jennifer Hill. 2007. Data Analysis Using Regression and
Multilevel/Hierarchical Models. Cambridge University Press.

Gentle, James E. 2007. “Matrix Algebra: Theory, Computations, and Applications in Statistics.”
Springer Texts in Statistics 10. New York, NY: Springer.

Neal, Radford M. 2003. “Slice Sampling.” Annals of Statistics 31 (3). The Institute of
Mathematical Statistics: 705–67. https://doi.org/10.1214/aos/1056562461.

Papaspiliopoulos, Omiros, Gareth O. Roberts, and Martin Sköld. 2007. “A General Framework
for the Parametrization of Hierarchical Models.” Statistical Science 22 (1). The Institute of
Mathematical Statistics: 59–73. https://doi.org/10.1214/088342307000000014.
Sorensen, Tanner, Sven Hohenstein, and Shravan Vasishth. 2016. “Bayesian Linear Mixed
Models Using Stan: A Tutorial for Psychologists, Linguists, and Cognitive Scientists.”
Quantitative Methods for Psychology 12 (3): 175–200.

Stan Development Team. 2021. “Stan Modeling Language Users Guide and Reference
Manual, Version 2.27.” https://mc-stan.org.

36. If this is not the case, we can do as.numeric(as.factor(df_eeg$subj)) to transform a

vector where some numbers are skipped into a vector with consecutive numbers. For
example, both as.numeric(as.factor(c(1, 3, 4, 7, 9))) and
as.numeric(as.factor(paste0("subj", c(1, 3, 4, 7, 9)))) will give as output 1 2 3 4
5 .↩
Code

Chapter 12 Custom distributions in Stan

Stan includes a large number of distributions, but what happens if we need a distribution that
is not provided? In many cases, we can simply build a custom distribution by combining the
ever-growing number of functions available in the Stan language.

12.1 A change of variables with the reciprocal normal

distribution

In previous chapters, when faced with response times, we assumed a log-normal distribution.
The log-normal distribution moves the inferences relating to the location parameter into a
multiplicative frame (see Box 4.3). Another alternative, however, is to assume a “reciprocal”-
normal distribution (rec-normal) of response times. That is, we may want to assume that the
reciprocal of the response times are normally distributed. This can happen if the Box-Cox
variance stabilizing transform procedure suggests that a reciprocal transformation is needed
(Box and Cox 1964).

1/y ∼ Normal(μ, σ)
(12.1)
y ∼ RecNormal(μ, σ)

An interesting aspect of the rec-normal is that it affords an interpretation of the location

parameter in terms of rate or speed rather than time. Analogously to the case of the log-
normal, neither the location μ nor the scale σ are in the same scale as the dependent variable
y . These parameters are in the same scale as the transformed dependent variable (here, 1/y )
that is normally distributed.

By setting μ = 0.002 and σ = 0.0004 we generate data that looks right-skewed and not
unlike a distribution of response times, the code below shows some summary statistics of the
generated data.

Hide
N <- 100
mu <- .002
sigma <- .0004
rt <- 1 / rnorm(N, mu, sigma)
c(mean(rt), sd(rt), min(rt), max(rt))

## [1] 520 112 360 926

Figure 12.1 shows the distribution generated with the code shown above.

12.5

10.0

7.5
count

5.0

2.5

0.0

400 500 600 700 800 900

FIGURE 12.1: Distribution of synthetic data with a rec-normal distribution with parameters
μ = .002 and σ = .0004 .

We can fit the rec-normal distribution to the response times in a simple model with a normal
likelihood, by storing the reciprocal of the response times ( 1/RT ) in the vector variable
recRT (rather than storing the “raw” response times):
model {
\\ priors go here
target += normal_lpdf(recRT | mu, sigma);
}

One issue here is that the parameters of the likelihood, mu and sigma are going to be very
far away from the unit scale (0.002 and 0.0004 respectively). Due to the way Stan’s sampler
is built, parameters that are too small (much smaller than 1) or too large (much larger than 1)
can cause convergence problems. A straightforward solution is to fit the normal distribution to
parameters in reciprocal of seconds rather than milliseconds; this would make the parameters
have values that are 1000 times larger (2 and 0.4 respectively). We can do this using the
transformed parameters block. Although we use mu and sigma to fit the data, the priors

are defined on the parameters mu_s and sigma_s (μs and σs ).

Unless one can rely on previous estimates, finding good priors for μs and σs is not trivial. To
define appropriate priors, we would need to start with relatively arbitrary priors, inspect the
prior predictive distributions, adjust the priors, and repeat the inspection of prior predictive
distributions until these distributions start to look realistic. In the interest of conserving space,
we skip this iterative process here, and assign the following priors:

μs ∼ Normal(2, 2)

σs ∼ Normal+ (0.4, 0.2)

Hide
data {
int<lower = 1> N;
vector[N] recRT;
}
parameters {

real mu_s;
real<lower = 0> sigma_s;
}
transformed parameters {
real mu = mu_s / 1000;
real sigma = sigma_s / 1000;

}
model {
target += normal_lpdf(mu_s | 2, 2);
target += normal_lpdf(sigma_s | 0.4, 0.2) +
- normal_lccdf(0 | 0.4, 0.2);

target += normal_lpdf(recRT | mu, sigma);

}

Fit and display the summary of the previous model:

Hide

normal_recrt <- system.file("stan_models",

"normal_recrt.stan",
package = "bcogsci"
)
fit_rec <- stan(normal_recrt,
data = list(N = N, recRT = 1 / rt)
)

Hide

print(fit_rec, pars = c("mu", "sigma"), digits = 4)

## mean 2.5% 97.5% n_eff Rhat
## mu 0.0020 0.0019 0.0021 2992 1
## sigma 0.0004 0.0003 0.0005 3134 1

Is a rec-normal likelihood more appropriate than a log-normal likelihood? As things stand, we

cannot compare the models with these two likelihoods. This is because the dependent
variables are different: we cannot compare reciprocal response times with untransformed
response times on the millisecond scale. Model comparison with the Bayes factor or cross-
validation can only compare models with the same dependent variables; see chapters 14-16.

If we do want to compare the reciprocal-normal likelihood and the log-normal likelihood, we

have to set up the models with the two likelihoods in such a way that the dependent measure
is on the raw millisecond scale in each model. This means that for the reciprocal normal
likelihood, the model will receive as data raw reading times in milliseconds, and these will be
treated as a transformed random variable from reciprocal reading times. This approach is
discussed next, but requires knowledge of the Jacobian adjustment (see Box 12.1).

Box 12.1 Understanding the Jacobian adjustment in change of variables.

Some background knowledge is necessary to understand Jacobians. Suppose that X is a

continuous random variable with PDF fX (x) . Then, the relationship between the PDF
fX (x) and the CDF FX (x = a) , where a is some specific instance of X , is:
a

FX (a) = ∫ fX (x) dx
−∞

This also means that if we differentiate FX (a) , we get back fX (x) ; this fact is a
consequence of the fundamental theorem of calculus. The fundamental theorem states
the following: Let f be a continuous real-valued function defined on a closed interval [a, b]

. Let F be the function defined, for all x in [a, b] , by

F (c) = ∫ f (x) dx
a

Then, F is continuous on [a, b] , differentiable on the open interval (a, b) , and

d(F (x))
= f (x)
dx

for all x in (a, b) .

With the above background, we now explain the Jacobian in the case of univariate
distributions. In this book, we don’t need to know the Jacobian in the case of a
multivariate distribution, but see section 6.7 in Ross (2002) for more on that topic.

Suppose you have a continuous random variable X that has a particular PDF associated
with it: X ∼ fX (x) . Now, if this random variable is transformed such that Y = g(X) , the
question arises: what is the PDF of Y ?

Here, we have a situation where we have transformed a random variable X to Y ; this is

called a change of variables.

There is a theorem in statistics which states the following (the statement of the theorem,
and its proof, are adapted from Ross 2002):

Let X be a continuous random variable with probability density function fX . Suppose that
g(x) is a strict monotone (increasing or decreasing) function, differentiable and (thus
continuous) function of x. Then the random variable Y defined by Y = g(X) has a
probability density function defined by

−1 d −1
fX (g (y)) ∣ g (y)∣ if y = g(x) for some x
dx
fY (y) = {
0 if y ≠ g(x) for all x.

where g
−1
(y) is defined to be equal to the value of x such that g(x) = y .

The proof of this theorem goes as follows. Suppose that y = g(x) for some x. Then, the
cumulative distribution function of Y is FY (y) . This CDF tells us the probability P (Y ≤ y)

; but since Y = g(X) , we can write this probability as P (g(X) ≤ y) . So:

FY (y) = P (g(X) ≤ y)

Now, consider the term P (g(X) ≤ y) . In particular, consider the term g(X) ; applying the
inverse of the function g(⋅) to g(X) gives us back X : g
−1
g(X) = X . Applying the inverse
to both sides of ≤ sign in P (g(X) ≤ y) gives us:

−1
P (X ≤ g (y))

But the above expression is the probability that the CDF of X gives us (recall that
g
−1
(y) = x ):

−1 −1
P (X ≤ g (y)) = FX (g (y))

Now, we know (see the background above) that the PDF fY (y) can be derived from the
d(FY (y))
CDF :
FY (y) fY (y) =
dy
. Differentiating the CDF yields:
−1
d(g (y))
−1
fY (y) = fX (g (y))
dy

If you have forgotten your calculus (or didn’t study it in school), it is not at all obvious how
the above differentiation comes about. Here, we are using the fact that:

−1 −1
FY (y) = FX (g (y)) = ∫ fX (g (y)) dy

Recall (the fundamental theorem of calculus) that differentiating the integral

∫ fX (g
−1
(y)) dy will give us fX (g
−1
(y)) .

We now show how to carry out the following differentiation:

d(FY (y)) d
−1
= (FX (g (y)))
dy dy

We use something called the chain rule from calculus (Salas, Etgen, and Hille 2003). To
make things typographically easier to follow, rewrite w(y) = g
−1
(y) . Then, let

u = w(y)

Differentiating this function give us (w′ (y) is just a shorthand way of writing the derivative
of the function w(⋅) ):

du
′
= w (y)
dy

Also, let:

z = FX (u)

Differentiating this function gives us:

dz
′
= F (u) = fX (u)
X
du

By the chain rule:

du dz d
′ −1 −1
× = w (y)fX (u) = (g (y))fX (g (y))
dy du dy

That is how we obtain the result:

−1
d(g (y))
−1
fY (y) = fX (g (y))
dy
−1
d(g (y))
Notice that dy
(this is the rate of change of the function) has to be non-negative
because g
−1
(y) is non-decreasing. That is why we take the absolute value of the
derivative:

−1
d(g (y))
−1
fY (y) = fX (g (y)) ∣ ∣
dy

Next, consider some examples.

Example 1:

Suppose that we have reciprocal reading times (call this random variable X ) in
1/milliseconds, which are assumed to come from some PDF fX (x) (say, a normal
distribution). Suppose we transform this random variable to be on the milliseconds scale:

1
Y = g(X) =
X

The question now is: what is the PDF fY (y) of the transformed random variable Y .
−1
d(g (y))
The inverse of g(x) is x = g
−1
(y) =
1

y
, and its derivative is dy
= −
1
2
.
y

Using the theorem proved above, the PDF of Y is:

−1
d(g (y)) 1
−1 −1
fY (y) = fX (g (y)) ∣ ∣= fX (g (y)) ∣ − ∣
2
dy y

1
−1
fY (y) =X (g (y))
2
y

Rewriting this in terms of fX , which is the Normal distribution:

2
1 1 1/y − μ
fY (y) = exp(− ( ) )
√2πσy
2 2 σ

Example 2

Suppose that we have a continuous random variable X with PDF fX (x) . Suppose now
that we transform this random variable to Y = log(X) . What is the PDF of Y ?

Here, y = g(x) = log(x) , and the inverse is x = g

−1
(y) = exp(y) . The derivative of the
−1
d(g (y))
inverse is dy
is exp(y) .

Using the theorem above, the PDF of Y is:

−1
d(g (y))
−1 −1
fY (y) = fX (g (y)) ∣ ∣= fX (g (y)) ∣ exp(y)∣
dy

−1
fY (y) = fX (g (y)) exp(y)

Example 3

Suppose that we have a continuous random variable X with PDF fX (x) , the normal
distribution. Suppose now that we transform this random variable to Y = exp(X) . What
is the PDF of Y ?

Here, y = g(x) = exp(x) and x = g

−1
(y) = log(y) . The derivative of g
−1
(y) is
−1
d(g (y))

dy
=
1

y
.

Using the theorem,

−1
d(g (y)) 1
−1 −1
fY (y) = fX (g (y)) ∣ ∣= fX (g (y))
dy y

Rewriting the PDF of Y in terms of fX (⋅) :

2
1 1 1 log y − μ
fY (y) = fX (log(y)) = exp(− ( ) )
y √2πσy 2 σ

12.1.1 Scaling a probability density with the Jacobian

adjustment

To work with the original dependent variable ( RT rather than 1/RT ) we need to do a change
of variables: If the random variable X represents the reciprocal reading times (1/RT ), we can
transform this random variable to a new one, Y = 1/X = 1/(1/RT ) = RT , yielding a
transformed random variable Y which represents reading time.

This change of variables requires an adjustment to the (unnormalized log) posterior probability
to account for the distortion caused by the transform.37 The probability must be scaled by a
Jacobian adjustment, which, in the univariate case as this one, is the absolute value of the
derivative of the transform; see Box 12.1.38

d
p(RTn |μ, σ) = Normal(1/RTn |μ, σ)| 1/RTn | =
dRTn
2
Normal(1/RTn |μ, σ) ⋅ | − 1/RT | =
n

2
Normal(1/RTn |μ, σ) ⋅ 1/RT
n
If we omit the Jacobian adjustment, we are essentially fitting an incorrect likelihood (this will
be shown later in this chapter). The discrepancy between the correct and incorrect posterior
would depend on how large the Jacobian adjustment for our model is.

Because Stan works in log-space, rather than multiplying the Jacobian adjustment, we add its
logarithm to the log-probability density (log(1/RT2n ) = −2 ⋅ log(RTn ) ). The log likelihood (of
μ and σ) with its adjustment based on an individual observation, RTn , would be the following
one.

log L = log(Normal(1/RTn |μ, σ)) − 2 ⋅ log(RTn )

We obtain the log-likelihood based on all the N observations by summing the log-likelihood of
individual observations.

N N

log L = ∑ log(Normal(1/RTn |μ, σ)) − ∑ 2 ⋅ log(RTn )

n n

In Stan, this summing up is done as follows. The function normal_lpdf applied to a vector
already returns the sum of the individual log-probability densities. The Jacobian adjustment
2*log(RT) , which returns a vector of values, has to be summed up and added in manually

(the term therefore becomes -sum(2*log(RT)) ).

target += normal_lpdf(1 ./ RT | mu, sigma)

- sum(2 * log(RT));

Before fitting the model with the change of variables, we are going to truncate the distribution.
We didn’t encounter negative values in our synthetic data, but this was because the
distribution was not too spread out; that is, the scale was much smaller than the location (
σ << μ ). However, in principle we could end up generating negative values. For this reason,
we truncate the underlying normal distribution (for more details, see Box 4.1). The reciprocal
truncated normal distribution has been argued to be an appropriate model of response times,
neural inter-spike intervals, and latency distributions of saccades in a simple optimality model
in which reward is maximized to yield an optimal response rate (Harris et al. 2014; Harris and
Waddington 2012).

Because we have N observations, the truncation consists of adding - N * normal_lccdf(0 |

mu, sigma) to target ; also see section 10.4.1. The complete model, recnormal_rt.stan ,

including the truncation is as follows:

Hide
data {
int<lower = 1> N;
vector[N] RT;
}
parameters {

real mu_s;
real<lower = 0> sigma_s;
}
transformed parameters {
real mu = mu_s / 1000;
real sigma = sigma_s / 1000;

}
model {
target += normal_lpdf(mu_s | 2, 2);
target += normal_lpdf(sigma_s | 0.4, 0.2)
- normal_lccdf(0 | 0.4, 0.2);

target += normal_lpdf(1 ./ RT | mu, sigma)

- N * normal_lccdf(0 | mu, sigma)
- sum(2 * log(RT));
}

Next, generate data from a reciprocal truncated normal distribution:

Hide

N <- 100
mu <- .002
sigma <- .0004
rt <- 1 / rtnorm(N, mu, sigma, a = 0)

Fit the model to the data:

Hide
recnormal_rt <- system.file("stan_models",
"recnormal_rt.stan",
package = "bcogsci"
)
fit_rec <- stan(recnormal_rt, data = list(N = N, RT = rt))

Print the posterior summary:

Hide

print(fit_rec, pars = c("mu", "sigma"), digits = 4)

## mean 2.5% 97.5% n_eff Rhat

## mu 0.0020 0.0020 0.0021 3014 1

## sigma 0.0004 0.0003 0.0004 3187 1

We get the same results as before, but now we could potentially compare this model to one
that assumes a log-normal likelihood, for example.

An important question is the following: Does every transformation of a random variable require
a Jacobian adjustment? Essentially, if we assign a distribution to a random variable and then
transform it, this does not require a Jacobian adjustment. This is the case when, in Stan
syntax, we only have the random variable at the left of the pipe | in a PDF or PMF.

Alternatively, if we apply a transformation on a random variable first, and then assign a

distribution to the transformed random variable afterwards, we have a change of variables.
This requires a Jacobian adjustment. This is the case when we have a transformation of a
random variable at the left of the pipe | in the PDF or PMF. This is the reason why 1 ./ RT
requires a Jacobian adjustment, but not mu_s , mu , sigma_s , or sigma_s in the previous
model.

As a last step, we can encapsulate our new distribution in a function, and also create a
random number generator function. This is done in a block called functions . To create a
function, we need to specify the type of every argument and the type that the function returns.

As a simple example, if we would like a function to center a vector, we would do it as follows:

functions {
vector center(vector x) {
vector[num_elements(x)] centered;
centered = x - mean(x);
return centered;

}
}
data {
...
}
parameters {

...
}
model {
...
}

We want to create a log(PDF) function, similar to the native Stan functions that end in _lpdf .
Our function will take as arguments a vector RT , and real numbers mu and sigma . In
_lpdf functions, (some of) the arguments (e.g., RT , mu , and sigma ) can be vectorized,

but the output of the function is always a real number: the sum of the log(PDF) evaluated at
every value of the random variable at the left of the pipe. As we show below, to do that we just
move the right hand side of target in our original model inside our new function. See further
readings at the end of this chapter for more about Stan functions.

For our custom random number generator function (which should end with _rng ), we have
the added complication that the function is truncated. We opt for a less-than-optimally efficient
implementation for the sake of clarity. We simply generate random values from a normal
distribution, do the reciprocal transformation and if the number is above zero, this number is
returned; otherwise, a new number is drawn. A more efficient but more complex
implementation can be found in section 18.10 of the Stan user guide (Stan Development Team
2021).

The complete code can be found in recnormal_rt_f.stan , and is also shown below:

Hide
functions {
real recnormal_lpdf(vector RT, real mu, real sigma){
real lpdf;
lpdf = normal_lpdf(1 ./ RT | mu, sigma)
- num_elements(RT) * normal_lccdf(0 | mu, sigma)

- sum(2 * log(RT));
return lpdf;
}
real recnormal_rng(real mu, real sigma){
real pred_rt = 0;
while (pred_rt <= 0)

pred_rt = 1 / normal_rng(mu, sigma);

return pred_rt;
}
}
data {

int<lower = 1> N;
vector[N] RT;
}
parameters {
real mu_s;

real<lower = 0> sigma_s;

}
transformed parameters {
real mu = mu_s / 1000;
real sigma = sigma_s / 1000;
}

model {
target += normal_lpdf(mu_s | 2, 2);
target += normal_lpdf(sigma_s | 0.4, 0.2)
- normal_lccdf(0 | 0.4, 0.2);
target += recnormal_lpdf(RT | mu, sigma);

}
generated quantities {
array[N] real rt_pred;
for (n in 1:N)
rt_pred[n] = recnormal_rng(mu, sigma);
}

Fit the model to the simulated data:

Hide

recnormal_rt_f <- system.file("stan_models",

"recnormal_rt_f.stan",
package = "bcogsci"
)
fit_rec_f <- stan(recnormal_rt_f, data = list(N = N, RT = rt))

Print the summary:

Hide

print(fit_rec_f, pars = c("mu", "sigma"), digits = 4)

## mean 2.5% 97.5% n_eff Rhat

## mu 0.0020 0.0020 0.0021 2831 1.002

## sigma 0.0004 0.0003 0.0004 3559 0.999

12.2 Validation of a computed posterior distribution

The model converged, but did it work? At the very least we expect that the simulated data that
we used to test the model shows similar distribution to the posterior predictive distributions.
This is because we are fitting the data with the same function that we use to generate the
data. Figure 12.2 shows that the simulated data and the posterior predictive distributions are
similar.

Hide

ppc_dens_overlay(rt, yrep = extract(fit_rec_f)$rt_pred[1:500, ]) +

coord_cartesian(xlim = c(0, 2000))
y
y rep

0 500 1000 1500 2000

FIGURE 12.2: A posterior predictive check of fit_rec in comparison with the simulated data.
Our synthetic data set was generated by first defining a ground truth vector of parameters
{μ; σ} . We would expect that the true values of the parameters μ and σ should be well inside
the posterior distribution of our model. We investigate this by plotting the true values of the
parameters together with their posterior distributions using the function mcmc_recover_hist of
bayesplot as shown below in Figure 12.3.

Hide

post_rec <- as.data.frame(fit_rec_f) %>%

select("mu", "sigma")
mcmc_recover_hist(post_rec, true = c(mu, sigma))
mu sigma

Estimated

True

0.0019 0.0020 0.0021 0.00030 0.00035 0.00040 0.00045 0.00050

FIGURE 12.3: Posterior distributions of the parameters of fit_rec together with their true
values.
Even though this approach can capture serious misspecifications in our models, it has three
important shortcomings.

First, it’s unclear what we should conclude if the true value of a parameter is not in the bulk of
its posterior distribution. Even in a well-specified model, if we inspect enough parameters we
will find that some of the marginal posterior distributions are relatively far away from the true
values. In fact, a well-specified model should also be well-calibrated, and that means that its
credible intervals should also behave like frequentist confidence intervals (Cook, Gelman, and
Rubin 2006). Frequentist confidence intervals apply to the long-term success rate of the
model; that is, how often in the long run, the true value of the parameter will be inside the
interval. This means that in, say, 5% of the cases, we should expect that a true value is
outside the 95% of the CrI of its corresponding posterior distribution (Cook, Gelman, and
Rubin 2006).

The second shortcoming is that if the posteriors that we obtain are very wide, they might end
up containing the true values, leading us to believe that the model is working correctly even if
the model is highly biased.
The third shortcoming is that regardless of how a successful recovery of parameters is
defined, we might be able to recover a posterior based on data generated from some parts of
the parameter space, but not based on data generated from other parts of parameter space
(Talts et al. 2018). However, inspecting the entire parameter space is infeasible.

An alternative to our approach of plotting the true values of the parameters together with their
posterior distributions is simulation-based calibration (Talts et al. 2018). This is discussed
next.

12.2.1 The simulation-based calibration procedure

Talts et al. (2018) suggest the following procedure for validating a model. Check that the
Bayesian computation is faithful in the sense that it is not biased: We want to rule out that our
model will overestimate, underestimate, or will have the incorrect precision in the parameters
given the data and priors.

1. Generate true values (i.e., a ground truth) for the parameters of the model from the prior:
~
Θ ∼ p(Θ)

Here, p(Θ) represents a joint prior distributions of vector of parameters. In our previous
example, the vector of parameters is {μ; σ} , and the joint prior distribution is two independent
distributions, a normal and a truncated normal. Crucially, the prior distributions should be
meaningful; that is, one should use priors that are not too uninformative. The prior space plays
a crucial role because it determines the parameter space where we will verify the correctness
or faithfulness of the model.

Assuming 120 samples (from the priors), this step for our model would be as follows:

Hide

N_sim <- 120

mu_tilde <- rnorm(N_sim, 2, 2) / 1000
sigma_tilde <- rtnorm(N_sim, 0.4, 0.2, a = 0) / 1000

2. Generate multiple (Ns im) data sets based on the probability density function used as the
likelihood:

~ ~
D ∼ p(D|Θ)
~
The previous equation indicates that each data set Dn is sampled from each of the generated
~
parameters, Θn . Following our previous example, use the reciprocal truncated normal
distribution to generate data sets of response times:

Hide

## Create place holder for all the simulated datasets

rt_tilde <- vector(mode = "list", length = N_sim)
# Number of observations
N <- 100

for (n in 1:N_sim) {
rt_tilde[[n]] <- 1 / rtnorm(N,
mu_tilde[n],
sigma_tilde[n],
a = 0
)

~
Now, rt_tilde , which corresponds to D , consists of a list with 120 vectors of 100
observations each. Each list represents a simulated data set.

3. Fit the same Bayesian model to each of the generated data sets.

After fitting the same model to each data set, we should compare the recovered posterior
~
distribution for each parameter of each simulated data set; that is, p(Θn |Dn ) , with the
~
parameters that were used to generate the data, Θn . This comparison is represented by ...
in the code below (that is, it is not yet implemented in the code shown below). This raises the
question of how to compare the posterior distribution with the ground truth values of the
parameters.

Hide
for (i in 1:N_sim) {
fit <- stan(recnormal_rt,
data = list(
N = N,
rt = rt_tilde[[n]]

)
)
...
}

Talts et al. (2018) show that this procedure (1-3) also defines a natural condition for assessing
whether the computed posterior distributions match the exact posterior distributions. The
average over the ensemble of posteriors, the recovered posteriors of each simulated dataset,
should be the same as the prior.

Integrating out the ground posteriors over their joint distribution, we obtain the expectation of
the posterior over all possible generated data sets, also called the data averaged posterior,
which is identical to the prior distribution.

~ ~ ~ ~ ~
∫ p(Θ|D)p(D|Θ)p(Θ) dDdΘ = p(Θ)
S ~ ,S
Θ
D

Any mismatch between the data averaged posterior and the prior distribution indicates some
problem in our model (Cook, Gelman, and Rubin 2006).39

The next two steps describe a well-defined comparison of the exact posterior and prior
distribution.

4. Generate an ensemble of rank statistics of prior samples relative to corresponding

posterior samples.

Talts et al. (2018) demonstrate that the match between the exact posterior and prior can be
evaluated by examining whether, for each one-dimensional parameter of Θ (e. g., {μ; σ} ), the
rank statistics of the prior sample relative to S posterior samples will be uniformly distributed
across [0, S] . In other words, for each sample of the prior (e.g., ~
μ
n
), we calculate the number
of samples of the posterior that is smaller than the corresponding value of the parameter
sampled, and we examine its distribution. A mismatch between the exact posterior and our
posterior would show up as a non-uniform distribution.
As a next step, we examine this new distribution visually using a histogram with S + 1

possible ranks (from 0 to S ). There are two issues that we need to take into account (Talts et
al. 2018):

1. Regardless of our model, histograms will deviate from uniformity if the posterior samples
are dependent. This can be solved by thinning the samples of the posterior, that is
removing a number of intermediate samples (Talts et al. 2018 recommend between 6 and
10).

2. To reduce the noise in the histogram, we should have bins of equal size. If S + 1 are
divisible by a large power of 2, e.g, 1024 , we will be able to re-bin the histogram easily
with bins of equal size. We complete the code shown in step 3 by generating an ensemble
of ranks statistics as well.

Hide
N_s <- 1024 * 6 # Total samples from the posterior
# Thin by 6, and remove one extra sample so that S + 1 = 1024
thinner <- seq(from = 1, to = N_s, by = 6)[-1]
# Placeholders for the ranks
rank_mu <- rep(NA, N_sim)

rank_sigma <- rep(NA, N_sim)

# Loop over the simulations
for (n in 1:N_sim) {
message("Fit number ", n)
# Fit is only stored temporarily
fit <- stan(recnormal_rt,

data = list(
N = N,
RT = rt_tilde[[n]]
),
warmup = 1500,

iter = 1500 + N_s / 4,

control = list(
adapt_delta = .999,
max_treedepth = 14
)

)
# Break the loop if there are divergent transitions
if (get_num_divergent(fit)) break()
post <- extract(fit)
# Number of samples of the posterior that is smaller than
# the corresponding value of the parameter sampled

rank_mu[n] <- sum(post$mu[thinner] < mu_tilde[n])

rank_sigma[n] <- sum(post$sigma[thinner] < sigma_tilde[n])
}
df_rank <- tibble(
sim = rep(1:N_sim, 2),

variable = rep(c("mu", "sigma"),

each = N_sim
),
rank = c(rank_mu, rank_sigma)
)

5. Graphically assess the correctness of the model using rank histograms

As a last step we use a histogram for each parameter to identify deviations of uniformity. If the
posterior estimates are correct, then each of the B bins has a probability of 1/B that a
simulation (i.e. an individual rank) falls into it: Binomial(Nsim , 1/B) . This allow us to
complement the histograms with confidence intervals, indicating where the variation expected
from a uniform histogram should be.

Use the code below to build the plot shown in Figure 12.4. We can conclude that the
implementation of our model (in the parameter space determined by the priors) is correct.

Hide

B <- 16
bin_size <- 1024 / 16
ci_l <- qbinom(0.005, size = N_sim, prob = 1 / B)
ci_u <- qbinom(0.995, size = N_sim, prob = 1 / B)

ggplot(df_rank, aes(x = rank)) +

geom_histogram(
breaks =
seq(0, to = 1023, length.out = B + 1),
closed = "left", colour = "black"

) +
scale_y_continuous("count") +
geom_hline(yintercept = c(ci_l, ci_u), linetype = "dashed") +
facet_wrap(~variable, scales = "free_y")
mu sigma

15 15

10 10
count

5 5

0 0

0 250 500 750 1000 0 250 500 750 1000

rank

FIGURE 12.4: Rank histograms of μ and σ from the reciprocal truncated normal distribution.
The dashed lines represent the 99% confidence interval.
Next, we consider an example where simulation-based calibration can reveal a problem.

12.2.2 Simulation-based calibration revealing a problem

Let’s assume that we made an error in the model implementation. We’ll fit an incorrect model:
rather than the complete likelihood including the normal PDF and the truncation and Jacobian
adjustments, we will fit only the normal PDF evaluated at the reciprocal of the response times,
that is target += normal_lpdf(1 ./ RT | mu, sigma); .

Assessing the correctness of the model by looking at the recovery of the parameters is
misleading in this case, as it is evident from Figure 12.5. One could conclude that the model is
fine based on this plot, since the true values of the parameters are inside their posterior
distributions.
mu sigma

Estimated

True

0.0019 0.0020 0.0021 0.00030 0.00035 0.00040 0.00045

FIGURE 12.5: Posterior distributions of the parameters of an incorrect model (truncation and
Jacobian adjustments missing) together with their true values.
However, the rank histograms produced by the simulation-based calibration procedure in
Figure 12.6 show very tall bins at the left and at the right of each histogram, exceeding the
99% CI. This is a very clear indication that there is a problem with the model specification: our
incorrect model is overestimating μ and underestimating σ. This example illustrates the
importance of simulation-based calibration.
mu sigma

20
count

10
10

0 0

0 250 500 750 1000 0 250 500 750 1000

rank

FIGURE 12.6: Rank histograms of μ and σ from the incorrect implementation of the
reciprocal truncated normal distribution. The dashed lines represent the 99% confidence
interval.

12.2.3 Issues and limitation of simulation-based calibration

In the previous sections, we have used only rank histograms. Even though histograms are
very intuitive means of visualization to assess model correctness, they might not be sensitive
enough for small deviations (Talts et al. 2018) and other visualizations might be better suited.
Another limitation of histograms is that they are sensitive to the number of bins (Säilynoja,
Bürkner, and Vehtari 2022). One could bin the histograms multiple times, but this approach is
difficult to interpret when there are many parameters, and can be vulnerable to multiple testing
biases (Talts et al. 2018). One alternative to rank histograms is visualizations based on the
empirical cumulative density function of the ranks. This is briefly presented in Box 12.2.

Box 12.2 Different rank visualizations and the SBC package.

Implementing the simulation-based calibration algorithm “by hand” introduces a new

source of potential errors. Fortunately, the R package SBC (Kim et al. 2022) provides
tools to validate a Stan model (or any sampling algorithm) by allowing us to run
simulation-based calibrations easily. The package is in active development at the
moment40 and can be installed with the following command.

Hide

remotes::install_github("hyunjimoon/SBC")

One of the main advantages of this package is that it provides several ways to visualize
the results of the simulation-based calibration procedure; see
https://hyunjimoon.github.io/SBC/. Figure 12.7 shows rank histograms produced by SBC
of a correct model and several different incorrect models. An alternative to rank
histograms is to use an empirical cumulative distribution function (ECDF)-based method,
as proposed by Säilynoja, Bürkner, and Vehtari (2022). The idea behind this method is
that if the ranks produced by the simulation-based calibration algorithm are uniform the
ECDF of the ranks should be close to the CDF of a uniform distribution. Figure 12.8
shows the difference between the ECDF of the ranks and the CDF of a uniform
distribution together with 95% confidence bands (this is the default in the SBC package)
for a correct model and different incorrect ones.

Exact match Model overestimating Model too certain

15
40 40
10

20 20
5

0 0 0
count

Model too uncertain Model underestimating Some extra-low estimates

20
40
20
15

10
20 10

0 0 0
0 40 80 0 40 80 0 40 80
rank
FIGURE 12.7: Rank histograms produced by the R package SBC showing the outcome
one would expect for a correct model and for several different incorrect ones together with
95% confidence bands (these are the default bands in the package).
Exact match Model overestimating Model too certain

0.10
0.4 0.2

0.05 0.3
0.1

0.2
0.00 0.0
0.1
-0.05 -0.1
0.0
-0.2 theoretical CDF
-0.10 -0.1

Model too uncertain Model underestimating ome extra-low estimate

0.1 0.10
sample ECDF

0.1 0.0 theoretical CDF

0.05

-0.1
0.0 0.00

-0.2
-0.05
-0.1
-0.3
-0.10

0.00 0.25 0.50 0.75 1.00 0.00 0.25 0.50 0.75 1.00 0.00 0.25 0.50 0.75 1.00

FIGURE 12.8: Difference between the perfectly uniform CDF and empirical cumulative
distribution function (ECDF) of the ranks produced by the SBC R package together with
95% confidence bands. The figure shows the outcome one would expect for a correct
model and for several different incorrect ones.

Even though simulation-based calibration is a comprehensive approach to test model

correctness, it has several drawbacks, regardless of the means of visualization:

i. This approach requires fitting a model many times, which can be too time intensive for
complex models. It’s worth bearing in mind that code with errors can sometimes show
clear “catastrophic” failures in the recovery of the parameters. For this reason, one could
start the verification of model correctness with a simple recovery of parameters (see Talts
et al. 2018; Gelman et al. 2020), and then proceed with the simulation-based calibration
procedure for at least some suspect aspects of the model (e.g., when working with a
custom likelihood).

ii. Another major limitation of simulation-based calibration is that it is concerned exclusively

with the computational aspects of the analysis and offers no guarantee for any single
observation. A complete Bayesian workflow should include prior and posterior predictive
checks, see chapter 7 and Schad, Betancourt, and Vasishth (2020).
iii. Finally, it’s important to apply simulation-based calibration only after appropriate priors are
set. This is important because only the priors determine the parameter space that will be
inspected. Weak priors that may be good enough for estimation may nevertheless lead to
a parameter space that is unrealistically large since, during the calibration stage, there
are no independent data to constrain the space.

12.3 Another custom distribution: Re-implementing

the exponential distribution manually

There are cases when one needs a distribution that is not included in Stan. Many times one
can find the PDF (or a CDF) derived in a paper, and one needs to implement it manually by
writing the log(PDF) in the Stan language. Even though the exponential distribution is included
in Stan, we demonstrate how we would include it step-by-step as if it weren’t available. This
example extends what is demonstrated in Stan’s user guide (Stan Development Team 2021,
ch. 21).

The exponential distribution is used to model waiting times until a certain event. Its PDF is the
following:

−λx
λe x ≥ 0,
f (x|λ) = {
0 x < 0.

The parameter λ is often called the rate parameter and must be positive. A higher value for
the rate parameter leads to shorter waiting times on average. The mean of the distribution is
1/λ . The exponential distribution has the key property of being memoryless. What this means
is explained in Ross (2002) as follows: Suppose that X is a random variable with an
exponential distribution as a PDF; the random variable represents the lifetime of an item (e.g.,
X could represent the time it takes for a radioactive particle to completely decay). If the item
is t time-units old, then the remaining life s of that item has the same probability distribution as
the life of a new item. Mathematically, this amounts to stating that

P (X > s + t|X > t) = P (X > s)

The implication of the memoryless property is that we do not need to know the age of an item
to know what the distribution of its remaining life is.
To give another example of memorylessness, the conditional probability that a certain event
will happen in the next 100 ms is the same regardless of whether we have been already
waiting 1000 ms, 10 ms, or 0 ms. Although the exponential distribution is not commonly used
for modeling response times in cognitive science, it has been used in the past (Ashby and
Townsend 1980; Ashby 1982). We focus on this distribution because of its simple analytical
form.

As a first step, we make our own version of the PDF in R, and then check that it integrates to
1 to verify that this is actually a proper distribution (and that we haven’t introduced a typo in
the formula).

To avoid underflow, that is getting a zero instead of a very small number, we’ll work in log-
scale. This will also be useful when we implement this function in Stan, thus the log(PDF) is

log(f (x|λ)) = log(λ) − λx

where x > 0 .

Implement this function in R with the same arguments as the d* family of functions: if log =
TRUE the output is a log density; call this new function dexp2 .

Hide

dexp2 <- function(x, lambda = 1, log = FALSE) {

log_density <- log(lambda) - lambda * x
if (log == FALSE) {
exp(log_density)
} else {

log_density
}
}

Verify that this function integrates to 1 for some point values for the parameter (here, λ = 1

and λ = 20 ):

Hide

dexp2_l1 <- function(x) dexp2(x, 1)

integrate(dexp2_l1, lower = 0, upper = Inf)
## 1 with absolute error < 0.000057

Hide

dexp2_l20 <- function(x) dexp2(x, 20)

integrate(dexp2_l20, lower = 0, upper = Inf)

## 1 with absolute error < 0.00000001

To test our function, we’ll also need to generate random values from the exponential
distribution. If the quantile function of the distribution exists, the inverse transform sampling is
a relatively straightforward way to get pseudo random numbers sampled from a target
distribution (for an accessible introduction to inverse sampling, see Lynch 2007). Given a
target distribution with a PDF f , and a quantile function F
−1
(the inverse of the CDF), the
inverse transform sampling method consists of the following:
z
1. Sample one number u from Uniform(0, 1) . Let u = F (z) = ∫
L
f (x) dx (here, L is the
lower bound of the PDF f).
2. Then z = F
−1
(u) is a draw from f (x) .

In this case, the quantile function (the inverse of the CDF) is the following:

− log(1 − p)/λ

Here is how one can derive this inverse of the CDF. First, consider the fact that the CDF of the
exponential distribution is as follows. The term q is some quantile of the distribution:
q

F (q) = ∫ λ exp(−λx) dx
0

We can solve this integral by using the so-called u-substitution method (Salas, Etgen, and
Hille 2003). First, define

u = −λx

Then, the derivative du/dx is:

du
= −λ
dx

This implies that du = −λdx , or that −du = λdx .

In the CDF, replace the term λ, dx with −du :

q q

F (q) = ∫ λ exp(−λx) dx = ∫ (− exp(−λx) du)

0 0

Rewriting −λx as u, the CDF simplifies to:

q q

F (q) = ∫ λ exp(−λx) dx = ∫ (− exp(u) du)

0 0

We know from calculus that the integral of −exp(u) is − exp(u) . So, the integral becomes:
q
F (q) = [− exp(u)]
0

Replacing u with −λx , we get:

q
F (q) = [− exp(−λx)] = 1 − exp(−λq)
0

Thus, we know that the CDF is F (t) = 1 − exp(−λq) = p , where p is the probability of
observing the quantile q or some value smaller than q. To derive the inverse of the CDF, solve
the equation below for t:

−λq
p = 1 − e

−λq
p − 1 = −e

−λq
−p + 1 = e

log(1 − p) = −λq

− log(1 − p)/λ = q

Write the quantile function (the inverse of the CDF) and the random number generator for the
exponential distribution in R. to differentiate it from the built-in function in R, we will call it
qexp2 :

Hide

qexp2 <- function(p, lambda = 1) {

-log(1 - p) / lambda

}
rexp2 <- function(n, lambda = 1) {
u <- runif(n, 0, 1)
qexp2(u, lambda)
}

The functions that we would use in a Stan model are relatively faithful to the R code, but follow
Stan conventions: The function exp_lpdf stores the sum of the log(PDF) evaluated at each
value of x, this would be analogous to doing sum(dexp2(x, lambda, log = TRUE)) ; the function
exp_rng implements a non-vectorized version of rexp2 and uses the auxiliary function
exp_icdf (icdf stands for inverse CDF), which is similar to qexp2 .

Hide

functions {
real exp_lpdf(vector x, real lambda){
vector[num_elements(x)] lpdf = log(lambda) - lambda * x;
return sum(lpdf);
}

real exp_icdf(real p, real lambda){

return - log(1 - p) / lambda;
}
real exp_rng(real lambda){
real u = uniform_rng(0, 1);
return exp_icdf(u, lambda);

}
}

We are now ready to generate synthetic data and fit the distribution in Stan. Generate 1000
observations.

Hide

N <- 1000
lambda <- 1 / 200
rt <- rexp2(N, lambda)

Use exponential.stan which includes the function block shown before and the following
obligatory blocks:

Hide
data {
int<lower = 1> N;
vector[N] RT;
}
parameters {

real<lower = 0> lambda;

}
model {
target += normal_lpdf(lambda | 0, .1) -
normal_lccdf(0 | 0, .1);
target += exp_lpdf(RT | lambda);

}
generated quantities {
array[N] real rt_pred;
for (n in 1:N)
rt_pred[n] = exp_rng(lambda);

Fit the data with Stan.

Hide

exponential <- system.file("stan_models",

"exponential.stan",

package = "bcogsci")
fit_exp <- stan(exponential, data = list(N = N, RT = rt))

Print the summary:

Hide

print(fit_exp, pars = c("lambda"))

## mean 2.5% 97.5% n_eff Rhat

## lambda 0.01 0 0.01 1505 1
Carry out a quick check first, verifying that the true value of the parameter λ is reasonably
inside its posterior distribution. This is shown in Figure 12.9(a). Figure 12.9(b) shows the
results of the simulation-based calibration procedure, and shows that our implementation was
correct.

Hide

# Plot a
post_exp <- as.data.frame(fit_exp) %>%

select("lambda")
mcmc_recover_hist(post_exp, true = lambda) +
# make it similar to plot b:
ggtitle("a") +
xlab("lamda") +
theme(plot.title = element_text(hjust = 0),

legend.position="none",
strip.text.x = element_blank())
# Plot b
B <- 16
bin_size <- 1024 / 16

ci_l <- qbinom(0.005, size = N_sim, prob = 1 / B)

ci_u <- qbinom(0.995, size = N_sim, prob = 1 / B)
ggplot(df_rank_lambda, aes(x = rank)) +
geom_histogram(
breaks =

seq(0, to = 1023, length.out = B + 1),

closed = "left", colour = "black"
) +
scale_y_continuous("count") +
geom_hline(yintercept = c(ci_l, ci_u), linetype = "dashed") +
ggtitle("b") +

theme(plot.title = element_text(hjust = 0))

a b
15

count
5

0
0.004500.004750.005000.005250.005500.0057 0 250 500 750 1000
lamda rank

FIGURE 12.9: a) Posterior distribution of the parameter λ of fit_exp together with its true
values as a black line. b) Rank histograms of λ and from our hand-made implementation of
the exponential distribution. The dashed lines represent the 99% confidence interval.

12.4 Summary

In this chapter, we learned how to create a PDF that is not provided by Stan. We learned how
to do this by doing a change of variables (with its Jacobian adjustment) and by building the
distribution from scratch in a function ending in _lpdf . We also learned how to verify the
correctness of the new functions by recovering the true values of the parameters and by using
simulation-based calibration.

12.5 Further reading

Jacobian adjustments in Bayesian models are an ongoing source of confusion and there are
several posts in blogs and study cases that try to shed light on them:

https://rstudio-pubs-
static.s3.amazonaws.com/486816_440106f76c944734a7d4c84761e37388.html
https://betanalpha.github.io/assets/case_studies/probability_theory.html#42_probability_d
ensity_functions
https://jsocolar.github.io/jacobians/#fn1
https://mc-stan.org/documentation/case-studies/mle-params.html
A complete tutorial of simulation-based calibration using the SBC package was given in the
online event Stanconnect 2021, and it is available in https://www.martinmodrak.cz/post/2021-
sbc_tutorial/. The ideas that predate this technique can be found in Cook, Gelman, and Rubin
(2006) and were extended in Talts et al. (2018). The use of ECDF-based visualizations is
discussed in Säilynoja, Bürkner, and Vehtari (2022). The role of simulation-based calibration in
the Bayesian workflow is discussed in Schad, Betancourt, and Vasishth (2020). Examples of
the use of simulation-based calibration to validate novel models used in cognitive science are
Hartmann, Johannsen, and Klauer (2020) and Bürkner and Charpentier (2020). The extension
of this procedure to validate Bayes factors is discussed in Schad et al. (2021).

Custom functions in general and custom probability functions are treated in chapters 18 and
19 of the Stan user’s guide (Stan Development Team 2021).

12.6 Exercises

Exercise 12.1 Fitting a shifted lognormal distribution.

A random variable Y has a shifted log-normal distribution with shift ψ , location μ , and scale σ,
if Z = Y − ψ and Z ∼ LogNormal(μ, σ) .

1. Implement a shifted_lognormal_ldpf function in Stan with three parameters, mu ,

sigma , and psi . Tip: One can use the regular log-normal distribution and apply a

change of variable. In this case the adjustment of the Jacobian would be ,

d
| Y − ψ| = 1
dY

which in log-space is conveniently zero.

2. Verify the correctness of the model by recovering the true values of (your choice) of the
parameters of the model and by using simulation-based calibration. In order to use
simulation-based calibration, you will need to decide on sensible priors; assume that
ψ ∼ Normal+ (100, 50) , and choose priors for μ and σ so that the prior predictive
distributions are adequate for response times.

Exercise 12.2 Fitting a Wald distribution.

The Wald distribution (or inverse Gaussian distribution) and its variants have been proposed
as another useful distribution for response times (see for example Heathcote 2004).

The probability density function of the Wald distribution is the following.

2
λ λ(x − μ)
f (x; μ, λ) = √ exp ( − )
3 2
2πx 2μ x
1. Implement this distribution in Stan as wald_lpdf . In order to do this, you will need to
derive the logarithm of the PDF presented above. You can adapt the code of the following
R function.

Hide

dwald <- function(x, lambda, mu, log = FALSE) {

log_density <- 0.5 * log(lambda / (2 * pi())) -
1.5 * log(x) -
0.5 * lambda * ((x - mu) / (mu * sqrt(x)))^2

if (log == FALSE) {
exp(log_density)
} else {
log_density
}
}

2. Verify the correctness of the model by recovering the true values of (your choice) of the
parameters of the model and by using simulation-based calibration. As with the previous
exercise, you will need to decide on sensible priors by deriving prior predictive
distributions that are adequate for response times.

References

Ashby, F Gregory. 1982. “Testing the Assumptions of Exponential, Additive Reaction Time
Models.” Memory & Cognition 10 (2). Springer: 125–34.

Ashby, F Gregory, and James T Townsend. 1980. “Decomposing the Reaction Time
Distribution: Pure Insertion and Selective Influence Revisited.” Journal of Mathematical
Psychology 21 (2). Elsevier: 93–123.

Box, George E.P., and David R. Cox. 1964. “An Analysis of Transformations.” Journal of the
Royal Statistical Society. Series B (Methodological). JSTOR, 211–52.

Bürkner, Paul-Christian, and Emmanuel Charpentier. 2020. “Modelling Monotonic Effects of

Ordinal Predictors in Bayesian Regression Models.” British Journal of Mathematical and
Statistical Psychology. Wiley Online Library.
Cook, Samantha R, Andrew Gelman, and Donald B Rubin. 2006. “Validation of Software for
Bayesian Models Using Posterior Quantiles.” Journal of Computational and Graphical
Statistics 15 (3). Taylor & Francis: 675–92. https://doi.org/10.1198/106186006X136976.

Harris, Christopher M., and Jonathan Waddington. 2012. “On the Convergence of Time
Interval Moments: Caveat Sciscitator.” Journal of Neuroscience Methods 205 (2): 345–56.
https://doi.org/https://doi.org/10.1016/j.jneumeth.2012.01.017.

Harris, Christopher M., Jonathan Waddington, Valerio Biscione, and Sean Manzi. 2014.
“Manual Choice Reaction Times in the Rate-Domain.” Frontiers in Human Neuroscience 8:
418. https://doi.org/10.3389/fnhum.2014.00418.

Heathcote, Andrew. 2004. “Fitting Wald and ex-Wald Distributions to Response Time Data: An
Example Using Functions for the S-Plus Package.” Behavior Research Methods, Instruments,
& Computers 36 (4). Springer: 678–94.

Kim, Shinyoung, Hyunji Moon, Martin Modrák, and Teemu Säilynoja. 2022. SBC: Simulation
Based Calibration for Rstan/Cmdstanr Models.

Lynch, Scott Michael. 2007. Introduction to Applied Bayesian Statistics and Estimation for
Social Scientists. New York, NY: Springer.

Ross, Sheldon. 2002. A First Course in Probability. Pearson Education.

Salas, Saturnino L, Garret J Etgen, and Einar Hille. 2003. Calculus: One and Several
Variables. Ninth. John Wiley & Sons.

Säilynoja, Teemu, Paul-Christian Bürkner, and Aki Vehtari. 2022. “Graphical Test for Discrete
Uniformity and Its Applications in Goodness-of-Fit Evaluation and Multiple Sample
Comparison.” Statistics and Computing 32 (2). Springer: 1–21.

Schad, Daniel J., Michael J. Betancourt, and Shravan Vasishth. 2020. “Toward a Principled
Bayesian Workflow in Cognitive Science.” Psychological Methods 26 (1). American
Psychological Association: 103–26.
Schad, Daniel J., Bruno Nicenboim, Paul-Christian Bürkner, Michael J. Betancourt, and
Shravan Vasishth. 2021. “Workflow Techniques for the Robust Use of Bayes Factors.”

Stan Development Team. 2021. “Stan Modeling Language Users Guide and Reference
Manual, Version 2.27.” https://mc-stan.org.

Yao, Yuling, Aki Vehtari, Daniel Simpson, and Andrew Gelman. 2018. “Yes, but Did It Work?:
Evaluating Variational Inference.” In International Conference on Machine Learning, 5581–90.
PMLR.

37. Not every transformation is valid, univariate changes of variables must be monotonic and
differentiable, multivariate changes of variables must be injective and differentiable.↩

38. In the multivariate case, it is equal to the absolute determinant of the Jacobian, the matrix
of all its first-order partial derivatives, of the transform; see section 6.7 of Ross (2002).↩

39. We are working under the assumption that Stan, which yields the posterior approximation,
works correctly. In principle, if we assume that our model is correct, we can also use
simulation-based calibration to examine whether our approximation to the posterior
distribution is correct (that is, whether Stan’s sampler, or any posterior approximation
method works correctly); for example, see Yao et al. (2018). In any case, no mismatch
between the data-averaged posterior and the corresponding prior distribution of a
particular parameter means that we have support for the correctness of both the model
and the approximation to the posterior.↩

40. Even though the package is already fully functional, function names and arguments might
change.↩
Code

Chapter 13 Meta-analysis and measurement

error models
In this chapter, we introduce two relatively underutilized models that are potentially very
important for cognitive science: meta-analysis and measurement-error models.

Meta-analysis can be very informative when carrying out systematic reviews, and
measurement-error models are able to take into account uncertainty in one’s dependent or
independent variable (or both). What’s common to these two classes of model is that they
both assume that the n -th measured data point yn has an unknown true value of a parameter,
say ζn (pronounced zeta en), that is measured with with some uncertainty that can be
represented by the standard error SEn of the measurement yn :

yn ∼ Normal(ζn , SEn )

In both classes of model, the goal is to obtain a posterior distribution of a latent parameter ζ

which is assumed to generate the ζn , with some standard deviation τ . The parameter τ

quantifies the noise in the measurement process or the between-study variability in a meta-
analysis.

ζn ∼ Normal(ζ, τ )

The main parameter of interest is usually ζ , but the posterior distributions of τ and ζn can also
be informative. The above model specification should remind you of the hierarchical models
we saw in earlier chapters.

13.1 Meta-analysis

Once a number of studies have accumulated on a particular topic, it can be very informative to
synthesize the data. Here is a commonly used approach- a random-effects meta-analysis.
13.1.1 A meta-analysis of similarity-based interference in
sentence comprehension

The model is set up as follows. For each study n , let effectn be the effect of interest, and let
SEn be the standard error of the effect. A concrete example of a recent meta-analysis is the
effect of similarity-based interference in sentence comprehension (Jäger, Engelmann, and
Vasishth 2017); when two nouns are more similar to each other, there is greater processing
difficulty (i.e., longer reading times in milliseconds) when an attempt is made to retrieve one of
the nouns to complete a linguistic dependency (such as a subject-verb dependency). The
estimate of the effect and its standard error is the information we have from each study n .

First, load the data, and add an id variable that identifies each experiment.

Hide

data("df_sbi")
(df_sbi <- df_sbi %>%

mutate(study_id = 1:n()))

## # A tibble: 12 × 4

## publication effect SE study_id

## <chr> <int> <int> <int>
## 1 VanDyke07E1LoSem 13 30 1
## 2 VanDyke07E2LoSem 37 21 2
## 3 VanDyke07E3LoSem 20 11 3

## # … with 9 more rows

The effect size and standard errors were estimated from published summary statistics in the
respective article. In some cases, this involved a certain amount of guesswork; the details are
documente d in the online material accompanying Jäger, Engelmann, and Vasishth (2017).

We begin with the assumption that there is a true (unknown) effect ζn that lies behind each of
these studies. Each of the observed effects has an uncertainty associated with it, SEn . We
can therefore assume that each observed effect, effectn , is generated as follows:

effectn ∼ Normal(ζn , SEn )

Each study is assumed to have a different true effect ζn because each study will have been
carried out under different conditions: in a different lab with different protocols and workflows,
with different subjects, with different languages, with slightly different experimental designs,
etc.

Further, each of the true underlying effects ζn has behind it some true unknown value ζ . The
parameter ζ represents the underlying effect of similarity-based interference across
experiments. Our goal is to obtain the posterior distribution of this overall effect.

We can write the above statement as follows:

ζn ∼ Normal(ζ, τ )

τ is the between-study standard deviation; this expresses the assumption that there will be
some variability between the true effects ζn .

To summarize the model:

effectn is the observed effect (in this example, in milliseconds) in the n -th study.
ζn is the true (unknown) effect in each study.
ζ is the true (unknown) effect of the experimental manipulation, namely, the similarity-
based interference effect.
Each SEn is estimated from the standard error available from study n .
The parameter τ represents between-study standard deviation.

We can construct a hierarchical model as follows:

effectn ∼Normal(ζn , SEn ) n = 1, … , Nstudies

ζn ∼Normal(ζ, τ )
(13.1)
ζ ∼Normal(0, 100)

τ ∼Normal+ (0, 100)

The priors are based on domain knowledge; it seems reasonable to allow the effect to range a
priori from −200 to +200 ms with probability 95 %. Of course, a sensitivity analysis is
necessary (but skipped here).

This model can be implemented in brms in a relatively straightforward way as shown below.
We show the Stan version later in the chapter (section 13.1.1.2); the Stan version presents
some interesting challenges that can be useful for the reader interested in deepening their
Stan modeling knowledge.
13.1.1.1 brms version of the meta-analysis model

First, define the priors:

Hide

priors <- c(prior(normal(0, 100), class = Intercept),

prior(normal(0, 100), class = sd))

Fit the model as follows. Because of our relatively uninformative priors and the few data
points, the models of this chapter require us to tune the control parameter, increasing
adapt_delta and max_treedepth .

Hide

fit_sbi <- brm(effect | resp_se(`SE`, sigma = FALSE) ~ 1

+ (1 | study_id),
data = df_sbi,
prior = priors,
control = list(
adapt_delta = .99,

max_treedepth = 10))

The posterior of ζ and τ are summarized below as Intercept and sd(Intercept) .

Hide

fit_sbi
## ...
## Group-Level Effects:
## ~study_id (Number of levels: 12)
## Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS
## sd(Intercept) 11.82 7.59 0.81 29.17 1.00 962

## Tail_ESS
## sd(Intercept) 1543
##
## Population-Level Effects:
## Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESS
## Intercept 13.37 6.19 2.91 27.78 1.00 1460 1633

##
## Family Specific Parameters:
## Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESS
## sigma 0.00 0.00 0.00 0.00 NA NA NA
##

## ...

The sigma parameter does not play any role in this model, but appears in the brms output
anyway. In the model specification, sigma was explicitly removed by writing sigma = FALSE .
For this reason, we can ignore that parameter in the model summary output above. Box 13.1
explains what happens if we set sigma = TRUE .

As theory predicts, the overall effect from these studies has a positive sign.

One advantage of such a meta-analysis is that the posterior can now be used as an
informative prior for a future study. This is especially important when doing an analysis using
Bayes factors. But this meta-analysis posterior could also be used as an informative prior in a
future experiment; that would allow the researcher to build on what is known so far from
published studies.

Box 13.1 What happens if we set sigma = TRUE ?

If we set sigma = TRUE , we won’t be able to get estimates for ζn , since they are handled
implicitly. The model presented formally in equation (13.1) is equivalent to the following
one in (13.2). A critical difference is that ζn does not appear any more.
2 2
effectn ∼Normal(ζ, √τ + SEn )

(13.2)
ζ ∼Normal(0, 100)

τ ∼Normal+ (0, 100)

This works because of the following property of normally distributed random variables:

If X and Y are two independent random variables, and

X ∼ Normal(μX , σX )
(13.3)
Y ∼ Normal(μY , σY )

then, Z , the sum of these two random variables is:

Z = X + Y (13.4)

The distribution of Z has the following form:

2 2
Z ∼ Normal (μX + μY , √σ + σ ) (13.5)
X Y

In our case, let

Un ∼ Normal(0, SEn )
(13.6)
ζn ∼ Normal(ζ, τ )

Analogous to equations (13.4) and (13.5), effectn can be expressed as a sum of two
independent random variables:

effectn = Un + ζn

The distribution of effectn will be

2 2
effectn ∼ Normal (ζ, √SE + τ ) (13.7)

We can fit this in brms as follows. In this model specification, one should not include the
+ (1 | study_id) , and the prior for τ should now be specified for sigma .

Hide
priors2 <- c(prior(normal(0, 100), class = Intercept),
prior(normal(0, 100), class = sigma))
fit_sbi_sigma <- brm(effect | resp_se(`SE`, sigma = TRUE) ~ 1,
data = df_sbi,
prior = priors2,

control = list(
adapt_delta = .99,
max_treedepth = 10))

There are slight differences with fit_sbi due to the different parameterization and the
sampling process, but the results are very similar:

Hide

posterior_summary(fit_sbi_sigma,
variable = c("b_Intercept", "sigma"))

## Estimate Est.Error Q2.5 Q97.5

## b_Intercept 13.4 6.07 3.028 27.6

## sigma 11.6 7.63 0.626 29.5

If we are not interested in the underlying effects in each study, this parameterization of the
meta-analysis can be faster and more robust (i.e., it has less potential convergence
issues). A major drawback is that we can no longer display a forest plot as we do in Figure
13.1.

Another interesting by-product of a random-effects meta-analysis is the possibility of

displaying a forest plot (Figure 13.1). A forest plot shows the meta-analytic estimate (the
parameter b_Intercept in brms ) alongside the original estimates effectn (and their SEn )
and the posterior distributions of the ζn for each study (we reconstruct these estimates by
adding b_Intercept to the parameters starting with r_ in brms ). The original estimates are
the ones fed to the model as data and the posterior distributions of the ζn are calculated, as in
previous hierarchical models, after the information from all studies is pooled together. The ζn

estimates are shrunken estimates of each study’s (unknown) true effect, shrunken towards the
grand mean ζ , and weighted by the standard error observed in each study n . The ζn for a
particular study is shrunk more towards the grand mean ζ when the study’s standard error is
large (i.e., when the estimate is very imprecise). The code below shows how to build a forest
plot step by step.
First, change the format of the data so that it looks like the output of brms :

Hide

df_sbi <- df_sbi %>%

mutate(Q2.5 = effect - 2 * SE,

Q97.5 = effect + 2 * SE,
Estimate = effect,
type = "original")

Extract the meta-analytical estimate:

Hide

df_Intercept <- posterior_summary(fit_sbi,

variable = c("b_Intercept")
) %>%
as.data.frame() %>%
mutate(publication = "M.A. estimate", type = "")

For the pooled estimated effect (or fitted value) of the individual studies, we need the sum of
the meta-analytical estimate (intercept) and each of the by-study adjustment. Obtain this with
the fitted() function:

Hide

df_model <- fitted(fit_sbi) %>%

# Convert matrix to data frame:

as.data.frame() %>%
# Add a column to identify the estimates,
# and another column to identify the publication:
mutate(type = "adjusted",
publication = df_sbi$publication)

Bind the observed effects, the meta-analytical estimate, and the fitted values of the studies
together, and plot the data:

Hide
# the adjusted estimates and the meta-analysis estimate:
bind_rows(df_sbi, df_model, df_Intercept) %>%
# Plot:
ggplot(aes(x = Estimate,
y = publication,

xmin = Q2.5,
xmax = Q97.5,
color = type)) +
geom_point(position = position_dodge(.5)) +
geom_errorbarh(position = position_dodge(.5)) +
# Add the meta-analytic estimate and Credible Interval:

geom_vline(xintercept = df_Intercept$Q2.5,
linetype = "dashed",
alpha = .3) +
geom_vline(xintercept = df_Intercept$Q97.5,
linetype = "dashed",

alpha = .3) +
geom_vline(xintercept = df_Intercept$Estimate,
linetype = "dashed",
alpha = .5) +
scale_color_discrete(breaks = c("adjusted", "original"))
VanDykeEtAl11E2bRetro

VanDykeEtAl11E2bPro

VanDykeEtAl11E1bRetro

VanDykeEtAl11E1bPro

VanDykeEtAl06
publication

VanDykeEtal03E4
type
VanDyke07E3LoSyn adjusted
original
VanDyke07E3LoSem

VanDyke07E2LoSyn

VanDyke07E2LoSem

VanDyke07E1LoSyn

VanDyke07E1LoSem

M.A. estimate

-50 0 50 100
Estimate

FIGURE 13.1: Forest plot showing the original and the adjusted estimates computed from
each study from the random-effects meta-analysis. The error bars on the original estimates
show 95% confidence intervals, and those on the adjusted estimates show 95% credible
intervals.
It is important to keep in mind that a meta-analysis is always going to yield biased estimates
as long as we have publication bias: if a field has a tendency to allow only “big news” studies
to be published, then the literature that will appear in the public domain will be biased, and any
meta-analysis based on such information will be biased. Despite this limitation, a meta-
analysis is still a useful way to synthesize the known evidence; one just has to remember that
the estimate from the meta-analysis is almost certain to be biased.

13.1.1.2 Stan version of the meta-analysis model

Even though brms can handle meta-analyses, fitting them in Stan allows us for more
flexibility, which might be necessary in some cases. As a first attempt we could build a model
that closely follows the formal specification given in equation (13.1).

Hide
data {
int<lower=1> N;
vector[N] effect;
vector[N] SE;
vector[N] study_id;

}
parameters {
real zeta;
real<lower = 0> tau;
vector[N] zeta_n;
}

model {
target += normal_lpdf(effect| zeta_n, SE);
target += normal_lpdf(zeta_n | zeta, tau);
target += normal_lpdf(zeta | 0, 100);
target += normal_lpdf(tau | 0, 100)

- normal_lccdf(0 | 0, 100);
}

Fit the model as follows:

Hide

ma0 <- system.file("stan_models",

"meta-analysis0.stan",
package = "bcogsci"
)
ls_sbi <- list(N = nrow(df_sbi),
effect = df_sbi$effect,
SE = df_sbi$SE,

study_id = df_sbi$study_id)
fit_sbi0 <- stan(ma0,
data = ls_sbi,
control = list(
adapt_delta = .999,

max_treedepth = 12))
## Warning: There were 1 divergent transitions after
## warmup. See
## https://mc-stan.org/misc/warnings.html#divergent-transitions-after-warmup
## to find out why this is a problem and how to eliminate
## them.

## Warning: There were 1 chains where the estimated

## Bayesian Fraction of Missing Information was low. See
## https://mc-stan.org/misc/warnings.html#bfmi-low

## Warning: Examine the pairs() plot to diagnose sampling

## problems

## Warning: Bulk Effective Samples Size (ESS) is too low,

## indicating posterior means and medians may be unreliable.

## Running the chains for more iterations may help. See

## https://mc-stan.org/misc/warnings.html#bulk-ess

## Warning: Tail Effective Samples Size (ESS) is too low,

## indicating posterior variances and tail quantiles may be
## unreliable. Running the chains for more iterations may
## help. See
## https://mc-stan.org/misc/warnings.html#tail-ess

We see that there are warnings. As discussed in section 11.1.2, we can use pairs plots to
uncover pathologies in the sampling. Here we see the samples of zeta and tau are highly
correlated:

Hide

pairs(fit_sbi0, pars = c("zeta", "tau"))

0 20 40 60

zeta

30
10
-10
tau
60
40
20
0

-10 0 10 20 30 40

We face a similar problem as we faced in section 11.1.2, namely, the sampler cannot properly
explore the neck of the funnel-shaped space because, because of the strong correlation
between the parameters. The solution is, as in section 11.1.2, a non-centered
parameterization. Re-write (13.1) as follows:

zn ∼ Normal(0, 1)

ζ n = zn ⋅ τ + ζ

effectn ∼ Normal(ζn , SEn ) (13.8)

ζ ∼ Normal(0, 100)

τ ∼ Normal+ (0, 100)

This works because if X ∼ Normal(a, b) and Y ∼ Normal(0, 1) , then X = a + Y ⋅ b . You

can re-visit chapter 11.1.2 for more details.

Translate equation (13.8) into Stan code as follows in meta-analysis1.stan :

Hide
data {
int<lower=1> N;
vector[N] effect;
vector[N] SE;
vector[N] study_id;

}
parameters {
real zeta;
real<lower = 0> tau;
vector[N] z;
}

transformed parameters {
vector[N] zeta_n = z * tau + zeta;
}
model {
target += normal_lpdf(effect| zeta_n, SE);

target += std_normal_lpdf(z);
target += normal_lpdf(zeta | 0, 100);
target += normal_lpdf(tau | 0, 100)
- normal_lccdf(0 | 0, 100);
}

The model converges with values virtually identical to the ones of the brms model.

Hide

ma1 <- system.file("stan_models",

"meta-analysis1.stan",
package = "bcogsci")
fit_sbi1 <- stan(ma1,

data = ls_sbi,
control = list(adapt_delta = .999,
max_treedepth = 12))

Hide

print(fit_sbi1, pars = c("zeta", "tau"))

## mean 2.5% 97.5% n_eff Rhat
## zeta 13.3 2.76 27.9 1132 1
## tau 11.5 0.65 29.5 829 1

We can also reparameterize the model slightly differently, if we set Un ∼ Normal(0, SEn )

then,

effectn = Un + ζn

Then, given that ζn ∼ Normal(ζ, τ ) ,

2 2
effectn ∼ Normal(ζ, √SE + τ ) (13.9)

See Box 13.1 if it’s not clear why this reparameterization works.

This is equivalent to the brms model where sigma = TRUE . As with brms , we lose the
possibility of estimating the posterior of the true effect of the individual studies.

Write this in Stan as follows; this code is available in the file meta-analysis2.stan within the
bcogsci package:

Hide

data {
int<lower=1> N;
vector[N] effect;

vector[N] SE;
vector[N] study_id;
}
parameters {
real zeta;

real<lower = 0> tau;

}
model {
target += normal_lpdf(effect| zeta, sqrt(square(SE) + square(tau)));
target += normal_lpdf(zeta | 0, 100);
target += normal_lpdf(tau | 0, 100)

- normal_lccdf(0 | 0, 100);
}

Fit the model:

Hide

ma2 <- system.file("stan_models",

"meta-analysis2.stan",
package = "bcogsci"
)
fit_sbi2 <- stan(ma2,

data = ls_sbi,
control = list(adapt_delta = .9))

Hide

print(fit_sbi2, pars = c("zeta", "tau"))

## mean 2.5% 97.5% n_eff Rhat

## zeta 13.1 1.82 27.5 844 1

## tau 11.8 0.58 30.1 1098 1

This summary could be reported in an article by displaying the posterior means and 95%
credible intervals of the parameters.

13.2 Measurement-error models

Measurement error models deal with the situation where some predictor or the dependent
variable, or both, are observed with measurement error. This measurement error could arise
because a variable is an average (i.e., its standard error can also be estimated), or because
we know that our measurement is noisy due to limitations of our equipment (e.g., delays in the
signal from the keyboard to the motherboard, impedance in the electrodes in an EEG system,
etc.).
13.2.1 Accounting for measurement error in individual
differences in working memory capacity and reading fluency

As a motivating example, consider the following data from Nicenboim, Vasishth, et al. (2018).
For each subject, we have the partial-credit unit (PCU) scores of an operation span task as a
measure of their working memory capacity (Conway et al. 2005) along with their standard
error. In addition, the reading fluency of each subject is calculated from a separate set of data
based on the mean reading speeds (character/second) in a rapid automatized naming task
(RAN, Denckla and Rudel 1976); the standard error of the reading speed is also available.

Of interest here is the extent of the association between working memory capacity (measured
as PCU) and reading fluency (measured as reading speed in 50 characters per second). We
avoid making any causal claims: It could be that our measure of working memory capacity
really affects reading fluency or it could be the other way around, a third possibility is that
there is a third variable (or several) that affects both reading fluency and working memory
capacity. A treatment of causality in Bayesian models can be found in chapters 5 and 6 of
McElreath (2020).

Hide

data("df_indiv")
df_indiv

## # A tibble: 100 × 5
## subj mean_rspeed se_rspeed mean_pcu se_pcu

## <dbl> <dbl> <dbl> <dbl> <dbl>

## 1 1 0.0521 0.00113 0.738 0.0648
## 2 2 0.0479 0.00121 0.292 0.0315
## 3 3 0.0601 0.00117 0.408 0.0900
## # … with 97 more rows

At first glance, we see a relationship between mean PCU scores and mean reading speed;
see Figure 13.2. However, this relationship seems to be driven by two extremes data points on
the top left corner of the plot.

Hide
df_indiv <- df_indiv %>%
mutate(c_mean_pcu = mean_pcu - mean(mean_pcu))
ggplot(df_indiv, aes(x = c_mean_pcu, y = mean_rspeed)) +
geom_point() +

geom_smooth(method = "lm")

0.125

0.100
mean_rspeed

0.075

0.050

-0.4 -0.2 0.0 0.2

c_mean_pcu

FIGURE 13.2: The relationship between (centered) mean PCU scores and mean reading
speed.

A simple linear model shows a somewhat weak association between mean reading speed and
centered mean PCU. The priors are relatively arbitrary but they are in the right order of
magnitude given that reading speeds are quite short and well below 1.

Hide
df_indiv <- df_indiv %>%
mutate(c_mean_pcu = mean_pcu - mean(mean_pcu))
priors <- c(
prior(normal(0, 0.5), class = Intercept),
prior(normal(0, 0.5), class = b),

prior(normal(0, 0.5), class = sigma)

)
fit_indiv <- brm(mean_rspeed ~ c_mean_pcu,
data = df_indiv,
family = gaussian(),
prior = priors

Hide

fit_indiv

## ...
## Population-Level Effects:
## Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS
## Intercept 0.06 0.00 0.05 0.06 1.00 5429

## c_mean_pcu -0.01 0.01 -0.03 0.00 1.00 2376

## Tail_ESS
## Intercept 3209
## c_mean_pcu 2288
##
## Family Specific Parameters:

## Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESS

## sigma 0.01 0.00 0.01 0.01 1.00 2177 2084
##
## ...

Hide

# Proportion of samples below zero

(Pb <- mean(as_draws_df(fit_indiv)$b_c_mean_pcu < 0))
## [1] 0.942

Figure 13.3 shows the posterior distribution of the slope in this model. Most of the probability
mass is negative (94.2%), suggesting that a better PCU score is associated with slower
reading speed rather than faster; that is, that a larger working memory capacity is associated
with less reading fluency. This is not a very intuitive result and it could be the case that is
driven by the two extreme data points. Rather than removing these data points, we’ll examine
what happens when the uncertainty of the measurements is taken into account.

Hide

mcmc_plot(fit_indiv,
variable = "^b_c",

regex = TRUE,
type = "hist")

-0.03 -0.02 -0.01 0.00 0.01

b_c_mean_pcu

FIGURE 13.3: The posterior distribution of the slope in the linear model, modeling the effect of
centered mean PCU on mean reading speed (the unit is 50 characters per second).
Taking this uncertainty of the measurement is important;
in many practical research problems, researchers will often take average measurements like
these and examine the correlation between them. However, each of those data points is being
measured with some error (uncertainty), but this error is being ignored when we take the
averaged values. Ignoring this uncertainty leads to over-enthusiastic inferences. A
measurement-error model solves this issue by taking uncertainty into account.

The measurement error model is stated as follows. There is assumed to be a true unobserved
value yn,T RU E for the dependent variable, and a true unobserved value xn,T RU E for the
predictor, where n is indexing the observation number. The observed values yn and the
predictor xn are assumed to be generated with some error:

yn ∼ Normal(yn,T RU E , SEy )

xn ∼ Normal(xn,T RU E , SEx )

The regression is fit to the (unknown) true values of the dependent and independent variables:

yn,T RU E ∼ Normal(α + βxn,T RU E , σ) (13.10)

In addition, there is also an unknown standard deviation (standard error) of the latent unknown
means that generate the underlying PCU means. I.e., we assume that each of the observed
centered PCU scores are normally distributed with an underlying mean, χ , and a standard
deviation τ . This is very similar to the meta-analysis situation we saw earlier:
ζn ∼ Normal(ζ, τ ) , where ζn was the true latent mean of each study, and ζ was the
(unknown) true value of the parameter, and τ was the between-study variability.

xn,T RU E ∼ Normal(χ, τ )

The goal of the modeling is to obtain posterior distributions for the intercept and slope α and β

(and the residual error standard deviation σ).

We need to decide on priors for all the parameters now. We use relatively vague priors, which
can still be considered regularizing priors based on our knowledge of the order of magnitude
of the measurements. In situations where not much is known about a research question, one
could use such vague priors.

α ∼ Normal(0, 0.5)

β ∼ Normal(0, 0.5)

χ ∼ Normal(0, 0.5) (13.11)

σ ∼ Normal+ (0, 0.5)

τ ∼ Normal+ (0, 0.5)

13.2.1.1 The brms version of the measurement error model

In brms , the model specification would be as follows:

Hide

priors_me <- c(

prior(normal(0, 0.5), class = Intercept),

prior(normal(0, 0.5), class = b),
prior(normal(0, 0.5), class = meanme),
prior(normal(0, 0.5), class = sdme),
prior(normal(0, 0.5), class = sigma)

Here the parameter with class meanme and sdme refer to the unknown mean and standard
deviation (standard error) of the latent unknown means that generate the underlying PCU
means, χ and τ in (13.11). Once we decide on the priors, we use resp_se(.) with sigma =
TRUE (i.e, we don’t estimate yn,T RU E explicitly) and we use me(c_meanpcu, se_pcu) to
indicate that the dependent variable c_mean_pcu is measured with error and se_pcu is its
SE.

Hide

fit_indiv_me <- brm(mean_rspeed | resp_se(se_rspeed, sigma = TRUE) ~

me(c_mean_pcu, se_pcu),
data = df_indiv,

family = gaussian(),
prior = priors_me)

Hide

fit_indiv_me
## ...
## Population-Level Effects:
## Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS
## Intercept 0.05 0.00 0.05 0.06 1.00 4089
## mec_mean_pcuse_pcu -0.00 0.01 -0.01 0.01 1.00 6359

## Tail_ESS
## Intercept 2684
## mec_mean_pcuse_pcu 3315
##
## Family Specific Parameters:
## Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESS

## sigma 0.01 0.00 0.01 0.01 1.00 6813 3025

##
## ...

Hide

# Proportion of samples below zero

# Parameter names can be found out with `variables(fit_indiv_me)`
(Pb_me <- mean(as_draws_df(fit_indiv_me)$bsp_mec_mean_pcuse_pcu < 0))

## [1] 0.622

The posterior for the slope is plotted in Figure 13.4; this figure shows that the association
between PCU scores and reading speed is much weaker once measurement error is taken
into account: The posterior is much more uncertain (much more widely distributed) than in the
simple linear model we fit above (compare Figures 13.4 with 13.3), and the direction of the
association is now unclear, with 62 % of the probability mass below zero, rather than 94%.
-0.02 -0.01 0.00 0.01 0.02
bsp_mec_mean_pcuse_pcu

FIGURE 13.4: The posterior distribution of the slope in the measurement error model,
modeling the effect of centered mean PCU on mean reading speed.
Figure 13.5 visualizes the main reason why we have no clear association in the measurement
error analysis, the two points at the top left part of the plot that were driving the effect have
very large SE for the measurement of reading speed. The code to produce it appears below
and overlays several (250) regression lines that correspond to different samples of the
posterior distribution with the measurements of reading speed and PCU.

Hide
df_reg <- as_draws_df(fit_indiv_me) %>%
select(alpha = b_Intercept, beta = bsp_mec_mean_pcuse_pcu) %>%
slice(1:250)
ggplot(df_indiv, aes(x = c_mean_pcu, y = mean_rspeed)) +
geom_point() +

geom_errorbarh(aes(xmin = c_mean_pcu - 2 * se_pcu,

xmax = c_mean_pcu + 2 * se_pcu),
alpha = .5, linetype = "dotted") +
geom_errorbar(aes(ymin = mean_rspeed - 2 * se_rspeed,
ymax = mean_rspeed + 2 * se_rspeed),
alpha = .5, linetype = "dotted") +

geom_abline(aes(intercept = alpha, slope = beta),

data = df_reg,
alpha = .05)

0.2
mean_rspeed

0.1

0.0

-0.50 -0.25 0.00 0.25

c_mean_pcu

FIGURE 13.5: The relationship between centered mean PCU scores and mean reading speed
accounting for measurement error. The error bars represent two standard errors. The
regression lines are produced with 250 samples of the intercept and slope from the posterior
distribution.
Of course, the conclusion here cannot be that there is no association between PCU scores
and reading speed. In order to claim for an absence of an effect we would need to use Bayes
factor (see chapter 15) or cross validation (see chapter 16).

13.2.1.2 The Stan version of the measurement error model

As it happened when we modeled the meta-analysis, the main difficulty for modeling
measurement error models directly in Stan is that we need to reparameterize the models to
avoid correlations between samples of different parameters. The two changes that we need to
do to the parameterization of our model presented in equation (13.11) are the following.

1. Sample from an auxiliary parameter zn rather than directly from xn,T RU E , as we did in
(13.8):

zn ∼ Normal(0, 1)

xn,T RU E = zn ⋅ τ + χ

xn ∼ Normal(xn,T RU E , SEx )

2. Don’t model yn,T RU E explicitly as in (13.10); rather take into account the SE and the
variation on yn,T RU E in the following way:

2 2
yn ∼ Normal (α + βxn,T RU E , √SEy + σ )

We are now ready to write this in Stan; the code is in the model called me.stan :

Hide
data {
int<lower=1> N;
vector[N] x;
vector[N] SE_x;
vector[N] y;

vector[N] SE_y;
}
parameters {
real alpha;
real beta;
real chi;

real<lower = 0> sigma;

real<lower = 0> tau;
vector[N] z;
}
transformed parameters {

vector[N] x_true = z * tau + chi;

}
model {
target += normal_lpdf(x | x_true, SE_x);
target += normal_lpdf(y | alpha + beta * x_true,

sqrt(square(SE_y) + square(sigma)));
target += std_normal_lpdf(z);
target += normal_lpdf(alpha | 0, 0.5);
target += normal_lpdf(beta | 0, 0.5);
target += normal_lpdf(chi | 0, 0.5);
target += normal_lpdf(sigma| 0, 0.5)

- normal_lccdf(0 | 0, 0.5);
target += normal_lpdf(tau | 0, 0.5)
- normal_lccdf(0 | 0, 0.5);
}

Fit the model:

Hide
me <- system.file("stan_models",
"me.stan",
package = "bcogsci")
ls_me <- list(N = nrow(df_indiv),
y = df_indiv$mean_rspeed,

SE_y = df_indiv$se_rspeed,
x = df_indiv$c_mean_pcu,
SE_x = df_indiv$se_pcu)
fit_indiv_me_stan <- stan(me, data = ls_me)

Hide

print(fit_indiv_me_stan, pars = c("alpha", "beta", "sigma"))

## mean 2.5% 97.5% n_eff Rhat

## alpha 0.05 0.05 0.06 3719 1

## beta 0.00 -0.01 0.01 6488 1

## sigma 0.01 0.01 0.01 6378 1

The posterior distributions are similar to those that we obtained with brms .

13.3 Summary

This chapter introduced two statistical tools that are potentially of great relevance to cognitive
science: random-effects meta-analysis and measurement error models. Despite the inherent
limitations of meta-analysis, these should be used routinely to accumulate knowledge through
systematic evidence synthesis. Measurement errors can also prevent over-enthusiastic
conclusions that are often made based on noisy data.

13.4 Further reading

For some examples of Bayesian meta-analyses in psycholinguistics, see Vasishth et al.

(2013), Jäger, Engelmann, and Vasishth (2017), Nicenboim, Roettger, and Vasishth (2018),
Nicenboim, Vasishth, and Rösler (2020a), Bürki et al. (2020), Cox et al. (2022), and Bürki,
Alario, and Vasishth (2022). A frequentist meta-analysis of priming effects in psycholinguistics
appears in Mahowald et al. (2016). Sutton et al. (2012) and Higgins and Green (2008) are two
useful general introductions that discuss systematic reviews, meta-analysis, and evidence
synthesis; these two references are from medicine, where meta-analysis is more widely used
than in cognitive science. A potentially important article for meta-analysis introduces a
methodology for modeling bias, to adjust for different kinds of bias in the data (Turner et al.
2008).

13.5 Exercises

Exercise 13.1 A meta-analysis data of picture-word interference data

Load the following data set:

Hide

data("df_buerki")

head(df_buerki)

## study d se study_id

## 1 Collina 2013 Exp.1 a 24 13.09 1

## 2 Collina 2013 Exp.1 b -25 17.00 2
## 3 Collina 2013 Exp.2 46 22.79 3
## 4 Mahon 2007 Exp.1 17 12.24 4
## 5 Mahon 2007 Exp.2 57 13.96 5
## 6 Mahon 2007 Exp. 4 17 8.01 6

Hide

df_buerki <- subset(df_buerki, se > 0.60)

The data are from Bürki et al. (2020). We have a summary of the effect estimates (d) and
standard errors (se) of the estimates from 162 published experiments on a phenomenon
called semantic picture-word interference. We removed an implausibly low SE in the code
above, but the results don’t change regardless of whether we keep them or not, because we
have data from a lot of studies.
In this experimental paradigm, subjects are asked to name a picture while ignoring a distractor
word (which is either related or unrelated to the picture). The word can be printed on the
picture itself, or presented auditorily. The dependent measure is the response latency, or time
interval between the presentation of the picture and the onset of the vocal response. Theory
says that distractors that come from the same semantic category as the picture to be named
lead to a slower response then when the distractor comes from a different semantic category.

Carry out a random effects meta-analysis using brms and display the posterior distribution of
the effect, along with the posterior of the between study standard deviation.

Choose Normal(0, 100) priors for the intercept and between study sd parameters. You can
also try vague priors (sensitivity analysis). Examples would be:

Normal(0, 200)

Normal(0, 400)

Exercise 13.2 Measurement error model for English VOT data

Load the following data:

Hide

data("df_VOTenglish")
head(df_VOTenglish)

## subject meanVOT seVOT meanvdur sevdur

## 1 F01 108.1 4.56 171 11.7
## 2 F02 92.5 4.62 189 12.7
## 3 F03 82.6 3.13 171 10.0
## 4 F04 88.3 3.21 168 11.8

## 5 F05 94.6 3.67 166 15.0

## 6 F06 75.9 3.70 176 12.9

You are given mean voice onset time (VOT) data (with SEs) in milliseconds for English, along
with mean vowel durations (with SEs) in milliseconds. Fit a measurement-error model
investigating the effect of mean vowel duration on mean VOT duration. First plot the
relationship between the two variables; does it look like there is an association between the
two?
Then use brms with measurement error included in both the dependent and independent
variables. Do a sensitivity analysis to check the influence of the priors on the posteriors of the
relevant parameters.

References

Bürki, Audrey, Shereen Elbuy, Sylvain Madec, and Shravan Vasishth. 2020. “What Did We
Learn from Forty Years of Research on Semantic Interference? A Bayesian Meta-Analysis.”
Journal of Memory and Language. https://doi.org/10.1016/j.jml.2020.104125.

Conway, Andrew RA, Michael J Kane, Michael F Bunting, D Zach Hambrick, Oliver Wilhelm,
and Randall W Engle. 2005. “Working Memory Span Tasks: A Methodological Review and
User’s Guide.” Psychonomic Bulletin & Review 12 (5). Springer: 769–86.

Denckla, Martha Bridge, and Rita G Rudel. 1976. “Rapid ‘Automatized’naming (Ran): Dyslexia
Differentiated from Other Learning Disabilities.” Neuropsychologia 14 (4). Elsevier: 471–79.

Higgins, Julian, and Sally Green. 2008. Cochrane Handbook for Systematics Reviews of
Interventions. New York: Wiley-Blackwell.

Mahowald, Kyle, Ariel James, Richard Futrell, and Edward Gibson. 2016. “A Meta-Analysis of
Syntactic Priming in Language Production.” Journal of Memory and Language 91. Elsevier: 5–
27.

McElreath, Richard. 2020. Statistical Rethinking: A Bayesian Course with Examples in R and
Stan. Boca Raton, Florida: Chapman; Hall/CRC.

Nicenboim, Bruno, Timo B. Roettger, and Shravan Vasishth. 2018. “Using Meta-Analysis for
Evidence Synthesis: The case of incomplete neutralization in German.” Journal of Phonetics
70: 39–55. https://doi.org/https://doi.org/10.1016/j.wocn.2018.06.001.
Nicenboim, Bruno, Shravan Vasishth, Felix Engelmann, and Katja Suckow. 2018. “Exploratory
and Confirmatory Analyses in Sentence Processing: A case study of number interference in
German.” Cognitive Science 42 (S4). https://doi.org/10.1111/cogs.12589.

Sutton, Alexander J, Nicky J Welton, Nicola Cooper, Keith R Abrams, and AE Ades. 2012.
Evidence Synthesis for Decision Making in Healthcare. Vol. 132. John Wiley & Sons.

Turner, R.M., D.J. Spiegelhalter, G. Smith, and S.G. Thompson. 2008. “Bias Modelling in
Evidence Synthesis.” Journal of the Royal Statistical Society: Series A (Statistics in Society)
172 (1). Wiley Online Library: 21–47.

Chapter 14 Introduction to model

comparison
A key goal of cognitive science is to decide which theory under consideration accounts for the
experimental data better. This can be accomplished by implementing the theories (or some
aspects of them) as Bayesian models and comparing their predicting power. Thus, model
comparison and hypothesis testing are closely related ideas. There are two Bayesian
perspectives on model comparison: a prior predictive perspective based on the Bayes factor
using marginal likelihoods, and a posterior predictive perspective based on cross-validation.
The main characteristic difference between the prior predictive approach (Bayes factor) versus
the posterior predictive approach (cross-validation) is the following: The Bayes factor
examines how well the model (prior and likelihood) explains the experimental data. By
contrast, the posterior predictive approach assesses model predictions for held-out data after
seeing most of the data.

That is, the predictive accuracy of the Bayes factor is only based on its prior predictive
distribution. In Bayes factor analyses, the prior model predictions are used to evaluate the
support that the data give to the model. By contrast, in cross-validation, the model is fit to a
large subset of the data (i.e., the training data). The posterior distributions of the parameters
of this fitted model are then used to make predictions for held-out or validation data, and
model fit is assessed on this subset of the data. Typically, this process is repeated several
times, until the entire data set is assessed as held-out data. This attempts to assess whether
the model will generalize to truly new, unobserved data. Of course, the held-out data is usually
not “truly new” because it is part of the data that was collected, but at least it is data that the
model has not been exposed to. That is, the predictive accuracy of cross-validation methods is
based on how well the posterior predictive distribution that is fit to most of the data (i.e., the
training data) characterizes out-of-sample data (i.e., the test or held-out data).

The prior predictive distribution is obviously highly sensitive to the priors: it evaluates the
probability of the observed data under prior assumptions. By contrast, the posterior predictive
distribution is less dependent on the priors because the priors are combined with the
likelihood (and are thus less influential, given sufficient data) before making predictions for
held-out validation data.
Jaynes (2003, chap. 20) compares these two perspectives to “a cruel realist” and “a fair
judge”. According to Jaynes, Bayes factor adopts the posture of a cruel realist, who “judge[s]
each model taking into account the prior information we actually have pertaining to it; that is,
we penalize a model if we do not have the best possible prior information about its
parameters, although that is not really a fault of the model itself.” By contrast, cross-validation
adopts the posture of a scrupulously fair judge, “who insists that fairness in comparing models
requires that each is delivering the best performance of which it is capable, by giving each the
best possible prior probability for its parameters (similarly, in Olympic games we might
consider it unfair to judge two athletes by their performance when one of them is sick or
injured; the fair judge might prefer to compare them when both are doing their absolute best).”

Regardless of whether we use Bayes factor or cross-validation or any other method for model
comparison, there are several important points that one should keep in mind:

1. Although the objective of model comparison might ultimately be to find out which of the
models under consideration generalizes better, this generalization can only be done well
within the range of the observed data (see Vehtari and Lampinen 2002; Vehtari and
Ojanen 2012). That is, if one hypothesis implemented as the model M1 shows to be
superior to a second hypothesis, implemented as the model M2 , according to Bayes
factor and/or cross-validation and evaluated with a young western university student
population, this doesn’t mean that M1 will be superior to M2 when it is evaluated with a
broader population (and in fact it seems that many times it won’t, see Henrich, Heine, and
Norenzayan 2010). However, if we can’t generalize even within the range of the observed
data (e.g., university students in the northern part of the western hemisphere), there is no
hope of generalizing outside of that range (e.g., non-University students). Navarro (2019)
argues that one of the most important functions of a model is to encourage directed
exploration of new territory; our view is that this makes sense only if historical data is also
accounted for (this is analogous to regression testing in software development—existing
functionality/empirical coverage should not be lost when the model is extended to cover
new data). In practice, what that means for us is that evaluating a model’s performance
should be carried out using historical benchmark data in addition to any new data one
has; just using isolated pockets of new data to evaluate a model is not convincing. For an
example from psycholinguistics of model evaluation using historical benchmark data, see
Nicenboim, Vasishth, and Rösler (2020b).

2. Model comparison can provide a quantitative way to evaluate models, but this cannot
replace understanding the qualitative patterns in the data (see, e.g., Navarro 2019). A
model can provide a good fit by behaving in a way that contradicts our substantive
knowledge. For example, Lissón et al. (2021) examine two computational models of
sentence comprehension. One of the models yielded higher predictive accuracy when the
parameter that is related to the probability of correctly comprehending a sentence was
higher for impaired subjects (individuals with aphasia) than for the control population. This
contradicts domain knowledge—impaired subjects are generally observed to show worse
performance than unimpaired control subjects—and led to a re-evaluation of the model.
3. Model comparison is based on finding the most “useful model” for characterizing our data,
but neither the Bayes factor nor cross-validation (nor any other method that we are aware
of) guarantees selecting the model closest to the truth (even with enough data). This is
related to our previous point: A model that’s closest to the true generating data process is
not guaranteed to produce the best (prior or posterior) predictions, and a model with a
clearly wrong generating data process is not guaranteed to produce poor (prior or
posterior) predictions. See Wang and Gelman (2014), for an example with cross-
validation; and Navarro (2019) for a toy example with Bayes factor.

4. One should also check that the precision (the uncertainty) of the data being modeled is
high; if an effect is being modeled that has high uncertainty (the posterior distribution of
the target parameter is widely spread out), then any measure of model fit can be
uninformative because we don’t have accurate estimates of the effect of interest. In the
Bayesian context, this implies that the prior predictive and posterior predictive
distributions of the effects generated by the model should be theoretically plausible and
reasonably constrained, and the target parameter of interest should have as high
precision as possible; this implies that we need to have sufficient data if we want to obtain
precise estimates of the parameter of interest. Later in this part of the book, we will
discuss the adverse impact of imprecision in the data on model comparison (see section
15.5.2). We will show that, in the face of low precision, we generally won’t learn much
from model comparison.

5. When comparing a null model with an alternative model, it is important to be clear about
what the null model specification is. For example, in section 5.2.4, we encountered the
correlated varying intercepts and varying slopes model for the Stroop effect. The brms
formula for the full model was:

n400 ~ 1 + c_cloze + (1 + c_cloze | subj)

If we want to test the null hypothesis that centered cloze has no effect on the dependent
variable, one null model is:

n400 ~ 1 + (1 + c_cloze | subj) (Model M0a)

In model M0a , by-subject variability is allowed; just the fixed effect of centered cloze is
assumed to be zero. This is called a nested model comparison, because the null model is
subsumed in the full model.

An alternative null model could remove only the varying slopes:

n400 ~ 1 + c_cloze + (1 | subj) (Model M0b)

Model M0b , which is also nested inside the full model, is testing a different null hypothesis
than M0a above: is the between-subject variability in the centered cloze effect zero?

Yet another possibility is to remove both the fixed and random effects of centered cloze:

n400 ~ 1 + (1 | subj) (Model M0c)

Model M0c is also nested inside the full model, but it now has two parameters missing
instead of one. Usually, it is best to compare models by removing one parameter; otherwise
one cannot be sure which parameter was responsible for our rejecting or accepting the null
hypothesis.

Box 14.1 Credible intervals should not be used to reject a null hypothesis

Researchers often incorrectly use credible intervals for null hypothesis testing, that is, to
test whether a parameter β is zero or not. A common approach is to check whether zero is
included in the 95% credible interval for the parameter β; if it is, then the null hypothesis
that the effect is zero is accepted; and if zero is outside the interval, then the null is
rejected. For example, in a tutorial paper that two of the authors of this book wrote
(Nicenboim and Vasishth 2016), we incorrectly suggest that the credible interval can be
used to reject the hypothesis that the β is zero. This is not the correct approach.

The problem with this approach is that it is a heuristic that will work in some cases and
might be misleading in others (for an example, see Vasishth, Yadav, et al. 2022).
Unfortunately, when they will work or not is in fact not well-defined.

Why is the credible-interval approach only a heuristic? One line of (incorrect) reasoning
that justifies looking at the overlap between credible intervals and zero is based on the
fact that the most likely values of β lie within 95% credible interval.41 This entails that if
zero is outside the interval, it must have a low probability density. This is true, but it’s
meaningless: Regardless of where zero lies (or any point value), zero will have a
probability mass of exactly zero since we are dealing with a continuous distribution. The
lack of overlap doesn’t tell us how much posterior probability the null model has.
A partial solution could be to look at a probability interval close to zero rather than zero
(e.g., an interval of, say, −2 to 2 ms in a response time experiment), so that we obtain a
non-zero probability mass. While the lack of overlap would be slightly more informative,
excluding a small interval can be problematic when the prior probability mass of that
interval is very small to begin with (as was the case with the regularizing priors we
assigned to our parameters). Rouder, Haaf, and Vandekerckhove (2018) show that if prior
probability mass is added to the point value zero using a spike-and-slab prior (or if
probability mass is added to the small interval close to zero if one considers that
equivalent to the null model), looking at whether zero is in the 95% credible interval is
analogous to the Bayes factor. Unfortunately, the spike-and-slab prior cannot be
incorporated in Stan, because it relies on a discrete parameter. However, other
programming tools (like PyMC3 , JAGS, or Turing) can be used if such a prior needs to be
fit; see further readings.

Rather than looking at the overlap of the 95% credible interval, we might be tempted to
conclude that there is evidence for an effect because the probability that a parameter is
positive is high, that is P (β > 0) >> 0.5 . However, the same logic from the previous
paragraph renders this meaningless. Given that the probability mass of a point value,
P (β = 0) , is zero, what we can conclude from P (β > 0) >> 0.5 is that β is very likely
to be positive rather than negative, but we can’t make any assertions about whether β is
exactly zero.

As we saw, the main problem with these heuristics is that they ignore that the null model is
a separate hypothesis. In many situations, the null hypothesis may not be of interest, and
it might be perfectly fine to base our conclusions on credible intervals or P (β > 0) . The
problem arises when these heuristics are used to provide evidence in favor or against the
null hypothesis. If one wants to argue about the evidence in favor of or against a null
hypothesis, Bayes factors or cross-validation will be needed. These are discussed in the
next two chapters.

How can credible intervals be used sensibly? The region of practical equivalence (ROPE)
approach (Spiegelhalter, Freedman, and Parmar 1994; Freedman, Lowe, and Macaskill
1984; and, more recently, Kruschke and Liddell 2018; Kruschke 2014) is a reasonable
alternative to hypothesis testing and arguing for or against a null. This approach is related
to the spike-and-slab discussion above. In the ROPE approach, one can define a range of
values for a target parameter that is predicted before the data are seen. Of course, there
has to be a principled justification for choosing this range a priori; an example of a
principled justification would be the prior predictions of a computational model. Then, the
overlap (or lack thereof) between this predicted range and the observed credible interval
can be used to infer whether one has estimates consistent (or partly consistent) with the
predicted range. Here, we are not ruling out any null hypothesis, and we are not using the
credible interval to make a decision like “the null hypothesis is true/false.”

14.1 Further reading

Roberts and Pashler (2000) and Pitt and Myung (2002) argue for the need of going beyond “a
good fit” (this is a good posterior predictive check in the context of Bayesian data analysis)
and argue for the need of model comparison and a focus on measuring the generalizability of
a model. Navarro (2019) deals with the problematic aspects of model selection in the context
of psychological literature and cognitive modeling. Fabian Dablander’s blog post,
https://fabiandablander.com/r/Law-of-Practice.html, shows a very clear comparison between
Bayes factor and PSIS-LOO-CV. Rodriguez, Williams, and Rast (2021) provides JAGS code
for fitting models with spike-and-slab priors. Fabian Dablander has a comprehensive blog post
on how to implement a Gibbs sampler in R when using such a prior:
https://fabiandablander.com/r/Spike-and-Slab.html.

References

Freedman, Laurence S., D. Lowe, and P. Macaskill. 1984. “Stopping Rules for Clinical Trials
Incorporating Clinical Opinion.” Biometrics 40 (3): 575–86.

Henrich, Joseph, Steven J. Heine, and Ara Norenzayan. 2010. “The Weirdest People in the
World?” Behavioral and Brain Sciences 33 (2-3). Cambridge University Press: 61–83.
https://doi.org/10.1017/S0140525X0999152X.

Jaynes, Edwin T. 2003. Probability Theory: The Logic of Science. Cambridge university press.

Kruschke, John. 2014. Doing Bayesian Data Analysis: A tutorial with R, JAGS, and Stan.
Academic Press.

Kruschke, John, and Torrin M Liddell. 2018. “The Bayesian New Statistics: Hypothesis Testing,
Estimation, Meta-Analysis, and Power Analysis from a Bayesian Perspective.” Psychonomic
Bulletin & Review 25 (1). Springer: 178–206.

Lissón, Paula, Dorothea Pregla, Bruno Nicenboim, Dario Paape, Mick van het Nederend,
Frank Burchert, Nicole Stadie, David Caplan, and Shravan Vasishth. 2021. “A Computational
Evaluation of Two Models of Retrieval Processes in Sentence Processing in Aphasia.”
Cognitive Science 45 (4): e12956. https://onlinelibrary.wiley.com/doi/full/10.1111/cogs.12956.
Navarro, Danielle J. 2019. “Between the Devil and the Deep Blue Sea: Tensions Between
Scientific Judgement and Statistical Model Selection.” Computational Brain & Behavior 2 (1):
28–34. https://doi.org/10.1007/s42113-018-0019-z.

Nicenboim, Bruno, and Shravan Vasishth. 2016. “Statistical methods for linguistic research:
Foundational Ideas - Part II.” Language and Linguistics Compass 10 (11): 591–613.
https://doi.org/10.1111/lnc3.12207.

Nicenboim, Bruno, Shravan Vasishth, and Frank Rösler. 2020b. “Are Words Pre-Activated
Probabilistically During Sentence Comprehension? Evidence from New Data and a Bayesian
Random-Effects Meta-Analysis Using Publicly Available Data.” Neuropsychologia 142.
https://doi.org/10.1016/j.neuropsychologia.2020.107427.

Pitt, Mark A., and In Jae Myung. 2002. “When a Good Fit Can Be Bad.” Trends in Cognitive
Sciences 6 (10): 421–25. https://doi.org/10.1016/S1364-6613(02)01964-2.

Roberts, Seth, and Harold Pashler. 2000. “How Persuasive Is a Good Fit? A Comment on
Theory Testing.” Psychological Review 107 (2): 358–67.

Rodriguez, Josue E, Donald R Williams, and Philippe Rast. 2021. “Who Is and Is Not"
Average’"? Random Effects Selection with Spike-and-Slab Priors.” PsyArXiv.

Spiegelhalter, David J, Laurence S. Freedman, and Mahesh KB Parmar. 1994. “Bayesian

Approaches to Randomized Trials.” Journal of the Royal Statistical Society. Series A (Statistics
in Society) 157 (3): 357–416.

Vehtari, Aki, and Jouko Lampinen. 2002. “Bayesian Model Assessment and Comparison Using
Cross-Validation Predictive Densities.” Neural Computation 14 (10): 2439–68.
https://doi.org/10.1162/08997660260293292.

Vehtari, Aki, and Janne Ojanen. 2012. “A Survey of Bayesian Predictive Methods for Model
Assessment, Selection and Comparison.” Statistical Surveys 6 (0). Institute of Mathematical
Statistics: 142–228. https://doi.org/10.1214/12-ss102.
Wang, Wei, and Andrew Gelman. 2014. “Difficulty of Selecting Among Multilevel Models Using
Predictive Accuracy.” Statistics at Its Interface 7: 1–8.

41. This is also strictly true only in a highest density interval (HDI), this is a credible interval
where all the points within the interval have a higher probability density than points
outside the interval. However, when posterior distributions are symmetrical, these
intervals are virtually identical to the equal-tail intervals we use in this book.↩
Code

Chapter 15 Bayes factors

This chapter is based on a longer manuscript available on arXiv: Schad et al. (2021).
Bayesian approaches provide tools for different aspects of data analysis. A key contribution of
Bayesian data analysis to cognitive science is that it provides probabilistic ways to quantify the
evidence that data provide in support of one model or another. Models provide ways to
implement scientific hypotheses; as a consequence, model comparison and hypothesis testing
are closely related. There are two kinds of hypotheses: point hypotheses, which hypothesize
that a model parameter has a specific point value–such as e.g., zero. By contrast, range
hypotheses specify that a parameter exists and is needed to explain the data, but they do not
specify the parameter value, which can be estimated from data. Bayesian hypothesis testing
of range hypotheses is implemented using Bayes factors (Rouder, Haaf, and Vandekerckhove
2018; Schönbrodt and Wagenmakers 2018; Wagenmakers et al. 2010; Kass and Raftery
1995; Gronau et al. 2017a; Jeffreys 1939), which quantify evidence in favor of one statistical
(or computational) model over another. Point hypotheses are the norm in frequentist
hypothesis testing, and can be implemented in Bayesian analyses using posterior density
ratios. This chapter will focus on Bayes factors as the way to compare models and to obtain
evidence about (range) hypotheses.

There are subtleties associated with Bayes factors that are not widely appreciated. For
example, the results of Bayes factor analyses are highly sensitive to and crucially depend on
prior assumptions about model parameters (we will illustrate this below), which can vary
between experiments/research problems and even differ subjectively between different
researchers. Many authors use or recommend so-called default prior distributions, where the
prior parameters are fixed, and are independent of the scientific problem in question
(Hammerly, Staub, and Dillon 2019; Navarro 2015). However, default priors result in an overly
simplistic perspective on Bayesian hypothesis testing, and can be misleading. For this reason,
even though leading experts in the use of Bayes factor, such as Rouder et al. (2009), often
provide default priors for computing Bayes factors, they also make it clear that: “simply put,
principled inference is a thoughtful process that cannot be performed by rigid adherence to
defaults” (Rouder et al. 2009, 235). However, this observation does not seem to have had
much impact on how Bayes factors are used in fields like psychology and psycholinguistics;
the use of default priors when computing Bayes factor seems to be widespread.
Given the key influence of priors on Bayes factors, defining priors becomes a central issue
when using Bayes factors. The priors determine which models will be compared.

In this chapter, we demonstrate how Bayes factors should be used in practical settings in
cognitive science. In doing so, we demonstrate the strength of this approach and some
important pitfalls that researchers should be aware of.

15.1 Hypothesis testing using the Bayes factor

15.1.1 Marginal likelihood

Bayes’ rule can be written with reference to a specific statistical model M1 .

p(y ∣ Θ, M1 )p(Θ ∣ M1 )
p(Θ ∣ y, M1 ) =
p(y ∣ M1 )

Here, y refers to the data and Θ is a vector of parameters; for example, this vector could
include the intercept, slope, and variance component in a linear regression model.

The denominator p(y ∣ M1 ) is the marginal likelihood, and is a single number that gives us
the likelihood of the observed data y given the model M1 (and only in the discrete case, it
gives us the probability of the observed data y given the model; see section 1.7). Because in
general it’s not a probability, it should be interpreted relative to another marginal likelihood
(evaluated at the same y).

In frequentist statistics, it’s also common to quantify evidence for the model by determining the
maximum likelihood, that is, the likelihood of the data given the best-fitting model parameter.
Thus, the data is used twice: once for fitting the parameter, and then for evaluating the
likelihood. Importantly, this inference completely hinges upon this best-fitting parameter to be
a meaningful value that represents well what we know about the parameter, and doesn’t take
the uncertainty of the estimates into account. Bayesian inference quantifies the uncertainty
that is associated with a parameter, that is, one accepts that the knowledge about the
parameter value is uncertain. Computing the marginal likelihood entails computing the
likelihood given all plausible values for the model parameter.

One difficulty in the above equation showing Bayes’ rule is that the marginal likelihood
p(y ∣ M1 ) in the denominator cannot be easily computed in Bayes’ rule:
p(y ∣ Θ, M1 )p(Θ ∣ M1 )
p(Θ ∣ y, M1 ) =
p(y ∣ M1 )

The marginal likelihood does not depend on the model parameters Θ ; the parameters are
“marginalized” or integrated out:

p(y ∣ M1 ) = ∫ p(y ∣ Θ, M1 )p(Θ ∣ M1 )dΘ (15.1)

The likelihood is evaluated for every possible parameter value, weighted by the prior
plausibility of the parameter values. The product p(y ∣ Θ, M1 )p(Θ ∣ M1 ) is then summed
up (that is what the integral does).

For this reason, the prior is as important as the likelihood. Equation (15.1) also looks almost
identical to the prior predictive distribution from section 3.3 (that is, the predictions that the
model makes before seeing any data). The prior predictive distribution is repeated below for
convenience:

p(ypred ) = p(ypred1 , … , ypredn )

= ∫ p(ypred |Θ) ⋅ p(ypred |Θ) ⋯ p(ypred |Θ)p(Θ) dΘ

1 2 N

However, while the prior predictive distribution describes possible observations, the marginal
likelihood is evaluated on the actually observed data.

Let’s compute the Bayes factor for a very simple example case. We assume a study where we
assess the number of “successes” observed in a fixed number of trials. For example, suppose
that we have 80 “successes” out of 100 trials. A simple model of this data can be built by
assuming, as we did in section 1.4, that the data are distributed according to a binomial
distribution. In a binomial distribution, n independent experiments are performed, where the
result of each experiment is either a “success” or “no success” with probability θ. The binomial
distribution is the probability distribution of the number of successes k (number of “success”
responses) in this situation for a given sample of experiments X .

Suppose now that we have prior information about the probability parameter θ. As we
explained in section 2.2, a typical prior distribution for θ is a beta distribution. The beta
distribution defines a probability distribution on the interval [0, 1] , which is the interval on
which the probability θ is defined. It has two parameters a and b, which determine the shape
of the distribution. The prior parameters a and b can be interpreted as the a priori number of
“successes” versus “failures.” These could be based on previous evidence, or on the
researcher’s beliefs, drawing on their domain knowledge (O’Hagan et al. 2006).
Here, to illustrate the calculation of the Bayes factor, we assume that the parameters of the
beta distribution are a = 4 and b = 2 . As mentioned above, these parameters can be
interpreted as representing “success” (4 prior observations representing success), and “no
success” (2 prior observations representing “no success”). The resulting prior distribution is
visualized in Figure 15.1. A Beta(a = 4, b = 2) prior on θ amounts to a regularizing prior with
some, but no clear prior evidence for more than 50% of success.

2.0

1.5
y

1.0

0.5

0.0

0.00 0.25 0.50 0.75 1.00

theta

FIGURE 15.1: beta distribution with parameters a = 4 and b = 2.

To compute the marginal likelihood, equation (15.1) shows that we need to multiply the
likelihood with the prior. The marginal likelihood is then the area under the curve, that is, the
likelihood averaged across all possible values for the model parameter (the probability of
success).

Based on this data, likelihood, and prior we can calculate the marginal likelihood, that is, this
area under the curve, in the following way using R:42

Hide
# First we multiply the likelihood with the prior
plik1 <- function(theta) {
dbinom(x = 80, size = 100, prob = theta) *
dbeta(x = theta, shape1 = 4, shape2 = 2)
}

# Then we integrate (compute the area under the curve):

(MargLik1 <- integrate(f = plik1, lower = 0, upper = 1)$value)

## [1] 0.02

One would prefer a model that gives a higher marginal likelihood, i.e., a higher likelihood of
observing the data after integrating out the influence of the model parameter(s) (here: θ). A
model will yield a high marginal likelihood if it makes a high proportion of good predictions
(i.e., model 2 in Figure 15.2; the figure is adapted from Bishop 2006). Model predictions are
normalized, that is, the total probability that models assign to different expected data patterns
is the same for all models. Models that are too flexible (model 3 in Figure 15.2) will divide their
prior predictive probability density across all of their predictions. Such models can predict
many different outcomes. Thus, they likely can also predict the actually observed outcome.
However, due to the normalization, they cannot predict it with high probability, because they
also predict all kinds of other outcomes. This is true for both models with priors that are too
wide or for models with too many parameters. Bayesian model comparison automatically
penalizes such complex models, which is called the “Occam factor” (MacKay 2003).
FIGURE 15.2: Shown are the schematic marginal likelihoods that each of three models
assigns to different possible data sets. The total probability each model assigns to the data is
equal to one, i.e., the areas under the curves of all three models are the same. Model 1
(black), the low complexity model, assigns all the probability to a narrow range of possible
data, and can predict these possible data sets with high likelihood. Model 3 (light grey)
assigns its probability to a large range of different possible outcomes, but predicts each
individual observed data set with low likelihood (high complexity model). Model 2 (dark grey)
takes an intermediate position (intermediate complexity). The vertical dashed line (dark grey)
illustrates where the actual empirically observed data fall. The data most support model 2,
since this model predicts the data with highest likelihood. The figure is closely based on
Figure 3.13 in Bishop (2006).
By contrast, good models (Figure 15.2, model 2) will make very specific predictions, where the
specific predictions are consistent with the observed data. Here, all the predictive probability
density is located at the “location” where the observed data fall, and little probability density is
located at other places, providing good support for the model. Of course, specific predictions
can also be wrong, when expectations differ from what the observed data actually look like
(Figure 15.2, model 1).

Having a natural Occam factor is good for posterior inference, i.e., for assessing how much
(continuous) evidence there is for one model or another. However, it doesn’t necessarily imply
good decision making or hypothesis testing, i.e., to make discrete decisions about which
model explains the data best, or on which model to base further actions.

Here, we provide two examples of more flexible models. First, the following model assumes
the same likelihood and the same distribution function for the prior. However, we assume a
flat, uninformative prior, with prior parameters a = 1 and b = 1 (i.e., only one prior “success”
and one prior “failure”), which provides more prior spread than the first model. Again, we can
formulate our model as multiplying the likelihood with the prior, and integrate out the influence
of the parameter θ:

Hide

plik2 <- function(theta) {

dbinom(x = 80, size = 100, prob = theta) *

dbeta(x = theta, shape1 = 1, shape2 = 1)
}
(MargLik2 <- integrate(f = plik2, lower = 0, upper = 1)$value)
## [1] 0.0099

We can see that this second model is more flexible: due to the more spread-out prior, it is
compatible with a larger range of possible observed data patterns. However, when we
integrate out the θ parameter to obtain the marginal likelihood, we can see that this flexibility
also comes with a cost: the model has a smaller marginal likelihood (0.0099) than the first
model (0.02). Thus, on average (averaged across all possible values of θ) the second model
performs worse in explaining the specific data that we observed compared to the first model,
and has less support from the data.

A model might be more “complex” because it has a more spread-out prior, or alternatively
because it has a more complex likelihood function, which uses a larger number of parameters
to explain the same data. Here we implement a third model, which assumes a more complex
likelihood by using a beta-binomial distribution. The beta-binomial distribution is similar to the
binomial distribution, with one important difference: In the binomial distribution the probability
of success θ is fixed across trials. In the beta-binomial distribution, the probability of success
is fixed for each trial, but is drawn from a beta distribution across trials. Thus, θ can differ
between trials. In the beta-binomial distribution, we thus assume that the likelihood function is
a combination of a binomial distribution and a beta distribution of the probability θ, which
yields:

B(k + a, n − k + b)
p(X = k ∣ a, b) =
B(a, b)

What is important here is that this more complex distribution has two parameters (a and b;
rather than one, θ) to explain the same data. We assume log-normally distributed priors for the
a and b parameters, with location zero and scale 100 . The likelihood of this combined beta-
binomial distribution is given by the R-function dbbinom() in the package extraDistr . We
can now write down the likelihood times the priors (given as log-normal densities, dlnorm() ),
and integrate out the influence of the two free model parameters a and b using numerical
integration (applying integrate twice):

Hide
plik3 <- function(a, b) {
dbbinom(x = 80, size = 100, alpha = a, beta = b) *
dlnorm(x = a, meanlog = 0, sdlog = 100) *
dlnorm(x = b, meanlog = 0, sdlog = 100)
}

# Compute marginal likelihood by applying integrate twice

f <- function(b) {
integrate(function(a) plik3(a, b), lower = 0, upper = Inf)$value
}
# integrate requires a vectorized function:
(MargLik3 <- integrate(Vectorize(f), lower = 0, upper = Inf)$value)

## [1] 0.00000707

The results show that this third model has an even smaller marginal likelihood compared to
the first two (0.00000707). With its two parameters a and b, this third model has a lot of
flexibility to explain a lot of different patterns of observed empirical results. However, again,
this increased flexibility comes at a cost, and the simple pattern of observed data does not
seem to require such complex model assumptions. The small value for the marginal likelihood
indicates that this complex model has less support from the data.

That is, for this present simple example case, we would prefer model 1 over the other two,
since it has the largest marginal likelihood (0.02), and we would prefer model 2 over model 3,
since the marginal likelihood of model 2 (0.0099) is larger than that of model 3 (0.00000707).
The decision about which model is preferred is based on comparing the marginal likelihoods.

15.1.2 Bayes factor

The Bayes factor is a measure of relative evidence, the comparison of the predictive
performance of one model against another one. This comparison is a ratio of marginal
likelihoods:

P (y ∣ M1 )
BF12 =
P (y ∣ M2 )

BF12 indicates the extent to which the data are more likely under M1 over M2 , or in other
words, which of the two models is more likely to have generated the data, or the relative
evidence that we have for M1 over M2 . Values larger than one indicate evidence in favor of
M1 , smaller than one indicate evidence in favor of M2 , and values close to one indicate that
the evidence is inconclusive. This model comparison does not depend on a specific parameter
value. Instead, all possible prior parameter values are taken into account simultaneously. This
is in contrast with the likelihood ratio test, as it is explained in Box 15.1

Box 15.1 The likelihood ratio vs Bayes Factor.

The likelihood ratio test is a very similar, but frequentist, approach to model comparison
and hypothesis testing, which also compares the likelihood for the data given two different
models. We show this here to highlight the similarities and differences between frequentist
and Bayesian hypothesis testing. In contrast to the Bayes factor, the likelihood ratio test
depends on the “best” (i.e., the maximum likelihood) estimate for the model parameter(s),
that is, the model parameter θ occurs on the right side of the semi-colon in the equation
for each likelihood. (An aside: we do not use a conditional statement, i.e., the vertical bar,
when talking about likelihood in the frequentist context; instead, we use a semi-colon. This
is because the statement f (y ∣ θ) is a conditional statement, implying that θ has a
probability density function associated with it; in the frequentist framework, parameters
cannot have a pdf associated with them, they are assumed to have fixed, point values.)

^
P (y; Θ1 , M1 )
LikRat =
^
P (y; Θ2 , M2 )

That means that in the likelihood ratio test, each model is tested on its ability to explain
the data using this “best” estimate for the model parameter (here, the maximum likelihood
estimate θ^). That is, the likelihood ratio test reduces the full range of possible parameter
values to a point value, leading to overfitting the model to the maximum likelihood
estimate (MLE). If the MLE badly misestimates the true value of the parameter, due to
Type M/S error (Gelman and Carlin 2014), we could end up with a “significant” effect that
is just a consequence of this misestimation (it will not be consistently replicable; see
Vasishth, Mertzen, Jäger, et al. (2018a) for an example). By contrast, the Bayes factor
involves range hypotheses, which are implemented via integrals over the model
parameter; that is, it uses marginal likelihoods that are averaged across all possible
posterior values of the model parameter(s). Thus, if, due to Type M error, the best point
estimate (the MLE) for the model parameter(s) is not very representative of the possible
values for the model parameter(s), then Bayes factors will be superior to the likelihood
ratio test. An additional difference, of course, is that Bayes factors rely on priors for
estimating each model’s parameter(s), whereas the frequentist likelihood ratio test does
not (and cannot) consider priors in the estimation of the best-fitting model parameter(s).
As we show in this chapter, this has far-reaching consequences for Bayes factor-based
model comparison; for a more extensive exposition, see Schad et al. (2022) and Vasishth,
Yadav, et al. (2022).

For the Bayes factor, a scale (see Table 15.1) has been proposed to interpret Bayes factors
according to the strength of evidence in favor of one model (corresponding to some
hypothesis) over another (Jeffreys 1939); but this scale should not be regarded as a hard and
fast rule with clear boundaries.

TABLE 15.1: The Bayes factor scale as proposed by Jeffreys (1939). This scale should not be
regarded as a hard and fast rule.

BF12 Interpretation

> 100 Extreme evidence for M1 .

30 − 100 Very strong evidence for M1 .

10 − 30 Strong evidence for M1 .

3 − 10 Moderate evidence for M1 .

1 − 3 Anecdotal evidence for M1 .

1 No evidence.

1 1

1
−
3
Anecdotal evidence for M2 .

1 1

3
−
10
Moderate evidence for M2 .

1 1

10
−
30
Strong evidence for M2 .
1 1

30
−
100
Very strong evidence for M2 .

<
1

100
Extreme evidence for M2 .

So if we go back to our previous example, we can calculate BF12 , BF13 , and BF23 . The
subscript represents the order in which the models are compared; for example, BF21 is simply
1

BF12
.

marginal likelihood model 1 M argLik1

BF12 = = = 2
marginal likelihood model 2 M argLik2

M argLik1
BF13 = = 2825.4
M argLik3
M argLik3 1 1
BF32 = = 0.001 = =
M argLik2 BF23 1399.9

However, if we want to know, given the data y, what the probability for model M1 is, or how
much more probable model M1 is than model M2 , then we need the prior odds, that is, we
need to specify how probable M1 is compared to M2 a priori.

p(M1 ∣ y) p(M1 ) P (y ∣ M1 )
= ×
p(M2 ∣ y) p(M2 ) P (y ∣ M2 )

Posterior odds12 =Prior odds12 × BF12

The Bayes factor tells us, given the data and the priors, by how much we need to update our
relative belief between the two models. However, the Bayes factor alone cannot tell us
which one of the models is the most probable. Given our priors for the models and the
Bayes factor, we can calculate the odds between the models.

Here we compute posterior model probabilities for the case where we compare two models
against each other. However, posterior model probabilities can also be computed for the more
general case, where more than two models are considered:

p(y ∣ M1 )p(M1 )
p(M1 ∣ y) =
∑ p(y ∣ Mn )p(Mn )
n

For simplicity, we mostly constrain ourselves to two models. (However, the sensitivity analyses
we carry out below compare more than two models.)

Bayes factors (and posterior model probabilities) tell us how much evidence the data (and
priors) provide in favor of one model or another. That is, they allow us to perform inferences
on the model space, i.e., to learn how much each hypothesis is consistent with the data.

A completely different issue, however, is the question of how to perform (discrete) decisions
based on continuous evidence. The question here is: which hypothesis should one choose to
maximize utility? While Bayes factors have a clear rationale and justification in terms of the
(continuous) evidence they provide, there is not a clear and direct mapping from inferences to
how to perform decisions based on them. To derive decisions based on posterior model
probabilities, utility functions are needed. Indeed, the utility of different possible actions (i.e., to
accept and act based on one hypothesis or another) can differ quite dramatically in different
situations. For example, for a researcher trying to implement a life-saving therapy, erroneously
rejecting this new therapy could have high negative utility, whereas erroneously adopting the
new therapy may have little negative consequences. By contrast, erroneously claiming a new
discovery in fundamental research may have bad consequences (low utility), whereas
erroneously missing a new discovery claim may be less problematic if further evidence can be
accumulated. Thus, Bayesian evidence (in the form of Bayes factors or posterior model
probabilities) must be combined with utility functions in order to perform decisions based on
them. For example, this could imply specifying the utility of a true discovery (UT D ) and the
utility of a false discovery (UF D ). Calibration (i.e., simulations) can then be used to derive
decisions that maximize overall utility (see Schad et al. 2022).

The question now is how do we extend this method to models that we care about, i.e., that
represent more realistic data analysis situations. In cognitive science, we typically fit fairly
complex hierarchical models with many variance components. The major problem is that we
won’t be able to calculate the marginal likelihood for hierarchical models (or any other
complex model) either analytically or just using the R functions shown above. There are two
very useful methods for calculating the Bayes factor for complex models: the Savage–Dickey
density ratio method (Dickey, Lientz, and others 1970; Wagenmakers et al. 2010) and bridge
sampling (Bennett 1976; Meng and Wong 1996). The Savage–Dickey density ratio method is
a straightforward way to compute the Bayes factor, but it is limited to nested models. The
current implementation of the Savage–Dickey method in brms can be unstable, especially in
cases where the posterior is far away from zero. Bridge sampling is a much more powerful
method, but it requires many more effective samples than what is normally required for
parameter estimation. We will use bridge sampling from the bridgesampling package
(Gronau et al. 2017b; Gronau, Singmann, and Wagenmakers 2017) with the function
bayes_factor() to calculate the Bayes factor in the first examples.

15.2 Examining the N400 effect with Bayes factor

In section 5.2 we estimated the effect of cloze probability on the N400 average signal. This
yielded a posterior credible interval for the effect of cloze probability. It is certainly possible to
check whether e.g., the 95% posterior credible interval overlaps with zero or not. However,
such estimation cannot really answer the following question: How much evidence do we have
in support for an effect? A 95% credible interval that doesn’t overlap with zero, or a high
probability mass away from zero may hint that the predictor may be needed to explain the
data, but it is not really answering how much evidence we have in favor of an effect (for
discussion, see Royall 1997; Wagenmakers et al. 2020; Rouder, Haaf, and Vandekerckhove
2018).

This is a very important point, and is often overlooked in the literature. Many papers misuse
95% posterior credible intervals to argue that there is evidence for or against an effect. In the
past, we have also misused posterior credible intervals in this way (and even recommended
this incorrect interpretation in, for example, Nicenboim and Vasishth 2016).
The reason why the 95% posterior credible interval does not answer the question about
evidence for the alternative model M1 or the null model M0 is that we do not explicitly
consider and quantify the possibility that the parameter estimate is zero: we do not quantify
the likelihood of the data under the assumption that the effect is absent; see also Box 14.1.
The Bayes factor answers this question about the evidence in favor of an effect by explicitly
conducting a model comparison. We will compare a model that assumes the presence of an
effect, with a null model that assumes no effect.

As we saw before, the Bayes factor is highly sensitive to the priors. In the example presented
above, both models are identical except for the effect of interest, β, and so the prior on this
parameter will play a major role in the calculation of the Bayes factor.

Next, we will run a hierarchical model which includes random intercepts and slopes by items
and by subjects. We will use regularizing priors on all the parameters–this speeds up
computation and implies realistic expectations about the parameters. However, the prior on β

will be crucial for the calculation of the Bayes factor.

One possible way we can build a good prior for the parameter β estimating the influence of
cloze probability here is the following (see chapter 6 for an extended discussion about prior
selection). The reasoning below is based on domain knowledge; but there is room for
differences of opinion here. In a realistic data analysis situation, we would carry out a
sensitivity analysis using a range of priors to determine the extent of influence of the priors.

1. One may want to be agnostic regarding the direction of the effect; that means that we will
center the prior of β on zero by specifying that the mean of the prior distribution is zero.
However, we are still not sure about the variance of the prior on β.
2. One would need to know a bit about the variation on the dependent variable that we are
analyzing. After re-analyzing the data from a couple of EEG experiments available from
osf.io, we can say that for N400 averages, the standard deviation of the signal is between
8-15 microvolts (Nicenboim, Vasishth, and Rösler 2020c).
3. Based on published estimates of effects in psycholinguistics, we can conclude that they
are generally rather small, often representing between 5%-30% of the standard deviation
of the dependent variable.
4. The effect of noun predictability on the N400 is one of the most reliable and strongest
effects in neurolinguistics (together with the P600 that might even be stronger), and the
slope β represents the average change in voltage when moving from a cloze probability
of zero to one–the strongest prediction effect.
An additional and highly recommended way to obtain good priors (Schad, Betancourt, and
Vasishth 2020, also see chapter 7, which presents a principled Bayesian workflow) is to
perform prior predictive checks. Here, the idea is to simulate data from the model and the
priors, and then to analyze the simulated data using summary statistics. For example, it would
be possible to compute the summary statistic of the difference in the N400 between high
versus low cloze probability. The simulations would yield a distribution of differences.
Arguably, this distribution of differences, that is, the data analyses of the simulated data, are
much easier to judge for plausibility than the prior parameters specifying prior distributions.
That is, we might find it easier to judge whether a difference in voltage between high and low
cloze probability is plausible rather than judging the parameters of the model. For reasons of
brevity, we skip this steps here.

Instead, we will start with the prior β ∼ Normal(0, 5) (since 5 microvolts is roughly 30% of
15, which is the upper bound of the expected standard deviation of the EEG signal).

Hide

priors1 <- c(
prior(normal(2, 5), class = Intercept),

prior(normal(0, 5), class = b),

prior(normal(10, 5), class = sigma),
prior(normal(0, 2), class = sd),
prior(lkj(4), class = cor)
)

We load the data set on N400 amplitudes, which has data on cloze probabilities (Nieuwland et
al. 2018). We mean-center the cloze probability measure to make the intercept and the
random intercepts easier to interpret (i.e., after centering, they represent the grand mean and
the average variability around the grand mean across subjects or items).

Hide

data(df_eeg)

df_eeg <- df_eeg %>% mutate(c_cloze = cloze - mean(cloze))

We will need a large number of effective samples to be able to get stable estimates of the
Bayes factor with bridge sampling, for this reason a large number of sampling iterations ( n =
20000 ) is specified. Because otherwise we see warnings, we also set adapt_delta = 0.9 to
ensure that the posterior sampler is working correctly. For Bayes factors analyses, it’s
necessary to set the argument save_pars = save_pars(all = TRUE) . This setting is a
precondition for later performing bridge sampling for computing the Bayes factor.

Hide

fit_N400_h_linear <- brm(n400 ~ c_cloze +

(c_cloze | subj) + (c_cloze | item),
prior = priors1,
warmup = 2000,

iter = 20000,
cores = 4,
control = list(adapt_delta = 0.9),
save_pars = save_pars(all = TRUE),
data = df_eeg)

Next, take a look at the population-level (or fixed) effects from the Bayesian modeling.

Hide

fixef(fit_N400_h_linear)

## Estimate Est.Error Q2.5 Q97.5

## Intercept 3.65 0.45 2.76 4.53
## c_cloze 2.33 0.65 1.05 3.59

We can now take a look at the estimates and at the credible intervals. The effect of cloze
probability ( c_cloze ) is 2.33 with a 95% credible interval ranging from 1.05 to 3.59 . While
this provides an initial hint that highly probable words may elicit a stronger N400 compared to
low probable words, by just looking at the posterior there is no way to quantify evidence for
the question whether this effect is different from zero. Model comparison is needed to answer
this question.

To this end, we run the model again, now without the parameter of interest, i.e., the null
model. This is a model where our prior for β is that it is exactly zero.

Hide
fit_N400_h_null <- brm(n400 ~ 1 +
(c_cloze | subj) + (c_cloze | item),
prior = priors1[priors1$class != "b", ],
warmup = 2000,
iter = 20000,

cores = 4,
control = list(adapt_delta = 0.9),
save_pars = save_pars(all = TRUE),
data = df_eeg
)

Now everything is ready to compute the log marginal likelihood, that is, the likelihood of the
data given the model, after integrating out the model parameters. In the toy examples shown
above, we had used the R-function integrate() to perform this integration. This is not
possible for the more realistic and more complex models that are considered here because
the integrals that have to be solved are too high-dimensional and complex for these simple
functions to do the job. Instead, a standard approach to sampling realistic complex models is
to use bridge sampling (Gronau et al. 2017b; Gronau, Singmann, and Wagenmakers 2017).
We perform this integration using the function bridge_sampler() for each of the two models:

Hide

margLogLik_linear <- bridge_sampler(fit_N400_h_linear, silent = TRUE)

margLogLik_null <- bridge_sampler(fit_N400_h_null, silent = TRUE)

This gives us the marginal log likelihoods for each of the models. From these, we can
compute the Bayes factors. The function bayes_factor() is a convenient function to calculate
the Bayes factor.

Hide

(BF_ln <- bayes_factor(margLogLik_linear, margLogLik_null))

## Estimated Bayes factor in favor of x1 over x2: 50.96782

Alternatively, the Bayes factor can be computed manually as well. First, we compute the
difference in marginal log likelihoods, then we transform this difference in log likelihoods to the
likelihood scale (using exp() ). A difference in the exponential scale a ratio: exp(a-b) =
exp(a)/exp(b) , that is, it computes the Bayes factor. However, the values exp(ml1) and
exp(ml2) are too small to be represented accurately by R. Therefore, for numerical reasons,

it is important to take the difference first and to compute the exponential afterwards exp(a-
b) , i.e., exp(margLogLik_linear$logml - margLogLik_null$logml) , which yields the same

result as the bayes_factor() command.

Hide

exp(margLogLik_linear$logml - margLogLik_null$logml)

## [1] 51

The Bayes factor is quite large in this example, and furnishes strong evidence for the
alternative model, which includes a coefficient representing the effect of cloze probability. That
is, under the criteria shown in Table 15.1, the Bayes factor furnish strong evidence for an
effect of cloze probability.

In this example, there was good prior information about the model parameter β. However,
what happens if we are not sure about the prior for the model parameter? It might happen that
we compare the null model with a very “bad” alternative model, because our prior for β is not
appropriate.

For example, assuming that we do not know much about N400 effects, or that we do not want
to make strong assumptions, we might be inclined to use an uninformative prior. For example,
these could look as follows (where all the priors except for b remain unchanged):

Hide

priors_vague <- c(
prior(normal(2, 5), class = Intercept),
prior(normal(0, 500), class = b),
prior(normal(10, 5), class = sigma),
prior(normal(0, 2), class = sd),
prior(lkj(4), class = cor)

We can use these uninformative priors in the Bayesian model:

Hide
fit_N400_h_linear_vague <- brm(n400 ~ c_cloze +
(c_cloze | subj) + (c_cloze | item),
prior = priors_vague,
warmup = 2000,
iter = 20000,

cores = 4,
control = list(adapt_delta = 0.9),
save_pars = save_pars(all = TRUE),
data = df_eeg
)

Interestingly, we can still estimate the effect of cloze probability fairly well:

Hide

posterior_summary(fit_N400_h_linear_vague, variable = "b_c_cloze")

## Estimate Est.Error Q2.5 Q97.5

## b_c_cloze 2.37 0.652 1.07 3.64

Next, we again perform the bridge sampling for the alternative model.

Hide

margLogLik_linear_vague <- bridge_sampler(fit_N400_h_linear_vague,

silent = TRUE)

We compute the Bayes factor for the alternative over the null model, BF10 :

Hide

(BF_lnVague <- bayes_factor(margLogLik_linear_vague, margLogLik_null))

## Estimated Bayes factor in favor of x1 over x2: 0.56000

This is easier to read as the evidence for null model over the alternative:

Hide
1 / BF_lnVague[[1]]

## [1] 1.79

The result is inconclusive: there is no evidence in favor of or against the effect of cloze
probability. The reason for that is that priors are never uninformative when it comes to Bayes
factors. The wide prior specifies that both very small and very large effect sizes are possible
(with some considerable probability), but there is relatively little evidence in the data for such
large effect sizes.

The above example is related to a criticism of Bayes factors by Uri Simonsohn, that Bayes
factors can provide evidence in favor of the null and against a very specific alternative model,
when the researchers only know the direction of the effect (see https://datacolada.org/78a).
This can happen when an uninformative prior is used.

One way to overcome this problem is to actually try to learn about the effect size that we are
investigating. This can be done by first running an exploratory experiment and analysis without
computing any Bayes factor, and then use the posterior distribution derived from this first
experiment to calibrate the priors for the next confirmatory experiment where we do use the
Bayes factor (see Verhagen and Wagenmakers 2014 for a Bayes Factor test calibrated to
investigate replication success).

Another possibility is to examine a lot of different alternative models, where each model uses
different prior assumptions. This way, it’s possible to investigate the extent to which the Bayes
factor results depend on, or are sensitive to, the prior assumptions. This is an instance of a
sensitivity analysis. Recall that the model is the likelihood and the priors. We can therefore
compare models that only differ in the prior (for an example involving EEG and predictability
effects, see Nicenboim, Vasishth, and Rösler 2020c).

15.2.1 Sensitivity analysis

Here, we perform a sensitivtiy analysis by examining Bayes factors for several models. Each
model has the same likelihood but a different prior for β. For all of the priors we assume a
normal distribution with a mean of zero. Assuming a mean of zero asserts that we do not
make any assumption a priori that the effect differs from zero. If the effect should differ from
zero, we want the data to tell us that. What differs between the different priors is their standard
deviation. That is, what differs is the amount of uncertainty about the effect size that we allow
for in the prior. A large standard deviation allows for very large effect sizes, whereas a small
standard deviation asserts that we expect the effect not to be very large. Although a model
with a wide prior (i.e., large standard deviation) also allocates prior probability to small effect
sizes, it allocates much less probability to small effect sizes compared to a model with a
narrow prior. Thus, if the effect size is in reality small, then a model with a narrow prior (small
standard deviation) will have a better chance of detecting the effect.

Next, we try out a range of standard deviations, ranging from 1 to a much wider prior that has
a standard deviation of 100. In practice, for the experiment method we are discussing here, it
would not be a good idea to define very large standard deviations such as 100 microvolts,
since they imply unrealistically large effect sizes. However, we include such a large value here
just for illustration. Such a sensitivity analysis takes a very long time: here, we are running 11
models, where each model involves a lot of iterations to obtain stable Bayes factor estimates.

Hide
prior_sd <- c(1, 1.5, 2, 2.5, 5, 8, 10, 20, 40, 50, 100)
BF <- c()
for (i in 1:length(prior_sd)) {
psd <- prior_sd[i]
# for each prior we fit the model

fit <- brm(n400 ~ c_cloze + (c_cloze | subj) + (c_cloze | item),

prior =
c(
prior(normal(2, 5), class = Intercept),
set_prior(paste0("normal(0,", psd, ")"), class = "b"),
prior(normal(10, 5), class = sigma),

prior(normal(0, 2), class = sd),

prior(lkj(4), class = cor)
),
warmup = 2000,
iter = 20000,

cores = 4,
control = list(adapt_delta = 0.9),
save_pars = save_pars(all = TRUE),
data = df_eeg
)

# for each model we run a brigde sampler

lml_linear_beta <- bridge_sampler(fit, silent = TRUE)
# we store the Bayes factor compared to the null model
BF <- c(BF, bayes_factor(lml_linear_beta, lml_null)$bf)
}
BFs <- tibble(beta_sd = prior_sd, BF)

For each model, we run bridge sampling and we compute the Bayes factor of the model
against our baseline or null model, which does not contain a population-level effect of cloze
probability (BF10 ). Next, we need a way to visualize all the Bayes factors. We plot them in
Figure 15.3 as a function of the prior width.
Bayes factors
100

Evidence in favor of one M1

5
3
BF10

1/3

1/10

1/20
Evidence in favor of M0
1/50

1/100

0 25 50 75 100
Normal prior width (SD)

FIGURE 15.3: Prior sensitivity analysis for the Bayes factor.

This figure clearly shows that the Bayes factor provides evidence for the alternative model;
that is, it provides evidence that the fixed effect cloze probability is needed to explain the data.
This can be seen as the Bayes factor is quite large for a range of different values for the prior
standard deviation. The Bayes factor is largest for a prior standard deviation of 2.5 ,
suggesting a rather small size of the effect of cloze probability. If we assume gigantic effect
sizes a priori (e.g., standard deviations of 50 or 100), then the evidence for the alternative
model is weaker. Conceptually, the data do not fully support such big effect sizes, but start to
favor the null model relatively more, when such big effect sizes are tested against the null.
Overall, we can conclude that the data provide evidence for a not too large but robust
influence of cloze probability on the N400 amplitude.

15.2.2 Non-nested models

One important advantage of Bayes factors is that they can be used to compare models that
are not nested. In nested models, the simpler model is a special case of the more complex
and general model. For example, our previous model of cloze probability was a general
model, allowing different influences of cloze probability on the N400. We compared this to a
simpler, more specific null model, where the influence of cloze probability was not included,
which means that the regression coefficient (fixed effect) for cloze probability was assumed to
be set to zero. Such nested models can be also compared using frequentist methods such as
the likelihood ratio test (ANOVA).

By contrast, the Bayes factor also makes it possible to compare non-nested models. An
example of a non-nested model would be a case where we log-transform the cloze probability
variable before using it as a predictor. A model with log cloze probability as a predictor is not a
special case of a model with linear cloze probability as predictor. These are just different,
alternative models. With Bayes factors, we can compare these non-nested models with each
other to determine which receives more evidence from the data.

To do so, we first log-transform the cloze probability variable. Some cloze probabilities in the
data set are equal to zero. This creates a problem when taking logs, since the log of zero is
minus infinity, a value that we cannot use. We are going to overcome this problem by
“smoothing” the cloze probability in this example. We use additive smoothing (also called
Laplace or Lidstone smoothing; Lidstone 1920; Chen and Goodman 1999) with pseudocounts
set to one, this means that the smoothed probability is calculated as the number of responses
with a given gender plus one divided by the total number of responses plus two.

Hide

df_eeg <- df_eeg %>%

mutate(
scloze = (cloze_ans + 1) / (N + 2),
c_logscloze = log(scloze) - mean(log(scloze))
)

Next, we center the predictor variable, and we scale it to the same standard deviation as the
linear cloze probabilities. To implement this scaling, first divide the centered smoothed log
cloze probability variable by its standard deviation (effectively creating z-scaled values). As a
next step, multiply the z-scaled values by the standard deviation of the non-transformed cloze
probability variable. This way, both predictors (log cloze and cloze) have the same standard
deviation. We therefore expect them to have a similar impact on the N400. As a result of this
transformation, the same priors can be used for both variables (given that we currently have
no specific information about the effect of log cloze probability versus linear cloze probability):

Hide
df_eeg <- df_eeg %>%
mutate(c_logscloze = scale(c_logscloze) * sd(c_cloze))

Then, run a linear mixed-effects model with log cloze probability instead of linear cloze
probability, and we again carry out bridge sampling.

Hide

fit_N400_h_log <- brm(n400 ~ c_logscloze +

(c_logscloze | subj) + (c_logscloze | item),

prior = priors1,
warmup = 2000,
iter = 20000,
cores = 4,
control = list(adapt_delta = 0.9),
save_pars = save_pars(all = TRUE),

data = df_eeg
)

Hide

margLogLik_log <- bridge_sampler(fit_N400_h_log, silent = TRUE)

Next, compare the linear and the log model to each other using Bayes factors.

Hide

(BF_log_lin <- bayes_factor(margLogLik_log, margLogLik_linear))

## Estimated Bayes factor in favor of x1 over x2: 6.04762

The results show a Bayes factor of 6 of the log model over the linear model. This shows some
evidence that log cloze probability is a better predictor of N400 amplitudes than linear cloze
probability. Importantly, this analysis demonstrates that model comparisons using Bayes factor
are not limited to nested models, but can also be used for non-nested models.
15.3 The influence of the priors on Bayes factors:
beyond the effect of interest

We saw above that the width (or standard deviation) of the prior distribution for the effect of
interest had a strong impact on the results from Bayes factor analyses. Thus, one question is
whether only the prior for the effect of interest is important, or whether priors for other model
parameters can also impact the resulting Bayes factors in an analysis. It turns out that priors
for other model parameters can also be important and impact Bayes factors, especially when
there are non-linear components in the model, such as in generalized linear mixed effects
models. We investigate this issue by using a simulated data set on a variable that has a
Bernoulli distribution; in each trial, subjects can perform either successfully ( pDV = 1 ) on a
task, or not ( pDV = 0 ). The simulated data is from a factorial experimental design, with one
between-subject factor F with 2 levels (F 1 and F2 ), and Table 15.2 shows success
probabilities for each of the experimental conditions.

Hide

data("df_BF")
str(df_BF)

## tibble [100 × 3] (S3: tbl_df/tbl/data.frame)

## $ F : Factor w/ 2 levels "F1","F2": 1 1 1 1 1 1 1 1 1 1 ...

## $ pDV: int [1:100] 1 1 1 1 1 1 1 1 1 1 ...

## $ id : int [1:100] 1 2 3 4 5 6 7 8 9 10 ...

TABLE 15.2: Summary statistics per condition for the simulated data.

Factor A N data Means

F1 50 0.98

F2 50 0.70

Our question now is whether there is evidence for a difference in success probabilities
between groups F1 and F2 . As contrasts for the factor F , we use scaled sum coding
(−0.5, +0.5) .

Hide
contrasts(df_BF$F) <- c(-0.5, +0.5)

Next, we proceed to specify our priors. For the difference between groups (F 1 versus F2 ),
define a normally distributed prior with a mean of 0 and a standard deviation of 0.5 . Thus, we
do not specify a direction of the difference a priori, and assume not too large effect sizes. Now
run two logistic brms models, one with the group factor F included, and one without the
factor F , and compute Bayes factors using bridge sampling to obtain the evidence that the
data provide for the alternative hypothesis that a group difference exists between levels F1

and F2 .

So far, we have only specified the prior for the effect size. The question we are asking now is
whether priors on other model parameters can impact the Bayes factor computations for
testing the group effect. Specifically, can the prior for the intercept influence the Bayes factor
for the group difference? The results show that yes, such an influence can take place in some
situations. Let’s have a look at this in more detail. Let’s assume that we compare two different
priors for the intercept. We specify each as a normal distribution with a standard deviation of
0.1 , thus, specifying relatively high certainty a priori where the intercept of the data will fall.
The only difference that we now specify, is that one time, the prior mean (on the latent logistic
scale) is set to 0, corresponding to a prior mean probability of 0.5 . In the other condition, we
specify a prior mean of 2, corresponding to a prior mean probability of 0.88 . When we look at
the data (see Table 15.2) we see that the prior mean of 0 (i.e., prior probability for the
intercept of 0.5 ) is not very compatible with the data, whereas the prior mean of 2 (i.e., a prior
probability for the intercept of 0.88 ) is quite closely aligned with the actual data.

We now compute Bayes factors for the group difference (F 1 versus F2 ) by using these
different priors for the intercept. Thus, we first fit a null (M 0) and alternative (M 1) model
under the assumption of a false prior believe (mean = 0 ), and perform bridge sampling for
these models:

Hide
# set prior
priors_logit1 <- c(
prior(normal(0, 0.1), class = Intercept),
prior(normal(0, 0.5), class = b)
)

# Bayesian GLM: M0
fit_pDV_H0 <- brm(pDV ~ 1,
data = df_BF,
family = bernoulli(link = "logit"),
prior = priors_logit1[-2, ],
save_pars = save_pars(all = TRUE)

)
# Bayesian GLM: M1
fit_pDV_H1 <- brm(pDV ~ 1 + F,
data = df_BF,
family = bernoulli(link = "logit"),

prior = priors_logit1,
save_pars = save_pars(all = TRUE)
)
# bridge sampling
mLL_binom_H0 <- bridge_sampler(fit_pDV_H0, silent = TRUE)

mLL_binom_H1 <- bridge_sampler(fit_pDV_H1, silent = TRUE)

Next, we again prepare computation of Bayes factors, by again running the null (M 0) and the
alternative (M 1) model, now assuming a more realistic prior for the intercept (prior mean = 2

Hide
priors_logit2 <- c(
prior(normal(2, 0.1), class = Intercept),
prior(normal(0, 0.5), class = b)
)
# Bayesian GLM: M0

fit_pDV_H0_2 <- brm(pDV ~ 1,

data = df_BF,
family = bernoulli(link = "logit"),
prior = priors_logit2[-2, ],
save_pars = save_pars(all = TRUE)
)

# Bayesian GLM: M1
fit_pDV_H1_2 <- brm(pDV ~ 1 + F,
data = df_BF,
family = bernoulli(link = "logit"),
prior = priors_logit2,

save_pars = save_pars(all = TRUE)

)
# bridge sampling
mLL_binom_H0_2 <- bridge_sampler(fit_pDV_H0_2, silent = TRUE)
mLL_binom_H1_2 <- bridge_sampler(fit_pDV_H1_2, silent = TRUE)

Based on these models and bridge samples, we can now compute the Bayes factors in
support for M1 (i.e., in support of a group-difference between F1 and F2 ). We can do so for
the unrealistic prior for the intercept (prior mean of 0) and the more realistic prior for the
intercept (prior mean of 2).

Hide

(BF_binom_H1_H0 <- bayes_factor(mLL_binom_H1, mLL_binom_H0))

## Estimated Bayes factor in favor of x1 over x2: 7.19449

Hide

(BF_binom_H1_H0_2 <- bayes_factor(mLL_binom_H1_2, mLL_binom_H0_2))

## Estimated Bayes factor in favor of x1 over x2: 29.60257

The results show that with the realistic prior for the intercept (prior mean = 2 ), the evidence
for the M1 is quite strong, with a Bayes factor of BF10 = 29.6. With the unrealistic prior for
the intercept (i.e., prior mean = 0 ), by contrast, the evidence for the M1 is much reduced,
BF10 = 7.2, and now only modest.

Thus, when performing Bayes factor analyses, not only can the priors for the effect of interest
(here the group difference) impact the results, under certain circumstances priors for other
model parameters can too, such as the prior mean for the intercept here. Such an influence
will not always be strong, and can sometimes be negligible. There may be many situations,
where the exact specification of the intercept does not have much of an effect on the Bayes
factor for a group difference. However, such influences can in principle occur, especially in
models with non-linear components. Therefore, it is very important to be careful in specifying
realistic priors for all model parameters, also including the intercept. A good way to judge
whether prior assumptions are realistic and plausible is prior predictive checks, where we
simulate data based on the priors and the model and judge whether the simulated data is
plausible and realistic.

15.4 Bayes factor in Stan

The package bridgesampling allows for a straightforward calculation of Bayes factor for Stan
models as well. All the limitations and caveats of Bayes factor discussed in this chapter apply
to Stan code as much as they apply to brms code. Importantly, the sampling notation ( ~ )
should not be used; see Box 10.2.

An advantage of using Stan in comparison with brms is Stan’s flexibility. We revisit the model
implemented before in section 10.4.2. We want to assess the evidence for a positive effect of
attentional load on pupil size against a similar model that assumes no effect. To do this,
assume the following likelihood:

p_sizen ∼ Normal(α + c_loadn ⋅ β1 + c_trial ⋅ β2 + c_load ⋅ c_trial ⋅ β3 , σ)

Define priors for all the βs as before, with the difference that β1 can only have positive values:
α ∼ Normal(1000, 500)

β1 ∼ Normal+ (0, 100)

β2 ∼ Normal(0, 100)

β3 ∼ Normal(0, 100)

σ ∼ Normal+ (0, 1000)

The following Stan model is the direct translation of the new priors and likelihood.

Hide

data {

int<lower = 1> N;
vector[N] c_load;
vector[N] c_trial;
vector[N] p_size;
}

parameters {
real alpha;
real<lower = 0> beta1;
real beta2;
real beta3;
real<lower = 0> sigma;

}
model {
target += normal_lpdf(alpha | 1000, 500);
target += normal_lpdf(beta1 | 0, 100) -
normal_lccdf(0 | 0, 100);

target += normal_lpdf(beta2 | 0, 100);

target += normal_lpdf(beta3 | 0, 100);
target += normal_lpdf(sigma | 0, 1000)
- normal_lccdf(0 | 0, 1000);
target += normal_lpdf(p_size | alpha + c_load * beta1 +

c_trial * beta2 +
c_load .* c_trial * beta3, sigma);
}

Fit the model with 20000 iterations to ensure that the Bayes factor is stable, and increase the
adapt_delta parameter to avoid warnings:

Hide
data("df_pupil")
df_pupil <- df_pupil %>%
mutate(
c_load = load - mean(load),
c_trial = trial - mean(trial)

)
ls_pupil <- list(
p_size = df_pupil$p_size,
c_load = df_pupil$c_load,
c_trial = df_pupil$c_trial,
N = nrow(df_pupil)

)
pupil_pos <- system.file("stan_models",
"pupil_pos.stan",
package = "bcogsci")
fit_pupil_int_pos <- stan(

file = pupil_pos,
data = ls_pupil,
warmup = 1000,
iter = 20000,
control = list(adapt_delta = .95))

The null model that we defined has β1 = 0 and is written in Stan as follows:

Hide
data {
int<lower = 1> N;
vector[N] c_load;
vector[N] c_trial;
vector[N] p_size;

}
parameters {
real alpha;
real beta2;
real beta3;
real<lower = 0> sigma;

}
model {
target += normal_lpdf(alpha | 1000, 500);
target += normal_lpdf(beta2 | 0, 100);
target += normal_lpdf(beta3 | 0, 100);

target += normal_lpdf(sigma | 0, 1000)

- normal_lccdf(0 | 0, 1000);
target += normal_lpdf(p_size | alpha + c_trial * beta2 +
c_load .* c_trial * beta3, sigma);
}

generated quantities{

Hide

pupil_null <- system.file("stan_models",

"pupil_null.stan",

package = "bcogsci"
)
fit_pupil_int_null <- stan(
file = pupil_null,
data = ls_pupil,

warmup = 1000,
iter = 20000
)

Compare the models with bridge sampling:

Hide

lml_pupil <- bridge_sampler(fit_pupil_int_pos, silent = TRUE)

lml_pupil_null <- bridge_sampler(fit_pupil_int_null, silent = TRUE)
BF_att <- bridgesampling::bf(lml_pupil, lml_pupil_null)

Hide

BF_att

## Estimated Bayes factor in favor of lml_pupil over lml_pupil_null: 25.17274

We find that the data is 25.173 more likely under a model that assumes a positive effect of
load than under a model that assumes no effect.

15.5 Bayes factors in theory and in practice

15.5.1 Bayes factors in theory: Stability and accuracy

One question that we can ask here is how stable and accurate the estimates of Bayes factors
are. Importantly, the bridge sampling algorithm needs a lot of posterior samples to obtain
stable estimates of the Bayes factor. Running bridge sampling based on a too small an
effective sample size (related to the number of posterior samples) will yield unstable estimates
of the Bayes factor, such that repeated computations will yield radically different Bayes factor
values. Moreover, even if the Bayes factor is approximated in a stable way, it is unclear
whether this approximate Bayes factor is equal to the true Bayes factor, or whether there is
bias in the computation such that the approximate Bayes factor has a wrong value. We show
this below.

15.5.1.1 Instability due to the effective number of posterior samples

The number of iterations, which in turn affects the total number of posterior samples can have
a strong impact on the robustness of the results of the bridge sampling algorithm (i.e., on the
resulting Bayes factor) and there are no good theoretical guarantees that bridge sample will
yield accurate estimates of Bayes factors. In the analyses presented above, we set the
number of iterations to a very large number of n = 20000 . The sensitivity analysis therefore
took a considerable amount of time. Indeed, the results from this analysis were stable, as
shown below.

Running the same analysis with less iterations will induce some instability in the Bayes factor
estimates based on the bridge sampling, such that running the same analysis twice would
yield different results for the Bayes factor. Moreover, bridge sampling in itself may be unstable
and may return different results for different runs on the same posterior samples (just because
of different starting values). This is very concerning, as the results reported in a paper might
not be stable if the number of effective sample size is not large enough. Indeed, the default
number of iterations in brms is set as iter = 2000 (and the default number of warmup
iterations is warmup = 1000 ). These defaults were not set to support bridge sampling, i.e.,
they were not defined for computation of densities to support Bayes factors. Instead, they are
valid for posterior inference on expectations (e.g., posterior means) for models that are not too
complex. However, when using these defaults for estimation of densities and the computation
of Bayes factors, instabilities can arise.

As an illustration, we perform the same sensitivity analysis again, now using the default
number of 2000 iterations in brms . The posterior sampling process now runs much quicker.
Moreover, we check the stability of the Bayes factors in the sensitivity analyses by repeating
both sensitivity analyses (with n = 20000 iterations and with the default number of n = 2000

iterations) a second time, to see whether the results for Bayes factors are stable.
Bayes factors

1000

500

200 Evidence in favor of M1

100

5
3
BF10

1/3

1/10

1/20 Default iterations (2000); run 1

1/50 Default iterations (2000); run 2

1/100 Many iterations (20000); run 1

1/200 Many iterations (20000); run 2 Evidence in favor of M0

1/500

1/1000

0 25 50 75 100
Normal prior width (SD)

FIGURE 15.4: The effect of the number of samples on a prior sensitivity analysis for the Bayes
factor. Grey lines show 20 runs with default number of iterations (2000).
The results displayed in Figure 15.4 show that the resulting Bayes factors are highly unstable
when the number of iterations is low. They clearly deviate from the Bayes factors estimated
with 20000 iterations, resulting in very unstable estimates. By contrast, the analyses using
20000 iterations provide nearly the same results in both analyses. The two lines lie virtually
directly on top of each other; the points are jittered horizontally for better visibility.

This result demonstrates that it is necessary to use a large number of iterations when
computing Bayes factors using brms and bridge_sampler() . In practice, one should
compute the sensitivity analysis (or at least one of the models or priors) twice (as we did here)
to make sure that the results are stable and sufficiently similar, in order to provide a good
basis for reporting results.

By contrast, Bayes factors based on the Savage-Dickey method (as implemented in brms )
can be unstable even when using a large number of posterior samples. This problem can arise
especially when the posterior is very far from zero, and thus very large or very small Bayes
factors are obtained. Because of this instability of the Savage-Dickey method in brms , it is a
good idea to use bridge sampling, and to check the stability of the estimates.

15.5.1.2 Inaccuracy of Bayes factor estimates: Does the estimate

approximate the true Bayes factor well?

An important point about approximate estimates of Bayes factors using bridge sampling is that
there are no strong guarantees for their accuracy. That is, even if we can show that the
approximated Bayes factor estimate using bridge sampling is stable (i.e., when using sufficient
effective samples, see the analyses above), even then it remains unclear whether the Bayes
factor estimate actually is close to the true Bayes factor. In principle, it could very well be that
the stably estimated Bayes factors based on bridge sampling are in fact biased, i.e., that they
are not close to the correct (true) Bayes factor, but that the estimation exhibits bias and yields
a different value. The technique of simulation-based calibration (SBC; Talts et al. 2018; Schad,
Betancourt, and Vasishth 2020) can be used to investigate this question (SBC is also
discussed in section 12.2 in chapter 12). We ask and investigate this question next (for
details, see Schad et al. 2022).

In the SBC approach, the priors are used to simulate data. Then, posterior inference is done
on the simulated data, and the posterior can be compared to the prior. If the posteriors are
equal to the priors, then this supports accurate computations. Applied to Bayes factor
analyses, one defines a prior on the hypothesis space, i.e., one defines the prior probabilities
for a null and an alternative model, specifying how likely each model is a priori. From these
priors, one can randomly draw one hypothesis (model), e.g., nsim = 500 times. Thus, in
each of 500 draws one randomly chooses one model (either M0 or M1 ), with the
probabilities given by the model priors. For each draw, one first samples model parameters
from their prior distributions, and then uses these sampled model parameters to simulate data.
For each simulated data set, one can then compute marginal likelihoods and Bayes factor
estimates using posterior samples and bridge sampling, and one can then compute the
posterior probabilities for each hypothesis (i.e., how likely each model is a posteriori). As the
last, and critical step in SBC, one can then compare the posterior model probabilities to the
prior model probabilities. A key result in SBC is that if the computation of marginal likelihoods
and posterior model probabilities is performed accurately (without bias) by the bridge sampling
procedure; that is, if the Bayes factor estimate is close to the true Bayes factor, then the
posterior model probabilities should be the same as the prior model probabilities.

Here, we perform this SBC approach. Across the 500 simulations, we systematically vary the
prior model probability from zero to one. For each of the 500 simulations we sample a model
(hypothesis) from the model prior, then sample parameters from the priors over parameters,
use the sampled parameters to simulate fake data, fit the null and the alternative model on the
simulated data, perform bridge sampling for each model, compute the Bayes factor estimate
between them, and compute posterior model probabilities. If the bridge sampling works
accurately, then the posterior model probabilities should be the same as the prior model
probabilities. Given that we varied the prior model probabilities from zero to one, the posterior
model probabilities should also vary from zero to one. In Figure 15.5, we plot the posterior
model probabilities as a function of the prior model probabilities. If the posterior probabilities
are the same as the priors, then the local regression line and all points should lie on the
diagonal.

80
Posterior probability for M0

0 25 50 75 100
Prior probability for M0

FIGURE 15.5: The posterior probabilities for M0 are plotted as a function of prior probabilities
for M0. If the approximation of the Bayes factor using bridge sampling is unbiased, then the
data should be aligned along the diagonal (see dashed black line). The thick black line is a
prediction from a local regression analysis. The points are average posterior probabilities as a
function of a priori selected hypotheses for 50 simulation runs each. Error bars represent 95
percent confidence intervals.

The results of this analysis in Figure 15.5 show that the local regression line is very close to
the diagonal, and that the data points (each summarizing results from 50 simulations, with
means and confidence intervals) also lie close to the diagonal. This importantly demonstrates
that the estimated posterior model probabilities are close to their a priori values. This result
shows that posterior model probabilities, which are based on the Bayes factor estimates from
the bridge sampling, are unbiased for a large range of different a priori model probabilities.

This result is very important as it shows one example case where the Bayes factor
approximation is accurate. Importantly, however, of course this demonstration is valid only for
this one specific application case, i.e., with a particular data set, particular models, specific
priors for the parameters, and a specific comparison between nested models. Strictly
speaking, if one wants to be sure that the Bayes factor estimate is accurate for a particular
data analysis, then such a SBC validation analysis would have to be computed for every data
analysis. For details, including code, on how to perform such an SBC, see Schad et al. (2022).
However, the fact that the SBC yields such promising results for this first application case also
gives some hope that the bridge sampling may be accurate also for other comparable data
analysis situations.

Based on these results on the average theoretical performance of Bayes factor estimation, we
next turn to a different issue: how Bayes factors depend on and vary with varying data, leading
to bad performance in individual cases despite good average performance.

15.5.2 Bayes factors in practice: Variability with the data

15.5.2.1 Variation associated with the data (subjects, items, and residual
noise)

A second, and very different, source limiting robustness of Bayes factor estimates derives
from the variability that is observed with the data, i.e., among subjects, items, and residual
noise. Thus, when repeating an experiment a second time in a replication analysis, using
different subjects and items, will lead to different outcomes of the statistical analysis every
time a new replication run is conducted. This limit to robustness is well known in frequentist
analyses, as the “dance of p-values” (Cumming 2014), where over repeated replication
attempts, p-values are not consistently significant across studies. Instead, the results yield
highly different p-values each time a study is re-run. This can also be observed when
simulating data from some known truth and re-running analyses on simulated data sets.

This same type of variability should also be present in Bayesian analyses (also see
https://daniellakens.blogspot.com/2016/07/dance-of-bayes-factors.html). Here we show this
type of variability in Bayes factor analyses by looking at a new example data analysis: We look
at research on sentence comprehension, and specifically on effects of cue-based retrieval
interference (Lewis and Vasishth 2005; Van Dyke and McElree 2011).
15.5.2.2 Example: Facilitatory interference effects

In the following, we will look at experimental studies that investigated cognitive mechanisms
underlying a well-studied phenomenon in sentence comprehension. The example we consider
here is the agreement attraction configuration below, where the ungrammatical sentence (2)
seems more grammatical than the equally ungrammatical sentence (1):

1. The key to the cabinet are in the kitchen.

2. The key to the cabinets are in the kitchen.

Both sentences are ungrammatical because the subject (“key”) does not agree with the verb in
number (“are”). Sentences such as (2) are often found to have shorter reading times at (or just
following) the verb (“are”) compared to (1) (for a meta-analysis see Jäger, Engelmann, and
Vasishth 2017). Such shorter reading times are sometimes referred to as “facilitatory
interference” (Dillon 2011); facilitatory here does not necessarily mean that processing is
easier, it just means that reading times at the relevant word are shorter in (2) vs. (1). One
proposal explaining the shorter reading times is that the attractor word (here, cabinets) agrees
locally in number with the verb, leading to an illusion of grammaticality. This is an interesting
phenomenon because the plural versus singular feature of the attractor noun (“cabinet/s”) is
not the subject, and therefore, under the rules of English grammar, is not supposed to agree
with the number marking on the verb. That agreement attraction effects are consistently
observed indicates that some non-compositional processes are taking place.

An account of agreement attraction effects in language processing that is based on a full

computational implementation (which is in the ACT-R framework, Anderson et al. 2004),
explains such agreement attraction effects in ungrammatical sentences as a result of retrieval-
based working memory mechanisms (Engelmann, Jäger, and Vasishth 2020; cf. Hammerly,
Staub, and Dillon 2019; and Yadav, Smith, et al. 2022). Agreement attraction in ungrammatical
sentences has been investigated many times in similar experimental setups with different
dependent measures such as self-paced reading and eye-tracking. It is generally believed to
be a robust empirical phenomenon, and we choose it for analysis here for that reason.

Here, we look at a self-paced reading study on agreement attraction in Spanish by Lago et al.
(2015). We estimate a population-level effect for the experimental condition agreement
attraction ( x ; i.e., sentence type), against a null model where the population-level effect of
sentence type is excluded. For the agreement attraction effect of sentence type, we use sum
contrast coding (i.e., -1 and +1). We run a hierarchical model with the following formula in
brms : rt ~ 1+ x + (1+ x | subj) + (1 + x | item) , where rt is reading time, we have

random variation associated with subjects and with items, and we assume that reading times
follow a log-normal distribution: family = lognormal() .
First, load the data:

Hide

data("df_lagoE1")

head(df_lagoE1)

## subj item rt int x expt

## 2 S1 I1 588 low -1 lagoE1

## 22 S1 I10 682 high 1 lagoE1
## 77 S1 I13 226 low -1 lagoE1
## 92 S1 I14 580 high 1 lagoE1
## 136 S1 I17 549 low -1 lagoE1
## 153 S1 I18 458 high 1 lagoE1

As a next step, determine priors for the analysis of these data.

15.5.2.3 Determine priors using meta-analysis

One good way to obtain priors for Bayesian analyses, and specifically for Bayes factor
analyses, is to use results from meta-analyses on the subject. Here, we take the prior for the
experimental manipulation of agreement attraction from a published meta-analysis (Jäger,
Engelmann, and Vasishth 2017).43

The mean effect size (difference in reading time between the two experimental conditions) in
the meta-analysis is −22 milliseconds (ms), with 95% CI = [−36 − 9] (Jäger, Engelmann,
and Vasishth 2017, Table 4). This means that on average, the target word (i.e., the verb) in
sentences such as (2) is on average read 22 milliseconds faster than in sentences such as
(1). The size of the effect is measured on the millisecond scale, assuming a normal
distribution of effect sizes across studies.

However, individual reading times usually do not follow a normal distribution. Instead, a better
assumption about the distribution of reading times is a log-normal distribution. This is what we
will assume in the brms model. Therefore, to use the prior from the meta-analysis in the
Bayesian analysis, we have to transform the prior values from the millisecond scale to log
millisecond scale.
We have performed this transformation in Schad et al. (2022). Based on these calculations,
the prior for the experimental factor of interference effects is set to a normal distribution with
mean = −0.03 and standard deviation = 0.009 . For the other model parameters, we use
principled priors.

Hide

priors <- c(
prior(normal(6, 0.5), class = Intercept),
prior(normal(-0.03, 0.009), class = b),

prior(normal(0, 0.5), class = sd),

prior(normal(0, 1), class = sigma),
prior(lkj(2), class = cor)
)

15.5.2.4 Running a hierarchical Bayesian analysis

Next, run a brms model on the data. We use a large number of iterations ( iter = 10000 )
with bridge sampling to estimate the Bayes factor of the “full” model, which includes a
population-level effect for the experimental condition agreement attraction ( x ; i.e., sentence
type). As mentioned above, for the agreement attraction effect of sentence type, we use sum
contrast coding (i.e., −1 and +1 ).

We first show the population-level effects from the posterior analyses:

Hide

fixef(m1_lagoE1)

## Estimate Est.Error Q2.5 Q97.5

## Intercept 6.02 0.06 5.90 6.13
## x -0.03 0.01 -0.04 -0.01

They show that for the population-level effect x , capturing the agreement attraction effect,
the 95% credible interval does not overlap with zero. This indicates that there is some hint that
the effect may have the expected negative direction, reflecting shorter reading times in the
plural condition. As mentioned earlier, this does not provide a direct test of the hypothesis that
the effect exists and is not zero. This is not tested here, because we did not specify the null
hypothesis of zero effect explicitly. We can, however, draw inferences about this null
hypothesis by using the Bayes factor.

Estimate Bayes factors between a full model, where the effect of agreement attraction is
included, and a null model, where the effect of agreement attraction is absent, using the
command bayes_factor(lml_m1_lagoE1, lml_m0_lagoE1) . The function computes the Bayes
factor BF10 , that is, the evidence of the alternative over the null.

Hide

h_lagoE1$bf

## [1] 6.29

The output shows a Bayes factor of 6, suggesting that there is some support for the
alternative model, which includes the population-level effect of agreement attraction. That is,
this provides evidence for the alternative hypothesis that there is a difference between the
experimental conditions, i.e., a facilitatory effect in the plural condition of the size derived from
the meta-analysis.

The bayes_factor command should be run several times to check the stability of the Bayes
factor calculation.

15.5.2.5 Variability of the Bayes factor: Posterior simulations

One way to investigate how variable the outcome of Bayes factor analyses can be (given that
the Bayes factor is computed in a stable and accurate way), is to run posterior simulations
based on a fitted model. That is, one can assume that the truth is approximately known (as
approximated by the posterior model fit), and that based on this “truth” several data sets are
simulated. Computing the Bayes factor analysis again on the simulated data can provide
some insight into how variable the Bayes factor will be in a situation where the “true” data
generating process is always the same, and where variations in Bayes factor results have to
be attributed to random noise in subjects, items, residual variation, and to uncertainty about
the precise true parameter values.

We can take the Bayesian hierarchical model fitted to the data from Lago et al. (2015), and
run posterior predictive simulations. In these simulations, one takes posterior samples for the
model parameters (i.e., p(Θ ∣ y) ), and for each posterior sample of the model parameters,
~ ~
one can simulate new data y from the model p(y ∣ Θ) .

Hide

pred_lagoE1 <- posterior_predict(m1_lagoE1)

The question that we are interested in here now is, how much information is contained in this
posterior simulated data. That is, we can run Bayesian models on this posterior simulated data
and compute Bayes factors to test whether in the simulated data there is evidence for
agreement attraction effects. Of great interest to us is then the question of how variable the
results of these Bayes factor analyses will be across different simulated replications of the
same study.

We now perform this analysis for 50 different data sets simulated from the posterior predictive
distribution. For each of these data sets, we can proceed in exactly the same way as we did
for the real observed experimental data. That is, we again fit the same brms model 50 times,
now to the simulated data, and using the same prior as before. For each simulated data set,
we use bridge sampling to compute the Bayes factor of the alternative model compared to a
null model where the agreement attraction effect (population-level effect predictor of sentence
type, x ) is set to 0. For each simulated posterior predictive data set, we store the resulting
Bayes factor. We again use the prior from the meta-analysis.

15.5.2.6 Visualize distribution of Bayes factors

We can now visualize the distribution of Bayes factors (BF10 ) across posterior predictive
distributions by plotting a histogram. Values larger than one in this histogram indicate
evidence for the alternative model (M1) that agreement attraction effects exist (i.e., the
sentence type effect is different from zero), and Bayes factor values smaller than one indicate
evidence for the null model (M0) that no agreement attraction effect exists (i.e., the difference
in reading times between experimental conditions is zero).
6 0.00

-0.01

Estimate [95% CreInt]

4
-0.02
+0.5

Empirical data
Simulated data
-0.03
2

-0.04

0
-0.05
1/3 1 3 10 0 10 20 30 40 50
BF10 Simulated Study

FIGURE 15.6: Left panel: A histogram of Bayes factors (BF10) of the alternative model over
the null model in 50 simulated data sets. The vertical solid black line shows equal evidence for
both hypotheses; the dashed line shows the Bayes factor computed from the empirical data;
the horizontal error bar shows 95 percent of all Bayes factors. Right panel: Estimates of the
facilitatory effect of retrieval interference and 95 percent credible intervals across all
simulations (solid lines) and the empirically observed data (dashed line).
The results show that the Bayes factors are quite variable. Although all data sets are
simulated from the same posterior predictive distribution, the Bayes factor results are as
different as providing moderate evidence for the null model (BF10 < 1/3 ) or providing strong
evidence for the alternative model (BF10 > 10 ). The bulk of the simulated data sets provide
moderate or anecdotal evidence for the alternative model. That is, much like the “dance of p-
values” (Cumming 2014), this analysis reveals a “dance of the Bayes factors” with simulated
repetitions of the same study. The variability in these results shows that a typical cognitive or
psycholinguistic data set is not necessarily highly informative for drawing firm conclusions
about the hypotheses in question.

What is driving these differences in the Bayes factors between simulated data sets? One
obvious reason why the outcomes may be so different is that the difference in reading times
between the two sentence types, that is, the experimental effect that we wish to make
inferences about, may vary based on the noise and uncertainty in the posterior predictive
simulations. It is therefore interesting to plot the Bayes factors from this simulated data set as
a function of the difference in simulated reading times between the two sentence types as
estimated in the Bayesian model. That is, we extract the estimated mean difference in reading
times at the verb between plural and singular attractor conditions from the population-level
effects of the Bayesian model, and plot the Bayes factor as a function of this difference
(together with 95% credible intervals).
10
BF10

1/3

-0.05 -0.04 -0.03 -0.02 -0.01

Effect estimate [95% CreIv]

FIGURE 15.7: The Bayes factor (BF10) as a function of the estimate (with 95 percent credible
intervals) of the facilitatory effect of retrieval interference across 50 simulated data sets. The
prior is from a meta-analysis.
The results (displayed in Figure 15.7) show that the mean difference in reading times between
experimental conditions varies dramatically across posterior predictive simulations. This
indicates that the experimental data and design contain a limited amount of information about
the effect of interest. Of course, if the data is noisy, Bayes factor analyses based on this
simulated data cannot be stable across simulations either. Accordingly, as is clear from Figure
15.7, the difference in mean reading times between experimental conditions is indeed a major
driving force for the Bayes factor calculations (other model parameters don’t show a close
association; Schad et al. 2022).

In Figure 15.7, as the difference between reading times becomes more negative, that is, the
faster the plural noun condition (i.e., “cabinets” in the example; sentence 2) is read compared
to the singular noun condition (i.e., “cabinet”; example sentence 1), the larger the Bayes factor
BF10 becomes, indicating that the evidence in favor of the alternative model increases. By
contrast, when the difference between reading times becomes less negative, i.e., the plural
condition (sentence 2) is not read much faster than the singular condition (sentence 1), then
the Bayes factor BF10 decreases to values smaller than 1. Importantly, this behavior occurs
because we are using an informative priors from the meta-analysis, where the prior mean for
the agreement attraction effect is not centered at a mean of zero, but has a negative value
(i.e., a prior mean of −0.03 ). Therefore, differences in reading times that are less negative /
more positive than this prior mean are more in line with a null model of no effect. This also
leads to the striking observation that the 95% credible intervals are quite consistent and all do
not overlap with zero, whereas the Bayes factor results are far more variable. This should
alarm researchers who use the 95% credible interval to decide whether an effect is present or
not, i.e., to make a discovery claim.

Computing Bayes factors for such a prior with a non-zero mean asks the very specific
question of whether the data provide more evidence for the effect size obtained from the
meta-analysis compared to the absence of any effect.

The important lesson to learn from this analysis is that Bayes factors can be quite variable for
different data sets assessing the same phenomenon. Individual data sets in the cognitive
sciences often do not contain a lot of information about the phenomenon of interest, even
when–as is the case here with agreement attraction–the phenomenon is thought to be a
relatively robust phenomenon. For a more detailed investigation of how Bayes factors can
vary with data, in both simulated and real replication studies, we refer the reader to Schad et
al. (2022) and Vasishth, Yadav, et al. (2022).

15.5.3 A cautionary note about Bayes factors

Just like frequentist p-values (Wasserstein and Lazar 2016), Bayes factors are easy to misuse
and misinterpret, and have the potential to mislead the scientist if used in an automated
manner. A recent article (Tendeiro 2022) reviews many of the misuses of Bayes factors
analyses in psychology and related areas. As discussed in this chapter, Bayes factors (and
Bayesian analysis in general) require a great deal of thought; there is no substitute for
sensitivity analyses, and the development of sensible priors. Using default priors and deriving
black and white conclusions from Bayes factors analyses is never a good idea.

15.6 Summary

Bayes factors are a very important tool in Bayesian data analysis. They allow the researcher
to quantify the evidence in favor of certain effects in the data by comparing a full model, which
contains a parameter corresponding to the effect of interest, with a null model, that does not
contain that parameter. We saw that Bayes factor analyses are highly sensitive to priors
specified for the parameters; this is true both for the parameter corresponding to the effect of
interest, but also sometimes for priors relating to other parameters in the model, such as the
intercept. It is therefore very important to perform prior predictive checks to select good and
plausible priors. Moreover, sensitivity analyses, where Bayes factors are investigated for
differing prior assumptions, should be standardly reported in any analysis involving Bayes
factors. We studied theoretical aspects of Bayes factors and saw that bridge sampling
requires a very large effective sample size in order to obtain stable results for approximate
Bayes factors. Therefore, one should always perform a Bayes factor analysis at least twice to
ensure that the results are stable. Bridge sampling comes with no strong guarantees
concerning its accuracy, and we saw that simulation-based calibration can be used to evaluate
the accuracy of Bayes factor estimates. Last, we learned that Bayes factors can strongly vary
with the data. In the cognitive sciences, the data are–even for relatively robust effects–often
not stable due to small effect sizes and limited sample size. Therefore, also the resulting
Bayes factors can strongly vary with the data. As a consequence, only large effect sizes, large
sample studies, and/or replication studies can lead to reliable inferences from empirical data
in the cognitive sciences.

One topic that was not discussed in detail in this chapter is data aggregation. In repeated
measures data, null hypothesis Bayes factor analyses can be performed on the raw data, i.e.,
without aggregation, by using Bayesian hierarchical models. In an alternative approach, the
data are first aggregated by taking the mean per subject and condition, before running null
hypothesis Bayes factor analyses on the aggregated data. Importantly, inferences / Bayes
factors based on aggregated data can be biased, when either (i) item variability is present in
addition to subject variability, or (ii) when the sphericity assumption (inherent in repeated
measures ANOVA) is violated (Schad, Nicenboim, and Vasishth 2022). In these cases,
aggregated analyses provide biased results and should not be used. By contrast, non-
aggregated analyses are robust also in these cases and yield accurate Bayes factor
estimates.

Another issue not discussed here is sample size determination using Bayes factors when
planning a study. Wang and Gelfand (2002) is an important paper in this connection; also see
Vasishth, Yadav, et al. (2022) for an example involving a psycholinguistic experiment design.

15.7 Further reading

A detailed explanation on how bridge sampling works can be found in Gronau et al. (2017b),
and more details about the bridgesampling package can be found in Gronau, Singmann, and
Wagenmakers (2017). Wagenmakers et al. (2010) provides a complete tutorial and the
mathematical proof of the Savage-Dickey method; also see O’Hagan and Forster (2004). For
a Bayes Factor Test calibrated to investigate replication success, see Verhagen and
Wagenmakers (2014). A special issue on hierarchical modeling and Bayes factors appears in
the journal Computational Brain and Behavior in response to an article by van Doorn et al.
(2021). Kruschke and Liddell (2018) discuss alternatives to Bayes factors for hypothesis
testing. An argument against null hypothesis testing with Bayes Factors appears in this blog
post by Andrew Gelman: https://statmodeling.stat.columbia.edu/2019/09/10/i-hate-bayes-
factors-when-theyre-used-for-null-hypothesis-significance-testing/, An argument in favor of null
hypothesis testing with Bayes Factor as an approximation (but assuming realistic effects)
appears in: https://statmodeling.stat.columbia.edu/2018/03/10/incorporating-bayes-factor-
understanding-scientific-information-replication-crisis/. A visualization of the distinction
between Bayes factor and k-fold cross-validation is in a blog post by Fabian Dablander,
https://tinyurl.com/47n5cte4. Decision theory, which was only mentioned in passing in this
chapter, is discussed in Parmigiani and Inoue (2009). Hypothesis testing in its different flavors
is discussed in Robert (2022).

15.8 Exercises

Exercise 15.1 Is there evidence for differences in the effect of cloze probability among the
subjects?

Use Bayes factor to compare the log cloze probability model that we examined in section
15.2.2 with a similar model but that incorporates the strong assumption of no difference
between subjects for the effect of cloze (τu 2
= 0 ).

Exercise 15.2 Is there evidence for the claim that English subject relative clause are easier to
process than object relative clauses?

Consider again the reading time data coming from Experiment 1 of Grodner and Gibson
(2005) presented in exercise 5.2. Try to quantify the evidence against the null model (no
population-level reading times difference between SRC and ORC) relative to the following
alternative models:

a. β ∼ Normal(0, 1)

b. β ∼ Normal(0, .1)

c. β ∼ Normal(0, .01)

d. β ∼ Normal+ (0, 1)

e. β ∼ Normal+ (0, .1)

f. β ∼ Normal+ (0, .01)

(A Normal+ (. ) prior can be set in brms by defining a lower boundary as 0, with the
argument lb = 0 .)

What are the Bayes factor in favor of the alternative models a-f, compared to the null model?

Exercise 15.3 Is there evidence for the claim that sentences with subject relative clauses are
easier to comprehend?

Consider now the question response accuracy of the data of Experiment 1 of Grodner and
Gibson (2005).

a. Compare a model that assumes that RC type affects question accuracy on the population
and by-subjects and by-items with a null model that assumes that there is no population-
level present.
b. Compare a model that assumes that RC type affects question accuracy on the population
and by-subjects and by-items with another null model that assumes that there is no
population-level or group-level present, that is no by-subject or by-item effect. What’s the
meaning of the results of the Bayes factor analysis.

Assume that for the effect of RC on question accuracy, β ∼ Normal(0, .1) is a reasonable
prior, and that for all the variance components, the same prior, τ ∼ Normal+ (0, 1) , is a
reasonable prior.

Exercise 15.4 Bayes factor and bounded parameters using Stan.

Re-fit the data of a single subject pressing a button repeatedly from 4.2 from
data("df_spacebar") , coding the model in Stan.

Start by assuming the following likelihood and priors:

rtn ∼ LogNormal(α + c_trialn ⋅ β, σ)

α ∼ Normal(6, 1.5)

β ∼ Normal+ (0, .1)

σ ∼ Normal+ (0, 1)

Use the Bayes factor to answer the following questions:

a. Is there evidence for any effect of trial number in comparison with no effect?
b. Is there evidence for a positive effect of trial number (as the subject reads further, they
slowdown) in comparison with no effect?
c. Is there evidence for a negative effect of trial number (as the subject reads further, they
speedup) in comparison with no effect?
d. Is there evidence for a positive effect of trial number in comparison with a negative effect?
(Expect very large Bayes factors in this exercise.)

References

Anderson, John R., Dan Bothell, Michael D. Byrne, Scott Douglass, Christian Lebiere, and
Yulin Qin. 2004. “An Integrated Theory of the Mind.” Psychological Review 111 (4): 1036–60.

Bennett, Charles H. 1976. “Efficient Estimation of Free Energy Differences from Monte Carlo
Data.” Journal of Computational Physics 22 (2): 245–68. https://doi.org/10.1016/0021-
9991(76)90078-4.

Bishop, Christopher M. 2006. Pattern Recognition and Machine Learning. Springer.

Chen, Stanley F, and Joshua Goodman. 1999. “An Empirical Study of Smoothing Techniques
for Language Modeling.” Computer Speech & Language 13 (4): 359–94.
https://doi.org/https://doi.org/10.1006/csla.1999.0128.

Cumming, Geoff. 2014. “The New Statistics: Why and How.” Psychological Science 25 (1): 7–
29.

Dickey, James M, BP Lientz, and others. 1970. “The Weighted Likelihood Ratio, Sharp
Hypotheses About Chances, the Order of a Markov Chain.” The Annals of Mathematical
Statistics 41 (1). Institute of Mathematical Statistics: 214–26.

Dillon, Brian William. 2011. “Structured Access in Sentence Comprehension.” PhD thesis.

Gelman, Andrew, and John B. Carlin. 2014. “Beyond Power Calculations: Assessing Type S
(Sign) and Type M (Magnitude) Errors.” Perspectives on Psychological Science 9 (6). SAGE
Publications: 641–51.

Grodner, Daniel, and Edward Gibson. 2005. “Consequences of the Serial Nature of Linguistic
Input.” Cognitive Science 29: 261–90.

Gronau, Quentin F., Alexandra Sarafoglou, Dora Matzke, Alexander Ly, Udo Boehm, Maarten
Marsman, David S Leslie, Jonathan J Forster, Eric-Jan Wagenmakers, and Helen
Steingroever. 2017a. “A Tutorial on Bridge Sampling.” Journal of Mathematical Psychology 81.
Elsevier: 80–97.
Gronau, Quentin F., Alexandra Sarafoglou, Dora Matzke, Alexander Ly, Udo Boehm, Maarten
Marsman, David S Leslie, Jonathan J Forster, Eric-Jan Wagenmakers, and Helen
Steingroever. 2017a. “A Tutorial on Bridge Sampling.” Journal of Mathematical Psychology 81.
Elsevier: 80–97.

2017b. “A Tutorial on Bridge Sampling.” Journal of Mathematical Psychology 81: 80–97.

https://doi.org/10.1016/j.jmp.2017.09.005.

Jeffreys, Harold. 1939. Theory of Probability. Oxford: Clarendon Press.

Kass, Robert E, and Adrian E Raftery. 1995. “Bayes Factors.” Journal of the American
Statistical Association 90 (430). Taylor & Francis: 773–95.

Lago, Sol, Diego Shalom, Mariano Sigman, Ellen F Lau, and Colin Phillips. 2015. “Agreement
Processes in Spanish Comprehension.” Journal of Memory and Language 82: 133–49.

Lewis, Richard L., and Shravan Vasishth. 2005. “An Activation-Based Model of Sentence
Processing as Skilled Memory Retrieval.” Cognitive Science 29: 1–45.

Lidstone, George James. 1920. “Note on the General Case of the Bayes-Laplace Formula for
Inductive or a Posteriori Probabilities.” Transactions of the Faculty of Actuaries 8 (182-192):
13.

MacKay, David JC. 2003. Information Theory, Inference and Learning Algorithms. Cambridge,
UK: Cambridge University Press.

Meng, Xiao-li, and Wing Hung Wong. 1996. “Simulating Ratios of Normalizing Constants via a
Simple Identity: A Theoretical Exploration.” Statistica Sinica, 831–60.
Navarro, Daniel. 2015. Learning Statistics with R. https://learningstatisticswithr.com.

Nicenboim, Bruno, Shravan Vasishth, and Frank Rösler. 2020c. “Are Words Pre-Activated
Probabilistically During Sentence Comprehension? Evidence from New Data and a Bayesian
Random-Effects Meta-Analysis Using Publicly Available Data.” Neuropsychologia, 107427.

Nieuwland, Mante S, Stephen Politzer-Ahles, Evelien Heyselaar, Katrien Segaert, Emily

O’Hagan, Antony, and Jonathan Forster. 2004. “Kendall’s Advanced Theory of Statistics, Vol.
2B: Bayesian Inference.” Wiley.

Parmigiani, Giovanni, and Lurdes Inoue. 2009. Decision Theory: Principles and Approaches.
John Wiley & Sons.

Robert, Christian P. 2022. “50 Shades of Bayesian Testing of Hypotheses.” arXiv Preprint
arXiv:2206.06659.

Rouder, Jeffrey N., Paul L Speckman, Dongchu Sun, Richard D Morey, and Geoffrey Iverson.
2009. “Bayesian T Tests for Accepting and Rejecting the Null Hypothesis.” Psychonomic
Bulletin & Review 16 (2): 225–37.

Royall, Richard. 1997. Statistical Evidence: A Likelihood Paradigm. New York: Chapman; Hall,
CRC Press.

Schad, Daniel J, Bruno Nicenboim, Paul-Christian Bürkner, Michael Betancourt, and Shravan
Vasishth. 2022. “Workflow Techniques for the Robust Use of Bayes Factors.” Psychological
Methods. American Psychological Association.

Schad, Daniel J, Bruno Nicenboim, and Shravan Vasishth. 2022. “Data Aggregation Can Lead
to Biased Inferences in Bayesian Linear Mixed Models.” arXiv Preprint arXiv:2203.02361.

Schönbrodt, Felix D, and Eric-Jan Wagenmakers. 2018. “Bayes Factor Design Analysis:
Planning for Compelling Evidence.” Psychonomic Bulletin & Review 25 (1): 128–42.

Tendeiro, Kiers, J. 2022. “Diagnosing the Use of the Bayes Factor in Applied Research.”

van Doorn, Johnny, Frederik Aust, Julia M Haaf, Angelika Stefan, and Eric-Jan Wagenmakers.
2021. “Bayes Factors for Mixed Models.” Computational Brain and Behavior.
https://doi.org/https://doi.org/10.1007/s42113-021-00113-2.

Van Dyke, Julie A, and Brian McElree. 2011. “Cue-Dependent Interference in

Comprehension.” Journal of Memory and Language 65 (3). Elsevier: 247–63.

Wagenmakers, Eric-Jan, Michael D. Lee, Jeffrey N. Rouder, and Richard D. Morey. 2020. “The
Principle of Predictive Irrelevance or Why Intervals Should Not Be Used for Model
Comparison Featuring a Point Null Hypothesis.” In The Theory of Statistics in Psychology:
Applications, Use, and Misunderstandings, edited by Craig W. Gruber, 111–29. Cham:
Springer International Publishing. https://doi.org/10.1007/978-3-030-48043-1_8.
Wagenmakers, Eric-Jan, Tom Lodewyckx, Himanshu Kuriyal, and Raoul Grasman. 2010.
“Bayesian Hypothesis Testing for Psychologists: A Tutorial on the Savage–Dickey Method.”
Cognitive Psychology 60 (3). Elsevier: 158–89.

Wang, Fei, and Alan E Gelfand. 2002. “A Simulation-Based Approach to Bayesian Sample
Size Determination for Performance Under a Given Model and for Separating Models.”
Statistical Science. JSTOR, 193–208.

Wasserstein, Ronald L., and Nicole A. Lazar. 2016. “The ASA’s Statement on p-Values:
Context, Process, and Purpose.” The American Statistician 70 (2). Taylor & Francis: 129–33.

Yadav, Himanshu, Garrett Smith, Sebastian Reich, and Shravan Vasishth. 2022. “Number
Feature Distortion Modulates Cue-Based Retrieval in Reading.” Journal of Memory and
Language.

42. Given that the posterior is analytically available for beta-distributed priors for the binomial
distribution, we could alternatively compute the posterior first, and then integrate out the
probability θ.↩

43. This meta-analysis already includes the data that we want to make inference about; thus,
this meta-analysis estimate is not really the right estimate to use, since it involves using
the data twice. We ignore this detail here because our goal is simply to illustrate the
approach.↩
Code

Chapter 16 Cross-validation

A popular way to evaluate and compare models is to investigate their ability to make
predictions for “out-of-sample data”, that is, to use what we learned from the observed data to
predict future or unseen observations. Cross-validation is used to test which of the models
under consideration is/are able to learn the most from our data in order to make better
predictions. However, in cognitive science, our objective will rarely be to predict future
observations, but rather to compare how well different models fare in accounting for the
observed data.

The objective of cross-validation is to avoid over-optimistic predictions; such over-optimistic

predictions would arise if we were to use the data to estimate the parameters of our model,
and then use these estimates to predict the same data. That amounts to using the data twice.
The basic idea behind cross-validation is that the models are with a large subset of the data,
the training set, and then we predict a smaller part of the data, the held-out set. In order to
treat the entire data set as a held-out set, and to evaluate the predictive accuracy using every
observation, one changes what constitutes the training set and the held-out set. This ensures
that the predictions of the model are tested over the entire data set.

16.1 The expected log predictive density of a model

In order to compare the quality of the posterior predictions of two models, a utility function or a
scoring rule is used (see Gneiting and Raftery 2007 for a review on scoring rules). The
logarithmic score rule (Good 1952), shown in equation (16.1), has been proposed as a
reasonable way to assess the posterior predictive distribution of a candidate model M1 given
the data y. This approach is reasonable because it takes into account the uncertainty of the
predictions (compare this the mean square error). If new observations are well-accounted by
the posterior predictive distribution, then the density of the posterior predictive distribution is
high and so is its logarithm.

u(M1 , ypred ) = log p(ypred |y, M1 ) (16.1)

Unlike the Bayes factor, the prior is absent from equation (16.1). However, the prior does have
a role here: The posterior predictive distribution is based on the posterior distribution p(Θ ∣ y)

(where Θ is a vector of all the parameters of the model), which, according to Bayes’ rule,
depends on both priors and likelihood together. Recall equation (3.8) in section 3.6, repeated
here for convenience:

p(ypred ∣ y) = ∫ p(ypred ∣ Θ)p(Θ ∣ y) dΘ (16.2)

In equation (16.2), we are implicitly conditioning on the model under consideration:

p(ypred ∣ y, M1 ) = ∫ p(ypred ∣ Θ, M1 )p(Θ ∣ y, M1 ) dΘ (16.3)

The predicted data, ypred , are unknown to the utility function, so the utility function as
presented in equation (16.1) cannot be evaluated. For this reason, we marginalize over all
possible future data (calculating E[log p(ypred |y, M1 )] ); this expression is called the
expected log predictive density of model M1 :

elpd = u(M1 ) = ∫ pt (ypred ) log p(ypred ∣ y, M1 ) dypred (16.4)

ypred

where pt is the true data generating distribution. If we consider a set of models, the model with
the highest elpd is the model with the predictions that are the closest to those of the true data
generating process.44 The intuition behind Equation (16.4) is that we are evaluating the
predictive distribution of M1 over all possible future data weighted by how likely the future
data is according to its true distribution. This means that observations that are very likely
according to the true model will have a higher weight than unlikely ones.

But we don’t know the true data-generating distribution, pt ! If we knew it, we wouldn’t be
looking for the best model, since pt is the best model.

We can use the observed data distribution as a proxy for the true data generation distribution.
So instead of weighting the predictive distribution by the true density of all possible future
data, we just use the N observations that we have. We can do that because our observations
are presumed to be samples from the true distribution of the data: Under this assumption,
observations with higher likelihood according to the true distribution of the data will also be
more commonly obtained. This means that instead of integrating, we sum the posterior
predictive density of the observations and we give to each observation the same weight; this is
valid because observations that are more common will be already appearing more often (see
also Box 16.1). This quantity is called log pointwise predictive density or lpd (without the 1/N

in Vehtari, Gelman, and Gabry 2017b):

N
1
lpd = ∑ log p(yn |y, M1 ) (16.5)
N
n=1
The lpd is an overestimate of elpd for actual future data, because the parameters of the
posterior predictive distribution are estimated with the same observations that we are
considering out-of-sample. Incidentally, this also explains why posterior predictive checks are
generally optimistic and good fits cannot be taken too seriously. But they do serve the purpose
of identifying very strong model misspecifications.45

However, we can obtain a more conservative estimate of the predictive performance of a

model using cross-validation (Geisser and Eddy 1979). This is explained next. (As an aside,
we mention here that there are also other alternatives to cross-validation; these are presented
in Vehtari and Ojanen 2012).

Box 16.1 How do we get rid of the integral in the approximation of elpd?

As an example, imagine that there are N observations in an experiment. Suppose also

that the true generative process (which is always unknown to us) is a Beta distribution:

pt (y) = Beta(y|1, 3)

Set N and observe some simulated data y:

Hide

N <- 10000

y_data <- rbeta(N, 1, 3)

head(y_data)

## [1] 0.222 0.466 0.509 0.117 0.614 0.362

Let’s say that we fit the Bayesian model M1 , and somehow, after getting the posterior
distribution, we are able derive the analytical form of its posterior predictive distribution for
the model:

p(ypred |y, M1 ) = Beta(ypred |2, 2)

This distribution will tell us how likely different future observations will be, and it also
entails that our future observations will be bounded by 0 and 1. (Any observation outside
this range will have a probability density of zero).

Imagine that we could know the true distribution of the data, pt , which is conveniently
close to our posterior predictive distribution. This means that equation (16.4), repeated
below, is simple enough, and we know all its terms:
elpd = u(M1 ) = ∫ pt (ypred ) log p(ypred ∣ y, M1 ) dypred
ypred

We can compute this quantity in R. Notice that we don’t introduce the data at any point.
However, the data had to be used when p , the posterior predictive distribution, was
derived; we skipped that step here.

Hide

# True distribution:
p_t <- function(y) dbeta(y, 1, 3)
# Predictive distribution:
p <- function(y) dbeta(y, 2, 2)
# Integration:

integrand <- function(y) p_t(y) * log(p(y))

integrate(f = integrand, lower = 0, upper = 1)

## -0.375 with absolute error < 0.00000068

Because we will never know p_t , this integral can be approximated using the data,
y_data . It is possible to approximate the integration without any reference to p_t (see

equation (16.5)):

Hide

1/N * sum(log(p(y_data)))

## [1] -0.356

The main problem with this approach is that we are using y_data twice, once to derive
p , the predictive posterior distribution, and once for the approximation of elpd . We’ll see
that cross-validation approaches rely on deriving the posterior predictive distribution with
part of the data, and estimating the approximation to elpd with unseen data. (Don’t worry
that we don’t know the analytical form of the posterior predictive distribution: we saw that
we could generate samples from that distribution based on the likelihood and posterior
samples.)
16.2 K-fold and leave-one-out cross-validation

The basic idea of K-fold cross-validation (K-fold-CV) is to split the N observations of our data
into K subsets, such that each subset is used as a validation (or held-out) set, Dk , while the
remaining set (the training set), D−k is used for estimating the parameters and approximating
pt , the true data distribution. The leave-one-out cross-validation (LOO-CV) method represent
a special case of K-fold-CV where the training set only excludes one observation (K = N ).
We estimate elpd as follows.

N
1
ˆ
elpd = ∑ log p(yn |D∖n , M1 ) (16.6)
N
n=1

In equation (16.6), each observation, yn , belongs to a certain “validation” fold, Dk , and the
predictive accuracy of yn is evaluated based on a posterior predictive model trained on the
set, D∖n , which is the complete data set excluding the validation fold that contains the n -th
observation. This means that the posterior predictive distribution is used to evaluate yn , even
though the posterior predictive distribution was derived without having information from that yn

-th observation (in other words, the model was trained without that observation in the subset
of the data, D−k ). In K-fold-CV, several observations are held out in same (validation) fold.
This means that the the held-out observations are split among K folds, and D∖n , the data used
to derive the posterior predictive distribution, contain only a proportion of the observations;
this proportion is (1 − 1/K) . By contrast, in leave-one-out cross-validation, the held-out data
set includes only one observation. That is, D∖n contains the entire data set except for one
data point, yn , with n = 1, … , N . Box 16.2 explains the algorithm in detail.

Vehtari, Gelman, and Gabry (2017b) define the expected log pointwise predictive density of
the observation yn as follows:

ˆ
elpd n = log p(yn |D∖n , M1 )

This quantity indicates the predictive accuracy of the model M1 for a single observation, and
it is reported in the package loo and also in brms . In addition, the loo package uses the
sum of the expected log pointwise predictive density, ∑ elpdn (equation (16.6) without 1
) as
N

a measure of predictive accuracy (this is referred as elpd_loo or elpd_kfold by loo and

brms packages). For model comparison, the difference between the ∑ elpdn of competing
models can be computed, including the standard deviation of the sampling distribution of the
difference. It’s important to notice that we are calculating an approximation to the expectation
that we actually want to compute, elpd , and thus we always need to consider its inherent
randomness (Vehtari, Simpson, et al. 2019).
Unlike what is common with information criterion methods (such as the Akaike Information
Criterion, AIC, and the Deviance Information Criterion, DIC), a higher ˆ
elpd means higher
predictive accuracy. An alternative to using ˆ
elpd is to examine −2 × ˆ
elpd , which is equivalent
to deviance, and is called the LOO Information Criterion (LOOIC) (see section 22 of Vehtari
2022).

The approximation to the true data generating distribution is worse when fewer observations
are used, and thus ideally we would set K = N , and thus compute LOO-CV rather than K-
fold-CV. The main advantage of LOO-CV is its robustness, since the training set is as similar
as possible to the observed data, and the same observations are never used simultaneously
for training and evaluating the predictions. A major disadvantage is the computational burden
(Vehtari and Ojanen 2012), since we need to fit a model as many times as the number of
observations. The package loo provides an approximation to LOO-CV, Pareto smoothed
importance sampling leave-one-out (PSIS-LOO; Vehtari and Gelman 2015; Vehtari, Gelman,
and Gabry 2017b) which, as we show next, is relatively straightforward to use in brms and in
Stan models (see https://mc-stan.org/loo/articles/loo2-with-rstan.html). However, in some
cases, its estimates can be unreliable; this is indicated by the estimated shape parameter ^
k of
the generalized Pareto distribution. In those cases, where one or several pointwise predictive
density have associated large (larger than 0.5 or 0.7, see https://mc-
stan.org/loo/reference/pareto-k-diagnostic.html) k
^, either (i) the problematic predictions can be

refitted with exact LOO-CV, (ii) one can try some additional computations using the existing
posterior sample based on the moment matching approximation (see https://mc-
stan.org/loo/articles/loo2-moment-matching.html and Paananen et al. 2021), or (iii) one can
abandon PSIS-LOO-CV and use K-fold-CV, with K typically set to 10.

One of the main disadvantages of cross-validation (in comparison with Bayes factor at least) is
that the numerical difference in predictive accuracy is hard to interpret. As a rule of thumb, it
has been suggested that if the elpd difference ( elpd_diff in the loo package) is less
than 4, the difference is small, and if it is larger than 4, one should compare that difference to
its standard error ( se_diff ) (see section 16 of Vehtari 2022).

Box 16.2 The cross-validation algorithm

Here we spell out the Bayesian cross-validation algorithm in detail:

1. Split the data pseudo-randomly into K held-out or validation sets Dk , (where

k = 1, … , K ) that are a fraction of the original data, and K training sets, D−k . The
length of the held-out data vector Dk is approximately 1/K -th the size of the full data
set. It is common to use K = 10 for K-fold-CV. For LOO-CV, K should be set to the
number of observations.
2. Fit K models using each of the K training sets, and obtain posterior distributions
p−k (Θ) = p(Θ ∣ D−k ) , where Θ is the vector of model parameters.

3. Each posterior distribution p(Θ ∣ D−k ) is used to compute the predictive accuracy for
each held-out data-point yn in the vector Dk :

ˆ
elpd n = log p(yn ∣ D−k )

Given that the posterior distribution p(Θ ∣ D−k ) is summarized by S samples, the log
predictive density for each data point yn in a data vector Dk can be approximated as
follows:

S
1
k,s
ˆ
elpd n = log( ∑ p(yn ∣ Θ )) (16.7)
S
s=1

where Θ
k,s
corresponds to the sample s of the posterior of the model fit to the validation
set −k .

5. We obtain the elpdkf old (or elpdloo ) for all the held-out data points by summing up the
ˆ
elpd
n
:

elpdkf old = ∑ ˆ
elpd n (16.8)

n=1

We can also compute the standard deviation of the sampling distribution (the standard
error) by multiplying the standard deviation (or square root of variance) of the N

components by √N . Letting ˆ
ELP D be the vector ˆ
elpd , … , ˆ
1
elpd
N
, we can write:

se(ˆ
elpd ) = √N Var(ELP
ˆ D) (16.9)

The difference between the elpdkf old of two competing models, M1 and M2 , is a
measure of relative predictive performance. We can also compute the standard error of
their difference using the formula discussed in Vehtari, Gelman, and Gabry (2017b).

se(ˆ
elpd − ˆ
elpd ) = √N Var(ELP
ˆ ˆ
DM1 − ELP DM2 ) (16.10)
M1 M2
16.3 Testing the N400 effect using cross-validation

As we did in section 15.2 with the Bayes factor, we revisit section 5.2, where we estimated the
effect of cloze probability on the N400 average signal. We consider two models here, a model
that includes the effect of cloze probability, such as fit_N400_sih from section 5.2.5, and a
null model.

We can verify that the likelihood that we fit was appropriate for a hierarchical model that
includes an effect of cloze probability as follows:

Hide

formula(fit_N400_sih)

## n400 ~ c_cloze + (c_cloze | subj) + (c_cloze | item)

In contrast to the situation with Bayes factor, priors are less critical for cross-validation. Priors
are only important in cross-validation to the extent that they affect parameter estimation: As
we saw previously, very narrow priors can bias the posterior; and unrealistically wide priors
can lead to convergence problems. The number of samples is also less critical than with
Bayes factor, most of the uncertainty in the estimates of the ˆ
elpd is due to the number of
observations. However, a very small number of samples can affect the ˆ
elpd because the
posterior estimation will be affected by the small sample size. We update our previous formula
to define a null model as follows:

Hide

fit_N400_sih_null <- update(fit_N400_sih, ~ . - c_cloze)

16.3.1 Cross-validation with PSIS-LOO

Estimating elpd using PSIS-LOO is very straightforward with brms , which uses the package
loo as a back-end. There is no need to refit the model, and loo takes care of the applying

the PSIS approximation to derive estimates and standard errors.

Hide
(loo_sih <- loo(fit_N400_sih))

##
## Computed from 4000 by 2863 log-likelihood matrix
##

## Estimate SE
## elpd_loo -11092.9 46.7
## p_loo 81.5 2.8
## looic 22185.7 93.4
## ------
## Monte Carlo SE of elpd_loo is 0.1.

##
## All Pareto k estimates are good (k < 0.5).
## See help('pareto-k-diagnostic') for details.

Hide

(loo_sih_null <- loo(fit_N400_sih_null))

##
## Computed from 4000 by 2863 log-likelihood matrix
##
## Estimate SE
## elpd_loo -11095.5 46.5

## p_loo 89.5 3.0

## looic 22191.0 93.1
## ------
## Monte Carlo SE of elpd_loo is 0.2.
##
## All Pareto k estimates are good (k < 0.5).

## See help('pareto-k-diagnostic') for details.

The function loo reports three quantities with their standard error:

1. elpd_loo is the sum of pointwise predictive accuracy (a larger, less negative number
indicates better predictions).
2. p_loo is an estimate of effective complexity of the model; asymptotically and under
certain regularity conditions, p_loo can be interpreted as the effective number of
parameters. If p_loo is larger than the number of data points or parameters, this may
indicate a severe model misspecification.
3. looic is simply -2*elpd_loo , the elpd on the deviance scale. This is called the
information criterion, and is mainly provided for historical reasons: other information
criteria like the AIC (Akaike Information Criterion) and the DIC (Deviance Information
Criterion) are commonly used in model selection (Venables and Ripley 2002; Lunn et al.
2012).

It’s important to bear in mind that the PSIS-LOO approximation to LOO can only be trusted if
Pareto k estimates (k
^) are smaller than 0.7. To compare the models, we need to take a look at

the difference between elpd_loo and the standard error of that difference:

Hide

loo_compare(loo_sih, loo_sih_null)

## elpd_diff se_diff
## fit_N400_sih 0.0 0.0
## fit_N400_sih_null -2.6 2.5

Although the model that includes cloze probability as a predictor has higher predictive
accuracy, the difference is smaller than 4 and it’s smaller than two SE. This means that from
the perspective LOO-CV, both models are almost indistinguishable! In fact, the same will
happen if we compare the model with logarithmic predictability to the linear or null model; see
exercise 16.1.

We could also check whether the alternative model is making good predictions for some range
of values by examining the difference in pointwise predictive accuracy as a function of, for
example, cloze probability. In the following plot, we subtract the predictive accuracy of the
alternative model from the accuracy of the null model; we can now interpret larger differences
as an advantage for the alternative model. However, we see that as far as posterior predictive
accuracy goes, both models are quite similar. Figure 16.1 shows that the difference in
predictive accuracy is symmetrical with respect to the zero; as we go further from the mean
cloze (which is around 0.5 ), the differences in predictions are larger but they span over
positive and negative values.
The following code stores the difference in predictive accuracy of the models in a variable and
plots it in Figure 16.1.

Hide

df_eeg <- mutate(df_eeg,

diff_elpd = loo_sih$pointwise[, "elpd_loo"] -
loo_sih_null$pointwise[, "elpd_loo"]
)
ggplot(df_eeg, aes(x = cloze, y = diff_elpd)) +

geom_point(alpha = .4, position = position_jitter(w = .001, h = 0))

0.2
diff_elpd

0.0

-0.2

-0.4
0.00 0.25 0.50 0.75 1.00
cloze

FIGURE 16.1: The difference in predictive accuracy between a model including the effect of
cloze and a null model. A larger (more positive) difference indicates an advantage for the
model that includes the effect of cloze.

This is unsettling because the effect of cloze probability on the N400 has been replicated in
numerous studies. We would expect to see that, similar to the Bayes factor, cross-validation
techniques will also show that a model that includes cloze probability as a predictor is superior
to a model without it. Before we discuss why we don’t see a large difference, let us check what
K-fold-CV yields.
16.3.2 Cross-validation with K-fold

Estimating elpd using k-fold-CV has the advantage of omitting one layer of approximations:
the elpd based on PSIS-LOO-CV is an approximation of the elpd based on exact LOO-CV
(and we saw how any cross-validation approach gave as an approximation to the true elpd ).
This means that we don’t need to worry about k
^. However, K-fold also uses a reduced training

set in comparison with LOO, worsening the approximation to the true generating process pt .

Because we divide our data into folds, we need to think about the way we split the data: We
could do it randomly, but we take the risk that in some of the training sets, observations from a
subject, for example, would be completely absent. This will lead to large differences in
predictive accuracy between folds. We can avoid that by using stratification: we split the
observations into groups ensuring that relative category frequencies are approximately
preserved. We do this with the kfold() function, available in the package brms , by setting
folds = "stratified" and group = "subj" , by default K is set to 10, but that can be

changed.

Hide

kfold_sih <- kfold(fit_N400_sih,

folds = "stratified",
group = "subj")

Hide

kfold_sih_null <- kfold(fit_N400_sih_null,

folds = "stratified",
group = "subj")

Running K-fold CV takes some time since each model is refit K times. We can now inspect the
elpd values:

Hide

kfold_sih
##
## Based on 10-fold cross-validation
##
## Estimate SE
## elpd_kfold -11097.4 46.6

## p_kfold 85.7 3.5

## kfoldic 22194.8 93.1

Hide

kfold_sih_null

##
## Based on 10-fold cross-validation

##
## Estimate SE
## elpd_kfold -11098.7 46.5
## p_kfold 93.3 3.7
## kfoldic 22197.5 93.0

Compare the two models using loo_compare (this function is usedf for both PSIS-LOO-CV
and K-fold-CV):

Hide

loo_compare(kfold_sih, kfold_sih_null)

## elpd_diff se_diff
## fit_N400_sih 0.0 0.0
## fit_N400_sih_null -1.3 4.1

We see that, in this case, the results with K-fold-CV and PSIS-LOO-CV are quite similar: We
can’t really distinguish between the two models.
16.3.3 Leave-one-group-out cross-validation

An alternative to splitting the observations randomly using stratification is to treat naturally

occurring clusters as folds, this is leave-one-group-out cross-validation (LOGO-CV). By doing
this we can interpret the output of cross-validation as the capacity of the models for
generalizing to unseen clusters. We implement LOGO-CV with subjects.

Hide

logo_sih <- kfold(fit_N400_sih,

group = "subj")

Hide

logo_sih_null <- kfold(fit_N400_sih_null,

group = "subj")

Running LOGO CV with subjects takes some time since each model is refit as many times as
we have subjects, in this case 37 times. We can now inspect the elpd estimates and evaluate
which model generalizes better to unseen subjects.

We compare the models using loo_compare .

Hide

loo_compare(logo_sih, logo_sih_null)

## elpd_diff se_diff
## fit_N400_sih 0.0 0.0
## fit_N400_sih_null -1.5 2.3

As before, and as with PSIS-LOO-CV, K-fold-CV, we can’t distinguish between the two
models.
16.4 Comparing different likelihoods with cross-
validation

We now compare two models with different likelihoods. In section 3.7.2 from chapter 3, we
saw how a log-normal distribution was a more appropriate likelihood than a normal distribution
for response times data. This was because response times are bounded by zero and right
skewed, unlike the symmetrical normal distribution. Now we’ll use PSIS-LOO-CV to compare
the predictive accuracy of the Stroop model from section 5.3 in chapter 5, which assumed a
log-normal likelihood to fit response times of correct responses with a similar model which
assumes a normal likelihood.

Load the data from bcogsci , create a sum coded predictor (see chapter 8 for more details),
and fit the model as in section 5.3.

Hide

data("df_stroop")

df_stroop <- df_stroop %>%

mutate(c_cond = if_else(condition == "Incongruent", 1, -1))

Hide

fit_stroop_log <- brm(RT ~ c_cond + (c_cond | subj),

family = lognormal(),
prior =
c(
prior(normal(6, 1.5), class = Intercept),

prior(normal(0, 1), class = b),

prior(normal(0, 1), class = sigma),
prior(normal(0, 1), class = sd),
prior(lkj(2), class = cor)
),
data = df_stroop

Calculate the elpdloo for the original model with the log-normal likelihood:

Hide
loo_stroop_log <- loo(fit_stroop_log)

Hide

loo_stroop_log

##
## Computed from 4000 by 3058 log-likelihood matrix
##
## Estimate SE
## elpd_loo -19858.7 93.6

## p_loo 60.5 4.1

## looic 39717.4 187.1
## ------
## Monte Carlo SE of elpd_loo is 0.1.
##

## Pareto k diagnostic values:

## Count Pct. Min. n_eff
## (-Inf, 0.5] (good) 3057 100.0% 1069
## (0.5, 0.7] (ok) 1 0.0% 203
## (0.7, 1] (bad) 0 0.0% <NA>

## (1, Inf) (very bad) 0 0.0% <NA>

##
## All Pareto k estimates are ok (k < 0.7).
## See help('pareto-k-diagnostic') for details.

The summary shows that k estimates are ok.

Now fit a similar model where we assume that the likelihood is a normal distribution. It’s
important now to change the priors since they are on a different scale (namely, in
milliseconds). We choose reasonable but wide priors. We can do a sensitivity analysis, if we
are unsure about the priors. However, unlike what happened with the Bayes factor in chapter
15, priors are going to affect cross-validation based model comparison only as far as they
have a noticeable effect on the posterior distribution.

Hide
fit_stroop_normal <- brm(RT ~ c_cond + (c_cond | subj),
family = gaussian(),
prior =
c(
prior(normal(400, 600), class = Intercept),

prior(normal(0, 100), class = b),

prior(normal(0, 300), class = sigma),
prior(normal(0, 300), class = sd),
prior(lkj(2), class = cor)
),
data = df_stroop

If we try to obtain elpd based on PSIS-LOO-CV, we’ll find several large ^

k values.

Hide

loo_stroop_normal <- loo(fit_stroop_normal)

Hide

loo_stroop_normal
##
## Computed from 4000 by 3058 log-likelihood matrix
##
## Estimate SE
## elpd_loo -21588.9 480.5

## p_loo 117.5 67.2

## looic 43177.8 961.1
## ------
## Monte Carlo SE of elpd_loo is NA.
##
## Pareto k diagnostic values:

## Count Pct. Min. n_eff

## (-Inf, 0.5] (good) 3054 99.9% 1305
## (0.5, 0.7] (ok) 1 0.0% 84
## (0.7, 1] (bad) 2 0.1% 43
## (1, Inf) (very bad) 1 0.0% 3

## See help('pareto-k-diagnostic') for details.

We can try the model matching approximation for problematic observations, by setting
moment_match = TRUE in the loo() call (and we also need to fit the model with save_pars =
save_pars(all = TRUE) ). In this particular case, this approximation won’t solve our problem.

We skip that step here. We can use exact LOO (rather than its approximation) for the
problematic observations. By setting reloo = TRUE , we re-fit the 3 problematic observations
with ^
k values over 0.7 using exact LOO-CV.

Hide

loo_stroop_normal <- loo(fit_stroop_normal, reloo = TRUE)

Hide

loo_stroop_normal
##
## Computed from 4000 by 3058 log-likelihood matrix
##
## Estimate SE
## elpd_loo -21702.3 589.5

## p_loo 230.7 180.2

## looic 43404.7 1178.9
## ------
## Monte Carlo SE of elpd_loo is 0.3.
##
## Pareto k diagnostic values:

## Count Pct. Min. n_eff

## (-Inf, 0.5] (good) 3057 100.0% 1
## (0.5, 0.7] (ok) 1 0.0% 77
## (0.7, 1] (bad) 0 0.0% <NA>
## (1, Inf) (very bad) 0 0.0% <NA>

##
## All Pareto k estimates are ok (k < 0.7).
## See help('pareto-k-diagnostic') for details.

We are ready to compare the models.

Hide

loo_compare(loo_stroop_log, loo_stroop_normal)

## elpd_diff se_diff
## fit_stroop 0.0 0.0

## fit_stroop_normal -1843.6 533.8

Here cross-validation shows a clear advantage for the model with the log-normal likelihood.
We visualize the pointwise predictive accuracy in Figure 16.2.

Hide
df_stroop <- mutate(df_stroop,
diff_elpd = loo_stroop_log$pointwise[, "elpd_loo"] -
loo_stroop_normal$pointwise[, "elpd_loo"]
)
ggplot(df_stroop, aes(x = RT, y = diff_elpd)) +

geom_point(alpha = .4) + xlab("RC (ms)")

400
diff_elpd

200

0 2500 5000 7500

RC (ms)

FIGURE 16.2: The difference in predictive accuracy between a Stroop model with a log-
normal likelihood and a model with a normal likelihood. A larger (more positive) difference
indicates an advantage for the model with the log-normal likelihood.

Figure 16.2 shows that at first glance, the advantage of the log-normal likelihood lseems to lie
in being able to capture extremely slow observations.

Figure 16.3 zooms in to visualize the pointwise predictive accuracy for observations with
response times smaller than 2 seconds.

Hide
ggplot(df_stroop, aes(x = RT, y = diff_elpd)) +
geom_point(alpha = .3) +
xlab("RC (ms)") +
geom_hline(yintercept = 0, linetype = "dashed")+
coord_cartesian(xlim = c(0, 2000), ylim = c(-10, 10))

5
diff_elpd

-5

-10

0 500 1000 1500 2000

RC (ms)

FIGURE 16.3: The difference in predictive accuracy between a Stroop model with a log-
normal likelihood and a model with a normal likelihood for observations smaller than 2
seconds. A larger (more positive) difference indicates an advantage for the model with the log-
normal likelihood.

Figure 16.3 suggests that the advantage of the log-normal likelihood seems to lie in being able
to account for most of the observations in the data set, which occur around the 500 ms.

16.5 Issues with cross-validation

Sivula, Magnusson, and Vehtari (2020) analyzed the behavior of the uncertainty estimate of
the elpd in typical situations. Although they focus on LOO-CV, the consequences are the
same for K-fold-CV (and cross-validation in a non-Bayesian context). Sivula, Magnusson, and
Vehtari (2020) identified three cases where the uncertainty estimates can perform badly:

1. The models make very similar predictions.

2. The number of observations is small.
3. The models are misspecified with outliers (influential extreme values) in the data.

When the models make similar predictions (as it is the case with nested models that we saw in
earlier model comparisons) and when there is not much difference in the predictive
performance of the models, the uncertainty estimates will behave badly. In these situations,
cross-validation is not very useful for separating very small effect sizes from zero effect sizes.
In addition, small differences in the predictive performance cannot reliably be detected by
cross-validation if the number of observations is small (say, around 100 observations).
However, if the predictions are very similar, Sivula, Magnusson, and Vehtari (2020) show that
the same problems persist even with a larger data set.

One of the issues that cross-validation methods face when they are used to compare nested
models lies in the way that the exact elpd is approximated: In cross-validation approximations,
out-of-sample observations are used, which are not part of the model that that was fit. Every
time we evaluate the predictive accuracy of an observation, we ignore modeling assumptions.
One of the weaknesses of cross-validation is the high variance in the approximation of the
integral over the unknown true data distribution, pt (Vehtari and Ojanen 2012, sec. 4).

Cross-validation methods are sometimes criticized because when a lot of data are available,
they will give undue preference to the complex model in comparison to a true simpler model
(Gronau and Wagenmakers 2018). This might be true for toy examples where we can have
unlimited observations and we compare a “wrong” model with the true model.46 However, the
problems that we face in practice are often very different: This is because the true model is
unknown and very likely not under consideration in our comparison (see Navarro 2019). In our
experience, we are very far from the asymptotic behavior of cross-validation whereby it gives
undue preference to a more complex model in comparison to a true simpler model. The main
weakness of cross-validation lies in its lack of assumptions, which prevents it from selecting a
more complex model rather than a simple one when there is only a modest gain in predictions
(Vehtari, Simpson, et al. 2019).

An alternative to the cross-validation approach discussed here for nested models is the
projection predictive method (Piironen, Paasiniemi, and Vehtari 2020). However, this approach
(which is less general since it is valid only for generalized linear models) has a somewhat
different objective. In the projection predictive method, we first build the most complete
predictive model, the reference model, and then we look for a simpler model that gives as
similar predictions as the reference model. The idea is that for a given complexity (number of
predictors), the model with the smallest predictive discrepancy to the reference model should
be selected. See https://github.com/stan-dev/projpred for an implementation of this approach.
Thus this approach focuses on model simplification rather than on model comparison.

For models that are badly misspecified, the bias in the uncertainty makes their comparison
unreliable as well. In this case, posterior predictive checks and possible model refinements
are worth considering before carrying out model comparison.

If there are a large number of observations and the models under consideration are different
enough from each other, the differences in predictive accuracy will dwarf the variance in the
estimate of elpd , and cross-validation can be very useful (see also Piironen and Vehtari
2017). An example of this situation appeared in section 16.4. When models are very different,
one advantage of cross-validation methods in comparison with the Bayes factor is that the
selection of priors is less critical in cross-validation. It is sometimes hard to decide on priors
that encode our knowledge for one model, and this difficulty is exacerbated when we want to
assign comparable prior information to models with a different number of parameters that
might be on a different scale. Given that cross-validation methods are less sensitive to prior
specification, different models can be compared on the same footing. See Nicenboim and
Vasishth (2018) for an example from psycholinguistics where K-fold-CV does help in
distinguishing between models.

16.6 Cross-validation in Stan

We can also use PSIS-LOO-CV and K-fold-CV with our Stan models, but we should be careful
to store the appropriate log-likelihood in the generated quantities block.

16.6.1 PSIS-LOO-CV in Stan

As explained earlier, PSIS-LOO (as implemented in the package loo ) approximates the
likelihood of the held-out data based on the observed data: it’s faster (because only one model
is fit), and it only requires a minimal modification of the Stan code we need to fit a model. By
default, Stan only saves the sum of the log likelihood of each observation (in the parameter
lp__ ). If we want to store the log-likelihood of each observation, we have to do this in the

generated quantities block.

We revisit the model implemented in section 10.4.2, which was evaluated using the Bayes
factor in chapter 15. Now, we want to compare the predictive performance of a model that
assumes an effect of attentional load on pupil size against a similar model that assumes no
effect. To do this, we assume the following likelihood:

p_sizen ∼ Normal(α + c_loadn ⋅ β1 + c_trial ⋅ β2 + c_load ⋅ c_trial ⋅ β3 , σ)

Define priors for all the β’s as before:

α ∼ Normal(1000, 500)

β{1,2,3} ∼ Normal(0, 100)

σ ∼ Normal+ (0, 1000)

Prepare the data as in section 10.4.2:

Hide

df_pupil <- df_pupil %>%

mutate(c_load = load - mean(load),
c_trial = load - mean(trial))
ls_pupil <- list(
c_load = df_pupil$c_load,
c_trial= df_pupil$c_trial,

p_size = df_pupil$p_size,
N = nrow(df_pupil)
)

Add a generated quantities block to the model shown below. (It is also possible to run this
block in a stand-alone file with the rstan function gqs() ). If we use the variable name
log_lik in the Stan code, the loo package will know where to find the log likelihood of the

observations.

Code the effects as beta1 , beta2 , beta3 to more easily compare the model with the one
used in the BF chapter, but in this case we could have used a vector or an array instead. This
is the model pupil_null.stan shown below:

Hide
data {
int<lower = 1> N;
vector[N] c_load;
vector[N] c_trial;
vector[N] p_size;

}
parameters {
real alpha;
real<lower = 0> beta1;
real beta2;
real beta3;

real<lower = 0> sigma;

}
model {
// priors including all constants
target += normal_lpdf(alpha | 1000, 500);

target += normal_lpdf(beta1 | 0, 100);

target += normal_lpdf(beta2 | 0, 100);
target += normal_lpdf(beta3 | 0, 100);
target += normal_lpdf(sigma | 0, 1000)
- normal_lccdf(0 | 0, 1000);

target += normal_lpdf(p_size | alpha + c_load * beta1 +

c_trial * beta2 +
c_load .* c_trial * beta3, sigma);

}
generated quantities{

array[N] real log_lik;

for (n in 1:N){
log_lik[n] = normal_lpdf(p_size[n] | alpha + c_load[n] * beta1 +
c_trial[n] * beta2 +
c_load[n] * c_trial[n] * beta3,

sigma);

}
}
For the null model, just omit the term with beta1 in both the model block and the generated
quantities block. This is the model pupil_model_cv.stan shown below:

Hide

data {
int<lower = 1> N;
vector[N] c_load;
vector[N] c_trial;
vector[N] p_size;

}
parameters {
real alpha;
real beta2;
real beta3;
real<lower = 0> sigma;

}
model {
target += normal_lpdf(alpha | 1000, 500);
target += normal_lpdf(beta2 | 0, 100);
target += normal_lpdf(beta3 | 0, 100);

target += normal_lpdf(sigma | 0, 1000)

- normal_lccdf(0 | 0, 1000);
target += normal_lpdf(p_size | alpha + c_trial * beta2 +
c_load .* c_trial * beta3, sigma);
}

generated quantities{
array[N] real log_lik;
for (n in 1:N){
log_lik[n] = normal_lpdf(p_size[n] | alpha + c_trial[n] * beta2 +
c_load[n] * c_trial[n] * beta3,
sigma);

}
}

The models can be found in the bcogsci package:

Hide
pupil_model_cv <- system.file("stan_models",
"pupil_model_cv.stan",
package = "bcogsci")
pupil_null <- system.file("stan_models",
"pupil_null.stan",

package = "bcogsci")

Fit the models:

Hide

fit_pupil_int_pos_ll <- stan(

file = pupil_model_cv,
iter = 3000,
data = ls_pupil
)

fit_pupil_int_null_ll <- stan(

file = pupil_null,
iter = 3000,
data = ls_pupil
)

Show summary of predictive accuracy of the models using the function loo . Unlike brms ,
the package loo is configured to warn the user if ^ > 0.5
k (rather than ^ > 0.5
k ). In practice,
however, PSIS-LOO-CV has good performance for values of ^
k up to 0.7 and the pointwise
ˆ
elpd values with these associated ^
k can still be used (Vehtari, Gelman, and Gabry 2017b).

Hide

(loo_pos <- loo(fit_pupil_int_pos_ll))

##
## Computed from 6000 by 41 log-likelihood matrix
##
## Estimate SE
## elpd_loo -258.7 5.6

## p_loo 3.4 1.0

## looic 517.4 11.2
## ------
## Monte Carlo SE of elpd_loo is 0.0.
##
## All Pareto k estimates are good (k < 0.5).

## See help('pareto-k-diagnostic') for details.

Hide

(loo_null <- loo(fit_pupil_int_null_ll))

##
## Computed from 6000 by 41 log-likelihood matrix

##
## Estimate SE
## elpd_loo -258.2 5.6
## p_loo 3.2 1.0
## looic 516.3 11.2

## ------
## Monte Carlo SE of elpd_loo is 0.0.
##
## All Pareto k estimates are good (k < 0.5).
## See help('pareto-k-diagnostic') for details.

Hide

loo_compare(loo_pos, loo_null)
## elpd_diff se_diff
## model2 0.0 0.0
## model1 -0.5 0.4

As it happened with the cloze probability effect in the previous section, we cannot decide
which model has better predictive accuracy according to PSIS-LOO.

16.6.1.1 K-fold-CV in Stan

If we want to use K-fold-CV (or LOGO-CV) in Stan (as opposed to PSIS-LOO), we need to be
careful to store the log-likelihood of the held-out data, since we evaluate our model with only
this subset of the data. The following example closely follows the vignette https://cran.r-
project.org/web/packages/loo/vignettes/loo2-elpd.html.

The steps taken are as follows:

1. Split the data in 10 folds.

Since there is only one subject, we don’t need to stratify (using kfold_split_stratified ), and
we use kfold_split_random from the loo package.

Hide

df_pupil$fold <- kfold_split_random(K = 10, N = nrow(df_pupil))

# Show number of obs for each fold:

df_pupil %>%
group_by(fold) %>%
count() %>%
print(n=10)
## # A tibble: 10 × 2
## # Groups: fold [10]
## fold n
## <int> <int>
## 1 1 4

## 2 2 4
## 3 3 4
## 4 4 4
## 5 5 4
## 6 6 4
## 7 7 4

## 8 8 4
## 9 9 4
## 10 10 5

2. Fit and extract the log pointwise predictive densities for each fold.

Compile the alternative and the null models first with stan_model , and prepare two matrices
to store the predictive densities from the held out data. Each matrix has as many rows as
post-warmup iterations we’ll produce (3000 × 2), and as many columns as observations in the
data set.

Hide

pupil_stanmodel <- stan_model(pupil_model_cv)

pupil_null_stanmodel <- stan_model(pupil_null)
log_pd_kfold <- matrix(nrow = 6000, ncol = nrow(df_pupil))
log_pd_null_kfold <- matrix(nrow = 6000, ncol = nrow(df_pupil))

Next, loop over the 10 folds. Each loop carries out the following steps. First, fit each model
(i.e., the alternative and null model) to all the observations except the ones belonging to the
held-out fold using sampling() ; this uses the already-compiled models. Second, compute the
log pointwise predictive densities for the held-out fold with gqs() . This function produces
generated quantities based on samples from a posterior (in the draw argument) and
generated quantities 47
ignores all the blocks except . Finally, store the predictive density for
the observations of the held-out fold in a matrix by extracting the log likelihood of the held-out
data. The output of this loop is a matrix of the log pointwise predictive densities of all the
observations.
Hide
# Loop over the folds
for(k in 1:10){
# Training set for k
df_pupil_train <- df_pupil %>%
filter(fold != k)

ls_pupil_train <- list(

c_load = df_pupil_train$c_load,
c_trial= df_pupil_train$c_trial,
p_size = df_pupil_train$p_size,
N = nrow(df_pupil_train)
)

# Held out set for k

df_pupil_ho <- df_pupil %>%
filter(fold == k)
ls_pupil_ho <- list(
c_load = df_pupil_ho$c_load,

c_trial= df_pupil_ho$c_trial,
p_size = df_pupil_ho$p_size,
N = nrow(df_pupil_ho)
)
# Train the models

fit_train <- sampling(pupil_stanmodel,

iter = 3000,
data = ls_pupil_train)
fit_null_train <- sampling(pupil_null_stanmodel,
iter = 3000,
data = ls_pupil_train)

# Generated quantities based on the posterior from the training set

# and the data from the held out set
gq_ho <- gqs(pupil_stanmodel,
draws = as.matrix(fit_train),
data = ls_pupil_ho)

gq_null_ho <- gqs(pupil_null_stanmodel,

draws = as.matrix(fit_null_train),
data = ls_pupil_ho)
# Extract log likelihood which represents
# the pointwise predictive density
log_pd_kfold[, df_pupil$fold == k] <-
extract_log_lik(gq_ho)
log_pd_null_kfold[, df_pupil$fold == k] <-

extract_log_lik(gq_null_ho)
}

3. Compute K-fold ˆ
elpd .

Now we evaluate the predictive performance of the model on the 10 folds using elpd() .

Hide

(elpd_pupil_kfold <- elpd(log_pd_kfold))

##
## Computed from 6000 by 41 log-likelihood matrix using the generic elpd function
##
## Estimate SE

## elpd -259.7 6.0

## ic 519.5 12.0

Hide

(elpd_pupil_null_kfold <- elpd(log_pd_null_kfold))

## Computed from 6000 by 41 log-likelihood matrix using the generic elpd function
##
## Estimate SE
## elpd -259.6 6.0
## ic 519.1 12.1

4. Compare the ˆ
elpd estimates.

Hide

loo_compare(elpd_pupil_kfold, elpd_pupil_null_kfold)
## elpd_diff se_diff
## model2 0.0 0.0
## model1 -0.2 0.5

As with PSIS-LOO, we cannot decide which model has better predictive accuracy according to
K-fold-CV.

16.7 Summary

In this chapter, we learned how to use K-fold cross-validation and leave-one-out cross-
validation, using both built-in functionality in brms as well as Stan, in conjunction with the
loo package. We saw an example of model comparison where cross-validation helped

distinguish between the two models (log-normal vs. normal likelihood), and another example
where no important differences were found between the models being compared (the N400
data with cloze probability as predictor). In general, cross-validation will be helpful when
comparing rather different models (for an example from psycholinguistics, see Nicenboim and
Vasishth 2018); when the models are highly similar, it will be difficult to distinguish between
them. In particular, for typical psychology and linguistics data sets, it will be difficult to get
conclusive results from model comparisons using cross-validation that aim to find evidence for
the presence of a population-level (or fixed) effect, if the effect is very small and/or the data
are relatively sparse (this is often the case, especially in psycholinguistic data). In such cases,
if the aim is to find evidence for a theoretical claim, other model comparison methods like
Bayes factors might be more meaningful.

16.8 Further reading

A technical discussion about cross-validation methods can be found in Chapter 7 of Gelman et

al. (2014). For a discussion about the advantages and disadvantages of (leave-one-out)
cross-validation, see Gronau and Wagenmakers (2018), Vehtari, Simpson, et al. (2019) and
Gronau and Wagenmakers (2019). A LOO glossary from loo package can be found in
(https://mc-stan.org/loo/reference/loo-glossary.html). Cross-validation is still an active area of
research, there are multiple websites and blog posts on this topic: Aki Vehtari, the creator of
the loo package has a comprehensive FAQ about cross-validation in
https://avehtari.github.io/modelselection/CV-FAQ.html; on Andrew Gelman’s blog, he also
discusses the situations where cross-validation can be applied:
https://statmodeling.stat.columbia.edu/2018/08/03/loo-cross-validation-approaches-valid/.

16.9 Exercises

Exercise 16.1 Predictive accuracy of the linear and the logarithm effect of cloze probability.

Is there a difference in predictive accuracy between the model that incorporates a linear effect
of cloze probability and one that incorporates log-transformed cloze probabilities?

Exercise 16.2 Lognormal model

Use PSIS-LOO to compare a model of Stroop as the one in 11.1 with a model that assumes
no population-level effect

a. in brms .
b. in Stan.

Exercise 16.3 Lognormal vs rec-normal model in Stan

In section 12.1, we proposed a reciprocal truncated normal distribution (rec-normal) to

response times data, as an alternative to the log-normal distribution. The log-likelihood (of μ

and σ) of an individual observation, RTn , for the rec-normal distribution would be the
following one.

log L = log(Normal(1/RTn |μ, σ)) − 2 ⋅ log(RTn )

As explained in 12.1, we obtain the log-likelihood based on all the N observations by

summing the log-likelihood of individual observations.

N N

log L = ∑ log(Normal(1/RTn |μ, σ)) − ∑ 2 ⋅ log(RTn )

n n

Since these two models assume right-skewed data with only positive values, the question that
we are interested in here is if we can really distinguish between them. Investigate this in the
following way:

a. Generate data (N = 100 and N = 1000) with a rec-normal distribution (e.g., rt = 1 /

rtnorm(N, mu, sigma, a = 0) ).

b. Generate data (N = 100 and N = 1000) with a log-normal distribution

Fit a rec-normal and a log-normal model using Stan to each of the four datasets, and use
PSIS-LOO to compare the models.

What do you conclude?

References

Bernardo, José M, and Adrian FM Smith. 2009. Bayesian Theory. Vol. 405. John Wiley &
Sons.

Geisser, Seymour, and William F Eddy. 1979. “A Predictive Approach to Model Selection.”
Journal of the American Statistical Association 74 (365). Taylor & Francis Group: 153–60.

Gelman, Andrew, John B. Carlin, Hal S. Stern, David B. Dunson, Aki Vehtari, and Donald B.
Rubin. 2014. Bayesian Data Analysis. Third Edition. Boca Raton, FL: Chapman; Hall/CRC
Press.

Gneiting, Tilmann, and Adrian E Raftery. 2007. “Strictly Proper Scoring Rules, Prediction, and
Estimation.” Journal of the American Statistical Association 102 (477). Taylor & Francis: 359–
78. https://doi.org/10.1198/016214506000001437.

Good, I. J. 1952. “Rational Decisions.” Journal of the Royal Statistical Society. Series B
(Methodological) 14 (1). [Royal Statistical Society, Wiley]: 107–14.
http://www.jstor.org/stable/2984087.

Gronau, Quentin F., and Eric-Jan Wagenmakers. 2018. “Limitations of Bayesian Leave-One-
Out Cross-Validation for Model Selection.” Computational Brain & Behavior.
https://doi.org/10.1007/s42113-018-0011-7.

2019. “Rejoinder: More Limitations of Bayesian Leave-One-Out Cross-Validation.”

Computational Brain & Behavior 2 (1). Springer: 35–47.

Lunn, David, Chris Jackson, David J Spiegelhalter, Nicky Best, and Andrew Thomas. 2012.
The BUGS Book: A Practical Introduction to Bayesian Analysis. Vol. 98. CRC Press.

Navarro, Danielle J. 2019. “Between the Devil and the Deep Blue Sea: Tensions Between
Scientific Judgement and Statistical Model Selection.” Computational Brain & Behavior 2 (1):
28–34. https://doi.org/10.1007/s42113-018-0019-z.
Nicenboim, Bruno, and Shravan Vasishth. 2018. “Models of Retrieval in Sentence
Comprehension: A Computational Evaluation Using Bayesian Hierarchical Modeling.” Journal
of Memory and Language 99: 1–34. https://doi.org/10.1016/j.jml.2017.08.004.

Paananen, Topi, Juho Piironen, Paul-Christian Bürkner, and Aki Vehtari. 2021. “Implicitly
Adaptive Importance Sampling.” Statistics and Computing 31 (2). Springer Science; Business
Media LLC. https://doi.org/10.1007/s11222-020-09982-2.

Piironen, Juho, and Aki Vehtari. 2017. “Comparison of Bayesian Predictive Methods for Model
Selection.” Statistics and Computing 27 (3): 711–35. https://doi.org/10.1007/s11222-016-9649-
y.

Sivula, Tuomas, Måns Magnusson, and Aki Vehtari. 2020. “Uncertainty in Bayesian Leave-
One-Out Cross-Validation Based Model Comparison.”

Vehtari, Aki. 2022. “Cross-validation FAQ.”

https://web.archive.org/web/20221219223947/https://avehtari.github.io/modelselection/CV-
FAQ.html.

Vehtari, Aki, and Andrew Gelman. 2015. “Pareto Smoothed Importance Sampling.” arXiv
Preprint arXiv:1507.02646.

Vehtari, Aki, Andrew Gelman, and Jonah Gabry. 2017b. “Practical Bayesian Model Evaluation
Using Leave-One-Out Cross-Validation and WAIC.” Statistics and Computing 27 (5): 1413–32.
https://doi.org/10.1007/s11222-016-9696-4.

Vehtari, Aki, Daniel P. Simpson, Yuling Yao, and Andrew Gelman. 2019. “Limitations of
‘Limitations of Bayesian Leave-One-Out Cross-Validation for Model Selection’.” Computational
Brain & Behavior 2 (1): 22–27. https://doi.org/10.1007/s42113-018-0020-6.

Venables, William N., and Brian D. Ripley. 2002. Modern Applied Statistics with S-PLUS. New
York: Springer.
44. Maximizing the elpd in (16.4) is also equivalent to minimizing the Kullback–Leibler (KL)
divergence from the true data generating distribution pt (ypred ) to the posterior predictive
distribution of the candidate model M1 .↩

45. The double use of the data is also a problem when one relies on information criteria like
the Akaike Information Criterion (AIC) or the Bayesian Information Criterion (BIC).↩

46. If the true model is under consideration among the models being compared, we are under
an Mclosed scenario. However, this is rarely realistic. The most common case is an Mopen

scenario (Bernardo and Smith 2009), where the true model is not included in the set of
models being compared.↩

47. The reader using cmdstanr rather than rstan might find a cryptic error here. This is
because cmdstanr expects the parameters not to change. A workaround can be found in
https://discourse.mc-stan.org/t/generated-quantities-returns-error-mismatch-between-
model-and-fitted-parameters-csv-file/17869/15↩
Code

Chapter 17 Introduction to computational

cognitive modeling
Until this point in the book, we have been discussing models that specify a generative process
for the observed data. This generative process could be as simple as Y ∼ N ormal(μ, σ) or
it could be an elaborate hierarchical model that incorporates multiple variance components.
Usually, in these kinds of models, what is of interest is a parameter that represents a so-called
“effect” of interest. Examples that we encountered in the present book are: the effect of word
frequency on reading time; effect of relative clause type on reading time; and the effect of
attentional load on pupil size.

One characteristic common to the models seen so far is that no underlying latent cognitive
process is specified that elaborates on the generative process that produces the observed
dependent variable. For example, in a logistic regression, the correct or incorrect response
(say, to a yes/no comprehension question) could be the result of a cascade of alternative
steps taken unconsciously (or consciously) by the subject as they generate a response. To
make this concrete, a subject could give a yes/no response to a comprehension question after
seeing a target sentence by probabilistically processing the sentence deeply or superficially;
once the deep/shallow path is taken, the subject might end up giving either a correct or
incorrect answer (the latter by misinterpreting the meaning of the sentence). What is observed
in the data is a correct or incorrect response, but the reason that that particular response was
given could be underlyingly due to deep or superficial processing.

In this book, we use the phrase “computational cognitive modeling” to refer to generative
models that specify latent (unobserved and, usually, unobservable) processes that result in a
behavioral or other kind of response. Cognitive modeling as presented in this section goes
beyond estimates of “effects” in the sense discussed above; the principal goal is to explain
and understand how a particular cognitive process unfolds.

Unpacking the latent cognitive process that produces a response has a long history in
cognitive science. For example, in sentence processing research, early models like the classic
garden-path model (Frazier 1979) seek to spell out the steps that occur when the human
sentence processing system (the parser) attempts to build syntactic structure incrementally
when faced with a temporarily ambiguous sentence. To make this concrete, consider the
sentence: “While Mary bathed the baby happily played in the living room.” Compared to the
unambiguous baseline sentence “While Mary bathed, the baby happily played in the living
room,” the garden-path model assumes that in the ambiguous sentence the parser initially
connects the noun phrase “the baby” as a grammatical object of the verb “bathed.” It is only
when the verb “played” is encountered that the parser reassigns “the baby” as a grammatical
subject of the verb “played”, leading to the correct parse whereby Mary being the one who is
bathing, and not that Mary is bathing the baby. This process of reassigning “the baby”’s
grammatical role is computationally costly and is called reanalysis in sentence processing.
The slowdown observed (e.g., in reading studies) at the verb “played” is often called the
garden-path effect.

This kind of paper-pencil model implicitly posits an overly simplistic and deterministic parsing
process: Although this is never spelled out in the garden-path model, there is no assumption
that the parser could only probabilistically misparse the sentence when it encounters “the
baby”; misparse are implicitly assumed to happen every single time such a temporarily
ambiguous sentence is encountered. Such a model does not explicitly allow alternative
parsing constraints to come into play; by contrast, a computational model allows the empirical
consequences of multiple parsing constraints to be considered quantitatively (e.g., Jurafsky
1996; Paape and Vasishth 2022).

Although this kind of simple paper-pencil model is an excellent start towards modeling the
latent process of sentence comprehension, just stopping with such a description has several
disadvantages. First, no quantitative predictions can be derived; a corollary is that a slowdown
at the verb “played” (due to reanalysis) of 10 ms or 500 ms would both be equally consistent
with the predictions of the model—the model cannot say anything about how much time the
reanalysis process would take. This is a problem for model evaluation, not least because
overly large effect sizes observed in data could just be Type M error and therefore very
misleading (Gelman and Carlin 2014; Vasishth, Mertzen, Jäger, et al. 2018a). Second, such
paper-pencil models encourage an excessive (in fact, exclusive) focus on the average effect
size (here, the garden-path effect); the variability among individuals (which would affect the
standard error of the estimated effect from data) plays no role in determining whether the
model’s prediction is consistent with the data. Another problem with such verbally stated
models is that it is often not clear what the exact assumptions are. This makes it difficult to
establish whether the model’s predictions are consistent with observed patterns in the data.

The absence of quantitative predictions, and the inability to quantitatively investigate

individual-level variation are two major drawbacks of paper-pencil models. As Roberts and
Pashler (2000) have discussed at length, a good fit of a model to data is not merely about the
sign of the predicted effect being correct; the model must be able to commit a priori to the
uncertainty of the predicted effect, and the estimated effect from the data and its uncertainty
need to be compared to the predictions of the model. A good fit requires a tightly constrained
quantitative prediction derived from the model that is then validated by the comparing the
prediction with the data; this point has also been eloquently made by the psychologist Meehl
(1997). Meehl suggests that the model make risky numerical predictions (by which he means
tighly constrained quantitative predictions), which should then be compared with the observed
effect size and its confidence interval—this is essentially the Roberts and Pashler (2000)
criterion for a good fit.

For example, if a computational implementation of the garden-path model were to exist, one
could have derived (through prior specifications on the parameters of the model) prior
predictive distributions of the garden-path effect, and compared these predictions to the
estimates of the effect from a statistical model fit to the data (for an example from
psycholinguistics, see Vasishth and Engelmann 2022). Using parameteric variation, one could
even investigate the implications of the model for individual-level diffrences (e.g., Yadav,
Paape, et al. 2022); such implications of models are impossible to derive unless the model is
implemented computationally. In the absence of an implemented model, one is reduced to
classifying subjects into groups (e.g., by working memory capacity measures) and
investigating average group-level effects (e.g., Caplan and Waters 1999). This makes the
question a binary one: are there individual differences or are there none? The right question
about individual differences is a quantitative one (Haaf and Rouder 2019).

There are many different classes of computational cognitive model. For example, Newell
(1990) pioneered a cognitive architectures approach, where a model of a particular cognitive
process (like sentence processing) occurs within a broader computational framework that
defines very general constraints on human information processing. Examples of cognitive
architectures are SOAR (Laird 2019), the CAPS family of models (Just, Carpenter, and Varma
1999), and ACT-R (Anderson et al. 2004) . Other approaches include connectionist models
(e.g., McClelland and Rumelhart 1989), dynamical systems-based models (e.g., Port and Van
Gelder 1995; Tabor and Tanenhaus 1999; Beer 2000; Rabe et al. 2021).

In this book, we focus on Bayesian cognitive models (Lee and Wagenmakers 2014); these are
distinct from models that assume that human cognitive processes involve Bayesian inference
(e.g., Feldman 2017). The type of model we discuss here has the characteristic that the
underlying generative process spells out the latent, probabilistically occurring sub-processes.
The latent processes are spelled out by specifying a Bayesian model that allows different
events to happen probabilistically in each trial. An example is multinomial processing tree
models, which specify a sequence of possible latent sub-processes. Another example is a
hierarchical finite mixture process which specifies that, in some proportion of trials, the
observed response comes from one distribution, and in another proportion from a different
distribution. A third example is the assumption that the observed response (e.g., reading time)
is the result of an unobserved (latent) race process in the cognitive system. Probabilistic
programming languages like Stan allow us to implement such latent process models, allowing
for hierarchical structure (individual-level variability).

This part of the book introduces these three types of cognitive models using Stan. In many
cases, a great deal of cognitive detail is sacrificed for tractability, but this is a characteristic
shared by all computational models—by definition, a model is a simplification of the underlying
process being modeled (James L. McClelland 2009b).

The broader lesson to learn from this section is that it is possible to specify an underlying
generative process for the data that reflects theoretical assumptions in a particular research
area. The gain is that: (i) the assumptions of the underlying theory, and their consequences,
become transparent (Epstein 2008); (ii) one can derive quantitative predictions that, as
Roberts and Pashler (2000) point out, are vital for model evaluation; (iii) it becomes possible
(at least in principle) to eliminate competing theoretical proposals through quantitative model
comparison using benchmark data (Nicenboim and Vasishth 2018; Lissón et al. 2021; Lissón
et al. 2022; Yadav, Smith, et al. 2022); and (iv) the implications of models for individual-level
differences can be investigated (Yadav, Paape, et al. 2022).

17.1 Further reading

General textbooks on computational modeling for cognitive science are Busemeyer and
Diederich (2010), and Farrell and Lewandowsky (2018). The textbook by Lee and
Wagenmakers (2014) focuses on relatively simple computational cognitive models
implemented in a Bayesian framework (using the BUGS language). A good free textbook on
computational modeling for cognitive science is Blokpoel and Rooij (2021).
The entire special issue (Lee 2011a) on hierarchical Bayesian modeling in the Journal of
Mathematical Psychology is highly recommended (in particular, see the article by Lee 2011b).
Wilson and Collins (2019) discuss good practices in the computational modeling of behavioral
data using examples from reinforcement learning. Haines et al. (2020) discuss how generative
models produce higher test-retest reliability and more theoretically informative parameter
estimates than do traditional methods. For an overview of the different modeling approaches
in cognitive science and the relationships between them, see James L. McClelland (2009b).
Luce (1991) is a classic book that focuses on modeling response times.
References

Anderson, John R., Dan Bothell, Michael D. Byrne, Scott Douglass, Christian Lebiere, and
Yulin Qin. 2004. “An Integrated Theory of the Mind.” Psychological Review 111 (4): 1036–60.

Beer, Randall D. 2000. “Dynamical Approaches to Cognitive Science.” Trends in Cognitive

Sciences 4 (3). Elsevier: 91–99.

Blokpoel, Mark, and Iris van Rooij. 2021. Theoretical Modeling for Cognitive Science and
Psychology.

Busemeyer, Jerome R, and Adele Diederich. 2010. Cognitive Modeling. Sage.

Caplan, D., and G. S. Waters. 1999. “Verbal Working Memory and Sentence Comprehension.”
Behavioral and Brain Science 22: 77–94.

Epstein, Joshua M. 2008. “Why Model?” Journal of Artificial Societies and Social Simulation 11
(4): 12.

Farrell, Simon, and Stephan Lewandowsky. 2018. Computational Modeling of Cognition and
Behavior. Cambridge University Press.

Feldman, Jacob. 2017. “What Are the ‘True’ Statistics of the Environment?” Cognitive Science
41 (7). Wiley Online Library: 1871–1903.

Frazier, Lyn. 1979. “On Comprehending Sentences: Syntactic Parsing Strategies.” PhD thesis,
Amherst: University of Massachusetts.

Haaf, Julia M., and Jeffrey N. Rouder. 2019. “Some Do and Some Don’t? Accounting for
Variability of Individual Difference Structures.” Psychonomic Bulletin & Review 26 (3).
Springer: 772–89.

Haines, Nathaniel, Peter D Kvam, Louis H Irving, Colin Smith, Theodore P Beauchaine, Mark
A Pitt, Woo-Young Ahn, and Brandon Turner. 2020. “Learning from the Reliability Paradox:
How Theoretically Informed Generative Models Can Advance the Social, Behavioral, and
Brain Sciences.” Unpublished. PsyArXiv.

Jurafsky, Daniel. 1996. “A Probabilistic Model of Lexical and Syntactic Access and
Disambiguation.” Cognition 20: 137–94.
Just, M.A., P.A. Carpenter, and S. Varma. 1999. “Computational Modeling of High-Level
Cognition and Brain Function.” Human Brain Mapping 8: 128–36.

Laird, John E. 2019. The Soar Cognitive Architecture. MIT press.

Lee, Michael D., and Eric-Jan Wagenmakers. 2014. Bayesian Cognitive Modeling: A Practical
Course. Cambridge University Press.

Lissón, Paula, Dario Paape, Dorothea Pregla, Frank Burchert, Nicole Stadie, and Shravan
Vasishth. 2022. “Similarity-Based Interference in Sentence Comprehension in Aphasia: A
Computational Evaluation of Two Models of Cue-Based Retrieval.”

Luce, R Duncan. 1991. Response Times: Their Role in Inferring Elementary Mental
Organization. Oxford University Press.

McClelland, James L. 2009b. “The Place of Modeling in Cognitive Science.” Topics in

Cognitive Science 1 (1). Wiley Online Library: 11–38.

McClelland, James L, and David E Rumelhart. 1989. Explorations in Parallel Distributed

Processing: A Handbook of Models, Programs, and Exercises. MIT Press.

Meehl, Paul E. 1997. “The Problem Is Epistemology, Not Statistics: Replace Significance Tests
by Confidence Intervals and Quantify Accuracy of Risky Numerical Predictions.” In What If
There Were No Significance Tests?, edited by L.L. Harlow, S.A. Mulaik, and J. H. Steiger.
Mahwah, New Jersey: Erlbaum.

Newell, Allen. 1990. Unified Theories of Cognition. cambridge: Harvard University Press.
Nicenboim, Bruno, and Shravan Vasishth. 2018. “Models of Retrieval in Sentence
Comprehension: A Computational Evaluation Using Bayesian Hierarchical Modeling.” Journal
of Memory and Language 99: 1–34. https://doi.org/10.1016/j.jml.2017.08.004.

Paape, Dario, and Shravan Vasishth. 2022. “Estimating the True Cost of Garden-Pathing: A
Computational Model of Latent Cognitive Processes.” Cognitive Science 46 (8): e13186.

Port, Robert F, and Timothy Van Gelder. 1995. Mind as Motion: Explorations in the Dynamics
of Cognition. MIT Press.

Rabe, Maximilian M., Johan Chandra, André Krü"gel, Stefan A. Seelig, Shravan Vasishth, and
Ralf Engbert. 2021. “A Bayesian Approach to Dynamical Modeling of Eye-Movement Control in
Reading of Normal, Mirrored, and Scrambled Texts.” Psychological Review.
https://doi.org/10.1037/rev0000268.

Roberts, Seth, and Harold Pashler. 2000. “How Persuasive Is a Good Fit? A Comment on
Theory Testing.” Psychological Review 107 (2): 358–67.

Tabor, Whitney, and Michael K Tanenhaus. 1999. “Dynamical Models of Sentence

Processing.” Cognitive Science 23 (4). Wiley Online Library: 491–515.

Vasishth, Shravan, and Felix Engelmann. 2022. Sentence Comprehension as a Cognitive

Process: A Computational Approach. Cambridge, UK: Cambridge University Press.
https://books.google.de/books?id=6KZKzgEACAAJ.

Wilson, Robert C, and Anne GE Collins. 2019. “Ten Simple Rules for the Computational
Modeling of Behavioral Data.” Edited by Timothy E Behrens. eLife 8 (November). eLife
Sciences Publications, Ltd: e49547. https://doi.org/10.7554/eLife.49547.

Yadav, Himanshu, Dario Paape, Garrett Smith, Brian W. Dillon, and Shravan Vasishth. 2022.
“Individual Differences in Cue Weighting in Sentence Comprehension: An evaluation using
Approximate Bayesian Computation.” Open Mind.
https://doi.org/https://doi.org/10.1162/opmi_a_00052.

Yadav, Himanshu, Garrett Smith, Sebastian Reich, and Shravan Vasishth. 2022. “Number
Feature Distortion Modulates Cue-Based Retrieval in Reading.” Journal of Memory and
Language.
Code

Chapter 18 Multinomial processing trees

In this chapter, we introduce a widely-used cognitive model that can be implemented in Stan,
the multinomial processing tree. This model is useful in situations where the behavioral
response from the subject is one of several possible categorical outcomes. As an example, we
will look into a word production task, where we ask individuals with aphasia (a language
impairment that is usually due to a cerebrovascular accident or head trauma, Damasio 1992),
to name the object shown in a picture, e.g., a picture of a cat. The participant (hereafter,
subject) could produce the correct name (“cat”), a semantically and phonologically related but
incorrect name (“rat”), a semantically unrelated but phonologically related word (“hat”), or a
non-word (“cag”). The researcher may have a theory about how each possible outcome ends
up being probabilistically produced. Such a theoretical process model can be expressed as a
multinomial processing tree. Before we dive into multinomial processing trees, we discuss the
distributions that generalize the binomial and Bernoulli distribution for modeling more than two
possible outcomes.

18.1 Modeling multiple categorical responses #

One way to model categorical responses is using multinomial or categorical distributions. The
categorical responses could be “yes” or “no”; “blue”, “red” or “yellow”; “true”, “false”, or “I don’t
know”; or more complicated categories, but crucially the response of each observation can be
coded as only belonging to only one category. The multinomial and the categorical distribution
represent two ways of characterizing the underlying generative process for such data.

The multinomial distribution is the generalization of the binomial distribution for more than two
possible outcomes. Recall that the binomial works like this: in order to randomly generate the
number of successes in observation consisting of 10 trials, with the probability of success 0.5 ,
one can type:

Hide

rbinom(1, size = 10, prob = 0.5)

## [1] 4

It is possible to repeatedly generate multiple observations as follows. Suppose five simulated

observations are needed, each with 10 trials:

Hide

rbinom(5, size = 10, prob = 0.5)

## [1] 6 5 7 7 2

Now, suppose that there are N=3 possible answers to a question (yes, no don’t know), and
suppose that the probabilities of producing each answer are:

P( yes)= 0.1

P( no)= 0.1

P( don’t know)= 0.8

The probabilities must sum to 1, because those are the only three possible outcomes. Given
such a situation, it is possible to simulate a single experiment with 10 trials, where each of the
three possibilities appears a certain number of times. We do this with rmnom from the
extraDistr package (one could have used rmultinom() in R equally well, but the output

would look a bit different):

Hide

(random_sample <- rmnom(1, size = 10, prob = c(0.1, 0.1, 0.8)))

## [,1] [,2] [,3]

## [1,] 0 2 8

The above call returns the result of the random sample: 0 cases of the first answer type, 2

cases of the second; and 8 cases of the third.

Analogously to the binomial function shown above, five observations can be simulated, each
having 10 trials:

Hide
rmnom(5, size = 10, prob = c(0.1, 0.1, 0.8))

## [,1] [,2] [,3]

## [1,] 1 2 7
## [2,] 3 0 7

## [3,] 1 2 7
## [4,] 1 1 8
## [5,] 3 1 6

The categorical distribution is the generalization of the Bernoulli distribution for more than two
possible outcomes, and it is the special case of the multinomial distribution when we have only
one trial. Recall that the Bernoulli distribution can be used as follows. If we carry out a coin
toss (each coin toss counts as a single trial), we will either get a heads or a tails:

Hide

rbern(5, prob = 0.5)

## [1] 1 1 0 1 0

Hide

## equivalent rbinom command:

rbinom(5, size = 1, prob = 0.5)

## [1] 0 0 1 1 1

Thus, what the Bernoulli is to the Binomial, the Categorical is to the Multinomial. For example,
one can simulate five observations, each of which will give one of the three responses with the
given probabilities. We do this with rcat from the extraDistr package.

Hide

rcat(5, prob = c(0.1, 0.1, 0.8), labels = c("yes", "no", "dontknow"))

## [1] dontknow yes dontknow dontknow dontknow
## Levels: yes no dontknow

The above is analogous to using the multinomial with size = 1 (a single trial in each
experiment). In the output below, the rmnom function shows which of the three categories is
produced.

Hide

rmnom(5, size = 1, prob = c(0.1, 0.1, 0.8))

## [,1] [,2] [,3]

## [1,] 0 0 1
## [2,] 0 1 0
## [3,] 1 0 0
## [4,] 0 0 1
## [5,] 0 0 1

With these distributions as background, consider now a simulated situation where multiple
responses are possible.

18.1.1 A model for multiple responses using the multinomial

likelihood

Impaired picture naming (anomia) is common in most cases of aphasia. It is assessed as part
of most comprehensive aphasia test batteries, since picture naming accuracy is a relatively
easily obtained and is a reliable test score; in addition, the types of errors that are committed
can provide useful information for diagnosis.

In this simulated experiment, the responses are categorized as shown in Table 18.1.
TABLE 18.1: Categorization of responses for the simulated experiment.

Category Description Example

Correct The response matches the target. cat

The response is not a word, but it has a phonological relation to

Neologism cag
the target.

The response is a word with only a phonological relation to the

Formal hat
target.

The response is a word with both a semantic and phonological

Mixed rat
relation the target.

All other responses, including omissions, descriptions, non-

NR –
nouns, etc.

First, generate data assuming a multinomial distribution. The outcomes will be determined by
a vector θ (called true_theta below in the R code) that indicates the probability of each
outcome:

Hide

(true_theta <- tibble(theta_NR = .2,

theta_Neologism = .1,
theta_Formal = .2,
theta_Mixed = .08,
theta_Correct = 1 -
(theta_NR + theta_Neologism + theta_Formal + theta_Mixed)))

## # A tibble: 1 × 5
## theta_NR theta_Neologism theta_Formal theta_Mixed theta_Correct
## <dbl> <dbl> <dbl> <dbl> <dbl>

## 1 0.2 0.1 0.2 0.08 0.42

Hide

## The probabilities must sum to 1:

sum(true_theta)
## [1] 1

Given this vector of probabilities θ, generate values assuming a multinomial distribution of

responses in 100 trials:

Hide

N_trials <- 100

(ans_mn <- rmultinom(1, N_trials, true_theta))

## [,1]
## theta_NR 23

## theta_Neologism 7
## theta_Formal 18
## theta_Mixed 6
## theta_Correct 46

Now, we’ll try to recover the probability of each answer with a model with the following
likelihood:

ans ∼ Multinomial(θ)

where θ = {θnr , θneol. , θf ormal , θmix , θcorr } .

A common prior for multinomial likelihood is the Dirichlet distribution, which extends the Beta
distribution to cases where more than two categories are available.

θ ∼ Dirichlet(α)

The Dirichlet distribution has a parameter α , called concentration parameter, and it is a vector
with the same length as θ. If we set α = {2, 2, 2, 2, 2} , this is analogous to ∼ Beta(2, 2) ,
that is, the intuition behind this concentration parameter is that the prior probability distribution
of the vector θ corresponds to have seen two outcomes of each category in the past.

A Stan model assuming a multinomial likelihood and Dirichlet prior is shown below. Since the
elements of θ should sum to one we declare this vector, theta , as a simplex . The simplex
type ensures that its elements sum to one and also constraints them to have non-negative
values. In order to generate the vector α that contains five times the value two, we use
rep_vector(2, 5) (which is similar to rep(2, 5) in R).
Hide

data {
int<lower = 1> N_trials;
array[5] int<lower = 0,upper = N_trials> ans;
}
parameters {

simplex[5] theta;
}
model {
target += dirichlet_lpdf(theta | rep_vector(2, 5));
target += multinomial_lpmf(ans | theta);
}

generated quantities{
array[5] int pred_ans = multinomial_rng(theta, N_trials);
}

Fit the model:

Hide

# Create a list:
# c(ans_mn) makes a vector out of the matrix ans_mn
data_mn <- list(N_trials = N_trials,
ans = c(ans_mn))

Hide

str(data_mn)

## List of 2
## $ N_trials: num 100
## $ ans : int [1:5] 23 7 18 6 46

Hide
multinom <- system.file("stan_models",
"multinom.stan",
package = "bcogsci")
fit_mn <- stan(multinom, data = data_mn)

Print the posteriors:

Hide

print(fit_mn, pars = c("theta"))

## mean 2.5% 97.5% n_eff Rhat

## theta[1] 0.23 0.16 0.31 3909 1
## theta[2] 0.08 0.04 0.14 4612 1

## theta[3] 0.18 0.11 0.26 4198 1

## theta[4] 0.07 0.03 0.13 4639 1
## theta[5] 0.44 0.35 0.53 4818 1

Next, use mcmc_recover_hist in the code below to confirm that the posterior distributions of
the elements of θ are close to the true values that were set up when simulating the data. See
Figure 18.1.

Hide

as.data.frame(fit_mn) %>%
select(starts_with("theta")) %>%
mcmc_recover_hist(true = unlist(true_theta)) +

coord_cartesian(xlim = c(0, 1))

theta[1] theta[2] theta[3]

0.00 0.25 0.50 0.75 1.000.00 0.25 0.50 0.75 1.000.00 0.25 0.50 0.75 1.00 Estimated

theta[4] theta[5] True

0.00 0.25 0.50 0.75 1.000.00 0.25 0.50 0.75 1.00

FIGURE 18.1: Posterior distributions and true means of theta for the multinomial model
defined in multinom.stan .
We evaluate here whether our model is able to “recover” the true value of its parameters. By
“recover”, we mean that the true values are somewhere inside the posterior distribution of the
model.

The frequentist properties of Bayesian models guarantee that if we simulate data several
times, 95% of the true values should be inside of the 95% CrI intervals generated by a “well-
calibrated” model. Furthermore, if the true values of some parameters are consistently well
above or below their posterior distribution, it may mean that there is some problem with the
model specification. We follow Cook, Gelman, and Rubin (2006) here, and for now we are
going to verify that our model is roughly correct. A more principled (and computationally
demanding) approach uses simulation based calibration (SBC) introduced in section 12.2 of
chapter 12 (also see Talts et al. 2018; Schad, Betancourt, and Vasishth 2020).
18.1.2 A model for multiple responses using the categorical
distribution

Using the same information as above, we can model each response one at a time, instead of
aggregating them. Using the categorical distributions gives us more flexibility to define what
happens at every trial. However, we are not using the additional flexibility for now, and hence
the next model and the previous one are equivalent.

Hide

data {
int<lower = 1> N_obs;
array[N_obs] int<lower = 1, upper = 5> w_ans;
}

parameters {
simplex[5] theta;
}
model {
target += dirichlet_lpdf(theta | rep_vector(2, 5));

for(n in 1:N_obs)
target += categorical_lpmf(w_ans[n] | theta);
}
generated quantities{
array[N_obs] int pred_w_ans;

for(n in 1:N_obs)
pred_w_ans[n] = categorical_rng(theta);
}

Given the same set of probabilities θ as above, generate 100 individual observations:

Hide

N_obs <- 100

ans_cat <- rcat(N_obs, prob = as.matrix(true_theta))

The above output is how Stan expects to see the data. The data fed into the Stan model is
defined as a list as usual:

Hide
data_cat <- list(N_obs = N_obs,

w_ans = ans_cat)
str(data_cat)

## List of 2
## $ N_obs: num 100

## $ w_ans: num [1:100] 1 3 3 3 1 1 2 3 2 5 ...

Fitting the Stan model ( categorical.stan ) should yield approximately the same θ as with the
multinomial likelihood defined in the model multinom.stan .

Hide

categorical <- system.file("stan_models",

"categorical.stan",

package = "bcogsci")
fit_cat <- stan(categorical, data = data_cat)

Hide

print(fit_cat, pars = c("theta"))

## mean 2.5% 97.5% n_eff Rhat

## theta[1] 0.19 0.13 0.27 4177 1

## theta[2] 0.08 0.04 0.14 4857 1
## theta[3] 0.27 0.19 0.36 4853 1
## theta[4] 0.06 0.02 0.10 4428 1
## theta[5] 0.40 0.31 0.49 4410 1

The above models estimate the posterior distribution for the probability for each possible
response. If we had some experimental manipulation, we could even fit regressions to these
parameters. This is called a multinomial logistic regression or categorical regression, see
further reading for some examples.
18.2 Modeling picture naming abilities in aphasia with
MPT models

Multinomial processing tree (MPT) modeling is a method that estimates latent variables that
have a psychological interpretation given categorical data (a review is provided in Batchelder
and Riefer 1999). In other words, an MPT model is just one way to model categorical
responses following a multinomial or categorical distribution. MPT models assume that the
observed response categories result from a sequences of underlying cognitive events which
are represented as a binary branching tree. Each binary branching is associated with a
parameter that represents the probability of going down either branch. Every successive node
is assumed to be independent of the preceding node, allowing us to use the product rule from
probability theory to compute the probability of going down a particular path. The leaves of the
binary branching tree the observed response in the data. The goal is to derive posterior
distributions of the latent probability parameters specified for the binary branching in the
model.

Walker, Hickok, and Fridriksson (2018) created an MPT model that specifies a set of possible
internal errors that lead to the various possible response types during a picture naming trial for
aphasic patients. Here we’ll explore a simplification of the original model.

The model assumes that when an attempt is made to produce a word, errors in production can
arise at the whole word level (lexical level) or the segmental level (phonological level).
Semantic errors are assumed to arise from the lexical substitutions, and neologism errors from
phonological substitutions. Real word responses that are phonologically related to the correct
target word can arise from substitutions at the lexical or phonological level.

The task for the subject is to view a picture and name the object represented in the picture.
When an attempt is made to retrieve the word from memory, the following possible steps can
unfold (this is a simplified version of the original model):

Either the subject will make some lexical selection, or fail to make a lexical selection,
returning a non-response (NR). The probability of making some lexical selection is a, so
the probability of a non-response is 1 − a , as these are only two possibilities at this initial
stage of the binary branching tree. Example: the subject sees the picture of a cat, and
either produces the response “I don’t know”, or starts the process of producing a word.
If a lexical selection is made, the target word is selected with probability t, or some other
word is chosen with probability 1 − t .
Once a word is selected, either its phonological representation is selected with probability
f , or some other (incorrect) phonological representation is selected with probability 1 − f .
Once a word is selected, there can be a phonological change that leads to a real, formally
related word with probability c, or a neologism with probability 1 − c . Example: the
subjects produces either a formally related word “hat,” or a neologism like “cag.”

The end-result of walking down this tree is that the subject either produces a non-response (“I
don’t know” or silence), a correct response, a related word, a mixed word, or a neologism.
There is more than one way to produce a neologism or a related word, and the posterior
probabilities of the various paths will determine the probability of each possible path.

Attempt

1− a a
NR Lexical Selection

1− t t
Phonology Phonology

1− f f 1− f f
Target Word Mixed Target Word Correct

1− c c 1− c c
Neologism Formal Neologism Formal

FIGURE 18.2: Representation of a simplification of the MPT used in Walker, Hickok, and
Fridriksson (2018).

TABLE 18.2: Psychological interpretation of the parameters of the MPT model.

Param. Description

a Probability of initiating an attempt

t Probability of selecting a target word over competitors

f Probability of retrieving correct phonemes

c Probability of a phoneme change in the target word creating a real word

18.2.1 Calculation of the probabilities in the MPT branches

By navigating through the branches of the MPT (Figure 18.2), we can calculate the
probabilities of the five responses (the categorical outcomes), based on the four underlying
parameters assumed in the MPT:

P (NR|a t f c) = 1 − a
P (NR|a, t, f , c) = 1 − a

P (Neologism|a, t, f , c) = a ⋅ (1 − t) ⋅ (1 − f ) ⋅ (1 − c) + a ⋅ t ⋅ (1 − f ) ⋅ (1 − c)

P (Formal|a, t, f , c) = a ⋅ (1 − t) ⋅ (1 − f ) ⋅ c + a ⋅ t ⋅ (1 − f ) ⋅ c

P (Mixed|a, t, f , c) = a ⋅ (1 − t) ⋅ f

P (Correct|a, t, f , c) = a ⋅ t ⋅ f

Given that

P (NR|a, t, f , c) + P (Neologism|a, t, f , c) + P (Formal|a, t, f , c)+

P (Mixed|a, t, f , c) + P (Correct|a, t, f , c) = 1

there is no need to characterize every outcome: we can always calculate any one of the
remaining responses as one minus the other responses.

18.2.2 A simple MPT model

First, simulate 200 trials assuming no variability between items and subjects. It is convenient
to define functions to compute each outcome’s probability, based on the previous MPT. One
needs to assign “true values” to the underlying parameters of the MPT; these values are only
for illustration. Ideally, one should simulate data using parameter values that are realistic; that
is, one should use values based on the literature.
Hide

# The probabilities of the different answers:

Pr_NR <- function(a, t, f, c)
1 - a
Pr_Neologism <- function(a, t, f, c)
a * (1 - t) * (1 - f) * (1 - c) + a * t * (1 - f) * (1 - c)

Pr_Formal <- function(a, t, f, c)

a * (1 - t) * (1 - f) * c + a * t * (1 - f) * c
Pr_Mixed <- function(a, t, f, c)
a * (1 - t) * f
Pr_Correct <- function(a, t, f, c)
a * t * f

# The true underlying values for simulated data:

a_true <- .75
t_true <- .9
f_true <- .8
c_true <- .1

# The probability of the different answers:

Theta <- tibble(NR = Pr_NR(a_true, t_true, f_true, c_true),
Neologism = Pr_Neologism(a_true, t_true, f_true, c_true),
Formal = Pr_Formal(a_true, t_true, f_true, c_true),
Mixed = Pr_Mixed(a_true, t_true, f_true, c_true),

Correct = Pr_Correct(a_true, t_true, f_true, c_true))

N_trials <- 200
(ans <- rmultinom(1, N_trials, c(Theta)))

## [,1]
## NR 49
## Neologism 26
## Formal 5
## Mixed 17

## Correct 103

The above data can be modeled in Stan as discussed below (see mpt_mnm.stan ). The
probabilities of the different categories go into the transformed parameters section because
they are derived from the probability parameters in the model. The data are modeled as
coming from a multinomial likelihood. If priors are not specified, then a Beta distribution with
a = 1 and b = 1 (a Uniform(0,1) distribution) is assumed for the parameters a, t, f , and c.
Unlike θ, the values of these parameters are independent of each other and they do not sum
to one. For this reason, we should not use a Dirichlet prior here.

We define the following model:

θnr = 1 − a

θneol. = a ⋅ (1 − t) ⋅ (1 − f ) ⋅ (1 − c) + a ⋅ t ⋅ (1 − f ) ⋅ (1 − c)

θf ormal = a ⋅ (1 − t) ⋅ (1 − f ) ⋅ c + a ⋅ t ⋅ (1 − f ) ⋅ c

θmix = a ⋅ (1 − t) ⋅ f

θcorr = a ⋅ t ⋅ f

θ = {θnr , θneol. , θf ormal , θmix , θcorr }

ans ∼ Multinomial(θ)

a, t, f , c ∼ Beta(2, 2)

This translates to the following code:

Hide
data {
int<lower = 1> N_trials;
array[5] int<lower = 0, upper = N_trials> ans;
}
parameters {

real<lower = 0, upper = 1> a;

real<lower = 0, upper = 1> t;
real<lower = 0, upper = 1> f;
real<lower = 0, upper = 1> c;
}
transformed parameters {

simplex[5] theta;
theta[1] = 1 - a; //Pr_NR
theta[2] = a * (1 - t) * (1 - f) * (1 - c) + a * t * (1 - f) * (1 - c); //Pr_Neologism
theta[3] = a * (1 - t) * (1 - f) * c + a * t * (1 - f) * c; //Pr_Formal
theta[4] = a * (1 - t) * f; //Pr_Mixed

theta[5] = a * t * f; //Pr_Correct
}
model {
target += beta_lpdf(a | 2, 2);
target += beta_lpdf(t | 2, 2);

target += beta_lpdf(f | 2, 2);

target += beta_lpdf(c | 2, 2);
target += multinomial_lpmf(ans | theta);
}
generated quantities{
array[5] int pred_ans;

pred_ans = multinomial_rng(theta, N_trials);

}

Fit the model:

Hide

data_sMPT <- list(N_trials = N_trials,

ans = c(ans))
Hide

mpt_mnm <- system.file("stan_models",

"mpt_mnm.stan",
package = "bcogsci")
fit_sMPT <- stan(mpt_mnm, data = data_sMPT)

Print out a summary of the posterior of the parameter of interest:

Hide

print(fit_sMPT, pars = c("a", "t", "f", "c"))

## mean 2.5% 97.5% n_eff Rhat

## a 0.75 0.69 0.81 4741 1
## t 0.85 0.78 0.90 4770 1

## f 0.79 0.72 0.85 4806 1

## c 0.20 0.09 0.34 4054 1

What the model gives us is posterior distributions of each of the parameters a , t , f , c .

From these we can derive the probabilities of producing the different observed responses, and
the posterior predictive distributions, which could be used for model evaluation.

An important sanity check in modeling is checking whether the model can in principle recover
the true parameters that generated the data; see Figure 18.3.

Hide

as.data.frame(fit_sMPT) %>%
select(c("a","t","f","c")) %>%
mcmc_recover_hist(true = c(a_true, t_true, f_true, c_true)) +
coord_cartesian(xlim = c(0, 1))
a t

0.00 0.25 0.50 0.75 1.00 0.00 0.25 0.50 0.75 1.00 Estimated

f c True

0.00 0.25 0.50 0.75 1.00 0.00 0.25 0.50 0.75 1.00

FIGURE 18.3: Posterior distributions and true values of the parameters of the simple MPT
model (mpt_mnm.stan).
The above figure shows that the model can indeed recover the true parameters fairly
accurately.

The posterior distributions of the θ parameters can also be summarized:

Hide

print(fit_sMPT, pars = c("theta"))

## mean 2.5% 97.5% n_eff Rhat

## theta[1] 0.25 0.19 0.31 4741 1
## theta[2] 0.13 0.09 0.18 4738 1
## theta[3] 0.03 0.01 0.06 4088 1
## theta[4] 0.09 0.06 0.13 4552 1

## theta[5] 0.50 0.43 0.57 4870 1

These posteriors tell us the probability of producing each of the possible responses. This
model might be useful for estimating the latent parameters, a, t, f , c, but without further
constraints it is unfalsifiable.

Recall that for the multinomial likelihood in section 18.1.1, we had a simplex of size five, which
means that we had four free parameters (since the fifth can be deduced based on the others).
With five possible answers we can always estimate a vector of probabilities θ, that fits the
data, in the same way that with two possible answers (e.g., zeros and ones), we can always
estimate a single probability θ that fits the data (using a Bernoulli or Binomial likelihood). The
MPT that we present here is just reparameterizing the vector θ of the multinomial likelihood
(with the same number of free parameters). This means that it will always achieve a perfect fit;
see exercise 18.2. This doesn’t mean that this MPT model is “useless”: Under the assumption
that the model is meaningful, one can estimate its latent parameters and this estimation can
have theoretical implications. If we want to be able to falsify this model, we’ll need to constrain
it more, as we suggest below.

18.2.3 An MPT model assuming by-item variability

The use of aggregated data implies the assumption that the estimated parameters do not vary
too much between subjects and items. If this assumption is incorrect, the analysis of
aggregated data may lead to erroneous conclusions: reliance on aggregated data in the
presence of parameter heterogeneity may lead to biased parameter estimates and the
underestimation of credible intervals.

If it is known that f is affected by the phonological complexity of the individual word (e.g., cat
is easier to produce than umbrella), the previous model does not have a way to include that
information.

Simulated data can be generated taking into account the complexity of the items. Assume
here for simplicity that the complexity of items is scaled and centered; i.e., mean complexity is
represented by 0, and the standard deviation is assumed to be 1. We will assume a
regression model that determines the parameter, f , as a function of the phonological
complexity of each trial.

One important detail is that f is a probability and needs to be bounded between 0 and 1. To
make sure that this property is met, the computation of f for each item will be converted to
probability space using the logistic function. This is achieved as follows.

Suppose that f
′
is a linear function of complexity. For example, two parameters αf and βf

(intercept and slope respectively) could determine how f

′
is affected by complexity:
f
′
j
= αf + complexityj ⋅ βf .

The parameters αf and βf are defined in an unconstrained log-odds space (they can be any
real number). The model that is fit then yields an f
′
j
value for each item j in log-odds space.
The log-odds value f
′
j
can be converted to a probability value ftrue by applying the logistic
function (or the inverse logit, logit
−1
) to f
′
. Recall from the generalized linear model
discussed earlier that if we have a model in log-odds space:

pj
log( ) = α + β ⋅ xj = μ j
1 − pj

Then we can recover the probability pj by solving for pj :

exp(μj ) 1
pj = =
1 + exp(μj ) 1 + exp(−μj )

The above is the logistic or inverse logit function: it takes as input μj and returns the
corresponding probability pj . The plogis function in R carries out the calculation shown
above.

Hide

N_obs <- 50
complexity <- rnorm(N_obs) # by default mean = 0, sd = 1
## choose some hypothetical values:

alpha_f <- .3
# the negative sign indicates that
# increased complexity will lead to a reduced value of f:
beta_f <- -.3
# f' as a linear function of complexity:
f_prime <- alpha_f + complexity * beta_f

head(f_prime)

## [1] 0.468 0.369 -0.168 0.279 0.261 -0.215

Hide
## probabilities f for each item:
f_true <- plogis(f_prime)
head(f_true)

## [1] 0.615 0.591 0.458 0.569 0.565 0.447

This change in our assumptions entails that the probability of each response changes
depending on the item associated with each observation. The parameters theta now have to
be a matrix. This is in R; in Stan, we will code it as an array of simplexes, i.e., an array of non-
negative values that sums to 1.

We continue with the functions defined in 18.2.2, and the same values for a_true , t_true ,
and c_true as defined in that section. Since most of the equations depend on f , and f is
a vector now, the outcomes are automatically vectors. But this is not the case for theta_NR_v ,
and thus we need to repeat the value.

Hide

theta_NR_v <- rep(Pr_NR(a_true, t_true, f_true, c_true), N_obs)

theta_Neologism_v <- Pr_Neologism(a_true, t_true, f_true, c_true)
theta_Formal_v <- Pr_Formal(a_true, t_true, f_true, c_true)
theta_Mixed_v <- Pr_Mixed(a_true, t_true, f_true, c_true)

theta_Correct_v <- Pr_Correct(a_true, t_true, f_true, c_true)

theta_item <- matrix(c(theta_NR_v,
theta_Neologism_v,
theta_Formal_v,
theta_Mixed_v,
theta_Correct_v),

ncol = 5)
dim(theta_item)

## [1] 50 5

Hide

head(theta_item,n = 3)
## [,1] [,2] [,3] [,4] [,5]
## [1,] 0.25 0.260 0.0289 0.0461 0.415
## [2,] 0.25 0.276 0.0307 0.0443 0.399
## [3,] 0.25 0.366 0.0406 0.0344 0.309

Store this in a data frame:

Hide

sim_data_cx <- tibble(item = 1:N_obs,

complexity = complexity,
w_ans = c(rcat(N_obs,theta_item)))
sim_data_cx

## # A tibble: 50 × 3
## item complexity w_ans
## <int> <dbl> <dbl>

## 1 1 -0.560 5
## 2 2 -0.230 2
## 3 3 1.56 2
## # … with 47 more rows

The following model (saved in mpt_cat.stan ) is essentially doing the same thing as the
previous model but instead of fitting a multinomial to the summary of all the trials, it is fitting a
categorical distribution to each individual observation. (This is analogous to the difference
between the Bernoulli and Binomial distributions).

This is still not an appropriate model for the generative process that we are assuming in this
section, because it still ignores the effect of complexity. But it is a good start.

Hide
data {
int<lower=1> N_obs;
array[N_obs] int<lower=1,upper=5> w_ans;
}
parameters {

real<lower=0,upper=1> a;
real<lower=0,upper=1> t;
real<lower=0,upper=1> f;
real<lower=0,upper=1> c;
}
transformed parameters {

array[N_obs] simplex[5] theta;

for(n in 1:N_obs){
//Pr_NR:
theta[n, 1] = 1 - a;

//Pr_Neologism:
theta[n, 2] = a * (1 - t) * (1 - f) * (1 - c) + a * t * (1 - f) * (1 - c);
//Pr_Formal:
theta[n, 3] = a * (1 - t) * (1 - f) * c + a * t * (1 - f) * c;
//Pr_Mixed:

theta[n, 4] = a * (1 - t) * f;
//Pr_Correct:
theta[n, 5] = a * t * f;
}
}
model {

target += beta_lpdf(a | 2, 2);

target += beta_lpdf(t | 2, 2);
target += beta_lpdf(f | 2, 2);
target += beta_lpdf(c | 2, 2);
for(n in 1:N_obs)

target += categorical_lpmf(w_ans[n] | theta[n]);

}
generated quantities{
array[N_obs] int pred_w_ans;
for(n in 1:N_obs)
pred_w_ans[n] = categorical_rng(theta[n]);
}

An important aspect of the previous model is that theta is declared as simplex[5]

theta[N_obs] . This means that theta is an array of simplexes and thus has now two

dimensions: each element of the array (of length N_obs ) is a simplex and sums to one. That’s
why we iterate over the N_obs . However, one limitation of the previous model is that the
latent parameters a , t , f , c are declared as real and they do not vary in each
iteration of the loop. Before moving to the next section, you might want to do exercise 18.3,
where you are asked to edit the previous chunk of code to incorporate the fact that f is now
a transformed parameter that depends on the trial information and two new parameters.

18.2.4 A hierarchical MPT

The previous model doesn’t take into account that subjects might vary (and neither does the
modification to this model that is suggested in exercise 18.3). Let’s focus on taking into
account the differences between subjects.

Different subjects might not be equally motivated to do the task. This can be accounted for by
adding hierarchical structure to the parameter a , the probability of initiating an attempt.
Begin by simulating some data that incorporates by-subject variability.

First, define the number of items and subjects, and the number of observations:

Hide

N_item <- 20
N_subj <- 30
N_obs <- N_item * N_subj

Then, generate a vector for subjects and for items. Assume here that each subject sees each
item.

Hide

subj <- rep(1:N_subj, each = N_item)

item <- rep(1:N_item, time = N_subj)
A vector representing complexity is created for the number of items we have, and this vector is
repeated as many times as there are subjects:

Hide

complexity <- rep(rnorm(N_item), times = N_subj)

Next, create a data frame with all the above information:

Hide

(exp_sim <- tibble(subj = subj,

item = item,
complexity = complexity))

## # A tibble: 600 × 3
## subj item complexity

## <int> <int> <dbl>

## 1 1 1 -0.560
## 2 1 2 -0.230
## 3 1 3 1.56
## # … with 597 more rows

To create subject-level variability in the data, a between-subject standard deviation needs to

be defined. This standard deviation represents the deviations of subjects about the grand
mean. We are defining this adjustment in log-odds space.

Hide

# New parameters, in log-odds space:

tau_u_a <- 1.1

## generate subject adjustments in log-odds space:

u_a <- rnorm(N_subj, 0, tau_u_a)
str(u_a)

## num [1:30] -1.175 -0.24 -1.129 -0.802 -0.688 ...

Given the fixed a_true probability value of 0.75, the subject-level values for individual
a_true can be derived by (a) first converting the overall a_true value to log-odds space, (b)

adding the by-subject adjustment to this converted overall value, and (c) then converting back
to probability space using the logistic or inverse logit ( plogis ) function. Essentially we
generate data assuming the following:

′
a = αa + ua,subj[n]
h,n

−1 ′
ah,n = logit (a )
h,n

Where ua,subj[n] is a vector with the same length as the total number of observations. The
meaning of this notation was explained in the section 11.1.

This is done in R as follows:

Hide

a_true <- .75 # as before

## convert the intercept to log-odds space:

alpha_a <- qlogis(a_true)

## a_h' in log-odds space:
a_h_prime <- alpha_a + u_a[subj]
## convert back to probability space
a_true_h <- plogis(a_h_prime)
str(a_true_h)

## num [1:600] 0.481 0.481 0.481 0.481 0.481 ...

What this achieves mathematically is adding varying intercepts by subjects to alpha_a , and
then the values adjusted by subject are saved in probability space.

As before, f_true is computed as a function of complexity:

Hide

alpha_f <- .3; beta_f <- -.3

f_true <- plogis(alpha_f + complexity * beta_f)

We continue with the same probability functions and the rest of the true values remain the
same as well.
Hide

t_true <- .9; c_true <- .1

Pr_NR <- function(a, t, f, c)
1 - a
Pr_Neologism <- function(a, t, f, c)
a * (1 - t) * (1 - f) * (1 - c) + a * t * (1 - f) * (1 - c)

Pr_Formal <- function(a, t, f, c)

a * (1 - t) * (1 - f) * c + a * t * (1 - f) * c
Pr_Mixed <- function(a, t, f, c)
a * (1 - t) * f
Pr_Correct <- function(a, t, f, c)
a * t * f

Now, we can define the probabilities of different outcomes:

Hide

# Aux. parameters that define the probabilities:

theta_NR_v_h <- Pr_NR(a_true_h, t_true, f_true, c_true)
theta_Neologism_v_h <- Pr_Neologism(a_true_h, t_true, f_true, c_true)
theta_Formal_v_h <- Pr_Formal(a_true_h, t_true, f_true, c_true)
theta_Mixed_v_h <- Pr_Mixed(a_true_h, t_true, f_true, c_true)

theta_Correct_v_h <- Pr_Correct(a_true_h, t_true, f_true, c_true)

theta_h <- matrix(
c(theta_NR_v_h,
theta_Neologism_v_h,
theta_Formal_v_h,

theta_Mixed_v_h,
theta_Correct_v_h),
ncol = 5)
dim(theta_h)

## [1] 600 5

The probability specifications shown above can now generate the simulated data:

Hide
(sim_data_h <- mutate(exp_sim,
w_ans = rcat(N_obs,theta_h)))

## # A tibble: 600 × 4
## subj item complexity w_ans

## <int> <int> <dbl> <dbl>

## 1 1 1 -0.560 2
## 2 1 2 -0.230 1
## 3 1 3 1.56 1
## # … with 597 more rows

Next, define the following model; we omit the steps with f

′
and a
′
and directly apply the
logistic function to a regression. The parameters t, c do not vary by item or subject and
therefore do not have the subscript n . We start by defining relatively weak priors for all the
parameters in the following model. (See how we decided on the priors of α and β in the
logistic regression example of section 4.3.2).

αa , αf ∼ Normal(0, 1.5)

βf ∼ Normal(0, 1)

t, c ∼ Beta(2, 2)

τu ∼ Normal(0, 1)

ua ∼ Normal(0, τu )
a

−1
an = logit (αa + ua,subj[n] )

−1
fn = logit (αf + complexityn ⋅ βf )

θn,nr = 1 − an

θn,neol. = an ⋅ (1 − t) ⋅ (1 − fn ) ⋅ (1 − c) + an ⋅ t ⋅ (1 − fn ) ⋅ (1 − c)

θn,f ormal = an ⋅ (1 − t) ⋅ (1 − fn ) ⋅ c + an ⋅ t ⋅ (1 − fn ) ⋅ c

θn,mix = an ⋅ (1 − t) ⋅ fn

θn,corr = an ⋅ t ⋅ fn

θn = {θn,nr , θn,neol. , θn,f ormal , θn,mix , θn,corr }

ansn ∼ Categorical(θn )

The corresponding Stan model mpt_h.stan will look like this:

Hide
data {
int<lower = 1> N_obs;
array[N_obs] int<lower = 1, upper = 5> w_ans;
array[N_obs] real complexity;
int<lower = 1> N_subj;

array[N_obs] int<lower = 1, upper = N_subj> subj;

}
parameters {
real<lower = 0, upper = 1> t;
real<lower = 0, upper = 1> c;
real alpha_a;

real<lower = 0> tau_u_a;

vector[N_subj] u_a;
real alpha_f;
real beta_f;
}

transformed parameters {
array[N_obs] simplex[5] theta;
for (n in 1:N_obs){
real a = inv_logit(alpha_a + u_a[subj[n]]);
real f = inv_logit(alpha_f + complexity[n] * beta_f);

//Pr_NR
theta[n, 1] = 1 - a;
//Pr_Neologism
theta[n, 2] = a * (1 - t) * (1 - f) * (1 - c) +
a * t * (1 - f) * (1 - c);
//Pr_Formal

theta[n, 3] = a * (1 - t) * (1 - f) * c
+ a * t * (1 - f) * c;
//Pr_Mixed
theta[n, 4] = a * (1 - t) * f;
//Pr_Correct

theta[n, 5] = a * t * f;
}
}
model {
target += beta_lpdf(t | 2, 2);
target += beta_lpdf(c | 2, 2);
target += normal_lpdf(alpha_a | 0, 1.5);
target += normal_lpdf(alpha_f | 0, 1.5);

target += normal_lpdf(beta_f | 0, 1);

target += normal_lpdf(u_a | 0, tau_u_a);
target += normal_lpdf(tau_u_a | 0, 1) - normal_lccdf(0 | 0, 1);
for(n in 1:N_obs)
target += categorical_lpmf(w_ans[n] | theta[n]);

}
generated quantities{
array[N_obs] int<lower = 1, upper = 5> pred_w_ans;
for(n in 1:N_obs)
pred_w_ans[n] = categorical_rng(theta[n]);
}

For ease of exposition, we are not using the non-centered parameterization discussed
previously in section 11.1.2. We could also apply it here; that will speed up and improve the
convergence of the model. See Exercise 18.4.

It would be a good idea to plot prior predictive distributions for this model, but we skip this step
here. Next, fit the model to the simulated data, by first defining the data as a list:

Hide

sim_list_h <- list(N_obs = nrow(sim_data_h),

w_ans = sim_data_h$w_ans,
N_subj = max(sim_data_h$subj),

subj = sim_data_h$subj,
complexity = sim_data_h$complexity)

Hide

mpt_h <- system.file("stan_models",

"mpt_h.stan",
package = "bcogsci")
fit_mpt_h <- stan(mpt_h, data = sim_list_h)

Print out a summary of the posterior:

Hide

print(fit_mpt_h,
pars = c("t", "c", "tau_u_a", "alpha_a", "alpha_f", "beta_f"))

## mean 2.5% 97.5% n_eff Rhat

## t 0.91 0.87 0.94 4219 1

## c 0.11 0.07 0.16 4346 1

## tau_u_a 0.99 0.66 1.41 2050 1
## alpha_a 1.02 0.61 1.41 1408 1
## alpha_f 0.25 0.05 0.45 4338 1
## beta_f -0.22 -0.43 -0.01 3861 1

If we had fit this to real data, we would now conclude that:

i. given the value of beta_f , complexity has an adverse effect on the probability of
retrieving the correct phonemes, and

ii. given the posterior distribution of tau_u_a , there is a great deal of variation in the
subjects’ probability of initiating an attempt at each trial. Furthermore, if we had some
expectation about t and c based on previous research, we could conclude that our results
are in line (or not) with previous findings.

One could inspect how one unit of complexity is affecting the probability of retrieving the
correct phoneme (f ). We first derive the value of f for an item of zero complexity (that is
α f + 0 × βf ) and then the value of f for an item with a complexity of one (αf + 1 × βf ).
We are interested in summarizing the difference between the two:

Hide

as.data.frame(fit_mpt_h) %>%

select(alpha_f, beta_f) %>%

mutate(f_0 = plogis(alpha_f),
f_1 = plogis(alpha_f + beta_f),
diff_f = f_1 - f_0) %>%
summarize(Estimate = mean(diff_f),

`2.5%` = quantile(diff_f, 0.025),

`97.5%` = quantile(diff_f, 0.975))
## Estimate 2.5% 97.5%
## 1 -0.0543 -0.107 -0.00366

One further interesting step could be to develop a competing model that assumes a different
latent process, and then comparing the performance of the MPT with this competing model,
using Bayes factors or K-fold-CV (or both).

Since we generated the data based on known latent parameters, we also plot the posteriors
together with the true values of the parameters in Figure 18.4. This is something that we can
only do with simulated data.

Hide

as.data.frame(fit_mpt_h) %>%
select(tau_u_a, alpha_a, t, alpha_f, beta_f, c) %>%
mcmc_recover_hist(true = c(tau_u_a,
qlogis(a_true),

t_true, alpha_f,
beta_f, c_true))

tau_u_a alpha_a t

0.4 0.8 1.2 1.6 0.5 1.0 1.5 0.84 0.88 0.92 0.96 Estimated

alpha_f beta_f c True

-0.2 0.0 0.2 0.4 0.6 -0.6 -0.4 -0.2 0.0 0.2 0.05 0.10 0.15 0.20
FIGURE 18.4: Posterior of the hierarchical MPT with true values as vertical lines (model
mpt_h.stan ).

If everything is correctly defined in the model, we should be able to generate posterior

predictive data based on our estimates that looks quite similar to the simulated data; see
Figure 18.5. The error bars in yrep include 90% of the probability mass of the predictive
distribution (this is a default of ppc_bars() ). In a well calibrated model, the data (y, here the
proportion of answers of each type) should be inside the error bars in 90% of the cases.

Hide

gen_data <- rstan::extract(fit_mpt_h)$pred_w_ans

ppc_bars(sim_list_h$w_ans, gen_data) +
ggtitle ("Hierarchical model")

Hierarchical model
250

200
y rep
Count

150

100 y

0
1 2 3 4 5

FIGURE 18.5: A posterior predictive check for aggregated data in the hierarchical MPT model.

It is also useful to look at the individual subjects’ posteriors; these are shown in Figure 18.6.

Hide

ppc_bars_grouped(sim_list_h$w_ans,

gen_data, group = subj) +

ggtitle ("By-subject plot for the hierarchical model")
By-subject plot for the hierarchical model
1 2 3 4 5 6

0
7 8 9 10 11 12

0
13 14 15 16 17 18

15
y rep
Count

10
y
5

0
19 20 21 22 23 24

0
25 26 27 28 29 30

0
1 2 3 4 5 1 2 3 4 5 1 2 3 4 5 1 2 3 4 5 1 2 3 4 5 1 2 3 4 5

FIGURE 18.6: Individual subjects in the hierarchical MPT model.

But what about the first non-hierarchical MPT model ( mpt_cat.stan )?:

Hide

mpt_cat <- system.file("stan_models",

"mpt_cat.stan",
package = "bcogsci")
fit_sh <- stan(mpt_cat, data = sim_list_h)
The aggregated data looks great (Figure 18.7).

Hide

gen_data_sMPT <- rstan::extract(fit_sh)$pred_w_ans

ppc_bars(sim_list_h$w_ans, gen_data_sMPT) +
ggtitle ("Non-hierarchical model")

Non-hierarchical model

200

y rep
Count

150

100 y

0
1 2 3 4 5

FIGURE 18.7: Posterior predictive check for aggregated data in a non-hierarchical MPT model
(mpt_cat.stan).

However, in the non-hierarchical model, the fit to individual subjects is not as good (Figure
18.8): The error bars of the predicted distribution do not include the observed proportion of
answers for many of the subjects.

Hide

ppc_bars_grouped(sim_list_h$w_ans, gen_data_sMPT, group = subj) +

ggtitle ("By-subject plot for the non-hierarchical model")
By-subject plot for the non-hierarchical model
1 2 3 4 5 6
16

0
7 8 9 10 11 12
16

0
13 14 15 16 17 18
16

12
y rep
Count

8
y
4

0
19 20 21 22 23 24
16

0
25 26 27 28 29 30
16

0
1 2 3 4 5 1 2 3 4 5 1 2 3 4 5 1 2 3 4 5 1 2 3 4 5 1 2 3 4 5

FIGURE 18.8: Individual subjects in the non-hierarchical MPT model (mpt_cat.stan).

The hierarchical model does a better job of modeling individual-level variability.
18.3 Summary

In this chapter, we learned to fit increasingly complex MPTs to model categorical responses by
assuming underlying latent events that might or might not happen with a certain probability.
We started with a simple model and ended with a hierarchical model. We saw how to generate
simulated data and investigate parameter recovery to verify that the models were correctly
implemented. Furthermore, we showed how one can carry out posterior predictive checks to
evaluate the fit of the models, and how to interpret the posteriors of the parameters. MPTs
have a lot of potential as stand-alone models or as parts of larger cognitive models (as in, for
example, Paape and Vasishth 2022; Nicenboim and Vasishth 2016; and Klauer and Kellen
2018).

18.4 Further reading

Koster and McElreath (2017) present a tutorial on multinomial logistic regression/categorical

regression in the context of behavioral ecology and anthropology. Another tutorial on MPTs is
presented by Matzke et al. (2015). For the complete implementation of an MPT relating to
aphasia, see Walker, Hickok, and Fridriksson (2018). Some examples of cognitive models
using MPTs are Lee et al. (2020) and Smith and Batchelder (2010). An example of using MPTs
to model garden-pathing processes in sentence processing appears in Paape and Vasishth
(2022).

18.5 Exercises

Exercise 18.1 Modeling multiple categorical responses.

a. Re-fit the model presented in section 18.1.2, adding the assumption that you have more
information about the probability of giving a correct response in the task. Assume that you
know that subjects’ answers have around 60% accuracy. Encode this information in the
priors with two different degrees of certainty. (Hint: 1. As with the Beta distribution, you
can increase the pseudo-counts to increase the amount of information and reduce the
“width” of the distribution; compare Beta(9, 1) with Beta(900, 100) . 2. You’ll need to use
a column vector for the Dirichlet concentration parameters. [.., .., ] is a row_vector
that can be transposed and converted into a column vector by adding the transposition
symbol ' after the right bracket.)
b. What is the difference between the multinomial and categorical parameterizations?
c. What can we learn about impaired picture naming from the models in sections 18.1.1 and
18.1.2?

Exercise 18.2 An alternative MPT to model the picture recognition task.

Build any alternative tree with four parameters , , ,

w x y z to fit the data generated in 18.2.2.
Compare the posterior distribution of the auxiliary vector theta (that goes in the
multinomial_lpmf ) with the one derived in section 18.2.2.

Exercise 18.3 A simple MPT model that incorporates phonological complexity in the picture
recognition task.

Edit the the Stan code mpt_cat.stan from bcogsci presented in section 18.2.3 to
incorporate the fact that f is now a transformed parameter that depends on the trial
information and two new parameters, αf and βf . The rest of the latent parameters do not
need to vary by trial.

′
f = αf + complexityj ⋅ βf
j

−1 ′
fj = logit (f )
j

The inverse logit or logistic function is called inv_logit in Stan. Fit the model to the data of
18.2.3 and report the posterior distributions of the latent parameters.

Exercise 18.4 A more hierarchical MPT.

Modify the hierarchical MPT presented in section 18.2.4 so that all the parameters are
affected by individual differences. Simulate data and fit it. How well can you recover the
parameters? You should use the non-centered parametrization for the by-subject adjustments.
(Hint: Convergence will be reached much faster if you don’t assume that the adjustment
parameters are correlated as in 11.1.2, but you could also assume a correlation between all
(or some of) the adjustments by using the Cholesky factorization discussed in section 11.1.3.)

Exercise 18.5 Advanced: Multinomial processing trees.

The data set df_source_monitoring in bcogsci contains data from the package
psychotools coming from a source-monitoring experiment (Batchelder and Riefer 1990)

performed by Wickelmaier and Zeileis (2018).

In this type of experiment, subjects study items from (at least) two different sources, A and B.
After the presentation of the study items, subjects are required to classify each item as coming
from source A, B, or as new: N (that is, a distractor). In their version of the experiment,
Wickelmaier and Zeileis used two different A-B pairs: Half of the subjects had to read items
either quietly (source A = think) or aloud (source B = say). The other half had to write items
down (source A = write) or read them aloud (source B = say).

experiment : write-say or think-say

age : Age of the respondent in years.
gender : Gender of the respondent.

subj : Subject id.

source : Item source, a, b or b (new)
a , b , N : Number of responses for each type of stimuli

Fit a multinomial processing tree following Figures 18.9 and 18.10. to investigate whether
experiment type, age and/or gender affects the different processes assumed in the model. As
in Batchelder and Riefer (1990), assume that a = g (for identifiability) and that discriminability
is equal for both sources (d1 = d2 ).

Source A items

D1 1 − D1

d1 1 − d1 b 1− b

Resp onse A a 1− a g 1 − g Resp onse N

Resp onse A Resp onse B Resp onse A Resp onse B

FIGURE 18.9: Multinomial processing tree for the source A items from the source monitoring
paradigm (Batchelder and Riefer, 1990). D1 stands for the detectability of source A, d1 stands
for the source discriminabilities for source A items, b stands for the bias for responding “old” to
a nondetected item, a stands for guessing that a detected but nondiscriminated item belongs
to source A, and g stands for guessing that the item is a source A item.
Source B items

D2 1 − D2

d2 1 − d2 b 1− b

Resp onse A a 1− a g 1 − g Resp onse N

Resp onse A Resp onse B Resp onse A Resp onse B

FIGURE 18.10: Multinomial processing tree for the source B items from source monitoring
paradigm (Batchelder and Riefer, 1990). D2 stand for the detectability of source B items, d2

stands for the source discriminabilities for source B, b stands for the bias for responding “old”
to a nondetected item, a stands for guessing that a detected but nondiscriminated item
belongs to Source A, and g stands for guessing that the item is a source A item.

New items

b 1− b

g 1− g Resp onse N

Resp onse A Resp onse B

FIGURE 18.11: Multinomial processing tree for the new items in the source monitoring
paradigm (Batchelder and Riefer, 1990). b stands for the bias for responding “old” to a
nondetected item, a stands for guessing that a detected but nondiscriminated item belongs to
source A, and g stands for guessing that the item is a source A item.
Notice the following:

The data are aggregated at the level of source, so you should use multinomial_lpmf for
every row of the data set rather than categorical_lpmf() .
In contrast to the previous example, source determines three different trees, this means
that the parameter theta has to be defined in relationship to the item source.
All the predictors are between subject, this means that only a by-intercept adjustment (for
every latent process) is possible.

If you want some basis to start with, you can have a look at the incomplete code in
source.stan , by typing the following in R:

Hide

cat(readLines(system.file("stan_models",
"source.stan",
package = "bcogsci")),
sep = "\n")

References

Batchelder, William H, and David M Riefer. 1990. “Multinomial Processing Models of Source
Monitoring.” Psychological Review 97 (4). American Psychological Association: 548.

Batchelder, William H, and David M Riefer. 1999. “Theoretical and Empirical Review of
Multinomial Process Tree Modeling.” Psychonomic Bulletin & Review 6 (1). Springer: 57–86.

Cook, Samantha R, Andrew Gelman, and Donald B Rubin. 2006. “Validation of Software for
Bayesian Models Using Posterior Quantiles.” Journal of Computational and Graphical
Statistics 15 (3). Taylor & Francis: 675–92. https://doi.org/10.1198/106186006X136976.

Damasio, Antonio R. 1992. “Aphasia.” New England Journal of Medicine 326 (8). Mass
Medical Soc: 531–39.

Klauer, Karl Christoph, and David Kellen. 2018. “RT-MPTs: Process Models for Response-
Time Distributions Based on Multinomial Processing Trees with Applications to Recognition
Memory.” Journal of Mathematical Psychology 82: 111–30.
https://doi.org/https://doi.org/10.1016/j.jmp.2017.12.003.
Koster, Jeremy, and Richard McElreath. 2017. “Multinomial Analysis of Behavior: Statistical
Methods.” Behavioral Ecology and Sociobiology 71 (9): 138. https://doi.org/10.1007/s00265-
017-2363-8.

Lee, Michael D., Jason R Bock, Isaiah Cushman, and William R Shankle. 2020. “An
Application of Multinomial Processing Tree Models and Bayesian Methods to Understanding
Memory Impairment.” Journal of Mathematical Psychology 95. Elsevier: 102328.

Matzke, Dora, Conor V. Dolan, William H. Batchelder, and Eric-Jan Wagenmakers. 2015.
“Bayesian Estimation of Multinomial Processing Tree Models with Heterogeneity in
Participants and Items.” Psychometrika 80 (1): 205–35. https://doi.org/10.1007/s11336-013-
9374-9.

Paape, Dario, and Shravan Vasishth. 2022. “Estimating the True Cost of Garden-Pathing: A
Computational Model of Latent Cognitive Processes.” Cognitive Science 46 (8): e13186.

Smith, Jared B, and William H Batchelder. 2010. “Beta-MPT: Multinomial Processing Tree
Models for Addressing Individual Differences.” Journal of Mathematical Psychology 54 (1).
Elsevier: 167–83.

Walker, Grant M, Gregory Hickok, and Julius Fridriksson. 2018. “A Cognitive Psychometric
Model for Assessment of Picture Naming Abilities in Aphasia.” Psychological Assessment 6.
American Psychological Association: 809–26. https://doi.org/10.1037/pas0000529.

Wickelmaier, Florian, and Achim Zeileis. 2018. “Using Recursive Partitioning to Account for
Parameter Heterogeneity in Multinomial Processing Tree Models.” Behavior Research
Methods 50 (3). Springer: 1217–33.
Code

Chapter 19 Mixture models

Mixture models integrate multiple data generating processes into a single model. This is
especially useful in cases where the data alone don’t allow us to fully identify which
observations belong to which process. Mixture models are important in cognitive science
because many theories of cognition assume that the behavior of subjects in certain tasks is
determined by an interplay of different cognitive processes (e.g., response times in
schizophrenia in Levy et al. 1993; retrieval from memory in sentence processing in McElree
2000; Nicenboim and Vasishth 2018; fast choices in Ollman 1966; Dutilh et al. 2011). It is
important to stress that a mixture distribution of observations is an assumption of the latent
process developing trial by trial based on a given theory—it doesn’t necessarily represent the
true generative process. The role of Bayesian modeling is to help us understand the extent to
which this assumption is well-founded, by using posterior predictive checks and comparing
different models.

We focus here on the case where we have only two components; each component represents
a distinct cognitive process based on the domain knowledge of the researcher. The vector z

serves as an indicator variable that indicates which of the mixture components an observation
yn belongs to (n = 1, … , N is the number of data points). We assume two components, and
thus each zn can be either 0 or 1 (this will allows us to generate zn with a Bernoulli
distribution). We also assume two different generative processes, p1 and p2 , which generate
different distributions of the observations based on a vector of parameters indicated by Θ1

and Θ2 , respectively. These two processes occur with probability θ and 1 − θ , and each
observation is generated as follows:

zn ∼ Bernoulli(θ)

p1 (Θ1 ), if zn = 1 (19.1)
yn ∼ {
p2 (Θ2 ), if zn = 0

We focus on only two components because this type of model is already hard to fit and as we
show in this chapter, it requires plenty of prior information to being able to sample from the
posterior in most applied situations. However, if the number of components in the mixture is
finite and also determined by the researcher, the approach presented here can in principle be
extended to a larger number of mixtures by replacing the Bernoulli distribution by a categorical
one.
In order to fit this model, we need to estimate the posterior of each of the parameters
contained in the vectors Θ1 and Θ2 (intercepts, slopes, group-level effects, etc.), the
probability θ, and the indicator variable that corresponds to each observation zn . One issue
that presents itself here is that zn must be a discrete parameter, and Stan only allows
continuous parameters. This is because Stan’s algorithm requires the derivatives of the (log)
posterior distribution with respect to all parameters, and discrete parameters are not
differentiable (since they have “breaks”). In probabilistic programming languages like
WinBUGS and JAGS (Lunn et al. 2012; Plummer 2016), discrete parameters are possible to
use; but not in Stan. In Stan, we can circumvent this issue by marginalizing out the indicator
variable z.48 If p1 appears in the mixture with probability θ, and p2 with probability 1 − θ , then
the joint likelihood is defined as a function of Θ (which concatenates the mixing probability, θ,
and the parameters of the p1 and p2 , Θ1 and Θ2 ), and importantly zn “disappears”:

p(yn |Θ) = θ ⋅ p1 (yn |Θ1 ) + (1 − θ) ⋅ p2 (yn |Θ2 )

The intuition behind this formula is that each likelihood function, p1 , p2 is weighted by its
probability of being the relevant generative process. For our purposes, it suffices to say that
marginalization works; the reader interested in the mathematics behind marginalization is
directed to the further reading section at the end of the chapter.49

Even though Stan cannot fit a model with the discrete indicator of the latent class z that we
used in equation (19.1), this equation will prove very useful when we want to generate
synthetic data.

In the following sections, we model a well-known phenomenon (i.e., the speed-accuracy trade-
off) assuming an underlying finite mixture process. We start from the verbal description of the
model, and then implement the model in Stan step by step.

19.1 A mixture model of the speed-accuracy trade-off:

The fast-guess model account

When we are faced with multiple choices that require an immediate decision, we can speed up
the decision at the expense of accuracy and become more accurate at the expense of speed;
this is called the speed-accuracy trade-off (Wickelgren 1977). The most popular class of
models that can incorporate both response times and accuracy, and give an account for the
speed-accuracy trade-off is the class of sequential sampling models, which include the drift
diffusion model (Ratcliff 1978), the linear ballistic accumulator (Brown and Heathcote 2008),
and the log-normal race model (Heathcote and Love 2012; Rouder et al. 2015), which we
discuss in chapter 20; for a review see Ratcliff et al. (2016).

However, an alternative model that has been proposed in the past is Ollman’s simple fast-
guess model (Ollman 1966; Yellott 1967, 1971).50 Although it has mostly fallen out of favor
(but see Dutilh et al. 2011 for a more modern variant of this model), it presents a very simple
framework using finite mixture modeling that can also account for the speed-accuracy trade-
off. In the next sections, we’ll use this model to exemplify the use of finite mixtures to
represent different cognitive processes.

19.1.1 The global motion detection task

One way to examine the behavior of human and primate subjects when faced with two-
alternative forced choices is the detection of the global motion of a random dot kinematogram
(Britten et al. 1993). In this task, a subject sees a number of random dots on the screen. A
proportion of dots move in a single direction (e.g., right) and the rest move in random
directions. The subject’s goal is to estimate the overall direction of the movement. One of the
reasons for the popularity of this task is that it permits the fine-tuning of the difficulty of trials
(Dutilh et al. 2019): The task is harder when the proportion of dots that move coherently (the
level of coherence) is lower; see Figure 19.1.

FIGURE 19.1: Three levels of difficulty of the global motion detection task. The figures show a
consistent movement to the right with three levels of coherence (10%, 50%, and 100%). The
subjects see the dots moving in the direction indicated by the arrows. The subjects do not see
the arrows and all the dots look identical in the actual task. Adapted from Han et al. (2018);
licensed under CC BY 4.0 (https://creativecommons.org/licenses/by/4.0/).
Ollman’s (1966) fast-guess model assumes that the behavior in this task (and in any other
choice task) is governed by two distinct cognitive processes: (i) a guessing mode, and (ii) a
task-engaged mode. In the guessing mode, responses are fast and accuracy is at chance
level. In the task-engaged mode, responses are slower and accuracy approaches 100%. This
means that intermediate values of response times and accuracy can only be achieved by
mixing responses from the two modes. Further assumptions of this model are that response
times depend on the difficulty of the choice, and that the probability of being on one of the two
states depend on the speed incentives during the instructions. (

To simplify matters, we ignore the possibility that the accuracy of the choice is also affected by
the difficulty of the choice. Also, we ignore the possibility that subjects might be biased to one
specific response in the guessing mode, but see exercise 19.3.

19.1.1.1 The data set

We implement the assumptions behind Ollman’s fast-guess model and examine its fit to data
of a global motion detection task from Dutilh et al. (2019).

The data set from Dutilh et al. (2019) contains approximately 2800 trials of each of the 20
subjects participating in a global motion detection task and can be found in df_dots in the
bcogsci package. There were two level of coherence, yielding hard and easy trials ( diff ),

and the trials where done under instructions that emphasized either accuracy or speed
( emphasis ). More information about the data set can be found by accessing the
documentation for the data set (by typing ?df_dots in the R console, assuming that the
bcogsci package is installed).

Hide

data("df_dots")
df_dots
## # A tibble: 56,097 × 12
## subj diff emphasis rt acc fix_dur stim resp trial block
## <int> <chr> <chr> <dbl> <int> <dbl> <chr> <chr> <int> <int>
## 1 1 easy speed 482 1 0.738 R R 1 6
## 2 1 hard speed 602 1 0.784 R R 2 6

## 3 1 hard speed 381 1 0.651 R R 3 6

## block_trial bias
## <int> <chr>
## 1 1 no
## 2 2 no
## 3 3 no

## # … with 56,094 more rows

We might think that if the fast-guess model were true, we would see a bimodal distribution,
when we plot a histogram of the data. Unfortunately, when two similar distributions are mixed,
we won’t necessarily see any apparent bimodality; see Figure 19.2.

Hide

ggplot(df_dots, aes(rt)) +
geom_histogram()
15000

10000
count

5000

0 1000 2000 3000

FIGURE 19.2: Distribution of response times in the data of the global motion detection task in
Dutilh et al. (2019).
However, Figure 19.3 reveals that incorrect responses were generally faster, and this was
especially true when the instructions emphasized accuracy.

Hide

ggplot(df_dots, aes(x = factor(acc), y = rt)) +

geom_point(position = position_jitter(width = .4, height = 0),

alpha = .5) +
facet_wrap(diff ~ emphasis) +
xlab("Accuracy") +
ylab("Response time")
easy easy
accuracy speed
3000

2000

1000
Response time

hard hard
accuracy speed
3000

2000

1000

0
0 1 0 1
Accuracy

FIGURE 19.3: The distribution of response times by accuracy in the data of the global motion
detection task in Dutilh et al. (2019).

19.1.2 A very simple implementation of the fast-guess model

The description of the model makes it clear that an ideal subject who never guesses has a
response time that depends only on the difficulty of the trial. As we did in previous chapters,
we assume that response times are log-normally distributed, and for simplicity we start by
modeling the behavior of a single subject:

rtn ∼ LogNormal(α + β ⋅ xn , σ)

In the previous equation, x is larger for difficult trials. If we center x, α represents the average
logarithmic transformed response times for a subject engaged in the task, and β is the effect
of trial difficulty on log-response time. We assume a non-deterministic process, with a noise
parameter σ. See also Box 4.3 for more information about log-normally distributed response
times.

Alternatively, a subject that guesses in every trial would show a response time distribution that
is independent of the difficulty of the trial:
rtn ∼ LogNormal(γ, σ2 )

Here γ represents the the average logarithmic transformed response time when a subject only
guesses. We assume that responses from the guessing mode might have a different noise
component than from the task-engaged mode.

The fast-guess model makes the assumption that during a task, a single subject would behave
in these two ways: They would be engaged in the task a proportion of the trials and would
guess on the rest of the trials. This means that for a single subject, there is an underlying
probability of being engaged in the task, ptask , that determines whether they are actually
choosing (z = 1 ) or guessing (z = 0 ):

zn ∼ Bernoulli(ptask )

The value of the parameter z in every trial determines the behavior of the subject. This means
that the distribution that we observe is a mixture of the two distributions presented before:

LogNormal(α + β ⋅ xn , σ), if zn = 1
rtn ∼ { (19.2)
LogNormal(γ, σ2 ), if zn = 0

In order to have a Bayesian implementation, we also need to define some priors. We use
priors that encode what we know about reaction time experiments. These priors are slightly
more informative than the ones that we used in section 4.2, but they still can be considered
regularizing priors. One can verify this by performing prior predictive checks. As we increase
the complexity of our models, it’s worth spending some time designing more realistic priors.
These will speed up computation and in some cases they will be crucial for solving
convergence problems.

α ∼ Normal(6, 1)

β ∼ Normal(0, .1)

σ ∼ Normal+ (.5, .2)

γ ∼ Normal(6, 1)

σ2 ∼ Normal+ (.5, .2)

For now, we allow all values for the probability of having an engaged response having equal
likelihood; we achieve this by setting the following prior to ptask :

ptask ∼ Beta(1, 1)

This represents a flat, uniformative prior over the probability parameter ptask .

Before we fit our model to the real data, we generate synthetic data to make sure that our
model is working as expected.
We first define the number of observations, predictors, and fixed point values for each of the
parameters. We assume 1000 observations and two levels of difficulty, coded −0.5 (easy)
and 0.5 (hard). The point values chosen for the parameters are relatively realistic (based on
our previous experience on reaction time experiments). Although in the priors we try to encode
the range of possible values for the parameters, in this simulation we assume only one
instance of this possible range:

Hide

N <- 1000

# level of difficulty:
x <- c(rep(-.5, N/2), rep(.5, N/2))
# Parameters true values:
alpha <- 5.8
beta <- 0.05
sigma <- .4

sigma2 <- .5
gamma <- 5.2
p_task <- .8
# Median time
c("engaged" = exp(alpha), "guessing" = exp(gamma))

## engaged guessing
## 330 181

For generate a mixture of response times, we use the indicator of a latent class, z .

Hide
z <- rbern(n = N, prob = p_task)
rt <- if_else(z == 1,
rlnorm(N,
meanlog = alpha + beta * x,
sdlog = sigma),

rlnorm(N,
meanlog = gamma,
sdlog = sigma2))
df_dots_simdata1 <- tibble(trial = 1:N, x = x, rt = rt)

We verify that our simulated data is realistic, that is, it’s in the same range as the original data;
see Figure 19.4.

Hide

ggplot(df_dots_simdata1, aes(rt)) +

geom_histogram()

125

100

75
count

0 250 500 750 1000 1250

FIGURE 19.4: Response times in the simulated data ( df_dots_simdata1 ) that follows the fast-
guess model.
To implement the mixture model defined in equation (3.10) in Stan, the discrete parameter z

needs to be marginalized out:

p(rtn |Θ) = ptask ⋅ LogN ormal(rtn |α + β ⋅ xn , σ)+

(1 − ptask ) ⋅ LogN ormal(rtn |γ, σ2 )

In addition, Stan requires the likelihood to be defined in log-space:

log(p(rt|Θ)) = log(ptask ⋅ LogN ormal(rtn |α + β ⋅ xn , σ)+

(1 − ptask ) ⋅ LogN ormal(rtn |γ, σ2 ))

A “naive” implementation in Stan would look like the following (recall that _lpdf functions
provide log-transformed densities):

target += log(
p_task * exp(lognormal_lpdf(rt[n] | alpha + x[n] * beta, sigma)) +
(1-p_task) * exp(lognormal_lpdf(rt[n] | gamma, sigma2)));

However, we need to take into account that log(A ± B) can be numerically unstable (i.e.,
prone to underflow/overflow), when A is much larger than B or vice-versa. Stan provides
several functions to deal with different special cases of logarithms of sums and differences.
Here we need log_sum_exp(x, y) that corresponds to log(exp(x) + exp(y)) and log1m(x)
that corresponds to log(1-x) .

First, we need to take into account that the first summand of the logarithm, p_task *
exp(lognormal_lpdf(rt[n] | alpha + x[n] * beta, sigma)) corresponds to exp(x) , and the

second one, (1-p_task) * exp(lognormal_lpdf(rt[n] | gamma, sigma2)) to exp(y) in

log_sum_exp(x, y) . This means that we need to first apply the logarithm to each of them to

use them as arguments of log_sum_exp(x, y) :

target += log_sum_exp(
log(p_task) + lognormal_lpdf(rt[n] | alpha +
x[n] * beta, sigma),

log(1-p_task) + lognormal_lpdf(rt[n] | gamma, sigma2));

Now we can just replace log(1-p_task) by the more stable log1m(p_task) :

target += log_sum_exp(
log(p_task) + lognormal_lpdf(rt[n] | alpha +
x[n] * beta, sigma),
log1m(p_task) + lognormal_lpdf(rt[n] | gamma, sigma2));

The complete model ( mixture_rt.stan ) is shown below:

Hide
data {
int<lower = 1> N;
vector[N] x;
vector[N] rt;
}

parameters {
real alpha;
real beta;
real<lower = 0> sigma;
real gamma; //guessing
real<lower = 0> sigma2;

real<lower = 0, upper = 1> p_task;

}
model {
// priors for the task component
target += normal_lpdf(alpha | 6, 1);

target += normal_lpdf(beta | 0, .1);

target += normal_lpdf(sigma | .5, .2)
- normal_lccdf(0 | .5, .2);
// priors for the guessing component
target += normal_lpdf(gamma | 6, 1);

target += normal_lpdf(sigma2 | .5, .2)

- normal_lccdf(0 | .5, .2);
target += beta_lpdf(p_task | 1, 1);
// likelihood
for(n in 1:N)
target += log_sum_exp(log(p_task) +

lognormal_lpdf(rt[n] | alpha + x[n] * beta, sigma),

log1m(p_task) +
lognormal_lpdf(rt[n] | gamma, sigma2));
}

Call the Stan model mixture_rt.stan , and fit it to the simulated data. First, we set up the
simulated data as a list structure:

Hide
ls_dots_simdata <- list(N = N, rt = rt, x = x)

Then fit the model:

Hide

mixture_rt <- system.file("stan_models",

"mixture_rt.stan",
package = "bcogsci")
fit_mix_rt <- stan(mixture_rt, data = ls_dots_simdata)

## Warning: The largest R-hat is 1.74, indicating chains have not mixed.
## Running the chains for more iterations may help. See
## https://mc-stan.org/misc/warnings.html#r-hat

##
## Warning: Bulk Effective Samples Size (ESS) is too low, indicating
## posterior means and medians may be unreliable. Running the chains for
## more iterations may help. See
## https://mc-stan.org/misc/warnings.html#bulk-ess

##
## Warning: Tail Effective Samples Size (ESS) is too low, indicating
## posterior variances and tail quantiles may be unreliable. Running the
## chains for more iterations may help. See
## https://mc-stan.org/misc/warnings.html#tail-ess

There are a lot of warnings, the Rhats are too large, and number of effective samples is too
low:

Hide

print(fit_mix_rt)
## mean 2.5% 97.5% n_eff Rhat
## alpha 5.51 4.84 5.88 2 2.44
## beta 0.02 -0.14 0.13 6 1.24
## sigma 0.45 0.33 0.62 3 1.82
## gamma 5.55 4.94 5.87 2 2.30

## sigma2 0.46 0.35 0.60 3 2.03

## p_task 0.47 0.09 0.88 3 2.03
## lp__ -6373.97 -6378.52 -6370.88 13 1.11

The traceplots show clearly that the chains aren’t mixing; see Figure 19.5.

Hide

traceplot(fit_mix_rt)

alpha beta sigma

6.0
0.2
0.7

5.5 0.1 0.6

0.0 0.5
5.0
-0.1 0.4

0.3 chain
4.5 -0.2
1
1000 1250 1500 1750 2000 1000 1250 1500 1750 2000 1000 1250 1500 1750 2000
2
gamma sigma2 p_task
3
6.0 0.7
4

0.6 0.75
5.5
0.5 0.50

5.0 0.4
0.25

0.3
0.00
1000 1250 1500 1750 2000 1000 1250 1500 1750 2000 1000 1250 1500 1750 2000

FIGURE 19.5: Traceplots from the model mixture_rt.stan fit to simulated data.

The problem with this model is that the mixture components (i.e., the fast-guesses and the
engaged mode) are underlyingly exchangeable (see Box 5.1) and thus the posterior is multi-
modal and the model does not converge. Each chain doesn’t know how each component was
identified by the rest of the chains. However, we do have information that can identify the
components: According to the theoretical model, we know that the average response in the
engaged mode, represented by α , should be slower than the average response in the
guessing mode, γ.

Even though the theoretical model assumes that guesses are faster than engaged responses,
this is not explicit in our computational model. That is, our model lacks some of the theoretical
information that we have, namely that the distribution of engaged response times should be
slower than the distribution of guesses times. This can be encoded with a strong prior for γ,
where we assume that its prior distribution is truncated on an upper bound by the value of α :

γ ∼ Normal(6, 1), for γ < α

This would be enough to make the model to converge.

Another softer constraint that we could add to our implementation is the assumption that
subjects are generally more likely to be trying to do the task than just guessing. If this
assumption is correct, we also improve the accuracy of our estimation of the posterior of the
model. (The opposite is also true: If subjects are not trying to do the task, this assumption will
be unwarranted and our prior information will lead us further from to the “true” values of the
parameters). The following prior has the probability density concentrated near 1.

ptask ∼ Beta(8, 2)

Plotting this prior confirms where most of the probability mass lies; see Figure 19.6.

Hide

plot(function(x) dbeta(x, 8, 2), ylab = "density" , xlab= "probability")

3.0
2.0
density

1.0
0.0

0.0 0.2 0.4 0.6 0.8 1.0

probability

FIGURE 19.6: A density plot for the prior on ptask , Beta(8, 2)

The Stan code for this model is shown below as mixture_rt2.stan .

Hide
data {
int<lower = 1> N;
vector[N] x;
vector[N] rt;
}

parameters {
real alpha;
real beta;
real<lower = 0> sigma;
real<upper = alpha> gamma;
real<lower = 0> sigma2;

real<lower = 0, upper = 1> p_task;

}
model {
target += normal_lpdf(alpha | 6, 1);
target += normal_lpdf(beta | 0, .3);

target += normal_lpdf(sigma | .5, .2)

- normal_lccdf(0 | .5, .2);
target += normal_lpdf(gamma | 6, 1) -
normal_lcdf(alpha | 6, 1);
target += normal_lpdf(sigma2 | .5, .2)

- normal_lccdf(0 | .5, .2);

target += beta_lpdf(p_task | 8, 2);
for(n in 1:N)
target += log_sum_exp(log(p_task) +
lognormal_lpdf(rt[n] | alpha + x[n] * beta, sigma),
log1m(p_task) +

lognormal_lpdf(rt[n] | gamma, sigma2)) ;

}

Once we change the upper bound of gamma in the parameters block, we also need to
truncate the distribution in the model block by correcting the PDF with its CDF. This
correction is carried out using the CDF because we are truncating the distribution at the right-
hand side; recall that earlier we used the complement of the CDF when we truncate a
distribution at the left-hand side); see Box 4.1.
target += normal_lpdf(gamma | 6, 1) -
normal_lcdf(alpha | 6, 1);

Fit this model (call it mixture_rt2.stan ) to the same simulated data set that we used before:

Hide

mixture_rt2 <- system.file("stan_models",

"mixture_rt2.stan",
package = "bcogsci")
fit_mix_rt2 <- stan(mixture_rt2, data = ls_dots_simdata)

Now the summaries and traceplots look fine; see Figure 19.7.

Hide

print(fit_mix_rt2)

## mean 2.5% 97.5% n_eff Rhat

## alpha 5.78 5.72 5.85 994 1.01

## beta 0.02 -0.04 0.08 2452 1.00

## sigma 0.38 0.34 0.42 1037 1.01
## gamma 5.07 4.61 5.48 729 1.01
## sigma2 0.45 0.25 0.61 790 1.00
## p_task 0.81 0.57 0.95 789 1.01

## lp__ -6331.84 -6335.92 -6329.39 1405 1.00

Hide

traceplot(fit_mix_rt2)
alpha beta sigma
5.90 0.45
0.10
5.85
0.40
0.05
5.80
0.00
0.35
5.75
-0.05

5.70 0.30 chain

-0.10
1
1000 1250 1500 1750 2000 1000 1250 1500 1750 2000 1000 1250 1500 1750 2000
2
gamma sigma2 p_task
1.0 3
0.7
4
5.4
0.6
0.8
0.5
5.1

0.4
0.6
4.8
0.3

4.5 0.2 0.4

1000 1250 1500 1750 2000 1000 1250 1500 1750 2000 1000 1250 1500 1750 2000

FIGURE 19.7: Traceplots from the model mixture_rt2.stan fit to simulated data.

19.1.3 A multivariate implementation of the fast-guess model

A problem with the previous implementation of the fast-guess model is that we ignore the
accuracy information in the data. We can implement a version that is closer to the verbal
description of the model: In particular, we also want to model the fact that accuracy is at
chance level in the fast-guessing mode and that accuracy approaches 100% during the task-
engaged mode.

This means that the mixture affects two pairs of distributions:

zn ∼ Bernoulli(ptask )

The response time distribution

LogNormal(α + β ⋅ xn , σ), if zn = 1
rtn ∼ { (19.3)
LogNormal(γ, σ2 ), if zn = 0

and an accuracy distribution

Bernoulli(pcorrect ), if zn = 1
accn ∼ { (19.4)
Bernoulli(0.5), if zn = 0

We have a new parameter pcorrect , which represent the probability of making a correct answer
in the engaged mode. The verbal description says that it is closer to 100%, and here we have
the freedom to choose whatever prior we believe represents for us values that are close to
100% accuracy. We translate this belief into a prior as follows; our prior choice is relatively
informative but does not impose a hard constraint; if a subject consistently shows relatively
low (or high) accuracy, pcorrect will change accordingly:

pcorrect ∼ Beta(995, 5)

In our simulated data, we assume that the global motion detection task is done by a very
accurate subject, with an accuracy of 99.9%.

First, simulate reaction times, as done earlier:

Hide

N <- 1000
x <- c(rep(-.5, N/2), rep(.5, N/2)) # difficulty
alpha <- 5.8
beta <- 0.05
sigma <- 0.4

sigma2 <- 0.5

gamma <- 5.2 # fast guess location
p_task <- 0.8 # prob of being on task
z <- rbern(n = N, prob = p_task)
rt <- if_else(z == 1,
rlnorm(N,

meanlog = alpha + beta * x,

sdlog = sigma),
rlnorm(N,
meanlog = gamma,
sdlog = sigma2))

Simulate accuracy and include both reaction times and accuracy in the simulated data set:

Hide
p_correct <- 0.999
acc <- ifelse(z, rbern(n = N, p_correct),
rbern(n = N, 0.5))
df_dots_simdata3 <- tibble(trial = 1:N,
x = x,

rt = rt,
acc = acc) %>%
mutate(diff = if_else(x == 0.5, "hard", "easy"))

Plot the simulated data in Figure 19.8. This time we can see the effect of task difficulty on the
simulated response times and accuracy:

Hide

ggplot(df_dots_simdata3, aes(x = factor(acc), y = rt)) +

geom_point(position = position_jitter(width = .4, height = 0),

alpha = .5) +
facet_wrap(diff ~ .) +
xlab("Accuracy") +
ylab("Response time")

easy hard

900
Response time

600

300

0
0 1 0 1
Accuracy
FIGURE 19.8: Response times by accuracy accounting for task difficulty in the simulated data
( df_dots_simdata3 ) that follows the fast-guess model.
Next, we need to marginalize out the discrete parameters from both pairs of distributions.

p(rt, acc|Θ) =ptask ⋅

LogN ormal(rtn |α + β ⋅ xn , σ)⋅

Bernoulli(accn |pcorrect )

(1 − ptask )⋅

LogN ormal(rtn |γ, σ2 )⋅

Bernoulli(accn |.5)

In log-space:

log(p(rt, acc|Θ)) = log(exp(

log(ptask )+

log(LogN ormal(rtn |α + β ∗ xn , σ))+

log(Bernoulli(accn |pcorrect )))

exp(

log(1 − ptask )+

log(LogN ormal(rtn |γ, σ2 ))+

log(Bernoulli(accn |.5)))

Our model translates to the following Stan code ( mixture_rtacc.stan ):

Hide
data {
int<lower = 1> N;
vector[N] x;
vector[N] rt;
array[N] int acc;

}
parameters {
real alpha;
real beta;
real<lower = 0> sigma;
real<upper = alpha> gamma;

real<lower = 0> sigma2;

real<lower = 0, upper = 1> p_correct;
real<lower = 0, upper = 1> p_task;
}
model {

target += normal_lpdf(alpha | 6, 1);

target += normal_lpdf(beta | 0, .3);
target += normal_lpdf(sigma | .5, .2)
- normal_lccdf(0 | .5, .2);
target += normal_lpdf(gamma | 6, 1) -

target += log_sum_exp(log(p_task) +
lognormal_lpdf(rt[n] | alpha + x[n] * beta, sigma) +
bernoulli_lpmf(acc[n] | p_correct),
log1m(p_task) +
lognormal_lpdf(rt[n] | gamma, sigma2) +

bernoulli_lpmf(acc[n] | .5));
}

Next, set up the data in list format:

Hide
ls_dots_simdata <- list(N = N,

rt = rt,
x = x,
acc = acc)

Then fit the model:

Hide

mixture_rtacc <- system.file("stan_models",

"mixture_rtacc.stan",
package = "bcogsci")

fit_mix_rtacc <- stan(mixture_rtacc, data = ls_dots_simdata)

We see that our model can be fit to both response times and accuracy, and its parameters
estimates have sensible values (given the fixed parameters we used to generate our
simulated data).

Hide

print(fit_mix_rtacc)

## mean 2.5% 97.5% n_eff Rhat

## alpha 5.79 5.76 5.82 4238 1

## beta 0.02 -0.04 0.08 5445 1

## sigma 0.38 0.36 0.41 5539 1
## gamma 5.17 5.07 5.27 3742 1
## sigma2 0.50 0.43 0.57 3999 1
## p_correct 0.99 0.99 1.00 4608 1

## p_task 0.80 0.76 0.84 5089 1

## lp__ -6607.59 -6612.33 -6604.88 1864 1

We will evaluate the recovery of the parameters more carefully when we deal with the
hierarchical version of the fast-guess model in section 19.1.5. Before we extend this model
hierarchically, let us also take into account the instructions given to the subjects.
19.1.4 An implementation of the fast-guess model that takes
instructions into account

The actual global motion detection experiment that we started with has another manipulation
that can help us to evaluate better the fast-guess model. In some trials, the instructions
emphasized accuracy (e.g., “Be as accurate as possible.”) and in others speed (e.g., “Be as
fast as possible.”). The fast-guess model also assumes that the probability of being in one of
the two states depends on the speed incentives given during the instructions. This entails that
now ptask depends on the instructions x2 , where we encode a speed incentive with −0.5 and
an accuracy incentive with 0.5 . Essentially, we need to fit the following regression:

αtask + x2 ⋅ βtask

As we did with MPT models in the previous chapter (in section 18.2.3), we need to bound the
previous regression between 0 and 1; we achieve this using the logistic or inverse logit
function:

−1
ptask = logit (αtask + x2 ⋅ βtask )

This means that we need to interpret αtask + x2 ⋅ βtask in log-odds space, which has the
range (−∞, ∞) rather than the probability space [0, 1] ; also see section 18.2.3.

The likelihood (defined before in section 19.1.3) remains the same:

zn ∼ Bernoulli(ptask )

A response time distribution is defined:

LogNormal(α + β ⋅ xn , σ), if zn = 1
rtn ∼ {
LogNormal(γ, σ2 ), if zn = 0

and an accuracy distribution is defined as well:

Bernoulli(pcorrect ), if zn = 1
accn ∼ {
Bernoulli(0.5), if zn = 0

The only further change in our model is that rather than a prior on ptask , we now need priors
for αtask and βtask , which are on the log-odds scale.

For βtask , we assume an effect that can be rather large and we won’t assume a direction a
priori (for now):

βtask ∼ Normal(0, 1)
This means that the subject could be affected by the instructions in the expected way, with an
increased probability to be task-engaged, leading to better accuracy when the instructions
emphasize accuracy (βtask > 0 ). Alternatively, the subject might be behaving in an
unexpected way, with a decreased probability to be task-engaged, leading to worse accuracy
when the instructions emphasize accuracy (βtask < 0 ). The latter situation, βtask < 0 , could
represent the instructions being misunderstood. It’s certainly possible to include priors that
encode the expected direction of the effect instead: Normal+ (0, 1) . Unless there is a
compelling reason to constrain the prior in this way, following Cromwell’s rule (Box 6.2), we
leave open the possibility of the β parameter having negative values.

How can we choose a prior for αtask that encodes the same information that we had in the
previous model in ptask ? One possibility is to create an auxiliary parameter pbtask , that
represents the baseline probability of being engaged in the task, with the same prior that we
use in the previous section, and then transform it to an unconstrained space for our regression
with the logit function:

pbtask ∼ Beta(8, 2)

αtask = logit(pbtask )

To verify that our priors make sense, in Figure 19.9 we plot the difference in prior predicted
probability of being engaged in the task under the two emphasis conditions:

Hide

Ns <- 1000 # number of samples for the plot

# Priors
p_btask <- rbeta(n = Ns, shape1 = 8, shape2 = 2)
beta_task <- rnorm(n = Ns, mean = 0, sd = 1)
# Predicted probability of being engaged

p_task_easy <- plogis(qlogis(p_btask) + 0.5 * beta_task)

p_task_hard <- plogis(qlogis(p_btask) + -0.5 * beta_task)
# Predicted difference
diff_p_pred <- tibble(diff = p_task_easy - p_task_hard)

Hide

diff_p_pred %>%
ggplot(aes(diff)) +
geom_histogram()
150

100
count

-0.4 0.0 0.4

diff

FIGURE 19.9: The difference in prior predicted probability of being engaged in the task under
the two emphasis conditions for the simulated data ( diff_p_pred ) that follows the fast-guess
model.
Figure 19.9 shows that we are predicting a priori that the difference in ptask will tend to be
smaller than ±0.3 , which seems to make sense intuitively. If we had more information about
the likely range of variation, we could of course have adapted the prior to reflect that belief.

We are ready to generate a new data set, by deciding on some fixed values for βtask and
pbtask :

Hide

N <- 1000

x <- c(rep(-0.5, N/2), rep(0.5, N/2)) # difficulty

x2 <- rep(c(-0.5, 0.5), N/2) # instructions
# Verify that the predictors are crossed:
predictors <- tibble(x, x2)
xtabs(~ x + x2, predictors)
## x2
## x -0.5 0.5
## -0.5 250 250
## 0.5 250 250

Hide

alpha <- 5.8

beta <- 0.05
sigma <- 0.4
sigma2 <- 0.5

gamma <- 5.2

p_correct <- 0.999
# New true values:
p_btask <- 0.85
beta_task <- 0.5

# Generate data:
alpha_task <- qlogis(p_btask)
p_task <- plogis(alpha_task + x2 * beta_task)
z <- rbern(N, prob = p_task)
rt <- ifelse(z,

rlnorm(N, meanlog = alpha + beta * x, sdlog = sigma),

rlnorm(N, meanlog = gamma, sdlog = sigma2))
acc <- ifelse(z, rbern(N, p_correct),
rbern(N, 0.5))
df_dots_simdata4 <- tibble(trial = 1:N,
x = x,

rt = rt,
acc = acc,
x2 = x2) %>%
mutate(diff = if_else(x == 0.5, "hard", "easy"),
emphasis = ifelse(x2 == 0.5, "accuracy", "speed"))

We can generate a plot now where both the difficulty of the task and the instructions are
manipulated; see Figure 19.10.

Hide
ggplot(df_dots_simdata4, aes(x = factor(acc), y = rt)) +
geom_point(position = position_jitter(width = .4, height = 0),
alpha = .5) +
facet_wrap(diff ~ emphasis) +
xlab("Accuracy") +

ylab("Response time")

easy easy
accuracy speed
1250

1000

750

500

250
Response time

hard hard
accuracy speed
1250

1000

750

500

250

0 1 0 1
Accuracy

FIGURE 19.10: Response times and accuracy by the difficulty of the task and the instructions
type for the simulated data ( df_dots_simdata4 ) that follows the fast-guess model.

In the Stan implementation, log_inv_logit(x) is applying the logistic (or inverse logit)
function to x to transform it into a probability and then applying the logarithm;
log1m_inv_logit(x) is applying the logistic function to x , and then applying the logarithm to

its complement (1 − p) . We do this because rather than having p_task in probability space,
we have lodds_task in log-odds space:

real lodds_task = logit(p_btask) + x2[n] * beta_task;

The parameter lodds_task estimates the mixing probabilities in log-odds:

target += log_sum_exp(log_inv_logit(lodds_task) + ...,
log1m_inv_logit(lodds_task) + ...)

We also add a generated quantities block that can be used for further (prior or posterior)
predictive checks. In this block we do use z as an indicator of the latent class (task-engaged
mode or fast-guessing mode), since we do not estimate z , but rather generate it based on
the parameter’s posteriors.

We use the dummy variable onlyprior to indicate whether we use the data or we only
sample from the priors. One can always do the predictive checks in R, transforming the code
that we wrote for the simulation into a function, and writing the priors in R. However, it can be
simpler to take advantage of Stan output format and rewrite the code in Stan. One downside
of this is that the stanfit object that stores the model output can become too large for the
memory of the computer.

Hide
data {
int<lower = 1> N;
vector[N] x;
vector[N] rt;
array[N] int acc;

vector[N] x2; //speed or accuracy emphasis

int<lower = 0, upper = 1> onlyprior;
}
parameters {
real alpha;
real beta;

real<lower = 0> sigma;

real<upper = alpha> gamma;
real<lower = 0> sigma2;
real<lower = 0, upper = 1> p_correct;
real<lower = 0, upper = 1> p_btask;

real beta_task;
}
model {
target += normal_lpdf(alpha | 6, 1);
target += normal_lpdf(beta | 0, .1);

target += normal_lpdf(sigma | .5, .2)

target += normal_lpdf(beta_task | 0, 1);

target += beta_lpdf(p_correct | 995, 5);
target += beta_lpdf(p_btask | 8, 2);
if(onlyprior != 1)
for(n in 1:N){

real lodds_task = logit(p_btask) + x2[n] * beta_task;

target += log_sum_exp(log_inv_logit(lodds_task)+
lognormal_lpdf(rt[n] | alpha + x[n] * beta, sigma) +
bernoulli_lpmf(acc[n] | p_correct),
log1m_inv_logit(lodds_task) +
lognormal_lpdf(rt[n] | gamma, sigma2) +
bernoulli_lpmf(acc[n] | .5));
}

}
generated quantities {
array[N] real rt_pred;
array[N] real acc_pred;
array[N] int z;

for(n in 1:N){
real lodds_task = logit(p_btask) + x2[n] * beta_task;
z[n] = bernoulli_rng(inv_logit(lodds_task));
if(z[n]==1){
rt_pred[n] = lognormal_rng(alpha + x[n] * beta, sigma);
acc_pred[n] = bernoulli_rng(p_correct);

} else{
rt_pred[n] = lognormal_rng(gamma, sigma2);
acc_pred[n] = bernoulli_rng(.5);
}
}

We save the code as mixture_rtacc2.stan , and before fitting it to the simulated data, we
perform prior predictive checks.

19.1.4.1 Prior predictive checks

Generate prior predictive distributions, by setting onlyprior to 1 .

Hide
ls_dots_simdata <- list(N = N,
rt = rt,
x = x,
x2 = x2,
acc = acc,

onlyprior = 1)
mixture_rtacc2 <- system.file("stan_models",
"mixture_rtacc2.stan",
package = "bcogsci")

Hide

fit_mix_rtacc2_priors <- stan(mixture_rtacc2,

data = ls_dots_simdata,
chains = 1, iter = 2000)

We plot prior predictive distributions of response times as follows in Figure 19.11, by setting y
= rt using ppd_dens_overlay() .

Hide

rt_pred <- extract(fit_mix_rtacc2_priors)$rt_pred

ppd_dens_overlay(rt_pred[1:100,]) +
coord_cartesian(xlim = c(0, 2000))
0 500 1000 1500 2000

FIGURE 19.11: Prior predictive distributions of response times (using mixture_rtacc2.stan )

that follows the fast-guess model.
Some of the predictive data sets contain responses that are too large, and some of the have
too much probability mass close to zero, but there is nothing clearly wrong in the prior
predictive distributions (considering that the model hasn’t “seen” the data yet).

If we want to plot the prior predicted distribution of differences in response time conditioning
on task difficulty, we need to define a new function. Then we use the bayesplot function
ppc_stat() that takes as an argument of stat any summary function; see Figure 19.12.

Hide

meanrt_diff <- function(rt){

mean(rt[x == .5]) -
mean(rt[x == -.5])
}
ppd_stat(rt_pred, stat = meanrt_diff)
T = meanrt_diff
T (y pred)

-1000 -500 0 500

FIGURE 19.12: Prior predicted distribution (using mixture_rtacc2.stan ) of differences in

response time, conditioning on task difficulty.
We find that the range of response times look reasonable. There are, however, always more
checks that can be done; examples are plotting other summary statistics, or predictions
conditioned on other aspects of the data.

19.1.4.2 Fit to simulated data

Fit the model to data, by setting onlyprior = 0 :

Hide

ls_dots_simdata <- list(N = N,

rt = rt,
x = x,

x2 = x2,
acc = acc,
onlyprior = 0)

Hide
fit_mix_rtacc2 <- stan(mixture_rtacc2, data = ls_dots_simdata)

Hide

print(fit_mix_rtacc2,
pars = c("alpha", "beta", "sigma", "gamma", "sigma2",
"p_correct", "p_btask", "beta_task"))

## mean 2.5% 97.5% n_eff Rhat

## alpha 5.82 5.78 5.85 4197 1
## beta 0.05 0.00 0.11 5863 1

## sigma 0.40 0.38 0.42 4873 1

## gamma 5.12 5.00 5.24 3469 1
## sigma2 0.46 0.38 0.54 3806 1
## p_correct 0.99 0.99 1.00 3680 1
## p_btask 0.86 0.83 0.89 4006 1

## beta_task 0.62 0.17 1.10 5952 1

We see that we fit the model without problems. Before we evaluate the recovery of the
parameters more carefully, we implement a hierarchical version of the fast-guess model.

19.1.5 A hierarchical implementation of the fast-guess model

So far we have evaluated the behavior of one simulated subject. We discussed before (in the
context of distributional regression models, in section 5.2.6, and in the MPT modeling chapter
18) that, in principle, every parameter in a model can be made hierarchical. However, this
doesn’t guarantee that we’ll learn anything from the data for those parameters, or that our
model will converge. A safe approach here is to start simple, using simulated data. Despite the
fact that convergence with simulated data does not guarantee the convergence of the same
model with real data, in our experience the reverse is in general true.

For our hierarchical version, we assume that both response times and the effect of task
difficulty vary by subject, and that different subjects have different guessing times. This entails
the following change to the response time distribution:

LogNormal(α + usubj[n],1 + xn ⋅ (β + usubj[n],2 ), σ), if zn = 1

rtn ∼ {
LogNormal(γ + usubj[n],3 , σ2 ), if zn = 0
We assume that the three vectors of u (adjustment to the intercept and slope of the task-
engaged distribution, and the adjustment to the guessing time distribution) follow a
multivariate normal distribution centered on zero. For simplicity and lack of any prior
knowledge about this experiment design and method, we assume the same (weakly
informative) prior distribution for the three variance components and the same regularizing
LKJ prior for the three correlations between the adjustments (ρu 1,2
, ρu
1,3
, ρu
2,3
):

u ∼ N (0, Σu )

τu1..3 ∼ Normal+ (0, 0.5)

ρu ∼ LKJcorr(2)

Before we fit the model to the real data set, we simulate data again; this time we simulate 100
trials of each of 20 subjects.

Hide

# Build the fake stimuli

N_subj <- 20

N_trials <- 100

# Parameters true values
alpha <- 5.8
beta <- 0.05
sigma <- 0.4

sigma2 <- 0.5

gamma <- 5.2
beta_task <- 0.1
p_btask <- 0.85
alpha_task <- qlogis(p_btask)
p_correct <- 0.999

tau_u <- c(0.2, 0.005, 0.3)

rho <- 0.3

Hide
## Build the data set here:
N <- N_subj * N_trials
stimuli <- tibble(x = rep(c(rep(-.5, N_trials / 2),
rep(.5, N_trials / 2)), N_subj),
x2 = rep(rep(c(-.5, .5), N_trials / 2), N_subj),

subj = rep(1:N_subj, each = N_trials),

trial = rep(1:N_trials, N_subj)
)
stimuli

## # A tibble: 2,000 × 4

## x x2 subj trial
## <dbl> <dbl> <int> <int>
## 1 -0.5 -0.5 1 1
## 2 -0.5 0.5 1 2
## 3 -0.5 -0.5 1 3

## # … with 1,997 more rows

Hide

# Build the correlation matrix for the adjustments u:

Cor_u <- matrix(rep(rho, 9), nrow = 3)

diag(Cor_u) <- 1
Cor_u

## [,1] [,2] [,3]

## [1,] 1.0 0.3 0.3
## [2,] 0.3 1.0 0.3
## [3,] 0.3 0.3 1.0

Hide
# Variance covariance matrix for subjects:
Sigma_u <- diag(tau_u, 3, 3) %*% Cor_u %*% diag(tau_u, 3, 3)
# Create the correlated adjustments:
u <- mvrnorm(n = N_subj, c(0, 0, 0), Sigma_u)
# Check whether they are correctly correlated

# (There will be some random variation,

# but if you increase the number of observations and
# the value of the correlation, you'll be able to obtain
# a more exact value below):
cor(u)

## [,1] [,2] [,3]

## [1,] 1.000 0.514 0.314
## [2,] 0.514 1.000 0.245
## [3,] 0.314 0.245 1.000

Hide

# Check the SDs:

sd(u[, 1]); sd(u[, 2]); sd(u[, 3])

## [1] 0.221

## [1] 0.00586

## [1] 0.295

Hide
# Create the simulated data:
df_dots_simdata <- stimuli %>%
mutate(z = rbern(N, prob = plogis(alpha_task + x2 * beta_task)),
rt = ifelse(z,
rlnorm(N, meanlog = alpha + u[subj, 1] +

(beta + u[subj, 2]) * x,

sdlog = sigma),
rlnorm(N, meanlog = gamma + u[subj, 3],
sdlog = sigma2)),
acc = ifelse(z, rbern(n = N, p_correct),
rbern(n = N, 0.5)),

diff = if_else(x == 0.5, "hard", "easy"),

emphasis = ifelse(x2 == 0.5, "accuracy", "speed"))

Verify that the distribution of the simulated response times conditional on the simulated
accuracy and the experimental manipulations make sense with the plot shown in Figure 19.13.

Hide

ggplot(df_dots_simdata, aes(x = factor(acc), y = rt)) +

geom_point(position = position_jitter(width = .4,

height = 0), alpha = .5) +

facet_wrap(diff ~ emphasis) +
xlab("Accuracy") +
ylab("Response time")
easy easy
accuracy speed
1500

1000

500
Response time

0
hard hard
accuracy speed
1500

1000

500

0
0 1 0 1
Accuracy

FIGURE 19.13: The distribution of response times conditional on the simulated accuracy and
the experimental manipulations for the simulated hierarchical data ( df_dots_simdata ) that
follows the fast-guess model.
We implement the model in Stan as follows in mixture_h.stan . The hierarchical extension
uses the Cholesky factorization for the group-level effects (as we did in section 11.1.3).

Hide
data {
int<lower = 1> N;
vector[N] x;
vector[N] rt;
array[N] int acc;

vector[N] x2;
int<lower = 1> N_subj;
array[N] int<lower = 1, upper = N_subj> subj;
}
parameters {
real alpha;

real beta;
real<lower = 0> sigma;
real<upper = alpha> gamma;
real<lower = 0> sigma2;
real<lower = 0, upper = 1> p_correct;

real<lower = 0, upper = 1> p_btask;

real beta_task;
vector<lower = 0>[3] tau_u;
matrix[3, N_subj] z_u;
cholesky_factor_corr[3] L_u;

}
transformed parameters {
matrix[N_subj, 3] u;
u = (diag_pre_multiply(tau_u, L_u) * z_u)';
}
model {

target += normal_lpdf(alpha | 6, 1);

target += normal_lpdf(beta | 0, .1);
target += normal_lpdf(sigma | .5, .2)
- normal_lccdf(0 | .5, .2);
target += normal_lpdf(gamma | 6, 1) -

target += lkj_corr_cholesky_lpdf(L_u | 2);

target += std_normal_lpdf(to_vector(z_u));
for(n in 1:N){
real lodds_task = logit(p_btask) + x2[n] * beta_task;
target += log_sum_exp(log_inv_logit(lodds_task) +

lognormal_lpdf(rt[n] | alpha + u[subj[n], 1] +

x[n] * (beta + u[subj[n], 2]), sigma) +
bernoulli_lpmf(acc[n] | p_correct),
log1m_inv_logit(lodds_task) +
lognormal_lpdf(rt[n] | gamma + u[subj[n], 3], sigma2) +
bernoulli_lpmf(acc[n] |.5));

}
}
generated quantities {
corr_matrix[3] rho_u = L_u * L_u';
}

Save the model code and fit it to the simulated data:

Hide
ls_dots_simdata <- list(N = N,
rt = df_dots_simdata$rt,
x = df_dots_simdata$x,
x2 = df_dots_simdata$x2,
acc = df_dots_simdata$acc,

subj = df_dots_simdata$subj,
N_subj = N_subj)
mixture_h <- system.file("stan_models",
"mixture_h.stan",
package = "bcogsci")
fit_mix_h <- stan(file = mixture_h,

data = ls_dots_simdata,
iter = 3000,
control = list(adapt_delta = .9))

Print the posterior summary:

Hide

print(fit_mix_h, pars = c("alpha", "beta", "sigma", "gamma", "sigma2",

"p_correct","p_btask", "beta_task", "tau_u",

"rho_u[1,2]", "rho_u[1,3]", "rho_u[2,3]"))

## mean 2.5% 97.5% n_eff Rhat
## alpha 5.73 5.65 5.82 821 1.00
## beta 0.05 0.00 0.09 6061 1.00
## sigma 0.40 0.38 0.42 8705 1.00
## gamma 5.16 4.93 5.38 2383 1.00

## sigma2 0.55 0.49 0.61 8497 1.00

## p_correct 1.00 0.99 1.00 8264 1.00
## p_btask 0.85 0.83 0.88 8341 1.00
## beta_task 0.07 -0.24 0.39 9331 1.00
## tau_u[1] 0.19 0.13 0.28 1549 1.00
## tau_u[2] 0.06 0.00 0.14 1499 1.00

## tau_u[3] 0.46 0.31 0.67 2519 1.00

## rho_u[1,2] 0.04 -0.60 0.64 7077 1.00
## rho_u[1,3] 0.04 -0.40 0.46 2717 1.00
## rho_u[2,3] -0.08 -0.71 0.58 615 1.01

We see that we can fit the hierarchical extension of our model to simulated data. Next we’ll
evaluate whether we can recover the true values of the parameters.

19.1.5.1 Recovery of the parameters

By “recovering” the true values of the parameters, we mean that the true values are
somewhere inside the bulk of the posterior distribution of the model.

We use mcmc_recover_hist to compare the posterior distributions of the relevant parameters

of the model with their true values in Figure 19.14.

Hide
df_fit_mix_h <- fit_mix_h %>% as.data.frame() %>%
select(c("alpha", "beta", "sigma", "gamma", "sigma2",
"p_correct","p_btask", "beta_task", "tau_u[1]",
"tau_u[2]", "tau_u[3]", "rho_u[1,2]", "rho_u[1,3]",
"rho_u[2,3]"))

true_values <- c(alpha, beta, sigma, gamma, sigma2,

p_correct, p_btask, beta_task, tau_u[1],
tau_u[2], tau_u[3], rep(rho,3))
mcmc_recover_hist(df_fit_mix_h, true_values)

alpha beta sigma gamma

5.6 5.7 5.8 5.9 6.0 -0.050.000.050.100.15 0.380.390.400.410.420.43 4.8 5.0 5.2 5.4 5.6

sigma2 p_correct p_btask beta_task

0.5 0.6 0.985 0.990 0.995 1.000 0.8250.8500.8750.900 -0.4 0.0 0.4 Estimated

tau_u[1] tau_u[2] tau_u[3] rho_u[1,2] True

0.1 0.2 0.3 0.4 0.000.050.100.150.20 0.2 0.4 0.6 0.8 1.0-1.0 -0.5 0.0 0.5 1.0

rho_u[1,3] rho_u[2,3]

-0.4 0.0 0.4 0.8-1.0 -0.5 0.0 0.5 1.0

FIGURE 19.14: Posterior distributions of the main parameters of the mixture model
fit_mix_h together with their true values.

The model seems to be underestimating the probability of subjects being correct ( p_correct )
and overestimating the probability of being engaged in the task ( p_btask ). However, the
numerical differences are very small. We can be relatively certain that the model is not
seriously misspecified. As mentioned in previous chapters, a more principled (and
computationally demanding) approach uses simulation based calibration introduced in section
12.2 of chapter 12 (also see Talts et al. 2018; Schad, Betancourt, and Vasishth 2020).
19.1.5.2 Fitting the model to real data

After verifying that our model works as expected, we are ready to fit it to real data. We code
the predictors x and x2 as we did for the simulated data:

Hide

df_dots <- df_dots %>%

mutate(x = if_else(diff == "easy", -.5, .5),
x2 = if_else(emphasis == "accuracy", .5, -.5))

The main obstacle now is that fitting the entire data set takes around 12 hours! We’ll sample
600 observations of each subject as follows:

Hide

df_dots_data_short <- df_dots %>%

group_by(subj) %>%
sample_n(600)

Hide

ls_dots_data_short <-
list(N = nrow(df_dots_data_short),
rt = df_dots_data_short$rt,
x = df_dots_data_short$x,

x2 = df_dots_data_short$x2,
acc = df_dots_data_short$acc,
subj = as.numeric(df_dots_data_short$subj),
N_subj = length(unique(df_dots_data_short$subj)))
fit_mix_data <- stan(file = mixture_h,

data = ls_dots_data_short,
chains = 4,
iter = 2000,
control = list(adapt_delta = .9,
max_treedepth = 12))
## Warning: There were 1 divergent transitions after warmup. See
## https://mc-stan.org/misc/warnings.html#divergent-transitions-after-warmup
## to find out why this is a problem and how to eliminate them.
##
## Warning: Examine the pairs() plot to diagnose sampling problems

##
## Warning: The largest R-hat is NA, indicating chains have not mixed.
## Running the chains for more iterations may help. See
## https://mc-stan.org/misc/warnings.html#r-hat
##
## Warning: Bulk Effective Samples Size (ESS) is too low, indicating

## posterior means and medians may be unreliable. Running the chains for
## more iterations may help. See
## https://mc-stan.org/misc/warnings.html#bulk-ess
##
## Warning: Tail Effective Samples Size (ESS) is too low, indicating

## posterior variances and tail quantiles may be unreliable. Running the

## chains for more iterations may help. See
## https://mc-stan.org/misc/warnings.html#tail-ess

The model has not converged at all!

Hide

print(fit_mix_data,
pars = c("alpha", "beta", "sigma", "gamma", "sigma2",
"p_correct","p_btask", "beta_task", "tau_u",
"rho_u[1,2]", "rho_u[1,3]", "rho_u[2,3]"))
## mean 2.5% 97.5% n_eff Rhat
## alpha 6.36 6.19 6.48 2 2.30
## beta 0.10 0.03 0.15 2 3.05
## sigma 0.26 0.22 0.29 2 11.81
## gamma 6.18 6.04 6.34 2 2.33

## sigma2 0.28 0.19 0.42 2 18.82

## p_correct 0.93 0.86 1.00 2 16.27
## p_btask 0.85 0.68 0.94 2 7.95
## beta_task 1.83 -3.56 5.48 2 15.12
## tau_u[1] 0.14 0.10 0.20 18 1.08
## tau_u[2] 0.04 0.01 0.07 6 1.23

## tau_u[3] 0.26 0.11 0.69 2 3.58

## rho_u[1,2] 0.37 -0.14 0.78 3412 1.00
## rho_u[1,3] 0.20 -0.34 0.62 32 1.08
## rho_u[2,3] 0.17 -0.72 0.71 5 1.34

The traceplots in Figure 19.15 show that the chains are not mixing at all. It seems that the
posterior is multimodal, and there are at least two combinations of parameters that would fit
the data equally well.

Hide

traceplot(fit_mix_data)
alpha beta sigma gamma
6.4
6.5
0.15 0.275 6.3
6.4
0.10 6.2
6.3 0.250
0.05 6.1
6.2 0.225 6.0
6.1 0.00
5.9
10001250150017502000 10001250150017502000 10001250150017502000 10001250150017502000

sigma2 p_correct p_btask beta_task

1.00 0.95 chain
0.40 0.90 5.0
1
0.35 0.95 0.85 2.5
0.30 0.80 2
0.0
0.25 0.90 0.75 3
0.70 -2.5
0.20 4
10001250150017502000 10001250150017502000 10001250150017502000 10001250150017502000

tau_u[1] tau_u[2]
0.30 0.15
0.25
0.10
0.20
0.15 0.05
0.10
0.00
10001250150017502000 10001250150017502000

FIGURE 19.15: Traceplots from the hierarchical model ( mixture_h.stan ) fit to (a subset) of
the real data. The traceplot shows clearly that the posterior has at least two modes.
What should we do now? It can be a good idea to back off and simplify the model. Once the
simplified model converges, we can think about adding further complexity. The verbal
description of our model says that the accuracy in the task-engaged mode should be close to
100%. To simplify the model, we’ll assume that it’s exactly 100%. This entails the following:

pcorrect = 1

We adapt our Stan code in mixture_h2.stan , reflecting the assumption that p_correct has a
fixed value; this parameter is now in a block called transformed data . There, we assign to
p_correct the value of 1 .

Hide
data {
int<lower = 1> N;
vector[N] x;
vector[N] rt;
array[N] int acc;

vector[N] x2;
int<lower = 1> N_subj;
array[N] int<lower = 1, upper = N_subj> subj;
}
transformed data {
real p_correct = 1;

}
parameters {
real alpha;
real beta;
real<lower = 0> sigma;

real<upper = alpha> gamma; //guessing

real<lower = 0> sigma2;
real<lower = 0, upper = 1> p_btask;
real beta_task;
vector<lower = 0>[3] tau_u;

matrix[3, N_subj] z_u;

cholesky_factor_corr[3] L_u;
}
transformed parameters {
matrix[N_subj, 3] u;
u = (diag_pre_multiply(tau_u, L_u) * z_u)';

}
model {
target += normal_lpdf(alpha | 6, 1);
target += normal_lpdf(beta | 0, .1);
target += normal_lpdf(sigma | .5, .2)

- normal_lccdf(0 | .5, .2);

target += beta_lpdf(p_btask | 8, 2);

target += lkj_corr_cholesky_lpdf(L_u | 2);
target += std_normal_lpdf(to_vector(z_u));
for(n in 1:N){
real lodds_task = logit(p_btask) + x2[n] * beta_task;

target += log_sum_exp(log_inv_logit(lodds_task) +
lognormal_lpdf(rt[n] | alpha + u[subj[n], 1] +
x[n] * (beta + u[subj[n], 2]), sigma) +
bernoulli_lpmf(acc[n] | p_correct),
log1m_inv_logit(lodds_task) +
lognormal_lpdf(rt[n] | gamma + u[subj[n], 3], sigma2) +

bernoulli_lpmf(acc[n] |.5));
}
}
generated quantities {
corr_matrix[3] rho_u = L_u * L_u';

Fit the model again to the same data:

Hide

fit_mixs_data <- stan(file = mixture_h,

data = ls_dots_data_short,
chains = 4,
iter = 2000,
control = list(adapt_delta =.9,

max_treedepth = 12))

The model has now converged:

Hide
print(fit_mixs_data,
pars = c("alpha", "beta", "sigma", "gamma", "sigma2",
"p_btask", "beta_task", "tau_u",
"rho_u[1,2]", "rho_u[1,3]", "rho_u[2,3]"))

## mean 2.5% 97.5% n_eff Rhat

## alpha 6.34 6.29 6.40 667 1
## beta 0.10 0.07 0.12 1841 1
## sigma 0.22 0.21 0.23 4727 1
## gamma 6.29 6.21 6.35 1254 1
## sigma2 0.43 0.42 0.44 4386 1

## p_btask 0.69 0.68 0.70 4686 1

## beta_task 0.88 0.75 1.02 4608 1
## tau_u[1] 0.14 0.10 0.20 942 1
## tau_u[2] 0.05 0.03 0.08 1737 1
## tau_u[3] 0.19 0.13 0.27 1552 1

## rho_u[1,2] 0.40 -0.07 0.75 2194 1

## rho_u[1,3] 0.15 -0.28 0.53 1390 1
## rho_u[2,3] -0.03 -0.48 0.44 1071 1

The traceplots in Figure 19.16 show that this times the chains are mixing well.

Hide

traceplot(fit_mixs_data)
alpha beta sigma gamma
0.16 6.4
6.45
6.40 0.225
0.12 6.3
6.35 0.220
0.08 6.2
6.30
0.215
6.25 6.1
0.04
0 250 500 7501000 0 250 500 7501000 0 250 500 7501000 0 250 500 7501000

sigma2 p_btask beta_task tau_u[1]

0.45 0.71
chain
1.1 0.25
0.44 0.70 1.0 1
0.20
0.69 0.9 2
0.43
0.68 0.8 0.15
0.42 3
0.67 0.7 0.10
0.41 4
0 250 500 7501000 0 250 500 7501000 0 250 500 7501000 0 250 500 7501000

tau_u[2] tau_u[3]
0.125 0.4
0.100
0.3
0.075
0.050 0.2
0.025
0.1
0 250 500 7501000 0 250 500 7501000

FIGURE 19.16: Traceplots from the simplified hierarchical model ( mixture_h.stan , assuming
that p_correct = 1 ) fit to (a subset) of the real data. The traceplot shows that chains are
mixing well.
What can we say about the fit of the model now?

Under the assumptions that we have made (e.g., that there are two processing modes,
response times are affected by the difficulty of the task in the task-engaged mode, accuracy is
not affected by the difficulty of the task and is perfect at the task-engaged mode, etc.), we can
look at the parameters and conclude the following:

The instructions seemed to have a strong effect on the processing mode of the subjects
( beta_task is relatively high), and in the expected direction (emphasis in accuracy led to
a higher probability to be in the task engaged mode).
The guessing mode seemed to be much noisier than the task-engaged mode (compare
sigma with sigma2 ).

Slow subjects seemed to show a stronger effect of the experimental manipulation

( rho_u1[1,2] is mostly positive).

If we want to know whether our model achieves descriptive adequacy, we need to look at the
posterior predictive distributions of the model. However, by using posterior predictive checks,
we won’t be able to conclude that our model is not overfitting. Our success in fitting the fast-
guess model to real data does not entail that the model is a good account of the data. It just
means that it’s flexible enough to fit the data. One further step could be to develop a
competing model, and then comparing the performance of the models, using Bayes factors or
cross validation.

19.1.5.3 Posterior predictive checks

For the posterior predictive checks, we can write the generated quantities block in a new file.
The advantage is that we can generate as many observations as needed after estimating the
parameters. There is no model block in the following Stan program. We use the gqs()
function in the rstan library, which allows us to use the posterior draws from a previously
fitted model to generate posterior predicted data.

Hide

Generate responses from 500 simulated experiments as follows:

Hide

mixture_h_gen <- system.file("stan_models",

"mixture_h_gen.stan",
package = "bcogsci")
gen_model <- stan_model(mixture_h_gen)
# The argument of the matrix `drop` needs to be set to FALSE,
# otherwise R will simplify the matrix into a vector.

# The two commas in the line below are not a mistake!

draws_par <- as.matrix(fit_mixs_data)[1:500, ,drop = FALSE]
gen_mix_data <- gqs(gen_model,
data = ls_dots_data_short,
draws = draws_par)

First, take a look at the general distribution of response times generated by the posterior
predictive model and by our real data (Figure 19.17).

Hide

rt_pred <- extract(gen_mix_data)$rt_pred

ppc_dens_overlay(y = ls_dots_data_short$rt, yrep = rt_pred[1:100,]) +

coord_cartesian(xlim = c(0, 2000))

y
y rep

0 500 1000 1500 2000

FIGURE 19.17: The posterior predictive distribution of the hierarchical fast-guess model (using
mixture_h_gen.stan ) in comparison with the observed response times.

We see that the distribution of the observed response times is narrower than the predictive
distribution. We are generating response times that are more spread out than the real data.

Next, examine the effect of the experimental manipulation in Figure 19.18: The posterior
predictive check reveals that the model underestimates the observed effect of the
experimental manipulation: the observed difference between response times is well outside
the bulk of the predictive distribution.

Hide

meanrt_diff <- function(rt){

mean(rt[ls_dots_data_short$x == .5]) -
mean(rt[ls_dots_data_short$x == -.5])
}
ppc_stat(ls_dots_data_short$rt,
yrep = rt_pred,

stat = meanrt_diff)
T = meanrt_diff
T (y rep)

T (y )

0 25 50

FIGURE 19.18: Posterior predictive distribution (using mixture_h_gen.stan ) of the difference

in response time due to the experimental manipulation. The vertical bar shows the observed
difference in the data.
Another important posterior predictive check includes comparing the fit of the model using a
quantile probability plot, which is presented in the next chapter.

We also look at some instances of the predictive distribution. Figure 19.19 shows a simulated
data set in black overlaid onto the real observations in gray. As we noticed in Figure 19.17, the
model is predicting less variability than what we find in the data, especially when the emphasis
is on accuracy.

Hide
acc_pred <- extract(gen_mix_data)$acc_pred
df_dots_pred <-
tibble(rt = ls_dots_data_short$rt,
acc = ls_dots_data_short$acc,
difficulty = ifelse(ls_dots_data_short$x == .5,

"hard", "easy"),
emphasis = ifelse(ls_dots_data_short$x2 == -.5,
"speed", "accuracy"),
acc_pred1 = acc_pred[1,],
rt_pred1 = rt_pred[1,])

easy easy
accuracy speed
3000

2000

1000
Response time

0
hard hard
accuracy speed
3000

2000

1000

0
0 1 0 1
Accuracy

FIGURE 19.19: A simulated (posterior predictive) data set in black overlaid onto the
observations in gray (based on mixture_h_gen.stan ).

If we would like to compare this model with a competing one using cross-validation, we would
need to calculate the point-wise log-likelihood in the generated block:
generated quantities {
array[N] real log_lik;
for(n in 1:N){
real lodds_task = logit(p_btask) + x2[n] * beta_task;
log_lik[n] += log_sum_exp(log_inv_logit(lodds_task) +

lognormal_lpdf(rt[n] | alpha + u[subj[n], 1] +

x[n] * (beta + u[subj[n], 2]), sigma) +
bernoulli_lpmf(acc[n] | p_correct),
log1m_inv_logit(lodds_task) +
lognormal_lpdf(rt[n] | gamma + u[subj[n], 3], sigma2) +
bernoulli_lpmf(acc[n] |.5));

}
}

It is important to bear in mind that we can only compare models on the same dependent
variable(s). That is, we would need to compare this model with another one fit to the same
dependent variables and also in the same scale: accuracy ( 0 or 1 ) and response time in
milliseconds. This means that, for example, we cannot compare our fast-guess model with an
accuracy-only model, for example. It also means that to compare our fast-guess model with a
model based on left/right choices (known as stimulus coding, see section 20.1.1) and
response times, we would need to reparameterize one of the two models; see exercise 20.3 in
chapter 20.

To conclude, the fast-guess model shows a relatively decent fit to the data and is able to
account for the speed-accuracy trade-off. The model shows some inaccuracies that could lead
to its revision and improvement. To what extent the inaccuracies are acceptable or not
depends on (i) the empirical finding that we want to account for (for example, we can already
assume that the model will struggle to fit data sets that show slow errors); and (ii) its
comparison with a competing account.

19.2 Summary

In this chapter, we learned to fit increasingly complex two-component mixture models using
Stan, starting with a simple model and ending with a fully hierarchical model. We saw how to
evaluate model fit using the usual prior and posterior predictive checks, and to investigate
parameter recovery. Such mixture models are notoriously difficult to fit, but they have a lot of
potential in cognitive science applications, especially in developing computational models of
different kinds of cognitive processes.

19.3 Further reading

The reader interested in a deeper understanding of marginalization is referred to Pullin,

Gurrin, and Vukcevic (2021). Betancourt discusses problems of identification in Bayesian
mixture models in a case study ( https://mc-stan.org/users/documentation/case-
studies/identifying_mixture_models.html). An in-depth treatment of the fast-guess model and
other mixture models of response times is provided in Chapter 7 of Luce (1991).

19.4 Exercises

Exercise 19.1 Changes in the true values.

Change the true value of p_correct to 0.5 and 0.1, and generate data for the non-
hierarchical model. Can you recover the value of this parameter without changing the model
mixture_rtacc2.stan ? Perform posterior predictive checks.

Exercise 19.2 RTs in schizophrenic patients and control.

Response times for schizophrenic patients in a simple visual tracking experiment show more
variability than for non-schizophrenic controls; see Figure 19.20. It has been argued that at
least some of this extra variability arises from an attentional lapse that delays some
responses. We’ll use the data examined in Belin and Rubin (1990) ( df_schizophrenia in the
bcogsci package) analysis to investigate some potential models:

M1 . Both schizophrenic and controls show attentional lapses, but the lapses are more
common in schizophrenics. Other than that there is no difference in the latent response
times and the lapses of attention.
M2 . Only schizophrenic patients show attentional lapses. Other than that there is no
difference in the latent response times.
M3 . There are no (meaningful number of) lapses of attention in either group.

1. Fit the three models.

2. Carry out posterior predictive checks for each model; can they account for the data?
3. Carry out model comparison (with Bayes factor and cross-validation).
Hide

ggplot(df_schizophrenia, aes(rt)) +
geom_histogram() +
facet_grid(rows = vars(factor(patient, labels = c("control", "schizophrenic"))))

control
60

30
count

schizophrenic
60

0
500 1000 1500
rt

FIGURE 19.20: The distribution of response times for control and schizophrenic patients in
df_schizophrenia .

Exercise 19.3 Advanced: Guessing bias in the model.

In the original model, it was assumed that subjects might have a bias (a preference) to one of
the two answers when they were in the guessing mode. To fit this model we need to change
the dependent variable and add more information; now we not only care if the participant
answered correctly or not, but also which answer they gave (left or right).

Implement a unique bias for all the subjects. Fit the new model to (a subset of) the data.
Implement a hierarchical bias, that is there is a common bias, but every subject has its
adjustment. Fit the new model to (a subset of) the data.
References

Belin, TR, and DB Rubin. 1990. “Analysis of a Finite Mixture Model with Variance
Components.” In Proceedings of the Social Statistics Section, 211–15.

Britten, Kenneth H., Michael N. Shadlen, William T. Newsome, and J. Anthony Movshon. 1993.
“Responses of Neurons in Macaque Mt to Stochastic Motion Signals.” Visual Neuroscience 10
(6). Cambridge University Press: 1157–69. https://doi.org/10.1017/S0952523800010269.

Dutilh, Gilles, Jeffrey Annis, Scott D Brown, Peter Cassey, Nathan J Evans, Raoul PPP
Grasman, Guy E Hawkins, et al. 2019. “The Quality of Response Time Data Inference: A
Blinded, Collaborative Assessment of the Validity of Cognitive Models.” Psychonomic Bulletin
& Review 26 (4). Springer: 1051–69.

Dutilh, Gilles, Eric-Jan Wagenmakers, Ingmar Visser, and Han L. J. van der Maas. 2011. “A
Phase Transition Model for the Speed-Accuracy Trade-Off in Response Time Experiments.”
Cognitive Science 35 (2): 211–50. https://doi.org/10.1111/j.1551-6709.2010.01147.x.

Han, Ding, Jana Wegrzyn, Hua Bi, Ruihua Wei, Bin Zhang, and Xiaorong Li. 2018. “Practice
Makes the Deficiency of Global Motion Detection in People with Pattern-Related Visual Stress
More Apparent.” PLOS ONE 13 (2). Public Library of Science: 1–13.
https://doi.org/10.1371/journal.pone.0193215.

Heathcote, Andrew, and Jonathon Love. 2012. “Linear Deterministic Accumulator Models of
Simple Choice.” Frontiers in Psychology 3: 292. https://doi.org/10.3389/fpsyg.2012.00292.

Luce, R Duncan. 1991. Response Times: Their Role in Inferring Elementary Mental
Organization. Oxford University Press.

Lunn, David, Chris Jackson, David J Spiegelhalter, Nicky Best, and Andrew Thomas. 2012.
The BUGS Book: A Practical Introduction to Bayesian Analysis. Vol. 98. CRC Press.

McElree, Brian. 2000. “Sentence Comprehension Is Mediated by Content-Addressable

Memory Structures.” Journal of Psycholinguistic Research 29 (2). Springer: 111–23.
Nicenboim, Bruno, and Shravan Vasishth. 2018. “Models of Retrieval in Sentence
Comprehension: A Computational Evaluation Using Bayesian Hierarchical Modeling.” Journal
of Memory and Language 99: 1–34. https://doi.org/10.1016/j.jml.2017.08.004.

Ollman, Robert. 1966. “Fast Guesses in Choice Reaction Time.” Psychonomic Science 6 (4).
Springer: 155–56.

Plummer, Martin. 2016. “JAGS Version 4.2.0 User Manual.”

Pullin, Jeffrey, Lyle Gurrin, and Damjan Vukcevic. 2021. “Statistical Models of Repeated
Categorical Ratings: The R Package Rater.”

Ratcliff, Roger. 1978. “A Theory of Memory Retrieval.” Psychological Review 85 (2). American
Psychological Association: 59.

Rouder, Jeffrey N., Jordan M. Province, Richard D. Morey, Pablo Gomez, and Andrew
Heathcote. 2015. “The Lognormal Race: A Cognitive-Process Model of Choice and Latency
with Desirable Psychometric Properties.” Psychometrika 80 (2): 491–513.
https://doi.org/10.1007/s11336-013-9396-3.

Wickelgren, Wayne A. 1977. “Speed-Accuracy Tradeoff and Information Processing

Dynamics.” Acta Psychologica 41 (1): 67–85.

Yackulic, Charles B., Michael Dodrill, Maria Dzul, Jamie S. Sanderlin, and Janice A. Reid.
2020. “A Need for Speed in Bayesian Population Models: A Practical Guide to Marginalizing
and Recovering Discrete Latent States.” Ecological Applications 30 (5): e02112.
https://doi.org/https://doi.org/10.1002/eap.2112.

Yellott, John I. 1967. “Correction for Guessing in Choice Reaction Time.” Psychonomic
Science 8 (8): 321–22. https://doi.org/10.3758/BF03331682.
Yellott, John I. 1971. “Correction for Fast Guessing and the Speed-Accuracy Tradeoff in
Choice Reaction Time.” Journal of Mathematical Psychology 8 (2): 159–99.
https://doi.org/10.1016/0022-2496(71)90011-3.

48. See section 1.6.1.1 in chapter 1 for a review on the concept of marginalization.↩

49. As mentioned above, other probabilistic languages that do not rely on Hamiltonian
dynamics (exclusively) are able to deal with this. However, even when sampling discrete
parameters is possible, marginalization is more efficient (Yackulic et al. 2020): when zn is
omitted we are fitting a model with n fewer parameters.↩

50. Ollman’s original model was meant to be relevant only for means; Yellott (1967, 1971)
generalized it to a distributional form.↩
Code

Chapter 20 A simple accumulator model to

account for choice response time
As mentioned in chapter 19, the most popular class of cognitive-process models that can
incorporate both response times and accuracy are sequential sampling models (for a review,
see Ratcliff et al. 2016). This class of model includes, among others, the drift diffusion model
(Ratcliff 1978), the linear ballistic accumulator (Brown and Heathcote 2008), and the log-
normal race model (Heathcote and Love 2012; Rouder et al. 2015). We discuss the log-normal
race model in the current chapter. Sequential sampling or evidence-accumulation models are
based on the idea that decisions are made by gathering evidence from the environment (e.g.,
the computer screen in many experiments) until sufficient evidence is gathered and a
threshold of evidence is reached. The log-normal race model seems to be the simplest
sequential sampling model that can account for the joint distribution of response times and
response choice or accuracy (Heathcote and Love 2012; Rouder et al. 2015).

This model belongs to the subclass of race models, where the evidence for each response
grows gradually in time in separate racing accumulators, until a threshold is reached. A
response is made when one of these accumulators first reaches the threshold, and wins the
race against the other accumulators. This model is sometimes referred as deterministic (or
non-stochastic, and ballistic), since the noise only affects the rate of accumulation of evidence
before each race starts, but once the accumulator starts accumulating evidence, the rate is
fixed. This means that a given accumulator can be faster or slower in different trials (or
between choices) but its rate of accumulation will be fixed during a trial (or within choices).
Brown and Heathcote (2005) claim that even though it is clear that a range of factors might
cause within-choice noise, the behavioral effects might sometimes be small enough to ignore
(this is in contrast to models such as the drift diffusion model, where both types of noise are
present).

The two main advantages of the log-normal race model in comparison with other sequential
sampling models are that: (i) the log-normal race model is very simple, making it easy to
extend hierarchically; (ii) it is relatively easy to avoid convergence issues; and (iii) it is
straightforward to model more than two choices. This specific model is presented next for
pedagogical purposes because it is relatively easy to derive its likelihood given some
reasonable assumptions. However, even though the log-normal race is a “legitimate” cognitive
model (see Further Readings for examples), the majority of the literature fits choice response
times with the linear ballistic accumulator and/or the drift diffusion model, which provide more
flexibility to the modeler.

The next section explains how the log-normal race model is implemented, using data from a
lexical decision task.

20.1 Modeling a lexical decision task

In a lexical decision task, a subject is presented with a string of letters on the screen and they
need to decide whether the string is a word or a non-word; see Figure 20.1. In the example
developed below, a subset of 600 words and 600 non-words from 20 subjects (600 × 2 × 20
data points) are used from the data of the British Lexicon project (Keuleers et al. 2012). The
data are stored as the object df_blp in the package bcogsci . In this data set, the lexicality
of the string (word or non-word) is indicated in the column lex . The goal is to investigate
how word frequency, shown in the column freq (frequency is counted per million words
using the British National Corpus), affects the lexical decision task as quantified by accuracy
and response time. For more details about the data set, type ?df_blp on the R command
line after loading the library bcogsci .

rurble

monkey

FIGURE 20.1: Two trials in a lexical decision task. For the first trial, rurble , the correct
answer would be to press the key on a keyboard or a response console that is mapped to the
“non-word” response, for the second trial, monkey , the correct answer would be to press the
key that is mapped to the “word” response.
Hide

data("df_blp")
df_blp

## # A tibble: 24,000 × 8
## subj block lex trial string acc rt freq
## <dbl> <dbl> <chr> <dbl> <fct> <dbl> <dbl> <dbl>
## 1 1 57 non-word 28263 paybods 1 591 0
## 2 1 53 non-word 26414 lunned 1 621 0
## 3 1 49 non-word 24333 pertrax 1 575 0

## # … with 23,997 more rows

The following code chunk adds 0.01 (which corresponds to a word that appears only once in
the corpus) to avoid word frequencies of zero, and then log-transforms the frequencies to
compress their range of values (see Brysbaert, Mandera, and Keuleers 2018 for a more in-
depth treatment of word frequencies) and centers them. It also creates a new variable that
sum-codes the lexicality of the each given string (either a word, 0.5 , or a non-word, −0.5 ).

Hide

df_blp <- df_blp %>%

mutate(lfreq = log(freq + 0.01),
c_lfreq = lfreq - mean(lfreq),

c_lex = ifelse(lex == "word", 0.5, -0.5))

If one wants to study the effect of frequency on words, the “traditional” way to analyze these
data would be to fit response times and choice data in two separate models on words,
ignoring non-words. One model would be fit on the response times of correct responses, and
a second model on the accuracy. These two models are fit below.

To fit the response times model, subset the correct responses given to strings that are words:

Hide

df_blp_word_c <- df_blp %>%

filter(acc == 1, lex == "word")

Fit a hierarchical model with a log-normal likelihood and log-transformed frequency as a
predictor (using brms here) and relatively weak priors.

Hide

fit_rt_word_c <- brm(rt ~ c_lfreq + (c_lfreq | subj),

data = df_blp_word_c,
family = lognormal,
prior = c(
prior(normal(6, 1.5), class = Intercept),

prior(normal(0, 1), class = b),

prior(normal(0, 1), class = sigma),
prior(normal(0, 1), class = sd),
prior(lkj(2), class = cor)
), iter = 3000)

Show the estimate of the effect of log-frequency on the log-ms scale.

Hide

posterior_summary(fit_rt_word_c, variable = "b_c_lfreq")

## Estimate Est.Error Q2.5 Q97.5

## b_c_lfreq -0.0379 0.00267 -0.0431 -0.0326

To fit the accuracy model, subset the responses given to strings that are words:

Hide

df_blp_word <- df_blp %>%

filter(lex == "word")

Fit a hierarchical model with a Bernoulli likelihood (and logit link) using log-transformed
frequency as a predictor (using brms ) and relatively weak priors:

Hide
fit_acc_word <- brm(acc ~ c_lfreq + (c_lfreq | subj),
data = df_blp_word,
family = bernoulli(link = logit),
prior = c(
prior(normal(0, 1.5), class = Intercept),

prior(normal(0, 1), class = b),

prior(normal(0, 1), class = sd),
prior(lkj(2), class = cor)
), iter = 3000)

Show the estimate of the effect of log-frequency on the log-odds scale:

Hide

posterior_summary(fit_acc_word, variable = "b_c_lfreq")

## Estimate Est.Error Q2.5 Q97.5

## b_c_lfreq 0.573 0.0258 0.523 0.627

For this specific data set, it does not matter whether response times or accuracy are chosen
as the dependent variable, since both yield results with a similar interpretation: More frequent
words are identified more easily, that is, with shorter reading times (this is evident from the
negative sign on the estimate of the mean effect), and with higher accuracy (positive sign on
the estimate). However, it might be the case that some data set shows divergent directions in
response times and accuracy. For example, more frequent words might take longer to identify,
leading to a slowdown in response time as frequency increases, but might still be identified
more accurately.

Furthermore, two models are fit above, treating response times and accuracy as independent.
In reality, there is plenty of evidence that they are related (e.g., the speed-accuracy trade-off).
Even in these data, as frequency increases, correct answers are given faster, and most errors
are for low-frequency words (see Figure 20.2).

Hide
acc_lbl <- as_labeller(c(`0` = "Incorrect", `1` = "Correct"))
ggplot(df_blp, aes(y = rt, x = freq + .01, shape = lex, color = lex)) +
geom_point(alpha = .5) +
facet_grid(. ~ acc, labeller = labeller(acc = acc_lbl)) +

scale_x_continuous("Frequency per million (log-scaled axis)",

limits = c(.0001, 2000),
breaks = c(.01, 1, seq(5, 2000, 5)),
labels = ~ ifelse(.x %in% c(.01, 1, 5, 100, 2000), .x, "")
) +
scale_y_continuous("Response times in ms (log-scaled axis)",

limits = c(150, 8000),

breaks = seq(500,7500,500),
labels = ~ ifelse(.x %in% c(500,1000,2000, 7500), .x, "")
) +
scale_color_discrete("lexicality") +

scale_shape_discrete("lexicality") +
theme(legend.position = "bottom") +
coord_trans(x = "log", y = "log")

Incorrect Correct

7500
Response times in ms (log-scaled axis)

2000

1000

500

0.01 1 5 100 2000 0.01 1 5 100 2000

Frequency per million (log-scaled axis)

lexicality non-word word

FIGURE 20.2: The distribution of response times for words and non-words, and correct and
incorrect answers.
A powerful way to convey the relationship between response times and accuracy is using
quantile probability plots (Ratcliff and Tuerlinckx 2002; these are closely related to the latency
probability plots of Audley and Pike 1965).

A quantile probability plot shows quantiles of the response times distribution (typically ,
0.1 0.3 ,
,
0.5 0.7 , and 0.9 ) for correct and incorrect responses on the y-axis against probabilities of
correct and incorrect responses for experimental conditions on the x-axis. The plot is built by
first aggregating the data.

To display a quantile probability plot, create a custom function qpf() that takes as arguments
a data set grouped by an experimental condition (e.g., words vs non-words, here by lex ),
and the quantiles that need to be displayed (by default, , , , ,
0.1 0.3 0.5 0.7 0.9 ). The function
works as follows: First, calculate the desired quantiles of the response times for incorrect and
correct responses by condition (these are stored in rt_q ). Second, calculate the proportion
of incorrect and correct responses by condition (these are stored in p ); because this
information is needed for each quantile, repeat it for the number of quantiles chosen (here,
five times). Last, record the quantile that each response time and response probability
corresponds to (this is recorded in q ), and whether it corresponds to an incorrect or a correct
response (this information is stored in response ).

Hide
qpf <- function(df_grouped, quantiles = c(.1, .3, .5, .7, .9)) {
df_grouped %>% summarize(
rt_q = list(c(quantile(rt[acc == 0], quantiles),
quantile(rt[acc == 1], quantiles))),
p = list(c(rep(mean(acc == 0), length(quantiles)),

rep(mean(acc == 1), length(quantiles)))),

q = list(rep(quantiles, 2)),
response = list(c(rep("incorrect", length(quantiles)),
rep("correct", length(quantiles))))) %>%
# Since the summary contains a list in each column,
# we unnest it to have the following number of rows:

# number of quantiles x groups x 2 (incorrect, correct)

unnest(cols = c(rt_q, p, q, response))
}
df_blp_lex_q <- df_blp %>%
group_by(lex) %>%

qpf()

The aggregated data look like this:

Hide

df_blp_lex_q %>% print(n = 10)

## # A tibble: 20 × 5
## lex rt_q p q response
## <chr> <dbl> <dbl> <dbl> <chr>
## 1 non-word 433. 0.0521 0.1 incorrect
## 2 non-word 521. 0.0521 0.3 incorrect

## 3 non-word 613 0.0521 0.5 incorrect

## 4 non-word 779. 0.0521 0.7 incorrect
## 5 non-word 1110 0.0521 0.9 incorrect
## 6 non-word 448 0.948 0.1 correct
## 7 non-word 513 0.948 0.3 correct
## 8 non-word 575 0.948 0.5 correct

## 9 non-word 666 0.948 0.7 correct

## 10 non-word 905 0.948 0.9 correct
## # … with 10 more rows

Plot the data by joining the points that belong to the same quantiles with lines. Given that
incorrect responses in most tasks occur in less than 50% of the trials and correct responses
occur in a complementary distribution (i.e., in more than 50% of the trials), incorrect responses
usually appear in the left half of the plot, and correct ones in the right half. The code that
appears below produces Figure 20.3.

Hide

ggplot(df_blp_lex_q, aes(x = p, y = rt_q)) +

geom_vline(xintercept = .5, linetype = "dashed") +

geom_point(aes(shape = lex)) +
geom_line(aes(group = interaction(q, response))) +
ylab("RT quantiles (ms)") +
scale_x_continuous("Response proportion", breaks = seq(0, 1, .2)) +
scale_shape_discrete("Lexicality") +

annotate("text", x = .40, y = 500, label = "incorrect") +

annotate("text", x = .60, y = 500, label = "correct")
1000
RT quantiles (ms)

Lexicality
800
non-word
word

600

incorrect correct

400
0.2 0.4 0.6 0.8
Response proportion

FIGURE 20.3: Quantile probability plots showing , , ,

0.1 0.3 0.5 0.7 , and 0.9 -th response time
quantiles plotted against proportion of incorrect responses (left) and proportion of correct
responses (right) for strings that are words and non-words.
The vertical spread among the lines shows the shape of the response time distribution. The
lower quantile lines correspond to the left part of the response time distribution, and the higher
quantiles to the right part of the distribution. Since the response time distribution is long tailed
and right skewed, the higher quantiles are more spread apart than the lower quantiles.

A quantile probability plot can also be used to corroborate the observation that high-frequency
words are easier to recognize. To do that, subset the data to only words, and group the strings
according to their “frequency group” (that is, according to the quantile of frequency that the
strings belong to). Whereas we previously aggregated over all the observations, ignoring
subjects, we can also aggregate by subjects first, and then average the results. This will
prevent some idiosyncratic subjects from dominating in the plot. (We can also plot individual
quantile probability plots by subject). Apart from the fact that the aggregation is by subjects,
the code below follows the same steps as before, and the result is shown in Figure 20.4. The
plot shows that for more frequent words, accuracy improves and responses are faster.

Hide
df_blp_freq_q <- df_blp %>%
# Subset only words:
filter(lex == "word") %>%
# Create 5 word frequencies group
mutate(freq_group =

cut(lfreq,
quantile(lfreq, c(0,0.2,0.4, 0.6, 0.8, 1)),
include.lowest = TRUE,
labels =
c("0-0.2", "0.2-0.4",
"0.4-0.6", "0.6-0.8", ".8-1"))

) %>%
# Group by condition and subject:
group_by(freq_group, subj) %>%
# Apply the quantile probability function:
qpf() %>%

# Group again removing subject:

group_by(freq_group, q, response) %>%
# Get averages of all the quantities:
summarize(rt_q = mean(rt_q),
p = mean(p))

# Plot
ggplot(df_blp_freq_q, aes(x = p, y = rt_q)) +
geom_point(shape = 4) +
geom_text(
data = df_blp_freq_q %>%
filter(q == .1),

aes(label = freq_group), nudge_y = 12) +

geom_line(aes(group = interaction(q, response))) +
ylab("RT quantiles (ms)") +
scale_x_continuous("Response proportion", breaks = seq(0, 1, .2)) +
annotate("text", x = .40, y = 900, label = "incorrect") +

annotate("text", x = .60, y = 900, label = "correct")

1000

incorrect correct
RT quantiles (ms)

800

600

0-0.2
0.2-0.4
0.6-0.8 0.2-0.4 0-0.2 0.4-0.6
0.6-0.8
0.4-0.6 .8-1

0.0 0.2 0.4 0.6 0.8 1.0

Response proportion

FIGURE 20.4: Quantile probability plot showing , ,

0.1 0.3 0.5 0.7 , , and 0.9 response times
quantiles plotted against proportion of incorrect responses (left) and proportion of correct
responses (right) for words of different frequency. Word frequency is grouped according to
quantiles: The first group is words with frequencies smaller than the 0.2 -th quantile, the
second group is words with frequencies smaller than the 0.4 -th quantile and larger than the
0.2 -th quantile, and so forth.
So far, several ways were shown to describe the data by representing them graphically. Next,
we turn to modeling the data.

20.1.1 Modeling the lexical decision task with the log-normal

race model

The log-normal race model is used here to examine the effect of word frequency in both
response times and choice (word vs. non-word) in the lexical decision task presented earlier.
In this example, the log-normal race model is limited to fitting two choices; as mentioned
earlier, this model can in principle fit more than two choices. When modeling a task with two
choices, there are two ways to account for the data: either fit the response times and the
accuracy (i.e., accuracy coding: correct vs. incorrect), or fit the response times and actual
responses (i.e., stimulus coding: in this case word vs. non-word). In this example, we will use
the stimulus-coding approach.

The following code chunk adds a new column that incorporates the actual choice made (as
word vs non-word in choice and as 1 vs 2 in nchoice ):

Hide

df_blp <- df_blp %>%

mutate(choice = ifelse((lex == "word" &
acc == 1) |
(lex == "non-word" &
acc == 0), "word", "non-word"),

nchoice = ifelse(choice == "word", 1, 2))

To start modeling the data, think about the behavior of one synthetic subject. This subject
simultaneously accumulates evidence for the response, “word” in one accumulator, and for
“non-word” in another independent accumulator. Unlike other sequential sampling models, an
increase in evidence for one choice doesn’t necessarily reduce the evidence for the other
choices. Rouder et al. (2015) points out that it might seem odd to assume that we accumulate
evidence for a non-word in the same manner as we accumulate evidence for a word, since
non-words may be conceptualized as the absence of a word. However, they stress that this
approach is closely related to novelty detection, where the salience of never-before
experienced stimuli seems to indicate that novelty is psychologically represented as more than
the absence of familiarity. Nevertheless, notions of words and non-word evidence
accumulation are indeed controversial (see Dufau, Grainger, and Ziegler 2012). The
alternative approach of fitting accuracy rather than stimuli discussed before doesn’t really
circumvent the problem. This is because when the correct answer is word , we assume that
the “correct” accumulator accumulates evidence for word , and the incorrect one for non-
word , and the other way around when the correct answer is non-word .

20.1.2 A generative model for a race between accumulators

To build a generative model of the task based on the log-normal race model, start by spelling
out the assumptions. In a race of accumulators model, the assumption is that the time T taken
for each accumulator of evidence to reach the threshold at distance D is simply defined by
T = D/V

where the denominator V is the rate (velocity, sometimes also called drift rate) of evidence
accumulation.

The log-normal race model assumes that the rate in each trial is sampled from a log-normal
distribution:

V ∼ LogNormal(μv , σv )

Decision
threshold (D)

word
non-decision
Evidence

time Decision time (T)

Decision

non-word
threshold (D)

non-decision
time Decision time (T)
rt

FIGURE 20.5: A schematic illustration of the log-normal race model for the lexical-decision
task with a word stimulus. A larger rate of accumulation (V) leads to a larger angle. Here, the
choice of word is selected.

A log-normal distribution is partly justified by the work by Ulrich and Miller (1993) (also see
Box 4.3), and as discussed later, it is very convenient mathematically.

For simplicity, assume that the distance D to the threshold is kept constant. This might not be
a good assumption if the experiment is designed so that subjects change their threshold
depending on speed or accuracy incentives (that was not the case in this experiment), or if the
subject gets fatigued as the experiment progresses, or if there is reason to believe that there
might be random fluctuations in this threshold. Later in this chapter, we will discuss what
happens if this assumption is relaxed.
Assume that, for trial n , the location μw of the distribution of rates of accumulation of evidence
for a string w is a function of the lexicality of the string presented (only a word will increase
this rate of accumulation and not a non-word), frequency (i.e., high-frequency words might be
easier to identify, leading to a faster rate of accumulation than with low-frequency words), and
bias (i.e., a subject might have a tendency to answer that a string is a word rather than non-
word or vice-versa, regardless of the stimuli). This assumption can be modeled with a linear
regression over μw , with parameters that represent the bias, αw , the effect of lexicality, βlex
w
,
and the effect of log-frequency βlfreq
w
.

μw,n = αw + lexn ⋅ βlexw + lfreqn ⋅ βlfreq

The location for the rate of accumulation of evidence for the non-word accumulator is defined
similarly:

μnw,n = αnw + lexn ⋅ βlex + lfreq ⋅ βlfreq

nw n nw

Thus the rates are generated as follows:

Vw,n ∼ LogNormal(μw,n , σ)

Vnw,n ∼ LogNormal(μnw,n , σ)

The accumulators reach the threshold in times:

Tw,n = D/Vw,n

Tnw,n = D/Vnw,n

The choice for trial n corresponds to the accumulator with the shortest time for that trial,

word, if Tw,n < Tnw,n

choicen = {
non-word, otherwise

and the decision for trial n is made in time

Tn = min(Tw,n , Tnw,n )

We also need to take into account that not all the time spent in the task involves making the
decision: Time is spent fixating the gaze on the screen, pressing a button, etc. We’ll add a
shift to the distribution, representing the minimum amount of time that a subject needs for all
the peripheral processes that happened before and after the decision (also see Rouder 2005).
We represent this with Tnd ; “nd” stands for non-decision. Although some variation in the non-
decision time is highly likely, we use a constant as an approximation that will be reasonable if
its variation is small relative to the variation associated with the decision time (Heathcote and
Love 2012).
rtn = Tnd + Tn

The following chunk of code generates synthetic data for one subject, by setting true values to
the parameters and translating the previous equations to R. The true values are relatively
arbitrary and were decided by trial and error until a relatively realistic distribution of response
times was obtained. Considering that this is only one subject (unlike what was shown in
previous figures), Figure 20.6 looks relatively fine. (One can also inspect the quantile
probability plots of individual subjects in the real data set and compare it to the synthetic
data).

First, set a seed to always generate the same pseudo-random values, take a subset of the
data set to keep the same structure of the data frame for our simulated subject, and set true
values:

Hide

set.seed(123)
df_blp_1subj <- df_blp %>%

filter(subj == 1)
D <- 1800
alpha_w <- .8
beta_wlex <- .5
beta_wlfreq <- .2
alpha_nw <- 1

beta_nwlex <- -.5

beta_nwlfreq <- -.05
sigma <- .8
T_nd <- 150

Second, generate the locations of both accumulators, mu_w and mu_nw , for every trial. This
means that both variables are vectors of length, N , the number of trials:

Hide

mu_w <- alpha_w + df_blp_1subj$c_lfreq * beta_wlfreq +

df_blp_1subj$c_lex * beta_wlex
mu_nw <- alpha_nw + df_blp_1subj$c_lfreq * beta_nwlfreq +

df_blp_1subj$c_lex * beta_nwlex
N <- nrow(df_blp_1subj)
Third, generate values for the rates of accumulation, V_w and V_nw , for every trial. Use
those rates to calculate T_w and T_nw , how long it will take for each accumulator to reach
its threshold for every trial:

Hide

V_w <- rlnorm(N, mu_w, sigma)

V_nw <- rlnorm(N, mu_nw, sigma)
T_w <- D / V_w
T_nw <- D / V_nw

Fourth, calculate the time it takes to reach to a decision in every trial, T_winner as the by-trial
minimum between T_w and T_nw . Similarly, store the winner accumulator for each trial in
accumulator_winner :

Hide

T_winner <- pmin(T_w, T_nw)

accumulator_winner <- ifelse(T_w == pmin(T_w, T_nw),

"word",
"non-word")

Finally, add this information to the data frame that now indicates choice, time, and accuracy
for each trial:

Hide

df_blp1_sim <- df_blp_1subj %>%

mutate(rt = T_nd + T_winner,
choice = accumulator_winner,
nchoice = ifelse(choice == "word", 1, 2)) %>%
mutate(acc = ifelse(lex == choice, 1, 0))

Hide
acc_lbl <- as_labeller(c(`0` = "Incorrect", `1` = "Correct"))
ggplot(df_blp1_sim, aes(y = rt, x = freq + .01, shape = lex, color = lex)) +
geom_point(alpha = .5) +
facet_grid(. ~ acc, labeller = labeller(acc = acc_lbl)) +

scale_x_continuous("Frequency per million (log-scaled axis)",

limits = c(.0001, 2000),
breaks = c(.01, 1, seq(5, 2000, 5)),
labels = ~ ifelse(.x %in% c(.01, 1, 5, 100, 2000), .x, "")
)+
scale_y_continuous("Response times in ms (log-scaled axis)",

limits = c(150, 8000),

breaks = seq(500,7500,500),
labels = ~ ifelse(.x %in% c(500,1000,2000, 7500), .x, "")
) +
scale_color_discrete("lexicality") +

scale_shape_discrete("lexicality") +
theme(legend.position = "bottom") +
coord_trans(x = "log", y = "log")

Incorrect Correct

7500
Response times in ms (log-scaled axis)

2000

1000

500

0.01 1 5 100 2000 0.01 1 5 100 2000

Frequency per million (log-scaled axis)

lexicality non-word word

FIGURE 20.6: The distribution of response times for words and non-words, and correct and
incorrect answers for the synthetic data of one subject.

20.1.3 Fitting the log-normal race model

A first issue that we face when we attempt to fit the log-normal race model, is that we need to
fit its likelihood to a ratio of the random variables D and V ; that is, we need a ratio or quotient
distribution function. Although for arbitrary distributions this requires solving (sometimes
extremely complex) integrals (see, for example, Nelson 1981), there are two situations that
are compatible with our assumptions and are mathematically simple:

1. If we assume that D is a constant k, then T = k/V , and

log(T ) = log(k/V ) = log(k) − log(V )

Since V is log-normally distributed, log(V ) ∼ Normal(μv , σv ) , and log(k) is a constant:

log(T ) ∼ Normal(log(k) − μv , σv )
(20.1)
T ∼ LogNormal(log(k) − μv , σv )

2. A log-normally distributed time is not uniquely predicted by assuming that distance is a

constant. It also follows if distance is also a log-normally distributed variable: If we
assume that D ∼ LogNormal(μd , σd ) then T is the ratio of two random variables D/V ,
and

log(T ) = log(D/V ) = log(D) − log(V )

We have a difference of independent, normally distributed random variables. It follows from

random variable theory that:

2 2
log(T ) ∼ Normal(μd − μv , √σ + σv )
d

(20.2)
2 2
T ∼ LogNormal(μd − μv , √σ + σv )
d

From Equations (20.1) and (20.2), it should be clear that the threshold and accumulation rate
cannot be disentangled: a manipulation that affects the rate or the decision threshold will
affect the location of the distribution in the same way (also see Rouder et al. 2015). Another
important observation is that T won’t have a log-normal distribution when D has any other
distributional form.
Following Rouder et al. (2015), we assume that the noise parameter is the same for each
accumulator, since this means that contrasts between finishing time distributions are captured
completely by contrasts of the locations of the log-normal distributions. We discuss at the end
of the chapter why one would need to relax this assumption (also see Exercise 20.2).

In each trial n , with an accumulator for words, indicated with the subscript w , and one for non-
words, indicated with nw , we can model the time it takes for each accumulator to get to the
threshold D in the following way. For the word accumulator,

′
μw,n = μd − μw,n
w

′
μw,n = μd − (αw + lexn ⋅ βlex + lfreq ⋅ βlfreq )
w w n w

′
μw,n = (μd − αw ) − lexn ⋅ βlex − lfreq ⋅ βlfreq (20.3)
w w n w

′ ′
μw,n = αw − lexn ⋅ βlex − lfreq ⋅ βlfreq
w n w

′
Tw,n ∼ LogNormal(μw,n , σ)

The parameter αw
′
absorbs the location of the threshold distribution minus the intercept of the
rate distribution, and represents a bias. As ′
αw gets smaller, the accumulator will be more
likely to reach the threshold first all things being equal, biasing the responses to word .

Similarly, for the non-word accumulator,

′ ′
μnw,n = αnw − lexn ⋅ βlex − lfreq ⋅ βlfreq
nw n nw

′
Tnw,n ∼ LogNormal(μnw,n , σ)

The only observed time is the one associated with the winner accumulator, the response
selected s, which corresponds to the faster accumulator:

Taccum=s,n ∼ LogNormal(μaccum=s,n , σ)

If we only fit the observed finishing times of the accumulators, we’re always ignoring that in a
given trial the accumulator that lost was slower than the accumulator for which we have the
latency; this means that we underestimate the time it takes to reach the threshold and we
overestimate the rate of accumulation of both accumulators. We can treat this as a problem of
censored data, where for each trial we don’t know the slower observations.

Since the potential decision time for the accumulator that wasn’t selected is definitely longer
than the one of the winner accumulator, we obtain the likelihood for each unobserved time by
integrating out all the possible decision times that the accumulator could have, that is, from
the time it took for the winner accumulator to reach the threshold to infinitely large decision
times:
∞

P (Taccum≠s,n ) = ∫ LogNormal(T |μa=s,n , σ) dT

Taccum=s,n

This integral is the complement of the CDF of the log-normal distribution evaluated at
Taccum=s,n .

P (Taccum≠s,n ) = 1 − LogNormal_CDF (Taccum=s,n |μaccum=s,n , σ)

So far we have been fitting the decision time T , but our dependent variable is response times,
rt , the sum of the decision time T and the non-decision time Tnd . This requires a change of
variables in our model, Tn = rtn − Tnd , since rt but not T is available as data. Here, the
Jacobian is 1, since adjusting the likelihood is equal to multiplying the likelihood by one (or
rtn −T0
adding zero to the log likelihood). This is because |d
drtn
| = 1 ; for more details, see section
12.1. So, although one could write in the Jacobian adjustment in the Stan code, it will always
evaluate to zero.

To sum up, our model can be stated as follows:

Tn = rtn − Tnd
′ ′
μw,n = αw − lexn ⋅ βlex − lfreq ⋅ βlfreq
w n w

′ ′
μnw,n = αnw − lexn ⋅ βlex − lfreq ⋅ βlfreq
nw n nw

′
LogNormal(μw,n , σ) if choice = word
Tn ∼ {
′
LogNormal(μnw,n , σ) otherwise

Tcensored,n = rtcensored,n − Tnd

Rather than trying to estimate all the censored observations, we integrate them out:51

′
1 − LogNormal_CDF (Tn |μnw,n , σ), if choice = word
P (Tcensored ) = {
′
1 − LogNormal_CDF (Tn |μw,n , σ), otherwise

We need priors for all the parameters. An added complication here is the prior for the non-
decision time, Tnd : we need to make sure that it’s strictly positive and also that it’s smaller
than the shortest observed response time. This is because the decision time for each
observation, Tn should also be strictly positive:

Tn = rtn − Tnd > 0

rtn > Tnd

min(rt) > Tnd

We thus truncate the prior of Tnd so that the values lie between zero and min(rt) , the
minimum value of the vector of response times. Given the time it takes to fixate the gaze on
the screen and a minimal motor response time, centering the prior in 150 ms seems
reasonable. The rest of the priors are in log-scale. One should use prior predictive checks to
verify that the order of magnitude of all the priors is appropriate. We skip this step here and
present the priors below:

Tnd ∼ Normal(150, 100) with 0 < Tnd < min(rtn )

α ∼ Normal(6, 1)

β ∼ Normal(0, .5)

σ ∼ Normal+ (.5, .2)

where α is a vector ′ ′
⟨αn , αnw ⟩ , and β is a vector of all the β used in the likelihoods.

To translate the model into Stan, we need a normal distribution truncated so that the values lie
between zero and min(rt) for the prior of T_nd . This means dividing the original distribution
by the difference of the CDFs evaluated at these two points; see Box 4.1. In log-space, this is
a difference between the log-transformed original distribution and the logarithm of the
difference of the CDFs. The function log_diff_exp is a more stable version of this last
operation. What log_diff_exp does is to take the log of the difference of the exponent of two
functions. In this case the functions are two log-CDFs.

Hide

target += normal_lpdf(T_nd | 150, 100)

- log_diff_exp(normal_lcdf(min(rt) | 150, 100),

normal_lcdf(0 | 150, 100));

We implement the likelihood of each joint observation of response time and choice with an if-
else clause that calls the likelihood of the accumulator that corresponds to the choice selected
in the trial n , and the complement CDF for the accumulator that was not selected:

Hide
if(nchoice[n] == 1)
target += lognormal_lpdf(T[n] | alpha[1] -
c_lex[n] * beta[1] -
c_lfreq[n] * beta[2] , sigma) +
lognormal_lccdf(T[n] | alpha[2] -

c_lex[n] * beta[3] -
c_lfreq[n] * beta[4], sigma);
else
target += lognormal_lpdf(T[n] | alpha[2] -
c_lex[n] * beta[3] -
c_lfreq[n] * beta[4], sigma) +

lognormal_lccdf(T[n] | alpha[1] -
c_lex[n] * beta[1] -
c_lfreq[n] * beta[2], sigma);
}

The complete Stan code for this model is shown below as lnrace.stan :

Hide
data {
int<lower = 1> N;
vector[N] c_lfreq;
vector[N] c_lex;
vector[N] rt;

array[N] int nchoice;

}
parameters {
array[2] real alpha;
array[4] real beta;
real<lower = 0> sigma;

real<lower = 0, upper = min(rt)> T_nd;

}
model {
vector[N] T = rt - T_nd;
target += normal_lpdf(alpha | 6, 1);

target += normal_lpdf(beta | 0, .5);

target += normal_lpdf(sigma | .5, .2)
- normal_lccdf(0 | .5, .2);
target += normal_lpdf(T_nd | 150, 100)
- log_diff_exp(normal_lcdf(min(rt) | 150, 100),

normal_lcdf(0 | 150, 100));

for(n in 1:N){
if(nchoice[n] == 1)
target += lognormal_lpdf(T[n] | alpha[1] -
c_lex[n] * beta[1] -
c_lfreq[n] * beta[2], sigma) +

lognormal_lccdf(T[n] | alpha[2] -
c_lex[n] * beta[3] -
c_lfreq[n] * beta[4], sigma);
else
target += lognormal_lpdf(T[n] | alpha[2] -

c_lex[n] * beta[3] -
c_lfreq[n] * beta[4], sigma) +
lognormal_lccdf(T[n] | alpha[1] -
c_lex[n] * beta[1] -
c_lfreq[n] * beta[2], sigma);
}
}

Store the data in a list and fit the model. Some warnings might appear during the warm-up, but
these warnings can be ignored since they no longer appear afterwards, and all the
convergence checks look fine (omitted here):

Hide

lnrace <- system.file("stan_models",

"lnrace.stan",
package = "bcogsci")
ls_blp1_sim <- list(N = nrow(df_blp1_sim),
rt = df_blp1_sim$rt,

nchoice = df_blp1_sim$nchoice,
c_lex = df_blp1_sim$c_lex,
c_lfreq = df_blp1_sim$c_lfreq)
fit_blp1_sim <- stan(lnrace, data = ls_blp1_sim)

Print the parameters values:

Hide

print(fit_blp1_sim, pars = c("alpha", "beta", "T_nd", "sigma"))

## mean 2.5% 97.5% n_eff Rhat

## alpha[1] 6.67 6.60 6.73 4580 1
## alpha[2] 6.53 6.46 6.59 4211 1
## beta[1] 0.36 0.16 0.55 3372 1
## beta[2] 0.21 0.18 0.25 3210 1
## beta[3] -0.46 -0.69 -0.22 3207 1

## beta[4] -0.07 -0.13 -0.02 3125 1

## T_nd 143.41 131.36 152.30 3372 1
## sigma 0.78 0.74 0.82 3673 1

As in previous chapters, mcmc_recover_hist() can be used to compare the posterior

distributions of the relevant parameters of the model with their true values (Figure 20.7). First,
however, we need to reparameterize the true values, since D cannot be known, and we don’t
fit V , but rather D/V , with V log-normally distributed. Then, obtain an estimate of α
′
, rather
than α , such that α
′
= log(mud ) − α .

Hide

true_values <- c(log(D) - alpha_w,

log(D) - alpha_nw,
beta_wlex,
beta_wlfreq,
beta_nwlex,

beta_nwlfreq,
sigma,
T_nd)
estimates <- as.data.frame(fit_blp1_sim) %>%
select(- lp__)
mcmc_recover_hist(estimates, true_values)

alpha[1] alpha[2] beta[1]

6.6 6.7 6.8 6.45 6.50 6.55 6.60 6.65

0.0 0.2 0.4 0.6

beta[2] beta[3] beta[4]

Estimated

True

0.18 0.21 0.24 -0.75 -0.50 -0.25 0.00 -0.15 -0.10 -0.05 0.00

sigma T_nd

0.68 0.72 0.76 0.80 0.84 120 130 140 150

FIGURE 20.7: Posterior distributions of the main parameters of the log-normal race model
fit_blp1_sim together with their true values.
Before moving on to a more complex version of this model, it’s worth spending some time
making the code more modular. Encapsulate the likelihood of the log-normal race model by
writing it as a function. The function has four arguments, the decision time T , the choice
nchoice (this will only work with two choices, 1 and 2 ), an array of locations mu (which

again we implicitly assume that has two elements), and a common scale sigma .

Hide

functions {
real lognormal_race2_lpdf(real T, int nchoice,

real[] mu, real sigma){

real lpdf;
if(nchoice == 1)
lpdf = lognormal_lpdf(T | mu[1] , sigma) +
lognormal_lccdf(T | mu[2], sigma);
else

lpdf = lognormal_lpdf(T | mu[2], sigma) +

lognormal_lccdf(T | mu[1], sigma);
return lpdf;
}
}

Next, for each iteration n of the original for loop, generate an auxiliary variable T which
contains the decision time for the current trial, and mu as an array of size two that contains
all the parameters that affect the location at each trial. This will allow us to use our new
function as follows in the model block:52

Hide
real log_lik[N];
for(n in 1:N){
real T = rt[n] - T_nd;
real mu[2] = {alpha[1] -
c_lex[n] * beta[1] -

c_lfreq[n] * beta[2],
alpha[2] -
c_lex[n] * beta[3] -
c_lfreq[n] * beta[4]};
log_lik[n] = lognormal_race2_lpdf(T | nchoice[n], mu, sigma);
}

The variable log_lik contains the log-likelihood for each trial. We must not forget to add the
total log-likelihood to the target variable. This is done simply by target += sum(log_lik) .

The complete Stan code for this model can be found in the bcogsci package as
lnrace_mod.stan , it is left for the reader to verify that the results are the same as from the

non-modular model lnrace.stan fit earlier.

20.1.4 A hierarchical implementation of the log-normal race

model

A simple hierarchical version of the previous model assumes that that all the parameters α

and β have by-subject adjustments:

′ ′
μw,n = αw + usubj[n],1 − lexn ⋅ (βlex + usubj[n],2 ) − lfreq ⋅ (βlfreq + usubj[n],3 )
w n w

′ ′
μnw,n = αnw + usubj[n],4 − lexn ⋅ (βlex + usubj[n],5 ) − lfreq ⋅ (βlfreq + usubj[n],6 )
nw n nw

Similarly to the hierarchical implementation of the fast-guess model in section 19.1.5, assume
that u is a matrix with as many rows as subjects and six columns. Also assume that u follows
a multivariate normal distribution centered at zero. For lack of more information, we assume
the same (weakly informative) prior distribution for the six variance components τu
1,...,6
with a
somewhat smaller effect than we assumed for the prior of σ. As with previous hierarchical
models, we assign a regularizing LKJ prior for the correlations between the adjustments:53

u ∼ N (0, Σu )

τu ∼ Normal+ (.1, .1)

1..6

ρu ∼ LKJcorr(2)
Before we fit the model to the real data, we’ll verify that it works with simulated data. To create
synthetic data of several subjects, we repeat the same generative process we used before
and we add the by-subject adjustments u in the same way as in section 19.1.5. This version
of the log-normal race model assumes that all the parameters α and β have by-subject
adjustments; that is, 6 adjustments. To simplify the model, we ignore the possibility of an
adjustment for the non-decision time Tnd , but see Nicenboim and Vasishth (2018) for an
implementation of the log-normal race model with a hierarchical non-decision time. For
simplicity, all the adjustments u are normally distributed with the same standard deviation of
0.2 , and they have a 0.3 correlation between pairs of u ’s; see tau_u and rho below.

First, set a seed, take a subset of the data set to keep the same structure, set true values, and
auxiliary variables that indicate the number of observations, subjects, etc.

Hide
set.seed(42)
df_blp_sim <- df_blp %>%
group_by(subj) %>%
slice_sample(n = 100) %>%
ungroup()

D <- 1800
alpha_w <- .8
beta_wlex <- .5
beta_wlfreq <- .2
alpha_nw <- 1
beta_nwlex <- -.5

beta_nwlfreq <- -.05

sigma <- .8
T_nd <- 150
N <- nrow(df_blp_sim)
N_subj <- max(df_blp_sim$subj)

N_adj <- 6
tau_u <- rep(.2, N_adj)
rho <- .3
Cor_u <- matrix(rep(rho, N_adj * N_adj), nrow = N_adj)
diag(Cor_u) <- 1

Sigma_u <- diag(tau_u, N_adj, N_adj) %*%

Cor_u %*%
diag(tau_u, N_adj, N_adj)
u <- mvrnorm(n = N_subj, rep(0, N_adj), Sigma_u)
subj <- df_blp_sim$subj

Second, generate the locations of both accumulators, mu_w and mu_nw , for every trial:

Hide

mu_w <- alpha_w + u[subj, 1] +

df_blp_sim$c_lfreq * (beta_wlfreq + u[subj, 2]) +

df_blp_sim$c_lex * (beta_wlex + u[subj, 3])
mu_nw <- alpha_nw + u[subj, 4] +
df_blp_sim$c_lfreq * (beta_nwlfreq + u[subj, 5]) +
df_blp_sim$c_lex * (beta_nwlex + u[subj, 6])
Third, generate values for the rates of accumulation and use those rates to calculate T_w
and T_nw .

Hide

V_w <- rlnorm(N, mu_w, sigma)

V_nw <- rlnorm(N, mu_nw, sigma)
T_w <- D / V_w
T_nw <- D / V_nw

Fourth, calculate the time it takes to reach to a decision and the winner accumulator for each
trial.

Hide

T_winner <- pmin(T_w, T_nw)

accumulator_winner <- ifelse(T_w == pmin(T_w, T_nw),
"word",
"non-word")

Finally, add this information to the data frame.

Hide

df_blp_sim <- df_blp_sim %>%

mutate(rt = T_nd + T_winner,
choice = accumulator_winner,

nchoice = ifelse(choice == "word", 1, 2),

acc = ifelse(lex == choice, 1, 0))

The Stan code for this model implements the non-centered parameterization for correlated
adjustments (see section 11.1.3 for more details). The model is shown below as
lnrace_h.stan :

Hide
functions {
real lognormal_race2_lpdf(real T, int nchoice, real[] mu, real sigma){
real lpdf;
if(nchoice == 1)
lpdf = lognormal_lpdf(T | mu[1] , sigma) +

lognormal_lccdf(T | mu[2], sigma);

else
lpdf = lognormal_lpdf(T | mu[2], sigma) +
lognormal_lccdf(T | mu[1], sigma);
return lpdf;
}

}
data {
int<lower = 1> N;
int<lower = 1> N_subj;
vector[N] c_lfreq;

vector[N] c_lex;
vector[N] rt;
array[N] int nchoice;
array[N] int subj;
}

transformed data{
real min_rt = min(rt);
real max_rt = max(rt);
int N_re = 6;
}
parameters {

array[2] real alpha;

array[4] real beta;
real<lower = 0> sigma;
real<lower = 0, upper = min(rt)> T_nd;
vector<lower = 0>[N_re] tau_u;

matrix[N_re, N_subj] z_u;

cholesky_factor_corr[N_re] L_u;
}
transformed parameters {
matrix[N_subj, N_re] u;
u = (diag_pre_multiply(tau_u, L_u) * z_u)';
}
model {

array[N] real log_lik;

target += normal_lpdf(alpha | 6, 1);
target += normal_lpdf(beta | 0, .5);
target += normal_lpdf(sigma | .5, .2)
- normal_lccdf(0 | .5, .2);

target += normal_lpdf(T_nd | 150, 100)

target += std_normal_lpdf(to_vector(z_u));
for(n in 1:N){
real T = rt[n] - T_nd;
real mu[2] = {alpha[1] + u[subj[n], 1] -
c_lex[n] * (beta[1] + u[subj[n], 2]) -

c_lfreq[n] * (beta[2] + u[subj[n], 3]),

alpha[2] + u[subj[n], 4] -
c_lex[n] * (beta[3] + u[subj[n], 5]) -
c_lfreq[n] * (beta[4] + u[subj[n], 6])};
log_lik[n] = lognormal_race2_lpdf(T | nchoice[n], mu, sigma);

}
target += sum(log_lik);
}
generated quantities {
corr_matrix[N_re] rho_u = L_u * L_u';
}

Store the simulated data in a list and fit it.

Hide
lnrace_h <- system.file("stan_models",
"lnrace_h.stan",
package = "bcogsci")
ls_blp_h_sim <- list(N = nrow(df_blp_sim),
N_subj = max(df_blp_sim$subj),

subj = df_blp_sim$subj,
rt = df_blp_sim$rt,
nchoice = df_blp_sim$nchoice,
c_lex = df_blp_sim$c_lex,
c_lfreq = df_blp_sim$c_lfreq)
fit_blp_h_sim <- stan(lnrace_h, data = ls_blp_h_sim)

The code below compares the posterior distributions of the relevant parameters of the model
with their true values, and plots them in Figure 20.8. The true value for all the correlations was
0.3 , but we need to correct the sign depending on whether there was a minus sign in front of
the adjustment when we built mu_w and mu_wn or not: For example, there is no minus before
u[subj, 1] , but there is one before u[subj, 2] , thus the true correlation between u[subj,
1] and u[subj, 2] we generated should be negative (plus times minus is minus); and there

is a minus before u[subj, 3] and thus the correlation between u[subj, 2] and u[subj,
3] should positive (minus times minus is positive).

Hide
rho_us <- c(paste0("rho_u[1,", 2:6 , "]"),
paste0("rho_u[2,", 3:6 , "]"),
paste0("rho_u[3,", 4:6 , "]"),
paste0("rho_u[4,", 5:6 , "]"),
"rho_u[5,6]")

corrs <- rho * c(-1, -1, 1, -1, -1, 1, -1, 1,

1, -1, 1, 1, -1, -1, 1)
true_values <- c(log(D) - alpha_w,
log(D) - alpha_nw,
beta_wlex,
beta_wlfreq,

beta_nwlex,
beta_nwlfreq,
T_nd,
sigma,
tau_u,

corrs)
par_names = c("alpha",
"beta",
"T_nd",
"sigma",

"tau_u",
rho_us)
estimates <- as.data.frame(fit_blp_h_sim) %>%
select(contains(par_names))
mcmc_recover_hist(estimates, true_values)
alpha[1] alpha[2] beta[1] beta[2] beta[3] beta[4]

6.66.76.86.9 6.36.46.56.66.7 0.20.40.60.81.0 0.10.20.30.4 -0.6-0.4-0.20.0 -0.3

-0.2
-0.10.00.1

T_nd sigma tau_u[1] tau_u[2] tau_u[3] tau_u[4]

144148152156 0.80 0.85 0.1 0.2 0.3 0.4 0.00.10.20.30.4 0.1 0.2 0.3 0.40.1 0.2 0.3 0.4

tau_u[5] tau_u[6] rho_u[1,2] rho_u[1,3] rho_u[1,4] rho_u[1,5]

Estimated

True
0.00.10.20.30.4 0.2 0.3 0.4-1.0-0.50.0 0.5 -0.40.0 0.4 -0.5 0.0 0.5 -1.0-0.50.0 0.5

rho_u[1,6] rho_u[2,3] rho_u[2,4] rho_u[2,5] rho_u[2,6] rho_u[3,4]

-0.5 0.0 0.5 -0.50.0 0.5 1.0

-1.0-0.50.00.51.0
-1.0-0.50.00.51.0 -0.50.0 0.5 1.0 -0.5 0.0 0.5

rho_u[3,5] rho_u[3,6] rho_u[4,5] rho_u[4,6] rho_u[5,6]

-1.0-0.50.00.51.0
-0.75
-0.50
-0.25
0.00
0.25
0.50
-1.0-0.50.00.51.0 -0.5 0.0 0.5-1.0-0.50.00.51.0

FIGURE 20.8: Posterior distributions of the main parameters of the log-normal race model
fit_blp_h_sim together with their true values.

Figure 20.8 shows that we can recover the true values quite well, even if there is a great deal
of uncertainty over the posteriors of the correlations. As mentioned in previous chapters, a
more principled (and computationally demanding) approach uses simulation based calibration;
this was introduced in section 12.2 (also see Talts et al. 2018; Schad, Betancourt, and
Vasishth 2020).

We are now ready to fit the model to the observed data.

Create a list with the real data and fit the same Stan model:

Hide
lnrace_h <- system.file("stan_models",
"lnrace_h.stan",
package = "bcogsci")
ls_blp_h <- list(N = nrow(df_blp),
N_subj = max(df_blp$subj),

subj = df_blp$subj,
rt = df_blp$rt,
nchoice = df_blp$nchoice,
c_lex = df_blp$c_lex,
c_lfreq = df_blp$c_lfreq)
fit_blp_h_sim <- stan(lnrace_h, data = ls_blp_h)

Print the summary (omit the correlations for now).

Hide

print(fit_blp_h, pars = c("alpha",

"beta",
"T_nd",
"sigma",
"tau_u"))

## mean 2.5% 97.5% n_eff Rhat

## alpha[1] 6.83 6.75 6.91 610 1

## alpha[2] 6.64 6.58 6.69 1109 1

## beta[1] 0.37 0.31 0.42 2110 1
## beta[2] 0.06 0.06 0.07 1334 1
## beta[3] -0.14 -0.18 -0.09 2732 1
## beta[4] -0.07 -0.07 -0.06 4032 1
## T_nd 11.31 10.00 12.45 9007 1

## sigma 0.34 0.34 0.35 7575 1

## tau_u[1] 0.17 0.13 0.23 1505 1
## tau_u[2] 0.10 0.07 0.15 3076 1
## tau_u[3] 0.02 0.01 0.02 2108 1
## tau_u[4] 0.14 0.10 0.18 1840 1

## tau_u[5] 0.08 0.05 0.12 3344 1

## tau_u[6] 0.01 0.00 0.02 1335 1
Even though the model converged, the posterior summary shows a clear problem with the
model: The estimate for the non-decision time, Tnd is less than 12 milliseconds! This is just
not possible; physiological research (Clark, Fan, and Hillyard 1994) shows that the eye-to-
brain lag, the time it takes for the visual features on the screen until they are propagated from
the retina to the brain is at least 50 milliseconds. Besides identifying the stimuli, the subjects
also need to initiate a motor response which takes at least 100 milliseconds. Then how is it
possible that we obtained this extremely fast non-decision time? The reason is that the
parameter T_nd is constrained to be between zero and the shortest reaction time.

Verify what is the shortest reaction time in the data set and how many observations are below
150 milliseconds.

Hide

min(df_blp$rt)

## [1] 16

Hide

sum(df_blp$rt < 150)

## [1] 2

This shows that some responses must have been initiated even before the stimulus was
presented! Next, we deal with contaminant observations (Ratcliff and Tuerlinckx 2002).

20.1.5 Dealing with contaminant responses

So far we have assumed that all the observations were coming from responses done after a
decision was made. But what happens if there are anticipations or lapses of attention where
the subject responds either before the stimuli is presented or after the stimulus was
presented, but without attending to the stimulus? We are in a situation analogous to what we
described before in chapter 19 with the fast-guess model (Ollman 1966). There, we assumed
that the behavior of a subject would be the mixture of two distributions, one corresponding to a
guessing mode of responses and another one to a task-engaged mode. There are two major
differences with the fast guess model, however. First, we assume here that guesses can be
fast (e.g., anticipations) as well as slow. Second, here guesses occur in a minority of the
cases and choice and response times are mostly explained by the log-normal race model. The
distribution that corresponds to these guesses is sometimes called a contaminant distribution
(Ratcliff and Tuerlinckx 2002). When the contaminant response times are outside the usual
range of response times (either shorter or longer), they can cause major problems in data
analysis, distorting estimates. As we saw before, extremely short response times caused by
anticipating the response can make it virtually impossible to estimate the non-decision time.

A recommended approach (e.g., Ratcliff and Tuerlinckx 2002) for dealing with this problem
that we follow here is to assume that the responses come from a mixture between the
sequential sampling model (in this case the log-normal race model) and a uniform distribution
bounded at the minimum and maximum observed response time.

The new likelihood function will look as follows:

p(rtn , choicen ) =θc ⋅ Uniform(rtn |min(rt), max(rt)) ⋅ Bernoulli(choicen |θbias )+

′
(1 − θc ) ⋅ plnrace (rtn , choicen |μ , σ)

The first term of the sum represents the contaminant component that occurs with probability θc

and has a likelihood that depends on the response time, represented with the uniform PDF,
and on the response given, represented with a Bernoulli PMF. When a subject is guessing, the
likelihood of each choice depends on θbias .

The second term of the likelihood represents the log-normal race model that occurs with
probability 1 − θc . We use plnrace (rtn , choicen ) as a shorthand for the following function
(which we have already used in the models before):

plnrace (rtn , choicen ) =

′
⎧ LogNormal(μw,n , σ)⋅
⎪
⎪
⎪
⎪ ′
⎪
⎪ (1 − LogNormal_CDF (Tn |μnw,n , σ)), if choice = word

⎨
⎪
⎪ ′
⎪ LogNormal(μnw,n , σ)⋅
⎪
⎪
⎩
⎪ ′
(1 − LogNormal_CDF (Tn |μw,n , σ)), otherwise

To simplify the model, we assume that contaminant responses are completely random (i.e.,
there is no bias to word or non-word). This assumption is encoded in the model by setting
θbias = 0.5 .54 This makes Bernoulli(choicen |θbias ) = .5 .

For this model to converge, we need to assume that θc is much smaller than 1. This is a
sensible assumption for this particular model, since the contaminant distribution is assumed to
only happen in a minority of the cases. We set the following prior to θc :
θc ∼ Beta(0.9, 70)

By setting the first parameter of the beta distribution to a number smaller than 1, we get a
distribution of possible probabilities with a “horn” on the left; see Figure 20.9. Our prior belief
for θc has mean 0.007 , and its 95% CrI is [0, 0.048] .55

Hide

ggplot(data = tibble(theta_c = c(0, 1)), aes(theta_c)) +

stat_function(
fun = dbeta,

args = list(shape1 = .9, shape2 = 70),

) +
ylab("density")

20
density

0.00 0.25 0.50 0.75 1.00

theta_c

FIGURE 20.9: Prior distribution for θc . Most of the probability mass is close to 0.

We also want to “push” the non-decision time further from zero to get more realistic values.
For this reason we increase the informativity of the prior of Tnd . A log-normal prior discourages
values too close to zero, even with a similar location (on log-scale) than the truncated normal
prior. We settle on the following prior:
Tnd ∼ LogNormal(log(150), .6)

Hide

sdlog <- .6
lq = qlnorm(.025,log(150), sdlog)
hq = qlnorm(.975,log(150), sdlog)
ggplot(data = tibble(T_nd = c(0, 1000)), aes(T_nd)) +

stat_function(
fun = dlnorm,
args = list(meanlog = log(150), sdlog = sdlog),
) +
geom_vline(xintercept = lq, linetype = "dashed") +

geom_vline(xintercept = hq, linetype = "dashed") +

geom_text(label = round(lq), x = lq - 50, y = 0.0025) +
geom_text(label = round(hq), x = hq + 50, y = 0.0025) +
ylab("density")

0.004
density

46 486

0.002

0.000

0 250 500 750 1000

T_nd

FIGURE 20.10: Prior distribution for Tnd . The dashed lines show the 95% credible interval.
Finally, we want to be able to account for response times that are actually faster than the non-
decision time. If the observed response time, rt is smaller than the non-decision time,
T_nd , we can be sure that the observation belongs to the contaminant distribution, because

otherwise the decision time T should be negative. This means that in this case, the log-
normal race likelihood is 0 (and its logarithm is negative infinity). When T < 0 , the log-
likelihood of our model is log(θ ⋅ Uniform(rtn |min, max) ⋅ 0.5 + (1 − θ) ⋅ plnrace ) with
plnrace = 0 . This means that we only use the following code:

Hide

target += log(theta_c) + uniform_lpdf(rt[n] | min_rt, max_rt) +

log(0.5);

This also means that we need to relax the constraints on T_nd ; it doesn’t need to be smaller
than the smallest observed response time, since some of the observations are responses from
the contaminant distribution:

Hide

real<lower = 0> T_nd;

When T > 0 , the likelihood is a mixture of the contaminant distribution and the log-normal
race model as defined in (20.4). We use log_sum_exp exactly as we did in sections 19.1.2
and 19.1.3 for the fast-guess model. We fit a mixture distribution between a contaminant
distribution and the log-normal race model.

Hide

target += log_sum_exp(
log(theta_c) + uniform_lpdf(rt[n] | min_rt, max_rt)
+ log(0.5),
log1m(theta_c) + lognormal_race2_lpdf(T[n] | nchoice[n],
mu, sigma));

Finally, the complete Stan code for this model is shown below as lnrace_cont.stan :

Hide
functions {
real lognormal_race2_lpdf(real T, int nchoice, real[] mu, real sigma){
real lpdf;
if(nchoice == 1)
lpdf = lognormal_lpdf(T | mu[1] , sigma) +

lognormal_lccdf(T | mu[2], sigma);

else
lpdf = lognormal_lpdf(T | mu[2], sigma) +
lognormal_lccdf(T | mu[1], sigma);
return lpdf;
}

}
data {
int<lower = 1> N;
int<lower = 1> N_subj;
vector[N] c_lfreq;

vector[N] c_lex;
vector[N] rt;
array[N] int nchoice;
array[N] int subj;
}

transformed data{
real min_rt = min(rt);
real max_rt = max(rt);
int N_re = 6;
}
parameters {

array[2] real alpha;

array[4] real beta;
real<lower = 0> sigma;
real<lower = 0> T_nd;
real<lower = 0, upper = 1> theta_c;

vector<lower = 0>[N_re] tau_u;

matrix[N_re, N_subj] z_u;
cholesky_factor_corr[N_re] L_u;
}
transformed parameters {
matrix[N_subj, N_re] u;
u = (diag_pre_multiply(tau_u, L_u) * z_u)';
}

model {
array[N] real log_lik;
target += normal_lpdf(alpha | 6, 1);
target += normal_lpdf(beta | 0, .5);
target += normal_lpdf(sigma | .5, .2)

- normal_lccdf(0 | .5, .2);

target += std_normal_lpdf(to_vector(z_u));
for(n in 1:N){
real T = rt[n] - T_nd;
if(T > 0){
real mu[2] = {alpha[1] + u[subj[n], 1] -

c_lex[n] * (beta[1] + u[subj[n], 2]) -

c_lfreq[n] * (beta[2] + u[subj[n], 3]),
alpha[2] + u[subj[n], 4] -
c_lex[n] * (beta[3] + u[subj[n], 5]) -
c_lfreq[n] * (beta[4] + u[subj[n], 6])};

log_lik[n] = log_sum_exp(
log(theta_c) + uniform_lpdf(rt[n] | min_rt, max_rt)
+ log(.5),
log1m(theta_c) + lognormal_race2_lpdf(T | nchoice[n], mu, sigma));
} else {
// T < 0, observed time is smaller than the non-decision time

log_lik[n] = log(theta_c) + uniform_lpdf(rt[n] | min_rt, max_rt)

+ log(.5);
}
}
target += sum(log_lik);

}
In practice, we should verify that this model can recover the true values of its parameters by
first simulating data and fitting the model to simulated data, and using simulation based
calibration. We skip these steps here.

Store the real data in a list and fit the model:

Hide

lnrace_h_cont <- system.file("stan_models",

"lnrace_h_cont.stan",
package = "bcogsci")
ls_blp_h <- list(N = nrow(df_blp),
N_subj = max(df_blp$subj),
subj = df_blp$subj,

rt = df_blp$rt,
nchoice = df_blp$nchoice,
c_lex = df_blp$c_lex,
c_lfreq = df_blp$c_lfreq)
fit_blp_h_cont <- stan(lnrace_h_cont, data = ls_blp_h)

This model takes more than a day to finish in a relatively powerful computer and,
disappointingly, it doesn’t converge; this non-convergence is apparent from the traceplots in
Figure 20.11. We’ll see later that even if the converging model finishes faster, it still takes a
considerable amount of time. If one has a powerful computer (with for example multiple cores)
available, it is possible to parallelize the sampling further than what we did so far. This is
possible with special functions that allow for multithreading, which is discussed in the Stan’s
user guide (Stan Development Team 2021, ch. 25).

Hide

traceplot(fit_blp_h_cont, pars = c("alpha",

"beta",

"T_nd",
"theta_c",
"sigma",
"tau_u"))
alpha[1] alpha[2] beta[1] beta[2]
8 1 2.5
6 9 0 2.0
4 1.5
6 -1 1.0
2 -2 0.5
3
0 0.0
0 250 500 7501000 0 250 500 7501000 0 250 500 7501000 0 250 500 7501000

beta[3] beta[4] T_nd theta_c

1.0 0
0.5 300 0.6
-1
0.0 200 0.4
-2
-0.5
-3 100 0.2 chain
-1.0
-1.5 -4 0 0.0 1
0 250 500 7501000 0 250 500 7501000 0 250 500 7501000 0 250 500 7501000
2
sigma tau_u[1] tau_u[2] tau_u[3]
3
0.4 2.0
1.0 0.3 1.5 2 4
0.2 1.0
1
0.5 0.1 0.5
0.0 0.0 0
0 250 500 7501000 0 250 500 7501000 0 250 500 7501000 0 250 500 7501000

tau_u[4] tau_u[5] tau_u[6]

0.4 0.4 1.5
0.3 0.3
1.0
0.2 0.2
0.1 0.1 0.5
0.0 0.0
0 250 500 7501000 0 250 500 7501000 0 250 500 7501000

FIGURE 20.11: The traceplots of the fit_blp_h shows that the chains are clearly not mixing
well for the real data set.
The traceplots in Figure 20.11 shows that the chains get stuck and don’t mix well. It seems
that there is not enough information to constraint the model. If we look at the parameter
theta_c , which represents the mixing proportion between the contaminant and the log-

normal distribution, we see that its chains are getting stuck at very unlikely values that are
over 0.25 . Although in general is not recommended to cut off values from a prior just because
they’re unlikely, in this case restricting the parameter theta_c to be smaller than 0.1 helps
solving convergence problems. To truncate the prior for θc in Stan we declare the parameter to
have an upper bound of 0.1 :

real<lower = 0, upper = 0.1> theta_c;

Change its prior distribution in the model block to the following:

target += beta_lpdf(theta_c | 0.9, 70) -

beta_lcdf(0.1 | 0.9, 70);
If we were to fit this new model, we would see that some chains of T_nd mix in values around
300 ms and some other chains (sometimes) get stuck in values very close to zero. This
indicates that the model needs more information regarding the non-decision time. Another
issue that slows down convergence is that this parameter T_nd is on a different scale (with a
value above 100) than the rest of the parameters (with values below 10). Rather than
sampling from T_nd , we sample from lT_nd in a new model, so that T_nd = exp(lTnd) . We
assign the following prior to lT_nd :

lTnd ∼ Normal(log(200), 0.3)

This is mathematically equivalent to assigning the following prior to Tnd :

Tnd ∼ LogNormal(log(200), 0.3)

The new version of the model is omitted here, but can be found as lnrace_h_contb.stan in
the bcogsci package. Fit the new modified model to the same data.

Hide

lnrace_h_contb <- system.file("stan_models",

"lnrace_h_contb.stan",

package = "bcogsci")
fit_blp_h_contb <- stan(lnrace_h_cont, data = ls_blp_h)

This time the model takes considerably less time, but it still takes nine hours. However, the
model does converge and the posterior distribution does make sense now.

Print the summary of the main parameters:

Hide

print(fit_blp_h_contb, pars = c("alpha",

"beta",

"T_nd",
"sigma",
"theta_c",
"tau_u"))
## mean 2.5% 97.5% n_eff Rhat
## alpha[1] 6.36 6.23 6.48 565 1.01
## alpha[2] 6.02 5.91 6.14 1133 1.00
## beta[1] 0.60 0.52 0.68 2339 1.00
## beta[2] 0.13 0.11 0.14 1969 1.00

## beta[3] -0.23 -0.30 -0.16 3036 1.00

## beta[4] -0.12 -0.14 -0.11 3454 1.00
## T_nd 315.66 312.48 318.63 5504 1.00
## sigma 0.62 0.61 0.62 5065 1.00
## theta_c 0.01 0.00 0.01 6811 1.00
## tau_u[1] 0.29 0.23 0.37 1718 1.00

## tau_u[2] 0.15 0.10 0.21 2321 1.00

## tau_u[3] 0.02 0.02 0.04 1879 1.00
## tau_u[4] 0.26 0.20 0.33 2173 1.00
## tau_u[5] 0.12 0.07 0.19 3071 1.00
## tau_u[6] 0.02 0.00 0.04 1255 1.00

Print the summary of the correlations:

Hide

print(fit_blp_h_contb, pars = rho_us)

## mean 2.5% 97.5% n_eff Rhat
## rho_u[1,2] -0.06 -0.42 0.29 2798 1
## rho_u[1,3] 0.21 -0.15 0.53 2295 1
## rho_u[1,4] 0.61 0.33 0.80 2288 1
## rho_u[1,5] 0.00 -0.38 0.37 3742 1

## rho_u[1,6] -0.07 -0.54 0.42 5335 1

## rho_u[2,3] -0.39 -0.73 0.09 1935 1
## rho_u[2,4] 0.06 -0.30 0.43 2046 1
## rho_u[2,5] -0.51 -0.84 -0.06 2565 1
## rho_u[2,6] -0.34 -0.79 0.28 3303 1
## rho_u[3,4] 0.04 -0.33 0.40 2213 1

## rho_u[3,5] 0.35 -0.12 0.73 3550 1

## rho_u[3,6] 0.00 -0.53 0.53 3869 1
## rho_u[4,5] -0.33 -0.67 0.06 3976 1
## rho_u[4,6] -0.06 -0.53 0.44 4631 1
## rho_u[5,6] 0.04 -0.52 0.62 3615 1

What can we say about the fit of the model now?

Under the assumptions that we have made, we can look at the parameters and conclude the
following:

All other things being equal, there is an overall bias to respond non-word rather than
word . We can deduce this because the parameters α represent the boundary separation
of each accumulator minus their rate of accumulation (see equation (20.3)). A smaller
alpha indicates a closer boundary of evidence and/or a faster rate for a given

accumulator. In this case alpha[2] is smaller than alpha[1] . However, the fact that
tau_u[1] is relatively large suggests large individual differences.
The task seems to have been well-understood given that, when a word appears, the rate
of accumulation of the word accumulator increases ( beta[1] > 0 ), and the rate of the
non-word accumulator decreases ( beta[3] < 0 ).
As expected, and replicating previous findings in the literature, words with higher
frequency are easier to identify correctly as words, compared to lower frequency words
( beta[2] > 0 and beta[4] < 0 ).
The non-decision time ( T_nd ) is relatively long, considering that the normal reading of
words in a sentence takes around 200-400 ms (Rayner 1998).
The proportion of contaminant responses is quite small (1%), but without taking them into
account, the non-decision time would not be possible to estimate.
Subjects that are faster to answer in the word trials tend to be also faster to answer in
the non-word trials. We can deduce this given the high correlation between by-subject
adjustments to the parameters alpha ( rho_u[1,4] is much larger than 0).

Our assumptions include both the likelihood we have chosen and the priors. It’s clear from the
difficulties fitting the data that the model is very sensitive to the choice of priors. Since there
seem to be not enough information in the data, we need to provide information through the
prior distributions.

20.2 Posterior predictive check with the quantile

probability plots

As in the previous chapter, in a new file, we can write the generated quantities block for
posterior predictive checks. The advantage is that we can generate as many observations as
needed after estimating the parameters. There is no model block in the following Stan
program. The gqs() function needs a Stan model without transformed parameters. For this
reason, this model includes u in the parameters block rather than in the transformed
parameters . The complete Stan code for this model is shown below as
lnrace_h_contb_gen.stan .

Hide
data {
int<lower = 1> N;
int<lower = 1> N_subj;
vector[N] c_lfreq;
vector[N] c_lex;

vector[N] rt;
array[N] int nchoice;
array[N] int subj;
}
transformed data{
real min_rt = min(rt);

real max_rt = max(rt);

int N_re = 6;
}
parameters {
array[2] real alpha;

array[4] real beta;

real<lower = 0> sigma;
real<lower = 0> T_nd;
real<lower = 0, upper = .1> theta_c;
vector<lower = 0>[N_re] tau_u;

matrix[N_subj, N_re] u;
}
model {
}
generated quantities {
array[N] real rt_pred;

array[N] real nchoice_pred;

for(n in 1:N){
real T = rt[n] - T_nd;
real mu[2] = {alpha[1] + u[subj[n], 1] -
c_lex[n] * (beta[1] + u[subj[n], 2]) -

c_lfreq[n] * (beta[2] + u[subj[n], 3]),

alpha[2] + u[subj[n], 4] -
c_lex[n] * (beta[3] + u[subj[n], 5]) -
c_lfreq[n] * (beta[4] + u[subj[n], 6])};
real cont = bernoulli_rng(theta_c);
if(cont == 1){
rt_pred[n] = uniform_rng(min_rt, max_rt);
nchoice_pred[n] = bernoulli_rng(0.5) + 1;

} else {
real accum1 = lognormal_rng(mu[1], sigma);
real accum2 = lognormal_rng(mu[2], sigma);
rt_pred[n] = fmin(accum1, accum2) + T_nd;
nchoice_pred[n] = (accum1 > accum2) + 1;

}
}
}

Compile the model lnrace_h_contb_gen.stan and subset 200 samples of each parameter
appearing in the parameters block using the previous model:

Hide

lnrace_h_gen <- system.file("stan_models",

"lnrace_h_contb_gen.stan",

package = "bcogsci")
gen_model <- stan_model(lnrace_h_gen)
draws_par <- as.matrix(fit_blp_h_contb,
pars = c("alpha",
"beta",

"sigma",
"T_nd",
"theta_c",
"tau_u",
"u")
)[1:200, , drop = FALSE]

Use the function gqs() (this function draws samples of generated quantities from the Stan
model) to generate responses from 200 simulated experiments:

Hide
gen_race_data <- gqs(gen_model,
data = ls_blp_h,
draws = draws_par)

One can examine the general distribution of response times generated by the posterior
predictive model, or the effect of the experimental manipulations on response times, as we did
in section 19.1.5.3.

However, it is more informative to look at the quantile probability plot of the posterior predictive
distribution. In order to create this plot, first extract the predicted response times and choice,
and match each one to their corresponding observation of the corresponding simulation.
Before we can use qpf() , we need a data frame identical to the one with the observed data,
df_blp , but one that includes response time and choice for each observation of each

simulation (indicated with the column sim ). We do this in the piece of code below. This piece
of code yields a data frame called df_blp_pred_qpf , which has the quantile probability
information of the observed data and each one of the 200 simulations.

First, extract the array of predicted reading times from gen_race_data and transform it to a
long format:

Hide
df_rt <- rstan::extract(gen_race_data)$rt_pred %>%
# Convert a matrix of 200 x 24000 (iter x N obs)
# into a data.frame where each column is
# V1,...,V24000:
as.data.frame() %>%

# Add a column which identifies each iter as a simulation:

mutate(sim = 1:n()) %>%
# Pivot the data frame so that it has length 200 * 24000.
# Each row indicates:
# - sim: from which simulation the observation is coming
# - obs_id: identifies the 24000 observations

# - rt_pred: simulated RT
# Since each observation is in a column starting with V
# `names_prefix` removes the "V"
pivot_longer(cols = -sim,
names_to = "obs_id",

names_prefix = "V",
values_to = "rt_pred") %>%
# Make sure that obs_id is a number (and not a
# number represented as a character):
mutate(obs_id = as.numeric(obs_id))

df_rt

## # A tibble: 4,800,000 × 3
## sim obs_id rt_pred
## <int> <dbl> <dbl>
## 1 1 1 483.

## 2 1 2 488.
## 3 1 3 623.
## # … with 4,799,997 more rows

Second, extract the array of predicted choice (1 for words and 2 for non-words) and transform
it into a long format:

Hide
df_nchoice <- rstan::extract(gen_race_data)$nchoice_pred %>%
as.data.frame() %>%
mutate(sim = 1:n()) %>%
pivot_longer(cols = -sim,
names_to = "obs_id",

names_prefix = "V",
values_to = "nchoice_pred") %>%
mutate(obs_id = as.numeric(obs_id))
df_nchoice

## # A tibble: 4,800,000 × 3

## sim obs_id nchoice_pred

## <int> <dbl> <dbl>
## 1 1 1 2
## 2 1 2 2
## 3 1 3 2

## # … with 4,799,997 more rows

Third, create a new data frame with the characteristics of the stimuli the predictions and
observations. The predictions come from 200 simulated dataset, each simulation is indexed
with sim (whereas for the empirical observations, sim is set to NA ):

Hide

df_blp_main <- df_blp %>%

select(subj, lex, lfreq, rt, nchoice) %>%
mutate(obs_id = 1:n())
df_blp_pred <- left_join(df_rt, df_nchoice) %>%
left_join(select(df_blp_main, -rt, -nchoice)) %>%
rename(rt = rt_pred, nchoice = nchoice_pred) %>%

bind_rows(df_blp_main) %>%
mutate(acc = ifelse((nchoice == 1 & lex == "word") |
(nchoice == 2 & lex == "non-word"), 1, 0))
df_blp_pred
## # A tibble: 4,824,000 × 8
## sim obs_id rt nchoice subj lex lfreq acc
## <int> <dbl> <dbl> <dbl> <dbl> <chr> <dbl> <dbl>
## 1 1 1 483. 2 1 non-word -4.61 1
## 2 1 2 488. 2 1 non-word -4.61 1

## 3 1 3 623. 2 1 non-word -4.61 1

## # … with 4,823,997 more rows

Finally, create a data frame with the results of the quantile probability function applied by
simulation and by frequency:

Hide

df_blp_pred_qpf <- df_blp_pred %>%

# Subset only words

filter(lex == "word") %>%
# Create 5 word frequencies group
mutate(freq_group =
cut(lfreq,

quantile(lfreq, c(0,.2,.4, .6, .8, 1)),

include.lowest = TRUE,
labels =
c("0-.2", ".2-.4", ".4-.6", ".6-.8", ".8-1"))
) %>%

# Group by condition and subject

group_by(freq_group, sim, subj) %>%
# Apply the quantile probability function
qpf() %>%
# Group again removing subj
group_by(freq_group, sim, q, response) %>%

# Get averages of all the quantities

summarize(rt_q = mean(rt_q),
p = mean(p))

Now, plot the results with the code shown below (Figure 20.12).

Hide
ggplot(df_blp_pred_qpf %>% filter(sim <200 ), aes(x = p, y = rt_q)) +
geom_point(shape = 4, alpha = 0.1) +
geom_line(aes(group = interaction(q, response, sim)),
alpha = 0.1,
color = "grey") +

ylab("log-transformed RT quantiles (ms) ") +

xlab("Response proportion") +
geom_point(data = df_blp_pred_qpf %>% filter(is.na(sim)),
shape = 4) +
geom_line(data = df_blp_pred_qpf %>% filter(is.na(sim)),
aes(group = interaction(q, response))) +

coord_trans(y = "log")

8000

6000
log-transformed RT quantiles (ms)

4000

2000

0.00 0.25 0.50 0.75 1.00

Response proportion

FIGURE 20.12: Quantile probability plots showing 0.1, 0.3, 0.5, 0.7, and 0.9 response times
quantiles plotted against proportion of incorrect responses (left) and proportion of correct
responses (right) for only words of different frequency. Word frequency are grouped according
to quantiles: The first group are words with frequencies smaller than the 0.2-th quantile, the
second group are words with frequencies smaller than the 0.4-th quantile and larger than the
0.2-th quantile, and so forth. The summary of the observed data is plotted in black and the
summaries of the synthetic data sets are plotted in gray.
Figure 20.12 shows the quantile probability plots of the observed and simulated data. We
would expect the observed quantile probability summaries (in black) to fall within those of the
posterior predictive distribution (in grey).

The fit is clearly bad. One major issue is the misfit at the left side of the plot, this is because
the model is unable to capture fast errors: Low probability responses must be slower, because
having a slow rate and “losing” the race often is what makes their probability low. Other
sequential sampling models such as the linear ballistic accumulator and the drift diffusion
model can account for fast errors, under the assumption that they happen in the trials where
there is a strong initial bias toward the wrong response; this bias occurs due to random
variation in the starting points of the accumulators. This characterization of fast errors is not
possible for the log-normal race model, because bias and rate effects combine to determine
distribution location (Heathcote and Love 2012). Heathcote and Love (2012) point out that the
log-normal race model can still produce fast errors if the scale of the accumulator that
corresponds to the incorrect choice is larger than the scale of the accumulator of the correct
choice. We leave it as an exercise for the reader to verify that a log-normal race model with a
scale that depends on the stimuli improves the fit (see Exercise 20.2).

20.3 Summary

In this chapter, we learned to fit what seems to be the simplest sequential sampling model, a
log-normal race model with equal scale (or variances) starting with one subject, continuing
with a fully hierarchical model, and finally incorporating a contaminant distribution by using a
mixture model. We saw how to evaluate model fit of response time and choice using quantile
probability plots. We saw that the log-normal race model with equal scale is unable to account
for fast errors and a more complex model is needed. Crucially, many of the techniques
explained in this chapter (e.g., including a shift in the distribution, mixture distributions for
dealing with contaminated responses, and quantile probability plots) can be used with virtually
any type of model that fits response times and choice. One downside of this type of model is
that they take a long time to fit.

20.4 Further reading

The log-normal race model was first introduced in Heathcote and Love (2012), and its first
Bayesian implementation is described in Rouder et al. (2015). The log-normal race model is
closely connected to the retrieval process from memory that is assumed in the cognitive
architecture ACT-R; see, for example, Nicenboim and Vasishth (2018); Fisher, Houpt, and
Gunzelmann (2022); Lissón et al. (2021). Heathcote et al. (2019) outline how to fit evidence-
accumulation models in a Bayesian framework with a custom R package that relies on a
Differential-Evolution sampler (DE-MCMC; Turner et al. 2013).

20.5 Exercises

Exercise 20.1 Can we recover the true values of the parameters of a model when dealing with
a contaminant distribution?

In Section 20.1.5, we fit a hierarchical model that assumed a contaminant distribution

( lnrace_h_cont.stan ) without first verifying that we can recover the true values of its
parameters if we simulate data. An important first step would be to work with a non-
hierarchical version of this model.

1. Generate data of one subject as in section 20.1.2, but assume a contaminant distribution
as in section 20.1.5.
2. Fit a non-hierarchical version of lnrace_h_cont.stan without restricting the parameter
theta_c to be smaller than 0.1 .

3. Plot the posterior distributions of the model and verify that you can recover the true values
of the parameters.

Exercise 20.2 Can the log-normal race model account for fast errors?

Subject 13 shows fast errors for incorrect responses. This can be seen in the left side of the
quantile probability plot in Figure 20.13.

1. Fit a log-normal race model (with equal scales for the two accumulator) that accounts for
contaminant responses.
2. Fit a variation of this model, where whether the lexicality of the string matches or not the
accumulator affects its scale.
3. Visualize the fit of each model with quantile probability plots.
4. Use cross-validation to compare the models.

Notice that the models should be fit to only one subject and they should not have a
hierarchical structure.
13

800
log-transformed RT quantiles (ms)

700

600

500

400

0.00 0.25 0.50 0.75 1.00

Response proportion

FIGURE 20.13: Quantile probability plot showing 0.1, 0.3, 0.5, 0.7, and 0.9 response times
quantiles plotted against proportion of incorrect responses (left) and proportion of correct
responses (right) for only words of different frequency for subject number 13.
Exercise 20.3 Accounting for response time and choice in the lexical decision task using the
log-normal race model.

In Chapter 19, we modeled the data of the global motion detection task from Dutilh et al.
(2011) ( df_dots ) using a mixture model. Now, we’ll investigate what happens if we fit a log-
normal race model to the same data. As a reminder, in this type of task, subjects see a
number of random dots on the screen from which a proportion of them move in a single
direction (left or right) and the rest move in random directions. The goal of the task is to
estimate the overall direction of the movement. In this data set, there are two difficulty levels
( diff ) and two types of instructions ( emphasis ) that focus on accuracy or speed. (More
information about the data set can be found by loading the bcogsci package and typing ?
df_dots in the R console). For the sake of speed, we’ll fit only one subject from this data set.

1. Before modeling the data, show the relationship between response times and accuracy
with a quantile probability plot that shows quantiles and accuracy of easy and hard
difficulty conditions.
2. Fit a non-hierarchical log-normal race model to account for how both choice and response
time are affected by task difficulty and emphasis. Assume no contaminant distribution of
responses.

Note that the direction of the dots is indicated with stim , when stim and resp match, both
are L , left, or both are R , right, the accuracy, acc , is 1. For modeling this task with a log-
normal race model, the difficulty of the task should be coded in a way that reflects that the
stimuli will be harder to detect for the relevant accumulator. One way to do it is the following:

Hide

df_dots_subset <- df_dots %>%

filter(subj == 1)

df_dots_subset <- df_dots_subset %>%

mutate(c_diff = case_when(stim == "L" & diff == "easy" ~ .5,
stim == "L" & diff == "hard" ~ -.5,
stim == "R" & diff == "easy" ~ -.5,
stim == "R" & diff == "hard" ~ .5,

))

4. Expand the previous model including a contaminant distribution of responses.

5. Visualize the fit of the two previous models by doing posterior predictive checks using
quantile probability plots.

6. Use cross-validation to compare the models.

References

Audley, RJ, and AR Pike. 1965. “Some Alternative Stochastic Models of Choice 1.” British
Journal of Mathematical and Statistical Psychology 18 (2). Wiley Online Library: 207–25.

Brown, Scott, and Andrew Heathcote. 2005. “A Ballistic Model of Choice Response Time.”
Psychological Review 112 (1). American Psychological Association: 117.
Brysbaert, Marc, Paweł Mandera, and Emmanuel Keuleers. 2018. “The Word Frequency
Effect in Word Processing: An Updated Review.” Current Directions in Psychological Science
27 (1): 45–50. https://doi.org/10.1177/0963721417727521.

Clark, Vincent P, Silu Fan, and Steven A Hillyard. 1994. “Identification of Early Visual Evoked
Potential Generators by Retinotopic and Topographic Analyses.” Human Brain Mapping 2 (3).
Wiley Online Library: 170–87.

Fisher, Christopher R, Joseph W Houpt, and Glenn Gunzelmann. 2022. “Fundamental Tools
for Developing Likelihood Functions Within Act-R.” Journal of Mathematical Psychology 107.
Elsevier: 102636.

Heathcote, Andrew, Yi-Shin Lin, Angus Reynolds, Luke Strickland, Matthew Gretton, and Dora
Matzke. 2019. “Dynamic Models of Choice.” Behavior Research Methods 51 (2). Springer:
961–85.

Keuleers, Emmanuel, Paula Lacey, Kathleen Rastle, and Marc Brysbaert. 2012. “The British
Lexicon Project: Lexical Decision Data for 28,730 Monosyllabic and Disyllabic English Words.”
Behavior Research Methods 44 (1). Springer: 287–304.

Nelson, Peter R. 1981. “The Algebra of Random Variables.” Technometrics 23 (2). Taylor &
Francis: 197–98. https://doi.org/10.1080/00401706.1981.10486266.

Nicenboim, Bruno, and Shravan Vasishth. 2018. “Models of Retrieval in Sentence

Comprehension: A Computational Evaluation Using Bayesian Hierarchical Modeling.” Journal
of Memory and Language 99: 1–34. https://doi.org/10.1016/j.jml.2017.08.004.
Ollman, Robert. 1966. “Fast Guesses in Choice Reaction Time.” Psychonomic Science 6 (4).
Springer: 155–56.

Ratcliff, Roger. 1978. “A Theory of Memory Retrieval.” Psychological Review 85 (2). American
Psychological Association: 59.

Ratcliff, Roger, and Francis Tuerlinckx. 2002. “Estimating Parameters of the Diffusion Model:
Approaches to Dealing with Contaminant Reaction Times and Parameter Variability.”
Psychonomic Bulletin & Review 9 (3). Springer: 438–81.

Rayner, K. 1998. “Eye movements in reading and information processing: 20 years of

research.” Psychological Bulletin 124 (3): 372–422.

Stan Development Team. 2021. “Stan Modeling Language Users Guide and Reference
Manual, Version 2.27.” https://mc-stan.org.

Turner, Brandon M, Per B Sederberg, Scott D Brown, and Mark Steyvers. 2013. “A Method for
Efficiently Sampling from Distributions with Correlated Dimensions.” Psychological Methods
18 (3). American Psychological Association: 368.
Ulrich, Rolf, and Jeff Miller. 1993. “Information Processing Models Generating Lognormally
Distributed Reaction Times.” Journal of Mathematical Psychology 37 (4): 513–25.
https://doi.org/10.1006/jmps.1993.1032.

51. One could estimate the censored times by fitting log-normal distributions that are
truncated at Tn , since this is the minimum possible time for each censored observation:

′
LogNormal(μnw,n , σ) with Tcensored,n > Tn , if choice = word
Tcensored,n ∼ {
′
LogNormal(μw,n , σ) with Tcensored,n > Tn , otherwise

52. This for-loop can also be implemented in the transformed parameter block; the
advantage of doing this is that the log-likelihood of each observation can be used, for
example, for cross-validation; the disadvantage is that the R object might be very large,
because it will store the log-likelihood during the warm-up period as well.↩

53. There are 15 correlations since there are 15 ways to choose 2 variables out of 6 for
specifying the pairwise correlations, where order doesn’t matter. This is calculated with
6
( )
2
which is choose(6, 2) in R.↩

54. This is, of course, just an assumption that could be verified. But we’ll see that the model is
already quite complex and achieving convergence is not trivial.↩

55. The average is calculated as follows: 0.9/(0.9 + 70) ; this is because the mean of a beta
distribution with parameters a, b is a/(a + b) . The 95% quantile can be calculated in R
with qbeta(c(0.025, 0.975), .9, 70) .↩
Code

References
Allaire, JJ, Yihui Xie, Jonathan McPherson, Javier Luraschi, Kevin Ushey, Aron Atkins, Hadley
Wickham, Joe Cheng, Winston Chang, and Richard Iannone. 2019. rmarkdown: Dynamic
Documents for R. https://CRAN.R-project.org/package=rmarkdown.

Anderson, John R., Dan Bothell, Michael D. Byrne, Scott Douglass, Christian Lebiere, and
Yulin Qin. 2004. “An Integrated Theory of the Mind.” Psychological Review 111 (4): 1036–60.

Ashby, F Gregory. 1982. “Testing the Assumptions of Exponential, Additive Reaction Time
Models.” Memory & Cognition 10 (2). Springer: 125–34.

Audley, RJ, and AR Pike. 1965. “Some Alternative Stochastic Models of Choice 1.” British
Journal of Mathematical and Statistical Psychology 18 (2). Wiley Online Library: 207–25.

Auguie, Baptiste. 2017. GridExtra: Miscellaneous Functions for "Grid" Graphics.

https://CRAN.R-project.org/package=gridExtra.

Aust, Frederik. 2019. citr: RStudio Add-in to Insert Markdown Citations. https://CRAN.R-
project.org/package=citr.

Aust, Frederik, and Marius Barth. 2020. papaja: Create APA Manuscripts with R Markdown.
https://github.com/crsh/papaja.

Baguley, Thomas. 2012. Serious Stats: A Guide to Advanced Statistics for the Behavioral
Sciences. Macmillan International Higher Education.
Barr, Dale J, Roger Levy, Christoph Scheepers, and Harry J Tily. 2013. “Random Effects
Structure for Confirmatory Hypothesis Testing: Keep It Maximal.” Journal of Memory and
Language 68 (3). Elsevier: 255–78.

Barth, Marius. 2022. tinylabels: Lightweight Variable Labels. https://cran.r-

project.org/package=tinylabels.

Batchelder, William H, and David M Riefer. 1990. “Multinomial Processing Models of Source
Monitoring.” Psychological Review 97 (4). American Psychological Association: 548.

———. 1999. “Theoretical and Empirical Review of Multinomial Process Tree Modeling.”
Psychonomic Bulletin & Review 6 (1). Springer: 57–86.

Bates, Douglas M, Reinhold Kliegl, Shravan Vasishth, and Harald Baayen. 2015.
“Parsimonious Mixed Models.”

Bates, Douglas M, and Martin Maechler. 2019. Matrix: Sparse and Dense Matrix Classes and
Methods. https://CRAN.R-project.org/package=Matrix.

———. 2015b. “Fitting Linear Mixed-Effects Models Using lme4.” Journal of Statistical
Software 67 (1): 1–48. https://doi.org/10.18637/jss.v067.i01.

Beall, Alec T., and Jessica L. Tracy. 2013. “Women Are More Likely to Wear Red or Pink at
Peak Fertility.” Psychological Science 24 (9). Sage Publications Sage CA: Los Angeles, CA:
1837–41.

Beer, Randall D. 2000. “Dynamical Approaches to Cognitive Science.” Trends in Cognitive

Sciences 4 (3). Elsevier: 91–99.

Belin, TR, and DB Rubin. 1990. “Analysis of a Finite Mixture Model with Variance
Components.” In Proceedings of the Social Statistics Section, 211–15.

Bernardo, José M, and Adrian FM Smith. 2009. Bayesian Theory. Vol. 405. John Wiley &
Sons.
Betancourt, Michael J. 2016. “Identifying the Optimal Integration Time in Hamiltonian Monte
Carlo.”

———. 2017. “A Conceptual Introduction to Hamiltonian Monte Carlo.”

———. 2018. “Towards a Principled Bayesian Workflow.”

https://betanalpha.github.io/assets/case_studies/principled_bayesian_workflow.html.

Betancourt, Michael J., and Mark Girolami. 2013. “Hamiltonian Monte Carlo for Hierarchical
Models.”

Bishop, Christopher M. 2006. Pattern Recognition and Machine Learning. Springer.

Blitzstein, Joseph K, and Jessica Hwang. 2014. Introduction to Probability. Chapman;

Hall/CRC.

Blokpoel, Mark, and Iris van Rooij. 2021. Theoretical Modeling for Cognitive Science and
Psychology.

Bolker, Ben. 2018. “Https://Github.com/Bbolker/Mixedmodels-

Misc/Blob/Master/Notes/Contrasts.rmd.”

Box, George EP. 1979. “Robustness in the Strategy of Scientific Model Building.” In
Robustness in Statistics, 201–36. Elsevier.

Box, George E.P., and David R. Cox. 1964. “An Analysis of Transformations.” Journal of the
Royal Statistical Society. Series B (Methodological). JSTOR, 211–52.

Brée, David S. 1975. “The Distribution of Problem-Solving Times: An Examination of the

Stages Model.” British Journal of Mathematical and Statistical Psychology 28 (2): 177–200.
https://doi.org/10/cnx3q7.

Broadbent, Donald E., and Margaret H. P. Broadbent. 1987. “From Detection to Identification:
Response to Multiple Targets in Rapid Serial Visual Presentation.” Perception &
Psychophysics 42 (2): 105–13. https://doi.org/10.3758/BF03210498.
Brown, Scott D., and Andrew Heathcote. 2008. “The Simplest Complete Model of Choice
Response Time: Linear Ballistic Accumulation.” Cognitive Psychology 57 (3): 153–78.
https://doi.org/10.1016/j.cogpsych.2007.12.002.

Brown, Scott, and Andrew Heathcote. 2005. “A Ballistic Model of Choice Response Time.”
Psychological Review 112 (1). American Psychological Association: 117.

Brysbaert, Marc, Paweł Mandera, and Emmanuel Keuleers. 2018. “The Word Frequency
Effect in Word Processing: An Updated Review.” Current Directions in Psychological Science
27 (1): 45–50. https://doi.org/10.1177/0963721417727521.

Burger, Edward B, and Michael Starbird. 2012. The 5 Elements of Effective Thinking.
Princeton University Press.

Busemeyer, Jerome R, and Adele Diederich. 2010. Cognitive Modeling. Sage.

Bürkner, Paul-Christian. 2019. brms: Bayesian Regression Models Using “Stan”.

https://CRAN.R-project.org/package=brms.

Bürkner, Paul-Christian, and Emmanuel Charpentier. 2020. “Modelling Monotonic Effects of

Ordinal Predictors in Bayesian Regression Models.” British Journal of Mathematical and
Statistical Psychology. Wiley Online Library.

Bürkner, Paul-Christian, and Matti Vuorre. 2018. “Ordinal Regression Models in Psychological
Research: A Tutorial.” PsyArXiv Preprints.

Caplan, D., and G. S. Waters. 1999. “Verbal Working Memory and Sentence Comprehension.”
Behavioral and Brain Science 22: 77–94.
Carlin, Bradley P, and Thomas A Louis. 2008. Bayesian Methods for Data Analysis. CRC
Press.

Chambers, Chris. 2019. The Seven Deadly Sins of Psychology: A Manifesto for Reforming the
Culture of Scientific Practice. Princeton University Press.