Human AI Interactions in Public Sector Decision Making, "Automation Bias" and "Selective Adherence" To Algorithmic Advice
Human AI Interactions in Public Sector Decision Making, "Automation Bias" and "Selective Adherence" To Algorithmic Advice
Human AI Interactions in Public Sector Decision Making, "Automation Bias" and "Selective Adherence" To Algorithmic Advice
https://doi.org/10.1093/jopart/muac007
Advance access publication 8 February 2022
Article
Downloaded from https://academic.oup.com/jpart/article/33/1/153/6524536 by Perpustakaan Fakultas Sastra Universitas Indonesia user on 21 April 2024
“Automation Bias” and “Selective Adherence” to
Algorithmic Advice
Saar Alon-Barkat,*, Madalina Busuioc†,
*University of Haifa, Israel
†
Vrije Universiteit Amsterdam, The Netherlands
Address correspondence to the author at [email protected].
Abstract
Artificial intelligence algorithms are increasingly adopted as decisional aides by public bodies, with the promise of overcoming biases of human
decision-makers. At the same time, they may introduce new biases in the human–algorithm interaction. Drawing on psychology and public
administration literatures, we investigate two key biases: overreliance on algorithmic advice even in the face of “warning signals” from other
sources (automation bias), and selective adoption of algorithmic advice when this corresponds to stereotypes (selective adherence). We assess
these via three experimental studies conducted in the Netherlands: In study 1 (N = 605), we test automation bias by exploring participants’
adherence to an algorithmic prediction compared to an equivalent human-expert prediction. We do not find evidence for automation bias. In
study 2 (N = 904), we replicate these findings, and also test selective adherence. We find a stronger propensity for adherence when the advice
is aligned with group stereotypes, with no significant differences between algorithmic and human-expert advice. In study 3 (N = 1,345), we
replicate our design with a sample of civil servants. This study was conducted shortly after a major scandal involving public authorities’ reliance
on an algorithm with discriminatory outcomes (the “childcare benefits scandal”). The scandal is itself illustrative of our theory and patterns
diagnosed empirically in our experiment, yet in our study 3, while supporting our prior findings as to automation bias, we do not find patterns of
selective adherence. We suggest this is driven by bureaucrats’ enhanced awareness of discrimination and algorithmic biases in the aftermath of
the scandal. We discuss the implications of our findings for public sector decision making in the age of automation. Overall, our study speaks to
potential negative effects of automation of the administrative state for already vulnerable and disadvantaged citizens.
© The Author(s) 2022. Published by Oxford University Press on behalf of the Public Management Research Association.
This is an Open Access article distributed under the terms of the Creative Commons Attribution-NonCommercial License (https://creativecommons.org/
licenses/by-nc/4.0/), which permits non-commercial re-use, distribution, and reproduction in any medium, provided the original work is properly cited. For
commercial re-use, please contact [email protected]
154 Journal of Public Administration Research and Theory, 2023, Vol. 33, No. 1
predictive policing—support human decision making. As a context where citizens can act as decision-makers. In study
such, algorithms do not remove the human decision-maker 3, we set out to further replicate our findings with a sample
out of the equation—instead, algorithmic decision making of Dutch civil servants (N = 1,345). During our preparations
arises at the interaction of the two. for that study, a major political scandal occurred in the
For all its promise, the deployment of AI algorithmic Netherlands (the “childcare benefits scandal”), involving al-
technologies in the public sector has raised important gorithm use by public authorities. The scandal involved tax
concerns. High among these are concerns with algorithmic authorities’ reliance as a decisional aid on an AI algorithm
Downloaded from https://academic.oup.com/jpart/article/33/1/153/6524536 by Perpustakaan Fakultas Sastra Universitas Indonesia user on 21 April 2024
accountability and oversight of algorithmic outputs (Busuioc that used nationality as a discriminant predictive feature, with
2021; Diakopoulos 2014); issues of “algorithmic bias”— ensuing bureaucratic decisions reflecting discrimination of
the well-documented propensity of algorithms to learn sys- minority groups. We discuss the results of study 3 in light of
temic bias through, among others, their reliance on historical its co-occurrence with these events, which closely align with
data and come to perpetuate it, effectively “automating in- our theory.
equality” (Eubanks 2018); as well as the potential for bias Our focus is on human processing biases arising from
arising from human processing of AI algorithmic outputs. the use of AI algorithms in a public sector context. While
This article focuses on the latter, which we believe is an im- we would expect such biases to be equally relevant for al-
portant and especially worthy aspect of analysis in light of gorithmic decision making in the private sector, we focus
algorithms’ roles as decisional aids in public sector decision on the public sector because the stakes are especially high
making. In this context, it becomes important to understand for governments. AI algorithms are increasingly adopted in
the implications of these technologies in shaping public sector high-stakes areas—where they are highly consequential for
decision making and specific cognitive biases that might arise individual’s lives, rendering these questions especially pressing
in this respect. This gains yet further relevance as in the con- in a public sector context.
text of the rise of algorithmic governance, human decision-
makers are regarded as important safeguards, as decisional
mediators, on issues of algorithmic bias. Investigating to Automation and Decision Making in the Public
what extent our cognitive limits allow us to act as effective Sector: A Tale of Two Biases
decisional mediators becomes critical in an increasingly au- An important and growing literature in public administra-
tomated administrative state. tion is concerned with the effects of the increasing reliance
In this article, we focus on two diverging biases, theorizing on digital technologies for public sector decision making. A
on the basis of two strands of literature from different key concern in particular pertains to the implications of these
disciplines that have thus far not spoken to each other on technologies for the discretion and professional judgment of
this topic. The first bias, which builds on previous social decision-makers such as (street-level) bureaucrats (Bovens
psychology studies is automation bias. It refers to a well- and Zouridis 2002). This literature has flagged the poten-
documented human propensity to automatically defer to au- tial, in the age of automation, for “digital discretion” (Busch
tomated systems, despite warning signals or contradictory and Henriksen 2018), “automated discretion” (Zouridis, van
information from other sources. In other words, human actors Eck, and Bovens 2020) or specifically in the context of AI,
are found to uncritically abdicate their decision making to au- “artificial discretion” (Young, Bullock, and Lecy 2019) to
tomation. While robust, these findings have been documented supplant the discretion of bureaucratic actors in the admin-
for AI algorithmic precursors such as pilot navigation systems istration (see also Buffat 2015; Bullock 2019; de Boer and
and in fields outside a public sector context. The second bias Raaphorst 2021). In other words, the potential of digital tools
we theorize and test can be extrapolated from existing public to “influence or replace human judgment” in public service
administration research on biased information processing, provision (Busch and Hendriksen 2018, 4) and to alter the
and pertains to decision-makers’ selective adherence to algo- very nature of public managers’ work (Kim, Andersen, and
rithmic advice. Namely, the propensity to adopt algorithmic Lee 2021) and bureaucratic structures and routines (Meijer,
advice selectively, when it matches pre-existing stereotypes Lorenz, and Wessels 2021). Such tools stand to fundamentally
about decision subjects (e.g., when predicting high risk for shape public sector decision making through constraining, or
members of negatively stereotyped minority groups). This even removing, the scope for human expertise and discre-
bias has not yet been investigated in our field with respect to tion or influencing human judgment and cognition in unex-
algorithmic sources. pected ways. In doing so, the delegation of administrative
We report the results of three survey experiment studies decision-making authority to AI technologies could have
conducted in the Netherlands, which provide rigorous tests profound implications for bureaucratic legitimacy (Busuioc
for these hypothesized biases. In study 1 (N = 605), we put 2021) and public values more broadly (Schiff, Schiff, and
automation bias to test by exploring participants’ adherence Pierson 2021).
to an algorithmic prediction (which contradicts additional In this context, it becomes important therefore to under-
evidence) and comparing it to an equivalent human-expert stand how decision-makers in a public sector context process
prediction. In study 2 (N = 904), we replicate these findings, algorithmic outputs used as decisional aids, how they incor-
and at the same time, we also test whether decision subjects’ porate them into their decision making, their implications and
ethnic background moderates decision-makers’ inclination whether these differ in significant ways from the processing
to follow the algorithmic advice. In other words, whether of traditional (human-sourced) advice. To operationalize
respondents are more likely to follow an algorithmic ad- the potential implications of AI advice for decision making,
vice when this prediction is aligned with pre-existing group and given limited theorizing of potential cognitive biases in
stereotypes (engaging in “selective” rather than automatic ad- this emerging area, we borrow from, theorize and integrate
herence). Studies 1 and 2 were conducted among citizens in insights from two separate strands of literature, which offer
Journal of Public Administration Research and Theory, 2023, Vol. 33, No. 1 155
important starting points to unpack this topic: social psy- decision-makers is removed through the introduction of such
chology literature on automation and public administration tools.
research on information processing. Interestingly, these two Such concerns become particularly relevant given well-
literatures offer us somewhat competing projections as to documented failures and malfunctioning of AI(-informed)
what to expect. systems (e.g., Benjamin 2019; Buolamwini and Gebru 2018;
Ferguson 2017; Eubanks 2018; O’Neill 2016; Richardson,
Schultz, and Crawford 2019; Rudin 2019). Due, among
Downloaded from https://academic.oup.com/jpart/article/33/1/153/6524536 by Perpustakaan Fakultas Sastra Universitas Indonesia user on 21 April 2024
Automation Bias: Automatic Adherence to others, to model and/or data inadequacies, AI algorithms
Algorithmic Advice have been found to reproduce and automate systemic bias,
While AI is meant to help us overcome our biases, research and to do so in ways that, by virtue of their opaqueness and/
from social psychology suggests that automated systems or high complexity, have proven difficult to diagnose for both
might give rise to new and distinct biases arising from domain experts and system engineers alike. A human propen-
human processing of automated outputs. “Automation sity for default deference to algorithmic systems under such
bias” is a well-recognized decisional support problem that circumstances would become especially problematic—even
has emerged from studies in aviation and healthcare, areas more so given the high-stakes of AI use in a public sector
that have traditionally heavily relied on automated tools. context.
Automation bias refers to undue deference to automated sys-
tems by human actors that disregard contradictory informa- H1—Decision-makers are more likely to trust and to fol-
tion from other sources or do not (thoroughly) search for low algorithmic advice than human advice, when faced
additional information (Cummings 2006; Lyell and Coiera with similar contradicting external evidence. (automation
2017; Mosier et al. 2001; Parasuraman and Riley 1997; bias)
Skitka, Mosier, and Burdick 1999, 2000; Skitka et al. 2000).
In other words, it is manifest in the “the use of automation as
a heuristic replacement for vigilant information seeking and
processing” (Mosier et al. 1998, 201), a “short cut that pre- Selective Adherence to Algorithmic Advice
maturely shuts down situation assessment” (Skitka, Mosier, We theorize a second, diverging concern regarding decision-
and Burdick 2000, 714). makers’ use of algorithmic advice extrapolating from behav-
Experimental lab studies have diagnosed this tendency ioral work on public decision-makers’ information processing.
across a number of research fields (Goddard, Roudsari, and Following a motivated reasoning logic, this growing body
Wyatt 2012). While robust, these findings have not been of literature has established that decision-makers are prone
investigated in a bureaucratic context. As such, we do not to selectively seek and interpret information in light of pre-
know to what extent such biases are relevant and replicate existing stereotypes, beliefs, and social identities. They as-
in administrative contexts. Extant studies suggest that this sign greater weight to information congruent with prior
propensity to defer to automation stems on the one hand, beliefs and contest inputs that contradict them (Baekgaard
from the perceived inherent superiority of automated sys- et al. 2019; Baekgaard and Serritzlew 2016; Christensen
tems by human actors and on the other, from “cognitive lazi- et al. 2018; Christensen 2018; James and Van Ryzin 2017;
ness,” a human reluctance to engage in cognitively demanding Jilke 2017; Jilke and Baekgaard 2020). These studies have
mental processes, including thorough information search demonstrated these “confirmation biases” with regards to the
and processing (Skitka, Mosier, and Burdick 2000, 702). processing and interpretation of “unambiguous” information
Research findings on automation bias are further supported such as performance indicators. However, this has not been
by ample anecdotal evidence of automation bias with respect explicitly theorized nor investigated yet in relation to algo-
to commercial flights (Skitka et al. 2000, 703), car naviga- rithmic decisional aides.
tion systems (Milner 2016) and more recently, also specifi- We theoretically extend this literature, and argue that this
cally documented in the context of AI for self-driving cars motivated reasoning mechanism would apply not only to in-
(National Transportation Safety Board 2017). Recent busi- formation inputs generated by humans, but also to informa-
ness management experiment-based studies similarly talk tion produced by AI algorithms. Thus, we would similarly
about “algorithm appreciation” (Logg, Minson, and Moore expect decision-makers to adhere to algorithmic advice se-
2019), describing a similar tendency to over-trust algorithmic lectively, when it matches stereotypical views of the decision
outputs. subject (rather than by default, as expected by automation
Concerns with automation bias have been increasingly bias literature). This theoretical expectation also corresponds
voiced by scholars in the context of a growing reliance on AI to works on bureaucratic discrimination indicating that bu-
tools in the public sector and high-stakes scenarios (Cobbe reaucratic decision-makers search for stereotype-consistent
2019; Edwards and Veale 2017; Medium—Open Letter cues in their decisions, or respond to them unconsciously
Concerned AI Researchers 2019; Zerilli et al. 2019), and in- (e.g., Andersen and Guul 2019; Assouline, Gilad, and Bloom
creasingly so also by public administration scholars (Busuioc 2022; Jilke and Tummers 2018; Pedersen et al. 2018; Schram
2021; Peeters 2020; Giest and Grimmelikhuijsen 2020; et al. 2009). In this regard, we theorize that an algorithmic
Young et al. 2021). More broadly, this also corresponds prediction that accords with a group stereotype would sim-
to concerns raised by public administration literature, as ilarly amount to such a cue, which provides expectancy
discussed above, on the potential of AI algorithmic tools (and confirmation.
digital tools more broadly) to replace bureaucratic discre- While public administration scholars have thus far not
tion and professional judgment. Our investigation into auto- investigated selective processing of algorithmic outputs, it
mation bias speaks directly to this literature through setting has been the subject of recent investigations by law and com-
out to elucidate whether the scope for discretion of human puter science scholars in studies on the use of algorithmic
156 Journal of Public Administration Research and Theory, 2023, Vol. 33, No. 1
risk assessment by criminal courts (Green and Chen 2019a, Empirical Evidence from Previous Studies
2019b; Stevenson 2018), diagnosing patterns that are con- To date, we lack systematic empirical evidence about the prev-
sistent with selective adherence and motivated reasoning. alence of cognitive biases in algorithm-based public sector de-
Hence, extrapolating from and theorizing on the basis of cision-making. Existing peer-reviewed empirical studies on
these literatures we first hypothesize that: this topic are from law and computer science scholars in the
context of algorithm use in pretrial criminal justice decisions
H2—Decision-makers are more likely to follow advice in the United States. These studies stem from the underlying
Downloaded from https://academic.oup.com/jpart/article/33/1/153/6524536 by Perpustakaan Fakultas Sastra Universitas Indonesia user on 21 April 2024
(human or algorithmic-based) that matches stereotypical concern with high levels of detention in the United States and
views of the decision subjects. (selective adherence) its growing carceral state and are aimed at investigating the
promise of algorithmic risk assessments to decrease detention
To clarify, H2 pertains to the expectation that selective ad- levels through improving the accuracy of judges’ assessments
herence biases diagnosed for human-sourced advice are also of recidivism risk. Their tentative findings, as detailed below,
present for algorithmic advice, that is, selective adherence, are consistent with our theorized patterns of selective adher-
across both human and algorithmic advice types. In other ence. Stevenson (2018) uses archival data of criminal cases
words, we theorize that these biases persist (do not disap- from the state of Kentucky to compare observationally deten-
pear) in the adoption of algorithms in public sector decision tion rates before and after a reform in 2011 that made risk as-
making. Establishing whether selective adherence is present sessment mandatory in pretrial procedures. She finds that the
is important in a context where AI algorithms are said to expansion in the use of risk scores led to an overall increase
have the potential to do away with human decisional biases. in pretrial release immediately after the implementation of the
What is more, the presence of selective adherence biases gains reform, however, this eroded and almost disappeared within a
special relevance in the algorithmic case. As evidence of sys- matter of years. Additionally, the study finds that judges were
tematic algorithmic biases is accumulating, human decision- more likely to accept low scores for white defendants, while
makers in-the-loop are seen as critical checks, in their roles as overriding similar scores for black defendants.
decisional mediators. Investigating the presence of selective These findings are further supported by a series of exper-
adherence, importantly, therefore, also speaks to the extent to imental studies among laypersons (Green and Chen 2019a,
which human decision-makers can actually function as effec- 2019b; Grgić-Hlača, Engel, and Gummadi 2019). These
tive decisional mediators and safeguards against such risks. studies include a judicial decision-making task in which
If selective adherence biases are to persist, the next ques- participants are shown details of arrests and are asked to pre-
tion is whether they are more emphasized in the use of dict recidivism risk, comparing participants’ predictions with/
algorithms. Are decision-makers more prone to selective ad- without an algorithmic risk assessment. Grgić-Hlača, Engel,
herence to algorithmic advice compared to equivalent human and Gummadi (2019) find that participants did not signifi-
advice? In other words, do algorithmic outputs exacerbate cantly change their decisions in response to the algorithmic
the risk of selective adoption and discriminatory decisions? prediction, even when they receive feedback about its high ac-
We theorize that algorithms have the potential to amplify curacy or are incentivized to make correct predictions. Green
these biases due to their unique nature. Literature on auto- and Chen (2019a, 2019b) further compare between outcomes
mation has theorized that automated decisional aids tend to for black and white defendants and diagnose participant re-
create a “moral buffer,” acting as a psychological distancing liance on algorithms indicative of “disparate interactions”:
mechanism resulting in a diminished sense of moral agency, participants adhered to the algorithmic advice to a greater
personal responsibility and accountability for the human degree when it predicted either high risk for a black defendant
actor “because of a perception that the automation is in or low risk for a white defendant.
charge” (Cummings 2006, 8). These feelings of moral and All in all, while most of these studies demonstrate that
ethical disengagement and decreased responsibility may re- public decision-makers can be affected in their decisions by
duce decision-makers’ awareness of potential biases and algorithmic decisional aids, they do not provide particularly
implicit prejudice. Or worse: the algorithmic advice could strong evidence for automatic deference to algorithmic ad-
vindicate and give free license to decision-makers’ latent vice, as would be expected on the basis of automation bias
views (racial, xenophobic, misogynistic, etc.) by providing literature. They provide instead tentative empirical evidence
them with a seemingly legitimate reason to adopt discrimi- that decision-makers tend to process such advice in a biased,
natory decisions. Algorithms, in other words, could serve to selective manner.
“give permission” to decision-makers to act on their biases: Still, these studies have several important limitations.
Algorithms’ face-value “neutral” or “objective” character First, while the aim of these studies was to learn about the
would fend-off potential suspicions of bias and/or confirm influence of algorithmic decisional aids, their comparison
the validity of biased or prejudiced decisions. An algorithmic was only to a condition where decision-makers did not re-
recommendation aligned with decision-makers’ own biases ceive any advice at all, as opposed to comparable human
could amount to a powerful (mathematical!) endorsement expert advice. It is an open question, therefore, whether the
thereof. We therefore expect biased adherence to become es- effects found are attributed to algorithms per se, or rather
pecially emphasized for algorithmic advice by comparison that other professional advice that similarly includes nu-
with human advice. meric outputs would yield the same outcome. We propose
Consequently, we further hypothesize that: that in order to isolate the distinct effect of algorithms, the
appropriate counterfactual should be an equivalent numeric
H3—Selective adherence is likely to be exacerbated when advice produced by a human expert. Second, we argue that
decision-makers receive an algorithmic rather than a hu- these studies are ill-equipped to investigate automation
man advice. (exacerbated selective adherence) bias, since they lacked additional contradictory evidence
Journal of Public Administration Research and Theory, 2023, Vol. 33, No. 1 157
or inputs from other sources. Rather, automation bias can Wysocki’s story is often mentioned as an illustrative example
be tested effectively by supplementing the algorithmic ad- as to the dangers of bureaucracies’ reliance on black-box
vice with such additional inputs, a condition which “forces” algorithms (see O’Neal 2016). We aimed to simulate a similar
decision-makers to choose whether to rely on the automated scenario in which officials are required to make a decision
authority or rather take into account additional information of whether to extend the employment contract of a teacher,
and indicators. A similar approach was applied by previous when an algorithmic score indicates that she performs poorly,
automation bias experimental studies, where participants yet additional evidence suggests otherwise. We test experi-
Downloaded from https://academic.oup.com/jpart/article/33/1/153/6524536 by Perpustakaan Fakultas Sastra Universitas Indonesia user on 21 April 2024
were given automated aids not aligned with other indicators mentally whether people are more inclined to adhere to such
(Mosier et al. 1998; Skitka, Mosier, and Burdick 1999, 2000; advice when produced by an algorithm, compared to a human
Skitka et al. 2000). Thirdly, these studies are focused on the expert, as expected by our automation bias hypothesis. We
application of algorithms in one specific policy context. It is further examine (in studies 2 and 3) whether participants are
important to explore the generalizability of these patterns more likely to follow such advice when it concerns a decision
to additional public policy areas, especially given the rapid subject from an ethnic minority background, and whether
spread of algorithms across various policy contexts and participants do so to a greater extent when the advice comes
jurisdictions. from an algorithm (as opposed to a human expert). This
Below, in the methodology section, we present our allows us to explore instead patterns of selective (rather than
unique research design, and discuss how it overcomes these automatic) adherence.
limitations. We tailored our survey experimental design to the Dutch
context. In the Netherlands, all schools operate under pub-
licly funded educational associations, which enjoy a large au-
Research Design tonomy in their management. Important decisions, including
personnel management, are made by a school board, which
To examine our hypotheses, we designed and conducted a includes representatives of the educational association. In our
series of three unique survey experiments among Dutch cit- study, as detailed below, we invite participants to a simula-
izens and civil servants. Study 1 (N = 605) was designed to tion task where they act as board members of a hypothetical
test our automation bias hypothesis. Study 2 (N = 904) was Dutch high-school and are asked to make decisions about the
designed to replicate study 1 on a separate sample, as well as employment of three new teachers. Below we present each of
to test our two hypotheses regarding selective adherence to the three studies and their results. In addition, the results are
algorithmic advice. Studies 1 and 2 were conducted among summarized in supplementary table A4.
Dutch citizens in a context where citizens can act as decision-
makers. Thereafter, in study 3 (N = 1,345), we repeated
our experimental design with a large sample of Dutch civil Study 1: Automatic Adherence to Algorithmic
servants. The demographic characteristics of our samples are Versus Human Advice (Automation Bias)
summarized in Appendix 2.
The studies involve an administrative decision-making task Study 1 is designed to examine our hypothesis that
that concerns local school board decisions on the employment decision-makers are inclined to over-trust algorithmic ad-
of teachers. As elaborated below, we utilized a hypothetical vice—to follow algorithmic predictions despite additional
scenario of an algorithmic performance evaluation tool, used contradicting evidence, and to do so to a greater extent
as a decisional aid for the assessment of Dutch high-school than when presented with equivalent advice by a human
teachers. expert (H1). We preregistered the study and administered it
In the Netherlands, members of such boards are not re- in February 2020.1 The survey experiments were hosted on
quired to complete a specific professional certification, and Qualtrics, and participants (N = 605) were recruited through
are composed, among others, of volunteers (lay persons) a large online panel company—Dynata.2 The survey was
such as parents or citizens from the local community (OECD conducted in Dutch.
2014a, 14; OECD 2014b, 22, 98). As such, lay citizens are
relevant decision-makers in this context. Moreover, to fur- Procedure
ther enhance the external validity of our study, we addition- Survey participants are asked to act as board members of a
ally replicate the study with a large sample of actual civil hypothetical Dutch high school. In the main experimental
servants—Dutch decision-makers from various policy areas task, we ask participants to make a decision regarding the
and across government levels. An important advantage of employment of three teachers, who were hired the previous
our choice of empirical setting is that it involves a bureau- year for a trial period. Only two of the three new teachers
cratic task that can be relatively easily exercised in a vignette can be permanently hired and accordingly, participants must
survey experiment with participants who are not necessarily choose one teacher whose contract will not be renewed. As
experts on the specific task, allowing us to test our expecta- a basis for their decision, participants are given two data
tions among decision-makers in a public sector context more inputs per teacher (one qualitative input, and one numeric
broadly. Our explicit aim is to tap into generalized human input—a score) in both the algorithmic and the human ad-
biases in algorithm-supported decision making in the public vice conditions. In the algorithmic condition, respondents are
sector.
Our decision to focus on the education setting in the vi- 1
The preregistration form of study 1 is available at https://aspredicted.
gnette was inspired by the real-life case of Sarah Wysocki—a org/5de9d.pdf. Methodological choices are further discussed in the supple-
mentary section A6.
teacher in the United States who was fired based on the pre- 2
We estimated that a modest effect size of OR = 1.5 is detectable with power
diction of an algorithmic score, while ignoring her record of 0.8 (p = .05, one-sided test), assuming a probability of .3 for the baseline
and reputation as a well-performing teacher (Turque 2012). human-advice group.
158 Journal of Public Administration Research and Theory, 2023, Vol. 33, No. 1
told the numeric input is produced by an algorithm, while for characteristics of the three teachers, which could potentially
the human-expert advice condition that it is produced by a affect participants’ decisions, all three teachers are female,
human expert. have typical Dutch names and their teaching areas are in nat-
The first input, which was identical for all participants, is a ural sciences. The order of the three teachers was randomized
brief summary of a qualitative evaluation by the HR person (see also supplementary section A5.4).
of the educational association. The second is a numeric pre- In line with our theoretical focus, we deliberately designed
diction of the teachers’ potential to perform well in the fu- the task so that there will be an incongruence between the
Downloaded from https://academic.oup.com/jpart/article/33/1/153/6524536 by Perpustakaan Fakultas Sastra Universitas Indonesia user on 21 April 2024
ture, ranging between 1 (lowest) to 10 (highest). Participants two inputs in the table: the lowest ILE score (4) is never
are told that this numeric prediction was conducted by a matched with the most negative qualitative HR evaluation.
body named ILE (short for “Innovatieve Lerarenevaluatie”— The incongruence was as follows: One of the three teachers
“Innovative Teachers’ Evaluation”), and accordingly we refer received a low ILE score of 4, whereas the other two received
to it as the “ILE evaluation score”. Respondents are randomly scores of 8 and 6. The HR person’s qualitative evaluation sim-
assigned to one of two conditions: they are told that the ILE ilarly varies as one of the three teachers gets negative remarks,
score is either produced by a machine learning algorithm whereas the other two teachers receive positive and respec-
(algorithmic advice condition), or by consultants (human- tively, mixed evaluations. Most importantly however, the neg-
expert advice condition). To bolster participants’ confidence ative qualitative evaluation is never assigned to the teacher
in the predictive capacity of the ILE score, we noted (in both with the lowest ILE score (4), but to one of the other teachers.
conditions) that it “has proven highly effective in predicting Instead, the teacher with the lowest ILE score receives either
teacher performance, with an accuracy rate of 95%.” the positive or the mixed qualitative evaluation. Accordingly,
It is noteworthy that the format we used for the ILE eval- participants faced a decision of whether or not to follow the
uation score (an integer between 1 and 10) was designed ILE score (i.e., to fire the teacher with the most negative ILE
to resemble the COMPAS risk score that is used in pretrial score), given its incongruence with the HR person’s qualita-
procedures across the United States, which similarly ranges tive evaluation.
from 1 to 10. The comparison between a numeric algorithmic For exploratory purposes, we also randomized the distribu-
prediction and additional qualitative evidence (e.g., case file tion of the ILE scores (4, 6, and 8) across the three teachers,
evidence presented to a judge) is typical for many policy areas to generate different levels of incongruence between the ILE
where algorithms are used as decisional aids. score and the qualitative evaluation. We assigned participants
Participants were shown a table that presents the three to one of two main conditions of incongruence. In the high in-
teachers and the two inputs for each teacher, as illus- congruence condition (displayed in figure 1), the teacher with
trated in figure 1. To minimize additional differences in the the lowest ILE score receives the most favorable qualitative
evaluation. In the modest incongruence condition, the teacher Table 1. Study 1—Regression Results of Participants’ Adherence to
with the lowest ILE score receives the mixed qualitative evalua- Algorithmic Versus Human-Expert Advice (Automation Bias)
tion. In other words, through the qualitative input (the HR eval-
uation), respondents in both conditions receive informational Predictors (1)
cues that are at odds, to varying degrees, with the ILE score. OR [95% CI] z p value
Our main outcome variable is participants’ likelihood to
follow the ILE score. We coded 1 when participants chose to Algorithm 0.96 [0.58–1.58] −0.16 .876
Downloaded from https://academic.oup.com/jpart/article/33/1/153/6524536 by Perpustakaan Fakultas Sastra Universitas Indonesia user on 21 April 2024
fire the teacher with the lowest ILE score and 0 otherwise. In Intercept 0.14 [0.09–0.19] −11.41 <.001
our analyses below, we compare this binary variable between Observations 605
the two conditions (algorithmic versus human-expert). Log-likelihood −218.769
The main task was followed by a series of manipulation
check questions to confirm that participants were aware Note: Logistic regression model; OR, odds ratio; p values refer to a two-
of the source of advice (algorithmic versus human) as well sided test (by default). Binary outcome: following the ILE score
(1 = non-renewal of employment of teacher with lowest ILE score).
as of the actual ILE score (supplementary section A3). The
survey further included an attention check, additional items
regarding participants’ perceptions of algorithms and their fa- Table 2. Study 1—Descriptive Results (Automation Bias)
miliarity with the use of algorithms by public bodies and a set
of demographic questions. The full survey is attached in the Outcome: Teacher Algorithmic Advice Human-Expert
supplementary sections A8 and A9. Selected (Non-renewal (n = 295), % Advice (n = 310), %
We excluded from all analyses participants who did not of employment)
pass the attention check or completed the questionnaire
Teacher with lowest 11.5 11.9
under 3 minutes. These filtering criteria are not associated ILE score (algorithmic/
with the assignment to the experimental conditions (sup- human-expert)
plementary table A2.1). The two experimental groups are Teacher with poorest 77.3 81.0
balanced in relation to gender, reported income and educa- qualitative evaluation
tion, yet participants assigned to the algorithmic group are Other 11.2 7.1
slightly older on average (supplementary table A1.1). In ro-
bust analyses, we further control for these covariates (supple-
mentary table A5.1.1). While this sample consists of Dutch
Including covariates and restricting the samples to those who
citizens, their average age (47) and the share of participants
successfully completed the manipulation checks does not
with high education (50%) are comparable to that of the pop-
change the results (supplementary tables A5.1.1 and A5.2.1).
ulation of Dutch civil servants. Compared with civil servants,
Furthermore, these patterns are similar regardless of
our sample over-represents women, and people aged less than
whether the lowest ILE score was assigned to the teacher
25 or above 65 (Appendix 2). In our analyses below, we con-
with the best qualitative evaluation (high incongruence con-
trol for these variables and confirm that these characteristics
dition) or the teacher with the mixed qualitative evaluation
do not affect our results.
(modest incongruence condition)—providing further con-
A technical clarification on our statistical reporting: in
fidence that the diagnosed patterns are stable (supplemen-
all results tables presented in the article we use two-tailed
tary section A5.3). Also, randomizing the order of the three
p values uniformly, for consistency. We additionally report,
teachers did not significantly alter the results (supplementary
for our preregistered directional hypotheses, the one-sided p
section A5.4).
values, both in the tables and in the main text when discussing
In summary, in study 1, we did not find evidence supporting
the effects.
the automation bias expectation. The majority of participants,
under both algorithmic and human advice conditions, and across
Results (Study 1) the conditions of incongruence, chose to override the ILE score.
Tables 1 and 2 present the main experimental results of
study 1. Table 1 reports the results of the logistic regression
analysis as to the effect of our manipulation of the type of Study 2: Selective Adherence to Algorithmic
advice (algorithmic versus human) and Table 2 presents de- Versus Human Advice Matching Stereotypes
scriptively the distribution of participants’ decisions across The purpose of study 2 is two-fold: First, it aims to repli-
the two conditions. Based on our first hypothesis, and in line cate the results of study 1 on a separate sample. Second, it is
with automation bias literature, we expected the probability also designed to test the additional hypotheses that, similar to
of following the advice of the ILE score (i.e., selecting not to human advice, decision-makers are more inclined to follow
renew the contract of the teacher with the lowest ILE score) algorithmic advice inasmuch as this is aligned with stereotyp-
to be higher among those assigned to the AI algorithmic ad- ical views of the decision subjects (H2), and that this selective
vice, compared to those receiving an equivalent prediction adherence pattern is exacerbated by AI algorithms compared
produced by human experts. with equivalent human expert advice (H3). We preregistered
In contrast to our theoretical expectation, we find very small, the study and administered it mid-March 2020, and similarly
statistically insignificant differences between the algorithmic- recruited participants through Dynata (N = 904).3
advice and human-advice conditions (table 1). Under both
conditions, the vast majority of participants chose to over- 3
The preregistration forms of study 2 is available at https://aspredicted.org/
ride the ILE score and instead preferred to fire (not renew) v3u29.pdf. Methodological choices are further discussed in the supplemen-
the teacher with the poorest qualitative evaluation (table 2). tary section A6.
160 Journal of Public Administration Research and Theory, 2023, Vol. 33, No. 1
Procedure differences are in the expected direction, yet they are rel-
We repeated the procedure of study 1, while adding a manipu- atively small and statistically insignificant (supplementary
lation of teachers’ names as a cue for their ethnic background. tables A5.3.2 and A5.3.4).
The control condition is identical to study 1—all three teachers Thus, in both studies 1 and 2, we did not find that
are given typical Dutch surnames (“Verhagen,” “Jansen,” and participants are more likely to follow the algorithmic advice
“den Heijer”). In the treatment condition, the name of the compared with equivalent human advice. We also pooled the
teacher who received the lowest ILE score (4) is changed to two samples to maximize statistical power (N = 1,509), and
Downloaded from https://academic.oup.com/jpart/article/33/1/153/6524536 by Perpustakaan Fakultas Sastra Universitas Indonesia user on 21 April 2024
“El Amrani,” a common surname for citizens with a Moroccan the differences, while in the expected direction (11.1% versus
background. We henceforth refer to these conditions as “Dutch 10.5%), were still statistically insignificant (OR = 1.07, z =
teacher” and “Moroccan-Dutch teacher.” We specifically 0.41, tables 3 and 4).6 We do not find support for automa-
selected this ethnic minority group in the Netherlands, since it tion bias. We further ruled out that potential differences in
is a minority group that is often negatively stereotyped (Jilke, demographic and socioeconomic characteristics between our
Van Dooren, and Rys 2018; Kamans et al. 2009). Identical to sample and the civil service population might impact our
study 1, we randomized the level of incongruence between the experimental results via interaction models (supplementary
ILE score and the qualitative evaluation.4 table A5.5.1). We also examined whether participants’ pro-
Based on our theory, we expect that participants will be pensity to follow the algorithmic advice is influenced by their
more inclined to fire the teacher with the lowest ILE score familiarity with the use of algorithms by public organizations.
when that teacher has a Moroccan-sounding name (H2). Our 21% of the participants assigned to the algorithmic advice in
sample testing the selective adherence hypotheses were there- the two studies reported that they were familiar with algo-
fore filtered to include only respondents of Dutch descent (n rithm use by public bodies. This variable too had an insignifi-
= 792).5 In our analyses below, we examine the effect of this cant effect (supplementary table A5.6.1).
manipulation on participants’ inclination to follow the ILE
score, and its interaction with the type of advice (algorithmic Selective Adherence
versus human, H3). We now turn to discuss the results of our second study in rela-
It is important to note that previous vignette survey tion to our hypotheses of selective adherence. Table 5 reports
experimental studies have frequently failed to identify the regression results of the comparison between the two
discriminatory patterns, which has been explained by meth- teacher ethnicity conditions, across the algorithm and human
odological reasons, mainly social desirability pressures advice. In Model 1, we regressed our outcome variable on the
and the difficulty of simulating the conditions of real- two manipulations to test their main effects, and thereafter in
world decision making (Wulff and Villadsen 2020). We Model 2 we add their interaction. Table 6 then summarizes
were certainly aware of this limitation when designing our the descriptive differences in raw scores.
study, and for this reason we argue that our study can be We find a main effect for the teacher ethnicity manipula-
considered as a particularly hard case for our selective ad- tion in the expected direction. Respondents are more likely
herence hypothesis. to adhere to an advice when it predicts low performance for
Similar to study 1, we excluded from all the analyses a decision subject from a negatively stereotyped minority.
participants who did not pass the attention check or completed A Moroccan-Dutch teacher with a low ILE score is 50%
the questionnaire in less than 3 minutes. These filtering criteria more likely not to have their contract renewed, compared
are not associated with the assignment to the experimental to a Dutch teacher with the same score (OR = 1.50, p = .04,
conditions (supplementary table A2.2). After this filtering, we one-sided test). Descriptively, the difference in probabilities is
were left with an analytical sample of N = 904 for the replica- 12.3% versus 8.6%. In other words, in line with our H2, we
tion of study 1 (automation bias hypothesis) and N = 792 for find selective adherence across both types of advice: human
testing of our selective adherence hypotheses, that is, H2 and and algorithmic. This effect remains positive and signifi-
H3, the teacher ethnicity manipulation. The advice groups and cant when controlling for covariates (supplementary table
teacher names’ groups are balanced in relation to gender, re- A5.1.2).7 Given established difficulties for survey experi-
ported income, education, and age (supplementary table A1.2). mental designs to identify such discriminatory patterns, these
findings are important and likely speak to the prevalence of
Results (Study 2) such biases.
Automation Bias This pattern is consistent across our two incongruence
The results of study 2 with regards to the automation bias conditions (supplementary tables A5.3.3 and A5.3.5), which
hypothesis are displayed in tables 3 and 4. Consistent with further speaks to its robustness. We also examined the in-
our study 1 findings, we find small, statistically insignificant teraction between the teacher ethnicity manipulation and
differences between algorithmic-advice and human-advice participants’ age, gender, level of education, and reported in-
conditions. Including covariates and restricting the samples come. All these interactions are statistically insignificant (sup-
to those who successfully completed the manipulation checks plementary table A5.5.2).
does not change the results (supplementary tables A5.1.1 and
A5.2.1). 6
For this sample size and baseline probability, we estimate that a small
Also, there were no major differences across the effect-size of OR = 1.45 (a probability change of approximately 5%) is de-
randomized incongruence versions. In both cases, the tectable (power = 0.8, p = .05, one-sided test). For post hoc power analyses,
see supplementary section A7.
4
In study 2, we did not randomize the order of the three teachers, based on
the results of study 1.
7
Consistently, the coefficient is positive in supplementary analyses after
filtering out those who did not properly read the task (OR = 1.39, 9.5%
We assume negative stereotypes toward citizens of migrant descent to be
5
versus 7%), yet it is not sufficiently significant, arguably due to the smaller
more emphasized among citizens without a migration background. sample size (supplementary table A5.2.2).
Journal of Public Administration Research and Theory, 2023, Vol. 33, No. 1 161
Table 3. Study 2—Regression Results of Participants’ Adherence to Algorithmic Versus Human-Expert Advice (Automation Bias)
Downloaded from https://academic.oup.com/jpart/article/33/1/153/6524536 by Perpustakaan Fakultas Sastra Universitas Indonesia user on 21 April 2024
Study 2 (ref. = study 1) 0.85 [0.61–1.18] −0.96 .335
Intercept 0.10 [0.08–0.14] −13.91 <.001 0.13 [0.09–0.17] −13.57 <.001
Observations 904 1,509
Log-likelihood −297.144 −516.073
Note: Binary outcome: following the ILE score (1 = non-renewal of employment of teacher with lowest ILE score).
Next, we examine the interaction effect, in line with hypothesized) is unlikely due to the fact that the interaction
our H3. While we find statistically significant evidence that coefficient is in the opposite direction.
participants are more inclined to follow the ILE score when To summarize our main findings in study 2, we find that
the prediction is aligned with stereotypes, our findings do not participants, across both sources of advice (human and al-
suggest that this bias is increased when the score is produced gorithmic), tend to follow the advice in a selective manner—
by an algorithm compared to human advice, in contrast when it corresponds to pre-existing biases and stereotypes,
with our H3. Participants under both conditions were more which translates into group disparities (in support of our H2).
likely not to renew the contract of the teacher of Moroccan All else constant, a Moroccan-Dutch teacher is significantly
background, and the interaction between the teacher eth- more likely to get sanctioned due to a negative evaluation
nicity manipulation and the algorithmic advice condition is score compared to a Dutch teacher with the same score. Our
not statistically significant in our interaction model (table findings indicate that there are no significant differences be-
5, model 2). The differences between the Moroccan-Dutch tween human and algorithmic advice in this respect. Selective
and Dutch teachers in the human-expert advice group were biased processing patterns are found for both types of advice,
slightly greater compared with the algorithmic group, yet persisting in algorithm adoption.
these descriptive differences are not statistically significant, as
evidenced by the insignificant interaction term (|z| = 1.45, p
= .147),8 and as such could be entirely due to chance (type Study 3: Replication with a Sample of Civil
I error). The coefficient of the interaction further diminishes Servants
when we control for covariates (|z| = 1.25, p = .210), and In study 3, we aimed to replicate studies 1 and 2 with a
restricting the samples to those who successfully completed sample of civil servants. For this purpose, we contracted
the manipulation checks does not change the results (supple- a government-owned personnel research program
mentary tables A5.1.2 and A5.2.2). (Internetspiegel/ICTU) operating an online panel of Dutch
On this basis, given the significant main effect and insignif- civil servants (Flitspanel/“Flashpanel”). Participating civil
icant interaction, our findings indicate that decision-makers servants register themselves for their participation in the
are subject to selective adherence when processing decisional panel, and it is used both by the government itself to survey
aid outputs, regardless of whether these outputs are produced current policy issues as well as for academic studies.10
by humans or algorithms. At the same time, and despite our The online survey was administered and distributed by
considerable sample size, we acknowledge statistical power ICTU to 3,294 civil servants. The fieldwork was conducted
limitations in our interaction analysis (H3).9 Still, we can infer between the 2nd and 22nd of February 2021. ICTU sent the
with sufficient confidence that a significant increase (as we invitations to the participants via email, followed by two
8
A similar result is produced by a likelihood ratio comparing the interaction reminders. A total number of 1,345 participants completed
model with a main-effect-only model (χ2(1) = 2.141, p = .143). Comparing
the two models via BIC and AIC indicates that the main effect model is more 10
For additional information, see https://flitspanel.nl.
appropriate (Lorah 2020). 11
Assuming the baseline probability we found in studies 1 and 2, we estimate
See supplementary section A7.
9
this sample size is sufficient for detecting a modest effect-size of OR = 1.52.
162 Journal of Public Administration Research and Theory, 2023, Vol. 33, No. 1
Table 5. Regression Results of Study 2—Participants’ Selective Adherence to Advice (Algorithmic Versus Human-Expert) that Matches Stereotypical
View of Decision Subjects
Downloaded from https://academic.oup.com/jpart/article/33/1/153/6524536 by Perpustakaan Fakultas Sastra Universitas Indonesia user on 21 April 2024
Moroccan-Dutch teacher 1.50 [0.95–2.40] 1.73 .083 (.042 one-sided) 2.23 [1.11–4.73] 2.19 0.029
Algorithm × Moroccan-Dutch teacher 0.50 [0.19–1.27] −1.45 .147
Intercept 0.08 [0.05–0.13] −11.16 <.001 0.07 [0.03–0.11] −9.10 <.001
Observations 792 792
Log-likelihood −261.789 −260.719
BIC 543.602 548.136
AIC 529.579 529.438
Note: Binary outcome: following the ILE score (1 = non-renewal of employment of teacher with lowest ILE score).
Outcome: Teacher Selected All (n = 792) Algorithmic Advice (n = 405) Human-Expert Advice (n = 387)
(Non-renewal of employment)
Teacher with lowest ILE score: Teacher with lowest ILE score: Teacher with lowest ILE score:
Teacher with lowest ILE score 8.6 12.3 10.6 11.6 6.2 12.9
(algorithmic/human-expert)
Teacher with poorest 84.1 79.9 82.4 78.8 86.0 80.9
qualitative evaluation
Other 7.3 7.8 6.9 9.5 7.8 6.2
the survey (41% response rate).11 The sample includes civil balanced in relation to gender, reported income, education,
servants working in different policy sectors, at both national and age (supplementary table A1.3).
and local levels.12 Yet, it should be noted that the sample is It is important to note that the fieldwork of study 3 coincided
not entirely representative of the Dutch public sector. Women with the occurrence of significant events in the Netherlands,
are underrepresented in our sample, and the mean age was surrounding the “childcare benefits scandal” (toeslagenaffaire
higher compared with the Dutch public sector. In Appendix in Dutch). The scandal reached its peak during the tech-
2, we present the demographic characteristics of the sample nical preparations of our survey, and shortly before its dis-
and control for these characteristics in our robust analyses, as tribution with growing public attention in December 2020,
detailed below. following the publication of a parliamentary report on the
We repeated the 2 × 2 factorial design and the experimental scandal (titled “Unprecedented Injustice,” Parlementaire
procedure of study 2. The online survey was administered Ondervragingscommissie Kinderopvangtoeslag 2020), resulting
by ICTU (using a different software than Qualtrics), and for in the resignation of the Dutch government mid-January 2021.
technical reasons we could not include the additional ran- The scandal involved the reliance by the Dutch tax
domization into high and modest incongruence. Hence, in authorities on an AI algorithm—a “learning algorithm” that
this study, we assigned all participants to the high incongru- used, among other criteria, nationality as a discriminant
ence scenario, where the teacher who receives the lowest ILE predictive feature, and served as a decisional aid in flagging
score is the one with the most positive qualitative evaluation. high-risk applicants for further scrutiny. The requests flagged
We did not include an attention check in this survey, as per by the algorithm were checked manually by tax employees
the panel’s request, and therefore our analytical sample for after considering (and/or requesting from applicants) addi-
the automation bias hypothesis is the full sample of 1,345 tional information (Autoriteit Persoonsgegevens/Dutch Data
participants.13 For the analyses of the teacher’s ethnicity ma- Protection Authority 2020). The scandal disproportionately
nipulation, same as in study 2, we included only participants affected citizens of foreign descent, with mostly dual na-
of Dutch descent (N = 1,203). This screening is not associ- tionality families wrongly accused of benefits fraud: “[T]he
ated with the assignment to the experimental conditions (sup- tax ministry singled out tens of thousands of families often
plementary table A2.3), and the randomization groups are on the basis of their ethnic background” (Financial Times
2021). Victims of the scandal were required to retroactively
repay large sums of money (amounting to as much as tens
The survey was sent to participants from the following sectors: central
12
government (national ministries), municipalities, provinces, inter-municipal of thousands of euros), with the financial strain reportedly
cooperative arrangements, water boards, defense, and police. resulting in acute financial problems, bankruptcies, mental
All participants completed the survey in more than 3 minutes.
13
health issues, and broken families (Geiger 2021).
Journal of Public Administration Research and Theory, 2023, Vol. 33, No. 1 163
The scandal is a textbook example of the meeting point Table 7. Study 3—Regression Results of Participants’ Adherence to
between algorithmic bias and human decision-makers’ Algorithmic Versus Human-Expert Advice (Automation Bias)
biases. While the system itself was biased, using nation-
ality as a predictive feature, the way tax officials went Predictors (1)
about their work reinforced the system’s biases: “Both the OR [95% CI] z p value
automated risk selection and the individual investigations
of officials were discriminatory, the data protection au- Algorithm 0.54 [0.34–0.83] −2.74 .006
Downloaded from https://academic.oup.com/jpart/article/33/1/153/6524536 by Perpustakaan Fakultas Sastra Universitas Indonesia user on 21 April 2024
thority ruled” (Volkskrant 2020). The scandal is illustra- Intercept 0.09 [0.07–0.12] −17.32 <.001
tive of the patterns diagnosed in our study 2: algorithmic Observations 1345
recommendations aligned with prevalent stereotypes (i.e., Log-likelihood −329.022
indicating a negative assessment for members of an ethnic
minority group) with decision-makers likely not to override Note: Binary outcome: following the ILE score (1 = non-renewal of
such recommendations. Victims of the scandal, much like employment of teacher with lowest ILE score).
our teacher of Moroccan heritage in study 2, were specifi-
cally singled out for targeted scrutiny because of their ethnic
origin or double nationality, following an algorithmic pre-
diction (“families of largely Moroccan, Turkish and Dutch
Table 8. Study 3—Descriptive Results (Automation Bias)
Antilles origin were targeted, according to the national data
protection authority,” Financial Times 2021).
Outcome: Teacher Selected Algorithmic Advice Human Advice
Given that the survey was conducted shortly after the (Non-renewal of employment) (n = 662), % (n = 683), %
scandal, the results of this study should be interpreted in light
of it. Our participants were highly aware of the risk of algo- Teacher with lowest ILE score 4.8 8.6
rithmic bias, and sensitive to this issue. 63% of the participants (algorithmic/human-expert)
reported that they are familiar with the use of algorithms Teacher with poorest 89.1 83.9
by public organizations, and more than half of these (33%) qualitative evaluation
mentioned this case when asked to give an example, and Other 6.0 7.5
many of them spontaneously expressed their criticism to-
ward it in their qualitative answers. While we anticipated
such public reactions to be reflected in participants’ answers,
we decided not to withhold the fieldwork, as we believe that
investigating our research question under these conditions before the survey (n = 498), the likelihood of following the
can yield meaningful insights. We return to this point below algorithmic advice increases and is not significantly lower
in our discussion. compared to the human-expert condition (6.9% versus
8.8%, p = .431, two-sided). We return to this point later in
our discussion.
Results—Study 3
Automation Bias Selective Adherence
The logistic regression results of study 3 with regards to the Tables 9 and 10 present the results of the comparison be-
comparison between algorithmic and human advice (automa- tween the teacher ethnicity conditions, across the algorithm
tion bias hypothesis) are displayed in table 7, with descriptive and human advice, which is relevant to our selective ad-
differences presented in table 8. Participants were signifi- herence hypotheses. We find a negative main effect for the
cantly less likely to follow the ILE score when produced by Moroccan-Dutch teacher (table 9). In departure from our
an algorithm, and more likely to select the teacher with the study 2, participants in study 3 (civil servants in the aftermath
poorest qualitative evaluation. This confirms the findings of of a major public scandal involving algorithm use and ethnic
our previous two studies, which similarly did not diagnose discrimination) were less likely to fire a Moroccan-Dutch
automation bias in decision making. In fact, the patterns in teacher with a low ILE score, compared to a Dutch teacher
study 3 are in the opposite direction to the automation bias with the same score. These differences are fairly similar across
expectations. Including covariates and restricting the sample the two groups, and the interaction is insignificant. These
to those who successfully completed the manipulation checks results do not change when adding controls and filtering out
does not change the results (supplementary tables A5.1.1 and those who did not properly read the task (supplementary ta-
A5.2.1). bles A5.1.3 and A5.2.3).
Also, the interactions between the advice manipulation To summarize the main results, in this study with a sample
and participants’ gender, age and higher education are all of civil servants, similar to our previous two studies, we do
statistically insignificant (supplementary table A5.5.3). not find support for automation bias. Our study 3 reveals
This suggests that a sample more representative of the civil participants in the aftermath of the scandal were less likely
service’s demographic and socioeconomic characteristics to be influenced in their decision by the ILE score when
would have yielded similar results. However, in contrast generated by an AI algorithm rather than by human experts.
with studies 1 and 2, the negative effect of the algorithmic Also, and in contrast with our study 2, they were less likely
advice is linked to respondents’ reported familiarity with to sanction the Moroccan-Dutch teacher, regardless of the
the use of algorithms by public organizations. When type of advice. These findings arguably speak to the effect of
filtering our sample in study 3 to participants who were not the scandal in shaping bureaucratic responses, as we discuss
familiar with the use of algorithms by public organizations below.
164 Journal of Public Administration Research and Theory, 2023, Vol. 33, No. 1
Table 9. Regression Results of Study 3—Participants’ Selective Adherence to Advice (Algorithmic Versus Human-Expert) that Matches Stereotypical
View of Decision Subjects
Downloaded from https://academic.oup.com/jpart/article/33/1/153/6524536 by Perpustakaan Fakultas Sastra Universitas Indonesia user on 21 April 2024
Moroccan-Dutch teacher 0.57 [0.35–0.91] −2.33 .020 0.64 [0.36–1.13] −1.52 .129
Algorithm × Moroccan-Dutch teacher 0.69 [0.24–1.89] −0.71 .480
Intercept 0.13 [0.09–0.17] −12.13 <.001 0.12 [0.08–0.17] −11.36 <.001
Observations 1,203 1,203
Log-likelihood −286.297 −286.044
BIC 593.872 600.458
AIC 578.594 580.087
Note: Binary outcome: following the ILE score (1 = non-renewal of employment of teacher with lowest ILE score).
Outcome: Teacher Selected All (n = 1,203) Algorithmic Advice (n = 595) Human-Expert Advice (n = 608)
(Non-renewal of employment)
Teacher with lowest ILE score: Teacher with lowest ILE score: Teacher with lowest ILE score:
Teacher with lowest ILE score 8.3 5.0 5.9 2.7 10.7 7.1
(algorithmic/human-expert)
Teacher with poorest 83.6 89.2 86.8 92.5 80.3 86.0
qualitative evaluation
Other 8.1 5.8 7.3 4.8 9.0 6.8
Discussion and Conclusion advice. Across the three studies, we consistently did not find
evidence for an overall tendency for automation bias.
With AI set to fundamentally alter decision making in public
In none of the three studies were participants more
organizations, how do human decision-makers actually
likely to follow the ILE score when produced by an algo-
process algorithmic advice? Drawing on two separate strands
rithm compared to a human-expert: in studies 1 and 2, the
of behavioral literature, we have theorized that two biases in
differences were small and statistically insignificant, and in
particular are of high relevance and in dire need of investi-
study 3 conducted shortly after the childcare benefits scandal,
gation by public administration scholars: “automation bias”
participants were actually less likely to follow the algo-
and “selective adherence” to algorithmic advice.
rithmic advice, indicative of a growing reluctance to trust
A first bias stemming from automation studies is that
algorithms in its aftermath. We attribute this latter negative
decision-makers would automatically default to the algo-
effect primarily to the proximity of the study to the scandal,
rithm, potentially then also to poor algorithmic advice,
increasing participants’ exposure to the dangers of reliance
ignoring contradictory cues from other sources: automation
on AI algorithmic models (as exemplified by the scandal). A
bias. A second hypothesized bias, which we extrapolated
considerable number of respondents in study 3 (33%) were
from public administration literature, regards decision-
aware of the use of algorithms in the benefits scandal, as
makers’ tendency to defer to the algorithm selectively—when
evidenced by their open answers. Furthermore, as reported
algorithmic predictions match pre-existing stereotypes: se-
above, among those respondents who were not aware of the
lective adherence. The use of algorithms could then dispro-
use of algorithms by public organizations we did not find a
portionately negatively affect stereotyped groups, potentially
lower propensity to follow the algorithmic advice compared
creating administrative burdens (Herd and Moynihan 2019)
to human advice. This suggests the results of study 3 represent
and compounding discrimination. Below we discuss and re-
a response to the scandal rather than indicative of an inherent
flect, in turn, on our findings in relation to each of these biases
distrust toward algorithmic-sourced advice. At the same time,
and their implications for public sector decision making in the
our study should also serve as further caution as to the adop-
age of automation.
tion of unvetted, under-performing algorithmic systems in
public sector decision making, increasingly diagnosed in prac-
tice (e.g., Ferguson 2017; Eubanks 2018; O’Neil 2016), and
Automation Bias as exemplified in our article by the Dutch childcare benefits
Overall, our experimental findings from three separate studies scandal. Such failures, once exposed, have consequences, with
with an aggregated sample of 2,854 participants do not re- poorly implemented systems resulting in lower levels of trust
veal a general pattern of automatic adherence to algorithmic in algorithms’ performance.
Journal of Public Administration Research and Theory, 2023, Vol. 33, No. 1 165
These experimental findings are largely consistent with main effect for the teacher ethnicity manipulation. In both
findings from earlier studies outside our discipline on pretrial conditions, participants were more likely not to renew the
algorithmic risk scores in the US context. These studies, too, contract of the ethnic minority teacher.
did not reveal an overwhelming pattern of automatic adher- Importantly, we found that this bias is not more emphasized
ence to algorithmic risk scores. An important limitation of for algorithmic advice when compared to human advice, as
these previous studies however, was that they failed to com- the interaction between the two manipulations in our fac-
pare algorithmic advice with equivalent human advice, which torial design is insignificant. Taken jointly, these two sets of
Downloaded from https://academic.oup.com/jpart/article/33/1/153/6524536 by Perpustakaan Fakultas Sastra Universitas Indonesia user on 21 April 2024
we remedy with our current investigation. findings indicate that while not exacerbated by algorithms,
Still, how can we reconcile the results of our study (and the selective adherence patterns occur across both sources of
studies above) with findings from studies in social psychology advice rather than being restricted to human advice. The re-
on the use of automation in aviation and healthcare (e.g., placement of human advice with algorithmic advice does not
Lyell and Coiera 2017; Skitka, Mosier, and Burdick 1999, make selective adherence disappear. These findings are also
2000), where patterns of automation bias have been well- in line with results from other studies on pretrial algorithmic
documented and recognized? One possible explanation for risk scores by law and computer science scholars respectively,
this discrepancy is a relative skepticism about the performa- which report patterns consistent with biased adherence to al-
tive capacity of AI algorithms, with many participants, based gorithmic advice (Green and Chen 2019a, 2019b; Stevenson
on their self-reporting, still under-exposed to their performa- 2018). The findings of our study, and others, carry important
tive capacities (in studies 1 and 2), or exposed to their negative implications as they indicate decision-making biases endure
consequences (in study 3, following the benefits scandal). This in AI algorithm adoption as decisional aides in the public
is an important difference to earlier studies on automation sector, contrary to the promise that propelled their adoption
applied in areas well-accustomed to such devices (aviation, as a means to do away with such biases. Similar to human-
healthcare), characterized by routine use of reliable automa- sourced advice, a tendency to follow algorithmic advice, too,
tion, resulting in high levels of trust in their performance. rather than generalized, is instead selective and more likely
These findings also have important implications for the when this advice matches pre-existing stereotypical beliefs.
public administration literature on automation and discre- Our sample of civil servants in study 3 did not yield similar
tion. Introducing algorithmic tools into the decision-making results. The participants in study 3, conducted in the after-
process, we find in our studies, did not supplant the discre- math of a scandal involving algorithm use and discrimination
tion and judgment of human decision-makers, with the vast in bureaucratic decision making, were less likely not to renew
majority of our respondents overriding the prediction. At the the contract of a teacher from a negatively stereotyped mi-
same time, we argue that it is too soon to rule out concerns nority with a low score compared to a Dutch teacher with
with undue bureaucratic deference to AI systems. Rather, au- the same score.
tomatic deference to algorithmic advice could become more How can we explain the discrepancy between the studies
prevalent as decision-makers become increasingly exposed on this aspect? Several explanations could account for these
to AI algorithms in the practice of public organizations. findings: First, one could speculate that these differences
Repeated experience with high-performing systems (in so far might stem from distinctive characteristics of civil servants
as such systems are high-performing) might increase “user compared with lay citizens, namely the ability of the former
appreciation” of their judgment capacities (decrease skep- to overcome social biases and prejudice as a result of their
ticism), leading to higher levels of deference over repeated professional training, expertise, or background. However,
interactions. a vast body of literature in social science provides us with
theory and empirical evidence for the existence of discrim-
inatory decision-making that are also rooted in subtle and
Selective Adherence unconscious cognitive mechanisms (e.g., Schram et al. 2009).
We also theorized, extrapolating from behavioral literature, These patterns have been theorized and are well-documented
that similar to human advice, decision-makers are likely to in bureaucratic contexts, also among highly educated, pro-
rely on algorithmic inputs in a biased, selective manner— fessional decision-makers (e.g., Andersen and Guul 2019;
to assign more weight to the advice and follow it against Assouline, Gilad, and Bloom 2022; Giulietti, Tonin, and
contradicting evidence when this is aligned with pre-existing Vlassopoulos 2019).
stereotypes. Establishing whether selective adherence is A methodological explanation is also plausible for the
present across both types of advice is important in a context patterns encountered in study 3: namely, that civil servant
where AI algorithms are said to have the potential to do away participants’ responses could reflect social desirability bias,
with human decisional biases. We further theorized that selec- an (unconscious) need to answer questions in ways that
tive adherence biases could be exacerbated by algorithms, by demonstrate that they do not discriminate. This is a common
virtue of their unique nature. threat to studies of discrimination more broadly, and indeed
In study 2, which consisted of a sample of Dutch citizens several survey experimental studies have “failed” to find ra-
in a context where citizens can serve as actual decision- cial discrimination in their data, arguably for this reason
makers, we found evidence supporting selective adher- (e.g., Baekgaard and George 2018; Wulff and Villadsen
ence patterns across both human and algorithmic advice 2020). This threat is plausibly more likely for the sample
conditions. Namely, when the low prediction score is assigned of professional civil servants surveyed in study 3, compared
to a teacher from a negatively stereotyped ethnic minority, with the sample of lay citizens in study 2 (even though both
participants were significantly more likely to rely on it in their groups were explicitly guaranteed anonymity). Furthermore,
decisions and less likely to override it. These selective adher- this threat is potentially exacerbated by the fact that civil
ence patterns are present across both types of advice (human servant participants were invited by a panel linked to the
and algorithmic), as evidenced by the positive and significant Dutch government.
166 Journal of Public Administration Research and Theory, 2023, Vol. 33, No. 1
The more plausible explanation, in our reading, for the legally mandated to that end in forward-looking regulatory
fact that we did not encounter patterns of selective adher- frameworks such as the EU GDPR. While our findings as to a
ence in study 3 (as we did in study 2) is that participants’ lack of automatic deference are encouraging in this context,
responses were an authentic reaction to the recent childcare the likelihood that decision-makers adhere to algorithmic
benefits scandal and the political, media and public scru- advice (rather than resist it) precisely when predictions are
tiny that followed from it. The scandal represented a case of aligned with group stereotypes and disadvantage minority
systemic bureaucratic discrimination against citizens with a groups is disconcerting, and speaks to potential blind spots
Downloaded from https://academic.oup.com/jpart/article/33/1/153/6524536 by Perpustakaan Fakultas Sastra Universitas Indonesia user on 21 April 2024
migration background, an empirical case that incidentally in our ability to exercise meaningful oversight. Such concerns
closely matched our own hypothetical scenario, with many can become especially problematic, as we saw, in mixed algo-
civil servants respondents spontaneously indicating famil- rithmic decision making when human bias meets algorithmic
iarity with the scandal in their open answers. It is likely that bias. At the same time, an encouraging, tentative take-away
the scandal increased civil servants’ awareness of racial pro- that emerges from our investigation is that high-visibility,
filing and discrimination toward ethnic minorities in the public exposure of such biases (as in the aftermath of the
Netherlands—explicitly also in relation to algorithm use in benefits scandal) can have learning effects through rendering
bureaucratic decision-making. Indeed, social psychology civil servants more conscious and alert to such risks leading
studies have theorized that racial biases can be attenuated to their potential attenuation in decision making, at least in
when people are highly motivated to do so (Devine et al. the short term.
2002). This would suggest the scandal had a learning effect Our study takes a first step to investigate how public
although our study does not allow us to assess to what extent decision-makers process AI algorithmic advice from
these effects are long-lived. decisional support systems. As AI tools proliferate in
It is important to note that the scandal itself is an illus- the public sector, this comes with significant possible
trative example of the theorized patterns of decision-makers’ implications for the nature of administrative decision
adherence to algorithmic advice, and how it can result in dis- making, rendering this issue increasingly salient for our
crimination in decision making. The scandal speaks acutely to discipline. Future studies may investigate these aspects in
the serious real-life repercussions that can arise when human scenarios pertaining to different sectors and across mul-
bias meets algorithmic bias in bureaucratic decision making. tiple national jurisdictions. Importantly, and following
Taken together with our empirical findings in study 2, we our results, follow-up work could further test the role of
believe that there is evidence for selective adherence to al- decision-makers’ learning and repeat exposure through a
gorithmic advice that calls for additional and pressing investi- design that allows for repeat interactions with the algo-
gation of this issue by public administration scholars. rithm so as to assess to what extent participants’ trust in
A key justification put forward for algorithm adoption in the algorithm changes over time, potentially leading to
high-stakes public sector areas such as criminal justice or po- patterns of enhanced deference. Investigating the cognitive
licing, and for “tolerating” shortcomings of such systems (e.g., mechanisms underpinning algorithmic decision making in
pertaining to their opaqueness and associated concerns with an administrative context will be of crucial theoretical and
transparency and accountability) has been their perceived empirical significance, part and parcel of tackling broader,
superior performance and said “objectivity” as data-driven fundamental questions as to the impact of artificial intelli-
technologies, as a way to overcome human biases and limi- gence for bureaucratic expertise and discretion, the nature
tations. While such claims have been deflated when it comes of public authority, and public accountability in the age of
to algorithms’ own learning and functioning (e.g., algorithms automation.
replicating and propagating systematic biases learned from
training data is a well-documented problem that can arise
in algorithm deployment), it is important to keep in mind Supplementary Material
that bias can also crop up at another level: in the human– Supplementary data is available at the Journal of Public
AI interaction, in how decision-makers process, interpret, Administration Research and Theory online.
and act upon algorithmic outputs. Our findings raise fur-
ther questions about the added value of the reliance on al-
gorithmic advice as a mechanism to avoid bias and speak to Funding
potential negative effects of automation of the administra- This article is part of a project that has received funding from
tive state for already vulnerable and disadvantaged citizens the European Research Council (ERC) under the European
(Eubanks 2018; Ranchordas 2022). Even assuming that the Union’s Horizon 2020 research and innovation programme
algorithmic outputs themselves could be bias-free, we find (grant agreement 716439).
some evidence that human decision-makers tend to rely on
such outputs selectively, that is, when their predictions “suit”
pre-existing stereotypes. Data Availability
Keeping humans-in-the-loop (human intervention) is an The data underlying this article are available in Harvard
important safeguard against algorithmic failures and is even Dataverse, at https://doi.org/10.7910/DVN/TQYJNF.
Journal of Public Administration Research and Theory, 2023, Vol. 33, No. 1 167
Appendix A
Randomization Groups
Downloaded from https://academic.oup.com/jpart/article/33/1/153/6524536 by Perpustakaan Fakultas Sastra Universitas Indonesia user on 21 April 2024
Exclusions:
* Participants who failed the attention check or completed the questionnaire in less than 3 min.
** Participants not of Dutch descent (excluded from the analysis of selective adherence hypotheses—H2 and H3).
Appendix B
Sample Characteristics
Study 1 Study 2 Study 3 Dutch Civil Service
Note: Valid percentages are reported. Dutch civil service data is from 2018, regarding 412,999 civil servants from national and local civil service, including
defense and police.
Source: https://kennisopenbaarbestuur.nl/.
168 Journal of Public Administration Research and Theory, 2023, Vol. 33, No. 1
Downloaded from https://academic.oup.com/jpart/article/33/1/153/6524536 by Perpustakaan Fakultas Sastra Universitas Indonesia user on 21 April 2024
On the investigation of black boxes. New York: Columbia Univ.,
org/article/machine-bias-risk-assessments-in-criminal-sentencing
Tow Center for Digital Journalism.
(accessed March 2022).
Edwards, Lilian, and Michael Veale. 2017. Slave to the algorithm? Why
Assouline, Michaela, Sharon Gilad, and Pazit Ben-Nun Bloom. 2022.
a ‘right to an explanation’ is probably not the remedy you are look-
Discrimination of minority welfare claimants in the real world: The
ing for. Duke Law & Technology Review 18:18–84.
effect of implicit prejudice. Journal of Public Administration Re-
Engstrom, David F., Daniel E. Ho, Catherine M. Sharkey, and Mariano-
search and Theory 32 (1): 75–96.
Florentino Cuéllar. 2020. Government by algorithm: Artificial in-
Autoriteit Persoonsgegevens/Dutch Data Protection Author-
telligence in federal administrative agencies. Report Submitted to
ity. 2020. Belastingdienst/Toeslagen: De verwerking van de
the Administrative Conference of the United States, February 19.
nationaliteit van aanvragers van kinderopvangtoeslag. https://
Eubanks, Virginia. 2018. Automating inequality: How high-tech tools
autoriteitpersoonsgegevens.nl/sites/default/files/atoms/files/
profile, police, and punish the poor. New York: St. Martin’s Press.
onderzoek_belastingdienst_kinderopvangtoeslag.pdf (accessed
Ferguson, Andrew Guthrie. 2017. The rise of big data policing: Surveil-
March 2022).
lance, race, and the future of law enforcement. New York: NYU Press.
Baekgaard, Martin, and Bert George. 2018. Equal access to the
Financial Times. 2021. Scandals tarnish Dutch reputation for clean gov-
top? Representative bureaucracy and politicians’ recruitment
ernment. June 24. https://www.ft.com/content/9996a65e-0996-4a08-
preferences for top administrative staff. Journal of Public Adminis-
aa65-041be685deae?shareType=nongift (accessed March 2022).
tration Research and Theory 28 (4): 535–50.
Geiger, Gabriel. 2021. How a discriminatory algorithm wrongly ac-
Baekgaard, Martin, Julian Christensen, Casper Mondrup Dahlmann,
cused thousands of families of fraud. Vice, March 1. https://
Asbjørn Mathiasen, and Niels Bjørn Grund Petersen. 2019. The role
www.vice.com/en/article/jgq35d/how-a-discriminatory-algorithm-
of evidence in politics: Motivated reasoning and persuasion among
wrongly-accused-thousands-of-families-of-fraud
politicians. British Journal of Political Science 49 (3): 1117–40.
Giest, Sarah, and Stephan Grimmelikhuijsen. 2020. Introduction to
Baekgaard, Martin, and Søren Serritzlew. 2016. Interpreting perfor-
special issue algorithmic transparency in government: Towards a
mance information: Motivated reasoning or unbiased comprehen-
multi-level perspective. Information Polity 25 (4): 409–17.
sion. Public Administration Review 76 (1): 73–82.
Giulietti, Corrado, Mirco Tonin, and Michael Vlassopoulos. 2019. Ra-
Benjamin, Ruha. 2019. Race after technology: Abolitionist tools for the
cial discrimination in local public services: A field experiment in
New Jim Code. Polity Press.
the United States. Journal of the European Economic Association
Bovens, Mark, and Stavros Zouridis. 2002. From street-level to system-
17 (1): 165–204.
level bureaucracies: How information and communication tech-
Goddard, Kate, Abdul Roudsari, and Jeremy C. Wyatt. 2012. Automa-
nology is transforming administrative discretion and constitutional
tion bias: A systematic review of frequency, effect mediators, and
control. Public Administration Review 62 (2): 174–84.
mitigators. Journal of the American Medical Informatics Associa-
Buffat, Aurélien. 2015. Street-level bureaucracy and e-government.
tion 19 (1): 121–7.
Public Management Review 17 (1): 149–61.
Green, Ben, and Yiling Chen. 2019a. Disparate interactions: An
Bullock, Justin B. 2019. Artificial intelligence, discretion, and bureauc-
algorithm-in-the-loop analysis of fairness in risk assessments. In
racy. The American Review of Public Administration 49 (7): 751–61.
FAT* ’19: Proceedings of the Conference on Fairness, Accountabil-
Buolamwini, Joy, and Timnit Gebru. 2018. Gender shades: Intersec-
ity, and Transparency, Atlanta, GA, January 29–31.
tional accuracy disparities in commercial gender classification. Pro-
———. 2019b. The principles and limits of algorithm-in-the-loop deci-
ceedings of Machine Learning Research 81:1–15.
sion making. Proceedings of the ACM on Human-Computer Inter-
Busch, Peter, and Helle Henriksen. 2018. Digital discretion: A system-
action, 3 (CSCW): 1–24.
atic literature review of ICT and street-level discretion. Informa-
Grgić-Hlača, Nina, Christoph Engel, and Krishna P. Gummadi. 2019.
tion Polity 23 (1): 3–28.
Human decision making with machine assistance: An experi-
Busuioc, Madalina. 2021. Accountable artificial intelligence: Hold-
ment on bailing and jailing. Proceedings of the ACM on Human-
ing algorithms to account. Public Administration Review 81 (5):
Computer Interaction 3:1–25.
825–36.
Herd, Pamela, and Donald P. Moynihan. 2019. Administrative burden:
Calo, Ryan, and Danielle Keats Citron. 2021. The automated adminis-
Policymaking by other means. New York: Russell Sage Founda-
trative state: A crisis of legitimacy. Emory Law Journal 70: 797–846.
tion.
Christensen, Julian. 2018. Biased, not blind: An experimental test of
Israni, Ellora Thadaney. 2017. When an algorithm helps send you
self-serving biases in service users’ evaluations of performance in-
to prison. New York Times, October 26. https://www.nytimes.
formation. Public Administration 96 (3): 468–80.
com/2017/10/26/opinion/algorithm-compas-sentencing-bias.html
Christensen, Julian, Casper Mondrup Dahlmann, Asbjørn Hovgaard
James, Oliver, and Gregg G. Van Ryzin. 2017. Motivated reasoning a-
Mathiasen, Donald P. Moynihan, and Niels Bjørn Grund Petersen.
bout public performance: An experimental study of how citizens
2018. How do elected officials evaluate performance? Goal
judge the Affordable Care Act. Journal of Public Administration
preferences, governance preferences, and the process of goal
Research and Theory 27 (1): 197–209.
reprioritization. Journal of Public Administration Research and
Jilke, Sebastian. 2017. Citizen satisfaction under changing political
Theory 28 (2): 197–211.
leadership: The role of partisan motivated reasoning. Governance
Cobbe, Jennifer. 2019. Administrative law and the machines of govern-
31 (3): 515–33.
ment: Judicial review of automated public-sector decision-making.
Jilke, Sebastian, and Lars Tummers. 2018. Which clients are deserving
Legal Studies 39 (4): 636–55.
of help? A theoretical model and experimental test. Journal of Pub-
Cummings, Mary L. 2006. Automation and accountability in decision
lic Administration Research and Theory 28 (2): 226–38.
support system interface design. The Journal of Technology Studies
Jilke, Sebastian, and Martin Baekgaard. 2020. The political psychol-
32(1): 23–31.
ogy of citizen satisfaction: Does functional responsibility matter?
de Boer, Noortje, and Nadine Raaphorst. 2021. Automation and dis-
Journal of Public Administration Research and Theory 30 (1):
cretion: Explaining the effect of automation on how street-level
130–43.
Journal of Public Administration Research and Theory, 2023, Vol. 33, No. 1 169
Kamans, Elanor, Ernestine H. Gordijn, Hilbrand Oldenhuis, and Sa- Ranchordas, Sofia. 2022. Empathy in the digital administrative state.
bine Otten. 2009. What I think you see is what you get: Influence Duke Law Journal, forthcoming.
of prejudice on assimilation to negative meta-stereotypes among Richardson, Rashida, Jason Schultz, and Kate Crawford. 2019. Dirty
Dutch Moroccan teenagers. European Journal of Social Psychology data, bad predictions: How civil rights violations impact police
39 (5): 842–51. data, predictive policing systems, and justice. New York University
Kim, Soonhee, Kim Normann Andersen, and Jungwoo Lee. 2021. Plat- Law Review Online 94:192–233.
form government in the era of smart technology. Public Adminis- Rudin, Cynthia. 2019. Stop explaining black box machine learning
Downloaded from https://academic.oup.com/jpart/article/33/1/153/6524536 by Perpustakaan Fakultas Sastra Universitas Indonesia user on 21 April 2024
tration Review. doi:10.1111/puar.13422 models for high stakes decisions and use interpretable models in-
Logg, Jennifer, Julia A. Minson, and Don A. Moore. 2019. Algorithm stead. Nature Machine Intelligence 1 (5):206–15.
appreciation: People prefer algorithmic to human judgment. Or- Schiff, Daniel S., Kaylyn Jackson Schiff, and Patrick Pierson. 2021.
ganizational Behavior and Human Decision Processes 151:90–103. Assessing public value failure in government adoption of artificial
Lorah, Julie A. 2020. Interpretation of main effects in the presence of intelligence. Public Administration. doi:10.1111/padm.12742
non-significant interaction effects. The Quantitative Methods for Schram, Sanford F., Joe Soss, Richard C. Fording, and Linda Houser.
Psychology 16 (1): 33–45. 2009. Deciding to discipline: Race, choice, and punishment at the
Lyell, David, and Enrico Coiera. 2017. Automation bias and verifica- frontlines of welfare reform. American Sociological Review 74 (3):
tion complexity: a systematic review. Journal of the American Med- 398–422.
ical Informatics Association 24 (2): 423–31. Skitka, Linda J., Kathleen L. Mosier, and Mark D. Burdick. 1999.
Medium—Open Letter Concerned AI Researchers. 2019. On re- Does automation bias decision-making? International Journal of
cent research auditing commercial facial analysis technology. Human-Computer Studies 51 (5): 991–1006.
Medium, March 26. https://medium.com/@bu64dcjrytwitb8/on- ———. 2000. Accountability and automation bias. International Jour-
recent-research-auditing-commercial-facial-analysis-technology- nal of Human-Computer Studies 52 (4): 701–17.
19148bda1832 Skitka, Linda J., Kathleen L. Mosier, Mark D. Burdick, and Bonnie
Meijer, Albert, Lukas Lorenz, and Martijn Wessels. 2021. Rosenblatt. 2000. Automation bias and errors: Are crews better
Algorithmization of bureaucratic organizations: Using a practice than individuals? The International Journal of Aviation Psychology
lens to study how context shapes predictive policing systems. Pub- 10 (1): 85–97.
lic Administration Review 81 (5): 837–46. Stevenson, Megan. 2018. Assessing risk assessment in action. Minne-
Milner, Greg. 2016. Death by GPS: Are Satnavs changing our brains? sota Law Review 103:303–84.
The Guardian, June 25. https://www.theguardian.com/technol- Turque, Bill. 2012. ‘Creative...motivating’ and fired. Washington
ogy/2016/jun/25/gps-horror-stories-driving-satnav-greg-milner Post, March 6. https://www.washingtonpost.com/local/education/
Mosier, Kathleen, Melisa Dunbar, Lori McDonnell, Linda Skitka, creative--motivating-and-fired/2012/02/04/gIQAwzZpvR_story.
Mark Burdick, and Bonnie Rosenblatt. 1998. Automation bias html
and errors: Are teams better than individuals? Proceedings of the Veale, Michael, and Irina Brass. 2019. Administration by algorithm?
Human Factors and Ergonomics Society Annual Meeting 42 (3): Public management meets public sector machine learning. In Algo-
201–5. rithmic Regulation, eds. Karen Yeung and Martin Lodge. Oxford:
Mosier, Kathleen L., Linda J. Skitka, Melisa Dunbar, and Lori Oxford University Press.
McDonnell. 2001. Aircrews and automation bias: the advantages Vogl, Thomas M., Cathrine Seidelin, Bharath Ganesh, and Jonathan
of teamwork? The International Journal of Aviation Psychology Bright. 2020. Smart technology and the emergence of algorithmic
11 (1): 1–14. bureaucracy: Artificial intelligence in UK local authorities. Public
National Transportation Safety Board. 2017. “Collision between a car Administration Review 80 (6): 946–61.
operating with automated vehicle control systems and a tractor- Volkskrant. 2020. Belastingdienst schuldig aan structurele discriminatie
semitrailer truck near williston, Florida May 7, 2016”. National van mensen die toeslagen ontvingen. Volkskrant, July 21.
Transportation Safety Board. September 12. https://www.ntsb.gov/ https://www.volkskrant.nl/nieuws-achtergrond/belastingdienst-
investigations/Pages/HWY16FH018.aspx (accessed March 2022). schuldig-aan-structurele-discriminatie-van-mensen-die-toeslagen-
OECD. 2014a. Education policy outlook: Netherlands. http://www. ontvingen~baebefdb/
oecd.org/education/EDUCATION%20POLICY%20OUTLOOK_ Wulff, Jesper N., and Anders R. Villadsen. 2020. Are survey experiments
NETHERLANDS_EN%20.pdf (accessed March 2022). as valid as field experiments in management research? An empirical
OECD. 2014b. OECD reviews of evaluation and assessment in edu- comparison using the case of ethnic employment discrimination.
cation: Netherlands 2014. https://www.oecd.org/education/school/ European Management Review 17 (1): 347–56.
OECD-Evaluation-Assessment-Review-Netherlands.pdf (accessed Yeung, Karen, and Martin Lodge. 2019. Algorithmic regulation: An
March 2022). introduction. In Algorithmic regulation, eds. Karen Yeung and
O’Neil, Cathy. 2016. Weapons of math destruction: How big data Martin Lodge, 1–18. Oxford Univ. Press.
increases inequality and threatens democracy. Crown Books. Young, Matthew M., Justin B. Bullock, and Jesse D. Lecy. 2019.
Parasuraman, Raja, and Victor Riley. 1997. Humans and automation: Artificial discretion as a tool of governance: A framework for
Use, misuse, disuse, abuse. Human Factors 39 (2): 230–53. understanding the impact of artificial intelligence on public admin-
Parlementaire Ondervragingscommissie Kinderopvangtoeslag. 2020. istration. Perspectives on Public Management and Governance 2
Ongekend onrecht. https://www.tweedekamer.nl/sites/default/files/ (4): 301–13.
atoms/files/20201217_eindverslag_parlementaire_ Young, Matthew M., Johannes Himmelreich, Justin B. Bullock, and
ondervragingscommissie_kinderopvangtoeslag.pdf (accessed Kyoung-Cheol Kim. 2021. Artificial intelligence and administrative
March 2022). evil. Perspectives on Public Management and Governance 4 (3):
Pedersen, Mogens Jin, Justin M. Stritch, and Frederik Thuesen. 2018. 244–58.
Punishment on the frontlines of public service delivery: Client eth- Zerilli, John, Alistair Knott, James Maclaurin, and Colin Gavaghan.
nicity and caseworker sanctioning decisions in a Scandinavian wel- 2019. Algorithmic decision-making and the control problem.
fare state. Journal of Public Administration Research and Theory Minds & Machines 29:555–78.
28 (3): 339–54. Zouridis Stavros, Marlies van Eck, and Mark Bovens. 2020. Auto-
Peeters, Rik. 2020. The agency of algorithms: Understanding human– mated discretion. In Discretion and the quest for controlled free-
algorithm interaction in administrative decision-making. Informa- dom, eds. Tony Evans and Peter Hupe, 313–29. London: Palgrave
tion Polity 25 (4): 507–22. Macmillan.