Upstream Mitigation Is Not All You Need: Testing The Bias Transfer Hypothesis in Pre-Trained Language Models
Upstream Mitigation Is Not All You Need: Testing The Bias Transfer Hypothesis in Pre-Trained Language Models
1 Introduction
Large language models (LLMs) and other mas-
sively pre-trained “foundation” models are power-
ful tools for task-specific machine learning (Bom- Figure 1: Full pre-training to fine-tuning pipeline, with
experimental interventions (green hexagons).
masani et al., 2021). Models pre-trained by well-
resourced organizations can easily adapt to a wide
variety of downstream tasks in a process called fine- plications of fine-tuned LLMs, including discrimi-
tuning. But massive pre-training datasets and in- natory medical diagnoses (Zhang et al., 2020), over-
creasingly homogeneous model design come with reliance on binary gender for coreference resolu-
well-known, immediate social risks beyond the fi- tion (Cao and Daumé, 2021), the re-inforcement
nancial and environmental costs (Strubell et al., of traditional gender roles in part-of-speech tag-
2019; Bender et al., 2021). ging (Garimella et al., 2019), toxic text generation
Transformer-based LLMs like BERT and GPT- (Gehman et al., 2020), and censorship of inclu-
3 contain quantifiable intrinsic social biases en- sive language and AAVE (Blodgett and O’Connor,
coded in their embedding spaces (Goldfarb-Tarrant 2017; Blodgett et al., 2018; Park et al., 2018; Sap
et al., 2021). These intrinsic biases are typically et al., 2019).
associated with representational harms, including Despite these risks, no research has investigated
stereotyping and denigration (Barocas et al., 2017; the extent to which downstream systems inherit
Blodgett et al., 2020; Bender et al., 2021). Sepa- social biases from pre-trained models.1 Many stud-
rately, many studies document the extrinsic harms
of the downstream (fine-tuned & task-specific) ap- 1
We use the term “bias” to refer to statistical associations
3524
Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics
Volume 1: Long Papers, pages 3524 - 3542
May 22-27, 2022 c 2022 Association for Computational Linguistics
ies warn that increasing intrinsic bias upstream 2 Related Work
may lead to an increased risk of downstream harms
Little prior work directly tests the bias transfer
(Bolukbasi et al., 2016; Caliskan et al., 2017). This
hypothesis. The closest example of this phenom-
hypothesis, which we call the Bias Transfer Hy-
ena is the “blood diamond” effect (Birhane and
pothesis, holds that stereotypes and other biased
Prabhu, 2021), in which stereotyping and deni-
associations in a pre-trained model are transferred
gration in the pre-training corpora pervade sub-
to post-fine-tuning downstream tasks, where they
sequently generated images and language even
can cause further, task-specific harms. A weaker
before the fine-tuning stage (Steed and Caliskan,
version of this hypothesis holds that downstream
2021). Still, it is unclear to what extent unde-
harms are at least mostly determined by the pre-
sirable values encoded in pre-training datasets or
trained model (Bommasani et al., 2021).
benchmarks—such as Wikipedia or ImageNet—
In the pre-training paradigm, the extent to which
induce task-specific harms after fine-tuning (Baro-
the bias transfer hypothesis holds will determine
cas et al., 2019).
the most effective strategies for responsible design.
Some work explores the consistency of intrin-
In the cases we study, reducing upstream bias does
sic and extrinsic bias metrics: Goldfarb-Tarrant
little to change downstream behavior. Still, there is
et al. (2021) find that intrinsic and extrinsic met-
hope: instead, developers can carefully curate the
rics are not reliably correlated for static embed-
fine-tuning dataset, checking for harms in context.
dings like word2vec. We focus instead on state-
We test the bias transfer hypothesis on two classi- of-the-art transformer-based LLMs—the subject
fication tasks with previously demonstrated perfor- of intense ethical debate (Bender et al., 2021;
mance disparities: occupation classification of on- Bommasani et al., 2021)—which construct con-
line biographies (De-Arteaga et al., 2019) and tox- textual, rather than static, embeddings. Contextual
icity classification of Wikipedia Talks comments embeddings—token encodings that are conditional
(Dixon et al., 2018). We investigate whether re- on other, nearby tokens—pose an ongoing chal-
ducing or exacerbating intrinsic biases encoded by lenge for intrinsic bias measurement (May et al.,
RoBERTa (Liu et al., 2019) decreases or increases 2019; Kurita et al., 2019; Guo and Caliskan, 2021)
the severity of downstream, extrinsic harms (Fig- and bias mitigation (Liang et al., 2020). We find
ure 1). We find that the bias transfer hypothesis that intrinsic and extrinsic metrics are correlated for
describes only part of the interplay between pre- the typical LLM—but that the correlation is mostly
training biases and harms after fine-tuning: explained by biases in the fine-tuning dataset.
Other research tests the possibility that upstream
• Systematically manipulating upstream bias has mitigation could universally prevent downstream
little impact on downstream disparity, especially harm. Jin et al. (2021) show that an intermedi-
for the most-harmed groups. ate, bias-mitigating fine-tuning step can help re-
duce bias in later tasks. Likewise, Solaiman and
• With a regression analysis, we find that most Dennison (2021) propose fine-tuning on carefully
variation in downstream bias can be explained curated “values-targeted” datasets to reduce toxic
by bias in the fine-tuning dataset (proxied by co- GPT-3 behavior. Our results tend to corroborate
occurrence rates). these methods: we find that the fine-tuning process
can to some extent overwrite the biases present in
• Altering associations in the fine-tuning dataset the original pre-trained model. A recent post-hoc
can sometimes change downstream behavior, but mitigation technique, on the other hand, debiases
only when the model is not pre-trained. contextual embeddings before fine-tuning (Liang
et al., 2020). Our results imply that while this type
Without absolving LLMs or their owners of repre- of debiasing may help with representational harms
sentational harms intrinsic to pre-trained models, upstream, it is less successful for reducing harms
our results encourage practitioners and application downstream.
stakeholders to focus more on dataset quality and
context-specific harm identification and reduction. 3 Methods
that result in representational or allocational harms to histori- To empirically evaluate the bias transfer hypothe-
cally marginalized social groups (Blodgett et al., 2020). sis, we examine the relationship between upstream
3525
bias and downstream bias for two tasks. We track For example, the classifier correctly predicts “sur-
how this relationship changes under various con- geon” for he/him biographies much more often
trolled interventions on the model weights or the than for she/her biographies, so the TPR ratio for
fine-tuning dataset. the “surgeon” occupation is low (see Appendix A).
Upstream Bias.— We adapt Kurita et al. (2019)’s
3.1 Model pronoun ranking test to the 28 occupations in the
For each task, we fine-tune RoBERTa2 (Liu et al., B IOS dataset. Kurita et al. (2019) measure the
2019). We split the fine-tuning dataset into train encoded association of he/him and she/her pro-
(80%), evaluation (10%), and test (20%) partitions. nouns by the difference in log probability scores be-
To fine-tune, we attach a sequence classification tween pronouns appearing in templates of the form
head and train for 3 epochs.3 {pronoun} is a(n) {occupation}. We
augment this approach with 5 career-related tem-
3.2 Occupation Classification plates proposed by Bartl et al. (2020) (see Ap-
The goal of occupation classification is to predict pendix A). Formally, given a template sequence
someone’s occupation from their online biogra- xy,g filled in with occupation y and pronoun g,
phy. We fine-tune with the B IOS dataset scraped we compute py,g = P(xy,g ). As a baseline, we
from Common Crawl (De-Arteaga et al., 2019), also mask the occupation and compute the prior
which includes over 400,000 online biographies probability ⇡y,g = P(x⇡y,g ). The pronoun ranking
belonging to 28 common occupations. Since self- bias (PRB) for this template is the difference in log
identified gender was not collected, we will refer in- probabilities:
stead to the pronouns used in each biography (each
biography uses either he/him or she/her pronouns). py,she/her py,he/him
PRBy = log log . (2)
Following De-Arteaga et al. (2019), we use the ⇡y,she/her ⇡y,he/him
“scrubbed” version of the dataset—in which all the
identifying pronouns have been removed—to mea- 3.3 Toxicity Classification
sure just the effects of proxy words (e.g. “mother”)
For toxicity classification, we use the W IKI
and avoid overfitting on pronouns directly.
dataset, which consists of just under 130,000 com-
Downstream Bias.— Biographies with she/her
ments from the online forum Wikipedia Talks
pronouns are less frequently classified as be-
Pages (Dixon et al., 2018). The goal of the task is to
longing to certain traditionally male-dominated
predict whether each comment is toxic. Each com-
professions—such as “surgeon”—which could re-
ment has been labeled as toxic or non-toxic
sult in lower recruitment or callback rates for job
by a human annotator, where a toxic comment is a
candidates if the classifier is used by an employer.
“rude, disrespectful, or unreasonable comment that
The empirical true positive rate (TPR) estimates
is likely to make you leave the discussion” (Dixon
the likelihood that the classifier correctly identifies
et al., 2018). Following Dixon et al. (2018), we
a person’s occupation from their biography.
focus on 50 terms referring to people of certain gen-
We follow previous work (De-Arteaga et al.,
ders, races, ethnicities, sexualities, and religions.
2019) in measuring downstream bias via the empir-
Downstream (Extrinsic) Bias.— Mentions of cer-
ical true positive rate (TPR) gap between biogra-
tain identity groups—such as “queer”—are more
phies using each set of pronouns. First, define
likely to be flagged for toxic content, which could
TPRy,g = P[Ŷ = y | G = g, Y = y], result in certain communities being systematically
censored or left unprotected if an online plat-
where g is a set of pronouns and y is an occupa- form uses the classifier. The classifier’s empirical
tion. Y and Ŷ represent the true and predicted false positive rate (FPR) estimates its likelihood
occupation, respectively. Then the TPR bias (TPB) to falsely flag a non-toxic comment as toxic. The
is FPR corresponds to the risk of censoring inclusive
TPRy,she/her speech or de-platforming individuals who often
TPBy = . (1)
TPRy,he/him mention marginalized groups.
2 Following Dixon et al. (2018), we express the
roberta-base from HuggingFace (Wolf et al., 2020).
3
See Appendix D for more details. Epochs and other classifier’s bias against comments or commenters
parameters were chosen to match prior work (Jin et al., 2021). harmlessly mentioning an identity term as the FPR
3526
bias (FPB). bias subspace V with principal component anal-
ysis, then computes
Pk debiased word embeddings
P[T̂ = 0 | I = i, T = 1] ĥ = h hh, vj ivj by subtracting the
FPBi = , (3) j=1
P[T̂ = 0 | T = 1] projection of h onto V. We add the multiplier
to add or remove bias to various degrees—
where i is an identity term and T = 1 if the com- standard S ENT D EBIAS uses = 1.0.
ment was deemed toxic by a human annotator.
Upstream Bias.— Following Hutchinson et al.
• Re-balancing and scrubbing. For B IOS, we
(2020), we measure upstream bias via sentiment
re-balance the fine-tuning dataset by under-
associations. We construct a set of templates of the
sampling biographies with the prevalent pronoun
form {identity} {person} is [MASK],
in each occupation. For W IKI, we randomly re-
where identities are the identity terms from Dixon
move from the fine-tuning dataset ↵ percent of
et al. (2018) (e.g. “gay” or “Muslim”) and the per-
comments mentioning each identity term.
son phrases include “a person,” “my sibling,” and
other relations. We predict the top-20 likely tokens
for the “[MASK]” position (e.g., “awesome” or 4.1 Upstream variations have little impact on
“dangerous”). Using a pre-trained RoBERTA senti- downstream bias.
ment classifier trained on the TweetEval benchmark
Our goal is to test the bias transfer hypothe-
(Barbieri et al., 2020), we then measure the average
sis, which holds that upstream bias is transferred
negative sentiment score of the predicted tokens.
through fine-tuning to downstream models. By
The model’s bias is the magnitude of negative asso-
this view, we would expect changes to the pre-
ciation with each identity term.
trained model to also change the distribution of
RoBERTa sometimes suggests terms which refer
downstream bias—but we find that for both tasks,
back to the target identity group. To mitigate this
downstream bias is largely invariant to upstream
effect, we drop any predicted tokens that match the
interventions. Figure 2 summarizes the similarity
50 identity terms (e.g. “Latino”) from Dixon et al.
of biases before and after each randomized event.
(2018), but we are likely missing other confound-
Though randomizing the model weights signifi-
ing adjectives (e.g. “Spanish”). We suspect this
cantly reduces the mean and variance of upstream
confounding is minimal: we achieve similar results
bias, the distribution of downstream bias changes
with an alternative ranking-based bias metric (see
very little.4 For example, RoBERTa exhibits the
Appendix C.2).
same disparities in performance after fine-tuning re-
gardless of whether the base model was pre-trained.
4 Experiments
Likewise, although the S ENT D EBIAS mitigation
We measure changes in upstream and downstream method reduces pronoun ranking (upstream) bias as
bias subject to the following interventions (Fig. 1): intended, we detect roughly the same downstream
biases no matter the level of mitigation applied
• No pre-training. To control for the effects of (Figure 3). For example, in the B IOS task, surgeons
pre-training, we test randomly initialized ver- with he/him pronouns are still 1.3 times more likely
sions of both models that have not been pre- to have their biographies correctly classified than
trained. We average over 10 trials. their she/her counterparts.
• Random perturbations. We instantiate a pre- There is one notable exception to this trend: for
trained model and then add random noise e to the W IKI task, adding noise (uniform or Gaussian)
every weight in the embedding matrix. We try to the pre-trained model’s embedding matrix or
both uniform noise u ⇠ Unif( c, c) and Gaus- not pre-training the model yields a modest reduc-
sian noise z ⇠ N (0, 2 ), varying c and 2 . The tion in median bias (Figure 2). As upstream bias
final noise-added matrix is clipped so that its shifts towards zero, downstream bias also moves
range does not exceed that of the original matrix. marginally towards zero. Still, the largest biases
(e.g., against the term “gay”) do not decrease and
• Bias mitigation. We apply the S ENT D EBIAS al- may even increase after randomization.
gorithm to de-bias embeddings at the word-level
(Liang et al., 2020). S ENT D EBIAS estimates a 4
See Appendix B.2 for a full set of correlation tests.
3527
Pre−
trained
86.3%
accuracy Nurse
Uniform Surgeon Surgeon Nurse
noise
82.1%
accuracy
Gaussian
noise
78.6%
accuracy
Not pre−
trained
80.4%
accuracy
0 1 2 −0.5 0.0 0.5 1.0
Upstream Bias Downstream Bias
(Log prob.: she/her − he/him) (Log TPR: she/her − he/him)
(a) B IOS
Pre−
trained
94.3%
accuracy jewish
Uniform jewish gay
noise gay
93.1%
accuracy
Gaussian
noise
80.4%
accuracy
Not pre−
trained
67.4%
accuracy
0.0 0.2 0.4 0.6 −4 −2 0 2
Upstream Bias Downstream Bias
(Avg. Neg. Sentiment) (Log FPR)
(b) W IKI
Figure 2: Bias per occupation after randomized interventions, averaged over 10 trials. Despite drastic changes to
the distribution of upstream bias (left), downstream bias remains roughly stable (right). For example, upstream
bias with pre-training (purple) is not correlated with upstream bias without pre-training. But downstream bias is
partially correlated with the control (Pearson’s correlation coefficient ⇢B IOS = 0.93 and ⇢W IKI = 0.64, p < 0.01).
4.2 Most downstream bias is explained by the and evaluation templates. We estimate
fine-tuning step.
log TPBm,y = 0 + 1 PRBm,y,s + 2 ⇡y +fs +cm .
Though the results in the preceding section suggest (4)
that there is no clear or consistent correspondence for model treatment m, occupation y, and pronoun
between changes in upstream bias and changes ranking template s. TPB is the TPR bias (down-
in downstream bias, there is still a noticeable stream bias) from Eq. 1; PRB is the pronoun rank-
correlation between baseline upstream and down- ing bias (upstream bias) from Eq. 2; fs and cm
stream bias (Pearson’s ⇢ = 0.43, p = 0.022 for are dummy variables (for ordinary least squares)
B IOS, ⇢ = 0.59, p < 10 5 for W IKI—see Ap- or fixed effects to capture heterogeneous effects
pendix A). There is an important third variable between templates and models (such as variations
that helps explain this correlation: cultural arti- in overall embedding quality). We control for sta-
facts ingrained in both the pre-training and fine- tistical “dataset bias” with ⇡, the prevalence of
tuning datasets.5 RoBERTa learns these artifacts “she/her” biographies within each occupation y in
through co-occurrences and other associations be- the fine-tuning data.
tween words in both sets of corpora. We find that the “dataset bias” in the fine-tuning
To test this explanation, we conduct a simple stage explains most of the correlation between up-
regression analysis across interventions (Figure 1) stream and downstream bias. Under the strong bias
transfer hypothesis, we would expect the coeffi-
5
For example, cultural biases about which pronouns be- cient on upstream bias 1 to be statistically signifi-
long in which occupations are likely to pervade both the pre-
training dataset (e.g., Wikipedia) and the fine-tuning dataset cant and greater in magnitude than the coefficient
(internet biographies). 2 on our proxy for dataset bias. But for both tasks,
3528
Upstream Bias (Pronoun Ranking) each identity term and the average length of toxic
2 mentions of each identity term—longer comments
are less likely to result in erroneous screening (Ap-
1
pendix C.1).
Nurse
As in the previous regression, dataset bias ex-
0
Model plains more of the variation in downstream bias
Bias (she/her − he/him)
3529
All pre−
● trained
● N=6020
Upstream bias ● Pre−trained
● ●
Likelihood gap ●
●
N=140
●
●
Noise added
N=1400
●
Balanced
Fine−tuning ● N=1400
dataset bias ● Not pre−
●
Prevalance of ●
● ● trained
she/her ● N=2940
Bias−
● mitigated
N=1820
0.00 0.25 0.50 0.75
Coefficient
(a) B IOS
All pre−
Upstream bias ●
●
● trained
Avg. negative ● N=12516
sentiment ● Pre−trained
● ●
N=315
●
Noise added
N=6615
Fine−tuning ● Scrubbed
dataset bias ●
● ●
N=5901
Prevalance of ● Not pre−
toxic mentions ●
● trained
N=3150
Figure 4: Effect of upstream bias vs. fine-tuning dataset bias on downstream bias, controlling for model & tem-
plate fixed effects. Bars depict heteroskedasticity-consistent standard errors. Statistically insignificant (p < 0.01)
coefficients are hollow. For both tasks, reduction in fine-tuning dataset bias corresponds to a greater alteration to
downstream bias than an equivalent reduction (accounting for scale) in upstream bias.
3530
6 Conclusion
jewish gay
Our results offer several points of guidance to or-
ganizations training and distributing LLMs and the
None practitioners applying them:
jewish gay
• Attenuating downstream bias via upstream
interventions—including embedding-space bias
mitigation—is mostly futile in the cases we study
Mentions Scrubbed
3531
data or classifiers publicly. Finally, the models we
trained exhibit toxic, offensive behavior. These
models and datasets are intended only for studying
bias and simulating harms and, as our results show,
should not be deployed or applied to any other data
except for this purpose.
Acknowledgements
Thanks to Maria De-Arteaga and Benedikt Boeck-
ing for assistance with B IOS data collection.
Thanks also to the reviewers for their helpful com-
ments and feedback.
3532
References Tolga Bolukbasi, Kai-Wei Chang, James Y Zou,
Venkatesh Saligrama, and Adam T Kalai. 2016.
Francesco Barbieri, Jose Camacho-Collados, Luis Es- Man is to Computer Programmer as Woman is to
pinosa Anke, and Leonardo Neves. 2020. Tweet- Homemaker? Debiasing Word Embeddings. In D D
Eval: Unified Benchmark and Comparative Evalu- Lee, M Sugiyama, U V Luxburg, I Guyon, and
ation for Tweet Classification. In Findings of the R Garnett, editors, Advances in Neural Information
Association for Computational Linguistics: EMNLP Processing Systems 29, pages 4349–4357. Curran
2020, pages 1644–1650, Online. Association for Associates, Inc.
Computational Linguistics.
Rishi Bommasani, Drew A. Hudson, Ehsan Adeli, Russ
Solon Barocas, Kate Crawford, Aaron Shapiro, and
Altman, Simran Arora, Sydney von Arx, Michael S.
Hanna Wallach. 2017. The problem with bias: Al-
Bernstein, Jeannette Bohg, Antoine Bosselut, Emma
locative versus representational harms in machine
Brunskill, Erik Brynjolfsson, Shyamal Buch, Dal-
learning. In 9th Annual Conference of the Special
las Card, Rodrigo Castellon, Niladri Chatterji, An-
Interest Group for Computing, Information and So-
nie Chen, Kathleen Creel, Jared Quincy Davis, Dora
ciety.
Demszky, Chris Donahue, Moussa Doumbouya,
Esin Durmus, Stefano Ermon, John Etchemendy,
Solon Barocas, Moritz Hardt, and Arvind Narayanan. Kawin Ethayarajh, Li Fei-Fei, Chelsea Finn, Trevor
2019. Fairness and Machine Learning. fairml- Gale, Lauren Gillespie, Karan Goel, Noah Good-
book.org. man, Shelby Grossman, Neel Guha, Tatsunori
Hashimoto, Peter Henderson, John Hewitt, Daniel E.
Marion Bartl, Malvina Nissim, and Albert Gatt. 2020. Ho, Jenny Hong, Kyle Hsu, Jing Huang, Thomas
Unmasking Contextual Stereotypes: Measuring and Icard, Saahil Jain, Dan Jurafsky, Pratyusha Kalluri,
Mitigating BERT’s Gender Bias. In Proceedings Siddharth Karamcheti, Geoff Keeling, Fereshte
of the Second Workshop on Gender Bias in Natu- Khani, Omar Khattab, Pang Wei Kohd, Mark Krass,
ral Language Processing, pages 1–16, Barcelona, Ranjay Krishna, Rohith Kuditipudi, Ananya Kumar,
Spain (Online). Association for Computational Lin- Faisal Ladhak, Mina Lee, Tony Lee, Jure Leskovec,
guistics. Isabelle Levent, Xiang Lisa Li, Xuechen Li, Tengyu
Ma, Ali Malik, Christopher D. Manning, Suvir Mir-
Emily M. Bender, Timnit Gebru, Angelina McMillan- chandani, Eric Mitchell, Zanele Munyikwa, Suraj
Major, and Shmargaret Shmitchell. 2021. On the Nair, Avanika Narayan, Deepak Narayanan, Ben
Dangers of Stochastic Parrots: Can Language Mod- Newman, Allen Nie, Juan Carlos Niebles, Hamed
els Be Too Big? In Proceedings of the 2021 ACM Nilforoshan, Julian Nyarko, Giray Ogut, Laurel Orr,
Conference on Fairness, Accountability, and Trans- Isabel Papadimitriou, Joon Sung Park, Chris Piech,
parency, FAccT ’21, pages 610–623, New York, NY, Eva Portelance, Christopher Potts, Aditi Raghu-
USA. Association for Computing Machinery. nathan, Rob Reich, Hongyu Ren, Frieda Rong,
Yusuf Roohani, Camilo Ruiz, Jack Ryan, Christo-
Abeba Birhane and Vinay Uday Prabhu. 2021. Large pher Ré, Dorsa Sadigh, Shiori Sagawa, Keshav
image datasets: A pyrrhic win for computer vision? Santhanam, Andy Shih, Krishnan Srinivasan, Alex
In 2021 IEEE Winter Conference on Applications of Tamkin, Rohan Taori, Armin W. Thomas, Florian
Computer Vision (WACV), pages 1536–1546. ISSN: Tramèr, Rose E. Wang, William Wang, Bohan Wu,
2642-9381. Jiajun Wu, Yuhuai Wu, Sang Michael Xie, Michi-
hiro Yasunaga, Jiaxuan You, Matei Zaharia, Michael
Su Lin Blodgett, Solon Barocas, Hal Daumé III, and Zhang, Tianyi Zhang, Xikun Zhang, Yuhui Zhang,
Hanna Wallach. 2020. Language (Technology) is Lucia Zheng, Kaitlyn Zhou, and Percy Liang. 2021.
Power: A Critical Survey of “Bias” in NLP. In On the Opportunities and Risks of Foundation Mod-
Proceedings of the 58th Annual Meeting of the Asso- els. arXiv:2108.07258 [cs]. ArXiv: 2108.07258.
ciation for Computational Linguistics, pages 5454–
5476, Online. Association for Computational Lin-
guistics. Aylin Caliskan, Joanna J. Bryson, and Arvind
Narayanan. 2017. Semantics derived automatically
Su Lin Blodgett and Brendan O’Connor. 2017. Racial from language corpora contain human-like biases.
Disparity in Natural Language Processing: A Case Science, 356(6334):183–186.
Study of Social Media African-American English.
arXiv:1707.00061 [cs]. ArXiv: 1707.00061. Yang Trista Cao and Hal Daumé, III. 2021. Toward
Gender-Inclusive Coreference Resolution: An Anal-
Su Lin Blodgett, Johnny Wei, and Brendan O’Connor. ysis of Gender and Bias Throughout the Machine
2018. Twitter Universal Dependency Parsing for Learning Lifecycle*. Computational Linguistics,
African-American and Mainstream American En- 47(3):615–661.
glish. In Proceedings of the 56th Annual Meeting of
the Association for Computational Linguistics (Vol- Aida Mostafazadeh Davani, Mohammad Atari, Bren-
ume 1: Long Papers), pages 1415–1425, Melbourne, dan Kennedy, and Morteza Dehghani. 2021. Hate
Australia. Association for Computational Linguis- Speech Classifiers Learn Human-Like Social Stereo-
tics. types. arXiv:2110.14839 [cs]. ArXiv: 2110.14839.
3533
Maria De-Arteaga, Alexey Romanov, Hanna Wal- Linguistics: Human Language Technologies, pages
lach, Jennifer Chayes, Christian Borgs, Alexandra 3770–3783, Online. Association for Computational
Chouldechova, Sahin Geyik, Krishnaram Kentha- Linguistics.
padi, and Adam Tauman Kalai. 2019. Bias in Bios:
A Case Study of Semantic Representation Bias in a Keita Kurita, Nidhi Vyas, Ayush Pareek, Alan W
High-Stakes Setting. In Proceedings of the Confer- Black, and Yulia Tsvetkov. 2019. Measuring Bias
ence on Fairness, Accountability, and Transparency, in Contextualized Word Representations. In Pro-
FAT* ’19, pages 120–128, New York, NY, USA. As- ceedings of the First Workshop on Gender Bias in
sociation for Computing Machinery. Natural Language Processing, pages 166–172, Flo-
rence, Italy. Association for Computational Linguis-
Lucas Dixon, John Li, Jeffrey Sorensen, Nithum Thain, tics (ACL).
and Lucy Vasserman. 2018. Measuring and Mitigat-
ing Unintended Bias in Text Classification. In Pro- Paul Pu Liang, Irene Mengze Li, Emily Zheng,
ceedings of the 2018 AAAI/ACM Conference on AI, Yao Chong Lim, Ruslan Salakhutdinov, and Louis-
Ethics, and Society, pages 67–73, New Orleans LA Philippe Morency. 2020. Towards Debiasing Sen-
USA. ACM. tence Representations. In Proceedings of the 58th
Aparna Garimella, Carmen Banea, Dirk Hovy, and Annual Meeting of the Association for Computa-
Rada Mihalcea. 2019. Women’s Syntactic Re- tional Linguistics, pages 5502–5515, Online. Asso-
silience and Men’s Grammatical Luck: Gender-Bias ciation for Computational Linguistics.
in Part-of-Speech Tagging and Dependency Parsing.
In Proceedings of the 57th Annual Meeting of the Yinhan Liu, Myle Ott, Naman Goyal, Jingfei Du, Man-
Association for Computational Linguistics, pages dar Joshi, Danqi Chen, Omer Levy, Mike Lewis,
3493–3498, Florence, Italy. Association for Compu- Luke Zettlemoyer, and Veselin Stoyanov. 2019.
tational Linguistics. RoBERTa: A Robustly Optimized BERT Pretrain-
ing Approach. arXiv:1907.11692 [cs]. ArXiv:
Samuel Gehman, Suchin Gururangan, Maarten Sap, 1907.11692.
Yejin Choi, and Noah A. Smith. 2020. RealToxic-
ityPrompts: Evaluating Neural Toxic Degeneration David P. MacKinnon, Jennifer L. Krull, and Chon-
in Language Models. In Findings of the Association dra M. Lockwood. 2000. Equivalence of the Me-
for Computational Linguistics: EMNLP 2020, pages diation, Confounding and Suppression Effect. Pre-
3356–3369, Online. Association for Computational vention Science, 1(4):173–181.
Linguistics.
Chandler May, Alex Wang, Shikha Bordia, Samuel R.
Seraphina Goldfarb-Tarrant, Rebecca Marchant, Ri- Bowman, and Rachel Rudinger. 2019. On Measur-
cardo Muñoz Sánchez, Mugdha Pandya, and Adam ing Social Biases in Sentence Encoders. In Proceed-
Lopez. 2021. Intrinsic Bias Metrics Do Not Corre- ings of the 2019 Conference of the North, pages 622–
late with Application Bias. In Proceedings of the 628, Stroudsburg, PA, USA. Association for Compu-
59th Annual Meeting of the Association for Compu- tational Linguistics.
tational Linguistics and the 11th International Joint
Conference on Natural Language Processing (Vol- Ji Ho Park, Jamin Shin, and Pascale Fung. 2018. Re-
ume 1: Long Papers), pages 1926–1940, Online. As- ducing Gender Bias in Abusive Language Detec-
sociation for Computational Linguistics. tion. In Proceedings of the 2018 Conference on
Empirical Methods in Natural Language Processing,
Wei Guo and Aylin Caliskan. 2021. Detecting Emer- pages 2799–2804, Brussels, Belgium. Association
gent Intersectional Biases: Contextualized Word for Computational Linguistics.
Embeddings Contain a Distribution of Human-like
Biases. In Proceedings of the 2021 AAAI/ACM Con- Maarten Sap, Dallas Card, Saadia Gabriel, Yejin Choi,
ference on AI, Ethics, and Society, pages 122–133. and Noah A. Smith. 2019. The Risk of Racial Bias
Association for Computing Machinery, New York, in Hate Speech Detection. In Proceedings of the
NY, USA. 57th Annual Meeting of the Association for Com-
Ben Hutchinson, Vinodkumar Prabhakaran, Emily putational Linguistics, pages 1668–1678, Florence,
Denton, Kellie Webster, Yu Zhong, and Stephen De- Italy. Association for Computational Linguistics.
nuyl. 2020. Social Biases in NLP Models as Barriers
for Persons with Disabilities. In Proceedings of the Morgan Klaus Scheuerman, Alex Hanna, and Emily
58th Annual Meeting of the Association for Compu- Denton. 2021. Do Datasets Have Politics? Dis-
tational Linguistics, pages 5491–5501, Online. As- ciplinary Values in Computer Vision Dataset De-
sociation for Computational Linguistics. velopment. Proceedings of the ACM on Human-
Computer Interaction, 5(CSCW2):317:1–317:37.
Xisen Jin, Francesco Barbieri, Brendan Kennedy, Aida
Mostafazadeh Davani, Leonardo Neves, and Xiang Andrew Silva, Pradyumna Tambwekar, and Matthew
Ren. 2021. On Transferability of Bias Mitigation Gombolay. 2021. Towards a Comprehensive Under-
Effects in Language Model Fine-Tuning. In Pro- standing and Accurate Evaluation of Societal Biases
ceedings of the 2021 Conference of the North Amer- in Pre-Trained Transformers. In Proceedings of the
ican Chapter of the Association for Computational 2021 Conference of the North American Chapter of
3534
the Association for Computational Linguistics: Hu-
man Language Technologies, pages 2383–2389, On-
line. Association for Computational Linguistics.
Irene Solaiman and Christy Dennison. 2021. Process
for adapting language models to society (palms)
with values-targeted datasets. In Pre-Proceedings of
Advances in Neural Information Processing Systems,
volume 34.
Ramya Srinivasan and Kanji Uchino. 2021. Biases in
Generative Art: A Causal Look from the Lens of Art
History. In Proceedings of the 2021 ACM Confer-
ence on Fairness, Accountability, and Transparency,
FAccT ’21, pages 41–51, New York, NY, USA. As-
sociation for Computing Machinery.
Ryan Steed and Aylin Caliskan. 2021. Image Repre-
sentations Learned With Unsupervised Pre-Training
Contain Human-like Biases. In Proceedings of the
2021 ACM Conference on Fairness, Accountability,
and Transparency, FAccT ’21, pages 701–713, New
York, NY, USA. Association for Computing Machin-
ery.
3535
A Descriptive Statistics
B IOS.— Biographies include the 28-most frequent
occupations according to the BLS Standard Occu-
pation Classification system.6 See Figure 6 for a
full list of occupations and the prevalence of each
set of pronouns.
3536
Bias Template Targets Source
Occupation-Pronoun [TARGET] is a(n) [OCCUPATION] he, she (Kurita et al., 2019)
Occupation-Pronoun [TARGET] works as a(n) [OCCUPA- he, she (Bartl et al., 2020)
TION]
Occupation-Pronoun [TARGET] wants to become a(n) [OC- he, she (Bartl et al., 2020)
CUPATION]
Occupation-Pronoun [TARGET] applied for the position of he, she (Bartl et al., 2020)
[OCCUPATION]
Identity-Sentiment [IDENTITY] [TARGET] are [MASK] people (Dixon et al., 2018)
Identity-Sentiment [IDENTITY] [TARGET] is [MASK] my sibling (Dixon et al., 2018)
my friend
my parent
my partner
my spouse
● ●
●
●
● ● ●
● Model
● ●
●
(TPR ratio: she/her − he/him)
0.50
Downstream Bias
0.25
● ● ●
● ● ● ●
● ● ● ● Nurse
0.00
●
●
● ●
●
● ●
●
● DJ
−0.25 ● ● ● ● ● ●
● ●
● ● ● ●
●
Surgeon
0 10 20 30 40 50 60 70 80 90 100
% pronoun gap reduced w/ undersampling
3537
Pre−trained Not pre−trained
5.0
0.0
−2.5
5.0
2.5
All Mentions
0.0
−2.5
0 10 20 30 40 50 60 70 80 90 100 0 10 20 30 40 50 60 70 80 90 100
% mentions scrubbed
Figure 11: FPR gap (downstream bias) after scrubbing toxic mentions of identity terms from the W IKI fine-tuning
dataset.
3538
Upstream Bias Downstream Bias
Intervention Pearson’s ⇢ Kendall’s ⌧ Pearson’s ⇢ Kendall’s ⌧
B IOS Pronoun ranking TPR ratio
Not pre-trained -0.08 0.04 0.93⇤⇤⇤ 0.74⇤⇤⇤
Uniform noise 0.90⇤⇤⇤ 0.62⇤⇤⇤ 0.67⇤⇤⇤ 0.60⇤⇤⇤
Gaussian noise 0.29 0.09 0.90⇤⇤⇤ 0.56⇤⇤⇤
S ENT D EBIAS ( = 50) 0.87⇤⇤⇤ 0.61⇤⇤⇤ 0.96⇤⇤⇤ 0.77⇤⇤⇤
Dataset re-balancing ( = 1.0) 0.94⇤⇤⇤ 0.85⇤⇤⇤
Table 2: Correlation between bias distributions before and after each intervention. Statistically insignificant cor-
relation coefficients indicate bias has changed drastically (red). Notably, downstream bias is correlated with the
control to some extent for every intervention except for scrubbing and not pre-training. Pearson’s correlation coeffi-
cient ⇢ measure of correlation strength and direction; Kendall’s ⌧ is a measure of ordinal correlation. Randomized
interventions (e.g., not pre-training, adding noise) tend to re-order the bias distribution more than others, indicated
by a lower ⌧ .
Table 3: Effect of upstream on downstream bias for pre-trained RoBERTa on the B IOS task. Panel linear models
include model fixed effects.
Dependent variable:
Log TPR ratio (downstream bias)
OLS panel
linear
Pre-trained Mitigated Noise added Random Balanced All pre-trained
(1) (2) (3) (4) (5) (6)
Likelihood gap (upstream bias) 0.068⇤⇤⇤ 0.058⇤⇤⇤ 0.063⇤ 0.013 0.016⇤⇤ 0.018⇤⇤⇤
(0.018) (0.005) (0.038) (0.011) (0.007) (0.007)
Constant 0.090⇤⇤⇤
(0.029)
3539
Table 4: Effect of upstream on downstream bias for pre-trained RoBERTa on the W IKI task. Panel linear models
include model fixed effects.
Dependent variable:
FPR
OLS panel
linear
Pre-trained Noise added Random Scrubbed All pre-trained
(1) (2) (3) (4) (5)
Avg. negative sentiment (upstream bias) 0.591⇤⇤⇤ 0.255⇤⇤⇤ 0.029 0.571⇤⇤⇤ 0.376⇤⇤⇤
(0.107) (0.027) (0.084) (0.025) (0.017)
Table 5: Effect of upstream on downstream bias for pre-trained RoBERTa on the B IOS task.
Dependent variable:
TPR ratio (downstream bias) Likelihood gap after fine-tuning (intermediate bias)
OLS panel panel
linear linear
(1) (2) (3) (4) (5)
Likelihood gap (upstream bias) 0.043⇤⇤ 0.068⇤⇤⇤ 0.043⇤⇤ 0.068⇤⇤⇤ 0.046
(0.021) (0.018) (0.021) (0.018) (0.575)
Table 6: Effect of upstream on downstream bias for pre-trained RoBERTa on the W IKI task.
Dependent variable:
FPR Avg. negative sentiment after fine-tuning (intermediate bias)
OLS panel panel
linear linear
(1) (2) (3) (4) (5) (6)
Avg. negative sentiment (upstream bias) 0.569⇤⇤⇤ 0.568⇤⇤⇤ 0.586⇤⇤⇤ 0.569⇤⇤⇤ 0.591⇤⇤⇤ 0.004
(0.110) (0.106) (0.108) (0.110) (0.107) (0.014)
3540
examples). Templates of other forms are not in- //github.com/conversationai/
cluded to reduce computation time.8 For each of unintended-ml-bias-analysis.
these templates, we mask the identity term and
compute the log probability score as described in • Fine-tuning a single model for either task takes
§3.2. The model’s bias is described by the differ- from 4-6 hours on single NVIDIA Tesla V100
ence between the average log probability scores for 16GB GPU. Our results include approximately
the toxic templates and the non-toxic templates for 60 model permutations for a total of 240-360
each identity term. GPU hours. roberta-base has 125M param-
For the regressions (Tables 7 and 8), the tem- eters, but we did not pre-train any models from
plates are not paired, so we average first across scratch.
toxic and non-toxic templates, then calculate the
ratio between the two. The relative size and statisti-
cal significance of the coefficients are the same as
for the negative sentiment approach, suggesting the
negative sentiment metric is robust for our purposes
despite its limitations.
D Replication
We provide our full results (upstream and down-
stream bias for every intervention, for each task)
and the scripts used to analyse them. We are not
allowed to release the source code used to train
our models and measure bias, but we include addi-
tional details on our implementation to help others
understand and replicate our results.
3541
Table 7: Effect of upstream on downstream bias for pre-trained RoBERTa on the W IKI task.
Dependent variable:
FPR
(1) (2) (3)
Avg. log likelihood ratio (upstream bias) 0.399⇤⇤ 0.292 0.286
(0.192) (0.176) (0.187)
Observations 50 50 50
R2 0.083 0.269 0.269
Adjusted R2 0.064 0.238 0.222
Residual Std. Error 0.241 (df = 48) 0.217 (df = 47) 0.220 (df = 46)
F Statistic 4.345⇤⇤ (df = 1; 48) 8.649⇤⇤⇤ (df = 2; 47) 5.648⇤⇤⇤ (df = 3; 46)
Note: ⇤ p<0.1; ⇤⇤ p<0.05; ⇤⇤⇤ p<0.01
Table 8: Effect of upstream on downstream bias for pre-trained RoBERTa on the W IKI task.
Dependent variable:
FPR ratio (downstream bias)
OLS panel
linear
Pre-trained Noise added Random Scrubbed All pre-trained
(1) (2) (3) (4) (5)
Avg. log likelihood ratio (upstream bias) 0.292 0.073⇤⇤ 0.165 0.301⇤⇤⇤ 0.184⇤⇤⇤
(0.176) (0.032) (0.167) (0.039) (0.022)
Constant 0.008
(0.040)
3542