0% found this document useful (0 votes)
48 views19 pages

Upstream Mitigation Is Not All You Need: Testing The Bias Transfer Hypothesis in Pre-Trained Language Models

The document tests the "bias transfer hypothesis" which claims that social biases learned during pre-training of large language models will transfer to downstream tasks after fine-tuning. The authors investigate this for two classification tasks and find that reducing intrinsic bias before fine-tuning has little impact on discriminatory behavior after fine-tuning. Regression analysis suggests downstream disparities are better explained by biases in the fine-tuning dataset rather than pre-training. However, pre-training still plays a role, as simple alterations to the fine-tuning dataset are ineffective if the model was not pre-trained. The results suggest focusing on dataset quality and context-specific harms rather than just upstream mitigation.

Uploaded by

Pranav Vatsal
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
48 views19 pages

Upstream Mitigation Is Not All You Need: Testing The Bias Transfer Hypothesis in Pre-Trained Language Models

The document tests the "bias transfer hypothesis" which claims that social biases learned during pre-training of large language models will transfer to downstream tasks after fine-tuning. The authors investigate this for two classification tasks and find that reducing intrinsic bias before fine-tuning has little impact on discriminatory behavior after fine-tuning. Regression analysis suggests downstream disparities are better explained by biases in the fine-tuning dataset rather than pre-training. However, pre-training still plays a role, as simple alterations to the fine-tuning dataset are ineffective if the model was not pre-trained. The results suggest focusing on dataset quality and context-specific harms rather than just upstream mitigation.

Uploaded by

Pranav Vatsal
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 19

Upstream Mitigation Is Not All You Need: Testing the Bias Transfer

Hypothesis in Pre-Trained Language Models


Ryan Steed
Carnegie Mellon University
[email protected]

Swetasudha Panda, Ari Kobren, Michael Wick


Oracle Labs
{swetasudha.panda,ari.kobren,michael.wick}@oracle.com

A few large, homogenous, pre-trained models


undergird many machine learning systems — and
often, these models contain harmful stereotypes
learned from the internet. We investigate the bias
transfer hypothesis: the theory that social biases
(such as stereotypes) internalized by large language
models during pre-training transfer into harmful
task-specific behavior after fine-tuning. For two
classification tasks, we find that reducing intrin-
sic bias with controlled interventions before fine-
tuning does little to mitigate the classifier’s dis-
criminatory behavior after fine-tuning. Regression
analysis suggests that downstream disparities are
better explained by biases in the fine-tuning dataset.
Still, pre-training plays a role: simple alterations to
co-occurrence rates in the fine-tuning dataset are
ineffective when the model has been pre-trained.
Our results encourage practitioners to focus more
on dataset quality and context-specific harms.

1 Introduction
Large language models (LLMs) and other mas-
sively pre-trained “foundation” models are power-
ful tools for task-specific machine learning (Bom- Figure 1: Full pre-training to fine-tuning pipeline, with
experimental interventions (green hexagons).
masani et al., 2021). Models pre-trained by well-
resourced organizations can easily adapt to a wide
variety of downstream tasks in a process called fine- plications of fine-tuned LLMs, including discrimi-
tuning. But massive pre-training datasets and in- natory medical diagnoses (Zhang et al., 2020), over-
creasingly homogeneous model design come with reliance on binary gender for coreference resolu-
well-known, immediate social risks beyond the fi- tion (Cao and Daumé, 2021), the re-inforcement
nancial and environmental costs (Strubell et al., of traditional gender roles in part-of-speech tag-
2019; Bender et al., 2021). ging (Garimella et al., 2019), toxic text generation
Transformer-based LLMs like BERT and GPT- (Gehman et al., 2020), and censorship of inclu-
3 contain quantifiable intrinsic social biases en- sive language and AAVE (Blodgett and O’Connor,
coded in their embedding spaces (Goldfarb-Tarrant 2017; Blodgett et al., 2018; Park et al., 2018; Sap
et al., 2021). These intrinsic biases are typically et al., 2019).
associated with representational harms, including Despite these risks, no research has investigated
stereotyping and denigration (Barocas et al., 2017; the extent to which downstream systems inherit
Blodgett et al., 2020; Bender et al., 2021). Sepa- social biases from pre-trained models.1 Many stud-
rately, many studies document the extrinsic harms
of the downstream (fine-tuned & task-specific) ap- 1
We use the term “bias” to refer to statistical associations

3524
Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics
Volume 1: Long Papers, pages 3524 - 3542
May 22-27, 2022 c 2022 Association for Computational Linguistics
ies warn that increasing intrinsic bias upstream 2 Related Work
may lead to an increased risk of downstream harms
Little prior work directly tests the bias transfer
(Bolukbasi et al., 2016; Caliskan et al., 2017). This
hypothesis. The closest example of this phenom-
hypothesis, which we call the Bias Transfer Hy-
ena is the “blood diamond” effect (Birhane and
pothesis, holds that stereotypes and other biased
Prabhu, 2021), in which stereotyping and deni-
associations in a pre-trained model are transferred
gration in the pre-training corpora pervade sub-
to post-fine-tuning downstream tasks, where they
sequently generated images and language even
can cause further, task-specific harms. A weaker
before the fine-tuning stage (Steed and Caliskan,
version of this hypothesis holds that downstream
2021). Still, it is unclear to what extent unde-
harms are at least mostly determined by the pre-
sirable values encoded in pre-training datasets or
trained model (Bommasani et al., 2021).
benchmarks—such as Wikipedia or ImageNet—
In the pre-training paradigm, the extent to which
induce task-specific harms after fine-tuning (Baro-
the bias transfer hypothesis holds will determine
cas et al., 2019).
the most effective strategies for responsible design.
Some work explores the consistency of intrin-
In the cases we study, reducing upstream bias does
sic and extrinsic bias metrics: Goldfarb-Tarrant
little to change downstream behavior. Still, there is
et al. (2021) find that intrinsic and extrinsic met-
hope: instead, developers can carefully curate the
rics are not reliably correlated for static embed-
fine-tuning dataset, checking for harms in context.
dings like word2vec. We focus instead on state-
We test the bias transfer hypothesis on two classi- of-the-art transformer-based LLMs—the subject
fication tasks with previously demonstrated perfor- of intense ethical debate (Bender et al., 2021;
mance disparities: occupation classification of on- Bommasani et al., 2021)—which construct con-
line biographies (De-Arteaga et al., 2019) and tox- textual, rather than static, embeddings. Contextual
icity classification of Wikipedia Talks comments embeddings—token encodings that are conditional
(Dixon et al., 2018). We investigate whether re- on other, nearby tokens—pose an ongoing chal-
ducing or exacerbating intrinsic biases encoded by lenge for intrinsic bias measurement (May et al.,
RoBERTa (Liu et al., 2019) decreases or increases 2019; Kurita et al., 2019; Guo and Caliskan, 2021)
the severity of downstream, extrinsic harms (Fig- and bias mitigation (Liang et al., 2020). We find
ure 1). We find that the bias transfer hypothesis that intrinsic and extrinsic metrics are correlated for
describes only part of the interplay between pre- the typical LLM—but that the correlation is mostly
training biases and harms after fine-tuning: explained by biases in the fine-tuning dataset.
Other research tests the possibility that upstream
• Systematically manipulating upstream bias has mitigation could universally prevent downstream
little impact on downstream disparity, especially harm. Jin et al. (2021) show that an intermedi-
for the most-harmed groups. ate, bias-mitigating fine-tuning step can help re-
duce bias in later tasks. Likewise, Solaiman and
• With a regression analysis, we find that most Dennison (2021) propose fine-tuning on carefully
variation in downstream bias can be explained curated “values-targeted” datasets to reduce toxic
by bias in the fine-tuning dataset (proxied by co- GPT-3 behavior. Our results tend to corroborate
occurrence rates). these methods: we find that the fine-tuning process
can to some extent overwrite the biases present in
• Altering associations in the fine-tuning dataset the original pre-trained model. A recent post-hoc
can sometimes change downstream behavior, but mitigation technique, on the other hand, debiases
only when the model is not pre-trained. contextual embeddings before fine-tuning (Liang
et al., 2020). Our results imply that while this type
Without absolving LLMs or their owners of repre- of debiasing may help with representational harms
sentational harms intrinsic to pre-trained models, upstream, it is less successful for reducing harms
our results encourage practitioners and application downstream.
stakeholders to focus more on dataset quality and
context-specific harm identification and reduction. 3 Methods

that result in representational or allocational harms to histori- To empirically evaluate the bias transfer hypothe-
cally marginalized social groups (Blodgett et al., 2020). sis, we examine the relationship between upstream

3525
bias and downstream bias for two tasks. We track For example, the classifier correctly predicts “sur-
how this relationship changes under various con- geon” for he/him biographies much more often
trolled interventions on the model weights or the than for she/her biographies, so the TPR ratio for
fine-tuning dataset. the “surgeon” occupation is low (see Appendix A).
Upstream Bias.— We adapt Kurita et al. (2019)’s
3.1 Model pronoun ranking test to the 28 occupations in the
For each task, we fine-tune RoBERTa2 (Liu et al., B IOS dataset. Kurita et al. (2019) measure the
2019). We split the fine-tuning dataset into train encoded association of he/him and she/her pro-
(80%), evaluation (10%), and test (20%) partitions. nouns by the difference in log probability scores be-
To fine-tune, we attach a sequence classification tween pronouns appearing in templates of the form
head and train for 3 epochs.3 {pronoun} is a(n) {occupation}. We
augment this approach with 5 career-related tem-
3.2 Occupation Classification plates proposed by Bartl et al. (2020) (see Ap-
The goal of occupation classification is to predict pendix A). Formally, given a template sequence
someone’s occupation from their online biogra- xy,g filled in with occupation y and pronoun g,
phy. We fine-tune with the B IOS dataset scraped we compute py,g = P(xy,g ). As a baseline, we
from Common Crawl (De-Arteaga et al., 2019), also mask the occupation and compute the prior
which includes over 400,000 online biographies probability ⇡y,g = P(x⇡y,g ). The pronoun ranking
belonging to 28 common occupations. Since self- bias (PRB) for this template is the difference in log
identified gender was not collected, we will refer in- probabilities:
stead to the pronouns used in each biography (each
biography uses either he/him or she/her pronouns). py,she/her py,he/him
PRBy = log log . (2)
Following De-Arteaga et al. (2019), we use the ⇡y,she/her ⇡y,he/him
“scrubbed” version of the dataset—in which all the
identifying pronouns have been removed—to mea- 3.3 Toxicity Classification
sure just the effects of proxy words (e.g. “mother”)
For toxicity classification, we use the W IKI
and avoid overfitting on pronouns directly.
dataset, which consists of just under 130,000 com-
Downstream Bias.— Biographies with she/her
ments from the online forum Wikipedia Talks
pronouns are less frequently classified as be-
Pages (Dixon et al., 2018). The goal of the task is to
longing to certain traditionally male-dominated
predict whether each comment is toxic. Each com-
professions—such as “surgeon”—which could re-
ment has been labeled as toxic or non-toxic
sult in lower recruitment or callback rates for job
by a human annotator, where a toxic comment is a
candidates if the classifier is used by an employer.
“rude, disrespectful, or unreasonable comment that
The empirical true positive rate (TPR) estimates
is likely to make you leave the discussion” (Dixon
the likelihood that the classifier correctly identifies
et al., 2018). Following Dixon et al. (2018), we
a person’s occupation from their biography.
focus on 50 terms referring to people of certain gen-
We follow previous work (De-Arteaga et al.,
ders, races, ethnicities, sexualities, and religions.
2019) in measuring downstream bias via the empir-
Downstream (Extrinsic) Bias.— Mentions of cer-
ical true positive rate (TPR) gap between biogra-
tain identity groups—such as “queer”—are more
phies using each set of pronouns. First, define
likely to be flagged for toxic content, which could
TPRy,g = P[Ŷ = y | G = g, Y = y], result in certain communities being systematically
censored or left unprotected if an online plat-
where g is a set of pronouns and y is an occupa- form uses the classifier. The classifier’s empirical
tion. Y and Ŷ represent the true and predicted false positive rate (FPR) estimates its likelihood
occupation, respectively. Then the TPR bias (TPB) to falsely flag a non-toxic comment as toxic. The
is FPR corresponds to the risk of censoring inclusive
TPRy,she/her speech or de-platforming individuals who often
TPBy = . (1)
TPRy,he/him mention marginalized groups.
2 Following Dixon et al. (2018), we express the
roberta-base from HuggingFace (Wolf et al., 2020).
3
See Appendix D for more details. Epochs and other classifier’s bias against comments or commenters
parameters were chosen to match prior work (Jin et al., 2021). harmlessly mentioning an identity term as the FPR

3526
bias (FPB). bias subspace V with principal component anal-
ysis, then computes
Pk debiased word embeddings
P[T̂ = 0 | I = i, T = 1] ĥ = h hh, vj ivj by subtracting the
FPBi = , (3) j=1
P[T̂ = 0 | T = 1] projection of h onto V. We add the multiplier
to add or remove bias to various degrees—
where i is an identity term and T = 1 if the com- standard S ENT D EBIAS uses = 1.0.
ment was deemed toxic by a human annotator.
Upstream Bias.— Following Hutchinson et al.
• Re-balancing and scrubbing. For B IOS, we
(2020), we measure upstream bias via sentiment
re-balance the fine-tuning dataset by under-
associations. We construct a set of templates of the
sampling biographies with the prevalent pronoun
form {identity} {person} is [MASK],
in each occupation. For W IKI, we randomly re-
where identities are the identity terms from Dixon
move from the fine-tuning dataset ↵ percent of
et al. (2018) (e.g. “gay” or “Muslim”) and the per-
comments mentioning each identity term.
son phrases include “a person,” “my sibling,” and
other relations. We predict the top-20 likely tokens
for the “[MASK]” position (e.g., “awesome” or 4.1 Upstream variations have little impact on
“dangerous”). Using a pre-trained RoBERTA senti- downstream bias.
ment classifier trained on the TweetEval benchmark
Our goal is to test the bias transfer hypothe-
(Barbieri et al., 2020), we then measure the average
sis, which holds that upstream bias is transferred
negative sentiment score of the predicted tokens.
through fine-tuning to downstream models. By
The model’s bias is the magnitude of negative asso-
this view, we would expect changes to the pre-
ciation with each identity term.
trained model to also change the distribution of
RoBERTa sometimes suggests terms which refer
downstream bias—but we find that for both tasks,
back to the target identity group. To mitigate this
downstream bias is largely invariant to upstream
effect, we drop any predicted tokens that match the
interventions. Figure 2 summarizes the similarity
50 identity terms (e.g. “Latino”) from Dixon et al.
of biases before and after each randomized event.
(2018), but we are likely missing other confound-
Though randomizing the model weights signifi-
ing adjectives (e.g. “Spanish”). We suspect this
cantly reduces the mean and variance of upstream
confounding is minimal: we achieve similar results
bias, the distribution of downstream bias changes
with an alternative ranking-based bias metric (see
very little.4 For example, RoBERTa exhibits the
Appendix C.2).
same disparities in performance after fine-tuning re-
gardless of whether the base model was pre-trained.
4 Experiments
Likewise, although the S ENT D EBIAS mitigation
We measure changes in upstream and downstream method reduces pronoun ranking (upstream) bias as
bias subject to the following interventions (Fig. 1): intended, we detect roughly the same downstream
biases no matter the level of mitigation applied
• No pre-training. To control for the effects of (Figure 3). For example, in the B IOS task, surgeons
pre-training, we test randomly initialized ver- with he/him pronouns are still 1.3 times more likely
sions of both models that have not been pre- to have their biographies correctly classified than
trained. We average over 10 trials. their she/her counterparts.
• Random perturbations. We instantiate a pre- There is one notable exception to this trend: for
trained model and then add random noise e to the W IKI task, adding noise (uniform or Gaussian)
every weight in the embedding matrix. We try to the pre-trained model’s embedding matrix or
both uniform noise u ⇠ Unif( c, c) and Gaus- not pre-training the model yields a modest reduc-
sian noise z ⇠ N (0, 2 ), varying c and 2 . The tion in median bias (Figure 2). As upstream bias
final noise-added matrix is clipped so that its shifts towards zero, downstream bias also moves
range does not exceed that of the original matrix. marginally towards zero. Still, the largest biases
(e.g., against the term “gay”) do not decrease and
• Bias mitigation. We apply the S ENT D EBIAS al- may even increase after randomization.
gorithm to de-bias embeddings at the word-level
(Liang et al., 2020). S ENT D EBIAS estimates a 4
See Appendix B.2 for a full set of correlation tests.

3527
Pre−
trained
86.3%
accuracy Nurse
Uniform Surgeon Surgeon Nurse
noise
82.1%
accuracy
Gaussian
noise
78.6%
accuracy
Not pre−
trained
80.4%
accuracy
0 1 2 −0.5 0.0 0.5 1.0
Upstream Bias Downstream Bias
(Log prob.: she/her − he/him) (Log TPR: she/her − he/him)

(a) B IOS
Pre−
trained
94.3%
accuracy jewish
Uniform jewish gay
noise gay
93.1%
accuracy
Gaussian
noise
80.4%
accuracy
Not pre−
trained
67.4%
accuracy
0.0 0.2 0.4 0.6 −4 −2 0 2
Upstream Bias Downstream Bias
(Avg. Neg. Sentiment) (Log FPR)

(b) W IKI

Figure 2: Bias per occupation after randomized interventions, averaged over 10 trials. Despite drastic changes to
the distribution of upstream bias (left), downstream bias remains roughly stable (right). For example, upstream
bias with pre-training (purple) is not correlated with upstream bias without pre-training. But downstream bias is
partially correlated with the control (Pearson’s correlation coefficient ⇢B IOS = 0.93 and ⇢W IKI = 0.64, p < 0.01).

4.2 Most downstream bias is explained by the and evaluation templates. We estimate
fine-tuning step.
log TPBm,y = 0 + 1 PRBm,y,s + 2 ⇡y +fs +cm .
Though the results in the preceding section suggest (4)
that there is no clear or consistent correspondence for model treatment m, occupation y, and pronoun
between changes in upstream bias and changes ranking template s. TPB is the TPR bias (down-
in downstream bias, there is still a noticeable stream bias) from Eq. 1; PRB is the pronoun rank-
correlation between baseline upstream and down- ing bias (upstream bias) from Eq. 2; fs and cm
stream bias (Pearson’s ⇢ = 0.43, p = 0.022 for are dummy variables (for ordinary least squares)
B IOS, ⇢ = 0.59, p < 10 5 for W IKI—see Ap- or fixed effects to capture heterogeneous effects
pendix A). There is an important third variable between templates and models (such as variations
that helps explain this correlation: cultural arti- in overall embedding quality). We control for sta-
facts ingrained in both the pre-training and fine- tistical “dataset bias” with ⇡, the prevalence of
tuning datasets.5 RoBERTa learns these artifacts “she/her” biographies within each occupation y in
through co-occurrences and other associations be- the fine-tuning data.
tween words in both sets of corpora. We find that the “dataset bias” in the fine-tuning
To test this explanation, we conduct a simple stage explains most of the correlation between up-
regression analysis across interventions (Figure 1) stream and downstream bias. Under the strong bias
transfer hypothesis, we would expect the coeffi-
5
For example, cultural biases about which pronouns be- cient on upstream bias 1 to be statistically signifi-
long in which occupations are likely to pervade both the pre-
training dataset (e.g., Wikipedia) and the fine-tuning dataset cant and greater in magnitude than the coefficient
(internet biographies). 2 on our proxy for dataset bias. But for both tasks,

3528
Upstream Bias (Pronoun Ranking) each identity term and the average length of toxic
2 mentions of each identity term—longer comments
are less likely to result in erroneous screening (Ap-
1
pendix C.1).
Nurse
As in the previous regression, dataset bias ex-
0
Model plains more of the variation in downstream bias
Bias (she/her − he/him)

Surgeon than does upstream bias. On average, a large in-


DJ
crease in average negative sentiment against a given
Downstream Bias (TPR Ratio) identity term (e.g. 0.1, one standard deviation) cor-
Model
responds to only a modest 3.7% increase in FPR. In
0.50 comparison, only a 10% increase in the prevalence
of toxic mentions of an identity corresponds to an
0.25
even larger 6.3% increase in FPR.
0.00
Nurse
We also check that intrinsic downstream bias
DJ also changes due to fine-tuning. We measure in-
−0.25
Surgeon trinsic bias again after fine-tuning and regress on
−50 −25 −10 −1 0 1 10 25 50 downstream intrinsic bias instead of downstream
Mitigation Multiplier γ
extrinsic bias (Eq. 4). The results are consistent: af-
ter controlling for the overall increase in log likeli-
Figure 3: Log TPR bias per occupation after scaled
S ENT D EBIAS on the B IOS task. Mitigation signifi- hood, the effect of upstream intrinsic bias on down-
cantly reduces pronoun ranking (upstream) bias com- stream intrinsic bias is explained almost entirely by
pared to base RoBERTa (top); but even when upstream fine-tuning dataset bias (Appendix C.1).
bias decreases, the TPR ratio (downstream bias) re-
mains mostly constant (bottom). The distribution of 4.3 Re-sampling and re-scrubbing has little
downstream bias without any mitigation is almost per- effect on downstream behavior.
fectly correlated with the distribution at = 50 (Pear-
son’s ⇢ = 0.96, p < 0.01). Given the strong relation between our proxies for
dataset bias and downstream bias, we test whether
manipulating these proxies admits some control
the opposite is true: fine-tuning dataset bias has a over downstream bias. For example, were the fine-
larger effect than upstream bias. Figure 4 reports tuning dataset to include exactly as many she/her
the coefficient estimates for these two variables. nurse biographies as he/him, would the model still
(See Appendix C.1 for all estimates, standard er- exhibit biased performance on that occupation?
rors, assumptions and additional specifications.) Our findings suggest not. No matter the amount
In the B IOS task, a large decrease in upstream of re-sampling, downstream bias remains rela-
bias corresponds to a small but statistically signifi- tively stable for pre-trained RoBERTa. The dis-
cant increase in downstream bias. On average, a re- tributions of downstream bias with and without
duction of 0.3 to the log likelihood gap—equivalent re-balancing are almost perfectly correlated (Pear-
to the reduction in bias towards nurses after up- son’s ⇢ = 0.94, p < 0.01—see Appendix B.1).
stream mitigation—corresponds to a 0.5% increase Though co-occurrence statistics help to explain
in the TPR ratio. Almost all the downstream bias in downstream bias, they are still only proxies for
the B IOS task is explained by dataset bias instead: dataset bias. Directly altering these statistics via
a 10% increase in the prevalence of she/her pro- re-sampling the dataset does not alter the sentence-
nouns within an occupation corresponds to a much level context in which the words are used.
larger 6.5% increase in the TPR ratio. Based on this result, we also try completely re-
In the W IKI task, upstream bias has a more no- moving mentions of identity terms. Scrubbing men-
ticeable effect—but the effect of dataset bias is tions of identity terms—in all comments or only in
still much larger. The regression takes the same toxic comments—appears to reduce bias only when
form as Eq. 4, where downstream bias is FPR bias the model is not pre-trained and all mentions of the
(Eq. 3), upstream bias is negative sentiment, and term are scrubbed (Figure 5). For a pre-trained
⇡i is the proportion of toxic mentions of identity model trained on scrubbed data, a 10% decrease in
i. We additionally control for the prevalence of mentions of an identity term corresponds to a 7.2%

3529
All pre−
● trained
● N=6020
Upstream bias ● Pre−trained
● ●
Likelihood gap ●

N=140


Noise added
N=1400

Balanced
Fine−tuning ● N=1400
dataset bias ● Not pre−

Prevalance of ●
● ● trained
she/her ● N=2940
Bias−
● mitigated
N=1820
0.00 0.25 0.50 0.75
Coefficient
(a) B IOS

All pre−
Upstream bias ●

● trained
Avg. negative ● N=12516
sentiment ● Pre−trained
● ●
N=315

Noise added
N=6615
Fine−tuning ● Scrubbed
dataset bias ●
● ●
N=5901
Prevalance of ● Not pre−
toxic mentions ●
● trained
N=3150

0.0 0.2 0.4 0.6 0.8


Coefficient
(b) W IKI

Figure 4: Effect of upstream bias vs. fine-tuning dataset bias on downstream bias, controlling for model & tem-
plate fixed effects. Bars depict heteroskedasticity-consistent standard errors. Statistically insignificant (p < 0.01)
coefficients are hollow. For both tasks, reduction in fine-tuning dataset bias corresponds to a greater alteration to
downstream bias than an equivalent reduction (accounting for scale) in upstream bias.

decrease in FPR. We speculate that RoBERTa re- 5 Limitations


lies on its high quality feature embeddings to learn
proxy biases about identity terms based on the way Our approach comes with several limitations.
they are used in the pre-training corpora. For ex- First, our results may not generalize to all tasks—
ample, our model classifies a sequence containing especially non-classification tasks—or all kinds of
only the term “gay” as toxic without any context. bias (e.g., bias against AAVE or non-English speak-
If a term like “gay” is often used pejoratively on ers). Also, while similar studies of bias have been
the web, RoBERTa is likely to infer that sentences successfully applied to vision transformers (Steed
including “gay” are toxic even if the term never and Caliskan, 2021; Srinivasan and Uchino, 2021),
appears in the fine-tuning dataset. our results may vary for substrates other than En-
glish language.
But when the upstream model is not pre-trained, Second, Goldfarb-Tarrant et al. (2021) conclude
the fine-tuned model has no such prejudices. In this that the lack of correlation between intrinsic bias
case, removing all mentions of identity results in a indicators and downstream bias is because some
distribution of bias entirely uncorrelated with the embedding bias metrics are unsuitable for mea-
control (Pearson’s ⇢ = 0.09, p > 0.1). Notably, suring model bias. To ensure our intrinsic and
though, even a small number of mentions of an extrinsic metrics measure the same construct, we
identity term like “gay” in the fine-tuning dataset chose upstream indicators that correlate with real-
are enough for a randomly initialized model to world occupation statistics (Caliskan et al., 2017;
exhibit the same biases as the pre-trained model Kurita et al., 2019). Pronoun ranking in particular
(Figure 5). may be more reliable for transformer models than

3530
6 Conclusion

jewish gay
Our results offer several points of guidance to or-
ganizations training and distributing LLMs and the
None practitioners applying them:
jewish gay
• Attenuating downstream bias via upstream
interventions—including embedding-space bias
mitigation—is mostly futile in the cases we study
Mentions Scrubbed

gay jewish and may be fruitless in similar settings.

• For a typical pre-trained model trained for the


Toxic jewish gay tasks we study, the fine-tuning dataset plays a
much larger role than upstream bias in determin-
jewish
ing downstream harms.

• Still, simply modulating co-occurrence statistics


gay (e.g., by scrubbing harmful mentions of certain
identities) is not sufficient. Task framing, de-
All
gay sign, and data quality are also very important for
preventing harm.
jewish • If a model is pre-trained, it may be more resis-
tant to scrubbing, re-balancing, and other simple
−4 −2 0 2 4
modulations of the fine-tuning dataset.
Downstream Bias (Log FPR ratio)
But, our results also corroborate a nascent, some-
Pre−trained Not pre−trained what optimistic view of pre-training bias. LLMs’
intrinsic biases are harmful even before down-
stream applications, and correcting those biases
Figure 5: FPR gap (downstream bias) after scrubbing is not guaranteed to prevent downstream harms. In-
toxic mentions of identity terms from the W IKI fine- creased emphasis on the role of fine-tuning dataset
tuning dataset. A combination of scrubbing and not
bias offers an opportunity for practitioners to shift
pre-training (orange) results in a zero-mean, noticeably
re-ordered bias distribution. Scrubbing but still pre- to more careful, quality-focused and context-aware
training (purple) results in a bias distribution that is still approach to NLP applications (Zhu et al., 2018;
correlated with the original bias distribution (Pearson’s Scheuerman et al., 2021).
⇢ = 0.99, 0.93 for toxic and all respectively, p < 0.01).
Ethical Considerations
This study navigates several difficult ethical issues
in NLP ethics research. First, unlike prior work,
other metrics (Silva et al., 2021). Still, downstream, we do not claim to measure gender biases—only
annotator prejudices and other label biases could biases related to someone’s choice of personal pro-
skew our extrinsic bias metrics as well (Davani nouns. However, our dataset is limited to the En-
et al., 2021). glish “he/him” and “she/her,” so our results do
not capture biases against other pronouns. Our
Third, there may be other explanations for the re- study is also very Western-centric: we study only
lationship between upstream and downstream bias: English models/datasets and test for biases con-
for example, decreasing the magnitude of upstream sidered normatively pressing in Western research.
bias often requires a reduction in model accuracy, Second, our training data (including pre-training
though we attempt to control for between-model datasets), was almost entirely scraped from inter-
variation with fixed effects and other controls. Al- net users without compensation or explicit consent.
ternate regression specifications included in Ap- To avoid exploiting these users further, we only
pendix C.1 show how our results change with the used already-scraped data and replicated already-
inclusion of controls. existing classifiers, and we do not release these

3531
data or classifiers publicly. Finally, the models we
trained exhibit toxic, offensive behavior. These
models and datasets are intended only for studying
bias and simulating harms and, as our results show,
should not be deployed or applied to any other data
except for this purpose.

Acknowledgements
Thanks to Maria De-Arteaga and Benedikt Boeck-
ing for assistance with B IOS data collection.
Thanks also to the reviewers for their helpful com-
ments and feedback.

3532
References Tolga Bolukbasi, Kai-Wei Chang, James Y Zou,
Venkatesh Saligrama, and Adam T Kalai. 2016.
Francesco Barbieri, Jose Camacho-Collados, Luis Es- Man is to Computer Programmer as Woman is to
pinosa Anke, and Leonardo Neves. 2020. Tweet- Homemaker? Debiasing Word Embeddings. In D D
Eval: Unified Benchmark and Comparative Evalu- Lee, M Sugiyama, U V Luxburg, I Guyon, and
ation for Tweet Classification. In Findings of the R Garnett, editors, Advances in Neural Information
Association for Computational Linguistics: EMNLP Processing Systems 29, pages 4349–4357. Curran
2020, pages 1644–1650, Online. Association for Associates, Inc.
Computational Linguistics.
Rishi Bommasani, Drew A. Hudson, Ehsan Adeli, Russ
Solon Barocas, Kate Crawford, Aaron Shapiro, and
Altman, Simran Arora, Sydney von Arx, Michael S.
Hanna Wallach. 2017. The problem with bias: Al-
Bernstein, Jeannette Bohg, Antoine Bosselut, Emma
locative versus representational harms in machine
Brunskill, Erik Brynjolfsson, Shyamal Buch, Dal-
learning. In 9th Annual Conference of the Special
las Card, Rodrigo Castellon, Niladri Chatterji, An-
Interest Group for Computing, Information and So-
nie Chen, Kathleen Creel, Jared Quincy Davis, Dora
ciety.
Demszky, Chris Donahue, Moussa Doumbouya,
Esin Durmus, Stefano Ermon, John Etchemendy,
Solon Barocas, Moritz Hardt, and Arvind Narayanan. Kawin Ethayarajh, Li Fei-Fei, Chelsea Finn, Trevor
2019. Fairness and Machine Learning. fairml- Gale, Lauren Gillespie, Karan Goel, Noah Good-
book.org. man, Shelby Grossman, Neel Guha, Tatsunori
Hashimoto, Peter Henderson, John Hewitt, Daniel E.
Marion Bartl, Malvina Nissim, and Albert Gatt. 2020. Ho, Jenny Hong, Kyle Hsu, Jing Huang, Thomas
Unmasking Contextual Stereotypes: Measuring and Icard, Saahil Jain, Dan Jurafsky, Pratyusha Kalluri,
Mitigating BERT’s Gender Bias. In Proceedings Siddharth Karamcheti, Geoff Keeling, Fereshte
of the Second Workshop on Gender Bias in Natu- Khani, Omar Khattab, Pang Wei Kohd, Mark Krass,
ral Language Processing, pages 1–16, Barcelona, Ranjay Krishna, Rohith Kuditipudi, Ananya Kumar,
Spain (Online). Association for Computational Lin- Faisal Ladhak, Mina Lee, Tony Lee, Jure Leskovec,
guistics. Isabelle Levent, Xiang Lisa Li, Xuechen Li, Tengyu
Ma, Ali Malik, Christopher D. Manning, Suvir Mir-
Emily M. Bender, Timnit Gebru, Angelina McMillan- chandani, Eric Mitchell, Zanele Munyikwa, Suraj
Major, and Shmargaret Shmitchell. 2021. On the Nair, Avanika Narayan, Deepak Narayanan, Ben
Dangers of Stochastic Parrots: Can Language Mod- Newman, Allen Nie, Juan Carlos Niebles, Hamed
els Be Too Big? In Proceedings of the 2021 ACM Nilforoshan, Julian Nyarko, Giray Ogut, Laurel Orr,
Conference on Fairness, Accountability, and Trans- Isabel Papadimitriou, Joon Sung Park, Chris Piech,
parency, FAccT ’21, pages 610–623, New York, NY, Eva Portelance, Christopher Potts, Aditi Raghu-
USA. Association for Computing Machinery. nathan, Rob Reich, Hongyu Ren, Frieda Rong,
Yusuf Roohani, Camilo Ruiz, Jack Ryan, Christo-
Abeba Birhane and Vinay Uday Prabhu. 2021. Large pher Ré, Dorsa Sadigh, Shiori Sagawa, Keshav
image datasets: A pyrrhic win for computer vision? Santhanam, Andy Shih, Krishnan Srinivasan, Alex
In 2021 IEEE Winter Conference on Applications of Tamkin, Rohan Taori, Armin W. Thomas, Florian
Computer Vision (WACV), pages 1536–1546. ISSN: Tramèr, Rose E. Wang, William Wang, Bohan Wu,
2642-9381. Jiajun Wu, Yuhuai Wu, Sang Michael Xie, Michi-
hiro Yasunaga, Jiaxuan You, Matei Zaharia, Michael
Su Lin Blodgett, Solon Barocas, Hal Daumé III, and Zhang, Tianyi Zhang, Xikun Zhang, Yuhui Zhang,
Hanna Wallach. 2020. Language (Technology) is Lucia Zheng, Kaitlyn Zhou, and Percy Liang. 2021.
Power: A Critical Survey of “Bias” in NLP. In On the Opportunities and Risks of Foundation Mod-
Proceedings of the 58th Annual Meeting of the Asso- els. arXiv:2108.07258 [cs]. ArXiv: 2108.07258.
ciation for Computational Linguistics, pages 5454–
5476, Online. Association for Computational Lin-
guistics. Aylin Caliskan, Joanna J. Bryson, and Arvind
Narayanan. 2017. Semantics derived automatically
Su Lin Blodgett and Brendan O’Connor. 2017. Racial from language corpora contain human-like biases.
Disparity in Natural Language Processing: A Case Science, 356(6334):183–186.
Study of Social Media African-American English.
arXiv:1707.00061 [cs]. ArXiv: 1707.00061. Yang Trista Cao and Hal Daumé, III. 2021. Toward
Gender-Inclusive Coreference Resolution: An Anal-
Su Lin Blodgett, Johnny Wei, and Brendan O’Connor. ysis of Gender and Bias Throughout the Machine
2018. Twitter Universal Dependency Parsing for Learning Lifecycle*. Computational Linguistics,
African-American and Mainstream American En- 47(3):615–661.
glish. In Proceedings of the 56th Annual Meeting of
the Association for Computational Linguistics (Vol- Aida Mostafazadeh Davani, Mohammad Atari, Bren-
ume 1: Long Papers), pages 1415–1425, Melbourne, dan Kennedy, and Morteza Dehghani. 2021. Hate
Australia. Association for Computational Linguis- Speech Classifiers Learn Human-Like Social Stereo-
tics. types. arXiv:2110.14839 [cs]. ArXiv: 2110.14839.

3533
Maria De-Arteaga, Alexey Romanov, Hanna Wal- Linguistics: Human Language Technologies, pages
lach, Jennifer Chayes, Christian Borgs, Alexandra 3770–3783, Online. Association for Computational
Chouldechova, Sahin Geyik, Krishnaram Kentha- Linguistics.
padi, and Adam Tauman Kalai. 2019. Bias in Bios:
A Case Study of Semantic Representation Bias in a Keita Kurita, Nidhi Vyas, Ayush Pareek, Alan W
High-Stakes Setting. In Proceedings of the Confer- Black, and Yulia Tsvetkov. 2019. Measuring Bias
ence on Fairness, Accountability, and Transparency, in Contextualized Word Representations. In Pro-
FAT* ’19, pages 120–128, New York, NY, USA. As- ceedings of the First Workshop on Gender Bias in
sociation for Computing Machinery. Natural Language Processing, pages 166–172, Flo-
rence, Italy. Association for Computational Linguis-
Lucas Dixon, John Li, Jeffrey Sorensen, Nithum Thain, tics (ACL).
and Lucy Vasserman. 2018. Measuring and Mitigat-
ing Unintended Bias in Text Classification. In Pro- Paul Pu Liang, Irene Mengze Li, Emily Zheng,
ceedings of the 2018 AAAI/ACM Conference on AI, Yao Chong Lim, Ruslan Salakhutdinov, and Louis-
Ethics, and Society, pages 67–73, New Orleans LA Philippe Morency. 2020. Towards Debiasing Sen-
USA. ACM. tence Representations. In Proceedings of the 58th
Aparna Garimella, Carmen Banea, Dirk Hovy, and Annual Meeting of the Association for Computa-
Rada Mihalcea. 2019. Women’s Syntactic Re- tional Linguistics, pages 5502–5515, Online. Asso-
silience and Men’s Grammatical Luck: Gender-Bias ciation for Computational Linguistics.
in Part-of-Speech Tagging and Dependency Parsing.
In Proceedings of the 57th Annual Meeting of the Yinhan Liu, Myle Ott, Naman Goyal, Jingfei Du, Man-
Association for Computational Linguistics, pages dar Joshi, Danqi Chen, Omer Levy, Mike Lewis,
3493–3498, Florence, Italy. Association for Compu- Luke Zettlemoyer, and Veselin Stoyanov. 2019.
tational Linguistics. RoBERTa: A Robustly Optimized BERT Pretrain-
ing Approach. arXiv:1907.11692 [cs]. ArXiv:
Samuel Gehman, Suchin Gururangan, Maarten Sap, 1907.11692.
Yejin Choi, and Noah A. Smith. 2020. RealToxic-
ityPrompts: Evaluating Neural Toxic Degeneration David P. MacKinnon, Jennifer L. Krull, and Chon-
in Language Models. In Findings of the Association dra M. Lockwood. 2000. Equivalence of the Me-
for Computational Linguistics: EMNLP 2020, pages diation, Confounding and Suppression Effect. Pre-
3356–3369, Online. Association for Computational vention Science, 1(4):173–181.
Linguistics.
Chandler May, Alex Wang, Shikha Bordia, Samuel R.
Seraphina Goldfarb-Tarrant, Rebecca Marchant, Ri- Bowman, and Rachel Rudinger. 2019. On Measur-
cardo Muñoz Sánchez, Mugdha Pandya, and Adam ing Social Biases in Sentence Encoders. In Proceed-
Lopez. 2021. Intrinsic Bias Metrics Do Not Corre- ings of the 2019 Conference of the North, pages 622–
late with Application Bias. In Proceedings of the 628, Stroudsburg, PA, USA. Association for Compu-
59th Annual Meeting of the Association for Compu- tational Linguistics.
tational Linguistics and the 11th International Joint
Conference on Natural Language Processing (Vol- Ji Ho Park, Jamin Shin, and Pascale Fung. 2018. Re-
ume 1: Long Papers), pages 1926–1940, Online. As- ducing Gender Bias in Abusive Language Detec-
sociation for Computational Linguistics. tion. In Proceedings of the 2018 Conference on
Empirical Methods in Natural Language Processing,
Wei Guo and Aylin Caliskan. 2021. Detecting Emer- pages 2799–2804, Brussels, Belgium. Association
gent Intersectional Biases: Contextualized Word for Computational Linguistics.
Embeddings Contain a Distribution of Human-like
Biases. In Proceedings of the 2021 AAAI/ACM Con- Maarten Sap, Dallas Card, Saadia Gabriel, Yejin Choi,
ference on AI, Ethics, and Society, pages 122–133. and Noah A. Smith. 2019. The Risk of Racial Bias
Association for Computing Machinery, New York, in Hate Speech Detection. In Proceedings of the
NY, USA. 57th Annual Meeting of the Association for Com-
Ben Hutchinson, Vinodkumar Prabhakaran, Emily putational Linguistics, pages 1668–1678, Florence,
Denton, Kellie Webster, Yu Zhong, and Stephen De- Italy. Association for Computational Linguistics.
nuyl. 2020. Social Biases in NLP Models as Barriers
for Persons with Disabilities. In Proceedings of the Morgan Klaus Scheuerman, Alex Hanna, and Emily
58th Annual Meeting of the Association for Compu- Denton. 2021. Do Datasets Have Politics? Dis-
tational Linguistics, pages 5491–5501, Online. As- ciplinary Values in Computer Vision Dataset De-
sociation for Computational Linguistics. velopment. Proceedings of the ACM on Human-
Computer Interaction, 5(CSCW2):317:1–317:37.
Xisen Jin, Francesco Barbieri, Brendan Kennedy, Aida
Mostafazadeh Davani, Leonardo Neves, and Xiang Andrew Silva, Pradyumna Tambwekar, and Matthew
Ren. 2021. On Transferability of Bias Mitigation Gombolay. 2021. Towards a Comprehensive Under-
Effects in Language Model Fine-Tuning. In Pro- standing and Accurate Evaluation of Societal Biases
ceedings of the 2021 Conference of the North Amer- in Pre-Trained Transformers. In Proceedings of the
ican Chapter of the Association for Computational 2021 Conference of the North American Chapter of

3534
the Association for Computational Linguistics: Hu-
man Language Technologies, pages 2383–2389, On-
line. Association for Computational Linguistics.
Irene Solaiman and Christy Dennison. 2021. Process
for adapting language models to society (palms)
with values-targeted datasets. In Pre-Proceedings of
Advances in Neural Information Processing Systems,
volume 34.
Ramya Srinivasan and Kanji Uchino. 2021. Biases in
Generative Art: A Causal Look from the Lens of Art
History. In Proceedings of the 2021 ACM Confer-
ence on Fairness, Accountability, and Transparency,
FAccT ’21, pages 41–51, New York, NY, USA. As-
sociation for Computing Machinery.
Ryan Steed and Aylin Caliskan. 2021. Image Repre-
sentations Learned With Unsupervised Pre-Training
Contain Human-like Biases. In Proceedings of the
2021 ACM Conference on Fairness, Accountability,
and Transparency, FAccT ’21, pages 701–713, New
York, NY, USA. Association for Computing Machin-
ery.

Emma Strubell, Ananya Ganesh, and Andrew McCal-


lum. 2019. Energy and Policy Considerations for
Deep Learning in NLP. In Proceedings of the 57th
Annual Meeting of the Association for Computa-
tional Linguistics, pages 3645–3650, Florence, Italy.
Association for Computational Linguistics.

Thomas Wolf, Lysandre Debut, Victor Sanh, Julien


Chaumond, Clement Delangue, Anthony Moi, Pier-
ric Cistac, Tim Rault, Remi Louf, Morgan Funtow-
icz, Joe Davison, Sam Shleifer, Patrick von Platen,
Clara Ma, Yacine Jernite, Julien Plu, Canwen Xu,
Teven Le Scao, Sylvain Gugger, Mariama Drame,
Quentin Lhoest, and Alexander Rush. 2020. Trans-
formers: State-of-the-Art Natural Language Process-
ing. In Proceedings of the 2020 Conference on Em-
pirical Methods in Natural Language Processing:
System Demonstrations, pages 38–45, Online. Asso-
ciation for Computational Linguistics.

Haoran Zhang, Amy X. Lu, Mohamed Abdalla,


Matthew McDermott, and Marzyeh Ghassemi. 2020.
Hurtful words: quantifying biases in clinical con-
textual word embeddings. In Proceedings of the
ACM Conference on Health, Inference, and Learn-
ing, CHIL ’20, pages 110–120, New York, NY, USA.
Association for Computing Machinery.
Haiyi Zhu, Bowen Yu, Aaron Halfaker, and Loren
Terveen. 2018. Value-Sensitive Algorithm De-
sign: Method, Case Study, and Lessons. Proceed-
ings of the ACM on Human-Computer Interaction,
2(CSCW):194:1–194:23.

3535
A Descriptive Statistics
B IOS.— Biographies include the 28-most frequent
occupations according to the BLS Standard Occu-
pation Classification system.6 See Figure 6 for a
full list of occupations and the prevalence of each
set of pronouns.

Figure 7: Distribution of pronoun ranking (upstream)


bias across occupations in the B IOS task.

Figure 6: Frequency of occupations in B IOS dataset.

Upstream bias (measured with pronoun rank-


ing) is depicted in Figure 7. Table 1 Gives the
full list of templates used for testing. Traditionally
female occupations (e.g., “nurse”) are generally bi-
ased towards “she/her,” with some exceptions (e.g.,
“software engineer”). Downstream bias is similarly
distributed—Figure 9 depicts the relationship be-
tween upstream and downstream bias, which is
generally linear (Pearson’s ⇢ = 0.43, p < 0.05).
There are noticeable outliers (e.g., “surgeon”) for Figure 8: Frequency of comments mentioning each
which real-world harms could be especially acute. identity term in the W IKI dataset.
W IKI.— Identity terms occur relatively infre-
quently in the dataset and some terms appear in
toxic mentions, rather than all of them). Figure 10
toxic comments more often than others. See Fig-
shows downstream bias at each level of pronoun
ure 8 for a full list of identity terms tested and their
re-balancing. To measure the effects of different
prevalence in toxic and non-toxic comments.
levels of gender balance, we either randomly over-
Upstream and downstream bias have an even
sample the minority group to close ↵ percent of the
stronger relationship for W IKI than for B IOS (⇢ =
gap between the majority and minority pronouns,
0.59, p < 0.01). The terms “homosexual” and
randomly under-sample the majority group to close
“gay” have especially high biases both upstream and
percent of the gap, or some combination of both.
downstream, likely because of their use as common
pejoratives on certain internet spaces represented Figure 11 shows the full set of partial scrub-
in the pre-training and fine-tuning datasets. bing interventions tested - notably, there is a slight
convergence to zero in downstream bias when the
B Additional Results model is not pre-trained.

B.1 Partial Interventions B.2 Correlation Tests


This section gives the full results for each par- To quantify the effect of our experimental inter-
tial intervention (e.g., scrubbing only half of the ventions on upstream and downstream bias, we
6
https://www.bls.gov/soc/ compared the distribution of bias after each in-

3536
Bias Template Targets Source
Occupation-Pronoun [TARGET] is a(n) [OCCUPATION] he, she (Kurita et al., 2019)
Occupation-Pronoun [TARGET] works as a(n) [OCCUPA- he, she (Bartl et al., 2020)
TION]
Occupation-Pronoun [TARGET] wants to become a(n) [OC- he, she (Bartl et al., 2020)
CUPATION]
Occupation-Pronoun [TARGET] applied for the position of he, she (Bartl et al., 2020)
[OCCUPATION]
Identity-Sentiment [IDENTITY] [TARGET] are [MASK] people (Dixon et al., 2018)
Identity-Sentiment [IDENTITY] [TARGET] is [MASK] my sibling (Dixon et al., 2018)
my friend
my parent
my partner
my spouse

Table 1: Templates used for bias measurement.

● ●



● ● ●
● Model
● ●

(TPR ratio: she/her − he/him)

0.50
Downstream Bias

0.25

● ● ●
● ● ● ●
● ● ● ● Nurse
0.00


● ●

● ●

● DJ
−0.25 ● ● ● ● ● ●
● ●
● ● ● ●

Surgeon
0 10 20 30 40 50 60 70 80 90 100
% pronoun gap reduced w/ undersampling

Figure 10: TPR gap (downstream bias) after balanc-


ing pronouns within each occupation of the B IOS fine-
tuning dataset. As shown, balancing pronoun preva-
lence has little effect on downstream bias.

small (N = 28 and N = 50 for B IOS and W IKI,


respectively), but a Shapiro-Wilk test of normality
shows that downstream bias for both tasks is likely
non-normal (W = 0.81 and W = 0.67 for TPR
Figure 9: Correlation between upstream and down-
ratio and FPR ratio respectively, p < 0.01). So,
stream bias across occupations (B IOS) and identity
terms (W IKI). ⇢ is the Pearson correlation coefficient.
we also compute Kendall’s correlation coefficient
⌧ , which is a nonparametric test of ordinal asso-
ciation. The results are similar in magnitude and
tervention to the distribution of bias after an un- significance (Table 2).
modified pre-trained model. We tested for statis-
tical correlation between these two distributions
with both Pearson’s correlation coefficient ⇢ and C Full Regression Results
Kendall’s correlation coefficient ⌧ . For Pearson’s,
we assume that the two distributions are approx- Tables 3 and 4 report the full set of coeffi-
imately normally distributed. This assumptions cient estimates used to generate Figure 4 and
seems reasonable because our samples are not too the effects described in the paper. We use HC3

3537
Pre−trained Not pre−trained
5.0

Toxic Mentions Only


2.5
Downstream Bias (Log FPR ratio)

0.0

−2.5

5.0

2.5

All Mentions
0.0

−2.5

0 10 20 30 40 50 60 70 80 90 100 0 10 20 30 40 50 60 70 80 90 100
% mentions scrubbed

Figure 11: FPR gap (downstream bias) after scrubbing toxic mentions of identity terms from the W IKI fine-tuning
dataset.

heteroskedasticity-consistent standard errors.7 The can be interpreted as “explaining” the relationship


Variance Inflation Factor (VIF) for the covariates between the independent (upstream) and dependent
are all less than 2.5 for an unmodified pre-trained (downstream) variables (MacKinnon et al., 2000).
model for both tasks (a sign that multicollinearity To test whether the effect observed is mediated
may not be too severe). by a change in the model’s weights, also include
The fixed effects regression requires a few as- an estimate of the effect of upstream intrinsic bias
sumptions for unbiased, normally-converging esti- (e.g., from pronoun ranking) on downstream intrin-
mates. First, we assume that the error is uncorre- sic bias (intrinsic bias, measured after fine-tuning).
lated with every covariate (i.e., there are no omitted We control for the overall increase in log likelihood
variables; we discuss this possibility in the limita- by including in the regression the difference in log
tions section). Second, we assume that the sam- likelihood of the neutral pronouns “they/them” be-
ples are independent and identically distributed (in- fore and after fine-tuning. We find that a similar
dependence is assured by our experimental setup, relationship holds between upstream bias and in-
which varies one factor at a time). Third, we as- trinsic bias downstream as holds between upstream
sume that large outliers are unlikely (evident from bias and extrinsic bias downstream, suggesting that
the distribution plots presented). the model’s internal representations change in con-
cert with its downstream behavior.
C.1 Additional Specifications
We tested several regression specifications on just C.2 Identity Ranking - Robustness Check
the unmodified, pre-trained model (Tables 5 and 6).
For B IOS, note that the direct and indirect (after Because of the limitations of the sentiment-based
controlling for dataset bias) effects of upstream approach, we check the robustness of our results
bias on downstream bias have opposite signs. The with an identity ranking approach based on pro-
change is the effect of including dataset bias, a col- noun ranking. Included in the Dixon et al. (2018)
inear confounder, in the regression. Confounders study of toxicity classification bias is an extensive
evaluation set composed of 89,000 templates such
7
For a simple OLS specification on W IKI, the Breusch- as “[IDENTITY] is [ATTRIBUTE],” where the
Pagan test rejects the hypothesis that our errors are ho-
moskedastic with BP = 27.039, p < 10 3 . For B IOS, attributes include both positive (for non-toxic ex-
the hypothesis is not rejected (BP = 5.033, p = 0.41). amples) and extremely negative words (for toxic

3538
Upstream Bias Downstream Bias
Intervention Pearson’s ⇢ Kendall’s ⌧ Pearson’s ⇢ Kendall’s ⌧
B IOS Pronoun ranking TPR ratio
Not pre-trained -0.08 0.04 0.93⇤⇤⇤ 0.74⇤⇤⇤
Uniform noise 0.90⇤⇤⇤ 0.62⇤⇤⇤ 0.67⇤⇤⇤ 0.60⇤⇤⇤
Gaussian noise 0.29 0.09 0.90⇤⇤⇤ 0.56⇤⇤⇤
S ENT D EBIAS ( = 50) 0.87⇤⇤⇤ 0.61⇤⇤⇤ 0.96⇤⇤⇤ 0.77⇤⇤⇤
Dataset re-balancing ( = 1.0) 0.94⇤⇤⇤ 0.85⇤⇤⇤

W IKI Negative sentiment FPR ratio


Not pre-trained -0.21 -0.14 0.64⇤⇤⇤ 0.39⇤⇤⇤
Uniform noise 0.56⇤⇤⇤ 0.37⇤⇤⇤ 0.91⇤⇤⇤ 0.56⇤⇤⇤
Gaussian noise 0.36 ⇤⇤⇤ 0.24⇤⇤ 0.78⇤⇤⇤ 0.45⇤⇤⇤
Scrubbing toxic mentions 0.99⇤⇤⇤ 0.85⇤⇤⇤
Scrubbing all mentions 0.93⇤⇤⇤ 0.77⇤⇤⇤
Scrubbing toxic mentions, not pre-trained 0.30⇤⇤ 0.21⇤⇤ -0.11 -0.07
Scrubbing all mentions, not pre-trained 0.30⇤⇤ 0.21⇤⇤ 0.09 0.10
Note: ⇤ p<0.1; ⇤⇤ p<0.05; ⇤⇤⇤ p<0.01

Table 2: Correlation between bias distributions before and after each intervention. Statistically insignificant cor-
relation coefficients indicate bias has changed drastically (red). Notably, downstream bias is correlated with the
control to some extent for every intervention except for scrubbing and not pre-training. Pearson’s correlation coeffi-
cient ⇢ measure of correlation strength and direction; Kendall’s ⌧ is a measure of ordinal correlation. Randomized
interventions (e.g., not pre-training, adding noise) tend to re-order the bias distribution more than others, indicated
by a lower ⌧ .

Table 3: Effect of upstream on downstream bias for pre-trained RoBERTa on the B IOS task. Panel linear models
include model fixed effects.

Dependent variable:
Log TPR ratio (downstream bias)
OLS panel
linear
Pre-trained Mitigated Noise added Random Balanced All pre-trained
(1) (2) (3) (4) (5) (6)
Likelihood gap (upstream bias) 0.068⇤⇤⇤ 0.058⇤⇤⇤ 0.063⇤ 0.013 0.016⇤⇤ 0.018⇤⇤⇤
(0.018) (0.005) (0.038) (0.011) (0.007) (0.007)

Prevalance of she/her 0.485⇤⇤⇤ 0.458⇤⇤⇤ 0.534⇤⇤⇤ 0.739⇤⇤⇤ 0.820⇤⇤⇤ 0.633⇤⇤⇤


(0.043) (0.011) (0.016) (0.048) (0.036) (0.027)

Constant 0.090⇤⇤⇤
(0.029)

Template Dummies? Yes Yes Yes Yes Yes Yes


Observations 140 1,820 1,400 2,940 1,400 6,020
R2 0.500 0.489 0.438 0.075 0.297 0.085
Adjusted R2 0.477 0.484 0.432 0.067 0.289 0.078
Residual Std. Error 0.109 (df = 133)
F Statistic 22.149⇤⇤⇤ (df = 6; 133) 287.282⇤⇤⇤ (df = 6; 1801) 179.585⇤⇤⇤ (df = 6; 1384) 39.276⇤⇤⇤ (df = 6; 2913) 97.257⇤⇤⇤ (df = 6; 1384) 92.307⇤⇤⇤ (df = 6; 5971)
Note: ⇤ p<0.1; ⇤⇤ p<0.05; ⇤⇤⇤ p<0.01

3539
Table 4: Effect of upstream on downstream bias for pre-trained RoBERTa on the W IKI task. Panel linear models
include model fixed effects.

Dependent variable:
FPR
OLS panel
linear
Pre-trained Noise added Random Scrubbed All pre-trained
(1) (2) (3) (4) (5)
Avg. negative sentiment (upstream bias) 0.591⇤⇤⇤ 0.255⇤⇤⇤ 0.029 0.571⇤⇤⇤ 0.376⇤⇤⇤
(0.107) (0.027) (0.084) (0.025) (0.017)

Prevalance of toxic mentions 0.650⇤⇤⇤ 0.556⇤⇤⇤ 0.425⇤⇤⇤ 0.716⇤⇤⇤ 0.626⇤⇤⇤


(0.068) (0.013) (0.015) (0.018) (0.011)

Prevalance of identity term 5.024 1.575⇤⇤ 6.526⇤⇤⇤ 7.274⇤⇤⇤ 1.231⇤⇤


(3.740) (0.708) (0.815) (1.086) (0.612)

Avg. length of toxic mentions 0.373⇤⇤⇤


(0.077)

Template Dummies? Yes Yes Yes Yes Yes


Observations 315 6,615 3,150 5,901 12,516
R2 0.296 0.225 0.210 0.283 0.241
Adjusted R2 0.276 0.221 0.206 0.279 0.238
Residual Std. Error 0.221 (df = 305)
F Statistic 14.283⇤⇤⇤ (df = 9; 305) 211.998⇤⇤⇤ (df = 9; 6585) 92.711⇤⇤⇤ (df = 9; 3131) 257.043⇤⇤⇤ (df = 9; 5873) 439.225⇤⇤⇤ (df = 9; 12467)
Note: ⇤ p<0.1; ⇤⇤ p<0.05; ⇤⇤⇤ p<0.01

Table 5: Effect of upstream on downstream bias for pre-trained RoBERTa on the B IOS task.

Dependent variable:
TPR ratio (downstream bias) Likelihood gap after fine-tuning (intermediate bias)
OLS panel panel
linear linear
(1) (2) (3) (4) (5)
Likelihood gap (upstream bias) 0.043⇤⇤ 0.068⇤⇤⇤ 0.043⇤⇤ 0.068⇤⇤⇤ 0.046
(0.021) (0.018) (0.021) (0.018) (0.575)

Prevalance of she/her 0.485⇤⇤⇤ 0.485⇤⇤⇤ 10.311⇤⇤⇤


(0.043) (0.043) (1.424)

Difference in they/them log likelihood before and after pre-training 0.206⇤⇤


(0.086)

Constant 0.013 0.090⇤⇤⇤


(0.039) (0.029)

Template dummies? Yes Yes No No No


Observations 140 140 140 140 140
R2 0.031 0.500 0.031 0.500 0.366
Adjusted R2 0.005 0.477 0.005 0.477 0.332
Residual Std. Error 0.151 (df = 134) 0.109 (df = 133)
F Statistic 0.856 (df = 5; 134) 22.149⇤⇤⇤ (df = 6; 133) 4.279⇤⇤ (df = 1; 134) 66.447⇤⇤⇤ (df = 2; 133) 25.386⇤⇤⇤ (df = 3; 132)
Note: ⇤ p<0.1; ⇤⇤ p<0.05; ⇤⇤⇤ p<0.01

Table 6: Effect of upstream on downstream bias for pre-trained RoBERTa on the W IKI task.

Dependent variable:
FPR Avg. negative sentiment after fine-tuning (intermediate bias)
OLS panel panel
linear linear
(1) (2) (3) (4) (5) (6)
Avg. negative sentiment (upstream bias) 0.569⇤⇤⇤ 0.568⇤⇤⇤ 0.586⇤⇤⇤ 0.569⇤⇤⇤ 0.591⇤⇤⇤ 0.004
(0.110) (0.106) (0.108) (0.110) (0.107) (0.014)

Prevalance of toxic mentions 0.657⇤⇤⇤ 0.654⇤⇤⇤ 0.650⇤⇤⇤ 0.020⇤⇤


(0.068) (0.070) (0.068) (0.009)

Prevalance of identity term 4.973 5.024 0.656


(3.749) (3.740) (0.481)

Avg. length of toxic mentions 0.00001


(0.00002)

Constant 0.281⇤⇤⇤ 0.371⇤⇤⇤ 0.375⇤⇤⇤


(0.079) (0.077) (0.078)

Template Dummies? Yes Yes Yes No No No


Observations 350 315 315 350 315 315
R2 0.073 0.292 0.297 0.073 0.296 0.024
Adjusted R2 0.054 0.274 0.274 0.054 0.276 0.004
Residual Std. Error 0.240 (df = 342) 0.221 (df = 306) 0.221 (df = 304)
F Statistic 3.829⇤⇤⇤ (df = 7; 342) 15.801⇤⇤⇤ (df = 8; 306) 12.826⇤⇤⇤ (df = 10; 304) 26.802⇤⇤⇤ (df = 1; 342) 42.848⇤⇤⇤ (df = 3; 305) 2.551⇤ (df = 3; 305)
Note: ⇤ p<0.1; ⇤⇤ p<0.05; ⇤⇤⇤ p<0.01

3540
examples). Templates of other forms are not in- //github.com/conversationai/
cluded to reduce computation time.8 For each of unintended-ml-bias-analysis.
these templates, we mask the identity term and
compute the log probability score as described in • Fine-tuning a single model for either task takes
§3.2. The model’s bias is described by the differ- from 4-6 hours on single NVIDIA Tesla V100
ence between the average log probability scores for 16GB GPU. Our results include approximately
the toxic templates and the non-toxic templates for 60 model permutations for a total of 240-360
each identity term. GPU hours. roberta-base has 125M param-
For the regressions (Tables 7 and 8), the tem- eters, but we did not pre-train any models from
plates are not paired, so we average first across scratch.
toxic and non-toxic templates, then calculate the
ratio between the two. The relative size and statisti-
cal significance of the coefficients are the same as
for the negative sentiment approach, suggesting the
negative sentiment metric is robust for our purposes
despite its limitations.

D Replication
We provide our full results (upstream and down-
stream bias for every intervention, for each task)
and the scripts used to analyse them. We are not
allowed to release the source code used to train
our models and measure bias, but we include addi-
tional details on our implementation to help others
understand and replicate our results.

• Our code for the pronoun ranking tests is


adapted from Zhang et al. (2020)’s implemen-
tation available at https://github.com/
MLforHealth/HurtfulWords.

• Our code for S ENT D EBIAS is adapted from the


original authors’ (Liang et al., 2020), available
at https://github.com/pliang279/
sent_debias.

• Epochs and other parameters were chosen


to match prior work on the same tasks (Jin
et al., 2021). We train with 5 epochs, batch
sizes 16 and 64 for training and evaluation
respectively, and a learning rate of 5de
6. Otherwise, we use the default hyper-
parameters for roberta-base (https://
huggingface.co/roberta-base).

• Code for scraping the B IOS dataset is pro-


vided by the original authors at https:
//github.com/microsoft/biosbias.
The W IKI dataset is available at https:
8
A full list of these templates can be found in (Dixon et al.,
2018) or https://github.com/conversationai/
unintended-ml-bias-analysis.

3541
Table 7: Effect of upstream on downstream bias for pre-trained RoBERTa on the W IKI task.

Dependent variable:
FPR
(1) (2) (3)
Avg. log likelihood ratio (upstream bias) 0.399⇤⇤ 0.292 0.286
(0.192) (0.176) (0.187)

Prevalance of toxic mentions 0.624⇤⇤⇤ 0.628⇤⇤⇤


(0.180) (0.186)

Prevalance of identity term 0.00001


(0.0001)

Avg. length of toxic mentions 0.096⇤⇤⇤ 0.008 0.004


(0.034) (0.040) (0.058)

Observations 50 50 50
R2 0.083 0.269 0.269
Adjusted R2 0.064 0.238 0.222
Residual Std. Error 0.241 (df = 48) 0.217 (df = 47) 0.220 (df = 46)
F Statistic 4.345⇤⇤ (df = 1; 48) 8.649⇤⇤⇤ (df = 2; 47) 5.648⇤⇤⇤ (df = 3; 46)
Note: ⇤ p<0.1; ⇤⇤ p<0.05; ⇤⇤⇤ p<0.01

Table 8: Effect of upstream on downstream bias for pre-trained RoBERTa on the W IKI task.

Dependent variable:
FPR ratio (downstream bias)
OLS panel
linear
Pre-trained Noise added Random Scrubbed All pre-trained
(1) (2) (3) (4) (5)
Avg. log likelihood ratio (upstream bias) 0.292 0.073⇤⇤ 0.165 0.301⇤⇤⇤ 0.184⇤⇤⇤
(0.176) (0.032) (0.167) (0.039) (0.022)

Prevalance of toxic mentions 0.624⇤⇤⇤ 0.558⇤⇤⇤ 0.421⇤⇤⇤ 0.688⇤⇤⇤ 0.536⇤⇤⇤


(0.180) (0.034) (0.039) (0.046) (0.021)

Prevalance of identity term 2.611 1.074 2.682⇤⇤


(1.777) (2.700) (1.177)

Constant 0.008
(0.040)

Observations 50 1,050 500 1,000 3,050


R2 0.269 0.216 0.196 0.247 0.200
Adjusted R2 0.238 0.198 0.178 0.230 0.183
Residual Std. Error 0.217 (df = 47)
F Statistic 8.649⇤⇤⇤ (df = 2; 47) 94.264⇤⇤⇤ (df = 3; 1026) 59.616⇤⇤⇤ (df = 2; 488) 106.960⇤⇤⇤ (df = 3; 977) 248.602⇤⇤⇤ (df = 3; 2986)
Note: ⇤ p<0.1; ⇤⇤ p<0.05; ⇤⇤⇤ p<0.01

3542

You might also like