Toxic Language Detection in Emails-3
Toxic Language Detection in Emails-3
Communications
Meghana Moorthy Bhat ∗† Saghar Hosseini‡ Ahmed Hassan Awadallah‡
Paul N. Bennett‡ Weisheng Li§
†
The Ohio State University ‡ Microsoft Research § Microsoft
[email protected], {sahoss, hassanam, pauben, weishli}@microsoft.com
Abstract
Warning: this paper contains content that
may be offensive or upsetting.
Table 2: Sub-type categories of toxic language that we developed based on the literature, and email conversations.
Examples demonstrate that the phenomenon is complex and is different from offensive text or negative sentiment.
been extensively studied in the research commu- For each highlighted sentence, annotators indicate
nity. We follow a similar definition of offensive whether the post is toxic, type of toxicity, whether
language as Zampieri et al. (2019b) which refers the target of the toxic comment is the recipient or
to any form of unacceptable language to insult a someone else, whether the prior email as context
targeted individual or group. In our setting, we was helpful, the kind of negative affect associated
define offensive language such that it includes five with toxicity and whether the whole email was
broad categories: profanity, bullying, harassment, toxic. We provide a subset of negative affects to
discrimination and violence. the annotators from WordNet-Affect (Strapparava
and Valitutti, 2004). The annotators answer the
3.2 Annotation task questions on type of toxicity and the target only if
We design a hierarchical annotation framework to they indicate potential toxicity during annotation.
collect instances of sentence in an email and the They can also choose multiple toxic categories for
corresponding label on a crowd-sourcing platform. a highlighted sentence. Finally, the annotators are
Before working on the task, annotators go through provided an optional text box to provide additional
a brief set of guidelines explaining the task. We col- details if the highlighted sentence did not belong
lect the dataset in batches of around 1000 examples to any of the categories we defined. Please note
each. For the first three batches, we upload 75-100 that the sub-types of toxicity do not have a clear
instances manually labeled as toxic by the group of boundary and are not mutually-exclusive.
researchers working on the project to understand A total of 76 annotators participated in this task.
if the annotators followed the guidelines. We re- All annotators were fluent in English and came
peat the pilot testing until desirable performance from 4 countries: USA, Canada, Great Britain and
is achieved. Also, we manually review a sample India, with the majority of them residing in the
of the examples submitted by each annotator after USA. Each highlighted statement in the email was
each batch and exclude those who do not provide annotated by three annotators and they were com-
accurate inputs from the annotators pool and redo pensated based on an hourly rate (as opposed to
all their annotations. A key characteristic of subtle per annotation) to encourage them to optimize for
toxic emails are that they often result from prior quality. They took an average of 5 minutes per an-
experiences, cultural difference or background be- notation. We assume a sentence is toxic even if one
tween individuals (Sue et al., 2007). Hence, design- out of three annotators perceived it as toxic. We
ing annotation for detecting toxicity is a difficult adopt this principle to be inclusive of every individ-
task and there will be discrepancies in perceived ual’s background, culture, sexual orientation and
toxicity between the annotators. In order to min- implicit toxic language can be subtle. Similarly, we
imize ambiguity and provide a clearer context to included the union of the toxicity types selected by
the annotators, we provide email body, subject, and the three annotators for the instance. A snapshot
the prior email in thread as context information. of our crowd-sourcing framework can be found in
Appendix 5
Due to the scarce nature of toxic conversations
in emails, we adopt two round approach for data
collection. For the first round of annotations, we
use several heuristics to increase the chances of
identifying positive instances in the sample. We
tried running the Perspective API and the microag-
gression model (Breitfeller et al., 2019) against Av-
ocado corpus. The coverage of Perspective API is
extremely low (0.1%) since not many overly toxic
text is present in Avocado corpus. On the other Figure 2: Frequency of each sub-category of toxic sen-
hand, the microaggression model output has low tences.
precision (0.12%). To further prune the false pos-
itives, we employ filtering methods2 over the out-
puts from microaggression model before sending gory, annotators agreed on a sentence being offen-
the positive labels for annotation. The first round sive at Krippendorf’s α = 0.77, impolite at Krip-
of annotations provided a positive label ratio of pendorf’s α = 0.29 and gossip at Krippendorf’s
2.74% compared to 0.29% from a manually anno- α = 0.32. The high agreement score on overall
tated batch of around 800 random email sentences. toxicity shows that annotator judgements are reli-
This implies the need to be selective regarding the able and the lower agreement score on sub-types
emails we submit for annotation. In addition, for are indicative of the subjectivity and lack of objec-
the second round of annotations, we used SVM tivity for implicit toxicity (Lilienfeld, 2017) and
classifier to pick positive instances from the un- not the quality. We also quote several prior works
labeled email corpus. To avoid model biases, we in toxicity setting and other tasks that lack objec-
randomly sample unlabeled email sentences based tivity, and have inter-annotator agreement score in
on their probability scores with more instances be- our range. Microaggression dataset has a score of
ing sampled from the higher scores ranges. The 0.41 for 200 instances and Rashkin et al. (2016)
second round of annotations provided a positive la- has a score of 0.25 for inter-annotator agreement.
bel ratio of 11.2% which is significantly higher than Insights from annotation task: Sometimes
our previous rounds.The classifier is updated with defining a clear boundary between categories of
more examples after each round of annotations. toxic language is challenging because they are not
Overall, the final dataset contains 10,110 email mutually exclusive. Therefore a statement can be-
sentences of which 1,120 of the sentences are la- long to multiple toxic categories. For example, the
beled as toxic by annotators. We call this dataset content of an email can be about gossiping and at
for studying toxic language in workplace commu- the same time be discriminatory against a certain
nications as ToxiScope. Please note that we asked group of people. Our analysis shows that 92% of
the annotators to identify spam emails and their emails belong to a single toxic category while the
types including Advertisement, Adult content, and rest of the emails contain two or more types of
Derogatory content. We observed that 99% of the toxic language. Figure 3 shows the co-occurrence
emails in Spam category are advertisement and of different toxic contents in the same email. We
we decided to exclude those emails since advertise- can observe that the Offensive and Impolite cat-
ment contents are not in the scope of toxic language egories are slightly more likely to happen in the
detection. Figure 2 shows the distribution of toxic same email than with Gossip. Since our task is
emails over sub-categories of toxic language which highly subjective, in order to understand the rea-
indicates higher frequency for Impolite emails. sons behind perceived toxicity we ask annotators
Annotators Agreement: Overall, the annotations several questions about the target and affect of the
showed inter-annotator agreement score of Krip- toxic statement, and whether the context (previous
pendorf’s α = 0.718 to detect whether a given sen- email) is useful in determining the toxicity of the
tence was toxic or not. Broken down by each cate- statement. We find that in 41% of the instances,
2
context information was helpful to determine toxic-
LIWC lexicon (Pennebaker et al., 2015), WordNetAffect
(Strapparava and Valitutti, 2004), https://github.com/ ity. In 76.86% of the toxic instances, the language
snguyenthanh/better_profanity was targeted to another individual or a group. Un-
of-the-art models in literature and the Perspective
API:
Linear Models: We generate n-grams (where n is
up to 2) and feed them as feature vectors for the
classifier. We experiment with Logistic Regression
and Support Vector Machines (SVM) as utilized by
Breitfeller et al. and Raman et al. for our task.
Context-Aware Sentence Classification: Wang
et al. developed a GRU model with context en-
coder that uses attention mechanism on the context
Figure 3: Correlation between emails toxic categories. sentences and a fusion layer that concatenates tar-
get and context sentence representations to study
the influence of context in intent classification. We
derstandably, all the toxic instances have negative leverage this model for our experiments.
affect with anger and hostile being present in most Bert Classification: We experimented with the
of the cases. However, annotators find gossip ex- Bert-based model proposed by Liu et al.. We
amples more disgusting and a toxic sentence to fine-tuned the model that was initially trained on
be 6.1% more annoying when they are targeted to Zampieri et al. with ToxiScope. This model con-
another individual not in the conversation. catenates the text of the parent and target comments,
We use 70% of the data for training and 10% separated by Bert’s [SEP] token, as in Bert’s next
as validation set. We hold out 20% of the data for sentence prediction pre-training task.
test set. Table 3 provides a summary of the final Bert+ MLP:For this model, we experimented with
dataset. context-aware version of Bert-based classifier as
Sentence Type Train Dev Test explained above. We freeze the first 8 layers of Bert
and add a non-linear activation function before the
Toxic 886 117 207
classification layer.
Impolite 636 84 139
Gossip 176 23 47
Offensive 74 10 21 5 Results and Analysis
Non-toxic 6308 864 1728 Table 4 summarizes the performance of models
Total 7194 981 1935 trained and tested on ToxiScope. The baselines
Table 3: Number of instances in each toxic category performance are reported for binary classification
and set of ToxiScope (toxic vs non-toxic). We report evaluation metrics
in F1 (macro and micro) and accuracy (TPR and
TNR) of different classes due to class imbalance.
For the models in Table 4, which required context
4 Detecting toxic conversations in Emails
as an input, we took the prior email in the thread
We design our experiments with the following during pre-processing. The results imply pretrained
goals: (1) Investigate if contextual information Bert models fine-tuned on ToxiScope perform bet-
(email body, the parent email) helps in determin- ter than non-pretrained models. Hence, we will fo-
ing toxicity. We also study which categories of cus on these models to evaluate the effect of context
toxic language benefit from adding context to the on the outcome. In addition, the low recall perfor-
sentence. (2) We also test our hypothesis that cur- mance or True Positive Rate (TPR) demonstrates
rent toxic language datasets cannot identify indirect the challenge in detecting subtle toxic instances in
aggressive or impolite sentences. We consider cur- communications and from now on we pay more
rent state-of-the-art toxic language detectors for attention to TPR and F1 score metrics.
this task. (3) Evaluate our baseline models on other
datasets including Wiki Comments (Wulczyn et al., Effect of adding context: As outlined in Sec-
2017) and GitHub (Raman et al., 2020) to study if tion 3.2, annotators find prior email and email
understanding subtle signals help in determining body helpful to determine toxicity. Pavlopoulos
overly toxic language. et al. (2020) showed that adding context did not
We experimented with publicly available state- help pre-trained models like Bert in boosting the
Model Accuracy F1
toxic non-toxic overall (macro/micro)
(TPR) (TNR)
Logistic Regression 0.0097 1.00 0.5050 0.4816/0.8941
Linear SVM 0.3092 0.9421 0.6257 0.6378/0.8744
DCRNN (Wang et al., 2019) 0.1223 1.00 0.5610 0.4980/0.8537
Bert Classification 0.4348 0.9825 0.7102 0.75/0.91
Bert + MLP 0.4300 0.9925 0.7112 0.7696/0.9213
Table 4: Performance of different models trained and tested on ToxiScope. We report True Positive Rate (TPR),
True Negative Rate (TNR), and overall accuracy along with F1 (macro and micro) scores.
Table 5: Performance of our baseline models across different categories of toxic language. We report True Positive
Rate (TPR) for each category and the average over their TPR.
Table 6: Performance of baseline models trained on ToxiScope and tested on several toxic language datasets.
aggressive text as toxic. The best performing clas- on Microaggression dataset are more applicable to
sifier by Raman et al. (2020) on GitHub datatset workplace toxic language detection. However, they
has a TPR of 0.35. One reason for poor scores on are still performing worse than the in-domain mod-
GitHub dataset can be attributed to noisy labels. els (Table 4). Impolite and gossip (constituting of
We sampled a few instances from GitHub dataset sarcasm, stereotyping, rude) categories are predom-
and found 15% of them to be noisy. Overall, these inantly present in ToxiScope while there are not
experiment results imply the potential benefits of many datasets available for these tasks and the ex-
using our dataset for detecting toxic language in isting datasets are small in size. This could explain
social media and open source community domains. the inadequate performance of these models.
Table 7: Performance of Bert models trained on Microaggression, Wiki Comments, GitHub datasets and tested on
ToxiScope. The column denotes the dataset all the models were trained on.
Model Accuracy F1
toxic non-toxic overall (macro/micro)
in email communications from the results of this re-
(TPR) (TNR) search, as well as any public benefit that may come
Perspective API 0.2174 0.9907 0.6040 0.6432/0.8848 from these Research Results being shared with the
Raman et al. (2020) 0.1014 0.9797 0.8858 0.5492/0.8734
Breitfeller et al. (2019) 0.3987 0.5556 0.5217 0.4375/0.5483 greater scientific community.
Liu et al. (2019) 0.4348 0.9825 0.7102 0.75/0.91 Risks: During your participation, you may experi-
ence some discomfort being exposed to profanity,
Table 8: Performance of different models with infer-
toxic and discriminatory language in emails. To
ence on ToxiScope.
mitigate this risk, the research team makes it pos-
sible for you to take a break or skip tasks without
self-training for better performance. adversely affecting your ratings within the crowd-
Going forward, we will also investigate other re- sourcing platform. This research may involve risks
search questions pertaining to the likelihood of an to you that are currently unforeseeable.
individual using toxic language repeatedly, corre- In addition, we did not collect any personal or
lation of power and gender dynamics with respect demographic information other than their crowd
to toxicity, presence of the bias (racial/gender) in source platform identification number. The consent
ToxiScope, understanding the degree of severity of form explains how we manage their information
toxic text. We hope our work will encourage the and provide details about their compensations. Re-
researchers in the community to study and develop sources were also provide to answer the annotators
methods to detect workplace toxicity. questions and concerns. Moreover, we limited the
number of emails an annotator can work on in a
Acknowledgments task and paid them above minimum wage ($12-15
We thank anonymous reviewers for their construc- per hour).
tive feedback and Saleema Amershi, Michael Gam-
mon, Alexandra Olteanu, Allison Hegel, Liye Fu, 7.2 Deployment
Subho Mukherjee for valuable discussions during
Detecting harmful language in email communica-
the project.
tion is a difficult task even for human. Recent work
7 Ethical Considerations have shown that the toxic language detection mod-
els are also very prone to racial biases (Sap et al.,
7.1 Annotation 2019; Davidson et al., 2019) due to the fact that
In this work, we leverage the publicly available they are using biased datasets. In this work, we
Avocado corpus which belongs to Language Data hired annotators from different English speaking
Consortium (LDC). This email dataset has been countries to reduce the bias in our dataset. How-
processed and anonymized by LDC. We received ever, this is a research paper with the goal to bet-
approval from our organization Internal Review ter understand the problem of toxic language in
Board (IRB) before starting the annotation task to workplace communications and encouraging other
make sure we are in compliance with the Avocado researchers to work on this problem. We believe
Research Email Collection license agreements as further study needs to be done on this dataset to
well as the ethical guidelines. We understand that make sure it’s not biased before deploying any com-
annotating potentially toxic content can have neg- putational model.
ative impact on the workers. In order to reduce In addition, for deploying this technology, we
these effects, we provided warnings and informa- need access to the employees’ communications. To
tion about the research project in a consent form. the best of our knowledge, most workplaces do not
We asked the annotators to read the consent form provide any guarantee of privacy for employee’s
and only proceed if they’ve agreed to its terms (Fig- communications using enterprise systems. In addi-
ure 4). The risks and benefits of working on this tion, there are several existing technologies being
annotation tasks were presented to annotators in implemented on workplace communications for im-
the consent form: proving users’ productivity such as response gen-
Benefits: There are no direct benefits to you eration and intent detection in emails. These tech-
that might reasonably be expected as a result of nologies are being used without violating user’s
being in this study. The research team expects to privacy thanks to advances in the fields of unsu-
learn to detect micro-aggressive and toxic language pervised learning and privacy-preserving machine
learning. and Rafael Alonso. 2011. Extracting social power
Moreover, this technology have multiple appli- relationships from natural language. In Proceed-
ings of the 49th Annual Meeting of the Association
cations and some of them can potentially be used to
for Computational Linguistics: Human Language
harm employees and their friends and family. For Technologies - Volume 1, HLT ’11, page 773–782,
example, using this model to detect toxic language USA. Association for Computational Linguistics.
and report employees to HR or their manager is a
Luke Breitfeller, Emily Ahn, David Jurgens, and Yulia
high-stake application. If this system makes a false
Tsvetkov. 2019. Finding microaggressions in the
positive error, it may damage employee’s reputa- wild: A case for locating elusive phenomena in so-
tion, forces the employee to defend themselves and cial media posts. In Proceedings of the 2019 Con-
diminishes their trust in the company. This technol- ference on Empirical Methods in Natural Language
ogy can also be used to provide feedback to employ- Processing and the 9th International Joint Confer-
ence on Natural Language Processing (EMNLP-
ees about their written communication style. This IJCNLP), Hong Kong, China. Association for Com-
tool can be used for training purposes and increas- putational Linguistics.
ing workers awareness of such a micro-aggressive
language. If this system makes frequent false pos- Tommaso Caselli, Valerio Basile, Jelena Mitrović, Inga
Kartoziya, and Michael Granitzer. 2020. I feel
itive errors, employees will become annoyed and offended, don’t be abusive! implicit/explicit mes-
be less productive, which causes an eventual drop sages in offensive and abusive language. In Pro-
in the company‘s profits. Companies can pursue ceedings of the 12th Language Resources and Eval-
mitigation steps and allow employees to provide uation Conference, pages 6193–6202, Marseille,
France. European Language Resources Associa-
feedback and dispute the system’s predictions. tion.
Figure 4: A snapshot of the annotation task which shows the annotator must read the consent form and agree to its
terms before proceeding to annotate an email
Thomas Davidson, Debasmita Bhattacharya, and Ing- Kevin L. Nadal, Katie E. Griffin, Yinglee Wong,
mar Weber. 2019. Racial bias in hate speech and Sahran Hamit, and Morgan Rasmus. 2014. The
abusive language detection datasets. In Proceed- impact of racial microaggressions on mental health:
ings of the Third Workshop on Abusive Language Counseling implications for clients of color. Jour-
Online, pages 25–35, Florence, Italy. Association nal of Counseling & Development, 92(1):57–66.
for Computational Linguistics.
Douglas Oard, William Webber, David Kirsch, and
Michelle K. Duffy, Daniel C. Ganster, and Milan Sergey Golitsynskiy. 2015. Avocado research email
Pagon. 2002. Social undermining in the workplace. collection. DVD.
Academy of Management Journal, 45(2):331–351.
Silviu Oprea and Walid Magdy. 2020. iSarcasm: A
Lea Ellwardt, Giuseppe (Joe) Labianca, and Rafael Wit- dataset of intended sarcasm. In Proceedings of the
tek. 2012. Who are the objects of positive and neg- 58th Annual Meeting of the Association for Compu-
ative gossip at work?: A social network perspective tational Linguistics, pages 1279–1289, Online. As-
on workplace gossip. Social Networks, 34(2):193 – sociation for Computational Linguistics.
205.
John Pavlopoulos, Jeffrey Sorensen, Lucas Dixon,
Paula Fortuna and Sérgio Nunes. 2018. A survey on Nithum Thain, and Ion Androutsopoulos. 2020.
automatic detection of hate speech in text. ACM Toxicity detection: Does context really matter? In
Comput. Surv., 51(4). Proceedings of the 58th Annual Meeting of the Asso-
ciation for Computational Linguistics, Online. As-
Akshita Jha and Radhika Mamidi. 2017. When does sociation for Computational Linguistics.
a compliment become sexist? analysis and classifi-
cation of ambivalent sexism using twitter data. In James Pennebaker, Martha Francis, and Roger Booth.
Proceedings of the Second Workshop on NLP and 2015. Linguistic inquiry and word count (liwc).
Computational Social Science, pages 7–16, Vancou-
ver, Canada. Association for Computational Lin- Vinodkumar Prabhakaran, Emily E. Reid, and Owen
guistics. Rambow. 2014. Gender and power: How gender
and gender environment affect manifestations of
David Jurgens, Libby Hemphill, and Eshwar Chan-
power. In Proceedings of the 2014 Conference on
drasekharan. 2019. A just and comprehensive strat-
Empirical Methods in Natural Language Process-
egy for using NLP to address online abuse. In Pro-
ing (EMNLP), pages 1965–1976, Doha, Qatar. As-
ceedings of the 57th Annual Meeting of the Associ-
sociation for Computational Linguistics.
ation for Computational Linguistics, pages 3658–
3666, Florence, Italy. Association for Computa-
Jing Qian, Mai ElSherief, Elizabeth Belding, and
tional Linguistics.
William Yang Wang. 2019. Learning to decipher
Ming Kong. 2018. Effect of perceived negative work- hate symbols. In Proceedings of the 2019 Confer-
place gossip on employees’ behaviors. Frontiers in ence of the North American Chapter of the Associ-
Psychology, 9:1112. ation for Computational Linguistics: Human Lan-
guage Technologies, Volume 1 (Long and Short Pa-
Scott O. Lilienfeld. 2017. Microaggressions: Strong pers), pages 3006–3015, Minneapolis, Minnesota.
claims, inadequate evidence. Perspectives on Association for Computational Linguistics.
Psychological Science, 12(1):138–169. PMID:
28073337. Santhosh Rajamanickam, Pushkar Mishra, Helen Yan-
nakoudakis, and Ekaterina Shutova. 2020. Joint
Ping Liu, Wen Li, and Liang Zou. 2019. NULI at modelling of emotion and abusive language detec-
SemEval-2019 task 6: Transfer learning for offen- tion. In Proceedings of the 58th Annual Meeting
sive language detection using bidirectional trans- of the Association for Computational Linguistics,
formers. In Proceedings of the 13th Interna- pages 4270–4279, Online. Association for Compu-
tional Workshop on Semantic Evaluation, pages 87– tational Linguistics.
91, Minneapolis, Minnesota, USA. Association for
Computational Linguistics. Naveen Raman, Minxuan Cao, Yulia Tsvetkov, Chris-
tian Kästner, and Bogdan Vasilescu. 2020. Stress
Aman Madaan, Amrith Setlur, Tanmay Parekh, Barn- and burnout in open source: Toward finding, un-
abas Poczos, Graham Neubig, Yiming Yang, Rus- derstanding, and mitigating unhealthy interactions.
lan Salakhutdinov, Alan W Black, and Shrimai New York, NY, USA. Association for Computing
Prabhumoye. 2020. Politeness transfer: A tag and Machinery.
generate approach.
Hannah Rashkin, Sameer Singh, and Yejin Choi. 2016.
Francis T. McAndrew, Emily K. Bell, and Con- Connotation frames: A data-driven investigation.
titta Maria Garcia. 2007. Who do we tell and whom In Proceedings of the 54th Annual Meeting of the
do we tell on? gossip as a strategy for status en- Association for Computational Linguistics (Volume
hancement1. Journal of Applied Social Psychology, 1: Long Papers), pages 311–321, Berlin, Germany.
37(7):1562–1577. Association for Computational Linguistics.
Niloofar Safi Samghabadi, Afsheen Hatami, Mahsa Zeerak Waseem, James Thorne, and Joachim Bingel.
Shafaei, Sudipta Kar, and Thamar Solorio. 2020. 2018. Bridging the Gaps: Multi Task Learning for
Attending the emotions to detect online abusive lan- Domain Transfer of Hate Speech Detection, pages
guage. In Proceedings of the Fourth Workshop on 29–55. Springer International Publishing, Cham.
Online Abuse and Harms, pages 79–88, Online. As-
sociation for Computational Linguistics. Ellery Wulczyn, Nithum Thain, and Lucas Dixon. 2017.
Wikipedia talk labels: Personal attacks.
Maarten Sap, Dallas Card, Saadia Gabriel, Yejin Choi,
Jun-Ming Xu, Kwang-Sung Jun, Xiaojin Zhu, and Amy
and Noah A. Smith. 2019. The risk of racial bias
Bellmore. 2012. Learning from bullying traces in
in hate speech detection. In Proceedings of the
social media. In Proceedings of the 2012 Con-
57th Annual Meeting of the Association for Com-
ference of the North American Chapter of the As-
putational Linguistics, pages 1668–1678, Florence,
sociation for Computational Linguistics: Human
Italy. Association for Computational Linguistics.
Language Technologies, NAACL HLT ’12, page
656–666, USA. Association for Computational Lin-
Anna Schmidt and Michael Wiegand. 2017. A survey guistics.
on hate speech detection using natural language pro-
cessing. In Proceedings of the Fifth International Marcos Zampieri, Shervin Malmasi, Preslav Nakov,
Workshop on Natural Language Processing for So- Sara Rosenthal, Noura Farra, and Ritesh Kumar.
cial Media, pages 1–10, Valencia, Spain. Associa- 2019a. Predicting the type and target of offen-
tion for Computational Linguistics. sive posts in social media. In Proceedings of the
2019 Conference of the North American Chapter
Carlo Strapparava and Alessandro Valitutti. 2004. of the Association for Computational Linguistics:
WordNet affect: an affective extension of Word- Human Language Technologies, Volume 1 (Long
Net. In Proceedings of the Fourth International and Short Papers), pages 1415–1420, Minneapolis,
Conference on Language Resources and Evaluation Minnesota. Association for Computational Linguis-
(LREC’04), Lisbon, Portugal. European Language tics.
Resources Association (ELRA).
Marcos Zampieri, Shervin Malmasi, Preslav Nakov,
Derald Sue, Christina Capodilupo, Gina Torino, Jen- Sara Rosenthal, Noura Farra, and Ritesh Kumar.
nifer Bucceri, be Aisha, Kevin Nadal, and Marta Es- 2019b. SemEval-2019 task 6: Identifying and cat-
quilin. 2007. Racial microaggressions in everyday egorizing offensive language in social media (Of-
life: Implications for clinical practice. The Ameri- fensEval). In Proceedings of the 13th Interna-
can psychologist, 62:271–86. tional Workshop on Semantic Evaluation, pages 75–
86, Minneapolis, Minnesota, USA. Association for
Derald Wing Sue. 2010. Microaggressions in Everyday Computational Linguistics.
Life: Race, Gender, and Sexual Orientation. Wiley.