A Multidimensional Framework For Evaluating Lexical Semantic Change With Social Science Applications
A Multidimensional Framework For Evaluating Lexical Semantic Change With Social Science Applications
A Multidimensional Framework For Evaluating Lexical Semantic Change With Social Science Applications
1
Despite advances in detecting and modelling lex- notational (referential) meaning, Geeraerts iden-
ical semantic change, there is a need for a unify- tifies (1) specialization, (2) generalization, (3)
ing framework to integrate multiple dimensions metonymy, and (4) metaphor. Specialization (se-
of change. The present study addresses this gap mantic ‘restriction’ and ‘narrowing’) implies that
by proposing a framework which synthesizes the the new meaning covers a subset of the old mean-
theoretical insights of historical linguists about the ing’s range; for generalization (or ‘expansion’, ‘ex-
many distinct forms of diachronic lexical semantic tension’, ‘schematization’, ‘broadening’), the new
change (e.g., Bloomfield, 1933) and aligns them range includes the old meaning. Metonymy (here
with the methodological sophistication of natural including synecdoche) is a “link between two read-
language processing. The comprehensive compu- ings of a lexical item based on a relationship of
tational framework for evaluating lexical semantic contiguity between the referents of the expression
change that emerges should be valuable for compu- in each of those readings” (Geeraerts, 2010, p. 27).
tational social scientists seeking to understand and Conversely, metaphor is based on similarity. Geer-
model social and cultural change. aerts also identifies two forms of connotational
meaning (i.e., the aspects of a word’s meaning that
2. Related Work are related to the writer or reader’s emotions, senti-
2.1 Forms of Lexical Semantic Change ment, opinions, or evaluations): (1) pejorative and
(2) ameliorative change (i.e., shift towards a more
Historical linguists have developed several tax- negative/positive emotive meaning). An example
onomies of the forms of lexical semantic change of pejoration is ‘silly’, which formerly meant ‘de-
(Blank, 1999; Bréal, 1897; Ullmann, 1962), but serving sympathy, helpless’, but has come to mean
Bloomfield’s (1933) is one of the most well- ‘showing a lack of common sense’. Amelioration
established. Bloomfield described nine forms iden- is shown by ‘knight’ once meaning ‘boy, servant’.
tified by earlier scholars: (1) narrowing: superordi-
nate to subordinate, or when a meaning becomes 2.2 Expanding Concepts of Harm and
more restricted (Old English mete ‘all food’ > meat Pathology
‘edible flesh’); (2) widening: subordinate to super- Semantic change processes such as these may
ordinate, or specific to general expansion of mean- partly reflect cultural, social, and political shifts,
ing (Middle English dogge ‘dog of a specific breed’ and are of interest to social science researchers.
> dog); (3) metaphor: the transfer of a name based One example is social psychological research on
on the associations of similarity or hidden compar- concept creep, the semantic expansion of harm-
ison (Primitive Germanic bitraz ‘biting’, derivative related concepts (e.g., abuse, bullying, mental ill-
of ‘I bite’ > bitter ‘harsh of taste’), (4) metonymy: ness, prejudice, trauma, violence; Haslam, 2016).
change based on the meanings’ proximity in space Concept creep takes two forms: harm-related con-
or time (Old English ceace ‘jaw’ > cheek); (5) cepts have expanded ‘horizontally’ to cover a wider
synecdoche: the meanings are related as whole and range of harms and ‘vertically’ to encompass less
part (pre-English stobo ‘heated room’ > stove), (6) intense harms. It is theorized to be driven by
hyperbole: stronger to weaker meaning by over- rising cultural sensitivity to harm (Furedi, 2016;
statement (pre-French extonare ‘to strike with thun- Wheeler et al., 2019), falling societal prevalence
der’ > to astonish; English borrowed astound, as- of harm (Levari et al., 2018; Pinker, 2011), and
tonish from Old French); (7) meiosis:1 weaker to deliberate conceptual expansion by “opprobrium
stronger meaning by understatement (pre-English entrepreneurs” (Sunstein, 2018). Concept creep is
kwalljan ‘to torment’ > Old English cwellan ‘to theorized to have mixed blessings (Haslam et al.,
kill’); (8) degeneration: positive to negative conno- 2020), trivializing harms on one hand (Dakin et al.,
tation (Old English cnafa ‘boy servant’ > knave); 2023) and enhancing the recognition and redress of
(9) elevation: negative to positive connotation (Old major harms on the other (Tse and Haslam, 2021).
English cniht ‘boy, servant’ > knight). Prior empirical work has evaluated concept creep
Bloomfield’s classes align closely with the forms in historical text corpora. Studies assessing hori-
of change identified in studies of denotational and zontal expansion as increases in the broadening of
connotational meaning (Geeraerts, 2010). For de- harm concepts found that some concepts (e.g., ad-
1
Bloomfield (1933) refers to this class as litotes, but we diction, bullying, trauma) have broadened within
use meiosis to reflect general understatement. academic psychology (Haslam et al., 2021; Vylo-
2
mova et al., 2019; Vylomova and Haslam, 2021). 3. Method
Recent work evaluated the vertical form of concept
creep, defined as the concept’s use in contexts of 3.1 Framework
declining emotional intensity, and yielded mixed The proposed framework, illustrated in Figure 1,
findings for anxiety, depression, grief, stress, and economically reduces classes of lexical semantic
trauma (Baes et al., 2023a,b; Xiao et al., 2023). change identified by historical linguists (excluding
Mental illness has become an increasingly metaphor and metonymy; Geeraerts, 2010) to three
salient term in society (Haslam and Baes, 2024), dimensions. It recognizes that these classes repre-
partly due to the recent prioritization of mental sent opposed pairs of change types, each member
health in global health policy (WHO, 2021). Crit- corresponding to a pole on a single dimension. In
ics have raised concerns that the rising prominence essence, the framework reformulates six classes as
of mental health discourse is instigating problem- three dimensions, allowing lexical semantic change
atic changes in how people conceptualize mental to be quantified on three axes simultaneously rather
ill health. Some contend that concepts of men- than categorized into exclusive types. A recent sur-
tal illness have broadened so that everyday life vey paper (de Sá et al., 2024) has also classified se-
is increasingly pathologized (Brinkmann, 2016; mantic change as having three classes of characteri-
Horwitz and Wakefield, 2007, 2012). Experiences zations related to a word’s meaning becoming used
that were once considered normal are now given in a more (1) pejorative or ameliorated sense (orien-
diagnostic labels, such as using ‘depression’ to tation), (2) metaphoric or metonymic context (rela-
reference ordinary sadness (Bröer and Besseling, tion), (3) abstract/general or more specific/narrow
2017). Alternatively, it has been argued that terms context (dimension). However, their theoretical
like “mental health problems” are being normal- framework does not consider hyperbole/litotes.
ized and broadened (Sartorius, 2007), alongside
increasing prevalence of mental illnesses. Some
argue that concepts of mental illness are becoming
less stigmatizing, although this question has only
been addressed in surveys of public attitudes (e.g.,
Schomerus et al., 2022), rather than in changes in
word connotations. In view of the widespread spec-
ulation on the ways in which concepts of mental
illness have changed historically and the lack of sci-
entific evidence of these shifts, a systematic study
of conceptual change in this domain is a priority.
3
Dimension Rising Falling
Sentiment Elevation (Bloomfield, 1933); Ameliora- Degeneration (Bloomfield, 1933); Pejora-
tion (Ullmann, 1962) tion (Ullmann, 1962)
Breadth Widening (Bloomfield, 1933; Ullmann, Narrowing (Bloomfield, 1933; Ullmann,
1962); Generalization of meaning (Blank, 1962); Specialization of meaning (Blank,
1999); Horizontal Creep (Haslam, 2016)* 1999)
Intensity Meiosis (Bloomfield, 1933) Hyperbole (Bloomfield, 1933); Vertical
Creep (Haslam, 2016)*
Table 1: Dimensions of Lexical Semantic Change and their associated forms. * = specific to harm-related concepts.
The three proposed dimensions align with es- target concept were extracted within a ± 5-word
tablished dimensions in other domains. For exam- context window (Agirre et al., 2009) and matched
ple, Sentiment and Intensity resemble the two pri- to the Warriner et al. norms which showed adequate
mary dimensions of human emotion, Valence and coverage for the psychology corpus but poorer cov-
Arousal (Russell, 2003), and two primary dimen- erage for the general corpus (“mental_health”: psy-
sions of connotational meaning, Evaluation (e.g., chology = 84%; general = 50%; “mental_illness”:
“good/bad”) and Potency (e.g., “strong/weak”) (Os- psychology = 83%;general = 48%; “perception”:
good et al., 1975), both of which have been shown psychology = 84%; general = 39%). Annual counts
to have cross-cultural validity. Although our dimen- of Warriner-matched collocates for each target con-
sions capture the primary forms of lexical change, cept were then extracted from the lemmatized cor-
we argue that they can be complemented by evalu- pora, which showed few occurrences due to few
ation of changes in a word’s salience (i.e., relative appearances of texts containing targets before 1990
frequency of use) and its thematic content (i.e., in the general corpus (see Appendix B). There-
shifts in the specific contexts in which the word is fore, analyses excluded general texts before 1990.
used). These dimensions may reflect psychological, The annual sentiment score for each concept was
sociocultural, or cultural forces that contribute to computed by weighting the valence rating for each
or result from semantic change (Blank, 1999). Our collocate by its annual appearances, standardized
case study of mental health and mental illness illus- by the total number of (matched) collocates in the
trates how attention to salience and thematic con- respective year. The index represents the mean
tent enrich the characterization of semantic change valence of terms [1,9] collocating with target con-
that the three primary dimensions provide. We now cepts, where higher scores indicate higher valence.
turn to the details of that case study, including the
computational methodologies for evaluating these 3.3 Breadth
dimensions. Future implementations of our three- The semantic broadening of the target concept was
dimensional framework are likely to include tech- evaluated as the average inverse cosine similar-
nical refinements of these methodologies. Those ity between the sentence level embeddings con-
employed in the case study simply demonstrate one taining the target term. Our method adapts previ-
way to implement it using interpretable techniques. ous work (Vylomova et al., 2019; Vylomova and
Haslam, 2021) by replacing type-level word em-
3.2 Sentiment beddings with contextualized sentence-level em-
The sentiment of the target concepts (mental health beddings. Given that this breadth measure resem-
and mental illness and the control concept percep- bles the Semantic Textual Similarity (STS) task
tion) was evaluated using valence norms from War- (Cer et al., 2017, the degree to which two sentences
riner et al. (2013), which provide valence ratings are semantically equivalent to each other), to se-
for 13,915 English lemmas collected from 1,827 lect the optimal model we compared the sentence
United States residents, ranging from low valence similarity scores, from corpus samples, of models
(1: feeling extremely “unhappy”, "despaired") to that have shown good performance for encoding
high valence (9: feeling extremely “happy”, “hope- sentences. Many of the original Sentence-BERT
ful"). See Appendix A for more information re- models (Reimers and Gurevych, 2019) with good
garding the valence ratings. Collocates of each scores on semantic textual similarity benchmarks
4
(Tsukagoshi et al., 2022; Reimers and Gurevych, terms [1,9] collocating with target concepts, where
2019) are deprecated, therefore we examined and higher scores indicate higher arousal.
compared three public pre-trained models that cur- Second, we developed a new index to directly
rently excel in encoding sentences,3 from the sen- capture shifts in a concept’s intensity. Instead of
tence transformers library. See Appendix C for examining the arousal of its collocates (regardless
more information regarding model selection (C), of their order), it examined the occurrence of in-
comparison (C) and results (C). The pre-trained tensifying expressions that directly modify it. If
model used in the present study4 performed best on a concept increasingly appears with an intensify-
detecting semantic information and encoding sen- ing modifier, it can be inferred that its unmodified
tences for 14 diverse tasks from different domains. meaning has become less intense. We developed
To compute the breadth score, relevant texts a new “intensifier index” which evaluates the rela-
were extracted from our corpora. Inspecting their tive frequency with which 11 adjectival modifiers
frequencies showed that it was acceptable to sam- (“great”, “intense”, “severe”, “harsh”, “major”, “ex-
ple 50 texts from each five-year interval.5 Thus, treme”, “powerful”, “serious”, “devastating”, “de-
we randomly and uniformly sampled up to 50 structive”, “debilitating”) preceded “mental health”
sentences per interval and repeated the procedure and “mental illness”. De-adjectival adverbs from
10 times to reduce sampling noise. These sen- Luo et al. (2019) were considered but most were
tences were then passed to the sentence transformer not sufficiently general (e.g., “devastating”, “excru-
model, "all-mpnet-base-v2" (where MPNET means ciating”, “vicarious”). We used the dependency-
Masked Permuted Language Modeling Network), parsed corpora (see Section 4.2) to compute the
to be tokenized and to encode embeddings rep- proportion of instances of each target concept that
resenting their semantic characteristics. Cosine has any of the 11 terms as its adjective modifier.
distance was computed for each pair of sentence
vectors by inverting the similarity scores (1 - cosine 3.5 Thematic content
similarity). The final breadth metric [0,1] was cal- Thematic content was evaluated using a top-down
culated by averaging scores across samples in each approach. The theme of interest was pathology
interval. Higher scores indicate greater breadth given concerns raised by critics about the pathol-
(dissimilarity) between sentence vectors. ogization of mental health and mental illness
(Brinkmann, 2016; Horwitz and Wakefield, 2007,
3.4 Intensity
2012). We used a pathologization dictionary de-
Changes in the intensity of the concepts were eval- veloped by Baes et al. (2023a) to compute the
uated in two ways. First, we computed an arousal pathologization index. This approach can be used
index, adapting a previously established procedure to construct dictionaries for other themes of interest.
(Baes et al., 2023a,b; Xiao et al., 2023). In an equiv- First, we generated unambiguously disease-related
alent manner to the sentiment analysis, we exam- words with restricted range in meaning: “clinical”,
ined the collocates of each concept and computed a “disorder”, “symptom”, “illness”, “pathology”, and
weighted average annual ratings, using Warriner et “disease”. Next, their forward word associations
al.’s arousal norms that range from low arousal (participant responses to each disease-related word)
(1: feeling "calm", "unaroused" while reading drawn from the English Small World of Words
the lemma) to high arousal (9: feeling "agitated", project (De Deyne et al., 2019) were listed and
"aroused"). See Appendix A for more information duplicates were removed. We filtered the list for
regarding arousal ratings. The annual arousal score terms reflecting pathologization (i.e., to view or
for each concept was calculated by weighting the characterize as medically or psychologically ab-
arousal rating for each collocate by its total number normal), leaving 17 terms: “ailment”, “clinical”,
of appearances in each year and normalizing it by “clinic”, “cure”, “diagnosis”, “disease”, “disorder”,
the total (matched) collocate count for the respec- “ill”, “illness”, “medical”, “medicine”, “pathol-
tive year. The index represents the mean arousal of ogy”, “prognosis”, “sick”, “sickness”, “symptom”,
3
https://www.sbert.net/docs/pretrained_models. “treatment”. Following Baes et al. (2023a), we
html computed the pathologization index by dividing
4
"all-mpnet-base-v2" from Hugging Face, appearances of the 17 terms in the target concept’s
sentence-transformers: https://huggingface.co/
sentence-transformers/all-mpnet-base-v2 collocates (±5-word context window) in a specific
5
Appendix C explains interval selection. year by the total number of collocates in that year.
5
3.6 Salience like “the”), and lemmatization using spaCy.7 For
Salience was computed as the concept’s annual dependency parsing we used the raw corpora to
relative frequency, using the raw corpora versions. provide more contextual information for the model
to better understand relationships between words.
4. Materials The English Transformer model8 was used to pre-
process the corpus with a high performance com-
4.1 Corpora
puting system (Lafayette et al., 2016).
Two corpora were chosen for their historical length,
their magnitude, and their texts. The psychology 4.3 Target Concepts
corpus contained 143,575,773 tokens from 871,344 Two terms were chosen to analyze levels of seman-
abstracts from 875 (Scimago indexed) psychol- tic change (Hamilton et al., 2016a): mental_health
ogy journals, ranging from 1930 to 2019, sourced and mental_illness. We also ran control analyses us-
from E-Research and PubMed databases (Vylo- ing the neutral term, perception, for which a fixed
mova et al., 2019). The journal set was distributed rate of change was expected and which demon-
across all subdisciplines of psychology. The final strated a steady rise in relative frequency starting
corpus of psychology abstracts was limited to 1970- around 1945 in the Google Ngram Viewer.9
2016 due to the relatively small number of abstracts
outside this period (Vylomova et al., 2019), yield- 4.4 Statistical Analysis
ing 129,980,596 tokens from 793,942 abstracts.
Linear regression analyses were performed to test
The second corpus is a combination of two re-
the statistical significance of historical trends in
lated corpora: the Corpus of Historical American
the semantic indices (Jebb et al., 2015). Ordi-
English (Davies, 2010, 1810-2009) and the Cor-
nary least squares served as the primary estima-
pus of Contemporary American English (Davies,
tor, the secondary one being a generalized least
2008, 1990-2019). Academic texts were excluded
squares estimator to account for auto-correlated
to avoid any potential overlap with psychology ar-
residuals (Durbin-Watson test: p < .05). Coef-
ticles. After merging the two corpora, contain-
ficients, standard errors and confidence intervals
ing 115,000 everyday publications and >500,000
were standardized using the betaSandwich pack-
contemporary texts, the combined corpus was pro-
age (Pesigan et al., 2023), employing Dudgeon’s
cessed following recommendations from Alatrash
(2017) heteroskedasticity-consistent estimator ap-
et al. (2020) to maintain data integrity.6 The cur-
proach (HC3), ideal for extracting estimates for
rent study restricted the corpus period from 1970 to
nonnormal data and small sample sizes (Dudgeon,
2016, using 501,415,577 tokens from 244,552 texts
2017). The code is publicly available.10
(books: 23,855 fiction, 1,498 non-fiction; 88,641
magazines; 73,557 newspapers; 40,036 spoken lan- 5. Results
guage; 16,965 TV shows).
Sentiment: The linear regression models mostly
4.2 Preprocessing show decreasing trends for the valence index. Fig-
Analyses required three versions of the corpora: ure 2 shows a significant declining trend in the va-
(1) a raw cleaned version transforming target con- lence of words used in the context of mental health
cepts to single noun tokens (Section 3.6 and 3.3 in the psychology corpus and the general corpus.
and 3.4); (2) a lemmatized version (Section 3.2, For mental illness, the valence index shows a de-
3.4, and 3.5); and (3) a dependency parsed version creasing trend in psychology, and an increase in
(Section 3.4). The first version, including punc- the general corpus. The valence of perception only
tuation, uppercasing, and numbers, was used for shows a decreasing trend in the general corpus.
all analyses after transforming multiword target Breadth: The linear regression models testing
concepts into single tokens (e.g., “mental health” the trend for the cosine distance of sentential con-
> “mental_health”) using case sensitive matching. texts containing targets show significant increas-
The lemmatization pipeline included tokenization, 7
https://spacy.io/
part-of-speech tagging (skipping tokens with unin- 8
“en_core_web_trf” (roberta-base) from Spacy was used
formative tags: punctuation, symbols, spaces, num- as it demonstrates the highest accuracy on 13 evaluation tasks:
bers), removing stop words (uninformative words https://spacy.io/models/en#en_core_web_trf.
9
https://books.google.com/ngrams/info
6 10
See Appendix D for a comprehensive explanation. https://osf.io/4d7ur/
6
Figure 2: Valence index over the study period (1970-
2016).
Figure 3: Breadth score over five-year intervals (1970-
2014).
ing trends for mental health, mental illness and
perception in the psychology corpus, reflecting Thematic content: The target concepts, mental
greater sentence diversity, with a decrease for men- health and mental illness, and the control percep-
tal health and an increase for perception in the tion, become significantly more associated with
general corpus, as shown in Figure 3. pathology-related terms in the psychology corpus,
Intensity: Figure 4 shows the significant rise and for all targets except for mental health in the
and fall in the use of intensifiers to modify mental general corpus, as shown in Figure 6. Inspecting
illness in the psychology corpus, but no trend in the the top ten ranked collocates for the main target
general corpus. Examining the top ranked adjective terms (see Appendix F) shows the presence of only
modifiers in each decade (Table 4 and Table 7 in two of the 17 pathology-related terms in psychol-
Appendix E) reveals that “severe”, “serious”, “ma- ogy and the general corpus (“disorder” and “treat-
jor”, “chronic” come to be more associated with ment”), and no pathology-related terms among the
mental illness from the 1990s onwards. Although top ranked collocates for the control. The diver-
mental health is not frequently modified by inten- sity of terms among the top ranked collocates for
sifiers, as expected, “poor” and “positive” remain mental health and mental illness indicate that more
closely associated with it across the decades, with themes are present in the semantic space.
“maternal” becoming more associated with mental Salience: Figure 7 illustrates that the relative
health from the 1990s onwards. Despite demon- frequencies rise significantly for both target con-
strating a significant increase in its intensifier index cepts, mental health and mental illness, in both
in the psychology corpus, perception does not dis- corpora. The relative frequency of perception in-
play intensifiers among its top adjective modifiers. creases significantly in the psychology corpus and
Figure 5 shows a significant increasing trend shows relatively stability in the general corpus.
in the intensity (arousal index) of mental health- The significance of the trends was determined by
related words in both corpora. For mental illness examining standardized beta coefficients and their
and perception, the index increases significantly associated standard errors (see Table 17). As shown
for the psychology corpus and only shows an in- in Appendix G, the strongest effect sizes can be
creasing trend for perception in the general corpus. observed for the two target terms with breadth (both
7
Figure 4: Intensifier index for mental illness over the
study period (1970-2016).
8
Figure 7: Normalized term frequencies for the general
and psychology corpora (1970-2016).
9
8. Limitations dom "noise", not variation in time). To better cap-
ture themes, future work should develop a bottom-
Limitations inspire future directions. The proce- up, not a top-down dictionary-based, approach by
dures employed in the present study are simply a using topic modeling or clustering contextualized
first implementation of the framework. Future re- word embeddings (Montariol et al., 2021) and eval-
search should refine its computational methodology uating the target’s proximity to the centroid of the
by enhancing or replacing procedures with more semantic category cluster. These methods might
robust or sensitive alterations. While the Warriner reveal senses or domains without imposing a dictio-
norms data we used (i) follows a rigorous and reli- nary on the semantic space. It will also be crucial
able rating procedure, (ii) are highly interpretable to consider LLM approaches for lexical semantic
and (iii) have high face validity, future work might change (Wang and Choi, 2023).
consider alternative methods in addition to closed- With regard to substantive studies, it will be
vocabulary approaches (Eichstaedt et al., 2021). important to make a general case for the frame-
The current method could be compared against work by, ideally, finding an existing data set that
publicly-available BERT-based models fine-tuned includes annotated examples of semantic change
for sentiment analysis (Goworek and Dubossarsky, for evaluation and estimation of the recall/coverage
2024), the VADER (a rule-based sentiment analysis of the methods. In addition, our findings should
tool; Hutto and Gilbert, 2014), or other sentiment- be extended by applying the framework to a
emotion lexica (Boyd-Graber et al., 2022; Moham- wider assortment of mental health-related con-
mad, 2018). Ideally, the approach will capture the cepts such as diagnostic terms (e.g., anxiety, de-
nuanced sentiment contributions of the target word, pression, autism, obsessive-compulsive disorder,
which averaging the sentiment of contexts fails to schizophrenia, attention-deficit hyperactivity dis-
capture (Goworek and Dubossarsky, 2024). Ro- order). Characterizing how specific diagnoses
bustness checks should be conducted on new meth- have altered their meanings in a differentiated,
ods by comparing its convergent validity against multi-dimensional manner will illuminate histori-
the existing one to evaluate the extent to which cal changes that have only been the focus of theo-
the alternative method correlates when applied to retical speculation and qualitative research to date
the same dataset. In addition, because the target (e.g., Brinkmann, 2016; Horwitz and Wakefield,
term’s semantic broadening is operationalized as 2007, 2012; Parrott, 2023). Future research can
the cosine dissimilarity of the target’s sentential also capitalize on the new framework to explore
contextual usages, it only differentiates between possible causal relationships between dimensions,
quantitatively (not qualitatively) different mean- such as whether rising salience drives conceptual
ings. Future work should introduce more fine- broadening (Haslam et al., 2021), whether rising
grained follow-up analyses by, for example, iden- breadth of mental illness-related concepts drives
tifying hypernymy or using state-of-the-art word improvements in sentiment (a destigmatization pro-
in context (WiC) models, like XL-LEXEME (Cas- cess), and whether trade-offs exist (e.g., rising
sotti et al., 2023), which beats GPT-4 on the WiC breadth may lead to shifts in intensity). Studies
task and BERT, mBERT, XLM-R on the graded already point to related laws of semantic change,
change detection task (Periti and Tahmasebi, 2024). finding that sentiment change is associated with se-
It should also introduce a diachronic analysis to mantic change (Goworek and Dubossarsky, 2024).
examine if the target’s prototypical meaning has Future studies should conduct fine-grained analy-
been diluted/intensified. ses on semantic shifts in discourse around mental
Additionally, while the present study includes a health to examine how online group dynamics and
neutral control term, future work should evaluate macro social and cultural shifts (e.g., prevailing
how to (semi)automatically identify baseline se- stereotypes and stigma towards social groups; see
mantic change in the global corpus (a stability axis), Garg et al., 2018; Charlesworth and Hatzenbuehler,
to normalize the semantic change of the target con- 2024; Durrheim et al., 2023) contribute to observed
cepts against. A control condition where no change semantic shifts and possibly the social transmission
of meaning is expected could also be set up (Du- of mental disorders, shown in adolescent peer net-
bossarsky et al., 2017) using a chronologically shuf- works; Alho et al. (2024). Ideally studies will be
fled corpus so that the assumed changes become conducted with many corpora (e.g., news, social
uniform and any change is an artefact (reflects ran- media) with high frequencies of the target terms.
10
9. Ethics Statement pages 119–128, Singapore. Association for Compu-
tational Linguistics.
We do not identify any foreseeable risks or potential
for harmful use of our work. Analyses use licensed Naomi Baes, Ekaterina Vylomova, Michael J. Zyphur,
and Nick Haslam. 2023b. The semantic inflation of
data that are openly accessible for academic pur- “trauma” in psychology. Psychology of Language
poses, ensuring transparency and accountability. and Communication, 27(1):23–45.
Naveen Badathala, Abisek Rajakumar Kalarani, Tejpals- Pierluigi Cassotti, Lucia Siciliani, Marco DeGemmis,
ingh Siledar, and Pushpak Bhattacharyya. 2023. A Giovanni Semeraro, and Pierpaolo Basile. 2023. XL-
match made in heaven: A multi-task framework for LEXEME: WiC pretrained model for cross-lingual
hyperbole and metaphor detection. In Findings of LEXical sEMantic changE. In Proceedings of the
the Association for Computational Linguistics: ACL 61st Annual Meeting of the Association for Compu-
2023, pages 388–401, Toronto, Canada. Association tational Linguistics (Volume 2: Short Papers), pages
for Computational Linguistics. 1577–1585, Toronto, Canada. Association for Com-
putational Linguistics.
Naomi Baes, Nick Haslam, and Ekaterina Vylomova.
2023a. Semantic shifts in mental health-related con- Daniel Cer, Mona Diab, Eneko Agirre, Iñigo Lopez-
cepts. In Proceedings of the 4th Workshop on Compu- Gazpio, and Lucia Specia. 2017. SemEval-2017
tational Approaches to Historical Language Change, task 1: Semantic textual similarity multilingual and
11
crosslingual focused evaluation. In Proceedings Johannes C Eichstaedt, Margaret L Kern, David B
of the 11th International Workshop on Semantic Yaden, H Andrew Schwartz, Salvatore Giorgi, Gre-
Evaluation (SemEval-2017), pages 1–14, Vancouver, gory Park, Courtney A Hagan, Victoria A Tobolsky,
Canada. Association for Computational Linguistics. Laura K Smith, Anneke Buffone, et al. 2021. Closed-
and open-vocabulary approaches to text analysis: A
Tessa ES Charlesworth and Mark L Hatzenbuehler. review, quantitative comparison, and recommenda-
2024. Mechanisms upholding the persistence of tions. Psychological Methods, 26(4):398.
stigma across 100 years of historical text. Scientific
Reports, 14(1):11069. Lauren Fonteyn and Enrique Manjavacas. 2021. Adjust-
ing scope: A computational approach to case-driven
Brodie C Dakin, Melanie J McGrath, Joshua J Rhee, research on semantic change. In CHR, pages 280–
and Nick Haslam. 2023. Broadened concepts of harm 298.
appear less serious. Social Psychological and Per-
sonality Science, 14(1):72–83. Allen Frances. 2013. Saving Normal: An Insider’s Re-
volt Against Out-of-Control Psychiatric Diagnosis,
Mark Davies. 2008. The corpus of contempo- DSM-5, Big Pharma, and the Medicalization of Ordi-
rary american english (COCA). https://www. nary Life. HarperCollins Publishers (Australia) Pty.
english-corpora.org/coca/. Ltd., Level 13, 201 Elizabeth Street, Sydney, NSW
2000, Australia.
Mark Davies. 2010. The corpus of historical american
english (coha). Available online at https://www. Frank Furedi. 2016. The cultural underpinning of con-
english-corpora.org/coha/. cept creep. Psychological Inquiry, 27(1):34–39.
Simon De Deyne, Daniel J. Navarro, Amy Perfors, Nikhil Garg, Londa Schiebinger, Dan Jurafsky, and
Marc Brysbaert, and Gert Storms. 2019. The “small James Zou. 2018. Word embeddings quantify 100
world of words” english word association norms for years of gender and ethnic stereotypes. Proceedings
over 12,000 cue words. Behavior Research Methods, of the National Academy of Sciences, 115(16):E3635–
51(3):987–1006. E3644.
Jader Martins Camboim de Sá, Marcos Da Silveira, and Dirk Geeraerts. 2010. Theories of lexical semantics.
Cédric Pruski. 2024. Survey in characterization of Oxford University Press.
semantic change. arXiv preprint arXiv:2402.19088. Roksana Goworek and Haim Dubossarsky. 2024. To-
ward sentiment aware semantic change analysis. In
Jacob Devlin, Ming-Wei Chang, Kenton Lee, and
Proceedings of the 18th Conference of the European
Kristina Toutanova. 2019. Bert: Pre-training of deep
Chapter of the Association for Computational Lin-
bidirectional transformers for language understand-
guistics: Student Research Workshop, pages 350–
ing. In Proceedings of the 2019 Conference of the
357.
North American Chapter of the Association for Com-
putational Linguistics: Human Language Technolo- William L Hamilton, Jure Leskovec, and Dan Jurafsky.
gies, Volume 1 (Long Papers), pages 4171–4186. 2016a. Cultural shift or linguistic drift? comparing
two computational measures of semantic change. In
Liviu P. Dinu, Ioan-Bogdan Iordache, Ana Sabina Uban, Proceedings of the conference on empirical methods
and Marcos Zampieri. 2021. A computational ex- in natural language processing. Conference on empir-
ploration of pejorative language in social media. In ical methods in natural language processing, volume
Findings of the Association for Computational Lin- 2016, page 2116. NIH Public Access.
guistics: EMNLP 2021, pages 3493–3498, Punta
Cana, Dominican Republic. Association for Compu- William L. Hamilton, Jure Leskovec, and Dan Jurafsky.
tational Linguistics. 2016b. Diachronic word embeddings reveal statisti-
cal laws of semantic change. In Proceedings of the
Haim Dubossarsky, Daphna Weinshall, and Eitan Gross- 54th Annual Meeting of the Association for Compu-
man. 2017. Outta control: Laws of semantic change tational Linguistics (Volume 1: Long Papers), pages
and inherent biases in word representation models. 1489–1501, Berlin, Germany. Association for Com-
In Proceedings of the 2017 Conference on Empiri- putational Linguistics.
cal Methods in Natural Language Processing, pages
1136–1145, Copenhagen, Denmark. Association for Nick Haslam. 2016. Concept creep: Psychology’s ex-
Computational Linguistics. panding concepts of harm and pathology. Psycholog-
ical Inquiry, 27(1):1–17.
Paul Dudgeon. 2017. Some improvements in confi-
dence intervals for standardized regression coeffi- Nick Haslam and Naomi Baes. 2024. What should we
cients. Psychometrika, 82:928–951. call mental ill health? historical shifts in the popular-
ity of generic terms. PLOS Ment Health, 1(1).
Kevin Durrheim, Maria Schuld, Martin Mafunda, and
Sindisiwe Mazibuko. 2023. Using word embeddings Nick Haslam, Brodie C Dakin, Fabian Fabiano,
to investigate cultural biases. British Journal of So- Melanie J McGrath, Joshua Rhee, Ekaterina Vylo-
cial Psychology, 62(1):617–629. mova, Morgan Weaving, and Melissa A Wheeler.
12
2020. Harm inflation: Making sense of concept creep. Lev Lafayette, Greg Sauter, Linh Vu, and Bernard
European Review of Social Psychology, 31(1):254– Meade. 2016. Spartan performance and flexibil-
286. ity: An hpc-cloud chimera. OpenStack Summit,
Barcelona, 27:6.
Nick Haslam, Ekaterina Vylomova, Michael J. Zy-
phur, and Yoshihisa Kashima. 2021. The cultural David E. Levari, Daniel T. Gilbert, Timothy D. Wil-
dynamics of concept creep. American Psychologist, son, Baruch Sievers, David M. Amodio, and Thalia
76(6):1013–1026. Wheatley. 2018. Prevalence-induced concept change
in human judgment. Science, 360(6396):1465–1467.
Simon Hengchen, Nina Tahmasebi, Dominik
Schlechtweg, and Haim Dubossarsky. 2021. Chal- Xiaodong Liu, Pengcheng He, Weizhu Chen, and Jian-
lenges for computational lexical semantic change. feng Gao. 2019. Multi-task deep neural networks for
In Nina Tahmasebi, Lars Borin, Adam Jatowt, Yang natural language understanding. In Proceedings of
Xu, and Simon Hengchen, editors, Computational the 57th Annual Meeting of the Association for Com-
approaches to semantic change, pages 341–372. putational Linguistics, pages 4487–4496, Florence,
Language Science Press, Berlin. Italy. Association for Computational Linguistics.
Bjørn Hofmann. 2016. Medicalization and overdiagno- Yu Luo, Dan Jurafsky, and Beth Levin. 2019. From in-
sis: Different but alike. Medicine, Health Care and sanely jealous to insanely delicious: Computational
Philosophy, 19(2):253–264. models for the semantic bleaching of english intensi-
fiers. In Proceedings of the 1st International Work-
Allan V. Horwitz and Jerome C. Wakefield. 2007. The shop on Computational Approaches to Historical
loss of sadness: How psychiatry transformed normal Language Change, pages 1–13.
sorrow into depressive disorder. Oxford University
Press. Christopher D Manning. 2022. Human language under-
standing & reasoning. Daedalus, 151(2):127–138.
Allan V. Horwitz and Jerome C. Wakefield. 2012. All we
have to fear: Psychiatry’s transformation of natural Rowan Hall Maudslay and Simone Teufel. 2022.
anxieties into mental disorders. Oxford University Metaphorical polysemy detection: Conventional
Press. metaphor meets word sense disambiguation. In Pro-
ceedings of the 29th International Conference on
Clayton Hutto and Eric Gilbert. 2014. Vader: A parsi- Computational Linguistics, pages 65–77, Gyeongju,
monious rule-based model for sentiment analysis of Republic of Korea. International Committee on Com-
social media text. In Proceedings of the International putational Linguistics.
AAAI Conference on Web and Social Media, pages
216–225. Tomas Mikolov, Kai Chen, Gregory S. Corrado, and
Jeffrey Dean. 2013. Efficient estimation of word
Andrew T. Jebb, Louis Tay, Wei Wang, and Qiming representations in vector space. In International Con-
Huang. 2015. Time series analysis for psychological ference on Learning Representations.
research: Examining and forecasting change. Fron-
tiers in Psychology, 6. Saif Mohammad. 2018. Obtaining reliable human rat-
ings of valence, arousal, and dominance for 20,000
Daniel Jurafsky and James H. Martin. 2023. Vector English words. In Proceedings of the 56th Annual
Semantics and Embeddings. Draft of February 3, Meeting of the Association for Computational Lin-
2024. Draft chapters available online: https://web. guistics (Volume 1: Long Papers), pages 174–184,
stanford.edu/~jurafsky/slp3/. Melbourne, Australia. Association for Computational
Linguistics.
Li Kong, Chuanyi Li, Jidong Ge, Bin Luo, and Vin-
cent Ng. 2020. Identifying exaggerated language. In Stefano Montanelli and Fabio Periti. 2023. A survey
Proceedings of the 2020 Conference on Empirical on contextualised semantic shift detection. arXiv,
Methods in Natural Language Processing (EMNLP), arXiv:2304.01666.
pages 7024–7034, Online. Association for Computa-
tional Linguistics. Syrielle Montariol, Matej Martinc, and Lidia Pivovarova.
2021. Scalable and interpretable semantic change
Andrey Kutuzov, Lilja Øvrelid, Terrence Szymanski, detection. In Proceedings of the 2021 Conference
and Erik Velldal. 2018. Diachronic word embeddings of the North American Chapter of the Association
and semantic shifts: a survey. In Proceedings of the for Computational Linguistics: Human Language
27th International Conference on Computational Lin- Technologies, pages 4642–4652.
guistics, pages 1384–1397, Santa Fe, New Mexico,
USA. Association for Computational Linguistics. Charles Egerton Osgood, William H May, and Murray S
Miron. 1975. Cross-Cultural Universals of Affective
Andrey Kutuzov, Erik Velldal, and Lilja Øvrelid. 2022. Meaning. University of Illionois Press.
Contextualized embeddings for semantic change de-
tection: Lessons learned. In Northern European Joel Paris. 2020. Overdiagnosis in Psychiatry: How
Journal of Language Technology, Volume 8, Copen- Modern Psychiatry Lost Its Way While Creating a Di-
hagen, Denmark. Northern European Association of agnosis for Almost All of Life’s Misfortunes. Oxford
Language Technology. University Press.
13
Scott Parrott. 2023. PTSD in the news: Media framing, Vered Shwartz, Yoav Goldberg, and Ido Dagan. 2016.
stigma, and myths about mental illness. Electronic Improving hypernymy detection with an integrated
News, 17(3):181–197. path-based and distributional method. In Proceed-
ings of the 54th Annual Meeting of the Association for
Jeffrey Pennington, Richard Socher, and Christopher D Computational Linguistics (Volume 1: Long Papers),
Manning. 2014. Glove: Global vectors for word rep- pages 2389–2398, Berlin, Germany. Association for
resentation. In Proceedings of the 2014 conference Computational Linguistics.
on empirical methods in natural language processing
(EMNLP), pages 1532–1543. Kaitao Song, Xu Tan, Tao Qin, Jianfeng Lu, and Tie-
Yan Liu. 2020. MPNet: Masked and permuted pre-
Francesco Periti and Nina Tahmasebi. 2024. A sys- training for language understanding. Advances in
tematic comparison of contextualized word embed- neural information processing systems, 33:16857–
dings for lexical semantic change. arXiv preprint 16867.
arXiv:2402.12011.
Cass R. Sunstein. 2018. The power of the normal.
SSRN. SSRN Scholarly Paper ID 3239204. Social
Ivan Jacob Agaloos Pesigan, Rong Wei Sun, and Shu Fai
Science Research Network. https://doi.org/10.
Cheung. 2023. betadelta and betasandwich: Con-
2139/ssrn.3239204.
fidence intervals for standardized regression coef-
ficients in r. Multivariate Behavioral Research, Nina Tahmasebi, Lars Borin, and Adam Jatowt. 2021.
58(6):1183–1186. Survey of computational approaches to lexical seman-
tic change detection. In Nina Tahmasebi, Lars Borin,
Steven Pinker. 2011. The Better Angels of Our Nature: Adam Jatowt, Yang Xu, and Simon Hengchen, edi-
Why Violence Has Declined. Viking Books. tors, Computational approaches to semantic change,
pages 1–91. Language Science Press, Berlin.
Nils Reimers and Iryna Gurevych. 2019. Sentence-
BERT: Sentence embeddings using Siamese BERT- Xuri Tang. 2018. A state-of-the-art of semantic
networks. In Proceedings of the 2019 Conference on change computation. Natural Language Engineering,
Empirical Methods in Natural Language Processing 24(5):649–676.
and the 9th International Joint Conference on Natu-
ral Language Processing (EMNLP-IJCNLP), pages Yufei Tian, Arvind Krishna Sridhar, and Nanyun Peng.
3982–3992, Hong Kong, China. Association for Com- 2021. Hypogen: Hyperbole generation with com-
putational Linguistics. monsense and counterfactual knowledge. In Find-
ings of the Association for Computational Linguistics:
Anna Rogers, Olga Kovaleva, and Anna Rumshisky. EMNLP 2021, pages 1583–1593.
2020. A primer in BERTology: What we know about
how BERT works. Transactions of the Association Jesse S. Y. Tse and Nick Haslam. 2021. In-
for Computational Linguistics, 8:842–866. clusiveness of the concept of mental disorder
and differences in help-seeking between asian
James A. Russell. 2003. Core affect and the psychologi- and white americans. Frontiers in Psychology,
cal construction of emotion. Psychological Review, 12. https://www.frontiersin.org/articles/
110(1):145–172. 10.3389/fpsyg.2021.699750.
Victor Sanh, Lysandre Debut, Julien Chaumond, and Hayato Tsukagoshi, Ryohei Sasano, and Koichi Takeda.
Thomas Wolf. 2019. Distilbert, a distilled version 2022. Comparison and combination of sentence em-
of bert: smaller, faster, cheaper and lighter. ArXiv, beddings derived from different supervision signals.
abs/1910.01108. In Proceedings of the 11th Joint Conference on Lex-
ical and Computational Semantics, pages 139–150,
Seattle, Washington. Association for Computational
Norman Sartorius. 2007. Stigma and mental health.
Linguistics.
The Lancet, 370(9590):810–811.
Stephen Ullmann. 1962. Semantics: An Introduction to
Nina Schneidermann, Daniel Hershcovich, and Bolette the Science of Meaning. Blackwell.
Pedersen. 2023. Probing for hyperbole in pre-trained
language models. In Proceedings of the 61st An- Ekaterina Vylomova and Nick Haslam. 2021. Semantic
nual Meeting of the Association for Computational changes in harm-related concepts in english. In Nina
Linguistics (Volume 4: Student Research Workshop), Tahmasebi, Lars Borin, Adam Jatowt, Yue Xu, and Si-
pages 200–211, Toronto, Canada. Association for mon Hengchen, editors, Computational Approaches
Computational Linguistics. to Semantic Change. Language Science Press.
Georg Schomerus, Stephanie Schindler, Christian Ekaterina Vylomova, Sean Murphy, and Nick Haslam.
Sander, Eva Baumann, and Matthias C Angermeyer. 2019. Evaluation of semantic change of harm-related
2022. Changes in mental illness stigma over 30 years– concepts in psychology. In Proceedings of the 1st In-
improvement, persistence, or deterioration? Euro- ternational Workshop on Computational Approaches
pean Psychiatry, 65(1):e78. to Historical Language Change, pages 29–34.
14
Ekaterina Vylomova, Laura Rimell, Trevor Cohn, and A. Appendix A
Timothy Baldwin. 2016. Take and took, gaggle and
goose, book and read: Evaluating the utility of vector To elaborate on what a word being low or high
differences for lexical relation learning. In Proceed- "arousal" or "valence" means, Warriner et al. (2013)
ings of the 54th Annual Meeting of the Association for defined them in the following way when (valid)
Computational Linguistics (Volume 1: Long Papers),
pages 1671–1682, Berlin, Germany. Association for participants made direct judgements of the large
Computational Linguistics. sample of words on the measured attributes (n =
419: valence; n = 448: arousal; 16-87 years; ma-
Ruiyu Wang and Matthew Choi. 2023. Large language
models on lexical semantic change detection: An
jority were female (60%), English native language
evaluation. arXiv preprint arXiv:2312.06002. speakers, held a college degree):
Wenhui Wang, Furu Wei, Li Dong, Hangbo Bao, Nan • Valence: "You are invited to take part in the
Yang, and Ming Zhou. 2020. Minilm: Deep self- study that [...] concerns how people respond
attention distillation for task-agnostic compression
of pre-trained transformers. Advances in Neural In-
to different types of words. You will use a scale
formation Processing Systems, 33:5776–5788. to rate how you felt while reading each word.
[...] The scale ranges from 1 (happy) to 9 (un-
Amy Beth Warriner, Victor Kuperman, and Marc Brys- happy). At one extreme of this scale, you are
baert. 2013. Norms of valence, arousal, and domi-
nance for 13,915 english lemmas. Behavior Research happy, pleased, satisfied, contented, hopeful.
Methods, 45(4):1191–1207. When you feel completely happy you should
indicate this by choosing rating 1. The other
Melissa A Wheeler, Melanie J McGrath, and Nick
Haslam. 2019. Twentieth century morality: The rise end of the scale is when you feel completely
and fall of moral concepts from 1900 to 2007. PLOS unhappy, annoyed, unsatisfied, melancholic,
ONE, 14(2):e0212267. despaired, or bored. You can indicate feeling
completely unhappy by selecting 9. The num-
WHO. 2021. Comprehensive mental health action plan
2013–2030. bers also allow you to describe intermediate
feelings of pleasure, by selecting any of the
Yu Xiao, Naomi Baes, Ekaterina Vylomova, and Nick other feelings. If you feel completely neutral,
Haslam. 2023. Have the concepts of ‘anxiety’ and
‘depression’ been normalized or pathologized? a cor- neither happy nor sad, select the middle of the
pus study of historical semantic change. PLOS ONE, scale (rating 5)."
18(6):e0288027.
• Arousal: “You are invited to take part in the
Arda Yüksel, Berke Uğurlu, and Aykut Koç. 2021. Se- study that [...] concerns how people respond
mantic change detection with gaussian word embed-
dings. IEEE/ACM Transactions on Audio, Speech,
to different types of words. You will use a
and Language Processing, 29:3349–3361. scale to rate how you felt while reading each
word. [...] The scale ranges from 1 (excited)
to 9 (calm). At one extreme of this scale, you
are stimulated, excited, frenzied, jittery, wide-
awake, or aroused. When you feel completely
aroused you should indicate this by choosing
rating 1. The other end of the scale is when
you feel completely relaxed, calm, sluggish,
dull, sleepy, or unaroused. You can indicate
feeling completely calm by selecting 9. The
numbers also allow you to describe intermedi-
ate feelings of calmness/arousal, by selecting
any of the other feelings. If you feel com-
pletely neutral, not excited nor at all calm,
select the middle of the scale (rating 5).”
15
B. Appendix B C. Appendix C
Total lines where target term appears in the text for Breadth Model Selection
both corpora (1970-2016): for the General corpus: The top three (pre-trained) sentence transformer
mental_health = 3,233; mental_illness = 1,559, per- models were chosen, ranked by their performance
ception = 9,440; for the Psychology corpus (1970- in embedding sentences.11 The best-performing
2016): mental_health = 26,482; mental_illness = model on the semantic textual similarity bench-
4,219, perception = 54,694. mark,12 Multi-Task Deep Neural Network (Liu
et al., 2019), was unavailable.13 See Table 2 for
descriptive statistics of models.
• "all-mpnet-base-v2"14 is maintained by the
SentenceTransformers community and excels
in encoding sentences across 14 diverse tasks
from different domains using the MPNet
(Masked and Permuted Pre-training for Lan-
guage Understanding) (Song et al., 2020) ar-
chitecture.
• "all-distilroberta-v1"15 uses a distilled ver-
sion of "distilroberta-base" (Sanh et al., 2019),
based on BERT architecture, employing
knowledge distillation during pre-training and
a triple loss (language modeling, distillation
and cosine-distance losses) to leverage the in-
ductive biases of LLMs during pre-training.
• "all-MiniLM-L6-v2"16 uses the MiniLM ar-
chitecture (Wang et al., 2020) employing deep
self-attention distillation (using self-attention
relation distillation for task-agnostic compres-
sion of pre-trained Transformers).
• Additionally, "bert-base-uncased"17 (Devlin
et al., 2019) was included for comparison,
although its network structure prohibits the
direct comparison of sentence embeddings,
and BERT maps sentences to a vector space
that is unsuitable for use with common simi-
larity measures and performs below average
GloVe embeddings on STS tasks (Reimers
Figure 8: Annual counts of articles where target terms
appear in the main text (1970-2016). Note: Top three and Gurevych, 2019).
panels = Psychology corpus; bottom three panels = 11
https://www.sbert.net/docs/pretrained_models.
General corpus. html
12
https://paperswithcode.com/sota/
semantic-textual-similarity-on-sts-benchmark
13
See https://github.com/namisan/mt-dnn
14
"all-mpnet-base-v2" from Hugging Face,
sentence-transformers: https://huggingface.co/
sentence-transformers/all-mpnet-base-v2
15
"all-distilroberta-v1" from Hugging Face,
sentence-transformers: https://huggingface.co/
sentence-transformers/all-distilroberta-v1
16
"all-MiniLM-L6-v2": https://huggingface.co/
sentence-transformers/all-MiniLM-L6-v2
17
"bert-base-uncased": https://huggingface.co/
google-bert/bert-base-uncased
16
Model Info all-mpnet-base- all-distilroberta- all-MiniLM-L6- bert-base-uncased
v2* v1* v2*
Table 2: Summary of language models sampled in the present study. Note: * = embeddings are normalized. + =
Average performance on encoding sentence over 14 tasks over 14 diverse tasks from different domains (14 datasets).
SNL = 570k sentence pairs annotated with labels. Multi-Genre NLI = 430k sentence pairs covering spoken and
written text. BookCorpus = 11,038 unpublished books scraped from the Internet.
17
Model Comparison: Test Sample
First, we compared similarity scores for sentence
embedding pairs for each sentence transformer
model to get a qualitative understanding of the cap-
tured dimensions. After feeding seven sample sen-
tences through each sentence transformer model
for encoding, similarity arrays of each sentence
embedding pair were compared. Tokenization and
preprocessing is handled as part of the sentence
transformers library.
• 1 = "I didn’t want to believe I had any men- Figure 10: Cosine similarity matrix for sentence embed-
tal_health issues and went into denial." dings using the "all-distilroberta-v1" model.
18
Breadth Measure
19
Figure 13: Breadth score over five-year intervals for each model (1970- 2014). Note: Model order demonstrates
rank of cosine distance score at the final data point (2010-2014) from highest to lowest.
20
D. Appendix D
To create the general corpus, a rigorous procedure
was followed. We first combined two related cor-
pora: the Corpus of Historical American English
(CoHA; Davies, 2008) and the Corpus of Contem-
porary American English (CoCA; Davies, 2008).
CoHA contains 400 million words from 1810-
2009, drawn from 115,000 texts distributed across
everyday publications (fiction, magazines, news-
papers, and non-fiction books). CoCA contains
560 million words from 1990-2019 drawn from
500,000 texts (from spoken language, TV shows,
academic journals, fiction, magazines, newspapers,
and blogs). After merging the two corpora, the com-
bined corpus spanning 1810-2019 was processed
following recommendations from Alatrash et al.
(2020) to clean it without compromising the quali-
tative and distributional properties of the data. This
process included first excluding the special token
“@”, which appears in 5% of the CoHA corpus (in-
troduced for legal reasons), malformed tokens that
are possible artifacts of the digitization process or
the data processing, and clean-up performed using
the web interface (“&c?;”, “q!”, “|p130”, “NUL”),
and removing escaped HTML characters (“ ( STAR
) ”, “<p>”, “<>”). Other symbols were excluded
after manual inspection of the corpus (e.g., “ // ”,
“ | ”, “ – ”, “*”, “..”, “PHOTO”, “( COLOR )”,
“ ILLUSTRATION ”, “/”). Blogs were also ex-
cluded (89,054 web articles; 98,788 blogs) for not
containing associated year data, and 25,418 aca-
demic texts were removed. Forty-one lines were
removed for missing text data (3 fiction, 11 news,
25 magazines, 2 spoken text) and 32 lines were re-
moved for column misalignment (15 mag, 15 news,
1 fiction, 1 tv). The cleaned corpus was then lower-
cased and punctuation (commas, periods, question
marks), function words, numerals and academic
texts were removed. The final combined corpus
contained 822,620,111 words from 344,634 texts:
30,496 fiction books, 136,476 magazines, 113,421
newspapers, 2,635 non-fiction books, 43,209 spo-
ken language and 18,397 TV shows. The current
study restricted the corpus period from 1970 to
2016 using 501,415,577 tokens from 244,552 ar-
ticles (23,855 fiction; 88,641 magazines; 73,557
news; 1,498 non-fiction; 40,036 spoken; 16,965
TV).
21
E. Appendix E
22
F. Appendix F
1970 1980 1990 2000 2010
1970 1980 1990 2000 2010 department center have say have
community service service service service state institute national have say
center community child problem problem center service institute national issue
service professional professional child child health fund care institute care
program center use use study
city have service child problem
professional problem problem care use
child use care study care
director national professional care system
school study study treatment treatment institute allow abuse community service
problem social treatment professional need national commissioner
state need health
group child need need outcome new department center service physical
worker program community health physical program oak department problem professional
Table 9: Top 10 Warriner-matched collocates of mental Table 12: Top 10 Warriner-matched collocates of mental
health in the psychology corpus (terms are ranked by health in the general corpus (terms are ranked by their
their relative count for the respective decade) relative count for the respective decade)
23
G. Appendix G
Year effect sizes for indices operationalizing major dimensions of lexical semantic change in the psychology
corpus (filled circles) and general corpus (empty circles). Note: First degree = Linear; Second degree = Quadratic.
Vertical dotted line = Standardized beta coefficient of 0; Standard errors (SE) that overlap line indicate that the null
hypothesis can be rejected at the 5% significance level.
Table 15: Regression Coefficients (Scaled) and Fit Statistics Predicting Intensifier Indices as a Function of Year.
Note: * = p-value for the overall model = <.001. Regression coefficients are unstandardized. For mental_illness
in psychology, residuals were autocorrelated, and outcome variable was re-fit with Generalized Least Squares
approach, yielding: B = 0.74; SE = 0.09; p < .001; RSE(DF) = 0.62(47,44); BIC = 108.52.
24
Index Concept Corpus B SE p F (DF) Adj. R2
Psychology -0.003 3 × 10−4 <.001 122.65 (1,45) 0.73
Mental Health
General -0.005 0.003 .071 3.55 (1,25) 0.09
Valence Psychology -0.002 9 × 10−4 .057 3.82 (1,45) 0.058
Mental Illness
General 0.01 0.005 .011 7.62 (1,25) 0.20
Psychology −1 × 10−5 2 × 10−4 .949 0.004 (1,45) -0.02
Perception
General -0.002 0.002 .188 1.84 (1,25) 0.03
Psychology 0.001 3 × 10−4 0.001 28.19 (1,7) 0.77
Mental Health
General -0.001 7 × 10−4 .213 2.49 (1,3) 0.27
Breadth Psychology 0.002 4 × 10−4 .006 14.99 (1,7) 0.64
Mental Illness
General −6 × 10−6 6× 10−4 .992 1× 10−4 (1,3) -0.33
Psychology 0.001 3 × 10−4 0.006 15.12 (1,7) 0.64
Perception
General 7 × 10−4 3 × 10−4 .076 7.13 (1,3) 0.61
Psychology 0.003 3 × 10−4 <.001 89.38 (1,45) 0.66
Mental Health
General 0.005 0.002 <.001 7.83 (1,25) 0.21
Arousal Psychology 0.003 9× 10−4 <.001 7.51 (1,45) 0.12
Mental Illness
General 0.002 0.003 .462 0.56 (1,25) -0.02
Psychology 0.001 2 × 10−4 <.001 23.65 (1,45) 0.33
Perception
General 0.002 0.001 .148 2.22 (1,25) 0.05
Psychology 4 × 10−4 3 × 10−5 <.001 163.34 (1,45) 0.78
Mental Health
General 3 × 10−4 2 × 10−4 .130 2.48 (1,21) 0.06
Path. Psychology 2 × 10−4 1 × 10−4 .049 4.12 (1,43) 0.07
Mental Illness
General −1 × 10−4 2× 10−4 .552 0.36 (1,23) -0.03
Psychology 2 × 10−3 4 × 10−2 <.001 118.42 (1,44) 0.72
Perception
General 5 × 10−5 2 × 10−5 .051 5.95 (1,6) 0.41
Psychology 7 × 10−6 4 × 10−7 <.001 292.52 (1,45) 0.86
Mental Health
General 2× 10−7 4× 10−8 <.001 18.17 (1,45) 0.27
Salience Psychology 3 × 10−7 9 × 10−8 <.001 13.21 (1,45) 0.21
Mental Illness
General 1 × 10−7 2 × 10−8 <.001 42.21 (1,45) 0.47
Psychology 5 × 10−7 3 × 10−7 .160 2.04 (1,45) 0.02
Perception
General −3 × 10−8 6× 10−8 .568 0.33 (1,45) -0.01
Table 16: Unstandardized Regression Coefficients and Fit Statistics Predicting Indices as a Function of Year. Note:
The midrule separates the main dimensions (above) and the exploratory dimensions (below). Path. = Pathologization.
Generalized Least Squares approach also used for models with autocorrelated residuals.
• Arousal: mental_health (P): B = 0.003; SE = 3 × 10−4 ; p < .001; RSE(DF) = 0.03(47,45); BIC = -172.07
• Salience: mental_health (P): B = 7 × 10−6 ; SE = 4 × 10−7 ; p <.001; RSE(DF) = 4 × 10−5 (47,45); BIC = -767.87;
mental_illness (P): B = 3 × 10−7 ; SE = 9 × 10−7 ; p < .001; RSE(DF) = 9 × 10−6 (47,45); BIC = -895.27; perception
(P): B = 5 × 10−7 ; SE = 3 × 10−7 ; p = .160; RSE(DF) = 3 × 10−5 (47,45); BIC = -785.60; mental_illness (G): B =
1 × 10−7 ; SE = 2 × 10−8 ; p < .001; RSE(DF) = 2 × 10−6 (47,45); BIC = -1048.85
25
Index Concept Corpus β SE 95% CI
Psychology -0.86* 0.04 (-0.94, -0.77)
Mental Health
General -0.35 0.17 (-0.71, 0.004)
Valence Psychology -0.28* 0.12 (-0.53, -0.03)
Mental Illness
General 0.48* 0.13 (0.21, 0.76)
Psychology -0.01 0.18 (-0.37, 0.35)
Perception
General -0.26 0.15 (-0.57, 0.05)
Psychology 0.90* 0.04 (0.80, 0.99)
Mental Health
General -0.67* 0.09 (-0.95, -0.40)
Breadth Psychology 0.83* 0.14 (0.50, 1.15)
Mental Illness
General -0.01 0.69 (-2.19, 2.18)
Psychology 0.83* 0.11 (0.57, 1.09)
Perception
General 0.84* 0.22 (0.13, 1.54)
Psychology(1) 0.74* 0.05 (0.64, 0.85)
Psychology(2) -0.30* 0.10 (-0.50, -0.09)
Mental illness
General(1) -0.09 0.23 (-0.56, 0.38)
Intensifier General(2) 0.26 0.15 (-0.05, 0.57)
Psychology(1) 0.60* 0.08 (0.44, 0.76)
Psychology(2) -0.07 0.10 (-0.28, 0.13)
Perception
General(1) 0.05 0.17 (-0.30, 0.41)
General(2) -0.06 0.21 (-0.50, 0.37)
Psychology 0.82* 0.08 (0.66, 0.97)
Mental Health
General 0.49 0.15 (0.19, 0.79)
Arousal Psychology 0.38* 0.17 (0.05, 0.71)
Mental Illness
General 0.15 0.20 (-0.27, 0.57)
Psychology 0.59* 0.08 (0.44, 0.74)
Perception
General 0.29 0.22 (-0.17, 0.74)
Psychology 0.30* 0.12 (0.06, 0.53)
Mental Health
General -0.12 0.23 (-0.61, 0.36)
Pathologization Psychology 0.89* 0.02 (0.85, 0.92)
Mental Illness
General 0.32 0.20 (-0.09, 0.74)
Psychology 0.85* 0.30 (0.79, 0.92)
Perception
General 0.71 0.30 (-0.03, 1.45)
Psychology 0.93* 0.02 (0.89, 0.97)
Mental Health
General 0.54* 0.10 (0.34, 0.73)
Salience Psychology 0.48* 0.13 (0.21, 0.74)
Mental Illness
General 0.70* 0.07 (0.56, 0.83)
Psychology 0.21 0.15 (-0.10, 0.52)
Perception
General -0.09 0.15 (-0.38, 0.21)
26
Table 17: Standardized Regression Coefficients (β) predicting Semantic Change Indices by Year. Note: Midrule
separates main dimensions of semantic change (above). * = p: < .05. (1) = First degree. (2) = Second degree.