A Multidimensional Framework For Evaluating Lexical Semantic Change With Social Science Applications

A Multidimensional Framework for Evaluating Lexical Semantic Change
with Social Science Applications

* * **
Naomi Baes Nick Haslam Ekaterina Vylomova
*
Melbourne School of Psychological Sciences
**
School of Computing and Information Systems
The University of Melbourne
{n.baes, nhaslam, vylomovae}@unimelb.edu.au
Abstract As a result, word embeddings have evolved from

count-based models (Jurafsky and Martin, 2023),
Historical linguists have identified multiple
forms of lexical semantic change. We present where words are represented by their co-occurrence
frequency with other words, to prediction-based
arXiv:2406.06052v1 [cs.CL] 10 Jun 2024
a three-dimensional framework for integrat-

ing these forms and a unified computational representations (Mikolov et al., 2013; Pennington
methodology for evaluating them concurrently. et al., 2014), where word vectors are iteratively
The dimensions represent increases or de- learned as part of a language modelling task ob-
creases in semantic 1) sentiment (valence of a jective. The granularity of these representations
target word’s collocates), 2) breadth (diversity shifted from type-level, where each word has a sin-
of contexts in which the target word appears),
gle vector despite its usages, to token-based, or
and 3) intensity (emotional arousal of collo-
cates or the frequency of intensifiers). These contextualized representations (Montanelli and Per-
dimensions can be complemented by the eval- iti, 2023; Kutuzov et al., 2022), where each word
uation of shifts in the frequency of the target instance (token) has a vector, dynamically captur-
words and the thematic content of its collo- ing shifts in meaning based on context. Lexical se-
cates. This framework enables lexical semantic mantic relations can be detected by type- (Shwartz
change to be mapped economically and sys- et al., 2016; Vylomova et al., 2016) and token-level
tematically and has applications in computa-
(Rogers et al., 2020) embeddings.
tional social science. We present an illustra-
tive analysis of semantic shifts in mental health Other work has started addressing the chal-
and mental illness in two corpora, demonstrat- lenge of formalizing and understanding kinds of
ing patterns of semantic change that illuminate semantic change (Hengchen et al., 2021). Pro-
contemporary concerns about pathologization, cesses such as broadening (Vylomova et al., 2019;
stigma, and concept creep. Yüksel et al., 2021), metaphorization (Maudslay
and Teufel, 2022), and pejoration/amelioration
1. Introduction
(Fonteyn and Manjavacas, 2021) have been mod-
Lexical semantic change is defined by historical elled. Researchers have created methods to auto-
linguists as innovations that alter the meaning, but matically disambiguate a word’s pejorative usage
not the grammatical function, of a form (Campbell, from its non-pejorative use (Dinu et al., 2021). At-
1999). For instance, “awesome” once denoted the tempts have also been made to evaluate understud-
capacity to inspire awe, but its meaning has since ied classes of semantic change. Sentence represen-
been bleached to a general expression of approval. tations from neural language models were used for
Computational linguists have made strides in de- hyperbole detection (Schneidermann et al., 2023).
veloping distributional semantic methods (Boleda, Exaggerated language can be generated (Tian et al.,
2020) to detect semantic change (Kutuzov et al., 2021) and detected (Kong et al., 2020), alongside
2018; Tahmasebi et al., 2021; Tang, 2018) and its metaphor (Badathala et al., 2023). Researchers
laws (Hamilton et al., 2016b) as distinct from cul- have also evaluated semantic bleaching, whereby
tural shifts (Hamilton et al., 2016a). words lose elements of their meaning (Luo et al.,
Advances in deep learning since 2018 (Manning, 2019), and found it to be triggered in contexts
2022) afford new ways to model semantic change where an adverb premodifies a semantically similar
processes. These innovations have facilitated the adjective (e.g., “insanely jealous”). Nevertheless,
development of language models with sophisti- there are a dearth of diachronic methods for evalu-
cated word embeddings or vector representations. ating lexical semantic change (de Sá et al., 2024).
1
Despite advances in detecting and modelling lex- notational (referential) meaning, Geeraerts iden-
ical semantic change, there is a need for a unify- tifies (1) specialization, (2) generalization, (3)
ing framework to integrate multiple dimensions metonymy, and (4) metaphor. Specialization (se-
of change. The present study addresses this gap mantic ‘restriction’ and ‘narrowing’) implies that
by proposing a framework which synthesizes the the new meaning covers a subset of the old mean-
theoretical insights of historical linguists about the ing’s range; for generalization (or ‘expansion’, ‘ex-
many distinct forms of diachronic lexical semantic tension’, ‘schematization’, ‘broadening’), the new
change (e.g., Bloomfield, 1933) and aligns them range includes the old meaning. Metonymy (here
with the methodological sophistication of natural including synecdoche) is a “link between two read-
language processing. The comprehensive compu- ings of a lexical item based on a relationship of
tational framework for evaluating lexical semantic contiguity between the referents of the expression
change that emerges should be valuable for compu- in each of those readings” (Geeraerts, 2010, p. 27).
tational social scientists seeking to understand and Conversely, metaphor is based on similarity. Geer-
model social and cultural change. aerts also identifies two forms of connotational
meaning (i.e., the aspects of a word’s meaning that
2. Related Work are related to the writer or reader’s emotions, senti-
2.1 Forms of Lexical Semantic Change ment, opinions, or evaluations): (1) pejorative and
(2) ameliorative change (i.e., shift towards a more
Historical linguists have developed several tax- negative/positive emotive meaning). An example
onomies of the forms of lexical semantic change of pejoration is ‘silly’, which formerly meant ‘de-
(Blank, 1999; Bréal, 1897; Ullmann, 1962), but serving sympathy, helpless’, but has come to mean
Bloomfield’s (1933) is one of the most well- ‘showing a lack of common sense’. Amelioration
established. Bloomfield described nine forms iden- is shown by ‘knight’ once meaning ‘boy, servant’.
tified by earlier scholars: (1) narrowing: superordi-
nate to subordinate, or when a meaning becomes 2.2 Expanding Concepts of Harm and
more restricted (Old English mete ‘all food’ > meat Pathology
‘edible flesh’); (2) widening: subordinate to super- Semantic change processes such as these may
ordinate, or specific to general expansion of mean- partly reflect cultural, social, and political shifts,
ing (Middle English dogge ‘dog of a specific breed’ and are of interest to social science researchers.
> dog); (3) metaphor: the transfer of a name based One example is social psychological research on
on the associations of similarity or hidden compar- concept creep, the semantic expansion of harm-
ison (Primitive Germanic bitraz ‘biting’, derivative related concepts (e.g., abuse, bullying, mental ill-
of ‘I bite’ > bitter ‘harsh of taste’), (4) metonymy: ness, prejudice, trauma, violence; Haslam, 2016).
change based on the meanings’ proximity in space Concept creep takes two forms: harm-related con-
or time (Old English ceace ‘jaw’ > cheek); (5) cepts have expanded ‘horizontally’ to cover a wider
synecdoche: the meanings are related as whole and range of harms and ‘vertically’ to encompass less
part (pre-English stobo ‘heated room’ > stove), (6) intense harms. It is theorized to be driven by
hyperbole: stronger to weaker meaning by over- rising cultural sensitivity to harm (Furedi, 2016;
statement (pre-French extonare ‘to strike with thun- Wheeler et al., 2019), falling societal prevalence
der’ > to astonish; English borrowed astound, as- of harm (Levari et al., 2018; Pinker, 2011), and
tonish from Old French); (7) meiosis:1 weaker to deliberate conceptual expansion by “opprobrium
stronger meaning by understatement (pre-English entrepreneurs” (Sunstein, 2018). Concept creep is
kwalljan ‘to torment’ > Old English cwellan ‘to theorized to have mixed blessings (Haslam et al.,
kill’); (8) degeneration: positive to negative conno- 2020), trivializing harms on one hand (Dakin et al.,
tation (Old English cnafa ‘boy servant’ > knave); 2023) and enhancing the recognition and redress of
(9) elevation: negative to positive connotation (Old major harms on the other (Tse and Haslam, 2021).
English cniht ‘boy, servant’ > knight). Prior empirical work has evaluated concept creep
Bloomfield’s classes align closely with the forms in historical text corpora. Studies assessing hori-
of change identified in studies of denotational and zontal expansion as increases in the broadening of
connotational meaning (Geeraerts, 2010). For de- harm concepts found that some concepts (e.g., ad-
1
Bloomfield (1933) refers to this class as litotes, but we diction, bullying, trauma) have broadened within
use meiosis to reflect general understatement. academic psychology (Haslam et al., 2021; Vylo-
2
mova et al., 2019; Vylomova and Haslam, 2021). 3. Method
Recent work evaluated the vertical form of concept
creep, defined as the concept’s use in contexts of 3.1 Framework
declining emotional intensity, and yielded mixed The proposed framework, illustrated in Figure 1,
findings for anxiety, depression, grief, stress, and economically reduces classes of lexical semantic
trauma (Baes et al., 2023a,b; Xiao et al., 2023). change identified by historical linguists (excluding
Mental illness has become an increasingly metaphor and metonymy; Geeraerts, 2010) to three
salient term in society (Haslam and Baes, 2024), dimensions. It recognizes that these classes repre-
partly due to the recent prioritization of mental sent opposed pairs of change types, each member
health in global health policy (WHO, 2021). Crit- corresponding to a pole on a single dimension. In
ics have raised concerns that the rising prominence essence, the framework reformulates six classes as
of mental health discourse is instigating problem- three dimensions, allowing lexical semantic change
atic changes in how people conceptualize mental to be quantified on three axes simultaneously rather
ill health. Some contend that concepts of men- than categorized into exclusive types. A recent sur-
tal illness have broadened so that everyday life vey paper (de Sá et al., 2024) has also classified se-
is increasingly pathologized (Brinkmann, 2016; mantic change as having three classes of characteri-
Horwitz and Wakefield, 2007, 2012). Experiences zations related to a word’s meaning becoming used
that were once considered normal are now given in a more (1) pejorative or ameliorated sense (orien-
diagnostic labels, such as using ‘depression’ to tation), (2) metaphoric or metonymic context (rela-
reference ordinary sadness (Bröer and Besseling, tion), (3) abstract/general or more specific/narrow
2017). Alternatively, it has been argued that terms context (dimension). However, their theoretical
like “mental health problems” are being normal- framework does not consider hyperbole/litotes.
ized and broadened (Sartorius, 2007), alongside
increasing prevalence of mental illnesses. Some
argue that concepts of mental illness are becoming
less stigmatizing, although this question has only
been addressed in surveys of public attitudes (e.g.,
Schomerus et al., 2022), rather than in changes in
word connotations. In view of the widespread spec-
ulation on the ways in which concepts of mental
illness have changed historically and the lack of sci-
entific evidence of these shifts, a systematic study
of conceptual change in this domain is a priority.
2.3 Our Original Contribution

The present study aims to make three main contri-
butions: (1) it proposes a multidimensional frame- Figure 1: Three Major Dimensions of Semantic Change.
work for evaluating lexical semantic change that
economically integrates forms identified by histori- In our proposed framework, the Sentiment di-
cal linguists; (2) it develops a set of computational mension relates to whether the word acquires a
methodologies for evaluating change on these di- more positive (‘elevation’, ‘amelioration’) or nega-
mensions; and 3) it illustrates this computational tive (‘degeneration’, ‘pejoration’) connotation. The
framework by examining semantic shifts in con- Breadth dimension relates to whether a word ex-
cepts of mental health and mental illness to address pands (‘widening’, ‘generalization’) or contracts
cultural concerns about pathologization, normaliza- (‘narrowing’, ‘specialization’) its semantic range.
tion, and stigmatization. The study will therefore The Intensity dimension relates to whether a word
test if the framework can thoroughly illuminate changes to refer to more emotionally or referen-
how mental health and mental illness have changed tially intense phenomena (‘meiosis’) or less intense
their meanings in two corpora representing aca- phenomena (‘hyperbole’). Table 1 summarizes how
demic psychology and general US English text.2 the three dimensions map onto the classes of lex-
2
The source code is available here: https://github. ical semantic change as well as the two proposed
com/naomibaes/lexical_semantic_change_framework forms of concept creep.
3
Dimension Rising Falling
Sentiment Elevation (Bloomfield, 1933); Ameliora- Degeneration (Bloomfield, 1933); Pejora-
tion (Ullmann, 1962) tion (Ullmann, 1962)
Breadth Widening (Bloomfield, 1933; Ullmann, Narrowing (Bloomfield, 1933; Ullmann,
1962); Generalization of meaning (Blank, 1962); Specialization of meaning (Blank,
1999); Horizontal Creep (Haslam, 2016)* 1999)
Intensity Meiosis (Bloomfield, 1933) Hyperbole (Bloomfield, 1933); Vertical
Creep (Haslam, 2016)*
Table 1: Dimensions of Lexical Semantic Change and their associated forms. * = specific to harm-related concepts.
The three proposed dimensions align with es- target concept were extracted within a ± 5-word
tablished dimensions in other domains. For exam- context window (Agirre et al., 2009) and matched
ple, Sentiment and Intensity resemble the two pri- to the Warriner et al. norms which showed adequate
mary dimensions of human emotion, Valence and coverage for the psychology corpus but poorer cov-
Arousal (Russell, 2003), and two primary dimen- erage for the general corpus (“mental_health”: psy-
sions of connotational meaning, Evaluation (e.g., chology = 84%; general = 50%; “mental_illness”:
“good/bad”) and Potency (e.g., “strong/weak”) (Os- psychology = 83%;general = 48%; “perception”:
good et al., 1975), both of which have been shown psychology = 84%; general = 39%). Annual counts
to have cross-cultural validity. Although our dimen- of Warriner-matched collocates for each target con-
sions capture the primary forms of lexical change, cept were then extracted from the lemmatized cor-
we argue that they can be complemented by evalu- pora, which showed few occurrences due to few
ation of changes in a word’s salience (i.e., relative appearances of texts containing targets before 1990
frequency of use) and its thematic content (i.e., in the general corpus (see Appendix B). There-
shifts in the specific contexts in which the word is fore, analyses excluded general texts before 1990.
used). These dimensions may reflect psychological, The annual sentiment score for each concept was
sociocultural, or cultural forces that contribute to computed by weighting the valence rating for each
or result from semantic change (Blank, 1999). Our collocate by its annual appearances, standardized
case study of mental health and mental illness illus- by the total number of (matched) collocates in the
trates how attention to salience and thematic con- respective year. The index represents the mean
tent enrich the characterization of semantic change valence of terms [1,9] collocating with target con-
that the three primary dimensions provide. We now cepts, where higher scores indicate higher valence.
turn to the details of that case study, including the
computational methodologies for evaluating these 3.3 Breadth
dimensions. Future implementations of our three- The semantic broadening of the target concept was
dimensional framework are likely to include tech- evaluated as the average inverse cosine similar-
nical refinements of these methodologies. Those ity between the sentence level embeddings con-
employed in the case study simply demonstrate one taining the target term. Our method adapts previ-
way to implement it using interpretable techniques. ous work (Vylomova et al., 2019; Vylomova and
Haslam, 2021) by replacing type-level word em-
3.2 Sentiment beddings with contextualized sentence-level em-
The sentiment of the target concepts (mental health beddings. Given that this breadth measure resem-
and mental illness and the control concept percep- bles the Semantic Textual Similarity (STS) task
tion) was evaluated using valence norms from War- (Cer et al., 2017, the degree to which two sentences
riner et al. (2013), which provide valence ratings are semantically equivalent to each other), to se-
for 13,915 English lemmas collected from 1,827 lect the optimal model we compared the sentence
United States residents, ranging from low valence similarity scores, from corpus samples, of models
(1: feeling extremely “unhappy”, "despaired") to that have shown good performance for encoding
high valence (9: feeling extremely “happy”, “hope- sentences. Many of the original Sentence-BERT
ful"). See Appendix A for more information re- models (Reimers and Gurevych, 2019) with good
garding the valence ratings. Collocates of each scores on semantic textual similarity benchmarks
4
(Tsukagoshi et al., 2022; Reimers and Gurevych, terms [1,9] collocating with target concepts, where
2019) are deprecated, therefore we examined and higher scores indicate higher arousal.
compared three public pre-trained models that cur- Second, we developed a new index to directly
rently excel in encoding sentences,3 from the sen- capture shifts in a concept’s intensity. Instead of
tence transformers library. See Appendix C for examining the arousal of its collocates (regardless
more information regarding model selection (C), of their order), it examined the occurrence of in-
comparison (C) and results (C). The pre-trained tensifying expressions that directly modify it. If
model used in the present study4 performed best on a concept increasingly appears with an intensify-
detecting semantic information and encoding sen- ing modifier, it can be inferred that its unmodified
tences for 14 diverse tasks from different domains. meaning has become less intense. We developed
To compute the breadth score, relevant texts a new “intensifier index” which evaluates the rela-
were extracted from our corpora. Inspecting their tive frequency with which 11 adjectival modifiers
frequencies showed that it was acceptable to sam- (“great”, “intense”, “severe”, “harsh”, “major”, “ex-
ple 50 texts from each five-year interval.5 Thus, treme”, “powerful”, “serious”, “devastating”, “de-
we randomly and uniformly sampled up to 50 structive”, “debilitating”) preceded “mental health”
sentences per interval and repeated the procedure and “mental illness”. De-adjectival adverbs from
10 times to reduce sampling noise. These sen- Luo et al. (2019) were considered but most were
tences were then passed to the sentence transformer not sufficiently general (e.g., “devastating”, “excru-
model, "all-mpnet-base-v2" (where MPNET means ciating”, “vicarious”). We used the dependency-
Masked Permuted Language Modeling Network), parsed corpora (see Section 4.2) to compute the
to be tokenized and to encode embeddings rep- proportion of instances of each target concept that
resenting their semantic characteristics. Cosine has any of the 11 terms as its adjective modifier.
distance was computed for each pair of sentence
vectors by inverting the similarity scores (1 - cosine 3.5 Thematic content
similarity). The final breadth metric [0,1] was cal- Thematic content was evaluated using a top-down
culated by averaging scores across samples in each approach. The theme of interest was pathology
interval. Higher scores indicate greater breadth given concerns raised by critics about the pathol-
(dissimilarity) between sentence vectors. ogization of mental health and mental illness
(Brinkmann, 2016; Horwitz and Wakefield, 2007,
3.4 Intensity
2012). We used a pathologization dictionary de-
Changes in the intensity of the concepts were eval- veloped by Baes et al. (2023a) to compute the
uated in two ways. First, we computed an arousal pathologization index. This approach can be used
index, adapting a previously established procedure to construct dictionaries for other themes of interest.
(Baes et al., 2023a,b; Xiao et al., 2023). In an equiv- First, we generated unambiguously disease-related
alent manner to the sentiment analysis, we exam- words with restricted range in meaning: “clinical”,
ined the collocates of each concept and computed a “disorder”, “symptom”, “illness”, “pathology”, and
weighted average annual ratings, using Warriner et “disease”. Next, their forward word associations
al.’s arousal norms that range from low arousal (participant responses to each disease-related word)
(1: feeling "calm", "unaroused" while reading drawn from the English Small World of Words
the lemma) to high arousal (9: feeling "agitated", project (De Deyne et al., 2019) were listed and
"aroused"). See Appendix A for more information duplicates were removed. We filtered the list for
regarding arousal ratings. The annual arousal score terms reflecting pathologization (i.e., to view or
for each concept was calculated by weighting the characterize as medically or psychologically ab-
arousal rating for each collocate by its total number normal), leaving 17 terms: “ailment”, “clinical”,
of appearances in each year and normalizing it by “clinic”, “cure”, “diagnosis”, “disease”, “disorder”,
the total (matched) collocate count for the respec- “ill”, “illness”, “medical”, “medicine”, “pathol-
tive year. The index represents the mean arousal of ogy”, “prognosis”, “sick”, “sickness”, “symptom”,
3
https://www.sbert.net/docs/pretrained_models. “treatment”. Following Baes et al. (2023a), we
html computed the pathologization index by dividing
4
"all-mpnet-base-v2" from Hugging Face, appearances of the 17 terms in the target concept’s
sentence-transformers: https://huggingface.co/
sentence-transformers/all-mpnet-base-v2 collocates (±5-word context window) in a specific
5
Appendix C explains interval selection. year by the total number of collocates in that year.
5
3.6 Salience like “the”), and lemmatization using spaCy.7 For
Salience was computed as the concept’s annual dependency parsing we used the raw corpora to
relative frequency, using the raw corpora versions. provide more contextual information for the model
to better understand relationships between words.
4. Materials The English Transformer model8 was used to pre-
process the corpus with a high performance com-
4.1 Corpora
puting system (Lafayette et al., 2016).
Two corpora were chosen for their historical length,
their magnitude, and their texts. The psychology 4.3 Target Concepts
corpus contained 143,575,773 tokens from 871,344 Two terms were chosen to analyze levels of seman-
abstracts from 875 (Scimago indexed) psychol- tic change (Hamilton et al., 2016a): mental_health
ogy journals, ranging from 1930 to 2019, sourced and mental_illness. We also ran control analyses us-
from E-Research and PubMed databases (Vylo- ing the neutral term, perception, for which a fixed
mova et al., 2019). The journal set was distributed rate of change was expected and which demon-
across all subdisciplines of psychology. The final strated a steady rise in relative frequency starting
corpus of psychology abstracts was limited to 1970- around 1945 in the Google Ngram Viewer.9
2016 due to the relatively small number of abstracts
outside this period (Vylomova et al., 2019), yield- 4.4 Statistical Analysis
ing 129,980,596 tokens from 793,942 abstracts.
Linear regression analyses were performed to test
The second corpus is a combination of two re-
the statistical significance of historical trends in
lated corpora: the Corpus of Historical American
the semantic indices (Jebb et al., 2015). Ordi-
English (Davies, 2010, 1810-2009) and the Cor-
nary least squares served as the primary estima-
pus of Contemporary American English (Davies,
tor, the secondary one being a generalized least
2008, 1990-2019). Academic texts were excluded
squares estimator to account for auto-correlated
to avoid any potential overlap with psychology ar-
residuals (Durbin-Watson test: p < .05). Coef-
ticles. After merging the two corpora, contain-
ficients, standard errors and confidence intervals
ing 115,000 everyday publications and >500,000
were standardized using the betaSandwich pack-
contemporary texts, the combined corpus was pro-
age (Pesigan et al., 2023), employing Dudgeon’s
cessed following recommendations from Alatrash
(2017) heteroskedasticity-consistent estimator ap-
et al. (2020) to maintain data integrity.6 The cur-
proach (HC3), ideal for extracting estimates for
rent study restricted the corpus period from 1970 to
nonnormal data and small sample sizes (Dudgeon,
2016, using 501,415,577 tokens from 244,552 texts
2017). The code is publicly available.10
(books: 23,855 fiction, 1,498 non-fiction; 88,641
magazines; 73,557 newspapers; 40,036 spoken lan- 5. Results
guage; 16,965 TV shows).
Sentiment: The linear regression models mostly
4.2 Preprocessing show decreasing trends for the valence index. Fig-
Analyses required three versions of the corpora: ure 2 shows a significant declining trend in the va-
(1) a raw cleaned version transforming target con- lence of words used in the context of mental health
cepts to single noun tokens (Section 3.6 and 3.3 in the psychology corpus and the general corpus.
and 3.4); (2) a lemmatized version (Section 3.2, For mental illness, the valence index shows a de-
3.4, and 3.5); and (3) a dependency parsed version creasing trend in psychology, and an increase in
(Section 3.4). The first version, including punc- the general corpus. The valence of perception only
tuation, uppercasing, and numbers, was used for shows a decreasing trend in the general corpus.
all analyses after transforming multiword target Breadth: The linear regression models testing
concepts into single tokens (e.g., “mental health” the trend for the cosine distance of sentential con-
> “mental_health”) using case sensitive matching. texts containing targets show significant increas-
The lemmatization pipeline included tokenization, 7
https://spacy.io/
part-of-speech tagging (skipping tokens with unin- 8
“en_core_web_trf” (roberta-base) from Spacy was used
formative tags: punctuation, symbols, spaces, num- as it demonstrates the highest accuracy on 13 evaluation tasks:
bers), removing stop words (uninformative words https://spacy.io/models/en#en_core_web_trf.
9
https://books.google.com/ngrams/info
6 10
See Appendix D for a comprehensive explanation. https://osf.io/4d7ur/
6
Figure 2: Valence index over the study period (1970-
2016).
Figure 3: Breadth score over five-year intervals (1970-
2014).
ing trends for mental health, mental illness and
perception in the psychology corpus, reflecting Thematic content: The target concepts, mental
greater sentence diversity, with a decrease for men- health and mental illness, and the control percep-
tal health and an increase for perception in the tion, become significantly more associated with
general corpus, as shown in Figure 3. pathology-related terms in the psychology corpus,
Intensity: Figure 4 shows the significant rise and for all targets except for mental health in the
and fall in the use of intensifiers to modify mental general corpus, as shown in Figure 6. Inspecting
illness in the psychology corpus, but no trend in the the top ten ranked collocates for the main target
general corpus. Examining the top ranked adjective terms (see Appendix F) shows the presence of only
modifiers in each decade (Table 4 and Table 7 in two of the 17 pathology-related terms in psychol-
Appendix E) reveals that “severe”, “serious”, “ma- ogy and the general corpus (“disorder” and “treat-
jor”, “chronic” come to be more associated with ment”), and no pathology-related terms among the
mental illness from the 1990s onwards. Although top ranked collocates for the control. The diver-
mental health is not frequently modified by intensity of terms among the top ranked collocates for
sifiers, as expected, “poor” and “positive” remain mental health and mental illness indicate that more
closely associated with it across the decades, with themes are present in the semantic space.
“maternal” becoming more associated with mental Salience: Figure 7 illustrates that the relative
health from the 1990s onwards. Despite demon- frequencies rise significantly for both target con-
strating a significant increase in its intensifier index cepts, mental health and mental illness, in both
in the psychology corpus, perception does not dis- corpora. The relative frequency of perception in-
play intensifiers among its top adjective modifiers. creases significantly in the psychology corpus and
Figure 5 shows a significant increasing trend shows relatively stability in the general corpus.
in the intensity (arousal index) of mental health- The significance of the trends was determined by
related words in both corpora. For mental illness examining standardized beta coefficients and their
and perception, the index increases significantly associated standard errors (see Table 17). As shown
for the psychology corpus and only shows an in- in Appendix G, the strongest effect sizes can be
creasing trend for perception in the general corpus. observed for the two target terms with breadth (both
7
Figure 4: Intensifier index for mental illness over the
study period (1970-2016).
corpora), valence (decreasing for psychology and

increasing for the general corpus), and for mental
illness with intensity (both corpora). According Figure 5: Arousal index over the study period (1970-
to the Adjusted R2 values in Tables 15 and 16, 2016).
with a few exceptions year has more explanatory
power predicting the semantic indices for the target
concepts than for the control concept. tion (Brinkmann, 2016; Frances, 2013), psychia-
trization (Beeker et al., 2021; Paris, 2020), and
6. Discussion stigmatization (Sartorius, 2007; Schomerus et al.,
2022). Little research has investigated these pro-
The present study implemented, for the first time, posed trends or attempted to characterize them sys-
a new framework for evaluating lexical semantic tematically. Our case study documents how a rig-
change. Rather than assessing a single dimension orous characterization of these conceptual changes
of change or classifying it into a specific taxo- might be conducted. Its findings point to the com-
nomic category, the framework enables the concur- plexity of these changes, which would remain hid-
rent evaluation of multiple dimensions of semantic den had they been evaluated on a single dimension.
change, each corresponding to a well-established Regarding sentiment, we found paradoxical
dimension of referential or affective meaning. Eval- trends. Sentiment toward mental illness became
uating semantic change along these dimensions si- more positive in the general corpus, supporting sug-
multaneously allows complex patterns of change gestions of destigmatization in the culture at large
to be disentangled and characterized, with possible (e.g., Schomerus et al. 2022), while sentiment to-
applications in social science research. ward mental health and mental illness became more
The case study demonstrated a suite of compu- negative in the psychology corpus and for mental
tational methodologies for evaluating the frame- health in the general corpus. In the general corpus,
work’s dimensions of change in an examination mental health came to be used in narrower con-
of mental health and mental illness motivated by texts. Nevertheless, the consistent rising trends for
social scientific research questions. Theorists work- semantic breadth in the psychology corpus support
ing in sociology, psychology, psychiatry, and re- previous claims of expanding meanings or horizon-
lated fields have speculated on recent cultural shifts tal concept creep (Brinkmann, 2016; Horwitz and
in these concepts, relying on overlapping and some- Wakefield, 2007, 2012) in academic psychology.
times ill-defined notions of medicalization (Bröer Furthermore, the analysis of intensity yielded
and Besseling, 2017; Hofmann, 2016), pathologiza- clear patterns of change. The target concepts rose
8
Figure 7: Normalized term frequencies for the general
and psychology corpora (1970-2016).
psychology and the general domain. All indices for

Figure 6: Pathologization index over the study period
the control target, perception, showed significant
(1970-2016). trends in at least one corpus.
In sum, the multi-dimensional analysis suggests
that in recent decades, as discourse on mental
on the arousal index in psychology, indicating that health and illness has become more prominent (sup-
although only the semantic contexts for mental ported by our salience index), concepts of mental
health increased in valence, both mental health health and illness have not so much de-stigmatized
and mental illness have become more emotionally (sentiment) but have instead inflated (breadth) and
animated or agitated. Only mental health showed become a growing focus of social concern and
no arousal trend in the general corpus. There was problematization (intensity) and increasingly seen
also evidence that mental illness has increasingly through a medical lens (pathologization).
become more and then less modified by intensifier
adjectives, in the psychology corpus, possibly in 7. Conclusion
response to vertical concept creep, where the con- The current study presented a new computational
cept’s meaning is stretched to refer to less severe framework that can be applied in the social sci-
phenomena (Haslam, 2016) which may lead people ences. Our contributions lie in (1) proposing a
to intensify the target concept (e.g., where mental multidimensional framework to evaluate lexical
illness comes to be modified as “serious” or “se- semantic change in a way that economically inte-
vere”) to distinguish it from more expansive usages. grates forms identified by historical linguists; (2),
This increase in severity modifiers and arousal may developing a set of computational methodologies
both reflect the same rising concern with and prob- to evaluate change on the newly proposed semantic
lematization of mental illness and health. dimensions; and (3) illustrating the computational
Finally, the tendency for the target concepts to framework by examining how mental health and
become more associated with pathology-related mental illness have changed their meanings in two
terms (apart from for mental health in the general corpora, implying that the concepts are increasingly
corpus) supports claims of rising pathologization inflated, problematized and pathologized. The in-
(Brinkmann, 2016). Notably, mental illness was vestigation illuminates the complexity of semantic
most pathologized. Furthermore, the increase in and cultural change and provides new tools for
relative frequency of the target concepts in both cor- studying them.
pora is evidence of their rising cultural salience in
9
8. Limitations dom "noise", not variation in time). To better cap-
ture themes, future work should develop a bottom-
Limitations inspire future directions. The proce- up, not a top-down dictionary-based, approach by
dures employed in the present study are simply a using topic modeling or clustering contextualized
first implementation of the framework. Future re- word embeddings (Montariol et al., 2021) and eval-
search should refine its computational methodology uating the target’s proximity to the centroid of the
by enhancing or replacing procedures with more semantic category cluster. These methods might
robust or sensitive alterations. While the Warriner reveal senses or domains without imposing a dictio-
norms data we used (i) follows a rigorous and reli- nary on the semantic space. It will also be crucial
able rating procedure, (ii) are highly interpretable to consider LLM approaches for lexical semantic
and (iii) have high face validity, future work might change (Wang and Choi, 2023).
consider alternative methods in addition to closed- With regard to substantive studies, it will be
vocabulary approaches (Eichstaedt et al., 2021). important to make a general case for the frame-
The current method could be compared against work by, ideally, finding an existing data set that
publicly-available BERT-based models fine-tuned includes annotated examples of semantic change
for sentiment analysis (Goworek and Dubossarsky, for evaluation and estimation of the recall/coverage
2024), the VADER (a rule-based sentiment analysis of the methods. In addition, our findings should
tool; Hutto and Gilbert, 2014), or other sentiment- be extended by applying the framework to a
emotion lexica (Boyd-Graber et al., 2022; Moham- wider assortment of mental health-related con-
mad, 2018). Ideally, the approach will capture the cepts such as diagnostic terms (e.g., anxiety, de-
nuanced sentiment contributions of the target word, pression, autism, obsessive-compulsive disorder,
which averaging the sentiment of contexts fails to schizophrenia, attention-deficit hyperactivity dis-
capture (Goworek and Dubossarsky, 2024). Ro- order). Characterizing how specific diagnoses
bustness checks should be conducted on new meth- have altered their meanings in a differentiated,
ods by comparing its convergent validity against multi-dimensional manner will illuminate histori-
the existing one to evaluate the extent to which cal changes that have only been the focus of theo-
the alternative method correlates when applied to retical speculation and qualitative research to date
the same dataset. In addition, because the target (e.g., Brinkmann, 2016; Horwitz and Wakefield,
term’s semantic broadening is operationalized as 2007, 2012; Parrott, 2023). Future research can
the cosine dissimilarity of the target’s sentential also capitalize on the new framework to explore
contextual usages, it only differentiates between possible causal relationships between dimensions,
quantitatively (not qualitatively) different mean- such as whether rising salience drives conceptual
ings. Future work should introduce more fine- broadening (Haslam et al., 2021), whether rising
grained follow-up analyses by, for example, iden- breadth of mental illness-related concepts drives
tifying hypernymy or using state-of-the-art word improvements in sentiment (a destigmatization pro-
in context (WiC) models, like XL-LEXEME (Cas- cess), and whether trade-offs exist (e.g., rising
sotti et al., 2023), which beats GPT-4 on the WiC breadth may lead to shifts in intensity). Studies
task and BERT, mBERT, XLM-R on the graded already point to related laws of semantic change,
change detection task (Periti and Tahmasebi, 2024). finding that sentiment change is associated with se-
It should also introduce a diachronic analysis to mantic change (Goworek and Dubossarsky, 2024).
examine if the target’s prototypical meaning has Future studies should conduct fine-grained analy-
been diluted/intensified. ses on semantic shifts in discourse around mental
Additionally, while the present study includes a health to examine how online group dynamics and
neutral control term, future work should evaluate macro social and cultural shifts (e.g., prevailing
how to (semi)automatically identify baseline se- stereotypes and stigma towards social groups; see
mantic change in the global corpus (a stability axis), Garg et al., 2018; Charlesworth and Hatzenbuehler,
to normalize the semantic change of the target con- 2024; Durrheim et al., 2023) contribute to observed
cepts against. A control condition where no change semantic shifts and possibly the social transmission
of meaning is expected could also be set up (Du- of mental disorders, shown in adolescent peer net-
bossarsky et al., 2017) using a chronologically shuf- works; Alho et al. (2024). Ideally studies will be
fled corpus so that the assumed changes become conducted with many corpora (e.g., news, social
uniform and any change is an artefact (reflects ran- media) with high frequencies of the target terms.
10
9. Ethics Statement pages 119–128, Singapore. Association for Compu-
tational Linguistics.
We do not identify any foreseeable risks or potential
for harmful use of our work. Analyses use licensed Naomi Baes, Ekaterina Vylomova, Michael J. Zyphur,
and Nick Haslam. 2023b. The semantic inflation of
data that are openly accessible for academic pur- “trauma” in psychology. Psychology of Language
poses, ensuring transparency and accountability. and Communication, 27(1):23–45.
Acknowledgements Timo Beeker, China Mills, Dinesh Bhugra, Sanne

te Meerman, Samuel Thoma, Martin Heinze, and Se-
We thank the three anonymous ACL reviewers for bastian von Peter. 2021. Psychiatrization of society:
their valuable feedback which substantially im- A conceptual framework and call for transdisciplinary
research. Frontiers in Psychiatry, 12:645556.
proved the paper. Our gratitude also extends to
Professor Charles Kemp and Professor Yoshihisa Andreas Blank. 1999. Why do new meanings occur?
Kashima for their guidance and feedback on earlier a cognitive typology of the motivations for lexical
versions of the framework at PhD committee meet- semantic change. In Andreas Blank and Peter Koch,
editors, Historical semantics and cognition, pages
ings and to Lea Frermann, Filip Miletić and Andrey 61–90. Mouton de Gruter.
Kutuzov for indirect recommendations which ben-
efited the work, and to Zheng Wei Lim for helping Leonard Bloomfield. 1933. Language. Compton Print-
ing Works Ltd.
me troubleshoot this airXiv submission. This re-
search is supported by Australian Research Council Gemma Boleda. 2020. Distributional semantics and lin-
Discovery Project DP210103984 and by an Aus- guistic theory. Annual Review of Linguistics, 6:213–
tralian Government Research Training Program 234.
Scholarship. Jordan Boyd-Graber, Samuel Carton, Shi Feng, Q. Vera
Liao, Tania Lombrozo, Alison Smith-Renner, and
Chenhao Tan. 2022. Human-centered evaluation of
References explanations. In Proceedings of the 2022 Conference
of the North American Chapter of the Association for
Eneko Agirre, Enrique Alfonseca, Keith Hall, Janyce Computational Linguistics: Human Language Tech-
Kravalova, Marius Paşca, and Aitor Soroa. 2009. A nologies: Tutorial Abstracts, pages 26–32, Seattle,
study on similarity and relatedness using distribu- United States. Association for Computational Lin-
tional and wordnet-based approaches. In Proceed- guistics.
ings of Human Language Technologies: The 2009
Annual Conference of the North American Chap- Svend Brinkmann. 2016. Diagnostic Cultures: A Cul-
ter of the Association for Computational Linguistics, tural Approach to the Pathologization of Modern Life.
page 19. Routledge.
Reem Alatrash, Dominik Schlechtweg, Jonas Kuhn, and Michel Bréal. 1897. Essai de sémantique. Hachette.
Sabine Schulte im Walde. 2020. CCOHA: Clean cor-
pus of historical American English. In Proceedings Christian Bröer and Broos Besseling. 2017. Sadness
of the Twelfth Language Resources and Evaluation or depression: Making sense of low mood and the
Conference, pages 6958–6966, Marseille, France. Eu- medicalization of everyday life. Social Science &
ropean Language Resources Association. Medicine, 183:28–36.
Jussi Alho, Mai Gutvilig, Ripsa Niemi, Kaisla Komu- Lyle Campbell. 1999. Historical linguistics: An
lainen, Petri Böckerman, Roger T Webb, Marko Elo- introduction, 1st mit press ed edition. MIT
vainio, and Christian Hakulinen. 2024. Transmis- Press. Available online at http://tscheer.free.
sion of mental disorders in adolescent peer networks. fr/scan/Campbell%2098%20-%20Historical%
JAMA psychiatry. 20Linguistics.%20An%20Introduction.pdf.
Naveen Badathala, Abisek Rajakumar Kalarani, Tejpals- Pierluigi Cassotti, Lucia Siciliani, Marco DeGemmis,
ingh Siledar, and Pushpak Bhattacharyya. 2023. A Giovanni Semeraro, and Pierpaolo Basile. 2023. XL-
match made in heaven: A multi-task framework for LEXEME: WiC pretrained model for cross-lingual
hyperbole and metaphor detection. In Findings of LEXical sEMantic changE. In Proceedings of the
the Association for Computational Linguistics: ACL 61st Annual Meeting of the Association for Compu-
2023, pages 388–401, Toronto, Canada. Association tational Linguistics (Volume 2: Short Papers), pages
for Computational Linguistics. 1577–1585, Toronto, Canada. Association for Com-
putational Linguistics.
Naomi Baes, Nick Haslam, and Ekaterina Vylomova.
2023a. Semantic shifts in mental health-related con- Daniel Cer, Mona Diab, Eneko Agirre, Iñigo Lopez-
cepts. In Proceedings of the 4th Workshop on Compu- Gazpio, and Lucia Specia. 2017. SemEval-2017
tational Approaches to Historical Language Change, task 1: Semantic textual similarity multilingual and
11
crosslingual focused evaluation. In Proceedings Johannes C Eichstaedt, Margaret L Kern, David B
of the 11th International Workshop on Semantic Yaden, H Andrew Schwartz, Salvatore Giorgi, Gre-
Evaluation (SemEval-2017), pages 1–14, Vancouver, gory Park, Courtney A Hagan, Victoria A Tobolsky,
Canada. Association for Computational Linguistics. Laura K Smith, Anneke Buffone, et al. 2021. Closed-
and open-vocabulary approaches to text analysis: A
Tessa ES Charlesworth and Mark L Hatzenbuehler. review, quantitative comparison, and recommenda-
2024. Mechanisms upholding the persistence of tions. Psychological Methods, 26(4):398.
stigma across 100 years of historical text. Scientific
Reports, 14(1):11069. Lauren Fonteyn and Enrique Manjavacas. 2021. Adjust-
ing scope: A computational approach to case-driven
Brodie C Dakin, Melanie J McGrath, Joshua J Rhee, research on semantic change. In CHR, pages 280–
and Nick Haslam. 2023. Broadened concepts of harm 298.
appear less serious. Social Psychological and Per-
sonality Science, 14(1):72–83. Allen Frances. 2013. Saving Normal: An Insider’s Re-
volt Against Out-of-Control Psychiatric Diagnosis,
Mark Davies. 2008. The corpus of contempo- DSM-5, Big Pharma, and the Medicalization of Ordi-
rary american english (COCA). https://www. nary Life. HarperCollins Publishers (Australia) Pty.
english-corpora.org/coca/. Ltd., Level 13, 201 Elizabeth Street, Sydney, NSW
2000, Australia.
Mark Davies. 2010. The corpus of historical american
english (coha). Available online at https://www. Frank Furedi. 2016. The cultural underpinning of con-
english-corpora.org/coha/. cept creep. Psychological Inquiry, 27(1):34–39.
Simon De Deyne, Daniel J. Navarro, Amy Perfors, Nikhil Garg, Londa Schiebinger, Dan Jurafsky, and
Marc Brysbaert, and Gert Storms. 2019. The “small James Zou. 2018. Word embeddings quantify 100
world of words” english word association norms for years of gender and ethnic stereotypes. Proceedings
over 12,000 cue words. Behavior Research Methods, of the National Academy of Sciences, 115(16):E3635–
51(3):987–1006. E3644.
Jader Martins Camboim de Sá, Marcos Da Silveira, and Dirk Geeraerts. 2010. Theories of lexical semantics.
Cédric Pruski. 2024. Survey in characterization of Oxford University Press.
semantic change. arXiv preprint arXiv:2402.19088. Roksana Goworek and Haim Dubossarsky. 2024. To-
ward sentiment aware semantic change analysis. In
Jacob Devlin, Ming-Wei Chang, Kenton Lee, and
Proceedings of the 18th Conference of the European
Kristina Toutanova. 2019. Bert: Pre-training of deep
Chapter of the Association for Computational Lin-
bidirectional transformers for language understand-
guistics: Student Research Workshop, pages 350–
ing. In Proceedings of the 2019 Conference of the
357.
North American Chapter of the Association for Com-
putational Linguistics: Human Language Technolo- William L Hamilton, Jure Leskovec, and Dan Jurafsky.
gies, Volume 1 (Long Papers), pages 4171–4186. 2016a. Cultural shift or linguistic drift? comparing
two computational measures of semantic change. In
Liviu P. Dinu, Ioan-Bogdan Iordache, Ana Sabina Uban, Proceedings of the conference on empirical methods
and Marcos Zampieri. 2021. A computational ex- in natural language processing. Conference on empir-
ploration of pejorative language in social media. In ical methods in natural language processing, volume
Findings of the Association for Computational Lin- 2016, page 2116. NIH Public Access.
guistics: EMNLP 2021, pages 3493–3498, Punta
Cana, Dominican Republic. Association for Compu- William L. Hamilton, Jure Leskovec, and Dan Jurafsky.
tational Linguistics. 2016b. Diachronic word embeddings reveal statisti-
cal laws of semantic change. In Proceedings of the
Haim Dubossarsky, Daphna Weinshall, and Eitan Gross- 54th Annual Meeting of the Association for Compu-
man. 2017. Outta control: Laws of semantic change tational Linguistics (Volume 1: Long Papers), pages
and inherent biases in word representation models. 1489–1501, Berlin, Germany. Association for Com-
In Proceedings of the 2017 Conference on Empiri- putational Linguistics.
cal Methods in Natural Language Processing, pages
1136–1145, Copenhagen, Denmark. Association for Nick Haslam. 2016. Concept creep: Psychology’s ex-
Computational Linguistics. panding concepts of harm and pathology. Psycholog-
ical Inquiry, 27(1):1–17.
Paul Dudgeon. 2017. Some improvements in confi-
dence intervals for standardized regression coeffi- Nick Haslam and Naomi Baes. 2024. What should we
cients. Psychometrika, 82:928–951. call mental ill health? historical shifts in the popular-
ity of generic terms. PLOS Ment Health, 1(1).
Kevin Durrheim, Maria Schuld, Martin Mafunda, and
Sindisiwe Mazibuko. 2023. Using word embeddings Nick Haslam, Brodie C Dakin, Fabian Fabiano,
to investigate cultural biases. British Journal of So- Melanie J McGrath, Joshua Rhee, Ekaterina Vylo-
cial Psychology, 62(1):617–629. mova, Morgan Weaving, and Melissa A Wheeler.
12
2020. Harm inflation: Making sense of concept creep. Lev Lafayette, Greg Sauter, Linh Vu, and Bernard
European Review of Social Psychology, 31(1):254– Meade. 2016. Spartan performance and flexibil-
286. ity: An hpc-cloud chimera. OpenStack Summit,
Barcelona, 27:6.
Nick Haslam, Ekaterina Vylomova, Michael J. Zy-
phur, and Yoshihisa Kashima. 2021. The cultural David E. Levari, Daniel T. Gilbert, Timothy D. Wil-
dynamics of concept creep. American Psychologist, son, Baruch Sievers, David M. Amodio, and Thalia
76(6):1013–1026. Wheatley. 2018. Prevalence-induced concept change
in human judgment. Science, 360(6396):1465–1467.
Simon Hengchen, Nina Tahmasebi, Dominik
Schlechtweg, and Haim Dubossarsky. 2021. Chal- Xiaodong Liu, Pengcheng He, Weizhu Chen, and Jian-
lenges for computational lexical semantic change. feng Gao. 2019. Multi-task deep neural networks for
In Nina Tahmasebi, Lars Borin, Adam Jatowt, Yang natural language understanding. In Proceedings of
Xu, and Simon Hengchen, editors, Computational the 57th Annual Meeting of the Association for Com-
approaches to semantic change, pages 341–372. putational Linguistics, pages 4487–4496, Florence,
Language Science Press, Berlin. Italy. Association for Computational Linguistics.
Bjørn Hofmann. 2016. Medicalization and overdiagno- Yu Luo, Dan Jurafsky, and Beth Levin. 2019. From in-
sis: Different but alike. Medicine, Health Care and sanely jealous to insanely delicious: Computational
Philosophy, 19(2):253–264. models for the semantic bleaching of english intensi-
fiers. In Proceedings of the 1st International Work-
Allan V. Horwitz and Jerome C. Wakefield. 2007. The shop on Computational Approaches to Historical
loss of sadness: How psychiatry transformed normal Language Change, pages 1–13.
sorrow into depressive disorder. Oxford University
Press. Christopher D Manning. 2022. Human language under-
standing & reasoning. Daedalus, 151(2):127–138.
Allan V. Horwitz and Jerome C. Wakefield. 2012. All we
have to fear: Psychiatry’s transformation of natural Rowan Hall Maudslay and Simone Teufel. 2022.
anxieties into mental disorders. Oxford University Metaphorical polysemy detection: Conventional
Press. metaphor meets word sense disambiguation. In Pro-
ceedings of the 29th International Conference on
Clayton Hutto and Eric Gilbert. 2014. Vader: A parsi- Computational Linguistics, pages 65–77, Gyeongju,
monious rule-based model for sentiment analysis of Republic of Korea. International Committee on Com-
social media text. In Proceedings of the International putational Linguistics.
AAAI Conference on Web and Social Media, pages
216–225. Tomas Mikolov, Kai Chen, Gregory S. Corrado, and
Jeffrey Dean. 2013. Efficient estimation of word
Andrew T. Jebb, Louis Tay, Wei Wang, and Qiming representations in vector space. In International Con-
Huang. 2015. Time series analysis for psychological ference on Learning Representations.
research: Examining and forecasting change. Fron-
tiers in Psychology, 6. Saif Mohammad. 2018. Obtaining reliable human rat-
ings of valence, arousal, and dominance for 20,000
Daniel Jurafsky and James H. Martin. 2023. Vector English words. In Proceedings of the 56th Annual
Semantics and Embeddings. Draft of February 3, Meeting of the Association for Computational Lin-
2024. Draft chapters available online: https://web. guistics (Volume 1: Long Papers), pages 174–184,
stanford.edu/~jurafsky/slp3/. Melbourne, Australia. Association for Computational
Linguistics.
Li Kong, Chuanyi Li, Jidong Ge, Bin Luo, and Vin-
cent Ng. 2020. Identifying exaggerated language. In Stefano Montanelli and Fabio Periti. 2023. A survey
Proceedings of the 2020 Conference on Empirical on contextualised semantic shift detection. arXiv,
Methods in Natural Language Processing (EMNLP), arXiv:2304.01666.
pages 7024–7034, Online. Association for Computa-
tional Linguistics. Syrielle Montariol, Matej Martinc, and Lidia Pivovarova.
2021. Scalable and interpretable semantic change
Andrey Kutuzov, Lilja Øvrelid, Terrence Szymanski, detection. In Proceedings of the 2021 Conference
and Erik Velldal. 2018. Diachronic word embeddings of the North American Chapter of the Association
and semantic shifts: a survey. In Proceedings of the for Computational Linguistics: Human Language
27th International Conference on Computational Lin- Technologies, pages 4642–4652.
guistics, pages 1384–1397, Santa Fe, New Mexico,
USA. Association for Computational Linguistics. Charles Egerton Osgood, William H May, and Murray S
Miron. 1975. Cross-Cultural Universals of Affective
Andrey Kutuzov, Erik Velldal, and Lilja Øvrelid. 2022. Meaning. University of Illionois Press.
Contextualized embeddings for semantic change de-
tection: Lessons learned. In Northern European Joel Paris. 2020. Overdiagnosis in Psychiatry: How
Journal of Language Technology, Volume 8, Copen- Modern Psychiatry Lost Its Way While Creating a Di-
hagen, Denmark. Northern European Association of agnosis for Almost All of Life’s Misfortunes. Oxford
Language Technology. University Press.
13
Scott Parrott. 2023. PTSD in the news: Media framing, Vered Shwartz, Yoav Goldberg, and Ido Dagan. 2016.
stigma, and myths about mental illness. Electronic Improving hypernymy detection with an integrated
News, 17(3):181–197. path-based and distributional method. In Proceed-
ings of the 54th Annual Meeting of the Association for
Jeffrey Pennington, Richard Socher, and Christopher D Computational Linguistics (Volume 1: Long Papers),
Manning. 2014. Glove: Global vectors for word rep- pages 2389–2398, Berlin, Germany. Association for
resentation. In Proceedings of the 2014 conference Computational Linguistics.
on empirical methods in natural language processing
(EMNLP), pages 1532–1543. Kaitao Song, Xu Tan, Tao Qin, Jianfeng Lu, and Tie-
Yan Liu. 2020. MPNet: Masked and permuted pre-
Francesco Periti and Nina Tahmasebi. 2024. A sys- training for language understanding. Advances in
tematic comparison of contextualized word embed- neural information processing systems, 33:16857–
dings for lexical semantic change. arXiv preprint 16867.
arXiv:2402.12011.
Cass R. Sunstein. 2018. The power of the normal.
SSRN. SSRN Scholarly Paper ID 3239204. Social
Ivan Jacob Agaloos Pesigan, Rong Wei Sun, and Shu Fai
Science Research Network. https://doi.org/10.
Cheung. 2023. betadelta and betasandwich: Con-
2139/ssrn.3239204.
fidence intervals for standardized regression coef-
ficients in r. Multivariate Behavioral Research, Nina Tahmasebi, Lars Borin, and Adam Jatowt. 2021.
58(6):1183–1186. Survey of computational approaches to lexical seman-
tic change detection. In Nina Tahmasebi, Lars Borin,
Steven Pinker. 2011. The Better Angels of Our Nature: Adam Jatowt, Yang Xu, and Simon Hengchen, edi-
Why Violence Has Declined. Viking Books. tors, Computational approaches to semantic change,
pages 1–91. Language Science Press, Berlin.
Nils Reimers and Iryna Gurevych. 2019. Sentence-
BERT: Sentence embeddings using Siamese BERT- Xuri Tang. 2018. A state-of-the-art of semantic
networks. In Proceedings of the 2019 Conference on change computation. Natural Language Engineering,
Empirical Methods in Natural Language Processing 24(5):649–676.
and the 9th International Joint Conference on Natu-
ral Language Processing (EMNLP-IJCNLP), pages Yufei Tian, Arvind Krishna Sridhar, and Nanyun Peng.
3982–3992, Hong Kong, China. Association for Com- 2021. Hypogen: Hyperbole generation with com-
putational Linguistics. monsense and counterfactual knowledge. In Find-
ings of the Association for Computational Linguistics:
Anna Rogers, Olga Kovaleva, and Anna Rumshisky. EMNLP 2021, pages 1583–1593.
2020. A primer in BERTology: What we know about
how BERT works. Transactions of the Association Jesse S. Y. Tse and Nick Haslam. 2021. In-
for Computational Linguistics, 8:842–866. clusiveness of the concept of mental disorder
and differences in help-seeking between asian
James A. Russell. 2003. Core affect and the psychologi- and white americans. Frontiers in Psychology,
cal construction of emotion. Psychological Review, 12. https://www.frontiersin.org/articles/
110(1):145–172. 10.3389/fpsyg.2021.699750.
Victor Sanh, Lysandre Debut, Julien Chaumond, and Hayato Tsukagoshi, Ryohei Sasano, and Koichi Takeda.
Thomas Wolf. 2019. Distilbert, a distilled version 2022. Comparison and combination of sentence em-
of bert: smaller, faster, cheaper and lighter. ArXiv, beddings derived from different supervision signals.
abs/1910.01108. In Proceedings of the 11th Joint Conference on Lex-
ical and Computational Semantics, pages 139–150,
Seattle, Washington. Association for Computational
Norman Sartorius. 2007. Stigma and mental health.
Linguistics.
The Lancet, 370(9590):810–811.
Stephen Ullmann. 1962. Semantics: An Introduction to
Nina Schneidermann, Daniel Hershcovich, and Bolette the Science of Meaning. Blackwell.
Pedersen. 2023. Probing for hyperbole in pre-trained
language models. In Proceedings of the 61st An- Ekaterina Vylomova and Nick Haslam. 2021. Semantic
nual Meeting of the Association for Computational changes in harm-related concepts in english. In Nina
Linguistics (Volume 4: Student Research Workshop), Tahmasebi, Lars Borin, Adam Jatowt, Yue Xu, and Si-
pages 200–211, Toronto, Canada. Association for mon Hengchen, editors, Computational Approaches
Computational Linguistics. to Semantic Change. Language Science Press.
Georg Schomerus, Stephanie Schindler, Christian Ekaterina Vylomova, Sean Murphy, and Nick Haslam.
Sander, Eva Baumann, and Matthias C Angermeyer. 2019. Evaluation of semantic change of harm-related
2022. Changes in mental illness stigma over 30 years– concepts in psychology. In Proceedings of the 1st In-
improvement, persistence, or deterioration? Euro- ternational Workshop on Computational Approaches
pean Psychiatry, 65(1):e78. to Historical Language Change, pages 29–34.
14
Ekaterina Vylomova, Laura Rimell, Trevor Cohn, and A. Appendix A
Timothy Baldwin. 2016. Take and took, gaggle and
goose, book and read: Evaluating the utility of vector To elaborate on what a word being low or high
differences for lexical relation learning. In Proceed- "arousal" or "valence" means, Warriner et al. (2013)
ings of the 54th Annual Meeting of the Association for defined them in the following way when (valid)
Computational Linguistics (Volume 1: Long Papers),
pages 1671–1682, Berlin, Germany. Association for participants made direct judgements of the large
Computational Linguistics. sample of words on the measured attributes (n =
419: valence; n = 448: arousal; 16-87 years; ma-
Ruiyu Wang and Matthew Choi. 2023. Large language
models on lexical semantic change detection: An
jority were female (60%), English native language
evaluation. arXiv preprint arXiv:2312.06002. speakers, held a college degree):
Wenhui Wang, Furu Wei, Li Dong, Hangbo Bao, Nan • Valence: "You are invited to take part in the
Yang, and Ming Zhou. 2020. Minilm: Deep self- study that [...] concerns how people respond
attention distillation for task-agnostic compression
of pre-trained transformers. Advances in Neural In-
to different types of words. You will use a scale
formation Processing Systems, 33:5776–5788. to rate how you felt while reading each word.
[...] The scale ranges from 1 (happy) to 9 (un-
Amy Beth Warriner, Victor Kuperman, and Marc Brys- happy). At one extreme of this scale, you are
baert. 2013. Norms of valence, arousal, and domi-
nance for 13,915 english lemmas. Behavior Research happy, pleased, satisfied, contented, hopeful.
Methods, 45(4):1191–1207. When you feel completely happy you should
indicate this by choosing rating 1. The other
Melissa A Wheeler, Melanie J McGrath, and Nick
Haslam. 2019. Twentieth century morality: The rise end of the scale is when you feel completely
and fall of moral concepts from 1900 to 2007. PLOS unhappy, annoyed, unsatisfied, melancholic,
ONE, 14(2):e0212267. despaired, or bored. You can indicate feeling
completely unhappy by selecting 9. The num-
WHO. 2021. Comprehensive mental health action plan
2013–2030. bers also allow you to describe intermediate
feelings of pleasure, by selecting any of the
Yu Xiao, Naomi Baes, Ekaterina Vylomova, and Nick other feelings. If you feel completely neutral,
Haslam. 2023. Have the concepts of ‘anxiety’ and
‘depression’ been normalized or pathologized? a cor- neither happy nor sad, select the middle of the
pus study of historical semantic change. PLOS ONE, scale (rating 5)."
18(6):e0288027.
• Arousal: “You are invited to take part in the
Arda Yüksel, Berke Uğurlu, and Aykut Koç. 2021. Se- study that [...] concerns how people respond
mantic change detection with gaussian word embed-
dings. IEEE/ACM Transactions on Audio, Speech,
to different types of words. You will use a
and Language Processing, 29:3349–3361. scale to rate how you felt while reading each
word. [...] The scale ranges from 1 (excited)
to 9 (calm). At one extreme of this scale, you
are stimulated, excited, frenzied, jittery, wide-
awake, or aroused. When you feel completely
aroused you should indicate this by choosing
rating 1. The other end of the scale is when
you feel completely relaxed, calm, sluggish,
dull, sleepy, or unaroused. You can indicate
feeling completely calm by selecting 9. The
numbers also allow you to describe intermedi-
ate feelings of calmness/arousal, by selecting
any of the other feelings. If you feel com-
pletely neutral, not excited nor at all calm,
select the middle of the scale (rating 5).”
15
B. Appendix B C. Appendix C
Total lines where target term appears in the text for Breadth Model Selection
both corpora (1970-2016): for the General corpus: The top three (pre-trained) sentence transformer
mental_health = 3,233; mental_illness = 1,559, per- models were chosen, ranked by their performance
ception = 9,440; for the Psychology corpus (1970- in embedding sentences.11 The best-performing
2016): mental_health = 26,482; mental_illness = model on the semantic textual similarity bench-
4,219, perception = 54,694. mark,12 Multi-Task Deep Neural Network (Liu
et al., 2019), was unavailable.13 See Table 2 for
descriptive statistics of models.
• "all-mpnet-base-v2"14 is maintained by the
SentenceTransformers community and excels
in encoding sentences across 14 diverse tasks
from different domains using the MPNet
(Masked and Permuted Pre-training for Lan-
guage Understanding) (Song et al., 2020) ar-
chitecture.
• "all-distilroberta-v1"15 uses a distilled ver-
sion of "distilroberta-base" (Sanh et al., 2019),
based on BERT architecture, employing
knowledge distillation during pre-training and
a triple loss (language modeling, distillation
and cosine-distance losses) to leverage the in-
ductive biases of LLMs during pre-training.
• "all-MiniLM-L6-v2"16 uses the MiniLM ar-
chitecture (Wang et al., 2020) employing deep
self-attention distillation (using self-attention
relation distillation for task-agnostic compres-
sion of pre-trained Transformers).
• Additionally, "bert-base-uncased"17 (Devlin
et al., 2019) was included for comparison,
although its network structure prohibits the
direct comparison of sentence embeddings,
and BERT maps sentences to a vector space
that is unsuitable for use with common simi-
larity measures and performs below average
GloVe embeddings on STS tasks (Reimers
Figure 8: Annual counts of articles where target terms
appear in the main text (1970-2016). Note: Top three and Gurevych, 2019).
panels = Psychology corpus; bottom three panels = 11
https://www.sbert.net/docs/pretrained_models.
General corpus. html
12
https://paperswithcode.com/sota/
semantic-textual-similarity-on-sts-benchmark
13
See https://github.com/namisan/mt-dnn
14
"all-mpnet-base-v2" from Hugging Face,
sentence-transformers/all-mpnet-base-v2
15
"all-distilroberta-v1" from Hugging Face,
sentence-transformers/all-distilroberta-v1
16
"all-MiniLM-L6-v2": https://huggingface.co/
sentence-transformers/all-MiniLM-L6-v2
17
"bert-base-uncased": https://huggingface.co/
google-bert/bert-base-uncased
16
Model Info all-mpnet-base- all-distilroberta- all-MiniLM-L6- bert-base-uncased
v2* v1* v2*
Accuracy+ 69.57 68.73 68.06 NA

Size 420 MB 290 MB 80 MB 80 MB
Case Sensi- Yes Yes Yes Yes
tive
Vocabulary 30,527 50,264 30,522 30,522
Max Seq 384 512 256 512
Length
Mean Pooling (Tokens) CLS pooling
Pooling
Dimensions 768 768 384 768
Layers 12 6 6 12
Heads 12 12 12 12
Parameters 33M 82.1M 33M 110M
Training >1B training pairs, sent. (3 data sets: wikihow, code_search
NA
Data _net, ms_marco)
Contrastive Learning Objective: given a sentence from the
Fine-tuning NA
sentence pair, the model is trained to predict which out of a
set of randomly sampled other sentences, is paired with it
in the dataset. It computes the cosine similarity from each
possible sentence pair and applies the cross-entropy loss by
comparing with true pairs.
Base Model mpnet-base distilroberta-base

MiniLM-L12- bert-base-uncased
H384-uncased
Pre-training BooksCorpus, BooksCorpus, Unknown (Corpora Unknown (Likely a
Corpora CC-News, English CC-News, English for the original large code and text
Wikipedia, Open- Wikipedia, Open- model used for dis- dataset)
WebText, Stories WebText, Stories tillation, UniLMv2,
is also unknown)
Pre-training (1) Permuted lan- (1) Knowledge (1) Distillation (1) Masked lan-
Technique guage modeling; (2) Distillation, build- (deep self-attention guage modeling;
Incorporate auxil- ing on the robust distillation) likely (2) Next sentence
iary positional in- training techniques from UniLMv2 prediction; (3)
formation of RoBERTa (dy- Tokenization with
namic masking, WordPiece; (4)
large batch sizes, Positional embed-
longer training dings
duration)
Table 2: Summary of language models sampled in the present study. Note: * = embeddings are normalized. + =
Average performance on encoding sentence over 14 tasks over 14 diverse tasks from different domains (14 datasets).
SNL = 570k sentence pairs annotated with labels. Multi-Genre NLI = 430k sentence pairs covering spoken and
written text. BookCorpus = 11,038 unpublished books scraped from the Internet.
17
Model Comparison: Test Sample
First, we compared similarity scores for sentence
embedding pairs for each sentence transformer
model to get a qualitative understanding of the cap-
tured dimensions. After feeding seven sample sen-
tences through each sentence transformer model
for encoding, similarity arrays of each sentence
embedding pair were compared. Tokenization and
preprocessing is handled as part of the sentence
transformers library.
• 0 = "She has been seen at a mental_health

facility since 1983."
• 1 = "I didn’t want to believe I had any men- Figure 10: Cosine similarity matrix for sentence embed-
tal_health issues and went into denial." dings using the "all-distilroberta-v1" model.
• 2 = "The burden of mental_illness concen-

trates in 5-10 of the adolescent population."
• 3 = "Their rates of mental_illness are almost

twice that of religious adolescents raised in
religious households."
• 4 = "Stigma against people with men-

tal_illness is a very complex public health
problem."
• 5 = "Stigma associated with mental_illness

is one of the major impediments in evolving
effective treatment interventions to address
the burden associated with these disorders."
Figure 11: Cosine similarity matrix for sentence embed-
• 6 = "Anorexia is a killer it has the highest dings using the "all-MiniLM-L6-v2" model.
mortality rate of any mental_illness, including
depression ."
Our analysis demonstrated that "all-mpnet-base-
v2" (the best model on various encoding tasks, as
shown in the first row of Table 2) had the high-
est similarity for semantically equivalent sentences
(see Figure 9). For this task, its superior per-
formance might be attributed to its architecture.
MPNet leverages token dependencies through per-
muted language modeling, which involves scram-
bling sentential word order and training the model
to predict the original order, forcing MPNet to learn
the relationships and dependencies between words.
It also incorporates auxiliary positional informa-
tion, allowing the model to perceive entire sen-
tences, enhancing its ability to capture semantic
nuances. "all-distilroberta-v1" (Figure 10) and "all-
Figure 9: Cosine similarity matrix for sentence embed- MiniLM-L6-v2" (Figure 11) do not capture this se-
dings using the "all-mpnet-base-v2" model. mantic depth, underscoring the strengths of MPNet
in semantic understanding and syntactic sensitivity.
18
Breadth Measure
To analyze semantic differences among sentences

containing target concepts, we first extracted texts
and then sentences containing target terms from
our corpora. The frequency of these sentences over
five-year intervals dictated the minimum acceptable
number of sentences to sample.
Next, we engaged in a randomized sampling
process. In the 1975-1979 interval, there were
more than 50 texts in total, apart from for “men-
tal_health”, in the general corpus. From these texts,
we randomly sampled up to 50 sentences per in-
terval across 10 sets of samples (we sampled all
available sentences when there were fewer texts),
resulting in up to 500 sentences for every five-year
interval, shown in Figure 12.
Following data acquisition, we encoded sentence
embeddings using state-of-the-art approaches. Us-
ing sentence transformers (except for “bert-base-
uncased” which tokenized and passed sentences
through PyTorch tensors), we derived embeddings
that encapsulated the semantic essence of each sen-
tence. These embeddings were averaged along di-
mension one in the last hidden state layer, creating
a single vector representation for each sentence that
achieves a nuanced representation of the sentence’s
semantic content.
Finally, dissimilarity scores were computed. Figure 12: Counts of lines containing target terms
Leveraging the inverse cosine distance metric, we grouped by 5-year intervals (horizontal line represents
maximum sampling threshold). Note: Top three panels
estimated the similarity between every pair of
= Psychology corpus; bottom three = General corpus.
sentence representations using pairwise distances
within the range [-1,1]. To ensure unbiased re-
sults, we excluded self-similarity and symmetric el- Model Comparison: Results
ements, focusing solely on the upper half of the ma- After computing breadth scores across five-year
trix (49x25). During analysis, the matrix was flat- intervals, we compared trends using sentence trans-
tened to extract a 1D array (a stacked half-matrix) former models and bert-base-uncased. As shown in
of line-by-line similarity scores. Next, we inverted Figure 13, most models showed an upward trend in
the similarity scores by subtracting them from 1 cosine distance (i.e., inverse similarity), indicating
to obtain absolute values within the range of [0,1], a broader semantic usage of the target concepts.
signifying the dissimilarity between corresponding However, "bert-base-uncased" showed lower and
sentence vectors. The final dissimilarity metric was flatter similarity scores, possibly due to its pre-
computed by averaging scores within each of the training on tasks less directly related to semantic
ten samples per interval (getting the sum of cosine textual similarity. "all-mpnet-base-v2," chosen for
distance scores divided by the total number of sen- the main analysis, performed similarly to the other
tence pairs), followed by an additional averaging models but excelled in capturing semantic nuances,
across each five-year period within the 1970-2014 as described in Section C.
range. Higher scores on the cosine distance metric,
ranging from 0 to 1, correspond to greater dissimi-
larity between sentence vectors.
19
Figure 13: Breadth score over five-year intervals for each model (1970- 2014). Note: Model order demonstrates
rank of cosine distance score at the final data point (2010-2014) from highest to lowest.
20
D. Appendix D
To create the general corpus, a rigorous procedure
was followed. We first combined two related cor-
pora: the Corpus of Historical American English
(CoHA; Davies, 2008) and the Corpus of Contem-
porary American English (CoCA; Davies, 2008).
CoHA contains 400 million words from 1810-
2009, drawn from 115,000 texts distributed across
everyday publications (fiction, magazines, news-
papers, and non-fiction books). CoCA contains
560 million words from 1990-2019 drawn from
500,000 texts (from spoken language, TV shows,
academic journals, fiction, magazines, newspapers,
and blogs). After merging the two corpora, the com-
bined corpus spanning 1810-2019 was processed
following recommendations from Alatrash et al.
(2020) to clean it without compromising the quali-
tative and distributional properties of the data. This
process included first excluding the special token
“@”, which appears in 5% of the CoHA corpus (in-
troduced for legal reasons), malformed tokens that
are possible artifacts of the digitization process or
the data processing, and clean-up performed using
the web interface (“&c?;”, “q!”, “|p130”, “NUL”),
and removing escaped HTML characters (“ ( STAR
) ”, “<p>”, “<>”). Other symbols were excluded
after manual inspection of the corpus (e.g., “ // ”,
“ | ”, “ – ”, “*”, “..”, “PHOTO”, “( COLOR )”,
“ ILLUSTRATION ”, “/”). Blogs were also ex-
cluded (89,054 web articles; 98,788 blogs) for not
containing associated year data, and 25,418 aca-
demic texts were removed. Forty-one lines were
removed for missing text data (3 fiction, 11 news,
25 magazines, 2 spoken text) and 32 lines were re-
moved for column misalignment (15 mag, 15 news,
1 fiction, 1 tv). The cleaned corpus was then lower-
cased and punctuation (commas, periods, question
marks), function words, numerals and academic
texts were removed. The final combined corpus
contained 822,620,111 words from 344,634 texts:
30,496 fiction books, 136,476 magazines, 113,421
newspapers, 2,635 non-fiction books, 43,209 spo-
ken language and 18,397 TV shows. The current
study restricted the corpus period from 1970 to
2016 using 501,415,577 tokens from 244,552 ar-
ticles (23,855 fiction; 88,641 magazines; 73,557
news; 1,498 non-fiction; 40,036 spoken; 16,965
TV).
21
E. Appendix E
1970 1980 1990 2000 2010

1970 1980 1990 2000 2010
collective everincreasinggood good poor
positive poor poor poor poor diminished vibrant own well good
poor general maternal maternal positive necessary NA positive own well
adolescent well positive well maternal normal NA rural collective abysmal
female positive well positive well NA NA sound optimal additional
improved preventive adolescent general adolescent NA NA subsequent poor comprehensive
overall good general adolescent parental NA NA bad postpartum lessthanoptimal
preventive adolescent good parental general NA NA dubious collegestudentnew
public maternal bad bad bad NA NA geriatric confident own
recent own optimal good good NA NA maternal fragile pediatric
robust individual own overall overall
Table 6: Top 10 adjective modifiers of mental health in
Table 3: Top 10 adjective modifiers of mental health the general corpus (terms are ranked by their relative
in the psychology corpus (terms are ranked by their count for the respective decade)
relative count for the respective decade)
1970 1980 1990 2000 2010

past chronic severe severe severe
chronic major serious serious serious
1970 1980 1990 2000 2010
excess severe chronic major chronic
feminine familial major chronic parental serious acute severe severe serious
formal malingered common parental major certain hereditary serious serious severe
more acute maternal other common incipient severe major major other
obvious aggressive parental common other socalled socalled chronic other acute
other disabling comorbid maternal co NA underlying untreated most chronic
partum few persistent comorbid maternal NA NA common adolescent common
severe less other persistent comorbid NA NA other bipolar diagnosable
NA NA classic common major
Table 4: Top 10 adjective modifiers of mental illness NA NA more many deep
in the psychology corpus (terms are ranked by their NA NA severe new difficult
Table 7: Top 10 adjective modifiers of mental illness in
the general corpus (terms are ranked by their relative
1970 1980 1990 2000 2010 count for the respective decade)
visual visual visual visual visual
interpersonal social social positive social
auditory maternal positive negative negative
differential positive negative social positive
social parental subjective subjective conscious
subliminal negative parental conscious subjective
pictorial interpersonal maternal parental high 1970 1980 1990 2000 2010
binocular human auditory high auditory
favorable subjective accurate categorical low extrasensory public public public public
high auditory interpersonal low parental visual different widespread common common
own common visual popular popular
aesthetic general common wrong general
Table 5: Top 10 adjective modifiers of perception in the direct innate popular human own
psychology corpus (terms are ranked by their relative single own general general sensory
count for the respective decade) keen popular own own veridical
new psychic new extrasensory extrasensory
practical clear extrasensory acute negative
present human human visual parental
Table 8: Top 10 adjective modifiers of perception in the

general corpus (terms are ranked by their relative count
for the respective decade)
22
F. Appendix F
1970 1980 1990 2000 2010
1970 1980 1990 2000 2010 department center have say have
community service service service service state institute national have say
center community child problem problem center service institute national issue
service professional professional child child health fund care institute care
program center use use study
city have service child problem
professional problem problem care use
child use care study care
director national professional care system
school study study treatment treatment institute allow abuse community service
problem social treatment professional need national commissioner
state need health
group child need need outcome new department center service physical
worker program community health physical program oak department problem professional
Table 9: Top 10 Warriner-matched collocates of mental Table 12: Top 10 Warriner-matched collocates of mental
health in the psychology corpus (terms are ranked by health in the general corpus (terms are ranked by their
their relative count for the respective decade) relative count for the respective decade)
1970 1980 1990 2000 2010

attitude attitude severe severe people
scale study person people severe 1970 1980 1990 2000 2010
patient patient patient use study
study high treatment patient stigma drug suffer have have have
group person substance study use history alcoholismpeople people people
psychiatric problem study person individual treat have severe family family
opinion scale use disorder treatment
factor child people treatment disorder
acute time depression do say
find use disorder individual family appoint acute family suffer alliance
student major family substance experience bill argue say disorder history
can ask can say national
Table 10: Top 10 Warriner-matched collocates of mental cancer basement drug severe severe
illness in the psychology corpus (terms are ranked by cause bout history child suffer
their relative count for the respective decade) center cite know drug member
Table 13: Top 10 Warriner-matched collocates of mental

1970 1980 1990 2000 2010 illness in the general corpus (terms are ranked by their
self study study study study relative count for the respective decade)
study child self self self
result self child child social
test result social examine relationship
visual subject result relationship use
child difference examine use examine
subject social use social influence 1970 1980 1990 2000 2010
group relationship relationship result effect
people public public change change
difference effect difference relate result
use group behavior influence child
public change change public public
alter reality can people people
Table 11: Top 10 Warriner-matched collocates of per- change base people can can
ception in the psychology corpus (terms are ranked by president black other reality time
their relative count for the respective decade) study member reality other shift
associate new world go affect
can people may depth alter
cause popular go know challenge
child side thing will pain
Table 14: Top 10 Warriner-matched collocates of per-

ception in the general corpus (terms are ranked by their
23
G. Appendix G
Year effect sizes for indices operationalizing major dimensions of lexical semantic change in the psychology
corpus (filled circles) and general corpus (empty circles). Note: First degree = Linear; Second degree = Quadratic.
Vertical dotted line = Standardized beta coefficient of 0; Standard errors (SE) that overlap line indicate that the null
hypothesis can be rejected at the 5% significance level.
Index (Concept) Corpus Model B SE t p F (DF); Adj. R2

Linear 0.74 0.09 8.18 <.001 38.76 (1, 44); 0.62*
Psychology
Quadratic -0.33 0.10 -3.26 0.002
Intensifier (Mental Illness)
Linear -0.09 0.20 -0.45 .657 0.99 (2, 24); -0.0005
General
Quadratic 0.30 0.22 1.34 0.194
Linear 0.60 0.12 5.00 <.001 12.71 (2,44); 0.34*
Psychology
Quadratic -0.08 0.14 -0.62 0.541
Intensifier (Perception)
Linear 0.05 0.20 0.27 0.793 0.08 (2, 24); -0.08
General
Quadratic -0.07 0.23 -0.32 0.752
Table 15: Regression Coefficients (Scaled) and Fit Statistics Predicting Intensifier Indices as a Function of Year.
Note: * = p-value for the overall model = <.001. Regression coefficients are unstandardized. For mental_illness
in psychology, residuals were autocorrelated, and outcome variable was re-fit with Generalized Least Squares
approach, yielding: B = 0.74; SE = 0.09; p < .001; RSE(DF) = 0.62(47,44); BIC = 108.52.
24
Index Concept Corpus B SE p F (DF) Adj. R2
Psychology -0.003 3 × 10−4 <.001 122.65 (1,45) 0.73
Mental Health
General -0.005 0.003 .071 3.55 (1,25) 0.09
Valence Psychology -0.002 9 × 10−4 .057 3.82 (1,45) 0.058
Mental Illness
General 0.01 0.005 .011 7.62 (1,25) 0.20
Psychology −1 × 10−5 2 × 10−4 .949 0.004 (1,45) -0.02
Perception
General -0.002 0.002 .188 1.84 (1,25) 0.03
Psychology 0.001 3 × 10−4 0.001 28.19 (1,7) 0.77
Mental Health
General -0.001 7 × 10−4 .213 2.49 (1,3) 0.27
Breadth Psychology 0.002 4 × 10−4 .006 14.99 (1,7) 0.64
Mental Illness
General −6 × 10−6 6× 10−4 .992 1× 10−4 (1,3) -0.33
Psychology 0.001 3 × 10−4 0.006 15.12 (1,7) 0.64
Perception
General 7 × 10−4 3 × 10−4 .076 7.13 (1,3) 0.61
Psychology 0.003 3 × 10−4 <.001 89.38 (1,45) 0.66
Mental Health
General 0.005 0.002 <.001 7.83 (1,25) 0.21
Arousal Psychology 0.003 9× 10−4 <.001 7.51 (1,45) 0.12
Mental Illness
General 0.002 0.003 .462 0.56 (1,25) -0.02
Psychology 0.001 2 × 10−4 <.001 23.65 (1,45) 0.33
Perception
General 0.002 0.001 .148 2.22 (1,25) 0.05
Psychology 4 × 10−4 3 × 10−5 <.001 163.34 (1,45) 0.78
Mental Health
General 3 × 10−4 2 × 10−4 .130 2.48 (1,21) 0.06
Path. Psychology 2 × 10−4 1 × 10−4 .049 4.12 (1,43) 0.07
Mental Illness
General −1 × 10−4 2× 10−4 .552 0.36 (1,23) -0.03
Psychology 2 × 10−3 4 × 10−2 <.001 118.42 (1,44) 0.72
Perception
General 5 × 10−5 2 × 10−5 .051 5.95 (1,6) 0.41
Psychology 7 × 10−6 4 × 10−7 <.001 292.52 (1,45) 0.86
Mental Health
General 2× 10−7 4× 10−8 <.001 18.17 (1,45) 0.27
Salience Psychology 3 × 10−7 9 × 10−8 <.001 13.21 (1,45) 0.21
Mental Illness
General 1 × 10−7 2 × 10−8 <.001 42.21 (1,45) 0.47
Psychology 5 × 10−7 3 × 10−7 .160 2.04 (1,45) 0.02
Perception
General −3 × 10−8 6× 10−8 .568 0.33 (1,45) -0.01
Table 16: Unstandardized Regression Coefficients and Fit Statistics Predicting Indices as a Function of Year. Note:
The midrule separates the main dimensions (above) and the exploratory dimensions (below). Path. = Pathologization.
Generalized Least Squares approach also used for models with autocorrelated residuals.
• Arousal: mental_health (P): B = 0.003; SE = 3 × 10−4 ; p < .001; RSE(DF) = 0.03(47,45); BIC = -172.07
• Salience: mental_health (P): B = 7 × 10−6 ; SE = 4 × 10−7 ; p <.001; RSE(DF) = 4 × 10−5 (47,45); BIC = -767.87;
mental_illness (P): B = 3 × 10−7 ; SE = 9 × 10−7 ; p < .001; RSE(DF) = 9 × 10−6 (47,45); BIC = -895.27; perception
(P): B = 5 × 10−7 ; SE = 3 × 10−7 ; p = .160; RSE(DF) = 3 × 10−5 (47,45); BIC = -785.60; mental_illness (G): B =
1 × 10−7 ; SE = 2 × 10−8 ; p < .001; RSE(DF) = 2 × 10−6 (47,45); BIC = -1048.85
25
Index Concept Corpus β SE 95% CI
Psychology -0.86* 0.04 (-0.94, -0.77)
Mental Health
General -0.35 0.17 (-0.71, 0.004)
Valence Psychology -0.28* 0.12 (-0.53, -0.03)
Mental Illness
General 0.48* 0.13 (0.21, 0.76)
Psychology -0.01 0.18 (-0.37, 0.35)
Perception
General -0.26 0.15 (-0.57, 0.05)
Psychology 0.90* 0.04 (0.80, 0.99)
Mental Health
General -0.67* 0.09 (-0.95, -0.40)
Breadth Psychology 0.83* 0.14 (0.50, 1.15)
Mental Illness
General -0.01 0.69 (-2.19, 2.18)
Psychology 0.83* 0.11 (0.57, 1.09)
Perception
General 0.84* 0.22 (0.13, 1.54)
Psychology(1) 0.74* 0.05 (0.64, 0.85)
Psychology(2) -0.30* 0.10 (-0.50, -0.09)
Mental illness
General(1) -0.09 0.23 (-0.56, 0.38)
Intensifier General(2) 0.26 0.15 (-0.05, 0.57)
Psychology(1) 0.60* 0.08 (0.44, 0.76)
Psychology(2) -0.07 0.10 (-0.28, 0.13)
Perception
General(1) 0.05 0.17 (-0.30, 0.41)
General(2) -0.06 0.21 (-0.50, 0.37)
Psychology 0.82* 0.08 (0.66, 0.97)
Mental Health
General 0.49 0.15 (0.19, 0.79)
Arousal Psychology 0.38* 0.17 (0.05, 0.71)
Mental Illness
General 0.15 0.20 (-0.27, 0.57)
Psychology 0.59* 0.08 (0.44, 0.74)
Perception
General 0.29 0.22 (-0.17, 0.74)
Psychology 0.30* 0.12 (0.06, 0.53)
Mental Health
General -0.12 0.23 (-0.61, 0.36)
Pathologization Psychology 0.89* 0.02 (0.85, 0.92)
Mental Illness
General 0.32 0.20 (-0.09, 0.74)
Psychology 0.85* 0.30 (0.79, 0.92)
Perception
General 0.71 0.30 (-0.03, 1.45)
Psychology 0.93* 0.02 (0.89, 0.97)
Mental Health
General 0.54* 0.10 (0.34, 0.73)
Salience Psychology 0.48* 0.13 (0.21, 0.74)
Mental Illness
General 0.70* 0.07 (0.56, 0.83)
Psychology 0.21 0.15 (-0.10, 0.52)
Perception
General -0.09 0.15 (-0.38, 0.21)
26
Table 17: Standardized Regression Coefficients (β) predicting Semantic Change Indices by Year. Note: Midrule
separates main dimensions of semantic change (above). * = p: < .05. (1) = First degree. (2) = Second degree.

A Multidimensional Framework For Evaluating Lexical Semantic Change With Social Science Applications

Uploaded by

Copyright:

Available Formats

A Multidimensional Framework For Evaluating Lexical Semantic Change With Social Science Applications

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

A Multidimensional Framework For Evaluating Lexical Semantic Change With Social Science Applications

Uploaded by

Copyright:

Available Formats

A Multidimensional Framework for Evaluating Lexical Semantic Change

with Social Science Applications

Abstract As a result, word embeddings have evolved from

a three-dimensional framework for integrat-

2.3 Our Original Contribution

corpora), valence (decreasing for psychology and

psychology and the general domain. All indices for

Acknowledgements Timo Beeker, China Mills, Dinesh Bhugra, Sanne

Accuracy+ 69.57 68.73 68.06 NA

Base Model mpnet-base distilroberta-base

• 0 = "She has been seen at a mental_health

• 2 = "The burden of mental_illness concen-

• 3 = "Their rates of mental_illness are almost

• 4 = "Stigma against people with men-

• 5 = "Stigma associated with mental_illness

To analyze semantic differences among sentences

1970 1980 1990 2000 2010

1970 1980 1990 2000 2010

Table 8: Top 10 adjective modifiers of perception in the

1970 1980 1990 2000 2010

Table 13: Top 10 Warriner-matched collocates of mental

Table 14: Top 10 Warriner-matched collocates of per-

Index (Concept) Corpus Model B SE t p F (DF); Adj. R2

You might also like