Data Analytics For Discourse Analysis With Python
Data Analytics For Discourse Analysis With Python
with Python
38 Researching Metaphors
Towards a Comprehensive Account
Edited by Michele Prandi and Micaela Rossi
39 The Referential Mechanism of Proper Names
Cross-cultural Investigations into Referential Intuitions
Jincai Li
40 Discourse Particles in Asian Languages Volume I
East Asia
Edited by Elin McCready and Hiroki Nomoto
41 Discourse Particles in Asian Languages Volume II
Southeast Asia
Edited by Hiroki Nomoto and Elin McCready
42 The Present in Linguistic Expressions of Temporality
Case Studies from Australian English and Indigenous Australian languages
Marie-Eve Ritz
43 Theorizing and Applying Systemic Functional Linguistics
Developments by Christian M.I.M. Matthiessen
Edited by Bo Wang and Yuanyi Ma
44 Coordination and the Strong Minimalist Thesis
Stefanie Bode
45 Data Analytics for Discourse Analysis with Python
The Case of Therapy Talk
Dennis Tay
Dennis Tay
First published 2024
by Routledge
605 Third Avenue, New York, NY 10158
and by Routledge
4 Park Square, Milton Park, Abingdon, Oxon, OX14 4RN
Routledge is an imprint of the Taylor & Francis Group, an informa business
© 2024 Dennis Tay
The right of Dennis Tay to be identified as author of this work has been
asserted in accordance with sections 77 and 78 of the Copyright, Designs and
Patents Act 1988.
All rights reserved. No part of this book may be reprinted or reproduced or
utilised in any form or by any electronic, mechanical, or other means, now
known or hereafter invented, including photocopying and recording, or in
any information storage or retrieval system, without permission in writing
from the publishers.
Trademark notice: Product or corporate names may be trademarks or
registered trademarks, and are used only for identification and explanation
without intent to infringe.
Library of Congress Cataloging-in-Publication Data
Names: Tay, Dennis, author.
Title: Data analytics for discourse analysis with Python: the case of
therapy talk / Dennis Tay.
Description: New York, NY: Routledge, 2024. | Series: Routledge studies in
linguistics | Includes bibliographical references and index. |
Identifiers: LCCN 2023047858 | ISBN 9781032419015 (hardback) |
ISBN 9781032419022 (paperback) | ISBN 9781003360292 (ebook)
Subjects: LCSH: Discourse analysis–Data processing. | Python (Computer
program language) | Psychotherapy–Language–Case studies.
Classification: LCC P302.3 .T39 2024 | DDC 401.410285–dc23/
eng/20231226
LC record available at https://lccn.loc.gov/2023047858
ISBN: 9781032419015 (hbk)
ISBN: 9781032419022 (pbk)
ISBN: 9781003360292 (ebk)
DOI: 10.4324/9781003360292
Typeset in Sabon
by Deanta Global Publishing Services, Chennai, India
Contents
1 Introduction1
Defining data analytics 1
Data analytics for discourse analysis 3
The case of psychotherapy talk 6
Outline of the book 11
Quantifying language and implementing data analytics 12
Quantification of language: Word embedding 13
Quantification of language: LIWC scores 17
Introduction to Python and basic operations 19
3 Cluster analysis66
Introduction to cluster analysis: Creating groups
for objects 66
Agglomerative hierarchical clustering (AHC) 68
k-means clustering 71
vi Contents
4 Classification105
Introduction to classification: Predicting groups from
objects 105
Case study: Predicting therapy types from therapist-client
language 108
Step 1: Data and LIWC scoring 109
Step 2: k-NN and model validation 114
Python code used in this chapter 123
6 Conclusion165
Data analytics as a rifle and a spade 166
Applications in other discourse contexts 168
Combining data analytic techniques in a project 172
Final words: Invigorate, collaborate, and empower 174
Index177
Figures
Tables
1 Introduction
DOI: 10.4324/9781003360292-1
2 Introduction
context of the issues they represent. These include summarizing and visual-
izing data in various ways, comparing alternative approaches to connect
the data with the research problem, and allowing contextual knowledge
to shape analytic decisions. The role of contextual or domain knowledge
(Conway, 2010) is the second key trait of data analytics. Though under-
pinned by a common core of statistical concepts, techniques are adaptable
for specific questions in different domains like business (Chen et al., 2012)
and health care (Raghupathi & Raghupathi, 2014). For example, the same
statistical models used to forecast stock prices can be used to forecast
demand for hospital beds (Earnest et al., 2005), but model assumptions
and the forecasting horizon of interest clearly differ.
Evans and Lindner (2012) discuss four subtypes of data analytics. In
increasing order of complexity and value-addedness, these are descriptive
analytics, diagnostic analytics, predictive analytics, and, finally, prescrip-
tive analytics. The four subtypes were originally conceived in business
contexts where ‘value’ has a concrete financial sense, but other research-
ers can understand them as progressive phases of inquiry into their data.
Descriptive analytics is roughly synonymous with the traditional notion of
descriptive statistics, where the objective is a broad overview of the data to
prepare for later analysis. Simple examples include visualizing the central
tendency and distribution of data with boxplots, histograms, or pie charts.
The increasing sophistication and importance of visual aesthetics and user
interactivity, however, has made data visualization a distinct field in itself
(Bederson & Shneiderman, 2003). The next phase of diagnostic analytics
involves discovering relationships in the data using statistical techniques.
It is more complex and ‘valuable’ than descriptive analytics because the
connections between different aspects of data help us infer potential causes
underlying observed effects, addressing the why behind the what. In an
applied linguistics context, for example, large differences in scores between
student groups discovered in the descriptive analytics phase might moti-
vate a correlational study with student demographics to diagnose potential
sociocultural factors that explain this difference. We will later see how
diagnostic analytics can also offer practical solutions like inferring missing
information from incomplete datasets or assigning data to theoretically
meaningful groups by their characteristics.
If diagnostic analytics is about revealing past outcomes, the next phase
of predictive analytics is aimed at predicting yet-to-be-observed, or future,
outcomes. This means making informed guesses about the future based
on present and historical data points using core techniques like regression
and time series analysis. It is clear why predictive analytics constitutes a
quantum leap in value for businesses that are inherently forward looking.
The same applies for objectives like automatic classification of new texts or
forecasting of language assessment scores. As ever-increasing volumes of
Introduction 3
discourse, on the one hand, and richer educational experiences for stu-
dents, on the other (Asamoah et al., 2015).
Figure 1.1 A
pplying time series analytics to the construction of expertise on
YouTube
6 Introduction
C: Like if you’re in a, a car and someone’s doing something silly you have
the ability to stop it and get out.
T: Mm hmm.
C: Whereas this was no control, no control.
T: Yeah, yeah. And what kind of emotional impact did that have on you
straight away? I mean you were out of control; you couldn’t control it.
Linguistic synchrony can also be measured in ways that go beyond the use
of similar words. LIWC is likewise useful for this purpose – each transcript
can be assigned scores based on displayed socio-psychological stances like
analyticity, authenticity, clout, and emotional tone, and (a)synchrony can
then be defined in terms of (dis)similarity between scores. We will exam-
ine how the data analytic technique of cluster analysis – focusing specifi-
cally on the k-means clustering algorithm – can be used on LIWC scores
to discover natural groupings in therapist and client language, leading to
a concrete and replicable synchrony measure per dyad. Just as the case
10 Introduction
T: So, um what I’d like to do today is start off by reviewing what we did
in our last session. Kind of seeing if you’ve had any further thoughts
about that questions, concerns come up. And then make a start on the
attention training task which we briefly talked about last session.
C: Ok.
T: Yeah. How did you find our last session, did anything come up?
C: I found it most objective.
T: Ok.
fits, or predicts, the existing sample data. Model validation, on the other
hand, is also concerned with how well the model predicts out-of-sample
data, which would be a better test of its real-world applicability. Model
validation is often described as a context-dependent process (Mayer &
Butler, 1993), and this bears two important implications for us. First, and
more generally, we should not assume that these techniques will be useful
for analyzing psychotherapy talk just because they are common in other
domains like finance and engineering. Second, as this book will show,
each technique and its context of application may require specific valida-
tion procedures appropriate to the case at hand. The following validation
procedures (Good & Hardin, 2012) will be demonstrated throughout the
book: (1) resampling and refitting the model multiple times on different
parts of the dataset, (2) splitting the data sample into training and test-
ing datasets, and (3) comparing outcomes of alternative methods and/or
using external sources of information. We will also highlight cases where
model validation serves not only as quality control but also as an interest-
ing avenue for stimulating further research hypotheses.
To maximize consistency of presentation and argumentation, Chapters
2 to 4 as described above will be structured with the following common
elements
the transcripts or, in simple terms, to ‘convert words into numbers’ using
some measure or quantification scheme to prepare for further quantitative
data analytics. The present choice of LIWC, among many other ways to
do so, will be explained below. In a nutshell, the main rationale is that
it allows users a relatively simple option to focus on socio-psychological
stances rather than semantic contents underlying word choices. Following
quantification, the second broad step is the actual implementation of the
techniques. There are again many options for this, ranging from paid sta-
tistical software like SPSS to open-source programming languages, and the
present choice of the open-source Python programming language will like-
wise be explained.
and the columns show the number of times each unique word appears in
each sentence.
Table 1.1 is known as a document-term matrix where each review is
a document and each word a term. The rows spell out the vectors for the
corresponding reviews. R1 in vectorized form is therefore [1 1 1 1 0 0 0 0],
R2 is [0 1 0 0 1 1 1 0], and R3 is [2 1 1 0 0 0 0 1]. Geometrically speaking,
the eight unique words each represents a spatial dimension or axis, each
document is a vector occupying this eight-dimensional space, and the fre-
quencies of the terms determine the vector length. The more a certain term
appears in a document, the longer the vector will be in the corresponding
dimension. As this simple approach merely counts the raw frequencies of
terms and, if desired, contiguous term sequences (i.e., n-grams), it may
be less useful for making more nuanced comparisons within the corpus.
For instance, a term that occurs frequently in a document may not actu-
ally be that important or informative to that document, if the same term
also occurs in many other documents in the corpus. Grammatical words
like articles and prepositions are good examples. A more refined way to
compute the document-term matrix would then be to scale the frequency
of each term by considering how often the term occurs among all other
documents. This scaled frequency is known as the term frequency-inverse
document frequency (tf-idf). A simple version of the formula is
æ N ö
tf - idf ( t, d ) = tf ( t, d ) * log ç ÷
è df + 1 ø
#pre-process data
data[‘text’] = hero.clean(data[‘text’])
#calculate tf-idf scores
data[‘tfidf’]=(hero.tfidf(data[‘text’])
#perform PCA on scores
data[‘pca’] = hero.pca(data[‘tfidf’])
Table 1.2 shows the hypothetical outcome of the above process for a cor-
pus of ten documents. Each document is now a vector with just two entries
corresponding to the reduced tf-idf scores. With only two dimensions, we
can then represent the information in more visually intuitive ways. Figure
1.2 shows the vectors representing our hypothetical documents F [0.1164
0.428016] and H [0.1874 0.2571] in a two-dimensional space. We can then
quantify, among other things, (di)similarity between documents by calcu-
lating distances and angles between their representing vectors. Documents
with words having more similar tf-idf scores will have smaller distances
and/or angles between their vectors. A common basic measure that can be
conveniently computed with free online calculators is the cosine distance.
16 Introduction
A –0.37685 –0.0089
B –0.34303 –0.05053
C –0.42005 0.037291
D –0.34016 0.044264
E –0.332 –0.17402
F 0.1164 0.428016
G –0.35413 0.027041
H 0.1874 0.2571
I –0.38525 0.02156
J –0.20533 –0.10186
This is just the cosine of the angle θ between vectors A and B, or their dot
product divided by the product of their lengths. Formally,
AB
Cosine distance cos
A B
The cosine distance always has an absolute value between 0 and 1. The
higher the value, the more similar the documents. The distance between
documents F and H in Figure 1.2, for example, has a high value of 0.934.
If we go back to our one-sentence reviews in Table 1.1, the value between
R1 and R2 is 0.25, between R1 and R3 is 0.756, and between R2 and R3 is
0.189. Such measures could then be used for different analytical purposes
Introduction 17
is used in therapy and related genres like narratives. They reflect aspects
like how stories are told, the stance of therapists when dispensing advice
and of clients when receiving it, the negotiation of relationships, and lin-
guistic displays of emotional states. For example, therapists could speak in
a highly logical way (analytic thinking) but hedge their advice (clout) and
use more positive words (emotional tone) to come across as more personal
(authenticity) and optimistic. The scores can therefore help to distinguish
various configurations of linguistic and interactional styles. Different from
the semantic annotation tools and vectorization processes described above,
it is also noteworthy that grammatical function words play a relatively
important role in defining the summary variables. This follows from the
argument that socio-psychological stances at both individual and cultural
levels are more reliably indexed by function than content word choices
(Tausczik & Pennebaker, 2010). One reason is that content words are, to
a large extent, tied to (arbitrary) topics at hand. The argument may well
apply to psychotherapy transcripts – it would be more useful to claim that
two sessions are similar because the speakers show comparable levels of
analytic thinking, clout, authenticity, and/or emotional tone rather than
because they happen to be talking about the same topic. Notwithstanding
these advantageous characteristics, it should be highlighted that the choice
of quantification scheme is, in principle, independent from the subsequent
data analytic process and should be made based on a holistic considera-
tion of the setting and objectives at hand. Other than the vectorization
techniques discussed above, other characteristics of session transcripts
that could be incorporated into an analyst’s quantification scheme include
demographic details as well expert or client ratings on the attainment of
certain therapeutic processes/outcomes.
it with low-level languages like C where the vocabulary and grammar are
unintuitive to humans but more directly ‘understood’ by computers, and
domain-specific languages like SQL and MATLAB, which are designed to
solve particular kinds of problems. Implementing data analytic techniques
is just one of Python’s many uses. While it would do an excellent job for
the present and other conceivable discourse analytic purposes, we should
bear in mind that Python is by no means the only way to implement the
techniques discussed in this book. Traditional commercial statistical soft-
ware like SPSS and Stata are ready alternatives but neither customizable
nor open source. Another highly popular programming language among
linguists and academics in general is R (Baayen, 2008; Levshina, 2015),
and there is a lively debate on how it compares with Python. Regardless,
both languages are popular because of their open-source nature, strong
community advocacy and support, relative ease of use, and numerous
learning resources available online.
Installation of Python is straightforward on modern operating systems.
It is, in fact, preinstalled on Unix-based systems like MacOS and Linux
and also on some Windows systems, especially if the administrative tools
were programmed with it. However, most beginners should find it easier to
run Python via a separately installed integrated development environment
(IDE). IDEs provide a graphical user interface for basic operations like
writing, running, debugging, loading, and saving code, somewhat analo-
gous to popular software programs like Microsoft Word. A Python code
file has the .py file extension, which is essentially in plain text format and
can be opened in any text processor and run in an IDE. The file contains
lines of Python code like the simple one-line example below, and by run-
ning the code we instruct the computer to execute the command – in our
case, the output is simply the phrase ‘Hello, world!’. Output is typically
displayed in a separate window in the IDE – be it text, numbers, or a
figure. Figures, in particular, are extremely common and can be saved as
picture files straight from IDEs.
print(‘Hello, world!’)
import pandas as pd
import numpy as np
import seaborn as sns
The next step is to import files, usually Excel spreadsheets, that contain
data for analysis. Each row of the spreadsheet usually represents a subject
– be it a human subject, transcript, or some other unit of analysis – while
each column represents a variable – scores, ratings, group labels, and so
on. This presentation format is commonly known as wide data because
many columns are used. As Figure 1.3 shows, the alternative is known as
long or narrow data where the same transcript occurs across many rows,
and all the variables and scores are captured under a single column.
Just like in traditional statistical software like SPSS, both formats can be
used and even converted to each other in Python. The more convenient for-
mat often depends on the situation and objectives at hand. The detailed dif-
ferences between them are not of immediate present relevance, but interested
readers can look them up with a simple search phrase like ‘wide vs. long
data’. To import an Excel spreadsheet, we first import the pandas library
and then use the following code with read_csv for .csv spreadsheets and
read_excel for .xlsx ones. If the Spyder IDE is used, ensure that the directory
path shown on the top right-hand corner correctly points to the file.
import pandas as pd
data=pd.read_csv(‘filename.csv’) OR data=pd.read_excel(‘filename.xlsx’)
This will import the spreadsheet and create a pandas dataframe named
data, or any other given name. A dataframe is the most common structure
for managing and analyzing data in Python, and it will be used throughout
the book.
The final generic step after importing a dataset is to actually perform the
analysis at hand. This may include sub-steps like initial visualization of the
data, performing the technique, and evaluating the outcomes of analyses,
all of which will be laid out in sequence in the annotated code throughout.
Readers are encouraged to spend time observing how the code logically
unfolds throughout an extended analysis. Grasping how each code section
relates and contributes to the overall objective is often just as important
as mastering its exact syntax, which is the main reason for including the
end-of-chapter code summaries.
Lastly, note that this book is best used in combination with different
general resources available for learning Python. The specific data analytic
techniques featured here are, after all, implemented based upon an exten-
sive foundation of more basic operations, which can only be mastered with
repeated practice in different contexts. Besides the official online docu-
mentation of the key libraries mentioned earlier, good online resources
that offer a blend of free and subscription-only content include datacamp.c
om and towardsdatascience.com. The former provides video tutorials and
other training materials for different programming languages and data
analytic tools across a range of proficiency levels. The latter houses a vast
collection of useful short articles and tutorials written by a dedicated com-
munity of data science practitioners in different fields.
References
Abidin, C. (2018). Internet celebrity: Understanding fame online. Emerald.
Antaki, C., Barnes, R., & Leudar, I. (2005). Diagnostic formulations in psychotherapy.
Discourse Studies, 7(6), 627–647. https://doi.org/10.1177/1461445605055420
Archer, D., Wilson, A., & Rayson, P. (2002). Introduction to the USAS category
system. Introduction to the USAS category systemLancaster Universityhttps://
ucrel.lancs.ac.uk › usas › usas guide
Asamoah, D., Doran, D., & Schiller, S. (2015). Teaching the foundations of data
science: An interdisciplinary approach (arXiv:1512.04456). arXiv. https://doi
.org/10.48550/arXiv.1512.04456
Introduction 23
Avdi, E., & Georgaca, E. (2007). Discourse analysis and psychotherapy: A critical
review. European Journal of Psychotherapy, Counselling and Health, 9(2),
157–176.
Baayen, R. H. (2008). Analyzing linguistic data: A practical introduction to
statistics using R. Cambridge University Press. https://doi.org/10.1558/sols.v2i3
.471
Baker, P., & Levon, E. (2015). Picking the right cherries? A comparison of corpus-
based and qualitative analyses of news articles about masculinity. Discourse and
Communication, 9(2), 221–236. https://doi.org/10.1177/1750481314568542
Bederson, B., & Shneiderman, B. (Eds.). (2003). The craft of information
visualization: Readings and reflections (1st ed.). Morgan Kaufmann. https://
www.elsevier.com/books/the-craft-of-information-visualization/bederson/978
-1-55860-915-0
Beysolow II, T. (2018). Applied natural language processing with python:
Implementing machine learning and deep learning algorithms for natural
language processing. Apress.
Bhatia, A. (2018). Interdiscursive performance in digital professions: The case of
YouTube tutorials. Journal of Pragmatics, 124, 106–120. https://doi.org/10
.1016/j.pragma.2017.11.001
Box, G. E. P., Jenkins, G. M., Reinsel, G. C., & Ljung, G. M. (2015). Time series
analysis: Forecasting and control (5th ed.). Wiley.
Boyd, R. L., Ashokkumar, A., Seraj, S., & Pennebaker, J. W. (2022). The
development and psychometric properties of LIWC-22. University of Texas at
Austin. https://www.liwc.app
Boyd, R. L., Blackburn, K. G., & Pennebaker, J. W. (2020). The narrative arc:
Revealing core narrative structures through text analysis. Science Advances,
6(32), eaba2196.
Breeze, R. (2011). Critical discourse analysis and its critics. Pragmatics, 21(4),
493–525.
Brezina, V. (2018). Statistics in corpus linguistics: A practical guide. Cambridge
University Press. https://doi.org/10.1017/9781316410899
Chen, H., Chiang, R. H. L., & Storey, V. C. (2012). Business intelligence and
analytics: From big data to big impact. MIS Quarterly: Management Information
Systems, 36(4), 1165–1188.
Cohn, M. A., Mehl, M. R., & Pennebaker, J. W. (2004). Linguistic markers of
psychological change surrounding September 11, 2001. Psychological Science,
15(10), 687–693.
Connolly Gibbons, M. B., Rothbard, A., Farris, K. D., Wiltsey Stirman, S.,
Thompson, S. M., Scott, K., Heintz, L. E., Gallop, R., & Crits-Christoph, P.
(2011). Changes in psychotherapy utilization among consumers of services
for major depressive disorder in the community mental health system.
Administration and Policy in Mental Health, 38(6), 495–503. https://doi.org/10
.1007/s10488-011-0336-1
Conway, D. (2010). The data science Venn diagram. blog.revolutionanalytics.
com
Earnest, A., Chen, M. I., Ng, D., & Leo, Y. S. (2005). Using autoregressive
integrated moving average (ARIMA) models to predict and monitor the number
24 Introduction
Pennebaker, J. W., Chung, C. K., Frazee, J., Lavergne, G. M., & Beaver, D. I.
(2014). When small words foretell academic success: The case of college
admissions essays. PLOS ONE, 9(12), 1–10.
Peräkylä, A., Antaki, C., Vehviläinen, S., & Leudar, I. (Eds.). (2011). Conversation
analysis and psychotherapy. Cambridge University Press.
Pittenger, R. E., Hockett, C. F., & Danehy, J. J. (1960). The first five minutes. Carl
Martineau.
Raghupathi, W., & Raghupathi, V. (2014). Big data analytics in healthcare:
Promise and potential. Health Information Science and Systems, 2(3), 1–10.
Reeves, A., Bowl, R., Wheeler, S., & Guthrie, E. (2004). The hardest words:
Exploring the dialogue of suicide in the counselling process — A discourse
analysis. Counselling and Psychotherapy Research, 4(1), 62–71. https://doi.org
/10.1080/14733140412331384068
Reisigl, M. (2017). The discourse-historical approach. In J. Flowerdew & J. E.
Richardson (Eds.), The Routledge handbook of critical discourse studies (pp.
44–59). Routledge.
Reisigl, M., & Wodak, R. (Eds.). (2001). Discourse and discrimination: Rhetorics
of racism and anti-semitism. Routledge.
Sarkar, D. (2016). Text analytics with python. Springer.
Scheflen, A. E. (1973). Communicational structure: Analysis of a psychotherapy
transaction. Indiana University Press.
Semin, G. R., & Cacioppo, J. T. (2008). Grounding social cognition:
Synchronization, entrainment, and coordination. In G. R. Semin & E. R. Smith
(Eds.), Embodied grounding: Social, cognitive, affective, and neuroscientific
approaches (pp. 119–147). Cambridge University Press.
Spong, S. (2010). Discourse analysis: Rich pickings for counsellors and therapists.
Counselling and Psychotherapy Research, 10(1), 67–74.
Spong, S., & Hollanders, H. (2005). Cognitive counsellors’ constructions of social
power. Psychotherapy and Politics International, 3(1), 47–57. https://doi.org
/10.1002/ppi.17
Stieglitz, S., & Dang-Xuan, L. (2013). Social media and political communication:
A social media analytics framework. Social Network Analysis and Mining, 3(4),
1277–1291.
Tannen, D., Hamilton, H. E., & Schiffrin, D. (Eds.). (2015). The handbook of
discourse analysis (2nd ed.). Blackwell.
Tausczik, Y. R., & Pennebaker, J. W. (2010). The psychological meaning of words:
LIWC and computerized text analysis methods. Journal of Language and Social
Psychology, 29(1), 24–54.
Tay, D. (2013). Metaphor in psychotherapy. A descriptive and prescriptive analysis.
John Benjamins.
Tay, D. (2017a). Quantitative metaphor usage patterns in Chinese psychotherapy
talk. Communication and Medicine, 14(1), 51–68.
Tay, D. (2017b). Time series analysis of discourse: A case study of metaphor in
psychotherapy sessions. Discourse Studies, 19(6), 694–710.
Tay, D. (2019). Time series analysis of discourse: Method and case studies.
Routledge.
Tay, D. (2020). A computerized text and cluster analysis approach to psychotherapy
talk. Language and Psychoanalysis, 9(1), 1–22.
26 Introduction
Tay, D. (2021a). Automated lexical and time series modelling for critical discourse
research: A case study of Hong Kong protest editorials. Lingua, 255, 103056.
Tay, D. (2021b). COVID-19 press conferences across time: World Health
Organization vs. Chinese Ministry of Foreign Affairs. In R. Breeze, K.
Kondo, A. Musolff, & S. Vilar-Lluch (Eds.), Pandemic and crisis discourse:
Communicating COVID-19 (pp. 13–30). Bloomsbury.
Tay, D. (2021c). Modelability across time as a signature of identity construction on
YouTube. Journal of Pragmatics, 182, 1–15.
Tay, D., & Pan, M. X. (Eds.). (2022). Data analytics in cognitive linguistics:
Methods and insights. De Gruyter Mouton.
Taylor, S. (2013). What is discourse analysis? Bloomsbury Academic. https://doi
.org/10.5040/9781472545213
Van Staden, C. W., & Fulford, K. W. M. M. (2004). Changes in semantic uses of
first person pronouns as possible linguistic markers of recovery in psychotherapy.
Australian and New Zealand Journal of Psychiatry, 38(4), 226–232. https://doi
.org/10.1111/j.1440-1614.2004.01339.x
Wei, C., Wang, Y.-C., Wang, B., & Kuo, C.-C. J. (2023). An overview on language
models: Recent developments and outlook (arXiv:2303.05759). arXiv. https://
doi.org/10.48550/arXiv.2303.05759
Zimmerman, J., Brockmeyer, T., Hunn, M., Schauenburg, H., & Wolf, M. (2016).
First-person pronoun use in spoken language as a predictor of future depressive
symptoms: Preliminary evidence from a clinical sample of depressed patients.
Clinical Psychology and Psychotherapy, 24(2), 384–391.
2 Monte Carlo simulations
The first thoughts and attempts I made to practice [the Monte Carlo
Method] were suggested by a question which occurred to me in 1946
as I was convalescing from an illness and playing solitaires. The ques-
tion was what are the chances that a Canfield solitaire laid out with
52 cards will come out successfully? After spending a lot of time try-
ing to estimate them by pure combinatorial calculations, I wondered
whether a more practical method than “abstract thinking” might not
be to lay it out say one hundred times and simply observe and count
the number of successful plays. This was already possible to envisage
with the beginning of the new era of fast computers, and I immedi-
ately thought of problems of neutron diffusion and other questions
of mathematical physics, and more generally how to change pro-
cesses described by certain differential equations into an equivalent
DOI: 10.4324/9781003360292-2
28 Monte Carlo simulations
The italicized portion outlines the gist of MCS. Facing an abstract prob-
lem where there is no tractable analytic solution, like Ulam’s solitaire com-
binations and neutron diffusion, one could attempt what is known as a
numerical solution instead. Most mathematical problems can be solved
either analytically or numerically. An analytic solution requires us to frame
the problem in a well-defined form and calculate an exact, deterministic
answer. A numerical solution, on the other hand, means making intelligent
guesses at the solution until a satisfactory (but not exact) answer is obtained.
This ‘guessing’ is often done by observing, or making sensible assumptions,
about the different ways in which the underlying real-world phenomena of
interest would occur when left to chance. Statisticians refer to this as the
data generating process. In the solitaire example above, the data-generating
process is probabilistically defined based on the practical idea that certain
configurations of cards must have certain odds of appearing. Using knowl-
edge and/or assumptions about their probability distributions, MCS works
by simulating a large range of possible outcomes and aggregating them for
further analysis. Thanks to modern computers that can handle multiple
simulations and some basic laws of probability to be explained shortly, the
aggregated outcome can give us (1) a highly reliable approximation of the
actual analytic solution, and (2) in many contexts, insights into the range
of potential real-world outcomes that might occur. We will call these two
claimed advantages our ‘overarching claims A and B’ and return to them
shortly. Some readers may at this point already be making a conceptual
connection between MCS and our practical problem of missing or incom-
plete transcripts – facing the problem of uncertainty over the properties (i.e.,
LIWC scores) of a certain dyad because of missing data, one could attempt
to simulate these scores using the distribution of available data and obtain
sensible estimates and information about what the scores are likely to be.
Before we officially make this connection, let us go through some examples
to familiarize ourselves more with the logic of MCS and its underpinning
laws of probability. We will begin to use Python code, so let us first import
the following libraries to be used in the rest of the chapter.
#runs the function for n=23. feel free to change the value of n.
probsamebirthday(23)
Let us now solve the same problem numerically using an MCS approach
that follows the broad procedure described above: (1) simulate a range of
possibilities using some assumed probability distribution, and (2) analyze
the aggregated outcome. Simulations always involve the computer drawing
random numbers, and the outcomes will therefore vary each time. This may
be disadvantageous if we want the same outcomes on different occasions
30 Monte Carlo simulations
The code below simply specifies the number of people and the number of
times we want to repeat the process to obtain the aggregated outcome (i.e.,
number of simulation trials). For this demonstration we use 23 and 5,000,
respectively. It is customary to run at least 1000 trials.
#specify no. of people and trials. feel free to change the values.
n_people=23
n_trials=5,000
We then run the following code that loops through the specified number of
trials, drawing a random birthday for the specified number of people each
time. This list of birthdays is subjected to the above function each time to
check for duplicates, the result (TRUE/FALSE) being recorded in a list.
Monte Carlo simulations 31
After 5,000 trials, we simply count the number of TRUES in the list and
derive the probability p by dividing this number by 5,000.
#calculate the final probability as the fraction of all trials where there
are duplicates
probability=len([p for p in list if p == True]) / len(list)
print(“With”, n_trials, “trials, probability of at least one shared bday
among”,n_people, “people =”, probability)
We observe again that the outcome (see y-axis) with 5,000 trials is p =
0.5056, which as mentioned is very close to the analytic solution of 0.507.
This example supports our overarching claim A that numerical simulations
can give us a highly reliable approximation of the actual analytic solu-
tion. To recall, the result is simply telling us that the computer simulated
23-people birthday lists for 5,000 times, and in 2,528 (50.56%) of these
times, there were at least two people with the same birthday.
Besides this result, the most important feature in Figure 2.1 is that as the
number of trials increases from 0 to 5,000, the probability fluctuates less
and less, and eventually converges upon the final value of 0.5056. That is
to say, with only a small number of trials, we get wildly fluctuating and
imprecise results even by slightly changing this number. The line, in fact,
begins to stabilize only after about 3000 trials, beyond which more trials
no longer tweak the result by much, giving us greater confidence in our
simulated answer. This convergence is generally true and illustrates a very
important law of probability that enables MCS: the law of large numbers.
Monte Carlo simulations 33
The law states that as the number of identically distributed, randomly gen-
erated variables increase, their average value will get closer and closer to
the true theoretical average. In our case, each trial is indeed identically
uniformly distributed and randomly generated. Therefore, with more and
more trials, the proportion of trials with duplicate birthdays will get closer
and closer to the true proportion given by the analytic solution above. The
law of large numbers generally ensures that the more simulation trials we
perform, the better our estimate of the true analytic solution will be. This is
especially crucial for problems for which an analytic solution is intractable
or impossible, meaning that there is nothing to judge our numerical solu-
tions against. However, the practical tradeoff is increased computational
expense, and slower computers will struggle with handling too many trials.
Another very important and related principle that enables MCS is the
well-known central limit theorem. This will be encountered in our second
example, to which we now turn. It is a tribute to the tenuous gambling
origins of MCS and assumes the setting of a standard casino. While the
birthday paradox lent support to our overarching claim A (numerical sim-
ulations can give us a highly reliable approximation of the actual analytic
solution), the casino example is meant to support our overarching claim
B – that, in many real-world contexts, MCS are useful not just for estimat-
ing a single analytic value, but also for showcasing the range of values that
would have occurred by chance, and their implications.
probability of 45%. If the patron plays 100 spins, the expected winnings
under an analytic approach would therefore be
which equals an expected loss of $10. However, this is not very informa-
tive for casino owners as they want more information about how likely a
patron would get extremely lucky. We therefore turn to the MCS approach,
again by (1) simulating a range of possibilities using some assumed prob-
ability distribution, and (2) analyzing the aggregated outcome.
Let us simulate a scenario where there are 1,000 patrons and each of
them spins the wheel 100 times. Same as before, we set np.random.seed(0)
prior to drawing any random numbers to ensure reproducibility. We begin
by writing a function to implement our rule that the patron wins $1 if each
spin lands on 56–100 and loses $1 otherwise.
We then specify the number of spins per set (or per patron), and the number
of sets (or trials) to be simulated. Note the similar logic with our birthday
example above. For this demonstration we use 100 spins and 1,000 sets,
bearing in mind it is customary to run at least 1,000 trials. These numbers
can be changed for experimentation just like the birthday example, and
readers are encouraged to do so to witness the law of large numbers and
the central limit theorem discussed below in action.
#specify no. of spins per set and no. of sets. feel free to change the
values.
spins=100
sets=1000
We then run the following code that loops through the specified number of
sets, and for each set simulates the specified number of spins. Each spin is a
random number from 1 to 100. We again use a uniform distribution because
it is reasonable to assume that the ball has an equal chance of landing on
each number. The spin outcomes are then subjected to the function above
to determine if $1 is won or lost for each spin. Finally, these are summed to
determine the total winnings per set, which is then stored in a list.
Monte Carlo simulations 35
Figure 2.2 is the resulting histogram from the code above. The line indi-
cates the mean (or expected) winnings of −$9.364. The standard deviation
of winnings across the 1,000 sets is 9.594.
This is where the aforementioned central limit theorem, which facili-
tates our overarching claim B, becomes relevant. The central limit theo-
rem might be familiar to some readers. It states that if independent and
identically distributed random samples of size N are repeatedly drawn,
the larger N is, the more normally distributed the sample means will be.
This happens regardless of how the phenomenon defining the popula-
tion is actually distributed. This is more or less what is happening in our
example. The ‘population’ of roulette spins is not normally distributed,
but the simulated outcomes, which involve computing winnings on a
large number of random spins with N = 100, approach a normal distribu-
tion as seen from the histogram. Note that we do not get a ‘perfect’ nor-
mal distribution because of this additional computation of winnings (i.e.,
36 Monte Carlo simulations
$1 or –$1 per spin), but for many other applications (as we will see later
in our case study) we do. The second important feature of the central
limit theorem is that the larger N is, the closer the mean of all the simu-
lated sample means will be to the ‘true’ population mean. This is closely
related to the convergent property of the law of large numbers above.
The central limit theorem allows the casino owners to derive interesting
insights from MCS. Since the winnings distribution approximates a nor-
mal distribution, we can use the known properties of a normal distribution
to estimate the likelihood of extreme scenarios. For example, 95% of all
randomly sampled outcomes are expected to fall within about two stand-
ard deviations on either side of the mean, and 2.5% to fall beyond that on
either side as ‘extreme’ outcomes. This means that although the long-term
expected profit is about $9.364 per 1,000 patrons, casino owners should
still anticipate a loss of at least (−9.364 + 2*9.594) = $9.824, 2.5% of
the time. Conversely, they can expect high profits of at least (−9.364 −
2*9.594) = $28.552, 2.5% of the time. This is a simple example of what
we call an MCS-enabled ‘scenario analysis’ of potential real-life outcomes
from inherent uncertainty.
We are now ready to make an explicit connection between our two
examples above and the practical problem of missing or incomplete
transcripts. Just as we have generated random samples of birthdays
Monte Carlo simulations 37
information about their income if they are less educated. This implies that
subsequent analyses making use of income data will not be representative
even if the survey respondents themselves were carefully sampled. Lastly,
data not missing at random are most likely to lead to biased results, and,
in most cases, they cannot be ignored. This is where the reason(s) for miss-
ing values are directly and systematically related to the variable at hand.
Going back to the previous example, if those who are less willing to reveal
their income indeed demonstrably earn less, we have a case of data not
missing at random. Another example is when sicker patients on whom we
expect the clearest results drop out of a longitudinal drug trial, rendering
the remaining subjects unrepresentative. All three categories are, in princi-
ple, possible in the case of psychotherapy transcripts. At one end, we may
have transcripts missing completely at random due to randomly occurring
technical faults. At the other end, we may also have transcripts missing not
at random if there are identifiable systematic reasons for some sessions to
be left out.
There are many suggested ways to deal with missing data, including
methods like imputation, interpolation, and deletion. The most appropri-
ate method depends on the type of data and missing data mechanism at
hand. Imputation and interpolation generally aim to preserve cases with
missing data by estimating the missing values, while deletion means omit-
ting such cases from the analysis. It is of course also possible to prevent
missing data in the first place by proper planning and collection, especially
in experimental settings where researchers have greater control (Kang,
2013). We will not go into the details and pros and cons of these sug-
gestions but will instead demonstrate MCS as yet another useful method
in our case of observational transcript data. The main strength of MCS
for the present purpose is that each simulated set of transcripts realizes a
hypothetical ‘discourse scenario’ that we would expect to potentially arise
by chance. Analogous to the examples above, the simulation outcome is
a probability distribution mapping out the likelihood of these discourse
scenarios.
It is important to note that MCS does not simulate any actual words or
language that could have been used by therapists and clients. From per-
sonal experience, this has been a typical misconception and leads to unnec-
essary doubts about the ethicality of the process. MCS is instead based
on the LIWC summary variable scores discussed in Chapter 1, which, to
reiterate, could be replaced by any other desired quantification scheme.
We are using the scores and observed probability distribution of available
transcripts for a certain therapist–client dyad – for example, 30 out of 40
sessions – to simulate scores for the remaining 10 missing sessions, thereby
providing an estimate of the scores of all 40 sessions. Formally, the Monte
Carlo estimator is given by
40 Monte Carlo simulations
å
1 N
E (X) » X( ) = X N
i
N i =1
where N random draws are taken from the observed probability distribu-
tion of available data and averaged to derive the estimated mean/expected
value of X, the unknown population value. Recall from Chapter 1 that it
is important to evaluate the accuracy of this estimation in order to deter-
mine if MCS is reliable enough to solve our practical problem of missing
transcripts. We will do this by a systematic process of model validation
that, as mentioned earlier, serves the secondary purpose of an entry point
into potentially interesting theoretical hypotheses. Our demonstration case
study will be presented below in three major steps, followed by a discus-
sion of the results and implications.
• Simulation run A
• Training: Sessions 11–40 / Validation: Sessions 1–10
• Simulation run B
• Training: Sessions 1–10,21–40 / Validation: Sessions 11–20
• Simulation run C
• Training: Sessions 1–20,31–40 / Validation: Sessions 21–30
• Simulation run D
• Training: Sessions 1–30 / Validation: Sessions 31–40
• Simulation run E
• Training: 30 random sessions / Validation: 10 random sessions
Our objective is therefore not only to evaluate the overall accuracy of MCS,
but also to see how this accuracy varies across the temporal phases of treat-
ment. In other words, we want to see which ‘missing’ phase would result in
the most/least accurate simulations. A phase that results in more accurate
simulations when missing would imply that that phase is linguistically least
different from the remainder of the dyad (as measured by LIWC), and vice
versa, with attendant theoretical implications to be explored. On the other
hand, simulation run E would serve to mimic data ‘missing completely as
random’ as described above, and allow us to compare outcomes with sys-
tematically missing data in the other runs.
We start by preparing the training and validation datasets. For simula-
tion runs A to D, this can be done manually by preparing two excel files
for each run and naming them something like
train_A.csv and validation_A.csv. The former would have scores (in
columns) of 30 sessions (in rows) and the latter scores of 10 sessions. For
simulation run E, we can use Python to randomly select the training and
validation data from the full dataset (e.g., 40_sessions.csv) with the code
Monte Carlo simulations 43
Table 2.1 summarizes the mean and standard deviation of all four LIWC
summary variables scores across our five training datasets. Shapiro-Wilk
tests confirmed that all scores are normally distributed (p>0.05). As such,
their means and standard deviations will be used as input to draw ran-
dom values from a normal distribution. Recall that we will be using these
30-session properties as a basis to simulate multiple sets of 40-sessions,
in effect compensating for the 10 ‘missing’ sessions each time. Note that
although we have a multivariate profile, the simulations will be univariate
as in the birthday and casino examples to simplify the illustration. This
means that each variable will be separately and independently simulated
without assuming or making use of statistical relationships that may exist
between them. If we do not wish to make this assumption, an alterna-
tive but more complex approach would be to perform multivariate normal
sampling, which requires not only the means and standard deviations of
each variable, but also the covariances or covariance structure among all
four variables.
The following code is used for each of the five simulation runs.
The first part is straightforward and only involves loading the data-
set (e.g., train_A.csv), calculating the mean and standard deviation of
each variable, and specifying the number of sessions and simulation
trials desired. We will simulate 5,000 sets of 40 sessions for each run.
These 5,000 sets correspond to 5,000 potential ‘discourse scenarios’
that might occur probabilistically as mentioned above. Readers using
the code on their own datasets and/or variables can simply change the
names accordingly.
44
#load data
data=pd.read_csv(‘train_A.csv’,index_col=‘Session’)
#specify no. of sessions and simulations. feel free to change the values.
num_sessions = 40
num_simulations = 5000
For the actual simulation, two options are presented here for readers’
experimentation. Option 1 is more straightforward and follows what we
have been doing in our examples so far. That is, simulate the variable
scores for 40 sessions on the basis of the 30-session training data param-
eters, calculate and store the 40-session mean scores, and loop 5,000 times
to generate 5,000 discourse scenarios, each one represented by a set of
these mean scores. The code for option 1 is as follows.
authentic_vector=auth_avg+(auth_sd*(normal_vector))
authentic_strata_mean=np.mean(authentic_vector, axis=1)
authentic=np.mean(authentic_strata_mean)
clout_vector=clout_avg+(clout_sd*(normal_vector))
clout_strata_mean=np.mean(clout_vector, axis=1)
clout=np.mean(clout_strata_mean)
tone_vector=tone_avg+(tone_sd*(normal_vector))
tone_strata_mean=np.mean(tone_vector, axis=1)
tone=np.mean(tone_strata_mean)
The code below will generate histograms of the simulation outcomes, with
useful annotations like the mean and standard deviation for each vari-
able, rounded to three decimal places. We call the figure title ‘Simulation
(A)’ to indicate that these are results of our simulation run A, but this
can be changed accordingly. The same goes for other visual customization
48 Monte Carlo simulations
options like the co-ordinates of the text annotation (set to 0.5, 0.5 here),
the labels for each histogram, and so on.
The above code is a rerun for each of the remaining four simulation runs.
We then proceed to Step 3 where the aggregated outcomes are analyzed for
each run. This includes the validation procedure of comparing how differ-
ent each simulation is to the corresponding validation dataset.
We begin with simulation run A where the first 10 ‘missing’ sessions com-
prise the validation set, and the remaining sessions 11–40 comprise the
training set. The top half of Figure 2.3 shows the histograms generated with
the code above. These are the distributions of 5,000 simulations of each
variable, with means and standard deviations indicated. In other words,
they are the aggregated outcomes of MCS showing the range of possible
40-session outcomes over 5,000 different probabilistic discourse scenarios.
By the central limit theorem and law of large numbers discussed above,
we know that (1) a large number of randomly drawn samples (N=5,000)
each with a reasonably large sample size (N=40) would yield a sampling
distribution that approaches normality, regardless of what the actual
Monte Carlo simulations 49
for the actual 40-session scores if the first 10 sessions, or first quarter of
the dyad’s interaction, are missing. We now analyze and perform the same
simple validation procedure for the remaining simulation runs.
For simulation run B shown in Figure 2.4, the second quarter of interac-
tion (sessions 11–20) is ‘missing’ and comprises the validation set. Sessions
1–10 and 21–40 comprise the training set. This time, only three out of
four variables: analytic thinking (t(9)= 1.010, p=0.34), clout (t(9)= 1.252,
52 Monte Carlo simulations
p=0.24), and authenticity (t(9)= 1.072, p=0.31) were not significantly dif-
ferent between the simulated and validation datasets. The simulated emo-
tional tone score was significantly higher than the observed score in the
validation dataset (t(9)= 2.93, p=0.02), suggesting that MCS provides a
less than accurate estimation of the use of emotion words in the event that
transcripts of the second quarter of interaction are missing. Estimations
of the other three variables are nevertheless satisfactory. We will return
to what this contrast in simulation results between and within different
‘missing’ phases implies later on when comparing all five runs side by side.
Figure 2.5 shows the results of simulation run C, this time with the third
quarter (sessions 21–30) as the validation dataset and sessions 1–20 and
31–40 as the training dataset. We obtain the same general result as simu-
lation run B that three out of four variables: clout (t(9)= 0.339, p=0.74),
authenticity (t(9)= −0.436, p=0.67), and emotional tone (t(9)= −1.59,
p=0.15) were not significantly different between the simulated and valida-
tion datasets. This time, however, it was analytic thinking that turned out
inaccurate. The simulated analytic thinking score was significantly higher
than the observed score in the validation dataset (t(9)= 3.213, p=0.01).
Figure 2.6 shows the results of simulation run D with the last quarter
(sessions 31–40) as the validation dataset, and sessions 1–30 as the train-
ing dataset. Interestingly, the results turned out to be similar to its ‘mirror’
image of simulation run A where the first quarter was missing. None of the
four variables showed a significant difference between the simulated and
validation means: analytic thinking (t(9)= −1.627, p=0.14), clout (t(9)=
−1.638, p=0.14), authenticity (t(9)= −0.121, p=0.91), and emotional tone
(t(9)= −0.189, p=0.85). This again implies that the simulated 40-session
scores would be an acceptable replacement for the actual 40-session scores
if the last 10 sessions, or last quarter of the dyad’s interaction, are missing.
Our final simulation run E (Figure 2.7) served as a pseudo-control con-
dition where 10 randomly chosen sessions comprised the validation data-
set, and the remaining 30 comprised the training dataset. As mentioned
earlier, the outcomes would allow us to compare the relative accuracy
of MCS for data missing completely at random versus systematically by
time. Just two out of four variables turned out to be accurate. While there
were no significant differences between the simulated and validation scores
for authenticity (t(9)= 0.103, p=0.92), and emotional tone (t(9)= 0.689,
p=0.51), the simulated analytic thinking (t(9)= 3.152, p=0.01) and clout
scores (t(9)= 4.09, p<0.01) were significantly higher than in the validation
dataset.
We will now juxtapose and consider the results of all five simulation
runs. This gives us a big picture of the usefulness of MCS in our usage
context and the emergent theoretical implications from the model valida-
tion process. Table 2.2 summarizes the previously presented results and
highlights the (in)accurately simulated variables in each run.
Monte Carlo simulations 53
A Analytic (p=0.17) -
Training: Session 11–40 Clout (p=0.86)
Validation: Session 1–10 Authentic (p=0.76)
Tone (p=0.87)
B Analytic (p=0.34) Tone (p=0.02)
Training: Session 1–10, 21–40 Clout (p=0.24)
Validation: Session 11–20 Authentic (p=0.31)
C Clout (p=0.74) Analytic (p=0.01)
Training: Session 1–20, 31–40 Authentic (p=0.67)
Validation: Session 21–30 Tone (p=0.15)
D Analytic (p=0.14) -
Training: Session 1–30 Clout (p=0.14)
Validation: Session 31–40 Authentic (p=0.91)
Tone (p=0.85)
E Authentic (p=0.92) Analytic (p=0.01)
Training: 30 random sessions Tone (p=0.51) Clout (p<0.01)
Validation: 10 random sessions
#runs the function for n=23. feel free to change the value of n.
probsamebirthday(23)
#specify no. of people and trials. feel free to change the values.
n_people=23
n_trials=5000
#calculate the final probability as the fraction of all trials where there
are duplicates
probability=len([p for p in list if p == True]) / len(list)
print(“With”, n_trials, “trials, probability of at least one shared bday
among”,n_people, “people =”, probability)
if x<=55:
return -1
else:
return 1
#specify no. of spins per set and no. of sets. feel free to change the values.
spins=100
sets=1000
MCS
#load data
data=pd.read_csv(‘train_A.csv’,index_col=‘Session’)
Monte Carlo simulations 61
#specify no. of sessions and simulations. feel free to change the values.
num_sessions = 40
num_simulations = 5000
np.random.seed(0)
allstats=[]
for i in range (num_simulations):
#distribute num_sessions evenly along num_strata
L=int(num_sessions/num_strata)
#allocate the probability space 0-1 evenly among the strata
lower_limits=np.arange(0,num_strata)/num_strata
upper_limits=np.arange(1, num_strata+1)/num_strata
#generate random numbers that are confined to the allocated prob-
ability space within each stratum. each random number represents
the cumulative distribution function for a normal distribution
points=np.random.uniform(lower_limits, upper_limits,
size=[int(L),num_strata]).T
#create a vector of z-scores, each corresponding to the CDF values
calculated above
normal_vector=sp.stats.norm.ppf(points)
authentic_vector=auth_avg+(auth_sd*(normal_vector))
authentic_strata_mean=np.mean(authentic_vector, axis=1)
authentic=np.mean(authentic_strata_mean)
clout_vector=clout_avg+(clout_sd*(normal_vector))
clout_strata_mean=np.mean(clout_vector, axis=1)
clout=np.mean(clout_strata_mean)
tone_vector=tone_avg+(tone_sd*(normal_vector))
tone_strata_mean=np.mean(tone_vector, axis=1)
tone=np.mean(tone_strata_mean)
plt.xticks(fontsize=10,rotation=0)
plt.yticks(fontsize=10,rotation=0)
plt.legend(fontsize=10)
References
Borbely, A. F. (2008). Metaphor and psychoanalysis. In R. W. Gibbs (Ed.), The
Cambridge handbook of metaphor and thought (pp. 412–424). Cambridge
University Press.
Caflisch, R. E. (1998). Monte Carlo and quasi-Monte Carlo methods. Acta
Numerica, 7, 1–49.
Carsey, T. M., & Harden, J. J. (2013). Monte Carlo simulation and resampling
methods for social science. Sage.
Delacre, M., Lakens, D., & Leys, C. (2017). Why psychologists should by default
use Welch’s t-test instead of student’s t-test. International Review of Social
Psychology, 30(1), Article 1. https://doi.org/10.5334/irsp.82
Eckhardt, R. (1987). Stan Ulam, John von Neumann, and the Monte Carlo
Method. Los Alamos Science, 15, 131–144 .
Efron, B. (1982). The jackknife, the bootstrap and other resampling plans. Society for
Industrial and Applied Mathematics. https://doi.org/10.1137/1.9781611970319
Graham, J. W. (2009). Missing data analysis: Making it work in the real world.
Annual Review of Psychology, 60(1), 549–576. https://doi.org/10.1146/annurev
.psych.58.110405.085530
Kang, H. (2013). The prevention and handling of the missing data. Korean Journal
of Anesthesiology, 64(5), 402–406.
Kramer, G. P., Bernstein, D. A., & Phares, V. (2008). Introduction to clinical
psychology (7th ed.). Pearson.
Kroese, D. P., Brereton, T., Taimre, T., & Botev, Z. I. (2014). Why the Monte Carlo
method is so important today. Wiley Interdisciplinary Reviews: Computational
Statistics, 6(6), 386–392.
Mayer, D. G., & Butler, D. G. (1993). Statistical validation. Ecological Modelling,
68(1), 21–32. https://doi.org/10.1016/0304-3800(93)90105-2
Metropolis, N. (1987). The beginning of the Monte Carlo method. Los Alamos
Science, 15, 125–131.
Nguyen, H., Mehrotra, R., & Sharma, A. (2016). Correcting for systematic biases
in GCM simulations in the frequency domain. Journal of Hydrology, 538, 117–
126. https://doi.org/10.1016/j.jhydrol.2016.04.018
Owen, A., & Glynn, P. W. (Eds.). (2018). Monte Carlo and quasi-Monte Carlo
methods. Springer.
Qiu, H., & Tay, D. (2023). A mixed-method comparison of therapist and
client language across four therapeutic approaches. Journal of Constructivist
Psychology, 36(3), 337–60. https://doi.org/10.1080/10720537.2021.2021570
Rizzuto, A. M. (1993). First person personal pronouns and their psychic referents.
International Journal of Psychoanalysis, 74(3), 535–546.
Rubin, D. B. (1976). Inference and missing data. Biometrika, 63(3), 581–592.
Schafer, J. L., & Graham, J. W. (2002). Missing data: Our view of the state of the
art. Psychological Methods, 7(2), 147–177.
Monte Carlo simulations 65
DOI: 10.4324/9781003360292-3
Cluster analysis 67
We first demonstrate AHC. After importing the relevant libraries and data-
set in Python, most notably the SciPy library for AHC, the first step is to
standardize our four properties by subtracting each value by the mean
and dividing by the standard deviation across the 60 countries/regions.
This scales the whole dataset to zero mean and unit variance, and will
improve clustering outcomes when the properties are measured with dif-
ferent scales like the present case. We can import the StandardScaler fea-
ture from the scikit-learn library to do this. The code below implements
all the above. Note that there are other techniques like normalization and
Cluster analysis 69
#import dataset
data = pd.read_csv(‘covid.csv’, index_col=‘Country’)
We are now ready to perform AHC. Recall that this means starting with as
many clusters as objects (i.e., our countries/regions), and iteratively merge
neighboring objects into larger clusters until one giant cluster remains. The
algorithm decides what to merge based on the similarity or ‘distance’ between
objects, as well as the ‘linkage measure’ (similar to distance) between the
resulting clusters. One could argue that the usefulness of clustering critically
depends on the extent to which spatial distance is the best criterion for simi-
larity in the context of application. In any case, the different ways to define
and measure these distances will not be detailed here (see Yim & Ramdeen,
2015). We will opt for the most widely used Euclidean distance and Ward’s
linkage, respectively. In so doing, we conceptualize objects as points in a
Euclidean space where each property/variable defines a dimension, and the
values determine the object positions. This is why the standardization of
scores above is important to prevent variables with larger scales from dis-
torting the positions. The following code will generate a dendrogram (Figure
3.1), the standard visualization of an AHC clustering solution.
#generate dendrogram
plt.title(“COVID-19 dendrogram”,fontsize=25)
plt.xticks(fontsize=25)
plt.yticks(fontsize=25)
plt.ylabel(‘Distance’,fontsize=25)
plt.xlabel(‘Country/region’,fontsize=25)
70 Cluster analysis
The code contains typical cosmetic settings like fontsizes, orientation, and
rotation angle of the labels that can all be experimented with. Besides
these, the three most critical settings are the distance metric, linkage meas-
ure method (set to Euclidean and Ward, respectively), and color_thresh-
old. This tells Python the aforementioned arbitrary cut-off point at which
to determine the final number of clusters, whereupon each cluster will
be colored differently. The setting of 10 means any split in the hierarchy
below the Euclidean distance value of 10 will be counted as a cluster. This
will be apparent in Figure 3.1.
The dendrogram indicates three distinct clusters at our chosen thresh-
old. The visual interpretation should be quite intuitive – for example,
Zambia and Ghana (far left) are more similar to each other than Indonesia
(far right of cluster one), but all three are more similar to one another than
anywhere else outside cluster one. Also, cluster three terminates higher up
the hierarchy, indicating that the level of similarity among its members is
the lowest among the three clusters – vice versa for cluster two. It is up
to readers to evaluate whether the clustering solution ‘makes sense’ with
reference to their real-world knowledge. As a statistical evaluation, we can
calculate and show the cophenetic correlation coefficient (Sokal & Rohlf,
1962) with the following code. It is a measure of how well correlated the
distances between objects in the dendrogram are to their original unmod-
eled distances. The closer the coefficient is to one, the better the clustering
solution is. The current value of 0.631 is adequate.
Cluster analysis 71
k-means clustering
developed to help the analyst select the optimal value of k, or the optimal
number of clusters (Kodinariya & Makwana, 2013). In general, param-
eters like k that are set prior to the machine-learning process rather than
learnt from the data are known as hyperparameters. A common approach
suggests that the optimal number of clusters is the point at which adding
another cluster no longer produces a noteworthy decrease in distortion/
inertia, and hence does not compensate for the corresponding loss in par-
simony. This is colloquially known as the elbow method, and will be dem-
onstrated below. Another approach is silhouette analysis, which evaluates
how well objects are matched to their assigned cluster. This is done by
comparing the average distance between each object and all other objects
in its cluster, with the average distance between each object and all other
objects in the nearest neighboring cluster. The optimal k is that which
leads to the clustering solution with the highest average silhouette score
across all objects. We will see in the upcoming case study that we can begin
the analysis by first determining this optimal number, and then specifying
it as k for the algorithm described in the paragraph above.
Figure 3.2 is a 2D-scatterplot showing the outcome of k-means cluster-
ing on our COVID dataset. In most practical cases there is a need to reduce
the original higher dimensional Euclidean space (four dimensions in our
case, corresponding to four variables) into two dimensions via principal
components analysis. This comes with some inevitable information loss
& Tschacher, 2016, p. 7). We could define (and measure) linguistic (a)
synchrony as the extent to which their linguistic choices (mis)align across
the treatment span. Existing approaches to this task, however, tend to be
polarized along the usual qualitative-quantitative divide and focus on dis-
parate levels and units of analysis. The influential Interactive Alignment
Model (Pickering & Garrod, 2004) synthesizes previous work (Brennan
& Clark, 1996; Sacks et al., 1974; Zwaan & Radvansky, 1998) to pro-
vide a cognitive account of linguistic synchrony. The main idea is that
speakers in natural dialogue prime each other to develop aligned repre-
sentations across phonological, syntactic, semantic, and situational levels.
Each level has its own monitoring and repair mechanisms, and alignment
at one level reinforces other levels to enhance the overall perception of
synchrony. Since this process is assumed to be primitive and unconscious,
the model is less able to account for complexities like higher-order com-
munication strategies, deliberate attempts to (mis)align with each other,
and other context-specific features emergent in spontaneous interaction.
Psychotherapy is seldom discussed in this model but is a case where we
would precisely expect to see such complexities (Anderson & Goolishian,
1988). Elsewhere, communication and language researchers consider these
complexities to be of primary interest. Communication Accommodation
Theory (Giles, 2016), for example, claims that our interactions are con-
sciously motivated by their perceived social consequences. People linguisti-
cally (mis)align/(a)synchronize with one another to highlight or downplay
differences as desired. This has been demonstrated by linguistic analyses
in settings like intercultural language education (Dörnyei & Csizér, 2005),
law enforcement (Giles et al., 2006), and, in fact, psychotherapy (Ferrara,
1991, 1994). As noted in Chapter 1, Ferrara (1994, p. 5) observes that
therapists and clients use the core strategies of ‘repetition’ and ‘contiguity’
to construct meaning in accommodative ways, “taking up portions of the
other’s speech to interweave with their own”. Another leading approach
to psychotherapy talk is conversation analysis, which focuses on the turn-
by-turn architecture of dialogue (Peräkylä et al., 2011; Weiste & Peräkylä,
2013). Although the term ‘synchrony’ is not often used, it coheres with the
main idea that therapeutic processes are tied to sequences of therapist–cli-
ent interaction, and the quality of these sequential relations reflects the
extent of synchrony. The general idea of (a)synchrony as linguistic (mis)
alignment is also apparent in related research like the act of ‘wordsmith-
ing’ in counseling (Strong, 2006), the strategic communication of risks
(Sarangi et al., 2003), and the co-construction of figurative language by
therapists and clients (Kopp, 1995; Tay, 2016, 2021b).
The above work mostly employs nuanced qualitative analysis of ‘local-
ized’ linguistic units or ‘isolated snapshots’ (Brown et al., 1996) like a
single conversational turn or topic. An inevitable tradeoff is the inability to
Cluster analysis 75
depict (a)synchrony at higher and perhaps more natural levels. Like the rest
of this book, we draw attention to the prime example of the institutional-
ized sessional level. It was explained in Chapter 1 that sessions are likely to
be more intuitive, concretely experienced, and recallable than single turns
or topics. Despite this, there is little work on linguistic (a)synchrony at
sessional level, in large part because it is hard to analyze entire sessions
in a nuanced yet reliable way. Complementary quantitative methods are
required for this. Computational techniques to analyze natural language
offer potential solutions. On the unconscious-versus-strategic alignment
debate described above, computational evidence suggests that “alignment
is not a completely automatic process but rather one of many discourse
strategies that speakers use to achieve their conversational goals” (Doyle
& Frank, 2016). Furthermore, due to the relative concreteness of words
over other grammatical levels (Gries, 2005; Healey et al., 2014), there is a
general preference for quantification at word level at least for the English
language. Two general types of synchrony measures have been proposed:
distributional and conditional. Distributional measures include the Zelig
Quotient (Jones et al., 2014) and Linguistic Style Matching (Niederhoffer
& Pennebaker, 2002), which determine linguistic (dis)similarity between
independent units of analysis. Conditional measures like Local Linguistic
Alignment (Fusaroli et al., 2012) focus instead on the relationship between
adjacent units – somewhat in the vein of conversation analysis. Both types
are complementary because distributional measures show global similarity
but not necessarily the contextual qualities of alignment, and vice versa
for conditional measures. They represent two kinds of interpretative logic
that are seldom applied together – as our case study will show, quantitative
determination of linguistic (a)synchrony using cluster analysis on sessional
LIWC variable scores, which is a distributional measure, can be supported
by qualitative investigation of how such (a)synchrony is played out in con-
text. Also noteworthy is that recent work emphasizes the importance of
function or grammatical words. The main reason is that while content
words are often tied to arbitrarily changing topics, grammatical words are
more context-invariant and thus more revealing of speakers’ interactional
styles (Doyle & Frank, 2016). This explains our current choice of LIWC
variable scores, which relies heavily on grammatical categories to measure
the socio-psychological dimensions of language. Following the same struc-
ture as Chapter 2, our case study will now be presented in stepwise fash-
ion, including computing LIWC variable scores, k-means clustering with
model validation to group sessions as a basis for measuring synchrony,
validating the clustering solutions, and qualitative analysis of how syn-
chrony is co-constructed in context. Note that most parts of this case study
were also reported in Tay and Qiu (2022) but without instruction on how
to implement the steps in Python.
76 Cluster analysis
The first step is similar to Chapter 2 where LIWC summary variable scores
are computed for the dataset at hand. The present approach applies k-means
clustering to measure linguistic (a)synchrony on a sessional and dyadic
basis. In other words, each therapist–client dyad receives a synchrony
score over the number of sessions conducted, which can be compared
between dyads and therapy approaches. To illustrate such a comparison,
our dataset comprises three dyads A to C. A is a psychoanalysis dyad with
15 sessions (74,697 words), B is a cognitive-behavioral therapy dyad with
14 sessions (68,812 words), and C is a humanistic therapy dyad with 20
sessions (101,044 words). The dyads were selected to maximize compara-
bility as all three clients shared broadly similar demographics and present-
ing conditions. All were heterosexual white American females in their early
to late twenties diagnosed with anxiety disorder and depression, reporting
relationship issues with their parents/spouse/partner. Given the inevitably
unique nature of each dyad, the present dataset can only illustrate but not
represent the therapy types. This again underlines the case study–oriented
nature of the present approach – it could be applied to larger datasets to
make stronger claims about therapy types if desired, as well as more lim-
ited ones for purposes like therapists’ self-monitoring and reflection.
Instead of computing LIWC scores for each session transcript as in
Chapter 2, here we first split each transcript into therapist and client-only
language. These will be called ‘sub-transcripts’. For example, dyad A will
have 30 sub-transcripts labeled T1 to T15 and C1 to C15, each with its four
LIWC summary variable scores as defining properties. Table 3.1 shows
the scores for all three dyads. The sub-transcripts will be the objects to
undergo clustering, one dyad at a time. The basic logic is that for each ses-
sion x, if sub-transcripts Tx and Cx fall into the same cluster, then session
x is considered synchronized. Otherwise, session x is asynchronized. This
is because therapist and client language within the same session ought to
be more similar to each other than they are to other sessions, if we want
to claim that session as synchronized. Applying k-means clustering would
therefore yield the following concrete outcomes: (1) which sessions across
the treatment span are (a)synchronized, (2) the percentage of synchronized
versus asynchronized sessions as an overall measure of the dyad, and (3)
the distribution pattern of (a)synchrony across time. Each outcome can
be further (qualitatively) examined using the actual transcripts in context.
Readers are again reminded that our LIWC multivariate profile can be
replaced by any other desired quantification scheme without affecting the
logic and process of k-means clustering.
Table 3.1 Summary variable scores for dyads A–C
C1 12.62 41.26 63.11 67.58 7.2 11.6 95.16 59.63 8.31 20.79 83.95 30.48
C2 17.73 21.95 69.21 60.4 4.59 14.94 92.75 60.95 9.76 17.3 84.89 39.14
C3 35.24 20.39 80.5 66.26 8.49 15.99 92.94 51.18 12.23 34.34 77.4 37.81
C4 30.8 25.32 68.91 40.86 6.76 16.41 88.04 50.45 5.99 12.05 89.67 33.92
C5 10.08 14.08 69.39 66.7 5.26 19.92 83.7 81.93 7.8 23.77 84.25 52.62
C6 19.74 35.59 69.84 50.09 4.23 38.41 73.35 34.36 6.86 15.47 84.35 28.09
C7 21.73 42.54 72.59 49.84 5.38 9.82 92.72 41.34 10.71 14.55 84.31 42.98
C8 15.09 51.7 28.18 42.92 1.89 8.74 97.23 48.31 13.8 16.09 89.81 56.77
C9 20.46 14.71 87.51 49.69 5.35 20.47 92.83 50.54 7.4 19 80.6 33.32
C10 10.43 32.1 67.85 50.33 3.22 20.56 85.43 44.17 8.4 20.27 76.16 27.55
C11 17.74 32.18 67.41 77.29 5.08 8.66 95.22 75.85 12.11 12.48 84.71 64.38
C12 11.08 18.31 86.55 67.23 6.09 16.98 86.3 43.6 9 13.76 89.37 35.98
C13 17.27 25.92 56.74 75.06 5.75 32.15 76.13 64.51 8.1 32.02 73.69 31.32
C14 21.27 15.99 87.01 56.85 6.27 18.84 85.88 78.6 10.38 10.52 96.12 36.5
C15 29.99 37.22 52.08 41.54 9.93 24.98 89.52 34.03
C16 16.14 22.22 85.27 40.3
C17 13.69 38.62 65.02 28.35
C18 8.72 21.52 84.27 34.38
C19 9.96 25.35 90.8 61.12
C20 10.09 11.12 92.02 41
T1 33.45 73.58 72.27 43.14 15.4 90.23 50.24 58.7 9.33 84.32 53.22 45.87
T2 7.39 69.04 69 72.32 11.88 92.89 51.82 62.78 10.82 71.38 50.59 41.91
T3 9.49 60.35 66.71 42.41 17.9 88.02 54.4 64.12 14.26 60.8 59.47 34.18
T4 20.15 65.65 50.19 22.31 16.87 93.37 46.92 61.21 3.63 46.31 68.51 46.11
T5 21.73 61.72 50.61 58.63 9.17 94.7 47.28 74.81 4.78 68.84 55.12 30.38
T6 18.49 55.65 59.94 30.12 15.01 91.05 44.97 74.45 5.17 69.12 70.92 35.67
Cluster analysis 77
T7 20.23 71.69 56.41 31.35 11.37 87.56 59.66 46.79 8.82 50.61 76.79 42.31
(Continued)
Table 3.1 (Continued)
78
T11 15.17 48.14 65 85.53 19.2 92.56 56.03 75.48 2.54 43.82 85.18 51.45
T12 13.79 68.94 86.4 91.04 16.37 92.4 48.83 74.9 7.23 45.65 84.16 30.59
T13 19.69 64.82 75.53 44.51 15.69 80.49 60.4 65.98 8.09 62.79 72.3 28.49
T14 38.33 80.23 73.1 91.52 24.48 87.23 43.96 72.94 6.27 37.4 86.97 15.06
T15 37.64 75.49 54.67 56.41 20.86 70.2 51.28 28.22
T16 10.81 56.49 67.51 61.87
T17 7.25 63.79 75.51 16.18
T18 8.91 70.43 71.29 50.49
T19 3.91 91.21 60.97 64.83
T20 12.18 46.21 75.23 40.69
Cluster analysis 79
We are now ready to perform k-means clustering using the LIWC variable
scores in Table 3.1. This is done one dyad at a time to derive a clustering
solution that will be the basis for measuring linguistic synchrony for that
dyad. Each dyad can thus be saved and imported as a separate excel file for
better data organization. Similar to the COVID example above, it is rec-
ommended to first standardize the raw LIWC scores by subtracting each
value by the mean and then dividing by the standard deviation. However,
standardization is not crucial because the four variables are measured on
the same scale of 0–100, unlike the COVID dataset. The following code
imports a dyad’s excel file and performs standardization using the same
scikit-learn StandardScaler feature as above. It is recommended to set the
‘session’ column as the index to facilitate annotation of the data in subse-
quent visualizations.
After the data are imported and standardized, the optimal number of clus-
ters can be determined. This was defined earlier as the number k beyond
which adding another cluster no longer produces a noteworthy decrease in
distortion/inertia. We determine k by repeatedly performing k-means clus-
tering for an incremental number of clusters from 1 to a specified number n
(5 is usually enough), calculating the distortion or inertia value each time.
We then plot the values against the number of clusters and note the point
beyond which the decrease in distortion/inertia tapers off. It will soon be
apparent why this is known as the elbow method for determining opti-
mal k. The following code implements the above. Note that the libraries
required for k-means clustering are also imported this point. Both scikit-
learn and SciPy are good options, but we choose the former here.
for i in num_clusters:
model=KMeans(n_clusters=i)
80 Cluster analysis
model.fi t(data)
inertias.append(model.inertia_)
Readers are encouraged to compare this code with the simulation codes in
Chapter 2 and recognize the similar logic of the loops deployed. In both
chapters we are essentially specifying a number of loops through the code,
performing some computation in each iteration and recording the result in
a list, and then doing a summary analysis of the list at the end. The com-
putation in this case is the performing of k-means clustering (or ‘fitting of
a k-means model’) with the number of clusters defined by the current loop
iteration, and the result recorded in the ‘inertias’ list is the corresponding
inertia value. While the summary analysis in Chapter 2 concerned the dis-
tribution of simulation outcomes as depicted in histograms, the summary
analysis here is simply to plot the inertia values against the number of
clusters.
Figure 3.3 shows the resulting plots for each of our three dyads. These
are commonly known as ‘elbow plots’ because the optimal k resembles the
elbow joint of an arm. Beyond k, the inherent decrease in inertia falls to its
minimum, implying that having k+1 clusters does not compensate for the
loss of parsimony.
The ‘elbows’ are usually obvious and thus visual inspection is sufficient
to determine the optimal number of clusters k. For example, it is clear
that k=2 for dyad B (CBT). In cases like dyads A and C, however, it is
less easy to identify an obvious elbow joint. The following code will help
to objectively determine k by calculating the decrease in inertia with each
additional cluster. It simply takes the inertias list from above, converts it
into a dataframe, and calculates the absolute difference between each suc-
cessive value with the diff() and abs() commands.
The optimal number k is the point after which the absolute difference
value (i.e., change of inertia) is at its lowest. For example, applying the
above code to dyad A yields 35.447, 20.597, 14.572, and 8.066 for k=1 to
4, respectively. The value of k is therefore 3. With this elbow method, we
Cluster analysis 81
conclude that the optimal number of clusters for dyads A to C is, respec-
tively, 3, 2, and 2.
Now that we know the value of k for each dyad, we can proceed to
perform k-means clustering one dyad at a time. This was of course already
done when running the elbow method loop, but having determined k we
can go further to assign each sub-transcript to one of the k clusters and vis-
ualize the clustering solution like in Figure 3.2 above. The following code
will (1) fit a k-means model with k specified (k=3 for dyad A), (2) ‘predict’
the cluster membership of each sub-transcript using the model and assign
a corresponding cluster label, (3) add these labels to the original dataframe
for easy reference, and (4) obtain the positions of the k cluster centroids
(i.e., cluster centres) for later plotting of the red crosses like in Figure 3.2.
If we re-examine the dataframe, we will now see the new cluster_labels col-
umn at the right indicating the cluster membership of each sub-transcript/
object. Examining cluster_centres will show an array consisting of k rows
representing k clusters. Each row has n numbers where n is the number
of variables defining the objects (n=4 in our case). For example, the array
below is from dyad A.
[ 1.71687033, 1.33623213, 0.07475749, 0.50641184]
[-0.22099794, 0.61341394, -0.96469075, -1.00276756]
[-0.28697058, -0.63915612, 0.49312864, 0.41172122]
The four numbers in each row/cluster locate its centroid position, which as
mentioned above is a point in n-dimensional Euclidean space just like all the
objects. Because of the optimization process explained earlier, each number
is, in fact, the mean LIWC variable score of all the sub-transcripts assigned
to that cluster. Note that some scores are negative because we standardized
them just now. In other words, these scores express the number of standard
deviations above (+) or below (-) the mean of the whole dataset.
We now have everything we need to generate the 2D-scatterplot visual-
izing the clustering solution. The following code will (1) employ princi-
pal components analysis to reduce our n-dimensional data, including the
cluster centroid locations, to two dimensions, (2) plot and label all objects
and centroid locations. The PCA library from scikit-learn is imported to
perform principal components analysis in the first step. There are many
cosmetic details that can be freely modified to suit different visual prefer-
ences. For example, palette determines the coloring scheme for the cluster,
s sets the marker size of the objects, and rx tells Python to use red crosses
for the cluster centroids.
Figure 3.4 shows the three 2D-scatterplots visualizing the outcomes of our
k-means clustering. The first thing to bear in mind is that the clustering
solutions are independent of each other in each dyad, and so ‘cluster 0’ in
dyad A has no relationship with ‘cluster 0’ in dyads B and C. It is visually
apparent that among the three dyads, dyad B (CBT) has the most distinct
separation between the clusters while dyads A and C have more ambigu-
ous cluster allocations that suggest poorer clustering solutions. The next
step would be to interpret these outcomes according to the present pur-
pose, which is to determine for each dyad which and how many sessions
are synchronized based on whether sub-transcripts Tx and Cx fall into the
same cluster. Before this can be done, however, the three clustering solu-
tions (or clustering models) need to be validated.
Recall that model validation is crucial to evaluate the accuracy and use-
fulness of data analytic outcomes and that there are different context-spe-
cific methods for it. The train-test approach discussed in Chapter 2 seems
to be an option here – the general idea being to redo the analysis with say
80% of the sub-transcripts, use the new model to predict cluster member-
ship for the remaining 20%, and compare the original and new cluster
labels. However, recall that cluster analysis is an unsupervised technique
where there are no correct or pre-existing cluster labels to compare the
results against. The very reason why we perform it is, in fact, to discover
potentially useful but hidden classification schemes. The above train-test
validation procedure, if used without pre-existing labels, would therefore
run the risk of being logically circular because we would end up evaluating
one set of predictions with another.
Two alternative procedures will be introduced in its place. The first is
informal, simpler, and relies mostly on visual inspection just like elbow
plots. The key idea is to see whether the cluster centroid locations for each
dyad are ‘distinct’ enough in the higher-dimensional spaces they occupy.
84 Cluster analysis
Recall that our cluster centroids are points in 4-dimensional space. Each
dimension represents one of the four LIWC summary variables, and the
centroid is the mean (or mid-point) of all the sub-transcripts in that cluster.
Therefore, if we plot the mean values of the four variables by cluster, we
would be visually reconstructing the cluster centroids in a way that allows
us to evaluate their mutual distinctiveness. The following code will gener-
ate this plot.
Relatedly, we can also use the code below to check and plot how many
members there are in each cluster, which is useful in a descriptive account
of the analysis.
In both code snippets above, the groupby command is used to group the
dataset by the cluster_labels column and perform various actions like
counting and plotting the mean variables scores in each group. Figure 3.5
shows the bar plots of mean variable scores for all three dyads.
Note that the y-axis shows our standardized scores rather than raw
LIWC values. We can verify that the bar plots for dyad A indeed recon-
struct the cluster centroid positions shown by the array below, which was
presented earlier. The top row corresponds to cluster 0, the second row to
cluster 1, and the bottom row to cluster 2.
[ 1.71687033, 1.33623213, 0.07475749, 0.50641184]
[-0.22099794, 0.61341394, -0.96469075, -1.00276756]
[-0.28697058, -0.63915612, 0.49312864, 0.41172122]
An important point to reiterate is that just as the three clustering solu-
tions are mutually independent, the standardized scores are relativized
within each dyad and thus cannot be compared across dyads. For example,
the high analytical thinking and clout bars for cluster 0 of dyad A and their
low counterparts in cluster 0 of dyad B does not (necessarily) mean that
both variables are higher in A than B. What they actually convey is that
analytical thinking and clout scores in cluster 0 of dyad A are relatively
higher than those in the other two clusters of the same dyad.
How, then, do we visually evaluate if the clusters are distinct and hence
the outcome adequate? Clusters that are not distinct would have their
86 Cluster analysis
centroids, or mean scores, close to each other, implying that their constitu-
ent sub-transcripts are also close to one another. This, in turn, implies an
unsatisfactory clustering solution because we did not successfully put our
objects into clear group. Conversely, distinct clusters would have mean
scores with magnitude and/or direction that are clearly different from one
another, implying that the constituent sub-transcripts form well-delineated
groups. Visual inspection of all three dyads suggest that the clusters are
indeed fairly distinct, especially for dyad B, which gives us some confidence
in our clustering solutions.
Nevertheless, bar plots and other visualizations rely on subjective vis-
ual judgments that are not quantifiable. While such judgments might be
good enough for practical purposes if the differences are obvious, they
should be supported by more quantitative validation procedures other-
wise. We will revisit the dual use of visual and quantitative validation
tools in later chapters. For the present case, a useful approach is to evalu-
ate the extent to which the outcomes of an alternative data analytic tech-
nique align with the present k-means clustering solutions. We mentioned
logistic regression as a common supervised machine-learning technique
at the start of the chapter. Recall that the main difference between super-
vised and unsupervised techniques like clustering is the use of pre-existing
group labels in the former. For example, if our objective were to see if
LIWC scores can successfully predict the therapy approach (psychoa-
nalysis versus CBT versus humanistic therapy) used in each transcript,
we would use logistic regression with the LIWC scores as our properties
and therapy approach as the pre-existing outcome. We can then validate
the logistic regression model by evaluating its predictive accuracy; i.e.,
how many predicted labels correctly predict the pre-existing labels. Our
approach is therefore to treat the cluster labels generated by k-means
clustering as ‘pre-existing’, and use logistic regression to see how well its
predictions match these labels. This allows us to step out of the previ-
ously mentioned logical circularity because a different algorithm is now
involved.
A full introduction to the logic and characteristics of logistic regression
is beyond the present scope, but readers may refer to the following exam-
ples of discourse-related applications (Tay, 2021a; Zhang, 2016; Zhang et
al., 2011). The code below imports the needed libraries from scikit-learn
and prepares our data from k-means clustering by splitting it into X and y.
X retains the LIWC scores (by removing the last column of cluster labels),
and y is the cluster_labels.
At this point, we have the option of using the same train_test_split fea-
ture from Chapter 2 to randomly split the data into training and testing
datasets. We would then fit the logistic regression to the training set and
use it to predict the testing set to evaluate the predictive accuracy of our
k-means clustering model. Recall that this allows us to test how well our
model performs with ‘unseen’ data. The code below instead implements a
simplified process where the entire dataset is used to fit the logistic regres-
sion model, and all the predicted labels are compared with the actual
labels. In other words, we are working only with ‘seen’ data. Readers may
try to write the code to implement the train_test procedure as a challenge!
The rows represent predicted labels and columns represent actual labels,
following the order we specified them in the code (logreg.predict(X), y). The
matrix tells us that 10 sub-transcripts that were assigned to the first cluster
(cluster 0) were also correctly predicted to belong to it. Likewise, 29 sub-
transcripts that were assigned to the second cluster (cluster 1) were correctly
predicted as such. There was in fact only one prediction error – a sub-tran-
script that was assigned to cluster 0 but predicted to belong to cluster 1. In
any confusion matrix, the number of correct predictions is always the sum
of the top-left to bottom-right diagonal. We have 39 in this case out of 40,
giving us the above accuracy score of 39/40 = 0.975. We will learn about
other nuanced measures in Chapter 4 to complement this simple measure of
overall accuracy, but the latter is sufficient for the present purpose of using
an alternative data analytic technique to validate our clustering models.
In summary, because a pure train_test validation approach on a cluster-
ing dataset without pre-existing cluster labels is logically circular, we can
use complementary alternative methods like plotting cluster means and
logistic regression. There are still more ways to establish the validity of
clustering solutions. A further suggestion on how to boost the external
validity of the present examples might be to elicit rating scores from thera-
pists and clients on their subjective perception of synchrony and compare
these with the clustering outcomes. To reiterate what was broadly stated
in Chapter 1, model validation is a context-specific process that should not
rely on one-size-fits-all solutions.
After validating our clustering solutions, we simply go through each
dyad and keep track of which sessions have therapist and client sub-tran-
scripts falling into the same cluster. This will also give us the percentage
of synchronized sessions as an overall measure of a dyad, as well as the
distribution pattern of (a)synchrony across sessions for potential further
analysis. Table 3.2 summarizes the above with green cells indicating syn-
chrony (Tx and Cx in the same cluster) and red cells asynchrony (Tx and Cx
not in the same cluster).
A total of 5 out of 15 sessions (33.3%) in the psychoanalysis dyad are
synchronized, compared to 5 out of 20 sessions (25%) in the humanistic
therapy dyad. Remarkably, the cognitive-behavioral therapy dyad ends
up as a case of ‘perfect asynchrony’ as none of the sessions are synchro-
nized. The distribution across sessions (left to right) provides a visual
overview of where in the treatment span (a)synchrony occurs. This is
useful for capturing potentially interesting patterns like contiguous or
intermittent ‘(a)synchrony blocks’ as explained below. The (a)synchronic
nature of each session, which could be seen as a categorical outcome,
could also be forecasted using relevant techniques like Markov chains
that lie beyond the present scope. We will return to forecasting as an
objective in Chapter 5.
90 Cluster analysis
dyads will accordingly be illustrated by one extract below. Even though the
words in these extracts might not be the main contributors to the LIWC
scores, and hence cluster membership, of the sub-transcripts they belong to,
they suffice to illustrate how quantitative and qualitative analysis comple-
ment each other. It was mentioned above that all three clients share similar
demographics that maximizes their comparability. The extracts were also
selected on this basis, in that they all zoom in on a discussion of the client’s
difficulties in relating with a specific individual. Furthermore, because the
main objective for all three dyads is to resolve these difficulties, it is reason-
able to assume that the interactional styles featured in each of these extracts
will recur throughout the rest of their respective sessions.
Extract 1 is from session 8 of dyad A (psychoanalysis), a synchronized
session (see Table 3.2). Recall that dyad A has the highest synchrony meas-
ure of 33%. The client is relating her boyfriend’s problems at work and
her frustration at not being able to help. The therapist is guiding the client
to re-experience her emotional disturbances and pin down the cause of her
reactions.
1. CLIENT: So he’s worked now for four years in this job, and it’s going
to be so hard for him to turn the other way and I can’t will him to do
anything.
2. THERAPIST: Yeah. I guess I’m imagining that, seeing him suffer this
way and be himself so sort of helpless and being so helpless yourself to
do too much about it, is part of what makes it so difficult.
3. CLIENT: Um hmm. So I sense some urgency in the, like, speeding up,
or in getting the most out of therapy while he has it. That is my get-
ting-the-most-out-of-things tendency. He does not feel this way. He's
like, “Ah, she just told me I was punishing myself”.
4. THERAPIST: Hmm.
5. CLIENT: Like, yeah. That’s the point.
6. THERAPIST: It’s pretty hard to sit by, huh?
7. CLIENT: Yeah. It’s so hard. It was so much harder in college though.
God, I was like, um, I felt that I could not go on in the relationship a
number of times.
8. THERAPIST: I mean I guess the place to look would be you know,
um, I mean, it does almost like you’ve vicariously experienced his
stress and, except that you’re helpless cause you can’t do all the things
that you would have done if it were you. But it was him.
9. CLIENT: Yeah.
10. THERAPIST: You know, that is what I imagine used to hang you up
about this.
11. CLIENT: Yeah, that was probably the main thing.
Cluster analysis 93
Referring back to Table 3.1, in this session the therapist and client have
similar scores for analytical thinking (Client=15.09, Therapist=11.39),
clout (Client=51.7, Therapist=63.4), and authenticity (Client=28.18,
Therapist=24.97), with a larger difference for emotional tone (Client=42.92,
Therapist=28.62). These similarities, which led to the statistical determina-
tion of synchrony, are linguistically reflected by the observable level of con-
cord between therapist and client. Markers of agreement like ‘yeah’ (Turn
2, 5, 7, 9, 11) and ‘um hmm’ (Turn 3) suggest gradual co-construction of a
shared interpretation of the client’s situation. Their similarly low analyti-
cal thinking indicates a mutually informal, personal, here-and-now, and
narrative style, as the therapist guides the client to explore the underlying
meanings, causes, and mechanisms of her thoughts and feelings. The thera-
pist’s display of mid-level clout is noticeable. On the one hand, she asserts
her interpretations by frequently using ‘I’ and directs them towards the cli-
ent with ‘you’ (turns 2, 8, 10). On the other hand, she carefully reduces the
force of these interpretations with hedging expressions like ‘I guess’ (Turn
2), ‘almost like’ (Turn 8), and ‘I imagine’ (Turn 10). The client, in turn,
does not display a significantly lower clout as she appeared to respond well
to this approach, concurring with the therapist’s interpretations. These
observations also account for the similar relatively low levels of authentic-
ity – unsurprisingly, given the general psychoanalytic aim of ‘making the
implicit explicit’, clients may find themselves speaking in a more guarded
and distanced manner when working through repressed thoughts and feel-
ings. The therapist also displays a comparable level of authenticity, which
may suggest an explicit effort to ‘reflect’ the client in a neutral and non-
interfering manner. It is also interesting to note that, contrary to strate-
gies like repetition and contiguity observed elsewhere (Ferrara, 1994), the
present linguistic synchrony does not seem to be based on taking up each
other’s keywords or phrases. This is consistent with the earlier observation
that content words may be less revealing of interactional stances (Doyle &
Frank, 2016), and will be further illustrated by the next extract where we
see the converse case of high repetition but low synchrony.
Extract 2 is from session 6 of dyad B (CBT). Recall that all sessions in
dyad B are asynchronized. The therapist establishes the client’s ‘dire need
for approval’ from her mother as a key irrational belief to be disputed, and
asks her to identify more potential irrational beliefs. The therapist then
proceeds to point out why they are irrational.
1. THERAPIST: So this need for approval, this dire need for approval,
and very pointedly from your parents, maybe more so from your
mother, is going to keep you suffering if you don’t continue the good
work you’re doing. So, any more irrational beliefs before you dispute
the life out of these ones right here and now?
94 Cluster analysis
As mentioned earlier, both speakers echo each other from Turn 3 to 6. The
therapist guides this process by repeating key parts of the client’s utter-
ances to prompt further reflection, and the client repeats them again (‘I
was so impulsive, that makes me…’, ‘As I shouldn’t be’) to demonstrate
this reflection. Such repetitions and overlaps are expected when dysfunc-
tional thoughts, beliefs, assumptions, etc. are discussed because they often
involve concrete details depicted by content words. In fact, if we had per-
formed cluster analysis with an alternative quantification scheme that is
based on document similarity, like the vectorization process mentioned in
Chapter 1, we would very likely see a high degree of synchrony in dyad B.
Our motivated choice of LIWC and its non-emphasis on content words,
however, reveals that surface-level similarity does not entail a high syn-
chrony measure. Referring again to Table 3.1, the therapist and client have
very different scores for all variables: analytical thinking (Client=4.23,
Therapist=15.01), clout (Client=38.41, Therapist=91.05), authentic-
ity (Client=73.35, Therapist=44.97), emotional tone (Client=34.36,
Therapist=74.45), which suggest highly divergent interactional stances.
The contrast in clout is evident from Turn 1 as the therapist uses many
client-directed pronouns (‘if you don’t continue…’, ‘before you dispute…’)
to assume an expert-like and directive stance to establish the ‘disput(ing)
the life out of” of irrational beliefs as a key focus of their interaction. The
client obliges by reflecting on the therapist’s directions using self-directed
pronouns (‘I was so impulsive’, ‘I shouldn’t be’). We see the reverse pat-
tern for authenticity – the client’s high score is reflected in her willing-
ness to disclose her thoughts and feelings, which is unsurprising given the
present therapist-directed focus on her irrational beliefs. By contrast, the
therapist’s low score is attributable to her exclusive focus on the client.
The therapist’s higher scores for analytical thinking and emotional tone
Cluster analysis 95
and she’s also so good-hearted. I empathize with her too much, and I
know what she feels like too much, and I know how she’s not able to
cope, and it hurts.
10. THERAPIST: And it’s like there’s too much overlap, it not only ends
up maybe damaging you, but making it really hard for you to… It may
sort of feel weak or like meaningless to someone else, but like it really
helps to have that, at that point because it sounds like in one way you
feel the same way. You’ve got to have that separation to retain any
sense of yourself.
11. CLIENT: Um hmm
12. THERAPIST: …sense of yourself.
From Table 3.1 we see that the synchrony of this session is attributable
to similar scores for analytical thinking (Client=7.4, Therapist=5.65)
and authenticity (Client=80.6, Therapist=78.43), with larger differences
in emotional tone (Client=33.2, Therapist=43.22) and especially clout
(Client=19, Therapist=47.84). In this sense it lies between extract 1 where
three variables are highly similar, and extract 2 where none of the variables
are. Interestingly, this coincides with the fact that the overall synchrony
measure of dyad C is also midway between A and B.
A closer analysis of the interactional construction of synchrony, indeed,
reveals elements that resemble both extracts 1 and 2. The therapist attempts
to clarify the client’s feelings by paraphrasing her account more precisely
like ‘keep yourself separate from her’ (Turn 2) and ‘those things can crip-
ple’ (Turn 6). The observed concord in extract 1 is noticeable here as the
client shows tacit agreement by echoing these utterances in the following
turns (‘what’s keeping myself separate’, ‘really crippling’), and markers
of agreement like ‘(to an extent) yeah’ (Turn 3) and ‘un hmm’ (Turn 11).
This general dynamic accounts for their similar analytical thinking and
authenticity scores. Both score low in the former as the conversation is
informal and narrative-like, and high in the latter as the client’s disclosure
of her feelings (‘I haven’t done a damn thing about my mother’, ‘I’m
upset in five minutes’, ‘I am so vulnerable’) is met with the therapist’s open
and empathetic understanding (‘Man, those things can cripple’, ‘I know’).
Their emotional tone scores are not as similar, but both tend towards the
negative end. Their use of negative emotion words is consistent throughout
as the client relates her and her mother’s feelings (‘upset’, ‘helpless’, ‘vul-
nerable’, ‘hurts’), and the therapist focuses more on the personal meaning
of her experiences (‘lost’, ‘damaging’, ‘weak’). However, while the thera-
pist in extract 1 does not seem to take the lead, the therapist here is subtly
leading the process by summarizing the client’s reflections, drawing out
their implications, and proposing an interpretation that is expected to be
accepted (Antaki et al., 2005). This is closer to the therapist’s educative
Cluster analysis 97
stance in extract 2 and accounts for the disparity in clout. Notice also that
the aforementioned concord is demonstrated to a lesser extent here. The
client appears to agree with the therapist (only) ‘to a degree’ (Turn 3), and,
unlike extract 1, she expresses her feelings more independently and does
not orient her utterance as a response to the therapist at every turn.
In summary, the above analyses attempted to contextualize the quanti-
tative synchrony measures and illustrate how linguistic (a)synchrony can
be constructed in different ways that can be examined at the individual
dyadic level. Our examples generally reflect characteristics expected at the
theoretical level of therapy type – dyad A demonstrates a high level of non-
judgmental ‘reflection’ often discussed in psychoanalysis, dyad B presents
a sharp contrast where the CBT therapist adopts a more institutionalized
educative role, and dyad C contains element of both in the humanistic
therapist’s broad adoption of a guiding, empathetic approach.
To conclude, this chapter demonstrated the combined application of
cluster and discourse analysis to model linguistic (a)synchrony in thera-
pist–client interaction. It follows Chapter 2’s focus on the session as the
key unit of quantitative analysis, with contextually bound extracts serv-
ing an important illustrative purpose. This affirms the importance of a
mixed-method orientation to linguistic (a)synchrony research, where
quantified measures of a dataset are complemented with a more critical
eye on higher-order communicative strategies and phenomena. The sam-
ple analyses of three dyads from key therapy approaches were performed
with both researcher and practitioner objectives in mind. Researchers may
adopt a similar comparative approach on more representative datasets to
study how (a)synchrony varies across therapy types, the temporal distri-
bution of (a)synchronized sessions within a dyad, and/or conduct further
qualitative analyses of (a)synchrony construction from different theoreti-
cal perspectives. Interested practitioners can apply the approach to their
own work and critically reflect on their socio-psychological stances vis-
à-vis their clients, as well as their avowed therapeutic approach. It would
be particularly interesting to track how one’s tendency to (a)synchronize
changes across different clients and over time. Additionally, the approach
could also be applied to other social contexts where there is an interest in
examining linguistic (a)synchrony between speakers and across motivated
intervals, such as classroom interaction (e.g., teacher versus student talk
across lessons) or online fora (different posts across time). It is worth reit-
erating that cluster analysis can be performed on the outcomes of alterna-
tive quantification schemes that have different theoretical assumptions and
underpinnings than LIWC. A comparison of how the resulting clustering
solutions differ is, in fact, an interesting direction in itself. At the very
least, it would demonstrate how data analytic techniques can be used as
systematic testing grounds for just how different various discourse analytic
perspectives on the same dataset are.
98 Cluster analysis
#import dataset
data = pd.read_csv(‘covid.csv’, index_col=‘Country’)
Cluster analysis 99
#generate dendrogram
plt.title(“COVID-19 dendrogram”,fontsize=25)
plt.xticks(fontsize=25)
plt.yticks(fontsize=25)
plt.ylabel(‘Distance’,fontsize=25)
plt.xlabel(‘Country/region’,fontsize=25)
k-means clustering
for i in num_clusters:
model=KMeans(n_clusters=i)
model.fi t(data)
inertias.append(model.inertia_)
References
Anderson, H., & Goolishian, H. (1988). Human systems as linguistic systems:
Preliminary and evolving ideas about the implications for clinical theory. Family
Process, 27(4), 371–393.
Antaki, C., Barnes, R., & Leudar, I. (2005). Diagnostic formulations in
psychotherapy. Discourse Studies, 7(6), 627–647. https://doi.org/10.1177
/1461445605055420
Ardito, R. B., & Rabellino, D. (2011). Therapeutic alliance and outcome of
psychotherapy: Historical excursus, measurements, and prospects for research.
Frontiers in Psychology, 2, 1–11.
Arkowitz, H., & Hannah, M. T. (1989). Cognitive, behavioral, and psychodynamic
therapies: Converging or diverging pathways to change? In A. Freeman, K.
102 Cluster analysis
Spong, S. (2010). Discourse analysis: Rich pickings for counsellors and therapists.
Counselling and Psychotherapy Research, 10(1), 67–74.
Strong, T. (2006). Wordsmithing in counselling. European Journal of
Psychotherapy, Counselling and Health, 8(3), 251–268.
Tay, D. (2016). Finding the middle ground between therapist-centred and client-
centred metaphor research in psychotherapy. In M. O’Reilly & J. N. Lester
(Eds.), The Palgrave handbook of adult mental health (pp. 558–576). Palgrave
Macmillan. https://doi.org/10.1057/9781137496850_29
Tay, D. (2021a). Is the social unrest like COVID-19 or is COVID-19 like the social
unrest? A case study of source-target reversibility. Metaphor and Symbol, 36(2),
99–115.
Tay, D. (2021b). Metaphor response categories and distribution between therapists
and clients: A case study in the Chinese context. Journal of Constructivist
Psychology, 34(4), 378–394. https://doi.org/10.1080/10720537.2019.1697913
Tay, D., & Qiu, H. (2022). Modeling linguistic (A)synchrony: A case study of
therapist–client interaction. Frontiers in Psychology, 13. https://www.frontiersin
.org/article/10.3389/fpsyg.2022.903227
Valdesolo, P., & DeSteno, D. (2011). Synchrony and the social tuning of
compassion. Emotion, 11(2), 262–266.
Watson, J. C., Goldman, R. N., & Greenberg, L. S. (2011). Humanistic and
experiential theories of psychotherapy. In J. C. Norcross, G. R. VandenBos, &
D. K. Freedheim (Eds.), History of psychotherapy: Continuity and change (2nd
ed., pp. 141–172). American Psychological Association. https://doi.org/10.1037
/12353-005
Weiste, E., & Peräkylä, A. (2013). A comparative conversation analytic study
of formulations in psychoanalysis and cognitive psychotherapy. Research on
Language and Social Interaction, 46(4), 299–321. https://doi.org/10.1080
/08351813.2013.839093
Yim, O., & Ramdeen, K. T. (2015). Hierarchical cluster analysis: Comparison
of three linkage measures and application to psychological data. Quantitative
Methods for Psychology, 11(1), 8–21. https://doi.org/10.20982/tqmp.11.1
.p008
Zhang, W. (2016). Variation in metonymy. Mouton de Gruyter.
Zhang, W., Speelman, D., & Geeraerts, D. (2011). Variation in the (non)metonymic
capital names in Mainland Chinese and Taiwan Chinese. Metaphor and the
Social World, 1(1), 90–112. https://doi.org/10.1075/msw.1.1.09zha
Zwaan, R., & Radvansky, G. (1998). Situation models in language comprehension
and memory. Psychological Bulletin, 123(2), 162–185.
4 Classification
DOI: 10.4324/9781003360292-4
106 Classification
Given the objective of our case study and the stated conceptual simi-
larities between k-means clustering and k-NN, it would seem that we
can reuse the dataset from Chapter 3, which are sub-transcripts of three
dyads each representing a therapy type. We would then simply use the
110 Classification
1 15.17 14.04 86.52 3.15 8.49 15.99 92.94 51.18 5.99 12.05 89.67 33.92
2 9.34 33.42 85.29 19.33 15.12 25.32 92.85 58.51 9 13.76 89.37 35.98
3 13.34 31.25 85.28 35.86 5.35 20.47 92.83 50.54 3.37 28.34 88.31 44.79
4 16.81 19.45 84.5 71.95 4.59 14.94 92.75 60.95 4.05 25.41 87.67 47.29
5 7.8 24.39 84.14 37.45 6.27 12.79 90.42 71.04 15.47 22.87 87.61 52.98
6 13.34 31.05 83.34 80.62 4.79 33.69 88.52 45.91 16.14 22.22 85.27 40.3
7 9.74 18.65 82.39 17.22 6.76 16.41 88.04 50.45 9.76 17.3 84.89 39.14
8 14.88 70.88 78.39 34.27 6.09 16.98 86.3 43.6 7.8 23.77 84.25 52.62
9 12.39 55.84 77.9 19.38 6.27 18.84 85.88 78.6 7.23 45.65 84.16 30.59
10 6.8 28 77.51 44.32 5.26 19.92 83.7 81.93 3.7 18.59 83.79 53.46
11 15.05 33.74 77.48 16.63 4.97 23.68 82.26 75.61 13.49 29.84 81.4 38.44
12 11.95 24.91 76.85 44.43 10.32 21.19 80.48 44.49 7.4 19 80.6 33.32
13 19.59 64.84 75.72 44.54 9.28 20.66 79.56 33.91 8.33 25.57 79.57 47.62
14 18.31 50.8 75.56 74.48 4.86 33.87 76.55 18.63 9.76 21.19 79.31 20.22
15 22.5 62.65 75.17 68.28 5.75 32.15 76.13 64.51 5.65 47.84 78.43 43.22
16 10.12 42.86 74.43 39.62 9.75 66 75.88 26.93 12.23 34.34 77.4 37.81
17 5.28 38.44 74.36 45.29 4.23 38.41 73.35 34.36 8.36 19.96 77.04 26.94
18 9.22 56.6 73.03 45.39 11.4 34.03 70.42 35.74 8.02 33.58 75.97 46.55
19 11.54 39.55 72.72 61.63 6.11 82.13 70.28 44.49 6.41 65.26 75.86 44.72
20 20.18 44.24 72.55 41.42 6.53 76.84 68.99 43.18 11.81 72.85 75.38 35.53
21 15.57 62.09 72.03 57.66 7.26 46.03 67.41 44.42 25.06 50.5 72.96 36.25
22 6.81 28.55 69.6 27.03 8.36 83.54 64.03 44.83 5.81 30.36 71.72 42.36
23 17.73 21.95 69.21 60.4 10.75 44.33 61.92 47.22 5.76 34.66 70.68 39.84
24 7.39 69.04 69 72.32 15.69 80.49 60.4 65.98 6.39 73.28 70.47 39.13
25 13.05 67.06 68.92 53.45 11.37 87.56 59.66 46.79 2.53 31.73 69.25 58.63
26 8.23 41.19 68.69 21.42 7.63 46.31 59.13 46.03 3.63 46.31 68.51 46.11
Classification 111
27 14.09 75.14 68.64 45.02 12.74 90.6 57.11 72.19 10.81 56.49 67.51 61.87
(Continued)
Table 4.1 (Continued)
32 23.58 50.61 64.29 86.79 13.77 89.53 51.86 71.25 14.26 60.8 59.47 34.18
33 29.3 44.91 63.57 70.14 11.88 92.89 51.82 62.78 10.64 56.62 57.91 65.72
34 6.11 68.1 62.25 64.04 8.35 91.88 50.45 85.78 4.44 77.66 56.43 59.24
35 17.55 54.6 62.01 41.44 15.4 90.23 50.24 58.7 3.59 58.22 55.25 62.36
36 6.86 72.46 60.4 16.54 4.6 81.59 50.22 39.11 4.78 68.84 55.12 30.38
37 21.26 71.2 58.98 82.83 10.53 93.57 50.21 69.14 17.35 73.35 51.41 25.77
38 7.71 58.64 57.4 39.98 16.37 92.4 48.83 74.9 10.82 71.38 50.59 41.91
39 30.34 65.4 56.89 62.38 9.17 94.7 47.28 74.81 5.44 77.45 49.75 25.77
40 17.25 25.91 56.56 75.07 16.87 93.37 46.92 61.21 8.91 61.91 47.73 52.5
41 9.03 28.76 56.05 34.75 7.51 95.9 46.58 55.78 6.11 94.37 42.3 24.95
42 5.24 32.06 55.45 27.77 15.01 91.05 44.97 74.45 13 66.53 40.02 43.36
43 7.29 55.85 54.73 27.97 24.48 87.23 43.96 72.94 9.04 85.29 33.51 36.94
44 19.57 62.62 50.87 71.48 17.91 88.03 40.12 51.58 6.64 78.97 32.89 24.8
45 37.39 57.16 45.17 70.29 8.73 59.25 39.23 35.19 19.37 73.3 31.49 40.83
46 7.48 44.18 38.18 14.34 12.06 90.37 39.1 58.91 21.72 65.42 30.42 35.35
47 5.7 81.48 34.72 6.73 8.77 96.26 37.95 86.43 15.09 89.27 29.11 32.58
48 21.78 82.88 32.67 76 7.65 98.19 34.35 76.51 6.65 92.54 28.21 32.43
49 9.36 81.69 27.73 14.44 5.27 91.81 33.42 87.93 4.07 86.05 27.9 40.11
50 8.64 83.22 22.22 58.9 7.78 81.71 31.38 14.56 11.88 71.44 27.85 41.37
Classification 113
We then randomly split X and y into training and testing datasets (X_train,
X_test, y_train, y_test) as before. Doing so will not jumble up the order of
the transcripts in the datasets. That is to say, each row across X_train and
y_train, as well as X_test and y_test, will still point to the same transcript
accordingly. Recall that test_size determines the ratio of this split, which is
set to 75:25 here. We therefore have 112 transcripts in the training dataset
and 38 in the testing dataset. Additionally, because our transcripts have
pre-existing group labels, it would be desirable to ensure that the training
and testing datasets maintain the proportion of group labels (i.e., therapy
types) in the entire dataset. We can do this by specifying stratify=y, which
tells Python to perform stratified random sampling based on the frequencies
observed in y, the group labels. We will discuss the limitations and alterna-
tives to this ‘one-off’ splitting approach towards the end of the chapter.
#generate random training set with stratify=y to ensure fair split among
types
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.25,
random_state=0, stratify=y)
and at the end of the loop identifying the k with the highest accuracy. It is
usually enough to consider test_accuracy but we will introduce two addi-
tional measures for illustrative purposes below.
• test_accuracy: this is the usual measure. We fit/train the model using the
training data, apply this model onto the testing data only, and compute
the percentage of correctly predicted group labels among the testing
data.
• train_accuracy: we fit/train the model using the training data, apply this
model back onto the training data, and compute the percentage of cor-
rectly predicted group labels among the training data.
• overall_accuracy: we fit/train the model using the training data, apply
this model back onto the whole dataset (i.e., training + testing), and
compute the percentage of correctly predicted group labels among the
whole dataset.
We then initiate the for-loop that will iterate 20 times in our example,
each time incrementing k by 1. A k-NN model is fitted each time specify-
ing X_train and y_train as the training predictors and labels, respectively.
The three different accuracy measures are computed in turn using knn
.score and stored in the arrays described above. Note the use of enu-
merate instead of range (Chapter 2). The main difference is that range
iterates over just one numerical range but enumerate iterates over what
are known as tuples, or ordered pairs. We need the latter here because
if we use for k in range(1,21) and then store the accuracy measures as
test_accuracy[k], there would be a conflict between k starting at 1 (the
minimum number of neighbors) and the first element of the array being
test_accuracy[0] instead of test_accuracy[1]. Having an ordered pair i, k
resolves this issue.
plt.xticks(neighbors)
plt.xlabel(‘Number of neighbors’)
plt.ylabel(‘Accuracy’)
plt.show()
these were actually already done during the fourth iteration of the loop
above when searching for optimal k, but repeating them makes things
clearer. The resulting accuracy measures should match those in Figure 4.3.
The accuracy measures give us an overall idea of how well the k-NN model
can predict therapy types from transcript LIWC scores. Bringing our focus
back to the most commonly used test accuracy score, which is 0.71 (71%)
in this case, we conclude that 27 out of the 38 transcripts in our test dataset
were correctly predicted. This can be considered an adequate form of model
validation for basic purposes, similar to the logistic regression approach
used in Chapter 3. However, we might want to know further details like
how this accuracy varies across the three therapy types, because a generic
‘global’ measure may not always work well. Taking an extreme example,
if 99% of our dataset comprises of just one therapy type, our model would
be measured as 99% accurate even if it just blindly predicts everything to be
that type. There are, in fact, a number of more nuanced model validation
measures that we will now introduce. The first step is, as before, to generate
a confusion matrix that reveals the distribution of correct and wrong predic-
tions by therapy types. The code below will first generate the predicted labels
(test_pred) from our model on the testing dataset, and then cross tabulate this
with the actual labels (y_test) to produce a 3x3 confusion matrix. Specifying
y_test before test_pred will position the former as rows and the latter as col-
umns, which is the standard convention for observed and predicted values,
respectively. Note the additional specification of labels=[“CBT”, “HUM”,
“PA”] to tell Python to arrange the rows in this sequence (HUM=humanistic
therapy, PA=psychoanalysis). If unspecified like in Chapter 3, the order will
be based on the sequence in which the labels appear in the data.
True positives
Recall =
True positives + False negatives
Now let us look down the columns to derive a slightly different metric.
Looking down the first column, 12 CBT transcripts were correctly pre-
dicted as CBT, and 1 HUM transcript and 4 PA transcripts were wrongly
predicted as CBT. This time, CBT scores only 71%, which is lower than
HUM (80%) but higher than PA (56%). These percentages express the
ratio of the number of true positives (CBTs correctly identified as CBT)
to the sum of true positives and false positives (something else incorrectly
identified as CBTs). This is known as the precision score, which measures
the accuracy of a model in not mistaking negatives as positives.
True positives
Precision =
True positives + False positives
Therefore, high recall does not entail high precision, the contrasting results
of CBT and HUM being illustrative. What are the implications of these
different nuanced measures? In general, a model with high recall but low
precision for a group (CBT) is able to identify many group members, but
it is also more likely to misidentify non-group members as members. This
is a case of quantity over quality because many members are identified at
120 Classification
It turns out that scikit-learn provides a convenient way to generate all the
above measures in a table known as a classification report. This can be
done in one line of code below. Be mindful to consistently specify y_test
before test_pred to avoid mistakes. Table 4.2 shows the resulting classifica-
tion report.
The first three columns show the precision, recall, and f1-scores correspond-
ing to each of the three groups. The support column refers to the number
of samples in the (test) dataset upon which these scores are computed.
The equal distribution across groups is a result of our earlier specification
of stratify=y when splitting the original dataset into train and test data.
Without this specification, the three groups may be unevenly represented,
Classification 121
Our ten scores in this case are as follows, giving an average of 0.567.
[0.46666667, 0.46666667, 0.53333333, 0.46666667, 0.33333333,
0.6, 0.8, 0.73333333, 0.6, 0.66666667]
There is cause for concern here because the average is substantially
lower than the earlier classification report figures (all >0.7), and the large
variability in the ten scores suggests that our model cannot consistently
provide accurate predictions on different sets of unseen data. Possible
solutions include going back to the drawing board and refitting the k-NN
model with a different k value, or retreating even further to reconsider
the transcript dataset. We will not illustrate the whole process again, as
the more important point being made is that model fitting and validation
Classification 123
#generate random training set with stratify=y to ensure fair split among
types
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.25,
random_state=0, stratify=y)
124 Classification
Model validation
References
Aggarwal, C. C. (Ed.). (2020). Data classification: Algorithms and applications.
Chapman and Hall.
Demiray, Ç. K., & Gençöz, T. (2018). Linguistic reflections on psychotherapy:
Change in usage of the first person pronoun in information structure positions.
Journal of Psycholinguistic Research, 47(4), 959–973.
Fortmann-Roe, S. (2012). Understanding the bias-variance tradeoff. http://scott
.fortmann-roe.com/docs/biasvariance.html
Huston, J., Meier, S. T., Faith, M., & Reynolds, A. (2019). Exploratory study of
automated linguistic analysis for progress monitoring and outcome assessment.
Counselling and Psychotherapy Research, 19(3), 321–328.
Levitt, H., Korman, Y., & Angus, L. (2000). A metaphor analysis in treatments
of depression: Metaphor as a marker of change. Counselling Psychology
Quarterly, 13(1), 23–35.
Qiu, H., & Tay, D. (2023). A mixed-method comparison of therapist and
client language across four therapeutic approaches. Journal of Constructivist
Psychology, 36(3), 337–60. https://doi.org/10.1080/10720537.2021.2021570
Van Staden, C. W., & Fulford, K. W. M. M. (2004). Changes in semantic uses of
first person pronouns as possible linguistic markers of recovery in psychotherapy.
Australian and New Zealand Journal of Psychiatry, 38(4), 226–232. https://doi
.org/10.1111/j.1440-1614.2004.01339.x
5 Time series analysis
DOI: 10.4324/9781003360292-5
Time series analysis 127
Stock prices, rainfall, and birth/death rates are time series data measured
over very different time intervals, or sampling frequencies, from seconds to
decades. However, their analysts share the common objective of discover-
ing patterns underlying the fluctuations and making reliable predictions
on that basis. Two broad information sources can support this objective.
In cases where the phenomenon and its contributing factors are theoreti-
cally well understood, we can use the contributing factors as predictor
variables and build regression models to predict the phenomenon. Birth
or death rates may, for example, be predicted from measures of well-being
like income, education, and health-care levels. These predictions may
either be longitudinal (i.e., forecasting future values for the same sample)
or cross-sectional (predicting values of another sample at a similar time
point). However, in cases where the phenomenon and contributing fac-
tors are ‘messier’ or less understood, it may well be the case that past
values become the most reliable, or only available, predictors of future
values. In many such time series we, in fact, find that successive values
are systematically correlated and dependent on one another. Statisticians
call this ‘autocorrelation’ or ‘serial correlation’. Rather than using exter-
nal factors as predictors, we can therefore build regression models using
autocorrelational information from past values instead. The parameters of
these models can then be interpreted to understand structural regularities
underlying the series.
This is the crux of time series analysis – more specifically, the widely
used Box-Jenkins method (Box et al., 2015), which employs a class of
statistical models known as ARIMA (Autoregressive Integrated Moving
Average) models to analyze time series data. Note that ARIMA models are
used for numerical time series, which our present LIWC scores exemplify.
It is also possible to model and forecast categorical time series data, if say
each session is described by some categorical feature. However, this would
require different approaches like Markov chain models (Allen, 2017) that
are beyond the present scope. The meaning of ‘autoregressive integrated
moving average’ will be clarified later in the chapter. Typical applications
lie in domains where (1) the effect of time is of primary relevance to the
phenomenon of interest, and (2) the phenomenon arises from a data gen-
erating process (cf. Chapter 2) that is not fully understood, as mentioned
above. Point (2) relates to what Keating and Wilson (2019) aptly describe
as the ‘black box’ philosophy of the Box-Jenkins method. It should be clear
why financial data, like stock prices, and health data, like the incidence of
diseases, are typical contexts of application. In both domains, time is of the
essence and (causal) mechanisms of prices and infection rates are seldom
transparent. Interestingly, there is also good reason to believe that various
types of language/discourse-related data exhibit somewhat similar behavior
and are therefore potential candidates for such analyses. Let us consider the
128 Time series analysis
first key feature of time being a primary (if not always explicit) variable of
concern. Examples of longitudinally sensitive language/discourse research
range broadly from time-based psycholinguistic experiments (Tay, 2020)
to grammaticalization (Hopper & Traugott, 2003) and sociolinguistic
variation and change (Tagliamonte, 2012). The time intervals of interest
likewise range from seconds to decades, and it often makes sense to assume
that past and present manifestations of the phenomenon are not independ-
ent of one another. In other words, some degree of autocorrelation exists
in the series. The second key feature, which basically precludes reliance on
pre-specified predictor variables, also squares well with received discourse
analytic wisdom. We will see that the Box-Jenkins method relies only on
the observed series and extracts patterns from it until pure randomness
or, in statistical parlance, white noise remains. These patterns are essen-
tially based on the aforementioned autocorrelation in the series. A use-
ful analogy for this process that brings back many childhood memories is
the extraction of juice from sugarcane. Juice is extracted by passing raw
sugarcane multiple times through an extractor until the fibrous residue
remains, just like autocorrelational patterns are filtered stepwise from a
raw time series until random white noise remains. Although the math-
ematical definition of these autocorrelation patterns is the same regardless
of context, the ‘why’ and ‘so what’ of their presence must be understood in
the light of domain-specific knowledge, which is some linguistic/discourse
theory in this case. This line of reasoning is consistent with the mainstream
discourse analytic logic of anticipating emergent patterns and phenomena
rather than necessarily hypothesizing them beforehand.
intercept = 14.274) estimate the initial value of y = 14.274, and every unit
increment of t is predicted to increase y by 0.812.
However, a closer look at the residuals plot reveals that the prediction
errors tend to persist in the same direction over many consecutive inter-
vals. The most recent predicted values of y from t = 20 to 30 (circled) are
all lower than the actual values with signs of a worsening upward devia-
tion. This is where we can make a useful distinction between prediction
and forecasting. Prediction is a more general term, which includes using
the model to generate predictions for existing values and the respective
residuals, while forecasting refers to the prediction of future values that
are not yet known. Therefore, even if our high R2 model fits the existing
data well and makes good predictions, it will make systematic forecast
errors for the 31st interval and beyond. This is another example of poten-
tial underfitting, as briefly discussed in Chapter 4, as the linear model fails
to capture the recent ‘localized’ upward movement. More generally, the
overall fit may conceal ‘localized’ up/downward movements resulting from
autocorrelation in the series, and it is precisely these movements that point
towards context-specific phenomena like a growing/falling momentum in
some discourse feature over the time intervals at hand. ARIMA time series
models, in contrast, account for these movements by definition. They are
intended to capture not just the way things are, but also the way things
move (Hyndman & Athanasopoulos, 2018). Before we proceed to demon-
strate how the juice extraction process works step by step in the case study,
it is useful to gain an overview by looking at (1) what the autocorrelational
structure of a time series exactly means, and (2) how autocorrelation trans-
lates to key components of time series data across different contexts.
Autocorrelation or serial correlation is the correlation between values
that are separated from each other by a given number of intervals. We can
see it as the correlation between the series and a lagged version of itself.
Table 5.1 illustrates this concept by reproducing the 30 measured y and t
values from Figure 5.1 above.
130 Time series analysis
t y yt+1 yt+2
1 20 18 22
2 18 22 16
3 22 16 16
4 16 16 21
5 16 21 22
6 21 22 23
7 22 23 21
8 23 21 21
9 21 21 24
10 21 24 23
11 24 23 22
12 23 22 22
13 22 22 23
14 22 23 24
15 23 24 23
16 24 23 25
17 23 25 25
18 25 25 27
19 25 27 32
20 27 32 33
21 32 33 35
22 33 35 36
23 35 36 35
24 36 35 38
25 35 38 37
26 38 37 38
27 37 38 42
28 38 42 42
29 42 42 -
30 42 - -
Recall that in Figure 5.1, the 30 paired values of y and t were fitted to
a standard linear regression model that assumed independence among the
y values across t. The lag 1 and lag 2 columns in Table 5.1, on the other
hand, juxtapose each y value with the y value one and two intervals later
(i.e., yt+1 and yt+2). The lag 1 autocorrelation is then the Pearson’s cor-
relation coefficient between the 29 paired values of yt and yt+1, the lag 2
autocorrelation is that between the 28 paired values of yt and yt+2, and so
on for lag k in general. The longer the time series, the more lagged autocor-
relations we can calculate, but each higher lag will have one pair of values
less. The original series in the first column is also called lag 0, and the lag
0 autocorrelation is always +1 because it is simply the correlation of the
series with itself. Autocorrelations explicitly measure the degree of interde-
pendence within the series as the values are sequenced in time.
Time series analysis 131
#compute (P)ACF
from statsmodels.graphics.tsaplots import plot_acf, plot_pacf
from statsmodels.tsa.stattools import acf, pacf
series=pd.read_csv(‘data.csv’,index_col=‘Session’)
acf(series.y,nlags=10)
pacf(series.y,nlags=10)
#plot (P)ACF
plot_acf(series.y, lags=10, alpha=0.05, title=‘Autocorrelation function’)
plot_pacf(series.y, lags=10, alpha=0.05, title=‘Partial autocorrelation
function’)
132 Time series analysis
Series.index creates what is known as a ‘date time index’ for the input
time series. A date time index allows the analyst to easily perform various
time-related operations on time series data such as resampling, rolling,
and expanding metrics, but these are beyond the present scope. We then
specify three settings before plotting the decomposed time series: (1) the
frequency of time intervals, set to ‘monthly’ in this case; (2) the start time
of the series, set to the year 1949 for our example, and (3) the number of
periods or observations in our data, which we can automatically count
with len(series). We then run seasonal_decompose().plot() on the desired
series, under column ‘y’ in the dataframe in this case. Figure 5.2 shows the
seasonal decomposition outcome of our example. This is a widely used
Time series analysis 133
the predicted values at each interval, as well as the forecasted values for a
specified number of periods into the future. The grey zone indicates 95%
confidence intervals for the forecasts. The two lines are quite close to each
other in both cases, reflecting the prototypical fit. Our case study will dem-
onstrate the stepwise Box-Jenkins method to arrive at these outcomes. For
now, notice that the AR1 series is characterized by strong period-to-period
up or downward momentum for long stretches of time, with occasional
directional switches. This is what was meant by a ‘structural signature’
that identifies typical datasets like stock prices with bullish momentum.
On the other hand, the MA1 series is characterized by rapid period-to-
period ‘bouncing’ where sudden jumps tend to be ‘restored’ by an oppo-
site movement thereafter. This is a very different structural signature that
typifies a correspondingly different context, like high frequency trading
with minute-to-minute transactions where past values have little long-term
influence. Readers may already see where this is going. For different time
series discourse data in psychotherapy, classrooms, the media, and so on,
questions like which model types fit well and what the structural signatures
imply are exciting and underexplored. The ability to forecast future values
is also an intriguing application in some discourse contexts, as we will
show in the case study below.
Figure 5.2 showed that inspecting a decomposed series can reveal trends,
seasons, and cycles. Another important reason to inspect the series as the
first step is that the Box-Jenkins method relies on just one realization (i.e.,
our data) to estimate the parameters of its underlying data-generating pro-
cess. In most cases it is, in fact, only possible to have one realization because
we cannot go back in time to collect another ‘sample’. Note that we are not
talking about the number of observations in our series, because they still
collectively comprise only one series regardless of how many observations
there are. For the statistical estimation to be valid, the mean and variance
of our ‘N=1’ sample series should therefore remain constant across differ-
ent sections of time. This is known as the condition of stationarity, and
we should have a stationary series before moving to Step 2. If our series is
non-stationary, it needs to be transformed into a stationary one. Assuming
that our two series (therapist and client analytical thinking scores over 40
sessions) are imported as two columns of a dataframe called series, the fol-
lowing code will plot and annotate them. It is also good practice to number
the sessions consecutively and set it as the index column of the dataframe.
The setting subplots=True will plot each series on a separate grid, which
can be changed to False to plot everything on just one grid. Lastly, series.
describe() provides handy descriptive statistics like the number of observa-
tions, mean, and standard deviation of each series.
Time series analysis 139
#inspect series
series=pd.read_csv(‘data.csv’,index_col=‘Session’)
series.plot(subplots=True)
series.describe()
Figure 5.4 shows our two series (therapist and client) at the top, as well
as additional examples of non-stationary series and outcomes of various
transformation procedures.
The means of our two series appear to remain constant over time as
the plots fluctuate around a midpoint along the y-axis. From the descrip-
tive statistics, we note for now that the therapist series has a higher mean
(49.999) than the client series (42.001), suggesting a higher degree of ana-
lytical thinking in general. Nevertheless, as with the elbow plots in Chapter
3, it is always good to verify visual inspection with a formal measure.
We can do this with the Augmented Dickey-Fuller (ADF) test where H0 =
the series is non-stationary. The code below imports the relevant librar-
ies and returns just the first row of the full ADF test output, which is
the p-value. Note that only one series is tested at a time – in this case the
therapist scores, which are found under the Therapist column. As p <0.01
for both therapist and client, H0 is rejected and both series are confirmed
as stationary.
The variances of both series also appear constant over time since the disper-
sion around the means is not obviously changing. Such series are described
as homoscedastic. Generally, a stationary series is also homoscedastic but
not vice versa. Homoscedasticity can be verified with the Breusch-Pagan
test, where H0 = the series is homoscedastic. The code for this is more
cumbersome than the ADF test as it involves additional steps like regress-
ing the series on the time intervals expressed as integers, and then creating
an array out of the series. We do not need to delve into the details, but the
final outcome is likewise a p-value for each series.
def create_array(col):
s = []
for i in col:
a = [1,i]
s.append(a)
return (np.array(s))
array = create_array(series.Therapist)
#differencing
series[‘Therapist_diff’]= series.Therapist.diff(periods=1)
The outcome of differencing the third plot is shown in the fourth plot. Note
that some values are negative because yt - yt-1 can be negative. The most
important feature is that the rising trend is now eliminated. If the differenced
series is still non-stationary (as confirmed by an ADF test), we can perform
a further second-order differencing. This means to difference the first-differ-
enced series one more time, by computing zt – zt-1. We can do this by changing
periods=1 into periods=2 in the above code, preferably saving the outcome
to another new column. Most non-stationary time series will become sta-
tionary after, at most, two orders of differencing. Note that with every order
of differencing, we “lose” one observed value. The recommended minimum
number of observations for accurate TSA is 50 (McCleary et al., 1980).
The fifth plot is an example of a heteroscedastic series with the disper-
sion around the mean, or the series variance, increasing over time. There
are several ways to deal with this. The first is to perform a log-transfor-
mation by converting each value yt to log(yt), the result of which is shown
in the final plot. Log-transformation is a commonly applied technique for
other purposes like making a skewed distribution more normal, or chang-
ing the scales of graphical axes. The following code creates a new col-
umn (Therapist_logged) by log-transforming the original therapist series
(assuming it was heteroscedastic).
#log-transformation
series[‘Therapist_logged’]= np.log(series.Therapist)
In some cases, the original series may need to undergo both differenc-
ing and variance stabilization. If both procedures are required, variance
stabilization should be performed first. Python will automatically convert
transformed series back into their original forms, when performing predic-
tions and forecasts later.
The ACF and PACF of each series are computed in Step 2. This is argu-
ably the most important step because subsequent interpretation of statisti-
cally significant autocorrelations is crucial for identifying which types of
ARIMA models will fit the series. The code for computing and plotting
(P)ACF was already provided earlier in the chapter, but the code below
introduces the subplots feature to plot the (P)ACF correlograms of our two
series in a 2x2 grid.
In Step 3, we decide which ARIMA model type best describes our data
based on the relative behavior of (P)ACF: an autoregressive (AR) model, a
moving average (MA) model, or a combination of both (ARMA/ARIMA).
Mathematically, AR models express the current value of the time series as
a function of its past values, MA models express it as a function of residu-
als in past intervals, while ARMA models combine both. The difference
between ARMA and ARIMA is the letter I for ‘integrated’. The latter refers
to cases where differencing was performed to achieve stationarity in Step
1. We will soon discuss what these ‘mean’ as structural signatures in a
144 Time series analysis
discourse context. For now, Table 5.2 offers key guidelines for candidate
model selection based on the five most common behavior patterns of (P)
ACF.
The order k corresponds to the number of lags for which the (P)ACF is
significant. The general form of an MA(k) model is yt = μ– θ1at-1 …– θkat-k –
at where
In Step 4, we fit the candidate model to our time series and estimate its
parameters. For an MA model the parameters are the intercept/constant μ
and θ1 … θk , while for an AR model they are the intercept/constant (1– Φ1 –
… Φk )μ and Φ 1 … Φ k. Due to the relative complexity of ARIMA over lin-
ear regression models, different software may give slightly different param-
eter estimates. Our model validation procedure also begins at this step.
Previous chapters introduced train-test data, resampling, visualizations,
and alternative analyses as validation techniques. For time series analysis,
we opt for a train-test approach where the first 37 out of 40 session scores
are used as training data to fit the candidate model, and the final three
146 Time series analysis
We are now ready to fit an AR(1) model to the therapist series and
MA(1) model to the client series, using the SARIMAX class in statsmod-
els and our training data train_series. The following code performs the
necessary import and model fitting, naming the therapist and client mod-
els model1 and model2, respectively. The AR(1) model is specified by
order=(1,0,0) as per the (p,d,q) nomenclature introduced above, while
the MA(1) model is specified by order=(0,0,1). Specifying trend=‘c’ tells
Python to include the constant/intercept term in the model. This could be
removed, and the model refitted, if the constant/intercept turns out to be
statistically insignificant. The model.summary() will generate the output
shown below.
model2=sm.tsa.SARIMAX(train_series.Client, order=(0,0,1),trend=‘c’).
fit()
model2.summary()
Figure 5.6 are screenshots of the output for the AR(1) (top) and MA(1)
model (bottom). Among other details, the series (or dependent variable)
and fitted model, the number of observations, and model fit statistics are
shown at the top of each output. Residual diagnostics like the Ljung-Box Q
and its associated p-value, to be discussed in the next step, are shown at the
bottom. The middle panel shows the estimated parameters (coef), standard
errors, and z-scores (number of standard deviations away from 0), p-values
Evaluating a time series model involves three interrelated aspects: (1) pre-
dictive accuracy: how well it predicts observed values, especially the testing
dataset, (2) model fit: how well it fits the data in general, and (3) residual
diagnostics: whether the model has captured all patterns from the series (or
squeezed all the juice from the sugarcane). If we are satisfied with all three,
we can move on to the final step of interpreting the model in context. If
not, we return to Step 3 and select another candidate model.
Predictive accuracy and model fit can be visually evaluated by plotting
the model predicted values, including forecasts, against the observed val-
ues. The code below generates the predicted values and forecasts using
the therapist AR(1) model (model1). The same can be done by replacing
model1 with model2. Remember to also rename predict and forecast when
doing so, or the results for model1 will be overwritten.
forecast=(model1.get_prediction(start=len(series),end=len(series)+3))
forecastinfo=forecast.summary_frame()
Note that predict consists of predicted values from the first session (start=1)
to the last session in the original 40-session series (end=len(series)). It
Time series analysis 149
therefore includes the predicted values for the final three sessions, which
we could then compare against the testing data. A summary of the results,
including the predicted values and their 95% confidence intervals, is gener-
ated by predict.summary_frame() and stored as predictinfo. On the other
hand, forecast consists of predicted values beyond the observed interval –
from session 40 to 43, which is why they are called forecasts. A summary
is likewise generated and stored as forecastinfo.
With this information we can now generate our plot with the code
below. This plots the therapist series, but relevant parts can be changed
to show the client series. As usual, cosmetic details like labels, colors, and
font sizes can be changed at will. ax.axvspan() creates a red region to com-
pare predicted versus observed values in the last three sessions (i.e., the
testing data), and ax.fill_between() colors the 95% confidence intervals of
forecasts. Figure 5.7 shows the predicted versus observed plots for both
models.
The blue lines depict the observed time series and the orange lines the
predicted series. The red region marks the final three sessions, or the ‘train-
test zone’ so to speak. The dotted line depicts forecasts after session 40,
and the gray region indicates 95% confidence intervals for each forecast.
Many analysts also calculate the slightly different prediction intervals,
which we will touch upon later. The plots can reveal a lot about predictive
accuracy and model fit. The main difference between the two measures is
that predictive accuracy is evaluated with testing data only (i.e., the dis-
parity between predicted values and testing data values), to judge how
well the model performs on data it has not ‘seen’. Model fit, on the other
hand, is often evaluated with training data only (i.e., the disparity between
150 Time series analysis
predicted values and training data values), to judge how well the trained
model reflects the data it has ‘seen’. It is of course not possible to evaluate
the quality of forecasts until the sessions actually happen.
We can see that the AR(1) model (yt = 16.6586 + 0.6629yt-1 + at) seems to
capture both the ‘shape’ and the magnitude of the therapist series quite well
because the two lines are close, except in the red region, which indicates
poorer predictive accuracy. However, it correctly predicted the general
upward movement of the series in the last three sessions. The MA(1) model
(yt = 42.2339 − 0.6256at-1 + at), on the other hand, seems to capture the
shape better than the magnitude. Nevertheless, just as in previous chapters,
visual assessment should be supported by more objective measures when-
ever possible. One option is the mean absolute error (MAE), which simply
sums the absolute error (predicted – observed, positive values only) for all
Time series analysis 151
For the therapist series, R2 = 0.885 and for the client series R2 = 0.620.
These measures concur with the previous MAPE evaluation of the better
fit of the AR(1) model for the therapist series, although the MA(1) model
for the client series is also acceptable.
Having evaluated predictive accuracy and model fit, the final part of
Step 5 is to perform residual diagnostics. This tends to be neglected when
reporting regression analyses, but it is an important step to evaluate whether
any patterns in the data have been ‘left over’. In our time series context,
it means to check if the candidate model has extracted all juice from the
sugarcane – or transformed the original series into patternless residuals or
‘white noise’ (Figure 5.2). We do this by checking (1) if there is autocor-
relation in the residuals, (2) if the mean of the residuals is near-zero, and
(3) if the residuals are normally distributed. For (1), we treat the residuals
across sessions (predicted – observed values) as a time series itself and eval-
uate its (P)ACF just like in Step 2 of our main process. Absence of spikes
in (P)ACF at all lags implies that there is no more autocorrelational infor-
mation left in the residuals. On the other hand, spikes would suggest that
the residuals are still patterned across time, which needs to be addressed
either by modeling the residual series and adding it to the original model
Time series analysis 153
The correlograms clearly show that there are no (P)ACF spikes in the
residuals. This also reflects the result of the Ljung-Box Q test (Q=0.14,
p=0.71) shown at the bottom left of model1’s summary in Figure 5.6,
where H0 = the series is independently distributed and has zero autocor-
relation. The raw plot of the residuals suggests that it fluctuates around
a mean of near-zero, as further confirmed by the histogram. Lastly, the
distribution appears normal, which can be confirmed with a Shapiro-Wilk
test of normality (H0 = the series is normally distributed) using the code
below. The result (W=0.974, p=0.542) confirms that the model1 residuals
are normally distributed.
The final step of interpreting the model means to relate its parameters,
and the structural signature they constitute, to our understanding and/
or qualitative analysis of the (discourse) phenomena in their context(s) of
occurrence. Recall that our two models for therapist and client analytical
thinking are, respectively
1. THERAPIST: Okay. Because what you just said is what you can do.
And I haven’t heard the irrational beliefs yet.
2. CLIENT: Okay, here they are. Sorry. Yes, here they are. Okay, so, I
need the surgery to reduce my pain – where are the irrational beliefs?
Oh, no, here it is. I need the surgery to reduce my pain; otherwise, I
can’t go on in life as I should. I’m just going to go through – shall I go
through them?
3. THERAPIST: Quickly, yes.
4. CLIENT: Okay. “Others will not approve of my negative attitude,
and I couldn’t stand that”. “Comparing myself to this guy going into
surgery and telling myself he’s better than me, as he shouldn’t be. And
I can’t stand my mother thinking I’m not strong enough to recover as
fast as he does”.
5. THERAPIST: Well done in identifying the irrational beliefs. When I
say, “Well done,” – I also mean “Well caught.”
6. CLIENT: Well caught, right. I feel angry towards my mom for pushing
me back to work when the surgery hasn’t even happened, and I can’t
stand that feeling.
7. THERAPIST: But perhaps, “My mother shouldn’t have told me—” is
a more helpful way of expressing the irrational belief here.
8. CLIENT: Ah, got you. Okay.
9. THERAPIST: Remember: precision is helpful when identifying the
irrational beliefs.
Time series analysis 157
#compute (P)ACF
from statsmodels.graphics.tsaplots import plot_acf, plot_pacf
from statsmodels.tsa.stattools import acf, pacf
series=pd.read_csv(‘data.csv’,index_col=‘t’)
acf(series.y,nlags=10)
pacf(series.y,nlags=10)
#plot (P)ACF
plot_acf(series.y, lags=10, alpha=0.05, title=‘Autocorrelation function’)
plot_pacf(series.y, lags=10, alpha=0.05, title=‘Partial autocorrelation
function’)
Seasonal decomposition
#inspect series
series=pd.read_csv(‘data.csv’,index_col=‘Session’)
series.plot(subplots=True)
series.describe()
def create_array(col):
s = []
for i in col:
a = [1,i]
s.append(a)
return (np.array(s))
array = create_array(series.Therapist)
Time series analysis 161
#differencing
series[‘Therapist_diff’]= series.Therapist.diff(periods=1)
model2=sm.tsa.SARIMAX(train_series.Client, order=(0,0,1),trend=‘c’).
fit()
model2.summary()
forecast=(model1.get_prediction(start=len(series),end=len(series)+3))
forecastinfo=forecast.summary_frame()
162 Time series analysis
References
Allen, M. (2017). Markov analysis. In Mike Allen (Ed.), The SAGE encyclopedia
of communication research methods (pp. 906–909). Sage.
Althoff, T., Clark, K., & Leskovec, J. (2016). Large-scale analysis of counseling
conversations: An application of natural language processing to mental health.
Transactions of the Association for Computational Linguistics, 4, 463–476.
Bollerslev, T. (1986). Generalized autoregressive conditional heteroskedasticity.
Journal of Econometrics, 31(3), 307–327.
Bowerman, B., & O’Connell, R. (1987). Time series forecasting: Unified concepts
and computer implementation (2nd ed.). Duxbury Press.
Brockwell, P. J., & Davis, R. A. (2016). Introduction to time series and forecasting
(3rd ed.). Springer.
Chatfield, C. (1989). The analysis of time series: An introduction (4th ed.).
Chapman and Hall.
Earnest, A., Chen, M. I., Ng, D., & Leo, Y. S. (2005). Using autoregressive
integrated moving average (ARIMA) models to predict and monitor the number
of beds occupied during a SARS outbreak in a tertiary hospital in Singapore.
BMC Health Services Research, 5, 1–8. https://doi.org/10.1186/1472-6963-5-36
Eubanks, P. (1999). Conceptual metaphor as rhetorical response: A reconsideration
of metaphor. Written Communication, 16(2), 171–199.
Gelo, O. C. G., & Mergenthaler, E. (2003). Psychotherapy and metaphorical
language. Psicoterapia, 27, 53–65.
Hopper, P. J., & Traugott, E. C. (2003). Grammaticalization (2nd ed.). Cambridge
University Press.
Hyndman, R. J., & Athanasopoulos, G. (2018). Forecasting: Principles and
practice. OTexts.
Keating, B., & Wilson, J. H. (2019). Forecasting and predictive analytics (7th ed.).
McGraw-Hill.
Kopp, R. R., & Craw, M. J. (1998). Metaphoric language, metaphoric cognition,
and cognitive therapy. Psychotherapy, 35(3), 306–311.
Lai, D. (2005). Monitoring the SARS epidemic in China: A time series analysis.
Journal of Data Science, 3(3), 279–293. https://doi.org/10.6339/JDS.2005
.03(3).229
Levitt, H., Korman, Y., & Angus, L. (2000). A metaphor analysis in treatments
of depression: Metaphor as a marker of change. Counselling Psychology
Quarterly, 13(1), 23–35.
Luborsky, L., Auerbach, A., Chandler, M., Cohen, J., & Bachrach, H. (1971).
Factors influencing the outcome of psychotherapy: A review of quantitative
research. Psychological Bulletin, 75(3), 145–185.
164 Time series analysis
DOI: 10.4324/9781003360292-6
166 Conclusion
As the first two points above describe the efficacy of data analytics and
the third point its exploratory value, it is useful to metaphorically describe
data analytics as both a ‘rifle’ and a ‘spade’ for doing discourse analysis.
Rifles are designed for precise aiming and accuracy when hitting the target.
Likewise, finding appropriate data analytic tools and solutions to investi-
gate discourse analytic constructs like (a)synchrony and identity construc-
tion is like an aiming exercise that locks the relevant questions on target,
with systematic and replicable techniques to ensure that the bullets hit the
target with a high level of reliability. Beyond this conceptualization of data
analytics as a ‘problem solver’, it is also an under-utilized spade for plough-
ing new ground, sowing new seeds, and finding something unexpected,
all of which are very much in the inductive spirit of discourse analytic
research. We witnessed scenarios where by-products of the main analytical
process – from observed differences in the predictive accuracy of different
simulation runs in Chapter 2 to precision versus recall scores in Chapter
4 – pointed towards new directions for research that often lie beyond the
original scope of inquiry. Therefore, while hitting predetermined research
targets using the rifle of data analytics, one should also have their spade in
hand to dig up surprises along the way. Emphasizing either metaphor runs
the risk of obscuring the other, so the most important point to remember is
that data analytics functions as both – and often simultaneously.
It was also emphasized at various points that there is much room for
interested readers to go beyond the present introductory scope and fur-
ther their understanding of data analytics. Useful and relatively affordable
online resources include, as mentioned in the introduction chapter, data-
camp.com and towardsdatascience.com. The following are some suggested
learning directions that are covered by these resources, each of which would
deserve its own monograph. First, while we relied mainly on LIWC scores,
readers can explore a wide range of other quantification schemes for their
language and discourse data. Pretrained language models using Google’s
Word2Vec, for example, can be freely downloaded and used to derive
word and document vectors for one’s own discourse data. It is, of course,
Conclusion 167
Let us first consider the case of missing data and applicability of Monte
Carlo simulations. All three discussed categories of data missing completely
at random, at random, and not at random are equally conceivable in these
discourse contexts. For example, social media researchers who employ web
scraping and text mining techniques to gather data from platforms like
Facebook and Twitter may face technical, ethical, and legal challenges that
result in missing or incomplete data (Bruns, 2019). Researchers of political
discourse face similar issues as they turn increasingly towards social media
platforms as a primary source of data (Stieglitz & Dang-Xuan, 2013).
Classroom discourse analysis (Rymes, 2016), an important part of educa-
tion discourse research, is yet another instance where data collection may
be compromised by technical and ethical concerns. In all the above cases,
the general Monte Carlo logic of estimating and evaluating a probabilistic
range of outcomes based on available sample data is straightforwardly
applicable given a suitable quantification scheme.
Basic (un)supervised machine learning techniques like clustering and
classification also have many potential applications in these discourse con-
texts, without necessarily involving the enormous quantities of data that
seem to typify machine learning in many peoples’ minds. We have, in fact,
not delved too deeply into the issue of sample sizes because there is sim-
ply no catch-all answer to a question like ‘how many samples do I need
for simulations/clustering/classifi cation/time series analysis?’. There are
well-known pros, cons, and remedies to large and small datasets alike that
are beyond the present introductory scope, and, beyond basic statistical
considerations, sample size determination for data analytic techniques is
highly dependent on the context of analysis (Figueroa et al., 2012; Kokol
et al., 2022). The examples throughout this book have illustrated a practi-
cal, needs-based approach that focuses on the objectives and situation at
hand rather than fixating upon a priori sample size determination pro-
cedures. From this perspective, recall that the main objective of cluster-
ing algorithms is to discern latent groups based on (dis)similarities among
observed features, while that of classification algorithms is to evaluate how
pre-existing groups are predictable from these features. It is known that
questions of identity and group membership are among the most pressing
across a variety of theoretical perspectives and discourse research contexts
(Van De Mieroop, 2015). While many of these eschew reductionistic defi-
nitions of identities as static and clear-cut categories, clustering and clas-
sification approaches would at least present new and/or complementary
methodological possibilities. This was amply demonstrated in Bamman
et al.’s (2014) application of cluster analysis to investigate the relation-
ship between gender, linguistic style, and social networks on Twitter. In
the context of political discourse, one specific interesting application is to
investigate election campaign speeches and test if candidates from the same
Conclusion 169
Extract 3: WP as incumbent
We have worked hard to earn your trust sometimes under very dif-
ficult circumstances. For those who feel that we have not met your
expectations, we seek your understanding and promise to do bet-
ter. Now, more than ever, your vote is essential to chart the kind of
170 Conclusion
political system Singapore should have. Make your vote count. Vote
for The Workers’ Party. Voters of Aljunied GRC, the PAP keeps say-
ing there’s no need to vote for the opposition as the NCMP scheme
ensures your voice in Parliament. Don’t be swayed by this argument.
Parliament is not just a talk shop where MPs make speeches. It exists
to make laws which are voted on by MPs. The PAP will feel safe
as long as their two-thirds majority is not threatened. But once the
opposition gains more seats, they will be forced to consult you, and
you will have a more responsive government.
As for the time series analysis of language and discourse data, ARIMA
models likewise offer underexplored possibilities for media, political,
and educational discourse research across a range of phenomena and
time scales (Tay, 2019). Traditional print media like newspapers tend to
be published daily and the language therein collectively analyzed across
longer intervals like years and beyond (Partington, 2010). However, lin-
guistic and discursive elements in contemporary social media may unfold
across much shorter and underexplored time intervals in the order of sec-
onds, especially in highly interactional contexts like live gaming streams
(Recktenwald, 2017). The modelability and structural signatures underly-
ing linguistic/discursive elements in such exchanges, against the backdrop
of spontaneous real-time activity like live streams, opens up a whole new
area of investigation that are aligned with the key assumptions of ARIMA
time series models. Other forms of social media that tend to unfold across
more ‘familiar’ time intervals, such as daily or weekly video uploads on
YouTube, can also be analyzed in similar ways (Tay, 2021b). Likewise,
many forms of political discourse are inherently time-sensitive, ranging
from the above examples of daily electoral campaign speeches to regular
communications in response to real-world events. Tay (2021a), for exam-
ple, compared the language of daily press conferences by the World Health
Organization and the Chinese Ministry of Foreign Affairs in response to
the initial phases of COVID-19, interpreting the modelability of various
series from an ideological perspective. Various types of annual speeches in
different political settings are also ripe for time series analysis, with findings
relatable to larger scale background observations of policy, governance,
and so on (Liu & Tay, 2023; Zeng et al., 2020, 2021). Educational dis-
course likewise assumes many time-sensitive forms like teacher talk and/or
teacher-student interaction that could be modeled across multiple sessions,
or even identified segments within a session. The COVID-19 pandemic,
still impacting some parts of the world at the time of writing, forced many
ill-prepared teachers and students to transit between face-to-face, online,
and hybrid modes of teaching and learning (Mishra et al., 2020). This
raises interesting questions not only about general differences in teacher
talk, student talk, and/or teacher-student interaction between these modes,
but also specific structural patterns across sessions that may underlie each
scenario. The extent to which the implicit structural signatures may relate
to explicit pedagogical strategies employed to cope with the pandemic may
be of particular interest.
and regression analysis has, in fact, been seen in recent work across fields
ranging from education to computer science (Budisteanu & Mocanu,
2021; El Aissaoui et al., 2019). It could also be considered a form of fea-
ture engineering since we are exploring and extracting new features from
raw data with the ultimate objective of creating better models to explain
phenomena of interest.
References
Bamman, D., Eisenstein, J., & Schnoebelen, T. (2014). Gender identity and lexical
variation in social media. Journal of Sociolinguistics, 18(2), 135–160.
Bruns, A. (2019). After the ‘APIcalypse’: Social media platforms and their fight
against critical scholarly research. Information, Communication and Society,
22(11), 1544–1566.
Budisteanu, E.-A., & Mocanu, I. G. (2021). Combining supervised and unsupervised
learning algorithms for human activity recognition. Sensors, 21(18), 6309.
Csomay, E. (2007). A corpus-based look at linguistic variation in classroom
interaction: Teacher talk versus student talk in American University classes.
Journal of English for Academic Purposes, 6(4), 336–355.
El Aissaoui, O., El Alami El Madani, Y., Oughdir, L., & El Allioui, Y. (2019).
Combining supervised and unsupervised machine learning algorithms to predict
the learners’ learning styles. Procedia Computer Science, 148, 87–96.
Figueroa, R. L., Zeng-Treitler, Q., Kandula, S., & Ngo, L. H. (2012). Predicting
sample size required for classification performance. BMC Medical Informatics
and Decision Making, 12(1), 8.
Kokol, P., Kokol, M., & Zagoranski, S. (2022). Machine learning on small
size samples: A synthetic knowledge synthesis. Science Progress, 105(1),
00368504211029777. https://doi.org/10.1177/00368504211029777
Liu, Y., & Tay, D. (2023). Modelability of WAR metaphors across time in cross-
national COVID-19 news translation: An insight into ideology manipulation.
Lingua, 286, 103490. https://doi.org/10.1016/j.lingua.2023.103490
Mishra, L., Gupta, T., & Shree, A. (2020). Online teaching-learning in higher
education during lockdown period of COVID-19 pandemic. International
Journal of Educational Research Open, 1, 100012. https://doi.org/10.1016/j
.ijedro.2020.100012
Partington, A. (2010). Modern Diachronic Corpus-Assisted Discourse Studies
(MD-CADS) on UK newspapers: An overview of the project. Corpora, 5(2),
83–108.
Recktenwald, D. (2017). Toward a transcription and analysis of live streaming on
Twitch. Journal of Pragmatics, 115, 68–81. https://doi.org/10.1016/j.pragma
.2017.01.013
Rymes, B. (Ed.). (2016). Classroom discourse analysis: A tool for critical reflection
(2nd ed.). Routledge.
Stieglitz, S., & Dang-Xuan, L. (2013). Social media and political communication:
A social media analytics framework. Social Network Analysis and Mining, 3(4),
1277–1291.
Conclusion 175
Tay, D. (2019). Time series analysis of discourse: Method and case studies.
Routledge.
Tay, D. (2021a). COVID-19 press conferences across time: World Health
Organization vs. Chinese Ministry of Foreign Affairs. In R. Breeze, K.
Kondo, A. Musolff, & S. Vilar-Lluch (Eds.), Pandemic and crisis discourse:
Communicating COVID-19 (pp. 13–30). Bloomsbury.
Tay, D. (2021b). Modelability across time as a signature of identity construction
on YouTube. Journal of Pragmatics, 182, 1–15.
Van De Mieroop, D. (2015). Social identity theory and the discursive analysis of
collective identities in narratives. In A. De Fina & A. Georgakopoulou (Eds.),
The handbook of narrative analysis (pp. 408–428). John Wiley & Sons.
Zeng, H., Burgers, C., & Ahrens, K. (2021). Framing metaphor use over time: ‘Free
Economy’ metaphors in Hong Kong political discourse (1997–2017). Lingua,
252, 102955. https://doi.org/10.1016/j.lingua.2020.102955
Zeng, H., Tay, D., & Ahrens, K. (2020). A multifactorial analysis of metaphors
in political discourse: Gendered influence in Hong Kong political speeches.
Metaphor and the Social World, 10(1), 141–168.
Index
accuracy 37, 40–42, 88–89, 115–116, 5, 9, 20–22, 87, 89, 97, 138, 159,
118–119, 121–122, 124–125, 165, 167–168, 171; theory 11;
166; average 116–117; averaged thinking 19, 50–52, 155; tools 3,
measure of 42; computing 116, 124; 22, 166; value 33; visualization 170;
evaluating 42, 83; global 120–121; wisdom 128; work 7
highest 115; measures 115–116, analytical: language levels 6; methods
118–119, 121–122, 124; overall 42, 11; options 165, 167; processes 6,
89, 115–117, 124; percentage 88, 166; purposes 10, 16, 71; scenarios 6;
101; predictive 87–88, 108, 114, techniques 173; thinking 5–6, 17, 18,
121, 137, 146, 148–152, 154, 161, 40–41, 44, 48, 56–57, 63, 85, 93–94,
166; relative 52; score 119; test 96, 138–139, 142, 149, 155–157,
115–118, 124; train 115–117, 124; 159, 161–162; tone 157; units 172
varying 56; verifying 41 analyticity 5, 9
agglomerative hierarchical clustering ANOVA 38, 109, 170
(AHC) 68–71, 73, 98, 165 approach: additive 132; alternative
Akaike Information Criterion (AIC) 2; comparative 97; computational
140, 152, 160 linguistic 8; critical 4; descriptive 4,
Althoff, T. 136 7; flexible 10; historical 4; logistic
American Psychological Association regression 118; looping 108;
(APA) 11 multiplicative 132; needs-based 168;
analytic: algorithms 173; applications non-parametric 170; numerical 29, 31;
11; approach 29, 34, 90–91, 158; plausible 3; psychotherapy 55; splitting
conceptualization 173; constructs 114; synchrony 98; therapeutic 40, 97;
4–6, 166; context 73, 105, 158; therapy 8, 57, 76, 87, 97, 109; train-
decisions 2; dimensions 4, 166; test 37, 41, 73, 83, 89, 145; univariate
example 66; features 126; import modeling 157; see also analytic
141; interest 7; interpretation 141; association rules mining 167
literature 158; logic 128; modelling Augmented Dickey-Fuller (ADF) test
73; models 167; outcomes 83, 166; 139–141, 160
perspectives 57; phase 98; point of authenticity 9, 17–19, 40–41, 44, 48,
departure 173; process 19; program 50, 52, 56, 63, 93–94, 96
17; purposes 20, 141; questions autocorrelation 127–135, 142, 144,
3, 173; repertoire 173; research 7, 146, 152, 154, 156, 160
166; scenarios 66; situations 154; autoregressive 135; Integrated Moving
solution 28–29, 31–33, 58; subtypes Average (ARIMA) 10–11, 127, 129,
4; tasks 117, 172; techniques 1–2, 131; model 6, 132, 135, 143
178 Index
average 40, 71, 116, 121–122, 134; k-means 71, 72, 73, 75–76, 79–81,
accuracy 116–117; distance 72; 83, 84, 87–88, 90, 99, 105–109,
macro 121; measure of accuracy 42; 165, 170, 172; models 83, 89, 172;
measure of recall 120; moving 135, non-hierarchical 68; outcomes 68,
143; number of sessions 8; precision 89, 105; purposes of 71; solution
121; silhouette score 72; theoretical 67–68, 70–73, 75, 79, 81–83,
33; value 33; weighted 121 85, 87–89, 97, 101; tasks 167;
techniques 170; transcripts 109;
Bamman, D. 168 usefulness of 69
Bayesian Information Criterion cognitive-behavioral therapy (CBT) 76,
(BIC) 152 77–78, 80, 83, 87, 89–92, 97–98,
behavior 6, 10, 127; contrasting 131, 109, 110–111, 118–120, 121, 125,
143; cooperative 73; human 3; 138, 155–157
linguistic 73, 137, 156; modifying Communication Accommodation
73; paralinguistic 98; pattern 144; Theory 74
relative 143; responses 73; confusion matrix 88–89, 101, 118–
seasonal 142 119, 125
bias-variance tradeoff 108, cophenetic correlation coefficient
115–117, 124 70–71, 99
Bokeh 167 correlograms 131, 142, 143, 153–154
Box-Jenkins method 127–128, 132, cosine distance 16
134, 136–138, 141, 157 CountVectorizer 15
Breusch-Pagan test 140, 160 covariance structure 43
COVID-19 5, 68–69, 70, 72, 99, 105,
central limit theorem 33–36, 48, 54 107, 110, 126, 169, 171
classification 1, 66, 105, 107, 117, cycles 132, 134, 138, 158
126, 167–168, 170, 172; algorithm
10, 105, 168; approaches 168; data: analytics 1–4, 6, 8–9, 11–13, 17,
automatic 2; hidden 83; models 20, 98, 120, 165–167, 172–173;
121, 172; non-linguistic 105; report -frame 22, 35, 45, 47, 50, 60–63,
120, 121, 122, 125; tasks 105–106; 80–82, 100, 114, 131–132, 138,
techniques 10, 105–106, 109, 170; 172; generating process 11, 28,
text 17 37, 41, 127, 131, 138; long 21,
classifiers 106; linear 106; probabilistic 21; narrow 21; richness 3, 173;
106 visualization 2, 21, 110, 167;
clout 9, 17–19, 40–41, 44–45, 47–48, wide 21
50–52, 56, 61–63, 85, 93–94, decision trees 106
96–97 Demiray, Ç. K. 109
cluster 67–73, 76, 79–80, 82–83, differences 9, 50, 74, 87, 148, 170;
85, 89, 101, 107, 113; allocations detailed 21; discursive 170; first
83; analysis 9, 11, 66–68, 75, 83, 140; general 171; large 2, 96;
90–91, 94, 97, 105, 156, 159, 168, linguistic 165; major principled 106;
170; application of 97; centres noteworthy 170; observed 166;
81–82, 86; centroids 71, 73, 81–83, philosophical 91; significant 1, 13,
85, 100–101; higher-level 67; 52, 57
label 81–83, 85, 87–89, 100, 107; differencing 140–143, 145, 161
membership 81–83, 92, 98 discourse analysis 3–4, 7, 11, 97, 137,
clustering 1, 10, 67–68, 72, 76, 87, 158, 166, 168, 170, 173
98, 106, 126, 168, 170, 172; disorder: anxiety 76; depressive 8
agglomerative hierarchical (AHC) distortion 71–72, 79
67–69, 98; algorithm 9, 67, 165, document-term matrix 9, 14
168; dataset 89; hierarchical 67; dot product 16
Index 179
172; clustering 83, 88–89, 100; predict 4, 6, 10, 66, 81, 83, 87–89,
complex 108, 115, 152; first-order 105–107, 118, 125, 127–128, 137,
autoregressive 6; fit 11–12, 80–81, 146, 148–149, 151, 161–162, 172
99–100, 114–115, 118, 122, 124, predictinfo 148–149, 161–162
128–129, 135–137, 145–149, prediction 8, 129; error 89, 129, 134;
152, 154, 161; language 13, 120, intervals 149; outcomes 41–42
166; linear 101, 129, 134, 170; principal components analysis (PCA)
logistic regression 87–88; machine 15, 72, 82, 100, 113, 123
66; multilevel 110; multiplicative Python code 12, 20, 28, 58, 73, 98,
132; parameters 146, 148, 154; 123, 159, 165
precision 146; predictive 114, 148;
regression 127–128, 130, 134, Qiu, H. 109
145; relationships 105; results 73; quantification 13, 75, 98; of language
selection 43, 60, 144; simplified 13, 17; scheme 13, 19, 39, 54–55,
108; statistical 2, 11, 127; summary 76, 94, 97, 166–168
152; theoretical 90; therapy 7; time
series 134, 156, 158–159, 171; random: completely at 38–39, 52,
validation 11–12, 37, 40, 52, 73, 57, 168; not at 38–39, 168;
75, 79, 83, 89, 114, 118, 120–121, numbers 29–30, 34, 46, 62, 115,
124, 145 126; operations 28; outcomes 30;
Monte Carlo simulations (MCS) 8, 10, samples 35–37; values 37, 43,
27–29, 32–34, 36–37, 39–42, 46, 45, 61, 126, 159; variables 3, 33,
48, 50, 52–57, 60, 114, 120, 146, 45–46, 61, 126
165, 168 range 22, 28–29, 31, 33–35, 45–46,
multivariate normal sampling 43 48, 58–62, 66, 79, 83, 99–100,
115–116, 128, 132, 137, 140, 160,
naïve bayes 106 165–166, 168, 171
natural language processing (NLP) realization 138
13, 15 regression 1–2, 11, 66, 87–89,
Nikkie Tutorials 5 105–107, 114, 118, 127–130, 134,
normalization 68 144–145, 152, 167, 170, 172–173
numerical: approaches 29; range regularization 167
116; representation 13; simulation Reisigl, M. 4–5
31–33; solution 28, 31, 33, 58; time relativized 85
series 127; vectors 13 residuals 128–129, 132, 134, 143,
152–154, 157, 162–163
objectivity 3, 173 root mean squared error (RMSE) 151
optimal: allocation 71; balance 108;
courses of action 3; number of sample: analyses 97, 157;
clusters 71–73, 79–81, 99–100, autocorrelation 131; data 1, 108,
108; outcome 68; parameters 117; 168; dataset 68; dyad 90–91;
value 68, 72, 108, 114, 117, 124 findings 98; independent 50;
optimization procedure 67, 106 instantiation 131; means 35–36,
out-of-sample 12 155; random 35–37, 48; series
overfit 108 138; sizes 38, 54, 121, 131,
overfitting 117, 151, 167 138, 168
overfitting/underfitting 167 sampling frequencies 127
Sarkar, D. 15
paralinguistic 98 SARS epidemic 126
Pittenger, R. E. 7 scalability 3, 173
Plotly 167 scale-dependent measures 151
precision 46, 119–122, 146, 156–157, scale-independent measures 151
166, 174 Scheflen, A. E. 7
Index 181