0% found this document useful (0 votes)
161 views190 pages

Data Analytics For Discourse Analysis With Python

This document provides an overview of a book that demonstrates how data analytics techniques can be applied to analyze discourse data from psychotherapy sessions. It showcases various data analytics methods including Monte Carlo simulations, cluster analysis, classification, and time series analysis. For each technique, it walks through a case study using discourse data from therapy transcripts and psycholinguistic metrics. The goal is to illustrate how these quantitative methods can address practical and theoretical concerns in discourse research and related fields. The book aims to be an accessible resource for researchers and practitioners interested in complementary approaches to qualitative discourse analysis. Each chapter follows a consistent structure of introducing a technique, providing an annotated example, and including the relevant Python code.

Uploaded by

Rabi Soto
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
161 views190 pages

Data Analytics For Discourse Analysis With Python

This document provides an overview of a book that demonstrates how data analytics techniques can be applied to analyze discourse data from psychotherapy sessions. It showcases various data analytics methods including Monte Carlo simulations, cluster analysis, classification, and time series analysis. For each technique, it walks through a case study using discourse data from therapy transcripts and psycholinguistic metrics. The goal is to illustrate how these quantitative methods can address practical and theoretical concerns in discourse research and related fields. The book aims to be an accessible resource for researchers and practitioners interested in complementary approaches to qualitative discourse analysis. Each chapter follows a consistent structure of introducing a technique, providing an annotated example, and including the relevant Python code.

Uploaded by

Rabi Soto
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 190

Data Analytics for Discourse Analysis

with Python

This concise volume, using examples of psychotherapy talk, showcases the


potential applications of data analytics for advancing discourse research
and other related disciplines.
The book provides a brief primer on data analytics, defined as the
science of analyzing raw data to reveal new insights and support decision
making. Currently underutilized in discourse research, Tay draws on
the case of psychotherapy talk, in which clients’ concerns are worked
through via verbal interaction with therapists, to demonstrate how data
analytics can address both practical and theoretical concerns. Each chapter
follows a consistent structure, offering a streamlined walkthrough of a key
technique, an example case study, and annotated Python code. The volume
shows how techniques such as simulations, classification, clustering, and
time series analysis can address such issues as incomplete data transcripts,
therapist–client (a)synchrony, and client prognosis, offering inspiration
for research, training, and practitioner self-reflection in psychotherapy and
other discourse contexts.
This volume is a valuable resource for discourse and linguistics
researchers, particularly for those interested in complementary approaches
to qualitative methods, as well as active practitioners.

Dennis Tay is Professor at the Department of English and Communication,


the Hong Kong Polytechnic University. He is Co-Editor-in-Chief of
Metaphor and the Social World, Associate Editor of Metaphor and
Symbol, Academic Editor of PLOS One, and Review Editor of Cognitive
Linguistic Studies. His recent Routledge publication is Time Series Analysis
of Discourse: Method and Case Studies (2020).
Routledge Studies in Linguistics

38 Researching Metaphors
Towards a Comprehensive Account
Edited by Michele Prandi and Micaela Rossi
39 The Referential Mechanism of Proper Names
Cross-cultural Investigations into Referential Intuitions
Jincai Li
40 Discourse Particles in Asian Languages Volume I
East Asia
Edited by Elin McCready and Hiroki Nomoto
41 Discourse Particles in Asian Languages Volume II
Southeast Asia
Edited by Hiroki Nomoto and Elin McCready
42 The Present in Linguistic Expressions of Temporality
Case Studies from Australian English and Indigenous Australian languages
Marie-Eve Ritz
43 Theorizing and Applying Systemic Functional Linguistics
Developments by Christian M.I.M. Matthiessen
Edited by Bo Wang and Yuanyi Ma
44 Coordination and the Strong Minimalist Thesis
Stefanie Bode
45 Data Analytics for Discourse Analysis with Python
The Case of Therapy Talk
Dennis Tay

For more information about this series, please visit: https://www​.routledge​.com​/Routledge​


-Studies​-in​-Linguistics​/book​-series​/SE0719
Data Analytics for Discourse
Analysis with Python
The Case of Therapy Talk

Dennis Tay
First published 2024
by Routledge
605 Third Avenue, New York, NY 10158
and by Routledge
4 Park Square, Milton Park, Abingdon, Oxon, OX14 4RN
Routledge is an imprint of the Taylor & Francis Group, an informa business
© 2024 Dennis Tay
The right of Dennis Tay to be identified as author of this work has been
asserted in accordance with sections 77 and 78 of the Copyright, Designs and
Patents Act 1988.
All rights reserved. No part of this book may be reprinted or reproduced or
utilised in any form or by any electronic, mechanical, or other means, now
known or hereafter invented, including photocopying and recording, or in
any information storage or retrieval system, without permission in writing
from the publishers.
Trademark notice: Product or corporate names may be trademarks or
registered trademarks, and are used only for identification and explanation
without intent to infringe.
Library of Congress Cataloging-in-Publication Data
Names: Tay, Dennis, author.
Title: Data analytics for discourse analysis with Python: the case of
therapy talk / Dennis Tay.
Description: New York, NY: Routledge, 2024. | Series: Routledge studies in
linguistics | Includes bibliographical references and index. |
Identifiers: LCCN 2023047858 | ISBN 9781032419015 (hardback) |
ISBN 9781032419022 (paperback) | ISBN 9781003360292 (ebook)
Subjects: LCSH: Discourse analysis–Data processing. | Python (Computer
program language) | Psychotherapy–Language–Case studies.
Classification: LCC P302.3 .T39 2024 | DDC 401.410285–dc23/
eng/20231226
LC record available at https://lccn.loc.gov/2023047858
ISBN: 9781032419015 (hbk)
ISBN: 9781032419022 (pbk)
ISBN: 9781003360292 (ebk)
DOI: 10.4324/9781003360292
Typeset in Sabon
by Deanta Global Publishing Services, Chennai, India
Contents

List of Figures vii


List of Tables viii

1 Introduction1
Defining data analytics 1
Data analytics for discourse analysis 3
The case of psychotherapy talk 6
Outline of the book 11
Quantifying language and implementing data analytics 12
Quantification of language: Word embedding 13
Quantification of language: LIWC scores 17
Introduction to Python and basic operations 19

2 Monte Carlo simulations27


Introduction to MCS: Bombs, birthdays, and casinos 27
The birthday problem 29
Spinning the casino roulette 33
Case study: Simulating missing or incomplete transcripts 37
Step 1: Data and LIWC scoring 40
Step 2: Simulation runs with a train-test approach 41
Step 3: Analysis and validation of aggregated outcomes 48
Python code used in this chapter 58

3 Cluster analysis66
Introduction to cluster analysis: Creating groups
for objects 66
Agglomerative hierarchical clustering (AHC) 68
k-means clustering 71


vi Contents

Case study: Measuring linguistic (a)synchrony between


therapists and clients 73
Step 1: Data and LIWC scoring 76
Step 2: k-means clustering and model validation 79
Step 3: Qualitative analysis in context 90
Python code used in this chapter 98

4 Classification105
Introduction to classification: Predicting groups from
objects 105
Case study: Predicting therapy types from therapist-client
language 108
Step 1: Data and LIWC scoring 109
Step 2: k-NN and model validation 114
Python code used in this chapter 123

5 Time series analysis126


Introduction to time series analysis: Squeezing juice
from sugarcane 126
Structure and components of time series data 128
Time series models as structural signatures 134
Case study: Modeling and forecasting psychotherapy
language across sessions 136
Step 1: Inspect series 138
Step 2: Compute (P)ACF 142
Step 3: Identify candidate models 143
Step 4: Fit model and estimate parameters 145
Step 5: Evaluate predictive accuracy, model fit, and
residual diagnostics 148
Step 6: Interpret model in context 155
Python code used in this chapter 159

6 Conclusion165
Data analytics as a rifle and a spade 166
Applications in other discourse contexts 168
Combining data analytic techniques in a project 172
Final words: Invigorate, collaborate, and empower 174

Index177
Figures

1.1 Applying time series analytics to the construction of


expertise on YouTube 5
1.2 Documents in a two-dimensional space 16
1.3 Wide vs. long data 21
2.1 Outcome of 5,000 birthday paradox simulations 32
2.2 Distribution of winnings from roulette simulations 36
2.3 Results of simulation run A 49
2.4 Results of simulation run B 51
2.5 Results of simulation run C 53
2.6 Results of simulation run D 54
2.7 Results of simulation run E 55
3.1 Outcome of AHC on COVID-19 data 70
3.2 Outcome of k-means clustering on COVID-19 data 72
3.3 Elbow plots for the three dyads 81
3.4 Outcome of k-means clustering on the three dyads 84
3.5 Cluster centers of the three dyads 86
4.1 Predicting group labels with k-NN107
4.2 Scatterplot of distribution of transcripts across therapy types 113
4.3 Plot of accuracy for different k values 117
5.1 A standard regression and residuals plot 129
5.2 Components of time series data 133
5.3 Prototypical ARIMA models as structural signatures 135
5.4 Examples of (non-)stationary series 139
5.5 (P)ACF correlograms 143
5.6 Output of time series model fitting 147
5.7 Predicted vs. observed plots for both models 150
5.8 Residual diagnostic plots 153


Tables

1.1 A simple document-term matrix for three short reviews 14


1.2 Reduced tf-idf scores for ten documents 16
1.3 Summary variables and defining lexical categories 18
2.1 Training data properties 44
2.2 Summary of the five simulation runs 56
3.1 Summary variable scores for dyads A–C 77
3.2 Synchrony measures of the three dyads 90
4.1 Summary variable scores for transcripts across three
therapy types 111
4.2 Classification report for testing data 121
5.1 Autocorrelation at lags 1 and 2 130
5.2 Basic guidelines for model selection 144


1 Introduction

Defining data analytics


A popular definition of data analytics is the processing and analysis of data
to extract information for enhancing knowledge and decision making. The
financial context is often used as a key example where increasingly large
volumes of data are used to support better strategic planning, improve
operational efficiency, enhance customer experience, drive innovation, and
so on. The rapid growth of university degree programs, virtual learning
platforms, consultancy services, and demand for data analysts testifies to
its importance in today’s economy. Many institutions send the message
that data analytics is fundamentally interdisciplinary, and they extend
course offerings to students outside the STEM (science, technology, engi-
neering, mathematics) and business fields.
Mathematician and entrepreneur Clive Humby famously proclaimed
data as ‘the new oil’ near the turn of the 21st century, which may have
helped to elevate the prestige and perceived novelty of data analytics in
the eyes of the general public. However, core data analytic techniques like
simulation, regression, classification, and clustering have long been used
across various fields of academic research, although on a smaller scale than
present. Some might therefore consider the data analytics movement as a
passing fad or a mere rebranding and upscaling of traditional statistical
methods. Nevertheless, important key traits distinguish the two. First, data
analytics is a more holistic and contextual process that spans across the
trajectory from problem formulation to data collection, hypothesis testing,
and interpretation. Many researchers might be inclined to narrowly equate
statistical methods to the hypothesis testing phase where categorical deci-
sions about ‘significant’ differences or relationships are made based on
abstract test statistics derived from technical operations on sample data.
However, just as educators have advocated a broader view of statistical
analysis as a contextually driven methodological framework (MacKay &
Oldford, 2000), data analytic techniques aim to make sense of data in the

DOI: 10.4324/9781003360292-1
2 Introduction

context of the issues they represent. These include summarizing and visual-
izing data in various ways, comparing alternative approaches to connect
the data with the research problem, and allowing contextual knowledge
to shape analytic decisions. The role of contextual or domain knowledge
(Conway, 2010) is the second key trait of data analytics. Though under-
pinned by a common core of statistical concepts, techniques are adaptable
for specific questions in different domains like business (Chen et al., 2012)
and health care (Raghupathi & Raghupathi, 2014). For example, the same
statistical models used to forecast stock prices can be used to forecast
demand for hospital beds (Earnest et al., 2005), but model assumptions
and the forecasting horizon of interest clearly differ.
Evans and Lindner (2012) discuss four subtypes of data analytics. In
increasing order of complexity and value-addedness, these are descriptive
analytics, diagnostic analytics, predictive analytics, and, finally, prescrip-
tive analytics. The four subtypes were originally conceived in business
contexts where ‘value’ has a concrete financial sense, but other research-
ers can understand them as progressive phases of inquiry into their data.
Descriptive analytics is roughly synonymous with the traditional notion of
descriptive statistics, where the objective is a broad overview of the data to
prepare for later analysis. Simple examples include visualizing the central
tendency and distribution of data with boxplots, histograms, or pie charts.
The increasing sophistication and importance of visual aesthetics and user
interactivity, however, has made data visualization a distinct field in itself
(Bederson & Shneiderman, 2003). The next phase of diagnostic analytics
involves discovering relationships in the data using statistical techniques.
It is more complex and ‘valuable’ than descriptive analytics because the
connections between different aspects of data help us infer potential causes
underlying observed effects, addressing the why behind the what. In an
applied linguistics context, for example, large differences in scores between
student groups discovered in the descriptive analytics phase might moti-
vate a correlational study with student demographics to diagnose potential
sociocultural factors that explain this difference. We will later see how
diagnostic analytics can also offer practical solutions like inferring missing
information from incomplete datasets or assigning data to theoretically
meaningful groups by their characteristics.
If diagnostic analytics is about revealing past outcomes, the next phase
of predictive analytics is aimed at predicting yet-to-be-observed, or future,
outcomes. This means making informed guesses about the future based
on present and historical data points using core techniques like regression
and time series analysis. It is clear why predictive analytics constitutes a
quantum leap in value for businesses that are inherently forward looking.
The same applies for objectives like automatic classification of new texts or
forecasting of language assessment scores. As ever-increasing volumes of
Introduction 3

data become available, contemporary diagnostic and predictive analytics


capitalize on machine learning and artificial intelligence to quickly identify
patterns, build models, and make decisions with minimal human inter-
vention. Lastly, prescriptive analytics bridges the gap between knowing
and doing by translating the above insights into optimal courses of action.
These actions are often split-second responses to ever-changing informa-
tion and conditions, such as pricing airline tickets and self-driving cars in
real time. Linguistics research is unlikely to involve this level of challenge
in the near future. However, prescriptive analytics is still related to the
notion of applied research, with the growing need to demonstrate how our
findings improve personal and social lives.

Data analytics for discourse analysis


Discourse analysis examines language and communication in daily life,
and it aims to offer insights into human behavior, culture, and society.
Since discourse phenomena are often spontaneous and could be regarded
as quantifiable random variables, data analytics appears to be a plausible
approach at first glance. The following points are routinely raised in any
programmatic argument for applying data analytics to discourse analysis.

• Objectivity: data analytics use reproducible algorithms and quantitative


methods to minimize subjective interpretation and ‘cherry-picking’ data
to support preconceived conclusions (Baker & Levon, 2015; Breeze,
2011).
• Scalability: data analytics handles large amounts of data with minimal
cost, helping us uncover otherwise hidden patterns in discourse (Tay,
2017a).
• Data richness: data analytics taps on the exponentially growing gold
mine of digital data like social media and online forums (Han et al.,
2000), providing a rich and representative picture of contemporary
discourse.
• Novelty: data analytics introduces a suite of underexplored techniques
for studying discourse across dimensions of key theoretical interest like
time, space, and human networks (Stieglitz & Dang-Xuan, 2013; Tay,
2019).
• Flexibility: data analytic tools, as alluded to above, are not a set of
invariant ‘tests’ and procedures, and they can be customized to spe-
cific discourse analytic questions and objectives in context (Tay & Pan,
2022).
• Interdisciplinarity: data analytics is receptive towards interdisciplinary
ideas and assumptions, enabling more comprehensive insights into
4 Introduction

discourse, on the one hand, and richer educational experiences for stu-
dents, on the other (Asamoah et al., 2015).

Nevertheless, discourse analysts have not always welcomed data ana-


lytics with open arms. This seems to be true for descriptive and critical
approaches alike (Gee, 2011; Tannen et al., 2015), which focus, respec-
tively, on the content/structure/function(s) of language in use and broader
social forces that influence linguistic and other communicative practices.
A basic tenet shared by these approaches is that discourse analysis focuses
on how language is actually used, and its effects and implications, rather
than predict or prescribe how it should be used. This implies that two of
our four data analytic subtypes – predictive and prescriptive analytics – are
fundamentally misdirected. There is also the familiar and serious argument
that statistical analysis reduces complex humanistic phenomena to ‘mean-
ingless’ numbers or that statistical analysis is itself built upon subjective
value judgments (Taylor, 2013).
Instead of rehearsing the above debates, let us consider two more spe-
cific arguments for harnessing data analytics for discourse analysis. The
first is that many discourse analytic constructs and objectives, which are
often elaborated in broad but somewhat vague terms, have underexplored
data analytic dimensions that should be explored in the spirit of critical
inquiry (Tay, 2019). Doing so neither problematizes these constructs/
objectives nor demands a fundamental change in how discourse analy-
sis is done. Instead, it aligns with values like complementarity and multi-
disciplinarity that discourse analysts often promote. A good example is
the ‘discourse-historical approach’ that is situated within critical discourse
studies (Reisigl & Wodak, 2001). Reisigl (2017, p. 53) emphasizes the
historical context of discursive practices as a key feature of this approach,
and he outlines several ways for analysts to connect discourse with history.
One of these states that

A diachronic series or sequence of thematically or/and functionally


connected discourse fragments or utterances is taken as a starting
point, and their historical interrelationships are reconstructed within
a specified period. This way, specific discourse elements can be
related to each within a particular period of the past, e.g., a period of
some months, years, decades, etc.
(Reisigl, 2017, p. 53)

On a broad conceptual level, it certainly makes sense to assume that consec-


utive observations of physical/social phenomena are related to one another
over a specified time period. Past events are often one of the best predictors
of present happenings and, by extension, forecasters of the future. This logic
Introduction 5

is foundational to predictive analytic techniques like time series analysis,


which offer explicit and replicable methods to model the present as a func-
tion of past observations. Notwithstanding what discourse-historical ana-
lysts actually do to establish links between specific discourse elements over
time, it is worthwhile to explore the extent to which predictive analytic tech-
niques can support this task. This has been demonstrated in a series of recent
work in different contexts like the use of metaphors across psychotherapy
sessions (Tay, 2017b), COVID-19 press conferences (Tay, 2021b), newspa-
per protest editorials (Tay, 2021a), and the discursive construction of exper-
tise on YouTube (Tay, 2021c). Figure 1.1 illustrates the YouTube study
where the language of 109 consecutive makeup tutorial videos spanning
nine years from the popular NikkieTutorials channel were analyzed for the
gradual construction of an ‘amateur expert’ identity (Abidin, 2018; Bhatia,
2018). It exemplifies Reisigl’s (2017) call to analyze a ‘diachronic series …
of thematically or/and functionally connected … utterances’ in servicing or
otherwise underpinning identity as a discourse analytic construct. Using the
text analysis programme LIWC (Linguistic Inquiry and Word Count) (Boyd
et al., 2022; Pennebaker et al., 2015), each video was scored on the extent
to which the language reflects ‘analytical thinking’, i.e., formal, logical, and
hierarchical versus informal, here-and-now, and narrative thinking.
The blue line tracks the measured analytical thinking values across all
109 sessions that comprise NikkieTutorials up to that point. The erratic
pattern motivates the application of time series analytic techniques to
account for how the linguistic display of analyticity has evolved over nine
years, supporting what would otherwise be a difficult task for qualita-
tive methods alone. The orange line depicts the resulting time series model

Figure 1.1 A
 pplying time series analytics to the construction of expertise on
YouTube
6 Introduction

known as an AR(1) model, i.e., first-order autoregressive model, which


is best able to predict the actual values with minimal error. If deemed
appropriate by the objectives at hand, this model could be further used
to forecast analytical thinking scores in future hypothetical videos. The
dotted line at the end depicts these forecasted values three videos ahead.
The model offers the critical insight that analytical language levels exhibit
a ‘short-term momentum’ with frequent video-to-video increases or
decreases followed by a sudden movement towards the opposite direction.
This, in turn, provides an entry point for qualitative analysis of the tran-
scripts to discover if the sustained intermittent switch from formal/logical/
hierarchical to simpler language is a strategy of identity construction. We
will return to time series analysis later in the book.

The case of psychotherapy talk


The second argument for data analytics is that the specific features of
different data analytic techniques are often well suited to address corre-
spondingly specific issues in an area of discourse research and/or practice.
With some critical thinking and interpretation, many of the planned and
unplanned analytical processes involved can also lead to potential hypoth-
eses for further pursuit. These points are related to the aforementioned
role of domain knowledge (Conway, 2010) as a key trait of contemporary
data analytics. We could say that specialized domain knowledge helps to
motivate as well as constrain choices made on the application of particu-
lar data analytic techniques. It is the main objective of this book to dem-
onstrate this argument using psychotherapy talk as a case study context.
Nevertheless, it should already be made clear that the analytical scenarios
and issues to be discussed also exist in varying forms in other comparable
discourse contexts, including but not limited to media, politics, and educa-
tion. This provides an opportunity for readers working in these areas to
think about how to apply the techniques for their own purposes. While the
concluding chapter will touch upon some aspects of this cross-fertilization,
readers are encouraged to make relevant connections for themselves as
they make their way through this book.
Psychotherapy, nicknamed the ‘talking cure’, is defined as

the informed and intentional application of clinical methods and


interpersonal stances derived from established psychological princi-
ples for the purpose of assisting people to modify their behaviors,
cognitions, emotions, and/or other personal characteristics in direc-
tions that the participants deem desirable.
(Norcross, 1990, pp. 218–220)
Introduction 7

Although psychotherapy is a mental health-care practice built foremost


upon psychological principles, its verbal interactive nature has attracted
the attention of discourse analysts and sociolinguists, who see it as a
prime example of language being constitutive of social reality. Therapists
have likewise affirmed the potential clinical relevance of discourse ana-
lytic research. Spong (2010), for instance, outlines five ways it can benefit
therapy practice ranging from a deeper understanding of client issues and
therapy models to critical reflection on how these are shaped by broader
social structures. Pioneering discourse analytic work includes Pittenger
et al.’s (1960) fine-grained linguistic analysis of the initial moments of
therapist-client interaction, Labov and Fanshel’s (1977) seminal work
on speech acts in therapy talk, and Scheflen’s (1973) account of bodily
movement as a communicative resource. The two broad types of discourse
analysis outlined earlier – descriptive and critical – are both well attested
in the psychotherapy context. The most influential descriptive approach
to psychotherapy talk has arguably been conversation analysis (Peräkylä
et al., 2011). Its core idea is that therapy does not proceed in a top-down
manner guided solely by psychological principles but as a sequentially
unfolding ‘bottom-up’ process. Key therapeutic and therapeutically ori-
ented activities like interpretations (Antaki et al., 2005) and relationship-
building (Ferrara, 1994) are mutually constructed through organized
speech sequences as therapists and clients “each take up portions of the
other’s speech to interweave with their own” (Ferrara, 1994, p. 5). Shifting
emphasis from structure to form, a wealth of other descriptive stud-
ies cover various linguistic features, including pronouns (Van Staden &
Fulford, 2004; Zimmerman et al., 2016), questions (Reeves et al., 2004),
and metaphors (McMullen, 2008; Tay, 2013), many of which attempt to
connect language use to therapy processes and outcomes. There is also a
substantial amount of critical analytic work that relates language in ther-
apy to “broader social structures, meanings, and power relations” (Spong,
2010, p. 69). Examples are multifaceted and include therapists’ reflections
on how social power, inequalities, and values impact their attitudes and
practices (Spong & Hollanders, 2005) or how attitudes towards the very
notion of ‘psychological difficulties’ are culturally constructed (Avdi &
Georgaca, 2007). Many such studies tend to have the overarching objec-
tive of advocating greater awareness and reflexivity in therapists, clients,
and the general public through analyzing language.
The body of research described above firmly establishes psychotherapy
as a genre, context, or speech event (Ferrara, 1994) that will continue to
motivate questions of primarily discourse analytic interest. While intended
for discourse analysts, this introductory-level book also approaches psy-
chotherapy language from the perspective of an interested therapist con-
vinced by the value of discourse analysis, but not necessarily aiming to
8 Introduction

advance its theory. Such a therapist may instead be keener on concrete


solutions to practical issues that arise from research and practice. This
book will show that many such solutions lie at the intersection of discourse
and data analytics, and the latter can provide valuable insights without
demanding a high level of technical expertise. Among the many real-life
scenarios that underlie these issues, we will discuss four examples that
span across the trajectory of data collection, measurement, analysis, and
application through prediction and/or forecasting. Their practicality lies
with the fact that they do not just define potential research objectives in
their own right, but they are relevant to therapists’ self-evaluation and
reflection on their individual linguistic practices. We will see that all four
examples focus on the psychotherapy session – as opposed to, say, isolated
turn sequences or extracts – as the basic unit of analysis. One obvious
rationale for doing so is that the session is likely to be more concrete, relat-
able, or recallable since the course of treatment is usually organized on a
sessional basis. Consequently, most research on psychotherapy processes
and outcomes pivot around in-session (or more recently, intersession)
events and experiences (Hartmann et al., 2011; Orlinsky et al., 2004). Our
four examples will each comprise the subject matter from Chapters 2 to 5.
The first issue to be discussed in Chapter 2 is a basic yet underexplored
problem that is often encountered at the very beginning of the research
trajectory in psychotherapy as well as other contexts. That is the problem
of missing or incomplete data due to technical failure, lack of consent, and
other reasons. As mentioned above, psychotherapy is usually conducted
on a sessional basis with substantial variation in the number of sessions
across individuals, therapy approaches, and so on. A study of utilization
for major depressive disorder in the United States found that the average
number of sessions per client was 8.5 (SD = 10.0) in 1993 and 9.4 (SD =
10.6) in 2003 (Connolly Gibbons et al., 2011), and it is easy to find cases
with at least 20 sessions in online transcript databases. Psychotherapy talk
is often recorded, transcribed, and presented on a sessional basis, but it is
common for some (parts of) sessions to be omitted because of recording
difficulties or clients requesting so. One could imagine similar problems in
other professional discourse contexts like the classroom, boardroom, or
courtroom. While missing data are not too problematic for those who use
transcripts for training or illustrative purposes, they might be for research
purposes, especially if the missing sessions create multiple gaps in the treat-
ment span. Chapter 2 will discuss how basic simulation techniques – in
particular, Monte Carlo simulations – can partly address the problem by
making use of the statistical properties of available data to simulate what
hypothetical missing data would have turned out by the laws of chance.
There are, of course, multiple ways to depict the statistical properties
of transcript data. A common computational linguistic approach is the
Introduction 9

document-term matrix, which represents each transcript in terms of the


frequencies of words appearing across all transcripts. We will instead use
the aforementioned LIWC (Linguistic Inquiry and Word Count) (Boyd et
al., 2022; Pennebaker et al., 2015), which focuses not so much on content
words and their frequencies but on how the combination of content and
function words reflect various socio-psychological stances. The differences
between these approaches will be explained in more detail below.
Beyond data collection, the overarching endeavor of psychotherapy
discourse research is to discover how aspects of therapist-client language
reflect therapeutic processes and/or outcomes. Chapter 3 illustrates the
application of data analytics to measure the extent of therapist-client ‘lin-
guistic synchrony’, as both a research construct and a useful self-reflective
measure for individual therapists. Linguistic synchrony is a component
of the broader notion of ‘interpersonal synchrony’ between therapists
and clients. This is defined as the “alignment of neural, perceptual, affec-
tive, physiological, and behavioural responses during social interaction”
(Semin & Cacioppo, 2008), and it is generally thought to be ideal in inter-
personal relationships. Linguistic (a)synchrony is thus the extent to which
the linguistic choices by a therapist and their client (mis)align across the
treatment span. Simple examples of alignment readily observable on the
surface include lexical repetitions and other efforts at cohesion/coher-
ence (Ferrara, 1991, 1994). In the example below, the client describes his
experience of a recent earthquake and emphasizes that he had ‘no control’
unlike when ‘in a car and someone’s doing something silly’. The therapist
proceeds to ask for elaboration by aligning with the client’s description
using the word ‘control’, thereby affirming its relevance and aptness.

C: Like if you’re in a, a car and someone’s doing something silly you have
the ability to stop it and get out.
T: Mm hmm.
C: Whereas this was no control, no control.
T: Yeah, yeah. And what kind of emotional impact did that have on you
straight away? I mean you were out of control; you couldn’t control it.

Linguistic synchrony can also be measured in ways that go beyond the use
of similar words. LIWC is likewise useful for this purpose – each transcript
can be assigned scores based on displayed socio-psychological stances like
analyticity, authenticity, clout, and emotional tone, and (a)synchrony can
then be defined in terms of (dis)similarity between scores. We will exam-
ine how the data analytic technique of cluster analysis – focusing specifi-
cally on the k-means clustering algorithm – can be used on LIWC scores
to discover natural groupings in therapist and client language, leading to
a concrete and replicable synchrony measure per dyad. Just as the case
10 Introduction

with Monte Carlo simulations, discourse researchers can apply a simi-


lar grouping logic to investigate linguistic (a)synchrony in other contexts
with sessional dialogic interaction. Following this, Chapter 4 will further
introduce related grouping or classification techniques to investigate psy-
chotherapy discourse. While k-means clustering can be used to discover
emergent groups among transcripts based on linguistic similarity, Chapter
4 discusses when and how to work with groups that pre-exist independent
of language use. An obvious example is the therapy type such as cognitive-
behavioral therapy or psychoanalysis that underpins the talk, which is pre-
determined by the theoretical orientation and training of therapists. The
k-nearest neighbors (k-NN) classification algorithm will be introduced as
a basic yet flexible approach to test the extent to which LIWC scores,
and, hence, language use, can reliably predict the therapy type at hand.
Likewise, this could be applied to examine if language use discriminates
other therapeutically relevant categories (e.g., good versus poor outcomes)
as well as categories of interest in other discourse contexts.
Psychotherapy sessions are often scheduled weekly. As treatment pro-
gresses along this fairly short interval, it is sensible to assume that atti-
tudes and behaviors (including language) from past sessions have some
lingering influence on present and future sessions. We can find support
for this interdependence assumption by observing how a session’s open-
ing sequences often tend to refer to the previous session(s), akin to how a
teacher starts a lesson by reviewing the previous one. The initial exchange
below is illustrative.

T: So, um what I’d like to do today is start off by reviewing what we did
in our last session. Kind of seeing if you’ve had any further thoughts
about that questions, concerns come up. And then make a start on the
attention training task which we briefly talked about last session.
C: Ok.
T: Yeah. How did you find our last session, did anything come up?
C: I found it most objective.
T: Ok.

A precise description or model of the nature of this influence could then


help therapists develop potentially useful prognoses of their clients.
Chapter 5 explores the intriguing possibility of using this sessional inter-
dependence assumption to model and forecast language usage trends, as
measured by LIWC scores, across the treatment span. The application of
time series analysis to YouTube videos for a comparable analytical pur-
pose was briefly discussed above (Figure 1.1). We will likewise fit similar
time series models, formally known as ARIMA (Autoregressive Integrated
Moving Average) models (Box et al., 2015), to consecutive psychotherapy
Introduction 11

sessions. This will enable us to discover any regularities in socio-psycho-


logical stances across time that are very likely to be invisible with quali-
tative analytical methods alone. Such regularities, in turn, point towards
tendencies, strategies, or other emergent dynamics of therapist-client com-
munication, and they can be harnessed for the intriguing task of forecast-
ing future tendencies of language use.

Outline of the book


This book is an introductory-level demonstration of data analytics for
discourse analysis, focusing on the case of psychotherapy. We will dem-
onstrate how the featured techniques can be applied to (1) simulate and
thus compensate for incomplete or missing transcripts, (2) measure thera-
pist-client synchrony based on grouping (dis)similar transcripts, (3) evalu-
ate the predictability of therapy types from language use, and (4) model
and forecast language use across treatment sessions. The primary source
of psychotherapy talk data used to demonstrate the techniques is the
Counselling and Psychotherapy Transcripts Series published by Alexander
Street Press, an online database with a growing collection of more than
4,000 session transcripts at the time of writing. The database adheres to
the American Psychological Association’s Ethics Guidelines on privacy,
client rights, and related issues, and it is widely used in education and
research.
We mentioned above that these data analytic applications best serve the
needs of those interested in discourse analysis but who are not necessarily
looking to advance discourse analytic theory. The general nature of our
featured techniques further clarifies this point. They are theory-neutral in
that one does not need to be fixated upon specific discourse theoretical
constructs or measures in order to use them, and they are exploratory in
the sense that, unlike typical statistical techniques like regression, one is
not required to specify discourse-related predictors to investigate the out-
comes. We will, for example, see that cluster analysis does not require any
input beyond the LIWC scores, and ARIMA time series analysis uses only
past data to model present and future data. With appropriate considera-
tion of specialized domain knowledge, these techniques can also be applied
to address questions in discourse contexts other than psychotherapy, as
will be discussed in the concluding chapter.
For each technique, the important task of model validation will also be
demonstrated in detail. This is the necessary step of confirming that the
outputs of statistical model(s) are accurate enough to represent the real-
world data generating process(es) – in our case, therapist-client language.
Many readers may be familiar with the concept of model fit and relevant
standard measures like R2 that generally aim to evaluate how well a model
12 Introduction

fits, or predicts, the existing sample data. Model validation, on the other
hand, is also concerned with how well the model predicts out-of-sample
data, which would be a better test of its real-world applicability. Model
validation is often described as a context-dependent process (Mayer &
Butler, 1993), and this bears two important implications for us. First, and
more generally, we should not assume that these techniques will be useful
for analyzing psychotherapy talk just because they are common in other
domains like finance and engineering. Second, as this book will show,
each technique and its context of application may require specific valida-
tion procedures appropriate to the case at hand. The following validation
procedures (Good & Hardin, 2012) will be demonstrated throughout the
book: (1) resampling and refitting the model multiple times on different
parts of the dataset, (2) splitting the data sample into training and test-
ing datasets, and (3) comparing outcomes of alternative methods and/or
using external sources of information. We will also highlight cases where
model validation serves not only as quality control but also as an interest-
ing avenue for stimulating further research hypotheses.
To maximize consistency of presentation and argumentation, Chapters
2 to 4 as described above will be structured with the following common
elements

• a streamlined conceptual walk through of the technique at hand


• an example case study application including model validation
• annotated Python code for readers to try the techniques on their own
data.

The Python code used to implement examples in the conceptual walk-


throughs and case studies will be presented as they are discussed, for read-
ers to try them out as they go along.
The various code snippets throughout will be presented again at the end
of each chapter to provide a bird’s-eye view of the logical flow of the tech-
niques and their components. All code and data files are also available upon
request and stored on my GitHub site (https://github​.com​/dennistay1981).
Readers interested in discourse contexts other than psychotherapy would
hopefully find this presentation structure helpful for adapting the code for
their own data and purposes. The concluding Chapter 6 will provide a
synthesized summary of the discussions throughout the book and suggest
ways to explore data analytics beyond its introductory scope.

Quantifying language and implementing data analytics


The case study applications featuring the techniques will all follow two
broad methodological steps. The first step is to quantify the language of
Introduction 13

the transcripts or, in simple terms, to ‘convert words into numbers’ using
some measure or quantification scheme to prepare for further quantitative
data analytics. The present choice of LIWC, among many other ways to
do so, will be explained below. In a nutshell, the main rationale is that
it allows users a relatively simple option to focus on socio-psychological
stances rather than semantic contents underlying word choices. Following
quantification, the second broad step is the actual implementation of the
techniques. There are again many options for this, ranging from paid sta-
tistical software like SPSS to open-source programming languages, and the
present choice of the open-source Python programming language will like-
wise be explained.

Quantification of language: Word embedding


The numerical representation of different aspects of texts to facilitate sub-
sequent quantitative analysis is a basic task in fields like corpus linguistics
and natural language processing (NLP). The most basic form of numerical
representation is the frequency of different linguistic forms and structures,
which can then be compared within and across text types. Corpus lin-
guists, for example, annotate their corpora with various types of meta-
data like parts of speech, word stems, lemmas, and other grammatical and
semantic information, and they use frequency-based statistical measures
like keyness and mutual information to characterize the linguistic proper-
ties of texts (Brezina, 2018). Many of these numerical representations aim
to directly provide information about the substantive contents or themes
of texts. For instance, significant differences in parts-of-speech distribution
can reflect differing emphases on entities versus processes, while keyness
analyses reveal how and why particular (content) words are over or under-
used in a corpus. Many readers should already be familiar with these cor-
pus linguistic concepts.
Somewhat related but perhaps less familiar to discourse researchers
is the fundamental computational linguistic process of word embedding.
This refers to various techniques to map language-in-context to numerical
vectors that represent their semantic and syntactic properties, to facilitate
subsequent tasks like language modeling, part-of-speech tagging, named
entity recognition, sentiment analysis, machine translation, and so on.
These increasingly sophisticated techniques underpin state-of-the-art large
language models like Google’s Word2Vec, BERT and, more famously,
OpenAI’s GPT (Wei et al., 2023). Let us however illustrate the most basic
vectorization techniques based on word frequencies rather than semantic
and syntactic relationships. Consider a simple example in Table 1.1. There
are three short sentences R1–R3 forming a mini corpus of movie reviews,
14 Introduction

Table 1.1 A simple document-term matrix for three short reviews

very scary but cliché not just alright good

R1: “very scary, but 1 1 1 1 0 0 0 0


cliché”
R2: “not scary, just 0 1 0 0 1 1 1 0
alright”
R3: “very scary, but 2 1 1 0 0 0 0 1
very good”

and the columns show the number of times each unique word appears in
each sentence.
Table 1.1 is known as a document-term matrix where each review is
a document and each word a term. The rows spell out the vectors for the
corresponding reviews. R1 in vectorized form is therefore [1 1 1 1 0 0 0 0],
R2 is [0 1 0 0 1 1 1 0], and R3 is [2 1 1 0 0 0 0 1]. Geometrically speaking,
the eight unique words each represents a spatial dimension or axis, each
document is a vector occupying this eight-dimensional space, and the fre-
quencies of the terms determine the vector length. The more a certain term
appears in a document, the longer the vector will be in the corresponding
dimension. As this simple approach merely counts the raw frequencies of
terms and, if desired, contiguous term sequences (i.e., n-grams), it may
be less useful for making more nuanced comparisons within the corpus.
For instance, a term that occurs frequently in a document may not actu-
ally be that important or informative to that document, if the same term
also occurs in many other documents in the corpus. Grammatical words
like articles and prepositions are good examples. A more refined way to
compute the document-term matrix would then be to scale the frequency
of each term by considering how often the term occurs among all other
documents. This scaled frequency is known as the term frequency-inverse
document frequency (tf-idf). A simple version of the formula is

æ N ö
tf - idf ( t, d ) = tf ( t, d ) * log ç ÷
è df + 1 ø

where tf-idf(t,d) is the tf-idf score of term t in document d, tf(t,d) is the


(basic) term frequency of term t in document d as described above, N is the
total number of documents in the corpus, and df is the number of docu-
ments in which term t appears at least once. The higher df is, the closer the
fraction inside the logarithm is to 1, and the closer the logarithmic function
is to 0 (since log 1=0), and vice versa. In this way, we have a more precise
evaluation of the relative informational value of terms and, hence, a better
way to distinguish documents in a corpus.
Introduction 15

Regardless of whether raw frequencies or tf-idf scores are used to vec-


torize our corpus, we still end up with our eight-dimensional matrix. This
number is already impossible to visualize, not to mention real-world texts
with hundreds if not thousands of unique words/dimensions. A typical
preparatory step when dealing with real-world texts is to first reduce the
thousands of dimensions into a far more manageable number, usually just
two, using a statistical technique called principal components analysis
(PCA). We will encounter PCA again later, but, for now, we can under-
stand it as transforming the thousands of axes into a new coordinate
system while preserving as much information from the original data as
possible. Python is an excellent choice for implementing the above tasks
like word embedding, PCA, and NLP in general. Its scikit-learn library,
which we will use many times in this book, has the CountVectorizer and
TfidfVectorizer tools to perform basic and tfidf-vectorization, and it can
also perform various forms of PCA. Sarkar (2016) is a recommended
resource with many examples and annotated code for performing such
analyses. An alternative simple way to compute tf-idf scores and reduce
them to two dimensions in just a few lines of code is to use the texthero
package (https://texthero​.org/). The code below, for example, will pre-
process a corpus with standard text cleaning procedures like removing
grammatical ‘stopwords’ and punctuation, calculate tf-idf scores for each
unique word, and reduce the scores to just two dimensions with PCA.
This way of presenting annotated Python will be used throughout the
book.

#pre-process data
data[‘text’] = hero​.cle​an(data[‘text’])
#calculate tf-idf scores
data[‘tfidf’]=(hero​.tfi​df(data[‘text’])
#perform PCA on scores
data[‘pca’] = hero​.p​ca(data[‘tfidf’])

Table 1.2 shows the hypothetical outcome of the above process for a cor-
pus of ten documents. Each document is now a vector with just two entries
corresponding to the reduced tf-idf scores. With only two dimensions, we
can then represent the information in more visually intuitive ways. Figure
1.2 shows the vectors representing our hypothetical documents F [0.1164
0.428016] and H [0.1874 0.2571] in a two-dimensional space. We can then
quantify, among other things, (di)similarity between documents by calcu-
lating distances and angles between their representing vectors. Documents
with words having more similar tf-idf scores will have smaller distances
and/or angles between their vectors. A common basic measure that can be
conveniently computed with free online calculators is the cosine distance.
16 Introduction

Table 1.2 Reduced tf-idf scores for ten documents

Documents Dimension 1 Dimension 2

A –0.37685 –0.0089
B –0.34303 –0.05053
C –0.42005 0.037291
D –0.34016 0.044264
E –0.332 –0.17402
F 0.1164 0.428016
G –0.35413 0.027041
H 0.1874 0.2571
I –0.38525 0.02156
J –0.20533 –0.10186

Figure 1.2 Documents in a two-dimensional space

This is just the cosine of the angle θ between vectors A and B, or their dot
product divided by the product of their lengths. Formally,

AB
Cosine distance  cos  
A B

The cosine distance always has an absolute value between 0 and 1. The
higher the value, the more similar the documents. The distance between
documents F and H in Figure 1.2, for example, has a high value of 0.934.
If we go back to our one-sentence reviews in Table 1.1, the value between
R1 and R2 is 0.25, between R1 and R3 is 0.756, and between R2 and R3 is
0.189. Such measures could then be used for different analytical purposes
Introduction 17

like text classification or predictors of characteristics like sentiment or


genre (Beysolow II, 2018; Sarkar, 2016).

Quantification of language: LIWC scores


The approaches described above are all plausible ways to quantify our
psychotherapy transcripts or other kinds of discourse data for further data
analytics. Each therapy session transcript could, for instance, be treated as
a document and vectorized with tf-idf scores. The present book, however,
opts for a slightly different alternative given the large volume of literature
that already exists on these approaches. Instead of directly quantifying
session transcripts based on word frequencies, we make use of the LIWC
text analytic program to measure the extent to which the language in each
transcript reflects different socio-psychological stances adapted by thera-
pists and clients in their interaction. Using an extensive built-in dictionary,
LIWC computes two broad classes of variable scores for a given input
text: ‘linguistic dimensions’ and ‘summary variables’. Linguistic dimen-
sion scores are likewise ‘direct’ measures in that they are just normalized
frequencies of approximately 90 different linguistic categories that appear
in the input text. These categories include grammatical (e.g., pronouns,
articles, prepositions, conjunctions, parts of speech, quantifiers) and
socio-psychological semantic domains (e.g., affective, cognitive, percep-
tual, psychological). This is broadly similar to automatic semantic annota-
tion schemes used by corpus linguists like the UCREL Semantic Analysis
System (Archer et al., 2002). Summary variables, on the other hand, are
the ‘indirect’ measures of primary interest in this book. The latest ver-
sions of LIWC (LIWC-22 and LIWC2015) feature four summary vari-
ables, scored from 0 to 100 in an input text. Each summary variable can be
understood as a socio-psychological construct defined by some combina-
tion of content and function word categories. They are analytical thinking,
clout, authenticity, and emotional tone. As part of LIWC’s psychometric
validation (Boyd et al., 2022; Pennebaker et al., 2015), the word categories
that define the variables have been shown to reliably co-occur and differ-
entiate input texts along their implied traits. Table 1.3 lists the summary
variables, their defining word categories, and relevant validating studies.
The plus/minus signs indicate that the category concerned occurs more/
less frequently in a text that would score higher in the respective summary
variable.
A high analytical thinking score suggests formal, logical, and hierarchi-
cal thinking versus informal, personal, here-and-now, and narrative think-
ing. This was partly validated by a study of American college admission
essays. Essays with more articles and prepositions were more formal and
precise in describing objects, events, goals, and plans, while those with
more pronouns, auxiliary verbs, etc. were more likely to involve personal
18 Introduction

stories (Pennebaker et al., 2014). A high clout score suggests speaking/


writing with high expertise and confidence versus a more tentative and
humble style. This was partly validated by studies of decision-making
tasks, chats, and personal correspondences. Higher status individuals used
more we/our, you/your, and fewer tentative words, which was attributed
to an association between relative status and attentional bias. Higher-ups
are more other-focused and less unsure while lower individuals are more
self-focused and tentative (Kacewicz et al., 2013). A high authenticity
score suggests more honest, personal, and disclosing discourse versus more
guarded and distanced discourse. This was partly validated by studies of
elicited true versus false stories where the latter has fewer first- and third-
person pronouns, exclusive words, and more negative emotion and motion
verbs. The explanation was that liars tend to dissociate themselves with
the lie, feel greater tension and guilt, and speak in less cognitively complex
ways. These linguistic tendencies accurately distinguished truth-tellers ver-
sus liars in independent data more than 60% of the time (Newman et al.,
2003). Lastly, a high emotional tone score suggests a more positive and
upbeat style, a low score anxiety/sadness/hostility, while a neutral score
around 50 suggests a lack of emotionality. This was partly validated by a
study of online journals prior to and after the September 11 attacks where
negative emotion words increased sharply following the attack and gradu-
ally returned to pre-attack baselines after some time (Cohn et al., 2004).
Considered holistically, the four summary variable scores could be seen
as sketching a multivariate socio-psychological profile of an input text.
This would be a transcript of a psychotherapy session in our case. Recent
studies (Boyd et al., 2020; Huston et al., 2019; Tay, 2020) have argued
for the usefulness of these summary variables for revealing how language

Table 1.3 Summary variables and defining lexical categories

Summary variable Word categories

Analytical thinking +articles, prepositions


−pronouns, auxiliary verbs, conjunctions, adverbs,
negations (Pennebaker et al., 2014)
Clout +1st person plural pronouns, 2nd person pronouns
−tentative words (e.g., maybe, perhaps) (Kacewicz et al.,
2013)
Authenticity +1st person singular pronouns, 3rd person pronouns,
exclusive words (e.g., but, except, without)
−negative emotion words (e.g., hurt, ugly, nasty), motion
verbs (e.g., walk, move, go) (Newman et al., 2003)
Emotional tone +positive emotion words (e.g., love, nice, sweet)
−negative emotion words (e.g., hurt, ugly, nasty) (Cohn
et al., 2004)
Introduction 19

is used in therapy and related genres like narratives. They reflect aspects
like how stories are told, the stance of therapists when dispensing advice
and of clients when receiving it, the negotiation of relationships, and lin-
guistic displays of emotional states. For example, therapists could speak in
a highly logical way (analytic thinking) but hedge their advice (clout) and
use more positive words (emotional tone) to come across as more personal
(authenticity) and optimistic. The scores can therefore help to distinguish
various configurations of linguistic and interactional styles. Different from
the semantic annotation tools and vectorization processes described above,
it is also noteworthy that grammatical function words play a relatively
important role in defining the summary variables. This follows from the
argument that socio-psychological stances at both individual and cultural
levels are more reliably indexed by function than content word choices
(Tausczik & Pennebaker, 2010). One reason is that content words are, to
a large extent, tied to (arbitrary) topics at hand. The argument may well
apply to psychotherapy transcripts – it would be more useful to claim that
two sessions are similar because the speakers show comparable levels of
analytic thinking, clout, authenticity, and/or emotional tone rather than
because they happen to be talking about the same topic. Notwithstanding
these advantageous characteristics, it should be highlighted that the choice
of quantification scheme is, in principle, independent from the subsequent
data analytic process and should be made based on a holistic considera-
tion of the setting and objectives at hand. Other than the vectorization
techniques discussed above, other characteristics of session transcripts
that could be incorporated into an analyst’s quantification scheme include
demographic details as well expert or client ratings on the attainment of
certain therapeutic processes/outcomes.

Introduction to Python and basic operations


A programming language can be described as a closed set of vocabulary
and grammatical rules for instructing a computer to perform tasks. Python
is acknowledged by many sources as one of the most popular and fastest
growing programming languages in the data science industry, and this is
the main reason for choosing it to implement the techniques in this book.
Reportedly named after the comedy series Monty Python’s Flying Circus, it
was created by Dutch programmer Guido van Rossum, who once held the
modest title of its ‘benevolent dictator for life’. Real-world contexts where
Python is used include web and software development, AI and machine
learning, finance, image processing, operating systems, statistical analysis,
and many more.
Python is recognized as a high-level, general-purpose programming lan-
guage. This means it resembles human language more than machine code,
has multiple uses, and works across operating systems. We can contrast
20 Introduction

it with low-level languages like C where the vocabulary and grammar are
unintuitive to humans but more directly ‘understood’ by computers, and
domain-specific languages like SQL and MATLAB, which are designed to
solve particular kinds of problems. Implementing data analytic techniques
is just one of Python’s many uses. While it would do an excellent job for
the present and other conceivable discourse analytic purposes, we should
bear in mind that Python is by no means the only way to implement the
techniques discussed in this book. Traditional commercial statistical soft-
ware like SPSS and Stata are ready alternatives but neither customizable
nor open source. Another highly popular programming language among
linguists and academics in general is R (Baayen, 2008; Levshina, 2015),
and there is a lively debate on how it compares with Python. Regardless,
both languages are popular because of their open-source nature, strong
community advocacy and support, relative ease of use, and numerous
learning resources available online.
Installation of Python is straightforward on modern operating systems.
It is, in fact, preinstalled on Unix-based systems like MacOS and Linux
and also on some Windows systems, especially if the administrative tools
were programmed with it. However, most beginners should find it easier to
run Python via a separately installed integrated development environment
(IDE). IDEs provide a graphical user interface for basic operations like
writing, running, debugging, loading, and saving code, somewhat analo-
gous to popular software programs like Microsoft Word. A Python code
file has the .py file extension, which is essentially in plain text format and
can be opened in any text processor and run in an IDE. The file contains
lines of Python code like the simple one-line example below, and by run-
ning the code we instruct the computer to execute the command – in our
case, the output is simply the phrase ‘Hello, world!’. Output is typically
displayed in a separate window in the IDE – be it text, numbers, or a
figure. Figures, in particular, are extremely common and can be saved as
picture files straight from IDEs.

print(‘Hello, world!’)

Because of the popularity of both Python and R, the recommended


approach is to install Anaconda (anaconda​.c​om), a data science platform
that includes both languages. Anaconda, in turn, comes with a variety
of user-friendly IDEs like Spyder, PyCharm, JupyterLab, DataSpell, and
RStudio. The recommended Python IDE for most beginners is Spyder.
After installation, there are three generic steps to any basic data analysis
procedure that we will follow throughout this book. The Python code pro-
vided throughout will also be structured according to these steps. They are
(1) importing libraries, (2) importing datafiles, and (3) performing the data
analytic technique at hand.
Introduction 21

Python libraries refer to pre-written collections of code, or modules,


that are available for use once they are ‘imported’. Each library is typically
designed to perform a key task or coherent group of tasks. Many librar-
ies, including the ones used in this book, come with a standard Python
installation, but more specialized ones may need to be installed by the
user. We will be consistently using the following key libraries – pandas
(for data management); matplotlib and seaborn (for data visualization);
and numpy, scikit-learn, and statsmodels (for statistics and data analytics).
Each of these libraries comes with official online user guides and docu-
mentation that can be found by searching its name. Importing libraries is a
preparatory step often done at the start/top of any code file. As seen from
the examples below, we simply type ‘import’ followed by the library name,
and then optionally ‘as x’ where x is a conventional short form that can
subsequently be used in the rest of the code.

import pandas as pd
import numpy as np
import seaborn as sns

The next step is to import files, usually Excel spreadsheets, that contain
data for analysis. Each row of the spreadsheet usually represents a subject
– be it a human subject, transcript, or some other unit of analysis – while
each column represents a variable – scores, ratings, group labels, and so
on. This presentation format is commonly known as wide data because
many columns are used. As Figure 1.3 shows, the alternative is known as
long or narrow data where the same transcript occurs across many rows,
and all the variables and scores are captured under a single column.
Just like in traditional statistical software like SPSS, both formats can be
used and even converted to each other in Python. The more convenient for-
mat often depends on the situation and objectives at hand. The detailed dif-
ferences between them are not of immediate present relevance, but interested
readers can look them up with a simple search phrase like ‘wide vs. long
data’. To import an Excel spreadsheet, we first import the pandas library
and then use the following code with read_csv for .csv spreadsheets and

Figure 1.3 Wide vs. long data


22 Introduction

read_excel for .xlsx ones. If the Spyder IDE is used, ensure that the directory
path shown on the top right-hand corner correctly points to the file.

import pandas as pd
data=pd.read_csv(‘filename​.c​sv’) OR data=pd.read_excel(‘filename​.xl​sx’)

This will import the spreadsheet and create a pandas dataframe named
data, or any other given name. A dataframe is the most common structure
for managing and analyzing data in Python, and it will be used throughout
the book.
The final generic step after importing a dataset is to actually perform the
analysis at hand. This may include sub-steps like initial visualization of the
data, performing the technique, and evaluating the outcomes of analyses,
all of which will be laid out in sequence in the annotated code throughout.
Readers are encouraged to spend time observing how the code logically
unfolds throughout an extended analysis. Grasping how each code section
relates and contributes to the overall objective is often just as important
as mastering its exact syntax, which is the main reason for including the
end-of-chapter code summaries.
Lastly, note that this book is best used in combination with different
general resources available for learning Python. The specific data analytic
techniques featured here are, after all, implemented based upon an exten-
sive foundation of more basic operations, which can only be mastered with
repeated practice in different contexts. Besides the official online docu-
mentation of the key libraries mentioned earlier, good online resources
that offer a blend of free and subscription-only content include datacamp​.c​
om and towardsdatascience​.co​m. The former provides video tutorials and
other training materials for different programming languages and data
analytic tools across a range of proficiency levels. The latter houses a vast
collection of useful short articles and tutorials written by a dedicated com-
munity of data science practitioners in different fields.

References
Abidin, C. (2018). Internet celebrity: Understanding fame online. Emerald.
Antaki, C., Barnes, R., & Leudar, I. (2005). Diagnostic formulations in psychotherapy.
Discourse Studies, 7(6), 627–647. https://doi​.org​/10​.1177​/1461445605055420
Archer, D., Wilson, A., & Rayson, P. (2002). Introduction to the USAS category
system. Introduction to the USAS category systemLancaster Universityhttps://
ucrel​.lancs​.ac​.uk › usas › usas guide
Asamoah, D., Doran, D., & Schiller, S. (2015). Teaching the foundations of data
science: An interdisciplinary approach (arXiv:1512.04456). arXiv. https://doi​
.org​/10​.48550​/arXiv​.1512​.04456
Introduction 23

Avdi, E., & Georgaca, E. (2007). Discourse analysis and psychotherapy: A critical
review. European Journal of Psychotherapy, Counselling and Health, 9(2),
157–176.
Baayen, R. H. (2008). Analyzing linguistic data: A practical introduction to
statistics using R. Cambridge University Press. https://doi​.org​/10​.1558​/sols​.v2i3​
.471
Baker, P., & Levon, E. (2015). Picking the right cherries? A comparison of corpus-
based and qualitative analyses of news articles about masculinity. Discourse and
Communication, 9(2), 221–236. https://doi​.org​/10​.1177​/1750481314568542
Bederson, B., & Shneiderman, B. (Eds.). (2003). The craft of information
visualization: Readings and reflections (1st ed.). Morgan Kaufmann. https://
www​.elsevier​.com​/books​/the​-craft​-of​-information​-visualization​/bederson​/978​
-1​-55860​-915-0
Beysolow II, T. (2018). Applied natural language processing with python:
Implementing machine learning and deep learning algorithms for natural
language processing. Apress.
Bhatia, A. (2018). Interdiscursive performance in digital professions: The case of
YouTube tutorials. Journal of Pragmatics, 124, 106–120. https://doi​.org​/10​
.1016​/j​.pragma​.2017​.11​.001
Box, G. E. P., Jenkins, G. M., Reinsel, G. C., & Ljung, G. M. (2015). Time series
analysis: Forecasting and control (5th ed.). Wiley.
Boyd, R. L., Ashokkumar, A., Seraj, S., & Pennebaker, J. W. (2022). The
development and psychometric properties of LIWC-22. University of Texas at
Austin. https://www​.liwc​.app
Boyd, R. L., Blackburn, K. G., & Pennebaker, J. W. (2020). The narrative arc:
Revealing core narrative structures through text analysis. Science Advances,
6(32), eaba2196.
Breeze, R. (2011). Critical discourse analysis and its critics. Pragmatics, 21(4),
493–525.
Brezina, V. (2018). Statistics in corpus linguistics: A practical guide. Cambridge
University Press. https://doi​.org​/10​.1017​/9781316410899
Chen, H., Chiang, R. H. L., & Storey, V. C. (2012). Business intelligence and
analytics: From big data to big impact. MIS Quarterly: Management Information
Systems, 36(4), 1165–1188.
Cohn, M. A., Mehl, M. R., & Pennebaker, J. W. (2004). Linguistic markers of
psychological change surrounding September 11, 2001. Psychological Science,
15(10), 687–693.
Connolly Gibbons, M. B., Rothbard, A., Farris, K. D., Wiltsey Stirman, S.,
Thompson, S. M., Scott, K., Heintz, L. E., Gallop, R., & Crits-Christoph, P.
(2011). Changes in psychotherapy utilization among consumers of services
for major depressive disorder in the community mental health system.
Administration and Policy in Mental Health, 38(6), 495–503. https://doi​.org​/10​
.1007​/s10488​-011​-0336-1
Conway, D. (2010). The data science Venn diagram. blog​.revolutionanalytics​.​
com
Earnest, A., Chen, M. I., Ng, D., & Leo, Y. S. (2005). Using autoregressive
integrated moving average (ARIMA) models to predict and monitor the number
24 Introduction

of beds occupied during a SARS outbreak in a tertiary hospital in Singapore.


BMC Health Services Research, 5, 1–8. https://doi​.org​/10​.1186​/1472​-6963​-5​
-36
Evans, J., & Lindner, C. (2012). Business analytics: The next frontier for decision
sciences. Decision Line, 43(2), 4–6.
Ferrara, K. W. (1991). Accommodation in therapy. In H. Giles, J. Coupland, &
N. Coupland (Eds.), Contexts of accommodation (pp. 187–222). Cambridge
University Press and Maison des Sciences de l’Homme.
Ferrara, K. W. (1994). Therapeutic ways with words. Oxford University Press.
Gee, J. P. (2011). An introduction to discourse analysis: Theory and method (3rd
ed.). Routledge.
Good, P., & Hardin, J. (2012). Common errors in statistics (and how to
avoid them) (1st ed.). John Wiley & Sons, Ltd. https://doi​.org​/10​.1002​
/9781118360125
Han, J., Kamber, M., & Pei, J. (2000). Data mining: Concepts and techniques (3rd
ed.). Morgan Kaufmann.
Hartmann, A., Orlinsky, D., & Zeeck, A. (2011). The structure of intersession
experience in psychotherapy and its relation to the therapeutic alliance. Journal
of Clinical Psychology, 67(10), 1044–1063. https://doi​.org​/10​.1002​/jclp​.20826
Huston, J., Meier, S. T., Faith, M., & Reynolds, A. (2019). Exploratory study of
automated linguistic analysis for progress monitoring and outcome assessment.
Counselling and Psychotherapy Research, 19(3), 321–328.
Kacewicz, E., Pennebaker, J. W., Jeon, M., Graesser, A. C., & Davis, M. (2013).
Pronoun use reflects standings in social hierarchies. Journal of Language and
Social Psychology, 33(2), 125–143. https://doi​.org​/10​.1177​/0261927x13502654
Labov, W., & Fanshel, D. (1977). Therapeutic discourse: Psychotherapy as
conversation. Academic Press.
Levshina, N. (2015). How to do linguistics with R. John Benjamins.
MacKay, R. J., & Oldford, R. W. (2000). Scientific method, statistical method and
the speed of light. Statistical Science, 15(3), 254–278.
Mayer, D. G., & Butler, D. G. (1993). Statistical validation. Ecological Modelling,
68(1), 21–32. https://doi​.org​/10​.1016​/0304​-3800(93)90105-2
McMullen, L. M. (2008). Putting it in context: Metaphor and psychotherapy. In
R. W. Gibbs (Ed.), The Cambridge handbook of metaphor and thought (pp.
397–411). Cambridge University Press.
Newman, M. L., Pennebaker, J. W., Berry, D. S., & Richards, J. M. (2003).
Lying words: Predicting deception from linguistic styles. Personality
and Social Psychology Bulletin, 29(5), 665–675. https://doi​.org​/10​.1177​
/0146167203251529
Norcross, J. C. (1990). An eclectic definition of psychotherapy. In J. K. Zeig & W.
M. Munion (Eds.), What is psychotherapy? Contemporary perspectives (pp.
218–220). Jossey-Bass.
Orlinsky, D., Michael, R.., & Willutzki, U. (2004). Fifty years of psychotherapy
process-outcome research: Continuity and change. In Bergin and Garfield’s
handbook of psychotherapy and behavior change (pp. 307–389). Wiley.
Pennebaker, J. W., Boyd, R. L., Jordan, K., & Blackburn, K. (2015). The
development and psychometric properties of LIWC2015. University of Texas
at Austin.
Introduction 25

Pennebaker, J. W., Chung, C. K., Frazee, J., Lavergne, G. M., & Beaver, D. I.
(2014). When small words foretell academic success: The case of college
admissions essays. PLOS ONE, 9(12), 1–10.
Peräkylä, A., Antaki, C., Vehviläinen, S., & Leudar, I. (Eds.). (2011). Conversation
analysis and psychotherapy. Cambridge University Press.
Pittenger, R. E., Hockett, C. F., & Danehy, J. J. (1960). The first five minutes. Carl
Martineau.
Raghupathi, W., & Raghupathi, V. (2014). Big data analytics in healthcare:
Promise and potential. Health Information Science and Systems, 2(3), 1–10.
Reeves, A., Bowl, R., Wheeler, S., & Guthrie, E. (2004). The hardest words:
Exploring the dialogue of suicide in the counselling process — A discourse
analysis. Counselling and Psychotherapy Research, 4(1), 62–71. https://doi​.org​
/10​.1080​/147​3314​0412​331384068
Reisigl, M. (2017). The discourse-historical approach. In J. Flowerdew & J. E.
Richardson (Eds.), The Routledge handbook of critical discourse studies (pp.
44–59). Routledge.
Reisigl, M., & Wodak, R. (Eds.). (2001). Discourse and discrimination: Rhetorics
of racism and anti-semitism. Routledge.
Sarkar, D. (2016). Text analytics with python. Springer.
Scheflen, A. E. (1973). Communicational structure: Analysis of a psychotherapy
transaction. Indiana University Press.
Semin, G. R., & Cacioppo, J. T. (2008). Grounding social cognition:
Synchronization, entrainment, and coordination. In G. R. Semin & E. R. Smith
(Eds.), Embodied grounding: Social, cognitive, affective, and neuroscientific
approaches (pp. 119–147). Cambridge University Press.
Spong, S. (2010). Discourse analysis: Rich pickings for counsellors and therapists.
Counselling and Psychotherapy Research, 10(1), 67–74.
Spong, S., & Hollanders, H. (2005). Cognitive counsellors’ constructions of social
power. Psychotherapy and Politics International, 3(1), 47–57. https://doi​.org​
/10​.1002​/ppi​.17
Stieglitz, S., & Dang-Xuan, L. (2013). Social media and political communication:
A social media analytics framework. Social Network Analysis and Mining, 3(4),
1277–1291.
Tannen, D., Hamilton, H. E., & Schiffrin, D. (Eds.). (2015). The handbook of
discourse analysis (2nd ed.). Blackwell.
Tausczik, Y. R., & Pennebaker, J. W. (2010). The psychological meaning of words:
LIWC and computerized text analysis methods. Journal of Language and Social
Psychology, 29(1), 24–54.
Tay, D. (2013). Metaphor in psychotherapy. A descriptive and prescriptive analysis.
John Benjamins.
Tay, D. (2017a). Quantitative metaphor usage patterns in Chinese psychotherapy
talk. Communication and Medicine, 14(1), 51–68.
Tay, D. (2017b). Time series analysis of discourse: A case study of metaphor in
psychotherapy sessions. Discourse Studies, 19(6), 694–710.
Tay, D. (2019). Time series analysis of discourse: Method and case studies.
Routledge.
Tay, D. (2020). A computerized text and cluster analysis approach to psychotherapy
talk. Language and Psychoanalysis, 9(1), 1–22.
26 Introduction

Tay, D. (2021a). Automated lexical and time series modelling for critical discourse
research: A case study of Hong Kong protest editorials. Lingua, 255, 103056.
Tay, D. (2021b). COVID-19 press conferences across time: World Health
Organization vs. Chinese Ministry of Foreign Affairs. In R. Breeze, K.
Kondo, A. Musolff, & S. Vilar-Lluch (Eds.), Pandemic and crisis discourse:
Communicating COVID-19 (pp. 13–30). Bloomsbury.
Tay, D. (2021c). Modelability across time as a signature of identity construction on
YouTube. Journal of Pragmatics, 182, 1–15.
Tay, D., & Pan, M. X. (Eds.). (2022). Data analytics in cognitive linguistics:
Methods and insights. De Gruyter Mouton.
Taylor, S. (2013). What is discourse analysis? Bloomsbury Academic. https://doi​
.org​/10​.5040​/9781472545213
Van Staden, C. W., & Fulford, K. W. M. M. (2004). Changes in semantic uses of
first person pronouns as possible linguistic markers of recovery in psychotherapy.
Australian and New Zealand Journal of Psychiatry, 38(4), 226–232. https://doi​
.org​/10​.1111​/j​.1440​-1614​.2004​.01339.x
Wei, C., Wang, Y.-C., Wang, B., & Kuo, C.-C. J. (2023). An overview on language
models: Recent developments and outlook (arXiv:2303.05759). arXiv. https://
doi​.org​/10​.48550​/arXiv​.2303​.05759
Zimmerman, J., Brockmeyer, T., Hunn, M., Schauenburg, H., & Wolf, M. (2016).
First-person pronoun use in spoken language as a predictor of future depressive
symptoms: Preliminary evidence from a clinical sample of depressed patients.
Clinical Psychology and Psychotherapy, 24(2), 384–391.
2 Monte Carlo simulations

Introduction to MCS: Bombs, birthdays, and casinos


Monte Carlo simulations (MCS), or more generally the Monte Carlo
method, provides a simple yet flexible and elegant solution to deal with dif-
ferent scenarios of uncertainty. It has evolved over the years from being a
‘last resort’ to a leading methodology in many branches of science, finance,
and engineering (Kroese et al., 2014). Despite its name, the origins of MCS
are closer to atomic bombs than gambling resorts. It was developed in
the mid-1940s by Stanislaw Ulam and John van Neumann during their
wartime involvement with the Manhattan Project, and the name ‘Monte
Carlo’ was a code name referencing Ulam’s gambling uncle. Interestingly
however, Ulam was indeed initially inspired toward the method while
playing solitaire (Metropolis, 1987).
The basic idea of MCS is to calculate many different possible outcomes
of some inherently uncertain phenomenon, consider the distribution of
their probabilities of occurrence, and analyze the resulting ‘reconstruction’
of the phenomenon in ways that suit the objective(s) at hand. In Ulam’s
own words (italics mine),

The first thoughts and attempts I made to practice [the Monte Carlo
Method] were suggested by a question which occurred to me in 1946
as I was convalescing from an illness and playing solitaires. The ques-
tion was what are the chances that a Canfield solitaire laid out with
52 cards will come out successfully? After spending a lot of time try-
ing to estimate them by pure combinatorial calculations, I wondered
whether a more practical method than “abstract thinking” might not
be to lay it out say one hundred times and simply observe and count
the number of successful plays. This was already possible to envisage
with the beginning of the new era of fast computers, and I immedi-
ately thought of problems of neutron diffusion and other questions
of mathematical physics, and more generally how to change pro-
cesses described by certain differential equations into an equivalent

DOI: 10.4324/9781003360292-2
28 Monte Carlo simulations

form interpretable as a succession of random operations. Later [in


1946], I described the idea to John von Neumann, and we began to
plan actual calculations. (Eckhardt, 1987)

The italicized portion outlines the gist of MCS. Facing an abstract prob-
lem where there is no tractable analytic solution, like Ulam’s solitaire com-
binations and neutron diffusion, one could attempt what is known as a
numerical solution instead. Most mathematical problems can be solved
either analytically or numerically. An analytic solution requires us to frame
the problem in a well-defined form and calculate an exact, deterministic
answer. A numerical solution, on the other hand, means making intelligent
guesses at the solution until a satisfactory (but not exact) answer is obtained.
This ‘guessing’ is often done by observing, or making sensible assumptions,
about the different ways in which the underlying real-world phenomena of
interest would occur when left to chance. Statisticians refer to this as the
data generating process. In the solitaire example above, the data-generating
process is probabilistically defined based on the practical idea that certain
configurations of cards must have certain odds of appearing. Using knowl-
edge and/or assumptions about their probability distributions, MCS works
by simulating a large range of possible outcomes and aggregating them for
further analysis. Thanks to modern computers that can handle multiple
simulations and some basic laws of probability to be explained shortly, the
aggregated outcome can give us (1) a highly reliable approximation of the
actual analytic solution, and (2) in many contexts, insights into the range
of potential real-world outcomes that might occur. We will call these two
claimed advantages our ‘overarching claims A and B’ and return to them
shortly. Some readers may at this point already be making a conceptual
connection between MCS and our practical problem of missing or incom-
plete transcripts – facing the problem of uncertainty over the properties (i.e.,
LIWC scores) of a certain dyad because of missing data, one could attempt
to simulate these scores using the distribution of available data and obtain
sensible estimates and information about what the scores are likely to be.
Before we officially make this connection, let us go through some examples
to familiarize ourselves more with the logic of MCS and its underpinning
laws of probability. We will begin to use Python code, so let us first import
the following libraries to be used in the rest of the chapter.

#import Python libraries


import pandas as pd
import matplotlib​.pypl​ot as plt
import numpy as np
import scipy as sp
import seaborn as sns
Monte Carlo simulations 29

The birthday problem


Our first example is the well-known ‘birthday problem’ that will help us to
compare the analytic and numerical approaches. Suppose there are n peo-
ple in a room, and we want to find the probability that at least two of them
share the same birthday. This is sometimes called the birthday paradox
because many people are surprised to find out that it takes only 23 people
for this probability to exceed 0.5. In other words, although there are 365
days in a year, there is already a 50% chance that at least two have the
same birthday among 23 people. The simplest and rather clever analytic
solution is as follows. The probability p we are after is actually 1 minus the
probability that no one has the same birthday. We calculate the latter by
noting that while the first person’s birthday can be any day of the year, the
second person would have only 364 days left to ‘choose’ from, the third
person 363 days, and so on. Therefore, for n people, the probability that
no one has the same birthday is given by

365 364 363 365 - ( n - 1)


* * *¼*
365 365 365 365
And the probability p we are originally after is simply 1 minus the above.
The following code will calculate the above for any given value of n. It is a
simple function that takes n as an input and implements the above formula
as a loop with n iterations, returning the desired final analytic solution. For
n=23, the probability to three decimal places is 0.507.

#a function to calculate the probability p given n number of people


def probsamebirthday(n):
q=1
for i in range(1, n):
   probability = i / 365
   q *= (1 - probability)
p=1-q
return(p)

#runs the function for n=23. feel free to change the value of n.
probsamebirthday(23)

Let us now solve the same problem numerically using an MCS approach
that follows the broad procedure described above: (1) simulate a range of
possibilities using some assumed probability distribution, and (2) analyze
the aggregated outcome. Simulations always involve the computer drawing
random numbers, and the outcomes will therefore vary each time. This may
be disadvantageous if we want the same outcomes on different occasions
30 Monte Carlo simulations

to ensure reproducibility for instructional purposes. We can do this by


running the np​.random​.s​eed command before each simulation, which will
make Python draw the same random numbers each time. Changing the
number 0 will make Python draw a different set of random numbers.

#fix the random seed to ensure reproducibility


np​.random​.s​eed(0)

We then proceed to simulate a random birthday for each of 23 hypotheti-


cal people under the reasonable assumption that birthdays are uniformly
distributed (i.e., each day of the year is equally likely to be a birthday).
This is implemented by telling the computer to randomly pick 23 numbers
from 1 to 365 from a uniform distribution. There are other approaches to
obtain multiple random outcomes like jackknifing and bootstrapping that
are beyond the present scope – readers may refer to Efron (1982). We then
check and note if there is at least one duplicate/shared birthday among the
23 numbers. This process is then repeated multiple times. The aggregated
outcome is simply the number of times where there is at least one dupli-
cate, and the desired probability p is this number of times divided by the
number of times we repeated the process.
The code below is the function to check if there are any duplicates in the
simulated birthdays. It works by comparing the number of unique dates
to the total number of dates simulated. If the numbers are different, which
implies that there is at least one duplicate (i.e., shared birthday), the func-
tion returns the value TRUE.

#a function to check for duplicates in a list of birthdays


def contains_duplicates(X):
return len(np​.uniq​ue(X)) != len(X)

The code below simply specifies the number of people and the number of
times we want to repeat the process to obtain the aggregated outcome (i.e.,
number of simulation trials). For this demonstration we use 23 and 5,000,
respectively. It is customary to run at least 1000 trials.

#specify no. of people and trials. feel free to change the values.
n_people=23
n_trials=5,000

We then run the following code that loops through the specified number of
trials, drawing a random birthday for the specified number of people each
time. This list of birthdays is subjected to the above function each time to
check for duplicates, the result (TRUE/FALSE) being recorded in a list.
Monte Carlo simulations 31

After 5,000 trials, we simply count the number of TRUES in the list and
derive the probability p by dividing this number by 5,000.

#a for-loop to generate birthdays and calculate the probability for the


specified no. of people and trials
list=[]
for i in range (0,n_trials):
#loop through the number of trials. For each, draw random birthdays
for n_people
dates = np​.random​.cho​ice(range(1366), size=(n_people))
#use the function above to check for duplicates in each trial, append-
ing the result to list
list​.appe​nd(contains_duplicates(dates))

#calculate the final probability as the fraction of all trials where there
are duplicates
probability=len([p for p in list if p == True]) / len(list)
print(“With”, n_trials, “trials, probability of at least one shared bday
among”,n_people, “people =”, probability)

Our solution using this numerical simulation approach is p = 0.5056,


which is remarkably close to the analytic solution of p = 0.507. To illus-
trate the effect of varying the number of trials on our numerical solution,
we can use the following code. It is a more complex for-loop to keep track
of the numerical solution from an increasing number of trials from 0 to
5,000 (or any other specified number). These solutions (i.e., probabilities)
are then plotted against the corresponding number of trials in Figure 2.1.

#a more complex for-loop to track and plot the probability for an


increasing number of trials from 0 to n_trials
n_people=23
n_trials=5000
trial_count = 0
shared_birthdays = 0
probabilities = []

for experiment_index in range(0,n_trials):


dates = np​.random​.cho​ice(range(1366), size=(n_people))
if contains_duplicates(dates):
  shared_birthdays+=1
trial_count += 1
probability = shared_birthdays/trial_count
probabilities​.appe​nd(probability)
32 Monte Carlo simulations

Figure 2.1 Outcome of 5,000 birthday paradox simulations

#plot the outcome (probability) for an increasing number of trials


plt​.pl​ot(probabilities)
plt​.tit​le(“Outcome of” +str(n_trials)+” trials converging to p=” +
str(probabilities[-1]))
plt​.sh​ow()

We observe again that the outcome (see y-axis) with 5,000 trials is p =
0.5056, which as mentioned is very close to the analytic solution of 0.507.
This example supports our overarching claim A that numerical simulations
can give us a highly reliable approximation of the actual analytic solu-
tion. To recall, the result is simply telling us that the computer simulated
23-people birthday lists for 5,000 times, and in 2,528 (50.56%) of these
times, there were at least two people with the same birthday.
Besides this result, the most important feature in Figure 2.1 is that as the
number of trials increases from 0 to 5,000, the probability fluctuates less
and less, and eventually converges upon the final value of 0.5056. That is
to say, with only a small number of trials, we get wildly fluctuating and
imprecise results even by slightly changing this number. The line, in fact,
begins to stabilize only after about 3000 trials, beyond which more trials
no longer tweak the result by much, giving us greater confidence in our
simulated answer. This convergence is generally true and illustrates a very
important law of probability that enables MCS: the law of large numbers.
Monte Carlo simulations 33

The law states that as the number of identically distributed, randomly gen-
erated variables increase, their average value will get closer and closer to
the true theoretical average. In our case, each trial is indeed identically
uniformly distributed and randomly generated. Therefore, with more and
more trials, the proportion of trials with duplicate birthdays will get closer
and closer to the true proportion given by the analytic solution above. The
law of large numbers generally ensures that the more simulation trials we
perform, the better our estimate of the true analytic solution will be. This is
especially crucial for problems for which an analytic solution is intractable
or impossible, meaning that there is nothing to judge our numerical solu-
tions against. However, the practical tradeoff is increased computational
expense, and slower computers will struggle with handling too many trials.
Another very important and related principle that enables MCS is the
well-known central limit theorem. This will be encountered in our second
example, to which we now turn. It is a tribute to the tenuous gambling
origins of MCS and assumes the setting of a standard casino. While the
birthday paradox lent support to our overarching claim A (numerical sim-
ulations can give us a highly reliable approximation of the actual analytic
solution), the casino example is meant to support our overarching claim
B – that, in many real-world contexts, MCS are useful not just for estimat-
ing a single analytic value, but also for showcasing the range of values that
would have occurred by chance, and their implications.

Spinning the casino roulette


The saying ‘the house always wins’ warns us that in any casino, the games
are designed to virtually guarantee a profit for casino owners relative to
all its patrons. In other words, the cruel laws of probability ensure that
any short-term success for individual patrons will be ‘canceled out’ long-
term on themselves or others. What if we (or most casino owners, for that
matter) want to be more precise and ask the question of how much the
casino is expected to win over a certain period? Just like the birthday para-
dox, this is a problem that can be approached analytically or by numerical
simulations. Unlike the birthday paradox, however, we are not so much
interested in a single deterministic answer as the outcome. Instead, it is
more useful to have a range of profits/losses and their relative probabilities
of occurrence, so that more nuanced business planning decisions could be
made. This is exactly what our aforementioned overarching claim B states.
As an illustration, imagine a simple roulette game where the ball lands on
a number from 1 to 100. The odds are always slightly tipped toward the
house so that patrons feel a realistic chance of winning and are motivated
to keep playing. To keep things simple, suppose that the patron wins $1
if each spin lands on 56–100 and loses $1 otherwise, giving us a winning
34 Monte Carlo simulations

probability of 45%. If the patron plays 100 spins, the expected winnings
under an analytic approach would therefore be

$ ( 0.45 * 1 - 0.55 * 1) * 100 = $ - 10

which equals an expected loss of $10. However, this is not very informa-
tive for casino owners as they want more information about how likely a
patron would get extremely lucky. We therefore turn to the MCS approach,
again by (1) simulating a range of possibilities using some assumed prob-
ability distribution, and (2) analyzing the aggregated outcome.
Let us simulate a scenario where there are 1,000 patrons and each of
them spins the wheel 100 times. Same as before, we set np​.random​.s​eed(0)
prior to drawing any random numbers to ensure reproducibility. We begin
by writing a function to implement our rule that the patron wins $1 if each
spin lands on 56–100 and loses $1 otherwise.

#a function to calculate winnings per spin


def winnings(x):
if x<=55:
  return −1
else:
  return 1

We then specify the number of spins per set (or per patron), and the number
of sets (or trials) to be simulated. Note the similar logic with our birthday
example above. For this demonstration we use 100 spins and 1,000 sets,
bearing in mind it is customary to run at least 1,000 trials. These numbers
can be changed for experimentation just like the birthday example, and
readers are encouraged to do so to witness the law of large numbers and
the central limit theorem discussed below in action.

#specify no. of spins per set and no. of sets. feel free to change the
values.
spins=100
sets=1000

We then run the following code that loops through the specified number of
sets, and for each set simulates the specified number of spins. Each spin is a
random number from 1 to 100. We again use a uniform distribution because
it is reasonable to assume that the ball has an equal chance of landing on
each number. The spin outcomes are then subjected to the function above
to determine if $1 is won or lost for each spin. Finally, these are summed to
determine the total winnings per set, which is then stored in a list.
Monte Carlo simulations 35

#a for-loop to spin the wheel, calculate and keep track of winnings


list=[]
for i in range(0,sets):
#loop through the number of sets and spin the specified number of
times
x​=np​.random​.uni​form(1,100,spins)
#keep track of spin outcomes in a dataframe
df=pd.DataFrame(data={‘x’:x})
#use the function above to determine amount won per spin
df[‘Winning’]=df[‘x’].apply(winnings)
#sum the amount won for all spins, to obtain and record the total
winnings per set
list​.appe​nd([df[‘Winning’].sum()])

The code below plots a histogram to show the distribution of winnings


over all 1000 sets. A blue vertical line indicating the mean winnings per set
is included. The standard deviation of the 1000 winnings is also calculated
to three decimal places.

#plot the distribution of winnings over all the sets


ax=sns.distplot(list,kde=False)
ax​.s​et(xlabel=‘Winnings’, ylabel = ‘No. of sets (‘+str(spins)+’ spins each)’)
ax.axvline(x​=np​.m​ean(list), color=‘blue’, label=‘Mean’,
linestyle=‘dotted’, linewidth=2)
ax​.lege​nd()
plt​.tit​le(‘Mean winnings=’ +str(np​.me​an(list)) + ‘\n St dev winnings=’
+str(np​.rou​nd(np​.s​td(list),decimals=3)))

Figure 2.2 is the resulting histogram from the code above. The line indi-
cates the mean (or expected) winnings of −$9.364. The standard deviation
of winnings across the 1,000 sets is 9.594.
This is where the aforementioned central limit theorem, which facili-
tates our overarching claim B, becomes relevant. The central limit theo-
rem might be familiar to some readers. It states that if independent and
identically distributed random samples of size N are repeatedly drawn,
the larger N is, the more normally distributed the sample means will be.
This happens regardless of how the phenomenon defining the popula-
tion is actually distributed. This is more or less what is happening in our
example. The ‘population’ of roulette spins is not normally distributed,
but the simulated outcomes, which involve computing winnings on a
large number of random spins with N = 100, approach a normal distribu-
tion as seen from the histogram. Note that we do not get a ‘perfect’ nor-
mal distribution because of this additional computation of winnings (i.e.,
36 Monte Carlo simulations

Figure 2.2 Distribution of winnings from roulette simulations

$1 or –$1 per spin), but for many other applications (as we will see later
in our case study) we do. The second important feature of the central
limit theorem is that the larger N is, the closer the mean of all the simu-
lated sample means will be to the ‘true’ population mean. This is closely
related to the convergent property of the law of large numbers above.
The central limit theorem allows the casino owners to derive interesting
insights from MCS. Since the winnings distribution approximates a nor-
mal distribution, we can use the known properties of a normal distribution
to estimate the likelihood of extreme scenarios. For example, 95% of all
randomly sampled outcomes are expected to fall within about two stand-
ard deviations on either side of the mean, and 2.5% to fall beyond that on
either side as ‘extreme’ outcomes. This means that although the long-term
expected profit is about $9.364 per 1,000 patrons, casino owners should
still anticipate a loss of at least (−9.364 + 2*9.594) = $9.824, 2.5% of
the time. Conversely, they can expect high profits of at least (−9.364 −
2*9.594) = $28.552, 2.5% of the time. This is a simple example of what
we call an MCS-enabled ‘scenario analysis’ of potential real-life outcomes
from inherent uncertainty.
We are now ready to make an explicit connection between our two
examples above and the practical problem of missing or incomplete
transcripts. Just as we have generated random samples of birthdays
Monte Carlo simulations 37

and roulette spins based on reasonable assumptions of their (uniform)


distribution, we will generate random samples of transcripts based on
the distribution of the LIWC scores of available transcripts. The result-
ing outcomes can then be subject to the same kind of scenario analysis
described above. We will also see that MCS can be used to perform inter-
esting experiments on a full set of transcripts by running multiple differ-
ent simulations, or ‘resampling’ the data (Carsey & Harden, 2013), each
time treating transcripts from a different treatment phase (e.g., beginning,
middle, end) as if they were missing. This is described in machine learning
parlance as a ‘train-test approach’ where actual data are initially withheld
from the model-building process and then used later to test its accuracy.
The simulated transcript scores are compared to the actual scores as a
form of model validation (Mayer & Butler, 1993), and even as an entry
point to interesting hypotheses about the nature of psychotherapy talk at
different phases.
A final point is that the above examples and the forthcoming introduc-
tory applications in this chapter are all limited to univariate simulations.
That is to say, only one variable, or one source of uncertainty, or one
data generating process – be it a birthday, roulette spin, or LIWC sum-
mary variable score – defines the outcome of interest. Many other real-life
MCS applications involve (complex) interactions between multiple sources
of uncertainty. A simple example of a multivariate simulation from the
e-commerce domain might be to simulate the signup flow of a new web-
site. During a typical signup journey, a potential customer sees an online
advertisement and decides (1) whether to click on the ad, and (2) whether
to sign up afterwards. We therefore have two sources of uncertainty, each
with its own probability distribution depending on managerial decisions
like how much money is spent on the advertising. The relationship between
the two variables is also important and straightforward in this case – the
number of clicks must be simulated first, and the number of signups, our
ultimate outcome of interest, is derived thereafter. Despite the additional
complexities, the primary logic and process of univariate and multivari-
ate MCS are the same: define the variable(s) of interest, generate multiple
random values from an assumed/observed probability distribution, com-
pute the relationship(s) between the variables in the multivariate case, and
then aggregate and analyze the results. Interested readers may refer to the
following references to learn about more complex applications of MCS
beyond the present scope (Carsey & Harden, 2013; Owen & Glynn, 2018).

Case study: Simulating missing or incomplete transcripts


We now explore how MCS can address the practical problem of miss-
ing or incomplete transcripts by using the properties of available data to
simulate how missing data would turn out by the laws of chance. The
38 Monte Carlo simulations

major approaches to psychotherapy language and discourse research were


briefly discussed in Chapter 1. Whether descriptive or critical, and qualita-
tive or quantitatively oriented, much of this research relies on transcripts
of therapist–client interaction as a key data source. It is therefore some-
what surprising that the common problem of missing or incomplete tran-
scripts, even in professionally maintained databases, is seldom explicitly
confronted. The most typical reasons are (1) logistical. Voice recording
technologies are imperfect especially for older databases with sessions
dating back to the 1970s, many of which were transcribed long after the
recordings. A good number of utterances tend to be marked as inaudible as
a result; (2) absence of consent. Most therapists and clients provide general
consent to be recorded at the start of the treatment span. However, this
consent may be withdrawn for sessions where highly sensitive material is
discussed.
Missing or incomplete transcripts can be seen as a practical instance of
the much-discussed statistical problem of missing data (Graham, 2009).
The most obvious general consequence of missing data in most research
contexts is the loss of representativeness, potentially leading to biased
results. Typical examples include survey, experimental, and data from sec-
ondary sources, especially in within-subjects and/or longitudinal studies
where subjects drop out before completion. Missing data also pose prob-
lems for many (but not all) statistical procedures that are unable to handle
it. Basic examples include the matched pairs t-test and repeated measures
ANOVA (Tay, 2022), which require all compared samples to have equal
sample sizes. For psychotherapy language and discourse research, missing
data pose problems for both qualitative and quantitative analyses alike.
Conversation analysis, for instance, relies almost by definition on complete
turn sequences to examine the sequential organization of therapy talk.
From a statistical perspective, the severity of problems resulting from
missing data depends on exactly how the data are missing. Statisticians
refer to this as the missing data mechanism. The many possible scenarios
are classified into three broad categories (Rubin, 1976; Schafer & Graham,
2002): data that are missing completely at random, missing at random,
and missing not at random. Perhaps the best-case scenario is for data to
be missing completely at random. This means that what is missing is a
completely random subset of the full dataset, which minimizes the prob-
ability that the available data are biased in some way. A simple example is
survey respondents not answering some questions by accident. It might be
safe in many such cases to simply ignore the problem because the results
are not likely to be systematically affected. Slightly more problematic are
cases of data missing at random. This means that the data appear to be
randomly missing but could actually be related to other variables in the
dataset. For example, survey respondents may be less willing to provide
Monte Carlo simulations 39

information about their income if they are less educated. This implies that
subsequent analyses making use of income data will not be representative
even if the survey respondents themselves were carefully sampled. Lastly,
data not missing at random are most likely to lead to biased results, and,
in most cases, they cannot be ignored. This is where the reason(s) for miss-
ing values are directly and systematically related to the variable at hand.
Going back to the previous example, if those who are less willing to reveal
their income indeed demonstrably earn less, we have a case of data not
missing at random. Another example is when sicker patients on whom we
expect the clearest results drop out of a longitudinal drug trial, rendering
the remaining subjects unrepresentative. All three categories are, in princi-
ple, possible in the case of psychotherapy transcripts. At one end, we may
have transcripts missing completely at random due to randomly occurring
technical faults. At the other end, we may also have transcripts missing not
at random if there are identifiable systematic reasons for some sessions to
be left out.
There are many suggested ways to deal with missing data, including
methods like imputation, interpolation, and deletion. The most appropri-
ate method depends on the type of data and missing data mechanism at
hand. Imputation and interpolation generally aim to preserve cases with
missing data by estimating the missing values, while deletion means omit-
ting such cases from the analysis. It is of course also possible to prevent
missing data in the first place by proper planning and collection, especially
in experimental settings where researchers have greater control (Kang,
2013). We will not go into the details and pros and cons of these sug-
gestions but will instead demonstrate MCS as yet another useful method
in our case of observational transcript data. The main strength of MCS
for the present purpose is that each simulated set of transcripts realizes a
hypothetical ‘discourse scenario’ that we would expect to potentially arise
by chance. Analogous to the examples above, the simulation outcome is
a probability distribution mapping out the likelihood of these discourse
scenarios.
It is important to note that MCS does not simulate any actual words or
language that could have been used by therapists and clients. From per-
sonal experience, this has been a typical misconception and leads to unnec-
essary doubts about the ethicality of the process. MCS is instead based
on the LIWC summary variable scores discussed in Chapter 1, which, to
reiterate, could be replaced by any other desired quantification scheme.
We are using the scores and observed probability distribution of available
transcripts for a certain therapist–client dyad – for example, 30 out of 40
sessions – to simulate scores for the remaining 10 missing sessions, thereby
providing an estimate of the scores of all 40 sessions. Formally, the Monte
Carlo estimator is given by
40 Monte Carlo simulations

å
1 N
E (X) » X( ) = X N
i

N i =1

where N random draws are taken from the observed probability distribu-
tion of available data and averaged to derive the estimated mean/expected
value of X, the unknown population value. Recall from Chapter 1 that it
is important to evaluate the accuracy of this estimation in order to deter-
mine if MCS is reliable enough to solve our practical problem of missing
transcripts. We will do this by a systematic process of model validation
that, as mentioned earlier, serves the secondary purpose of an entry point
into potentially interesting theoretical hypotheses. Our demonstration case
study will be presented below in three major steps, followed by a discus-
sion of the results and implications.

Step 1: Data and LIWC scoring

The case study will make use of a dataset of 40 complete psychoanalysis


session transcripts from a single therapist–client dyad from Counselling
and Psychotherapy Transcripts Series. The total word count is 199,170
words. Psychoanalysis is a leading therapeutic approach developed from
the ideas of prominent figures like Sigmund Freud, Erik Erikson, Erich
Fromm, and Carl Jung. The key aim in Jung’s words is to ‘make the uncon-
scious conscious’ by using therapist–client interaction to confront, clarify,
interpret, and work through clients’ resistance toward repressed thoughts,
feelings, and interactional patterns that originate from early life experi-
ences (Kramer et al., 2008). The language of such interaction, ranging
from pronouns to metaphors, continues to attract a good deal of attention
from linguists and discourse analysts (Borbely, 2008; Qiu & Tay, 2023;
Rizzuto, 1993; Tay, 2020). However, resistance toward talking about
these experiences also implies a fair chance for transcripts to be missing or
incomplete. We will demonstrate MCS by treating different parts of our
complete 40-session set as missing, simulating the LIWC summary variable
scores of these missing sessions, and evaluating the different outcomes.
As a quick reminder, LIWC is a text analysis program widely used in
language and social psychology research. It assumes that “the words we
use in daily life reflect what we are paying attention to, what we are think-
ing about, what we are trying to avoid, how we are feeling, and how we
are organizing and analyzing our worlds” (Tausczik & Pennebaker, 2010,
p. 30). Given an input text, LIWC computes the frequencies of more than
90 word categories and can further compute four ‘summary variables’
as combinations of these categories. These are called analytical thinking,
clout, authenticity, and emotional tone. Each summary variable is scored
from 0–100 in each transcript, reflecting the normalized frequencies of
Monte Carlo simulations 41

relevant word combinations (see Table 1.3). A high analytical thinking


score suggests formal, logical, and hierarchical thinking versus informal,
personal, here-and-now, and narrative thinking. A high clout score sug-
gests speaking/writing with high expertise and confidence, versus a more
tentative and humble style. A high authenticity score suggests more honest,
personal, and disclosing discourse, versus more guarded and distanced dis-
course. A high emotional tone score suggests a more positive and upbeat
style, a low score anxiety/sadness/hostility, while a neutral score around
50 a lack of emotionality. We use LIWC to process our 40 transcripts and
derive the four summary variable scores, giving us a multivariate profile
for each transcript. The output can be saved in an excel file with session
transcripts as rows and variable scores as columns. The excel file can then
be imported into Python for the MCS process below.

Step 2: Simulation runs with a train-test approach

We are now ready to perform MCS by a resampling procedure that will


involve five separate simulation runs. For each run, 30 out of 40 tran-
scripts (75%) will be selected and used as available data. We can also
describe them as ‘training data’. The remaining 10 transcripts (25%) that
are left out will thus be treated as if they were missing. Their LIWC scores
will be simulated based on the mean, standard deviation, and distribution
of the LIWC scores from the 30 transcripts. This approach is conceptually
similar to our birthday and casino examples. Back then, we assumed that
the outcomes follow a uniform distribution based on our knowledge of the
data-generating process. Here, we have no such prior knowledge of how
LIWC scores in transcripts are distributed, and we must therefore rely on
the observed properties of available data. Since the simulated scores will be
compared with the actual scores of the 10 transcripts as a validation proce-
dure to determine accuracy, we can describe the 10 transcripts as ‘valida-
tion data’ or ‘testing data’. As mentioned earlier, this ‘train-test’ approach,
where a small part of the original dataset is withheld and then used to
evaluate simulation/prediction outcomes, is widely used in machine learn-
ing. It is also often combined with resampling procedures like the present
case where multiple different sets of train-test data are drawn from the
same dataset, as will be revisited in Chapter 4. It is a crucial step because in
a real-life situation where transcripts are really missing, it would be much
harder to objectively verify the accuracy of simulated scores. Experiments
with training and testing data are needed to increase our general confi-
dence in the method.
How, then, do we decide which 30 transcripts to use as training data
and which 10 to leave out as validation data for each run? As mentioned,
42 Monte Carlo simulations

a typical practice is to randomly select a part of the full dataset as training


data, evaluate the accuracy by comparing simulation/prediction outcomes
with the validation data, repeat the process multiple times, and obtain an
averaged measure of accuracy. This is a perfectly sound approach, but the
present suggestion is that a more strategic composition of training versus
validation data can serve the secondary purpose of providing interesting
theoretical insights for discourse analysts. In our five simulation runs A-E,
we will therefore use (1) four validation datasets corresponding to four
equal temporal phases of treatment (i.e., Sessions 1–10, 11–20, 21–30,
31–40), plus (2) the aforementioned typical practice of an additional
validation dataset of ten randomly selected sessions as a pseudo-control
condition.
The details are summarized below for clarity.

• Simulation run A
• Training: Sessions 11–40 / Validation: Sessions 1–10
• Simulation run B
• Training: Sessions 1–10,21–40 / Validation: Sessions 11–20
• Simulation run C
• Training: Sessions 1–20,31–40 / Validation: Sessions 21–30
• Simulation run D
• Training: Sessions 1–30 / Validation: Sessions 31–40
• Simulation run E
• Training: 30 random sessions / Validation: 10 random sessions

Our objective is therefore not only to evaluate the overall accuracy of MCS,
but also to see how this accuracy varies across the temporal phases of treat-
ment. In other words, we want to see which ‘missing’ phase would result in
the most/least accurate simulations. A phase that results in more accurate
simulations when missing would imply that that phase is linguistically least
different from the remainder of the dyad (as measured by LIWC), and vice
versa, with attendant theoretical implications to be explored. On the other
hand, simulation run E would serve to mimic data ‘missing completely as
random’ as described above, and allow us to compare outcomes with sys-
tematically missing data in the other runs.
We start by preparing the training and validation datasets. For simula-
tion runs A to D, this can be done manually by preparing two excel files
for each run and naming them something like
train​_A​.​csv and validation​_A​.cs​v. The former would have scores (in
columns) of 30 sessions (in rows) and the latter scores of 10 sessions. For
simulation run E, we can use Python to randomly select the training and
validation data from the full dataset (e.g., 40​_sessions​.​csv) with the code
Monte Carlo simulations 43

below. This is done by importing and using the train_test_split feature


from the scikit-learn library, which randomly divides the 40 sessions using
the specified test data proportion (0.25 in our case). Note that we can
set the random_state=0, which functions similarly as np​.random​.s​eed(0)
to ensure the same split each time. This can be disabled by setting ran-
dom_state=None, which is also the default setting if random_state is not
called. Finally, we save the split train and test data into two separate excel
files and name them with our nomenclature train​_random​.​csv and valida-
tion​_random​.cs​v.

#generate random training set


from sklearn.model_selection import train_test_split
data=pd.read_csv(‘40​_sessions​.​csv’,index_col=‘Session’)
train​,test​=trai​n_tes​t_spl​it(da​ta,te​st_si​ze=0.​25,ra​ndom_​state​=0)
train.to_csv(‘train​_random​.​csv’)
test.to_csv(‘validation​_random​.​csv’)

Table 2.1 summarizes the mean and standard deviation of all four LIWC
summary variables scores across our five training datasets. Shapiro-Wilk
tests confirmed that all scores are normally distributed (p>0.05). As such,
their means and standard deviations will be used as input to draw ran-
dom values from a normal distribution. Recall that we will be using these
30-session properties as a basis to simulate multiple sets of 40-sessions,
in effect compensating for the 10 ‘missing’ sessions each time. Note that
although we have a multivariate profile, the simulations will be univariate
as in the birthday and casino examples to simplify the illustration. This
means that each variable will be separately and independently simulated
without assuming or making use of statistical relationships that may exist
between them. If we do not wish to make this assumption, an alterna-
tive but more complex approach would be to perform multivariate normal
sampling, which requires not only the means and standard deviations of
each variable, but also the covariances or covariance structure among all
four variables.
The following code is used for each of the five simulation runs.
The first part is straightforward and only involves loading the data-
set (e.g., train​_A​.​csv), calculating the mean and standard deviation of
each variable, and specifying the number of sessions and simulation
trials desired. We will simulate 5,000 sets of 40 sessions for each run.
These 5,000 sets correspond to 5,000 potential ‘discourse scenarios’
that might occur probabilistically as mentioned above. Readers using
the code on their own datasets and/or variables can simply change the
names accordingly.
44

Table 2.1 Training data properties

Simulation run Training data properties

A Analytical thinking Clout Authenticity Emotional tone


Training: S11–40 Mean 23.96 44.08 60.05 30.01
Validation: S1–10 SD 8.27 9.46 12.24 12.97
Monte Carlo simulations

Shapiro-Wilk p 0.162 0.553 0.826 0.093


B Mean 25.38 45.26 61.46 32.55
Training: S1–10, 21–40 SD 8.14 9.59 13.27 14.95
Validation: S11–20 Shapiro-Wilk p 0.256 0.396 0.372 0.963
C Mean 26.21 44.50 59.87 28.37
Training: S1–20, 31–40 SD 8.20 10.04 11.92 13.86
Validation: S21–30 Shapiro-Wilk p 0.440 0.135 0.789 0.854
D Mean 23.53 43.15 60.36 30.08
Training: S1–30 SD 7.12 10.08 14.46 15.32
Validation: S31–40 Shapiro-Wilk p 0.641 0.272 0.545 0.766
E Mean 26.15 46.97 60.56 31.03
Training: 30 random S SD 8.27 8.72 12.21 14.48
Validation: 10 random S Shapiro-Wilk p 0.560 0.159 0.973 0.928
Monte Carlo simulations 45

#load data
data=pd.read_csv(‘train​_A​.​csv’,index_col=‘Session’)

#calculate parameters for each variable


ana_avg = np​.me​an(data.Analytic)
ana_sd = np​.s​td(data.Analytic)
auth_avg = np​.me​an(data.Authentic)
auth_sd = np​.s​td(data.Authentic)
clout_avg = np​.me​an(data.Clout)
clout_sd = np​.s​td(data.Clout)
tone_avg = np​.me​an(data.Tone)
tone_sd = np​.s​td(data.Tone)

#specify no. of sessions and simulations. feel free to change the values.
num_sessions = 40
num_simulations = 5000

For the actual simulation, two options are presented here for readers’
experimentation. Option 1 is more straightforward and follows what we
have been doing in our examples so far. That is, simulate the variable
scores for 40 sessions on the basis of the 30-session training data param-
eters, calculate and store the 40-session mean scores, and loop 5,000 times
to generate 5,000 discourse scenarios, each one represented by a set of
these mean scores. The code for option 1 is as follows.

#a for-loop to draw random variable scores for sessions, calculate and


keep track of the mean scores
np​.random​.s​eed(0)
allstats=[]
for i in range (num_simulations):
#loop through the number of simulations, drawing random values for
each variable
analytic= np​.random​.nor​mal(ana_avg, ana_sd, num_sessions)
authentic= np​.random​.nor​mal(auth_avg, auth_sd, num_sessions)
clout= np​.random​.nor​mal(clout_avg, clout_sd, num_sessions)
tone= np​.random​.nor​mal(tone_avg, tone_sd, num_sessions)
#keep track of simulated scores in a dataframe
df=pd​.Data​Frame​(inde​x=ran​ge(nu​m_ses​sions​),dat​
a={‘Analytic’:analytic, ‘Authentic’:authentic, ‘Clout’:clout,
‘Tone’:tone})
#calculate and store the mean scores
allstats​.appe​nd([df[‘Analytic’].mean(),df[‘Authentic’].
mean(),df[‘Clout’].mean(),df[‘Tone’].mean()])
46 Monte Carlo simulations

Option 2 is conceptually more advanced as it implements a variance reduc-


tion technique known as stratified sampling (Caflisch, 1998). In MCS,
variance reduction increases the precision of estimates by – as the name
suggests – reducing the variance among simulated values. Common tech-
niques for doing so include using antithetic variates, control variates, and
the presently used stratified sampling. Readers might have heard of this
term in the context of survey research. The idea is to first divide the ses-
sions to be simulated into equal groups, or strata. Then, when performing
random draws, which essentially means drawing values that span across
the probability space from 0 to 1, we allocate this space evenly among the
strata to ensure that the entire space will be represented in the random
draws. To illustrate this, consider the case without variance reduction (i.e.,
Option 1). When we draw our 40 values from a normal distribution, there
is no guarantee that they will be spread evenly across the range of possible
values. For example, we might be ‘unlucky’ and obtain a large number of
outlier values that are much bigger/smaller than the mean. The eventual
outcome would be simulated values with high variance, in turn leading
to imprecise estimates with large confidence intervals. By implementing
stratified sampling, say with 10 strata, we carve up the probability space
from 0 to 1 into ten equal parts, and ensure that the random draws in each
stratum (i.e., 4 sessions per stratum) only come from the corresponding
smaller probability space. This way, we basically eliminate the chance of
obtaining too many outlier values, and reduce the variance of the simula-
tions as a result. The more complicated code for option 2 is as follows.

#Specify number of strata


num_strata = 10
#a for-loop to draw random variable scores for sessions, calculate and
keep track of the mean scores
np​.random​.s​eed(0)
allstats=[]
for i in range (num_simulations):
#distribute num_sessions evenly along num_strata
L=int(num_sessions/num_strata)
#allocate the probability space 0-1 evenly among the strata
lower​​_limi​​ts​=np​​.ar​an​​ge(0,​num_s​trata​)/num​_stra​ta
upper​_limits​=np​.ar​ange(1, num_strata+1)/num_strata
#generate random numbers that are confined to the allocated prob-
ability space within each stratum. each random number represents
the cumulative distribution function for a normal distribution
points​=np​.random​.uni​form(lower_limits, upper_limits,
size=[int(L),num_strata]).T
#create a vector of z-scores, each corresponding to the CDF values
calculated above
normal​_vector​=sp​.stats​.nor​m​.ppf(points)
Monte Carlo simulations 47

#For each of the four summary variables, generate a vector of nor-


mally distributed scores (one score per session) using the normal
vector above
analy​tic_v​ector​=ana_​avg+(​ana_s​d*(no​rmal_​vecto​r))
analy​​tic​_s​​trata​​_mean​​=np​​.m​​ean(a​nalyt​ic_ve​ctor,​ axis=1)
analytic​=np​.m​ean(analytic_strata_mean)

authe​ntic_​vecto​r=aut​h_avg​+(aut​h_sd*​(norm​al_ve​ctor)​)
authe​​ntic_​​strat​​a​_mea​​n​=n​p.​​mean(​authe​ntic_​vecto​r, axis=1)
authentic​=np​.m​ean(authentic_strata_mean)

clout​_vect​or=cl​out_a​vg+(c​lout_​sd*(n​ormal​_vect​or))
clout​_strata​_mean​=np​​.mean(clout_vector, axis=1)
clout​=np​.m​ean(clout_strata_mean)

tone_​vecto​r=ton​e_avg​+(ton​e_sd*​(norm​al_ve​ctor)​)
tone​_strata​_mean​=np​​.mean(tone_vector, axis=1)
tone​=np​.m​ean(tone_strata_mean)

#keep track of simulated scores in a dataframe


df=pd​.Data​Frame​(inde​x=ran​ge(nu​m_ses​sions​),dat​
a={‘Analytic’:analytic, ‘Authentic’:authentic, ‘Clout’:clout,
‘Tone’:tone})
#calculate and store the mean scores
allstats​.appe​nd([df[‘Analytic’].mean(),df[‘Authentic’].
mean(),df[‘Clout’].mean(),df[‘Tone’].mean()])

Whichever option is chosen, we can now summarize and visualize the


simulations in preparation of the final step of analyzing the aggregated
outcomes. The code below will display descriptive statistics of our 5,000
simulations for all four variables, including their means, standard devia-
tions, and percentiles, rounded to three decimal places. Readers who try
both options can see for themselves that the standard deviations for option
2 with variance reduction will be much lower than option 1.

#convert to dataframe and summarize final outcomes


resul​ts_df​=pd.D​ataFr​ame.f​rom_r​ecord​s(all​stats​,colu​mns=[​‘Analytic’,‘A
uthentic’,‘Clout’,‘Tone’])
results_df.describe().round(3)

The code below will generate histograms of the simulation outcomes, with
useful annotations like the mean and standard deviation for each vari-
able, rounded to three decimal places. We call the figure title ‘Simulation
(A)’ to indicate that these are results of our simulation run A, but this
can be changed accordingly. The same goes for other visual customization
48 Monte Carlo simulations

options like the co-ordinates of the text annotation (set to 0.5, 0.5 here),
the labels for each histogram, and so on.

#plot histograms of final outcomes


fig,axes=plt.subplots(2,2)
fig.suptitle(‘Simulation (A)’)
sns.d​istpl​ot(re​sults​_df.A​nalyt​ic,kd​e=Fal​se,ax​=axes​[0,0]​,axla​
bel=‘Analytical thinking’)
sns.d​istpl​ot(re​sults​_df.A​uthen​tic,k​de=Fa​lse,a​x=axe​s[0,1​],axl​abel=​
‘Authenticity’)
sns.d​istpl​ot(re​sults​_df.C​lout,​kde=F​alse,​ax=ax​es[1,​0],ax​label​=‘Clout’)
sns.d​istpl​ot(re​sults​_df.T​one,k​de=Fa​lse,a​x=axe​s[1,1​],axl​abel=​
‘Emotional tone’)
axes[0,0].text(0.5,0.5,f'M={np​​.roun​​d(np.​​me​an(​resul​ts_df​.Anal​ytic)​
,3)},​SD={n​​p​.rou​​​nd(np​​.s​td(​resul​ts_df​.Anal​ytic)​,3)}’, ha=“center”,
va=“top”,transform=axes[0,0].transAxes)
axes[0,1].text(0.5,0.5,f'M={np​​.roun​​d(np.​​me​an(​resul​ts_df​.Auth​entic​
),3)}​,SD={​np​.ro​​​und(n​​p​.​std​(resu​lts_d​f.Aut​henti​c),3)​}’, ha=“center”,
va=“top”,transform=axes[0,1].transAxes)
axes[1,0].text(0.5,0.5,f'M={np​​.roun​​d(np.​​me​an(​resul​ts_df​.Clou​t),3)​
},SD=​{np​.r​​o​und(​np​.st​​d(res​ults_​df.Cl​out),​3)}’, ha=“center”,
va=“top”,transform=axes[1,0].transAxes)
axes[1,1].text(0.5,0.5,f'M={np​​.roun​​d(np.​​me​an(​resul​ts_df​.Tone​
),3)}​,SD={​np​.ro​​​und(n​​p​.​std​(resu​lts_d​f.Ton​e),3)​}’, ha=“center”
va=“top”,transform=axes[1,1].transAxes)

The above code is a rerun for each of the remaining four simulation runs.
We then proceed to Step 3 where the aggregated outcomes are analyzed for
each run. This includes the validation procedure of comparing how differ-
ent each simulation is to the corresponding validation dataset.

Step 3: Analysis and validation of aggregated outcomes

We begin with simulation run A where the first 10 ‘missing’ sessions com-
prise the validation set, and the remaining sessions 11–40 comprise the
training set. The top half of Figure 2.3 shows the histograms generated with
the code above. These are the distributions of 5,000 simulations of each
variable, with means and standard deviations indicated. In other words,
they are the aggregated outcomes of MCS showing the range of possible
40-session outcomes over 5,000 different probabilistic discourse scenarios.
By the central limit theorem and law of large numbers discussed above,
we know that (1) a large number of randomly drawn samples (N=5,000)
each with a reasonably large sample size (N=40) would yield a sampling
distribution that approaches normality, regardless of what the actual
Monte Carlo simulations 49

Figure 2.3 Results of simulation run A

population distribution is. ‘Actual population distribution’ here refers to


the abstract notion of the ‘true’ linguistic tendencies of the dyad at hand,
as measured by LIWC, over all possibly occurring scenarios of their inter-
action; (2) the mean of these 5,000 samples would be a good estimate of
the ‘true’ mean as described above. Both of these theoretical consequences
are visually corroborated by the characteristic bell-shaped histograms
for all four variables. More importantly, they enable us to make reliable
inferences about how far any 40-session sample might depart from the
50 Monte Carlo simulations

mean value – in other words, the likelihood of different ‘discourse sce-


narios’ occurring. For example, emotional tone (bottom right) has M =
30.02 and SD = 0.4. By the normal distribution, there are therefore (only)
5 out of 100 scenarios (p<0.05) where the mean of the emotional tone
falls below (30.02 – 1.96*0.4) = 29.24, or goes above (30.02 + 1.96*0.4)
= 30.8. This is the gist of how MCS helps us estimate LIWC scores in the
event that transcripts are missing and the actual information unavailable.
The next step is to validate our simulated scores by comparing them
with the actual scores in the validation dataset, to see if they are suffi-
ciently close to each other. We take the simplest approach by computing
t-statistics for the differences between the mean simulated scores and the
mean validation dataset scores. Welch’s t-tests are used since equal vari-
ances cannot be assumed between the simulated and actual scores (Delacre
et al., 2017). The following code will load the validation data (valida-
tion​_A​.​csv), perform independent samples of Welch’s t-tests between the
simulated and validation data for each variable, and create bar plots to
visualize the results as shown in the bottom half of Figure 2.3.

#perform Welch’s t-test for validation


valid=pd.read_csv(‘validation​_A​.​csv’,index_col=‘Session’)
sp​.stats​.ttest​​_ind(results_df.Analytic, valid.Analytic, equal_var=False)
#Analytic
sp​.stats​.ttest​​_ind(results_df.Authentic, valid.Authentic, equal_
var=False) #Authentic
sp​.stats​.ttest​​_ind(results_df.Clout, valid.Clout, equal_var=False) #Clout
sp​.stats​.ttest​​_ind(results_df.Tone, valid.Tone, equal_var=False) #Tone

#join the two dataframes (results_df and valid) to prepare barplots


joint = pd​.conc​at([results_df, valid], axis=0, ignore_index=False)
joint['Dataset'] = (len(results_df)*(‘Simulated set’,) +
len(valid)*(‘Validation set’,))

#plot simulated and validation set variable means


joint.groupby(‘Dataset’).mean().plot(kind=‘bar’,title=‘Validation (A)’).
legend(loc=‘best’)
plt​.xtic​ks(fontsize=10,rotation=0)
plt​.ytic​ks(fontsize=10,rotation=0)
plt​.lege​nd(fontsize=10)

Using α=0.05 as a threshold, none of the four variables showed a significant


difference between the simulated and validation means: analytic thinking
(t(9)= −1.476, p=0.17), clout (t(9)= −0.186, p=0.86), authenticity (t(9)=
−0.310, p=0.76), and emotional tone (t(9)= −0.16, p=0.87). This implies
that the simulated 40-session scores would be an acceptable replacement
Monte Carlo simulations 51

Figure 2.4 Results of simulation run B

for the actual 40-session scores if the first 10 sessions, or first quarter of
the dyad’s interaction, are missing. We now analyze and perform the same
simple validation procedure for the remaining simulation runs.
For simulation run B shown in Figure 2.4, the second quarter of interac-
tion (sessions 11–20) is ‘missing’ and comprises the validation set. Sessions
1–10 and 21–40 comprise the training set. This time, only three out of
four variables: analytic thinking (t(9)= 1.010, p=0.34), clout (t(9)= 1.252,
52 Monte Carlo simulations

p=0.24), and authenticity (t(9)= 1.072, p=0.31) were not significantly dif-
ferent between the simulated and validation datasets. The simulated emo-
tional tone score was significantly higher than the observed score in the
validation dataset (t(9)= 2.93, p=0.02), suggesting that MCS provides a
less than accurate estimation of the use of emotion words in the event that
transcripts of the second quarter of interaction are missing. Estimations
of the other three variables are nevertheless satisfactory. We will return
to what this contrast in simulation results between and within different
‘missing’ phases implies later on when comparing all five runs side by side.
Figure 2.5 shows the results of simulation run C, this time with the third
quarter (sessions 21–30) as the validation dataset and sessions 1–20 and
31–40 as the training dataset. We obtain the same general result as simu-
lation run B that three out of four variables: clout (t(9)= 0.339, p=0.74),
authenticity (t(9)= −0.436, p=0.67), and emotional tone (t(9)= −1.59,
p=0.15) were not significantly different between the simulated and valida-
tion datasets. This time, however, it was analytic thinking that turned out
inaccurate. The simulated analytic thinking score was significantly higher
than the observed score in the validation dataset (t(9)= 3.213, p=0.01).
Figure 2.6 shows the results of simulation run D with the last quarter
(sessions 31–40) as the validation dataset, and sessions 1–30 as the train-
ing dataset. Interestingly, the results turned out to be similar to its ‘mirror’
image of simulation run A where the first quarter was missing. None of the
four variables showed a significant difference between the simulated and
validation means: analytic thinking (t(9)= −1.627, p=0.14), clout (t(9)=
−1.638, p=0.14), authenticity (t(9)= −0.121, p=0.91), and emotional tone
(t(9)= −0.189, p=0.85). This again implies that the simulated 40-session
scores would be an acceptable replacement for the actual 40-session scores
if the last 10 sessions, or last quarter of the dyad’s interaction, are missing.
Our final simulation run E (Figure 2.7) served as a pseudo-control con-
dition where 10 randomly chosen sessions comprised the validation data-
set, and the remaining 30 comprised the training dataset. As mentioned
earlier, the outcomes would allow us to compare the relative accuracy
of MCS for data missing completely at random versus systematically by
time. Just two out of four variables turned out to be accurate. While there
were no significant differences between the simulated and validation scores
for authenticity (t(9)= 0.103, p=0.92), and emotional tone (t(9)= 0.689,
p=0.51), the simulated analytic thinking (t(9)= 3.152, p=0.01) and clout
scores (t(9)= 4.09, p<0.01) were significantly higher than in the validation
dataset.
We will now juxtapose and consider the results of all five simulation
runs. This gives us a big picture of the usefulness of MCS in our usage
context and the emergent theoretical implications from the model valida-
tion process. Table 2.2 summarizes the previously presented results and
highlights the (in)accurately simulated variables in each run.
Monte Carlo simulations 53

Figure 2.5 Results of simulation run C

Overall, the MCS performance can be considered satisfactory with 16


out of 20 (80%) accurate variable simulations over five runs. The differ-
ence between simulated and validation data was also relatively small in
three out of the four inaccurate cases as seen from the respective p-values
(p=0.01 or 0.02). While the demonstration case study is limited to only
one dyad and therapy type; nevertheless, it provides encouraging evidence
for the usefulness of MCS as a technique to simulate and compensate for
54 Monte Carlo simulations

Figure 2.6 Results of simulation run D

missing transcripts. Another important practical advantage of MCS lies


with its flexibility and relative ease of use. Key parameters like the sample
size and number of simulations can be readily adjusted, and main underly-
ing principles like the law of large numbers and central limit theorem are
quite accessible. Future research can build on the present results by report-
ing simulations with larger datasets, experimenting with quantification
Monte Carlo simulations 55

Figure 2.7 Results of simulation run E

schemes other than LIWC, and comparing outcomes across natural


parameters like different psychotherapy approaches, therapist versus cli-
ent language, and so on. Extensions to and evaluations of MCS in simi-
larly sessional discourse contexts, including classroom and social media
discourse, would also be welcome contributions.
56 Monte Carlo simulations

Table 2.2 Summary of the five simulation runs

Simulation run Accurate Inaccurate

A Analytic (p=0.17) -
Training: Session 11–40 Clout (p=0.86)
Validation: Session 1–10 Authentic (p=0.76)
Tone (p=0.87)
B Analytic (p=0.34) Tone (p=0.02)
Training: Session 1–10, 21–40 Clout (p=0.24)
Validation: Session 11–20 Authentic (p=0.31)
C Clout (p=0.74) Analytic (p=0.01)
Training: Session 1–20, 31–40 Authentic (p=0.67)
Validation: Session 21–30 Tone (p=0.15)
D Analytic (p=0.14) -
Training: Session 1–30 Clout (p=0.14)
Validation: Session 31–40 Authentic (p=0.91)
Tone (p=0.85)
E Authentic (p=0.92) Analytic (p=0.01)
Training: 30 random sessions Tone (p=0.51) Clout (p<0.01)
Validation: 10 random sessions

We now provide a more detailed analysis of how accuracy varies


according to where the missing transcripts are along the treatment span.
Recall that while the primary purpose of this variation is to obtain a more
accurate validation measure by evaluating different parts of the dataset,
the analysis below is an attempt to harness its secondary potential by con-
sidering its outcomes from a theoretical perspective grounded upon the
subject matter, or domain knowledge, at hand. It is far beyond the present
scope for us to actually pursue the potential hypotheses that emerge from
this analysis, but our main intention is to illustrate how MCS helped us to
arrive at them in the first place.
We mentioned earlier that the more accurate a simulated missing phase
of treatment is, the least linguistically volatile or ‘surprising’ that phase is
likely to be with respect to the rest of the treatment. This is precisely why
the remaining 30 sessions would provide an accurate basis for simulating
the missing phase. Our results show that simulation runs A and D, cor-
responding to the first and last quarter of the 40 sessions, were the most
accurate. This reflects an intuitive if not (yet) empirically demonstrated
point about the nature of psychotherapy talk – that the beginning and end
of the treatment span are less likely to witness departures from central ten-
dencies of analytical thinking, clout, authenticity, and emotional tone. In
other words, psychoanalysts and clients may follow a fairly standardized
or even formulaic interactional routine at both ends, before and after the
idiosyncrasies of specific clients come into focus, which gets reflected in the
language they use as measured by LIWC summary variables. It would be
Monte Carlo simulations 57

interesting to investigate this point further with more representative and


diverse datasets as suggested above. For the practical purposes of MCS, the
results also imply that missing transcripts at both ends can be more reliably
compensated for by simulation.
A closer look at the middle portion – simulation runs B and C – reveals
further nuances. In B where sessions approaching the midpoint (Session
11–20) are missing, emotional tone was inaccurately simulated while in C
where sessions approaching the last quarter (Session 21–30) are missing,
analytical thinking was inaccurately simulated. This implies the interest-
ing theoretical hypothesis that the linguistic display of emotionality by
psychoanalysts and clients is most volatile, compared to the rest of the
sessions, as the therapy progresses toward the halfway point. Such a dis-
play is then followed immediately by a phase of volatile display of analyti-
cal thinking in the next quarter. Putting all the pieces together, we may
theorize, based on our initial observation of this dyad, that a relatively
stable beginning and end frames two volatile periods of linguistic display
– first in terms of emotions-focused exploration, then in terms of logic-
focused exploration. We will not dive into the details here, but the extent
to which this linguistic patterning (mis)aligns with the strategic objectives
of psychoanalysis across sessions, and exactly how such (mis)alignment is
constructed in context, becomes a natural question for discourse analysts
and practitioners alike. What we want to highlight instead is how this
hypothesis came about as a ‘byproduct’ of applying MCS for a markedly
different purpose.
Two final observations could be made in this connection. First, our
pseudo-control condition consisting of random ‘missing’ sessions (simula-
tion run E) turned out to perform the worst. This is surprising as it appears
to go against the aforementioned assumption that data missing completely
at random are least detrimental. Theoretically, it also implies that there
is less variability across contiguous session blocks than random session
blocks, which further reflects the ‘segmental’ nature of psychotherapy talk.
We can again only state this as a potential emergent hypothesis that needs
to be investigated more rigorously by using multiple random validation
sets, as well as more representative and diverse datasets involving more
dyads and therapy approaches. Chapters 3 and 4 will revisit this segmental
nature of therapy from different data analytic and theoretical perspectives.
The second observation is that throughout the five runs, where significant
differences exist, the simulated scores were always higher than the actual
observed scores. There is no direct and intuitive explanation for this, but it
points toward a systematic bias (Nguyen et al., 2016) to be confirmed on
other datasets. The possibility that psychotherapy talk consistently mani-
fests a systematic bias in simulated scores would again be an emergent
hypothesis of considerable interest.
58 Monte Carlo simulations

Python code used in this chapter


The Python code used throughout the chapter is reproduced in sequence
below for readers’ convenience and understanding of how the analysis
gradually progresses.

#import Python libraries


import pandas as pd
import matplotlib​.pypl​ot as plt
import numpy as np
import scipy as sp
import seaborn as sns

Birthday paradox (analytic solution)

#a function to calculate the probability p given n number of people


def probsamebirthday(n):
q=1
for i in range(1, n):
   probability = i / 365
   q *= (1 - probability)
p=1-q
return(p)

#runs the function for n=23. feel free to change the value of n.
probsamebirthday(23)

Birthday paradox (numerical solution)

#fix the random seed to ensure reproducibility


np​.random​.s​eed(0)

#a function to check for duplicates in a list of birthdays


def contains_duplicates(X):
return len(np​.uniq​ue(X)) != len(X)

#specify no. of people and trials. feel free to change the values.
n_people=23
n_trials=5000

#a for-loop to generate birthdays and calculate the probability for the


specified no. of people and trials
list=[]
Monte Carlo simulations 59

for i in range (0,n_trials):


#loop through the number of trials. For each, draw random birthdays
for n_people
dates = np​.random​.cho​ice(range(1,366), size=(n_people))
#use the function above to check for duplicates in each trial, append-
ing the result to list
list​.appe​nd(contains_duplicates(dates))

#calculate the final probability as the fraction of all trials where there
are duplicates
probability=len([p for p in list if p == True]) / len(list)
print(“With”, n_trials, “trials, probability of at least one shared bday
among”,n_people, “people =”, probability)

#a more complex for-loop to track and plot the probability for an


increasing number of trials from 0 to n_trials
n_people=23
n_trials=5000
trial_count = 0
shared_birthdays = 0
probabilities = []

for experiment_index in range(0,n_trials):


dates = np​.random​.cho​ice(range(1,366), size=(n_people))
if contains_duplicates(dates):
  shared_birthdays+=1
trial_count += 1
probability = shared_birthdays/trial_count
probabilities​.appe​nd(probability)

#plot the outcome (probability) for an increasing number of trials


plt​.pl​ot(probabilities)
plt​.tit​le(“Outcome of” +str(n_trials)+ “trials converging to p=” +
str(probabilities[-1]))
plt​.sh​ow()

Casino roulette simulation

#fix the random seed to ensure reproducibility


np​.random​.s​eed(0)

#a function to calculate winnings per spin


def winnings(x):
60 Monte Carlo simulations

if x<=55:
  return -1
else:
  return 1

#specify no. of spins per set and no. of sets. feel free to change the values.
spins=100
sets=1000

#a for-loop to spin the wheel, calculate and keep track of winnings


list=[]
for i in range(0,sets):
#loop through the number of sets and spin the specified number of
times
x​=np​.random​.uni​form(1,100,spins)
#keep track of spin outcomes in a dataframe
df=pd.DataFrame(data={‘x’:x})
#use the function above to determine amount won per spin
df[‘Winning’]=df[‘x’].apply(winnings)
#sum the amount won for all spins, to obtain and record the total
winnings per set
list​.appe​nd([df[‘Winning’].sum()])

#plot the distribution of winnings over all the sets


ax=sns.distplot(list,kde=False)
ax​.s​et(xlabel=‘Winnings’, ylabel = ‘No. of sets (‘+str(spins)+’ spins
each)’)
ax.axvline(x​=np​.m​ean(list), color=‘blue’, label=‘Mean’,
linestyle=‘dotted’, linewidth=2)
ax​.lege​nd()
plt​.tit​le(‘Mean winnings=’ +str(np​.me​an(list)) + ‘\n St dev winnings=’
+str(np​.rou​nd(np​.s​td(list),decimals=3)))

MCS

#generate random training set


from sklearn.model_selection import train_test_split
data=pd.read_csv(‘40​_sessions​.​csv’,index_col=‘Session’)
train​,test​=trai​n_tes​t_spl​it(da​ta,te​st_si​ze=0.​25,ra​ndom_​state​=0)
train.to_csv(‘train​_random​.​csv’)
test.to_csv(‘validation​_random​.​csv’)

#load data
data=pd.read_csv(‘train​_A​.​csv’,index_col=‘Session’)
Monte Carlo simulations 61

#calculate parameters for each variable


ana_avg = np​.me​an(data.Analytic)
ana_sd = np​.s​td(data.Analytic)
auth_avg = np​.me​an(data.Authentic)
auth_sd = np​.s​td(data.Authentic)
clout_avg = np​.me​an(data.Clout)
clout_sd = np​.s​td(data.Clout)
tone_avg = np​.me​an(data.Tone)
tone_sd = np​.s​td(data.Tone)

#specify no. of sessions and simulations. feel free to change the values.
num_sessions = 40
num_simulations = 5000

Option 1: Simulation without stratified sampling for variance reduction.


More straightforward

#a for-loop to draw random variable scores for sessions, calculate and


keep track of the mean scores
np​.random​.s​eed(0)
allstats=[]
for i in range (num_simulations):
#loop through the number of simulations, drawing random values
for each variable
analytic= np​.random​.nor​mal(ana_avg, ana_sd, num_sessions)
authentic= np​.random​.nor​mal(auth_avg, auth_sd, num_sessions)
clout= np​.random​.nor​mal(clout_avg, clout_sd, num_sessions)
tone= np​.random​.nor​mal(tone_avg, tone_sd, num_sessions)
#keep track of simulated scores in a dataframe
df=pd​.Data​Frame​(inde​x=ran​ge(nu​m_ses​sions​),dat​
a={‘Analytic’:analytic, ‘Authentic’:authentic, ‘Clout’:clout,
‘Tone’:tone})
#calculate and store the mean scores
allstats​.appe​nd([df[‘Analytic’].mean(),df[‘Authentic’].
mean(),df[‘Clout’].mean(),df[‘Tone’].mean()])

Option 2: Simulation with stratified sampling for variance reduction. More


complicated

#Specify number of strata


num_strata = 10
#a for-loop to draw random variable scores for sessions, calculate and
keep track of the mean scores
62 Monte Carlo simulations

np​.random​.s​eed(0)
allstats=[]
for i in range (num_simulations):
#distribute num_sessions evenly along num_strata
L=int(num_sessions/num_strata)
#allocate the probability space 0-1 evenly among the strata
lower​​_limi​​ts​=np​​.ar​an​​ge(0,​num_s​trata​)/num​_stra​ta
upper​_limits​=np​.ar​ange(1, num_strata+1)/num_strata
#generate random numbers that are confined to the allocated prob-
ability space within each stratum. each random number represents
the cumulative distribution function for a normal distribution
points​=np​.random​.uni​form(lower_limits, upper_limits,
size=[int(L),num_strata]).T
#create a vector of z-scores, each corresponding to the CDF values
calculated above
normal​_vector​=sp​.stats​.nor​m​.ppf(points)

#For each of the four summary variables, generate a vector of nor-


mally distributed scores (one score per session) using the normal
vector above
analy​tic_v​ector​=ana_​avg+(​ana_s​d*(no​rmal_​vecto​r))
analy​​tic​_s​​trata​​_mean​​=np​​.m​​ean(a​nalyt​ic_ve​ctor,​ axis=1)
analytic​=np​.m​ean(analytic_strata_mean)

authe​ntic_​vecto​r=aut​h_avg​+(aut​h_sd*​(norm​al_ve​ctor)​)
authe​​ntic_​​strat​​a​_mea​​n​=n​p.​​mean(​authe​ntic_​vecto​r, axis=1)
authentic​=np​.m​ean(authentic_strata_mean)

clout​_vect​or=cl​out_a​vg+(c​lout_​sd*(n​ormal​_vect​or))
clout​_strata​_mean​=np​​.mean(clout_vector, axis=1)
clout​=np​.m​ean(clout_strata_mean)

tone_​vecto​r=ton​e_avg​+(ton​e_sd*​(norm​al_ve​ctor)​)
tone​_strata​_mean​=np​​.mean(tone_vector, axis=1)
tone​=np​.m​ean(tone_strata_mean)

#keep track of simulated scores in a dataframe


df=pd​.Data​Frame​(inde​x=ran​ge(nu​m_ses​sions​),dat​
a={‘Analytic’:analytic, ‘Authentic’:authentic, ‘Clout’:clout,
‘Tone’:tone})
#calculate and store the mean scores
allstats​.appe​nd([df[‘Analytic’].mean(),df[‘Authentic’].
mean(),df[‘Clout’].mean(),df[‘Tone’].mean()])
Monte Carlo simulations 63

#convert to dataframe and summarize final outcomes


resul​ts_df​=pd.D​ataFr​ame.f​rom_r​ecord​s(all​stats​,colu​mns=[​‘Analytic’,‘A
uthentic’,‘Clout’,‘Tone’])
results_df.describe().round(3)

#plot histograms of final outcomes


fig,axes=plt.subplots(2,2)
fig.suptitle(‘Simulation (A)’)
sns.d​istpl​ot(re​sults​_df.A​nalyt​ic,kd​e=Fal​se,ax​=axes​[0,0]​,axla​
bel=‘Analytical thinking’)
sns.d​istpl​ot(re​sults​_df.A​uthen​tic,k​de=Fa​lse,a​x=axe​s[0,1​],axl​abel=​
‘Authenticity’)
sns.d​istpl​ot(re​sults​_df.C​lout,​kde=F​alse,​ax=ax​es[1,​0],ax​label​=‘Clout’)
sns.d​istpl​ot(re​sults​_df.T​one,k​de=Fa​lse,a​x=axe​s[1,1​],axl​abel=​
‘Emotional tone’)
axes[0,0].text(0.5,0.5,f‘M={np​​.roun​​d(np.​​me​an(​resul​ts_df​.Anal​ytic)​
,3)},​SD={n​​p​.rou​​​nd(np​​.s​td(​resul​ts_df​.Anal​ytic)​,3)}’, ha=“center”,
va=“top”,transform=axes[0,0].transAxes)
axes[0,1].text(0.5,0.5,f‘M={np​​.roun​​d(np.​​me​an(​resul​ts_df​.Auth​entic​
),3)}​,SD={​np​.ro​​​und(n​​p​.​std​(resu​lts_d​f.Aut​henti​c),3)​}’, ha=“center”,
va=“top”,transform=axes[0,1].transAxes)
axes[1,0].text(0.5,0.5,f‘M={np​​.roun​​d(np.​​me​an(​resul​ts_df​.Clou​t),3)​
},SD=​{np​.r​​o​und(​np​.st​​d(res​ults_​df.Cl​out),​3)}’, ha=“center”,
va=“top”,transform=axes[1,0].transAxes)
axes[1,1].text(0.5,0.5,f‘M={np​​.roun​​d(np.​​me​an(​resul​ts_df​.Tone​
),3)}​,SD={​np​.ro​​​und(n​​p​.​std​(resu​lts_d​f.Ton​e),3)​}’, ha=“center”,
va=“top”,transform=axes[1,1].transAxes)

#perform Welch's t-test for validation


valid=pd.read_csv(‘validation​_A​.​csv’,index_col=‘Session’)
sp​.stats​.ttest​​_ind(results_df.Analytic, valid.Analytic, equal_var=False)
#Analytic
sp​.stats​.ttest​​_ind(results_df.Authentic, valid.Authentic, equal_
var=False) #Authentic
sp​.stats​.ttest​​_ind(results_df.Clout, valid.Clout, equal_var=False) #Clout
sp​.stats​.ttest​​_ind(results_df.Tone, valid.Tone, equal_var=False) #Tone

#join the two dataframes (results_df and valid) to prepare barplots


joint = pd​.conc​at([results_df, valid], axis=0, ignore_index=False)
joint[‘Dataset’] = (len(results_df)*(‘Simulated set’,) +
len(valid)*(‘Validation set’,))

#plot simulated and validation set variable means


joint.groupby(‘Dataset’).mean().plot(kind=‘bar’,title=‘Validation (A)’).
legend(loc=‘best’)
64 Monte Carlo simulations

plt​.xtic​ks(fontsize=10,rotation=0)
plt​.ytic​ks(fontsize=10,rotation=0)
plt​.lege​nd(fontsize=10)

References
Borbely, A. F. (2008). Metaphor and psychoanalysis. In R. W. Gibbs (Ed.), The
Cambridge handbook of metaphor and thought (pp. 412–424). Cambridge
University Press.
Caflisch, R. E. (1998). Monte Carlo and quasi-Monte Carlo methods. Acta
Numerica, 7, 1–49.
Carsey, T. M., & Harden, J. J. (2013). Monte Carlo simulation and resampling
methods for social science. Sage.
Delacre, M., Lakens, D., & Leys, C. (2017). Why psychologists should by default
use Welch’s t-test instead of student’s t-test. International Review of Social
Psychology, 30(1), Article 1. https://doi​.org​/10​.5334​/irsp​.82
Eckhardt, R. (1987). Stan Ulam, John von Neumann, and the Monte Carlo
Method. Los Alamos Science, 15, 131–144 .
Efron, B. (1982). The jackknife, the bootstrap and other resampling plans. Society for
Industrial and Applied Mathematics. https://doi​.org​/10​.1137​/1​.9781611970319
Graham, J. W. (2009). Missing data analysis: Making it work in the real world.
Annual Review of Psychology, 60(1), 549–576. https://doi​.org​/10​.1146​/annurev​
.psych​.58​.110405​.085530
Kang, H. (2013). The prevention and handling of the missing data. Korean Journal
of Anesthesiology, 64(5), 402–406.
Kramer, G. P., Bernstein, D. A., & Phares, V. (2008). Introduction to clinical
psychology (7th ed.). Pearson.
Kroese, D. P., Brereton, T., Taimre, T., & Botev, Z. I. (2014). Why the Monte Carlo
method is so important today. Wiley Interdisciplinary Reviews: Computational
Statistics, 6(6), 386–392.
Mayer, D. G., & Butler, D. G. (1993). Statistical validation. Ecological Modelling,
68(1), 21–32. https://doi​.org​/10​.1016​/0304​-3800(93)90105-2
Metropolis, N. (1987). The beginning of the Monte Carlo method. Los Alamos
Science, 15, 125–131.
Nguyen, H., Mehrotra, R., & Sharma, A. (2016). Correcting for systematic biases
in GCM simulations in the frequency domain. Journal of Hydrology, 538, 117–
126. https://doi​.org​/10​.1016​/j​.jhydrol​.2016​.04​.018
Owen, A., & Glynn, P. W. (Eds.). (2018). Monte Carlo and quasi-Monte Carlo
methods. Springer.
Qiu, H., & Tay, D. (2023). A mixed-method comparison of therapist and
client language across four therapeutic approaches. Journal of Constructivist
Psychology, 36(3), 337–60. https://doi.org/10.1080/10720537.2021.2021570
Rizzuto, A. M. (1993). First person personal pronouns and their psychic referents.
International Journal of Psychoanalysis, 74(3), 535–546.
Rubin, D. B. (1976). Inference and missing data. Biometrika, 63(3), 581–592.
Schafer, J. L., & Graham, J. W. (2002). Missing data: Our view of the state of the
art. Psychological Methods, 7(2), 147–177.
Monte Carlo simulations 65

Tausczik, Y. R., & Pennebaker, J. W. (2010). The psychological meaning of words:


LIWC and computerized text analysis methods. Journal of Language and Social
Psychology, 29(1), 24–54.
Tay, D. (2020). A computerized text and cluster analysis approach to psychotherapy
talk. Language and Psychoanalysis, 9(1), 1–22.
Tay, D. (2022). Navigating the realities of metaphor and psychotherapy research.
Cambridge University Press.
3 Cluster analysis

Introduction to cluster analysis: Creating groups for objects


A key objective in many data analytic scenarios is to assign objects into
natural groups for further interpretation and analysis. Such objects can
range from human beings to countries to texts, or anything else with
measurable properties. The task is straightforward if both the groups and
grouping criteria are already well defined, so we just need to assign objects
based on how their properties fit the criteria. Less straightforward is when
the groups are already well defined, but how the properties fit the criteria
remains implicit. We might then want to find out why or how the objects
end up in their groups – in other words, whether there is some stable rela-
tionship that can predict group membership based on object properties. A
discourse analytic example is the metonymic versus non-metonymic use
of national capital names in news articles, as in ‘Beijing made a general
investigation into …’ (metonymic) versus ‘In a supermarket in Beijing …’
(non-metonymic) (Zhang et al., 2011). The group in which each usage falls
is relatively apparent (metonymy versus non-metonymy), but the lexico-
grammatical, discourse, and contextual features underpinning each usage
that might predict whether it ends up as (non)-metonymic are not. The task
of investigating relationships between observable properties and existing
group labels is generally known as supervised machine learning. We ‘super-
vise’ the computer by giving it a list of object properties and corresponding
group labels, the machine models/’learns’ statistically reliable relationships
between them, and can then proceed to predict the group membership of
new objects. Another example mentioned in Chapter 1 is the not always
obvious relationship between the language of product reviews and their
sentiment category – positive, negative, or neutral. Common techniques
for supervised machine learning include logistic regression, which we will
revisit later in the chapter, and k-nearest neighbors classification, the sub-
ject matter of the next chapter.
Even less straightforward are cases where neither the groups nor group-
ing criteria are well defined. Instead, we simply have a list of objects with

DOI: 10.4324/9781003360292-3
Cluster analysis 67

their observed/measured properties and want to figure out sensible or nat-


ural ways to group them that would be meaningful in that particular con-
text. Staying with our product reviews, instead of predicting (well-defined)
sentiments, we may instead want to see if/how the 100,000 customer
reviewers can somehow be grouped based on their language use. Neither
the grouping criteria of language patterns nor the groups themselves are
well defined in this case. If we are successful in discerning these language-
usage groups, they might be usefully compared with well-defined group-
ings like age, gender, etc. for insights. Similarly, we might have a database
of 100,000 customers and their product browsing and purchasing statistics
and want to find out if there is a way to group them based on these statis-
tics. Specific marketing strategies could then be formulated to target each
distinct group. The discovery of patterns among objects without group
labels is generally known as unsupervised machine learning, since we pro-
vide the machine with no input or ‘right answers’ other than the objects
and their properties. Unsupervised machine learning is thus often used to
uncover previously unknown relationships, or similarities, between the
objects of interest. One of the most common unsupervised machine learn-
ing techniques is cluster analysis, the subject matter of this chapter.
Cluster analysis, or clustering, is the task of grouping objects based on
their properties such that objects ending up in the same group (or cluster)
are maximally similar to one another, and maximally dissimilar to objects
in other clusters. This definition implies that clustering is an optimization
procedure to provide the most effective solution (in this case, groupings)
with available information, rather than an objectively ‘correct’ solution. In
other words, it is up to the analyst to interpret what the groups mean, in
a specific context of application. The many diverse applications of cluster
analysis include the aforementioned marketing, in biology to derive tax-
onomies, and in recommendation engines to group and recommend similar
products.
There are many clustering algorithms that differ on key parameters like
how (dis)similarity is defined and what constitutes a cluster (Rodriguez et
al., 2019). The two major types are hierarchical and non-hierarchical. As
the name implies, the outcome of hierarchical clustering (also called the
clustering solution) is a hierarchy of clusters where two or more lower-
level clusters, each with their own objects, also belong to a higher-level
cluster at a ‘cruder’ level of similarity. The eventual number of clusters in
the dataset thus needs to be interpreted by the analyst based on a ‘cut-off’
point in the hierarchy. There are, in turn, two subtypes of hierarchical
clustering: divisive and agglomerative. The former starts by putting all
objects in a single cluster and iteratively divides this into smaller clusters,
in a top-down manner down the hierarchy. The latter starts with each
object forming its own cluster, and iteratively merges them into larger
68 Cluster analysis

clusters in a bottom-up manner, up the hierarchy until one giant cluster


remains. Each approach has its pros and cons, but agglomerative hier-
archical clustering (AHC) appears to be more commonly used due to its
lower computational cost (Roux, 2018). A comprehensive machine learn-
ing project will in any case likely involve applying and comparing the out-
comes of different approaches. The second major type, non-hierarchical
clustering, does not produce clusters ordered in a hierarchy. Instead, the
number of clusters and objects that end up in each are based on optimizing
an overall evaluation criterion (more on this later), and there is no deter-
mined relationship between the clusters. The eventual number of clusters
also does not require interpretation. It is an explicitly stated optimal out-
come, and analysts can, in fact, specify the number of clusters desired
beforehand. Changing the number of clusters from its optimal value to
something else may make the clustering solution less optimal, but it is
useful in some contexts where a fixed number of groups is required. The
most common example of non-hierarchical clustering is k-means cluster-
ing, where k stands for the number of clusters. This is our chosen tech-
nique for this chapter.
As with Chapter 2, we will now use an example to introduce and com-
pare the logic and processes of AHC and k-means clustering – two key rep-
resentatives of hierarchical and non-hierarchical algorithms, respectively.
Since the start of the COVID-19 pandemic, all kinds of statistics have been
gathered and analyzed in various ways to compare countries/regions. We
will use a sample dataset of just 60 countries/regions and four properties:
GDP per capita (as an indicator of economic well-being), life expectancy
(as an indicator of health status), COVID cases per million, and COVID-
related deaths per million as of the end of 2020. We want to use cluster
analysis to put the countries/regions into groups based on their overall
similarity across the four properties/variables. These groups are meaning-
ful (only) in this specific context of seeing which places share similar eco-
nomic, health, and COVID profiles.

Agglomerative hierarchical clustering (AHC)

We first demonstrate AHC. After importing the relevant libraries and data-
set in Python, most notably the SciPy library for AHC, the first step is to
standardize our four properties by subtracting each value by the mean
and dividing by the standard deviation across the 60 countries/regions.
This scales the whole dataset to zero mean and unit variance, and will
improve clustering outcomes when the properties are measured with dif-
ferent scales like the present case. We can import the StandardScaler fea-
ture from the scikit-learn library to do this. The code below implements
all the above. Note that there are other techniques like normalization and
Cluster analysis 69

mean-centering aimed at improving outcomes when the properties have


different scales, but a detailed discussion is beyond the present scope.

#import Python libraries


import matplotlib​.pypl​ot as plt
import pandas as pd
import numpy as np
import scipy.cluster.hierarchy as shc
from sklearn.preprocessing import StandardScaler
from scipy.cluster.hierarchy import linkage, dendrogram, cophenet
from scipy.spatial.distance import pdist

#import dataset
data = pd.read_csv(‘covid​.c​sv’, index_col=‘Country’)

#scale data to zero mean and unit variance


scaler=StandardScaler()
data=​pd.Da​taFra​me(sc​aler.​fit_t​ransf​orm(d​ata),​​colum​​ns​=da​​ta​.co​​lumns​​
,inde​​x​=dat​​a​.ind​​ex)

We are now ready to perform AHC. Recall that this means starting with as
many clusters as objects (i.e., our countries/regions), and iteratively merge
neighboring objects into larger clusters until one giant cluster remains. The
algorithm decides what to merge based on the similarity or ‘distance’ between
objects, as well as the ‘linkage measure’ (similar to distance) between the
resulting clusters. One could argue that the usefulness of clustering critically
depends on the extent to which spatial distance is the best criterion for simi-
larity in the context of application. In any case, the different ways to define
and measure these distances will not be detailed here (see Yim & Ramdeen,
2015). We will opt for the most widely used Euclidean distance and Ward’s
linkage, respectively. In so doing, we conceptualize objects as points in a
Euclidean space where each property/variable defines a dimension, and the
values determine the object positions. This is why the standardization of
scores above is important to prevent variables with larger scales from dis-
torting the positions. The following code will generate a dendrogram (Figure
3.1), the standard visualization of an AHC clustering solution.

#generate dendrogram
plt​.tit​le(“COVID-19 dendrogram”,fontsize=25)
plt​.xtic​ks(fontsize=25)
plt​.ytic​ks(fontsize=25)
plt​.ylab​el(‘Distance’,fontsize=25)
plt​.xlab​el(‘Country/region’,fontsize=25)
70 Cluster analysis

Figure 3.1 Outcome of AHC on COVID-19 data

dend = dendrogram(linkage(data, metric=‘euclidean’,method=‘ward’),


orientation=‘top’, leaf_​​rotat​​ion​=8​​0​,lea​​f​_fon​​t​_siz​​e​=25,​​label​​s​=dat​​a​
.ind​​ex​,co​​​lor​_t​​hresh​​old​=1​0)
plt​.sh​ow()

The code contains typical cosmetic settings like fontsizes, orientation, and
rotation angle of the labels that can all be experimented with. Besides
these, the three most critical settings are the distance metric, linkage meas-
ure method (set to Euclidean and Ward, respectively), and color_thresh-
old. This tells Python the aforementioned arbitrary cut-off point at which
to determine the final number of clusters, whereupon each cluster will
be colored differently. The setting of 10 means any split in the hierarchy
below the Euclidean distance value of 10 will be counted as a cluster. This
will be apparent in Figure 3.1.
The dendrogram indicates three distinct clusters at our chosen thresh-
old. The visual interpretation should be quite intuitive – for example,
Zambia and Ghana (far left) are more similar to each other than Indonesia
(far right of cluster one), but all three are more similar to one another than
anywhere else outside cluster one. Also, cluster three terminates higher up
the hierarchy, indicating that the level of similarity among its members is
the lowest among the three clusters – vice versa for cluster two. It is up
to readers to evaluate whether the clustering solution ‘makes sense’ with
reference to their real-world knowledge. As a statistical evaluation, we can
calculate and show the cophenetic correlation coefficient (Sokal & Rohlf,
1962) with the following code. It is a measure of how well correlated the
distances between objects in the dendrogram are to their original unmod-
eled distances. The closer the coefficient is to one, the better the clustering
solution is. The current value of 0.631 is adequate.
Cluster analysis 71

#calculate cophenetic correlation coefficient


c, coph_dist = cophenet(shc.linkage(data,method=‘ward’,metric=‘eucli
dean’), pdist(data))
c

k-means clustering

We will now demonstrate k-means clustering on the same dataset. Just


like AHC, k-means conceptualizes objects as points in a Euclidean space
with as many dimensions as properties, and object positions fixed by these
properties. It does not, however, form a hierarchy of clusters through itera-
tive merging/splitting. Instead, it identifies ‘cluster centroids’, which are
also points in the same Euclidean space, and allocates each object to its
nearest centroid. These centroids and their allocated objects comprise the
eventual clusters. The evaluation criterion mentioned above for optimal
allocation is intuitive: to find the optimal number of clusters, and their cen-
troid positions, so that the overall distance between them and the objects is
minimized. We can understand it as minimizing the ‘error’ of the clustering
solution since a smaller distance implies that the objects within each cluster
show greater similarity. This overall distance is known as distortion or
inertia, depending on how it is calculated. Distortion is the average of the
Euclidean squared distances, while inertia is the sum.
How does the optimization process work? The algorithm starts by
randomly initiating a specified k number of cluster centroids and then
assigning each object to its nearest cluster. This random initial solution
is, of course, very unlikely to be optimal. The next step then computes
the mean values of all object properties within each cluster. These mean
values define a new mean position in the Euclidean space that more accu-
rately represents these objects, compared to the initial random position.
The new mean position then serves as an ‘updated’ cluster centroid, and
all objects are again reassigned to the nearest updated centroid. The pro-
cess repeats itself, each cycle moving gradually towards more refined
cluster centroids, until there is no further change in cluster centroid allo-
cation. Readers should now see why the algorithm is called k-means,
since we are working with k number of mean positions. The finalized
cluster centroids and their member objects are the final outcomes of
k-means clustering.
Readers might already spot an important point here. By definition,
having more cluster centroids would reduce the overall distance because
there are more ‘options’ for objects to latch onto. If the objective is to
minimize this distance, why not have as many objects as centroids (as with
the starting point of AHC) such that the distance is perfectly zero? While
mathematically sound, having one object per cluster obviously defeats the
analytical purpose of clustering. Various approaches have therefore been
72 Cluster analysis

developed to help the analyst select the optimal value of k, or the optimal
number of clusters (Kodinariya & Makwana, 2013). In general, param-
eters like k that are set prior to the machine-learning process rather than
learnt from the data are known as hyperparameters. A common approach
suggests that the optimal number of clusters is the point at which adding
another cluster no longer produces a noteworthy decrease in distortion/
inertia, and hence does not compensate for the corresponding loss in par-
simony. This is colloquially known as the elbow method, and will be dem-
onstrated below. Another approach is silhouette analysis, which evaluates
how well objects are matched to their assigned cluster. This is done by
comparing the average distance between each object and all other objects
in its cluster, with the average distance between each object and all other
objects in the nearest neighboring cluster. The optimal k is that which
leads to the clustering solution with the highest average silhouette score
across all objects. We will see in the upcoming case study that we can begin
the analysis by first determining this optimal number, and then specifying
it as k for the algorithm described in the paragraph above.
Figure 3.2 is a 2D-scatterplot showing the outcome of k-means cluster-
ing on our COVID dataset. In most practical cases there is a need to reduce
the original higher dimensional Euclidean space (four dimensions in our
case, corresponding to four variables) into two dimensions via principal
components analysis. This comes with some inevitable information loss

Figure 3.2 Outcome of k-means clustering on COVID-19 data


Cluster analysis 73

but is necessary because it is impossible to visualize four dimensions. For


brevity, the Python code for implementing and visualizing k-means cluster-
ing will be provided later in the chapter when discussing our case study.
The optimal number of clusters was determined to be three, and each
cluster and its member objects are represented with a different color.
Objects are also text annotated. X marks the spot as they say, so the three
big crosses mark the positions of the three final cluster centroids. We
can see that object allocation in this case is quite unproblematic because
most countries/locations are unambiguously closest to one of them. Our
k-means clustering solution turns out to be very close to the AHC solution
above despite the very different algorithms. Out of 60, only Ireland and
Taiwan end up in different groups each time. It turns out that comparing
the outcomes of different algorithms, instead of or in addition to the train-
test approach in Chapter 2, is another possible approach to validating data
analytic modeling results. We will discuss model validation in more detail
in our upcoming case study. The algorithm of choice will be k-means clus-
tering – as seen in its comparison with AHC in our COVID example, the
generation of relatively unambiguous group memberships is a strong prac-
tical advantage in our discourse analytic context of application.

Case study: Measuring linguistic (a)synchrony between therapists


and clients
The notion of interpersonal and linguistic (a)synchrony was briefly out-
lined in Chapter 1. To recall, interpersonal synchrony is the alignment of
neural, perceptual, affective, physiological, and behavioral responses dur-
ing social interaction (Semin & Cacioppo, 2008). It is linked to positive
evaluations of teacher-student (Bernieri, 1988) and spousal relationships
(Julien et al., 2000), and claimed to promote cooperative behavior, affilia-
tion, and compassion (Hove & Risen, 2009; Valdesolo & DeSteno, 2011).
Psychotherapy is a highly relevant context to investigate interpersonal syn-
chrony as therapists interact with clients to help them modify behaviors,
cognitions, and emotions (Norcross, 1990). Synchronous processes may
strengthen the therapeutic alliance and lead to more positive treatment
outcomes (Ardito & Rabellino, 2011). Koole and Tschacher (2016) out-
line three interlinked levels of synchronous processes that occur in psy-
chotherapy. These are perceptual-motor processes like movement, facial
expressions, and gestures, complex cognitive processes like memory and
language, and emotion regulation.
The key role of language in psychotherapy has led to considerable inter-
est in linguistic synchrony as a distinct component of interpersonal syn-
chrony. At its most general level, therapists and clients attain synchrony
by developing shared mental representations or a “common language”
through “mutual adaptation to another’s linguistic behaviors” (Koole
74 Cluster analysis

& Tschacher, 2016, p. 7). We could define (and measure) linguistic (a)
synchrony as the extent to which their linguistic choices (mis)align across
the treatment span. Existing approaches to this task, however, tend to be
polarized along the usual qualitative-quantitative divide and focus on dis-
parate levels and units of analysis. The influential Interactive Alignment
Model (Pickering & Garrod, 2004) synthesizes previous work (Brennan
& Clark, 1996; Sacks et al., 1974; Zwaan & Radvansky, 1998) to pro-
vide a cognitive account of linguistic synchrony. The main idea is that
speakers in natural dialogue prime each other to develop aligned repre-
sentations across phonological, syntactic, semantic, and situational levels.
Each level has its own monitoring and repair mechanisms, and alignment
at one level reinforces other levels to enhance the overall perception of
synchrony. Since this process is assumed to be primitive and unconscious,
the model is less able to account for complexities like higher-order com-
munication strategies, deliberate attempts to (mis)align with each other,
and other context-specific features emergent in spontaneous interaction.
Psychotherapy is seldom discussed in this model but is a case where we
would precisely expect to see such complexities (Anderson & Goolishian,
1988). Elsewhere, communication and language researchers consider these
complexities to be of primary interest. Communication Accommodation
Theory (Giles, 2016), for example, claims that our interactions are con-
sciously motivated by their perceived social consequences. People linguisti-
cally (mis)align/(a)synchronize with one another to highlight or downplay
differences as desired. This has been demonstrated by linguistic analyses
in settings like intercultural language education (Dörnyei & Csizér, 2005),
law enforcement (Giles et al., 2006), and, in fact, psychotherapy (Ferrara,
1991, 1994). As noted in Chapter 1, Ferrara (1994, p. 5) observes that
therapists and clients use the core strategies of ‘repetition’ and ‘contiguity’
to construct meaning in accommodative ways, “taking up portions of the
other’s speech to interweave with their own”. Another leading approach
to psychotherapy talk is conversation analysis, which focuses on the turn-
by-turn architecture of dialogue (Peräkylä et al., 2011; Weiste & Peräkylä,
2013). Although the term ‘synchrony’ is not often used, it coheres with the
main idea that therapeutic processes are tied to sequences of therapist–cli-
ent interaction, and the quality of these sequential relations reflects the
extent of synchrony. The general idea of (a)synchrony as linguistic (mis)
alignment is also apparent in related research like the act of ‘wordsmith-
ing’ in counseling (Strong, 2006), the strategic communication of risks
(Sarangi et al., 2003), and the co-construction of figurative language by
therapists and clients (Kopp, 1995; Tay, 2016, 2021b).
The above work mostly employs nuanced qualitative analysis of ‘local-
ized’ linguistic units or ‘isolated snapshots’ (Brown et al., 1996) like a
single conversational turn or topic. An inevitable tradeoff is the inability to
Cluster analysis 75

depict (a)synchrony at higher and perhaps more natural levels. Like the rest
of this book, we draw attention to the prime example of the institutional-
ized sessional level. It was explained in Chapter 1 that sessions are likely to
be more intuitive, concretely experienced, and recallable than single turns
or topics. Despite this, there is little work on linguistic (a)synchrony at
sessional level, in large part because it is hard to analyze entire sessions
in a nuanced yet reliable way. Complementary quantitative methods are
required for this. Computational techniques to analyze natural language
offer potential solutions. On the unconscious-versus-strategic alignment
debate described above, computational evidence suggests that “alignment
is not a completely automatic process but rather one of many discourse
strategies that speakers use to achieve their conversational goals” (Doyle
& Frank, 2016). Furthermore, due to the relative concreteness of words
over other grammatical levels (Gries, 2005; Healey et al., 2014), there is a
general preference for quantification at word level at least for the English
language. Two general types of synchrony measures have been proposed:
distributional and conditional. Distributional measures include the Zelig
Quotient (Jones et al., 2014) and Linguistic Style Matching (Niederhoffer
& Pennebaker, 2002), which determine linguistic (dis)similarity between
independent units of analysis. Conditional measures like Local Linguistic
Alignment (Fusaroli et al., 2012) focus instead on the relationship between
adjacent units – somewhat in the vein of conversation analysis. Both types
are complementary because distributional measures show global similarity
but not necessarily the contextual qualities of alignment, and vice versa
for conditional measures. They represent two kinds of interpretative logic
that are seldom applied together – as our case study will show, quantitative
determination of linguistic (a)synchrony using cluster analysis on sessional
LIWC variable scores, which is a distributional measure, can be supported
by qualitative investigation of how such (a)synchrony is played out in con-
text. Also noteworthy is that recent work emphasizes the importance of
function or grammatical words. The main reason is that while content
words are often tied to arbitrarily changing topics, grammatical words are
more context-invariant and thus more revealing of speakers’ interactional
styles (Doyle & Frank, 2016). This explains our current choice of LIWC
variable scores, which relies heavily on grammatical categories to measure
the socio-psychological dimensions of language. Following the same struc-
ture as Chapter 2, our case study will now be presented in stepwise fash-
ion, including computing LIWC variable scores, k-means clustering with
model validation to group sessions as a basis for measuring synchrony,
validating the clustering solutions, and qualitative analysis of how syn-
chrony is co-constructed in context. Note that most parts of this case study
were also reported in Tay and Qiu (2022) but without instruction on how
to implement the steps in Python.
76 Cluster analysis

Step 1: Data and LIWC scoring

The first step is similar to Chapter 2 where LIWC summary variable scores
are computed for the dataset at hand. The present approach applies k-means
clustering to measure linguistic (a)synchrony on a sessional and dyadic
basis. In other words, each therapist–client dyad receives a synchrony
score over the number of sessions conducted, which can be compared
between dyads and therapy approaches. To illustrate such a comparison,
our dataset comprises three dyads A to C. A is a psychoanalysis dyad with
15 sessions (74,697 words), B is a cognitive-behavioral therapy dyad with
14 sessions (68,812 words), and C is a humanistic therapy dyad with 20
sessions (101,044 words). The dyads were selected to maximize compara-
bility as all three clients shared broadly similar demographics and present-
ing conditions. All were heterosexual white American females in their early
to late twenties diagnosed with anxiety disorder and depression, reporting
relationship issues with their parents/spouse/partner. Given the inevitably
unique nature of each dyad, the present dataset can only illustrate but not
represent the therapy types. This again underlines the case study–oriented
nature of the present approach – it could be applied to larger datasets to
make stronger claims about therapy types if desired, as well as more lim-
ited ones for purposes like therapists’ self-monitoring and reflection.
Instead of computing LIWC scores for each session transcript as in
Chapter 2, here we first split each transcript into therapist and client-only
language. These will be called ‘sub-transcripts’. For example, dyad A will
have 30 sub-transcripts labeled T1 to T15 and C1 to C15, each with its four
LIWC summary variable scores as defining properties. Table 3.1 shows
the scores for all three dyads. The sub-transcripts will be the objects to
undergo clustering, one dyad at a time. The basic logic is that for each ses-
sion x, if sub-transcripts Tx and Cx fall into the same cluster, then session
x is considered synchronized. Otherwise, session x is asynchronized. This
is because therapist and client language within the same session ought to
be more similar to each other than they are to other sessions, if we want
to claim that session as synchronized. Applying k-means clustering would
therefore yield the following concrete outcomes: (1) which sessions across
the treatment span are (a)synchronized, (2) the percentage of synchronized
versus asynchronized sessions as an overall measure of the dyad, and (3)
the distribution pattern of (a)synchrony across time. Each outcome can
be further (qualitatively) examined using the actual transcripts in context.
Readers are again reminded that our LIWC multivariate profile can be
replaced by any other desired quantification scheme without affecting the
logic and process of k-means clustering.
Table 3.1 Summary variable scores for dyads A–C

Session Dyad A (Psychoanalysis) Dyad B (CBT) Dyad C (Humanistic)

Ana Cl Auth Tone Ana Cl Auth Tone Ana Cl Auth Tone

C1 12.62 41.26 63.11 67.58 7.2 11.6 95.16 59.63 8.31 20.79 83.95 30.48
C2 17.73 21.95 69.21 60.4 4.59 14.94 92.75 60.95 9.76 17.3 84.89 39.14
C3 35.24 20.39 80.5 66.26 8.49 15.99 92.94 51.18 12.23 34.34 77.4 37.81
C4 30.8 25.32 68.91 40.86 6.76 16.41 88.04 50.45 5.99 12.05 89.67 33.92
C5 10.08 14.08 69.39 66.7 5.26 19.92 83.7 81.93 7.8 23.77 84.25 52.62
C6 19.74 35.59 69.84 50.09 4.23 38.41 73.35 34.36 6.86 15.47 84.35 28.09
C7 21.73 42.54 72.59 49.84 5.38 9.82 92.72 41.34 10.71 14.55 84.31 42.98
C8 15.09 51.7 28.18 42.92 1.89 8.74 97.23 48.31 13.8 16.09 89.81 56.77
C9 20.46 14.71 87.51 49.69 5.35 20.47 92.83 50.54 7.4 19 80.6 33.32
C10 10.43 32.1 67.85 50.33 3.22 20.56 85.43 44.17 8.4 20.27 76.16 27.55
C11 17.74 32.18 67.41 77.29 5.08 8.66 95.22 75.85 12.11 12.48 84.71 64.38
C12 11.08 18.31 86.55 67.23 6.09 16.98 86.3 43.6 9 13.76 89.37 35.98
C13 17.27 25.92 56.74 75.06 5.75 32.15 76.13 64.51 8.1 32.02 73.69 31.32
C14 21.27 15.99 87.01 56.85 6.27 18.84 85.88 78.6 10.38 10.52 96.12 36.5
C15 29.99 37.22 52.08 41.54 9.93 24.98 89.52 34.03
C16 16.14 22.22 85.27 40.3
C17 13.69 38.62 65.02 28.35
C18 8.72 21.52 84.27 34.38
C19 9.96 25.35 90.8 61.12
C20 10.09 11.12 92.02 41
T1 33.45 73.58 72.27 43.14 15.4 90.23 50.24 58.7 9.33 84.32 53.22 45.87
T2 7.39 69.04 69 72.32 11.88 92.89 51.82 62.78 10.82 71.38 50.59 41.91
T3 9.49 60.35 66.71 42.41 17.9 88.02 54.4 64.12 14.26 60.8 59.47 34.18
T4 20.15 65.65 50.19 22.31 16.87 93.37 46.92 61.21 3.63 46.31 68.51 46.11
T5 21.73 61.72 50.61 58.63 9.17 94.7 47.28 74.81 4.78 68.84 55.12 30.38
T6 18.49 55.65 59.94 30.12 15.01 91.05 44.97 74.45 5.17 69.12 70.92 35.67
Cluster analysis 77

T7 20.23 71.69 56.41 31.35 11.37 87.56 59.66 46.79 8.82 50.61 76.79 42.31
(Continued)
Table 3.1 (Continued)
78

Session Dyad A (Psychoanalysis) Dyad B (CBT) Dyad C (Humanistic)

Ana Cl Auth Tone Ana Cl Auth Tone Ana Cl Auth Tone


T8 11.39 63.4 24.98 28.62 12.74 90.6 57.11 72.19 20.29 67.18 66.65 50.7
T9 30.32 67.62 67.4 68.9 17.91 88.03 40.12 51.58 5.65 47.84 78.43 43.22
T10 21.96 34.07 73.84 49.06 12.06 90.37 39.1 58.91 9.7 73.77 66.41 37.37
Cluster analysis

T11 15.17 48.14 65 85.53 19.2 92.56 56.03 75.48 2.54 43.82 85.18 51.45
T12 13.79 68.94 86.4 91.04 16.37 92.4 48.83 74.9 7.23 45.65 84.16 30.59
T13 19.69 64.82 75.53 44.51 15.69 80.49 60.4 65.98 8.09 62.79 72.3 28.49
T14 38.33 80.23 73.1 91.52 24.48 87.23 43.96 72.94 6.27 37.4 86.97 15.06
T15 37.64 75.49 54.67 56.41 20.86 70.2 51.28 28.22
T16 10.81 56.49 67.51 61.87
T17 7.25 63.79 75.51 16.18
T18 8.91 70.43 71.29 50.49
T19 3.91 91.21 60.97 64.83
T20 12.18 46.21 75.23 40.69
Cluster analysis 79

Step 2: k-means clustering and model validation

We are now ready to perform k-means clustering using the LIWC variable
scores in Table 3.1. This is done one dyad at a time to derive a clustering
solution that will be the basis for measuring linguistic synchrony for that
dyad. Each dyad can thus be saved and imported as a separate excel file for
better data organization. Similar to the COVID example above, it is rec-
ommended to first standardize the raw LIWC scores by subtracting each
value by the mean and then dividing by the standard deviation. However,
standardization is not crucial because the four variables are measured on
the same scale of 0–100, unlike the COVID dataset. The following code
imports a dyad’s excel file and performs standardization using the same
scikit-learn StandardScaler feature as above. It is recommended to set the
‘session’ column as the index to facilitate annotation of the data in subse-
quent visualizations.

#import dataset and scale scores


data=pd.read_csv(‘A.csv’,index_col=‘Session’)
scaler=StandardScaler()
scaler​.f​i t(data)
data=​pd.Da​taFra​me(sc​aler.​trans​form(​data)​​,colu​​mns​=d​​ata​.c​​olumn​​s​,ind​​
ex​=da​​ta​.in​​dex)

After the data are imported and standardized, the optimal number of clus-
ters can be determined. This was defined earlier as the number k beyond
which adding another cluster no longer produces a noteworthy decrease in
distortion/inertia. We determine k by repeatedly performing k-means clus-
tering for an incremental number of clusters from 1 to a specified number n
(5 is usually enough), calculating the distortion or inertia value each time.
We then plot the values against the number of clusters and note the point
beyond which the decrease in distortion/inertia tapers off. It will soon be
apparent why this is known as the elbow method for determining opti-
mal k. The following code implements the above. Note that the libraries
required for k-means clustering are also imported this point. Both scikit-
learn and SciPy are good options, but we choose the former here.

#determine the optimal number of clusters with the 'elbow method'


from sklearn.cluster import KMeans
n=5 #n can be changed to test more clusters
num_clusters = range(1, n+1)
inertias = []

for i in num_clusters:
model=KMeans(n_clusters=i)
80 Cluster analysis

model​.f​i t(data)
inertias​.appe​nd(model.inertia_)

#generate ‘elbow plot’


plt​.pl​ot(num_clusters, inertias, ‘-o’)
plt​.xlab​el(‘number of clusters, k’, fontsize=15)
plt​.ylab​el(‘inertia value’, fontsize=15)
plt​.tit​le(‘Dyad A (Psychoanalysis)’, fontsize=15)
plt​.xtic​ks(num_clusters, fontsize=15)
plt​.sh​ow()

Readers are encouraged to compare this code with the simulation codes in
Chapter 2 and recognize the similar logic of the loops deployed. In both
chapters we are essentially specifying a number of loops through the code,
performing some computation in each iteration and recording the result in
a list, and then doing a summary analysis of the list at the end. The com-
putation in this case is the performing of k-means clustering (or ‘fitting of
a k-means model’) with the number of clusters defined by the current loop
iteration, and the result recorded in the ‘inertias’ list is the corresponding
inertia value. While the summary analysis in Chapter 2 concerned the dis-
tribution of simulation outcomes as depicted in histograms, the summary
analysis here is simply to plot the inertia values against the number of
clusters.
Figure 3.3 shows the resulting plots for each of our three dyads. These
are commonly known as ‘elbow plots’ because the optimal k resembles the
elbow joint of an arm. Beyond k, the inherent decrease in inertia falls to its
minimum, implying that having k+1 clusters does not compensate for the
loss of parsimony.
The ‘elbows’ are usually obvious and thus visual inspection is sufficient
to determine the optimal number of clusters k. For example, it is clear
that k=2 for dyad B (CBT). In cases like dyads A and C, however, it is
less easy to identify an obvious elbow joint. The following code will help
to objectively determine k by calculating the decrease in inertia with each
additional cluster. It simply takes the inertias list from above, converts it
into a dataframe, and calculates the absolute difference between each suc-
cessive value with the diff() and abs() commands.

#confirm visual inspection with actual inertia value change


pd.DataFrame(inertias).diff().abs()

The optimal number k is the point after which the absolute difference
value (i.e., change of inertia) is at its lowest. For example, applying the
above code to dyad A yields 35.447, 20.597, 14.572, and 8.066 for k=1 to
4, respectively. The value of k is therefore 3. With this elbow method, we
Cluster analysis 81

Figure 3.3 Elbow plots for the three dyads

conclude that the optimal number of clusters for dyads A to C is, respec-
tively, 3, 2, and 2.
Now that we know the value of k for each dyad, we can proceed to
perform k-means clustering one dyad at a time. This was of course already
done when running the elbow method loop, but having determined k we
can go further to assign each sub-transcript to one of the k clusters and vis-
ualize the clustering solution like in Figure 3.2 above. The following code
will (1) fit a k-means model with k specified (k=3 for dyad A), (2) ‘predict’
the cluster membership of each sub-transcript using the model and assign
a corresponding cluster label, (3) add these labels to the original dataframe
for easy reference, and (4) obtain the positions of the k cluster centroids
(i.e., cluster centres) for later plotting of the red crosses like in Figure 3.2.

#generate cluster centroids and labels using the optimal number of


clusters
model = KMeans(n_clusters=3) #the value of n_clusters should be set
to the optimal k
labels = model.fit_predict(data)
data[‘cluster_labels’] = labels

#obtain cluster centroid positions for later plotting


cluster_centres = model.cluster_centers_
82 Cluster analysis

If we re-examine the dataframe, we will now see the new cluster_labels col-
umn at the right indicating the cluster membership of each sub-transcript/
object. Examining cluster_centres will show an array consisting of k rows
representing k clusters. Each row has n numbers where n is the number
of variables defining the objects (n=4 in our case). For example, the array
below is from dyad A.
[ 1.71687033, 1.33623213, 0.07475749, 0.50641184]
[-0.22099794, 0.61341394, -0.96469075, -1.00276756]
[-0.28697058, -0.63915612, 0.49312864, 0.41172122]
The four numbers in each row/cluster locate its centroid position, which as
mentioned above is a point in n-dimensional Euclidean space just like all the
objects. Because of the optimization process explained earlier, each number
is, in fact, the mean LIWC variable score of all the sub-transcripts assigned
to that cluster. Note that some scores are negative because we standardized
them just now. In other words, these scores express the number of standard
deviations above (+) or below (-) the mean of the whole dataset.
We now have everything we need to generate the 2D-scatterplot visual-
izing the clustering solution. The following code will (1) employ princi-
pal components analysis to reduce our n-dimensional data, including the
cluster centroid locations, to two dimensions, (2) plot and label all objects
and centroid locations. The PCA library from scikit-learn is imported to
perform principal components analysis in the first step. There are many
cosmetic details that can be freely modified to suit different visual prefer-
ences. For example, palette determines the coloring scheme for the cluster,
s sets the marker size of the objects, and rx tells Python to use red crosses
for the cluster centroids.

from sklearn.decomposition import PCA as sklearnPCA


#specify two principal components
pca = sklearnPCA(n_components=2)
#reduce the cluster centroid locations into two dimensions
cent=​pca.f​i t_tr​ansfo​rm(cl​uster​_cent​res).​T
#use data​.il​oc to remove cluster labels in the rightmost column before
reducing the data
reduc​ed=pd​.Data​Frame​(pca.​fit_t​ransf​orm(d​​ata​.i​​​loc[:​,:-1]​),col​umns=​
[‘Dim_1’,‘Dim_2’]​,index​=data​.index)
#reattach previous cluster labels to prepare for plotting
reduced[‘cluster’]=data[‘cluster_labels’]

#generate scatterplot and color according to clusters


sns.scatterplot(x=‘Dim_1’, y=‘Dim_2’, hue=‘cluster’, data=reduced,
palette=‘tab10’, s=30)
Cluster analysis 83

#plot cluster centroids


plt​.pl​ot(cent[0],cent[1], ‘rx’, markersize=15)

#annotate each object


for i in range(reduced​.sha​pe[0]):
plt​.te​xt(x=reduced.Dim_1[i]+0.05, y=reduced.Dim_2[i]+0.05, s​
=reduced​.in​dex[i],
   fontdict=dict(color=‘black’, size=10))
plt​.lege​nd(title=‘cluster’)
plt​.xlab​el(“Dimension 1”,fontsize=15)
plt​.ylab​el(“Dimension 2”,fontsize=15)
plt​.tit​le(“Dyad A (Psychoanalysis)”, fontsize=15)
plt​.sh​ow()

Figure 3.4 shows the three 2D-scatterplots visualizing the outcomes of our
k-means clustering. The first thing to bear in mind is that the clustering
solutions are independent of each other in each dyad, and so ‘cluster 0’ in
dyad A has no relationship with ‘cluster 0’ in dyads B and C. It is visually
apparent that among the three dyads, dyad B (CBT) has the most distinct
separation between the clusters while dyads A and C have more ambigu-
ous cluster allocations that suggest poorer clustering solutions. The next
step would be to interpret these outcomes according to the present pur-
pose, which is to determine for each dyad which and how many sessions
are synchronized based on whether sub-transcripts Tx and Cx fall into the
same cluster. Before this can be done, however, the three clustering solu-
tions (or clustering models) need to be validated.
Recall that model validation is crucial to evaluate the accuracy and use-
fulness of data analytic outcomes and that there are different context-spe-
cific methods for it. The train-test approach discussed in Chapter 2 seems
to be an option here – the general idea being to redo the analysis with say
80% of the sub-transcripts, use the new model to predict cluster member-
ship for the remaining 20%, and compare the original and new cluster
labels. However, recall that cluster analysis is an unsupervised technique
where there are no correct or pre-existing cluster labels to compare the
results against. The very reason why we perform it is, in fact, to discover
potentially useful but hidden classification schemes. The above train-test
validation procedure, if used without pre-existing labels, would therefore
run the risk of being logically circular because we would end up evaluating
one set of predictions with another.
Two alternative procedures will be introduced in its place. The first is
informal, simpler, and relies mostly on visual inspection just like elbow
plots. The key idea is to see whether the cluster centroid locations for each
dyad are ‘distinct’ enough in the higher-dimensional spaces they occupy.
84 Cluster analysis

Figure 3.4 Outcome of k-means clustering on the three dyads


Cluster analysis 85

Recall that our cluster centroids are points in 4-dimensional space. Each
dimension represents one of the four LIWC summary variables, and the
centroid is the mean (or mid-point) of all the sub-transcripts in that cluster.
Therefore, if we plot the mean values of the four variables by cluster, we
would be visually reconstructing the cluster centroids in a way that allows
us to evaluate their mutual distinctiveness. The following code will gener-
ate this plot.

#visual validation by reconstructing cluster centroids


data.groupby(‘cluster_labels’).mean().plot(kind=‘bar’)
plt​.sh​ow()

Relatedly, we can also use the code below to check and plot how many
members there are in each cluster, which is useful in a descriptive account
of the analysis.

#check and plot cluster sizes


data.groupby(‘cluster_labels’).count()
sns.countplot(data=data,x=data[‘cluster_labels’])

In both code snippets above, the groupby command is used to group the
dataset by the cluster_labels column and perform various actions like
counting and plotting the mean variables scores in each group. Figure 3.5
shows the bar plots of mean variable scores for all three dyads.
Note that the y-axis shows our standardized scores rather than raw
LIWC values. We can verify that the bar plots for dyad A indeed recon-
struct the cluster centroid positions shown by the array below, which was
presented earlier. The top row corresponds to cluster 0, the second row to
cluster 1, and the bottom row to cluster 2.
[ 1.71687033, 1.33623213, 0.07475749, 0.50641184]
[-0.22099794, 0.61341394, -0.96469075, -1.00276756]
[-0.28697058, -0.63915612, 0.49312864, 0.41172122]
An important point to reiterate is that just as the three clustering solu-
tions are mutually independent, the standardized scores are relativized
within each dyad and thus cannot be compared across dyads. For example,
the high analytical thinking and clout bars for cluster 0 of dyad A and their
low counterparts in cluster 0 of dyad B does not (necessarily) mean that
both variables are higher in A than B. What they actually convey is that
analytical thinking and clout scores in cluster 0 of dyad A are relatively
higher than those in the other two clusters of the same dyad.
How, then, do we visually evaluate if the clusters are distinct and hence
the outcome adequate? Clusters that are not distinct would have their
86 Cluster analysis

Figure 3.5 Cluster centers of the three dyads


Cluster analysis 87

centroids, or mean scores, close to each other, implying that their constitu-
ent sub-transcripts are also close to one another. This, in turn, implies an
unsatisfactory clustering solution because we did not successfully put our
objects into clear group. Conversely, distinct clusters would have mean
scores with magnitude and/or direction that are clearly different from one
another, implying that the constituent sub-transcripts form well-delineated
groups. Visual inspection of all three dyads suggest that the clusters are
indeed fairly distinct, especially for dyad B, which gives us some confidence
in our clustering solutions.
Nevertheless, bar plots and other visualizations rely on subjective vis-
ual judgments that are not quantifiable. While such judgments might be
good enough for practical purposes if the differences are obvious, they
should be supported by more quantitative validation procedures other-
wise. We will revisit the dual use of visual and quantitative validation
tools in later chapters. For the present case, a useful approach is to evalu-
ate the extent to which the outcomes of an alternative data analytic tech-
nique align with the present k-means clustering solutions. We mentioned
logistic regression as a common supervised machine-learning technique
at the start of the chapter. Recall that the main difference between super-
vised and unsupervised techniques like clustering is the use of pre-existing
group labels in the former. For example, if our objective were to see if
LIWC scores can successfully predict the therapy approach (psychoa-
nalysis versus CBT versus humanistic therapy) used in each transcript,
we would use logistic regression with the LIWC scores as our properties
and therapy approach as the pre-existing outcome. We can then validate
the logistic regression model by evaluating its predictive accuracy; i.e.,
how many predicted labels correctly predict the pre-existing labels. Our
approach is therefore to treat the cluster labels generated by k-means
clustering as ‘pre-existing’, and use logistic regression to see how well its
predictions match these labels. This allows us to step out of the previ-
ously mentioned logical circularity because a different algorithm is now
involved.
A full introduction to the logic and characteristics of logistic regression
is beyond the present scope, but readers may refer to the following exam-
ples of discourse-related applications (Tay, 2021a; Zhang, 2016; Zhang et
al., 2011). The code below imports the needed libraries from scikit-learn
and prepares our data from k-means clustering by splitting it into X and y.
X retains the LIWC scores (by removing the last column of cluster labels),
and y is the cluster_labels.

#import Python libraries


from sklearn.linear_model import LogisticRegression
from sklearn import metrics
88 Cluster analysis

#split dataset into features and target variable


X = data​.il​oc[:,:-1]
y = data[‘cluster_labels’]

At this point, we have the option of using the same train_test_split fea-
ture from Chapter 2 to randomly split the data into training and testing
datasets. We would then fit the logistic regression to the training set and
use it to predict the testing set to evaluate the predictive accuracy of our
k-means clustering model. Recall that this allows us to test how well our
model performs with ‘unseen’ data. The code below instead implements a
simplified process where the entire dataset is used to fit the logistic regres-
sion model, and all the predicted labels are compared with the actual
labels. In other words, we are working only with ‘seen’ data. Readers may
try to write the code to implement the train_test procedure as a challenge!

#instantiate the model. multi_class=‘auto’ detects if outcome is binary


or multinomial
logreg = LogisticRegression(multi_class=‘auto’)
logreg​.f​i t(X,y)
#get percentage accuracy
logreg​.sco​re(X,y)

The multi_class=‘auto’ setting, which is also the default setting, automati-


cally detects if the outcome variable (the cluster labels in our case) has only
two categories (i.e., two clusters) or more. This will determine if binary or
multinomial logistic regression is used. After fitting the model, logreg​.sco​
re(X,y) computes the percentage accuracy of our model, which is simply
the proportion of predicted labels that are correct. Running the above for
dyad C, for example, yields a high score of 0.975, which gives us strong
confidence in the clustering solution. The final code snippet below gener-
ates a confusion matrix, one of the most common visualizations of predic-
tive accuracy. We will be using a confusion matrix again in Chapter 4.

#generate confusion matrix


metri​cs.co​nfusi​on_ma​trix(​logre​g.pre​dict(​X), y)

Logreg.predict(X) generates the list of labels predicted by the logistic


regression model for each sub-transcript, and this is cross tabulated against
the actual labels y. Since there are two clusters for dyad C, this would give
us a 2x2 confusion matrix that looks like this.
01
0 [10, 0]
1 [1, 29]
Cluster analysis 89

The rows represent predicted labels and columns represent actual labels,
following the order we specified them in the code (logreg.predict(X), y). The
matrix tells us that 10 sub-transcripts that were assigned to the first cluster
(cluster 0) were also correctly predicted to belong to it. Likewise, 29 sub-
transcripts that were assigned to the second cluster (cluster 1) were correctly
predicted as such. There was in fact only one prediction error – a sub-tran-
script that was assigned to cluster 0 but predicted to belong to cluster 1. In
any confusion matrix, the number of correct predictions is always the sum
of the top-left to bottom-right diagonal. We have 39 in this case out of 40,
giving us the above accuracy score of 39/40 = 0.975. We will learn about
other nuanced measures in Chapter 4 to complement this simple measure of
overall accuracy, but the latter is sufficient for the present purpose of using
an alternative data analytic technique to validate our clustering models.
In summary, because a pure train_test validation approach on a cluster-
ing dataset without pre-existing cluster labels is logically circular, we can
use complementary alternative methods like plotting cluster means and
logistic regression. There are still more ways to establish the validity of
clustering solutions. A further suggestion on how to boost the external
validity of the present examples might be to elicit rating scores from thera-
pists and clients on their subjective perception of synchrony and compare
these with the clustering outcomes. To reiterate what was broadly stated
in Chapter 1, model validation is a context-specific process that should not
rely on one-size-fits-all solutions.
After validating our clustering solutions, we simply go through each
dyad and keep track of which sessions have therapist and client sub-tran-
scripts falling into the same cluster. This will also give us the percentage
of synchronized sessions as an overall measure of a dyad, as well as the
distribution pattern of (a)synchrony across sessions for potential further
analysis. Table 3.2 summarizes the above with green cells indicating syn-
chrony (Tx and Cx in the same cluster) and red cells asynchrony (Tx and Cx
not in the same cluster).
A total of 5 out of 15 sessions (33.3%) in the psychoanalysis dyad are
synchronized, compared to 5 out of 20 sessions (25%) in the humanistic
therapy dyad. Remarkably, the cognitive-behavioral therapy dyad ends
up as a case of ‘perfect asynchrony’ as none of the sessions are synchro-
nized. The distribution across sessions (left to right) provides a visual
overview of where in the treatment span (a)synchrony occurs. This is
useful for capturing potentially interesting patterns like contiguous or
intermittent ‘(a)synchrony blocks’ as explained below. The (a)synchronic
nature of each session, which could be seen as a categorical outcome,
could also be forecasted using relevant techniques like Markov chains
that lie beyond the present scope. We will return to forecasting as an
objective in Chapter 5.
90 Cluster analysis

Table 3.2 Synchrony measures of the three dyads

Dyad Synchronized % of Synchrony distribution across sessions


sessions synchronized
sessions

A 2, 8, 10, 11, 33.3


12
B - 0
C 9, 11, 12, 14, 25
17

Step 3: Qualitative analysis in context

It was mentioned earlier that most linguistic and discourse analytic


approaches to psychotherapy talk include some form of qualitative anal-
ysis of transcript extracts. Some of these are complemented by relevant
quantitative analysis before, after, or in an intermittent way (Creswell,
2014). For the present case, a key rationale for performing quantitative
k-means clustering before qualitative analysis is that cluster analysis offers
concrete ‘entry’ or focal points for the necessarily selective task of deciding
which (types of) examples to focus on. The same rationale will be apparent
again for our time series analyses in Chapter 4. For the present case study,
the synchrony measures obtained in Step 2 will be qualitatively discussed
at two levels. The first is the more general level of therapy types where we
discuss the extent to which each representing dyad’s synchrony measure
reflects its ostensible theoretical traits. While we should not expect any
dyad to perfectly enact its underpinning therapeutic principles, this level
of analysis is still helpful especially when making larger scale compari-
sons across therapy types. Individual practitioners may also attempt it as a
self-reflection on how their own practices (mis)align with theory. We then
move to the specific level of examining how and why (a)synchrony is co-
constructed using selected extracts from each dyad. This helps us connect
the global quantitative synchrony measures with what actually goes on in
therapist–client interaction, and potentially reveals higher-order communi-
cation strategies as well as other context-specific features that may deepen
our understanding of the nature of (a)synchrony.
We start with the first level where the above measures suggest that the
psychoanalysis sample dyad is the most synchronized (33%), followed by
humanistic therapy (25%) and CBT (0%). Psychoanalysis and humanis-
tic therapies are known to be less structured, adhering less to theoretical
models and attaching more importance to the therapist–client relationship
(Arkowitz & Hannah, 1989; Bland & DeRobertis, 2020). Humanistic
therapy regards this relationship as a primary curative factor and empha-
sizes attentiveness to the client’s experiential and affective world (Rogers,
Cluster analysis 91

1951), while psychoanalysis uses therapist–client interaction as a tool to


probe into the client’s repressed thoughts, feelings, and interactional pat-
terns (Kramer et al., 2008).
Both these approaches are also less educative in that therapists avoid
imposing solutions and interpretations (Watson et al., 2011). They instead
prefer to ‘reflect’ clients’ unconscious in a neutral manner (Freud, 1924)
such that positive changes emerge as a natural result rather than a pre-
conceived goal. On the other hand, CBT is described as educative, prob-
lem-focused, and task-based, and it adheres closely to a set of established
techniques (Fenn & Byrne, 2013). Therapists often explicitly demonstrate
their expertise to ‘teach ‘clients to develop more adaptive ways of thinking
and behaving. Based on these general philosophical differences, it would
be reasonable to suggest that psychoanalytic and humanistic dyads have
greater linguistic synchrony especially when co-constructing shared under-
standings of clients’ life situations. CBT, on the other hand, is likely to
result in greater asymmetries in therapist and client linguistic styles.
Our sample dyads seem to support these hypotheses as the psychoana-
lytic (33%) and humanistic dyads (25%) are comparable and both much
higher than CBT (0%). Looking more closely at the distribution of syn-
chronized sessions across time, the psychoanalytic dyad is barely synchro-
nized at the beginning of treatment but experiences four near-consecutive
synchronized sessions near the end. The distribution is more intermittent
in the humanistic dyad, but synchrony likewise occurs mostly in the lat-
ter half of treatment. The greater display of synchrony in the latter half
is likely a result of more co-constructive interaction as treatment devel-
ops, with the consecutive block in the psychoanalytic dyad indicating a
prolonged period of such activity. Importantly, we are neither claiming
that the above synchrony measures correlate with real or perceived treat-
ment quality/outcomes nor that certain approaches inherently (dis)prefer
synchrony because of high/low measures. These are important empirical
questions that could be addressed in future by exploiting cluster analysis
as a key methodological tool.
Moving to the second level of analysis, a more detailed examination of
relevant transcripts would allow researchers to learn more about the lin-
guistic and interactional construction of (a)synchrony, and practitioners to
critically reflect on their own practices (Spong, 2010). The striking example
of CBT’s perfect asynchrony might be particularly illustrative in this regard.
Recall that whether sub-transcripts are judged to be synchronized depends
on their holistic similarity across all four LIWC variables. While each word
that appears both in the text and in the LIWC dictionary contributes to a
specific variable score (Pennebaker et al., 2015), a correspondingly holistic
qualitative analytic approach is to focus on the overall language use in con-
text rather than specific word-variable correspondences. Each of the three
92 Cluster analysis

dyads will accordingly be illustrated by one extract below. Even though the
words in these extracts might not be the main contributors to the LIWC
scores, and hence cluster membership, of the sub-transcripts they belong to,
they suffice to illustrate how quantitative and qualitative analysis comple-
ment each other. It was mentioned above that all three clients share similar
demographics that maximizes their comparability. The extracts were also
selected on this basis, in that they all zoom in on a discussion of the client’s
difficulties in relating with a specific individual. Furthermore, because the
main objective for all three dyads is to resolve these difficulties, it is reason-
able to assume that the interactional styles featured in each of these extracts
will recur throughout the rest of their respective sessions.
Extract 1 is from session 8 of dyad A (psychoanalysis), a synchronized
session (see Table 3.2). Recall that dyad A has the highest synchrony meas-
ure of 33%. The client is relating her boyfriend’s problems at work and
her frustration at not being able to help. The therapist is guiding the client
to re-experience her emotional disturbances and pin down the cause of her
reactions.

1. CLIENT: So he’s worked now for four years in this job, and it’s going
to be so hard for him to turn the other way and I can’t will him to do
anything.
2. THERAPIST: Yeah. I guess I’m imagining that, seeing him suffer this
way and be himself so sort of helpless and being so helpless yourself to
do too much about it, is part of what makes it so difficult.
3. CLIENT: Um hmm. So I sense some urgency in the, like, speeding up,
or in getting the most out of therapy while he has it. That is my get-
ting-the-most-out-of-things tendency. He does not feel this way. He's
like, “Ah, she just told me I was punishing myself”.
4. THERAPIST: Hmm.
5. CLIENT: Like, yeah. That’s the point.
6. THERAPIST: It’s pretty hard to sit by, huh?
7. CLIENT: Yeah. It’s so hard. It was so much harder in college though.
God, I was like, um, I felt that I could not go on in the relationship a
number of times.
8. THERAPIST: I mean I guess the place to look would be you know,
um, I mean, it does almost like you’ve vicariously experienced his
stress and, except that you’re helpless cause you can’t do all the things
that you would have done if it were you. But it was him.
9. CLIENT: Yeah.
10. THERAPIST: You know, that is what I imagine used to hang you up
about this.
11. CLIENT: Yeah, that was probably the main thing.
Cluster analysis 93

Referring back to Table 3.1, in this session the therapist and client have
similar scores for analytical thinking (Client=15.09, Therapist=11.39),
clout (Client=51.7, Therapist=63.4), and authenticity (Client=28.18,
Therapist=24.97), with a larger difference for emotional tone (Client=42.92,
Therapist=28.62). These similarities, which led to the statistical determina-
tion of synchrony, are linguistically reflected by the observable level of con-
cord between therapist and client. Markers of agreement like ‘yeah’ (Turn
2, 5, 7, 9, 11) and ‘um hmm’ (Turn 3) suggest gradual co-construction of a
shared interpretation of the client’s situation. Their similarly low analyti-
cal thinking indicates a mutually informal, personal, here-and-now, and
narrative style, as the therapist guides the client to explore the underlying
meanings, causes, and mechanisms of her thoughts and feelings. The thera-
pist’s display of mid-level clout is noticeable. On the one hand, she asserts
her interpretations by frequently using ‘I’ and directs them towards the cli-
ent with ‘you’ (turns 2, 8, 10). On the other hand, she carefully reduces the
force of these interpretations with hedging expressions like ‘I guess’ (Turn
2), ‘almost like’ (Turn 8), and ‘I imagine’ (Turn 10). The client, in turn,
does not display a significantly lower clout as she appeared to respond well
to this approach, concurring with the therapist’s interpretations. These
observations also account for the similar relatively low levels of authentic-
ity – unsurprisingly, given the general psychoanalytic aim of ‘making the
implicit explicit’, clients may find themselves speaking in a more guarded
and distanced manner when working through repressed thoughts and feel-
ings. The therapist also displays a comparable level of authenticity, which
may suggest an explicit effort to ‘reflect’ the client in a neutral and non-
interfering manner. It is also interesting to note that, contrary to strate-
gies like repetition and contiguity observed elsewhere (Ferrara, 1994), the
present linguistic synchrony does not seem to be based on taking up each
other’s keywords or phrases. This is consistent with the earlier observation
that content words may be less revealing of interactional stances (Doyle &
Frank, 2016), and will be further illustrated by the next extract where we
see the converse case of high repetition but low synchrony.
Extract 2 is from session 6 of dyad B (CBT). Recall that all sessions in
dyad B are asynchronized. The therapist establishes the client’s ‘dire need
for approval’ from her mother as a key irrational belief to be disputed, and
asks her to identify more potential irrational beliefs. The therapist then
proceeds to point out why they are irrational.

1. THERAPIST: So this need for approval, this dire need for approval,
and very pointedly from your parents, maybe more so from your
mother, is going to keep you suffering if you don’t continue the good
work you’re doing. So, any more irrational beliefs before you dispute
the life out of these ones right here and now?
94 Cluster analysis

2. CLIENT: Yeah. Let me think. “I should have not been so impulsive to


say what I was feeling at the time”.
3. THERAPIST: And because I was so impulsive, that makes me…
4. CLIENT: “And because I was so impulsive, that makes me a thought-
less daughter”.
5. THERAPIST: As I shouldn’t be.
6. CLIENT: “As I shouldn’t be”. See, it’s always that ending.
7. THERAPIST: Well it usually is that ending, if the result is anxiety,
panic or other unpleasantries.
8. CLIENT: Right. God. That part always gets me.
9. THERAPIST: You mean, until now, that part has not been as evident
as you’d like it to be. You can’t say it always gets you, because always
implies always: past, present and future, and you still have years of life
left.

As mentioned earlier, both speakers echo each other from Turn 3 to 6. The
therapist guides this process by repeating key parts of the client’s utter-
ances to prompt further reflection, and the client repeats them again (‘I
was so impulsive, that makes me…’, ‘As I shouldn’t be’) to demonstrate
this reflection. Such repetitions and overlaps are expected when dysfunc-
tional thoughts, beliefs, assumptions, etc. are discussed because they often
involve concrete details depicted by content words. In fact, if we had per-
formed cluster analysis with an alternative quantification scheme that is
based on document similarity, like the vectorization process mentioned in
Chapter 1, we would very likely see a high degree of synchrony in dyad B.
Our motivated choice of LIWC and its non-emphasis on content words,
however, reveals that surface-level similarity does not entail a high syn-
chrony measure. Referring again to Table 3.1, the therapist and client have
very different scores for all variables: analytical thinking (Client=4.23,
Therapist=15.01), clout (Client=38.41, Therapist=91.05), authentic-
ity (Client=73.35, Therapist=44.97), emotional tone (Client=34.36,
Therapist=74.45), which suggest highly divergent interactional stances.
The contrast in clout is evident from Turn 1 as the therapist uses many
client-directed pronouns (‘if you don’t continue…’, ‘before you dispute…’)
to assume an expert-like and directive stance to establish the ‘disput(ing)
the life out of” of irrational beliefs as a key focus of their interaction. The
client obliges by reflecting on the therapist’s directions using self-directed
pronouns (‘I was so impulsive’, ‘I shouldn’t be’). We see the reverse pat-
tern for authenticity – the client’s high score is reflected in her willing-
ness to disclose her thoughts and feelings, which is unsurprising given the
present therapist-directed focus on her irrational beliefs. By contrast, the
therapist’s low score is attributable to her exclusive focus on the client.
The therapist’s higher scores for analytical thinking and emotional tone
Cluster analysis 95

are likewise predictable from her general educative, problem-focused,


and task-based stance. Her language contains more logical markers (‘if’,
‘because’, ‘until’) compared with the client’s complementary narrative
style, and she is obliged to use more optimistic or emotionally positive
language in contrast with the client’s generally negative depiction of her
situation. This becomes apparent later in the session (not in extract) when
the therapist urges the client to remind herself that she is worth ‘approval,
adoration, and acceptance’.
Our analyses of extracts 1 and 2 have shown how a quantitative con-
trast in synchrony measures is reflected in equally contrastive interactional
stances between dyads A and B. We now consider dyad C (humanistic
therapy), which has a synchrony measure (25%) midway between A and
B. Extract 3 is from session 9 where the client relates her frustrations with
her mother and why she has been avoiding her. The therapist guides the
client to explore these feelings and suggests that the avoidance is linked
with her need to retain her sense of self.

1. CLIENT: Like I haven’t done a damn thing about my mother since


July and she’s been through hell since then, and I just completely
turned off.
2. THERAPIST: You have some feeling like there’s some resemblances
between what she does and how you are sometime, that it’s absolutely
necessary that you keep yourself separate from her?
3. CLIENT: To a degree yeah, mostly because I haven’t sorted out what’s
keeping myself separate is in a way like she’s fantastically … like I’m
upset in five minutes after being with her. She’s just overwhelmingly

4. THERAPIST: In all of these.
5. CLIENT: She’s like, “I am helpless. You don’t talk to me”. It’s every
kind of accusation except that she never says it that way, so you can’t
yell at her for saying it that way. It’s always done in a nice, rational
tone
6. THERAPIST: Man, those things can cripple.
7. CLIENT: I’ve experienced them really crippling, and I have experi-
enced being literally nervous wrecks for a long time growing up, […]
the only time that I’ve been able to handle her has been in the last year
since I’ve been able to not treat her like a human being, but treat her
like a patient. But if I once let myself soften to her, I am so vulnerable.
8. THERAPIST: I know. It’d almost be a bad thing tucked away or being
lost.
9. CLIENT: Because I don’t know who I am, but I know that I’m a very
negative person from her. I can’t take that from her anymore, every
time she staggers or it’s just like you just feel like crying, she’s so pitiful,
96 Cluster analysis

and she’s also so good-hearted. I empathize with her too much, and I
know what she feels like too much, and I know how she’s not able to
cope, and it hurts.
10. THERAPIST: And it’s like there’s too much overlap, it not only ends
up maybe damaging you, but making it really hard for you to… It may
sort of feel weak or like meaningless to someone else, but like it really
helps to have that, at that point because it sounds like in one way you
feel the same way. You’ve got to have that separation to retain any
sense of yourself.
11. CLIENT: Um hmm
12. THERAPIST: …sense of yourself.

From Table 3.1 we see that the synchrony of this session is attributable
to similar scores for analytical thinking (Client=7.4, Therapist=5.65)
and authenticity (Client=80.6, Therapist=78.43), with larger differences
in emotional tone (Client=33.2, Therapist=43.22) and especially clout
(Client=19, Therapist=47.84). In this sense it lies between extract 1 where
three variables are highly similar, and extract 2 where none of the variables
are. Interestingly, this coincides with the fact that the overall synchrony
measure of dyad C is also midway between A and B.
A closer analysis of the interactional construction of synchrony, indeed,
reveals elements that resemble both extracts 1 and 2. The therapist attempts
to clarify the client’s feelings by paraphrasing her account more precisely
like ‘keep yourself separate from her’ (Turn 2) and ‘those things can crip-
ple’ (Turn 6). The observed concord in extract 1 is noticeable here as the
client shows tacit agreement by echoing these utterances in the following
turns (‘what’s keeping myself separate’, ‘really crippling’), and markers
of agreement like ‘(to an extent) yeah’ (Turn 3) and ‘un hmm’ (Turn 11).
This general dynamic accounts for their similar analytical thinking and
authenticity scores. Both score low in the former as the conversation is
informal and narrative-like, and high in the latter as the client’s disclosure
of her feelings (‘I haven’t done a damn thing about my mother’, ‘I’m
upset in five minutes’, ‘I am so vulnerable’) is met with the therapist’s open
and empathetic understanding (‘Man, those things can cripple’, ‘I know’).
Their emotional tone scores are not as similar, but both tend towards the
negative end. Their use of negative emotion words is consistent throughout
as the client relates her and her mother’s feelings (‘upset’, ‘helpless’, ‘vul-
nerable’, ‘hurts’), and the therapist focuses more on the personal meaning
of her experiences (‘lost’, ‘damaging’, ‘weak’). However, while the thera-
pist in extract 1 does not seem to take the lead, the therapist here is subtly
leading the process by summarizing the client’s reflections, drawing out
their implications, and proposing an interpretation that is expected to be
accepted (Antaki et al., 2005). This is closer to the therapist’s educative
Cluster analysis 97

stance in extract 2 and accounts for the disparity in clout. Notice also that
the aforementioned concord is demonstrated to a lesser extent here. The
client appears to agree with the therapist (only) ‘to a degree’ (Turn 3), and,
unlike extract 1, she expresses her feelings more independently and does
not orient her utterance as a response to the therapist at every turn.
In summary, the above analyses attempted to contextualize the quanti-
tative synchrony measures and illustrate how linguistic (a)synchrony can
be constructed in different ways that can be examined at the individual
dyadic level. Our examples generally reflect characteristics expected at the
theoretical level of therapy type – dyad A demonstrates a high level of non-
judgmental ‘reflection’ often discussed in psychoanalysis, dyad B presents
a sharp contrast where the CBT therapist adopts a more institutionalized
educative role, and dyad C contains element of both in the humanistic
therapist’s broad adoption of a guiding, empathetic approach.
To conclude, this chapter demonstrated the combined application of
cluster and discourse analysis to model linguistic (a)synchrony in thera-
pist–client interaction. It follows Chapter 2’s focus on the session as the
key unit of quantitative analysis, with contextually bound extracts serv-
ing an important illustrative purpose. This affirms the importance of a
mixed-method orientation to linguistic (a)synchrony research, where
quantified measures of a dataset are complemented with a more critical
eye on higher-order communicative strategies and phenomena. The sam-
ple analyses of three dyads from key therapy approaches were performed
with both researcher and practitioner objectives in mind. Researchers may
adopt a similar comparative approach on more representative datasets to
study how (a)synchrony varies across therapy types, the temporal distri-
bution of (a)synchronized sessions within a dyad, and/or conduct further
qualitative analyses of (a)synchrony construction from different theoreti-
cal perspectives. Interested practitioners can apply the approach to their
own work and critically reflect on their socio-psychological stances vis-
à-vis their clients, as well as their avowed therapeutic approach. It would
be particularly interesting to track how one’s tendency to (a)synchronize
changes across different clients and over time. Additionally, the approach
could also be applied to other social contexts where there is an interest in
examining linguistic (a)synchrony between speakers and across motivated
intervals, such as classroom interaction (e.g., teacher versus student talk
across lessons) or online fora (different posts across time). It is worth reit-
erating that cluster analysis can be performed on the outcomes of alterna-
tive quantification schemes that have different theoretical assumptions and
underpinnings than LIWC. A comparison of how the resulting clustering
solutions differ is, in fact, an interesting direction in itself. At the very
least, it would demonstrate how data analytic techniques can be used as
systematic testing grounds for just how different various discourse analytic
perspectives on the same dataset are.
98 Cluster analysis

Lastly, while it is tempting to suggest that higher linguistic synchrony


measures correlate with better treatment or interactional outcomes, such
questions are beyond the present emphasis on methodology. Given that
evidence exists for the general effectiveness of all three therapy types, our
sample findings may raise the question of how detrimental a seemingly
low-synchrony approach like CBT truly is. It goes without saying that the
present approach does not aim to overcome the inherent limitations of sec-
ondary transcript analysis and has a more descriptive focus on modeling
rather than prescribing language use. It nevertheless invites future work to
incorporate outcome measures and investigate links between linguistic syn-
chrony and treatment quality. Returning to machine learning parlance, we
might then employ supervised techniques where these outcome measures
function as pre-existing ‘labels’ for each session, and the cluster member-
ship function as predictors. Future work may also investigate how linguistic
constructions of (a)synchrony might vary along demographic variables like
age and gender, and its non-verbal manifestations like gestures and other
paralinguistic cues. Again, these can be flexibly incorporated into a data
analytics framework because of its atheoretical and context-specific nature.
Quantification and clustering can be extended to non- or paralinguistic
behaviors and variables, and the subsequent qualitative analytic phase is not
married to any particular framework. The above considerations also apply
to other potential contexts of linguistic (a)synchrony where our approach
can be applied.

Python code used in this chapter


The Python code used throughout the chapter is reproduced in sequence
below for readers’ convenience and understanding of how the analysis
gradually progresses.

Agglomerative Hierarchical Clustering

#import Python libraries


import matplotlib​.pypl​ot as plt
import pandas as pd
import numpy as np
import scipy.cluster.hierarchy as shc
from sklearn.preprocessing import StandardScaler
from scipy.cluster.hierarchy import linkage, dendrogram, cophenet
from scipy.spatial.distance import pdist

#import dataset
data = pd.read_csv(‘covid​.c​sv’, index_col=‘Country’)
Cluster analysis 99

#scale data to zero mean and unit variance


scaler=StandardScaler()

#generate dendrogram
plt​.tit​le(“COVID-19 dendrogram”,fontsize=25)
plt​.xtic​ks(fontsize=25)
plt​.ytic​ks(fontsize=25)
plt​.ylab​el(‘Distance’,fontsize=25)
plt​.xlab​el(‘Country/region’,fontsize=25)

dend = dendrogram(linkage(data, metric=‘euclidean’,method=‘ward’),


orientation=‘top’, leaf_​​rotat​​ion​=8​​0​,lea​​f​_fon​​t​_siz​​e​=25,​​label​​s​=dat​​a​
.ind​​ex​,co​​​lor​_t​​hresh​​old​=1​0)
plt​.sh​ow()

#calculate cophenetic correlation coefficient


c, coph_dist = cophenet(shc.linkage(data,method=‘ward’,metric=‘eucli
dean’), pdist(data))
c

k-means clustering

#import dataset and scale scores


data=pd.read_csv(‘A.csv’,index_col=‘Session’)
scaler=StandardScaler()
scaler​.f​i t(data)
data=​pd.Da​taFra​me(sc​aler.​trans​form(​data)​​,colu​​mns​=d​​ata​.c​​olumn​​s​,ind​​
ex​=da​​ta​.in​​dex)

#determine the optimal number of clusters with the ‘elbow method’


from sklearn.cluster import KMeans
n=5 #n can be changed to test more clusters
num_clusters = range(1, n+1)
inertias = []

for i in num_clusters:
model=KMeans(n_clusters=i)
model​.f​i t(data)
inertias​.appe​nd(model.inertia_)

#generate ‘elbow plot’


plt​.pl​ot(num_clusters, inertias, ‘-o’)
plt​.xlab​el(‘number of clusters, k’, fontsize=15)
100 Cluster analysis

plt​.ylab​el(‘inertia value’, fontsize=15)


plt​.tit​le(‘Dyad A (Psychoanalysis)’, fontsize=15)
plt​.xtic​ks(num_clusters, fontsize=15)
plt​.sh​ow()

#confirm visual inspection with actual inertia value change


pd.DataFrame(inertias).diff().abs()

#generate cluster centroids and labels using the optimal number of


clusters
model = KMeans(n_clusters=3) #the value of n_clusters should be set
to the optimal k
labels = model.fit_predict(data)
data[‘cluster_labels’] = labels

#obtain cluster centroid positions for later plotting


cluster_centres = model.cluster_centers_

from sklearn.decomposition import PCA as sklearnPCA


#specify two principal components
pca = sklearnPCA(n_components=2)
#reduce the cluster centroid locations into two dimensions
cent=​pca.f​i t_tr​ansfo​rm(cl​uster​_cent​res).​T
#use data​.il​oc to remove cluster labels in the rightmost column before
reducing the data
reduc​ed=pd​.Data​Frame​(pca.​fit_t​ransf​orm(d​​ata​.i​​​loc[:​,:-1]​),col​umns=​
[‘Dim_1’,’Dim_2’]​,index​=data​.index)
#reattach previous cluster labels to prepare for plotting
reduced[‘cluster’]=data[‘cluster_labels’]

#generate scatterplot and color according to clusters


sns.scatterplot(x=‘Dim_1’, y=‘Dim_2’, hue=‘cluster’, data=reduced,
palette=‘tab10’, s=30)

#plot cluster centroids


plt​.pl​ot(cent[0],cent[1], ‘rx’, markersize=15)

#annotate each object


for i in range(reduced​.sha​pe[0]):
plt​.te​xt(x=reduced.Dim_1[i]+0.05, y=reduced.Dim_2[i]+0.05,
s​=reduced​.in​dex[i], fontdict=dict(color=‘black’, size=10))
plt​.lege​nd(title=‘cluster’)
plt​.xlab​el(“Dimension 1”,fontsize=15)
plt​.ylab​el(“Dimension 2”,fontsize=15)
Cluster analysis 101

plt​.tit​le(“Dyad A (Psychoanalysis)”, fontsize=15)


plt​.sh​ow()

Validation of clustering solutions

#visual validation by reconstructing cluster centroids


data.groupby(‘cluster_labels’).mean().plot(kind=‘bar’)
plt​.sh​ow()

#check and plot cluster sizes


data.groupby(‘cluster_labels’).count()
sns.countplot(data=data,x=data[‘cluster_labels’])

#import Python libraries


from sklearn.linear_model import LogisticRegression
from sklearn import metrics

#split dataset into features and target variable


X = data​.il​oc[:,:-1]
y = data[‘cluster_labels’]

#instantiate the model. multi_class=‘auto’ detects if outcome is binary


or multinomial
logreg = LogisticRegression(multi_class=‘auto’)
logreg​.f​i t(X,y)
#get percentage accuracy
logreg​.sco​re(X,y)

#generates confusion matrix


metri​cs.co​nfusi​on_ma​trix(​logre​g.pre​dict(​X), y)

References
Anderson, H., & Goolishian, H. (1988). Human systems as linguistic systems:
Preliminary and evolving ideas about the implications for clinical theory. Family
Process, 27(4), 371–393.
Antaki, C., Barnes, R., & Leudar, I. (2005). Diagnostic formulations in
psychotherapy. Discourse Studies, 7(6), 627–647. https://doi​.org​/10​.1177​
/1461445605055420
Ardito, R. B., & Rabellino, D. (2011). Therapeutic alliance and outcome of
psychotherapy: Historical excursus, measurements, and prospects for research.
Frontiers in Psychology, 2, 1–11.
Arkowitz, H., & Hannah, M. T. (1989). Cognitive, behavioral, and psychodynamic
therapies: Converging or diverging pathways to change? In A. Freeman, K.
102 Cluster analysis

M. Simon, L. E. Beutler, & H. Arkowitz (Eds.), Comprehensive handbook of


cognitive therapy (pp. 143–167). Plenum Press.
Bernieri, F. J. (1988). Coordinated movement and rapport in teacher-student
interactions. Journal of Nonverbal Behavior, 12(2), 120–138.
Bland, A. M., & DeRobertis, E. M. (2020). Humanistic perspective. In V. Zeigler-
Hill & T. K. Shackelford (Eds.), Encyclopedia of personality and individual
differences (pp. 2061–2079). Springer International Publishing. https://doi​.org​
/10​.1007​/978​-3​-319​-24612​-3​_1484
Brennan, S. E., & Clark, H. H. (1996). Conceptual pacts and lexical choices in
conversation. Journal of Experimental Psychology: Learning, Memory, and
Cognition, 22(6), 1482–1493.
Brown, B., Nolan, P., Crawford, P., & Lewis, A. (1996). Interaction, language
and the “narrative turn” in psychotherapy and psychiatry. Social Science and
Medicine, 43(11), 1569–1578. https://doi​.org​/10​.1016​/S0277​-9536(96)00053-6
Creswell, J. (2014). Research design: Qualitative, quantitative, and mixed methods
approaches (4th ed.). Sage.
Dörnyei, Z., & Csizér, K. (2005). The effects of intercultural contact and tourism
on language attitudes and language learning motivation. Journal of Language
and Social Psychology, 24(4), 327–357.
Doyle, G., & Frank, M. C. (2016). Investigating the sources of linguistic alignment
in conversation. 54th Annual Meeting of the Association for Computational
Linguistics, ACL 2016 - Long Papers, 1, 526–536.
Fenn, K., & Byrne, M. (2013). The key principles of cognitive behavioural therapy.
InnovAiT, 6(9), 579–585. https://doi​.org​/10​.1177​/1755738012471029
Ferrara, K. W. (1991). Accommodation in therapy. In H. Giles, J. Coupland, &
N. Coupland (Eds.), Contexts of accommodation (pp. 187–222). Cambridge
University Press and Maison des Sciences de l’Homme.
Ferrara, K. W. (1994). Therapeutic ways with words. Oxford University Press.
Freud, S. (1924). A general introduction to psychoanalysis. Horace Liveright.
Fusaroli, R., Bahrami, B., Olsen, K., Roepstorff, A., Rees, G., Frith, C., & Tylén,
K. (2012). Coming to terms: Quantifying the benefits of linguistic coordination.
Psychological Science, 23(8), 931–939.
Giles, H. (Ed.). (2016). Communication accommodation theory: Negotiating
personal relationships and social identities across contexts. Cambridge
University Press.
Giles, H., Dailey, R., Barker, V., Hajek, C., Anderson, D. C., & Rule, N. (2006).
Communication accommodation: Law enforcement and the public. In B. Poire
& R. Dailey (Eds.), Applied interpersonal communication matters (pp. 241–
270). Peter Lang.
Gries, S. T. (2005). Syntactic priming: A corpus-based approach. Journal of
Psycholinguistic Research, 34(4), 365–399.
Healey, P., Purver, M., & Howes, C. (2014). Divergence in dialogue. PLoS ONE,
9(2), e98598
Hove, M. J., & Risen, J. L. (2009). It’s all in the timing: Interpersonal synchrony
increases affiliation. Social Cognition, 27(6), 949–960.
Jones, S., Cotterill, R., Dewdney, N., Muir, K., & Joinson, A. (2014). Finding
Zelig in text: A measure for normalising linguistic accommodation. In
Cluster analysis 103

Proceedings of the 25th international conference on computational linguistics.


Dublin City University and Association for Computational Linguistics, (pp.
455–465).
Julien, D., Brault, M., Chartrand, É., & Bégin, J. (2000). Immediacy behaviours and
synchrony in satisfied and dissatisfied couples. Canadian Journal of Behaviorial
Science, 32(2), 84–90.
Kodinariya, T., & Makwana, P. (2013). Review on determining number of Cluster
in K-Means Clustering. International Journal of Advance Research in Computer
Science and Management Studies, 1(6), 90–95.
Koole, S. L., & Tschacher, W. (2016). Synchrony in psychotherapy: A review and
an integrative framework for the therapeutic alliance. Frontiers in Psychology,
7(June), 1–17.
Kopp, R. R. (1995). Metaphor therapy: Using client-generated metaphors in
psychotherapy. Brunnel/Mazel.
Kramer, G. P., Bernstein, D. A., & Phares, V. (2008). Introduction to clinical
psychology (7th ed.). Pearson.
Niederhoffer, K. G., & Pennebaker, J. W. (2002). Linguistic style matching
in social interaction. Journal of Language and Social Psychology, 21(4),
337–360.
Norcross, J. C. (1990). An eclectic definition of psychotherapy. In J. K. Zeig & W.
M. Munion (Eds.), What is psychotherapy? Contemporary perspectives (pp.
218–220). Jossey-Bass.
Pennebaker, J. W., Boyd, R. L., Jordan, K., & Blackburn, K. (2015). The
development and psychometric properties of LIWC2015. University of Texas
at Austin.
Peräkylä, A., Antaki, C., Vehviläinen, S., & Leudar, I. (Eds.). (2011). Conversation
analysis and psychotherapy. Cambridge University Press.
Pickering, M. J., & Garrod, S. (2004). Toward a mechanistic psychology of
dialogue. Behavioral and Brain Sciences, 27(2), 169–190.
Rodriguez, M. Z., Comin, C. H., Casanova, D., Bruno, O. M., Amancio, D.
R., Costa, L. da F., & Rodrigues, F. A. (2019). Clustering algorithms: A
comparative approach. PLoS ONE, 1, 1–34. https://doi​.org​/10​.1371​/journal​
.pone​.0210236
Rogers, C. R. (1951). Client-centered psychotherapy. Houghton Mifflin.
Roux, M. (2018). A comparative study of divisive and agglomerative hierarchical
clustering algorithms. Journal of Classification, 35(2), 345–366.
Sacks, H., Schegloff, E., & Jefferson, G. (1974). A simplest systematics for the
organization of turn-taking in conversation. Language, 50, 696–735.
Sarangi, S., Bennert, K., Howell, L., & Clarke, A. (2003). ‘Relatively speaking’:
Relativisation of genetic risk in counselling for predictive testing. Health, Risk
and Society, 5(2), 155–171.
Semin, G. R., & Cacioppo, J. T. (2008). Grounding social cognition:
Synchronization, entrainment, and coordination. In G. R. Semin & E. R. Smith
(Eds.), Embodied grounding: Social, cognitive, affective, and neuroscientific
approaches (pp. 119–147). Cambridge University Press.
Sokal, R., & Rohlf, J. (1962). The comparison of dendograms by objective
methods. Taxon, 11(2), 33–40.
104 Cluster analysis

Spong, S. (2010). Discourse analysis: Rich pickings for counsellors and therapists.
Counselling and Psychotherapy Research, 10(1), 67–74.
Strong, T. (2006). Wordsmithing in counselling. European Journal of
Psychotherapy, Counselling and Health, 8(3), 251–268.
Tay, D. (2016). Finding the middle ground between therapist-centred and client-
centred metaphor research in psychotherapy. In M. O’Reilly & J. N. Lester
(Eds.), The Palgrave handbook of adult mental health (pp. 558–576). Palgrave
Macmillan. https://doi​.org​/10​.1057​/9781137496850​_29
Tay, D. (2021a). Is the social unrest like COVID-19 or is COVID-19 like the social
unrest? A case study of source-target reversibility. Metaphor and Symbol, 36(2),
99–115.
Tay, D. (2021b). Metaphor response categories and distribution between therapists
and clients: A case study in the Chinese context. Journal of Constructivist
Psychology, 34(4), 378–394. https://doi​.org​/10​.1080​/10720537​.2019​.1697913
Tay, D., & Qiu, H. (2022). Modeling linguistic (A)synchrony: A case study of
therapist–client interaction. Frontiers in Psychology, 13. https://www​.frontiersin​
.org​/article​/10​.3389​/fpsyg​.2022​.903227
Valdesolo, P., & DeSteno, D. (2011). Synchrony and the social tuning of
compassion. Emotion, 11(2), 262–266.
Watson, J. C., Goldman, R. N., & Greenberg, L. S. (2011). Humanistic and
experiential theories of psychotherapy. In J. C. Norcross, G. R. VandenBos, &
D. K. Freedheim (Eds.), History of psychotherapy: Continuity and change (2nd
ed., pp. 141–172). American Psychological Association. https://doi​.org​/10​.1037​
/12353​-005
Weiste, E., & Peräkylä, A. (2013). A comparative conversation analytic study
of formulations in psychoanalysis and cognitive psychotherapy. Research on
Language and Social Interaction, 46(4), 299–321. https://doi​.org​/10​.1080​
/08351813​.2013​.839093
Yim, O., & Ramdeen, K. T. (2015). Hierarchical cluster analysis: Comparison
of three linkage measures and application to psychological data. Quantitative
Methods for Psychology, 11(1), 8–21. https://doi​.org​/10​.20982​/tqmp​.11​.1​
.p008
Zhang, W. (2016). Variation in metonymy. Mouton de Gruyter.
Zhang, W., Speelman, D., & Geeraerts, D. (2011). Variation in the (non)metonymic
capital names in Mainland Chinese and Taiwan Chinese. Metaphor and the
Social World, 1(1), 90–112. https://doi​.org​/10​.1075​/msw​.1​.1​.09zha
Zwaan, R., & Radvansky, G. (1998). Situation models in language comprehension
and memory. Psychological Bulletin, 123(2), 162–185.
4 Classification

Introduction to classification: Predicting groups from objects


In Chapter 3, the distinction between unsupervised and supervised machine
learning was explained. K-means clustering was demonstrated as an exam-
ple of unsupervised learning and applied to measure therapist–client (a)syn-
chrony based on latent groupings of linguistic (dis)similarity. Supervised
learning, on the other hand, involves supplying the computer with pre-exist-
ing outcome labels. A standard example is classification techniques that
model relationships between measured properties and existing category/
group labels. These techniques can determine if the properties of the objects
at hand can be used to reliably classify them into the existing groups. In
machine-learning parlance, we assume that there is a mathematical func-
tion or mapping between properties and group labels, and the goal is to
approximate this function as best as possible. The properties and groups are
of a different nature in most cases, precisely because we want to investigate
these properties as non-obvious or implicit determinants of group member-
ship. For example, we will revisit the COVID-19 example from Chapter 3
to see if the four properties of a country/region – GDP per capita, life expec-
tancy, COVID cases per million, and COVID-related deaths per million
– can be used to predict the continent that it is located in. A positive result
would then imply that such statistics vary systematically across different
geographical parts of the world. In discourse analytic contexts, the analo-
gous objective would be to show that linguistic or discursive properties of
texts are able to predict inherently non-linguistic classifications.
Logistic regression is a common classification technique that was briefly
introduced and demonstrated in Chapter 3 to validate clustering out-
comes. Recall that the validation logic was to treat the emergent group
labels as ‘pre-existing’ and take the reverse approach to test if the proper-
ties could predict these labels. Just like cluster analysis, there are many
techniques, algorithms, or classifiers that can be used to perform classi-
fication tasks. More comprehensive machine-learning projects will often
adopt an approach known as ensemble learning, which involves using

DOI: 10.4324/9781003360292-4
106 Classification

multiple algorithms and comparing their relative performance or aggre-


gating their outcomes to attain better ‘meta’ results. Major classification
algorithms include logistic regression, decision trees, random forests, sup-
port vector machine (SVM), naïve bayes, and k-nearest neighbors (k-NN)
(Aggarwal, 2020). Each of these approaches the task of classification with
a fundamentally different basic logic. Additionally, the major principled
differences between them include whether they are (1) probabilistic or non-
probabilistic, (2) linear or non-linear, and (3) parametric or non-paramet-
ric. Probabilistic classifiers like logistic regression and naïve bayes compute
the probability of each object belonging to its predicted group, while non-
probabilistic ones like k-NN and SVM simply output the predicted group
as a categorical decision. As the name implies, linear classifiers like logistic
regression and naïve bayes categorize objects using linear combinations
of the predicting properties. From a geometrical perspective we can think
of them as relying on straight lines to demarcate group decision bounda-
ries. Non-linear classifiers like decision trees and k-NN, on the other hand,
can demarcate objects in various other ways and thus work better with
objects that are not linearly separable. Linear classifiers are also likely to be
parametric as the predicting properties are required to fulfil certain statistic
assumptions, while non-linear classifiers are likely to be non-parametric
and, hence, more flexible. The classifier of choice for this chapter is k-near-
est neighbors, the main reason being that it is conceptually most similar to
the subject matter of the previous chapter, k-means clustering.
K-means clustering and k-nearest neighbors both conceptualize (1)
objects as points in a Euclidean space, (2) each object property as a dimen-
sion in that space, (3) object positions as fixed by their properties, and (4)
proximity between objects as a measure of their similarity. These simi-
larities mean that we can afford to be more concise in the present con-
ceptual introduction as many of these points have already been explained
in Chapter 3. Just like k-means clustering, the extent to which point (4)
is true determines the usefulness of k-NN as a classification technique. In
some contexts there may, after all, be specific reasons to want to place
objectively similar things in different categories. Unlike clustering, how-
ever, there is no need for an optimization procedure of identifying ‘cen-
troids’ and collecting nearby objects to induce group labels. This is because
each object in the (training) dataset already has a group label, and the main
objective is to predict the group label of new or unseen objects based on
their (Euclidean) distances to existing ones.
The k-NN algorithm is relatively straightforward and intuitive. We start
by defining the value of the parameter k. While k stands for the number
of clusters in k-means clustering, in k-NN it denotes the number of neigh-
bors closest to the object whose group label we want to predict. For any
such object, the predicted label is simply the ‘majority label’ among k of its
most proximate neighbors. The same logic applies when using the k-NN
Classification 107

algorithm for regression instead of classification, where the outcome is a


continuous variable rather than a group label. In this case, the predicted
value is derived by averaging the values of the k most proximate neighbors.
To illustrate k-NN classification, let us revisit our COVID-19 scatterplot
from Chapter 3 showing the relative positions of 60 countries/regions along
four defining properties, reduced to two dimensions/axes (See Figure 4.1).
Recall that the four properties are first standardized by subtracting each
value by the mean and then dividing by the standard deviation. This time,
however, the countries/regions are not colored according to the emergent
cluster labels. They are, instead, colored according to the pre-existing label
of which continent they belong to. The more the same color tends to cluster
together, the more likely the properties that defined the countries/regions’
location on the plot would be successful in predicting the continent. Suppose
that we want to predict the continent of a new object indicated by the black
star. If our defined value of k=5, the algorithm locates five neighbors nearest
to the black star, with distances indicated by the dotted lines, and identifies
the majority label among them. Remember that the pictured distance is a
2D simplification of a 4D space, but the idea is the same. Since three out
of the five nearest neighbors to our black star are in Asia-Pacific, the pre-
dicted continent of the black star will be Asia-Pacific. The same logic can
be applied to predict other types of labels, such as various socioeconomic
categories, that are hypothesized to be linked with the four defining proper-
ties. Also notice that if k=1, the predicted label will simply be that of the
object’s nearest neighbor. On the other hand, there are different proposed
‘tie-breaks’ to resolve situations where there is no clear majority label. These

Figure 4.1 Predicting group labels with k-NN


108 Classification

include choosing a different k (especially an odd number), randomly choos-


ing one of the tied labels, or exercising manual judgment with contextual
information. According to its official documentation, scikit-learn breaks ties
by choosing the first that appears in the list of labels. We will be using scikit-
learn to implement k-NN like we did for k-means clustering.
Standard applications of k-NN begin with the familiar procedure of
randomly splitting the dataset into training and testing data, using the
training data as reference points to predict the labels of the testing data,
and evaluating the predictive accuracy based on the percentage of correctly
predicted labels. A good k-NN model can then be used to predict labels for
new data of interest. Just like for k-means clustering, the optimal value of k
can be determined prior to the analysis, with the most common evaluation
criterion being the value that maximizes predictive accuracy. This value is
often highly dependent on the nature of the training dataset. For example,
larger k values are known to be better for datasets with many outliers.
Large k values are also associated with high bias and low variance. Models
with higher bias pay relatively less attention to nuanced relations between
predicting features and outcomes, resulting in a simplified model that may
underfit the data. High bias also implies low variance because the model
will be relatively consistent even when fitted to different training datasets.
Smaller k values, on the other hand, are associated with low bias and
high variance. This means more attention is paid to these nuanced rela-
tions. The result is complex models that overfit the training data, vary
when applied to different training datasets, and, hence, fail to generalize
to testing data. We can intuit this from the case of the lowest possible
value k=1. Since the horizon of inference is only one neighboring unit,
similar objects lying around a wider general zone that may point towards
an alternative predicted label are ignored. Note that bias and variance are,
by definition, in a trade-off relationship, and how to strike the optimal bal-
ance in this bias-variance tradeoff across different algorithms remains an
interesting question in machine learning (Fortmann-Roe, 2012). There are
two common ways to determine the optimal k given these considerations.
The first is a rule-of-thumb estimate where k equals the square root of the
number of data samples, rounded off to the nearest integer. We should
therefore have k=8 as optimal in our example of 60 countries/regions.
The second, which is more comprehensive and will be used in the upcom-
ing case study, is to simply loop through a list of increasing k values and
choose the one that produces the highest predictive accuracy. This looping
approach is similar to the elbow method to determine the optimal number
of clusters in Chapter 3.

Case study: Predicting therapy types from therapist-client language


The idea of using k-means clustering to induce natural groups among sub-
transcripts was exploited in Chapter 3 to address the specific theoretical
Classification 109

issue of linguistic (a)synchrony. In psychotherapy, however, research based


on natural pre-existing groups is much more common, especially those of a
comparative nature. Examples of pre-existing groups include therapy types
(e.g., psychoanalysis vs. CBT), speaker (therapist versus client), process and/
or outcome categories (good versus bad), treatment phases (early versus
late) and so on. While some of these groups are likely to be predictor (e.g.,
therapy types) and others outcome variables (e.g., outcomes), all have been
featured in psychotherapy language and discourse research. Qiu and Tay
(2023), for example, found significant interactions between therapy type
(psychoanalysis, humanistic, CBT, eclectic therapy) and speaker as factors
influencing language use in a 1-million word corpus. Therapists generally
speak in a more logical, formal, and confident way across approaches, but
nuanced patterns of variation exist within each approach. Studies involving
good versus bad outcomes include comparisons of metaphor (Levitt et al.,
2000) and pronoun usage patterns (Van Staden & Fulford, 2004). Good
outcomes are associated with transformative rather than merely repetitive
metaphor use, as well as semantically agentive rather than passive first-per-
son pronouns. Huston et al. (2019) explored the effects of both outcomes
and treatment phases. While good outcomes are reportedly associated with
the use of more positive emotion words in the early phase of therapy, no
evidence of change in language use from early to late phase was found.
Demiray and Gençöz (2018), however, found that the use of first-person
pronouns in pre-verbal positions changed significantly from early to late
phases of therapy. It should of course be noted that different studies tend
to define variables like outcome quality and treatment phase differently.
Our case study will as usual be presented stepwise below. We will
explore if language use in psychotherapy transcripts, measured by the four
LIWC summary variables, can satisfactorily discriminate between three
therapy types (psychoanalysis, CBT, and humanistic therapy) that under-
pin them. Besides addressing research questions on linguistic variation
across therapy types, basic classification techniques like k-NN allow prac-
titioners to self-evaluate if their own linguistic practices align with others
following the same therapy approach. It is also worth noting that different
from studies that use standard means-comparison techniques like t-tests or
ANOVA to derive general conclusions about whether two or more groups
differ, k-NN places the study in a more direct predictive context. This
allows us a more straightforward way to assign predicted group member-
ship to newly collected / observed transcripts.

Step 1: Data and LIWC scoring

Given the objective of our case study and the stated conceptual simi-
larities between k-means clustering and k-NN, it would seem that we
can reuse the dataset from Chapter 3, which are sub-transcripts of three
dyads each representing a therapy type. We would then simply use the
110 Classification

same LIWC scores as predictors and the pre-existing label of therapy


type as the predicted outcomes. However, this approach would, in fact,
not be advisable. The main reason is that while the point of Chapter 3
was to measure synchrony on a dyadic basis by clustering transcripts
produced by the same therapist and client, it is better here to use ran-
dom transcripts of different therapists and clients to avoid within-dyad
similarities from confounding within-therapy type similarities. In other
words, we want the potential similarities among transcripts to be attrib-
utable to their therapy types rather than the fact that they come from the
same people. There are statistical methods to control for such confounds
in cases where alternative datasets are not feasible, such as mean-center-
ing and fitting multilevel models, but they are beyond the present scope.
We will therefore use a different dataset of 150 transcripts randomly
sampled across different phases of treatment, 50 each from psychoa-
nalysis (251,172 words), CBT (254,913 words), and humanistic therapy
(251,926 words). Furthermore, unlike Chapter 3, each transcript is not
sub-divided into therapist and client utterances as we will be focusing on
the predictive potential of language use in general. Table 4.1 shows the
LIWC scores for all 150 sessions across the three therapy types. Note that
the sessions are non-consecutive.
It is now a good idea to make use of our pre-existing group labels to
generate 2D-scatterplots like those in Chapter 3 and the COVID-19 exam-
ple above. This visualization strategy gives us an initial visual estimation
of how likely the therapy types are distinguishable on the basis of LIWC
scores. We first import the required libraries and dataset, and just like in
Chapter 3, standardize the LIWC scores using scikit-learn’s StandardScaler
feature. Note that instead of setting ‘session’ as the index column in order
to facilitate subsequent data visualization, as was done in Chapter 3, we
now assign it to the therapy type of each transcript.

#import Python libraries


import numpy as np
import pandas as pd
import matplotlib​.pypl​ot as plt
import seaborn as sns
from sklearn import metrics
from sklearn.preprocessing import StandardScaler
from sklearn.neighbors import KNeighborsClassifier

#import dataset and scale scores


data=pd.read_csv(‘data​.c​sv’,index_col=‘Type’)
scaler=StandardScaler()
data=​pd.Da​taFra​me(sc​aler.​fit_t​ransf​orm(d​ata),​​colum​​ns​=da​​ta​.co​​lumns​​
,inde​​x​=dat​​a​.ind​​ex)
Table 4.1 Summary variable scores for transcripts across three therapy types

Session Psychoanalysis CBT Humanistic

Ana Cl Auth Tone Ana Cl Auth Tone Ana Cl Auth Tone

1 15.17 14.04 86.52 3.15 8.49 15.99 92.94 51.18 5.99 12.05 89.67 33.92
2 9.34 33.42 85.29 19.33 15.12 25.32 92.85 58.51 9 13.76 89.37 35.98
3 13.34 31.25 85.28 35.86 5.35 20.47 92.83 50.54 3.37 28.34 88.31 44.79
4 16.81 19.45 84.5 71.95 4.59 14.94 92.75 60.95 4.05 25.41 87.67 47.29
5 7.8 24.39 84.14 37.45 6.27 12.79 90.42 71.04 15.47 22.87 87.61 52.98
6 13.34 31.05 83.34 80.62 4.79 33.69 88.52 45.91 16.14 22.22 85.27 40.3
7 9.74 18.65 82.39 17.22 6.76 16.41 88.04 50.45 9.76 17.3 84.89 39.14
8 14.88 70.88 78.39 34.27 6.09 16.98 86.3 43.6 7.8 23.77 84.25 52.62
9 12.39 55.84 77.9 19.38 6.27 18.84 85.88 78.6 7.23 45.65 84.16 30.59
10 6.8 28 77.51 44.32 5.26 19.92 83.7 81.93 3.7 18.59 83.79 53.46
11 15.05 33.74 77.48 16.63 4.97 23.68 82.26 75.61 13.49 29.84 81.4 38.44
12 11.95 24.91 76.85 44.43 10.32 21.19 80.48 44.49 7.4 19 80.6 33.32
13 19.59 64.84 75.72 44.54 9.28 20.66 79.56 33.91 8.33 25.57 79.57 47.62
14 18.31 50.8 75.56 74.48 4.86 33.87 76.55 18.63 9.76 21.19 79.31 20.22
15 22.5 62.65 75.17 68.28 5.75 32.15 76.13 64.51 5.65 47.84 78.43 43.22
16 10.12 42.86 74.43 39.62 9.75 66 75.88 26.93 12.23 34.34 77.4 37.81
17 5.28 38.44 74.36 45.29 4.23 38.41 73.35 34.36 8.36 19.96 77.04 26.94
18 9.22 56.6 73.03 45.39 11.4 34.03 70.42 35.74 8.02 33.58 75.97 46.55
19 11.54 39.55 72.72 61.63 6.11 82.13 70.28 44.49 6.41 65.26 75.86 44.72
20 20.18 44.24 72.55 41.42 6.53 76.84 68.99 43.18 11.81 72.85 75.38 35.53
21 15.57 62.09 72.03 57.66 7.26 46.03 67.41 44.42 25.06 50.5 72.96 36.25
22 6.81 28.55 69.6 27.03 8.36 83.54 64.03 44.83 5.81 30.36 71.72 42.36
23 17.73 21.95 69.21 60.4 10.75 44.33 61.92 47.22 5.76 34.66 70.68 39.84
24 7.39 69.04 69 72.32 15.69 80.49 60.4 65.98 6.39 73.28 70.47 39.13
25 13.05 67.06 68.92 53.45 11.37 87.56 59.66 46.79 2.53 31.73 69.25 58.63
26 8.23 41.19 68.69 21.42 7.63 46.31 59.13 46.03 3.63 46.31 68.51 46.11
Classification 111

27 14.09 75.14 68.64 45.02 12.74 90.6 57.11 72.19 10.81 56.49 67.51 61.87
(Continued)
Table 4.1 (Continued)

Session Psychoanalysis CBT Humanistic

Ana Cl Auth Tone Ana Cl Auth Tone Ana Cl Auth Tone


28 17.92 34.94 68.1 7.06 9.65 93.44 56.84 94.84 3 31.11 63.51 64.73
29 13.35 49.12 67.98 40.97 8.14 75.07 56.45 46.54 3.93 55.31 63.04 66.47
30 5.37 41.6 66.14 35.49 19.2 92.56 56.03 75.48 13.48 74.24 62.15 65.14
31 5.68 53.93 65.38 65.68 17.9 88.02 54.4 64.12 21.1 44.3 60.26 21.73
112 Classification

32 23.58 50.61 64.29 86.79 13.77 89.53 51.86 71.25 14.26 60.8 59.47 34.18
33 29.3 44.91 63.57 70.14 11.88 92.89 51.82 62.78 10.64 56.62 57.91 65.72
34 6.11 68.1 62.25 64.04 8.35 91.88 50.45 85.78 4.44 77.66 56.43 59.24
35 17.55 54.6 62.01 41.44 15.4 90.23 50.24 58.7 3.59 58.22 55.25 62.36
36 6.86 72.46 60.4 16.54 4.6 81.59 50.22 39.11 4.78 68.84 55.12 30.38
37 21.26 71.2 58.98 82.83 10.53 93.57 50.21 69.14 17.35 73.35 51.41 25.77
38 7.71 58.64 57.4 39.98 16.37 92.4 48.83 74.9 10.82 71.38 50.59 41.91
39 30.34 65.4 56.89 62.38 9.17 94.7 47.28 74.81 5.44 77.45 49.75 25.77
40 17.25 25.91 56.56 75.07 16.87 93.37 46.92 61.21 8.91 61.91 47.73 52.5
41 9.03 28.76 56.05 34.75 7.51 95.9 46.58 55.78 6.11 94.37 42.3 24.95
42 5.24 32.06 55.45 27.77 15.01 91.05 44.97 74.45 13 66.53 40.02 43.36
43 7.29 55.85 54.73 27.97 24.48 87.23 43.96 72.94 9.04 85.29 33.51 36.94
44 19.57 62.62 50.87 71.48 17.91 88.03 40.12 51.58 6.64 78.97 32.89 24.8
45 37.39 57.16 45.17 70.29 8.73 59.25 39.23 35.19 19.37 73.3 31.49 40.83
46 7.48 44.18 38.18 14.34 12.06 90.37 39.1 58.91 21.72 65.42 30.42 35.35
47 5.7 81.48 34.72 6.73 8.77 96.26 37.95 86.43 15.09 89.27 29.11 32.58
48 21.78 82.88 32.67 76 7.65 98.19 34.35 76.51 6.65 92.54 28.21 32.43
49 9.36 81.69 27.73 14.44 5.27 91.81 33.42 87.93 4.07 86.05 27.9 40.11
50 8.64 83.22 22.22 58.9 7.78 81.71 31.38 14.56 11.88 71.44 27.85 41.37
Classification 113

The following code will then employ principal components analysis to


reduce our four properties/dimensions to two dimensions, just like in
Chapter 3, and plot the objects/transcripts in this 2D-space. Note that the
transcripts are now color-coded by therapy type.

#reduce dataset to 2D and plot distribution of pre-existing groups


from sklearn.decomposition import PCA as sklearnPCA
pca = sklearnPCA(n_components=2)
data2​D=pd.​DataF​rame(​pca.f​i t_tr​ansfo​rm(da​ta),c​olumn​
s=[‘Dim_1’,‘Dim_2’]​,index​=data​.index)
sns.scatterplot(data=data2D,x=‘Dim_1’,y=‘Dim_2’,hue=‘Type’)

The resulting scatterplot is shown in Figure 4.2.


Recall that each transcript’s position is fixed by its four LIWC scores,
which have been standardized and reduced to two dimensions. The spa-
tial distance between transcripts therefore reflects their similarity based on
these scores. Similar to the COVID-19 example above, the more transcripts
of the same color (i.e., therapy type) cluster together, the better the scores
would be at predicting the types. It appears that Figure 4.2 is less satisfac-
tory than Figure 4.1 in this regard. All three colors overlap to a consider-
able extent, which implies that predicting the label of objects based on the
majority vote of their k neighbors would be less than straightforward. As
always, however, visualizations provide only estimates to be confirmed by
the actual k-NN process described next.

Figure 4.2 Scatterplot of distribution of transcripts across therapy types


114 Classification

Step 2: k-NN and model validation

Similar to Monte Carlo simulations (Chapter 2) and other supervised tech-


niques in general, we now randomly split the dataset into training and
testing data, using the latter to evaluate the predictive accuracy of a model
fitted with the former. Before using train_test_split to do so, we first sep-
arate the four standardized LIWC scores and the outcome group labels
into two arrays X and y, respectively, so that we can later tell Python to
treat them as predictors and outcomes, respectively, when fitting the k-NN
model. This procedure applies generally to all types of predictive modeling
(e.g., linear and logistic regression) with Python. The code below assigns
the four variable scores in our dataframe to X, and the group labels that
were set as the dataframe index to y.

#create arrays for the predictors and outcome


X​=data​.val​ues
y​=data​.index​.va​lues

We then randomly split X and y into training and testing datasets (X_train,
X_test, y_train, y_test) as before. Doing so will not jumble up the order of
the transcripts in the datasets. That is to say, each row across X_train and
y_train, as well as X_test and y_test, will still point to the same transcript
accordingly. Recall that test_size determines the ratio of this split, which is
set to 75:25 here. We therefore have 112 transcripts in the training dataset
and 38 in the testing dataset. Additionally, because our transcripts have
pre-existing group labels, it would be desirable to ensure that the training
and testing datasets maintain the proportion of group labels (i.e., therapy
types) in the entire dataset. We can do this by specifying stratify=y, which
tells Python to perform stratified random sampling based on the frequencies
observed in y, the group labels. We will discuss the limitations and alterna-
tives to this ‘one-off’ splitting approach towards the end of the chapter.

#generate random training set with stratify=y to ensure fair split among
types
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.25,
random_state=0, stratify=y)

We are now ready to determine the optimal value of k, or the number of


neighbors to consider when determining the majority label. As explained
above, this is done by simply looping through a list of k from 1 to a speci-
fied maximum number (k=20 is often sufficient), fitting a k-NN model
using the training data in each iteration, computing predictive accuracies,
Classification 115

and at the end of the loop identifying the k with the highest accuracy. It is
usually enough to consider test_accuracy but we will introduce two addi-
tional measures for illustrative purposes below.

• test_accuracy: this is the usual measure. We fit/train the model using the
training data, apply this model onto the testing data only, and compute
the percentage of correctly predicted group labels among the testing
data.
• train_accuracy: we fit/train the model using the training data, apply this
model back onto the training data, and compute the percentage of cor-
rectly predicted group labels among the training data.
• overall_accuracy: we fit/train the model using the training data, apply
this model back onto the whole dataset (i.e., training + testing), and
compute the percentage of correctly predicted group labels among the
whole dataset.

The main purpose of illustrating the three different measures is to illustrate


the earlier conceptual point on bias-variance tradeoff. Recall that lower
k values result in more complex models that do not generalize beyond
training data well. This implies that lower k values will often have higher
train_accuracy than test_accuracy. Higher k values, on the other hand,
may show the opposite. The overall_accuracy generally lies somewhere
between train_accuracy and test_accuracy. These will be made visually
apparent by plotting the values later.
The code below will perform the loop and generate a plot of these
accuracy measures against the values of k. We first define the range of k
to be tested using np​.arange​. Note that to set the maximum k to 20, we
should define np​.aran​ge(1,21) because the upper bound (21) is exclusive.
Three empty arrays of this range are then generated with np​.empty​. These
serve as placeholders to store the forthcoming accuracy measures. It is
interesting to note that these arrays are strictly speaking not ‘empty’,
but filled with random numbers that will be overwritten by the measures
later. We can also use np​.zer​os instead of np​.empt​y, the only difference
being that the placeholder arrays would be filled with zeroes instead.
However, this is marginally slower since Python has to explicitly set the
zeroes.

#setup arrays to store various predictive accuracies


neighbors = np​.aran​ge(1, 21)
test_accuracy = np​.emp​ty(len(neighbors))
train_accuracy = np​.emp​ty(len(neighbors))
overall_accuracy = np​.emp​ty(len(neighbors))
116 Classification

We then initiate the for-loop that will iterate 20 times in our example,
each time incrementing k by 1. A k-NN model is fitted each time specify-
ing X_train and y_train as the training predictors and labels, respectively.
The three different accuracy measures are computed in turn using knn​
.sco​re and stored in the arrays described above. Note the use of enu-
merate instead of range (Chapter 2). The main difference is that range
iterates over just one numerical range but enumerate iterates over what
are known as tuples, or ordered pairs. We need the latter here because
if we use for k in range(1,21) and then store the accuracy measures as
test_accuracy[k], there would be a conflict between k starting at 1 (the
minimum number of neighbors) and the first element of the array being
test_accuracy[0] instead of test_accuracy[1]. Having an ordered pair i, k
resolves this issue.

#loop over values of k, fitting a k-NN model and computing accuracies


each time
for i, k in enumerate(neighbors):
knn = KNeighborsClassifier(n_neighbors=k)
knn​.f​i t(X_train, y_train)
#compute accuracy on the testing set
test_accuracy[i] = knn​.sco​re(X_test, y_test)
#compute accuracy on the training set
train_accuracy[i] = knn​.sco​re(X_train, y_train)
#compute accuracy on the whole set
overall_accuracy[i] = knn​.sco​re(X, y)

We now have three arrays of accuracy measures, which, if examined indi-


vidually, will show the corresponding values from k=1 to 20. As described
above, they are derived from computing the percentage of correctly pre-
dicted labels using the trained model on the testing, training, and whole
dataset, respectively. The following code will plot these results on the same
axis for convenient visualization. A fourth derived measure – average accu-
racy – which is simply the average of the three measures at each k, is also
included. Figure 4.3 shows this plot.

#generate plot. note that the plot illustrates bias-variance tradeoff


plt​.tit​le(‘Accuracy with different k’)
plt​.pl​ot(neighbors, test_accuracy, label = ‘Testing accuracy’)
plt​.pl​ot(neighbors, train_accuracy, label = ‘Training accuracy’)
plt​.pl​ot(neighbors, overall_accuracy, label = ‘Overall accuracy’)
plt​.p​​​lot(n​eighb​ors,(​test_​accur​acy+t​rain_​accur​acy+o​veral​l_acc​uracy​)/3,
’:‘label = ‘Average’)
plt​.lege​nd()
Classification 117

Figure 4.3 Plot of accuracy for different k values

plt​.xtic​ks(neighbors)
plt​.xlab​el(‘Number of neighbors’)
plt​.ylab​el(‘Accuracy’)
plt​.sh​ow()

The plot visualizes the bias-variance tradeoff mentioned earlier. Training


accuracy starts off higher than testing accuracy but becomes lower at
some high k values, while overall accuracy lies in between throughout.
Assuming equal consideration of all three measures, the dotted average
accuracy line is highest at k=1, suggesting this to be the optimal k value.
However, recall that in many cases k=1 is likely to lead to overfitting and
lower testing accuracy (0.68 here), which is less ideal if predicting the
group labels of new objects is of primary importance. It is therefore rea-
sonable in this case to go for the highest testing accuracy score, giving us
an optimal k=4. Note that k=4 also results in the second highest average
accuracy.
The important point to bear in mind is that there is not always an objec-
tively ‘correct’ decision when determining optimal parameters for classifi-
cation and other data analytic tasks. Instead, the best decision may require
careful consideration of different factors, the most important of which is
the exact objective of the analysis at hand.
Having found the optimal k, the following code simply (re)fits a k-NN
model onto the training data by specifying k=4 and (re)printing the test,
train, and overall accuracy measures. We say refit and reprint because
118 Classification

these were actually already done during the fourth iteration of the loop
above when searching for optimal k, but repeating them makes things
clearer. The resulting accuracy measures should match those in Figure 4.3.

#(re)fit the model to the training data, specifying optimal k


knn = KNeighborsClassifier(n_neighbors=4)
knn​.f​i t(X_train,y_train)

#(re)print accuracy measures (test, train, overall)


knn​.sco​re(X_test, y_test)
knn​.sco​re(X_train, y_train)
knn​.sco​re(X, y)

The accuracy measures give us an overall idea of how well the k-NN model
can predict therapy types from transcript LIWC scores. Bringing our focus
back to the most commonly used test accuracy score, which is 0.71 (71%)
in this case, we conclude that 27 out of the 38 transcripts in our test dataset
were correctly predicted. This can be considered an adequate form of model
validation for basic purposes, similar to the logistic regression approach
used in Chapter 3. However, we might want to know further details like
how this accuracy varies across the three therapy types, because a generic
‘global’ measure may not always work well. Taking an extreme example,
if 99% of our dataset comprises of just one therapy type, our model would
be measured as 99% accurate even if it just blindly predicts everything to be
that type. There are, in fact, a number of more nuanced model validation
measures that we will now introduce. The first step is, as before, to generate
a confusion matrix that reveals the distribution of correct and wrong predic-
tions by therapy types. The code below will first generate the predicted labels
(test_pred) from our model on the testing dataset, and then cross tabulate this
with the actual labels (y_test) to produce a 3x3 confusion matrix. Specifying
y_test before test_pred will position the former as rows and the latter as col-
umns, which is the standard convention for observed and predicted values,
respectively. Note the additional specification of labels=[“CBT”, “HUM”,
“PA”] to tell Python to arrange the rows in this sequence (HUM=humanistic
therapy, PA=psychoanalysis). If unspecified like in Chapter 3, the order will
be based on the sequence in which the labels appear in the data.

#predict labels for the testing set


test_pred = knn.predict(X_test)

#generate confusion matrix


metrics.confusion_matrix(y_test, test_pred, labels=[“CBT”, “HUM”,
“PA”])
Classification 119

The confusion matrix is shown below with descriptive labels added.


   CBT HUM PA
CBT [12   1 0]
HUM [ 1    8   4]
PA   [ 4    1   7]
Recall that the top-left to bottom-right diagonal indicates correct pre-
dictions where actual and predicted labels match. The diagonal sum
(12+8+7=27) as well as total sum (38) tally with the accuracy score calcu-
lated above. Beyond this global accuracy measure, we can derive further
information by looking across each row of actual values, and then down
each column of predicted values. Looking across the first row, 12 CBT
transcripts were correctly predicted as CBT, 1 CBT transcript was wrongly
predicted as HUM, and no CBT transcript was wrongly predicted as PA,
yielding a percentage of 92%. This is clearly the highest compared to the
HUM (62%) and PA rows (58%). These percentages express the ratio of
the number of true positives (i.e., CBTs correctly identified as CBT) to
the sum of true positives and false negatives (CBTs incorrectly identified
as something else). This is known as the recall score, which measures the
ability of a model to find all instances of a certain group.

True positives
Recall =
True positives + False negatives

Now let us look down the columns to derive a slightly different metric.
Looking down the first column, 12 CBT transcripts were correctly pre-
dicted as CBT, and 1 HUM transcript and 4 PA transcripts were wrongly
predicted as CBT. This time, CBT scores only 71%, which is lower than
HUM (80%) but higher than PA (56%). These percentages express the
ratio of the number of true positives (CBTs correctly identified as CBT)
to the sum of true positives and false positives (something else incorrectly
identified as CBTs). This is known as the precision score, which measures
the accuracy of a model in not mistaking negatives as positives.
True positives
Precision =
True positives + False positives

Therefore, high recall does not entail high precision, the contrasting results
of CBT and HUM being illustrative. What are the implications of these
different nuanced measures? In general, a model with high recall but low
precision for a group (CBT) is able to identify many group members, but
it is also more likely to misidentify non-group members as members. This
is a case of quantity over quality because many members are identified at
120 Classification

the expense of wrong identification of non-members. Conversely, a model


with high precision but low recall (HUM) is not as good as identifying
group members, but it is less likely to misidentify non-group members as
members. This means quality over quantity because while some members
are missed, the identifications made are more likely to be correct.
Putting this into our present context of a language-based model, it seems
that CBT has a set of linguistic characteristics (as measured by LIWC) that
are shared by a large number of its transcripts, but also occur more than
occasionally in transcripts of other therapy types. At the other end, HUM
has a narrower core of characteristics that clearly define a fair number
of its transcripts, which at the same time are rarely seen in other therapy
types. This is yet another example of a potential hypothesis emerging from
the model validation process, similar to the Monte Carlo simulations in
Chapter 2. We will discuss this overlooked dimension of data analytics in
greater detail in the concluding chapter of this book.
Besides global accuracy, recall, and precision, there is yet another meas-
ure known as the F1 score that is essentially an averaged measure of recall
and precision. We can think of it as evaluating the extent to which posi-
tive predictions turned out to be correct, balancing the two types of errors
(false negatives and false positives). The F1 score is the weighted harmonic
mean of the recall and precision, which differs from the arithmetic mean
by giving less weightage to large values and more to small values.
2 * Recall * Precision
F1 = or
Recall + Precision
Tru ue positives
True positives + 0.5 ( False positives + False negatives )

It turns out that scikit-learn provides a convenient way to generate all the
above measures in a table known as a classification report. This can be
done in one line of code below. Be mindful to consistently specify y_test
before test_pred to avoid mistakes. Table 4.2 shows the resulting classifica-
tion report.

#generate classification report


print​(metr​ics.c​lassi​ficat​ion_r​eport​(y_te​st, test_pred))

The first three columns show the precision, recall, and f1-scores correspond-
ing to each of the three groups. The support column refers to the number
of samples in the (test) dataset upon which these scores are computed.
The equal distribution across groups is a result of our earlier specification
of stratify=y when splitting the original dataset into train and test data.
Without this specification, the three groups may be unevenly represented,
Classification 121

Table 4.2 Classification report for testing data

Precision Recall f1-score Support

CBT 0.71 0.92 0.80 13


HUM 0.80 0.62 0.70 13
PA 0.64 0.58 0.61 12
accuracy 0.71 38
macro avg 0.71 0.71 0.70 38
weighted avg 0.72 0.71 0.70 38

leading to potentially misleading measures. The accuracy measure in the


fourth row is simply the global accuracy score described above, where 27
out of 38 transcripts in the test dataset (71%) were correctly predicted.
Notice that the support value is now 38 since the accuracy measure is
based on all 38 transcripts. In the bottom two rows, we have the macro
average and weighted average for the precision, recall, and f1-scores. The
macro average is simply the arithmetic mean of each score across all three
groups, and can be reported as an overall evaluation of the model. The
weighted average, on the other hand, is a proportional measure based on
the sample size of each group. For example, the weighted precision average
is 0.71 * (13/38) + 0.80 * (13/38) + 0.64* (12/38) = 0.72. The two types
of averages do not differ much in this case because of the evenly distrib-
uted groups. Some analysts recommend the weighted f1-score as a means
to evaluate and compare different classification models like k-NN, rather
than the simple global accuracy measure.
The model validation process described above was fairly extensive and
produces different nuanced measures that gives us general confidence that
language is predictive of therapy types. These measures are nevertheless
based on just one random train-test dataset split made at the beginning,
and there is always a chance of a biased selection leading to misleading
predictive accuracy measures. A potential solution introduced in Chapter
2 is to use a resampling approach where different training and testing data-
sets are drawn from the original data. Five different sets of training data
each comprising a contiguous block of session transcripts were purposively
chosen back then to test which treatment phase is most amenable to simu-
lation. The contiguity of sessions needed to be preserved because the ses-
sion sequence was analytically relevant, as will again be the case for our
time series analyses in Chapter 5.
In statistical parlance, this implies that the data were not independently
and identically distributed because sequential relationships between ses-
sions were assumed. In the present case however, because session sequence
is not relevant and the transcripts are randomly drawn from different
dyads, we can draw multiple random training and testing datasets with
a widely used approach known as k-folds cross validation. The idea is
122 Classification

straightforward – the k-value represents a chosen number of folds, or the


number of equally sized groups the original dataset is randomly split into.
Typical values include k=5 or k=10. The first fold is then treated as the
testing data, the remaining data are used to train the model, and the accu-
racy measures above computed. Setting k=5 thus implies a 20%–80% split
between testing and training data, and k=10 a 10%–90% split. The process
is then repeated for all folds, giving us k number of accuracy measures in
the end. This way, all parts of the data will at some point be used as training
and testing data in a non-overlapping way, allowing an unbiased evaluation
of the model at hand. Some analysts also advocate first splitting the dataset
into training and testing data, and then performing cross-validation on the
training data only – in other words, to not touch the testing data until we
are satisfied with the model’s performance on training data. The final list
of k accuracy measures is useful in several ways. Besides simply averaging
them to derive an overall evaluation, they can reveal potential problems.
For example, a large variance or inconsistency among the values suggests
that the model is unable to provide consistent predictions, signaling the
need to refit the model – with a different k number of neighbors, in this case.
The following code imports the necessary package from scikit-learn and
performs k-folds cross validation specifying (1) the previously instantiated
k-NN classifier with four neighbors (knn), (2) the original unsplit dataset
(X and y), and (3) ten folds (cv=10). We can also specify the type of evalua-
tion score needed. Options include accuracy (the global measure described
above) as well as precision, recall, and f1. All ten scores and their average
are then printed out.

#k-folds cross validation


from sklearn.model_selection import cross_val_score
cv_scores = cross_val_score(knn, X, y, cv=10, scoring=‘accuracy’)
print(cv_scores)
print(“Average 10-Fold CV Score:”,(np​.me​an(cv_scores)))

Our ten scores in this case are as follows, giving an average of 0.567.
[0.46666667, 0.46666667, 0.53333333, 0.46666667, 0.33333333,
0.6, 0.8, 0.73333333, 0.6, 0.66666667]
There is cause for concern here because the average is substantially
lower than the earlier classification report figures (all >0.7), and the large
variability in the ten scores suggests that our model cannot consistently
provide accurate predictions on different sets of unseen data. Possible
solutions include going back to the drawing board and refitting the k-NN
model with a different k value, or retreating even further to reconsider
the transcript dataset. We will not illustrate the whole process again, as
the more important point being made is that model fitting and validation
Classification 123

in real-world situations require careful investigative work, and sometimes


iterative action – something we will again observe in Chapter 5.

Python code used in this chapter


The Python code used throughout the chapter is reproduced in sequence
below for readers’ convenience and understanding of how the analysis
gradually progresses.

#import Python libraries


import numpy as np
import pandas as pd
import matplotlib​.pypl​ot as plt
import seaborn as sns
from sklearn import metrics
from sklearn.preprocessing import StandardScaler
from sklearn.neighbors import KNeighborsClassifier

#import dataset and scale scores


data=pd.read_csv(‘data​.c​sv’,index_col=‘Type’)
scaler=StandardScaler()
data=​pd.Da​taFra​me(sc​aler.​fit_t​ransf​orm(d​ata),​​colum​​ns​=da​​ta​.co​​lumns​​
,inde​​x​=dat​​a​.ind​​ex)

Initial visualization of pre-existing groups


#reduce dataset to 2D and plot distribution of pre-existing groups
from sklearn.decomposition import PCA as sklearnPCA
pca = sklearnPCA(n_components=2)
d a t a 2 ​ D = p d . ​ D a t a F ​ r a m e ( ​ p c a . f ​ i t _ t r ​ a n s f o ​ r m ( d a ​ t a ) , c ​ o l u m n​
s=[‘Dim_1’,‘Dim_2’]​,index​=data​.index)
sns.scatterplot(data=data2D,x=‘Dim_1’,y=‘Dim_2’,hue=‘Type’)

The k_NN process

#create arrays for the predictors and outcome


X​=data​.val​ues
y​=data​.index​.va​lues

#generate random training set with stratify=y to ensure fair split among
types
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.25,
random_state=0, stratify=y)
124 Classification

#setup arrays to store various predictive accuracies


neighbors = np​.aran​ge(1, 21)
test_accuracy = np​.emp​ty(len(neighbors))
train_accuracy = np​.emp​ty(len(neighbors))
overall_accuracy = np​.emp​ty(len(neighbors))

Determine the optimal value of k

#loop over values of k, fitting a k-NN model and computing accuracies


each time
for i, k in enumerate(neighbors):
knn = KNeighborsClassifier(n_neighbors=k)
knn​.f​i t(X_train,y_train)
#compute accuracy on the testing set
test_accuracy[i] = knn​.sco​re(X_test, y_test)
#compute accuracy on the training set
train_accuracy[i] = knn​.sco​re(X_train, y_train)
#compute accuracy on the whole set
overall_accuracy[i] = knn​.sco​re(X, y)

#generate plot. note that the plot illustrates bias-variance tradeoff


plt​.tit​le(‘Accuracy with different k’)
plt​.pl​ot(neighbors, test_accuracy, label = ‘Testing accuracy’)
plt​.pl​ot(neighbors, train_accuracy, label = ‘Training accuracy’)
plt​.pl​ot(neighbors, overall_accuracy, label = ‘Overall accuracy’)
plt​.p​​​lot(n​eighb​ors,(​test_​accur​acy+t​rain_​accur​acy+o​veral​l_acc​uracy​)/3,
‘:’,label = ‘Average’)
plt​.lege​nd()
plt​.xtic​ks(neighbors)
plt​.xlab​el(‘Number of neighbors’)
plt​.ylab​el(‘Accuracy’)
plt​.sh​ow()

#(re)fit the model to the training data, specifying optimal k


knn = KNeighborsClassifier(n_neighbors=14)
knn​.f​i t(X_train,y_train)

Model validation

#(re)print accuracy measures (test, train, overall)


knn​.sco​re(X_test, y_test)
knn​.sco​re(X_train, y_train)
knn​.sco​re(X, y)
Classification 125

#predict labels for the testing set


test_pred = knn.predict(X_test)

#generate confusion matrix


metrics.confusion_matrix(y_test, test_pred, labels=[“CBT”, “HUM”,
“PA”])

#generate classification report


print​(metr​ics.c​lassi​ficat​ion_r​eport​(y_te​st, test_pred))

#k-folds cross validation


from sklearn.model_selection import cross_val_score
cv_scores = cross_val_score(knn, X, y, cv=10, scoring=‘accuracy’)
print(cv_scores)
print(“Average 10-Fold CV Score:”,(np​.me​an(cv_scores)))

References
Aggarwal, C. C. (Ed.). (2020). Data classification: Algorithms and applications.
Chapman and Hall.
Demiray, Ç. K., & Gençöz, T. (2018). Linguistic reflections on psychotherapy:
Change in usage of the first person pronoun in information structure positions.
Journal of Psycholinguistic Research, 47(4), 959–973.
Fortmann-Roe, S. (2012). Understanding the bias-variance tradeoff. http://scott​
.fortmann​-roe​.com​/docs​/biasvariance​.html
Huston, J., Meier, S. T., Faith, M., & Reynolds, A. (2019). Exploratory study of
automated linguistic analysis for progress monitoring and outcome assessment.
Counselling and Psychotherapy Research, 19(3), 321–328.
Levitt, H., Korman, Y., & Angus, L. (2000). A metaphor analysis in treatments
of depression: Metaphor as a marker of change. Counselling Psychology
Quarterly, 13(1), 23–35.
Qiu, H., & Tay, D. (2023). A mixed-method comparison of therapist and
client language across four therapeutic approaches. Journal of Constructivist
Psychology, 36(3), 337–60. https://doi.org/10.1080/10720537.2021.2021570
Van Staden, C. W., & Fulford, K. W. M. M. (2004). Changes in semantic uses of
first person pronouns as possible linguistic markers of recovery in psychotherapy.
Australian and New Zealand Journal of Psychiatry, 38(4), 226–232. https://doi​
.org​/10​.1111​/j​.1440​-1614​.2004​.01339.x
5 Time series analysis

Introduction to time series analysis: Squeezing juice from sugarcane


From Chapters 2 to 4, simulations, clustering, and classification were dis-
cussed as techniques that offer solutions for discourse-relevant issues like
missing data and linguistic (a)synchrony. As mentioned in the introduction
chapter, the aim was to illustrate how to harness data analytic features
that are particularly well suited for specific discourse contexts and ques-
tions. We will now introduce time series analysis as our fourth and final
technique that capitalizes on the sessional nature of psychotherapy talk to
perform useful tasks like modeling and forecasting LIWC variable scores.
The conceptual introduction that follows will be the most technical so far
due to the more complex nature of time series analysis. Some sections draw
from similar introductions elsewhere (Tay, 2019, 2022), and readers may
also refer to well-known textbooks for further understanding (Bowerman
& O’Connell, 1987; Chatfield, 1989).
A time series is a set of consecutive measurements of a random variable
made at equal time intervals. Typical time series data in the social and
physical world include stock prices, rainfall, and birth/death rates, and
these are the contexts in which time series analysis is often used. Another
topical example mentioned in Chapter 3 is COVID-19 cases over time
and other related phenomena from the earlier SARS epidemic (Earnest
et al., 2005; Lai, 2005). The simple code below generates and plots 100
random numbers from a normal distribution with mean 0 and variance
1. Remember that the plot will differ with each run because of the rand-
omization, but we can use np​.random​.s​eed to overcome this if necessary
(see Chapter 2). We see the familiar fluctuations that are a key visual trait
of time series data, if we imagine that the values are measured one after
another at equal intervals.

#generate and plot 100 random values


pd.Da​taFra​me(np​​.rand​​om​.no​​​rmal(​0,1,1​00)).​plot(​)

DOI: 10.4324/9781003360292-5
Time series analysis 127

Stock prices, rainfall, and birth/death rates are time series data measured
over very different time intervals, or sampling frequencies, from seconds to
decades. However, their analysts share the common objective of discover-
ing patterns underlying the fluctuations and making reliable predictions
on that basis. Two broad information sources can support this objective.
In cases where the phenomenon and its contributing factors are theoreti-
cally well understood, we can use the contributing factors as predictor
variables and build regression models to predict the phenomenon. Birth
or death rates may, for example, be predicted from measures of well-being
like income, education, and health-care levels. These predictions may
either be longitudinal (i.e., forecasting future values for the same sample)
or cross-sectional (predicting values of another sample at a similar time
point). However, in cases where the phenomenon and contributing fac-
tors are ‘messier’ or less understood, it may well be the case that past
values become the most reliable, or only available, predictors of future
values. In many such time series we, in fact, find that successive values
are systematically correlated and dependent on one another. Statisticians
call this ‘autocorrelation’ or ‘serial correlation’. Rather than using exter-
nal factors as predictors, we can therefore build regression models using
autocorrelational information from past values instead. The parameters of
these models can then be interpreted to understand structural regularities
underlying the series.
This is the crux of time series analysis – more specifically, the widely
used Box-Jenkins method (Box et al., 2015), which employs a class of
statistical models known as ARIMA (Autoregressive Integrated Moving
Average) models to analyze time series data. Note that ARIMA models are
used for numerical time series, which our present LIWC scores exemplify.
It is also possible to model and forecast categorical time series data, if say
each session is described by some categorical feature. However, this would
require different approaches like Markov chain models (Allen, 2017) that
are beyond the present scope. The meaning of ‘autoregressive integrated
moving average’ will be clarified later in the chapter. Typical applications
lie in domains where (1) the effect of time is of primary relevance to the
phenomenon of interest, and (2) the phenomenon arises from a data gen-
erating process (cf. Chapter 2) that is not fully understood, as mentioned
above. Point (2) relates to what Keating and Wilson (2019) aptly describe
as the ‘black box’ philosophy of the Box-Jenkins method. It should be clear
why financial data, like stock prices, and health data, like the incidence of
diseases, are typical contexts of application. In both domains, time is of the
essence and (causal) mechanisms of prices and infection rates are seldom
transparent. Interestingly, there is also good reason to believe that various
types of language/discourse-related data exhibit somewhat similar behavior
and are therefore potential candidates for such analyses. Let us consider the
128 Time series analysis

first key feature of time being a primary (if not always explicit) variable of
concern. Examples of longitudinally sensitive language/discourse research
range broadly from time-based psycholinguistic experiments (Tay, 2020)
to grammaticalization (Hopper & Traugott, 2003) and sociolinguistic
variation and change (Tagliamonte, 2012). The time intervals of interest
likewise range from seconds to decades, and it often makes sense to assume
that past and present manifestations of the phenomenon are not independ-
ent of one another. In other words, some degree of autocorrelation exists
in the series. The second key feature, which basically precludes reliance on
pre-specified predictor variables, also squares well with received discourse
analytic wisdom. We will see that the Box-Jenkins method relies only on
the observed series and extracts patterns from it until pure randomness
or, in statistical parlance, white noise remains. These patterns are essen-
tially based on the aforementioned autocorrelation in the series. A use-
ful analogy for this process that brings back many childhood memories is
the extraction of juice from sugarcane. Juice is extracted by passing raw
sugarcane multiple times through an extractor until the fibrous residue
remains, just like autocorrelational patterns are filtered stepwise from a
raw time series until random white noise remains. Although the math-
ematical definition of these autocorrelation patterns is the same regardless
of context, the ‘why’ and ‘so what’ of their presence must be understood in
the light of domain-specific knowledge, which is some linguistic/discourse
theory in this case. This line of reasoning is consistent with the mainstream
discourse analytic logic of anticipating emergent patterns and phenomena
rather than necessarily hypothesizing them beforehand.

Structure and components of time series data


To further illustrate why time series data may require special treatment
because of their characteristic structure and components, imagine a hypo-
thetical series y measured over 30 intervals. The plot of y against time t is
shown on the left of Figure 5.1. We can see the typical fluctuating shape if
we connect the dots. Suppose we overlook the possibility that successive
values of y are correlated and just fit a standard linear regression model
to predict y in terms of t. This is tantamount to assuming that the values
of successive y’s have no bearing on one another. The resulting regression
line is shown, and the right of Figure 5.1 is the plot of residuals, or the
difference between the observed and predicted value of y at each interval t.
At first glance the model appears to fit well as the measured y’s are close
to the regression line and the residuals are distributed around zero in both
directions. The translucent band around the regression line indicating 95%
confidence intervals is thin, and R2 = 0.857 is high with almost 86% of the
total variance in y accounted for. The model (yt = 0.812t + 14.274) could
thus be considered good in general. The model parameters (slope = 0.812,
Time series analysis 129

Figure 5.1 A standard regression and residuals plot

intercept = 14.274) estimate the initial value of y = 14.274, and every unit
increment of t is predicted to increase y by 0.812.
However, a closer look at the residuals plot reveals that the prediction
errors tend to persist in the same direction over many consecutive inter-
vals. The most recent predicted values of y from t = 20 to 30 (circled) are
all lower than the actual values with signs of a worsening upward devia-
tion. This is where we can make a useful distinction between prediction
and forecasting. Prediction is a more general term, which includes using
the model to generate predictions for existing values and the respective
residuals, while forecasting refers to the prediction of future values that
are not yet known. Therefore, even if our high R2 model fits the existing
data well and makes good predictions, it will make systematic forecast
errors for the 31st interval and beyond. This is another example of poten-
tial underfitting, as briefly discussed in Chapter 4, as the linear model fails
to capture the recent ‘localized’ upward movement. More generally, the
overall fit may conceal ‘localized’ up/downward movements resulting from
autocorrelation in the series, and it is precisely these movements that point
towards context-specific phenomena like a growing/falling momentum in
some discourse feature over the time intervals at hand. ARIMA time series
models, in contrast, account for these movements by definition. They are
intended to capture not just the way things are, but also the way things
move (Hyndman & Athanasopoulos, 2018). Before we proceed to demon-
strate how the juice extraction process works step by step in the case study,
it is useful to gain an overview by looking at (1) what the autocorrelational
structure of a time series exactly means, and (2) how autocorrelation trans-
lates to key components of time series data across different contexts.
Autocorrelation or serial correlation is the correlation between values
that are separated from each other by a given number of intervals. We can
see it as the correlation between the series and a lagged version of itself.
Table 5.1 illustrates this concept by reproducing the 30 measured y and t
values from Figure 5.1 above.
130 Time series analysis

Table 5.1 Autocorrelation at lags 1 and 2

Lag 0 (original series) Lag 1 Lag 2

t y yt+1 yt+2
1 20 18 22
2 18 22 16
3 22 16 16
4 16 16 21
5 16 21 22
6 21 22 23
7 22 23 21
8 23 21 21
9 21 21 24
10 21 24 23
11 24 23 22
12 23 22 22
13 22 22 23
14 22 23 24
15 23 24 23
16 24 23 25
17 23 25 25
18 25 25 27
19 25 27 32
20 27 32 33
21 32 33 35
22 33 35 36
23 35 36 35
24 36 35 38
25 35 38 37
26 38 37 38
27 37 38 42
28 38 42 42
29 42 42 -
30 42 - -

Recall that in Figure 5.1, the 30 paired values of y and t were fitted to
a standard linear regression model that assumed independence among the
y values across t. The lag 1 and lag 2 columns in Table 5.1, on the other
hand, juxtapose each y value with the y value one and two intervals later
(i.e., yt+1 and yt+2). The lag 1 autocorrelation is then the Pearson’s cor-
relation coefficient between the 29 paired values of yt and yt+1, the lag 2
autocorrelation is that between the 28 paired values of yt and yt+2, and so
on for lag k in general. The longer the time series, the more lagged autocor-
relations we can calculate, but each higher lag will have one pair of values
less. The original series in the first column is also called lag 0, and the lag
0 autocorrelation is always +1 because it is simply the correlation of the
series with itself. Autocorrelations explicitly measure the degree of interde-
pendence within the series as the values are sequenced in time.
Time series analysis 131

Like all Pearson’s correlation coefficients, a lag k autocorrelation ranges


from -1 to +1, and its statistical significance – whether it is significantly dif-
ferent than zero – depends on the magnitude of the coefficient and sam-
ple size (i.e., number of value pairs). A statistically significant positive lag
k autocorrelation means that values at time t and t+k tend to move in the
same direction up/downwards. A significant negative lag k autocorrelation
means that values at time t and t+k tend to move in opposite directions. A
non-significant lag k autocorrelation means that values at time t and t+k are
unrelated. Therefore, a strong lag k autocorrelation translates to many con-
secutive movements in the same direction over k intervals, much like what we
observed in Figure 5.1. Our time series analysis process will involve calculat-
ing all autocorrelations up to a specified k (usually at least 10) to yield the
overall autocorrelation function (ACF) of the series. The ACF is sometimes
called the sample autocorrelation as opposed to the theoretical autocorrela-
tion. This is because each time series is considered a sample instantiation of
an underlying data generating process, or theoretical series. Closely related to
the autocorrelation function is the partial autocorrelation function (PACF).
The lag k partial autocorrelation is simply the lag k autocorrelation, but with
the effects of the intervening time period from lag 1 to lag k-1 removed. We
will soon see that the ACF and PACF of a time series exhibit contrasting
behaviors in a way that provides a ‘signature’ for analysts to decide what
ARIMA model to fit. Just like the ACF, the PACF is also called the sample
partial autocorrelation as opposed to the theoretical partial autocorrelation.
The following code will import the relevant libraries, compute, and plot
(P)ACF for a given input series. Note that the time series data are conven-
tionally imported as a dataframe named ‘series’. The (P)ACF plots, also
known as correlograms, show the magnitude of the (partial) autocorrela-
tions from lag 0 to a specified lag k (k=10 in this case). The alpha value
can be changed to depict confidence intervals at the standard 95% level
(alpha=0.05) or otherwise. These plots visually depict the aforementioned
contrasting behaviors that we will revisit in the analysis.

#compute (P)ACF
from statsmodels.graphics.tsaplots import plot_acf, plot_pacf
from statsmodels​.tsa​.statto​ols import acf, pacf
series=pd.read_csv(‘data​.c​sv’,index_col=‘Session’)
acf(series.y,nlags=10)
pacf(series.y,nlags=10)

#plot (P)ACF
plot_acf(series.y, lags=10, alpha=0.05, title=‘Autocorrelation function’)
plot_pacf(series.y, lags=10, alpha=0.05, title=‘Partial autocorrelation
function’)
132 Time series analysis

It is the very presence of significant autocorrelations that gives time series


data their characteristic shape. The next step of our conceptual overview
of the Box-Jenkins method is to introduce a visualization process known
as seasonal decomposition, which breaks down a time series into the key
components of trends, seasons, cycles, and residuals. Trends, seasons, and
cycles are known as the systematic components of time series because
they result from autocorrelations and can thus be systematically modeled,
which is the gist of the Box-Jenkins method. The outcome, our sugarcane
juice, is an ARIMA model that depicts present values in terms of past
values and can be used to forecast future values. Residuals, also variously
known as error, white noise, or random fluctuations, is the non-systematic
component that is left over after the good stuff above is filtered away. Just
as we only discard the sugarcane fiber after ensuring that all the juice is
extracted, simple diagnostic tests can be performed on residuals to ensure
that no more extractable patterns remain. In a nutshell, the Box-Jenkins
method of time series analysis is therefore a process of extracting patterns
from a series until an unpatterned random series remains. We also note
that either an additive or a multiplicative approach could be taken, where
the series is modeled as either the sum or the product of their trends, sea-
sons, and cycles. We will illustrate only the additive approach here. In
general, multiplicative models are appropriate only when the magnitude of
trends, cycles, and/or seasons change substantially across time.
Seasonal decomposition can be done on any input time series with the
following code.

#seasonal decomposition of time series


from statsmodels​.tsa​.seaso​nal import seasonal_decompose
#create a ‘date time index’ for the series and specify three settings
series​.index​=pd​.date​_​range(freq=‘m’,start=‘1949’,periods=len(series))
seasonal_decompose(series.y).plot()

Series​.ind​ex creates what is known as a ‘date time index’ for the input
time series. A date time index allows the analyst to easily perform various
time-related operations on time series data such as resampling, rolling,
and expanding metrics, but these are beyond the present scope. We then
specify three settings before plotting the decomposed time series: (1) the
frequency of time intervals, set to ‘monthly’ in this case; (2) the start time
of the series, set to the year 1949 for our example, and (3) the number of
periods or observations in our data, which we can automatically count
with len(series). We then run seasonal_decompose().plot() on the desired
series, under column ‘y’ in the dataframe in this case. Figure 5.2 shows the
seasonal decomposition outcome of our example. This is a widely used
Time series analysis 133

Figure 5.2 Components of time series data

example of the total number of airline passengers from January 1949 to


December 1960 and is accessible from many sources online.
The top panel is just a raw time plot of y, the total number of airline
passengers (in thousands) against time. We see a general increase over the
years with somewhat periodic fluctuations. Note that the x-axis is auto-
matically simplified to show only the years for a cleaner presentation. The
next panel extracts the trend from the series. Trends are gradual increases
or decreases over a long-term period usually defined as more than one
year. There is a clear upward trend that will be marked by strong auto-
correlations persisting for many lags (i.e., a high value of k), since values
at time t and t+k move generally in the same direction. Note that if the
autocorrelation is negative, for example at lag 1, we will see a zig-zagging
pattern instead since values at time t and t+1 move in opposite directions.
Interpreting in context, our trend is likely due to a stable background
economic growth that increases both demand and supply for air travel.
What about an analogous discourse example? Suppose we are modeling
the frequency of metaphor use over a series of psychotherapy sessions
(Tay, 2017b). Metaphors are commonly used by therapists and clients for
purposes like expressing abstract experiences and explaining concepts, but
their dynamics from the perspective of sessional progression are far less
134 Time series analysis

clear. An upward trend might occur if therapists and clients co-elaborate


an insightful metaphor with greater frequency as treatment progresses,
especially if the therapist is using a specific technique for this (Kopp &
Craw, 1998; Sims, 2003).
The next panel extracts seasons and cycles from the series. Seasons are
short-term oscillations due to stable recurrent factors like increases in air
travel during annual holidays. They are often predictably subsumed under
larger trends like in our example where the seasonal rises and falls recur
annually. Seasons are likewise marked by a strong positive autocorrela-
tion at lag k where k is some multiple that defines the season. For exam-
ple, if we track monthly sales of Christmas gifts over several years, we
expect a strong autocorrelation at lag 12 because the pattern recurs every
12 months (e.g., gradual rise into December followed by a gradual dip).
Cycles, on the other hand, are less stable longer-term oscillations such as
five- to seven-year business cycles of expansion and recession. They can
be misinterpreted as trends if the observed series is too short. Seasons and
cycles may likewise occur in discourse contexts. An obvious example is the
increased use of words related to ‘Summer Olympics’ or the ‘World Cup’
in newspapers every four years. In the case of metaphors in psychotherapy,
we will observe seasons if particular metaphorical topics recur over pre-
dictable or planned intervals by the therapist.
The bottom panel shows the residuals after the above components are
‘filtered out’. Just like a standard regression model, they represent quanti-
ties that cannot be systematically explained and constitute the prediction
errors at each interval. We will see later that standard diagnostics are per-
formed to verify that no more systematic patterns can be extracted from
the residual series, before finalizing and using our model.

Time series models as structural signatures


Before demonstrating the Box-Jenkins method step by step on LIWC
scores across sessions, we introduce one final conceptual idea of time
series models as structural signatures. We know by now that the outcome
of a time series analysis is, broadly speaking, a model that represents the
present value of y as a function of its past values, and the exact details
of this representation depend on the autocorrelational structure of the
series. Just like any other regression model, these details take the form of
parameters and coefficients that reveal insights into structural regularities
underlying the series. For example, the linear model yt = 0.812t + 14.274
(Figure 5.1) tells us that every time step increases y by 0.812 units on
average. Although there are infinite possible values for coefficients, and
infinite possible models as a result, time series models fall into a much
smaller number of categories in practice. There are, in fact, just two major
Time series analysis 135

attested types – autoregressive (AR) and moving average (MA) models


– and their variants, depending on how many lags the autocorrelations
persist into, as well as whether the autocorrelations are positive or nega-
tive. These two major types are what gives ARIMA models its name. If
the model fits the data adequately, each type and variant can then be
interpreted as a ‘structural signature’ of the series, telling us exactly how,
how far, and how much successive values in a series are related. In dis-
course contexts, they become signatures of discourse structure, and they
can be critically interpreted against contextual and theoretical aspects of
the data. To the extent that different datasets fit different models, they
can even function as typologies, or theoretical handles for comparative
discourse research.
Figure 5.3 shows two hypothetical time series as examples, each with 50
periods. The top series is a prototypical fit for an AR(1) model. This means
an autoregressive model of order 1. The bottom series, on the other hand,
is a prototypical fit for an MA(1) model. This means a moving average
model of order 1. What these orders imply and how they are determined
will be explained later.
Figure 5.3 also depicts the outcomes of time series analysis. The orange
line is the plot of the original series, while the blue forecast line shows

Figure 5.3 Prototypical ARIMA models as structural signatures


136 Time series analysis

the predicted values at each interval, as well as the forecasted values for a
specified number of periods into the future. The grey zone indicates 95%
confidence intervals for the forecasts. The two lines are quite close to each
other in both cases, reflecting the prototypical fit. Our case study will dem-
onstrate the stepwise Box-Jenkins method to arrive at these outcomes. For
now, notice that the AR1 series is characterized by strong period-to-period
up or downward momentum for long stretches of time, with occasional
directional switches. This is what was meant by a ‘structural signature’
that identifies typical datasets like stock prices with bullish momentum.
On the other hand, the MA1 series is characterized by rapid period-to-
period ‘bouncing’ where sudden jumps tend to be ‘restored’ by an oppo-
site movement thereafter. This is a very different structural signature that
typifies a correspondingly different context, like high frequency trading
with minute-to-minute transactions where past values have little long-term
influence. Readers may already see where this is going. For different time
series discourse data in psychotherapy, classrooms, the media, and so on,
questions like which model types fit well and what the structural signatures
imply are exciting and underexplored. The ability to forecast future values
is also an intriguing application in some discourse contexts, as we will
show in the case study below.

Case study: Modeling and forecasting psychotherapy language


across sessions
Time is an inherent dimension underlying psychotherapy processes and
outcomes. There are of course many justifiable ways to define temporal
intervals of interest. Althoff et al.’s (2016) comparative study of good and
poor outcomes in SMS texting-based counseling, for example, divided
each conversation into five equal chunks, and Tay’s (2017a) analysis of
metaphor usage patterns divided each session into three. These studies
are insightful on their own terms, but the somewhat arbitrary chunking
does not find clear support in psychotherapy theory. As demonstrated in
previous chapters, the whole session is instead likely to be the most intui-
tive unit of analysis preferred by psychotherapy researchers. Mergenthaler
and associates (Gelo & Mergenthaler, 2003; Mergenthaler, 1996), for
example, observed cyclical patterns in clients’ use of emotion words and
abstract words from session to session, although they stopped short of
modeling this cyclicity in more explicit ways (e.g., with ARIMA times
series models). A similar focus on sessional progress can be observed in
studies of other linguistic features like pronouns and metaphors. Pronouns
reflect self-presentation and attention, ego, other persons and entities, and
how engaged clients are with therapists (Rizzuto, 1993). Van Staden and
Fulford (2004) compared initial and final session transcripts between good
and poor outcomes and found that good outcomes had a higher increase of
Time series analysis 137

first-person pronouns, and used more pronouns in an active position (e.g.,


I chased the dog versus the dog chased me). In the case of metaphor, good
outcome clients are better able to transform less optimistic metaphors to
more optimistic ones (Levitt et al., 2000; Sarpavaara & Koski-Jännes,
2013) across sessions.
Time series analysis with ARIMA models would extend these studies
in two substantial ways. First, as mentioned in the conceptual introduc-
tion above, an explicit account of structural relationships or ‘signatures’
between language variables in past, present, and future sessions can pro-
vide a more detailed account – backed by subsequent qualitative analysis
in context – of linguistic behavior across time. These structural relation-
ships could, in turn, be related to different process and outcome measures
to further examine the co-occurrence of linguistic and therapeutic changes.
Second, since forecasting is a primary objective of ARIMA models in most
contexts, their application to psychotherapy talk serves to bridge the some-
what conflicting aims of discourse analysis and psychotherapy research – at
least as traditionally conceived. We mentioned in Chapter 1 that most dis-
course analysts aim to depict how language is used rather than predict or
prescribe how they should be used. Another way to put it is that linguistic
analysis of psychotherapy necessarily focuses on sessions that have taken
place, and thus its potential contribution to prognosis (i.e., looking ahead
at future sessions and outcomes) is less clear. In a health-care context like
psychotherapy, however, it is commonplace for therapists to prognose the
development of clients’ conditions using a range of information, including
their medical history, educational and social assets, motivation, intel-
ligence, and experiences of sessions (Luborsky et al., 1971). To the extent
that language use in future sessions can be reliably forecasted, linguistic
analysis would prove to be an additional tool in the prognostic framework
to shed light on how the treatment might progress over time.
We will now demonstrate the six steps of the Box-Jenkins method of
time series analysis as listed below.

• Step 1: Inspect series


• Step 2: Compute (P)ACF
• Step 3: Identify candidate models
• Step 4: Fit model and estimate parameters
• Step 5: Evaluate predictive accuracy, model fit, and residual diagnostics
○ If inadequate, return to Step 3
○ If adequate, move to Step 6
• Step 6: Interpret model in context

The method combines automatic computation with the analyst’s informed


judgment. While Steps 2 and 4 are automated, the rest require more
138 Time series analysis

engaged interpretation of computer output. In particular, the selection of


potential candidate models in Step 3 requires the analyst to judge the (P)
ACF of the series, and the evaluation in Step 5 requires a balanced con-
sideration of different factors as explained later. Step 5 also reflects the
iterative nature of the method. If the candidate model selected for fitting is
deemed inadequate, the process returns to Step 3 where a different model
is selected.
To illustrate the complementary potential of different data analytic
techniques, our case study follows from the linguistic synchrony analysis
in Chapter 3 where a sample of CBT sessions was found to exhibit perfect
therapist–client asynchrony. We will perform and compare the results of
two time series analyses, focusing just on the LIWC analytical thinking
scores of a CBT therapist versus client over 40 sessions. If CBT dyads
are in general relatively asynchronous, it would be reasonable to expect
different trajectories for therapist and client LIWC scores, with nuanced
analysis of their structural signatures promising further insight. Note that
a different dyad from Chapter 3 will be analyzed here due to sample size
requirements explained below. The 14 CBT sessions from Chapter 3 are,
in general, inadequate for accurate time series modeling.

Step 1: Inspect series

Figure 5.2 showed that inspecting a decomposed series can reveal trends,
seasons, and cycles. Another important reason to inspect the series as the
first step is that the Box-Jenkins method relies on just one realization (i.e.,
our data) to estimate the parameters of its underlying data-generating pro-
cess. In most cases it is, in fact, only possible to have one realization because
we cannot go back in time to collect another ‘sample’. Note that we are not
talking about the number of observations in our series, because they still
collectively comprise only one series regardless of how many observations
there are. For the statistical estimation to be valid, the mean and variance
of our ‘N=1’ sample series should therefore remain constant across differ-
ent sections of time. This is known as the condition of stationarity, and
we should have a stationary series before moving to Step 2. If our series is
non-stationary, it needs to be transformed into a stationary one. Assuming
that our two series (therapist and client analytical thinking scores over 40
sessions) are imported as two columns of a dataframe called series, the fol-
lowing code will plot and annotate them. It is also good practice to number
the sessions consecutively and set it as the index column of the dataframe.
The setting subplots=True will plot each series on a separate grid, which
can be changed to False to plot everything on just one grid. Lastly, series.
describe() provides handy descriptive statistics like the number of observa-
tions, mean, and standard deviation of each series.
Time series analysis 139

#inspect series
series=pd.read_csv(‘data​.c​sv’,index_col=‘Session’)
series​.pl​ot(subplots=True)
series.describe()

Figure 5.4 shows our two series (therapist and client) at the top, as well
as additional examples of non-stationary series and outcomes of various
transformation procedures.
The means of our two series appear to remain constant over time as
the plots fluctuate around a midpoint along the y-axis. From the descrip-
tive statistics, we note for now that the therapist series has a higher mean
(49.999) than the client series (42.001), suggesting a higher degree of ana-
lytical thinking in general. Nevertheless, as with the elbow plots in Chapter
3, it is always good to verify visual inspection with a formal measure.
We can do this with the Augmented Dickey-Fuller (ADF) test where H0 =
the series is non-stationary. The code below imports the relevant librar-
ies and returns just the first row of the full ADF test output, which is
the p-value. Note that only one series is tested at a time – in this case the

Figure 5.4 Examples of (non-)stationary series


140 Time series analysis

therapist scores, which are found under the Therapist column. As p <0.01
for both therapist and client, H0 is rejected and both series are confirmed
as stationary.

#ADF test for stationarity


from statsmodels​.tsa​.statto​ols import adfuller
adfuller(series.Therapist, autolag=‘AIC’)[1]

The variances of both series also appear constant over time since the disper-
sion around the means is not obviously changing. Such series are described
as homoscedastic. Generally, a stationary series is also homoscedastic but
not vice versa. Homoscedasticity can be verified with the Breusch-Pagan
test, where H0 = the series is homoscedastic. The code for this is more
cumbersome than the ADF test as it involves additional steps like regress-
ing the series on the time intervals expressed as integers, and then creating
an array out of the series. We do not need to delve into the details, but the
final outcome is likewise a p-value for each series.

#Breusch-Pagan test for homoscedasticity


from statsmodels​.formula​.​api import ols
series[‘time’]​=np​.array(range(len(series)))
model = ols(‘Therapist ~ time’, data=series).fit()

def create_array(col):
s = []
for i in col:
a = [1,i]
s.append(a)
return (np​.arr​ay(s))
array = create_array(series.Therapist)

from statsmodels​.stats​.diagnos​tic import het_breuschpagan


het_breuschpagan(model​.resi​d, array)[1]

Before proceeding to Step 2, let us consider some problematic cases. The


third plot in Figure 5.4 is an example of a non-stationary series where the
mean rises and is therefore not constant over time. We deal with such cases
by transforming them with a procedure known as differencing. First-order
differencing means to subtract the values of successive observations to
derive a new series of zt, where zt = yt – yt-1. The new series is also known as
the first differences. By performing differencing, we are effectively remov-
ing the trend component of a time series and therefore the first step of our
Time series analysis 141

sugarcane extraction process. The following code creates a new column


(Therapist_diff) by differencing the original therapist series (assuming it
was actually non-stationary).

#differencing
series[‘Therapist_diff’]= series​.Therapist​.d​iff(periods=1)

The outcome of differencing the third plot is shown in the fourth plot. Note
that some values are negative because yt - yt-1 can be negative. The most
important feature is that the rising trend is now eliminated. If the differenced
series is still non-stationary (as confirmed by an ADF test), we can perform
a further second-order differencing. This means to difference the first-differ-
enced series one more time, by computing zt – zt-1. We can do this by changing
periods=1 into periods=2 in the above code, preferably saving the outcome
to another new column. Most non-stationary time series will become sta-
tionary after, at most, two orders of differencing. Note that with every order
of differencing, we “lose” one observed value. The recommended minimum
number of observations for accurate TSA is 50 (McCleary et al., 1980).
The fifth plot is an example of a heteroscedastic series with the disper-
sion around the mean, or the series variance, increasing over time. There
are several ways to deal with this. The first is to perform a log-transfor-
mation by converting each value yt to log(yt), the result of which is shown
in the final plot. Log-transformation is a commonly applied technique for
other purposes like making a skewed distribution more normal, or chang-
ing the scales of graphical axes. The following code creates a new col-
umn (Therapist_logged) by log-transforming the original therapist series
(assuming it was heteroscedastic).

#log-transformation
series[‘Therapist_logged’]= np​.l​og(series.Therapist)

The transformation achieves variance stabilization. Although the shape of


the series remains the same, the scale of the y-axis and hence the variance
is compressed. Another option is to model the changing variance explicitly
with a different set of (G)ARCH (Generalized Autoregressive Conditional
Heteroskedasticity) models (Bollerslev, 1986). These are beyond the pre-
sent scope, and much harder to interpret for discourse analytic purposes
compared to ARIMA models. This is because ARIMA models directly
represent how the observed variable changes in time, while the modeled
variances in (G)ARCH models have no obvious discourse analytic import.
This is also why the Box-Jenkins method is more suited to the modeling
of language/discourse because it is less easy to intuit a discourse analytic
interpretation from other established approaches.
142 Time series analysis

In some cases, the original series may need to undergo both differenc-
ing and variance stabilization. If both procedures are required, variance
stabilization should be performed first. Python will automatically convert
transformed series back into their original forms, when performing predic-
tions and forecasts later.

Step 2: Compute (P)ACF

The ACF and PACF of each series are computed in Step 2. This is argu-
ably the most important step because subsequent interpretation of statisti-
cally significant autocorrelations is crucial for identifying which types of
ARIMA models will fit the series. The code for computing and plotting
(P)ACF was already provided earlier in the chapter, but the code below
introduces the subplots feature to plot the (P)ACF correlograms of our two
series in a 2x2 grid.

#multiple (P)ACF plots


fig,axes=plt.subplots(2,2)
fig​.te​xt(0.5, 0.05, ‘Therapist vs. Client (analytical thinking)’,ha=‘cente
r’,fontsize=15)
plot_acf(series.Therapist, ax=axes[0,0], alpha=0.05, title=‘Therapist
ACF’,lags=10)
plot_pacf(series.Therapist,ax=axes[0,1], alpha=0.05, title=‘Therapist
PACF’, lags=10)
plot_acf(series.Client, ax=axes[1,0], alpha=0.05, title=‘Client
ACF’,lags=10)
plot_pacf(series.Client,ax=axes[1,1], alpha=0.05, title=‘Client PACF’,
lags=10)

Correlograms (Figure 5.5) provide a useful visual aid to identify can-


didate ARIMA models in Step 3. For both ACF and PACF, the y-axis
shows the correlation coefficient at the lag indicated by the x-axis. This
value is always 1.0 at lag 0. The shaded boundaries are 95% confidence
bands, beyond which the correlation is statistically significant at α=0.05.
Significant lag k autocorrelations will thus appear as ‘spikes’ or vertical
bars that extend beyond the confidence bands. For the therapist series,
the ACF spikes up to lag 2 and the PACF up to lag 1. For the client series,
both ACF and PACF spike up to lag 1. There is no seasonal behavior in
our series as there are no spikes that recur at seasonal multiples (e.g., lags
4, 8, 12…). In visually unclear cases where the (P)ACF is very close to the
bands, we can further calculate a t-statistic (i.e., the correlation coefficient
at that lag divided by its standard error) to determine if a spike exists, but
we leave this out for now.
Time series analysis 143

Figure 5.5 (P)ACF correlograms

The typical contrasting behaviors of ACF and PACF mentioned earlier


should now be noted. Notice that for the therapist series, the ACF tapers
off or ‘dies down’ gradually over consecutive lags, while the PACF ‘cuts
off’ more abruptly. In some cases, the converse occurs, while in others both
ACF and PACF die down or cut off in similar fashion. It is precisely this
relative behavior that will help us select candidate ARIMA models for the
series in Step 3.

Step 3: Identify candidate models

In Step 3, we decide which ARIMA model type best describes our data
based on the relative behavior of (P)ACF: an autoregressive (AR) model, a
moving average (MA) model, or a combination of both (ARMA/ARIMA).
Mathematically, AR models express the current value of the time series as
a function of its past values, MA models express it as a function of residu-
als in past intervals, while ARMA models combine both. The difference
between ARMA and ARIMA is the letter I for ‘integrated’. The latter refers
to cases where differencing was performed to achieve stationarity in Step
1. We will soon discuss what these ‘mean’ as structural signatures in a
144 Time series analysis

Table 5.2 Basic guidelines for model selection

Behavior pattern Candidate model

1. ACF spikes up to lag k and MA model of order k; i.e., MA(k) model


cuts off
PACF spikes up to lag k and
dies down
2. ACF spikes up to lag k and AR model of order k; i.e., AR(k) model
dies down
PACF spikes up to lag k and
cuts off
3. Both ACF and PACF spike If ACF cuts off more abruptly, use MA(k) model
up to lag k and cut off If PACF cuts off more abruptly, use AR(k) model
If both cut off equally abruptly, try both models
to see which fits better
4. Both ACF and PACF spike ARMA model of order k; i.e., both MA(k) and
up to lag k and die down AR(k) model
5. Both ACF and PACF have No suitable model since autocorrelations are
no spikes at all lags absent

discourse context. For now, Table 5.2 offers key guidelines for candidate
model selection based on the five most common behavior patterns of (P)
ACF.
The order k corresponds to the number of lags for which the (P)ACF is
significant. The general form of an MA(k) model is yt = μ– θ1at-1 …– θkat-k –
at where

• yt is the present value in the series


• μ is the estimated mean of the process that generated the observed
series
• θ1 … θk are coefficients also known as MA(k) operators
• at is the residual (i.e., observed – predicted value) at time t
• at-1 … at-k are past residuals from time t-1 to t-k

An MA(k) model is thus called an MA model of order k. Note that the


general form yt = μ– θ1at-1 …– θkat-k – at resembles the simple linear regres-
sion model y = β0 + β1x1 … + βkxk … + et.
The constant μ corresponds to the intercept β0, the coefficients θ1 … θk
correspond to β1 … βk, the past residuals at-1… at-k correspond to the predic-
tors x1 … xk., and the residual at corresponds to the error term et.
Likewise, the general form of an AR(k) model is yt = (1– Φ1 – … Φk )μ +
Φ1yt-1 …+ Φkyt-k + at where

• yt is the present value in the series


• μ is the estimated mean of the process that generated the observed series
Time series analysis 145

• Φ1 … Φ k are coefficients also known as AR(k) operators


• (1– Φ1 – … Φk )μ is known as the constant term of the model
• at is the residual (i.e., observed – predicted value) at time t
• yt-1… yt-k are past series values from time t-1 to t-k

An AR(k) model is thus called an AR model of order k. Once again, the


general form yt = (1– Φ1 – … Φk )μ + Φ1yt-1 …+ Φkyt-k + at resembles the
simple linear regression model y = β0 + β1x1 … + βkxk … + et. The constant
(1– Φ1 – … Φk )μ corresponds to the intercept β0, the coefficients Φ1 … Φ
k
correspond to β1 … βk, the past series values yt-1 … yt-k correspond to the
predictors x1 … xk., and the residual at corresponds to the error term et.
ARMA models, by definition, combine AR/MA(k) models additively and
therefore share the characteristics of both. These general forms capture the
essential characteristics of AR and MA models as structural signatures, as
will become clear when discussing our two examples later.
Applying Table 5.2, our therapist series reflects pattern #2. ACF spikes
up to lag 2 and dies down, while PACF spikes up to only lag 1 and cuts off.
Noting that the ACF at lag 2 just passes the significance threshold while the
PACF at lag 2 is far off, we conservatively choose AR(1) as the candidate
model. A common way to denote an ARIMA model is by ARIMA(p,d,q)
where p=the order of AR, q=the order of MA, and d=how many orders of
differencing were performed. Our AR(1) model is therefore also called an
ARIMA(1,0,0) model. As for the client series, both ACF and PACF spike
up to lag 1 and cut off, which reflects pattern #3. We further note that
ACF cuts off more abruptly because PACF becomes significant again
at much later lags. We therefore choose an MA(1) model to be the can-
didate model. Beyond the present introductory scope are SARIMA or
seasonal ARIMA models that capture seasonal phenomena in time series
– something not often seen in spontaneous discourse contexts. Having
determined a candidate model, we now move on to the standard statisti-
cal modeling procedures of estimating its parameters and determining the
goodness-of-fit.

Step 4: Fit model and estimate parameters

In Step 4, we fit the candidate model to our time series and estimate its
parameters. For an MA model the parameters are the intercept/constant μ
and θ1 … θk , while for an AR model they are the intercept/constant (1– Φ1 –
… Φk )μ and Φ 1 … Φ k. Due to the relative complexity of ARIMA over lin-
ear regression models, different software may give slightly different param-
eter estimates. Our model validation procedure also begins at this step.
Previous chapters introduced train-test data, resampling, visualizations,
and alternative analyses as validation techniques. For time series analysis,
we opt for a train-test approach where the first 37 out of 40 session scores
are used as training data to fit the candidate model, and the final three
146 Time series analysis

sessions are withheld as testing data to evaluate the predictive accuracy


of the fitted model. The approach is similar to our Monte Carlo simula-
tions in Chapter 2 in that the training data cannot be randomly selected/
resampled, and the contiguity of session transcripts must be preserved.
This is because the sequence of observations defines a time series and its
autocorrelational structure, and thus should not be randomized. It is also
possible to use approaches similar to k-folds cross-validation introduced in
Chapter 4 where multiple contiguous session folds are randomly selected
as training data, and not just the final sessions (e.g., by using scikit-learn’s
TimeSeriesSplit). These are beyond the present introductory scope.
As before, there is no hard and fast rule on the required size of the test-
ing data, with recommendations ranging from 20% of the original series
to being at least as large as the number of forecasting intervals eventu-
ally required. A larger training dataset generally increases model preci-
sion because there are more samples from which to induce patterns. On
the other hand, a larger testing dataset gives a better indication of how
well these patterns predict new data not been seen by the model. If the
predictive accuracy and other aspects of goodness-of-fit (see next step) are
deemed acceptable, the testing data can be re-incorporated to re-estimate
the model parameters, leading to an even more fine-tuned model. The fol-
lowing code removes the final three values of our series as described above,
and stores the resulting 37-session training data as train_series.

#remove final three values to form training data


train​​_seri​​es​=se​​ries.​​​iloc[​0:len​(seri​es)-3​]

We are now ready to fit an AR(1) model to the therapist series and
MA(1) model to the client series, using the SARIMAX class in statsmod-
els and our training data train_series. The following code performs the
necessary import and model fitting, naming the therapist and client mod-
els model1 and model2, respectively. The AR(1) model is specified by
order=(1,0,0) as per the (p,d,q) nomenclature introduced above, while
the MA(1) model is specified by order=(0,0,1). Specifying trend=‘c’ tells
Python to include the constant/intercept term in the model. This could be
removed, and the model refitted, if the constant/intercept turns out to be
statistically insignificant. The model.summary() will generate the output
shown below.

#fit candidate models


import statsmodels​.a​pi as sm
model​​1​=sm.​​tsa​.S​​A​RIMA​X(tra​in_se​ries.​Thera​pist,​
order=(1,0,0),trend=‘c’').fit()
model1.summary()
Time series analysis 147

model​​2​=sm.​​tsa​.S​​A​RIMA​X(tra​in_se​ries.​Clien​t, order=(0,0,1),trend=‘c’).
fit()
model2.summary()

Figure 5.6 are screenshots of the output for the AR(1) (top) and MA(1)
model (bottom). Among other details, the series (or dependent variable)
and fitted model, the number of observations, and model fit statistics are
shown at the top of each output. Residual diagnostics like the Ljung-Box Q
and its associated p-value, to be discussed in the next step, are shown at the
bottom. The middle panel shows the estimated parameters (coef), standard
errors, and z-scores (number of standard deviations away from 0), p-values

Figure 5.6 Output of time series model fitting


148 Time series analysis

of the parameters, and 95% confidence intervals of the estimates. Sigma2


estimates the variance of the error term at, which essentially quantifies how
well the model predicts observed values. For both models, the intercepts
and AR(1)/MA(1) operators (represented as ar.L1 and ma.L1) are all sig-
nificantly different from zero (p<0.01). Plugging the parameters into the
general forms described above, the models for our two series are

• therapist series: yt = 16.6586 + 0.6629yt-1 + at


• client series: yt = 42.2339 − 0.6256at-1 + at.

As mentioned above, different statistical software may produce slightly


different results because of varying approaches to parameter estimation.
For example, sigma2 estimates can be ‘biased’ on ‘unbiased’ depending on
whether the number of model parameters is taken into account (Brockwell
& Davis, 2016). These differences are usually minor and understanding
the technical details is not necessary for our purposes. With the models in
hand, the final step before interpreting them is to evaluate their quality.

Step 5: Evaluate predictive accuracy, model fit, and residual diagnostics

Evaluating a time series model involves three interrelated aspects: (1) pre-
dictive accuracy: how well it predicts observed values, especially the testing
dataset, (2) model fit: how well it fits the data in general, and (3) residual
diagnostics: whether the model has captured all patterns from the series (or
squeezed all the juice from the sugarcane). If we are satisfied with all three,
we can move on to the final step of interpreting the model in context. If
not, we return to Step 3 and select another candidate model.
Predictive accuracy and model fit can be visually evaluated by plotting
the model predicted values, including forecasts, against the observed val-
ues. The code below generates the predicted values and forecasts using
the therapist AR(1) model (model1). The same can be done by replacing
model1 with model2. Remember to also rename predict and forecast when
doing so, or the results for model1 will be overwritten.

#predict and forecast values


predi​ct=(m​odel1​.get_​predi​ction​(star​t=1,e​nd=le​n(ser​ies))​)
predictinfo=predict.summary_frame()

forec​ast=(​model​1.get​_pred​ictio​n(sta​rt=le​n(ser​ies),​end=l​en(se​ries)​+3))
forecastinfo=forecast.summary_frame()

Note that predict consists of predicted values from the first session (start=1)
to the last session in the original 40-session series (end=len(series)). It
Time series analysis 149

therefore includes the predicted values for the final three sessions, which
we could then compare against the testing data. A summary of the results,
including the predicted values and their 95% confidence intervals, is gener-
ated by predict.summary_frame() and stored as predictinfo. On the other
hand, forecast consists of predicted values beyond the observed interval –
from session 40 to 43, which is why they are called forecasts. A summary
is likewise generated and stored as forecastinfo.
With this information we can now generate our plot with the code
below. This plots the therapist series, but relevant parts can be changed
to show the client series. As usual, cosmetic details like labels, colors, and
font sizes can be changed at will. ax.axvspan() creates a red region to com-
pare predicted versus observed values in the last three sessions (i.e., the
testing data), and ax.fill_between() colors the 95% confidence intervals of
forecasts. Figure 5.7 shows the predicted versus observed plots for both
models.

#plot predicted/forecasted vs. observed values


fig, ax = plt.subplots(figsize=(10, 5))
forecastinfo[‘mean’].plot(ax=ax, style= ‘k--’,label=“forecast”)
plt​.pl​ot(series.Therapist, label=“observed”,color=‘dodgerblue’)
plt​.pl​ot(predictinfo['mean'], label=“predicted”,color= ‘orange’)
ax.ax​vspan​(len(​train​_seri​es),l​en(tr​ain_s​eries​)+3,c​olor=​
'red',alpha=0.2,label=‘train-test’)
ax.fill_between(forecastinfo​.inde​x, forecastinfo[‘mean_ci_lower’],
forecastinfo[‘mean_ci_upper’], color=‘grey’, alpha=0.2,
label=“95% CI”)
ax.set_ylabel(‘Analytical thinking’, fontsize=12)
ax.set_xlabel(‘Session’, fontsize=12)
ax.set_title(‘Therapist series’,fontsize=12)
plt​.se​tp(ax, xticks​=np​.ara​nge(1, len(series)+4, step=2))
plt​.lege​nd(loc=‘best’,fontsize=10)

The blue lines depict the observed time series and the orange lines the
predicted series. The red region marks the final three sessions, or the ‘train-
test zone’ so to speak. The dotted line depicts forecasts after session 40,
and the gray region indicates 95% confidence intervals for each forecast.
Many analysts also calculate the slightly different prediction intervals,
which we will touch upon later. The plots can reveal a lot about predictive
accuracy and model fit. The main difference between the two measures is
that predictive accuracy is evaluated with testing data only (i.e., the dis-
parity between predicted values and testing data values), to judge how
well the model performs on data it has not ‘seen’. Model fit, on the other
hand, is often evaluated with training data only (i.e., the disparity between
150 Time series analysis

Figure 5.7 Predicted vs. observed plots for both models

predicted values and training data values), to judge how well the trained
model reflects the data it has ‘seen’. It is of course not possible to evaluate
the quality of forecasts until the sessions actually happen.
We can see that the AR(1) model (yt = 16.6586 + 0.6629yt-1 + at) seems to
capture both the ‘shape’ and the magnitude of the therapist series quite well
because the two lines are close, except in the red region, which indicates
poorer predictive accuracy. However, it correctly predicted the general
upward movement of the series in the last three sessions. The MA(1) model
(yt = 42.2339 − 0.6256at-1 + at), on the other hand, seems to capture the
shape better than the magnitude. Nevertheless, just as in previous chapters,
visual assessment should be supported by more objective measures when-
ever possible. One option is the mean absolute error (MAE), which simply
sums the absolute error (predicted – observed, positive values only) for all
Time series analysis 151

observations and divides this by the number of observations. Two similar


measures are the mean squared error (MSE), which squares each error and
divides the sum by the number of observations, and the root mean squared
error (RMSE), which is the square root of MSE. These options are easy
to interpret because they measure the error directly in terms of the unit
of the series. They are thus called scale-dependent measures. However, if
we want to compare models across different phenomena, scale-independ-
ent measures that calculate errors as percentages are preferred. The most
common is the mean absolute percentage error (MAPE), which sums the
percentage error ([predicted – observed] / observed * 100%) across all
observations and divides this by the number of observations. However,
the biggest disadvantage with MAPE is that it does not work well when
observed values are 0 or close to 0, which leads to undefined or extreme
values. Another less obvious disadvantage that would probably not apply
to most discourse contexts is when the unit is not a ratio variable and
has an arbitrary zero point (e.g., temperature), rendering percentage errors
meaningless.
The code below calculates MAPE for the training data (i.e., from ses-
sions 1 to 37), and then the testing data (from sessions 38–40), for the
therapist series. The value mape is multiplied by 100 to express it in per-
centage terms.

#MAPE for training data (i.e., sessions 1–37)


from sklearn.metrics import mean_absolute_percentage_error
mape = mean_​absol​ute_p​ercen​tage_​error​(trai​n_ser​ies.T​herap​ist,
predi​​ct​.pr​​edict​​ed​_me​​a​n​.il​​oc[0:​len(t​rain_​serie​s)])
mape*100

#MAPE for testing data (sessions 38-40)


mape = mean_​absol​ute_p​ercen​tage_​error​(seri​​es​.Th​​erapi​​st​​.il​​oc[-3​:],
predict​.predicted​_mean​.​iloc[-3:])
mape*100

For the therapist series, MAPE=4.21% (training) and 7.96% (testing),


while for the client series, MAPE=18.24% (training) and 11.67% (test-
ing). It appears that the AR(1) model for the therapist series has a bet-
ter fit and predictive accuracy than the MA(1) model for the client series.
Interestingly, however, the relative performance on testing versus train-
ing data is better in the MA(1) than AR(1) model. While not a serious
issue in this case, a model performing much better on training than testing
data might be a sign of overfitting – where the predictions correspond too
closely to a specific set of data, and the model therefore fails to generalize
or fit additional/future data.
152 Time series analysis

Other common model fit measures that can be used to evaluate


ARIMA models include the Akaike Information Criterion (AIC), Bayesian
Information Criterion (BIC), and the widely used coefficient of determina-
tion (R2). AIC/BIC evaluate models on their training dataset performance
as well as complexity – with less complex models preferred over more com-
plex ones (Chakrabarti & Ghosh, 2011). Lower AIC/BIC values indicate
better models, and these values are provided in the model summary out-
put (Figure 5.6). However, AIC/BIC are relative measures used to decide
between more than one candidate model, which is not useful for us now
because only one candidate model was identified for each series. Lastly,
the coefficient of determination R2 can also be considered. R2 indicates the
proportion of variation in the dependent variable (i.e., each of our two
series) that is accounted for by the model. A high R2 means the model cap-
tures much of the variation or, equivalently, produces small error residuals
(predicted – observed values). It is therefore conceptually similar to MSE
or RMSE, but for that reason should not be used to compare different
models directly. The following code calculates R2 for the therapist training
data. Notice that both MAPE and R2 are imported from sklearn.metrics.

#R2 for training data


from sklearn.metrics import r2_score
r2 = r2_score(train_series.Therapist, predi​​ct​.pr​​edict​​ed​_me​​a​n​.il​​oc[0:​
len(t​rain_​serie​s)])

For the therapist series, R2 = 0.885 and for the client series R2 = 0.620.
These measures concur with the previous MAPE evaluation of the better
fit of the AR(1) model for the therapist series, although the MA(1) model
for the client series is also acceptable.
Having evaluated predictive accuracy and model fit, the final part of
Step 5 is to perform residual diagnostics. This tends to be neglected when
reporting regression analyses, but it is an important step to evaluate whether
any patterns in the data have been ‘left over’. In our time series context,
it means to check if the candidate model has extracted all juice from the
sugarcane – or transformed the original series into patternless residuals or
‘white noise’ (Figure 5.2). We do this by checking (1) if there is autocor-
relation in the residuals, (2) if the mean of the residuals is near-zero, and
(3) if the residuals are normally distributed. For (1), we treat the residuals
across sessions (predicted – observed values) as a time series itself and eval-
uate its (P)ACF just like in Step 2 of our main process. Absence of spikes
in (P)ACF at all lags implies that there is no more autocorrelational infor-
mation left in the residuals. On the other hand, spikes would suggest that
the residuals are still patterned across time, which needs to be addressed
either by modeling the residual series and adding it to the original model
Time series analysis 153

or by going back to Step 3 to choose another candidate model. For (2), we


check if the residual series has a mean value close to zero, which suggests
that positive and negative errors cancel one another out. A non-zero mean
would lead to biased forecasts in either direction, which is often addressed
by simply adjusting forecasts by an amount equivalent to the mean. Lastly,
it is ideal (but not necessary) for the residuals to be normally distributed
and heteroscedastic for easier calculation of forecast prediction intervals.
For example, the 95% prediction intervals for a one-step-ahead forecast
would simply be the forecasted value ± 1.96 standard deviations of the
normally distributed residual series. If the residuals are not normally dis-
tributed, prediction intervals would require more complicated bootstrap-
ping methods to derive.
The following code first obtains the residuals from a model (model1
in this case), and then produces a convenient collection of relevant plots,
including the (P)ACF correlograms, a raw plot of the residuals across ses-
sions, and histogram showing its distribution. Figure 5.8 shows the plots
for model1, the AR(1) model for the therapist series.

#obtain residuals from model


residuals = model1​.res​id

Figure 5.8 Residual diagnostic plots


154 Time series analysis

#residual diagnostic plots


fig,axes=plt.subplots(2,2)
fig.tight_layout(pad=2.0)
fig.suptitle(‘Residual diagnostics’,fontsize=14,y=1.05)
plot_acf(residuals, ax=axes[0,0], alpha=0.05, title=‘ACF of residuals’,
lags=10)
plot_pacf(residuals, ax=axes[0,1], alpha=0.05, title=‘PACF of residu-
als’, lags=10)
residuals​.pl​ot(ax=axes[1,0], title=‘Residuals series plot’)
sns.distplot(residuals,kde=True, axlabel=‘Residuals
histogram’,ax=axes[1,1])

The correlograms clearly show that there are no (P)ACF spikes in the
residuals. This also reflects the result of the Ljung-Box Q test (Q=0.14,
p=0.71) shown at the bottom left of model1’s summary in Figure 5.6,
where H0 = the series is independently distributed and has zero autocor-
relation. The raw plot of the residuals suggests that it fluctuates around
a mean of near-zero, as further confirmed by the histogram. Lastly, the
distribution appears normal, which can be confirmed with a Shapiro-Wilk
test of normality (H0 = the series is normally distributed) using the code
below. The result (W=0.974, p=0.542) confirms that the model1 residuals
are normally distributed.

#shapiro-wilk test of normality of residuals


stats.shapiro(residuals)

It is now time to consider all the information gathered in Step 5 and


decide if the candidate model is adequate. While predictive accuracy,
model fit, and residual diagnostic statistics are automatically computed,
the final decision may depend on other factors like knowledge of the mod-
eled phenomenon and the specific objectives of modeling. It is well known
that discourse is “messy” in content and structure (Eubanks, 1999). We
should therefore not expect the same extent of structural regularity in
discourse as compared to, say, annual sales data of familiar products.
We should further consider if our objective is to understand the ‘shape’
of the series or to make reliable predictions/forecasts. Understanding the
shape or structural signature of the data is likely to be more pertinent in
most discourse analytic situations. Given these considerations, we would
thus pay more attention to the fact that (1) the model parameters are sta-
tistically significant, (2) the shape of the observed and predicted plots is
similar, and (3) the residuals are not autocorrelated and have a near-zero
mean. We would be more inclined to overlook poor predictive accuracy,
non-normally distributed residuals, and/or large confidence intervals of
Time series analysis 155

forecasts. For the example of model1 we have been discussing so far,


we can therefore conclude that our AR(1) model for therapist analyti-
cal thinking is adequate, and we can proceed to the final step of inter-
preting the model in context. In other words, we now have confidence
that the aforementioned structural signature described by the model is a
good account of the nature of our series. If we instead conclude that the
model is inadequate, the process reverts to Step 3 where another candi-
date model is chosen.

Step 6: Interpret model in context

The final step of interpreting the model means to relate its parameters,
and the structural signature they constitute, to our understanding and/
or qualitative analysis of the (discourse) phenomena in their context(s) of
occurrence. Recall that our two models for therapist and client analytical
thinking are, respectively

• AR(1): yt = 16.6586 + 0.6629yt-1 + at


• MA(1): yt = 42.2339 − 0.6256at-1 + at.

The following points detail their structural signatures from a comparative


perspective

• the estimated mean μ of therapist analytical thinking is 16.6586/


(1-0.6629)= 49.417 (since the constant term 16.6586=(1– Φ1 )μ), while
that of client analytical thinking is 42.2339. These are close to the
earlier calculated sample means (i.e., mean of the observed series) of
49.999 and 42.001, respectively. The therapist demonstrates a higher
level of analytical thinking (i.e., formal, logical, and hierarchical versus
informal, personal, here-and-now, and narrative) than the client, which
is not surprising for CBT dyads
• for the therapist, a 1-unit increase/decrease in analytic thinking pre-
dicts a 0.6629-unit increase/decrease in the immediately following ses-
sion, with no significant influence beyond that. This implies a forward
but short-term momentum of analytical thinking display that could be
linked to either a strategic stepping up of some analytically oriented
technique or, conversely, a strategic relaxation of the same.
• for the client, a 1-unit increase/decrease in the residual (i.e., prediction
error) predicts a 0.6256-unit decrease/increase in analytical thinking in
the immediately following session, with no significant influence beyond
that. This implies a session-to-session ‘bounce’ where unexpectedly
high/low displays of analytical thinking tend to be balanced out by a
movement in the opposite direction in the following session. It could be
156 Time series analysis

linked to some ‘reactionary’ dynamic where the client self-monitors and


adjusts their linguistic display on a sessional basis.
• considering the two signatures together, the contrasting structural sig-
natures in our CBT dyad is consistent with the therapist–client linguistic
asynchrony uncovered via cluster analysis in Chapter 3, although the
analysis is limited only to one variable here.

The time series models lend a useful complementary perspective to the


previous cluster analysis outcomes since the latter regards (sub)sessions as
independent units of analysis while time series analysis directly exploits the
interdependence or autocorrelation among session properties. This means
that time series and cluster analysis could be used as complementary tools
for the same research, training, and/or self-evaluation purposes previously
highlighted. To further investigate how the structural signatures are mani-
fested in context, we can adopt a strategy of manually examining thera-
pist–client interaction for analytical thinking-related linguistic behavior
at one-session intervals, as informed by the models. The extract below is
from just one session, but it provides an example of a starting point for
such qualitative investigation.

1. THERAPIST: Okay. Because what you just said is what you can do.
And I haven’t heard the irrational beliefs yet.
2. CLIENT: Okay, here they are. Sorry. Yes, here they are. Okay, so, I
need the surgery to reduce my pain – where are the irrational beliefs?
Oh, no, here it is. I need the surgery to reduce my pain; otherwise, I
can’t go on in life as I should. I’m just going to go through – shall I go
through them?
3. THERAPIST: Quickly, yes.
4. CLIENT: Okay. “Others will not approve of my negative attitude,
and I couldn’t stand that”. “Comparing myself to this guy going into
surgery and telling myself he’s better than me, as he shouldn’t be. And
I can’t stand my mother thinking I’m not strong enough to recover as
fast as he does”.
5. THERAPIST: Well done in identifying the irrational beliefs. When I
say, “Well done,” – I also mean “Well caught.”
6. CLIENT: Well caught, right. I feel angry towards my mom for pushing
me back to work when the surgery hasn’t even happened, and I can’t
stand that feeling.
7. THERAPIST: But perhaps, “My mother shouldn’t have told me—” is
a more helpful way of expressing the irrational belief here.
8. CLIENT: Ah, got you. Okay.
9. THERAPIST: Remember: precision is helpful when identifying the
irrational beliefs.
Time series analysis 157

10. CLIENT: Precision, okay. My mother –


11. THERAPIST: But you’re doing very well.
12. CLIENT: My mother shouldn’t have told me to go back to work when
this other person is or whatever before the surgery hasn’t even hap-
pened yet. And I can’t stand that feeling. I shouldn’t feel this, as she – I
shouldn’t be thinking in this way, as she is expressing herself. And she
is just expressing herself. It’s unfair – oh, wait, maybe this is where –
hold on, I didn’t organize it for whatever – no, no.
13. THERAPIST: Which, by the way, is a lesson that the more well-organ-
ized your ABCs are, the more clearly the disputing steps can follow.

The extract typifies CBT interaction. The therapist adopts an analytical,


and often instructive, tone in guiding the client through identifying her
irrational beliefs as a precursor to disputing them. This is apparent from
turns 1, 5, 7, and 9. The client plays a more passive role in responding to
the therapist’s instruction, narrating her thoughts and reflections on her
feelings towards her mother. Unsurprisingly, the therapist scores higher on
analytical thinking. The AR(1) model is realized by a momentum where
the therapist’s display of analytical thinking strengthens from one session
to the next, indicative of a strategic stepping up of relevant techniques
(e.g., directing, questioning, summarizing) – or conversely, a gradual relax-
ation when the focus of treatment shifts elsewhere. The upward-down-
ward cycle is relatively frequent as analytical thinking scores are mutually
correlated across (only) one-session intervals. The client’s MA(1) model,
on the other hand, is realized by a more reactionary and potentially ‘self-
adjusting’ tendency, which we see traces of in turns 6, 8, and 10. Since pre-
sent values are correlated with past residuals, likewise across one-session
intervals, unexpectedly high/low displays of analytical thinking tend to
be balanced out by a movement in the opposite direction in the follow-
ing session. The fact that both contrasting trajectories play out over the
same one-session interval (as both AR and MA models have an order of
1) points towards what we may call ‘asynchrony in tandem’ – which again
dovetails with Chapter 3’s finding that none of the sampled CBT sessions
were synchronized. Our sample analysis deals with only one LIWC vari-
able as the introduced Box-Jenkins method is fundamentally a univariate
modeling approach. We can expect more intricate findings by repeating
the procedure on other variables, or using more complex approaches that
model the relationships between multiple variables as they change across
time. Examples include vector autoregression models (Lütkepohl, 2005)
that are beyond our scope.
At a more general level of analysis, the very observation of whether the
time series at hand is modelable (Tay, 2021), which is revealed as early
as Step 2, can also be insightful. A modelable series is simply a series of
158 Time series analysis

measurements for which a well-defined model can be found to fit, imply-


ing some regularity across time in the underpinning processes and/or phe-
nomena. Conversely, a non-modelable series randomly fluctuates over
time such that (P)ACF have no spikes at all lags. We could also say the
series is white noise right from the beginning. In many discourse analytic
contexts, these discourse-relevant measures are assumed to reflect and/
or even constitute some background social reality, and the very objective
of discourse analysis is to theorize the relationship between the two. The
(non)modelability of a discourse time series, considered together with
relevant background knowledge, would therefore point towards one of
the following relationships:

• A non-modelable series occurring against a correspondingly unpredict-


able background reality. This appears to be the most ‘innocuous’ sce-
nario where no link is established between the two. An example is some
discourse feature used in an unpatterned way across daily news reports.
• A modelable series occurring against an unpredictable background real-
ity. This means that the discourse feature is patterned despite its osten-
sible background being non-patterned, and implies the strategic use of
language/discourse to construe or construct reality, as widely assumed
in critical discourse analytic approaches. An example is some discourse
feature used in a patterned way across daily news reports.
• A non-modelable series occurring against a predictable background
reality. This could either imply an ‘innocuous’ relationship, where there
is no evidence that the discourse feature is strategically linked to the
background, or a more subtle construal strategy, where care is taken to
avoid perceptions of this link. An example of the former is the use of a
non-Olympics-related discourse feature over four-yearly cycles, and the
latter is the use of an Olympics-related feature over the same.
• A modelable series occurring against a predictable background reality.
This could again imply one of two things, depending on whether the
two ‘match’; i.e., whether the model depicting the series aligns with the
model depicting the background. In the former case, the predictability
of the discourse corresponds to the background and simply reflect it, an
example being a time series of the phrase ‘Olympic games’ in the media
over four-yearly cycles. In the latter case, there would be a case for the
discourse feature being used to construe a reality that diverges from
what is actually going on. Many examples from the critical discourse
analytic literature could be cited here.

A time series analysis of discourse could therefore be conducted holistically


at both levels, with accounts of the specific time series models that fit the
data as well as the general relationship between its (non)-modelablity and
Time series analysis 159

the nature of the underpinning context. As mentioned earlier, forecasting


values is another key element of time series analysis that may not always be
relevant to discourse analysts, but could shed prognostic light on the likely
course of psychotherapy treatment. Returning to our therapist model yt =
16.6586 + 0.6629yt-1 + at, the forecasted analytical thinking score for the
forthcoming 41st session can be obtained by simply substituting the value
of the 40th session into yt-1. Note that at is the error term in the 41st ses-
sion and can only be calculated once the actual value of the 41st session
is known. Also, for longer forecasting horizons many intervals into the
future, each successive forecast will itself be based upon a forecasted value.
Prediction intervals take this into account and thus always become longer
as the horizon increases. Confidence intervals, on the other hand, estimate
the uncertainty of each predicted value only at that point in time. The dif-
ference between them is, however, not always crucial in practice (Granger
& Newbold, 1986).
Figure 5.7 illustrated our three-session-ahead forecasts in both series,
with 95% confidence intervals. This gives therapists a prognostic idea of
how the linguistic display of analytical thinking is likely to be if the cur-
rent interactional dynamic continues. It therefore also allows them to plan
or adjust their interactional styles should the forecasted values be deemed
undesirable for some reason. This is analogous to using time series models
for statistical process control (Alwan & Roberts, 1988) in manufacturing
and other industries, where statistical analysis informs the monitoring and
quality control of various processes. Nevertheless,
statistical forecasting techniques that rely entirely on data and prob-
ability principles should be seen as just one of many tools. Judgmental or
‘qualitative forecasts’ that rely more on knowledge and experience could be
just as important in research and practice (Hyndman & Athanasopoulos,
2018). As mentioned, the complementary use of data analytic techniques
like cluster analysis and time series analysis should also be encouraged to
offer the therapist/researcher multiple perspectives for research, training,
and self-evaluation.

Python code used in this chapter


The Python code used throughout the chapter is reproduced in sequence
below for readers’ convenience and understanding of how the analysis
gradually progresses.

Generating and plotting a random series

#generate and plot 100 random values


pd.Da​taFra​me(np​​.rand​​om​.no​​​rmal(​0,1,1​00)).​plot(​)
160 Time series analysis

Computing and plotting (P)ACF

#compute (P)ACF
from statsmodels.graphics.tsaplots import plot_acf, plot_pacf
from statsmodels​.tsa​.statto​ols import acf, pacf
series=pd.read_csv(‘data​.c​sv’,index_col=‘t’)
acf(series.y,nlags=10)
pacf(series.y,nlags=10)

#plot (P)ACF
plot_acf(series.y, lags=10, alpha=0.05, title=‘Autocorrelation function’)
plot_pacf(series.y, lags=10, alpha=0.05, title=‘Partial autocorrelation
function’)

Seasonal decomposition

#seasonal decomposition of time series


from statsmodels​.tsa​.seaso​nal import seasonal_decompose
series​.index​=pd​.date​_​range(freq=‘m’,start=‘1949’,periods=len(series))
seasonal_decompose(series.y).plot()

STEP 1: Inspect series

#inspect series
series=pd.read_csv(‘data​.c​sv’,index_col=‘Session’)
series​.pl​ot(subplots=True)
series.describe()

#ADF test for stationarity


from statsmodels​.tsa​.statto​ols import adfuller
adfuller(series.Therapist, autolag=‘AIC’)[1]

#Breusch-Pagan test for homoscedasticity


from statsmodels​.formula​.​api import ols
series[‘time’]​=np​.array(range(len(series)))
model = ols(‘Therapist ~ time’, data=series).fit()

def create_array(col):
s = []
for i in col:
a = [1,i]
s.append(a)
return (np​.arr​ay(s))
array = create_array(series.Therapist)
Time series analysis 161

from statsmodels​.stats​.diagnos​tic import het_breuschpagan


het_breuschpagan(model​.resi​d, array)[1]

#differencing
series[‘Therapist_diff’]= series​.Therapist​.d​iff(periods=1)

STEP 2: Compute (P)ACF

#multiple (P)ACF plots


fig,axes=plt.subplots(2,2)
fig​.te​xt(0.5, 0.05, ‘Therapist vs. Client (analytical thinking)’,ha=‘cente
r’,fontsize=15)
plot_acf(series.Therapist, ax=axes[0,0], alpha=0.05, title=‘Therapist
ACF’,lags=10)
plot_pacf(series.Therapist,ax=axes[0,1], alpha=0.05, title=‘Therapist
PACF’, lags=10)
plot_acf(series.Client, ax=axes[1,0], alpha=0.05, title=‘Client
ACF’,lags=10)
plot_pacf(series.Client,ax=axes[1,1], alpha=0.05, title=‘Client PACF’,
lags=10)

STEP 4: Fit model and estimate parameters

#remove final three values to form training data


train​​_seri​​es​=se​​ries.​​​iloc[​0:len​(seri​es)-3​]

#fit candidate models


import statsmodels​.a​pi as sm
model​​1​=sm.​​tsa​.S​​A​RIMA​X(tra​in_se​ries.​Thera​pist,​ order=(1,0,0),trend=
‘c’).fit()
model1.summary()

model​​2​=sm.​​tsa​.S​​A​RIMA​X(tra​in_se​ries.​Clien​t, order=(0,0,1),trend=‘c’).
fit()
model2.summary()

STEP 5: Evaluate predictive accuracy, model fit, and residual diagnostics

#predict and forecast values


predi​ct=(m​odel1​.get_​predi​ction​(star​t=1,e​nd=le​n(ser​ies))​)
predictinfo=predict.summary_frame()

forec​ast=(​model​1.get​_pred​ictio​n(sta​rt=le​n(ser​ies),​end=l​en(se​ries)​+3))
forecastinfo=forecast.summary_frame()
162 Time series analysis

#plot predicted/forecasted vs. observed values


fig, ax = plt.subplots(figsize=(10, 5))
forecastinfo[‘mean’].plot(ax=ax, style=‘k--’,label=“forecast”)
plt​.pl​ot(series.Therapist, label=“observed”,color=‘dodgerblue’)
plt​.pl​ot(predictinfo[‘mean’], label=“predicted”,color=‘orange’)
ax.ax​vspan​(len(​train​_seri​es),l​en(tr​ain_s​eries​)+3,c​olor=​
‘red’,alpha=0.2,label=‘train-test’)
ax.fill_between(forecastinfo​.inde​x, forecastinfo[‘mean_ci_lower’],
forecastinfo[‘mean_ci_upper’], color=‘grey’, alpha=0.2,
label=“95% CI”)
ax.set_ylabel(‘Analytical thinking’, fontsize=12)
ax.set_xlabel(‘Session’, fontsize=12)
ax.set_title(‘Therapist series’,fontsize=12)
plt​.se​tp(ax, xticks​=np​.ara​nge(1, len(series)+4, step=2))
plt​.lege​nd(loc=‘best’,fontsize=10)

#MAPE for training dataset (i.e., sessions 1-37)


from sklearn.metrics import mean_absolute_percentage_error
mape_train = mean_​absol​ute_p​ercen​tage_​error​(trai​n_ser​ies.T​herap​ist,
predi​​ct​.pr​​edict​​ed​_me​​a​n​.il​​oc[0:​len(t​rain_​serie​s)])
mape_train*100

#MAPE for testing dataset (sessions 38-40)


mape_test = mean_​absol​ute_p​ercen​tage_​error​(seri​​es​.Th​​erapi​​st​​.il​​oc[-3​:],
predict​.predicted​_mean​.​iloc[-3:])
Mape_test*100

#R2 for training data


from sklearn.metrics import r2_score
r2 = r2_score(train_series.Therapist, predi​​ct​.pr​​edict​​ed​_me​​a​n​.il​​oc[0:​
len(t​rain_​serie​s)])

#obtain residuals from model


residuals = model1​.res​id

#residual diagnostic plots


fig,axes=plt.subplots(2,2)
fig.tight_layout(pad=2.0)
fig.suptitle(‘Residual diagnostics’,fontsize=14,y=1.05)
plot_acf(residuals, ax=axes[0,0], alpha=0.05, title=‘ACF of residuals’,
lags=10)
plot_pacf(residuals, ax=axes[0,1], alpha=0.05, title=‘PACF of residu-
als’, lags=10)
Time series analysis 163

residuals​.pl​ot(ax=axes[1,0], title=‘Residuals series plot’)


sns.distplot(residuals,kde=True, axlabel=‘Residuals
histogram’,ax=axes[1,1])

#shapiro-wilk test of normality of residuals


stats.shapiro(residuals)

References
Allen, M. (2017). Markov analysis. In Mike Allen (Ed.), The SAGE encyclopedia
of communication research methods (pp. 906–909). Sage.
Althoff, T., Clark, K., & Leskovec, J. (2016). Large-scale analysis of counseling
conversations: An application of natural language processing to mental health.
Transactions of the Association for Computational Linguistics, 4, 463–476.
Bollerslev, T. (1986). Generalized autoregressive conditional heteroskedasticity.
Journal of Econometrics, 31(3), 307–327.
Bowerman, B., & O’Connell, R. (1987). Time series forecasting: Unified concepts
and computer implementation (2nd ed.). Duxbury Press.
Brockwell, P. J., & Davis, R. A. (2016). Introduction to time series and forecasting
(3rd ed.). Springer.
Chatfield, C. (1989). The analysis of time series: An introduction (4th ed.).
Chapman and Hall.
Earnest, A., Chen, M. I., Ng, D., & Leo, Y. S. (2005). Using autoregressive
integrated moving average (ARIMA) models to predict and monitor the number
of beds occupied during a SARS outbreak in a tertiary hospital in Singapore.
BMC Health Services Research, 5, 1–8. https://doi​.org​/10​.1186​/1472​-6963​-5​-36
Eubanks, P. (1999). Conceptual metaphor as rhetorical response: A reconsideration
of metaphor. Written Communication, 16(2), 171–199.
Gelo, O. C. G., & Mergenthaler, E. (2003). Psychotherapy and metaphorical
language. Psicoterapia, 27, 53–65.
Hopper, P. J., & Traugott, E. C. (2003). Grammaticalization (2nd ed.). Cambridge
University Press.
Hyndman, R. J., & Athanasopoulos, G. (2018). Forecasting: Principles and
practice. OTexts.
Keating, B., & Wilson, J. H. (2019). Forecasting and predictive analytics (7th ed.).
McGraw-Hill.
Kopp, R. R., & Craw, M. J. (1998). Metaphoric language, metaphoric cognition,
and cognitive therapy. Psychotherapy, 35(3), 306–311.
Lai, D. (2005). Monitoring the SARS epidemic in China: A time series analysis.
Journal of Data Science, 3(3), 279–293. https://doi​.org​/10​.6339​/JDS​.2005​
.03(3).229
Levitt, H., Korman, Y., & Angus, L. (2000). A metaphor analysis in treatments
of depression: Metaphor as a marker of change. Counselling Psychology
Quarterly, 13(1), 23–35.
Luborsky, L., Auerbach, A., Chandler, M., Cohen, J., & Bachrach, H. (1971).
Factors influencing the outcome of psychotherapy: A review of quantitative
research. Psychological Bulletin, 75(3), 145–185.
164 Time series analysis

Lütkepohl, H. (2005). New introduction to multiple time series analysis. Springer.


https://link​.springer​.com​/book​/10​.1007​/978​-3​-540​-27752-1
Mergenthaler, E. (1996). Emotion-Abstraction patterns in verbatim protocols: A
new way of describing psychotherapeutic processes. Journal of Consulting and
Clinical Psychology, 64(6), 1306–1315.
Rizzuto, A. M. (1993). First person personal pronouns and their psychic referents.
International Journal of Psychoanalysis, 74(3), 535–546.
Sarpavaara, H., & Koski-Jännes, A. (2013). Change as a journey-clients’
metaphoric change talk as an outcome predictor in initial motivational sessions
with probationers. Qualitative Research in Psychology, 10(1), 86–101.
Sims, P. A. (2003). Working with metaphor. American Journal of Psychotherapy,
57(4), 528–536.
Tagliamonte, S. A. (2012). Variationist sociolinguistics: Change, observation,
interpretation. Wiley-Blackwell.
Tay, D. (2017a). Quantitative metaphor usage patterns in Chinese psychotherapy
talk. Communication and Medicine, 14(1), 51–68.
Tay, D. (2017b). Time series analysis of discourse: A case study of metaphor in
psychotherapy sessions. Discourse Studies, 19(6), 694–710.
Tay, D. (2019). Time series analysis of discourse: Method and case studies.
Routledge.
Tay, D. (2020). Affective engagement in metaphorical versus literal communication
styles in counseling. Discourse Processes, 57(4), 360–375.
Tay, D. (2021). Modelability across time as a signature of identity construction on
YouTube. Journal of Pragmatics, 182, 1–15.
Tay, D. (2022). Time series analysis with python. In D. Tay & M. X. Pan (Eds.),
Data analytics in cognitive linguistics (pp. 49–74). De Gruyter. https://doi​.org​
/10​.1515​/9783110687279​-003
Van Staden, C. W., & Fulford, K. W. M. M. (2004). Changes in semantic uses of
first person pronouns as possible linguistic markers of recovery in psychotherapy.
Australian and New Zealand Journal of Psychiatry, 38(4), 226–232. https://doi​
.org​/10​.1111​/j​.1440​-1614​.2004​.01339.x
6 Conclusion

Data analytics as a rifle and a spade


Readers should now have an adequate conceptual understanding of basic
data analytic techniques like Monte Carlo simulations, the k-means cluster-
ing and k-nearest neighbors algorithms, and ARIMA time series analysis.
Each of these techniques was demonstrated to address specific questions in
psychotherapy research and practice across the trajectory from data collec-
tion to analysis. Readers can adopt and apply the annotated Python code
provided at the end of each chapter to their own datasets and objectives. In
Chapter 2, the Monte Carlo method was introduced as a flexible solution
to the practical problem of missing transcripts, by using what is statistically
known about available data to estimate a range of probable values that
would have been taken on by the missing data. Chapters 3 and 4, respec-
tively, introduced key examples of unsupervised and supervised machine
learning techniques. Chapter 3 showed how k-means clustering, as well as
other clustering algorithms like AHC, can be used to discern latent groups
among psychotherapy transcripts. This, in turn, forms the basis for a lin-
guistic (a)synchrony measure that can be used for research and practitioner
self-reflection alike. While k-means clustering is an unsupervised technique
to induce groups in the dataset, Chapter 4 followed it up by introducing
k-NN as a supervised technique that works with pre-existing groups. This
again presents various analytical options for research and practice, such
as evaluating the extent to which linguistic differences can discriminate
important non-linguistic groupings like therapy type, good/bad outcomes,
and so on. Lastly, Chapter 5 demonstrated ARIMA time series analysis as
a methodological tool for research, practitioner self-reflection, and other
objectives related to prognosis. These include understanding how language
use exhibit structurally regular patterns or signatures of change across ses-
sions, how these relate with non-linguistic measures of change, and the
possibility of forecasting future language use.

DOI: 10.4324/9781003360292-6
166 Conclusion

Relating the above applications back to the arguments made in Chapter


1 for harnessing data analytics for discourse analysis, the examples
throughout the book have collectively shown how

• discourse analytic constructs have underexplored data analytic


dimensions.
• data analytic outcomes can be interpreted in ways that uniquely address
specific issues in an area of discourse research and/or practice.
• planned and unplanned analytical processes alike can stimulate hypoth-
eses for further investigation.

As the first two points above describe the efficacy of data analytics and
the third point its exploratory value, it is useful to metaphorically describe
data analytics as both a ‘rifle’ and a ‘spade’ for doing discourse analysis.
Rifles are designed for precise aiming and accuracy when hitting the target.
Likewise, finding appropriate data analytic tools and solutions to investi-
gate discourse analytic constructs like (a)synchrony and identity construc-
tion is like an aiming exercise that locks the relevant questions on target,
with systematic and replicable techniques to ensure that the bullets hit the
target with a high level of reliability. Beyond this conceptualization of data
analytics as a ‘problem solver’, it is also an under-utilized spade for plough-
ing new ground, sowing new seeds, and finding something unexpected,
all of which are very much in the inductive spirit of discourse analytic
research. We witnessed scenarios where by-products of the main analytical
process – from observed differences in the predictive accuracy of different
simulation runs in Chapter 2 to precision versus recall scores in Chapter
4 – pointed towards new directions for research that often lie beyond the
original scope of inquiry. Therefore, while hitting predetermined research
targets using the rifle of data analytics, one should also have their spade in
hand to dig up surprises along the way. Emphasizing either metaphor runs
the risk of obscuring the other, so the most important point to remember is
that data analytics functions as both – and often simultaneously.
It was also emphasized at various points that there is much room for
interested readers to go beyond the present introductory scope and fur-
ther their understanding of data analytics. Useful and relatively affordable
online resources include, as mentioned in the introduction chapter, data-
camp​.c​om and towardsdatascience​.co​m. The following are some suggested
learning directions that are covered by these resources, each of which would
deserve its own monograph. First, while we relied mainly on LIWC scores,
readers can explore a wide range of other quantification schemes for their
language and discourse data. Pretrained language models using Google’s
Word2Vec, for example, can be freely downloaded and used to derive
word and document vectors for one’s own discourse data. It is, of course,
Conclusion 167

also possible to construct unique quantification schemes to better reflect


one’s data and objectives, such as focusing on particular lexical categories
only. Second, readers can equip themselves with many other techniques
useful for discourse research, ideally together with variant algorithms for
the same technique. For instance, in large part because it is already well
known, this book did not discuss regression and its variants where the out-
come of interest is continuous rather than categorical. Variant algorithms
that employ different logical and computational procedures to perform
classification and clustering tasks were briefly mentioned throughout, and
a key step towards data analytics mastery is to practice implementing them
(e.g., as an ensemble) and comparing their outcomes. The latter is, in fact,
an interesting and underdeveloped area for discourse research in its own
right. Other underexplored techniques that can be readily implemented
in Python and harnessed for discourse research include survival analysis
and association rules mining, both of which unfortunately lie beyond the
present scope. Third, readers can explore many other ways to enhance,
validate, and/or evaluate data analytic models to counter perennial prob-
lems like overfitting/underfitting, imbalanced datasets, and so on. These
include skills like feature selection (selecting the most important or rel-
evant features instead of using all features to train models), feature extrac-
tion (transforming raw data into distinct features while retaining as much
information in the raw data as possible), feature engineering (creating new
features from existing ones; see example below), regularization (imposing
penalty terms on models to avoid overfitting), and so on. Last but not least,
while often overlooked as just a matter of aesthetics, readers can further
hone their data visualization and communication skills for important pur-
poses like teaching and knowledge dissemination. Excellent Python visuali-
zation libraries besides the standard duo of matplotlib and seaborn include
Yellowbrick, Plotly, and Bokeh, many of which have advanced interactive
visualization capabilities supported by full online documentation.

Applications in other discourse contexts


It was emphasized early in Chapter 1 that although all major examples in
this book are situated in the context of psychotherapy talk, the data ana-
lytic techniques are all applicable in other discourse contexts. The three
widely researched contexts of media (including social media), politics, and
education were suggested as prime examples. An important reason is that
the temporal and dialogic aspects of discourse are often just as prominent
in all of these contexts as in psychotherapy. This implies that methodo-
logical and analytical options, such as creating sub-transcripts, simulation,
classification, and time series analysis, are all of potential relevance for
addressing specific questions that arise in them.
168 Conclusion

Let us first consider the case of missing data and applicability of Monte
Carlo simulations. All three discussed categories of data missing completely
at random, at random, and not at random are equally conceivable in these
discourse contexts. For example, social media researchers who employ web
scraping and text mining techniques to gather data from platforms like
Facebook and Twitter may face technical, ethical, and legal challenges that
result in missing or incomplete data (Bruns, 2019). Researchers of political
discourse face similar issues as they turn increasingly towards social media
platforms as a primary source of data (Stieglitz & Dang-Xuan, 2013).
Classroom discourse analysis (Rymes, 2016), an important part of educa-
tion discourse research, is yet another instance where data collection may
be compromised by technical and ethical concerns. In all the above cases,
the general Monte Carlo logic of estimating and evaluating a probabilistic
range of outcomes based on available sample data is straightforwardly
applicable given a suitable quantification scheme.
Basic (un)supervised machine learning techniques like clustering and
classification also have many potential applications in these discourse con-
texts, without necessarily involving the enormous quantities of data that
seem to typify machine learning in many peoples’ minds. We have, in fact,
not delved too deeply into the issue of sample sizes because there is sim-
ply no catch-all answer to a question like ‘how many samples do I need
for simul​ation​s/clu​steri​ng/cl​assif​i cati​on/ti​me series analysis?’. There are
well-known pros, cons, and remedies to large and small datasets alike that
are beyond the present introductory scope, and, beyond basic statistical
considerations, sample size determination for data analytic techniques is
highly dependent on the context of analysis (Figueroa et al., 2012; Kokol
et al., 2022). The examples throughout this book have illustrated a practi-
cal, needs-based approach that focuses on the objectives and situation at
hand rather than fixating upon a priori sample size determination pro-
cedures. From this perspective, recall that the main objective of cluster-
ing algorithms is to discern latent groups based on (dis)similarities among
observed features, while that of classification algorithms is to evaluate how
pre-existing groups are predictable from these features. It is known that
questions of identity and group membership are among the most pressing
across a variety of theoretical perspectives and discourse research contexts
(Van De Mieroop, 2015). While many of these eschew reductionistic defi-
nitions of identities as static and clear-cut categories, clustering and clas-
sification approaches would at least present new and/or complementary
methodological possibilities. This was amply demonstrated in Bamman
et al.’s (2014) application of cluster analysis to investigate the relation-
ship between gender, linguistic style, and social networks on Twitter. In
the context of political discourse, one specific interesting application is to
investigate election campaign speeches and test if candidates from the same
Conclusion 169

political party end up clustered together – or conversely, whether the lan-


guage of these speeches predicts party membership, amidst various contex-
tual complexities that may impact language styles. Consider the following
extracts from televised campaign speeches during the 2020 Singaporean
general election (GE). In this unicameral electoral system, parties compete
across geographically defined constituencies for the simple majority vote.
The party that wins the most constituencies is then empowered to form
the national government. Extracts 1 and 2 are by candidates from the PAP
(People’s Action Party), which has won every election and formed the gov-
ernment ever since the nation’s independence. However, while Extract 1
is from a constituency where the PAP is incumbent and the clear favorite,
Extract 2 is from a constituency where it is challenging WP (Worker’s
Party), the incumbent opposition. Extract 3, on the other hand, is by can-
didates of WP who are incumbent in that constituency.

Extract 1: PAP as incumbent


Right now, we are all concerned with COVID-19. We understand
your anxiety about your jobs and your families. Let me assure you,
that once elected, the PAP team at East Coast and the PAP govern-
ment will continue to work hard to see us through this crisis. At the
national level, we have committed almost $100 billion to safeguard
the lives and livelihoods of our people. We will do our best to keep
you in jobs, help you find new ones, and support you to bounce back
stronger. We will continue to strengthen our connectivity, so that our
businesses and our people have opportunities on the global stage.

Extract 2: PAP as challenger


Dear friends, since the 2011 GE, the PAP team continues to walk
with you. We have listened to you, we know the issues, we know how
you feel. Aljunied has not been left behind. I am Chua Eng Leong,
candidate for election, for Aljunied. Victor, Shamsul and myself, con-
tested the 2015 GE with you and for you. Even though we were not
elected, we were not disheartened. We know your concerns about
our ageing population, medical and estate improvements. Just look
around you, you can see many new facilities that have been added
within the last five years. Our infrastructure has improved, as we
continue to engage the government on your behalf, and provide nec-
essary ground feedback.

Extract 3: WP as incumbent
We have worked hard to earn your trust sometimes under very dif-
ficult circumstances. For those who feel that we have not met your
expectations, we seek your understanding and promise to do bet-
ter. Now, more than ever, your vote is essential to chart the kind of
170 Conclusion

political system Singapore should have. Make your vote count. Vote
for The Workers’ Party. Voters of Aljunied GRC, the PAP keeps say-
ing there’s no need to vote for the opposition as the NCMP scheme
ensures your voice in Parliament. Don’t be swayed by this argument.
Parliament is not just a talk shop where MPs make speeches. It exists
to make laws which are voted on by MPs. The PAP will feel safe
as long as their two-thirds majority is not threatened. But once the
opposition gains more seats, they will be forced to consult you, and
you will have a more responsive government.

In such predictable environments where certain parties tend to be con-


stant favorites and others tend to be underdogs, we might expect clear
linguistic/discursive boundaries that delineate them, and that which would
be captured by clustering/classification techniques. This is evident as PAP
tends to focus on their accomplishments and promises at both constituency
and national levels despite being the challenger in Extract 2, whereas WP
feels the need to emphasize the importance of opposition representation
despite being the incumbent in Extract 3. However, clustering/classifica-
tion is still helpful to tease out and quantify the effects of contextual fac-
tors that might blur these boundaries across a large dataset of speeches.
For example, there are some noteworthy differences between Extracts 1
and 2 where the PAP is incumbent and challenger respectively. Extract
1 exudes a confident and assertive tone that delivers a series of unmiti-
gated promises (‘We will…’) and almost assumes victory (‘Let me assure
you, that once elected…’). Extract 2 is decidedly less assertive and seeks
to establish a more interpersonal connection with voters (‘…continues
to walk with you’), in a similar vein to WP in Extract 3 (‘we seek your
understanding and promise to do better’). The overlapping characteristics
of these speeches can be more concretely represented with data analytic
visualization and techniques, for example in terms of their relative spa-
tial proximity as demonstrated in Chapters 3 and 4. Similar ideas can be
applied to questions in education discourse analysis. For example, the con-
cept of linguistic (a)synchrony may be extended to classroom talk across
lessons and investigated with cluster analysis, and linguistic/discursive dif-
ferences between natural groupings like teacher-student, discipline, and
levels of instruction (Csomay, 2007) may be investigated with classifica-
tion techniques like k-NN. Such differences are usually tested with general
linear models like t-tests, ANOVA, and linear regression. It is noticeable
that researchers do not always address whether their data meet standard
assumptions like linearity, constant variance, (multivariate) normality, and
independence, so non-parametric approaches like k-means clustering and
k-NN may be more appropriate in cases where these are not met.
Conclusion 171

As for the time series analysis of language and discourse data, ARIMA
models likewise offer underexplored possibilities for media, political,
and educational discourse research across a range of phenomena and
time scales (Tay, 2019). Traditional print media like newspapers tend to
be published daily and the language therein collectively analyzed across
longer intervals like years and beyond (Partington, 2010). However, lin-
guistic and discursive elements in contemporary social media may unfold
across much shorter and underexplored time intervals in the order of sec-
onds, especially in highly interactional contexts like live gaming streams
(Recktenwald, 2017). The modelability and structural signatures underly-
ing linguistic/discursive elements in such exchanges, against the backdrop
of spontaneous real-time activity like live streams, opens up a whole new
area of investigation that are aligned with the key assumptions of ARIMA
time series models. Other forms of social media that tend to unfold across
more ‘familiar’ time intervals, such as daily or weekly video uploads on
YouTube, can also be analyzed in similar ways (Tay, 2021b). Likewise,
many forms of political discourse are inherently time-sensitive, ranging
from the above examples of daily electoral campaign speeches to regular
communications in response to real-world events. Tay (2021a), for exam-
ple, compared the language of daily press conferences by the World Health
Organization and the Chinese Ministry of Foreign Affairs in response to
the initial phases of COVID-19, interpreting the modelability of various
series from an ideological perspective. Various types of annual speeches in
different political settings are also ripe for time series analysis, with findings
relatable to larger scale background observations of policy, governance,
and so on (Liu & Tay, 2023; Zeng et al., 2020, 2021). Educational dis-
course likewise assumes many time-sensitive forms like teacher talk and/or
teacher-student interaction that could be modeled across multiple sessions,
or even identified segments within a session. The COVID-19 pandemic,
still impacting some parts of the world at the time of writing, forced many
ill-prepared teachers and students to transit between face-to-face, online,
and hybrid modes of teaching and learning (Mishra et al., 2020). This
raises interesting questions not only about general differences in teacher
talk, student talk, and/or teacher-student interaction between these modes,
but also specific structural patterns across sessions that may underlie each
scenario. The extent to which the implicit structural signatures may relate
to explicit pedagogical strategies employed to cope with the pandemic may
be of particular interest.

Combining data analytic techniques in a project


This book presented each data analytic technique in a separate chapter
for the sake of organization and clarity. This does not, however, mean
that there is always a one-to-one match between a certain technique and
172 Conclusion

a corresponding discourse analytic task. We should, in fact, think about


how the techniques could be co-deployed in useful ways, especially in
larger-scale projects with more comprehensive research objectives and
datasets. This echoes the point made early in Chapter 1 that data analyt-
ics is a “holistic and contextual process that spans across the trajectory
from problem formulation to data collection, hypothesis testing, and inter-
pretation”. Readers should nevertheless not confuse this with the related
but different concept of ensemble learning mentioned in Chapter 4, where
alternative algorithms for the same technique are used to obtain aggre-
gated performance measures.
As an example of combining techniques, imagine a project that is
broadly similar to our demonstrations from Chapters 3 to 5, which aims to
exploit the linguistic properties of psychotherapy talk for various machine-
learning tasks to answer theoretically and practically motivated questions.
Instead of using individual session transcripts as analytical units, however,
the project attempts a larger scale endeavor where each unit of analysis
(or ‘row’ in the dataframe) comprises a complete series of transcripts
representing one dyad. We might, for instance, be interested in how well
dyadic properties measured over multiple sessions, or partially simulated
by Monte Carlo methods in cases of missing data, predict therapy types.
Returning briefly to the election campaign example above, this would be
analogous to predicting political party membership based on aggregated
properties of multiple speeches by the same person(s) within the same con-
stituency. While we might rely on say the mean LIWC variable scores for
each dyad as predictive features, the newly introduced time series compo-
nent could be exploited by incorporating relevant information – for exam-
ple, the standard deviation of scores, or even ARIMA parameters like AR
and MA operators – as potential additional predictive features for our
clustering and/or classification models. After all, since these features would
have been found to characterize the data, they should contribute meaning-
fully to the modeling process and improve model performance. This step
of feeding time series analysis outcomes into clustering/classification can
be described as a basic form of feature engineering – the use of extra infor-
mation within and beyond raw data, supported by domain knowledge, to
create new features and improve the quality of machine-learning processes.
Another interesting approach to combine techniques is to think about
how supervised and unsupervised learning – illustrated in this book by
k-nearest neighbors and k-means clustering, respectively – can comple-
ment each other for the objectives at hand. We discussed a potential appli-
cation in Chapter 3 to validate unsupervised k-means clustering outcomes
by treating the emergent group labels as ‘pre-existing’ and attempting to
predict them from the data by supervised logistic regression. The general
strategy of using clustering to prepare the data for further classification
Conclusion 173

and regression analysis has, in fact, been seen in recent work across fields
ranging from education to computer science (Budisteanu & Mocanu,
2021; El Aissaoui et al., 2019). It could also be considered a form of fea-
ture engineering since we are exploring and extracting new features from
raw data with the ultimate objective of creating better models to explain
phenomena of interest.

Final words: Invigorate, collaborate, and empower


At the beginning of this book, some fundamental characteristics of data
analytics were highlighted in making a case for its application to dis-
course analysis: objectivity, scalability, data richness, novelty, flexibility,
and interdisciplinarity. As we look back and reflect on how our basic case
studies exemplify these characteristics in their own ways, we draw the cur-
tain with three final statements that reiterate the theoretical, practical, and
pedagogical value of data analytics for discourse analysts.
First, data analytic conceptualizations of discourse analytic questions
can invigorate our theoretical perspectives and constructs, bridging tra-
ditional and novel ways of doing discourse analysis. The co-existence of
methodological variants or alternatives has always been an important and
valuable trait of discourse research. Besides bringing different theoretical
backgrounds, assumptions, and analytical techniques to the table, meth-
odological diversity also promotes general ideals like critical self-reflection
and reflexibility among researchers. A data analytic point of departure
in this regard would represent a genuine example of cross-fertilization
between disciplines and epistemologies that are quite often wrongly
assumed to be incompatible.
Second, developing our data analytic repertoire as discourse researchers
opens up more collaborative pathways – be it for the purpose of knowl-
edge dissemination, transfer, or enhancing competitiveness with inter-dis-
ciplinary teams in research funding. The latter point is most pertinent for
discourse researchers who explicitly position their research as appliable,
yet often encounter criticism when navigating strongly evidence-based
contexts like health care and education. The ability to connect discourse-
oriented constructs and claims with data analytic algorithms and solutions
has, in my personal experience, been valuable not only in discussions with
collaborators, but also in enhancing the relevance of discourse research
amidst the growing emphasis on ‘quantifiable outcomes’.
Last but not least, a data analytics approach to discourse studies has the
understated potential to empower our students (and ourselves!) through
educational ‘by-products’ like quantitative literacy and programming
skills that are in high demand in the digital economy. A growing num-
ber of tertiary institutions around the world have begun to offer options
for humanities students to pursue qualifications in data analytics, ranging
174 Conclusion

anywhere from elective courses to a minor or double major program. The


opportunity to impart these skills through examples grounded in human-
istic contexts is all the more valuable since it allows educators to convey
the importance of balancing precision with nuance in understanding our
linguistic and social worlds. It is hoped that this book has successfully out-
lined a general approach towards that end.

References
Bamman, D., Eisenstein, J., & Schnoebelen, T. (2014). Gender identity and lexical
variation in social media. Journal of Sociolinguistics, 18(2), 135–160.
Bruns, A. (2019). After the ‘APIcalypse’: Social media platforms and their fight
against critical scholarly research. Information, Communication and Society,
22(11), 1544–1566.
Budisteanu, E.-A., & Mocanu, I. G. (2021). Combining supervised and unsupervised
learning algorithms for human activity recognition. Sensors, 21(18), 6309.
Csomay, E. (2007). A corpus-based look at linguistic variation in classroom
interaction: Teacher talk versus student talk in American University classes.
Journal of English for Academic Purposes, 6(4), 336–355.
El Aissaoui, O., El Alami El Madani, Y., Oughdir, L., & El Allioui, Y. (2019).
Combining supervised and unsupervised machine learning algorithms to predict
the learners’ learning styles. Procedia Computer Science, 148, 87–96.
Figueroa, R. L., Zeng-Treitler, Q., Kandula, S., & Ngo, L. H. (2012). Predicting
sample size required for classification performance. BMC Medical Informatics
and Decision Making, 12(1), 8.
Kokol, P., Kokol, M., & Zagoranski, S. (2022). Machine learning on small
size samples: A synthetic knowledge synthesis. Science Progress, 105(1),
00368504211029777. https://doi​.org​/10​.1177​/00368504211029777
Liu, Y., & Tay, D. (2023). Modelability of WAR metaphors across time in cross-
national COVID-19 news translation: An insight into ideology manipulation.
Lingua, 286, 103490. https://doi​.org​/10​.1016​/j​.lingua​.2023​.103490
Mishra, L., Gupta, T., & Shree, A. (2020). Online teaching-learning in higher
education during lockdown period of COVID-19 pandemic. International
Journal of Educational Research Open, 1, 100012. https://doi​.org​/10​.1016​/j​
.ijedro​.2020​.100012
Partington, A. (2010). Modern Diachronic Corpus-Assisted Discourse Studies
(MD-CADS) on UK newspapers: An overview of the project. Corpora, 5(2),
83–108.
Recktenwald, D. (2017). Toward a transcription and analysis of live streaming on
Twitch. Journal of Pragmatics, 115, 68–81. https://doi​.org​/10​.1016​/j​.pragma​
.2017​.01​.013
Rymes, B. (Ed.). (2016). Classroom discourse analysis: A tool for critical reflection
(2nd ed.). Routledge.
Stieglitz, S., & Dang-Xuan, L. (2013). Social media and political communication:
A social media analytics framework. Social Network Analysis and Mining, 3(4),
1277–1291.
Conclusion 175

Tay, D. (2019). Time series analysis of discourse: Method and case studies.
Routledge.
Tay, D. (2021a). COVID-19 press conferences across time: World Health
Organization vs. Chinese Ministry of Foreign Affairs. In R. Breeze, K.
Kondo, A. Musolff, & S. Vilar-Lluch (Eds.), Pandemic and crisis discourse:
Communicating COVID-19 (pp. 13–30). Bloomsbury.
Tay, D. (2021b). Modelability across time as a signature of identity construction
on YouTube. Journal of Pragmatics, 182, 1–15.
Van De Mieroop, D. (2015). Social identity theory and the discursive analysis of
collective identities in narratives. In A. De Fina & A. Georgakopoulou (Eds.),
The handbook of narrative analysis (pp. 408–428). John Wiley & Sons.
Zeng, H., Burgers, C., & Ahrens, K. (2021). Framing metaphor use over time: ‘Free
Economy’ metaphors in Hong Kong political discourse (1997–2017). Lingua,
252, 102955. https://doi​.org​/10​.1016​/j​.lingua​.2020​.102955
Zeng, H., Tay, D., & Ahrens, K. (2020). A multifactorial analysis of metaphors
in political discourse: Gendered influence in Hong Kong political speeches.
Metaphor and the Social World, 10(1), 141–168.
Index

Page numbers in bold denote tables, those in italic denote figures.

accuracy 37, 40–42, 88–89, 115–116, 5, 9, 20–22, 87, 89, 97, 138, 159,
118–119, 121–122, 124–125, 165, 167–168, 171; theory 11;
166; average 116–117; averaged thinking 19, 50–52, 155; tools 3,
measure of 42; computing 116, 124; 22, 166; value 33; visualization 170;
evaluating 42, 83; global 120–121; wisdom 128; work 7
highest 115; measures 115–116, analytical: language levels 6; methods
118–119, 121–122, 124; overall 42, 11; options 165, 167; processes 6,
89, 115–117, 124; percentage 88, 166; purposes 10, 16, 71; scenarios 6;
101; predictive 87–88, 108, 114, techniques 173; thinking 5–6, 17, 18,
121, 137, 146, 148–152, 154, 161, 40–41, 44, 48, 56–57, 63, 85, 93–94,
166; relative 52; score 119; test 96, 138–139, 142, 149, 155–157,
115–118, 124; train 115–117, 124; 159, 161–162; tone 157; units 172
varying 56; verifying 41 analyticity 5, 9
agglomerative hierarchical clustering ANOVA 38, 109, 170
(AHC) 68–71, 73, 98, 165 approach: additive 132; alternative
Akaike Information Criterion (AIC) 2; comparative 97; computational
140, 152, 160 linguistic 8; critical 4; descriptive 4,
Althoff, T. 136 7; flexible 10; historical 4; logistic
American Psychological Association regression 118; looping 108;
(APA) 11 multiplicative 132; needs-based 168;
analytic: algorithms 173; applications non-parametric 170; numerical 29, 31;
11; approach 29, 34, 90–91, 158; plausible 3; psychotherapy 55; splitting
conceptualization 173; constructs 114; synchrony 98; therapeutic 40, 97;
4–6, 166; context 73, 105, 158; therapy 8, 57, 76, 87, 97, 109; train-
decisions 2; dimensions 4, 166; test 37, 41, 73, 83, 89, 145; univariate
example 66; features 126; import modeling 157; see also analytic
141; interest 7; interpretation 141; association rules mining 167
literature 158; logic 128; modelling Augmented Dickey-Fuller (ADF) test
73; models 167; outcomes 83, 166; 139–141, 160
perspectives 57; phase 98; point of authenticity 9, 17–19, 40–41, 44, 48,
departure 173; process 19; program 50, 52, 56, 63, 93–94, 96
17; purposes 20, 141; questions autocorrelation 127–135, 142, 144,
3, 173; repertoire 173; research 7, 146, 152, 154, 156, 160
166; scenarios 66; situations 154; autoregressive 135; Integrated Moving
solution 28–29, 31–33, 58; subtypes Average (ARIMA) 10–11, 127, 129,
4; tasks 117, 172; techniques 1–2, 131; model 6, 132, 135, 143


178 Index

average 40, 71, 116, 121–122, 134; k-means 71, 72, 73, 75–76, 79–81,
accuracy 116–117; distance 72; 83, 84, 87–88, 90, 99, 105–109,
macro 121; measure of accuracy 42; 165, 170, 172; models 83, 89, 172;
measure of recall 120; moving 135, non-hierarchical 68; outcomes 68,
143; number of sessions 8; precision 89, 105; purposes of 71; solution
121; silhouette score 72; theoretical 67–68, 70–73, 75, 79, 81–83,
33; value 33; weighted 121 85, 87–89, 97, 101; tasks 167;
techniques 170; transcripts 109;
Bamman, D. 168 usefulness of 69
Bayesian Information Criterion cognitive-behavioral therapy (CBT) 76,
(BIC) 152 77–78, 80, 83, 87, 89–92, 97–98,
behavior 6, 10, 127; contrasting 131, 109, 110–111, 118–120, 121, 125,
143; cooperative 73; human 3; 138, 155–157
linguistic 73, 137, 156; modifying Communication Accommodation
73; paralinguistic 98; pattern 144; Theory 74
relative 143; responses 73; confusion matrix 88–89, 101, 118–
seasonal 142 119, 125
bias-variance tradeoff 108, cophenetic correlation coefficient
115–117, 124 70–71, 99
Bokeh 167 correlograms 131, 142, 143, 153–154
Box-Jenkins method 127–128, 132, cosine distance 16
134, 136–138, 141, 157 CountVectorizer 15
Breusch-Pagan test 140, 160 covariance structure 43
COVID-19 5, 68–69, 70, 72, 99, 105,
central limit theorem 33–36, 48, 54 107, 110, 126, 169, 171
classification 1, 66, 105, 107, 117, cycles 132, 134, 138, 158
126, 167–168, 170, 172; algorithm
10, 105, 168; approaches 168; data: analytics 1–4, 6, 8–9, 11–13, 17,
automatic 2; hidden 83; models 20, 98, 120, 165–167, 172–173;
121, 172; non-linguistic 105; report -frame 22, 35, 45, 47, 50, 60–63,
120, 121, 122, 125; tasks 105–106; 80–82, 100, 114, 131–132, 138,
techniques 10, 105–106, 109, 170; 172; generating process 11, 28,
text 17 37, 41, 127, 131, 138; long 21,
classifiers 106; linear 106; probabilistic 21; narrow 21; richness 3, 173;
106 visualization 2, 21, 110, 167;
clout 9, 17–19, 40–41, 44–45, 47–48, wide 21
50–52, 56, 61–63, 85, 93–94, decision trees 106
96–97 Demiray, Ç. K. 109
cluster 67–73, 76, 79–80, 82–83, differences 9, 50, 74, 87, 148, 170;
85, 89, 101, 107, 113; allocations detailed 21; discursive 170; first
83; analysis 9, 11, 66–68, 75, 83, 140; general 171; large 2, 96;
90–91, 94, 97, 105, 156, 159, 168, linguistic 165; major principled 106;
170; application of 97; centres noteworthy 170; observed 166;
81–82, 86; centroids 71, 73, 81–83, philosophical 91; significant 1, 13,
85, 100–101; higher-level 67; 52, 57
label 81–83, 85, 87–89, 100, 107; differencing 140–143, 145, 161
membership 81–83, 92, 98 discourse analysis 3–4, 7, 11, 97, 137,
clustering 1, 10, 67–68, 72, 76, 87, 158, 166, 168, 170, 173
98, 106, 126, 168, 170, 172; disorder: anxiety 76; depressive 8
agglomerative hierarchical (AHC) distortion 71–72, 79
67–69, 98; algorithm 9, 67, 165, document-term matrix 9, 14
168; dataset 89; hierarchical 67; dot product 16
Index  179

dynamic 133; emergent 11; general 96; Jung, C. 40


interactional 159; reactionary 156
Keating, B. 127
Efron, B. 30 k-folds cross validation 121–122,
elbow method 72, 79–81, 99, 108 125, 146
emotional tone 9, 17–19, 40–41, 44, k-means clustering 9–10, 68, 71–73,
48, 50, 52, 56–57, 63, 93–94, 96 75–76, 79–81, 83, 84, 87–88, 90,
ensemble 167; learning 105, 172 99, 105–109, 165, 170, 172
enumerate 116, 124 k-nearest neighbors (k-NN) 10, 66,
Erikson, E. 40 106–107, 109, 113–114, 116–118,
Euclidean space 69, 71–72, 82, 106 121–122, 124, 165, 170, 172
Evans, J. 2 Koole, S. L. 73
exploratory 11, 116
labels 48, 70, 81, 87–88, 100, 107,
false negatives 119–120 116, 118, 149; cluster 82–83, 85,
feature: engineering 167, 172–173; 87–89, 100–101, 107; descriptive
extraction 167; selection 167 119; group 21, 66–67, 87, 105–106,
Ferrara, K. W. 74 113, 114–115, 117, 172; list of 108;
flexibility 3, 54, 173 predicted 87–88, 106, 107, 115–
folds 122, 146 116, 118–119, 125; pre-existing 83,
forecast 2, 6, 10–11, 127, 129, 132, 135– 87, 98, 105, 110; tied 108
136, 148–149, 153, 159, 161–162 Labov, W. 7
forecasting 2, 8, 11, 89, 126–127, 129, lagged: autocorrelations 130;
136–137, 146, 159, 165 version 129
Freud, S. 40 law of large numbers 32–34, 36,
Fromm, E. 40 48, 54
function 4–5, 9, 17, 19, 29; Linguistic Inquiry and Word Count
grammatical 19; logarithmic 14 (LIWC) 5, 9–11, 13, 17, 28, 37,
39–43, 49–50, 55–56, 75–76, 79,
groupby 50, 63, 85, 101 82, 85, 87, 91–92, 94, 97, 109, 113,
114, 118, 120, 126–127, 134, 138,
hierarchical 5–6, 67–68, 155; 157, 166, 172
agglomerative 68, 98; divisive 67; linguistics: context 2; corpus 13;
non- 67–68; thinking 17, 41; see research 3
also clustering Linguistic Style Matching 75
homoscedastic 140, 160 Local Linguistic Alignment 75
humanistic 77–78; dyads 91; logistic regression 66, 87–89, 105–106,
phenomena 4; therapy 76, 87, 89–90 114, 118, 172
Humby, C. 1
Huston, J. 109 machine learning 3, 19, 37, 41, 66–68,
hyperparameters 72 72, 87, 98, 105, 108, 165, 168, 172
Markov chain models 89, 127
independently and identically matplotlib 21, 28, 58, 69, 98, 110,
distributed 121 123, 167
inertia 71–72, 79–80, 99–100 mean absolute error (MAE) 150
informative 14, 34 mean absolute percentage error
integrated development environment (MAPE) 151–152, 162
(IDE) 20, 22 mean-centering 69, 110
Interactive Alignment Model 74 mean squared error (MSE) 151–152
interactivity 2 Mergenthaler, E. 136
interdisciplinarity 34, 173 model: assumptions 2; -building
iterative action 123 process 37; classification 121,
180 Index

172; clustering 83, 88–89, 100; predict 4, 6, 10, 66, 81, 83, 87–89,
complex 108, 115, 152; first-order 105–107, 118, 125, 127–128, 137,
autoregressive 6; fit 11–12, 80–81, 146, 148–149, 151, 161–162, 172
99–100, 114–115, 118, 122, 124, predictinfo 148–149, 161–162
128–129, 135–137, 145–149, prediction 8, 129; error 89, 129, 134;
152, 154, 161; language 13, 120, intervals 149; outcomes 41–42
166; linear 101, 129, 134, 170; principal components analysis (PCA)
logistic regression 87–88; machine 15, 72, 82, 100, 113, 123
66; multilevel 110; multiplicative Python code 12, 20, 28, 58, 73, 98,
132; parameters 146, 148, 154; 123, 159, 165
precision 146; predictive 114, 148;
regression 127–128, 130, 134, Qiu, H. 109
145; relationships 105; results 73; quantification 13, 75, 98; of language
selection 43, 60, 144; simplified 13, 17; scheme 13, 19, 39, 54–55,
108; statistical 2, 11, 127; summary 76, 94, 97, 166–168
152; theoretical 90; therapy 7; time
series 134, 156, 158–159, 171; random: completely at 38–39, 52,
validation 11–12, 37, 40, 52, 73, 57, 168; not at 38–39, 168;
75, 79, 83, 89, 114, 118, 120–121, numbers 29–30, 34, 46, 62, 115,
124, 145 126; operations 28; outcomes 30;
Monte Carlo simulations (MCS) 8, 10, samples 35–37; values 37, 43,
27–29, 32–34, 36–37, 39–42, 46, 45, 61, 126, 159; variables 3, 33,
48, 50, 52–57, 60, 114, 120, 146, 45–46, 61, 126
165, 168 range 22, 28–29, 31, 33–35, 45–46,
multivariate normal sampling 43 48, 58–62, 66, 79, 83, 99–100,
115–116, 128, 132, 137, 140, 160,
naïve bayes 106 165–166, 168, 171
natural language processing (NLP) realization 138
13, 15 regression 1–2, 11, 66, 87–89,
Nikkie Tutorials 5 105–107, 114, 118, 127–130, 134,
normalization 68 144–145, 152, 167, 170, 172–173
numerical: approaches 29; range regularization 167
116; representation 13; simulation Reisigl, M. 4–5
31–33; solution 28, 31, 33, 58; time relativized 85
series 127; vectors 13 residuals 128–129, 132, 134, 143,
152–154, 157, 162–163
objectivity 3, 173 root mean squared error (RMSE) 151
optimal: allocation 71; balance 108;
courses of action 3; number of sample: analyses 97, 157;
clusters 71–73, 79–81, 99–100, autocorrelation 131; data 1, 108,
108; outcome 68; parameters 117; 168; dataset 68; dyad 90–91;
value 68, 72, 108, 114, 117, 124 findings 98; independent 50;
optimization procedure 67, 106 instantiation 131; means 35–36,
out-of-sample 12 155; random 35–37, 48; series
overfit 108 138; sizes 38, 54, 121, 131,
overfitting 117, 151, 167 138, 168
overfitting/underfitting 167 sampling frequencies 127
Sarkar, D. 15
paralinguistic 98 SARS epidemic 126
Pittenger, R. E. 7 scalability 3, 173
Plotly 167 scale-dependent measures 151
precision 46, 119–122, 146, 156–157, scale-independent measures 151
166, 174 Scheflen, A. E. 7
Index  181

scikit-learn 15, 21, 43, 68, 79, 82, 87, texthero 15


107, 110, 120, 122, 146 TfidfVectorizer 15
SciPy 28, 58, 68–69, 79, 98 theory-neutral 11
seaborn 21, 28, 58, 110, 123, 167 train 42–43, 45, 60, 114, 116–118,
seasonal: ARIMA models (SARIMA) 120, 123–124, 146, 149, 152,
145; behavior 142; decomposition 161–162, 167; accuracy 115–116,
132, 160; multiples 142; 124; split 43; -test 37, 41, 43, 60,
phenomena 145; rises and falls 134 83, 88–89, 114, 121, 145, 162
seasons 132, 134, 138 trends 10, 132, 134, 138
September 11 attacks 18 true positives 119–120
silhouette analysis 72
simulating 28, 34, 37, 40, 56 UCREL Semantic Analysis System 17
sklearn metrics 151–152, 162 underfit 108
Spong, S. 7 underfitting 129, 167
standardize 68, 79, 82, 85, 107, unsupervised 67, 83, 87, 105, 165, 172
110, 114
stationarity 138, 140, 143, 160 validation 42–43, 44, 48, 50–52,
statistics 1–2, 21, 47, 50, 67–68, 105, 56–57, 60, 63, 73, 89, 101, 105,
138–139, 147, 154 121–122; cross- 122; data 41–42,
statsmodels 21, 131–132, 140, 146, 48, 50, 52–53; model 11–12, 37, 40,
160–161 75, 79, 83, 89, 114, 118, 120, 124,
stratify 114, 120, 123 145; procedures 12, 41, 48, 51, 83,
structural signatures 134, 135, 136, 87; psychometric 17; quantitative
138, 143, 145, 155–156, 171 87; scores 52; visual 85, 101
supervised 66, 87, 98, 105, 114, 165, Van Staden, C. W. 136
168, 172 variance stabilization 141–142
support 1, 3, 5, 10, 20, 32–33, 75, 87, vector autoregression model 157
91, 106, 120–121, 127, 136, 150, vectorization 13, 19, 94
167, 169, 172 virtual learning platforms 1
survival analysis 167
systematic 39–40, 57, 97, 129, 132, weighted f1-score 121
134, 166 white noise 128, 132, 152, 158

Tay, D. 75, 136, 171 Yellowbrick 167


term frequency-inverse document
frequency (tf-idf) 14–15, 16, 17 Zelig Quotient 75

You might also like