Critical Evaluation of The RFQ - Accepted Manuscript

Download as pdf or txt
Download as pdf or txt
You are on page 1of 59

A CRITICAL EVALUATION OF THE RFQ

This article has been accepted for publication in Journal of Personality Assessment,

published by Taylor & Francis.

A Critical Evaluation of the Reflective Functioning Questionnaire (RFQ)

Sascha Müller1*, Leon P. Wendt1*, Carsten Spitzer2, Oliver Masuhr3, Sarah N. Back4,

& Johannes Zimmermann1

1
University of Kassel, Department of Psychology, Kassel, Germany
2
Rostock University Medical Center, Rostock, Germany
3
Asklepios Fachklinikum Tiefenbrunn, Rosdorf, Germany
4
Ludwig-Maximilians-Universität München, Munich, Germany

*shared first authorship

Authors’ note: Ethics committee approval was obtained for data collection (Ethics Committee

at the University Medicine Rostock, #A2020-0075; Ethics Committee at the Faculty of

Behavioural and Cultural Studies, University of Heidelberg, #Tau 2019 1/1). The first authors

contributed equally to this research. The data used in Study 1 are available from the

corresponding authors upon reasonable request. Data and R code for reproducing the analyses

of Study 2 are permanently and openly accessible at https://osf.io/stbz5. Please address

correspondence to Sascha Müller and Leon Wendt, Department of Psychology, University of

Kassel, Holländische Str. 36-38, 34127, Kassel, Germany; E-Mail: sascha.mueller@uni-

kassel.de, [email protected].
A CRITICAL EVALUATION OF THE RFQ 2

Abstract

The Reflective Functioning Questionnaire (RFQ) is an 8-item self-report measure of

reflective functioning that is presumed to capture individual differences in hypo- and

hypermentalizing. Despite its broad acceptance by the field, we argue that the validity of the

measure is not well-established. The current research elaborates on problems of the RFQ

related to its item content, scoring procedure, dimensionality, and associations with

psychopathology. We tested these considerations across three large clinical and non-clinical

samples from Germany and the US (total N = 2289). In a first study, we found that the RFQ

may assess a single latent dimension related to hypomentalizing but is rather unlikely to

capture maladaptive forms of hypermentalizing. Moreover, the RFQ exhibited very strong

associations with measures of personality pathology, while associations with measures of

symptom distress were less strong. In a second preregistered study focused on convergent and

discriminant validity, however, a commonality analysis indicated that associations with

indicators of personality pathology are inflated because some of the RFQ items tap into

emotional lability and impulsivity rather than mentalizing. Our findings demonstrate

limitations of the RFQ. We discuss key challenges in assessing mentalizing via self-report.

Keywords: reflective functioning; mentalizing; validity; factor analysis; U-shaped

associations; commonality analysis


A CRITICAL EVALUATION OF THE RFQ 3

A Critical Evaluation of the Reflective Functioning Questionnaire (RFQ)

Fonagy and colleagues (2016) introduced the Reflective Functioning Questionnaire

(RFQ), a brief screening measure that is intended to assess an individual’s capacity to

adequately interpret mental states of both the self and others (i.e., reflective functioning or

mentalizing) via self-report. The measure as well as its translated versions have been

positively evaluated in several validation studies (Badoud et al., 2015; Fonagy et al., 2016;

Morandotti et al., 2018), leading to the consensual conclusion that the RFQ is able to capture

deficits in reflective functioning. According to mentalizing theory (e.g., Bateman & Fonagy,

2016), the theoretical spectrum of reflective functioning includes hypo- and hypermentalizing

(i.e., too little or too much certainty about one’s interpretation of mental states) as well as

genuine mentalizing (the optimal trait level, i.e., acknowledging the opaqueness of mental

states). Deviations from the optimal trait level in both directions are presumed to be a sign, a

symptom, and a transdiagnostic risk factor of psychopathology (e.g., Luyten et al., 2020) and

have been targeted by a specialized psychotherapeutic approach, that is, mentalization-based

treatment (e.g., Bateman & Fonagy, 2016).

In the RFQ, respondents are asked to endorse or reject eight statements that are

supposed to relate to mentalizing processes (e.g., Item 6; “Sometimes I do things without

really knowing why”) on a 7-point scale ranging from do not agree at all (= 1) to agree

completely (= 7) (see Panel A of Figure 1). The creators implemented a procedure that

involves scoring half of the items twice (i.e., double-scoring) and splitting the underlying

information into two scales (see Panel B of Figure 1): certainty about mental states (RFQ_C;

i.e., 3, 2, 1, 0, 0, 0, 0) and uncertainty about mental states (RFQ_U; i.e., 0, 0, 0, 0, 1, 2, 3).

According to this logic, a strong rejection of Item 6 on the original scale (= 1) is scored in

such a manner that it is indicative of high certainty (= 3) on the RFQ_C scale and at the same

time indicative of low uncertainty (= 0) on the RFQ_U scale, while a strong agreement is
A CRITICAL EVALUATION OF THE RFQ 4

scored as indicative of low certainty (= 0) on the RFQ_C scale and at the same time as

indicative of high uncertainty (= 3) on the RFQ_U scale. High levels of certainty about

mental states are assumed to reflect hypermentalizing and high levels of uncertainty are

assumed to reflect hypomentalizing (e.g., Badoud et al., 2015; Fonagy et al., 2016;

Morandotti et al., 2018). The latent structure of the RFQ has been investigated repeatedly

using double-scored items and reportedly consists of two negatively correlated factors (with

correlations typically ranging between -.60 and -.80) reflecting the scales of RFQ_C and

RFQ_U (Badoud et al., 2015; Fonagy et al., 2016).

However, in a recent attempt to validate the factor structure of the German version of

the RFQ in a large sample (N = 2477) representative of the German general population,

Spitzer et al. (2021) criticized the use of double-scored items in factor analysis (e.g., Badoud

et al., 2015; Fonagy et al., 2016; Morandotti et al., 2018), noting that the assumption of

uncorrelated item residuals seems unrealistic when two items are derived from the same

original responses. Instead, Spitzer and colleagues provided initial evidence that a

unidimensional model sufficiently explained the observed covariation of the original

responses to RFQ items and suggested that this representation could give rise to two

maladaptive poles of hypo- and hypermentalizing deviating from an adaptive middle region.

Following this notion, they proceeded with testing U-shaped relationships between a

unidimensional RFQ score (dropping two items to improve internal consistency) and

depression, anxiety, and somatic symptoms but found no evidence for such associations. The

findings of Spitzer and colleagues in conjunction with their briefly noted considerations raise

some initial doubts and call for a critical re-examination of the RFQ as a psychometric test. A

critical re-examination and discussion of the RFQ appears to be of particular importance

given that researchers are increasingly adopting the measure for primary investigations of the

mentalizing construct (e.g., Badoud et al., 2018; de Meulemeester et al., 2017, 2018; Huang
A CRITICAL EVALUATION OF THE RFQ 5

et al., 2020; Li et al., 2020). In the following, we will identify and explore potential issues

with the RFQ in detail.

Item Content

An initial concern is that the coverage of the reflective functioning construct in the

RFQ does not seem to converge well with the definition given by the creators (Fonagy et al.,

2016). According to them, reflective functioning pertains to “the capacity to interpret both the

self and others in terms of internal mental states, such as feelings, wishes, goals, desires, and

attitudes” (Fonagy et al., 2016, p. 1). However, all RFQ items but one refer to understanding

oneself (and not others) and most items refer to understanding one’s own behavior (and not

feelings, desires, wishes, goals, or attitudes). The construct of reflective functioning as

defined by the authors might thus not be optimally covered by the items. Although it should

be acknowledged that the RFQ was introduced as a brief screening measure and the creators

themselves generally called for a multidimensional assessment of mentalizing (e.g., Fonagy

et al., 2016; Luyten et al., 2020), a potential lack of coverage cannot be compensated by

relating the RFQ to a validated and more comprehensive long form. This is because, even

though long forms (i.e., RFQ-54, RFQ-46) are accessible from the creators’ website and are

also used in some studies (e.g., Euler et al., 2021), there are no publications that investigate

their psychometric properties or their relationship to the brief RFQ.

In addition, it seems that the item content of the RFQ may overrepresent another

maladaptive characteristic, namely, a tendency towards impulsive behavior when

experiencing negative emotions (e.g., Item 3: “When I get angry, I say things without really

knowing why I am saying them”; Item 4: “When I get angry, I say things that I later regret”;

Item 5: “If I feel insecure I can behave in ways that put others’ backs up”). That is, it could be

that individuals who endorse these items are not necessarily lacking in reflective capacity, but

merely show a high level of negative urgency (Cyders & Smith, 2008; Settles et al., 2012).
A CRITICAL EVALUATION OF THE RFQ 6

Negative urgency denotes the disposition to act rashly and ill-advisedly under negative

emotions and can thus be seen as reflecting a blend of emotional lability and impulsivity.

While this behavioral signature could be considered a consequence of impairments in

mentalizing abilities, it may well be due to other underlying characteristics and deficiencies

(e.g., lack of adaptive emotion regulation strategies; King et al., 2018). Therefore, these items

do not seem specific enough to the core definition of reflective functioning. By contrast, other

items of the RFQ such as “People’s thoughts are a mystery to me” (Item 1) or “I always know

what I feel” (Item 7) seem to address more directly the certainty or uncertainty involved in

forming inferences about mental states.

Scoring Procedure

The second concern pertains to the aforementioned double-scoring of four of the eight

RFQ items (i.e., Items 2, 4, 5, 6) to derive RFQ_U and RFQ_C scale scores and their use in

factor analysis, as it causes psychometric problems. Given that respondents only provide one

rating for each of these four items on the 7-point scale, the resulting eight rescaled scores on

RFQ_C and RFQ_U are mutually determined. For example, when a respondent rates Item 6

with strong agreement, this necessarily leads to a rescaled score of 3 on RFQ_C and a

rescaled score of 0 on RFQ_U (see Panels A and B of Figure 1). Thus, the two rescaled

scores are not independent of each other and overlap with regard to their information. In fact,

nine of the 16 theoretical combinations of the two scores are mathematically impossible (see

Panel C of Figure 1). In our Sample 1 (see below), this results in an artificially negative

correlation between the two rescaled scores of Item 6 with r = -.55. The exact value of the

correlation depends on the univariate distribution of the raw scores and varies slightly

between items and samples. However, when using polychoric correlations that take into

account the ordinal scaling of the scores, the estimate will always be exactly r = -1. This

shows that the two rescaled scores are in fact completely redundant. This issue is particularly
A CRITICAL EVALUATION OF THE RFQ 7

problematic because the assumed factor structure of the RFQ is based on confirmatory factor

analysis (CFA) using double-scored items as indicators (Badoud et al., 2015; Fonagy et al.,

2016). Applying ordinal CFA should produce warning messages because the polychoric

correlations between several indicators will approach r = -1. However, if instead one treats

the indicators as continuous and applies robust maximum likelihood estimation (as was done

in previous validation studies), one also cannot undo the inherent dependencies of the

indicators. In this case, the suggested two-dimensional factor model will artificially induce a

negative correlation between the two factors because the residual correlations of double-

scored item pairs are restricted to zero. Therefore, the finding of two negatively correlated

dimensions of the RFQ (i.e., RFQ_U and RFQ_C) using double-scored items should rather

not be interpreted as evidence in favor of the instrument’s structural validity as was done in

previous validation studies (Badoud et al., 2015; Fonagy et al., 2016). We argue that the

factor structure of the RFQ is still open to debate.

The problematic scoring procedure associated with the two scales manifests itself in

conceptual inconsistencies with regard to the RFQ_C scale in particular. Although RFQ_C is

supposed to represent certainty about mental states, all items are geared towards a state of

uncertainty with respect to their semantic content (e.g., Item 1: “People’s thoughts are a

mystery to me”) and are, ultimately, reversely scored. The certainty scale is thus based

entirely on the denial of uncertainty. In fact, the RFQ contains only one item directly

referring to a state of certainty (Item 7: “I always know what I feel”) and this item is scored

exclusively for the uncertainty scale (RFQ_U).

Associations with Psychopathology

A third issue is that findings on associations between RFQ scales and

psychopathology constructs are somewhat conflicting. As theory posits that mentalizing

impairments such as hypo- and hypermentalizing are a vulnerability factor for severe
A CRITICAL EVALUATION OF THE RFQ 8

psychopathology (e.g., Luyten et al., 2020), positive associations between mentalizing

impairments and various indicators of psychopathology are expected. However, the

correlational patterns of the RFQ_C and RFQ_U scales that emerge from the literature (e.g.,

Badoud et al., 2015; Fonagy et al., 2016; Huang et al., 2020; Li et al., 2020) rather appear to

be in contrast to the interpretation that RFQ_C assesses hypermentalizing, as has been noted

previously (de Meulemeester et al., 2018; Euler et al., 2021). Overall, the RFQ_C scale

(certainty about mental states) was often positively associated with mental health, suggesting

that it captures an adaptive characteristic. By contrast, the RFQ_U scale (uncertainty about

mental states) appears to be quite strongly related to various indices of psychopathology.

Taken together, the two scales of the RFQ tend to exhibit similar correlational patterns to

external criteria but with opposite signs, respectively. These opposing correlational patterns

of RFQ_C and RFQ_U in a shared nomological net seem more compatible with the notion

that the RFQ reflects a unidimensional continuum ranging from genuine to impaired

mentalizing.

Another aspect in this regard is that, from a theoretical point of view, the association

between mentalizing and psychopathology should depend on which form of psychopathology

is being considered. Specifically, impairments in mentalizing are thought to be a core feature

of personality disorders (e.g., Bateman & Fonagy, 2019; Fonagy et al., 2017; Luyten et al.,

2020). For example, in the Alternative Model for Personality Disorders (AMPD) in the fifth

edition of the Diagnostic and Statistical Manual of Mental Disorders (DSM-5; APA, 2013),

the first general criterion of personality disorders refers to difficulties in understanding and

regulating the self and interpersonal relationships, with mentalizing deficits explicitly

considered as one important element (Bender et al., 2011). Thus, the strongest correlations in

the field of psychopathology should occur where the RFQ is correlated with measures that

capture the severity of personality pathology. Previous studies documented strong


A CRITICAL EVALUATION OF THE RFQ 9

associations between the RFQ scales and measures of personality pathology (e.g., Badoud et

al., 2015; de Meulemeester et al., 2017; Fonagy et al., 2016) as well as various symptom

measures (e.g., de Meulemeester et al., 2018; Fonagy et al., 2016; Huang et al., 2020; Li et

al., 2020). However, the magnitudes of these associations should be compared systematically

by inferential testing. Notably, differential associations should be carefully interpreted in

terms of validity, not least because the RFQ also contains item content that may reflect

constructs such as negative urgency rather than mentalizing in particular. For example,

impulsive responsivity to emotions may be a general underlying disposition for psychological

distress that would be expected to exhibit similar patterns of associations as were reported for

the scales of the RFQ (e.g., Carver et al., 2017; Settles et al., 2012). We therefore argue that

more research is needed to evaluate whether associations of the RFQ with specific,

theoretically selected forms of psychopathology are higher than associations with non-

specific symptom distress. In addition, it should be investigated whether associations between

the RFQ and indicators of personality dysfunction are artificially inflated due to item content

that is shared with other maladaptive characteristics.

Study 1

We have stated potential concerns with regard to the RFQ related to its item content,

scoring procedure, dimensionality, and associations with psychopathology. Based on our

interpretation of previous findings, we make three observations: First, we propose that the

measure may emerge as unidimensional when modeled more adequately, that is, when not

using double-scored items in CFA given the methodological problems associated with this

approach. Although initial evidence for the unidimensionality of the RFQ has been provided

by Spitzer et al. (2021), their study only involved a single non-clinical sample. In the first

study, we used data from a clinical and a non-clinical sample to compare a series of factor
A CRITICAL EVALUATION OF THE RFQ 10

models for the RFQ including unidimensional and two-dimensional representations in order

to test the generalizability and robustness of a unidimensional solution.

Second, if the RFQ turns out to be unidimensional, it could still capture a unipolar or

bipolar construct. As a unipolar construct, it could range from genuine to impaired

mentalizing, whereas as a bipolar construct, it could capture hypo- and hypermentalizing as

two maladaptive ends of one continuum. Notably, in our view, a unidimensional scoring of

the RFQ lends itself to investigating the unique characteristics of the two poles using non-

linear statistical models. Specifically, in the case where the RFQ captures a continuum that

runs from extreme uncertainty about mental states (“too little mentalizing”) to an optimal

level (“genuine mentalizing”) to excessive certainty about mental states (“too much

mentalizing”), one would predict U-shaped associations with maladaptive characteristics

(e.g., depression) and, vice versa, inverse U-shaped associations with adaptive characteristics

(e.g., well-being). 1 When first testing this hypothesis by specifying quadratic terms in

regression models to predict psychopathology outcomes, Spitzer et al. (2021) found no

support for such U-shaped associations. However, the authors considered only rather weak

criteria in terms of short screening measures of depression, anxiety, and somatic symptoms.

In this study, we explored (inverse) U-shaped relationships between the RFQ and measures

indicative of psychopathology and personality pathology to provide more comprehensive

tests of the hypothesis that the RFQ captures two maladaptive variants of mentalizing on a

single continuum (i.e., hypo- and hypermentalizing).

Third, given that the RFQ contains several non-specific items, we examined

associations with specific forms of psychopathology that are theoretically more closely

1
Thus, both low and high levels of the unidimensional RFQ-8 are supposed to be maladaptive, marking the high
ends of the U-shape when inspecting a maladaptive criterion. By contrast, middle levels are not supposed to be
maladaptive, marking the turning point of the U-shape when inspecting a maladaptive criterion. Similarly, the
association between the RFQ-8 and adaptive criteria should be inverse U-shaped, indicating that middle levels
are again adaptive whereas both low and high levels are not.
A CRITICAL EVALUATION OF THE RFQ 11

related to the core construct of mentalizing. We therefore selected several measures of

personality pathology that are linked to Criterion A (impairments in personality functioning)

and Criterion B (maladaptive personality traits) of the AMPD. We compared the correlations

between the RFQ and measures of personality pathology with correlations between the RFQ

and measures of general symptom distress, and further scrutinized the latent associations

between the RFQ and personality pathology using bifactor exploratory structural equation

modeling (bifactor ESEM).

Method

Sample 1. The first sample was collected at a psychosomatic clinic, the Asklepios

Fachklinikum Tiefenbrunn. Participants (64% female) were 861 inpatients ranging in age

from 18 to 68 (M = 34.0, SD = 13.3). The data presented here were collected from patients

who were admitted for inpatient treatment within three days of admission and shortly before

discharge as part of routine diagnostics. The RFQ was administered to all included

participants at admission without missings, whereas only 364 participants completed the

measure at discharge. The reason for these missing observations is not dropout; the RFQ was

only administered at discharge at a later stage of data collection and thus only pertained to

364 patients. For more detailed information about the sample characteristics, see Note S1 in

the supplement.

Sample 2. The second sample was collected online within a study on dimensional

measures of personality at Heidelberg University Hospital (Back et al., 2020). Participants

were recruited via flyers and calls for participation on the university website and in several

online forums. The sample comprises 566 young adults (74% female) who completed the

study and whose ages ranged from 18 to 30 (M = 24.2, SD = 3.13). There were no missing

data. For more detailed information about the sample characteristics, see Note S2 in the

supplement.
A CRITICAL EVALUATION OF THE RFQ 12

Measures. The RFQ was administered in both samples. The Brief Symptom

Inventory (BSI), Inventory of Interpersonal Problems (IIP-32), WHO-5 Well-Being Index

(WHO-5), Patient Health Questionnaire (PHQ), Inventory of Personality Organization (IPO-

16), and Operationalized Structural Diagnostic Questionnaire – Short Form (OPD-SQS) were

administered in Sample 1. Some of the criterion measures were administered at both

admission and discharge in Sample 1; however, we used only the data at admission (i.e.,

before treatment) for exploring their associations with the RFQ. The Personality Inventory

for DSM-5 – Brief Form (PID-5-BF) was administered in Sample 2.

Reflective Functioning Questionnaire (RFQ). The RFQ (Fonagy et al., 2016)

comprises eight items forming the two scales of certainty about mental states (RFQ_C) and

uncertainty about mental states (RFQ_U). According to the scoring procedure described

above, the 7-point Likert scale (“do not agree at all” = 1 to “agree completely” = 7) is

recoded as 3, 2, 1, 0, 0, 0, 0 for the scale RFQ_C; for the scale RFQ_U, items are recoded as

0, 0, 0, 0, 1, 2, 3 (except for the inverse Item 7 that is recoded as shown for RFQ_C). Results

for the original scoring procedure are supplemented. For the main results, we refrained from

applying the scoring procedure due to the aforementioned problems. Items were thus kept in

the original coding (i.e., 1, 2, 3, 4, 5, 6, 7) and only Item 7 was reversed such that it

corresponded to the content polarity of the other items. High values indicated uncertainty

about mental states and low values indicated certainty. In the following, we will refer to the

mean score of the 8-item RFQ as the RFQ-8. In a previous investigation using a

representative sample from the German population, Spitzer et al. (2021) presented results in

favor of using a reduced 6-item version of the RFQ (omitting Items 7 and 4), that will

subsequently be referred to as RFQ-6.

Brief Symptom Inventory (BSI). The BSI (Franke, 2000) was used to assess distress

associated with symptoms of mental illness during the last week (e.g., “loss of appetite”) on a
A CRITICAL EVALUATION OF THE RFQ 13

5-point scale ranging from “not at all” (0) to “extremely” (4). Internal consistency of the BSI

total score was estimated at α = .95 in Sample 1.

Inventory of Interpersonal Problems (IIP-32). Interpersonal problems were

measured with the 32-item version of the IIP (Horowitz et al., 2000). The measure assesses

distress associated with interpersonal behaviors (e.g., “I open up to other people too much”)

that are performed excessively or inhibited strongly on a 5-point scale ranging from “not at

all” (0) to “extremely” (4). Internal consistency of the IIP-32 total score was estimated at α =

.87 in Sample 1.

WHO-5 Well-Being Index (WHO-5). The WHO-5 (World Health Organization,

1998) is a self-report measure of well-being. It consists of five items (e.g., “Over the last two

weeks I have felt cheerful and in good spirits”) that are rated on a 6-point Likert scale ranging

from “at no time” (0) to “all the time” (5). High scores indicate a high subjective well-being.

Internal consistency of the WHO-5 total score was estimated at α = .85 in Sample 1.

Patient Health Questionnaire (PHQ). The PHQ-15 (Kroenke et al., 2002) is a 15-

item module for assessing the severity of impairment associated with somatic symptoms

(e.g., “back pain”) that have been experienced during the last four weeks. Items are rated on a

3-point Likert scale ranging from “not bothered at all” (0) to “bothered a lot” (2). The PHQ-9

(Kroenke et al., 2001) is a 9-item module for assessing the impairment associated with the

nine DSM-IV criteria of depression (e.g., “Little interest or pleasure in doing things”,

“Feeling tired or having little energy”) that may have been experienced during the last two

weeks. Items are rated on a 4-point scale ranging from “not at all” (0) to “nearly every day”

(3). Internal consistency of the PHQ-15/PHQ-9 total scores were estimated at α = .81/.84 in

Sample 1.

Inventory of Personality Organization (IPO-16). The IPO-16 assesses general

personality dysfunction (Zimmermann et al., 2013) in the domains of identity diffusion (e.g.,
A CRITICAL EVALUATION OF THE RFQ 14

“I feel that my tastes and opinions are not really my own, but have been borrowed from other

people”), primitive defenses (e.g., “People tell me I behave in contradictory ways”), and

reality testing (e.g., “I can’t tell whether certain physical sensations I’m having are real, or

whether I’m imagining them”). The 16 items are answered on a 5-point scale from “never

applies” (1) to “always applies” (5). Internal consistency of the IPO-16 total score was

estimated at α = .86 in Sample 1.

Operationalized Structural Diagnostic Questionnaire – Short Form (OPD-SQS).

The OPD-SQS is a 12-item measure of personality dysfunction (Ehrenthal et al., 2015).

Statements are endorsed or rejected on a 5-point scale ranging from “completely untrue” (0)

to “entirely true” (4). The items give rise to the scales of self-perception (e.g., “I sometimes

feel like a stranger to myself”), contact (e.g., “I sometimes misjudge how my behavior affects

others”), and relationship (e.g., “It can be dangerous to let others get too close to you.”).

Internal consistency of the OPD-SQS total score was estimated at α = .86 in Sample 1.

Personality Inventory for DSM-5 – Brief Form (PID-5-BF). The PID-5-BF (APA,

2013; Zimmermann et al., 2014) is a 25-item measure assessing the broad maladaptive

personality domains of negative affectivity, detachment, disinhibition, antagonism, and

psychoticism with five items each. Items are rated on a 4-point scale ranging from “very

false” (0) to “very true” (3). Internal consistency of the PID-5-BF total score was estimated at

α = .86 in Sample 2.

Statistical Analyses. The analyses were performed using R version 4.0.3 (R Core

Team, 2020) in conjunction with the package lavaan (Rosseel, 2012) and Mplus version 8.4

(Muthén & Muthén, 1998–2019). All latent variable models were estimated using the

Weighted Least Squares Mean and Variance Adjusted (WLSMV) estimator that was applied

to the polychoric correlation matrix. The fit of latent variable models was evaluated by a

commonly used combination of fit indices and cut-off criteria (i.e., Comparative Fit Index
A CRITICAL EVALUATION OF THE RFQ 15

[CFI] > .95, Root Mean Square Error of Approximation [RMSEA] < .06, Standardized Root

Mean Square Residual [SRMR] < .08; Hu & Bentler, 1999). The internal consistency of the

RFQ was estimated with the model-based McDonald’s ω for categorical variables (Flora,

2020). We report fully standardized estimates.

Factor Structure. The factor structure of the RFQ was evaluated by confirmatory

factor analysis (CFA) and exploratory factor analysis (EFA) using the original item

responses. The two-dimensional measurement model using double-scored items that was

reported by the creators of the RFQ (Badoud et al., 2015; Fonagy et al., 2016) is not taken

into account for the main results due to the methodological problems associated with the

scoring procedure that are described above. However, we report the results based on the

original model and the creators’ recommendations in the supplement. For this analysis, we

considered the following measurement models: (1) a unidimensional CFA; (2) a two-

dimensional CFA with cross-loadings that follow the scoring procedure of RFQ_C and

RFQ_U as proposed in the original publication; and (3) a two-dimensional EFA with oblique

factor rotation (quartimin). In all CFA models, the correlation between the residual variances

of Items 3 and 4 was freely estimated since these items have a large overlap in terms of

semantic content and wording. As two measurement occasions were available in Sample 1

(i.e., at admission and discharge, respectively), we specified repeated measures CFA models

with equality constraints for loadings, thresholds, intercepts, latent covariance, and residual

covariances (Liu et al., 2017). For the repeated measures CFA, we dealt with missing data by

means of pairwise deletion.

U-Shaped Associations with Psychopathology. We investigated the shape of

associations between the RFQ scale scores and various measures of psychopathology

spanning symptoms of mental illness and maladaptive characteristics. More specifically,

given the conceptualization of the RFQ, one would expect U-shaped associations with
A CRITICAL EVALUATION OF THE RFQ 16

maladaptive characteristics (i.e., signs and symptoms of psychopathology) and inverse U-

shaped associations with adaptive characteristics (e.g., subjective well-being), respectively. In

order to test the hypothesis that hypo- and hypermentalizing may delineate extreme

maladaptive ends of a unidimensional continuum (i.e., very low and very high values on the

manifest score), we examined U-shaped and inverse U-shaped associations between the RFQ

score and measures of symptomatic distress (i.e., general symptomatic distress: BSI; somatic

symptoms: PHQ-15; depressive symptoms vs. well-being: PHQ-9, WHO-5), personality

dysfunction (i.e., IPO-16, OPD-SQS, PID-5-BF), and interpersonal distress (i.e., IIP-32). To

this end, lowess smoothing curves were inspected and regression models were estimated in

which the quadratic term was added to the RFQ score for predicting criterion variables. It

should be noted that a significant effect of the quadratic predictor is not sufficiently indicative

of the presence of two maladaptive poles, as this would also be observed, for example, in

floor or ceiling effects. More specifically, the predicted values for a criterion should also

differ between medium scores on the RFQ as compared to extreme scores by forming a U-

shape. We therefore further used the two-lines test as a more rigorous method (Simonsohn,

2018) that estimates two regression lines, one before and one after a break-point in the

distribution of a predictor variable, in order to detect a change in the sign of the regression

slope.

Specific and Latent Associations with Personality Pathology. We investigated the

differential associations between the RFQ and indicators of personality pathology (i.e., IPO-

16, OPD-SQS) and various dimensions of symptomatic distress (i.e., BSI, PHQ-15, PHQ-9,

IIP-32, WHO-5), respectively. To this end, we compared the magnitude of their correlation

coefficients in Sample 1. Additionally, we estimated the empirical overlap between the RFQ

and indicators of personality pathology (i.e., PID-5-BF, IPO-16, OPD-SQS) in both samples

using a bifactor exploratory structural equation modeling approach (i.e., bifactor ESEM). In a
A CRITICAL EVALUATION OF THE RFQ 17

bifactor ESEM (e.g., Morin et al., 2020), a criterion (e.g., RFQ) is regressed on the

orthogonal general and specific factors of an exploratory bifactor measurement model

reflecting a multidimensional construct (e.g., PID-5-BF). This approach has two advantages

for this analysis. First, the use of latent variable modeling partitions reliable variance and

measurement error and thus estimates the disattenuated associations among constructs.

Second, bifactor models with uncorrelated general and specific factors allow for a clear

partitioning of variance in the presence of multidimensionality and a strong general factor

(Reise, 2012), as is the case for the PID-5-BF, IPO-16, and OPD-SQS (Zimmermann et al.,

2020). Further details and explanations about the bifactor ESEM approach used here can be

found in Note S3 of the supplement.

Results

Factor Structure. All estimated model parameters are depicted in Figure 2 per

sample. In Sample 1, the unidimensional repeated measures CFA model showed good model

fit, χ²(139) = 353.0, p < .001, CFI = .98, RMSEA = .046, 90% CI [.041; .052], SRMR = .06.

Factor loadings were acceptable (λ ≥ .49), but Item 7 showed a weak loading on the latent

factor (λ = .34). The a priori specified residual correlation between Items 3 and 4 was

estimated at .60. Heywood cases occurred for the two-dimensional CFA model (i.e., factor

correlation > 1) and the two-dimensional EFA model (i.e., standardized factor loading > 1 for

Item 4). In Sample 2, the unidimensional CFA model provided a good fit to the data, χ²(19) =

91.8, p < .001, CFI = .98, RMSEA = .08, 90% CI [.07; .10], SRMR = .04. Factor loadings

were acceptable (λ ≥ .51), but Item 7 again exhibited the weakest loading on the latent factor

(λ = .45). The a priori specified residual correlation between Items 3 and 4 was estimated at

.66. The two-dimensional CFA model did not reach convergence. The two-dimensional EFA

model again showed a Heywood case (i.e., standardized loading > 1 for Item 4).
A CRITICAL EVALUATION OF THE RFQ 18

Taken together, we did not identify the two proposed dimensions of RFQ_C and

RFQ_U in the estimated solutions, and we never identified more than one meaningful factor

in general, suggesting that the RFQ essentially captures a unidimensional construct. Results

for the originally proposed two-dimensional CFA model that uses double-scoring of RFQ

items are reported in the supplement (see Figures S1 and S2). Note that we found similar

parameters and model fit for the original specifications based on double-scored items as

found in previous validation studies (Badoud et al., 2015; Fonagy et al., 2016). We explained

above why these models and their respective fit statistics should rather not be interpreted

substantively as they are based on questionable assumptions.

The internal consistency of the RFQ factor from the unidimensional solution was

estimated at ω = .79/.81 (Sample 1; admission/discharge) and ω = .82 (Sample 2),

respectively. In the following, we omitted Items 4 and 7 in order to improve the internal

consistency of the scale (Spitzer et al., 2021). This decision was based on Item 7 (“I always

know what I feel”) overall tending towards having low factor loadings, whereas Item 3

(“When I get angry, I say things without really knowing why I am saying them”) and Item 4

(“When I get angry, I say things I regret later”) overlap strongly with respect to their content.

The removal of these two items resulted in a 6-item scale with internal consistency of ω = .82

(Sample 1) and ω = .83 (Sample 2). Subsequent results refer to the 6-item version of the RFQ

(i.e., RFQ-6) with low scores supposedly reflecting certainty about mental states and high

scores reflecting uncertainty about mental states. 2 This directionality is a decision based on

the circumstance that all items of the RFQ-6 are geared towards an endorsement of

uncertainty in their original format.

2
The total scores of the RFQ-8 and the reduced RFQ-6 were correlated at .97 (Sample 1) and at .98 (Sample 2),
respectively. It should be emphasized that retaining items 4 and 7 produced highly similar results and equivalent
conclusions for all of the presented analyses.
A CRITICAL EVALUATION OF THE RFQ 19

Associations with Psychopathology. We did not find any evidence for (inverse) U-

shaped relationships between the RFQ-6 and criteria indicative of symptom distress,

personality pathology, or well-being. Although a significant quadratic term was found for

OPD-SQS in Sample 1 and for PID-5-BF detachment in Sample 2, the associations were not

U-shaped but merely indicated a ceiling effect for higher values of the RFQ-6 (see

supplement Figures S3 and S4). The two-lines test did not indicate any U-shaped statistical

associations (see Figures S5 and S6). For the unidimensional RFQ-8, U-shaped associations

were absent as well.

On the contrary, the RFQ-6 exhibited substantial linear associations with measures of

psychopathology (see Tables 1 and 2). In Sample 1, the strongest correlations were found

between the RFQ-6 and measures of personality pathology, including the IPO-16 (r = .72)

and the OPD-SQS (r = .65), which were also significantly larger (p < .001, respectively) than

the correlations between the RFQ-6 and other measures, for example as compared to the BSI

(r = .54) and the IIP (r = .54). The bivariate associations between the original scales based on

double-scoring of the RFQ (i.e., RFQ_C and RFQ_U) as well as the RFQ-8 and criterion

measures are supplemented (see Table S1 and Table S2).

In Sample 1, the bifactor ESEM (see Panel A of Figure 3) had acceptable fit, χ²(373)

= 1764.83, p < .001, CFI = .94, RMSEA = .07, 90% CI [.06; .07], SRMR = .04. All items of

the IPO-16 and OPD-SQS loaded significantly on a general factor of personality pathology

with standardized loadings ranging from λ = .32 to λ = .71. For the specific factors, target

loadings were in the expected direction (i.e., positive factor loadings) and most of them had

values of λ > .30. (Absolute) non-target loadings were consistently < .30. The standardized

regression coefficient of the general factor was β = .83 (p < .001), explaining 69% of variance

in the RFQ factor. The specific factors of OPD relationship (β = -.17, p < .001), OPD contact

(β = .08, p = .005), OPD self-perception (β = .14, p < .001), IPO reality testing (β = .09, p =
A CRITICAL EVALUATION OF THE RFQ 20

.001), IPO identity diffusion (β = .06, p = .033), and IPO primitive defenses (β = .19, p <

.001) incrementally explained a combined 10% of variance in the RFQ factor. The

unexplained variance in the latent RFQ factor thus amounted to 21%. For the (highly similar)

results using RFQ-8, see Panel A of Figure S7.

In Sample 2, the bifactor ESEM (see Panel B of Figure 3) had good fit, χ²(318) =

743.0, p < .001, CFI = .97, RMSEA = .05, 90% CI [.04; .05], SRMR = .04. All items of the

PID-5-BF loaded significantly on a general factor of personality pathology with standardized

loadings ranging from λ = .31 to λ = .75. For the specific factors, target loadings were in the

expected direction (i.e., positive factor loadings) and most of them had values of λ > .30.

(Absolute) non-target loadings were consistently < .30. The standardized regression

coefficient of the general factor was β = .68 (p < .001), explaining 47% of variance in the

RFQ factor. The specific factors of PID-5-BF negative affect (β = .31, p < .001), PID-5-BF

detachment (β = .09, p = .022), PID-5-BF disinhibition (β = .08, p = .042), and PID-5-BF

psychoticism (β = .32, p < .001) incrementally explained a combined 21% of variance in the

RFQ factor. The unexplained variance in the latent RFQ factor thus amounted to 32%. See

Panel B of Figure S7 for the (highly similar) results using all eight RFQ items.

Discussion

Regarding the factor structure of the RFQ, we found evidence for a unidimensional

construct in a large clinical sample (Sample 1) and a large sample of young adults (Sample

2), thereby corroborating initial findings (Spitzer et al., 2021). We have argued that using a

unidimensional approach in combination with non-linear statistical modeling is consistent

with mentalizing theory and allows for conceptualizing hypo- and hypermentalizing as two

maladaptive poles of a continuum. We tested this notion by means of quadratic regression,

lowess curves, and the two-lines test. Across a broad range of criterion variables, however,

we found no evidence that the RFQ assesses a maladaptive form of having too much certainty
A CRITICAL EVALUATION OF THE RFQ 21

about mental states (i.e., hypermentalizing). Only the uncertainty pole of the RFQ was

associated with poorer mental health. Finally, in line with theoretical expectations with regard

to the mentalizing construct (e.g., Bateman & Fonagy, 2019), the RFQ was more strongly

related to indicators of personality dysfunction than to symptomatic distress.

More specifically, we found that the variance of the RFQ primarily reflected broad

indicators of self-reported personality dysfunction in both samples (i.e., clinical and non-

clinical) using diverse measures (i.e., PID-5-BF, IPO-16, OPD-SQS), whereas comparatively

little variance was unique to the RFQ. On the one hand, this may suggest that the constructs

of mentalizing and personality pathology, albeit conceptually separate, might be so greatly

intertwined that they cannot clearly be distinguished empirically (at least using self-reports).

On the other hand, in light of the observation that various items of the RFQ may tap into

related maladaptive dispositions (e.g., negative urgency, emotional lability, or impulsivity),

the large overlap between the RFQ and dimensions of personality pathology may also signal

caution. In fact, this finding could point to potential problems with respect to the discriminant

validity of the measure. Consistent with this notion, the existing research literature also

provided mixed results for the convergent validity of the RFQ with respect to mentalizing and

related constructs. In the initial validation studies, RFQ_U and RFQ_C exhibited strong

correlations to alexithymia (comparable to those for indicators of personality pathology) but

substantially smaller correlations to cognitive empathy and perspective-taking (Badoud et al.,

2015; Fonagy et al., 2016; Morandotti et al., 2018). This might be concerning given that

constructs such as cognitive empathy and perspective-taking are very similar to mentalizing

by definition (e.g., Ickes, 1993), except that the former constructs solely pertain to

understanding others’ mental states. In general, additional examinations of the convergence

of the RFQ with alternative self-report measures of mentalizing are needed.

Study 2
A CRITICAL EVALUATION OF THE RFQ 22

To address the question of whether validity issues of the RFQ contribute to its strong

empirical overlap with indicators of personality pathology, we conducted a second study that

was fully preregistered (see https://osf.io/qr38t). We hypothesized that the item content of the

RFQ reflects both mentalizing and possible consequences of impaired mentalizing, namely,

emotional lability and impulsivity. If the RFQ conflated these constructs, this could

artificially inflate associations between mentalizing as operationalized by the RFQ and

indicators of personality dysfunction, thereby impeding the interpretability of RFQ scores and

the measure’s utility for theory testing. Using commonality analysis (Nimon et al., 2008), we

tested whether and to what extent the associations between the RFQ and indicators of

personality dysfunction are driven by content overlap rather than reflecting the true

association between constructs. Furthermore, we investigated item-level correlations of the

RFQ items with convergent and discriminant measures to examine their nomological

consistency (Thielmann & Hilbig, 2019). For example, nomological inconsistency would be

indicated by some RFQ items correlating more strongly with impulsivity and other RFQ

items correlating more strongly with other measures of mentalizing. Thus, Study 2 included

alternative self-report measures of mentalizing, broad measures of personality pathology, and

assessments of potential confounders, namely, emotional lability and impulsivity. With

regard to alternative measures of mentalizing, we deliberately selected questionnaires that

pertain to the core construct of mentalizing, that is, interpreting the mental states of the self

and others.

Method

Sample 3. We recruited participants from the United States online via the panel

provider Prolific. Data quality was ensured by a series of attention and validity checks.

Participants received minimum wage as compensation. There were no missing data as

individuals were not able to proceed without answering each item. We collected data from N
A CRITICAL EVALUATION OF THE RFQ 23

= 862 participants based on an a priori power analysis (see preregistration). Participants’

(47% female) age ranged from 18 to 75 (M = 34.9, SD = 11.7). For more details about

exclusion criteria and sample characteristics, see Note S4.

Measures. In the study, all questionnaires were presented in random order. The RFQ

(Fonagy et al., 2016) and the IPO-16 (Zimmermann et al., 2013) were administered again in

their respective English versions. Based on our findings from Study 1, we used a

unidimensional mean score of the RFQ such that high values reflected uncertainty about

mental states. The analyses were performed for both the RFQ-8 and the RFQ-6. The internal

consistencies of the RFQ-6 (ω = .87), RFQ-8 (ω = .87), and the IPO-16 (α = .90) were good.

Level of Personality Functioning Scale - Brief Form 2.0 (LPFS-BF). The LPFS-BF

(Weekers et al., 2019) is a 12-item self-report measure assessing impairments in the domains

of self-functioning (6 items) and interpersonal functioning (6 items). The items (e.g., “I often

make unrealistic demands on myself”) are answered on a 4-point scale ranging from

completely untrue (1) to completely true (4). High scores on the respective scales indicate

self-dysfunction or interpersonal dysfunction. Internal consistency of the LPFS-BF total score

was estimated at α = .90.

Certainty About Mental States Questionnaire (CAMSQ). The CAMSQ (Müller et al.,

2021) is a 20-item self-report measure of mentalizing that assesses the perceived certainty

associated with making inferences about the mental states of the self (i.e., Self-Certainty) and

others (i.e., Other-Certainty). The items capture affective, cognitive, and motivational content

(e.g., “I understand my feelings”, “I know when other people are hiding their thoughts”) and

are answered on a 7-point frequency scale ranging from never (1) to always (7). High scores

reflect high levels of certainty. Internal consistencies of CAMSQ Self-Certainty (α = .93) and

CAMSQ Other-Certainty (α = .92) were high.


A CRITICAL EVALUATION OF THE RFQ 24

Empathy Quotient (EQ). The EQ (Baron-Cohen & Wheelwright, 2004) is a 40-item

self-report measure of empathy. The 9-item Cognitive Empathy scale was used as a measure

of mentalizing others. Items (e.g., “I am good at predicting how someone will feel”) are rated

on a 4-point scale ranging from strongly disagree (1) to strongly agree (4). Internal

consistency of the EQ Cognitive Empathy scale was estimated at α = .90.

Self-Reflection and Insight Scale (SRIS). The SRIS (Grant et al., 2002) is a 20-item

self-report measure. The 8-item Self-Insight scale of the SRIS was used as a measure of

mentalizing oneself. Items (e.g., “I usually know why I feel the way I do”) are rated on a 6-

point scale ranging from strongly disagree (1) to strongly agree (6). Internal consistency of

the SRIS Self-Insight scale was estimated at α = .88.

UPPS-P Impulsive Behavior Scale (UPPS-P). The UPPS-P (Lynam et al., 2007)

assesses five impulsive personality traits with 59 items. We used the 12-item Negative

Urgency scale that assesses impulsivity in terms of acting rashly under the influence of

negative emotions. Items (e.g., “When I am upset, I often act without thinking”) are endorsed

or rejected on a 4-point scale ranging from disagree strongly (1) to agree strongly (4). Internal

consistency of the UPPS-P Negative Urgency scale was estimated at α = .93.

Difficulties in Emotion Regulation Scale (DERS). The DERS (Gratz & Roemer,

2004) is a 36-item self-report measure that assesses various aspects of emotional

dysregulation. We used the 6-item Impulse Control Difficulties subscale to assess

impulsivity. Items (e.g., “When I’m upset, I become out of control”) are rated on a 5-point

scale ranging from almost never (1) to almost always (5) indicating the frequency of

experiencing the described behavior. Internal consistency of the DERS Impulse Control

Difficulties scale was α = .90.

Personality Inventory for DSM-5 (PID-5). The PID-5 (APA, 2013) assesses

personality pathology according to the AMPD with 220 items. The 7-item Emotional Lability
A CRITICAL EVALUATION OF THE RFQ 25

facet scale was used. Items (e.g., “I am a highly emotional person”) are answered on a 4-point

scale ranging from very false or often false (0) to very true or often true (3). Internal

consistency of the PID-5 Emotional Lability scale was α = .91.

Personality Assessment Inventory – Borderline Features (PAI-BOR). The PAI-BOR

(Morey, 2004) assesses traits associated with borderline personality disorder using 24 items.

We used the six items of the Affective Instability scale to assess emotional lability. Items

(e.g., “My mood can shift quite suddenly”) are rated on a 4-point scale ranging from false,

not at all true (0) to very true (3). Internal consistency of the PAI-BOR Affective Instability

scale was α = .80.

Statistical Analysis. To test the hypothesis of whether the associations between the

RFQ and indicators of personality pathology are driven by content overlap, we conducted a

commonality analysis (Nimon et al., 2008) that estimates the common and unique

contributions of each predictor in predicting a criterion. Commonality analysis involves

estimating a series of multiple regression models considering all possible combinations of

predictors. Specifically, the RFQ mean score is regressed on a broad measure of personality

dysfunction, other measures of mentalizing the self and others, and measures of emotional

lability and impulsivity. We expected that the effect of personality dysfunction on the RFQ

mean score can be partly attributed to variation that both measures share with mentalizing

impairment, while another significant part can be attributed to variation shared with

emotional lability and impulsivity alone (i.e., variation that is non-overlapping with other

measures of mentalizing). For example, the manifest correlation between the RFQ mean

score and indicators of personality dysfunction amounted to an r of around .65 in Study 1,

corresponding to a shared variance of R² = .42. Using commonality analysis, we decomposed

the variance explained in the RFQ by the respective measure of personality dysfunction into

(a) variance that is shared with the respective measures of impulsivity and emotional lability
A CRITICAL EVALUATION OF THE RFQ 26

but not with the respective measures of mentalizing the self and others, (b) variance that is

shared with the respective measures of mentalizing the self and others, and (c) variance

explained uniquely by the respective measure of personality dysfunction. We performed a

permutation test using 5000 random permutations of the dependent variable to compute p-

values for testing the null hypothesis that part (a) equals zero (i.e., that no part of the variance

explained is due to variance shared with emotional lability and impulsivity alone). We tested

this hypothesis with two different sets of measures to facilitate the generalizability of the

results (irrespective of specific measures; see preregistration). To control for performing this

test twice for each set of variables, we considered a p-value less than .025 to be indicative of

statistical significance. Furthermore, we computed bias-corrected bootstrap confidence

intervals for evaluating the precision of the estimate using 5000 bootstrap resamples. The

commonality analysis was conducted via the R package yhat (Nimon et al., 2021).

We also examined the pattern of associations between the eight RFQ items and

personality dysfunction, emotional lability, impulsivity, and other measures of mentalizing to

evaluate whether these correlations differ in magnitude. Differential patterns of association

would indicate that the eight RFQ items are not nomologically consistent (Thielmann &

Hilbig, 2019). For the sake of completeness, we again tested the factor models considered in

Study 1.

Results

We provide open data and a script for reproducing the analyses in R

(https://osf.io/stbz5). As in Study 1, the unidimensional CFA model fit the data well, χ²(19) =

272.7, p < .001, CFI = .98, RMSEA = .125, 90% CI [.112; .138], SRMR = .04, whereas two-

dimensional models did not provide parameter estimates in line with the proposed structure

of RFQ_C and RFQ_U (see Figure 2). Measures that were intended to assess the same

constructs (e.g., CAMSQ Self-Certainty and SRIS Self-Insight) showed the expected high
A CRITICAL EVALUATION OF THE RFQ 27

convergent correlations. The RFQ mean score exhibited similarly strong associations with

broad indicators of personality dysfunction, measures of emotional lability and impulsivity,

and measures of mentalizing the self (see Table 3 and Table S3 for the full correlation

matrices including RFQ-6 and RFQ-8). Rather low correlations were observed between the

RFQ mean score and measures of mentalizing others, indicating that the RFQ primarily

pertains to mentalizing the self.

Following our preregistration, we used commonality analysis to estimate what content

(i.e., mentalizing, emotional lability, impulsivity) accounts for the association between the

RFQ and indicators of personality dysfunction. Using the first set of variables, the association

between the RFQ-6 and LPFS-BF (R2 = .46) was decomposed into variance shared with

measures of mentalizing (ΔR2 = .22, 95% CI [.14; .27], p < .001), variance shared with

measures of emotional lability and impulsivity alone (ΔR2 = .20, 95% CI [.16; .30], p < .001),

and variance uniquely explained by LPFS-BF (ΔR2 = .04, 95% CI [.02; .06], p < .001). In the

second set of variables, the association between the RFQ-6 and IPO-16 (R2 = .45) was

decomposed into variance shared with measures of mentalizing (ΔR2 = .33, 95% CI [.24; .38],

p < .001), variance shared with measures of emotional lability and impulsivity alone (ΔR2 =

.07, 95% CI [.04; .18], p < .001), and variance uniquely explained by IPO-16 (ΔR2 = .05, 95%

CI [.03; .07], p < .001). In light of the difference between the two arbitrarily selected sets of

variables, we then conducted the commonality analysis across all 32 possible variable

combinations to account for influences of specific combinations. On average, the association

between the RFQ-6 and broad indicators of personality dysfunction (mean R2 = .45) was

decomposed into variance shared with measures of mentalizing (mean ΔR2 = .27), variance

shared with measures of emotional lability and impulsivity alone (mean ΔR2 = .14), and

variance uniquely explained by measures of personality dysfunction (mean ΔR2 = .04). Thus,

59% of the observed associations between RFQ-6 and indicators of personality dysfunction
A CRITICAL EVALUATION OF THE RFQ 28

were due to variance shared with other measures of mentalizing, whereas 31% were due to

variance shared with measures of emotional lability and impulsivity alone, and 10% were

unique to measures of personality dysfunction. Very similar results were obtained using the

RFQ-8 (see Note S5).

The item-level analysis demonstrated nomological inconsistencies between the eight

items of the RFQ in terms of significant differences in their correlational patterns (see Table

4). Generally, the item-level associations were most pronounced for UPPS-P Negative

Urgency as a measure of impulsivity and for SRIS Self-Insight as a measure of mentalizing

the self. Thus, we specifically used these two scales for testing differences in the magnitude

of correlations, although it should be noted that the correlational patterns were consistent for

the other measures that were included. RFQ Items 3, 4, and 8 correlated significantly stronger

(p < .001) with Negative Urgency than with Self-Insight. In contrast, Items 1, 2, 5, and 7

correlated significantly stronger with Self-Insight than with Negative Urgency (p < .001).

Discussion

In Study 2, we aimed to test the hypothesis that the items of the RFQ conflate content

associated with mentalizing and content associated with assumed consequences of

mentalizing impairment (i.e., emotional lability and impulsivity). The preregistered analyses

provide evidence for this hypothesis and suggest that associations between the RFQ and

measures of personality dysfunction may be inflated by approximately 30% because they

exploit common variance with aspects of emotional lability and impulsivity. These results

point to limitations of the RFQ with regard to its convergent and discriminant validity.

Specifically, the item-level analysis indicated nomological inconsistencies (Thielmann &

Hilbig, 2019) that suggest RFQ Item 3 (“When I get angry I say things without really

knowing why I am saying them”), Item 4 (“When I get angry, I say things that I later regret”),

and Item 8 (“Strong feelings often cloud my thinking”) may be the reason for the conflation
A CRITICAL EVALUATION OF THE RFQ 29

because they converge with negative urgency rather than mentalizing. Study 2 further

illustrates the relevance of assessing mentalizing the self and mentalizing others separately,

as these two dimensions provide unique information. Considering that the RFQ only contains

one item that is clearly geared towards understanding others’ mental states (i.e., Item 1:

“People’s thoughts are a mystery to me”), the RFQ cannot measure mentalizing others with

sufficient fidelity.

General Discussion

The RFQ has been proposed as a short self-report measure of reflective functioning.

In this article we have elaborated our concerns with respect to the instrument’s validity and

the methodology used in prior studies (e.g., Badoud et al., 2015; Fonagy et al., 2016),

particularly in reference to its item content, scoring procedure, dimensionality, and

associations with psychopathology. Using large clinical and non-clinical samples from

Germany and the US, we augmented the critical discussion with new empirical analyses.

First, our findings suggest that the RFQ assesses a unidimensional construct. Consequently,

we recommend refraining from the originally proposed scoring procedure to derive RFQ_C

and RFQ_U (Fonagy et al., 2016) and instead relying on a unidimensional score using the

original responses to the RFQ items (such as mean scores on the psychometrically optimized

RFQ-6 or the RFQ-8). Second, we have demonstrated that although the RFQ reflects

mentalizing impairments regarding the self, it also exhibits a substantial confound with

emotional lability and impulsivity due to shared item content. This indicates that, when using

the RFQ to study research questions related to mentalizing theory, observed associations with

other constructs can be influenced by this confound and inferences can thus be impeded.

Third, consistent with previous accounts (e.g., de Meulemeester et al., 2018), the present

results further indicate that the RFQ does indeed capture a maladaptive form of having too

little certainty about mental states (i.e., hypomentalizing) but appears to be unable to capture
A CRITICAL EVALUATION OF THE RFQ 30

a maladaptive form of having too much certainty about mental states (i.e., hypermentalizing)

as its certainty pole does not exhibit positive associations with negative outcomes. One

reason for this could be that the RFQ does not measure variation in hypermentalizing with

sufficient reliability on the certainty pole of the continuum (e.g., because its items are

formulated with reference to uncertainty).

It should be noted that the present empirical findings are limited to the English and the

German versions of the RFQ and might not necessarily generalize to other languages.

However, our concerns about the item content and the scoring procedure apply to the various

translations as well. Moreover, our conclusions cannot be generalized to the long forms of the

RFQ (e.g., Euler et al., 2021), although it should be noted that, to date, no validation studies

have been published for these forms and they have been seldomly used in previous research.

Finally, the current study is subject to the limitation that it did not focus on the primary target

population that is assumed to show particularly severe impairments in mentalizing capacity,

that is, individuals with borderline personality pathology (Luyten et al., 2020).

Mentalizing is arguably an important psychological construct with great relevance for

psychopathology, personality, and psychotherapy research (APA, 2013; Bender et al., 2011).

This calls for a valid and economic self-report assessment of the construct. Although the RFQ

has been rather broadly accepted by the field and is used in a growing body of research, we

have argued that the validity evidence is not compelling. In our view, self-report assessments

of mentalizing should adhere more closely to the specific core of the construct (i.e., inferring

mental states) rather than emphasizing hypothetical consequences of impaired mentalizing

(e.g., impulsivity). Thereby, the conflation of different constructs could be avoided. Second,

both hypo- and hypermentalizing should ideally be captured by measures of mentalizing.

Indeed, demonstrating that these two maladaptive variants of mentalizing impairment can be

assessed by self-report questionnaires is an important empirical test in itself that would


A CRITICAL EVALUATION OF THE RFQ 31

increase confidence in construct validity. For example, the CAMSQ (Müller et al., 2021) was

recently introduced as a measure that focuses on the core definition of inferring mental states

of the self and others provided by Fonagy and colleagues (2016). Initial results for the

CAMSQ suggest that it captures maladaptive levels of both too little or too much certainty

about mental states that could be interpreted as forms of hypo- and hypermentalizing.

Conclusion

The RFQ is regularly used to study the concept of mentalizing. Herein, we have

outlined critical considerations regarding the validity of the RFQ and provided empirical

evidence to support the critique. Findings indicate that the RFQ is a unidimensional measure

that reflects hypomentalizing pertaining to the self but is also conflated with content related

to emotional lability and impulsivity. Thus, researchers and mental health professionals alike

should be rather cautious in using the RFQ for theory testing and individual assessment.
A CRITICAL EVALUATION OF THE RFQ 32

References

American Psychiatric Association. (2013). Diagnostic and statistical manual of mental

disorders (DSM-5®). American Psychiatric Pub.

Back, S., Zettl, M., Bertsch, K., & Taubner, S. (2020). Persönlichkeitsniveau, maladaptive

Traits und Kindheitstraumata. Psychotherapeut, 65, 374–382.

https://doi.org/10.1007/s00278-020-00445-7

Badoud, D., Luyten, P., Fonseca-Pedrero, E., Eliez, S., Fonagy, P., & Debbané, M. (2015).

The French version of the Reflective Functioning Questionnaire: Validity data for

adolescents and adults and its association with non-suicidal self-injury. PloS ONE, 10,

e0145892. https://doi.org/10.1371/journal.pone.0145892

Badoud, D., Prada, P., Nicastro, R., Germond, C., Luyten, P., Perroud, N., & Debbané, M.

(2018). Attachment and reflective functioning in women with borderline personality

disorder. Journal of Personality Disorders, 32, 17–30.

https://doi.org/10.1521/pedi_2017_31_283

Baron-Cohen, S., & Wheelwright, S. (2004). The empathy quotient: An investigation of

adults with Asperger syndrome or high functioning autism, and normal sex

differences. Journal of Autism and Developmental Disorders, 34, 163–175.

https://doi.org/10.1023/b:jadd.0000022607.19833.00

Bateman, A., & Fonagy, P. (2016). Mentalization-based treatment for personality disorders:

A practical guide. Oxford, UK: Oxford Univ. Press.

Bateman, A. W., & Fonagy, P. (Eds.). (2019). Handbook of mentalizing in mental health

practice. American Psychiatric Publishing, Inc.

Bender, D. S., Morey, L. C., & Skodol, A. E. (2011). Toward a model for assessing level of

personality functioning in DSM-5, part I: A review of theory and methods. Journal of

Personality Assessment, 93, 332–346. https://doi.org/10.1080/00223891.2011.583808


A CRITICAL EVALUATION OF THE RFQ 33

Carver, C. S., Johnson, S. L., & Timpano, K. R. (2017). Toward a functional view of the p

factor in psychopathology. Clinical Psychological Science, 5, 880–889.

https://doi.org/10.1177/2167702617710037

Cyders, M. A., & Smith, G. T. (2008). Emotion-based dispositions to rash action: Positive

and negative urgency. Psychological Bulletin, 134, 807–828.

https://doi.org/10.1037/a0013341

de Meulemeester, C., Lowyck, B., Vermote, R., Verhaest, Y., & Luyten, P. (2017).

Mentalizing and interpersonal problems in borderline personality disorder: The

mediating role of identity diffusion. Psychiatry Research, 258, 141–144.

https://doi.org/10.1016/j.psychres.2017.09.061

de Meulemeester, C., Vansteelandt, K., Luyten, P., & Lowyck, B. (2018). Mentalizing as a

mechanism of change in the treatment of patients with borderline personality disorder:

A parallel process growth modeling approach. Personality Disorders: Theory,

Research, and Treatment, 9, 22–29. https://doi.org/10.1037/per0000256

Ehrenthal, J. C., Dinger, U., Schauenburg, H., Horsch, L., Dahlbender, R. W., & Gierk, B.

(2015). Entwicklung einer Zwölf-Item-Version des OPD-Strukturfragebogens (OPD-

SFK). Zeitschrift für Psychosomatische Medizin und Psychotherapie, 61, 262–274.

https://doi.org/10.13109/zptm.2015.61.3.262

Euler, S., Nolte, T., Constantinou, M., Griem, J., Montague, P. R., Fonagy, P., & Personality

and Mood Disorders Research Network. (2021). Interpersonal problems in borderline

personality disorder: Associations with mentalizing, emotion regulation, and

impulsiveness. Journal of Personality Disorders, 35, 177–193.

https://doi.org/10.1521/pedi_2019_33_427

Flora, D. B. (2020). Your coefficient alpha is probably wrong, but which coefficient omega is

right? A tutorial on using R to obtain better reliability estimates. Advances in Methods


A CRITICAL EVALUATION OF THE RFQ 34

and Practices in Psychological Science, 3, 484–501.

https://doi.org/10.1177/2515245920951747

Fonagy, P., Luyten, P., Allison, E., & Campbell, C. (2017). What we have changed our minds

about: Part 1. Borderline personality disorder as a limitation of resilience. Borderline

Personality Disorder and Emotion Dysregulation, 4, 1–11.

https://doi.org/10.1186/s40479-017-0061-9

Fonagy, P., Luyten, P., Moulton-Perkins, A., Lee, Y.-W., Warren, F., Howard, S., Ghinai, R.,

Fearon, P., & Lowyck, B. (2016). Development and validation of a self-report

measure of mentalizing: The Reflective Functioning Questionnaire. PLoS ONE, 11,

e0158678. https://doi.org/10.1371/journal.pone.0158678

Franke, H. (2000). The Brief Symptom Inventory – Deutsche Version. Manual. Göttingen:

Beltz.

Grant, A. M., Franklin, J., & Langford, P. (2002). The self-reflection and insight scale: A

new measure of private self-consciousness. Social Behavior and Personality: An

International Journal, 30, 821–835. https://doi.org/10.2224/sbp.2002.30.8.821

Gratz, K. L., & Roemer, L. (2004). Multidimensional assessment of emotion regulation and

dysregulation: Development, factor structure, and initial validation of the difficulties

in emotion regulation scale. Journal of Psychopathology and Behavioral Assessment,

26, 41–54. https://doi.org/10.1023/B:JOBA.0000007455.08539.94

Horowitz, L. M., Alden, L. E., Kordy, H., & Strauß, B. (2000). Inventar zur Erfassung

interpersonaler Probleme: Deutsche Version; IIP-D. Beltz-Test.

Hu, L. T., & Bentler, P. M. (1999). Cutoff criteria for fit indexes in covariance structure

analysis: Conventional criteria versus new alternatives. Structural Equation

Modeling: A Multidisciplinary Journal, 6, 1–55.

https://doi.org/10.1080/10705519909540118
A CRITICAL EVALUATION OF THE RFQ 35

Huang, Y. L., Fonagy, P., Feigenbaum, J., Montague, P. R., Nolte, T., & Mood Disorder

Research Consortium. (2020). Multidirectional pathways between attachment,

mentalizing, and posttraumatic stress symptomatology in the context of childhood

trauma. Psychopathology, 53, 48–58. https://doi.org/10.1159/000506406

Ickes, W. (1993). Empathic accuracy. Journal of Personality, 61, 587–610.

https://doi.org/10.1111/j.1467-6494.1993.tb00783.x

King, K. M., Feil, M. C., & Halvorson, M. A. (2018). Negative urgency is correlated with the

use of reflexive and disengagement emotion regulation strategies. Clinical

Psychological Science, 6, 822–834. https://doi.org/10.1177/2167702618785619

Kroenke, K., Spitzer, R. L., & Williams, J. B. (2001). The PHQ-9: Validity of a brief

depression severity measure. Journal of General Internal Medicine, 16, 606–613.

https://doi.org/10.1046/j.1525-1497.2001.016009606.x

Kroenke, K., Spitzer, R. L., & Williams, J. B. (2002). The PHQ-15: Validity of a new

measure for evaluating the severity of somatic symptoms. Psychosomatic Medicine,

64, 258–266.

Li, E. T., Carracher, E., & Bird, T. (2020). Linking childhood emotional abuse and adult

depressive symptoms: The role of mentalizing incapacity. Child Abuse & Neglect, 99,

104253. https://doi.org/10.1016/j.chiabu.2019.104253

Liu, Y., Millsap, R. E., West, S. G., Tein, J. Y., Tanaka, R., & Grimm, K. J. (2017). Testing

measurement invariance in longitudinal data with ordered-categorical measures.

Psychological Methods, 22, 486–506.

https://psycnet.apa.org/doi/10.1037/met0000075

Luyten, P., Campbell, C., Allison, E., & Fonagy, P. (2020). The mentalizing approach to

psychopathology: State of the art and future directions. Annual Review of Clinical

Psychology, 16, 297–325. https://doi.org/10.1146/annurev-clinpsy-071919-015355


A CRITICAL EVALUATION OF THE RFQ 36

Lynam, D. R., Smith, G. T., Cyders, M. A., Fischer, S., & Whiteside, S. P. (2007). The

UPPS-P questionnaire measure of five dispositions to rash action. Unpublished

technical report, Purdue University.

Morandotti, N., Brondino, N., Merelli, A., Boldrini, A., De Vidovich, G. Z., Ricciardo, S.,

Abbiati, V., Ambrosi, P., Cavercasi, E., Fonagy, P., & Luyten, P. (2018). The Italian

version of the Reflective Functioning Questionnaire: Validity data for adults and its

association with severity of borderline personality disorder. PloS ONE, 13, e0206433.

https://doi.org/10.1371/journal.pone.0206433

Morey, L. C. (2004). The Personality Assessment Inventory (PAI). Lawrence Erlbaum

Associates Publishers.

Morin, A. J., Myers, N. D., & Lee, S. (2020). Modern Factor Analytic Techniques: Bifactor

Models, Exploratory Structural Equation Modeling (ESEM), and Bifactor-ESEM.

Handbook of Sport Psychology, 51, 1044–1073.

https://doi.org/10.1002/9781119568124.ch51

Müller, S., Wendt, L. P., & Zimmermann, J. (2021, May 19). Development and Validation of

the Certainty About Mental States Questionnaire (CAMSQ): A Self-Report Measure

of Mentalizing Oneself and Others. https://doi.org/10.31234/osf.io/jtc3s

Muthén, L. K., & Muthén, B. O. (1998-2019). Mplus User’s Guide. 8th Edition. Los Angeles,

CA: Muthén & Muthén.

Nimon, K., Lewis, M., Kane, R., & Haynes, R. M. (2008). An R package to compute

commonality coefficients in the multiple regression case: An introduction to the

package and a practical example. Behavior Research Methods, 40, 457–466.

https://doi.org/10.3758/BRM.40.2.457

Nimon, K., Oswald, F., & Roberts, J. K. (2021). yhat: Interpreting Regression Effects. R

package version 2.0-3. https://CRAN.R-project.org/package=yhat


A CRITICAL EVALUATION OF THE RFQ 37

R Core Team. (2020). R: A language and environment for statistical computing. R

Foundation for Statistical Computing, Vienna, Austria.

Reise, S. P. (2012). The rediscovery of bifactor measurement models. Multivariate

Behavioral Research, 47, 667–696. https://doi.org/10.1080/00273171.2012.715555

Rosseel, Y. (2012). lavaan: An R package for structural equation modeling. Journal of

Statistical Software, 48, 1–36. http://www.jstatsoft.org/v48/i02/

Settles, R. E., Fischer, S., Cyders, M. A., Combs, J. L., Gunn, R. L., & Smith, G. T. (2012).

Negative urgency: A personality predictor of externalizing behavior characterized by

neuroticism, low conscientiousness, and disagreeableness. Journal of Abnormal

Psychology, 121, 160–172. https://doi.org/10.1037/a0024948

Simonsohn, U. (2018). Two lines: A valid alternative to the invalid testing of U-shaped

relationships with quadratic regressions. Advances in Methods and Practices in

Psychological Science, 1, 538–555. https://doi.org/10.1177/2515245918805755

Spitzer, C., Zimmermann, J., Brähler, E., Euler, S., Wendt, L. P., & Müller, S. (2021). Die

deutsche Version des Reflective Functioning Questionnaire (RFQ): Eine

teststatistische Überprüfung in der Allgemeinbevölkerung. Psychotherapie -

Psychosomatik - Medizinische Psychologie, 71, 124–131. https://doi.org/10.1055/a-

1234-6317

Thielmann, I., & Hilbig, B. E. (2019). Nomological consistency: A comprehensive test of the

equivalence of different trait indicators for the same constructs. Journal of

Personality, 87, 715–730. https://doi.org/10.1111/jopy.12428

Weekers, L. C., Hutsebaut, J., & Kamphuis, J. H. (2019). The Level of Personality

Functioning Scale-Brief Form 2.0: Update of a brief instrument for assessing level of

personality functioning. Personality and Mental Health, 13, 3–14.

https://doi.org/10.1002/pmh.1434
A CRITICAL EVALUATION OF THE RFQ 38

World Health Organization. (1998). Wellbeing measures in primary health care / The

Depcare Project. WHO Regional Office for Europe: Copenhagen.

Zimmermann, J., Altenstein, D., Krieger, T., Grosse Holtforth, M., Pretsch, J., Alexopoulos,

J., Spitzer, C., Benecke, C., Krueger, R. F., Markon, K. E., & Leising, D. (2014). The

structure and correlates of self-reported DSM-5 maladaptive personality traits:

Findings from two German-speaking samples. Journal of Personality Disorders, 28,

518–540. https://doi.org/10.1521/pedi_2014_28_130

Zimmermann, J., Benecke, C., Hörz, S., Rentrop, M., Peham, D., Bock, A., Wallner, T.,

Schauenburg, H., Frommer, J., Huber, D., Clarkin, J. F., & Dammann, G. (2013).

Validierung einer deutschsprachigen 16-Item-Version des Inventars der

Persönlichkeitsorganisation (IPO-16). Diagnostica, 59, 3–16.

https://doi.org/10.1026/0012-1924/a000076

Zimmermann, J., Müller, S., Bach, B., Hutsebaut, J., Hummelen, B., & Fischer, F. (2020). A

common metric for self-reported severity of personality disorder. Psychopathology,

53, 161–171. https://doi.org/10.1159/000507377


A CRITICAL EVALUATION OF THE RFQ 39

Table 1
Bivariate Correlations in Sample 1 at Admission

RFQ-6 BSI PHQ-15 PHQ-9 IIP-32 IPO-16 OPD-SQS

BSI .54

PHQ-15 .26 .59

PHQ-9 .41 76 .47

IIP-32 .54 .63 .32 .51

IPO-16 .72 .59 .33 .46 .60

OPD-SQS .65 .72 .41 .59 .68 .69

WHO-5 -.22 -.51 -.29 -.65 -.35 -.22 -.38

Note. N = 861. All correlations are statistically significant at p < .001.


A CRITICAL EVALUATION OF THE RFQ 40

Table 2
Bivariate Correlations in Sample 2

RFQ-6 Total NEG DET ANT DIS

PID-5-BF Total Score .68

Negative Affectivity (NEG) .58 .73

Detachment (DET) .47 .72 .43

Antagonism (ANT) .32 .65 .29 .29

Disinhibition (DIS) .45 .73 .40 .36 .43

Psychoticism (PSY) .63 .82 .52 .52 .44 .50

Note. N = 566. All correlations are statistically significant at p < .001.


A CRITICAL EVALUATION OF THE RFQ 41

Table 3
Bivariate Correlations in Sample 3

RFQ-6 (1) (2) (3) (4) (5) (6) (7) (8) (9)

(1) LPFS-BF .68

(2) IPO-16 .67 .66

(3) CAMSQ Self-Certainty -.50 -.51 -.38

(4) SRIS Self-Insight -.67 -.60 -.60 .68

(5) CAMSQ Other-Certainty -.25 -.23 -.09 .56 .28

(6) EQ Cognitive Empathy -.26 -.27 -.11 .42 .28 .73

(7) PID-5 Emotional Lability .60 .59 .61 -.36 -.53 -.08 -.08

(8) PAI-BOR Affective Instability .61 .67 .56 -.42 -.53 -.16 -.17 .72

(9) UPPS-P Negative Urgency .71 .63 .59 -.43 -.54 -.18 -.16 .62 .69

(10) DERS Impulse Control Difficulties .63 .65 .61 -.41 -.57 -.13 -.15 .66 .69 .68

Note. N = 862. All correlations are statistically significant at p < .05.


A CRITICAL EVALUATION OF THE RFQ 42

Table 4
Bivariate Correlations Between the Items of the RFQ and Further Measures in Sample 3

RFQ-8

Item 1 Item 2 Item 3 Item 4 Item 5 Item 6 Item 7 Item 8

(1) LPFS-BF .40 .60 .51 .43 .45 .59 .41 .52

(2) IPO-16 .39 .54 .51 .42 .48 .57 .22 .54

(3) CAMSQ Self-Certainty -.27 -.50 -.38 -.29 -.31 -.49 -.58 -.38

(4) SRIS Self-Insight -.41 -.65 -.50 -.39 -.42 -.62 -.51 -.52

(5) CAMSQ Other-Certainty -.31 -.23 -.15 -.17 -.11 -.22 -.30 -.16

(6) EQ Cognitive Empathy -.36 -.25 -.15 -.14 -.11 -.23 -.30 -.12

(7) PID-5 Emotional Lability .26 .47 .48 .40 .42 .49 .24 .56

(8) PAI-BOR Affective Instability .24 .50 .50 .50 .41 .53 .34 .54

(9) UPPS-P Negative Urgency .25 .56 .66 .68 .51 .62 .31 .59

(10) DERS Impulse Control Difficulties .29 .50 .58 .51 .41 .54 .30 .52

Note. N = 862. All correlations are statistically significant at p < .001. The strongest correlation for each RFQ item on a

descriptive level is highlighted in bold.


A CRITICAL EVALUATION OF THE RFQ 43

Figure 1

Empirical Distributions of Item 6 Before and After Applying the Scoring Procedure Outlined

in Fonagy et al. (2016)

Note. Relative frequencies (in %) of responses to Item 6 (“Sometimes I do things without

really knowing why”) in Sample 1. Panel A shows the univariate distribution of raw scores on

the 7-point scale, Panel B shows the univariate distributions of the rescaled scores on the 4-

point scales, and Panel C shows the bivariate distribution of the rescaled scores.
A CRITICAL EVALUATION OF THE RFQ 44

Figure 2
Standardized Parameter Estimates of the Factor Models in all Samples
Clinical Sample 1 (GER) Non-Clinical Sample 2 (GER) Non-Clinical Sample 3 (US)

Note. All estimates are standardized. Loadings smaller than .30 are grayed out. In Sample 1,
only the parameter estimates for admission are displayed. 1 = Unidimensional CFA model,
2 = Two-dimensional CFA model, 3 = Two-dimensional EFA model; RFQ_C = Certainty
About Mental States; RFQ_U = Uncertainty About Mental States; η = RFQ factor.
A CRITICAL EVALUATION OF THE RFQ 45

Figure 3
Bifactor Exploratory Structural Equation Model in Sample 1 at Admission (Panel A) and
Sample 2 (Panel B)

Note. All estimates are standardized. Intercepts, thresholds, and non-target loadings are not
displayed. Non-target loadings were all < .30. Target loadings < .30 are indicated by dashed lines.
o1-o12 = OPD-SQS items; i1-i16 = IPO-16 items; r1-r6 = RFQ-6 items; p1-p25 = PID-5-BF items;
OPD RS = Relationship; OPD CT = Conflict; OPD SP = Self-Perception; IPO RT = Reality-
Testing; IPO PD = Primitive Defenses; IPO ID = Identity Diffusion; NEG = Negative Affectivity;
DET = Detachment; ANT = Antagonism; DIS = Disinhibition; PSY = Psychoticism; gPD = General
Personality Pathology.
*p < .05
A CRITICAL EVALUATION OF THE RFQ 46

Online Supplement

Note S1
Further Information About Sample 1

In Sample 1, participants varied with regard to their educational background with 18%
holding a university degree, 30% holding the higher education entrance qualification (i.e., the
highest German school degree; “Abitur”), and 48% reporting 10 years of schooling or less.
Patients were diagnosed using clinical judgements based on the tenth edition of the
International Classification of Diseases (ICD-10; World Health Organization [WHO], 2004).
The most frequent group of diagnoses was that of depressive disorders (93%), followed by
anxiety disorders (42%), personality disorders (39%), substance use disorders (20%), and
somatoform disorders (18%). Personality disorders (PD) included borderline PD (11%),
avoidant PD (11%), dependent PD (2%), and PD not otherwise specified (15%). As is
commonly found, patients exhibited a high level of comorbidity (82% more than one
diagnosis, 54% more than two, 21% more than three, 5% more than four).

Note S2
Further Information About Sample 2

In Sample 2, participants had a rather high level of education with 41% of participants
holding a university degree and an additional 37% holding the “Abitur” certificate.
Participants indicated via self-report whether they were acutely suffering from a mental health
condition. Twenty percent of the participants reported being affected by any mental disorder.
Of these, more than half indicated an affective disorder (52%), followed by borderline
personality disorder (16%) and post-traumatic stress disorder (9%). As the university’s
website might also be a reference source for psychiatric inpatients of the university hospital
that specializes in the treatment of severe trauma, it is possible that such inpatients were
attracted to participate in the study via this route.
A CRITICAL EVALUATION OF THE RFQ 47

Note S3
Bifactor ESEM Model Specification

In an exploratory bifactor measurement model (i.e., bifactor EFA), all indicators should
load on a general factor and, at the same time, they should load on at least one of at least two
multiple specific factors. In contrast to classic bifactor models (i.e., bifactor CFA; Holzinger &
Swineford, 1937), indicators are allowed to load on multiple specific factors. The bifactor
ESEM used here involves a unidimensional confirmatory measurement model for the RFQ and
an exploratory measurement model with orthogonal target rotation for indicators of personality
dysfunction. In Sample 1, the indicators of IPO-16 and OPD-SQS were target rotated such that
the loadings of indicators were targeted towards a general factor reflecting general personality
dysfunction (gPD) and were also targeted towards their respective specific factors in alignment
with the six assumed content domains or scales of the measures (i.e., IPO-16: identity diffusion,
primitive defenses, and reality testing; Zimmermann et al., 2013; OPD-SQS: self-perception,
contact, and relationship; Ehrenthal et al., 2015). In Sample 2, the indicators of PID-5-BF were
target rotated in the same vein towards a general factor (gPD) and the five PID domains as
specific factors (i.e., negative affectivity, antagonism, detachment, disinhibition, psychoticism;
APA, 2013).

American Psychiatric Association (2013). Diagnostic and statistical manual of mental disorders (DSM-5®).
American Psychiatric Pub.
Ehrenthal, J. C., Dinger, U., Schauenburg, H., Horsch, L., Dahlbender, R. W., & Gierk, B. (2015). Entwicklung
einer Zwölf-Item-Version des OPD-Strukturfragebogens (OPD-SFK) [Development of a 12-item version
of the OPD-Structure Questionnaire (OPD-SQS)]. Zeitschrift für Psychosomatische Medizin und
Psychotherapie, 61, 262–274. https://doi.org/10.13109/zptm.2015.61.3.262
Holzinger, K. J., & Swineford, F. (1937). The bi-factor method. Psychometrika, 2, 41–54.
Zimmermann, J., Benecke, C., Hörz, S., Rentrop, M., Peham, D., Bock, A., Wallner, T., Schauenburg, H.,
Frommer, J., Huber, D., Clarkin, J. F., & Dammann, G. (2013). Validierung einer deutschsprachigen 16-
Item-Version des Inventars der Persönlichkeitsorganisation (IPO-16) [Validity of a German 16-item-
version of the Inventory of Personality Organization (IPO-16)]. Diagnostica, 59, 3–16.
https://doi.org/10.1026/0012-1924/a000076
A CRITICAL EVALUATION OF THE RFQ 48

Note S4
Further Information About Sample 3

To identify careless respondents and thereby ensure data quality, we implemented a


number of validity checks into the data collection process. Whereas attention checks are
useful to detect careless responding in general, language fluency tests specifically target
participants who participate by misrepresenting their language proficiency or place of
residence. We employed two instructed response items that required participants to mark a
specific response option that was clearly stated in the item instruction (e.g., “If you are paying
attention, mark ‘strongly agree’. Otherwise, you will be disqualified”). Thirty-four
participants failed at least one of the two instructed response items and were excluded. In
addition, we employed a language fluency check to ensure that participants met the
participation requirement in terms of language proficiency (i.e., being fluent in English). We
used an open-ended question at the end of the study that tasked participants with providing a
statement on an opinion question. Four participants were excluded for failing to provide a
proper statement.
In Sample 3, participants varied with regard to their educational level (e.g., 64% with a
bachelor’s degree or higher) and occupational status (e.g., 57% employed for wages).
A CRITICAL EVALUATION OF THE RFQ 49

Note S5
Commonality Analysis Using the RFQ-8

Using the first set of variables, the association between the RFQ-8 and LPFS-BF (R2 =
.48) was decomposed into variance shared with measures of mentalizing (ΔR2 = .25, 95% CI
[.16; .30], p < .001), variance shared with measures of emotional lability and impulsivity
alone (ΔR2 = .20, 95% CI [.15; .32], p < .001), and variance uniquely explained by LPFS-BF
(ΔR2 = .03, 95% CI [.02; .05], p < .001). In the second set of variables, the association
between the RFQ-8 and IPO-16 (R2 = .42) was decomposed into variance shared with
measures of mentalizing (ΔR2 = .32, 95% CI [.23; .37], p < .001), variance shared with
measures of emotional lability and impulsivity alone (ΔR2 = .06, 95% CI [.04; .18], p < .001),
and variance uniquely explained by IPO-16 (ΔR2 = .03, 95% CI [.01; .05], p < .001). On
average, across all 32 combinations of predictors, the association between the RFQ-8 and
broad indicators of personality dysfunction (mean R2 = .44) was decomposed into variance
shared with measures of mentalizing (mean ΔR2 = .28), variance shared with measures of
emotional lability and impulsivity alone (mean ΔR2 = .13), and variance uniquely explained
by IPO-16 (mean ΔR2 = .03). Thus, 63% of the observed associations between RFQ-8 and
indicators of personality dysfunction were due to variance shared with other measures of
mentalizing, whereas 30% were due to variance shared with measures of emotional lability
and impulsivity alone and 7% were unique to measures of personality dysfunction.
A CRITICAL EVALUATION OF THE RFQ 50

Table S1
Bivariate Correlations Between RFQ-8, RFQ_C, RFQ_U, and Criterion Measures in Sample 1 at Admission

RFQ-8 RFQ_C RFQ_U BSI PHQ-15 PHQ-9 IIP-32 IPO-16 OPD-SFK

RFQ_C -.92

RFQ_U .80 -.65

BSI .55 -.46 .53

PHQ-15 .26 -.23 .25 .59

PHQ-9 .42 -.34 .43 .76 .47

IIP-32 .54 -.48 .48 .63 .32 .51

IPO-16 .72 -.68 .62 .59 .33 .46 .60

OPD-SFK .65 -.57 .60 .72 .41 .59 .68 .69

WHO-5 -.23 .18 -.24 -.51 -.29 -.65 -.35 -.22 -.38

Note. N = 861. All p < .001.


A CRITICAL EVALUATION OF THE RFQ 51

Table S2
Bivariate Correlations Between RFQ-8, RFQ_C, RFQ_U, and Criterion Measures in Sample 2

RFQ-8 RFQ_C RFQ_U Total NEG DET ANT DIS

RFQ_C -.91

RFQ_U .88 -.64

PID-5-BF Total Score .67 -.61 .60

Negative Affectivity (NEG) .60 -.52 .56 .73

Detachment (DET) .46 -.43 .39 .72 .43

Antagonism (ANT) .30 -.32 .23 .65 .29 .29

Disinhibition (DIS) .44 -.40 .41 .73 .40 .36 .43

Psychoticism (PSY) .62 -.56 .59 .82 .52 .52 .44 .50

Note. N = 566. All p < .001.


A CRITICAL EVALUATION OF THE RFQ 52

Table S3
Bivariate Correlations Between RFQ-8, RFQ_C, RFQ_U, and Further Measures in Sample 3

RFQ-8 (1) (2) (3) (4) (5) (6) (7) (8) (9) (10) (11)

(1) RFQ_C -.92

(2) RFQ_U .81 -.62

(3) LPFS-BF .69 -.61 .55

(4) IPO-16 .64 -.60 .53 .66

(5) CAMSQ Self-Certainty -.54 .54 -.44 -.51 -.38

(6) SRIS Self-Insight -.68 .65 -.63 -.60 -.60 .68

(7) CAMSQ Other-Certainty -.28 .31 .16 -.23 -.09 .56 .28

(8) EQ Cognitive Empathy -.28 .29 -.18 -.27 -.11 .42 .28 .73

(9) PID-5 Emotional Lability .59 -.49 .53 .59 .61 -.36 -.53 -.08 -.08

(10) PAI-BOR Affective Instability .63 -.53 .56 .67 .56 -.42 -.53 -.16 -.17 .72

(11) UPPS-P Negative Urgency .74 -.69 .59 .63 .59 -.43 -.54 -.18 -.16 .62 .69

(12) DERS Impulse Control Difficulties .65 -.55 .57 .65 .61 -.41 -.57 -.13 -.15 .66 .69 .68

Note. N = 862. All correlations are statistically significant at p < .05.


A CRITICAL EVALUATION OF THE RFQ 53

Figure S1
Originally Proposed Two-Dimensional CFA Model Using Double-Scoring in Sample 1 at Admission

Note. N = 861. χ²(51) = 490.05, CFI = .95, RMSEA = .10, SRMR = .15. ω = .88/.73 (RFQ_C/RFQ_U).
The model was estimated using the WLSMV estimator. Correlated errors were specified following
recommendations by Fonagy et al. (2016). Intercepts, residual variances, and thresholds are not displayed.
A CRITICAL EVALUATION OF THE RFQ 54

Figure S2
Originally Proposed Two-Dimensional CFA Model Using Double-Scoring in Sample 2

Note. N = 566. χ²(51) = 313.46, CFI = .95, RMSEA = .10, SRMR = .15. ω = .88/.78 (RFQ_C/RFQ_U).
The model was estimated using the WLSMV estimator. Correlated errors were specified following
recommendations by Fonagy et al. (2016). Intercepts, residual variances, and thresholds are not displayed.
A CRITICAL EVALUATION OF THE RFQ 55

Figure S3
Lowess-Smoothed Regression Curves of the Association Between the RFQ-6 and Criterion Measures in Sample 1 at Admission
A CRITICAL EVALUATION OF THE RFQ 56

Figure S4
Lowess-Smoothed Regression Curves of the Association Between the RFQ-6 and Criterion Measures in Sample 2
A CRITICAL EVALUATION OF THE RFQ 57

Figure S5
Two-Lines Test of the Association Between the RFQ-6 and Criterion Measures in Sample 1 at Admission
A CRITICAL EVALUATION OF THE RFQ 58

Figure S6
Two-Lines Test of the Association Between the RFQ-6 and Criterion Measures in Sample 2
A CRITICAL EVALUATION OF THE RFQ 59

Figure S7
Bifactor Exploratory Structural Equation Model in Sample 1 at Admission (Panel A) and
Sample 2 (Panel B) Using all Eight Items of the RFQ

Note. All estimates are standardized. Intercepts, thresholds, and non-target loadings are not
displayed. Non-target loadings were all < .30. Target loadings < .30 are indicated by dashed lines.
Variable names are the same as in Figure 3. In Sample 1, model fit was CFI = .94, RMSEA = .06,
SRMR = .04. In Sample 2, model fit was CFI = .96, RMSEA = .05, SRMR = .04.
*p < .05

You might also like