Critical Evaluation of The RFQ - Accepted Manuscript

A CRITICAL EVALUATION OF THE RFQ
This article has been accepted for publication in Journal of Personality Assessment,
published by Taylor & Francis.
A Critical Evaluation of the Reflective Functioning Questionnaire (RFQ)
Sascha Müller1*, Leon P. Wendt1*, Carsten Spitzer2, Oliver Masuhr3, Sarah N. Back4,
& Johannes Zimmermann1
1
University of Kassel, Department of Psychology, Kassel, Germany
2
Rostock University Medical Center, Rostock, Germany
3
Asklepios Fachklinikum Tiefenbrunn, Rosdorf, Germany
4
Ludwig-Maximilians-Universität München, Munich, Germany
*shared first authorship
Authors’ note: Ethics committee approval was obtained for data collection (Ethics Committee
at the University Medicine Rostock, #A2020-0075; Ethics Committee at the Faculty of
Behavioural and Cultural Studies, University of Heidelberg, #Tau 2019 1/1). The first authors
contributed equally to this research. The data used in Study 1 are available from the
corresponding authors upon reasonable request. Data and R code for reproducing the analyses
of Study 2 are permanently and openly accessible at https://osf.io/stbz5. Please address
correspondence to Sascha Müller and Leon Wendt, Department of Psychology, University of
Kassel, Holländische Str. 36-38, 34127, Kassel, Germany; E-Mail: sascha.mueller@uni-
kassel.de, [email protected].
A CRITICAL EVALUATION OF THE RFQ 2
Abstract
The Reflective Functioning Questionnaire (RFQ) is an 8-item self-report measure of
reflective functioning that is presumed to capture individual differences in hypo- and
hypermentalizing. Despite its broad acceptance by the field, we argue that the validity of the
measure is not well-established. The current research elaborates on problems of the RFQ
related to its item content, scoring procedure, dimensionality, and associations with
psychopathology. We tested these considerations across three large clinical and non-clinical
samples from Germany and the US (total N = 2289). In a first study, we found that the RFQ
may assess a single latent dimension related to hypomentalizing but is rather unlikely to
capture maladaptive forms of hypermentalizing. Moreover, the RFQ exhibited very strong
associations with measures of personality pathology, while associations with measures of
symptom distress were less strong. In a second preregistered study focused on convergent and
discriminant validity, however, a commonality analysis indicated that associations with
indicators of personality pathology are inflated because some of the RFQ items tap into
emotional lability and impulsivity rather than mentalizing. Our findings demonstrate
limitations of the RFQ. We discuss key challenges in assessing mentalizing via self-report.
Keywords: reflective functioning; mentalizing; validity; factor analysis; U-shaped
associations; commonality analysis

A Critical Evaluation of the Reflective Functioning Questionnaire (RFQ)
Fonagy and colleagues (2016) introduced the Reflective Functioning Questionnaire
(RFQ), a brief screening measure that is intended to assess an individual’s capacity to
adequately interpret mental states of both the self and others (i.e., reflective functioning or
mentalizing) via self-report. The measure as well as its translated versions have been
positively evaluated in several validation studies (Badoud et al., 2015; Fonagy et al., 2016;
Morandotti et al., 2018), leading to the consensual conclusion that the RFQ is able to capture
deficits in reflective functioning. According to mentalizing theory (e.g., Bateman & Fonagy,
2016), the theoretical spectrum of reflective functioning includes hypo- and hypermentalizing
(i.e., too little or too much certainty about one’s interpretation of mental states) as well as
genuine mentalizing (the optimal trait level, i.e., acknowledging the opaqueness of mental
states). Deviations from the optimal trait level in both directions are presumed to be a sign, a
symptom, and a transdiagnostic risk factor of psychopathology (e.g., Luyten et al., 2020) and
have been targeted by a specialized psychotherapeutic approach, that is, mentalization-based
treatment (e.g., Bateman & Fonagy, 2016).
In the RFQ, respondents are asked to endorse or reject eight statements that are
supposed to relate to mentalizing processes (e.g., Item 6; “Sometimes I do things without
really knowing why”) on a 7-point scale ranging from do not agree at all (= 1) to agree
completely (= 7) (see Panel A of Figure 1). The creators implemented a procedure that
involves scoring half of the items twice (i.e., double-scoring) and splitting the underlying
information into two scales (see Panel B of Figure 1): certainty about mental states (RFQ_C;
i.e., 3, 2, 1, 0, 0, 0, 0) and uncertainty about mental states (RFQ_U; i.e., 0, 0, 0, 0, 1, 2, 3).
According to this logic, a strong rejection of Item 6 on the original scale (= 1) is scored in
such a manner that it is indicative of high certainty (= 3) on the RFQ_C scale and at the same
time indicative of low uncertainty (= 0) on the RFQ_U scale, while a strong agreement is
scored as indicative of low certainty (= 0) on the RFQ_C scale and at the same time as
indicative of high uncertainty (= 3) on the RFQ_U scale. High levels of certainty about
mental states are assumed to reflect hypermentalizing and high levels of uncertainty are
assumed to reflect hypomentalizing (e.g., Badoud et al., 2015; Fonagy et al., 2016;
Morandotti et al., 2018). The latent structure of the RFQ has been investigated repeatedly
using double-scored items and reportedly consists of two negatively correlated factors (with
correlations typically ranging between -.60 and -.80) reflecting the scales of RFQ_C and
RFQ_U (Badoud et al., 2015; Fonagy et al., 2016).
However, in a recent attempt to validate the factor structure of the German version of
the RFQ in a large sample (N = 2477) representative of the German general population,
Spitzer et al. (2021) criticized the use of double-scored items in factor analysis (e.g., Badoud
et al., 2015; Fonagy et al., 2016; Morandotti et al., 2018), noting that the assumption of
uncorrelated item residuals seems unrealistic when two items are derived from the same
original responses. Instead, Spitzer and colleagues provided initial evidence that a
unidimensional model sufficiently explained the observed covariation of the original
responses to RFQ items and suggested that this representation could give rise to two
maladaptive poles of hypo- and hypermentalizing deviating from an adaptive middle region.
Following this notion, they proceeded with testing U-shaped relationships between a
unidimensional RFQ score (dropping two items to improve internal consistency) and
depression, anxiety, and somatic symptoms but found no evidence for such associations. The
findings of Spitzer and colleagues in conjunction with their briefly noted considerations raise
some initial doubts and call for a critical re-examination of the RFQ as a psychometric test. A
critical re-examination and discussion of the RFQ appears to be of particular importance
given that researchers are increasingly adopting the measure for primary investigations of the
mentalizing construct (e.g., Badoud et al., 2018; de Meulemeester et al., 2017, 2018; Huang
et al., 2020; Li et al., 2020). In the following, we will identify and explore potential issues
with the RFQ in detail.
Item Content
An initial concern is that the coverage of the reflective functioning construct in the
RFQ does not seem to converge well with the definition given by the creators (Fonagy et al.,
2016). According to them, reflective functioning pertains to “the capacity to interpret both the
self and others in terms of internal mental states, such as feelings, wishes, goals, desires, and
attitudes” (Fonagy et al., 2016, p. 1). However, all RFQ items but one refer to understanding
oneself (and not others) and most items refer to understanding one’s own behavior (and not
feelings, desires, wishes, goals, or attitudes). The construct of reflective functioning as
defined by the authors might thus not be optimally covered by the items. Although it should
be acknowledged that the RFQ was introduced as a brief screening measure and the creators
themselves generally called for a multidimensional assessment of mentalizing (e.g., Fonagy
et al., 2016; Luyten et al., 2020), a potential lack of coverage cannot be compensated by
relating the RFQ to a validated and more comprehensive long form. This is because, even
though long forms (i.e., RFQ-54, RFQ-46) are accessible from the creators’ website and are
also used in some studies (e.g., Euler et al., 2021), there are no publications that investigate
their psychometric properties or their relationship to the brief RFQ.
In addition, it seems that the item content of the RFQ may overrepresent another
maladaptive characteristic, namely, a tendency towards impulsive behavior when
experiencing negative emotions (e.g., Item 3: “When I get angry, I say things without really
knowing why I am saying them”; Item 4: “When I get angry, I say things that I later regret”;
Item 5: “If I feel insecure I can behave in ways that put others’ backs up”). That is, it could be
that individuals who endorse these items are not necessarily lacking in reflective capacity, but
merely show a high level of negative urgency (Cyders & Smith, 2008; Settles et al., 2012).
Negative urgency denotes the disposition to act rashly and ill-advisedly under negative
emotions and can thus be seen as reflecting a blend of emotional lability and impulsivity.
While this behavioral signature could be considered a consequence of impairments in
mentalizing abilities, it may well be due to other underlying characteristics and deficiencies
(e.g., lack of adaptive emotion regulation strategies; King et al., 2018). Therefore, these items
do not seem specific enough to the core definition of reflective functioning. By contrast, other
items of the RFQ such as “People’s thoughts are a mystery to me” (Item 1) or “I always know
what I feel” (Item 7) seem to address more directly the certainty or uncertainty involved in
forming inferences about mental states.
Scoring Procedure
The second concern pertains to the aforementioned double-scoring of four of the eight
RFQ items (i.e., Items 2, 4, 5, 6) to derive RFQ_U and RFQ_C scale scores and their use in
factor analysis, as it causes psychometric problems. Given that respondents only provide one
rating for each of these four items on the 7-point scale, the resulting eight rescaled scores on
RFQ_C and RFQ_U are mutually determined. For example, when a respondent rates Item 6
with strong agreement, this necessarily leads to a rescaled score of 3 on RFQ_C and a
rescaled score of 0 on RFQ_U (see Panels A and B of Figure 1). Thus, the two rescaled
scores are not independent of each other and overlap with regard to their information. In fact,
nine of the 16 theoretical combinations of the two scores are mathematically impossible (see
Panel C of Figure 1). In our Sample 1 (see below), this results in an artificially negative
correlation between the two rescaled scores of Item 6 with r = -.55. The exact value of the
correlation depends on the univariate distribution of the raw scores and varies slightly
between items and samples. However, when using polychoric correlations that take into
account the ordinal scaling of the scores, the estimate will always be exactly r = -1. This
shows that the two rescaled scores are in fact completely redundant. This issue is particularly
problematic because the assumed factor structure of the RFQ is based on confirmatory factor
analysis (CFA) using double-scored items as indicators (Badoud et al., 2015; Fonagy et al.,
2016). Applying ordinal CFA should produce warning messages because the polychoric
correlations between several indicators will approach r = -1. However, if instead one treats
the indicators as continuous and applies robust maximum likelihood estimation (as was done
in previous validation studies), one also cannot undo the inherent dependencies of the
indicators. In this case, the suggested two-dimensional factor model will artificially induce a
negative correlation between the two factors because the residual correlations of double-
scored item pairs are restricted to zero. Therefore, the finding of two negatively correlated
dimensions of the RFQ (i.e., RFQ_U and RFQ_C) using double-scored items should rather
not be interpreted as evidence in favor of the instrument’s structural validity as was done in
previous validation studies (Badoud et al., 2015; Fonagy et al., 2016). We argue that the
factor structure of the RFQ is still open to debate.
The problematic scoring procedure associated with the two scales manifests itself in
conceptual inconsistencies with regard to the RFQ_C scale in particular. Although RFQ_C is
supposed to represent certainty about mental states, all items are geared towards a state of
uncertainty with respect to their semantic content (e.g., Item 1: “People’s thoughts are a
mystery to me”) and are, ultimately, reversely scored. The certainty scale is thus based
entirely on the denial of uncertainty. In fact, the RFQ contains only one item directly
referring to a state of certainty (Item 7: “I always know what I feel”) and this item is scored
exclusively for the uncertainty scale (RFQ_U).
Associations with Psychopathology
A third issue is that findings on associations between RFQ scales and
psychopathology constructs are somewhat conflicting. As theory posits that mentalizing
impairments such as hypo- and hypermentalizing are a vulnerability factor for severe
psychopathology (e.g., Luyten et al., 2020), positive associations between mentalizing
impairments and various indicators of psychopathology are expected. However, the
correlational patterns of the RFQ_C and RFQ_U scales that emerge from the literature (e.g.,
Badoud et al., 2015; Fonagy et al., 2016; Huang et al., 2020; Li et al., 2020) rather appear to
be in contrast to the interpretation that RFQ_C assesses hypermentalizing, as has been noted
previously (de Meulemeester et al., 2018; Euler et al., 2021). Overall, the RFQ_C scale
(certainty about mental states) was often positively associated with mental health, suggesting
that it captures an adaptive characteristic. By contrast, the RFQ_U scale (uncertainty about
mental states) appears to be quite strongly related to various indices of psychopathology.
Taken together, the two scales of the RFQ tend to exhibit similar correlational patterns to
external criteria but with opposite signs, respectively. These opposing correlational patterns
of RFQ_C and RFQ_U in a shared nomological net seem more compatible with the notion
that the RFQ reflects a unidimensional continuum ranging from genuine to impaired
mentalizing.
Another aspect in this regard is that, from a theoretical point of view, the association
between mentalizing and psychopathology should depend on which form of psychopathology
is being considered. Specifically, impairments in mentalizing are thought to be a core feature
of personality disorders (e.g., Bateman & Fonagy, 2019; Fonagy et al., 2017; Luyten et al.,
2020). For example, in the Alternative Model for Personality Disorders (AMPD) in the fifth
edition of the Diagnostic and Statistical Manual of Mental Disorders (DSM-5; APA, 2013),
the first general criterion of personality disorders refers to difficulties in understanding and
regulating the self and interpersonal relationships, with mentalizing deficits explicitly
considered as one important element (Bender et al., 2011). Thus, the strongest correlations in
the field of psychopathology should occur where the RFQ is correlated with measures that
capture the severity of personality pathology. Previous studies documented strong

associations between the RFQ scales and measures of personality pathology (e.g., Badoud et
al., 2015; de Meulemeester et al., 2017; Fonagy et al., 2016) as well as various symptom
measures (e.g., de Meulemeester et al., 2018; Fonagy et al., 2016; Huang et al., 2020; Li et
al., 2020). However, the magnitudes of these associations should be compared systematically
by inferential testing. Notably, differential associations should be carefully interpreted in
terms of validity, not least because the RFQ also contains item content that may reflect
constructs such as negative urgency rather than mentalizing in particular. For example,
impulsive responsivity to emotions may be a general underlying disposition for psychological
distress that would be expected to exhibit similar patterns of associations as were reported for
the scales of the RFQ (e.g., Carver et al., 2017; Settles et al., 2012). We therefore argue that
more research is needed to evaluate whether associations of the RFQ with specific,
theoretically selected forms of psychopathology are higher than associations with non-
specific symptom distress. In addition, it should be investigated whether associations between
the RFQ and indicators of personality dysfunction are artificially inflated due to item content
that is shared with other maladaptive characteristics.
Study 1
We have stated potential concerns with regard to the RFQ related to its item content,
scoring procedure, dimensionality, and associations with psychopathology. Based on our
interpretation of previous findings, we make three observations: First, we propose that the
measure may emerge as unidimensional when modeled more adequately, that is, when not
using double-scored items in CFA given the methodological problems associated with this
approach. Although initial evidence for the unidimensionality of the RFQ has been provided
by Spitzer et al. (2021), their study only involved a single non-clinical sample. In the first
study, we used data from a clinical and a non-clinical sample to compare a series of factor
models for the RFQ including unidimensional and two-dimensional representations in order
to test the generalizability and robustness of a unidimensional solution.
Second, if the RFQ turns out to be unidimensional, it could still capture a unipolar or
bipolar construct. As a unipolar construct, it could range from genuine to impaired
mentalizing, whereas as a bipolar construct, it could capture hypo- and hypermentalizing as
two maladaptive ends of one continuum. Notably, in our view, a unidimensional scoring of
the RFQ lends itself to investigating the unique characteristics of the two poles using non-
linear statistical models. Specifically, in the case where the RFQ captures a continuum that
runs from extreme uncertainty about mental states (“too little mentalizing”) to an optimal
level (“genuine mentalizing”) to excessive certainty about mental states (“too much
mentalizing”), one would predict U-shaped associations with maladaptive characteristics
(e.g., depression) and, vice versa, inverse U-shaped associations with adaptive characteristics
(e.g., well-being). 1 When first testing this hypothesis by specifying quadratic terms in
regression models to predict psychopathology outcomes, Spitzer et al. (2021) found no
support for such U-shaped associations. However, the authors considered only rather weak
criteria in terms of short screening measures of depression, anxiety, and somatic symptoms.
In this study, we explored (inverse) U-shaped relationships between the RFQ and measures
indicative of psychopathology and personality pathology to provide more comprehensive
tests of the hypothesis that the RFQ captures two maladaptive variants of mentalizing on a
single continuum (i.e., hypo- and hypermentalizing).
Third, given that the RFQ contains several non-specific items, we examined
associations with specific forms of psychopathology that are theoretically more closely
1
Thus, both low and high levels of the unidimensional RFQ-8 are supposed to be maladaptive, marking the high
ends of the U-shape when inspecting a maladaptive criterion. By contrast, middle levels are not supposed to be
maladaptive, marking the turning point of the U-shape when inspecting a maladaptive criterion. Similarly, the
association between the RFQ-8 and adaptive criteria should be inverse U-shaped, indicating that middle levels
are again adaptive whereas both low and high levels are not.
related to the core construct of mentalizing. We therefore selected several measures of
personality pathology that are linked to Criterion A (impairments in personality functioning)
and Criterion B (maladaptive personality traits) of the AMPD. We compared the correlations
between the RFQ and measures of personality pathology with correlations between the RFQ
and measures of general symptom distress, and further scrutinized the latent associations
between the RFQ and personality pathology using bifactor exploratory structural equation
modeling (bifactor ESEM).
Method
Sample 1. The first sample was collected at a psychosomatic clinic, the Asklepios
Fachklinikum Tiefenbrunn. Participants (64% female) were 861 inpatients ranging in age
from 18 to 68 (M = 34.0, SD = 13.3). The data presented here were collected from patients
who were admitted for inpatient treatment within three days of admission and shortly before
discharge as part of routine diagnostics. The RFQ was administered to all included
participants at admission without missings, whereas only 364 participants completed the
measure at discharge. The reason for these missing observations is not dropout; the RFQ was
only administered at discharge at a later stage of data collection and thus only pertained to
364 patients. For more detailed information about the sample characteristics, see Note S1 in
the supplement.
Sample 2. The second sample was collected online within a study on dimensional
measures of personality at Heidelberg University Hospital (Back et al., 2020). Participants
were recruited via flyers and calls for participation on the university website and in several
online forums. The sample comprises 566 young adults (74% female) who completed the
study and whose ages ranged from 18 to 30 (M = 24.2, SD = 3.13). There were no missing
data. For more detailed information about the sample characteristics, see Note S2 in the
supplement.
Measures. The RFQ was administered in both samples. The Brief Symptom
Inventory (BSI), Inventory of Interpersonal Problems (IIP-32), WHO-5 Well-Being Index
(WHO-5), Patient Health Questionnaire (PHQ), Inventory of Personality Organization (IPO-
16), and Operationalized Structural Diagnostic Questionnaire – Short Form (OPD-SQS) were
administered in Sample 1. Some of the criterion measures were administered at both
admission and discharge in Sample 1; however, we used only the data at admission (i.e.,
before treatment) for exploring their associations with the RFQ. The Personality Inventory
for DSM-5 – Brief Form (PID-5-BF) was administered in Sample 2.
Reflective Functioning Questionnaire (RFQ). The RFQ (Fonagy et al., 2016)
comprises eight items forming the two scales of certainty about mental states (RFQ_C) and
uncertainty about mental states (RFQ_U). According to the scoring procedure described
above, the 7-point Likert scale (“do not agree at all” = 1 to “agree completely” = 7) is
recoded as 3, 2, 1, 0, 0, 0, 0 for the scale RFQ_C; for the scale RFQ_U, items are recoded as
0, 0, 0, 0, 1, 2, 3 (except for the inverse Item 7 that is recoded as shown for RFQ_C). Results
for the original scoring procedure are supplemented. For the main results, we refrained from
applying the scoring procedure due to the aforementioned problems. Items were thus kept in
the original coding (i.e., 1, 2, 3, 4, 5, 6, 7) and only Item 7 was reversed such that it
corresponded to the content polarity of the other items. High values indicated uncertainty
about mental states and low values indicated certainty. In the following, we will refer to the
mean score of the 8-item RFQ as the RFQ-8. In a previous investigation using a
representative sample from the German population, Spitzer et al. (2021) presented results in
favor of using a reduced 6-item version of the RFQ (omitting Items 7 and 4), that will
subsequently be referred to as RFQ-6.
Brief Symptom Inventory (BSI). The BSI (Franke, 2000) was used to assess distress
associated with symptoms of mental illness during the last week (e.g., “loss of appetite”) on a
5-point scale ranging from “not at all” (0) to “extremely” (4). Internal consistency of the BSI
total score was estimated at α = .95 in Sample 1.
Inventory of Interpersonal Problems (IIP-32). Interpersonal problems were
measured with the 32-item version of the IIP (Horowitz et al., 2000). The measure assesses
distress associated with interpersonal behaviors (e.g., “I open up to other people too much”)
that are performed excessively or inhibited strongly on a 5-point scale ranging from “not at
all” (0) to “extremely” (4). Internal consistency of the IIP-32 total score was estimated at α =
.87 in Sample 1.
WHO-5 Well-Being Index (WHO-5). The WHO-5 (World Health Organization,
1998) is a self-report measure of well-being. It consists of five items (e.g., “Over the last two
weeks I have felt cheerful and in good spirits”) that are rated on a 6-point Likert scale ranging
from “at no time” (0) to “all the time” (5). High scores indicate a high subjective well-being.
Internal consistency of the WHO-5 total score was estimated at α = .85 in Sample 1.
Patient Health Questionnaire (PHQ). The PHQ-15 (Kroenke et al., 2002) is a 15-
item module for assessing the severity of impairment associated with somatic symptoms
(e.g., “back pain”) that have been experienced during the last four weeks. Items are rated on a
3-point Likert scale ranging from “not bothered at all” (0) to “bothered a lot” (2). The PHQ-9
(Kroenke et al., 2001) is a 9-item module for assessing the impairment associated with the
nine DSM-IV criteria of depression (e.g., “Little interest or pleasure in doing things”,
“Feeling tired or having little energy”) that may have been experienced during the last two
weeks. Items are rated on a 4-point scale ranging from “not at all” (0) to “nearly every day”
(3). Internal consistency of the PHQ-15/PHQ-9 total scores were estimated at α = .81/.84 in
Sample 1.
Inventory of Personality Organization (IPO-16). The IPO-16 assesses general
personality dysfunction (Zimmermann et al., 2013) in the domains of identity diffusion (e.g.,
“I feel that my tastes and opinions are not really my own, but have been borrowed from other
people”), primitive defenses (e.g., “People tell me I behave in contradictory ways”), and
reality testing (e.g., “I can’t tell whether certain physical sensations I’m having are real, or
whether I’m imagining them”). The 16 items are answered on a 5-point scale from “never
applies” (1) to “always applies” (5). Internal consistency of the IPO-16 total score was
estimated at α = .86 in Sample 1.
Operationalized Structural Diagnostic Questionnaire – Short Form (OPD-SQS).
The OPD-SQS is a 12-item measure of personality dysfunction (Ehrenthal et al., 2015).
Statements are endorsed or rejected on a 5-point scale ranging from “completely untrue” (0)
to “entirely true” (4). The items give rise to the scales of self-perception (e.g., “I sometimes
feel like a stranger to myself”), contact (e.g., “I sometimes misjudge how my behavior affects
others”), and relationship (e.g., “It can be dangerous to let others get too close to you.”).
Internal consistency of the OPD-SQS total score was estimated at α = .86 in Sample 1.
Personality Inventory for DSM-5 – Brief Form (PID-5-BF). The PID-5-BF (APA,
2013; Zimmermann et al., 2014) is a 25-item measure assessing the broad maladaptive
personality domains of negative affectivity, detachment, disinhibition, antagonism, and
psychoticism with five items each. Items are rated on a 4-point scale ranging from “very
false” (0) to “very true” (3). Internal consistency of the PID-5-BF total score was estimated at
α = .86 in Sample 2.
Statistical Analyses. The analyses were performed using R version 4.0.3 (R Core
Team, 2020) in conjunction with the package lavaan (Rosseel, 2012) and Mplus version 8.4
(Muthén & Muthén, 1998–2019). All latent variable models were estimated using the
Weighted Least Squares Mean and Variance Adjusted (WLSMV) estimator that was applied
to the polychoric correlation matrix. The fit of latent variable models was evaluated by a
commonly used combination of fit indices and cut-off criteria (i.e., Comparative Fit Index
[CFI] > .95, Root Mean Square Error of Approximation [RMSEA] < .06, Standardized Root
Mean Square Residual [SRMR] < .08; Hu & Bentler, 1999). The internal consistency of the
RFQ was estimated with the model-based McDonald’s ω for categorical variables (Flora,
2020). We report fully standardized estimates.
Factor Structure. The factor structure of the RFQ was evaluated by confirmatory
factor analysis (CFA) and exploratory factor analysis (EFA) using the original item
responses. The two-dimensional measurement model using double-scored items that was
reported by the creators of the RFQ (Badoud et al., 2015; Fonagy et al., 2016) is not taken
into account for the main results due to the methodological problems associated with the
scoring procedure that are described above. However, we report the results based on the
original model and the creators’ recommendations in the supplement. For this analysis, we
considered the following measurement models: (1) a unidimensional CFA; (2) a two-
dimensional CFA with cross-loadings that follow the scoring procedure of RFQ_C and
RFQ_U as proposed in the original publication; and (3) a two-dimensional EFA with oblique
factor rotation (quartimin). In all CFA models, the correlation between the residual variances
of Items 3 and 4 was freely estimated since these items have a large overlap in terms of
semantic content and wording. As two measurement occasions were available in Sample 1
(i.e., at admission and discharge, respectively), we specified repeated measures CFA models
with equality constraints for loadings, thresholds, intercepts, latent covariance, and residual
covariances (Liu et al., 2017). For the repeated measures CFA, we dealt with missing data by
means of pairwise deletion.
U-Shaped Associations with Psychopathology. We investigated the shape of
associations between the RFQ scale scores and various measures of psychopathology
spanning symptoms of mental illness and maladaptive characteristics. More specifically,
given the conceptualization of the RFQ, one would expect U-shaped associations with
maladaptive characteristics (i.e., signs and symptoms of psychopathology) and inverse U-
shaped associations with adaptive characteristics (e.g., subjective well-being), respectively. In
order to test the hypothesis that hypo- and hypermentalizing may delineate extreme
maladaptive ends of a unidimensional continuum (i.e., very low and very high values on the
manifest score), we examined U-shaped and inverse U-shaped associations between the RFQ
score and measures of symptomatic distress (i.e., general symptomatic distress: BSI; somatic
symptoms: PHQ-15; depressive symptoms vs. well-being: PHQ-9, WHO-5), personality
dysfunction (i.e., IPO-16, OPD-SQS, PID-5-BF), and interpersonal distress (i.e., IIP-32). To
this end, lowess smoothing curves were inspected and regression models were estimated in
which the quadratic term was added to the RFQ score for predicting criterion variables. It
should be noted that a significant effect of the quadratic predictor is not sufficiently indicative
of the presence of two maladaptive poles, as this would also be observed, for example, in
floor or ceiling effects. More specifically, the predicted values for a criterion should also
differ between medium scores on the RFQ as compared to extreme scores by forming a U-
shape. We therefore further used the two-lines test as a more rigorous method (Simonsohn,
2018) that estimates two regression lines, one before and one after a break-point in the
distribution of a predictor variable, in order to detect a change in the sign of the regression
slope.
Specific and Latent Associations with Personality Pathology. We investigated the
differential associations between the RFQ and indicators of personality pathology (i.e., IPO-
16, OPD-SQS) and various dimensions of symptomatic distress (i.e., BSI, PHQ-15, PHQ-9,
IIP-32, WHO-5), respectively. To this end, we compared the magnitude of their correlation
coefficients in Sample 1. Additionally, we estimated the empirical overlap between the RFQ
and indicators of personality pathology (i.e., PID-5-BF, IPO-16, OPD-SQS) in both samples
using a bifactor exploratory structural equation modeling approach (i.e., bifactor ESEM). In a
bifactor ESEM (e.g., Morin et al., 2020), a criterion (e.g., RFQ) is regressed on the
orthogonal general and specific factors of an exploratory bifactor measurement model
reflecting a multidimensional construct (e.g., PID-5-BF). This approach has two advantages
for this analysis. First, the use of latent variable modeling partitions reliable variance and
measurement error and thus estimates the disattenuated associations among constructs.
Second, bifactor models with uncorrelated general and specific factors allow for a clear
partitioning of variance in the presence of multidimensionality and a strong general factor
(Reise, 2012), as is the case for the PID-5-BF, IPO-16, and OPD-SQS (Zimmermann et al.,
2020). Further details and explanations about the bifactor ESEM approach used here can be
found in Note S3 of the supplement.
Results
Factor Structure. All estimated model parameters are depicted in Figure 2 per
sample. In Sample 1, the unidimensional repeated measures CFA model showed good model
fit, χ²(139) = 353.0, p < .001, CFI = .98, RMSEA = .046, 90% CI [.041; .052], SRMR = .06.
Factor loadings were acceptable (λ ≥ .49), but Item 7 showed a weak loading on the latent
factor (λ = .34). The a priori specified residual correlation between Items 3 and 4 was
estimated at .60. Heywood cases occurred for the two-dimensional CFA model (i.e., factor
correlation > 1) and the two-dimensional EFA model (i.e., standardized factor loading > 1 for
Item 4). In Sample 2, the unidimensional CFA model provided a good fit to the data, χ²(19) =
91.8, p < .001, CFI = .98, RMSEA = .08, 90% CI [.07; .10], SRMR = .04. Factor loadings
were acceptable (λ ≥ .51), but Item 7 again exhibited the weakest loading on the latent factor
(λ = .45). The a priori specified residual correlation between Items 3 and 4 was estimated at
.66. The two-dimensional CFA model did not reach convergence. The two-dimensional EFA
model again showed a Heywood case (i.e., standardized loading > 1 for Item 4).
Taken together, we did not identify the two proposed dimensions of RFQ_C and
RFQ_U in the estimated solutions, and we never identified more than one meaningful factor
in general, suggesting that the RFQ essentially captures a unidimensional construct. Results
for the originally proposed two-dimensional CFA model that uses double-scoring of RFQ
items are reported in the supplement (see Figures S1 and S2). Note that we found similar
parameters and model fit for the original specifications based on double-scored items as
found in previous validation studies (Badoud et al., 2015; Fonagy et al., 2016). We explained
above why these models and their respective fit statistics should rather not be interpreted
substantively as they are based on questionable assumptions.
The internal consistency of the RFQ factor from the unidimensional solution was
estimated at ω = .79/.81 (Sample 1; admission/discharge) and ω = .82 (Sample 2),
respectively. In the following, we omitted Items 4 and 7 in order to improve the internal
consistency of the scale (Spitzer et al., 2021). This decision was based on Item 7 (“I always
know what I feel”) overall tending towards having low factor loadings, whereas Item 3
(“When I get angry, I say things without really knowing why I am saying them”) and Item 4
(“When I get angry, I say things I regret later”) overlap strongly with respect to their content.
The removal of these two items resulted in a 6-item scale with internal consistency of ω = .82
(Sample 1) and ω = .83 (Sample 2). Subsequent results refer to the 6-item version of the RFQ
(i.e., RFQ-6) with low scores supposedly reflecting certainty about mental states and high
scores reflecting uncertainty about mental states. 2 This directionality is a decision based on
the circumstance that all items of the RFQ-6 are geared towards an endorsement of
uncertainty in their original format.
2
The total scores of the RFQ-8 and the reduced RFQ-6 were correlated at .97 (Sample 1) and at .98 (Sample 2),
respectively. It should be emphasized that retaining items 4 and 7 produced highly similar results and equivalent
conclusions for all of the presented analyses.
Associations with Psychopathology. We did not find any evidence for (inverse) U-
shaped relationships between the RFQ-6 and criteria indicative of symptom distress,
personality pathology, or well-being. Although a significant quadratic term was found for
OPD-SQS in Sample 1 and for PID-5-BF detachment in Sample 2, the associations were not
U-shaped but merely indicated a ceiling effect for higher values of the RFQ-6 (see
supplement Figures S3 and S4). The two-lines test did not indicate any U-shaped statistical
associations (see Figures S5 and S6). For the unidimensional RFQ-8, U-shaped associations
were absent as well.
On the contrary, the RFQ-6 exhibited substantial linear associations with measures of
psychopathology (see Tables 1 and 2). In Sample 1, the strongest correlations were found
between the RFQ-6 and measures of personality pathology, including the IPO-16 (r = .72)
and the OPD-SQS (r = .65), which were also significantly larger (p < .001, respectively) than
the correlations between the RFQ-6 and other measures, for example as compared to the BSI
(r = .54) and the IIP (r = .54). The bivariate associations between the original scales based on
double-scoring of the RFQ (i.e., RFQ_C and RFQ_U) as well as the RFQ-8 and criterion
measures are supplemented (see Table S1 and Table S2).
In Sample 1, the bifactor ESEM (see Panel A of Figure 3) had acceptable fit, χ²(373)
= 1764.83, p < .001, CFI = .94, RMSEA = .07, 90% CI [.06; .07], SRMR = .04. All items of
the IPO-16 and OPD-SQS loaded significantly on a general factor of personality pathology
with standardized loadings ranging from λ = .32 to λ = .71. For the specific factors, target
loadings were in the expected direction (i.e., positive factor loadings) and most of them had
values of λ > .30. (Absolute) non-target loadings were consistently < .30. The standardized
regression coefficient of the general factor was β = .83 (p < .001), explaining 69% of variance
in the RFQ factor. The specific factors of OPD relationship (β = -.17, p < .001), OPD contact
(β = .08, p = .005), OPD self-perception (β = .14, p < .001), IPO reality testing (β = .09, p =
.001), IPO identity diffusion (β = .06, p = .033), and IPO primitive defenses (β = .19, p <
.001) incrementally explained a combined 10% of variance in the RFQ factor. The
unexplained variance in the latent RFQ factor thus amounted to 21%. For the (highly similar)
results using RFQ-8, see Panel A of Figure S7.
In Sample 2, the bifactor ESEM (see Panel B of Figure 3) had good fit, χ²(318) =
743.0, p < .001, CFI = .97, RMSEA = .05, 90% CI [.04; .05], SRMR = .04. All items of the
PID-5-BF loaded significantly on a general factor of personality pathology with standardized
loadings ranging from λ = .31 to λ = .75. For the specific factors, target loadings were in the
expected direction (i.e., positive factor loadings) and most of them had values of λ > .30.
(Absolute) non-target loadings were consistently < .30. The standardized regression
coefficient of the general factor was β = .68 (p < .001), explaining 47% of variance in the
RFQ factor. The specific factors of PID-5-BF negative affect (β = .31, p < .001), PID-5-BF
detachment (β = .09, p = .022), PID-5-BF disinhibition (β = .08, p = .042), and PID-5-BF
psychoticism (β = .32, p < .001) incrementally explained a combined 21% of variance in the
RFQ factor. The unexplained variance in the latent RFQ factor thus amounted to 32%. See
Panel B of Figure S7 for the (highly similar) results using all eight RFQ items.
Discussion
Regarding the factor structure of the RFQ, we found evidence for a unidimensional
construct in a large clinical sample (Sample 1) and a large sample of young adults (Sample
2), thereby corroborating initial findings (Spitzer et al., 2021). We have argued that using a
unidimensional approach in combination with non-linear statistical modeling is consistent
with mentalizing theory and allows for conceptualizing hypo- and hypermentalizing as two
maladaptive poles of a continuum. We tested this notion by means of quadratic regression,
lowess curves, and the two-lines test. Across a broad range of criterion variables, however,
we found no evidence that the RFQ assesses a maladaptive form of having too much certainty
about mental states (i.e., hypermentalizing). Only the uncertainty pole of the RFQ was
associated with poorer mental health. Finally, in line with theoretical expectations with regard
to the mentalizing construct (e.g., Bateman & Fonagy, 2019), the RFQ was more strongly
related to indicators of personality dysfunction than to symptomatic distress.
More specifically, we found that the variance of the RFQ primarily reflected broad
indicators of self-reported personality dysfunction in both samples (i.e., clinical and non-
clinical) using diverse measures (i.e., PID-5-BF, IPO-16, OPD-SQS), whereas comparatively
little variance was unique to the RFQ. On the one hand, this may suggest that the constructs
of mentalizing and personality pathology, albeit conceptually separate, might be so greatly
intertwined that they cannot clearly be distinguished empirically (at least using self-reports).
On the other hand, in light of the observation that various items of the RFQ may tap into
related maladaptive dispositions (e.g., negative urgency, emotional lability, or impulsivity),
the large overlap between the RFQ and dimensions of personality pathology may also signal
caution. In fact, this finding could point to potential problems with respect to the discriminant
validity of the measure. Consistent with this notion, the existing research literature also
provided mixed results for the convergent validity of the RFQ with respect to mentalizing and
related constructs. In the initial validation studies, RFQ_U and RFQ_C exhibited strong
correlations to alexithymia (comparable to those for indicators of personality pathology) but
substantially smaller correlations to cognitive empathy and perspective-taking (Badoud et al.,
2015; Fonagy et al., 2016; Morandotti et al., 2018). This might be concerning given that
constructs such as cognitive empathy and perspective-taking are very similar to mentalizing
by definition (e.g., Ickes, 1993), except that the former constructs solely pertain to
understanding others’ mental states. In general, additional examinations of the convergence
of the RFQ with alternative self-report measures of mentalizing are needed.
Study 2
To address the question of whether validity issues of the RFQ contribute to its strong
empirical overlap with indicators of personality pathology, we conducted a second study that
was fully preregistered (see https://osf.io/qr38t). We hypothesized that the item content of the
RFQ reflects both mentalizing and possible consequences of impaired mentalizing, namely,
emotional lability and impulsivity. If the RFQ conflated these constructs, this could
artificially inflate associations between mentalizing as operationalized by the RFQ and
indicators of personality dysfunction, thereby impeding the interpretability of RFQ scores and
the measure’s utility for theory testing. Using commonality analysis (Nimon et al., 2008), we
tested whether and to what extent the associations between the RFQ and indicators of
personality dysfunction are driven by content overlap rather than reflecting the true
association between constructs. Furthermore, we investigated item-level correlations of the
RFQ items with convergent and discriminant measures to examine their nomological
consistency (Thielmann & Hilbig, 2019). For example, nomological inconsistency would be
indicated by some RFQ items correlating more strongly with impulsivity and other RFQ
items correlating more strongly with other measures of mentalizing. Thus, Study 2 included
alternative self-report measures of mentalizing, broad measures of personality pathology, and
assessments of potential confounders, namely, emotional lability and impulsivity. With
regard to alternative measures of mentalizing, we deliberately selected questionnaires that
pertain to the core construct of mentalizing, that is, interpreting the mental states of the self
and others.
Method
Sample 3. We recruited participants from the United States online via the panel
provider Prolific. Data quality was ensured by a series of attention and validity checks.
Participants received minimum wage as compensation. There were no missing data as
individuals were not able to proceed without answering each item. We collected data from N
= 862 participants based on an a priori power analysis (see preregistration). Participants’
(47% female) age ranged from 18 to 75 (M = 34.9, SD = 11.7). For more details about
exclusion criteria and sample characteristics, see Note S4.
Measures. In the study, all questionnaires were presented in random order. The RFQ
(Fonagy et al., 2016) and the IPO-16 (Zimmermann et al., 2013) were administered again in
their respective English versions. Based on our findings from Study 1, we used a
unidimensional mean score of the RFQ such that high values reflected uncertainty about
mental states. The analyses were performed for both the RFQ-8 and the RFQ-6. The internal
consistencies of the RFQ-6 (ω = .87), RFQ-8 (ω = .87), and the IPO-16 (α = .90) were good.
Level of Personality Functioning Scale - Brief Form 2.0 (LPFS-BF). The LPFS-BF
(Weekers et al., 2019) is a 12-item self-report measure assessing impairments in the domains
of self-functioning (6 items) and interpersonal functioning (6 items). The items (e.g., “I often
make unrealistic demands on myself”) are answered on a 4-point scale ranging from
completely untrue (1) to completely true (4). High scores on the respective scales indicate
self-dysfunction or interpersonal dysfunction. Internal consistency of the LPFS-BF total score
was estimated at α = .90.
Certainty About Mental States Questionnaire (CAMSQ). The CAMSQ (Müller et al.,
2021) is a 20-item self-report measure of mentalizing that assesses the perceived certainty
associated with making inferences about the mental states of the self (i.e., Self-Certainty) and
others (i.e., Other-Certainty). The items capture affective, cognitive, and motivational content
(e.g., “I understand my feelings”, “I know when other people are hiding their thoughts”) and
are answered on a 7-point frequency scale ranging from never (1) to always (7). High scores
reflect high levels of certainty. Internal consistencies of CAMSQ Self-Certainty (α = .93) and
CAMSQ Other-Certainty (α = .92) were high.

Empathy Quotient (EQ). The EQ (Baron-Cohen & Wheelwright, 2004) is a 40-item
self-report measure of empathy. The 9-item Cognitive Empathy scale was used as a measure
of mentalizing others. Items (e.g., “I am good at predicting how someone will feel”) are rated
on a 4-point scale ranging from strongly disagree (1) to strongly agree (4). Internal
consistency of the EQ Cognitive Empathy scale was estimated at α = .90.
Self-Reflection and Insight Scale (SRIS). The SRIS (Grant et al., 2002) is a 20-item
self-report measure. The 8-item Self-Insight scale of the SRIS was used as a measure of
mentalizing oneself. Items (e.g., “I usually know why I feel the way I do”) are rated on a 6-
point scale ranging from strongly disagree (1) to strongly agree (6). Internal consistency of
the SRIS Self-Insight scale was estimated at α = .88.
UPPS-P Impulsive Behavior Scale (UPPS-P). The UPPS-P (Lynam et al., 2007)
assesses five impulsive personality traits with 59 items. We used the 12-item Negative
Urgency scale that assesses impulsivity in terms of acting rashly under the influence of
negative emotions. Items (e.g., “When I am upset, I often act without thinking”) are endorsed
or rejected on a 4-point scale ranging from disagree strongly (1) to agree strongly (4). Internal
consistency of the UPPS-P Negative Urgency scale was estimated at α = .93.
Difficulties in Emotion Regulation Scale (DERS). The DERS (Gratz & Roemer,
2004) is a 36-item self-report measure that assesses various aspects of emotional
dysregulation. We used the 6-item Impulse Control Difficulties subscale to assess
impulsivity. Items (e.g., “When I’m upset, I become out of control”) are rated on a 5-point
scale ranging from almost never (1) to almost always (5) indicating the frequency of
experiencing the described behavior. Internal consistency of the DERS Impulse Control
Difficulties scale was α = .90.
Personality Inventory for DSM-5 (PID-5). The PID-5 (APA, 2013) assesses
personality pathology according to the AMPD with 220 items. The 7-item Emotional Lability
facet scale was used. Items (e.g., “I am a highly emotional person”) are answered on a 4-point
scale ranging from very false or often false (0) to very true or often true (3). Internal
consistency of the PID-5 Emotional Lability scale was α = .91.
Personality Assessment Inventory – Borderline Features (PAI-BOR). The PAI-BOR
(Morey, 2004) assesses traits associated with borderline personality disorder using 24 items.
We used the six items of the Affective Instability scale to assess emotional lability. Items
(e.g., “My mood can shift quite suddenly”) are rated on a 4-point scale ranging from false,
not at all true (0) to very true (3). Internal consistency of the PAI-BOR Affective Instability
scale was α = .80.
Statistical Analysis. To test the hypothesis of whether the associations between the
RFQ and indicators of personality pathology are driven by content overlap, we conducted a
commonality analysis (Nimon et al., 2008) that estimates the common and unique
contributions of each predictor in predicting a criterion. Commonality analysis involves
estimating a series of multiple regression models considering all possible combinations of
predictors. Specifically, the RFQ mean score is regressed on a broad measure of personality
dysfunction, other measures of mentalizing the self and others, and measures of emotional
lability and impulsivity. We expected that the effect of personality dysfunction on the RFQ
mean score can be partly attributed to variation that both measures share with mentalizing
impairment, while another significant part can be attributed to variation shared with
emotional lability and impulsivity alone (i.e., variation that is non-overlapping with other
measures of mentalizing). For example, the manifest correlation between the RFQ mean
score and indicators of personality dysfunction amounted to an r of around .65 in Study 1,
corresponding to a shared variance of R² = .42. Using commonality analysis, we decomposed
the variance explained in the RFQ by the respective measure of personality dysfunction into
(a) variance that is shared with the respective measures of impulsivity and emotional lability
but not with the respective measures of mentalizing the self and others, (b) variance that is
shared with the respective measures of mentalizing the self and others, and (c) variance
explained uniquely by the respective measure of personality dysfunction. We performed a
permutation test using 5000 random permutations of the dependent variable to compute p-
values for testing the null hypothesis that part (a) equals zero (i.e., that no part of the variance
explained is due to variance shared with emotional lability and impulsivity alone). We tested
this hypothesis with two different sets of measures to facilitate the generalizability of the
results (irrespective of specific measures; see preregistration). To control for performing this
test twice for each set of variables, we considered a p-value less than .025 to be indicative of
statistical significance. Furthermore, we computed bias-corrected bootstrap confidence
intervals for evaluating the precision of the estimate using 5000 bootstrap resamples. The
commonality analysis was conducted via the R package yhat (Nimon et al., 2021).
We also examined the pattern of associations between the eight RFQ items and
personality dysfunction, emotional lability, impulsivity, and other measures of mentalizing to
evaluate whether these correlations differ in magnitude. Differential patterns of association
would indicate that the eight RFQ items are not nomologically consistent (Thielmann &
Hilbig, 2019). For the sake of completeness, we again tested the factor models considered in
Study 1.
Results
We provide open data and a script for reproducing the analyses in R
(https://osf.io/stbz5). As in Study 1, the unidimensional CFA model fit the data well, χ²(19) =
272.7, p < .001, CFI = .98, RMSEA = .125, 90% CI [.112; .138], SRMR = .04, whereas two-
dimensional models did not provide parameter estimates in line with the proposed structure
of RFQ_C and RFQ_U (see Figure 2). Measures that were intended to assess the same
constructs (e.g., CAMSQ Self-Certainty and SRIS Self-Insight) showed the expected high
convergent correlations. The RFQ mean score exhibited similarly strong associations with
broad indicators of personality dysfunction, measures of emotional lability and impulsivity,
and measures of mentalizing the self (see Table 3 and Table S3 for the full correlation
matrices including RFQ-6 and RFQ-8). Rather low correlations were observed between the
RFQ mean score and measures of mentalizing others, indicating that the RFQ primarily
pertains to mentalizing the self.
Following our preregistration, we used commonality analysis to estimate what content
(i.e., mentalizing, emotional lability, impulsivity) accounts for the association between the
RFQ and indicators of personality dysfunction. Using the first set of variables, the association
between the RFQ-6 and LPFS-BF (R2 = .46) was decomposed into variance shared with
measures of mentalizing (ΔR2 = .22, 95% CI [.14; .27], p < .001), variance shared with
measures of emotional lability and impulsivity alone (ΔR2 = .20, 95% CI [.16; .30], p < .001),
and variance uniquely explained by LPFS-BF (ΔR2 = .04, 95% CI [.02; .06], p < .001). In the
second set of variables, the association between the RFQ-6 and IPO-16 (R2 = .45) was
decomposed into variance shared with measures of mentalizing (ΔR2 = .33, 95% CI [.24; .38],
p < .001), variance shared with measures of emotional lability and impulsivity alone (ΔR2 =
.07, 95% CI [.04; .18], p < .001), and variance uniquely explained by IPO-16 (ΔR2 = .05, 95%
CI [.03; .07], p < .001). In light of the difference between the two arbitrarily selected sets of
variables, we then conducted the commonality analysis across all 32 possible variable
combinations to account for influences of specific combinations. On average, the association
between the RFQ-6 and broad indicators of personality dysfunction (mean R2 = .45) was
decomposed into variance shared with measures of mentalizing (mean ΔR2 = .27), variance
shared with measures of emotional lability and impulsivity alone (mean ΔR2 = .14), and
variance uniquely explained by measures of personality dysfunction (mean ΔR2 = .04). Thus,
59% of the observed associations between RFQ-6 and indicators of personality dysfunction
were due to variance shared with other measures of mentalizing, whereas 31% were due to
variance shared with measures of emotional lability and impulsivity alone, and 10% were
unique to measures of personality dysfunction. Very similar results were obtained using the
RFQ-8 (see Note S5).
The item-level analysis demonstrated nomological inconsistencies between the eight
items of the RFQ in terms of significant differences in their correlational patterns (see Table
4). Generally, the item-level associations were most pronounced for UPPS-P Negative
Urgency as a measure of impulsivity and for SRIS Self-Insight as a measure of mentalizing
the self. Thus, we specifically used these two scales for testing differences in the magnitude
of correlations, although it should be noted that the correlational patterns were consistent for
the other measures that were included. RFQ Items 3, 4, and 8 correlated significantly stronger
(p < .001) with Negative Urgency than with Self-Insight. In contrast, Items 1, 2, 5, and 7
correlated significantly stronger with Self-Insight than with Negative Urgency (p < .001).
Discussion
In Study 2, we aimed to test the hypothesis that the items of the RFQ conflate content
associated with mentalizing and content associated with assumed consequences of
mentalizing impairment (i.e., emotional lability and impulsivity). The preregistered analyses
provide evidence for this hypothesis and suggest that associations between the RFQ and
measures of personality dysfunction may be inflated by approximately 30% because they
exploit common variance with aspects of emotional lability and impulsivity. These results
point to limitations of the RFQ with regard to its convergent and discriminant validity.
Specifically, the item-level analysis indicated nomological inconsistencies (Thielmann &
Hilbig, 2019) that suggest RFQ Item 3 (“When I get angry I say things without really
knowing why I am saying them”), Item 4 (“When I get angry, I say things that I later regret”),
and Item 8 (“Strong feelings often cloud my thinking”) may be the reason for the conflation
because they converge with negative urgency rather than mentalizing. Study 2 further
illustrates the relevance of assessing mentalizing the self and mentalizing others separately,
as these two dimensions provide unique information. Considering that the RFQ only contains
one item that is clearly geared towards understanding others’ mental states (i.e., Item 1:
“People’s thoughts are a mystery to me”), the RFQ cannot measure mentalizing others with
sufficient fidelity.
General Discussion
The RFQ has been proposed as a short self-report measure of reflective functioning.
In this article we have elaborated our concerns with respect to the instrument’s validity and
the methodology used in prior studies (e.g., Badoud et al., 2015; Fonagy et al., 2016),
particularly in reference to its item content, scoring procedure, dimensionality, and
associations with psychopathology. Using large clinical and non-clinical samples from
Germany and the US, we augmented the critical discussion with new empirical analyses.
First, our findings suggest that the RFQ assesses a unidimensional construct. Consequently,
we recommend refraining from the originally proposed scoring procedure to derive RFQ_C
and RFQ_U (Fonagy et al., 2016) and instead relying on a unidimensional score using the
original responses to the RFQ items (such as mean scores on the psychometrically optimized
RFQ-6 or the RFQ-8). Second, we have demonstrated that although the RFQ reflects
mentalizing impairments regarding the self, it also exhibits a substantial confound with
emotional lability and impulsivity due to shared item content. This indicates that, when using
the RFQ to study research questions related to mentalizing theory, observed associations with
other constructs can be influenced by this confound and inferences can thus be impeded.
Third, consistent with previous accounts (e.g., de Meulemeester et al., 2018), the present
results further indicate that the RFQ does indeed capture a maladaptive form of having too
little certainty about mental states (i.e., hypomentalizing) but appears to be unable to capture
a maladaptive form of having too much certainty about mental states (i.e., hypermentalizing)
as its certainty pole does not exhibit positive associations with negative outcomes. One
reason for this could be that the RFQ does not measure variation in hypermentalizing with
sufficient reliability on the certainty pole of the continuum (e.g., because its items are
formulated with reference to uncertainty).
It should be noted that the present empirical findings are limited to the English and the
German versions of the RFQ and might not necessarily generalize to other languages.
However, our concerns about the item content and the scoring procedure apply to the various
translations as well. Moreover, our conclusions cannot be generalized to the long forms of the
RFQ (e.g., Euler et al., 2021), although it should be noted that, to date, no validation studies
have been published for these forms and they have been seldomly used in previous research.
Finally, the current study is subject to the limitation that it did not focus on the primary target
population that is assumed to show particularly severe impairments in mentalizing capacity,
that is, individuals with borderline personality pathology (Luyten et al., 2020).
Mentalizing is arguably an important psychological construct with great relevance for
psychopathology, personality, and psychotherapy research (APA, 2013; Bender et al., 2011).
This calls for a valid and economic self-report assessment of the construct. Although the RFQ
has been rather broadly accepted by the field and is used in a growing body of research, we
have argued that the validity evidence is not compelling. In our view, self-report assessments
of mentalizing should adhere more closely to the specific core of the construct (i.e., inferring
mental states) rather than emphasizing hypothetical consequences of impaired mentalizing
(e.g., impulsivity). Thereby, the conflation of different constructs could be avoided. Second,
both hypo- and hypermentalizing should ideally be captured by measures of mentalizing.
Indeed, demonstrating that these two maladaptive variants of mentalizing impairment can be
assessed by self-report questionnaires is an important empirical test in itself that would

increase confidence in construct validity. For example, the CAMSQ (Müller et al., 2021) was
recently introduced as a measure that focuses on the core definition of inferring mental states
of the self and others provided by Fonagy and colleagues (2016). Initial results for the
CAMSQ suggest that it captures maladaptive levels of both too little or too much certainty
about mental states that could be interpreted as forms of hypo- and hypermentalizing.
Conclusion
The RFQ is regularly used to study the concept of mentalizing. Herein, we have
outlined critical considerations regarding the validity of the RFQ and provided empirical
evidence to support the critique. Findings indicate that the RFQ is a unidimensional measure
that reflects hypomentalizing pertaining to the self but is also conflated with content related
to emotional lability and impulsivity. Thus, researchers and mental health professionals alike
should be rather cautious in using the RFQ for theory testing and individual assessment.
References
American Psychiatric Association. (2013). Diagnostic and statistical manual of mental
disorders (DSM-5®). American Psychiatric Pub.
Back, S., Zettl, M., Bertsch, K., & Taubner, S. (2020). Persönlichkeitsniveau, maladaptive
Traits und Kindheitstraumata. Psychotherapeut, 65, 374–382.
https://doi.org/10.1007/s00278-020-00445-7
Badoud, D., Luyten, P., Fonseca-Pedrero, E., Eliez, S., Fonagy, P., & Debbané, M. (2015).
The French version of the Reflective Functioning Questionnaire: Validity data for
adolescents and adults and its association with non-suicidal self-injury. PloS ONE, 10,
e0145892. https://doi.org/10.1371/journal.pone.0145892
Badoud, D., Prada, P., Nicastro, R., Germond, C., Luyten, P., Perroud, N., & Debbané, M.
(2018). Attachment and reflective functioning in women with borderline personality
disorder. Journal of Personality Disorders, 32, 17–30.
https://doi.org/10.1521/pedi_2017_31_283
Baron-Cohen, S., & Wheelwright, S. (2004). The empathy quotient: An investigation of
adults with Asperger syndrome or high functioning autism, and normal sex
differences. Journal of Autism and Developmental Disorders, 34, 163–175.
https://doi.org/10.1023/b:jadd.0000022607.19833.00
Bateman, A., & Fonagy, P. (2016). Mentalization-based treatment for personality disorders:
A practical guide. Oxford, UK: Oxford Univ. Press.
Bateman, A. W., & Fonagy, P. (Eds.). (2019). Handbook of mentalizing in mental health
practice. American Psychiatric Publishing, Inc.
Bender, D. S., Morey, L. C., & Skodol, A. E. (2011). Toward a model for assessing level of
personality functioning in DSM-5, part I: A review of theory and methods. Journal of
Personality Assessment, 93, 332–346. https://doi.org/10.1080/00223891.2011.583808

Carver, C. S., Johnson, S. L., & Timpano, K. R. (2017). Toward a functional view of the p
factor in psychopathology. Clinical Psychological Science, 5, 880–889.
https://doi.org/10.1177/2167702617710037
Cyders, M. A., & Smith, G. T. (2008). Emotion-based dispositions to rash action: Positive
and negative urgency. Psychological Bulletin, 134, 807–828.
https://doi.org/10.1037/a0013341
de Meulemeester, C., Lowyck, B., Vermote, R., Verhaest, Y., & Luyten, P. (2017).
Mentalizing and interpersonal problems in borderline personality disorder: The
mediating role of identity diffusion. Psychiatry Research, 258, 141–144.
https://doi.org/10.1016/j.psychres.2017.09.061
de Meulemeester, C., Vansteelandt, K., Luyten, P., & Lowyck, B. (2018). Mentalizing as a
mechanism of change in the treatment of patients with borderline personality disorder:
A parallel process growth modeling approach. Personality Disorders: Theory,
Research, and Treatment, 9, 22–29. https://doi.org/10.1037/per0000256
Ehrenthal, J. C., Dinger, U., Schauenburg, H., Horsch, L., Dahlbender, R. W., & Gierk, B.
(2015). Entwicklung einer Zwölf-Item-Version des OPD-Strukturfragebogens (OPD-
SFK). Zeitschrift für Psychosomatische Medizin und Psychotherapie, 61, 262–274.
https://doi.org/10.13109/zptm.2015.61.3.262
Euler, S., Nolte, T., Constantinou, M., Griem, J., Montague, P. R., Fonagy, P., & Personality
and Mood Disorders Research Network. (2021). Interpersonal problems in borderline
personality disorder: Associations with mentalizing, emotion regulation, and
impulsiveness. Journal of Personality Disorders, 35, 177–193.
https://doi.org/10.1521/pedi_2019_33_427
Flora, D. B. (2020). Your coefficient alpha is probably wrong, but which coefficient omega is
right? A tutorial on using R to obtain better reliability estimates. Advances in Methods

and Practices in Psychological Science, 3, 484–501.
Fonagy, P., Luyten, P., Allison, E., & Campbell, C. (2017). What we have changed our minds
about: Part 1. Borderline personality disorder as a limitation of resilience. Borderline
Personality Disorder and Emotion Dysregulation, 4, 1–11.
https://doi.org/10.1186/s40479-017-0061-9
Fonagy, P., Luyten, P., Moulton-Perkins, A., Lee, Y.-W., Warren, F., Howard, S., Ghinai, R.,
Fearon, P., & Lowyck, B. (2016). Development and validation of a self-report
measure of mentalizing: The Reflective Functioning Questionnaire. PLoS ONE, 11,
e0158678. https://doi.org/10.1371/journal.pone.0158678
Franke, H. (2000). The Brief Symptom Inventory – Deutsche Version. Manual. Göttingen:
Beltz.
Grant, A. M., Franklin, J., & Langford, P. (2002). The self-reflection and insight scale: A
new measure of private self-consciousness. Social Behavior and Personality: An
International Journal, 30, 821–835. https://doi.org/10.2224/sbp.2002.30.8.821
Gratz, K. L., & Roemer, L. (2004). Multidimensional assessment of emotion regulation and
dysregulation: Development, factor structure, and initial validation of the difficulties
in emotion regulation scale. Journal of Psychopathology and Behavioral Assessment,
26, 41–54. https://doi.org/10.1023/B:JOBA.0000007455.08539.94
Horowitz, L. M., Alden, L. E., Kordy, H., & Strauß, B. (2000). Inventar zur Erfassung
interpersonaler Probleme: Deutsche Version; IIP-D. Beltz-Test.
Hu, L. T., & Bentler, P. M. (1999). Cutoff criteria for fit indexes in covariance structure
analysis: Conventional criteria versus new alternatives. Structural Equation
Modeling: A Multidisciplinary Journal, 6, 1–55.
Huang, Y. L., Fonagy, P., Feigenbaum, J., Montague, P. R., Nolte, T., & Mood Disorder
Research Consortium. (2020). Multidirectional pathways between attachment,
mentalizing, and posttraumatic stress symptomatology in the context of childhood
trauma. Psychopathology, 53, 48–58. https://doi.org/10.1159/000506406
Ickes, W. (1993). Empathic accuracy. Journal of Personality, 61, 587–610.
https://doi.org/10.1111/j.1467-6494.1993.tb00783.x
King, K. M., Feil, M. C., & Halvorson, M. A. (2018). Negative urgency is correlated with the
use of reflexive and disengagement emotion regulation strategies. Clinical
Psychological Science, 6, 822–834. https://doi.org/10.1177/2167702618785619
Kroenke, K., Spitzer, R. L., & Williams, J. B. (2001). The PHQ-9: Validity of a brief
depression severity measure. Journal of General Internal Medicine, 16, 606–613.
https://doi.org/10.1046/j.1525-1497.2001.016009606.x
Kroenke, K., Spitzer, R. L., & Williams, J. B. (2002). The PHQ-15: Validity of a new
measure for evaluating the severity of somatic symptoms. Psychosomatic Medicine,
64, 258–266.
Li, E. T., Carracher, E., & Bird, T. (2020). Linking childhood emotional abuse and adult
depressive symptoms: The role of mentalizing incapacity. Child Abuse & Neglect, 99,
104253. https://doi.org/10.1016/j.chiabu.2019.104253
Liu, Y., Millsap, R. E., West, S. G., Tein, J. Y., Tanaka, R., & Grimm, K. J. (2017). Testing
measurement invariance in longitudinal data with ordered-categorical measures.
Psychological Methods, 22, 486–506.
https://psycnet.apa.org/doi/10.1037/met0000075
Luyten, P., Campbell, C., Allison, E., & Fonagy, P. (2020). The mentalizing approach to
psychopathology: State of the art and future directions. Annual Review of Clinical
Psychology, 16, 297–325. https://doi.org/10.1146/annurev-clinpsy-071919-015355

Lynam, D. R., Smith, G. T., Cyders, M. A., Fischer, S., & Whiteside, S. P. (2007). The
UPPS-P questionnaire measure of five dispositions to rash action. Unpublished
technical report, Purdue University.
Morandotti, N., Brondino, N., Merelli, A., Boldrini, A., De Vidovich, G. Z., Ricciardo, S.,
Abbiati, V., Ambrosi, P., Cavercasi, E., Fonagy, P., & Luyten, P. (2018). The Italian
version of the Reflective Functioning Questionnaire: Validity data for adults and its
association with severity of borderline personality disorder. PloS ONE, 13, e0206433.
https://doi.org/10.1371/journal.pone.0206433
Morey, L. C. (2004). The Personality Assessment Inventory (PAI). Lawrence Erlbaum
Associates Publishers.
Morin, A. J., Myers, N. D., & Lee, S. (2020). Modern Factor Analytic Techniques: Bifactor
Models, Exploratory Structural Equation Modeling (ESEM), and Bifactor-ESEM.
Handbook of Sport Psychology, 51, 1044–1073.
https://doi.org/10.1002/9781119568124.ch51
Müller, S., Wendt, L. P., & Zimmermann, J. (2021, May 19). Development and Validation of
the Certainty About Mental States Questionnaire (CAMSQ): A Self-Report Measure
of Mentalizing Oneself and Others. https://doi.org/10.31234/osf.io/jtc3s
Muthén, L. K., & Muthén, B. O. (1998-2019). Mplus User’s Guide. 8th Edition. Los Angeles,
CA: Muthén & Muthén.
Nimon, K., Lewis, M., Kane, R., & Haynes, R. M. (2008). An R package to compute
commonality coefficients in the multiple regression case: An introduction to the
package and a practical example. Behavior Research Methods, 40, 457–466.
https://doi.org/10.3758/BRM.40.2.457
Nimon, K., Oswald, F., & Roberts, J. K. (2021). yhat: Interpreting Regression Effects. R
package version 2.0-3. https://CRAN.R-project.org/package=yhat

R Core Team. (2020). R: A language and environment for statistical computing. R
Foundation for Statistical Computing, Vienna, Austria.
Reise, S. P. (2012). The rediscovery of bifactor measurement models. Multivariate
Behavioral Research, 47, 667–696. https://doi.org/10.1080/00273171.2012.715555
Rosseel, Y. (2012). lavaan: An R package for structural equation modeling. Journal of
Statistical Software, 48, 1–36. http://www.jstatsoft.org/v48/i02/
Settles, R. E., Fischer, S., Cyders, M. A., Combs, J. L., Gunn, R. L., & Smith, G. T. (2012).
Negative urgency: A personality predictor of externalizing behavior characterized by
neuroticism, low conscientiousness, and disagreeableness. Journal of Abnormal
Psychology, 121, 160–172. https://doi.org/10.1037/a0024948
Simonsohn, U. (2018). Two lines: A valid alternative to the invalid testing of U-shaped
relationships with quadratic regressions. Advances in Methods and Practices in
Psychological Science, 1, 538–555. https://doi.org/10.1177/2515245918805755
Spitzer, C., Zimmermann, J., Brähler, E., Euler, S., Wendt, L. P., & Müller, S. (2021). Die
deutsche Version des Reflective Functioning Questionnaire (RFQ): Eine
teststatistische Überprüfung in der Allgemeinbevölkerung. Psychotherapie -
Psychosomatik - Medizinische Psychologie, 71, 124–131. https://doi.org/10.1055/a-
1234-6317
Thielmann, I., & Hilbig, B. E. (2019). Nomological consistency: A comprehensive test of the
equivalence of different trait indicators for the same constructs. Journal of
Personality, 87, 715–730. https://doi.org/10.1111/jopy.12428
Weekers, L. C., Hutsebaut, J., & Kamphuis, J. H. (2019). The Level of Personality
Functioning Scale-Brief Form 2.0: Update of a brief instrument for assessing level of
personality functioning. Personality and Mental Health, 13, 3–14.
https://doi.org/10.1002/pmh.1434
World Health Organization. (1998). Wellbeing measures in primary health care / The
Depcare Project. WHO Regional Office for Europe: Copenhagen.
Zimmermann, J., Altenstein, D., Krieger, T., Grosse Holtforth, M., Pretsch, J., Alexopoulos,
J., Spitzer, C., Benecke, C., Krueger, R. F., Markon, K. E., & Leising, D. (2014). The
structure and correlates of self-reported DSM-5 maladaptive personality traits:
Findings from two German-speaking samples. Journal of Personality Disorders, 28,
518–540. https://doi.org/10.1521/pedi_2014_28_130
Zimmermann, J., Benecke, C., Hörz, S., Rentrop, M., Peham, D., Bock, A., Wallner, T.,
Schauenburg, H., Frommer, J., Huber, D., Clarkin, J. F., & Dammann, G. (2013).
Validierung einer deutschsprachigen 16-Item-Version des Inventars der
Persönlichkeitsorganisation (IPO-16). Diagnostica, 59, 3–16.
https://doi.org/10.1026/0012-1924/a000076
Zimmermann, J., Müller, S., Bach, B., Hutsebaut, J., Hummelen, B., & Fischer, F. (2020). A
common metric for self-reported severity of personality disorder. Psychopathology,
53, 161–171. https://doi.org/10.1159/000507377

Table 1
Bivariate Correlations in Sample 1 at Admission
RFQ-6 BSI PHQ-15 PHQ-9 IIP-32 IPO-16 OPD-SQS
BSI .54
PHQ-15 .26 .59
PHQ-9 .41 76 .47
IIP-32 .54 .63 .32 .51
IPO-16 .72 .59 .33 .46 .60
OPD-SQS .65 .72 .41 .59 .68 .69
WHO-5 -.22 -.51 -.29 -.65 -.35 -.22 -.38
Note. N = 861. All correlations are statistically significant at p < .001.

Table 2
Bivariate Correlations in Sample 2
RFQ-6 Total NEG DET ANT DIS
PID-5-BF Total Score .68
Negative Affectivity (NEG) .58 .73
Detachment (DET) .47 .72 .43
Antagonism (ANT) .32 .65 .29 .29
Disinhibition (DIS) .45 .73 .40 .36 .43
Psychoticism (PSY) .63 .82 .52 .52 .44 .50

Table 3
Bivariate Correlations in Sample 3
RFQ-6 (1) (2) (3) (4) (5) (6) (7) (8) (9)
(1) LPFS-BF .68
(2) IPO-16 .67 .66
(3) CAMSQ Self-Certainty -.50 -.51 -.38
(4) SRIS Self-Insight -.67 -.60 -.60 .68
(5) CAMSQ Other-Certainty -.25 -.23 -.09 .56 .28
(6) EQ Cognitive Empathy -.26 -.27 -.11 .42 .28 .73
(7) PID-5 Emotional Lability .60 .59 .61 -.36 -.53 -.08 -.08
(8) PAI-BOR Affective Instability .61 .67 .56 -.42 -.53 -.16 -.17 .72
(9) UPPS-P Negative Urgency .71 .63 .59 -.43 -.54 -.18 -.16 .62 .69
(10) DERS Impulse Control Difficulties .63 .65 .61 -.41 -.57 -.13 -.15 .66 .69 .68

Table 4
Bivariate Correlations Between the Items of the RFQ and Further Measures in Sample 3
RFQ-8
Item 1 Item 2 Item 3 Item 4 Item 5 Item 6 Item 7 Item 8
(1) LPFS-BF .40 .60 .51 .43 .45 .59 .41 .52
(2) IPO-16 .39 .54 .51 .42 .48 .57 .22 .54
(3) CAMSQ Self-Certainty -.27 -.50 -.38 -.29 -.31 -.49 -.58 -.38
(4) SRIS Self-Insight -.41 -.65 -.50 -.39 -.42 -.62 -.51 -.52
(5) CAMSQ Other-Certainty -.31 -.23 -.15 -.17 -.11 -.22 -.30 -.16
(6) EQ Cognitive Empathy -.36 -.25 -.15 -.14 -.11 -.23 -.30 -.12
(7) PID-5 Emotional Lability .26 .47 .48 .40 .42 .49 .24 .56
(8) PAI-BOR Affective Instability .24 .50 .50 .50 .41 .53 .34 .54
(9) UPPS-P Negative Urgency .25 .56 .66 .68 .51 .62 .31 .59
(10) DERS Impulse Control Difficulties .29 .50 .58 .51 .41 .54 .30 .52
Note. N = 862. All correlations are statistically significant at p < .001. The strongest correlation for each RFQ item on a
descriptive level is highlighted in bold.

Figure 1
Empirical Distributions of Item 6 Before and After Applying the Scoring Procedure Outlined
in Fonagy et al. (2016)
Note. Relative frequencies (in %) of responses to Item 6 (“Sometimes I do things without
really knowing why”) in Sample 1. Panel A shows the univariate distribution of raw scores on
the 7-point scale, Panel B shows the univariate distributions of the rescaled scores on the 4-
point scales, and Panel C shows the bivariate distribution of the rescaled scores.
Figure 2
Standardized Parameter Estimates of the Factor Models in all Samples
Clinical Sample 1 (GER) Non-Clinical Sample 2 (GER) Non-Clinical Sample 3 (US)
Note. All estimates are standardized. Loadings smaller than .30 are grayed out. In Sample 1,
only the parameter estimates for admission are displayed. 1 = Unidimensional CFA model,
2 = Two-dimensional CFA model, 3 = Two-dimensional EFA model; RFQ_C = Certainty
About Mental States; RFQ_U = Uncertainty About Mental States; η = RFQ factor.
Figure 3
Bifactor Exploratory Structural Equation Model in Sample 1 at Admission (Panel A) and
Sample 2 (Panel B)
Note. All estimates are standardized. Intercepts, thresholds, and non-target loadings are not
displayed. Non-target loadings were all < .30. Target loadings < .30 are indicated by dashed lines.
o1-o12 = OPD-SQS items; i1-i16 = IPO-16 items; r1-r6 = RFQ-6 items; p1-p25 = PID-5-BF items;
OPD RS = Relationship; OPD CT = Conflict; OPD SP = Self-Perception; IPO RT = Reality-
Testing; IPO PD = Primitive Defenses; IPO ID = Identity Diffusion; NEG = Negative Affectivity;
DET = Detachment; ANT = Antagonism; DIS = Disinhibition; PSY = Psychoticism; gPD = General
Personality Pathology.
*p < .05
Online Supplement
Note S1
Further Information About Sample 1
In Sample 1, participants varied with regard to their educational background with 18%
holding a university degree, 30% holding the higher education entrance qualification (i.e., the
highest German school degree; “Abitur”), and 48% reporting 10 years of schooling or less.
Patients were diagnosed using clinical judgements based on the tenth edition of the
International Classification of Diseases (ICD-10; World Health Organization [WHO], 2004).
The most frequent group of diagnoses was that of depressive disorders (93%), followed by
anxiety disorders (42%), personality disorders (39%), substance use disorders (20%), and
somatoform disorders (18%). Personality disorders (PD) included borderline PD (11%),
avoidant PD (11%), dependent PD (2%), and PD not otherwise specified (15%). As is
commonly found, patients exhibited a high level of comorbidity (82% more than one
diagnosis, 54% more than two, 21% more than three, 5% more than four).
Note S2
In Sample 2, participants had a rather high level of education with 41% of participants
holding a university degree and an additional 37% holding the “Abitur” certificate.
Participants indicated via self-report whether they were acutely suffering from a mental health
condition. Twenty percent of the participants reported being affected by any mental disorder.
Of these, more than half indicated an affective disorder (52%), followed by borderline
personality disorder (16%) and post-traumatic stress disorder (9%). As the university’s
website might also be a reference source for psychiatric inpatients of the university hospital
that specializes in the treatment of severe trauma, it is possible that such inpatients were
attracted to participate in the study via this route.
Note S3
Bifactor ESEM Model Specification
In an exploratory bifactor measurement model (i.e., bifactor EFA), all indicators should
load on a general factor and, at the same time, they should load on at least one of at least two
multiple specific factors. In contrast to classic bifactor models (i.e., bifactor CFA; Holzinger &
Swineford, 1937), indicators are allowed to load on multiple specific factors. The bifactor
ESEM used here involves a unidimensional confirmatory measurement model for the RFQ and
an exploratory measurement model with orthogonal target rotation for indicators of personality
dysfunction. In Sample 1, the indicators of IPO-16 and OPD-SQS were target rotated such that
the loadings of indicators were targeted towards a general factor reflecting general personality
dysfunction (gPD) and were also targeted towards their respective specific factors in alignment
with the six assumed content domains or scales of the measures (i.e., IPO-16: identity diffusion,
primitive defenses, and reality testing; Zimmermann et al., 2013; OPD-SQS: self-perception,
contact, and relationship; Ehrenthal et al., 2015). In Sample 2, the indicators of PID-5-BF were
target rotated in the same vein towards a general factor (gPD) and the five PID domains as
specific factors (i.e., negative affectivity, antagonism, detachment, disinhibition, psychoticism;
APA, 2013).
American Psychiatric Association (2013). Diagnostic and statistical manual of mental disorders (DSM-5®).
American Psychiatric Pub.
Ehrenthal, J. C., Dinger, U., Schauenburg, H., Horsch, L., Dahlbender, R. W., & Gierk, B. (2015). Entwicklung
einer Zwölf-Item-Version des OPD-Strukturfragebogens (OPD-SFK) [Development of a 12-item version
of the OPD-Structure Questionnaire (OPD-SQS)]. Zeitschrift für Psychosomatische Medizin und
Psychotherapie, 61, 262–274. https://doi.org/10.13109/zptm.2015.61.3.262
Holzinger, K. J., & Swineford, F. (1937). The bi-factor method. Psychometrika, 2, 41–54.
Zimmermann, J., Benecke, C., Hörz, S., Rentrop, M., Peham, D., Bock, A., Wallner, T., Schauenburg, H.,
Frommer, J., Huber, D., Clarkin, J. F., & Dammann, G. (2013). Validierung einer deutschsprachigen 16-
Item-Version des Inventars der Persönlichkeitsorganisation (IPO-16) [Validity of a German 16-item-
version of the Inventory of Personality Organization (IPO-16)]. Diagnostica, 59, 3–16.
https://doi.org/10.1026/0012-1924/a000076
Note S4
To identify careless respondents and thereby ensure data quality, we implemented a

number of validity checks into the data collection process. Whereas attention checks are
useful to detect careless responding in general, language fluency tests specifically target
participants who participate by misrepresenting their language proficiency or place of
residence. We employed two instructed response items that required participants to mark a
specific response option that was clearly stated in the item instruction (e.g., “If you are paying
attention, mark ‘strongly agree’. Otherwise, you will be disqualified”). Thirty-four
participants failed at least one of the two instructed response items and were excluded. In
addition, we employed a language fluency check to ensure that participants met the
participation requirement in terms of language proficiency (i.e., being fluent in English). We
used an open-ended question at the end of the study that tasked participants with providing a
statement on an opinion question. Four participants were excluded for failing to provide a
proper statement.
In Sample 3, participants varied with regard to their educational level (e.g., 64% with a
bachelor’s degree or higher) and occupational status (e.g., 57% employed for wages).
Note S5
Commonality Analysis Using the RFQ-8
Using the first set of variables, the association between the RFQ-8 and LPFS-BF (R2 =
.48) was decomposed into variance shared with measures of mentalizing (ΔR2 = .25, 95% CI
[.16; .30], p < .001), variance shared with measures of emotional lability and impulsivity
alone (ΔR2 = .20, 95% CI [.15; .32], p < .001), and variance uniquely explained by LPFS-BF
(ΔR2 = .03, 95% CI [.02; .05], p < .001). In the second set of variables, the association
between the RFQ-8 and IPO-16 (R2 = .42) was decomposed into variance shared with
measures of mentalizing (ΔR2 = .32, 95% CI [.23; .37], p < .001), variance shared with
measures of emotional lability and impulsivity alone (ΔR2 = .06, 95% CI [.04; .18], p < .001),
and variance uniquely explained by IPO-16 (ΔR2 = .03, 95% CI [.01; .05], p < .001). On
average, across all 32 combinations of predictors, the association between the RFQ-8 and
broad indicators of personality dysfunction (mean R2 = .44) was decomposed into variance
shared with measures of mentalizing (mean ΔR2 = .28), variance shared with measures of
emotional lability and impulsivity alone (mean ΔR2 = .13), and variance uniquely explained
by IPO-16 (mean ΔR2 = .03). Thus, 63% of the observed associations between RFQ-8 and
indicators of personality dysfunction were due to variance shared with other measures of
mentalizing, whereas 30% were due to variance shared with measures of emotional lability
and impulsivity alone and 7% were unique to measures of personality dysfunction.
Table S1
Bivariate Correlations Between RFQ-8, RFQ_C, RFQ_U, and Criterion Measures in Sample 1 at Admission
RFQ-8 RFQ_C RFQ_U BSI PHQ-15 PHQ-9 IIP-32 IPO-16 OPD-SFK
RFQ_C -.92
RFQ_U .80 -.65
BSI .55 -.46 .53
PHQ-15 .26 -.23 .25 .59
PHQ-9 .42 -.34 .43 .76 .47
IIP-32 .54 -.48 .48 .63 .32 .51
IPO-16 .72 -.68 .62 .59 .33 .46 .60
OPD-SFK .65 -.57 .60 .72 .41 .59 .68 .69
WHO-5 -.23 .18 -.24 -.51 -.29 -.65 -.35 -.22 -.38
Note. N = 861. All p < .001.

Table S2
Bivariate Correlations Between RFQ-8, RFQ_C, RFQ_U, and Criterion Measures in Sample 2
RFQ-8 RFQ_C RFQ_U Total NEG DET ANT DIS
RFQ_C -.91
RFQ_U .88 -.64
PID-5-BF Total Score .67 -.61 .60
Negative Affectivity (NEG) .60 -.52 .56 .73
Detachment (DET) .46 -.43 .39 .72 .43
Antagonism (ANT) .30 -.32 .23 .65 .29 .29
Disinhibition (DIS) .44 -.40 .41 .73 .40 .36 .43
Psychoticism (PSY) .62 -.56 .59 .82 .52 .52 .44 .50
Note. N = 566. All p < .001.

Table S3
Bivariate Correlations Between RFQ-8, RFQ_C, RFQ_U, and Further Measures in Sample 3
RFQ-8 (1) (2) (3) (4) (5) (6) (7) (8) (9) (10) (11)
(1) RFQ_C -.92
(2) RFQ_U .81 -.62
(3) LPFS-BF .69 -.61 .55
(4) IPO-16 .64 -.60 .53 .66
(5) CAMSQ Self-Certainty -.54 .54 -.44 -.51 -.38
(6) SRIS Self-Insight -.68 .65 -.63 -.60 -.60 .68
(7) CAMSQ Other-Certainty -.28 .31 .16 -.23 -.09 .56 .28
(8) EQ Cognitive Empathy -.28 .29 -.18 -.27 -.11 .42 .28 .73
(9) PID-5 Emotional Lability .59 -.49 .53 .59 .61 -.36 -.53 -.08 -.08
(10) PAI-BOR Affective Instability .63 -.53 .56 .67 .56 -.42 -.53 -.16 -.17 .72
(11) UPPS-P Negative Urgency .74 -.69 .59 .63 .59 -.43 -.54 -.18 -.16 .62 .69
(12) DERS Impulse Control Difficulties .65 -.55 .57 .65 .61 -.41 -.57 -.13 -.15 .66 .69 .68

Figure S1
Originally Proposed Two-Dimensional CFA Model Using Double-Scoring in Sample 1 at Admission
Note. N = 861. χ²(51) = 490.05, CFI = .95, RMSEA = .10, SRMR = .15. ω = .88/.73 (RFQ_C/RFQ_U).
The model was estimated using the WLSMV estimator. Correlated errors were specified following
recommendations by Fonagy et al. (2016). Intercepts, residual variances, and thresholds are not displayed.
Figure S2
Originally Proposed Two-Dimensional CFA Model Using Double-Scoring in Sample 2
Note. N = 566. χ²(51) = 313.46, CFI = .95, RMSEA = .10, SRMR = .15. ω = .88/.78 (RFQ_C/RFQ_U).
The model was estimated using the WLSMV estimator. Correlated errors were specified following
recommendations by Fonagy et al. (2016). Intercepts, residual variances, and thresholds are not displayed.
Figure S3
Lowess-Smoothed Regression Curves of the Association Between the RFQ-6 and Criterion Measures in Sample 1 at Admission
Figure S4
Lowess-Smoothed Regression Curves of the Association Between the RFQ-6 and Criterion Measures in Sample 2
Figure S5
Two-Lines Test of the Association Between the RFQ-6 and Criterion Measures in Sample 1 at Admission
Figure S6
Two-Lines Test of the Association Between the RFQ-6 and Criterion Measures in Sample 2
Figure S7
Bifactor Exploratory Structural Equation Model in Sample 1 at Admission (Panel A) and
Sample 2 (Panel B) Using all Eight Items of the RFQ
Note. All estimates are standardized. Intercepts, thresholds, and non-target loadings are not
displayed. Non-target loadings were all < .30. Target loadings < .30 are indicated by dashed lines.
Variable names are the same as in Figure 3. In Sample 1, model fit was CFI = .94, RMSEA = .06,
SRMR = .04. In Sample 2, model fit was CFI = .96, RMSEA = .05, SRMR = .04.
*p < .05

Critical Evaluation of The RFQ - Accepted Manuscript

Uploaded by

Document Informationclick to expand document informationReflective functioning questionnaire

Document Informationclick to expand document information

Copyright:

Available Formats

Critical Evaluation of The RFQ - Accepted Manuscript

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Critical Evaluation of The RFQ - Accepted Manuscript

Uploaded by

Copyright:

Available Formats

A CRITICAL EVALUATION OF THE RFQ

published by Taylor & Francis.

A Critical Evaluation of the Reflective Functioning Questionnaire (RFQ)

& Johannes Zimmermann1

*shared first authorship

at the University Medicine Rostock, #A2020-0075; Ethics Committee at the Faculty of

of Study 2 are permanently and openly accessible at https://osf.io/stbz5. Please address

correspondence to Sascha Müller and Leon Wendt, Department of Psychology, University of

Kassel, Holländische Str. 36-38, 34127, Kassel, Germany; E-Mail: sascha.mueller@uni-

The Reflective Functioning Questionnaire (RFQ) is an 8-item self-report measure of

reflective functioning that is presumed to capture individual differences in hypo- and

associations with measures of personality pathology, while associations with measures of

discriminant validity, however, a commonality analysis indicated that associations with

Keywords: reflective functioning; mentalizing; validity; factor analysis; U-shaped

associations; commonality analysis

A Critical Evaluation of the Reflective Functioning Questionnaire (RFQ)

Fonagy and colleagues (2016) introduced the Reflective Functioning Questionnaire

(RFQ), a brief screening measure that is intended to assess an individual’s capacity to

have been targeted by a specialized psychotherapeutic approach, that is, mentalization-based

treatment (e.g., Bateman & Fonagy, 2016).

supposed to relate to mentalizing processes (e.g., Item 6; “Sometimes I do things without

i.e., 3, 2, 1, 0, 0, 0, 0) and uncertainty about mental states (RFQ_U; i.e., 0, 0, 0, 0, 1, 2, 3).

RFQ_U (Badoud et al., 2015; Fonagy et al., 2016).

unidimensional model sufficiently explained the observed covariation of the original

critical re-examination and discussion of the RFQ appears to be of particular importance

with the RFQ in detail.

feelings, desires, wishes, goals, or attitudes). The construct of reflective functioning as

themselves generally called for a multidimensional assessment of mentalizing (e.g., Fonagy

their psychometric properties or their relationship to the brief RFQ.

maladaptive characteristic, namely, a tendency towards impulsive behavior when

While this behavioral signature could be considered a consequence of impairments in

forming inferences about mental states.

factor structure of the RFQ is still open to debate.

exclusively for the uncertainty scale (RFQ_U).

Associations with Psychopathology

A third issue is that findings on associations between RFQ scales and

psychopathology constructs are somewhat conflicting. As theory posits that mentalizing

psychopathology (e.g., Luyten et al., 2020), positive associations between mentalizing

impairments and various indicators of psychopathology are expected. However, the

mental states) appears to be quite strongly related to various indices of psychopathology.

between mentalizing and psychopathology should depend on which form of psychopathology

is being considered. Specifically, impairments in mentalizing are thought to be a core feature

capture the severity of personality pathology. Previous studies documented strong

by inferential testing. Notably, differential associations should be carefully interpreted in

impulsive responsivity to emotions may be a general underlying disposition for psychological

specific symptom distress. In addition, it should be investigated whether associations between

that is shared with other maladaptive characteristics.

scoring procedure, dimensionality, and associations with psychopathology. Based on our

to test the generalizability and robustness of a unidimensional solution.

bipolar construct. As a unipolar construct, it could range from genuine to impaired

mentalizing, whereas as a bipolar construct, it could capture hypo- and hypermentalizing as

mentalizing”), one would predict U-shaped associations with maladaptive characteristics

regression models to predict psychopathology outcomes, Spitzer et al. (2021) found no

indicative of psychopathology and personality pathology to provide more comprehensive

single continuum (i.e., hypo- and hypermentalizing).

related to the core construct of mentalizing. We therefore selected several measures of

personality pathology that are linked to Criterion A (impairments in personality functioning)