Critical Evaluation of The RFQ - Accepted Manuscript
Critical Evaluation of The RFQ - Accepted Manuscript
Critical Evaluation of The RFQ - Accepted Manuscript
This article has been accepted for publication in Journal of Personality Assessment,
Sascha Müller1*, Leon P. Wendt1*, Carsten Spitzer2, Oliver Masuhr3, Sarah N. Back4,
1
University of Kassel, Department of Psychology, Kassel, Germany
2
Rostock University Medical Center, Rostock, Germany
3
Asklepios Fachklinikum Tiefenbrunn, Rosdorf, Germany
4
Ludwig-Maximilians-Universität München, Munich, Germany
Authors’ note: Ethics committee approval was obtained for data collection (Ethics Committee
Behavioural and Cultural Studies, University of Heidelberg, #Tau 2019 1/1). The first authors
contributed equally to this research. The data used in Study 1 are available from the
corresponding authors upon reasonable request. Data and R code for reproducing the analyses
kassel.de, [email protected].
A CRITICAL EVALUATION OF THE RFQ 2
Abstract
hypermentalizing. Despite its broad acceptance by the field, we argue that the validity of the
measure is not well-established. The current research elaborates on problems of the RFQ
related to its item content, scoring procedure, dimensionality, and associations with
psychopathology. We tested these considerations across three large clinical and non-clinical
samples from Germany and the US (total N = 2289). In a first study, we found that the RFQ
may assess a single latent dimension related to hypomentalizing but is rather unlikely to
capture maladaptive forms of hypermentalizing. Moreover, the RFQ exhibited very strong
symptom distress were less strong. In a second preregistered study focused on convergent and
indicators of personality pathology are inflated because some of the RFQ items tap into
emotional lability and impulsivity rather than mentalizing. Our findings demonstrate
limitations of the RFQ. We discuss key challenges in assessing mentalizing via self-report.
adequately interpret mental states of both the self and others (i.e., reflective functioning or
mentalizing) via self-report. The measure as well as its translated versions have been
positively evaluated in several validation studies (Badoud et al., 2015; Fonagy et al., 2016;
Morandotti et al., 2018), leading to the consensual conclusion that the RFQ is able to capture
deficits in reflective functioning. According to mentalizing theory (e.g., Bateman & Fonagy,
2016), the theoretical spectrum of reflective functioning includes hypo- and hypermentalizing
(i.e., too little or too much certainty about one’s interpretation of mental states) as well as
genuine mentalizing (the optimal trait level, i.e., acknowledging the opaqueness of mental
states). Deviations from the optimal trait level in both directions are presumed to be a sign, a
symptom, and a transdiagnostic risk factor of psychopathology (e.g., Luyten et al., 2020) and
In the RFQ, respondents are asked to endorse or reject eight statements that are
really knowing why”) on a 7-point scale ranging from do not agree at all (= 1) to agree
completely (= 7) (see Panel A of Figure 1). The creators implemented a procedure that
involves scoring half of the items twice (i.e., double-scoring) and splitting the underlying
information into two scales (see Panel B of Figure 1): certainty about mental states (RFQ_C;
According to this logic, a strong rejection of Item 6 on the original scale (= 1) is scored in
such a manner that it is indicative of high certainty (= 3) on the RFQ_C scale and at the same
time indicative of low uncertainty (= 0) on the RFQ_U scale, while a strong agreement is
A CRITICAL EVALUATION OF THE RFQ 4
scored as indicative of low certainty (= 0) on the RFQ_C scale and at the same time as
indicative of high uncertainty (= 3) on the RFQ_U scale. High levels of certainty about
mental states are assumed to reflect hypermentalizing and high levels of uncertainty are
assumed to reflect hypomentalizing (e.g., Badoud et al., 2015; Fonagy et al., 2016;
Morandotti et al., 2018). The latent structure of the RFQ has been investigated repeatedly
using double-scored items and reportedly consists of two negatively correlated factors (with
correlations typically ranging between -.60 and -.80) reflecting the scales of RFQ_C and
However, in a recent attempt to validate the factor structure of the German version of
the RFQ in a large sample (N = 2477) representative of the German general population,
Spitzer et al. (2021) criticized the use of double-scored items in factor analysis (e.g., Badoud
et al., 2015; Fonagy et al., 2016; Morandotti et al., 2018), noting that the assumption of
uncorrelated item residuals seems unrealistic when two items are derived from the same
original responses. Instead, Spitzer and colleagues provided initial evidence that a
responses to RFQ items and suggested that this representation could give rise to two
maladaptive poles of hypo- and hypermentalizing deviating from an adaptive middle region.
Following this notion, they proceeded with testing U-shaped relationships between a
unidimensional RFQ score (dropping two items to improve internal consistency) and
depression, anxiety, and somatic symptoms but found no evidence for such associations. The
findings of Spitzer and colleagues in conjunction with their briefly noted considerations raise
some initial doubts and call for a critical re-examination of the RFQ as a psychometric test. A
given that researchers are increasingly adopting the measure for primary investigations of the
mentalizing construct (e.g., Badoud et al., 2018; de Meulemeester et al., 2017, 2018; Huang
A CRITICAL EVALUATION OF THE RFQ 5
et al., 2020; Li et al., 2020). In the following, we will identify and explore potential issues
Item Content
An initial concern is that the coverage of the reflective functioning construct in the
RFQ does not seem to converge well with the definition given by the creators (Fonagy et al.,
2016). According to them, reflective functioning pertains to “the capacity to interpret both the
self and others in terms of internal mental states, such as feelings, wishes, goals, desires, and
attitudes” (Fonagy et al., 2016, p. 1). However, all RFQ items but one refer to understanding
oneself (and not others) and most items refer to understanding one’s own behavior (and not
defined by the authors might thus not be optimally covered by the items. Although it should
be acknowledged that the RFQ was introduced as a brief screening measure and the creators
et al., 2016; Luyten et al., 2020), a potential lack of coverage cannot be compensated by
relating the RFQ to a validated and more comprehensive long form. This is because, even
though long forms (i.e., RFQ-54, RFQ-46) are accessible from the creators’ website and are
also used in some studies (e.g., Euler et al., 2021), there are no publications that investigate
In addition, it seems that the item content of the RFQ may overrepresent another
experiencing negative emotions (e.g., Item 3: “When I get angry, I say things without really
knowing why I am saying them”; Item 4: “When I get angry, I say things that I later regret”;
Item 5: “If I feel insecure I can behave in ways that put others’ backs up”). That is, it could be
that individuals who endorse these items are not necessarily lacking in reflective capacity, but
merely show a high level of negative urgency (Cyders & Smith, 2008; Settles et al., 2012).
A CRITICAL EVALUATION OF THE RFQ 6
Negative urgency denotes the disposition to act rashly and ill-advisedly under negative
emotions and can thus be seen as reflecting a blend of emotional lability and impulsivity.
mentalizing abilities, it may well be due to other underlying characteristics and deficiencies
(e.g., lack of adaptive emotion regulation strategies; King et al., 2018). Therefore, these items
do not seem specific enough to the core definition of reflective functioning. By contrast, other
items of the RFQ such as “People’s thoughts are a mystery to me” (Item 1) or “I always know
what I feel” (Item 7) seem to address more directly the certainty or uncertainty involved in
Scoring Procedure
The second concern pertains to the aforementioned double-scoring of four of the eight
RFQ items (i.e., Items 2, 4, 5, 6) to derive RFQ_U and RFQ_C scale scores and their use in
factor analysis, as it causes psychometric problems. Given that respondents only provide one
rating for each of these four items on the 7-point scale, the resulting eight rescaled scores on
RFQ_C and RFQ_U are mutually determined. For example, when a respondent rates Item 6
with strong agreement, this necessarily leads to a rescaled score of 3 on RFQ_C and a
rescaled score of 0 on RFQ_U (see Panels A and B of Figure 1). Thus, the two rescaled
scores are not independent of each other and overlap with regard to their information. In fact,
nine of the 16 theoretical combinations of the two scores are mathematically impossible (see
Panel C of Figure 1). In our Sample 1 (see below), this results in an artificially negative
correlation between the two rescaled scores of Item 6 with r = -.55. The exact value of the
correlation depends on the univariate distribution of the raw scores and varies slightly
between items and samples. However, when using polychoric correlations that take into
account the ordinal scaling of the scores, the estimate will always be exactly r = -1. This
shows that the two rescaled scores are in fact completely redundant. This issue is particularly
A CRITICAL EVALUATION OF THE RFQ 7
problematic because the assumed factor structure of the RFQ is based on confirmatory factor
analysis (CFA) using double-scored items as indicators (Badoud et al., 2015; Fonagy et al.,
2016). Applying ordinal CFA should produce warning messages because the polychoric
correlations between several indicators will approach r = -1. However, if instead one treats
the indicators as continuous and applies robust maximum likelihood estimation (as was done
in previous validation studies), one also cannot undo the inherent dependencies of the
indicators. In this case, the suggested two-dimensional factor model will artificially induce a
negative correlation between the two factors because the residual correlations of double-
scored item pairs are restricted to zero. Therefore, the finding of two negatively correlated
dimensions of the RFQ (i.e., RFQ_U and RFQ_C) using double-scored items should rather
not be interpreted as evidence in favor of the instrument’s structural validity as was done in
previous validation studies (Badoud et al., 2015; Fonagy et al., 2016). We argue that the
The problematic scoring procedure associated with the two scales manifests itself in
conceptual inconsistencies with regard to the RFQ_C scale in particular. Although RFQ_C is
supposed to represent certainty about mental states, all items are geared towards a state of
uncertainty with respect to their semantic content (e.g., Item 1: “People’s thoughts are a
mystery to me”) and are, ultimately, reversely scored. The certainty scale is thus based
entirely on the denial of uncertainty. In fact, the RFQ contains only one item directly
referring to a state of certainty (Item 7: “I always know what I feel”) and this item is scored
impairments such as hypo- and hypermentalizing are a vulnerability factor for severe
A CRITICAL EVALUATION OF THE RFQ 8
correlational patterns of the RFQ_C and RFQ_U scales that emerge from the literature (e.g.,
Badoud et al., 2015; Fonagy et al., 2016; Huang et al., 2020; Li et al., 2020) rather appear to
be in contrast to the interpretation that RFQ_C assesses hypermentalizing, as has been noted
previously (de Meulemeester et al., 2018; Euler et al., 2021). Overall, the RFQ_C scale
(certainty about mental states) was often positively associated with mental health, suggesting
that it captures an adaptive characteristic. By contrast, the RFQ_U scale (uncertainty about
Taken together, the two scales of the RFQ tend to exhibit similar correlational patterns to
external criteria but with opposite signs, respectively. These opposing correlational patterns
of RFQ_C and RFQ_U in a shared nomological net seem more compatible with the notion
that the RFQ reflects a unidimensional continuum ranging from genuine to impaired
mentalizing.
Another aspect in this regard is that, from a theoretical point of view, the association
of personality disorders (e.g., Bateman & Fonagy, 2019; Fonagy et al., 2017; Luyten et al.,
2020). For example, in the Alternative Model for Personality Disorders (AMPD) in the fifth
edition of the Diagnostic and Statistical Manual of Mental Disorders (DSM-5; APA, 2013),
the first general criterion of personality disorders refers to difficulties in understanding and
regulating the self and interpersonal relationships, with mentalizing deficits explicitly
considered as one important element (Bender et al., 2011). Thus, the strongest correlations in
the field of psychopathology should occur where the RFQ is correlated with measures that
associations between the RFQ scales and measures of personality pathology (e.g., Badoud et
al., 2015; de Meulemeester et al., 2017; Fonagy et al., 2016) as well as various symptom
measures (e.g., de Meulemeester et al., 2018; Fonagy et al., 2016; Huang et al., 2020; Li et
al., 2020). However, the magnitudes of these associations should be compared systematically
terms of validity, not least because the RFQ also contains item content that may reflect
constructs such as negative urgency rather than mentalizing in particular. For example,
distress that would be expected to exhibit similar patterns of associations as were reported for
the scales of the RFQ (e.g., Carver et al., 2017; Settles et al., 2012). We therefore argue that
more research is needed to evaluate whether associations of the RFQ with specific,
theoretically selected forms of psychopathology are higher than associations with non-
the RFQ and indicators of personality dysfunction are artificially inflated due to item content
Study 1
We have stated potential concerns with regard to the RFQ related to its item content,
interpretation of previous findings, we make three observations: First, we propose that the
measure may emerge as unidimensional when modeled more adequately, that is, when not
using double-scored items in CFA given the methodological problems associated with this
approach. Although initial evidence for the unidimensionality of the RFQ has been provided
by Spitzer et al. (2021), their study only involved a single non-clinical sample. In the first
study, we used data from a clinical and a non-clinical sample to compare a series of factor
A CRITICAL EVALUATION OF THE RFQ 10
models for the RFQ including unidimensional and two-dimensional representations in order
Second, if the RFQ turns out to be unidimensional, it could still capture a unipolar or
two maladaptive ends of one continuum. Notably, in our view, a unidimensional scoring of
the RFQ lends itself to investigating the unique characteristics of the two poles using non-
linear statistical models. Specifically, in the case where the RFQ captures a continuum that
runs from extreme uncertainty about mental states (“too little mentalizing”) to an optimal
level (“genuine mentalizing”) to excessive certainty about mental states (“too much
(e.g., depression) and, vice versa, inverse U-shaped associations with adaptive characteristics
(e.g., well-being). 1 When first testing this hypothesis by specifying quadratic terms in
support for such U-shaped associations. However, the authors considered only rather weak
criteria in terms of short screening measures of depression, anxiety, and somatic symptoms.
In this study, we explored (inverse) U-shaped relationships between the RFQ and measures
tests of the hypothesis that the RFQ captures two maladaptive variants of mentalizing on a
Third, given that the RFQ contains several non-specific items, we examined
associations with specific forms of psychopathology that are theoretically more closely
1
Thus, both low and high levels of the unidimensional RFQ-8 are supposed to be maladaptive, marking the high
ends of the U-shape when inspecting a maladaptive criterion. By contrast, middle levels are not supposed to be
maladaptive, marking the turning point of the U-shape when inspecting a maladaptive criterion. Similarly, the
association between the RFQ-8 and adaptive criteria should be inverse U-shaped, indicating that middle levels
are again adaptive whereas both low and high levels are not.
A CRITICAL EVALUATION OF THE RFQ 11
and Criterion B (maladaptive personality traits) of the AMPD. We compared the correlations
between the RFQ and measures of personality pathology with correlations between the RFQ
and measures of general symptom distress, and further scrutinized the latent associations
between the RFQ and personality pathology using bifactor exploratory structural equation
Method
Sample 1. The first sample was collected at a psychosomatic clinic, the Asklepios
Fachklinikum Tiefenbrunn. Participants (64% female) were 861 inpatients ranging in age
from 18 to 68 (M = 34.0, SD = 13.3). The data presented here were collected from patients
who were admitted for inpatient treatment within three days of admission and shortly before
discharge as part of routine diagnostics. The RFQ was administered to all included
participants at admission without missings, whereas only 364 participants completed the
measure at discharge. The reason for these missing observations is not dropout; the RFQ was
only administered at discharge at a later stage of data collection and thus only pertained to
364 patients. For more detailed information about the sample characteristics, see Note S1 in
the supplement.
Sample 2. The second sample was collected online within a study on dimensional
were recruited via flyers and calls for participation on the university website and in several
online forums. The sample comprises 566 young adults (74% female) who completed the
study and whose ages ranged from 18 to 30 (M = 24.2, SD = 3.13). There were no missing
data. For more detailed information about the sample characteristics, see Note S2 in the
supplement.
A CRITICAL EVALUATION OF THE RFQ 12
Measures. The RFQ was administered in both samples. The Brief Symptom
16), and Operationalized Structural Diagnostic Questionnaire – Short Form (OPD-SQS) were
admission and discharge in Sample 1; however, we used only the data at admission (i.e.,
before treatment) for exploring their associations with the RFQ. The Personality Inventory
comprises eight items forming the two scales of certainty about mental states (RFQ_C) and
uncertainty about mental states (RFQ_U). According to the scoring procedure described
above, the 7-point Likert scale (“do not agree at all” = 1 to “agree completely” = 7) is
recoded as 3, 2, 1, 0, 0, 0, 0 for the scale RFQ_C; for the scale RFQ_U, items are recoded as
0, 0, 0, 0, 1, 2, 3 (except for the inverse Item 7 that is recoded as shown for RFQ_C). Results
for the original scoring procedure are supplemented. For the main results, we refrained from
applying the scoring procedure due to the aforementioned problems. Items were thus kept in
the original coding (i.e., 1, 2, 3, 4, 5, 6, 7) and only Item 7 was reversed such that it
corresponded to the content polarity of the other items. High values indicated uncertainty
about mental states and low values indicated certainty. In the following, we will refer to the
mean score of the 8-item RFQ as the RFQ-8. In a previous investigation using a
representative sample from the German population, Spitzer et al. (2021) presented results in
favor of using a reduced 6-item version of the RFQ (omitting Items 7 and 4), that will
Brief Symptom Inventory (BSI). The BSI (Franke, 2000) was used to assess distress
associated with symptoms of mental illness during the last week (e.g., “loss of appetite”) on a
A CRITICAL EVALUATION OF THE RFQ 13
5-point scale ranging from “not at all” (0) to “extremely” (4). Internal consistency of the BSI
measured with the 32-item version of the IIP (Horowitz et al., 2000). The measure assesses
distress associated with interpersonal behaviors (e.g., “I open up to other people too much”)
that are performed excessively or inhibited strongly on a 5-point scale ranging from “not at
all” (0) to “extremely” (4). Internal consistency of the IIP-32 total score was estimated at α =
.87 in Sample 1.
1998) is a self-report measure of well-being. It consists of five items (e.g., “Over the last two
weeks I have felt cheerful and in good spirits”) that are rated on a 6-point Likert scale ranging
from “at no time” (0) to “all the time” (5). High scores indicate a high subjective well-being.
Internal consistency of the WHO-5 total score was estimated at α = .85 in Sample 1.
Patient Health Questionnaire (PHQ). The PHQ-15 (Kroenke et al., 2002) is a 15-
item module for assessing the severity of impairment associated with somatic symptoms
(e.g., “back pain”) that have been experienced during the last four weeks. Items are rated on a
3-point Likert scale ranging from “not bothered at all” (0) to “bothered a lot” (2). The PHQ-9
(Kroenke et al., 2001) is a 9-item module for assessing the impairment associated with the
nine DSM-IV criteria of depression (e.g., “Little interest or pleasure in doing things”,
“Feeling tired or having little energy”) that may have been experienced during the last two
weeks. Items are rated on a 4-point scale ranging from “not at all” (0) to “nearly every day”
(3). Internal consistency of the PHQ-15/PHQ-9 total scores were estimated at α = .81/.84 in
Sample 1.
personality dysfunction (Zimmermann et al., 2013) in the domains of identity diffusion (e.g.,
A CRITICAL EVALUATION OF THE RFQ 14
“I feel that my tastes and opinions are not really my own, but have been borrowed from other
people”), primitive defenses (e.g., “People tell me I behave in contradictory ways”), and
reality testing (e.g., “I can’t tell whether certain physical sensations I’m having are real, or
whether I’m imagining them”). The 16 items are answered on a 5-point scale from “never
applies” (1) to “always applies” (5). Internal consistency of the IPO-16 total score was
Statements are endorsed or rejected on a 5-point scale ranging from “completely untrue” (0)
to “entirely true” (4). The items give rise to the scales of self-perception (e.g., “I sometimes
feel like a stranger to myself”), contact (e.g., “I sometimes misjudge how my behavior affects
others”), and relationship (e.g., “It can be dangerous to let others get too close to you.”).
Internal consistency of the OPD-SQS total score was estimated at α = .86 in Sample 1.
Personality Inventory for DSM-5 – Brief Form (PID-5-BF). The PID-5-BF (APA,
2013; Zimmermann et al., 2014) is a 25-item measure assessing the broad maladaptive
psychoticism with five items each. Items are rated on a 4-point scale ranging from “very
false” (0) to “very true” (3). Internal consistency of the PID-5-BF total score was estimated at
α = .86 in Sample 2.
Statistical Analyses. The analyses were performed using R version 4.0.3 (R Core
Team, 2020) in conjunction with the package lavaan (Rosseel, 2012) and Mplus version 8.4
(Muthén & Muthén, 1998–2019). All latent variable models were estimated using the
Weighted Least Squares Mean and Variance Adjusted (WLSMV) estimator that was applied
to the polychoric correlation matrix. The fit of latent variable models was evaluated by a
commonly used combination of fit indices and cut-off criteria (i.e., Comparative Fit Index
A CRITICAL EVALUATION OF THE RFQ 15
[CFI] > .95, Root Mean Square Error of Approximation [RMSEA] < .06, Standardized Root
Mean Square Residual [SRMR] < .08; Hu & Bentler, 1999). The internal consistency of the
RFQ was estimated with the model-based McDonald’s ω for categorical variables (Flora,
Factor Structure. The factor structure of the RFQ was evaluated by confirmatory
factor analysis (CFA) and exploratory factor analysis (EFA) using the original item
responses. The two-dimensional measurement model using double-scored items that was
reported by the creators of the RFQ (Badoud et al., 2015; Fonagy et al., 2016) is not taken
into account for the main results due to the methodological problems associated with the
scoring procedure that are described above. However, we report the results based on the
original model and the creators’ recommendations in the supplement. For this analysis, we
considered the following measurement models: (1) a unidimensional CFA; (2) a two-
dimensional CFA with cross-loadings that follow the scoring procedure of RFQ_C and
RFQ_U as proposed in the original publication; and (3) a two-dimensional EFA with oblique
factor rotation (quartimin). In all CFA models, the correlation between the residual variances
of Items 3 and 4 was freely estimated since these items have a large overlap in terms of
semantic content and wording. As two measurement occasions were available in Sample 1
(i.e., at admission and discharge, respectively), we specified repeated measures CFA models
with equality constraints for loadings, thresholds, intercepts, latent covariance, and residual
covariances (Liu et al., 2017). For the repeated measures CFA, we dealt with missing data by
associations between the RFQ scale scores and various measures of psychopathology
given the conceptualization of the RFQ, one would expect U-shaped associations with
A CRITICAL EVALUATION OF THE RFQ 16
order to test the hypothesis that hypo- and hypermentalizing may delineate extreme
maladaptive ends of a unidimensional continuum (i.e., very low and very high values on the
manifest score), we examined U-shaped and inverse U-shaped associations between the RFQ
score and measures of symptomatic distress (i.e., general symptomatic distress: BSI; somatic
dysfunction (i.e., IPO-16, OPD-SQS, PID-5-BF), and interpersonal distress (i.e., IIP-32). To
this end, lowess smoothing curves were inspected and regression models were estimated in
which the quadratic term was added to the RFQ score for predicting criterion variables. It
should be noted that a significant effect of the quadratic predictor is not sufficiently indicative
of the presence of two maladaptive poles, as this would also be observed, for example, in
floor or ceiling effects. More specifically, the predicted values for a criterion should also
differ between medium scores on the RFQ as compared to extreme scores by forming a U-
shape. We therefore further used the two-lines test as a more rigorous method (Simonsohn,
2018) that estimates two regression lines, one before and one after a break-point in the
distribution of a predictor variable, in order to detect a change in the sign of the regression
slope.
differential associations between the RFQ and indicators of personality pathology (i.e., IPO-
16, OPD-SQS) and various dimensions of symptomatic distress (i.e., BSI, PHQ-15, PHQ-9,
IIP-32, WHO-5), respectively. To this end, we compared the magnitude of their correlation
coefficients in Sample 1. Additionally, we estimated the empirical overlap between the RFQ
and indicators of personality pathology (i.e., PID-5-BF, IPO-16, OPD-SQS) in both samples
using a bifactor exploratory structural equation modeling approach (i.e., bifactor ESEM). In a
A CRITICAL EVALUATION OF THE RFQ 17
bifactor ESEM (e.g., Morin et al., 2020), a criterion (e.g., RFQ) is regressed on the
reflecting a multidimensional construct (e.g., PID-5-BF). This approach has two advantages
for this analysis. First, the use of latent variable modeling partitions reliable variance and
measurement error and thus estimates the disattenuated associations among constructs.
Second, bifactor models with uncorrelated general and specific factors allow for a clear
(Reise, 2012), as is the case for the PID-5-BF, IPO-16, and OPD-SQS (Zimmermann et al.,
2020). Further details and explanations about the bifactor ESEM approach used here can be
Results
Factor Structure. All estimated model parameters are depicted in Figure 2 per
sample. In Sample 1, the unidimensional repeated measures CFA model showed good model
fit, χ²(139) = 353.0, p < .001, CFI = .98, RMSEA = .046, 90% CI [.041; .052], SRMR = .06.
Factor loadings were acceptable (λ ≥ .49), but Item 7 showed a weak loading on the latent
factor (λ = .34). The a priori specified residual correlation between Items 3 and 4 was
estimated at .60. Heywood cases occurred for the two-dimensional CFA model (i.e., factor
correlation > 1) and the two-dimensional EFA model (i.e., standardized factor loading > 1 for
Item 4). In Sample 2, the unidimensional CFA model provided a good fit to the data, χ²(19) =
91.8, p < .001, CFI = .98, RMSEA = .08, 90% CI [.07; .10], SRMR = .04. Factor loadings
were acceptable (λ ≥ .51), but Item 7 again exhibited the weakest loading on the latent factor
(λ = .45). The a priori specified residual correlation between Items 3 and 4 was estimated at
.66. The two-dimensional CFA model did not reach convergence. The two-dimensional EFA
model again showed a Heywood case (i.e., standardized loading > 1 for Item 4).
A CRITICAL EVALUATION OF THE RFQ 18
Taken together, we did not identify the two proposed dimensions of RFQ_C and
RFQ_U in the estimated solutions, and we never identified more than one meaningful factor
in general, suggesting that the RFQ essentially captures a unidimensional construct. Results
for the originally proposed two-dimensional CFA model that uses double-scoring of RFQ
items are reported in the supplement (see Figures S1 and S2). Note that we found similar
parameters and model fit for the original specifications based on double-scored items as
found in previous validation studies (Badoud et al., 2015; Fonagy et al., 2016). We explained
above why these models and their respective fit statistics should rather not be interpreted
The internal consistency of the RFQ factor from the unidimensional solution was
respectively. In the following, we omitted Items 4 and 7 in order to improve the internal
consistency of the scale (Spitzer et al., 2021). This decision was based on Item 7 (“I always
know what I feel”) overall tending towards having low factor loadings, whereas Item 3
(“When I get angry, I say things without really knowing why I am saying them”) and Item 4
(“When I get angry, I say things I regret later”) overlap strongly with respect to their content.
The removal of these two items resulted in a 6-item scale with internal consistency of ω = .82
(Sample 1) and ω = .83 (Sample 2). Subsequent results refer to the 6-item version of the RFQ
(i.e., RFQ-6) with low scores supposedly reflecting certainty about mental states and high
scores reflecting uncertainty about mental states. 2 This directionality is a decision based on
the circumstance that all items of the RFQ-6 are geared towards an endorsement of
2
The total scores of the RFQ-8 and the reduced RFQ-6 were correlated at .97 (Sample 1) and at .98 (Sample 2),
respectively. It should be emphasized that retaining items 4 and 7 produced highly similar results and equivalent
conclusions for all of the presented analyses.
A CRITICAL EVALUATION OF THE RFQ 19
Associations with Psychopathology. We did not find any evidence for (inverse) U-
shaped relationships between the RFQ-6 and criteria indicative of symptom distress,
personality pathology, or well-being. Although a significant quadratic term was found for
OPD-SQS in Sample 1 and for PID-5-BF detachment in Sample 2, the associations were not
U-shaped but merely indicated a ceiling effect for higher values of the RFQ-6 (see
supplement Figures S3 and S4). The two-lines test did not indicate any U-shaped statistical
associations (see Figures S5 and S6). For the unidimensional RFQ-8, U-shaped associations
On the contrary, the RFQ-6 exhibited substantial linear associations with measures of
psychopathology (see Tables 1 and 2). In Sample 1, the strongest correlations were found
between the RFQ-6 and measures of personality pathology, including the IPO-16 (r = .72)
and the OPD-SQS (r = .65), which were also significantly larger (p < .001, respectively) than
the correlations between the RFQ-6 and other measures, for example as compared to the BSI
(r = .54) and the IIP (r = .54). The bivariate associations between the original scales based on
double-scoring of the RFQ (i.e., RFQ_C and RFQ_U) as well as the RFQ-8 and criterion
In Sample 1, the bifactor ESEM (see Panel A of Figure 3) had acceptable fit, χ²(373)
= 1764.83, p < .001, CFI = .94, RMSEA = .07, 90% CI [.06; .07], SRMR = .04. All items of
the IPO-16 and OPD-SQS loaded significantly on a general factor of personality pathology
with standardized loadings ranging from λ = .32 to λ = .71. For the specific factors, target
loadings were in the expected direction (i.e., positive factor loadings) and most of them had
values of λ > .30. (Absolute) non-target loadings were consistently < .30. The standardized
regression coefficient of the general factor was β = .83 (p < .001), explaining 69% of variance
in the RFQ factor. The specific factors of OPD relationship (β = -.17, p < .001), OPD contact
(β = .08, p = .005), OPD self-perception (β = .14, p < .001), IPO reality testing (β = .09, p =
A CRITICAL EVALUATION OF THE RFQ 20
.001), IPO identity diffusion (β = .06, p = .033), and IPO primitive defenses (β = .19, p <
.001) incrementally explained a combined 10% of variance in the RFQ factor. The
unexplained variance in the latent RFQ factor thus amounted to 21%. For the (highly similar)
In Sample 2, the bifactor ESEM (see Panel B of Figure 3) had good fit, χ²(318) =
743.0, p < .001, CFI = .97, RMSEA = .05, 90% CI [.04; .05], SRMR = .04. All items of the
loadings ranging from λ = .31 to λ = .75. For the specific factors, target loadings were in the
expected direction (i.e., positive factor loadings) and most of them had values of λ > .30.
(Absolute) non-target loadings were consistently < .30. The standardized regression
coefficient of the general factor was β = .68 (p < .001), explaining 47% of variance in the
RFQ factor. The specific factors of PID-5-BF negative affect (β = .31, p < .001), PID-5-BF
psychoticism (β = .32, p < .001) incrementally explained a combined 21% of variance in the
RFQ factor. The unexplained variance in the latent RFQ factor thus amounted to 32%. See
Panel B of Figure S7 for the (highly similar) results using all eight RFQ items.
Discussion
Regarding the factor structure of the RFQ, we found evidence for a unidimensional
construct in a large clinical sample (Sample 1) and a large sample of young adults (Sample
2), thereby corroborating initial findings (Spitzer et al., 2021). We have argued that using a
with mentalizing theory and allows for conceptualizing hypo- and hypermentalizing as two
lowess curves, and the two-lines test. Across a broad range of criterion variables, however,
we found no evidence that the RFQ assesses a maladaptive form of having too much certainty
A CRITICAL EVALUATION OF THE RFQ 21
about mental states (i.e., hypermentalizing). Only the uncertainty pole of the RFQ was
associated with poorer mental health. Finally, in line with theoretical expectations with regard
to the mentalizing construct (e.g., Bateman & Fonagy, 2019), the RFQ was more strongly
More specifically, we found that the variance of the RFQ primarily reflected broad
indicators of self-reported personality dysfunction in both samples (i.e., clinical and non-
clinical) using diverse measures (i.e., PID-5-BF, IPO-16, OPD-SQS), whereas comparatively
little variance was unique to the RFQ. On the one hand, this may suggest that the constructs
intertwined that they cannot clearly be distinguished empirically (at least using self-reports).
On the other hand, in light of the observation that various items of the RFQ may tap into
the large overlap between the RFQ and dimensions of personality pathology may also signal
caution. In fact, this finding could point to potential problems with respect to the discriminant
validity of the measure. Consistent with this notion, the existing research literature also
provided mixed results for the convergent validity of the RFQ with respect to mentalizing and
related constructs. In the initial validation studies, RFQ_U and RFQ_C exhibited strong
2015; Fonagy et al., 2016; Morandotti et al., 2018). This might be concerning given that
constructs such as cognitive empathy and perspective-taking are very similar to mentalizing
by definition (e.g., Ickes, 1993), except that the former constructs solely pertain to
Study 2
A CRITICAL EVALUATION OF THE RFQ 22
To address the question of whether validity issues of the RFQ contribute to its strong
empirical overlap with indicators of personality pathology, we conducted a second study that
was fully preregistered (see https://osf.io/qr38t). We hypothesized that the item content of the
RFQ reflects both mentalizing and possible consequences of impaired mentalizing, namely,
emotional lability and impulsivity. If the RFQ conflated these constructs, this could
indicators of personality dysfunction, thereby impeding the interpretability of RFQ scores and
the measure’s utility for theory testing. Using commonality analysis (Nimon et al., 2008), we
tested whether and to what extent the associations between the RFQ and indicators of
personality dysfunction are driven by content overlap rather than reflecting the true
RFQ items with convergent and discriminant measures to examine their nomological
consistency (Thielmann & Hilbig, 2019). For example, nomological inconsistency would be
indicated by some RFQ items correlating more strongly with impulsivity and other RFQ
items correlating more strongly with other measures of mentalizing. Thus, Study 2 included
pertain to the core construct of mentalizing, that is, interpreting the mental states of the self
and others.
Method
Sample 3. We recruited participants from the United States online via the panel
provider Prolific. Data quality was ensured by a series of attention and validity checks.
individuals were not able to proceed without answering each item. We collected data from N
A CRITICAL EVALUATION OF THE RFQ 23
(47% female) age ranged from 18 to 75 (M = 34.9, SD = 11.7). For more details about
Measures. In the study, all questionnaires were presented in random order. The RFQ
(Fonagy et al., 2016) and the IPO-16 (Zimmermann et al., 2013) were administered again in
their respective English versions. Based on our findings from Study 1, we used a
unidimensional mean score of the RFQ such that high values reflected uncertainty about
mental states. The analyses were performed for both the RFQ-8 and the RFQ-6. The internal
consistencies of the RFQ-6 (ω = .87), RFQ-8 (ω = .87), and the IPO-16 (α = .90) were good.
Level of Personality Functioning Scale - Brief Form 2.0 (LPFS-BF). The LPFS-BF
(Weekers et al., 2019) is a 12-item self-report measure assessing impairments in the domains
of self-functioning (6 items) and interpersonal functioning (6 items). The items (e.g., “I often
make unrealistic demands on myself”) are answered on a 4-point scale ranging from
completely untrue (1) to completely true (4). High scores on the respective scales indicate
Certainty About Mental States Questionnaire (CAMSQ). The CAMSQ (Müller et al.,
2021) is a 20-item self-report measure of mentalizing that assesses the perceived certainty
associated with making inferences about the mental states of the self (i.e., Self-Certainty) and
others (i.e., Other-Certainty). The items capture affective, cognitive, and motivational content
(e.g., “I understand my feelings”, “I know when other people are hiding their thoughts”) and
are answered on a 7-point frequency scale ranging from never (1) to always (7). High scores
reflect high levels of certainty. Internal consistencies of CAMSQ Self-Certainty (α = .93) and
self-report measure of empathy. The 9-item Cognitive Empathy scale was used as a measure
of mentalizing others. Items (e.g., “I am good at predicting how someone will feel”) are rated
on a 4-point scale ranging from strongly disagree (1) to strongly agree (4). Internal
Self-Reflection and Insight Scale (SRIS). The SRIS (Grant et al., 2002) is a 20-item
self-report measure. The 8-item Self-Insight scale of the SRIS was used as a measure of
mentalizing oneself. Items (e.g., “I usually know why I feel the way I do”) are rated on a 6-
point scale ranging from strongly disagree (1) to strongly agree (6). Internal consistency of
UPPS-P Impulsive Behavior Scale (UPPS-P). The UPPS-P (Lynam et al., 2007)
assesses five impulsive personality traits with 59 items. We used the 12-item Negative
Urgency scale that assesses impulsivity in terms of acting rashly under the influence of
negative emotions. Items (e.g., “When I am upset, I often act without thinking”) are endorsed
or rejected on a 4-point scale ranging from disagree strongly (1) to agree strongly (4). Internal
Difficulties in Emotion Regulation Scale (DERS). The DERS (Gratz & Roemer,
impulsivity. Items (e.g., “When I’m upset, I become out of control”) are rated on a 5-point
scale ranging from almost never (1) to almost always (5) indicating the frequency of
experiencing the described behavior. Internal consistency of the DERS Impulse Control
Personality Inventory for DSM-5 (PID-5). The PID-5 (APA, 2013) assesses
personality pathology according to the AMPD with 220 items. The 7-item Emotional Lability
A CRITICAL EVALUATION OF THE RFQ 25
facet scale was used. Items (e.g., “I am a highly emotional person”) are answered on a 4-point
scale ranging from very false or often false (0) to very true or often true (3). Internal
(Morey, 2004) assesses traits associated with borderline personality disorder using 24 items.
We used the six items of the Affective Instability scale to assess emotional lability. Items
(e.g., “My mood can shift quite suddenly”) are rated on a 4-point scale ranging from false,
not at all true (0) to very true (3). Internal consistency of the PAI-BOR Affective Instability
Statistical Analysis. To test the hypothesis of whether the associations between the
RFQ and indicators of personality pathology are driven by content overlap, we conducted a
commonality analysis (Nimon et al., 2008) that estimates the common and unique
predictors. Specifically, the RFQ mean score is regressed on a broad measure of personality
dysfunction, other measures of mentalizing the self and others, and measures of emotional
lability and impulsivity. We expected that the effect of personality dysfunction on the RFQ
mean score can be partly attributed to variation that both measures share with mentalizing
impairment, while another significant part can be attributed to variation shared with
emotional lability and impulsivity alone (i.e., variation that is non-overlapping with other
measures of mentalizing). For example, the manifest correlation between the RFQ mean
the variance explained in the RFQ by the respective measure of personality dysfunction into
(a) variance that is shared with the respective measures of impulsivity and emotional lability
A CRITICAL EVALUATION OF THE RFQ 26
but not with the respective measures of mentalizing the self and others, (b) variance that is
shared with the respective measures of mentalizing the self and others, and (c) variance
permutation test using 5000 random permutations of the dependent variable to compute p-
values for testing the null hypothesis that part (a) equals zero (i.e., that no part of the variance
explained is due to variance shared with emotional lability and impulsivity alone). We tested
this hypothesis with two different sets of measures to facilitate the generalizability of the
results (irrespective of specific measures; see preregistration). To control for performing this
test twice for each set of variables, we considered a p-value less than .025 to be indicative of
intervals for evaluating the precision of the estimate using 5000 bootstrap resamples. The
commonality analysis was conducted via the R package yhat (Nimon et al., 2021).
We also examined the pattern of associations between the eight RFQ items and
would indicate that the eight RFQ items are not nomologically consistent (Thielmann &
Hilbig, 2019). For the sake of completeness, we again tested the factor models considered in
Study 1.
Results
(https://osf.io/stbz5). As in Study 1, the unidimensional CFA model fit the data well, χ²(19) =
272.7, p < .001, CFI = .98, RMSEA = .125, 90% CI [.112; .138], SRMR = .04, whereas two-
dimensional models did not provide parameter estimates in line with the proposed structure
of RFQ_C and RFQ_U (see Figure 2). Measures that were intended to assess the same
constructs (e.g., CAMSQ Self-Certainty and SRIS Self-Insight) showed the expected high
A CRITICAL EVALUATION OF THE RFQ 27
convergent correlations. The RFQ mean score exhibited similarly strong associations with
and measures of mentalizing the self (see Table 3 and Table S3 for the full correlation
matrices including RFQ-6 and RFQ-8). Rather low correlations were observed between the
RFQ mean score and measures of mentalizing others, indicating that the RFQ primarily
(i.e., mentalizing, emotional lability, impulsivity) accounts for the association between the
RFQ and indicators of personality dysfunction. Using the first set of variables, the association
between the RFQ-6 and LPFS-BF (R2 = .46) was decomposed into variance shared with
measures of mentalizing (ΔR2 = .22, 95% CI [.14; .27], p < .001), variance shared with
measures of emotional lability and impulsivity alone (ΔR2 = .20, 95% CI [.16; .30], p < .001),
and variance uniquely explained by LPFS-BF (ΔR2 = .04, 95% CI [.02; .06], p < .001). In the
second set of variables, the association between the RFQ-6 and IPO-16 (R2 = .45) was
decomposed into variance shared with measures of mentalizing (ΔR2 = .33, 95% CI [.24; .38],
p < .001), variance shared with measures of emotional lability and impulsivity alone (ΔR2 =
.07, 95% CI [.04; .18], p < .001), and variance uniquely explained by IPO-16 (ΔR2 = .05, 95%
CI [.03; .07], p < .001). In light of the difference between the two arbitrarily selected sets of
variables, we then conducted the commonality analysis across all 32 possible variable
between the RFQ-6 and broad indicators of personality dysfunction (mean R2 = .45) was
decomposed into variance shared with measures of mentalizing (mean ΔR2 = .27), variance
shared with measures of emotional lability and impulsivity alone (mean ΔR2 = .14), and
variance uniquely explained by measures of personality dysfunction (mean ΔR2 = .04). Thus,
59% of the observed associations between RFQ-6 and indicators of personality dysfunction
A CRITICAL EVALUATION OF THE RFQ 28
were due to variance shared with other measures of mentalizing, whereas 31% were due to
variance shared with measures of emotional lability and impulsivity alone, and 10% were
unique to measures of personality dysfunction. Very similar results were obtained using the
items of the RFQ in terms of significant differences in their correlational patterns (see Table
4). Generally, the item-level associations were most pronounced for UPPS-P Negative
the self. Thus, we specifically used these two scales for testing differences in the magnitude
of correlations, although it should be noted that the correlational patterns were consistent for
the other measures that were included. RFQ Items 3, 4, and 8 correlated significantly stronger
(p < .001) with Negative Urgency than with Self-Insight. In contrast, Items 1, 2, 5, and 7
correlated significantly stronger with Self-Insight than with Negative Urgency (p < .001).
Discussion
In Study 2, we aimed to test the hypothesis that the items of the RFQ conflate content
mentalizing impairment (i.e., emotional lability and impulsivity). The preregistered analyses
provide evidence for this hypothesis and suggest that associations between the RFQ and
exploit common variance with aspects of emotional lability and impulsivity. These results
point to limitations of the RFQ with regard to its convergent and discriminant validity.
Hilbig, 2019) that suggest RFQ Item 3 (“When I get angry I say things without really
knowing why I am saying them”), Item 4 (“When I get angry, I say things that I later regret”),
and Item 8 (“Strong feelings often cloud my thinking”) may be the reason for the conflation
A CRITICAL EVALUATION OF THE RFQ 29
because they converge with negative urgency rather than mentalizing. Study 2 further
illustrates the relevance of assessing mentalizing the self and mentalizing others separately,
as these two dimensions provide unique information. Considering that the RFQ only contains
one item that is clearly geared towards understanding others’ mental states (i.e., Item 1:
“People’s thoughts are a mystery to me”), the RFQ cannot measure mentalizing others with
sufficient fidelity.
General Discussion
The RFQ has been proposed as a short self-report measure of reflective functioning.
In this article we have elaborated our concerns with respect to the instrument’s validity and
the methodology used in prior studies (e.g., Badoud et al., 2015; Fonagy et al., 2016),
associations with psychopathology. Using large clinical and non-clinical samples from
Germany and the US, we augmented the critical discussion with new empirical analyses.
First, our findings suggest that the RFQ assesses a unidimensional construct. Consequently,
we recommend refraining from the originally proposed scoring procedure to derive RFQ_C
and RFQ_U (Fonagy et al., 2016) and instead relying on a unidimensional score using the
original responses to the RFQ items (such as mean scores on the psychometrically optimized
RFQ-6 or the RFQ-8). Second, we have demonstrated that although the RFQ reflects
mentalizing impairments regarding the self, it also exhibits a substantial confound with
emotional lability and impulsivity due to shared item content. This indicates that, when using
the RFQ to study research questions related to mentalizing theory, observed associations with
other constructs can be influenced by this confound and inferences can thus be impeded.
Third, consistent with previous accounts (e.g., de Meulemeester et al., 2018), the present
results further indicate that the RFQ does indeed capture a maladaptive form of having too
little certainty about mental states (i.e., hypomentalizing) but appears to be unable to capture
A CRITICAL EVALUATION OF THE RFQ 30
a maladaptive form of having too much certainty about mental states (i.e., hypermentalizing)
as its certainty pole does not exhibit positive associations with negative outcomes. One
reason for this could be that the RFQ does not measure variation in hypermentalizing with
sufficient reliability on the certainty pole of the continuum (e.g., because its items are
It should be noted that the present empirical findings are limited to the English and the
German versions of the RFQ and might not necessarily generalize to other languages.
However, our concerns about the item content and the scoring procedure apply to the various
translations as well. Moreover, our conclusions cannot be generalized to the long forms of the
RFQ (e.g., Euler et al., 2021), although it should be noted that, to date, no validation studies
have been published for these forms and they have been seldomly used in previous research.
Finally, the current study is subject to the limitation that it did not focus on the primary target
that is, individuals with borderline personality pathology (Luyten et al., 2020).
psychopathology, personality, and psychotherapy research (APA, 2013; Bender et al., 2011).
This calls for a valid and economic self-report assessment of the construct. Although the RFQ
has been rather broadly accepted by the field and is used in a growing body of research, we
have argued that the validity evidence is not compelling. In our view, self-report assessments
of mentalizing should adhere more closely to the specific core of the construct (i.e., inferring
(e.g., impulsivity). Thereby, the conflation of different constructs could be avoided. Second,
Indeed, demonstrating that these two maladaptive variants of mentalizing impairment can be
increase confidence in construct validity. For example, the CAMSQ (Müller et al., 2021) was
recently introduced as a measure that focuses on the core definition of inferring mental states
of the self and others provided by Fonagy and colleagues (2016). Initial results for the
CAMSQ suggest that it captures maladaptive levels of both too little or too much certainty
about mental states that could be interpreted as forms of hypo- and hypermentalizing.
Conclusion
The RFQ is regularly used to study the concept of mentalizing. Herein, we have
outlined critical considerations regarding the validity of the RFQ and provided empirical
evidence to support the critique. Findings indicate that the RFQ is a unidimensional measure
that reflects hypomentalizing pertaining to the self but is also conflated with content related
to emotional lability and impulsivity. Thus, researchers and mental health professionals alike
should be rather cautious in using the RFQ for theory testing and individual assessment.
A CRITICAL EVALUATION OF THE RFQ 32
References
Back, S., Zettl, M., Bertsch, K., & Taubner, S. (2020). Persönlichkeitsniveau, maladaptive
https://doi.org/10.1007/s00278-020-00445-7
Badoud, D., Luyten, P., Fonseca-Pedrero, E., Eliez, S., Fonagy, P., & Debbané, M. (2015).
The French version of the Reflective Functioning Questionnaire: Validity data for
adolescents and adults and its association with non-suicidal self-injury. PloS ONE, 10,
e0145892. https://doi.org/10.1371/journal.pone.0145892
Badoud, D., Prada, P., Nicastro, R., Germond, C., Luyten, P., Perroud, N., & Debbané, M.
https://doi.org/10.1521/pedi_2017_31_283
adults with Asperger syndrome or high functioning autism, and normal sex
https://doi.org/10.1023/b:jadd.0000022607.19833.00
Bateman, A., & Fonagy, P. (2016). Mentalization-based treatment for personality disorders:
Bateman, A. W., & Fonagy, P. (Eds.). (2019). Handbook of mentalizing in mental health
Bender, D. S., Morey, L. C., & Skodol, A. E. (2011). Toward a model for assessing level of
Carver, C. S., Johnson, S. L., & Timpano, K. R. (2017). Toward a functional view of the p
https://doi.org/10.1177/2167702617710037
Cyders, M. A., & Smith, G. T. (2008). Emotion-based dispositions to rash action: Positive
https://doi.org/10.1037/a0013341
de Meulemeester, C., Lowyck, B., Vermote, R., Verhaest, Y., & Luyten, P. (2017).
https://doi.org/10.1016/j.psychres.2017.09.061
de Meulemeester, C., Vansteelandt, K., Luyten, P., & Lowyck, B. (2018). Mentalizing as a
Ehrenthal, J. C., Dinger, U., Schauenburg, H., Horsch, L., Dahlbender, R. W., & Gierk, B.
https://doi.org/10.13109/zptm.2015.61.3.262
Euler, S., Nolte, T., Constantinou, M., Griem, J., Montague, P. R., Fonagy, P., & Personality
https://doi.org/10.1521/pedi_2019_33_427
Flora, D. B. (2020). Your coefficient alpha is probably wrong, but which coefficient omega is
https://doi.org/10.1177/2515245920951747
Fonagy, P., Luyten, P., Allison, E., & Campbell, C. (2017). What we have changed our minds
https://doi.org/10.1186/s40479-017-0061-9
Fonagy, P., Luyten, P., Moulton-Perkins, A., Lee, Y.-W., Warren, F., Howard, S., Ghinai, R.,
e0158678. https://doi.org/10.1371/journal.pone.0158678
Franke, H. (2000). The Brief Symptom Inventory – Deutsche Version. Manual. Göttingen:
Beltz.
Grant, A. M., Franklin, J., & Langford, P. (2002). The self-reflection and insight scale: A
Gratz, K. L., & Roemer, L. (2004). Multidimensional assessment of emotion regulation and
Horowitz, L. M., Alden, L. E., Kordy, H., & Strauß, B. (2000). Inventar zur Erfassung
Hu, L. T., & Bentler, P. M. (1999). Cutoff criteria for fit indexes in covariance structure
https://doi.org/10.1080/10705519909540118
A CRITICAL EVALUATION OF THE RFQ 35
Huang, Y. L., Fonagy, P., Feigenbaum, J., Montague, P. R., Nolte, T., & Mood Disorder
https://doi.org/10.1111/j.1467-6494.1993.tb00783.x
King, K. M., Feil, M. C., & Halvorson, M. A. (2018). Negative urgency is correlated with the
Kroenke, K., Spitzer, R. L., & Williams, J. B. (2001). The PHQ-9: Validity of a brief
https://doi.org/10.1046/j.1525-1497.2001.016009606.x
Kroenke, K., Spitzer, R. L., & Williams, J. B. (2002). The PHQ-15: Validity of a new
64, 258–266.
Li, E. T., Carracher, E., & Bird, T. (2020). Linking childhood emotional abuse and adult
depressive symptoms: The role of mentalizing incapacity. Child Abuse & Neglect, 99,
104253. https://doi.org/10.1016/j.chiabu.2019.104253
Liu, Y., Millsap, R. E., West, S. G., Tein, J. Y., Tanaka, R., & Grimm, K. J. (2017). Testing
https://psycnet.apa.org/doi/10.1037/met0000075
Luyten, P., Campbell, C., Allison, E., & Fonagy, P. (2020). The mentalizing approach to
psychopathology: State of the art and future directions. Annual Review of Clinical
Lynam, D. R., Smith, G. T., Cyders, M. A., Fischer, S., & Whiteside, S. P. (2007). The
Morandotti, N., Brondino, N., Merelli, A., Boldrini, A., De Vidovich, G. Z., Ricciardo, S.,
Abbiati, V., Ambrosi, P., Cavercasi, E., Fonagy, P., & Luyten, P. (2018). The Italian
version of the Reflective Functioning Questionnaire: Validity data for adults and its
association with severity of borderline personality disorder. PloS ONE, 13, e0206433.
https://doi.org/10.1371/journal.pone.0206433
Associates Publishers.
Morin, A. J., Myers, N. D., & Lee, S. (2020). Modern Factor Analytic Techniques: Bifactor
https://doi.org/10.1002/9781119568124.ch51
Müller, S., Wendt, L. P., & Zimmermann, J. (2021, May 19). Development and Validation of
Muthén, L. K., & Muthén, B. O. (1998-2019). Mplus User’s Guide. 8th Edition. Los Angeles,
Nimon, K., Lewis, M., Kane, R., & Haynes, R. M. (2008). An R package to compute
https://doi.org/10.3758/BRM.40.2.457
Nimon, K., Oswald, F., & Roberts, J. K. (2021). yhat: Interpreting Regression Effects. R
Settles, R. E., Fischer, S., Cyders, M. A., Combs, J. L., Gunn, R. L., & Smith, G. T. (2012).
Simonsohn, U. (2018). Two lines: A valid alternative to the invalid testing of U-shaped
Spitzer, C., Zimmermann, J., Brähler, E., Euler, S., Wendt, L. P., & Müller, S. (2021). Die
1234-6317
Thielmann, I., & Hilbig, B. E. (2019). Nomological consistency: A comprehensive test of the
Weekers, L. C., Hutsebaut, J., & Kamphuis, J. H. (2019). The Level of Personality
Functioning Scale-Brief Form 2.0: Update of a brief instrument for assessing level of
https://doi.org/10.1002/pmh.1434
A CRITICAL EVALUATION OF THE RFQ 38
World Health Organization. (1998). Wellbeing measures in primary health care / The
Zimmermann, J., Altenstein, D., Krieger, T., Grosse Holtforth, M., Pretsch, J., Alexopoulos,
J., Spitzer, C., Benecke, C., Krueger, R. F., Markon, K. E., & Leising, D. (2014). The
518–540. https://doi.org/10.1521/pedi_2014_28_130
Zimmermann, J., Benecke, C., Hörz, S., Rentrop, M., Peham, D., Bock, A., Wallner, T.,
Schauenburg, H., Frommer, J., Huber, D., Clarkin, J. F., & Dammann, G. (2013).
https://doi.org/10.1026/0012-1924/a000076
Zimmermann, J., Müller, S., Bach, B., Hutsebaut, J., Hummelen, B., & Fischer, F. (2020). A
Table 1
Bivariate Correlations in Sample 1 at Admission
BSI .54
Table 2
Bivariate Correlations in Sample 2
Table 3
Bivariate Correlations in Sample 3
RFQ-6 (1) (2) (3) (4) (5) (6) (7) (8) (9)
(7) PID-5 Emotional Lability .60 .59 .61 -.36 -.53 -.08 -.08
(8) PAI-BOR Affective Instability .61 .67 .56 -.42 -.53 -.16 -.17 .72
(9) UPPS-P Negative Urgency .71 .63 .59 -.43 -.54 -.18 -.16 .62 .69
(10) DERS Impulse Control Difficulties .63 .65 .61 -.41 -.57 -.13 -.15 .66 .69 .68
Table 4
Bivariate Correlations Between the Items of the RFQ and Further Measures in Sample 3
RFQ-8
(1) LPFS-BF .40 .60 .51 .43 .45 .59 .41 .52
(2) IPO-16 .39 .54 .51 .42 .48 .57 .22 .54
(3) CAMSQ Self-Certainty -.27 -.50 -.38 -.29 -.31 -.49 -.58 -.38
(4) SRIS Self-Insight -.41 -.65 -.50 -.39 -.42 -.62 -.51 -.52
(5) CAMSQ Other-Certainty -.31 -.23 -.15 -.17 -.11 -.22 -.30 -.16
(6) EQ Cognitive Empathy -.36 -.25 -.15 -.14 -.11 -.23 -.30 -.12
(7) PID-5 Emotional Lability .26 .47 .48 .40 .42 .49 .24 .56
(8) PAI-BOR Affective Instability .24 .50 .50 .50 .41 .53 .34 .54
(9) UPPS-P Negative Urgency .25 .56 .66 .68 .51 .62 .31 .59
(10) DERS Impulse Control Difficulties .29 .50 .58 .51 .41 .54 .30 .52
Note. N = 862. All correlations are statistically significant at p < .001. The strongest correlation for each RFQ item on a
Figure 1
Empirical Distributions of Item 6 Before and After Applying the Scoring Procedure Outlined
really knowing why”) in Sample 1. Panel A shows the univariate distribution of raw scores on
the 7-point scale, Panel B shows the univariate distributions of the rescaled scores on the 4-
point scales, and Panel C shows the bivariate distribution of the rescaled scores.
A CRITICAL EVALUATION OF THE RFQ 44
Figure 2
Standardized Parameter Estimates of the Factor Models in all Samples
Clinical Sample 1 (GER) Non-Clinical Sample 2 (GER) Non-Clinical Sample 3 (US)
Note. All estimates are standardized. Loadings smaller than .30 are grayed out. In Sample 1,
only the parameter estimates for admission are displayed. 1 = Unidimensional CFA model,
2 = Two-dimensional CFA model, 3 = Two-dimensional EFA model; RFQ_C = Certainty
About Mental States; RFQ_U = Uncertainty About Mental States; η = RFQ factor.
A CRITICAL EVALUATION OF THE RFQ 45
Figure 3
Bifactor Exploratory Structural Equation Model in Sample 1 at Admission (Panel A) and
Sample 2 (Panel B)
Note. All estimates are standardized. Intercepts, thresholds, and non-target loadings are not
displayed. Non-target loadings were all < .30. Target loadings < .30 are indicated by dashed lines.
o1-o12 = OPD-SQS items; i1-i16 = IPO-16 items; r1-r6 = RFQ-6 items; p1-p25 = PID-5-BF items;
OPD RS = Relationship; OPD CT = Conflict; OPD SP = Self-Perception; IPO RT = Reality-
Testing; IPO PD = Primitive Defenses; IPO ID = Identity Diffusion; NEG = Negative Affectivity;
DET = Detachment; ANT = Antagonism; DIS = Disinhibition; PSY = Psychoticism; gPD = General
Personality Pathology.
*p < .05
A CRITICAL EVALUATION OF THE RFQ 46
Online Supplement
Note S1
Further Information About Sample 1
In Sample 1, participants varied with regard to their educational background with 18%
holding a university degree, 30% holding the higher education entrance qualification (i.e., the
highest German school degree; “Abitur”), and 48% reporting 10 years of schooling or less.
Patients were diagnosed using clinical judgements based on the tenth edition of the
International Classification of Diseases (ICD-10; World Health Organization [WHO], 2004).
The most frequent group of diagnoses was that of depressive disorders (93%), followed by
anxiety disorders (42%), personality disorders (39%), substance use disorders (20%), and
somatoform disorders (18%). Personality disorders (PD) included borderline PD (11%),
avoidant PD (11%), dependent PD (2%), and PD not otherwise specified (15%). As is
commonly found, patients exhibited a high level of comorbidity (82% more than one
diagnosis, 54% more than two, 21% more than three, 5% more than four).
Note S2
Further Information About Sample 2
In Sample 2, participants had a rather high level of education with 41% of participants
holding a university degree and an additional 37% holding the “Abitur” certificate.
Participants indicated via self-report whether they were acutely suffering from a mental health
condition. Twenty percent of the participants reported being affected by any mental disorder.
Of these, more than half indicated an affective disorder (52%), followed by borderline
personality disorder (16%) and post-traumatic stress disorder (9%). As the university’s
website might also be a reference source for psychiatric inpatients of the university hospital
that specializes in the treatment of severe trauma, it is possible that such inpatients were
attracted to participate in the study via this route.
A CRITICAL EVALUATION OF THE RFQ 47
Note S3
Bifactor ESEM Model Specification
In an exploratory bifactor measurement model (i.e., bifactor EFA), all indicators should
load on a general factor and, at the same time, they should load on at least one of at least two
multiple specific factors. In contrast to classic bifactor models (i.e., bifactor CFA; Holzinger &
Swineford, 1937), indicators are allowed to load on multiple specific factors. The bifactor
ESEM used here involves a unidimensional confirmatory measurement model for the RFQ and
an exploratory measurement model with orthogonal target rotation for indicators of personality
dysfunction. In Sample 1, the indicators of IPO-16 and OPD-SQS were target rotated such that
the loadings of indicators were targeted towards a general factor reflecting general personality
dysfunction (gPD) and were also targeted towards their respective specific factors in alignment
with the six assumed content domains or scales of the measures (i.e., IPO-16: identity diffusion,
primitive defenses, and reality testing; Zimmermann et al., 2013; OPD-SQS: self-perception,
contact, and relationship; Ehrenthal et al., 2015). In Sample 2, the indicators of PID-5-BF were
target rotated in the same vein towards a general factor (gPD) and the five PID domains as
specific factors (i.e., negative affectivity, antagonism, detachment, disinhibition, psychoticism;
APA, 2013).
American Psychiatric Association (2013). Diagnostic and statistical manual of mental disorders (DSM-5®).
American Psychiatric Pub.
Ehrenthal, J. C., Dinger, U., Schauenburg, H., Horsch, L., Dahlbender, R. W., & Gierk, B. (2015). Entwicklung
einer Zwölf-Item-Version des OPD-Strukturfragebogens (OPD-SFK) [Development of a 12-item version
of the OPD-Structure Questionnaire (OPD-SQS)]. Zeitschrift für Psychosomatische Medizin und
Psychotherapie, 61, 262–274. https://doi.org/10.13109/zptm.2015.61.3.262
Holzinger, K. J., & Swineford, F. (1937). The bi-factor method. Psychometrika, 2, 41–54.
Zimmermann, J., Benecke, C., Hörz, S., Rentrop, M., Peham, D., Bock, A., Wallner, T., Schauenburg, H.,
Frommer, J., Huber, D., Clarkin, J. F., & Dammann, G. (2013). Validierung einer deutschsprachigen 16-
Item-Version des Inventars der Persönlichkeitsorganisation (IPO-16) [Validity of a German 16-item-
version of the Inventory of Personality Organization (IPO-16)]. Diagnostica, 59, 3–16.
https://doi.org/10.1026/0012-1924/a000076
A CRITICAL EVALUATION OF THE RFQ 48
Note S4
Further Information About Sample 3
Note S5
Commonality Analysis Using the RFQ-8
Using the first set of variables, the association between the RFQ-8 and LPFS-BF (R2 =
.48) was decomposed into variance shared with measures of mentalizing (ΔR2 = .25, 95% CI
[.16; .30], p < .001), variance shared with measures of emotional lability and impulsivity
alone (ΔR2 = .20, 95% CI [.15; .32], p < .001), and variance uniquely explained by LPFS-BF
(ΔR2 = .03, 95% CI [.02; .05], p < .001). In the second set of variables, the association
between the RFQ-8 and IPO-16 (R2 = .42) was decomposed into variance shared with
measures of mentalizing (ΔR2 = .32, 95% CI [.23; .37], p < .001), variance shared with
measures of emotional lability and impulsivity alone (ΔR2 = .06, 95% CI [.04; .18], p < .001),
and variance uniquely explained by IPO-16 (ΔR2 = .03, 95% CI [.01; .05], p < .001). On
average, across all 32 combinations of predictors, the association between the RFQ-8 and
broad indicators of personality dysfunction (mean R2 = .44) was decomposed into variance
shared with measures of mentalizing (mean ΔR2 = .28), variance shared with measures of
emotional lability and impulsivity alone (mean ΔR2 = .13), and variance uniquely explained
by IPO-16 (mean ΔR2 = .03). Thus, 63% of the observed associations between RFQ-8 and
indicators of personality dysfunction were due to variance shared with other measures of
mentalizing, whereas 30% were due to variance shared with measures of emotional lability
and impulsivity alone and 7% were unique to measures of personality dysfunction.
A CRITICAL EVALUATION OF THE RFQ 50
Table S1
Bivariate Correlations Between RFQ-8, RFQ_C, RFQ_U, and Criterion Measures in Sample 1 at Admission
RFQ_C -.92
WHO-5 -.23 .18 -.24 -.51 -.29 -.65 -.35 -.22 -.38
Table S2
Bivariate Correlations Between RFQ-8, RFQ_C, RFQ_U, and Criterion Measures in Sample 2
RFQ_C -.91
Psychoticism (PSY) .62 -.56 .59 .82 .52 .52 .44 .50
Table S3
Bivariate Correlations Between RFQ-8, RFQ_C, RFQ_U, and Further Measures in Sample 3
RFQ-8 (1) (2) (3) (4) (5) (6) (7) (8) (9) (10) (11)
(7) CAMSQ Other-Certainty -.28 .31 .16 -.23 -.09 .56 .28
(8) EQ Cognitive Empathy -.28 .29 -.18 -.27 -.11 .42 .28 .73
(9) PID-5 Emotional Lability .59 -.49 .53 .59 .61 -.36 -.53 -.08 -.08
(10) PAI-BOR Affective Instability .63 -.53 .56 .67 .56 -.42 -.53 -.16 -.17 .72
(11) UPPS-P Negative Urgency .74 -.69 .59 .63 .59 -.43 -.54 -.18 -.16 .62 .69
(12) DERS Impulse Control Difficulties .65 -.55 .57 .65 .61 -.41 -.57 -.13 -.15 .66 .69 .68
Figure S1
Originally Proposed Two-Dimensional CFA Model Using Double-Scoring in Sample 1 at Admission
Note. N = 861. χ²(51) = 490.05, CFI = .95, RMSEA = .10, SRMR = .15. ω = .88/.73 (RFQ_C/RFQ_U).
The model was estimated using the WLSMV estimator. Correlated errors were specified following
recommendations by Fonagy et al. (2016). Intercepts, residual variances, and thresholds are not displayed.
A CRITICAL EVALUATION OF THE RFQ 54
Figure S2
Originally Proposed Two-Dimensional CFA Model Using Double-Scoring in Sample 2
Note. N = 566. χ²(51) = 313.46, CFI = .95, RMSEA = .10, SRMR = .15. ω = .88/.78 (RFQ_C/RFQ_U).
The model was estimated using the WLSMV estimator. Correlated errors were specified following
recommendations by Fonagy et al. (2016). Intercepts, residual variances, and thresholds are not displayed.
A CRITICAL EVALUATION OF THE RFQ 55
Figure S3
Lowess-Smoothed Regression Curves of the Association Between the RFQ-6 and Criterion Measures in Sample 1 at Admission
A CRITICAL EVALUATION OF THE RFQ 56
Figure S4
Lowess-Smoothed Regression Curves of the Association Between the RFQ-6 and Criterion Measures in Sample 2
A CRITICAL EVALUATION OF THE RFQ 57
Figure S5
Two-Lines Test of the Association Between the RFQ-6 and Criterion Measures in Sample 1 at Admission
A CRITICAL EVALUATION OF THE RFQ 58
Figure S6
Two-Lines Test of the Association Between the RFQ-6 and Criterion Measures in Sample 2
A CRITICAL EVALUATION OF THE RFQ 59
Figure S7
Bifactor Exploratory Structural Equation Model in Sample 1 at Admission (Panel A) and
Sample 2 (Panel B) Using all Eight Items of the RFQ
Note. All estimates are standardized. Intercepts, thresholds, and non-target loadings are not
displayed. Non-target loadings were all < .30. Target loadings < .30 are indicated by dashed lines.
Variable names are the same as in Figure 3. In Sample 1, model fit was CFI = .94, RMSEA = .06,
SRMR = .04. In Sample 2, model fit was CFI = .96, RMSEA = .05, SRMR = .04.
*p < .05