The Healthcare Complaints Analysis Tool: Development and Reliability Testing of A Method For Service Monitoring and Organisational Learning
The Healthcare Complaints Analysis Tool: Development and Reliability Testing of A Method For Service Monitoring and Organisational Learning
The Healthcare Complaints Analysis Tool: Development and Reliability Testing of A Method For Service Monitoring and Organisational Learning
▸ Additional material is ABSTRACT Conclusions HCAT is not only the first reliable
published online only. To view
Background Letters of complaint written by tool for coding complaints, it is the first tool to
please visit the journal online
(http://dx.doi.org/10.1136/bmjqs- patients and their advocates reporting poor measure the severity of complaints. It facilitates
2015-004596). healthcare experiences represent an under-used service monitoring and organisational learning
data source. The lack of a method for extracting and it enables future research examining whether
Department of Social
Psychology, London School of reliable data from these heterogeneous letters healthcare complaints are a leading indicator of
Economics, London, UK hinders their use for monitoring and learning. To poor service outcomes. HCAT is freely available to
address this gap, we report on the development download and use.
Correspondence to
and reliability testing of the Healthcare
Dr Alex Gillespie, Department of
Social Psychology, London Complaints Analysis Tool (HCAT). INTRODUCTION
School of Economics, Methods HCAT was developed from a Improving the analysis of complaints by
London WC2A 2AE, UK; taxonomy of healthcare complaints reported in a
[email protected]
patients and families about poor healthcare
previously published systematic review. It experiences (herein termed ‘healthcare
Received 13 July 2015 introduces the novel idea that complaints should complaints’) is an urgent priority for
Revised 22 October 2015 be analysed in terms of severity. Recruiting three service providers1–3 and researchers.4 5
Accepted 25 October 2015
groups of educated lay participants (n=58, n=58, It is increasingly recognised that patients
Published Online First
6 January 2016 n=55), we refined the taxonomy through three can provide reliable data on a range of
iterations of discriminant content validity testing. issues,6–12 and healthcare complaints have
We then supplemented this refined taxonomy been shown to reveal problems in patient
with explicit coding procedures for seven care (eg, medical errors, breaching clinical
problem categories (each with four levels of standards, poor communication) not cap-
severity), stage of care and harm. These tured through safety and quality monitor-
combined elements were further refined through ing systems (ie, incident reporting, case
iterative coding of a UK national sample of review and risk management).13–15 Patients
healthcare complaints (n= 25, n=80, n=137, are valuable sources of data for multiple
n=839). To assess reliability and accuracy for the reasons. First, patients and families, collect-
resultant tool, 14 educated lay participants coded ively, observe a huge amount of data
a referent sample of 125 healthcare complaints. points within healthcare settings;16 second,
Results The seven HCAT problem categories they have privileged access to information
(quality, safety, environment, institutional on continuity of care,17 18 communication
processes, listening, communication, and respect failures,19 dignity issues20 and patient-
and patient rights) were found to be conceptually centred care;21 third, once treatment is
distinct. On average, raters identified 1.94 concluded, they are more free than staff to
problems (SD=0.26) per complaint letter. Coders speak up;22 fourth, they are outside the
exhibited substantial reliability in identifying organisation, thus providing an independ-
problems at four levels of severity; moderate and ent assessment that reflects the norms and
substantial reliability in identifying stages of care expectations of society.23 Moreover,
To cite: Gillespie A, (except for ‘discharge/transfer’ that was only patients and their families filter the data,
Reader TW. BMJ Qual Saf fairly reliable) and substantial reliability in only writing complaints when a threshold
2016;25:937–946. identifying overall harm. of dissatisfaction has been crossed.24
Unlocking the potential of healthcare complaints observed.43–45 Although the taxonomy developed in
requires more than encouraging and facilitating com- the systematic review26 is comprehensive and theoret-
plaint reporting (eg, patients being unclear about how ically informed, it remains a first step. It needs to be
to complain, believing complaints to be ineffective or extended into a tool, similar to those used in adverse
fearing negative consequences for their healthcare),3 25 event research,20–22 that can reliably distinguish the
it also requires systematic procedures for analysing the types of problem reported, their severity and the
complaints, as is the case with adverse event data.4 It stages of care at which they occur.
has even been suggested that patient complaints might Our aim is to create a tool that supports healthcare
actually precede, rather than follow, safety incidents, organisations to listen46 to complaints, and to analyse
potentially acting as an early warning system.5 26 and aggregate these data in order to improve service
However, any systematic investigation of such potential monitoring and organisational learning. Although
requires a reliable and valid tool for coding and analys- healthcare complaints are heterogeneous47 and
ing healthcare complaints. Existing tools lag far behind require detailed redress at an individual level,48 we
established methods for analysing adverse events and demonstrate that complaints and associated severity
critical incidents.27–31 The present article answers levels can be reliably identified and aggregated.
recent calls to develop reliable method for analysing Although this process necessarily loses the voice of
healthcare complaints.4 5 31 32 individual complainants, it can enable the collective
A previous systematic review of 59 articles reporting voice of complainants to inform service monitoring
healthcare complaint coding tools revealed critical lim- and learning in healthcare institutions.
itations with the way healthcare complaints are ana-
lysed.26 First, there is no established taxonomy for METHOD
categorising healthcare complaints. Existing taxon- Tool development often entails separate phases of
omies differ widely (eg, 40% do not code safety-related development, refinement and testing.49 50 We devel-
data), mix general issues with specific issues, fail to oped and tested the HCAT through three phases (for
distinguish problems from stages of care and lack a the- which ethical approval was sought and obtained) with
oretical basis. Second, there is minimal standardisation the following aims:
of the procedures (eg, coding guidelines, training), and 1. To test and refine the conceptual validity of the original
no Healthcare Complaints Analysis Tool (HCAT) has taxonomy.
been thoroughly tested for reliability (ie, that two 2. To develop the refined taxonomy into a comprehensive
coders will observe the same problems within a com- rating tool, with robust guidelines capable of distinguish-
plaint). Third, analysis of healthcare complaints often ing problems, their severity and stages of care.
overlooks secondary issues in favour of single issues. 3. To test the reliability and calibration of the tool.
Finally, despite the varying severity of problems raised
(eg, from parking charges to gross medical negligence), Phase 1: testing and refining discriminant content validity
existing tools do not assess complaint severity. Discriminant content validity examines whether a
To begin addressing these limitations, the previous measure (eg, questionnaire item) or code (eg, for cate-
systematic review26 aggregated the coding taxonomies gorising data) accurately reflects the construct in terms
from the 59 studies, revealing 729 uniquely worded of content, and whether a number of measures or
codes, which were refined and conceptualised into codes are clearly distinct in terms of content (ie, that
seven categories and three broad domains (http:// they do not overlap).51 To assess whether the categor-
qualitysafety.bmj.com/content/23/8/678/F4.large.jpg). ies identified in the original systematic review26
The overarching tripartite distinction between clinical, conceptually subsumed the subcategories and whether
management and relational domains represents theory these categories were distinct from each other, we
and practice on healthcare delivery. The ‘clinical followed a six-step discriminant content validity
domain’ refers to the behaviour of clinical staff and procedure.51 First, we listed definitions of the
relates to the literature on human factors and problem categories and their associated domains.
safety.33–35 The ‘management domain’ refers to the Second, we listed the subcategories as the items to be
behaviour of administrative, technical and facilities sorted into the categories. Third, we recruited three
staff and relates to the literature on health service groups (n=58, n=58, n=55) of non-expert, but edu-
management.36–38 The ‘relationship domain’ refers to cated lay participants from a university participant
patients’ encounters with staff and relates to the litera- pool (comprising students from a range of degree pro-
tures on patient perspectives,39 misunderstandings,40 grammes across London who were paid £5 for
empathy41 and dignity.20 These domains also have an 30 min) to perform the sorting exercise. Fourth, parti-
empirical basis in studies of patient–doctor inter- cipants sorted each of the subcategories into one of
action, where the discourses (or ‘voices’) of medicine, the seven problem categories and provided a confi-
institutions and patients are evident,41 42 and clashes dence rating on a scale of 0–10. In addition, we asked
between the ‘system’ (clinical and management participants to indicate whether the subcategory item
domains) and ‘lifeworld’ (relational domain) are being sorted was either a ‘problem’ or a ‘stage of
care’. Fifth, we analysed the data to examine the saturation56 (ie, the fourth iteration resulted in
extent to which each subcategory item was sorted minimal revisions).
under their expected category and participants’ confi-
dence. Finally, we used this procedure to revise the
taxonomy through three rounds of testing. Phase 3: testing tool reliability and calibration
To test the reliability and calibration of HCAT, we
Phase 2: tool development through iterative application created a ‘referent standard’ of 125 healthcare com-
To broaden the refined taxonomy into a comprehen- plaints.57 This was a stratified subsample of the 1081
sive tool, we first incorporated coding procedures healthcare complaints described in the previous
established in the literature. To record background section. To construct the referent standard, the
details, we used the codes most commonly reported authors separately coded the letters and then agreed
in the healthcare complaint literature,26 namely: (1) on the most appropriate ratings. Letters were included
who made the complaint (family member, patient or such that the referent standard comprised at least five
unspecified/other), (2) gender of the patient (female, occurrences of each problem at each severity level (ie,
male or unspecified/other) and (3) which staff the so it was possible to test the reliability of coding for
complaint refers to (administrative, medical, nursing all HCAT problems and severity levels). Because
or unspecified/other). To record the stage of care, we healthcare complaints often relate to multiple
adopted the five basic stages of care coded within problem categories (and some are less common than
adverse event reports,52 namely: (1) admissions, (2) others), it was impossible to have a completely
examination and diagnosis, (3) care on the ward, (4) balanced distribution (table 1). These letters were all
operation and procedures and (5) discharge and trans- type written (either letters or emails), digitally
fers. To record harm, we used the UK National scanned, with length varying from 645 characters to
Reporting and Learning System’s risk matrix,53 which 14 365 characters (mean 2680.58, SD 1897.03).
has a five-point scale ranging from minimal harm (1) To test the reliability of HCAT, 14 participants with
to catastrophic harm (5). MSc-level psychology education were recruited from
Next, we aimed to (1) identify the range of severity the host department as ‘raters’ to apply HCAT to the
for each category and identify ‘indicators’ that referent standard. We chose educated non-expert
covered the diversity of complaints within each cat- raters because complaints are routinely coded by edu-
egory, both in terms of content and severity; (2) evalu- cated non-clinical experts, for example, hospital
ate the procedures for coding background details, administrators.26 There are no fixed criteria on the
stage of care and harm and (3) establish clear guide- number of raters required to assess the reliability of a
lines for the coding process as explicit criteria have coding framework,58 59 and a relatively large group of
been linked to inter-rater reliability.54 We used an raters (n=14) was recruited in order to provide a
iterative qualitative approach (repeatedly applying robust test of reliability and better understand any var-
HCAT to healthcare complaints) because it is suited iations in coding. Raters were trained during one of
for creating taxonomies (in our case indicators) that two 5 h training courses (each with seven raters).
ensure a diversity of issues can be covered parsimoni- Training included an introduction to HCAT, applying
ously.55 Also, through experiencing the complexity of HCAT to 10 healthcare complaints (three in a group
coding healthcare complaints, this iterative qualitative setting and seven individually) and receiving feedback.
approach allowed for us to refine both the codes and Raters then had 20 h to work independently to code
the coding guidelines. the 125 healthcare complaints. SPSS Statistics V.21
We used the Freedom of Information Act to obtain and AgreeStat V.3.2 were used to test reliability and
a redacted (ie, all personally identifying information calibration.
removed) random sample (of 7%) of the complaints
received from 52 healthcare conglomerates (termed
Table 1 Distribution of Healthcare Complaints Analysis Tool
‘Trust’) during the period April 2011 to March 2012.
problem severity across the referent standard
This yielded a dataset of 1082 letters, about 1% of
the 107 000 complaints received by NHS Trusts Not
present Low Medium High
during the period. This sample reflects the population (rated 0) (rated 1) (rated 2) (rated 3)
of UK healthcare complaints with a CI of 3 and a con-
fidence level of 95%. Quality 81 10 22 12
The authors then separately coded subsamples of Safety 73 5 24 23
the complaint letters using HCAT, subsequently Environment 101 6 10 8
meeting to discuss discrepancies. Once sufficient Institutional processes 86 10 18 11
insight had been gained, HCAT was revised and Listening 99 5 11 10
another iteration of coding ensued. After four itera- Communication 96 7 14 8
tions (n= 25, n=80, n=137, n=839), the sample of Respect and patient 88 19 13 5
complaints was exhausted, and we had reached rights
First, we used Gwet’s AC1 statistic to test among removed reference to stages of care (ie, subcategory
raters the inter-rater reliability of coding for complaint items ‘admissions’, ‘examinations’ and ‘discharge’), we
categories and their underlying severity ratings (not merged ‘humaneness/caring’ into ‘respect and patient
present (0), low (1), medium (2) and high (3)).60 61 rights’ and in light of recent literature that emphasises
This test examines the reliability of scoring for two or the importance of listening,64 65 we created a new cat-
more coders using a categorical rating scale, taking egory ‘listening’ (information moving from patients to
into account skewed datasets, where there are several staff ) as distinct from ‘communication’ (information
categories and the distributions of one rating occurs at moving from staff to patients). Also, we reconceptua-
a much higher rate than another62 (ie, 0 s in the lised the management domain as ‘environment’ and
current study because the majority of categories are ‘institutional processes’, which proved easier for parti-
not present in each letter). Furthermore, quadratic cipants to distinguish. The third and final test of dis-
ratings were applied, in order that large discrepancies criminant content validity yielded much improved
in ratings (ie, between 0 and 3) were treated as more results, with subcategory items being correctly sorted
significant in terms of indicating poor reliability than into the categories and domains on average 85.65%
small discrepancies (ie, between 2 and 3).60 Gwet’s of the time (range, 58%–100%; SD, 10.89%).
AC1 test was also applied to test for inter-rater reli-
Phase 2: creating the HCAT
ability in coding the stages of care complained about.
Applying HCAT to actual letters of healthcare com-
Although Gwet’s AC1 is the most appropriate test for
plaint revealed that reliable coding at the subcategory
the data, we also calculated Fleiss’ κ because this is
level was difficult. However, while the raters often
more commonly used and provides a more conserva-
disagreed at the subcategory level, they agreed at the
tive test (because it ignores the skewed distribution).
category level. Accordingly, the decision was made to
Finally, because harm was rated as a continuous
focus on the reliability of the three domains and seven
variable, an intraclass correlation (ICC) coefficient
categories, with the subcategories shaping the severity
was used to test for reliability. To interpret the coeffi-
indicators for each category. This decision to focus on
cients, the following commonly used guidelines60 63
the macro structure of HCAT is consistent with the
were followed: 0.01–0.20=poor/slight agreement;
overall aim of HCAT to identify macro trends rather
0.21–0.40=fair agreement; 0.41–0.60=moderate
than to identify and resolve individual complaints.
agreement; 0.61–0.80=substantial agreement and
To develop severity indicators for each category, we
0.81–1.00=excellent agreement.
iteratively applied the refined taxonomy to four
Second, we tested whether the 14 raters applied
samples (n=25, n=80, n=137, n=839) of healthcare
HCAT to the problem categories in a manner consist-
complaints. These sample sizes were determined by
ent with the referent standard (ie, as coded by the
the necessity to change some aspects of the tool. The
authors). Gwet’s AC1 (weighted) was calculated by
increasing sample sizes reveal that fewer changes were
comparing each rater’s coding of problem categories
required as the iterative refinement of the tool pro-
and severity against the referent standard and then
gressed. Rather than applying an abstract scale of
calculating an average Gwet’s AC1 score. The average
severity, we identified vivid indicators of severity,
inter-rater reliability coefficient (ie, across all 14
appropriate to each problem category and subcat-
raters) was then calculated for each problem category
egory, which should be used to guide coding. Figure 1
in order to provide an overall assessment of calibra-
reports the final HCAT problem categories and illus-
tion. Again, Fleiss’ κ was also calculated in order to
trative severity indicators.
provide a more conservative test.
The coding procedures for background details, stage
of care and harm proved relatively unproblematic to
RESULTS
apply. The only modifications necessary included
Phase 1: discriminant content validity results
adding an ‘unspecified or other’ category for stage of
The first test of discriminant content validity revealed
care and a harm category ‘0’ for when no information
large differences in the correct sorting of subcategor-
on harm was available.
ies by participants (range 21%–97%, mean=76.19%,
Resolving disagreements about how to apply HCAT
SD=19.35%). There was overlap between ‘institu-
to a specific healthcare complaint led us to the devel-
tional issues’ (bureaucracy, environment, finance and
opment of a set of guidelines for coding healthcare
billing, service issues, staffing and resources) and
complaints (box 1). The final version of the HCAT,
‘timing and access’ (access and admission, delays, dis-
with all the severity indicators and guidelines, is freely
charge and referrals). The ‘humaneness/caring’ cat-
available to download (see online supplementary file).
egory was also problematic, with subcategory items
Figure 2 demonstrates applying HCAT to illustrative
often miscategorised as ‘patient rights’ or ‘communi-
excerpts.
cation.’ Finally, participants would often classify sub-
category items as a ‘stage of care’. Phase 3: reliability and calibration of results
Accordingly, we revised the problematic categories The results of the reliability analysis are reported in
and subcategories twice. During these revisions, we table 2. On average, raters applied 1.94 codes per
Figure 1 The Healthcare Complaints Analysis Tool domains and problem categories with severity indicators for the safety and
communication categories.
letter (SD, 0.26). The Gwet’s AC1 coefficients reveal agreement or better). Safety showed least reliability
that the problem categories, each with four levels of (0.69), and respect and patient rights showed most
severity, were reliably coded (ie, with substantial reliability (0.91). Additional analysis using Fleiss’ κ
(which takes no account of the skewed data) found
moderate to substantial reliability for all problem cat-
Box 1 The guidelines for coding healthcare com- egories and severity ratings (0.48 (listening)–0.61
plaints with Healthcare Complaints Analysis Tool (safety, respect and patient rights)). The most signifi-
cant discrepancies between Gwet’s AC1 and Fleiss’ κ
occur on the items with the largest skew (ie, listening),
▸ Coding should be based on empirically identifiable
thus underscoring the problem with Fleiss’ κ and our
text, not on inferences.
rationale for privileging Gwet’s AC1. For stages of
▸ No judgement should be made of the intentions of
care, one showed substantial agreement (care on the
the complainant, their right to complain or the
ward), three showed moderate agreement (admissions,
importance they attach to the problems they
examination and diagnosis, operation or procedure)
describe.
and one had only fair agreement (discharge/transfer).
▸ Each hospital complaint is assessed for the presence
Demographic data were coded at substantial reliability
of each problem category, and where a category is
or higher. The ICC coefficient also demonstrated
not identified, it is coded as not present.
harm to be coded reliably (ICC, 0.68; 95% CI 0.62
▸ Severity ratings are independent of outcomes (ie,
to 0.75).
harm) and not comparable across problem
The results of the calibration analysis are reported
categories.
in table 3. Gwet’s AC1 scores show raters, on average,
▸ Coding severity should be based on the provided
to have substantial and excellent reliability against the
indicators, which reflect the severity distribution
referent standard. Fleiss’ κ scores show substantial
within the problem category.
agreement (0.62–0.67). Further analysis revealed
▸ When one problem category is present at multiple
some raters to be better calibrated (across all categor-
levels of severity, the highest level of severity is
ies) against the referent standard than others.
recorded.
Finally, exploratory analysis indicated that the
▸ Each problem should be associated with at least one
length of letter (in terms of characters per letter) was
stage of care (a problem can relate to multiple stages
negatively associated with reliability in coding for lis-
of care).
tening (r=0.266, p<0.01), communication (r=0.211,
▸ Harm relates exclusively to the harm resulting from
p<0.05) and environment (r=0.202, p<0.05). It was
the incident being complained about.
not associated with reliability in coding for respect
Figure 2 Applying Healthcare Complaints Analysis Tool to letters of complaint (excerpts are illustrative, not actual). GP, general
practitioner.
and patient rights, institutional processes, safety or The lack of a reliable tool for distinguishing
quality. Furthermore, there was no relationship problem types and severity has been an obstacle.5 26
between the number of codes applied per letter and HCAT provides a reliable additional data stream for
the length of the letter. monitoring healthcare safety and quality.69
Second, HCAT can contribute to understanding the
DISCUSSION relational side of patient experience. Nearly, one third
The present article has reported on the development of healthcare complaints relate to the relationship
and testing of a tool for analysing healthcare com- domain,26 and a better understanding of these pro-
plaints. The aim is to facilitate organisational listen- blems, and how they relate to clinical and manage-
ing,46 to respond to the ethical imperative to listen to ment practice, is essential for improving patient
grievances66 and to improve the effectiveness of satisfaction and perceptions of health services.4 67
healthcare delivery by incorporating the voice of These softer aspects of care have proved difficult to
patients.4 Many complainants aim to contribute infor- monitor,70–72 and again, HCAT can provide a reliable
mation that will improve healthcare delivery,67 yet to additional data stream.
date there has been no reliable tool for aggregating Third, HCAT can contribute to the management of
this voice of patients in order to support system-level healthcare. Concretely, HCAT could be integrated into
monitoring and learning.4 5 25 The present article existing complaint coding processes such that the
establishes HCAT as capable of reliably identifying the HCAT severity ratings can then be extracted and
problems, severity, stage of care, and harm reported in passed onto managers, external monitors and
healthcare complaints. This tool contributes to the researchers. HCAT could be used as an alternative
three domains that it monitors. metric of success in meeting standards (eg, on hospital
First, HCAT contributes to monitoring and enhan- hygiene, waiting times, patient satisfaction). It could
cing clinical safety and quality. It is well documented also be used longitudinally as a means to assess clin-
that existing tools (eg, case reviews, incident report- ical, management or relationship interventions.
ing) are limited in the type and range of incidents Additionally, HCAT could be used to benchmark units
they capture,13 68 and that healthcare complaints are or regions. Accumulating normative data would allow
an underused data source for augmenting existing for healthcare organisations to be compared for devia-
monitoring tools.1 2 4 tions (eg, poor or excellent complaint profiles), and
this would facilitate interorganisational learning (eg, Finally, one of the main innovations of HCAT is the
sharing practice).73 ability to reliably code severity within each complaint
Across these three domains, HCAT can bring into category. To date, analysis of healthcare complaints
decision-making the distinctive voice of patients, pro- has been limited to frequency of problem occurrence
viding an external perspective (eg, in comparison with (regardless of severity). This effectively penalises insti-
staff and incidents reports) on the culture of health- tutions that actively solicit complaints to improve
care organisations. For example, where safety culture quality; it might be that the optimum complaint
is poor (and thus incident reporting likely to be low), profile is a high percentage of low-severity com-
the analysis of complaints can provide a benchmark plaints, as this would demonstrate that the institution
that is independent of that poor culture. facilitates complaints and has managed to protect
against severe failures.
Table 3 Average calibration of raters (n=14) against the referent
standard
Average Future research
Gwet’s Fleiss’ Having a reliable tool for analysing healthcare com-
AC1 Range κ Range plaints paves the way for empirically examining recent
HCAT problem categories suggestions that healthcare complaints might be a
Quality 0.79 0.59 to 0.88 0.62 0.45 to 0.77 leading indicator of outcome variables.4 5 There is
Safety 0.76 0.69 to 0.83 0.68 0.49 to 0.78 already evidence that complaints predict individual
Environment 0.89 0.73 to 0.94 0.67 0.49 to 0.78 outcomes;74 the next question is whether a pattern of
Institutional 0.84 0.73 to 0.89 0.63 0.58 to 0.072
complaints can predict organisation-level outcomes.
processes For example: Do severe clinical complaints correlate
Listening 0.89 0.82 to 0.94 0.62 0.52 to 0.077 with hospital-level mortality or safety incidents?
Communication 0.86 0.72 to 0.93 0.62 0.41 to 0.76 Might complaints about management correlate with
Respect and 0.91 0.87 to 0.94 0.65 0.51 to 0.72 waiting times? Do relationship complaints correlate
patient rights with patient satisfaction? If any such relationships are
p<0.001 for all tests. found, then the question will become whether health-
HCAT, Healthcare Complaints Analysis Tool. care complaints are leading or lagging indicators.