Background The increased adoption of the internet, social media, wearable devices, e-health services, and other
technology-driven services in medicine and healthcare has led to the rapid generation of various types of digital data,
providing a valuable data source beyond the confines of traditional clinical trials, epidemiological studies, and lab-
based experiments.
Methods We provide a brief overview on the type and sources of real-world data and the common models and
approaches to utilize and analyze real-world data. We discuss the challenges and opportunities of using real-world
data for evidence-based decision making This review does not aim to be comprehensive or cover all aspects of the
intriguing topic on RWD (from both the research and practical perspectives) but serves as a primer and provides use-
ful sources for readers who interested in this topic.
Results and Conclusions Real-world hold great potential for generating real-world evidence for designing and
conducting confirmatory trials and answering questions that may not be addressed otherwise. The voluminosity and
complexity of real-world data also call for development of more appropriate, sophisticated, and innovative data pro-
cessing and analysis techniques while maintaining scientific rigor in research findings, and attentions to data ethics to
harness the power of real-world data.
Keywords Real-world data (RWD), Real-world evidence (RWE), Electronic health records, Machine learning, Artificial
intelligence, Causal inference
Fig. 1 RWD Types and Sources (source: Fig. 1 in [16] with written permission by Dr. Brandon Swift to use the figure)
vaccination [3–5], to model localized COVID-19 con- voluminous and dynamic data. Fourth, RWD may be
trol strategies [6], to characterize COVID-19 and flu incomplete and lack key endpoints for an analysis given
using data from smartphones and wearables [7], to study that the original collection is not for such a purpose.
behavioral and mental health changes in relation to the For example, claims data usually do not have clinical
lockdown of public life [8], and to assist in decision and endpoints; registry data have limited follow-ups. Fifth,
policy making, among others. RWD may be subject to bias and measurement errors
In what follows, we provide a brief review on the type (random and non-random). For example, data gener-
and sources of RWD (Section 2) and the common models ated from the internet, mobile devices, and wearables
and approaches to utilize and analyze RWD (Section 3) can be subject to selection bias; a RWD dataset is a
, and discuss the challenges and opportunities of using unrepresentative sample of the underlying popula-
RWD for evidence-based decision making (Section 4). tion that a study intends to understand; claims data are
This review does not aim to be comprehensive or cover known to contain fraudulent values. In summary, RWD
all aspects of the intriguing topic on RWD (from both are messy, incomplete, heterogeneous, and subject
the research and practical perspectives) but serves as a to different types of measurement errors and biases.
primer and provides useful sources for readers who inter- A systematic scoping review of the literature suggests
ested in this topic. data quality of RWD is not consistent, and as a result
quality assessments are challenging due to the complex
Characteristics, types and applications of RWD and heterogeneous nature of these data. The sub-opti-
RWD have several characteristics as compared to data mal data quality of RWD is well recognized [9–12]; how
collected from randomized trials in controlled set- to improve it (e.g. regulatory-grade) is work in progress
tings. First, RWD are observational as opposed to data [13–15].
gathered in a controlled setting. Second, many types of There are many different types of RWD. Figure 1 [16]
RWD are unstructured (e.g., texts, imaging, networks) provides a list of the RWD types and sources in medi-
and at times inconsistent due to entry variations across cine. We also refer readers to [11] for a comprehensive
providers and health systems. Third, RWD may be gen- overview of the RWD data types. Here we use a few com-
erated in a high-frequency manner (e.g., measurements mon RWD types, i.e., EHRs, registry data, claims data,
at the millisecond level from wearables), resulting in patient-reported outcome (PRO) data, and data collected
from wearables, as examples to demonstrate the variety adoption of modern statistical, data mining and ML tech-
of RWD and how they can be used for what purposes. niques for fraud detection [48–51].
EHRs are collected as part of routine care across clin- PRO data refer to data reported directly by patients on
ics, hospitals, and healthcare institutions. EHR data are their health status. PRO data have been used to provide
typical RWD – noisy, heterogeneous, structured, and RWE on effectiveness of interventions, symptoms moni-
unstructured (e.g., text, imaging), and dynamic and toring, relationships between exposure and outcomes,
require careful and intensive efforts pre-processing among others [52–55]. PRO data are subject to recall bias
[17]. EHRs have created unprecedented opportunities and large inter-individual variability.
for data-driven approaches to learn patterns, make new Wearable devices generate continuous streams of
discoveries, assist preoperative planning, diagnostics, data. When combined with contextual data (e.g., loca-
clinical prognostication, among others [18–27], improve tion data, social media), they provide an opportunity
predictions in selected outcomes especially if linked to conduct expansive research studies that are large in
with administrative and claim data and usage of proper scale and scope [56] that would be otherwise infeasible
machine learning techniques [27–30], and validate and in controlled trials. Examples of using wearable RWD to
replicate findings from clinical trials [31]. generate RWE include applications in neuroscience and
Registry data have various types. For example, prod- environmental health [57–60]. The wearables generate
uct registries include patients who have been exposed to huge amounts of data. Advances in data storage, real-
a biopharmaceutical product or a medical device; health time processing capabilities and efficient battery technol-
services registries consist of patients who have had a ogy would be essential for the full utilization of wearable
common procedure or hospitalization; and disease regis- data.
tries contains information about people diagnosed with
a specific type of disease. Registries data enable identifi- Using and analyzing RWD
cation and sharing best clinical practices, improve accu- A wide range of research methods are available to
racy of estimates, provide valuable data for supporting make use of RWD. In what follows, we outline a few
regulatory decision-making [32–35]. Especially for rare approaches, including pragmatic clinical trials, target trial
diseases where clinical trials are often of small size and emulation, and applications of ML and AI techniques.
data are subject to high variability, registries provide a Pragmatic clinical trials are trials designed to test the
valuable data source to help understand the course of a effectiveness of an intervention in the real-world clinical
disease, and provide critical information for confirmatory setting. Pragmatic trials leverage the increasingly inte-
clinical trial design and translational research to develop grated healthcare system and may use data from EHR,
treatments and improve patient care [34, 36, 37]. Reader claims, patient reminder systems, telephone-based care,
may refer to [38] for a comprehensive overview on reg- etc. Due to the data characteristics of RWD, new guide-
istry data and how they help understanding of patient lines and methodologies are developed to mitigate bias in
outcomes. RWE generated by RWD for decision making and causal
Claims data refer to data generated during process- inference, especially for per-protocol analysis [61, 62].
ing healthcare claims in health insurance plans or from The research question under investigation in pragmatic
practice management systems. Despite that claims data trials is whether an intervention works in real life and tri-
are collected and stored primarily for payment purposes als are designed to maximize the applicability and gener-
originally, they have been used in healthcare to under- alizability of the intervention. Various types of outcomes
stand patients’ and prescribes’ behavior and how they can be measured in these trials, but mostly patient-cen-
interact, to estimate disease prevalence, to learn disease tered, instead of typical measurable symptoms or mark-
progression, disease diagnosis, medication usage, and ers in explanatory trials. For example, ADAPTABLE trial
drug-drug interactions, and validate and replicate find- [63, 64] is a high-profile pragmatic trial and is the first
ings from clinical trials [31, 39–46]. A known pitfall of large-scale, EHR-enabled clinical trial conducted within
claim data is fraud, on top of some of the common data the U.S. It used EHR data to identify around 450,000
characteristics of RMD, such as upcoding1[47]. The data patients with established atherosclerotic cardiovascular
fraud problem can be mitigated with detailed audits and disease (CVD) for recruitment and eventually enrolled
about 15,000 individuals at 40 clinical centers that were
randomized to two aspirin dose arms. Electronic patient
follow-up for patient-reported outcomes was completed
every 3 to 6 months, with a median follow-up was 26.2
months to determine the optimal dosage of aspirin in
additional reimbursement from insurance by coding a service it provided as a
more expensive service than what was actually performed
CVD patients, with the primary endpoint being the
composite of all-cause mortality, hospitalization for non- confounders and improve the capabilities of RWD for
fatal myocardial infarction, or hospitalization for a non- causal inference [73–76].
fatal stroke. The cost of ADATABLE is estimated to be ML techniques are getting increasingly popular and
only 1/5 to 1/2 of a traditional RCT of that scale. are powerful tools for predictive modeling. One reason
Target trial emulation is the application of trial design for their popularity is that the modern ML techniques
and analysis principles from (target) randomized trials to are very capable of dealing with voluminous, messy,
the analysis of observational data [65]. By precisely speci- multi-modal, and various unstructured data types with-
fying the target trial’s inclusion/exclusion criteria, treat- out strong assumptions about the distribution of data.
ment strategies, treatment assignment, causal contrast, For example, deep learning can learn abstract represen-
outcomes, follow-up period, and statistical analysis, one tations of large, complex, and unstructured data; natural
may draw valid causal inferences about an intervention language processing (NLP) and embedding methods can
from RWD. Target trial emulation can be an important be used to process texts and clinical notes in EHRs and
tool especially when comparative evaluation is not yet transform them to real-valued vectors for downstream
available or feasible in randomized trials. For example, learning tasks. Secondly, new and more powerful ML
[66] employs target trial emulation to evaluate real-world techniques are being developed rapidly, due to the high
COVID-19 vaccine effectiveness, measured by protection demand and the large group of researchers in the field
against COVID-19 infection or related death, in racially attracted by the hot topic. Thirdly, there are also many
and ethnically diverse, elderly populations by comparing open source codes (e.g., on Github) and software libraries
newly vaccinated persons with matched unvaccinated (e.g., TensorFlow, Pytorch, Keras) out there to facilitate
controls using data from the US Department of Veterans the implementation of these techniques. Indeed, ML has
Affairs health care system. The simulated trial was con- enjoyed a rapid surge in the last decade or so for a wide
ducted with clearly defined inclusion/exclusion criteria, range of applications in RWD, outperforming more con-
identification of matched controls, including matching ventional approaches [77–85]. For example, ML is widely
based on propensity scores with careful selection of applied in in health informatics to generate RWE and for-
model covariates. Target trial emulation has also been mulate personalized healthcare [86–90] and was success-
used to evaluate the effect of colon cancer screening on fully employed on RWD collected during the COVID-19
cancer incidence over eight years of follow up [67], and pandemic to help understand the disease and evaluate its
the risk of urinary tract infection among diabetic patients prevention and treatment strategies [91–95]. It should be
[68]. noted that the ML techniques are largely used for predic-
RWD can also be used as historical controls and refer- tions and classification (e.g., disease diagnosis), variable
ence groups for controlled trials, with assessment of the selections (e.g, biomarker screening), data visualization,
quality and appropriateness of the RWD and employ- etc, rather than generating regulatory-level RWE; but this
ment of proper statistical approaches for analyzing the may change soon as regulatory agencies are aggressively
data [69]. Controlling for selection bias and confounding evaluating ML/AI for generating RWE and engaging
is key to the validity of this approach because of the lack stakeholders on the topic [96–99].
of randomization and potentially unrecognized baseline It would be more effective and powerful to combine
differences, and the control group needs to be compa- the expertise from statistical inference and ML when it
rable with the treated group. RWD also provide a great comes to generating RWE and learning causal relation-
opportunity to study rare events given the data volumi- ships. One of the recent methodological developments
nousness [70–72]. These studies also highlight the need is indeed in that direction – leveraging the advances in
for improving the RWD data quality, developing surro- semi-parametric and empirical process theory and incor-
gate endpoints, and standardizing data collection for out- porating the benefits of ML into comparative effective-
come measures in registries. ness using RWD. A well-known framework is targeted
In terms of analysis of RWD, statistical models and learning [100–102] that has been successfully applied in
inferential approaches are necessary for making sense causal inference for dynamic treatment rules using EHR
of RWD, obtaining causal relationships, testing/validat- data [103] and efficacy of COVID-19 treatments [104],
ing hypotheses, and generating regulatory-grade RWE among others.
to inform policymakers and regulators in decision mak- Regardless of which area a RWD project focuses on
ing – just as in the controlled trial settings. In fact, the – causal inference or prediction and classification, rep-
motivation for and the design and analysis principles in resentativeness of RWD of the population where the
pragmatic trials and target trial emulation are to obtain conclusions from the RWD project will be generalized
causal inference, with more innovative methods beyond to is critical. Otherwise, estimation or prediction can be
the traditional statistical methods to adjust for potential misleading or even harmful. The information in RWD
might not be adequate to validate the appropriateness of not equipped with proper training or understanding the
the data for generalization; in that case, the investigators principles of the techniques before applying them to real-
should resist the temptation to generalize to groups that world situations. In addition, to maintain scientific rigor
they are unsure about. during the RWE generation process from RWD, results
from statistical and ML procedures would require medi-
Challenges and opportunities cal validation either using expert knowledge or conduct-
Various challenges – from data gathering to data quality ing reproducibility and replicability studies before they
control to decision making – still exist in all stages of a are being used for decision making in the real world
RWD life cycle despite all the excitement around their [105].
transformative potentials. We list some of the challenges Explainability and interpretability: Modern ML
below, where plenty of opportunities for improvement approaches are often employed in a black-box fashion
exist and greater efforts are needed to harness the power and there a lack of understanding of the relationships
of RWD. between input and output and causal effects. Model
Data quality: RWD are now often used for other pur- selection, parameter initialization, and hyper-parameter
poses than what they are originally collected for and tuning are also often conducted in a trial-and-error man-
thus may lack information for critical endpoints and not ner, without domain expert input. This is in contrast to
always be positioned for generating regulatory-grade evi- the medical and healthcare field where interpretability
dence. On top of that, RWD are messy, heterogeneous, is critical to building patient/user trust, and doctors are
and subject to various measurement errors, all of which unlikely to use technology that they don’t understand.
contribute to the lower quality of RWD compared to data Promising and encouraging research work on this topic
from controlled trials. As a result, accuracy and precision has already started [106–111], but more research is
of results based on RWD are negatively impacted and warranted.
misleading results or false conclusions can be generated. Reproducibility and replicability: Reproducibility and
While these do not preclude the use of RWD in evidence replicability2 are major principles in scientific research,
generation and decision making, data quality issues need RWD included. If an analytical procedure is not robust
to be consistently documented and addressed as much and its output is not reproducible or replicable, the pub-
as possible through data cleaning and pre-processing lic would call into questions the scientific rigor of the
(e.g., imputation to fill in missing values, over-sampling work and doubt the conclusion from a RWD-based study
for imbalanced data, denoising, combining disparate [113–115]. Result validation, reproducibility, and repli-
pieces of information across databases, etc). If an issue cability can be challenging given their messiness, incom-
can be addressed during the pre-processing stage, efforts pleteness, unstructured data, but need to be established
should be made to correct it during data analysis or cau- especially considering that the generated evidence could
tion should be used when interpreting the results. Early be used towards regulatory decisions and affect the lives
engagement of key stakeholders (e.g., regulatory agencies of millions of people. Irreproducibility can be mitigated
if needed, research institutes, industries etc.) are encour- by sharing raw and processed data and codes, assuming
aged to establish data quality standards and reduce no privacy is compromised in this process. For replica-
unforeseen risks and issues. bility, given that RWD are not generated from controlled
Efficient and practical ML and statistical procedures: trials and every data set may has its own unique data
Fast growth of digital medical data and the fact that characteristics, complete replicability can be difficult or
workforce and investment flood into the field also drive even infeasible. Nevertheless, detailed documentation of
the rapid development and adoption of modern statistical data characteristics and pre-processing, pre-registration
procedures and ML algorithms to analyze the data. The of analysis procedures, and adherence to open science
availability of open-source platforms and software greatly principles (e.g., code repositories [116]) are critical for
facilitate the application of the procedures in practice. replicating findings on different RWD datasets, assuming
On the other hand, noisiness, heterogeneity, incomplete- they come from the same underlying population. Readers
ness, and unbalancedness of RWD may cause consider- may refer to [117–119] for more suggestions and discus-
able under-performance of the existing statistical and sions on this topic.
ML procedures and demand new procedures that target
specifically at RWD and can be effectively deployed in
the real world. Further, the availability of the open-source
platform and software and the accompanied conveni- 2
Reproducibility refers to “instances in which the original researcher’s data
ence, while offered with good intentions, also increases and computer codes are used to regenerate the results” and replicability refers
the chance of practitioners misusing the procedures, if to “instances in which a researcher collects new data to arrive at the same sci-
entific findings as a previous study.” [112]
Privacy: Ethical issues exist when an RWD project algorithmic fairness, which aims at understanding and
is implemented, among which, privacy is a commonly preventing bias in ML models. Algorithmic fairness is
discussed topic. Information in RWD is often sensitive, an increasingly popular research topic in literature [123–
such as medical histories, disease status, financial situ- 127]. Incorrect and misleading conclusions may be drawn
ations, and social behaviors, among others. Privacy risk if the trained models systematically disadvantage a cer-
can increase dramatically when different databases (e.g., tain group (e.g., a trained algorithm might be less likely
EHR, wearables, claims) are linked together, a common to detect cancer in black patients than white patients or
practice in the analysis of RWD. Data users and policy- in men than women). Transparency means that infor-
makers should make every effort to ensure that RWD mation and communication concerning the processing
collection, storage, sharing, and analysis follow estab- of personal data must be easily accessible and easy to
lished data privacy principles (i.e., lawfulness, fairness, understand. Transparency ensures that data contributors
purpose limitation, and data minimization). In addition, are aware of how their data are being used and for what
privacy-enhancing technology and privacy-preserving purposes and decision-makers can evaluate the quality of
data sharing and analysis can be deployed, where there the methods and the applicability of the generated RWE
already exist plenty effective and well-accepted state- [128–131]. Being transparent when working with RWD
of-the-art concepts and approaches, such as differential is critical for building trust among the key stakeholders
privacy3[120] and federated learning4[121, 122]. Investi- during an RWD life cycle (individuals who supply the
gators and policymakers may consider integrating these data, those who collect and manage the data, data cura-
concepts and technology when collecting and analyzing tors who design studies and analyze the data, and deci-
RWD and disseminating the results and RWE from the sion and policy makers).
RWD. The above challenges are not isolated but rather con-
Diversity, Equity, Algorithmic fairness, and Transpar- nected as depicted in Fig. 2. Data quality affects the per-
ency (DEAT): DEAT is another important ethical issue formance of statistical and ML procedures; data sources
to consider in an RWD project. RWD may contain infor- and the cleaning and pre-processing process relate to
mation from various demographic groups, which can be result reproducibility and replicability. How data are
used to generate RWE with improved generalizability analyzed and which statistical and ML procedures to
compared to data collected in controlled settings. On the use have an impact on reproducibility and replicability,
other hand, certain types of RWD may be heavily biased whether privacy-preserving procedures are used dur-
and unbalanced toward a certain group, not as diverse ing data collected and analysis and how information is
or inclusive, and in some cases, even exacerbate dispar- shared and released relate to data privacy, DEAT, and
ity (e.g., wearables and access to facilities and treatment explainability and interpretability, which can in turns
may be limited to certain demographic groups). Greater affect which ML procedures to apply and development of
effort will be needed to gain access to RWD from under- new ML techniques.
represented groups and to effectively take into account
the heterogeneity in RWD while being mindful of the Conclusions
limitation for diversity/equity. This topic also relates to RWD provide a valuable and rich data source beyond
the confines of traditional epidemiological studies, clini-
cal trials, and lab-based experiments, with lower cost in
data collection compared to the latter. If used and ana-
decision-making. Procedures that improve the quality of
randomized procedures are used to guarantee individual privacy when releas-
lyzed appropriately, RWD have the potential to generate
ing information. valid and unbiased RWE with savings in both cost and
Federated learning enables local devices to collaboratively learn a shared time, compared to controlled trials, and to enhance the
model while keeping all training data on the local devices without sharing, efficiency of medical and health-related research and
mitigating privacy risks.
