Data Management
Data Management
Overview
Without clear and feasible data collection procedures, the ability of the trial
to answer its question is compromised.
Good CDM ensures that data collected as part of clinical trials is high-
quality, reliable, and statistically sound.
Data capture
Data processing
Data maintenance and integrity
Data analysis
Adhering to good clinical practice means that there should also be systems
in place for all information collected associated with the trial to be securely
stored and for confidentiality of trial participants to be maintained.
Define data to be collected
These considerations will drive the methods used to collect the data for the
trial. Before moving on to that, we need to think about how the data will be
stored and managed throughout the trial.
Clinical Trial Databases
You are probably all familiar with Microsoft Excel, and in small-scale or
pilot studies with few variables, this may be an option. However MS Excel
is a spreadsheet program, rather than a database, and not ideal for clinical
trial data management for a number of reasons, including limited linkage
capabilities, ease of managing multiple measurements and ongoing tracking
of access and changes to the data.
“(a) Ensure and document that the electronic data processing system(s)
conforms to the sponsor’s established requirements for completeness,
accuracy, reliability, and consistent intended performance (i.e.
validation).
(c) Ensure that the systems are designed to permit data changes in such
a way that the data changes are documented and that there is no
deletion of entered data (i.e. maintain an audit trail, data trail, edit
trail).
(e) Maintain a list of the individuals who are authorized to make data
changes.
(g) Safeguard the blinding, if any (e.g. maintain the blinding during
data entry and processing).”
Most clinical trials are managed using relational databases. A relational
database is a collection of spreadsheet-like, two-dimensional tables in
which the rows correspond to individual records or study participants and
the columns correspond to the different characteristics or attributes of these
entities. In each table, there is a single column or combination of columns
that uniquely identifies a participant/row. Remember, in the trial database,
this ‘identifier’ (usually a study ID) should also maintain study participants’
anonymity. The list linking study IDs to personal identifying information
(eg, names and addresses) should be stored separately from the trial
database.
Creating the ‘relationships’ between the tables of the trial database means
that the data can be more easily maintained (separate tables can exist for
specific measures or specific time points, ensuring that the data table does
not become unwieldy), and when it comes to organising, filtering or
querying the data, or creating datasets for analysis, the individual tables
these are linked in robust manner.
If you purchase relational database software for use in your trial, it will
generally have user interfaces built in to allow you to do all of the data
management needed for your study and it won’t be necessary to be able to
program in SQL to manipulate your database.
Database set up
If not enabled through table headings, the first row (or rows) of a database
generally hold variable names for each column. These names need to be
short, but sufficiently self-explanatory.
If the trial data need to be kept in one ‘flat’ table, statistical analysis software
programs are very capable of reorganising data between long and wide
formats depending upon need.
Data type
- Nominal
Categorical
- Ordinal
- Discrete
Numeric
- Continuous
When entering nominal data into a database, ideally these are coded as
numbers. For example in the ‘current smoker’ example above, ‘No’ could
be coded as ‘0’ and ‘Yes’ could be coded as ‘1’ (generally we designate a
positive response as 1 and a negative response as 0). For ordinal variables
(eg, pain categorised as ‘mild’, ‘moderate’ or ‘severe’), conventionally we
use ascending numbers to designate increasing order of responses (eg
1=mild, 2=moderate, 3=severe). The data dictionary should indicate what
each number within a categorical variable designates.
There are also FDA guidelines for capture of data from electronic health
records.
CRFs
Case report forms (CRFs) are by ICH GCP definition: “A printed, optical or
electronic document designed to record all the protocol-required
information to be reported to the sponsor on each trial subject”.
Think about from whom, where, and how, the data is being collected:
- Clinician-completed forms?
The CRF should be laid out to logically follow how data is collected in
practice (it should not be necessary to skip back and forth between
pages to complete the form)
There are three common methods for collecting data and getting it into the
trial database.
Paper-based CRFs (with manual data entry into the study data base)
Paper-based CRFs that are designed to be scanned and directly entered
into a database using Optical Character Recognition software (OCR)
Electronic CRFs that enable direct entry/export into the trial database
Which method is chosen for a trial will depend on the study setting and
resources, but there are some issues to consider for each that will impact on
data quality and study workload.
CRF design principles
If collecting data at multiple sites, are the units used for particular
measures consistent across all sites?
e.g., in Australia we use mmol/l for total plasma cholesterol, but in the USA
they use mg/dl. The absolute numbers between these two measures are very
different. Does the CRF need to cater to more than one unit of
measurement?
Consider: On how many days in the past week have you experienced knee
pain? days
Paper-based CRFs
In some cases, a paper-based CRF may be the best option; for example, a
trial might be mailing a self-administered questionnaire to study
participants or data are being collected in resource-limited settings. Whether
entering paper-based CRF manually or through OCR scanning, some
common principles apply:
There are many software tools now available to support data management
in clinical trials. In the case of multi-centre trials collecting large amounts
of data, such systems (whether commercial or custom built) are essential.
Some are open source (e.g. OpenClinica), while there are also individually
customisable commercial systems (e.g. Oracle Clinical One).
https://bit.ly/3sI3TuP
CRFs are designed specifically for the trial and enable the collection of data
in a protocol-driven, controlled manner. This is not necessarily done in a
way that is consistent with routine clinical practice, i.e., it is often extra
work for clinicians, nurses or other staff.
Another option for data collection for a trial is the extraction of data from
existing sources (eg, electronic health records, clinical registries, health
administrative databases, insurance claims). This has the advantage of
reducing data collection burden, but may limit what data are available to the
trial, and may not be entirely consistent across multiple centres.
Private clouds can be used for trial data management, but given that these
are generally run by third parties, researchers must ensure:
*There are rules about trans-border transfer of personal data under the
Australian privacy laws
Data validation
Once you have extracted your trial data from the CRF, there are a few more
steps prior to analysis.
Discrepancies can arise due to inconsistent data, missing data, data that are
out of range (eg, excessive age) and deviations from the protocol.
Just as we would have an audit trail for the data collection in the first
instance, there would also be an audit trail for any amendments to the data
arising from resolution of discrepancies – the trial must be able to track who
has made changes to the data and for what reason.
Data coding
If a critical data issue arises post-lockdown, the data may be modified, but
this requires a clear audit trial and documentation.