0% found this document useful (0 votes)
26 views28 pages

Data Management

Uploaded by

akus0019
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
26 views28 pages

Data Management

Uploaded by

akus0019
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 28

Data management

Overview

This week we examine how trial data are collected and


how it should be handled.

We examine another important practical issue in clinical trial conduct:


Clinical Data Management (CDM).

Without clear and feasible data collection procedures, the ability of the trial
to answer its question is compromised.

Good CDM ensures that data collected as part of clinical trials is high-
quality, reliable, and statistically sound.

By the end of this week, you will be able to:

1. recognise the role of good data management in trial design and


outcomes
2. understand the principles of clinical data management
Data management - Main phases
The data management process for clinical trials can be conceptualised into
four phases:

Data capture
Data processing
Data maintenance and integrity
Data analysis

The first three phases can be further broken down into:

1. Consent and confidentiality


2. Define data to be collected
3. Data storage (database software)
4. Collection and entry
5. Data validation
6. Management procedures
Consent and confidentiality

A clear and robust methods need to be in place for obtaining informed


consent from participants and documentation of that consent is required,
which needs to be kept as part of study records.

Adhering to good clinical practice means that there should also be systems
in place for all information collected associated with the trial to be securely
stored and for confidentiality of trial participants to be maintained.
Define data to be collected

It is important to clearly identifying the study endpoints that allow


investigators to answer the research questions underpinning the trial. It is
important to clearly define what these endpoints are, and what needs to be
collected for the endpoint data that will be analysed at the end of the trial. It
is also important to consider how it will be collected.

What are the primary and


secondary endpoints?
What measures need to be
made or what data needs to
be obtained to collect these
endpoints in a valid and
reliable way that complies
with regulatory and ethical
standards?
If the study is multi-centre
across different countries, are
the endpoints, and the methods for obtaining the endpoint data,
acceptable and consistent for all concerned regulatory/clinical
authorities?
What other data need to collect for the study? – eg covariates,
demographic data, data to enable translational analyses arising from
the trial (eg cost data)?
From where is the data being collected? Directly? From secondary
sources? Via data linkage?

These considerations will drive the methods used to collect the data for the
trial. Before moving on to that, we need to think about how the data will be
stored and managed throughout the trial.
Clinical Trial Databases

You are probably all familiar with Microsoft Excel, and in small-scale or
pilot studies with few variables, this may be an option. However MS Excel
is a spreadsheet program, rather than a database, and not ideal for clinical
trial data management for a number of reasons, including limited linkage
capabilities, ease of managing multiple measurements and ongoing tracking
of access and changes to the data.

The International Conference on Harmonisation Good Clinical


Practice (ICH-GCP) guidelines (section 5.5) outline the responsibilities of
the sponsor when using electronic systems for trial data handling:

“(a) Ensure and document that the electronic data processing system(s)
conforms to the sponsor’s established requirements for completeness,
accuracy, reliability, and consistent intended performance (i.e.
validation).

(b) Maintains SOPs for using these systems.

(c) Ensure that the systems are designed to permit data changes in such
a way that the data changes are documented and that there is no
deletion of entered data (i.e. maintain an audit trail, data trail, edit
trail).

(d) Maintain a security system that prevents unauthorized access to the


data.

(e) Maintain a list of the individuals who are authorized to make data
changes.

(f) Maintain adequate backup of the data.

(g) Safeguard the blinding, if any (e.g. maintain the blinding during
data entry and processing).”
Most clinical trials are managed using relational databases. A relational
database is a collection of spreadsheet-like, two-dimensional tables in
which the rows correspond to individual records or study participants and
the columns correspond to the different characteristics or attributes of these
entities. In each table, there is a single column or combination of columns
that uniquely identifies a participant/row. Remember, in the trial database,
this ‘identifier’ (usually a study ID) should also maintain study participants’
anonymity. The list linking study IDs to personal identifying information
(eg, names and addresses) should be stored separately from the trial
database.

In a relational database, the column (or combination of columns) which are


unique identifier/s are the table’s ‘primary key’. The table may also include
a column that is the primary key in another table in the database, this
column or group of columns is called a ‘foreign key’. Including a foreign
key in a table then creates a relationship between the current table and the
other table for which the foreign key is the primary. The concept of a
relational database as a collection of related tables is a useful analogy
(although strictly speaking that is not why they are called relational
databases; ‘relation’ is the term used to denote a table with a primary key).

Creating the ‘relationships’ between the tables of the trial database means
that the data can be more easily maintained (separate tables can exist for
specific measures or specific time points, ensuring that the data table does
not become unwieldy), and when it comes to organising, filtering or
querying the data, or creating datasets for analysis, the individual tables
these are linked in robust manner.

Structured Query Language (SQL) is the standard programming language


that underpins relational databases.
https://bit.ly/3tOivKG

If you purchase relational database software for use in your trial, it will
generally have user interfaces built in to allow you to do all of the data
management needed for your study and it won’t be necessary to be able to
program in SQL to manipulate your database.
Database set up

If not enabled through table headings, the first row (or rows) of a database
generally hold variable names for each column. These names need to be
short, but sufficiently self-explanatory.

Detailed information about the variables should be maintained in a study


data dictionary. The data dictionary should allow anyone unfamiliar with
the study data or methodology to understand the dataset.

In clinical trials, participants will commonly have repeated measures (eg,


come in for a number of visits during the trial to undergo clinical
assessments). When working with a relational database, each visit can be set
up as a separate table. If a database is set up as a single table, there are two
options for entering the data – either ‘long’ or ‘wide’ format.

An example of ‘long’ format is given below.


In this format, a unique ID has been created for each column which is a
concatenation of subject ID and visit number.

This hypothetical example in ‘wide’ format would be:

If the trial data need to be kept in one ‘flat’ table, statistical analysis software
programs are very capable of reorganising data between long and wide
formats depending upon need.
Data type

Variable type is a determinant of the statistical tests that will be applied to


answer the trial hypotheses. Being clear about what and how the data is
collected is crucial.

Variable type can be classified (and sub-classified) as:

- Nominal
Categorical
- Ordinal

- Discrete
Numeric
- Continuous

Examples are given below:

Categorisation of numeric data can easily be undertaken after initial data


entry. For example, an analysis plan may specify that it is interested in
looking at differences among those who are underweight, normal weight,
overweight and obese. The researchers collect height and weight, and then
use this to calculate body mass index (BMI). BMI can further categorised
into weight status categories (eg overweight, underweight etc.). However,
keeping the numeric ‘original’ height and weight data in the database is
important. Numeric data are usually more analysis-friendly and richer in
detail and enable re-analysis or future analyses that may not be pre-
specified in the original trial data analysis plan.

When entering nominal data into a database, ideally these are coded as
numbers. For example in the ‘current smoker’ example above, ‘No’ could
be coded as ‘0’ and ‘Yes’ could be coded as ‘1’ (generally we designate a
positive response as 1 and a negative response as 0). For ordinal variables
(eg, pain categorised as ‘mild’, ‘moderate’ or ‘severe’), conventionally we
use ascending numbers to designate increasing order of responses (eg
1=mild, 2=moderate, 3=severe). The data dictionary should indicate what
each number within a categorical variable designates.

Ensuring that there are clear and comprehensive standard operating


procedures (SOPs) for data entry will assist in reducing the amount of time
spent ‘cleaning’ data at the end of the study.
Data collection and entry

GCP guidelines have been put in place to assist trials to


achieve clinical trial data management that is compliant
with ethical and regulatory frameworks.

Relevant guidelines include the FDA’s ‘Electronic Source Data in Clinical


Investigations’ which provide guidelines around the capture, review and
retention of electronic source data for clinical trials submitted to the FDA,
and the ICH GCP guidelines for data handling.

There are also FDA guidelines for capture of data from electronic health
records.
CRFs

Case report forms (CRFs) are by ICH GCP definition: “A printed, optical or
electronic document designed to record all the protocol-required
information to be reported to the sponsor on each trial subject”.

A well-designed CRF supports the collection of the data needed to answer


the trial question in a way that is accurate, complete, consistent and clear to
everyone involved in the data collection, management and analysis. It also
assists in documenting compliance to regulatory frameworks and GCP.

When designing a CRF, there are a number of principals which drive


good CRF design.

Think about from whom, where, and how, the data is being collected:

- Clinician-completed forms?

- Nurse, research assistant, other healthcare worker?

- Patient/participant completed forms?

- Observation, interview, focus groups?

- Retrospective case note review?

- Collected at specified time points?

- Collected ‘in the field’? In the back of an ambulance? From papers


in a filing cabinet?

- If considering an electronic CRF (eCRF), is there a reliable internet


connection?

- Are research staff required to administer the CRF? Or if participant


self-administered, is there a need for staff to be there to answer questions
about completing the CRF?
Involve principal investigators, statisticians, data managers and trial
coordinators in discussions around design and development of the CRF

Questions/instructions should be unambiguous

The CRF should be laid out to logically follow how data is collected in
practice (it should not be necessary to skip back and forth between
pages to complete the form)

The form should be well organised to make it as easy as possible to


enter data in a consistent way
CRFs and data entry

There are three common methods for collecting data and getting it into the
trial database.

Paper-based CRFs (with manual data entry into the study data base)
Paper-based CRFs that are designed to be scanned and directly entered
into a database using Optical Character Recognition software (OCR)
Electronic CRFs that enable direct entry/export into the trial database

Which method is chosen for a trial will depend on the study setting and
resources, but there are some issues to consider for each that will impact on
data quality and study workload.
CRF design principles

For all CRFs (whether paper or electronic)

Use character separators and pre-print decimal points and units

e.g., Q3.1 Subject height: metres

If collecting data at multiple sites, are the units used for particular
measures consistent across all sites?

e.g., in Australia we use mmol/l for total plasma cholesterol, but in the USA
they use mg/dl. The absolute numbers between these two measures are very
different. Does the CRF need to cater to more than one unit of
measurement?

Where possible, for questions with a limited number of potential


answers (eg name of hospital) consider specifying all of the
alternative answers (with an ‘other’, and possibly 'please specify' if
there are potentially some other less common answers)

If information needs to be collected in the form of free text, think


about how these answers will be coded (generally this data will need to
be coded before analysis).

Where possible, ask specific questions with a time frame:

Instead of: Have you recently experienced knee pain?

Consider: On how many days in the past week have you experienced knee

pain? days
Paper-based CRFs

In some cases, a paper-based CRF may be the best option; for example, a
trial might be mailing a self-administered questionnaire to study
participants or data are being collected in resource-limited settings. Whether
entering paper-based CRF manually or through OCR scanning, some
common principles apply:

Make sure that the participant’s unique study ID is pre-printed on


every page
Pr
o
vi
d
e
d
et
ai
le
d
in
st
ru
ct
io
ns
a
b
o
ut
c
ompleting the form
Provide contact details for any questions about completing the form
Provide special instructions and visual pointers where necessary
For example: "If you have ticked No to question 1.32, then complete form
F23 (Subject Exit form)"

Label the forms in the order in which they are chronologically to be


used
Electronic CRFs (eCRFS)

Electronic CRFs (eCRFs) have some advantages over paper CRFs.

Range constraints and consistency checks can


Accuracy of data
be put in place to minimise data entry errors.

eCRFs can flag when a data entry field has not


Completeness of data
been completed.

Efficiency of data Data can be entered directly into the trial


collection database, reducing study costs.

e.g. A new version of a CRF can be


Enhanced document
implemented at the same time at all sites of a
management and access
multi-centre trial if an update or change is
to files 24/7
required.

Obviously in cases where there was an interruption to power or internet


access, a paper CRF would have the advantage!

Like a paper-based CRF, each eCRF should be ‘signed off’.


Clinical Data Management Software

There are many software tools now available to support data management
in clinical trials. In the case of multi-centre trials collecting large amounts
of data, such systems (whether commercial or custom built) are essential.
Some are open source (e.g. OpenClinica), while there are also individually
customisable commercial systems (e.g. Oracle Clinical One).

For simpler trial designs, (generally smaller trials in which extensive


custom features are not required) a web-based data capture and
management systems such as REDCap might also be appropriate.

https://bit.ly/3sI3TuP

There are a number of different such applications, and REDCap is


mentioned as an example. Monash University uses REDCap for a number
of its studies, typically smaller ones. It has basic functionality, but allows
the database and web-server to be internally hosted in the institution,
meeting data security requirements.
Data Sources for Trials

CRFs are designed specifically for the trial and enable the collection of data
in a protocol-driven, controlled manner. This is not necessarily done in a
way that is consistent with routine clinical practice, i.e., it is often extra
work for clinicians, nurses or other staff.

Another option for data collection for a trial is the extraction of data from
existing sources (eg, electronic health records, clinical registries, health
administrative databases, insurance claims). This has the advantage of
reducing data collection burden, but may limit what data are available to the
trial, and may not be entirely consistent across multiple centres.

Another common option is the 'hybrid' combination of routinely-collected


data with additional study-specific data.
Cloud Storage for Clinical Trial Data

Cloud computing, or 'the cloud', can be thought of as a mechanisms for the


sharing of information technology (IT) resources (eg, processors, memory,
and storage) to leverage greater IT capacity, functionality and performance.
Use of a cloud computing can maximise the effectiveness of these shared
resources as they are dynamically reallocated to where they are most
needed. For clinical trials, the potential benefits include quick access to
powerful computing for processor-intensive data analysis and the scalable
storage of study documents and trial data, orders of magnitude beyond what
might be available at a local site.

However, great care needs to be taken when considering a cloud-based


database for trial management.

There are three types of ‘cloud’:

Public cloud - cloud infrastructure over a


network that is open to the public
Private cloud - cloud infrastructure
operated solely for a single organisation,
hosted either internally or externally
Hybrid cloud - a composition of public and private clouds

As indicated in the ICH guidelines, trial sponsors have a responsibility to


ensure that clinical trial data are secure, confidentiality is maintained, it is
only accessed by authorised users, back-up of the data is maintained and
that there is clear data audit and edit trail (ie, it can be traced as to what
changes were made to data, by whom and when). The lack of security
associated with the public cloud is a barrier to using this in clinical trial data
management.

Private clouds can be used for trial data management, but given that these
are generally run by third parties, researchers must ensure:

Privacy principles are adhered to by the private cloud service provider


that will enable the sponsor to meet their obligations under the Privacy
Act (understanding that these vary by jurisdiction)*
Security standards are clearly defined and understood by both parties,
with liabilities and risks (and identification of who bears responsibility
for violations) contractually agreed.

*There are rules about trans-border transfer of personal data under the
Australian privacy laws
Data validation

Once you have extracted your trial data from the CRF, there are a few more
steps prior to analysis.

One of these is data validation (checking) and resolution of discrepancies


(query resolution).

Discrepancies can arise due to inconsistent data, missing data, data that are
out of range (eg, excessive age) and deviations from the protocol.

As mentioned, eCRFs can allow for some validation checks to be built in –


eg, flagging missing data, or inconsistent or out of range data. In other
cases, discrepancies might arise between different and unconnected data
sources.

In some cases, these discrepancies might be minor and able to be resolved


by the data team without having to seek external clarification (eg, a spelling
error).

In other cases, the discrepancies flagged might require investigators to


source supporting data to resolve the discrepancy and validate the data
field. For example, there could be a date of death recorded for a participant,
but a date of medical procedure recorded that comes after the date of death
for that same participant. It is likely that these two dates arose from
different sources (eg, the former might have been provided by a next of kin
and the latter derived from linkage to an administrative database).

Data validation and management of discrepancies includes ongoing data


checking (including quality control of data processing) throughout the trial,
reviewing discrepancies, investigating the reason, and resolving them with
documentary proof or declaring them as ‘irresolvable’.

Just as we would have an audit trail for the data collection in the first
instance, there would also be an audit trail for any amendments to the data
arising from resolution of discrepancies – the trial must be able to track who
has made changes to the data and for what reason.
Data coding

Data collected in a clinical trial are usually matched to standard medical


codes for reporting and compliance documentation.

The Medical Dictionary for Regulatory Activities (MedDRA) is an


internationally used set of terms relating to medical conditions, medicines
and medical devices, and is commonly used (including by the TGA) for
coding adverse events.

symptoms, signs, diseases,


syndromes and diagnoses
medical device malfunctions
medication errors
MedDRA includes medical history information
standardised terms for: sites (e.g. application, implant and
injection sites)
medical and surgical procedures
types of tests and other
investigations

For example, in a multi-centre trial, the investigators at different sites may


use different terms for the same adverse event. However, it is important to
code them all to a single and consistent standard to ensure the correct
coding and classification of adverse events and medications. Incorrect or
inconsistent coding risks masking of safety concerns with a drug or device,
or incorrect identification of safety concerns with a particular drug or
device.
Data locking

Following completion of thorough data quality checks, a final data


validation is undertaken.

If no remaining discrepancies arise, and no outstanding data management


tasks remain, the database can be locked (finalised) prior to statistical
analysis.

Once data are locked down, generally no further modifications to the


database can be made. Hence prior to the data being locked down, it is
critical that data checks are finalised and all stakeholders (including
statisticians and investigators) are consulted. Data extraction for clinical
trial analysis is undertaken from this finalised trial database.

If a critical data issue arises post-lockdown, the data may be modified, but
this requires a clear audit trial and documentation.

You might also like