Clinical Data Management
Clinical Data Management
Clinical Data Management
Introduction
Clinical Data Management is involved in all aspects of processing the clinical data, working with
a range of computer applications, database systems to support collection, cleaning and management of
subject or trial data. For efficient clinical data management, a proper Data Management Plan (DMP) is
formed. These documents are also more accurate when written at the start rather than as a summary or
report at the end of the study. Another common name for this document is ‘data handling plan’.
A DMP should touch on all elements of the data management process for the study in question.
For each element or group of elements in the data management process, a DMP specifies:
The DMP also becomes an approximate table of contents for the study file and can be used as a
source document for internal quality assurance audits. There’s a lot of consistency in the content of
DMPs from company to company.While creating SOP for DMP, always be very careful in using the word
“require” in an SOP. If the SOP states that the study file must have a particular folder or tab, it must
actually be there --- with something in it --- during an audit. Require only those folders you actually
require, use more vague text such as “may include” for others.
1. CRF Design
2. Study Setup
3. Tracking CRF Data (CRF Workflow)
4. Entering data
5. Cleaning data
6. Managing lab data
7. SAE Handling
8. Coding reported terms
9. Creating reports and transferring data
10. Closing studies
1. The work to be done and responsibilities are clearly stated at the start of the study so
that everyone knows what is expected.
2. The expected documents are listed at the start of the study so that they can be
produced during the course of, rather than after, the conduct of the study.
3. The document helps everyone fulfill regulatory requirements.
4. Data management tasks become more visible to other groups when the DMP is made
available to the project team.
5. The DMP provides continuity of process and a history of a project. This is particularly
useful for long-term studies.
User acceptance testing – Once an electronic CRF (e-CRF) is built, the clinical data manager (and other
parties as appropriate) conducts User Acceptance Testing (UAT). The tester enters test data into the e-
CRF and records whether it functions as intended. UAT is performed until all the issues (if found) are
resolved.
Open-ended questions and Closed-ended questions
Closed-ended questions are those which can be answered by a simple "yes" or "no,"
while open-ended questions are those which require more thought(s) and more than a simple one-word
answer.
Database
Data from a clinical trial will be collected and stored in some kind of computer system (HDDs,
cloud, etc.). The initial data capture method may be paper, a computer entry screen at the Investigator‘s
site, a laboratory instrument, central lab system, or perhaps a hand-held computer/tablet used to
collect diary information from the patient. Regardless of the way the data is captured initially, it must
eventually be collected and stored in a computer system or systems that will allow complex data
cleaning, reviewing, and reporting. The storage and maintenance of data is achieved by getting it into
one or more databases.(A database is simply a structured set of data. This may be an Excel spread sheet,
a Microsoft Access application, an indexed proprietary data format, a collection of SAS tables, or a set of
tables built in one of the traditional relational applications such as Oracle®.)For any software that a
company uses to store and manage clinical data, the data manager will design structures, a database, for
each clinical trial after the Protocol has been defined and the capture methods (such as case report form
[CRF] booklets or electronic data capture [EDC] entry screens) have been drafted.
CRF pages or electronic data entry screens are the most common means of data collection, but
other sources may include lab data files, integrated voice response (IVR) systems, and so forth. Because
they collect data from the sites, the data capture instruments should be carefully designed to be clear
and easy to use.
Clinical data from any of the common sources, such as CRFs, electronic entry screens, and
electronic files, always contain some text. The most common text-type fields are:
• Short comments
• Reported terms
• Long comments
Coded values
Categories of answers, often known as coded values, comprise the largest number of text fields.
These fields have a limited list of possible answers and usually can contain only those answers. Common
examples of coded fields include “Yes/No” answers, “Male/Female” for gender, and “Mild/
Moderate/Severe” for severity. The list of possible answers is usually known as a “code list”. Coded text
fields only present a problem if the associated field can contain more than one answer or if the answer
frequently falls outside the predefined list. When more than one answer is possible, the database design
changes from a single field with a single answer to a series of fields, each of which may hold any eligible
response from the code list.
Short texts
A very common short-comment field is that associated with a ― “Normal /Abnormal response
labeled ― “If abnormal, explain.” Medical-history terms or “Reason for Discontinuation” fields are other
common examples of short text. These responses can be reviewed for clinical or safety. Monitoring, but
it is difficult to analyze them because any kind of summarization or analysis of these values depends on
some consistency of the data. The best that can be done for those is to standardize spelling, spacing,
and symbols as much as possible.
Reported terms
If the short texts or comments are important to the study, a codelist, possibly a very large one,
should be used to group like texts. Large codelists (sometimes called dictionaries) are available for a very
common class of short free-text fields: AEs, medications, and diagnoses. These kinds of free text are
often called reported terms.
Long texts
Longer texts are those that cover several lines, are usually associated with a group of fields or
even an entire page or visit. Clinical research associates and medical Monitors use this text to cross-
check against other problems reported or found in the data. Long texts are never analyzed. Long
comments can be stored in several ways:
Flow of Data
Data Collection
Data Entry
Data Validation
Data Cleanup
Data Analysis
Data Reporting
Data Presentation
Creating Database
Data Validation
Data validation is the process of testing the validity of data in accordance with the protocol
specifications. Edit check programs are written to identify the discrepancies in the entered data, which
are embedded in the database, to ensure data validity. These programs are written according to the
logic condition mentioned in the DVP. These edit check programs are initially tested with dummy data
containing discrepancies. Discrepancy is defined as a data point that fails to pass a validation check.
Discrepancy may be due to inconsistent data, missing data, range checks, and deviations from the
protocol. In e-CRF based studies, data validation process will be run frequently for identifying
discrepancies. These discrepancies will be resolved by investigators after logging into the system.
Ongoing quality control of data processing is undertaken at regular intervals during the course of CDM.
For example, if the inclusion criteria specify that the age of the patient should be between 18 and 65
years (both inclusive), an edit program will be written for two conditions viz. age <18 and >65. If for any
patient, the condition becomes TRUE, a discrepancy will be generated. These discrepancies will be
highlighted in the system and Data Clarification Forms (DCFs) can be generated. DCFs are documents
containing queries pertaining to the discrepancies identified.
System Validation
All computer systems used in the processing and management of clinical trial data must undergo
validation testing to ensure that they perform as intended and that results are reproducible. “System
validation” is conducted to ensure data security, during which system specifications, user requirements,
and regulatory compliance are evaluated before implementation. Study details like objectives, intervals,
visits, investigators, sites, and patients are defined in the database and CRF layouts are designed for data
entry.
Data Entry
When data in a clinical study is captured on paper case report forms (CRFs), that data must be
transferred to a database for final storage. We will call the process of transferring it from paper or image
into electronic storage – “data entry”. Data entry may be entirely manual or it may be partly
computerized using optical character recognition (OCR). Regardless of whether there is a computerized
step involved in the process, and regardless of the specific application used, the main data entry issues
that must be addressed by technology or process, or both, are:
Accurately transcribing the data from the CRF to the database is essential. Errors in transcription are
usually due to typographical/typing errors (― “typos”) or illegibility of the values on the CRF. Companies
aim to reduce transcription errors using one of these methods:
Illegible fields
Illegible writing on the CRF always causes problems for data entry, data management staff, and
clinical research associates (CRAs). Each data management group should consider the following
questions when planning an approach to illegible fields:
Even when staff tries to appropriately identify values, some data is just illegible and will have to be
sent to the Investigator for clarification during the data cleaning process.
Notations in margins
Investigators will sometimes supply data that is not requested. This most frequently takes the form
of comments running in the margins of a CRF page but may also take the form of unrequested, repeated
measurements written in between fields. Some database applications can support annotations or extra
measurements that are not requested, but many cannot. If site annotations are not supported, data
management, together with the clinical team, must decide what is to be done:
Can the information be stored in the database as a comment or annotation (but then it could
not be listed or analyzed)?
Can the comment be ignored?
Should the site be asked to remove the comment to transcribe it somewhere appropriate?
Pre-entry Review
In the past, many companies had an experienced data manager conduct a pre-entry review of all
CRF pages.The idea was to speed up entry by dealing with significant issues ahead of time. The problem
is that, extensive review circumvents/avoids the independent double entry process. Because entry staff
would enter whatever the reviewer wrote, and not consider whether that was appropriate or not, it was
― “one person decides”. Also, there were a lot of cases where the reviewer‘s notes changed the data
and made assumptions.
Single entry
While relatively rare in traditional clinical data management, single entry is an option when
there are strong supporting processes and technologies in place to identify possible typos or errors
because of unclear data. Electronic data capture (EDC) is a perfect example of single-pass entry; the site
is the entry operator transcribing from source documents, and checks in the application immediately
identify potential errors.
Double entry
With its error rate often given as 0.1 to 0.2%, double data entry has long been used without
question as a reliable method of transcription. In double entry, one operator enters all the data in a first
pass, and then an independent second operator enters the data again. Two different techniques are
used to identify mismatches and also to resolve those mismatches. In one double entry method, the two
entries are made and both are stored. After both passes have been completed, a comparison program
checks the entries and identifies any differences. Typically, a third person reviews the report of
differences and makes a decision as to whether there is a clear correct answer (for example, because
one entry had a typo). This method of double entry is sometimes known as ― “blind” double entry since
the operators have no knowledge of each other’s work.
The other double entry method uses the second entry operator to resolve mismatches. After
first pass entry, the second entry operator selects a set of data and begins to re-enter it. If the entry
application detects a mismatch, it stops the second operator who decides, right then and there, what
the correct value should be or whether to register a discrepancy to be resolved by a third person.
No matter how well designed a CRF is, there will be occasional problems with values in fields.
The problems may be due to confusion about a particular question or they may be due to the person
filling it out. The most common problem is illegibility; another is notations or comments in the margins.
Sometimes pre-entry review of the CRFs can help manage these and other problems. Because
companies deal with these data problems in different ways, each data management group must specify
the correct processing for each problem in data entry guidelines.
Cleaning Data
The biggest job for any data management group running a paper based trial is not the entry of
the data into a clinical database — it is checking the data for discrepancies and getting the data cleaned
up. Discrepancies may be identified manually at any point during processing when someone reviews
either the case report form (CRF) or the data. Most commonly, they are identified by the data
management system automatically at entry or after entry. The process or system to store discrepancies
can be called a discrepancy management system.A discrepancy can often be resolved by internal groups,
such as the data management group or clinical research associates (CRAs), but at least some portion will
need to be resolved by the investigator. Discrepancies that are sent to the sites for resolution are often
called “queries”.
Electronic data capture (EDC) systems deliver clinical trial data from the investigation sites to the
sponsor through electronic means rather than paper case report forms (CRFs). The site may be entering
the information directly into screens without first writing it down on paper, in which case the record is
an electronic source record. The site may first record the information on paper and then enter it later.
EDC systems are optimized for site activities during a clinical trial and typically feature:
Data Collection
CRF Tracking
CRF Annotation
Database Design
Data Entry
Medial Coding
Data Validation
Discrepancy Management
Database lock
Data collection
Data collection is done using the CRF that may exist in the form of a paper or an electronic
version. The traditional method is to employ paper CRFs to collect the data responses, which are
translated to the database by means of data entry done in-house. These paper CRFs are filled up by the
investigator according to the completion guidelines. In the e-CRF-based CDM, the investigator or a
designee will be logging into the CDM system and entering the data directly at the site. In e-CRF method,
chances of errors are less, and the resolution of discrepancies happens faster. Since pharmaceutical
companies try to reduce the time taken for drug development processes by enhancing the speed of
processes involved, many pharmaceutical companies are opting for e-CRF options (also called remote
data entry).
Database Design
Databases are the clinical software applications, which are built to facilitate the CDM tasks to
carry out multiple studies. Generally, these tools have built-in compliance with regulatory requirements
and are easy to use.These entry screens are tested with dummy data before moving them to the real
data capture.
Data entry
Data entry takes place according to the guidelines prepared along with the DMP. This is applicable only
in the case of paper CRF retrieved from the sites. Usually, double data entry is performed wherein the
data is entered by two operators separately. The second pass entry (entry made by the second person)
helps in verification and reconciliation by identifying the transcription errors and discrepancies caused
by illegible data. Moreover, double data entry helps in getting a cleaner database compared to a single
data entry. Earlier studies have shown that double data entry ensures better consistency with paper CRF
as denoted by a lesser error rate.
Medical coding
Medical coding helps in identifying and properly classifying the medical terminologies associated with
the clinical trial. For classification of events, medical dictionaries available online are used. Technically,
this activity needs the knowledge of medical terminology, understanding of disease entities, drugs used,
and a basic knowledge of the pathological processes involved. Functionally, it also requires knowledge
about the structure of electronic medical dictionaries and the hierarchy of classifications available in
them. Adverse events occurring during the study, prior to and concomitantly administered medications
and pre-or co-existing illnesses are coded using the available medical dictionaries. Commonly, Medical
Dictionary for Regulatory Activities (MedDRA) is used for the coding of adverse events as well as other
illnesses and World Health Organization–Drug Dictionary Enhanced (WHO-DDE) is used for coding the
medications. These dictionaries contain the respective classifications of adverse events and drugs in
proper classes. Other dictionaries are also available for use in data management (e.g., WHO-ART is a
dictionary that deals with adverse reactions terminology). Some pharmaceutical companies utilize
customized dictionaries to suit their needs and meet their standard operating procedures.
Medical coding helps in classifying reported medical terms on the CRF to standard dictionary terms in
order to achieve data consistency and avoid unnecessary duplication. For example, the investigators
may use different terms for the same adverse event, but it is important to code all of them to a single
standard code and maintain uniformity in the process. The right coding and classification of adverse
events and medication is crucial as an incorrect coding may lead to masking of safety issues or highlight
the wrong safety concerns related to the drug.
Discrepancy management
This is also called query resolution. Discrepancy management includes reviewing discrepancies,
investigating the reason, and resolving them with documentary proof or declaring them as irresolvable.
Discrepancy management helps in cleaning the data and gathers enough evidence for the deviations
observed in data. Almost all CDMS have a discrepancy database where all discrepancies will be recorded
and stored with audit trail.
Based on the types identified, discrepancies are either flagged to the investigator for clarification or
closed in-house by Self-Evident Corrections (SEC) without sending DCF to the site. The most common
SECs are obvious spelling errors. For discrepancies that require clarifications from the investigator, DCFs
will be sent to the site. The CDM tools help in the creation and printing of DCFs. Investigators will write
the resolution or explain the circumstances that led to the discrepancy in data.
When an item or variable has an error or a query raised against it, it is said to have a “discrepancy” or
“query”.
All EDC systems have a discrepancy management tool or also refer to “edit check” or “validation check”
that is programmed using any known programming language (i.e. PL/SQL, C# sharp, SQL, Python, etc).
So what is a ‘query’? A query is an error generated when a validation check detects a problem with the
data. Validation checks are run automatically whenever a page is saved “submitted” and can identify
problems with a single variable, between two or more variables on the same e-CRF page, or between
variables on different pages. A variable can have multiple validation checks associated with it.
by correcting the error – entering a new value for example or when the data point is updated
by marking the variable as correct – some EDC systems required additional response or you can
raise a further query if you are not satisfied with the response.
Types of Queries
Manual Queries
Coding Queries
Queries can be issued and/or answered by a number of people involved in the trial. Some of the
common setups are: CDM, CRA or monitors, Site or coordinators.
Database locking
After a proper quality check and assurance, the final data validation is run. If there are no discrepancies,
the SAS datasets are finalized in consultation with the statistician. All data management activities should
have been completed prior to database lock. To ensure this, a pre-lock checklist is used and completion
of all activities is confirmed. This is done as the database cannot be changed in any manner after locking.
Once the approval for locking is obtained from all stakeholders, the database is locked and clean data is
extracted for statistical analysis. Generally, no modification in the database is possible. But in case of a
critical issue or for other important operational reasons, privileged users can modify the data even after
the database is locked. This, however, requires proper documentation and an audit trail has to be
maintained with sufficient justification for updating the locked database. Data extraction is done from
the final database after locking. This is followed by its archival.