C Disc Express
C Disc Express
C Disc Express
Jiangtang Hu
[email protected]
www.jiangtanghu.com/
05/07/2011
1.
2.
Recently I did some research on Clinovos open source application, CDISC Express, a SAS
application based on Excel framework designed to map clinical data to CDISC SDTM domains
automatically. Not perfect yet, but it is easily understandable and practically usable after few
hours of exploration of user guide. And most important, it is on the right way: an automatic
CDISC converter is the magic weapon in almost every clinical programmers dream.
CDISC Express is the first and only practically usable open source CDISC converter I even met.
I wrote a post one month ago when I first tested it with great interests and reported some issues
to fix system. Then I also had the great opportunity to discuss the software via email with its core
developer, Romain Miralles. This post is just my personal notes on how to use and dig into the
software, and will be best serve as a working documentation. You can return to me for any
questions and comments.
1. Download and Installation
You can get CDISC Express for free in
http://www.clinovo.com/cdisc/download
It is a window application and will be installed by default in
C:\Program Files\CDISC Express\
After installation, this path will be coded as a macro variable &CDISCPATH in the following six
SAS files which are all located in C:\Program Files\CDISC Express\programs\:
create_new_study.sas
generate_Definexml.sas
generate_mapping_template.sas
generate_SDTM.sas
Validate_Mapping_File.sas
Validate_SDTM_Domains.sas
The macro variable reads as
If you change the destination folder at the installation stage, e.g., to D:\CDISC Express\, the
value of the macro variable &CDISCPATH will be changed accordingly in the six files
mentioned before:
%LET CDISCPATH = D:\CDISC Express;
Note that if you want copy the whole folder of files to another destination, you should at least
manually change the value of &CDISCPATH in such six files or add some codes to capture the
path accordingly. From this point of view, the setting of path of CDISC Express is not that
portable. Recommend that if you have such needs, just re-install the software in any destination
you want. It will not write any records into registry and you can have many copies in one
machine.
The following discussion assumes the software roots in C:\Program Files\CDISC Express\.
2. Working Flow
You can follow all the 6 action steps one by one coded in
C:\Program Files\CDISC Express\programs\
1) Create a new study (create_new_study.sas)
Simple and easy. Just assign a new study name in a macro call and run.
2) Generate mapping file (generate_mapping_template.sas)
This is the critical part and most time consuming. You should design mapping rules for every
domain needed in Excel spreadsheets (the MAPPING FILE). If done, all other tasks, such as
generate SDTM datasets, SAS transport files, define.xml and validation, can be well done by just
clicking buttons
Folder results and its subfolder will hold all the outputs, such as define.xml, SAS
transport file, validation reports and SDTM datasets;
Folder source holds all the clinical raw data used as inputs for SDTM domains;
Folder tempdata holds all the temporary datasets generated by following macro calls.
Also, a configuration file named CLINCAP_configuration.sas put in C:\Program Files\CDISC
Express\programs\study configuration\. This file is used to set some study level parameters,
such as lab and toxicity specifications (details in C:\Program Files\CDISC Express\specs\Lab
specs\).
Two versions of SDTM implementation guides are supported by CDISC Express, CDISC SDTM
Implementation Guide Version 3.1.1 and Version 3.1.2. You can find the corresponding
specification files in C:\Program Files\CDISC Express\specs\SDTM specs\:
SDTM_Specs_3_1_1.xls
SDTM_Specs_3_1_2.xls
The choosing of SDTM implementation version is also coded in the configuration file, in Line 41:
%LET SDTMSPECFILE=SDTM_Specs_3_1_1.xls;
Version 3.1.1 is used by default. You can also choose Version 3.1.2 if needed:
%LET SDTMSPECFILE=SDTM_Specs_3_1_2.xls;
Assign a study name and choose a SDTM implementation version. Thats all needed in step 1.
Lets take few minutes to navigate the software. CDISC Express is a set of macros and Excel
files. It is important to know the file structure first.
SDTM Validation
study1
specs
Excel engine
Lab specs
Mapping validation
SDTM specs
SDTM Terminology
studies
example1
temp
As we already got, all the action taken programs such as create_new_study.sas are located in
C:\Program Files\CDISC Express\programs\. In create_new_study.sas, one macro is called,
%addnewstudy, which is in C:\Program Files\CDISC Express\macros\ClinMap\.
Note that in C:\Program Files\CDISC Express\macros\, there are two sets of macros in
different folders:
C:\Program Files\CDISC Express\macros\ClinMap\: this folder holds all system
level macros used by the application only. No modification encouraged.
C:\Program Files\CDISC Express\macros\function_library\: macros used for
mapping among studies. You can also create you own macro in this folder. The
application imbedded macros also documented in user guide.
4. step 2 of 6: Generate mapping file
Generating template (blank) mapping file only needs pieces of effort by submitting
generate_mapping_template.sas. The toughest one is to fill it with mapping rules according to
specified study.
You should also choose the CORE variable (REQUIRED, PERMISSIBLE and EXPECTED)
by triggering &req, &perm, and &exp to YES or NO. Note that
REQUIRED and EXPECTED variables must always be included (req=YES, exp=YES);
PERMISSIBLE variables included if needed (perm=YES or perm=NO)
Submit generate_mapping_template.sas and you can get a blank template mapping file
tmpmapping.xls in C:\Program Files\CDISC Express\temp\.
Then StudyMetadata sheet, a XML metadata specification used to generate define.xml. you need
only add some information in Values column:
Such format structure is similar with the one we export the format from a format catalog using
proc format library=library cntlout=format_out;
run;
In most production environment, programmers get formats from clinical data management group.
If the entire formats are assigned into proper libraries (work or library), you dont need to export
such formats into this spreadsheet. Of course in the format sheet, you can type some customized
format.
A typical domain sheet (AT LAST!) that needs efforts and our understanding of the software,
DM for example:
10
From the Dataset column, three raw datasets from C:\Program Files\CDISC
Express\studies\example1\source\ needed to map into DM domain, demog, siteinv and
eligassess. Note that you can use any data step options such as drop=, rename=, where= for the
input datasets.
At the last of Dataset column, all indicates that all the previous datasets mentioned above
should be merged together for final processing.
In the Merge Key column, sitecode is designed to datasets demog and siteinv which means
demog and siteinv should be merged by the common key, sitecode.
As we mentioned, all the previous datasets should be merged at last. But there is no common key
settled in the Merge Key column. It is a common rule: if no key specified for merge, USUBJID
is used by default.
The third column is CDISC variable, which list all the needed variables according to
implementation version. An important note: you do not need to implement all the variables
11
according to the order as they appear in the blank template mapping file. In the previous blank
file, AGE in DM domain is ordered in Line 12, but in this working file, AGE is calculated in
the second last order. The variable order of final DM domain will be as same as the blank one.
It makes sense in practice. For example, the sequential variable, e.g. AESEQ is ordered after
USUBJID, but you can only get the sequential number when all other variables well done. So
SEQ variables are always computed in the final stage in a working mapping file.
Expression column specify the mapping rule from raw datasets to SDTM domains.
Assignments, expressions and macro calls (rooted in C:\Program Files\CDISC
Express\macros\function_library\) are allowed in this column and most of them are
straightforward. We will discuss more in the following section.
Sum up, we can translate this mapping sheet to SAS codes for better understanding of CDISC
Express architecture:
data tem1;
set demog;
STUDYID=study;
DOMAIN =&domain;
USUBJID=%CONCATENATE(_variables=study sitecode patid);
SUBJID =patid;
RFSTDTC=%D_RFST(_dataset=trtinf,_date=trtinfdt,_key=patid,_ivrsd
s=ivrs,_ivrsdt=randdat);
RFENDTC=%SORTLOOKUP(_dataset=disc,_variable=fupdat,_key=patid,_s
ort_variable=fupdat,_keep=last);
SITEID =sitecode;
INVID =invcode;
BRTHDTC=%FORMAT(_variable=brthdat,_format=yymmdd10);
SEX
=%GENDER(_gender=sex);
RACE
=%D_RACE();
ETHNIC =upcase(ethnic);
COUNTRY="USA";
DMDTC =%FORMAT(_variable=formdat,_format=yymmdd10);
ARMCD ="";
ARM
="";
run;
data tem2;
merge tem1 siteinv(drop=invcode);
by sitecode;
INVNAM=trim(lname)||","||fname;
run;
data dm;
12
13
There is a macro %cpd_importlist used to import the external dataset, _visits. Again, this
macro roots in C:\Program Files\CDISC Express\macros\function_library\.
Using a macro call to re-sharp or modify an input dataset offers great flexibility referencing data.
We will also discuss the benefits later on.
4.3.2 Assignment
You can assign a number, string and a dataset variable with any valid SAS functions to a SDTM
domain variable in Expression column.
Sometimes a temporary variable needed for later calculation. You can produce such temporary
variable in Dataset column with an assignment in the Expression column just similar with
any other domain variables. Two differences: first, such temporary variables named begin with
an asterisk, *; second, all temporary variables will not be included in the final domain. Once
created, such temporary variables can be used for any other expressions.
14
There are three special symbols used in Dataset column of CDISC Express. Asterisk, *
indicates a temporary variable, while other two are
Tilde, ~
: indicate a variable used for supplemental domain (SUPPQUAL).
Number sign, #: indicate a variable used for comments domain (CO).
Another symbol, at sign, @, used in Expression column, indicated referencing a variables
produced before:
In this case, AGEU uses AGE as input, while AGE is calculated before. @AGE just
indicates the dependency. In concept, it looks like the calculated option in SAS PROC SQL:
15
proc sql ;
select (AvgHigh - 32) * 5/9 as HighC ,
(AvgLow - 32) * 5/9 as LowC ,
(calculated HighC - calculated LowC)
as Range
from temps;
quit;
4.3.3 Match-merging
We already got a math-merging example before. If all appears as a dataset in the Dataset
column, all the previous datasets should be merged first for later processing by the common key
specified in Merge Key column. If no key assigned, patient ID is used by the system.
CDISC Express also supports two types of join, inner join and outer join (left, right, full) using
data steps. The implementation has slightly difference with standard SQL, but the ideas are same.
We add a new column, Join, usually beside the Merge Key column.
16
There are two values for Join, O or I while O stands for outer join and I, inner
join. A join indicator I equals a dataset option in= in action while O means no. Use the
above as illustration, the corresponding SAS codes behind look like
data temp;
merge demog(in=a) siteinv(in=b);
by sitecode;
if b;
run;
This is so called right outer join. The combination of I and O in these two datasets can
perform all the four types of join, one inner join and three outer join:
17
Join
type
Inner
join
data inner;
merge demog(in=a)
by sitecode;
if a and b;
run;
data left;
merge demog(in=a)
by sitecode;
if a;
run;
data right;
merge demog(in=a)
by sitecode;
if b;
run;
data full;
merge demog(in=a)
by sitecode;
run;
combination
(I,I)
siteinv(in=b);
(I,O)
siteinv(in=b);
(O,I)
siteinv(in=b);
(O,O)
siteinv(in=b);
As we could see, if no Join column specified, CDISC Express will perform inner join by
default.
So far CDISC Express cannot support multiply merge keys. For example, the following file is
illegal currently:
Dataset
Merge Key
arm
siteid, grpno
armdescri
siteid, grpno
The developer Romain indicated that such enhancements would be raised to the next round of
product road map and he also proposed a work around. To use multiple keys for merging, we can
18
create a temporary variable holding such multiple keys as a concatenation then this temporary
variable can be used as a single merging key.
4.3.4 Concatenating
Above we discussed lots about merge operation in CDISC Express. This section dedicated for
set operation. We already know how to set one dataset for referencing, but how to set
multiple datasets, i.e, Concatenating?
Symmetrically, an all appears in Dataset column indicating merging operation, an all
(stack) indicates concatenating operation:
The above file can be also translated to SAS codes for better understanding:
19
data height;
set vtsigns(where=(height ne .));
VSTESTCD="HEIGHT";
VSTEST
="Height";
VSORRES
=put(height,best12.);
VSORRESU="cm";
VSSTRESC=put(height,best12.);
VSSTRESN=height;
VSSTRESU="cm";
run;
data weight;
set vtsigns(where=(weight ne .));
VSTESTCD="WEIGHT";
VSTEST
="Weight";
VSORRES
=put(weight,best12.);
VSORRESU="kg";
VSSTRESC=put(weight,best12.);
VSSTRESN=weight;
VSSTRESU="cm";
run;
data vs;
set height weight;
STUDYID =study;
DOMAIN =&domain;
USUBJID =%CONCATENATE(_variables=study sitecode patid);
VSSEQ
=%SEQUENCE();
. . .
run;
4.3.5 Transpose
Clinical SAS programmers do lots of transpose operation to re-sharp the raw data to fit the
CDISC standards. Currently there is no explicit guide in CDISC Express on how to transpose,
but this is not the end of story.
There are two types of transpose:
20
Type I: from a wide dataset (more variables, less observations) to a long dataset (less
variables, more observations), e.g. transposing a one-row-per-subject datasets to a
multiple-row-per-subject dataset
Type II: from a long dataset (less variables, more observations) to a wide dataset (more
variables, less observations), e.g. transposing a multiple-row-per-subject dataset to a onerow-per-subject datasets
As good practices, in SAS we always use data steps with output statement to perform type I
transpose and use PROC TRANSPOSE for type II. Although CDISC Express doesnt support
transpose operation in an explicit way, at least you can perform type I transpose and surprisingly
we already saw it before!
Just back to section of concatenating. The example is taken from C:\Program Files\CDISC
Express\studies\example2\,
21
data height;
set vtsigns(where=(height ne .));
VSTESTCD="HEIGHT";
VSTEST
="Height";
VSORRES
=put(height,best12.);
VSORRESU="cm";
VSSTRESC=put(height,best12.);
VSSTRESN=height;
VSSTRESU="cm";
run;
data weight;
set vtsigns(where=(weight ne .));
VSTESTCD="WEIGHT";
VSTEST
="Weight";
VSORRES
=put(weight,best12.);
VSORRESU="kg";
VSSTRESC=put(weight,best12.);
VSSTRESN=weight;
22
VSSTRESU="cm";
run;
data vs;
set height weight;
STUDYID =study;
DOMAIN =&domain;
USUBJID =%CONCATENATE(_variables=study sitecode patid);
VSSEQ
=%SEQUENCE();
. . .
run;
We can see the input data vtsigns is typical wide table (more variables, less
observations):
And the final domain VS is a typical long table (less variables, more observations):
23
24
So obviously, such concatenating operation just did a wonderful type I transpose, from a wide
table to a long table! More often, the compact SAS codes for type I transpose look like:
data vs;
set vtsigns;
if height ne . then do;
VSTESTCD="HEIGHT";
VSTEST
="Height";
VSORRES
=put(height,best12.);
VSORRESU="cm";
VSSTRESC=put(height,best12.);
VSSTRESN=height;
VSSTRESU="cm";
output;
end;
if weight ne . then do;
VSTESTCD="WEIGHT";
VSTEST
="Weight";
VSORRES
=put(weight,best12.);
VSORRESU="kg";
VSSTRESC=put(weight,best12.);
VSSTRESN=weight;
VSSTRESU="cm";
output;
end;
. . .
run;
25
So it would be convenient to create temporary datasets using macros imbedded type II transpose
operation in Dataset column. Every thing SAS can do, you can also implement it in CDISC
Express. Just use macros, in Expression and Dataset column accordingly.
The raw data varies according to trial design and clinical data capture system and procedures. It
is impossible and impractical to anticipate the CDISC SDTM converter such as CDISC Express
to map all the data just clicking a button. The introducing of CDISC Express doesnt keep
programmers away. It just keeps most of the trivial work away from programmers daily life and
let them more concentrated on creative work and be productive and efficient.
26
27
28
29
30
http://www.clinovo.com/cdisc/game
The due day is July 15th and I already submitted my work. Thats fun.
31