R Vs SAS
R Vs SAS
Abstract
In this paper we show how to generate an ADaM compliant ADSL dataset using R. R packages such as Sas7bdat, Dplyr, Tidyr, and
Hmisc are used to generate the ADSL dataset. A procedure to set-up the R-environment for the process of generating the ADSL
dataset is shown. The typical steps used to create the ADSL dataset along with the derivation of numeric variables, flags, treatment
variables and trial dates are outlined. R procedures to attach labels to the variables are discussed. A side by side comparison of R
and SAS code is presented. A known weakness in R such as attaching labels to the variables [1] has been resolved in this work. The
challenges encountered in generating the ADSL dataset using R are discussed.
Introduction
SAS is widely used in clinical trials. Like SAS, R is a language and environment for statistical computing and graphics. R can be
considered as a viable alternative to SAS for generating specialized clinical trials datasets, tables, listings and figures. R is freely
available and an open source environment which is supported by R foundation of statistical computing. R has many specific
packages available for design, monitoring and analysis of clinical trials datasets, these include Sas7bdat, Dplyr, Tidyr, and Hmisc
that enable reading, merging, transposing, attaching label's to variables respectively [2].
1
Steps To Generate ADSL Dataset
The process steps below outline the general steps needed to generate an ADSL dataset, beginning with the reading of the SDTM
datasets and ending with the generation of the ADSL dataset.
1: Reading Datasets
SDTM datasets which include DM, SUPPDM, EX, SUPPEX, DS and SUPPDS datasets are imported into the statistical
programming environment.
DM and SUPPDM datasets are sorted, and then SUPPDM dataset is transposed and finally merged with DM dataset.
DS and SUPPDS datasets are sorted, and then SUPPDS dataset is transposed and finally merged with DS dataset.
EX and SUPPEX datasets are sorted, and then SUPPEX dataset is transposed and finally merged with EX dataset.
5: Exposure Variables
8: Generating Flags
Subject level population flags such as safety population flag (SAFFL), intent-to treat population flag (ITT) and enrollment
population flag (ENRFL) are generated.
Trial dates such as treatment start date and treatment end date are generated.
2
Setting-up the R Environment
The first typical step of using any programming environment is to install the required packages and activate the libraries needed for
a program. R packages can be installed in R-Studio through the lower right pane of the R-Studio IDE (4th Quadrant). Figure 3 below
shows how to install the Dplyr package; the same procedure can be used to install other R packages namely Sas7bdat, Tidyr and
Hmisc shown in Table 1. R Libraries can be installed in R-Studio through console commands as shown in figure 4.
dm<- read.sas7bdat("c:/sdtm/dm.sas7bdat")
suppdm<- read.sas7bdat("c:/sdtm/suppdm.sas7bdat")
ds<- read.sas7bdat("c:/sdtm/ds.sas7bdat")
suppds<- read.sas7bdat("c:/sdtm/suppds.sas7bdat")
ex<- read.sas7bdat("c:/sdtm/ex.sas7bdat")
suppex<- read.sas7bdat("c:/sdtm/suppex.sas7bdat")
3
2. Combining Dm and Suppdm Dataset by Sorting, Transposing and Merging Steps.
4
ex2<- ex2 %>%
rename(extrt2 = extrt, exdose2 = exdose, exstdtc2 =
exstdtc, exendtc2 = exendtc)
adsl$sexn[adsl$sex=="m"]<- 1
adsl$sexn[adsl$sex=="f"]<- 2
adsl$racen[adsl$race=="asian"]<- 1
adsl$racen[adsl$race=="other"]<- 2
5
10. Generating Trial Dates
describe(adsl)
Procedure R Function
Read SAS Dataset read.sas7bdat
Merge Datasets inner_join/left_join/right_ join/full_join
Transpose Dataset Spread
Attach Label Label
Variable Selection Select
Character to Numeric as.numeric
Numeric to Character as.character
Check attributes Content
Sort Dataset Order
Rename Variable Rename
Conditional operator ifelse()
Check Labels Describe
Check variable type Class
The Table 3 below compares the code in SAS and R to generate ADSL dataset. Table 3 is a comprehensive procedure step and code
reference for generating the ADSL dataset in SAS and R.
6
Table 3: SAS and R code comparison.
7
9 Generating data adsl3; adsl$saffl <-
Safety Flag, set adsl2; ifelse( !is.na(adsl$exdose) & !
ITT Flag, if exdose ^= . and exstdtc ^= ‘’ is.na(adsl$exstdtc), "Y", "N")
Enrolment then saffl = "Y";
Flag else saffl = "N"; adsl$ittfl <-
if armcd ^= ‘’ then ittfl = ‘Y’; ifelse( !is.na(adsl$armcd), "Y", "N")
else ittfl = 'N';
if rfstdtc ^= " " and rficdtc ^= adsl$enrfl <-
" " then enrfl = ‘Y’; ifelse( !is.na(adsl$rfstdtc) & !
else enrfl = 'N'; is.na(adsl$rficdtc), "Y", "N")
run;
12 Treatment data adsl4; Treatment variables for planned treatment
Variable :- set adsl3;
if armcd = "xx" then do; adsl$trt01p[adsl$armcd=="xx"] <- "refe"
TRT01P trt01p = " refe"; adsl$trt02p[adsl$armcd=="xx"] <- "test"
TRT02P trt02p = " test ";
TRT01PN trt01pn = 2; adsl$trt01pn[adsl$armcd=="xx"] <- 2
adsl$trt02pn[adsl$armcd=="xx"] <- 1
TRT02PN trt02pn = 1;
end;
adsl$trt01p[adsl$armcd=="yy"] <- "test"
TRT01A if armcd = "yy" then do; adsl$trt02p[adsl$armcd=="yy"] <- "refe"
TRT02A trt01p = " test";
TRT01AN trt02p = " refe"; adsl$trt01pn[adsl$armcd=="yy"] <- 1
TRT02AN trt01pn = 1; adsl$trt02pn[adsl$armcd=="yy"] <- 2
trt02pn = 2;
end; run; Similarly for actual treatment
Similarly for actual treatment
library(hmisc)
Using label function in Hmisc, we add the label to each variable by following code.
“Describe(adsl)” code can be used to check the labels attached to the variables. Below is the snapshot of the dataset which display
the labels with the variable name.
8
Two key challenges faced when coding in R were (1) R does not provide a log like SAS, so code debugging is difficult in R
compared to SAS. (2) We were able to attach label to the variables but were not able to attach label to the dataset.
Conclusion
In this paper we generated an ADaM compliant ADSL dataset using the R programming language. We demonstrate that R can be
used as an effective alternative to create the clinical trial dataset. We provided a step by step process to set-up the R environment
and the R code for reading the input dataset and processing the data to generate the ADSL dataset. We also compared the SAS and
R code for the process, and discussed challenges encountered and addressed issues like attaching labels to variables.
Reference
1. Prasanna Murugesan, 2018, “Clinical Trial Datasets (CDISC - SDTM/ADaM) Using R”, Phuse US Connect.
2. Monika M. Wahi, Peter Seebach, 2018, “Analyzing Health Data in R for SAS Users”, Boca Raton, Florida, Taylor &
Francis.
3. CDISC Analysis Data Model Team, 2016, “Analysis Data Model Implementation Guide Version 1.1”.
4. Ali Dootson, 2020,TFL Programming in R versus SAS, d-wise, https://www.d-wise.com/blog/tfl-programming-in-r-
versus-sas
5. Amol Waykar, Kevin Kramer, Kalyani Komarasetti, Andrew Miskell, 2020, Generating TFLs in R - Challenges and
Successes compared to SAS, Phuse US Connect.
Acknowledgement
I would like to thank Nagadip Rao, Associate Director of Eliassen Group Life Science for reviewing this paper and providing
valuable comments. I would also like to thank Lalitkumar Bansal of Statum Analytics for technical discussions and editorial inputs
to this paper.
Contact Information
Vipin Kumpawat
Eliassen Group Life Science
Somerset New Jersey USA
[email protected]
[email protected]