02 - Programming Languages
02 - Programming Languages
Programming Languages are an integral component of data science within the context of the
pharmaceutical industry. Their use is varied, and they can, for example, be used to provide analytics
from data collected from real-world evidence, clinical and nonclinical trials, to assess the efficacy
and safety of drugs and/or devices (e.g. inhalers) in trials.
One of the most prevalent languages in this area, and which has been used for decades, is SAS, but
over time many other languages and/or analytics tools have come and gone. Some of the most
popular languages at present include (but is not limited to):
• SAS
• R
• Python
• Julia
• S-PLUS
Newcomers to the pharmaceutical industry may also be surprised to see older languages still in use
in specialised cases (e.g. Fortran), but these cases are now rare and for this overview, we will look at
those languages that are currently prevalent.
Note that the above languages can support data visualisation aspects and, indeed, some of these
have specific language sets designed to facilitate data visualisation (e.g. R-Shiny). This aspect is not
covered here as it’s a specialised topic in its own
right. Similarily, machine learning, deep learning and artificial intelligence will be covered
elsewhere.
SAS
Created by Anthony James Barr at North Carolina State University, the SAS language started out as a
computer programming language used for statistical analysis. Over the years, SAS evolved to run
across multiple operating systems on mainframes, desktop computers and, more recently, on cloud-
based servers.
The capabilities of SAS have also increased over the years, adapting to the latest trends in the
industry, and its development is based by The SAS Institute under the leadership of their CEO, Dr Jim
Goodnight.
Importantly, the submission of data for approval of a drug by regulators such as the FDA mandates
the use of SAS Datasets using the CDISC STDM and ADaM standard. This often leads to the full set of
analytics being performed in SAS.
SAS is best described as a procedural language, where procedures are used to perform specific
statistical analyses. For example, PROC MIXED allows the analysis of data using mixed effect
statistical models; PROC GLM allows the analysis of data using general linear models mixed. Non-
parametric analyses are also catered for with, for example, the PROC NPAR1WAW.
SAS also incorporates some very powerful data manipulation procedures to sort, transpose, merge
and manipulate data. One procedure even allows the use of ANSI standard SQL to manipulate data,
but the most complex manipulations are often performed using programming commands within
what is called the DATA STEP.
SAS also incorporates your own sub-routines via the SAS Macro Language.
R
R is a free, community-driven, open-source programming language widely used in academia and
industry by statisticians and data miners for developing statistical software and data analysis. R itself
is an implantation of S programming language, whose roots go back as far as 1976.
R and its libraries implement a wide variety of statistical and graphical techniques,
including linear and nonlinear modeling, classical statistical tests, time-series analysis, classification
and clustering.
R is easily extensible, object-orientated language and the R community is noted for its active
contributions in terms of developing new packages for statistical analysis.
Whilst having such a large community of developers is a strength of R, it can also be perceived as a
weakness since validation of R packages is less well defined compared to SAS. There are, however,
companies that support the validation of R, and the fact that the underlying source code is open to
scrutiny by the community as a whole brings a different dimension to its reliability.
At this time, submissions to regulators using SAS far outnumber submissions in R, but R is increasing
in use and will continue to do so.
2 - R Overview
Python
Python is an interpreted, high-level, general-purpose programming language. Created by Guido van
Rossum and first released in 1991, Python’s design philosophy emphasises code readability with its
notable use of significant whitespace. Its language constructs and object-oriented approach aim to
help programmers write clear, logical code for small and large-scale projects.
Rather than having all of its functionality built into its core, Python was designed to be
highly extensible. This compact modularity has made it particularly popular as a means of adding
programmable interfaces to existing applications.
Especially for those who are interested in data science and machine learning fields, Python, along
with R, has become one of the must-know languages. And, as the pharmaceutical industry and
biometrics department move into data science and machine learning, programmers
with Python programming experience and knowledge will find themselves in demand.
3 - Python Overview and examples
Julia
First launched in 2012, Julia is a high-level, high-performance, dynamic programming language.
While it is a general-purpose language and can be used to write any application, many of its features
are well-suited for numerical analysis and computational science.
Whilst relatively new, Julia has some exciting potential in the area of data visualisation and could see
increased use in the pharmaceutical industry over time.
S-Plus
S-PLUS is a commercial implementation of the S statistical programming language, with a
publication-quality graphics package and a matrix-based programming language.
S-PLUS is written and runs in the TIBCO Spotfire S+ statistical programming environment.
Statisticians and researchers use S-PLUS to perform advanced statistical analysis on large datasets.
Reference Material
Recommended PHUSE educational material
Recommended Readings
Recommended Videos
Recommended Websites