GPP Guidance Document v1.1
GPP Guidance Document v1.1
Steering Board for Good Programming Practice in Health and Life Sciences
Version 1.1 March 2014
Table of Contents
Introduction ................................................................................................................................ 2
Getting Started With a New Project........................................................................................... 2
Language .................................................................................................................................... 3
Program header .......................................................................................................................... 3
Revision history ......................................................................................................................... 4
Comments .................................................................................................................................. 4
Naming conventions .................................................................................................................. 4
Coding conventions ................................................................................................................... 4
Log File Checking...................................................................................................................... 5
Portability................................................................................................................................... 6
Hard coding ................................................................................................................................ 6
Defensive programming............................................................................................................. 7
APPENDIX ................................................................................................................................ 7
Page 1 of 7
Guidance on Good Programming Practice
Steering Board for Good Programming Practice in Health and Life Sciences
Version 1.1 March 2014
Introduction
This document provides guidance for good programming practices (GPP) for analysis,
reporting and data manipulation of clinical data in health and life sciences organizations.
This guidance is primarily aimed at SAS programmers however the principles of GPP also
apply to other languages such as R and Stata. In addition, although this is not produced with
SAS macros in mind, the same principles apply to macros too.
We often have to update existing programs to add new rules, copy programs from one study
to another, and take over programs written by others. The guidance aims to show how to
produce well structured and well documented programs so that they are easy to read and
maintain over time. It is meant to be applicable to all programs, and hence all programmers
regardless of experience. Specific rules may be of more use to novice programmers, but
applying the principles should be in mind for experienced programmers and mentors.
Before you start programming, it is important that you familiarize yourself with the
following:
Familiarize yourself with the system you are working on.
Check for company specific programming standards.
Check for study and project specific standards.
Page 2 of 7
Guidance on Good Programming Practice
Steering Board for Good Programming Practice in Health and Life Sciences
Version 1.1 March 2014
Check for industry standards like Clinical Data Interchange Standards Consortium
(CDISC) which are to be applied or can be applied.
Check if a similar project/study has been worked on, i.e. check if available SAS code
can be reused.
Check for project-independent macros that can be applied.
Language
The language used in programming code and within headers and comments is English.
Reference
Program organization specific guidance here.
header
A standard header should be used for every program. The purpose of the header is to identify
the program and provide documentation including revision history. It provides the necessary
information for a code reviewer to identify and understand the program and its development
life cycle. The elements included in a header will vary from organization to organization but
below is a discussion of some of the most common elements.
Required elements
The following should be included in all program headers:
Identification of the project of which the program is a part.
Program name.
Author identification which should be human readable and unique.
Short description of program purpose.
List of macros used in the program.
Date program was first put into production, was finalized, or when past first
validation.
o This date will be chosen based on the operational procedures used within the
company /organization creating the program. The date should indicate the
first date when the program was released for final use.
Revision history (see discussion below).
Recommended elements
The following are not required but are highly recommended in all program headers:
All outputs generated by the program, including both file creation and modification.
External files used such as datasets or databases that are used as data inputs to the
program or macros used.
Platform and operating system for which the program was developed to run.
Software/programming language and version which the program was programmed in.
Page 3 of 7
Guidance on Good Programming Practice
Steering Board for Good Programming Practice in Health and Life Sciences
Version 1.1 March 2014
Revision history
The revision history section is critical to document the revisions made to the program once it
is put into production. A well designed revision history section should include the author of
the change, date of release of the change, a short description of the change. Revision history
may also include a version number for changes which can be used as a reference in the code.
Comments
Comments are important to help anyone reviewing, modifying or using a program to be able
to quickly understand the code. All major data or proc steps should be commented,
especially data specific and complex code. Ideally comments should be comprehensive, and
should describe the rationale and not simply the action. For example, instead of simply
typing "Access demography data", describe which data elements you are accessing and why
they are needed, for example, Bringing in DM to get gender and age and subset to include
only the intent to treat population. Comments can also include links to external
documentation (requirement specifications, design documents. The programs should also be
split up into sections by creating different types of comments, e.g. many rows with asterisks.
This helps to structure the program and make it easier for others to see an overview of the
program.
Reference
Naming organization specific guidance here.
conventions
All organizations should have standard naming conventions. Program naming conventions
will make it possible to identify groups of related programs such as adverse events tables.
Dataset and variable names should describe as best as possible their content, but of course
datasets following CDISC standards will have pre-defined names. Space characters should be
avoided in variable, dataset and output file names.
Reference
Coding organization specific guidance here.
conventions
In order to be efficient and streamline the sharing of program code between programmers,
with regulatory agencies, and with external partners or vendors, it is vital for code structure to
follow standard conventions. SAS code which follows these conventions is much easier to
read, modify, maintain, and correct. These conventions are divided into those which should
Page 4 of 7
Guidance on Good Programming Practice
Steering Board for Good Programming Practice in Health and Life Sciences
Version 1.1 March 2014
Required conventions
Do not overwrite existing datasets, use different meaningful names for each
temporary dataset.
Each organization may have its own standards for using case within programming
code but use of all uppercase should be avoided.
Separate data steps and procedures with at least one blank line.
Use data=dataset option in procedure statements so that the dataset being used is
explicitly stated to ensure that the statement will work if it is moved to another
location.
End data steps and procedures with run or quit to provide a boundary and allow for
independent execution.
Split data steps into logical parts.
Put each statement on a separate line.
Left justify global statements and data and procedure statements and their
corresponding run and quit statements.
Indent statements belonging to a level by 2 to 5 columns (use the same number of
spaces throughout the program), i.e. every nesting level should be visibly indented
from the previous level.
Do not use tabs for indentation because they will display differently depending on the
platform and text editor being used, use blanks instead.
For do loops place the end statement in the same position as the do statement so that
they can be easily matched.
Insert parentheses in meaningful places in order to clarify the sequence in which
mathematical or logical operations are performed.
When converting character variables to numeric or vice versa, use the put and input
functions to explicitly convert the variable to ensure that it is done in the way
intended and to avoid errors, warnings, and notes in the program log.
Structure your program to read in all external data at the top, do the processing then
produce any outputs or permanent analysis datasets.
Recommended conventions
Perform only one task per module or macro.
Use logical groupings to separate code into blocks.
Double space between sections.
Group similar statements together.
Define new variables with the attrib statement in order to ensure that the variable
properties such as length, format, and label are correct instead of allowing them to be
implicitly determined by the circumstances in which they are initialized in the code.
Reference
Log organization specific guidance here.
File Checking
As part of development and validation practices, it is often mandated that the log file
generated is checked to ensure that the program has executed in the correct intention. Many
Page 5 of 7
Guidance on Good Programming Practice
Steering Board for Good Programming Practice in Health and Life Sciences
Version 1.1 March 2014
companies may have their own automatic log file checking utilities to aid in this, and there
are many examples of such tools in widely available papers. ERROR and WARNING in
logs should normally be avoided. There are sometime exceptions to this, such as warnings
that are output from statistical models that do not have enough data. Ordinarily, any
warnings that are deemed acceptable are documented. There are also some specific
NOTEs that can indicate a problem. The common NOTEs that should normally be
avoided include those relating to repeats, more than one, uninitialized and
referenced.
Also, any user defined checks that have been added, such as from defensive programming,
should be checked for in the log and followed up on. A company-specific naming convention
for user defined checks can aid in this, so the specific string can be searched for within the
log. Examples of such conventions include ISSUE:, USER:, and ALERT:. Avoid the
use of user-generated errors and warnings labeled "NOTE:", "WARNING:" or "ERROR:", as
these may make it difficult to find genuine problems when searching the log.
Most organizations are now working across multiple platforms, commonly combining
Windows and Unix environments. There can be many occasions where code will work on
one platform and not on another. Portability is more than just working across multiplatform
environments; it is also about making programs easier to be used across projects. Below are
some suggestions to address some of the most common impediments to portability.
Use rounding in newly created variables (if applicable) in order to avoid different results
e.g. from 64 bit operating systems to 32 bit systems. (However give careful consideration
to doing this and round at the limit of precision as otherwise it may affect results. Where
rounding is only required for presenting results, do so after calculations and derivations
are completed.)
Avoid explicitly defining file paths in libname, filename, and %include statements
requiring platform specific syntax such as forward slash or back slash.
Avoid the use of X commands to execute statements directly on the operating system.
Avoid explicit project or data specific code by using macro variables where possible. An
example of this is using macro variables to describe dosing groups in table headers
instead of typing them out in the report section.
Reference
Hard codingorganization specific guidance here.
Hardcoding is the modification of the value of an item of source data within program code.
Hardcoding should be avoided whenever possible in final code, and changes to source data
should be done in data entry or capture systems which give better compliance to regulations
such as FDA 21CFR11. Hardcoding may be done temporarily in order to get a program to
run due to dirty data or correct for database inconsistencies. Permanent hardcoding to fix
incorrect data values in a final database is strongly discouraged, but if it is unavoidable then it
must be approved following a standard process and clearly documented using standard
comments and PUT statements to the log to show what has been hard coded.
Page 6 of 7
Guidance on Good Programming Practice
Steering Board for Good Programming Practice in Health and Life Sciences
Version 1.1 March 2014
Referenceprogramming
Defensive organization specific guidance here.
Defensive programming is an approach to programming intended to anticipate future changes
of the data that might influence the coding algorithms. Ideally programs should be written in
such a way that they will continue to work correctly in case of new or unexpected data values
which did not exist at the time the code was developed. Analysis dataset and table programs
are often developed in the early stages of a project or even when the only available data is test
data. In these situations the data often does not contain all possible values of data points such
as visits or time points, race values, and questionnaire responses, but the program must be
able to handle those values when they do become present in the data at a later point.
APPENDIX
Appendices can be added to the document to include organization specific guidance as well
as any templates or examples.
Page 7 of 7