Understanding The Define - XML File and Converting It To A Relational Database
Understanding The Define - XML File and Converting It To A Relational Database
Paper 157-2010
ABSTRACT
When submitting clinical study data in electronic format to the FDA, not only information from trials has to be submitted,
but also information to help understand the data. Part of this information is a data definition file, which is the metadata
describing the format and content of the submitted data sets. When submitting data in the CDISC SDTM format it is
required to submit the data definition file in the Case Report Tabulation Data Definition Specification (define.xml) format
as prepared by the CDISC define.xml team.
This paper illustrates how the define.xml can be transformed into relational SAS® data sets by using SAS XML Mapper
technology. Once the content of the define.xml is available in SAS data sets it can be presented as a PDF file with the
use of SAS ODS (Output Delivery System).
The relational SAS data sets can also be used to validate the metadata contained in the define.xml file against the SAS
transport files and the different types of SDTM metadata:
• metadata for domain datasets
• metadata for domain content (including value level metadata)
• metadata for controlled terminology
INTRODUCTION
The FDA issued the Final Guidance on Electronic Submissions using the eCTD specifications in April 2006. The latest
revision of this guidance was published in June 2008 [1]
Technical specifications associated with this guidance are provided as stand-alone documents. Among these are Study
Data Specifications [2] that provide further guidance for submitting animal and human study data in electronic format
when providing electronic submissions to the FDA. Study data includes information from trials submitted to the agency
for evaluation and information to understand the data (data definition). The study data includes both raw and derived
data.
As of January 1, 2008, sponsors submitting data electronically to the FDA are required to follow the new eCTD guidance.
The previous guidance, originally issued in 1999, has been withdrawn as of the same date.
The new guidance differs from the 1999 guidance in one significant aspect: The application table of contents is no longer
submitted as a PDF file, but is submitted as XML (eXtensible Markup Language). This means that the electronic
submissions will now be XML based.
The current version of the Study Data Specifications contains specifications for the Data Tabulation data sets of human
drug product clinical studies and provides a reference to the Study Data Tabulation Model (SDTM) [3][4][5][6] developed
by the Submission Data Standard (SDS) working group of the Clinical Data Interchange Standard Consortium (CDISC).
Further, the Study Data Specifications document gives a reference to the Case Report Tabulation Data Definition
Specification (CRT-DDS or define.xml) developed by the CDISC define.xml Team [7].
1
SAS Global Forum 2010 Hands-on Workshops
In July 2007 The CDISC Submission Data Standards (SDS) Metadata Team released a draft version of the Metadata
Submission Guidelines, Appendix to the Study Data Tabulation Model Implementation Guide 3.1.1 for review [9]. This
release included a sample electronic submission that contains examples of CRF annotations, metadata associated with
the submission domains, SDTM domains, and an example of a define.xml file.
Once the metadata that is contained in the define.xml is converted to relational SAS data sets, we can use the extensive
reporting capabilities of SAS to present the define.xml metadata in a variety of output formats like PDF, RTF or Excel.
The PDF representation of the define.xml will allow us to print the metadata contained in the define.xml.
XML 101
In this section we present a short introduction to XML.
BASIC SYNTAX
The Extensible Markup Language (XML) [11] is a general-purpose markup language. It is classified as an extensible
language because it allows users to define their own elements. Its primary purpose is to facilitate the sharing of
structured data across different information systems. XML is a language that is hierarchical, text-based and describes
data in terms of markup tags. A good introductory guide to XML can be found in the reference section [12].
Every XML document starts with the XML declaration. This is a small collection of details that prepares an XML
processor for working with the document.
2
SAS Global Forum 2010 Hands-on Workshops
Elements are the basic building blocks of XML, dividing a document into a hierarchy of regions. Some elements are
containers, holding text or (child) elements. Others are empty and serve as place markers. Every XML file should have
exactly 1 root element that contains all other elements.
In the element starting tag there can be information about the element in the form of an attribute. Attributes define
properties of elements. They associate a name with a value, which is a string of character data enclosed in quotes.
There is no limit to how many attributes an element can have, as long as no two attributes have the same name.
Namespaces are a mechanism by which element and attribute names can be assigned to groups. They are most often
used when combining different vocabularies in the same document. If each vocabulary is given a namespace then the
ambiguity between identically named elements or attributes can be resolved.
3
SAS Global Forum 2010 Hands-on Workshops
A CDATA (character data) section tells the XML parser that this section of the document contains no markup and should
be treated as regular text. CDATA sections can be useful for large regions of text that contain a lot of ‘forbidden’ XML
characters. They should be used with care though, since it may be hard to use any elements or attributes inside the
marked region.
Processing instructions are meant to provide information to a specific XML processor, but may not be relevant to
others. It is a container for data that is targeted toward a specific XML processor. The processing instruction looks like
the XML declaration, but is targeted at a specific XML processor. The XML declaration can be viewed as a processing
instruction for all XML processors.
Now that we have explained the most important building blocks of XML, we can look at a more complete example in
Figure 1.
4
SAS Global Forum 2010 Hands-on Workshops
An XML schema is a description of a type of XML document, typically expressed in terms of constraints on the structure
and content of documents of that type. This description goes above and beyond the basic syntax constraints imposed
by well-formedness of an XML document [13].
The schema defines the allowed elements and attributes, order of the elements, overall structure, etc...
A schema might also describe that the content of a certain element is only valid if it conforms to the ISO 8601 date and
time specification. An XML document is valid, if it conforms to a specific XML schema.
The following example illustrates the difference between well-formed and validated. The XML document in Figure 2 is
a well-formed XML document, but is obviously not valid with respect to the schema that defines a valid define.xml file.
To be able to extract information from an XML document the XPath standard gives XML developers a tool for navigating
the structure of an XML document. We can simply demonstrate this by 2 examples from Figure 2.
• The XPath location path /favorites/single/artist returns all artist elements: “The Sheppards” and
“Joyce Harris”.
5
SAS Global Forum 2010 Hands-on Workshops
• The XPath location path /favorites/single/@genre returns the “doowop” and “R&B” attributes.
The XPath standard is part of the XSL family of standards.
XSL is the Extensible Style sheet Language [14], one of the most complicated – and most useful – parts of XML. While
XML itself is intended to define the structure of a document, it does not contain any information on how it is to be
displayed. In order to do this we need a language, XSL, to describe the format of a document, ready for use in a display
application (computer screen, cell phone screen, paper,).
XSL is actually a family of transformation languages which allows one to describe how files encoded in the XML
standard are to be formatted or transformed.
The following three languages can be distinguished:
• XSL Transformations (XSLT): an XML language for transforming XML documents.
• XSL Formatting Objects (XSL-FO): an XML language for specifying the visual formatting of an XML document
• The XML Path Language (XPath): a non-XML language used by XSLT, and also available for use in non-XSLT
contexts, for addressing the parts of an XML document.
SCHEMA STRUCTURE
The CRT-DDS (define.xml) standard is based on the CDISC operational model (ODM). The ODM is defined by an XML
schema that allows extension [8]. This extension mechanism has been implemented by expressing the ODM schema
using two files:
• a foundation XML Schema file (ODM1-2-1-foundation.xsd), which defines the elements, attributes and structure
of the base ODM schema.
• an application XML Schema file (ODM1-2-1.xsd.) which imports the foundation XML Schema and other schema
definitions needed by ODM, such as the core W3C XML schema (xml.xsd) and the XML schema that defines
the W3C XML Signature standard (xmldsig-core-schema.xsd).
In November 2009 CDISC published a XML Schema Validation for Define.xml white paper [15]. This white paper,
created by the CDISC XML Technology team, provides guidance on validating define.xml documents against the
define.xml XML schemas and proposes practices and tools to improve define.xml schema validation.
6
SAS Global Forum 2010 Hands-on Workshops
Figure 3: The define.xml (CRT-DDS) and the associated style sheet and schemas
The header section of the define.xml is important for the XML processor.
7
SAS Global Forum 2010 Hands-on Workshops
The first line identifies the file as an XML document and specifies the XML version (“1.0”) and the encoding of the
document (“ISO-8859-1”). The second line includes a reference to the style sheet (“define1-0-0.xsl”) that can be used by
an XSL processor to render the document. In this case the style sheet can be used by the XSL processor in a web
browser to render the XML file as HTML for display. As mentioned before, the HTML that gets created by the browser
depends on the particular browser application.
Following the header section is the ODM root element. All other elements in the define.xml file will be contained
within the ODM element. The ODM element contains attributes that define the namespaces and the location of the
schema that specifies the XML document.
Other required ODM attributes are displayed in Table 1.
Study is the first element contained in the ODM root element. The Study element collects static structural
information about an individual study. It has one attribute “OID”, which is the unique identifier of the Study.
The Study element has two child-elements in the define.xml:
• GlobalVariables - General summary information about the Study.
• MetaDataVersion - Domain and variable definitions within the submission.
GlobalVariables is a required child element of the Study element and contains three required child elements:
• StudyName - Short external name for the study.
• StudyDescription - Free-text description of the study.
• ProtocolName - The sponsor’s internal name for the protocol.
The MetaDataVersion is a child element of the Study element and contains the domain and variable definitions
included within a submission. Table 2 lists the MetaDataVersion attributes that are part of the define.xml file. The
attributes with a prefix of “def:” are extensions to the ODM schema.
8
SAS Global Forum 2010 Hands-on Workshops
Table 3 lists the MetaDataVersion child elements. The ItemGroupDef and ItemDef elements are required.
9
SAS Global Forum 2010 Hands-on Workshops
ItemGroupDef
For every data set in the submission there will be an ItemGroupDef element with domain-level metadata. Table 4 lists the
ItemGroupDef attributes.
1
SASDatasetName is an optional attribute that is part of the ODM foundation. This means that it is a valid attribute in the
define.xml. This attribute is not mentioned in the define.xml specification.
10
SAS Global Forum 2010 Hands-on Workshops
def:ComputationMethod
The def:ComputationMethod element has one attribute (OID) and contains the details about computational algorithms
used to derive or impute variable values.
11
SAS Global Forum 2010 Hands-on Workshops
Figure 8 illustrates several concepts that are related to ItemRef, ItemDef, def:ComputationMethod, CodeList and
def:ValueListDef elements.
2
SASFieldName is optional an attribute that is part of the ODM foundation. This means that it is a valid attribute in the
define.xml. This attribute is not mentioned in the define.xml specification.
12
SAS Global Forum 2010 Hands-on Workshops
The black numbers on the right side represent the source of a relation and a red numbers on the left side represent the
target of a relation. In every relation there has to be a correspondence between the identifier of the source and the
target.
For example: In relation 4: source identifier = ItemRef/@ItemOID, target identifier = ItemDef/@OID and the both have
the value “VS.VSTESTCD”;
13
SAS Global Forum 2010 Hands-on Workshops
We can map the define.xml to the following data sets (all data sets also contain the MetaDataVersion OID, or SAS
variables MetaDataVersion_OID):
• MetaDataVersion which contains one record:
o ODM attributes (xsi:schemaLocation, FileOID, ODMVersion, FileType, CreationDateTime)
o Study OID attribute
o Text of GlobalVariables child elements: StudyName, StudyDescription and ProtocolName
o MetaDataVersion attributes: OID, Name, Description, def:DefineVersion, def:StandardName and
def:StandardVersion
• AnnotatedCRF which contains:
o For every def:AnnotatedCRF element and every def:DocumentRef child element the leafID attribute
• SupplementalDoc which contains:
o For every def:SupplementalDoc element and every def:DocumentRef child element the leafID
attribute
• Leaf which contains:
o For every def:AnnotatedCRF element the leafID attribute and the corresponding def:leaf@xlink:href
and def:leaf/title
o For every def:SupplementalDoc element the leafID attribute and the corresponding def:leaf@xlink:href
and def:leaf/title
• ItemGroupDef which contains for every ItemGroupDef:
o the ItemGroupDef attributes: OID, Name, Repeating, IsReferenceData, SASDatasetName, Purpose,
def:Label, def:Structure, def:DomainKeys, def:Class, def:ArchiveLocationID
o from ItemGroupDef/def:leaf the ID attribute, xlink:href attribute and the contents of the def:title child
element (this is possible since every ItemGroupDef has exactly one def:leaf child element)
• ItemGroupDef_ItemRef which contains for every ItemRef element in an ItemGroupDef element:
o The OID attribute of the parent ItemGroupDef element
o ItemRef attributes: ItemOID, OrderNumber, Mandatory, Role and RoleCodeListOID
• ItemDef which contains for every ItemDef element:
o ItemDef attributes: OID, Name, DataType, Length, SignificantDigits, SASFieldName, Origin,
Comment, Def:Label, Def:DisplayFormat and Def:ComputationMethodOID
o If applicable, the OID of the associated codelist: CodeListRef/@CodeListOID
o If applicable, the OID of the associated valuelist: def:ValueListRef/@ValueListOID
• ComputationMethod which contains for every def:ComputationMethod element:
o The OID attribute and the text content of the element.
The OID for a def:ComputationMethod element must be unique within a single study.
• CodeList which contains for every CodeList element:
o Attributes OID, Name, DataType and SASFormatName
14
SAS Global Forum 2010 Hands-on Workshops
We can already see an issue here, because the attribute /favorites/single/@genre (with values ‘doowop’ and ‘r&b’) has
not been included in the SAS data set.
15
SAS Global Forum 2010 Hands-on Workshops
Figure 9 shows part of an XMLMAP. It specifies that the data set ItemDef has a character variable named
ItemDef_Name (length 40), whose content is defined by the XPath specification
/ODM/Study/MetaDataVersion/ItemDef/@Name.
3
Figure 9: XMLMAP example
3
This particular XMLMap gave a warning in the XML Mapper application: "Column (MetaDataVersion_OID) in table
(ItemDef) has an XPath outside the scope of the table path. The contents of this column may not correspond to other
row values and/or may be missing entirely." In this particular case using this XMLMap would not lead to unexpected
results. However, to avoid that inexperienced SAS XML Mapper users start creating tables by dragging elements and
attributes in the same SAS target table, regardless of the origin of the data in the XML source file, SAS decided to create
help users with a warning when they go outside of the table boundary. This warning should be considered as a "safety
net".
16
SAS Global Forum 2010 Hands-on Workshops
There are numerous tools available that can check well-formedness and validate an XML file against an XML schema.
As mentioned earlier in this paper, the CDISC XML Technologies Team has published a XML Schema Validation for
Define.xml white paper, that provides guidance on validating define.xml documents against the define.xml XML schemas
and proposes practices and tools to improve define.xml schema validation [15].
To further ensure the quality of the define.xml, more validation needs to be performed:
• Validate against the rules as specified in the Case Report Tabulation Data Definition Specification (define.xml)
version 1.0 [7]. An example of this is the rule that the ItemOID attribute in an ItemRef element must match the
OID attribute of a corresponding ItemDef element.
• Check the consistency between the define.xml content and the SAS transport files. For example, an
ItemGroupDef element must contain an ItemRef element for each variable included in the corresponding SAS
transport file.
17
SAS Global Forum 2010 Hands-on Workshops
• Check the consistency between the define.xml and the SDTM metadata. An example of this is the “Mandatory”
attribute of the ItemRef element, which needs to be consistent with the SDTM Core attribute (Required,
Expected, Permissible).
Once we have the define.xml converted into relational SAS data sets, it will be easy to perform these validation checks.
PROC SQL;
SELECT ItemDef_OID, ItemDef_Name, ItemDef_Label
FROM ItemDef
GROUP BY ItemDef_OID
HAVING COUNT(*) > 1
ORDER BY ItemDef_OID;
QUIT;
PROC SQL;
SELECT ItemGroupDef_OID, ItemRef_ItemOID, ItemRef_OrderNumber
FROM ItemgroupDef_itemref
GROUP BY ItemGroupDef_OID, ItemRef_ItemOID
HAVING COUNT(*) > 1
ORDER BY ItemGroupDef_OID, ItemRef_ItemOID;
QUIT;
PROC SQL;
SELECT ItemRef_ItemOID
FROM ItemgroupDef_itemref
EXCEPT
SELECT ItemDef_OID
FROM ItemDef
QUIT;
CONCLUSION
SAS/Base together with the SAS XML Mapper enable the conversion of the define.xml into relational SAS data sets.
This enables the creation of a PDF rendition of the define.xml file and the use of SAS to perform various checks to
validate the define.xml against the clinical study data. The methodology described in this paper is applicable to the
current version of the define.xml file, but also to future extensions.
18
SAS Global Forum 2010 Hands-on Workshops
REFERENCES
1. U.S. Department of Health and Human Services, Food and Drug Administration, Center for Drug Evaluation and
Research (CDER), Center for Biologics Evaluation and Research (CBER)”.
“Final Guidance for Industry: Providing Regulatory Submissions in Electronic Format--Human Pharmaceutical
Applications and Related Submissions Using the eCTD Specifications”. Revision 2, June 2008.
(http://www.fda.gov/downloads/Drugs/GuidanceComplianceRegulatoryInformation/Guidances/UCM072349.pdf)
2. U.S. Department of Health and Human Services Food and Drug Administration Center for Drug Evaluation and
Research (CDER). Study Data Specifications, Version 1.5.1, January 2010
(http://www.fda.gov/downloads/Drugs/DevelopmentApprovalProcess/FormsSubmissionRequirements/ElectronicSub
missions/UCM199759.pdf)
3. CDISC Study Data Tabulation Model, Version 1.1, April 28, 2005
(http://www.cdisc.org/content1605)
4. CDISC Study Data Tabulation Model Implementation Guide: Human Clinical Trials, Version 3.1.1, August 26, 2005
(http://www.cdisc.org/content1605)
5. CDISC Study Data Tabulation Model, Version 1.2, November 12, 2008
(http://www.cdisc.org/sdtm)
6. CDISC Study Data Tabulation Model Implementation Guide: Human Clinical Trials, Version 3.1.2, November 12,
2008 (http://www.cdisc.org/sdtm)
7. Case Report Tabulation Data Definition Specification (define.xml), Version 1.0, February 9, 2005
(http://www.cdisc.org/define-xml)
8. CDISC Operational Data Model (ODM), Version 1.2.1, January, 2005
(http://www.cdisc.org/odm)
9. CDISC Metadata Submission Guidelines, Appendix to the Study Data Tabulation Model Implementation Guide
3.1.1, Draft version 0.9, July 25, 2007 (http://www.cdisc.org/content1210)
10. CDISC SDTM/ADaM Pilot Project Report. January 31, 2008. (http://www.cdisc.org/content1037)
11. Extensible Markup Language (XML) 1.0, Fourth Edition, August 16, 2006
(http://www.w3.org/TR/2006/REC-xml-20060816)
12. Eric T. Ray, 2003, Learning XML, Creating Self-Describing Data. 2nd Edition, (O’Reilly and Associates)
13. Eric van der Vlist, 2002, XML Schema, The W3C’s Object-Oriented Descriptions for XML (O’Reilly and Associates)
14. Dough Tidwell, 2001, XSLT, Mastering XML Transformations. (O’Reilly and Associates)
15. XML Schema Validation for Define.xml, Version 1.0, November 30, 2009
(http://www.cdisc.org/define-xml)
16. SAS Institute Inc. 2009. SAS® 9.2 XML LIBNAME Engine: User’s Guide. Cary, NC: SAS Institute Inc.
(http://support.sas.com/documentation/cdl/en/engxml/61740/PDF/default/engxml.pdf)
17. SAS XML Mapper download: http://www.sas.com/apps/demosdownloads/92_SDL_sysdep.jsp?packageID=000513
CONTACT INFORMATION
Lex Jansen,
Senior Consultant, Clinical Data Strategies
Octagon Research Solutions, Inc.
585 East Swedesford Road, Suite 200
Wayne, PA 19087
Email:
This paper can be found at http://www.lexjansen.com together with links to more than 10,000 other papers that
were presented at SAS usergroups.
SAS and all other SAS Institute Inc. product or service names are registered trademarks or trademarks of SAS Institute
Inc. in the USA and other countries. ® indicates USA registration. Other brand and product names are trademarks of
their respective companies.
19
SAS Global Forum 2010 Hands-on Workshops
APPENDIX
(incomplete) data model for Define.xml
20