Metadata Vs Paradata-FT
Metadata Vs Paradata-FT
1
Metadata and Paradata
TABLE OF CONTENTS
_
Executive Summary .........................................................................................................................................3
Preface.............................................................................................................................................................5
Background......................................................................................................................................................6
I. Terminology .............................................................................................................................................6
II. Review of Metadata Resources ................................................................................................................9
III. Specific Efforts in Other Countries .........................................................................................................13
IV. Specific Initiatives in the US Government ..............................................................................................14
V. Survey of ICSP Websites .........................................................................................................................16
VI. Other Items of Interest ...........................................................................................................................22
Appendix A: An Illustrative Example ..........................................................................................................27
Appendix B: WDSL For Data Swapping .......................................................................................................33
Appendix C: Author ....................................................................................................................................36
2
Metadata and Paradata
EXECUTIVE SUMMARY
Metadata and/or paradata accompany federal statistical agency data files to describe or define the data
elements and the collection and processing of these data. Practices vary across the Statistical Community
of Practice (SCOP). Distinctions between the two terms are not well-defined, and there are no generally
accepted standards in use by all the federal statistical agencies. Therefore SCOP commissioned the survey
of current definitions and practices in current use that make up this report with the objective of laying the
groundwork for development of standardized definitions and practices.
The goals for this report are therefore to provide the necessary background information for developing and
implementing standardized definitions and best practices for metadata and paradata within the US federal
system. In particular, this report
1. Reviews existing practices and resources both nationally and internationally;
2. Inventories current metadata and paradata practices in federal statistical agencies;
3. Identifies key elements of metadata and paradata that would be useful to federal statistical
agencies in development and implementation of standardized definitions.
The following definitions are proposed to SCOP:
• Metadata: Formalized data about statistical data needed to search for, display and analyze those
data.
• Paradata: Formalized data on methodologies, processes and quality associated with the production
and assembly of statistical data.
Note: survey weights should be regarded as data, even though calculation of them employs paradata about
the design and conduct of the survey.
• Markup Language: a method for annotating text in a way that is syntactically distinguishable from
that text and in consequence is computer processable, (e.g., HTML, HyperText Markup Language).
To date with respect to metadata and paradata, there are essentially two systems on which to build. Each is
tied to a description language. DDI is based on XML (Extensible Markup Language); SCMS is based on UML
(Unified Modeling Language).
A five-step process is suggested for the development and implementation of ICSP (Interagency Committee
on Statistical Policy) agency-wide definitions and practices.
1. Agreement on definitions for data, metadata and paradata with a concept of metadata that is
completely independent of that of file structure.
2. Commitment to markup language-based metadata, and to markup language-based data.
3
Metadata and Paradata
3. “Proof-of-concept” based on one or both of DDI and SDMX, consisting of full markup-language-
based metadata and data files for a modest-scale survey including development of parsers that
would put the data into common formats including development of parsers that would put the data
into common formats.
4. Creation of an ICSP-wide extensible metadata template for surveys, together with tools for
common tasks.
5. Creation of a repository for metadata for surveys.
4
Metadata and Paradata
PREFACE
The Federal Statistical Community of Practice and the National Center for Education Statistics (SCOP)
requested the National Institute of Statistical Sciences (NISS) to survey existing standards and practices for
metadata and for paradata that are currently in use for federal data. The goals for this report are therefore
to:
1. Review existing practices and resources both nationally and internationally to help inform efforts to
develop and implement standardized metadata and paradata for use in the US statistical system.
2. Inventory current metadata and paradata practices in federal statistical agencies to identify
possible best practices and possible gaps.
3. Identify key elements of metadata and paradata that would be useful to federal statistical agencies,
moving next to the development of standardized definitions, and ultimately to the implementation
of those definitions.
5
Metadata and Paradata
I. TERMINOLOGY
Metadata and/or paradata accompany federal statistical agency data files to describe or define the data
elements and the collection and processing of these data. Practices vary among agencies (and other
holders of large data files). Distinctions between the two terms are not well-defined, and there are no
generally accepted standards in use by all the federal statistical agencies. Therefore the current report
surveys definitions and practices in current use with the objective of laying the groundwork for
development of standardized definitions and practices.
The terms metadata and paradata are sometimes used synonymously, but we believe that there is a
meaningful, albeit nebulous, distinction. The SDMX Metadata Common Vocabulary (see Section 3.2.2)
contains the following definitions:
• Metadata: data that defines [sic] and describes [sic] other data and processes
• Statistical Metadata: Data about statistical data, comprising data and other documentation that
describe objects in a formalized manner. They provide information on data and about processes
of producing and using data. Statistical metadata describe statistical data and - to some extent -
processes and tools involved in the production and usage of statistical data.
There is no corresponding definition of paradata, but the same source further states that “there is a clear
high-level distinction between the metadata needed to search for and display data (Structural metadata)
and the metadata that give more information on definitions, methodologies, processes and quality
(Reference metadata).”
Some people would define paradata as a subset of what SMDX calls metadata: “information about […]
processes and tools involved in the production and usage of statistical data.”
It is not clear that a distinction between metadata and paradata will persist into the future, but it is useful
to maintain the distinction in the short run, and in particular to understand the current situation. One
justification is that although in some settings a distinction between metadata and paradata is not useful,
the distinction is useful in the context of surveys.
To some extent, similar issues arise in distinguishing data from metadata and paradata, which are discussed
further below.
6
Metadata and Paradata
1Data provenance is a major concern in the digital archiving community, especially insofar as it affects and is informative about
data quality. The usage is similar but not identical to that in connection with works of art. A useful introduction is available at
http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.101.7132&rep=rep1&type=pdf
7
Metadata and Paradata
useful course of action at this time. An intermediate position would be to permit an original dataset and a
confidentiality-protected version to have different metadata.
It may also be useful to think of metadata as logically distinct from the data collection process and collected
data, in the sense that metadata are definable in advance of any data collection. This would imply that only
data and paradata can contain record-level values, and that metadata cannot. . In terms of the discussion
above, this would require that reports of nonresponse bias analysis be regarded as either data or paradata.
An essential break with the past is that metadata does not mean file structure. Indeed, the markup
language-based current thinking discussed in Section 2.2 and the appendices forces metadata to be
logically and operationally independent of each other. The same reasoning applies in principle to the terms
“codebook” and “data dictionary.” In practice, objects identified as one of these seem to closer to being
true metadata than an unannotated descriptions of data objects and data products.
</NAME>
<STREET_ADDRESS>
2 See http://en.wikipedia.org/wiki/HTML
8
Metadata and Paradata
<NUMBER>19</NUMBER>
<STREET_NAME>T. W. Alexander </STREET_NAME>
<STREET_SUFFIX>Drive</STREET_SUFFIX>
</STREET_ADDRESS>
<BOX_ADDRESS>
<PREFIX>P.O. Box</PREFIX>
<NUMBER>14006</NUMBER>
</BOX_ADDRESS>
<CITY>Research Triangle Park</CITY>
<STATE>
<NAME>North Carolina</NAME>
<USPS_ABBREVIATION>NC</USPS_ABBREVIATION>
</STATE>
<ZIP>
<5DIGIT>27709</5DIGIT>
<PLUS4>4006</PLUS4>
<ZIP>
</ADDRESS>
The essential characteristics of this representation are:
Tags of the form <XXX> …</XXX>. The / is the indication of the end of the tag. Syntactic separation of
annotation - tag pairs - from the content between them.
The hierarchical structure: tags can be nested within one another, but a tag must end before any parent tag
containing it ends.
A parser is software capable of reading a markup language and resolving the annotation and content. Web
browsers are one class of examples. A parser capable of reading the markup language version of the
address would be able to arbitrarily combine or reorder the content elements.
By comparison, LaTeX uses the basic form
\begin{tag}Content\end{tag}
For example, text to be italicized appears in the source (markup) document as
\begin{italic}This sentence is in italics.\end{italic}
and in the parser-processed output document as
This sentence is in italics.
9
Metadata and Paradata
3 See http://en.wikipedia.org/wiki/XML.
4 See http://en.wikipedia.org/wiki/Unified_Modeling_Language.
10
Metadata and Paradata
“Focusing on time series and indicators, SDMX is the result of a joined effort from the Bank for International
Settlements, the European Central Bank (ECB), EUROSTAT, the International Monetary Fund (IMF), the
Organization for Economic Cooperation and Development (OECD), the United Nations (UN), and the World
Bank (WB) to create an XML specification to support the exchange of aggregate data and metadata. SDMX
provides three types of statistical metadata standards: standards for data formats, standards for metadata
and a registry-based architecture to implement these standards and to exchange data between systems.
One of the requirements of SDMX was the awareness of other metadata specifications such as the Data
Documentation Initiative (DDI). Any of the DDI metadata - which emphasizes archival metadata and micro-
data, rather than aggregate data - is exchangeable in an equivalent SDMX metadata format. This ensures
inter-operability of metadata across namespaces.”
The most accessible description is the user guide: http://sdmx.org/wp- content/uploads/2009/02/sdmx-
userguide-version2009-1-71.pdf.
Standards: SDMX Standards Version 2.0 Complete Package: http://sdmx.org/?page_id=16#package
“SDMX Technical Standards Version 2.0 provide the technical specifications for the exchange of data and
metadata based on a common information model. The scope of this effort is to define formats for the
exchange of aggregated statistical data and the metadata needed to understand how the data is structured.
The major focus is on data presented as time series, although cross- sectional XML formats are also
supported.
Version 2.0 Technical Standards are backward compatible with the earlier Version 1.0 efforts, which
focused on XML- and EDIFACT-syntax data formats. The latest work broadens the technical framework to
support wider coverage of metadata exchange as well as a more fully articulated architecture for data and
metadata exchange.
These specifications have been developed, reviewed, and adopted by SDMX. Steps will be taken to bring
this work forward within the context of the International Standards Organization (ISO), with a view to
updating ISO/Technical Specification 17369:2005 SDMX.”
Vocabulary: http://sdmx.org/wp content/uploads/2009/01/04_sdmx_cog_annex_4_mcv_2009.pdf
5
See http://www.w3.org/RDF/
11
Metadata and Paradata
• Part 2: Classification
• Part 3: Registry metamodel and basic attributes
• Part 4: Formulation of data definition
• Part 5: Naming and identification principles
• Part 6: Registration”
Base URL: http://metadata-stds.org/11179/
Quoting from http://en.wikipedia.org/wiki/ISO/IEC_11179, “The ISO/IEC 11179 model is a result of two
principles of semantic theory, combined with basic principles of data modelling. The first principle from
semantic theory is the thesaurus type relation between wider and more narrow (or specific) concepts, i.e.
the wide concept "income" has a relation to the more narrow concept "net income". The second principle
from semantic theory is the relation between a concept and its representation, i.e. "buy" and "purchase"
are the same concept even if different terms are used.”
Another ISO standard, ISO 23081-1:2006 (http://www.iso.org/iso/catalogue_detail.htm?csnumber=40832),
“covers the principles that underpin and govern records management metadata. These principles apply
through time to:
• records and their metadata;
• all processes that affect them;
• any system in which they reside;
• any organization that is responsible for their management.” This standard seems to be oriented
primarily to database management.
12
Metadata and Paradata
13
Metadata and Paradata
Library: http://www.cabinetoffice.gov.uk/govtalk/schemasstandards/xmlschemas/schemalibrary.aspx
2. eCAF XML Schema is an electronic implementation of the Common Assessment Framework (CAF). This
schema defines the format for CAF data exchange into and out of National eCAF. It will form part of
the solution for transferring Common Assessment information between National and Local eCAf
systems.
Base URL: http://www.dcsf.gov.uk/everychildmatters/strategy/deliveringservices1/caf/cafframework/
3.2 Australia
Australian Government Recordkeeping Metadata Standard:
https://www.naa.gov.au/information-management/information-management-standards/agls-metadata-
standard
Australian Institute of Health and Welfare - Metadata Online Registry (METeOR)
3. 3 Canada
Records Management Metadata Standard
http://www.collectionscanada.gc.ca/government/products-services/007002-5001-e.html
This standard is based on the Dublin Core; see
http://www.collectionscanada.gc.ca/government/products-services/007002-5001.1-e.html
In particular, in contains seven Dublin Core descriptive metadata elements: Creator, Description, Identifier,
Language, Subject, Title and Type.
3. 4 Eurostat
Base URL: http://epp.eurostat.ec.europa.eu/portal/page/portal/statistics/metadata
Concepts and Definitions Database (CODED):
http://ec.europa.eu/eurostat/ramon/nomenclatures/index.cfm?TargetUrl=LST_NOM_DTL_GLO
SSARY&StrNom=CODED2&StrLanguageCode=EN&CFID=4475384&CFTOKEN=7863e757 b805d521-
CF12BB8F-D860-0887-7D24223B18D5EE56&jsessionid=1e51a0ba6ae9f1ab363b2c3517235a22665dTR
14
Metadata and Paradata
nationwide data publishing effort is known as the National Spatial Data Infrastructure (NSDI). The NSDI is a
physical, organizational, and virtual network designed to enable the development and sharing of this
nation's digital geographic information resources. FGDC activities are administered through the FGDC
Secretariat, hosted by the U.S. Geological Survey.”
Base URL: http://www.fgdc.gov/
Metadata: http://www.fgdc.gov/metadata/csdgm/
Content Standards: http://www.fgdc.gov/standards/projects/FGDC-standards- projects/metadata/base-
metadata/v2_0698.pdf
information in emergency situations, as well as support the day-to-day operations of agencies throughout
the nation.”
Base URL: US National Information Exchange Model NIEM
16
Metadata and Paradata
• Documentation must also include descriptions for each variable in the data set that includes the
variable name, description, type (categorical, numerical, date/time, etc.), format, entry restrictions
(e.g., categories, range), and missing value codes.
• Indicate changes made to previously released data and the “as of” date of the data file.”
URL:
http://www.bts.gov/programs/statistical_policy_and_research/bts_statistical_standards_manual/h
tml/chapter_06.html
Codebook: Some codebooks are available. An example is
http://www.bts.gov/programs/omnibus_surveys/targeted_survey/2002_national_transportation_a
vailability_and_use_survey/public_use_data_files/excel/code_book.xls
Data Dictionary: Similar mentions as for metadata, suggesting that the two terms might be seen as
interchangeable.
Paradata: Search returned no results.
18
Metadata and Paradata
19
Metadata and Paradata
standards are current at the time data files are prepared and produce associated metadata for their files
that are in compliance with applicable standards.”
Distinctions between metadata and file structure are sometimes vague. “STANDARD 7-1-2: A file
description and record layout must be provided for each file. The file information/metadata [SIC] header
must include the following:
1. Title of the survey (survey name, part, and year as applicable);
2. Name(s) of each file;
3. Reference year for the data;
4. Version number and date of release;
5. Logical record length (in positional files) or number of variables on the file (delimited files);
6. Number of records per case or observation; and
7. Number of cases in the data file. For delimited files also include the delimiters (e.g., comma,
space).”
Similarly, “STANDARD 7-1-3: For each variable on the file, the file description must include the following:
1. Variable name;
2. Data type (alpha or numeric);
3. Record number (if multiple records per case);
4. Position within the record (beginning-end, or variable number if delimited) within the record, field
length, and variable label; and
5. The survey question wording and response categories.”
The National Assessment of Educational Progress (NAEP) appears to consider metadata to be the same as
file layout: http://nces.ed.gov/nationsreportcard/tdw/database/metadata.asp.
There is some discussion in the context of Department of Education State Longitudinal Stat Systems (SLDS)
programs, an example of which is http://nces.ed.gov/programs/slds/state.asp?stateabbr=KS
Codebook: Almost all NCES data products have an electronic codebook and associated documentation. An
example is the Early Childhood Longitudinal Study (ECLS-K)
http://nces.ed.gov/ecls/data/ECLSK_K8_Manual_part2.pdf
There are also codebooks for public use files; see for example, http://nces.ed.gov/pubs2010/2010334.pdf.
Data Dictionary: In the Forum Guide, http://nces.ed.gov/pubs2009/metadata/ch2_dictionaries.asp “A data
dictionary is an agreed-upon set of clearly and consistently defined elements, definitions, and attributes …
Although many items in a data dictionary can be classified as metadata, data dictionaries and metadata
systems are not identical. Data dictionaries generally contain only some of the metadata necessary for
understanding and navigating data elements and databases and, thus, contain only a subset of the
metadata found in a robust metadata system. Metadata systems, on the other hand, generally include the
entire range of items used for data system management and analysis, including features for sorting,
searching, organizing, and connecting data and metadata (see exhibit 2.3).
The Data Systems Standards: http://nces.ed.gov/dataguidelines/ contains the National Education Data
Model (http://nces.ed.gov/forum/datamodel/). There are also references to 2 XML standards
20
Metadata and Paradata
21
Metadata and Paradata
Information about OEI’s GeoData Gateway suggests that geographical data are FDGC compliant.
Codebook: Search returned no results.
Data Dictionary: Search returned no results.
Paradata: Search returned no results.
22
Metadata and Paradata
Repurposing
23
Metadata and Paradata
The second essential step, which amounts to recognition of reality, is commitment to markup language-
based metadata, and ultimately to markup language-based data. The two appendices to this report make
concrete on a small scale what this entails; below we propose what we believe to be a first “step in the
right direction.” Ultimately, this path leads to a major change in the way ICSP agencies disseminate their
data, in which system-specific (for instance, SAS or Excel) files or formats (for instance, CSV) would be
replaced by markup language-based files and parsers. 6 Nothing else will work in the long run. 7
Operationally, this step seems to entail a choice between DDI and SDMX (and by implication, between XML
and UML) as the basis for data and metadata. An important prerequisite to this decision would be for the
ICSP agencies to undertake, or ask an external body to undertake a detailed comparison between DDI and
SDMX.
A sensible third step would be to prepare a demonstration (“proof-of-concept”) case, based on one of both
of DDI and SDMX, consisting of full markup-language-based metadata and data files for a modest-scale
survey conducted by one of the ICSP agencies, including development of parsers that would put the data
into common formats. The survey should be complex enough to permit full understanding of the issues, but
not so complex (and in particular, not longitudinal) that the effort is either too lengthy or too expensive. In
even of NCES’ role in SCOP, a cross- sectional survey such as the Schools and Staffing Survey (SASS) might
be a suitable choice.
Based on the outcome of the third step, more difficult next step would follow: creation of an ICSP-wide
extensible metadata template for surveys, together with tools for common tasks. The administrative
framework might be analogous to that of the FGDC, with an (external?) Oversight Committee that controls
the core of the template and operates a mechanism for making additions. The goal would be to capture a
set of major components common to essentially all ICSP-agency surveys, including (This list is meant to be
illustrative, not prescriptive!)
• “Facts:” Agency, legislation, purpose, dates, contractor, contractor number, cost
• Key Words: For identification of datasets
• Design: Population, frame, sampling (PSAs, …), sample size, design weights,
• Data collection Instrument and Mode(s)
• Data Collection: Time period(s), unit response rate (unit history),
• Variables
o Definition
o Format, Units, Flags (missing, edit, imputation)
o Item response rate
o Classification: frame, collected, obtained from administrative data, calculated from other
variables
o Access restrictions
o Documentation, in the form of URLs or other pointers, including information no adjustments
for nonresponse bias and methodologies for edit, imputation and SDL.
6Agencies could, of course, apply the parsers to create and disseminate files in currently “popular” formats.
7Even today, many researchers cannot handle legacy file formats such as fixed-width fields in which padding with space characters
prevents numbers from being recognized as such. Even recognizing the problems can require hexadecimal dumps of files.
24
Metadata and Paradata
Undertaking this step involves, or at leads to, collective agency commitment to prepare, or have
contractors prepare, core metadata for all future (and some past?) surveys.
The remaining step, which is more complex than the others, would be creation of a repository for metadata
for surveys.
Other issues need to be addressed. The most pressing of these was discussed in Section 2: creation of a
conceptual and operational basis for dealing with confidentiality and SDL in the context of markup
language-based metadata (and, for that matter, markup language-based data). A second issue is cross-
survey compatibility, which is embedded in the “extensible metadata template” step described above, and
is very important because it instantiates standardization across surveys. A third issue, which seems crucial,
is to ensure that the metadata template facilitates linking datasets. A final issue, which may less pressing, is
provenance. Currently, agencies tightly monitor and control their datasets, even when collection is done by
a contractor. However, as more and more sharing and repurposing of data occurs, failure to document
provenance may have more serious consequences.
25
Metadata and Paradata
APPENDICES
Appendix C: Author
26
Metadata and Paradata
This example is designed to show in an informative but not overwhelming way how markup language-based
metadata works.
Consider the following data table.
Then, although more information could be included, a DDI-style 8 description of the data alone is given by
the following metadata object.
<DATA DESCRIPTION>
<ATTRIBUTE>
<NAME>Name</NAME>
<TYPE>Text</TYPE>
<MISSING INDICATOR>Empty </MISSING INDICATOR>
<IMPUTATION >
<IMPUTATION PRESENT>No</IMPUTATION PRESENT>
</IMPUTATION >
</ATTRIBUTE>
<ATTRIBUTE>
<NAME>Gender</NAME>
<TYPE>Text</TYPE>
<ALLOWABLE VALUE>M</ALLOWABLE VALUE>
<ALLOWABLE VALUE>F</ALLOWABLE VALUE>
<MISSING INDICATOR>Empty </MISSING INDICATOR>
<IMPUTATION>
<IMPUTATION PRESENT>No</IMPUTATION PRESENT>
</IMPUTATION>
</ATTRIBUTE>
<ATTRIBUTE>
<NAME>Age</NAME>
<TYPE>Numerical</TYPE>
<UNITS>Years</UNITS>
<PRECISION>5-Year Ranges</PRECISION>
<ALLOWABLE VALUE>0-5</ALLOWABLE VALUE>
8 IMPORTANT: This specification is not DDI compliant, and is not, for clarity, put into the DDI schema.
27
Metadata and Paradata
…
<ALLOWABLE VALUE>196-200</ALLOWABLE VALUE>
<MISSING INDICATOR>Empty </MISSING INDICATOR>
<IMPUTATION>
<IMPUTATION PRESENT>No</IMPUTATION PRESENT>
</IMPUTATION>
</ATTRIBUTE>
<ATTRIBUTE>
<NAME>Height</NAME>
<TYPE>Numerical</TYPE>
<UNITS>Centimeters</UNITS>
<PRECISION>Integer</PRECISION>
<ALLOWABLE VALUE>150</ALLOWABLE VALUE>
…
<ALLOWABLE VALUE>400</ALLOWABLE VALUE>
<MISSING INDICATOR>Empty </MISSING INDICATOR>
<IMPUTATION>
<IMPUTATION PRESENT>No</IMPUTATION PRESENT>
<IMPUTATION FLAG>N/A</IMPUTATION FLAG>
<IMPUTATION METHOD>N/A</IMPUTATION METHOD>
</IMPUTATION >
</ATTRIBUTE>
<ATTRIBUTE>
<NAME>Weight</NAME>
<TYPE>Numerical</TYPE>
<UNITS>Kilograms</UNITS>
<PRECISION>Integer</PRECISION>
<ALLOWABLE VALUE>25</ALLOWABLE VALUE>
…
<ALLOWABLE VALUE>150</ALLOWABLE VALUE>
<MISSING INDICATOR>Empty </MISSING INDICATOR>
<IMPUTATION>
<IMPUTATION PRESENT>Yes</IMPUTATION PRESENT>
<IMPUTATION FLAG>WeightImputeFlag=1</IMPUTATION FLAG>
<IMPUTATION METHOD>Weight = Height/3</IMPUTATION METHOD>
</IMPUTATION>
</ATTRIBUTE>
28
Metadata and Paradata
<ATTRIBUTE>
<NAME>WeightImputeFlag</NAME>
<TYPE>Binary</TYPE>
<ALLOWABLE VALUE>0</ALLOWABLE VALUE>
<ALLOWABLE VALUE>1</ALLOWABLE VALUE>
<MISSING INDICATOR>Empty </MISSING INDICATOR>
</ATTRIBUTE>
</DATA DESCRIPTION>
For each attribute, this description contains some of: its name, its type (text or numerical), the units in
which it is reported, the numerical precision, allowable values, the form in which missing values are
reported, and whether imputation is possible, as well as if so, the manner which in which it is indicated, the
method (imputing weight as height divided by 3 is purely illustrative).
An example of an associated markup language version of the actual data file is then. With a suitable parser,
this information can be put into any format, including (tab or comma) delimited text, a SAS data object, an
R data object, and an Excel file. Using the metadata, it can check whether the values in the file are valid.
<DATA FILE>
<RECORD>
<Name> Joe Smith</Name>
<Gender>M</Gender>
<Age>31-35</Age>
<Height>180</Height>
<Weight>73</Weight>
<WeightImputeFlag>0</WeightImputeFlag>
</RECORD>
<RECORD>
<Name> Bob Jones</Name>
<Gender>M</Gender>
<Age>26-30</Age>
<Height>195</Height>
<Weight>65</Weight>
<WeightImputeFlag>1</WeightImputeFlag>
</RECORD>
<RECORD>
<Name> Mary White</Name>
<Gender>F</Gender>
<Age>56-60</Age>
<Height>180</Height>
29
Metadata and Paradata
<Weight>73</Weight>
<WeightImputeFlag>0</WeightImputeFlag>
</RECORD>
</DATA FILE>
Note that there is neither logical nor implementation-driven need that the attributes of a record be any
prescribed order. The following file is completely equivalent to the one above, and proper parsers would
have no problem dealing with it.
<DATA FILE>
<RECORD>
<Age>31-35</Age>
<Name> Joe Smith</Name>
<WeightImputeFlag>0</WeightImputeFlag>
<Gender>M</Gender>
<Weight>73</Weight>
<Height>180</Height>
</RECORD>
<RECORD>
<Height>195</Height>
<Gender>M</Gender>
<Age>26-30</Age>
<WeightImputeFlag>1</WeightImputeFlag>
<Weight>65</Weight>
<Name> Bob Jones</Name>
</RECORD>
<RECORD>
<Height>180</Height>
<Weight>73</Weight>
<Gender>F</Gender>
<Age>56-60</Age>
<WeightImputeFlag>0</WeightImputeFlag>
<Name> Mary White</Name>
</RECORD>
</DATA FILE>
Building, for example, a CSV data file would require the following markup language description of a physical
data product.
30
Metadata and Paradata
<FORM>Delimited</FORM>
<RECORD>
<FIELD>
<NAME>ID</NAME>
<ATTRIBUTE>Name</ATTRIBUTE>
<ORDER>1</ORDER>
<SEPARATOR>,</SEPARATOR>
</FIELD>
<FIELD>
<NAME>Sex</NAME>
<ATTRIBUTE>Gender</ATTRIBUTE>
<ORDER>2</ORDER>
<SEPARATOR>,</SEPARATOR>
</FIELD>
<FIELD>
<NAME>Age in Years</NAME>
<ATTRIBUTE>Age</ATTRIBUTE>
<ORDER>3</ORDER>
<SEPARATOR>,</SEPARATOR>
</FIELD>
<FIELD>
<NAME>Height (cm)</NAME>
<ATTRIBUTE>Height</ATTRIBUTE>
<ORDER>4</ORDER>
<SEPARATOR>,</SEPARATOR>
</FIELD>
<FIELD>
<NAME>Weight (kg)</NAME>
<ATTRIBUTE>Weight</ATTRIBUTE>
<ORDER>5</ORDER>
<SEPARATOR>,</SEPARATOR>
</FIELD>
<FIELD>
<NAME>Impute Flag for Weight</NAME>
<ATTRIBUTE>WeightImputeFlag</ATTRIBUTE>
<ORDER>6</ORDER>
<SEPARATOR>,</SEPARATOR>
</FIELD>
<EOR>Carriage return/Line feed</EOR>
</RECORD>
31
Metadata and Paradata
32
Metadata and Paradata
This (real) example contains the XML-based WSDL (Web Services Description Language) file for a Web
services implementation of data swapping created by NISS. Briefly, this service allows users to perform data
swapping using remote software; see “NISSWebSwap: A Web Service for data swapping,” by A. Sanil, S.
Gomatam, A. F. Karr and S. Liu, Journal of Statistical Software 8(7) (2003) for a complete description. The
markup structure should be apparent. It allows communication of the file name and parameters for the
swapping.
<?xml version="1.0" encoding="UTF-8"?>
<definitions name="Swap_dataService" targetNamespace=http://WebSwap_swap.org/wsdl.
xmlns:tns="http://WebSwap_swap.org/wsdl" xmlns=http://schemas.xmlsoap.org/wsdl/
xmlns:soap="http://schemas.xmlsoap.org/wsdl/soap/" xmlns:ns2="http://WebSwap_swap.org/types"
xmlns:xsd="http://www.w3.org/2001/XMLSchema">
<types>
<schema targetNamespace=http://WebSwap_swap.org/types
xmlns:xsi=http://www.w3.org/2001/XMLSchema-instance
xmlns:tns="http://WebSwap_swap.org/types" xmlns:soap-
enc="http://schemas.xmlsoap.org/soap/encoding/" xmlns:wsdl="http://schemas.xmlsoap.org/wsdl/"
xmlns="http://www.w3.org/2001/XMLSchema">
<complexType name="SwapData">
<sequence>
<element name="numFields" type="int"/>
<element name="outputFile" type="string"/>
<element name="numRecords" type="int"/>
<element name="riskCutoff" type="double"/>
<element name="data" type="tns:ArrayOfArrayOfstring"/>
<element name="dataFile" type="string"/>
<element name="constraints" type="base64Binary"/>
<element name="log" type="tns:ArrayOfstring"/>
<element name="swapRate" type="double"/>
<element name="riskFraction" type="double"/>
<element name="logFile" type="string"/>
<element name="csvType" type="string"/></sequence>
</complexType>
<complexType name="ArrayOfArrayOfstring">
<complexContent>
<restriction base="soap-enc:Array">
<attribute ref="soap-enc:arrayType" wsdl:arrayType="tns:ArrayOfstring[]"/>
33
Metadata and Paradata
</restriction>
</complexContent>
</complexType>
<complexType name="ArrayOfstring">
<complexContent>
<restriction base="soap-enc:Array">
<attribute ref="soap-enc:arrayType" wsdl:arrayType="string[]"/>
</restriction>
</complexContent>
</complexType>
</schema>
</types>
<message name="doSwap">
<part name="SwapData_1" type="ns2:SwapData"/>
</message>
<message name="doSwapResponse">
<part name="result" type="ns2:SwapData"/>
</message>
<portType name="SwapIF">
<operation name="doSwap">
<input message="tns:doSwap"/>
<output message="tns:doSwapResponse"/>
</operation>
</portType>
34
Metadata and Paradata
<soap:operation soapAction=""/>
</operation>
<service name="Swap_data">
<port name="SwapIFPort" binding="tns:SwapIFBinding">
<soap:address location="http://www.niss.web-services:8080/WebSwap/SwapIF"/>
</port>
</service>
</definitions>
35
Metadata and Paradata
Appendix C: Author
36