Organizing Research Data
Organizing Research Data
net/publication/221699491
CITATIONS READS
5 248
1 author:
Peter Sestoft
IT University of Copenhagen
115 PUBLICATIONS 2,779 CITATIONS
SEE PROFILE
Some of the authors of this publication are also working on these related projects:
All content following this page was uploaded by Peter Sestoft on 25 March 2014.
Abstract
Research relies on ever larger amounts of data from experiments, automated production equipment,
questionnaries, times series such as weather records, and so on. A major task in science is to combine, process
and analyse such data to obtain evidence of patterns and correlations.
Most research data are on digital form, which in principle ensures easy processing and analysis, easy long-term
preservation, and easy reuse in future research, perhaps in entirely unanticipated ways. However, in practice,
obstacles such as incompatible or undocumented data formats, poor data quality and lack of familiarity with
current technology prevent researchers from making full use of available data.
This paper argues that relational databases are excellent tools for veterinary research and animal production;
provides a small example to introduce basic database concepts; and points out some concerns that must be
addressed when organizing data for research purposes.
© 2011 Sestoft; licensee BioMed Central Ltd. This is an open access article distributed under the terms of the Creative Commons
Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in
any medium, provided the original work is properly cited.
Sestoft Acta Veterinaria Scandinavica 2011, 53(Suppl 1):S2 Page 2 of 7
http://www.actavetscand.com/content/53/S1/S2
i 5 l li ff d ilki
Figure 1 Flat list of farms, cows and milking events
The heading lists the attributes or columns (id, address Queries in relational databases
and postcode) of the table. Each line below it is called a The beneficial splitting of the flat list of farm, cow and
record or row of the table. The unique farm id is a key; a milk data into three separate tables introduces a challenge,
given key must appear at most once in the table. Those though: How does one combine the tables to obtain useful
database keys are the reason everything (people, cows, information, such as the total milk production in each
supermarket goods) has a number in modern society. postcode? In a relational database this is done using
In the Cow table in Figure 3, each record describes a queries, expressed in the language SQL, or Standard
cow: the cow id is the key in the table, the farmId says Query Language. All modern database systems, including
which farm the cow belongs to, and the birth attribute the open source systems MySql and PostgreSql and the
is the cow’s birthdate. A cow’s farmId attribute is commercial systems DB2, Oracle, Microsoft SQL Server
intended to refer to some farm’s id, which is the key in and Microsoft Access, understand some variant of SQL
the Farm table; hence the farmId in the Cow table is and can execute queries involving millions of records in a
called a foreign key. few seconds. Although the complete SQL language is
In the Milk table in Figure 4, each record describes a rather complex, an introduction can be found in any data-
milking event: the cow id together with the date-and- base book, such as [2]. Here we shall just consider some
time (the “when” attribute) together constitute the key examples of SQL queries, from very simple to moderately
of the table, the amount of milk obtained, and possibly complex.
the cell count. The simplest possible query is: To list all cows. Figure 5
Missing observations, such as those in the cellCount shows an SQL query that extracts all columns (denoted
column of the Milk table, are said to be null. We may by the asterisk *) and all rows of the Cow table; the result,
require, and the database system may enforce, that all shown in italics to the right, is a “table” very similar to
values must be non-null, except possibly in the cell- the Cow table itself.
Count column. This requirement would not work in the To see only the cow’s id and its birth date, we may
original flat list in Figure 1, because it would prevent us specify the id and birth columns after SELECT as
from creating a farm record before the farm has a cow, shown in Figure 6; the result is a table that has only two
which is illogical. Furthermore, the splitting of the flat of the Cow table’s columns, but all its rows.
list into separate Farm, Cow and Milk tables means that To see only the cows belonging to farm number
there is no redundancy and hence less risk of inconsis- 12160, we use a WHERE-clause as in Figure 7; the
tency: the address of a farm is stated only once per result is a table that has all of the Cow table’s columns,
farm, and the farm to which a cow belongs is stated but only those of its rows where the cow’s farmId equals
only once per cow. 12160.
id | farmId | birth
SELECT * --------------------------------
FROM Cow 1216000002 | 12160 | 2004-06-15
3417400019 | 12160 | 2006-04-01
3417400021 | 12169 | 2007-12-19
Figure 5 Query to get all columns and rows of the Cow table
Sestoft Acta Veterinaria Scandinavica 2011, 53(Suppl 1):S2 Page 4 of 7
http://www.actavetscand.com/content/53/S1/S2
Figure 6 Query to get some columns and all rows of the Cow table
2010, or what is the total milk production per postcode whether or not this leads to problems depends on the
in each of the months of 2010. Unfortunately, the SQL discipline and consistency with which veterinarians
queries become a good deal more complex. The theory register clinical observations. Finally, some codes corre-
of temporal databases is well-developed; a good intro- spond to subcategories or specializations of others; for
duction is provided by [3]. instance 11 udder infection and 38 joint infection are
Moreover, much data is spatial: a farm or field is both special cases of 42 infection; should one then
located at a particular place, which may be described by always use the most specific code available (e.g. 11 or
UTM coordinates or longitude and latitude. Knowing 38) or alternatively always register a more general code
where objects are when allows for queries such as (e.g. 42) along with more specific ones (e.g. 11 or 38)?
at what times was this cow near Gelsted or find all In the former case, will somebody who queries the
pairs of cows that were within 8 km of each other at Clinical table in Figure 13 for all cases of infection
some time as well as epidemiological analyses and easy remember to also query for the more specific ones (e.g.
visualization. 11 and 38)? This example illustrates some problems
with designing category codes for use in databases, and
Terminology and ontology in classifying observations in general.
Here we shall consider a problem that is often over- A suitable system of “codes”, including a consideration
looked in database books: the design of categories or about how “codes” relate to each other, is often called a
“codes”. Assume that we want to extend our farm-cow- terminology, a controlled vocabulary, or an ontology.
milk database with veterinarians’ observations of various An ontology reflects the domain that it describes, such
diseases of cows. For this purpose we might introduce as the domain of animal disease symptoms discussed
two more tables. Table Clinical in Figure 13 contains above. One must first decide what parts of reality to
clinical observations about a given cow, made by a given model (for instance, this cow has an infection), what
veterinarian at a given time, recording a clinical observa- parts of reality to ignore (such as, where is the infection
tion such as joint infection by a code, here 38. located). Similarly, in a database of clinical observations
Another table, called ClinicalTerm and shown in one must make clear whether one records symptoms
Figure 14, associates a description with each clinical code. (e.g. diarrhea) or diagnosis (e.g. enteritis) or cause (Salmo-
However, there are some potential problems with the nella) or all of these. One must also decide how to relate
clinical term codes in Figure 14. First of all, codes 81 the various parts of reality to each other. For instance,
and 140 appear to have the same meaning, so there is a pneumonia is a special case of infection. Moreover, it
risk that two people may use different codes for the affects the lungs, which is part of the anatomy. A good
same observation, which may later produce misleading domain model should be able to express both forms of
results (e.g. statistics) when queries are made to the hierarchical relationship.
database. Second, no distinction is made between find- It takes domain experts, technological understanding,
ings (e.g. 88 will not drink), diagnoses (e.g. 11 udder and good taste to arrive at adequate domain models
infection) and procedures (e.g. 80 hoof trimming); that are not too complex.
Figure 7 Query to get all columns and some rows of the Cow table
Sestoft Acta Veterinaria Scandinavica 2011, 53(Suppl 1):S2 Page 5 of 7
http://www.actavetscand.com/content/53/S1/S2
An example of a well-designed (but complex) domain Data stewardship, standards, and sharing
model is SNOMED/CT, which stands for Systematized Sometimes a whole discipline manages to agree on an
Nomenclature of Medical-Clinical Terms. This is a set ontology, as in the case of SNOMED/CT. Such standar-
of standard terms for use in hospitals, electronic patient dization requires considerable effort, but also offers
records, and so on [4]. There are three components of huge synergistic benefits, especially when databases are
SNOMED/CT: made available to all interested parties in a standard for-
• Concepts, used to describe disorders (e.g. 128139000 mat. For instance, within bioinformatics this has led to
Inflammatory disorder and 233604007 Pneumonia), tremendous advances in research on animals, microor-
procedures (e.g. 11466000 Cesarean section), findings ganisms, plants and medicine. Important steps were the
(e.g. 62315008 Diarrhea and 55184003 Infectious enteri- 1980es development of standard formats [7] that enable
tis), causative organisms (e.g. 110378009 Salmonella free interchange of DNA sequence data between US,
enterica), anatomy, and more. Japanese and European institutions, and the requirement
• Descriptions, used primarily for synonyms, e.g. that any sequence data used as basis for a scientific pub-
497137013 Infective enteritis (synonym for concept lication must be published, free of any restrictions on
55184003 Infectious enteritis). further research, in the joint international databases [8].
• Relationships, used to describe how concepts relate While the development of standard formats and ontolo-
to each other, e.g. Pneumonia IS_A Inflammatory gies is important and enables much better utilization of
disorder and Pneumonia FINDING SITE Lung research investments, it looks more like infrastructure
structure. development than research, which means that it appears
Note how each concept and each description has a less exciting and that it may be difficult to obtain funding
unique numeric key. Also note how relationships can be for it. As a consequence, it may be more tempting to pro-
used to relate one concept (pneumonia) both to a disease pose new organizations, web sites and portals than to lay
category and to anatomy, that is, to place the concept in the foundation for them, which caused a Nature editorial
different hierarchies. to admonish that “Initiatives for digital research infrastruc-
SNOMED/CT is maintained by an international ture should focus more on making standardized data
organization whose member countries include the openly available, and less on developing new portals“ [9].
United States, United Kingdom, Germany, The Nether- Thanks to lab automation, sensor development and
lands, Spain, Sweden, Denmark, and many more. In computerized instruments, research produces new data
Denmark and most other places, electronic patient records on a scale never seen before. Yet in many cases the
are still based on older and less powerful classification required efforts to document, check and preserve all
systems, but SNOMED/CT is expected to replace those in these data lag behind researchers’ ability to generate the
the future [5]. data in the first place [10].
Full SNOMED/CT is very complex, with 311,000 This problem is the subject of a report from the US
concepts, 800,000 descriptions and 1,360,000 relations National Academies [11] on integrity, accessibility and
as of April 2010. A smaller subset for veterinary use is stewardship of digital data, encouraged and sponsored
being maintained by Virginia Terminology Services [6]. in part by leading journals [12,13]. The report’s three
Acknowledgements
This article has been published as part of Acta Veterinaria Scandinavica
Volume 53 Supplement 1, 2011: Databases in veterinary medicine: validation,
harmonisation and application. Proceedings of the 24th Symposium of the
Nordic Committee for Veterinary Scientific Cooperation (NKVet). The full
contents of the supplement are available online at http://www.actavetscand.
com/supplements/53/S1.
Competing interests
The authors declare that they have no competing interests.
References
1. Codd EF: A Relational Model of Data for Large Shared Data Banks.
Communications of the ACM 8211, 13(6):377-387, doi:10.1145/
362384.362685.
2. Churcher C: Beginning Database Design. Apress 2007.
3. Snodgrass R: Developing Time-Oriented Database Applications in SQL.
Morgan Kaufmann; 1999, Full text at http://www.cs.arizona.edu/people/rts/
tdbbook.pdf.
4. SNOMED Clinical Terms User Guide. International Health Terminology
Standards Development Organisation; 2009, At http://www.ihtsdo.org/
snomed-ct/.
5. Lippert S: IT University of Copenhagen, personal communication. 2010.
6. Wilcke R: Veterinary adaptation of SNOMED-CT. Presentation at Talbot
Symposium, AVMA Convention 2009, At http://snomed.vetmed.vt.edu/.
7. International Nucleotide Sequence Database Collaboration: The DDBJ/
EMBL/GenBank Feature Table, version 8.3. 2010, At http://www.insdc.org/.
8. Brunak S, Danchin A, Hattori M, Nakamura H, Shinozaki K, Matise T,
Preusset D: Nucleotide Sequence Database Policies. Science 2002,
298:1333.
9. Data for the masses. Editorial. Nature 2009, 457:129-129, doi:10.1038/
457129a.
10. Gray J, Liu DT, Nieto-Santisteban M, Szalay AS: Scientific Data Management
in the Coming Decade. Microsoft Research Technical Report MSR-TR-2005-10
2005, At http://arxiv.org/pdf/cs/0502008.
11. National Academy of Sciences: Ensuring the integrity, accessibility, and
stewardship of research data in the digital age. National Academies Press
2009.
12. Information overload. Editorial. Nature 2009, 460:551-551, doi:10.1038/
460551a.
13. Kleppner D, Sharp PA: Research Data in the Digital Age. Editorial. Science Submit your next manuscript to BioMed Central
2009, 325:368-368, doi:10.1126/science.1178927.
and take full advantage of:
doi:10.1186/1751-0147-53-S1-S2
Cite this article as: Sestoft: Organizing research data. Acta Veterinaria • Convenient online submission
Scandinavica 2011 53(Suppl 1):S2.
• Thorough peer review
• No space constraints or color figure charges
• Immediate publication on acceptance
• Inclusion in PubMed, CAS, Scopus and Google Scholar
• Research which is freely available for redistribution