100% found this document useful (21 votes)
643 views23 pages

Fundamentals of Clinical Data Science., 978-3319997124

Fundamentals of Clinical Data Science Full PDF DOCX Download. ISBN-13: 978-3319997124.

Uploaded by

lesliteirtzalux
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
100% found this document useful (21 votes)
643 views23 pages

Fundamentals of Clinical Data Science., 978-3319997124

Fundamentals of Clinical Data Science Full PDF DOCX Download. ISBN-13: 978-3319997124.

Uploaded by

lesliteirtzalux
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 23

Fundamentals of Clinical Data Science

Visit the link below to download the full version of this book:
https://cheaptodownload.com/product/fundamentals-of-clinical-data-science-full-p
df-docx-download/
Pieter Kubben · Michel Dumontier
Andre Dekker
Editors

Fundamentals of Clinical
Data Science
Editors
Pieter Kubben Michel Dumontier
Department of Neurosurgery Institute of Data Science
Maastricht University Maastricht University
Maastricht, Limburg Maastricht, Limburg
The Netherlands The Netherlands

Andre Dekker
Maastro Clinic
Maastricht, Limburg
The Netherlands

This book is an open access publication.


ISBN 978-3-319-99712-4    ISBN 978-3-319-99713-1 (eBook)
https://doi.org/10.1007/978-3-319-99713-1

Library of Congress Control Number: 2018963226

© The Editor(s) (if applicable) and The Author(s) 2019


Open Access This book is licensed under the terms of the Creative Commons Attribution 4.0
International License (http://creativecommons.org/licenses/by/4.0/), which permits use, sharing,
adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit
to the original author(s) and the source, provide a link to the Creative Commons license and indicate if
changes were made.
The images or other third party material in this book are included in the book’s Creative Commons
license, unless indicated otherwise in a credit line to the material. If material is not included in the book’s
Creative Commons license and your intended use is not permitted by statutory regulation or exceeds the
permitted use, you will need to obtain permission directly from the copyright holder.
The use of general descriptive names, registered names, trademarks, service marks, etc. in this publication
does not imply, even in the absence of a specific statement, that such names are exempt from the relevant
protective laws and regulations and therefore free for general use.
The publisher, the authors, and the editors are safe to assume that the advice and information in this book
are believed to be true and accurate at the date of publication. Neither the publisher nor the authors or the
editors give a warranty, express or implied, with respect to the material contained herein or for any errors
or omissions that may have been made. The publisher remains neutral with regard to jurisdictional claims
in published maps and institutional affiliations.

This Springer imprint is published by the registered company Springer Nature Switzerland AG
The registered company address is: Gewerbestrasse 11, 6330 Cham, Switzerland
Introduction “Fundamentals of Clinical
Data Science”

In the era of eHealth and personalized medicine, “big data” and “machine learning”
are increasingly becoming part of the medical world. Algorithms are capable of sup-
porting diagnostic and therapeutic processes and offer added value for both health-
care professionals and patients. The field of big data, machine learning, deep
learning, and algorithm development and validation is often referred to as “data
science,” and “data scientist” was mentioned in Harvard Business Review as “the
sexiest job of the 21th century” (https://hbr.org/2012/10/data-scientist-the-sexiest-
job-of-the-21st-century). A commonly used visual representation of the field is
Drew Conway’s Venn diagram (Fig. 1), which describes data science as a mix of
content expertise, methodological knowledge, and IT skills.

Fig. 1 Data science Venn


diagram by Drew Conway.
(Reproduced with
permission)

v
vi Introduction “Fundamentals of Clinical Data Science”

Unfortunately, most healthcare professionals still consider the field of clinical


data science as highly technical and something “for the IT whizzkids.” That leaves
many interesting and valuable opportunities unexplored and could even contribute
to serious flaws in developed algorithms. Chen and Asch described machine learn-
ing’s “peak of inflated expectations” and suggest that “we can soften a subsequent
crash into a ‘trough of disillusionment’ by fostering a stronger appreciation of the
technology’s capabilities and limitations” (Chen and Asch 2017). They conclude
that “combining machine-learning software with the best human clinician ‘hard-
ware’ will permit delivery of care that outperforms what either can do alone.” We
could not agree more.
This book is for you, the healthcare professional and “best human clinician hard-
ware” who would like to embrace the field of clinical data science but who is still
looking for a resource that explains the topic in nonengineering terminology. This
book’s promise is “no math, no code.” It contains three sections that help you under-
stand the transformation of data to model and to applications. It should be sufficient
to give you a decent grasp on the topic for understanding and a solid foundation if
you are to continue with active mastery of the field by taking programming courses
online or in a classroom setting. Either way, we want you to get aboard.
Our thanks go to the NFU Citrienfonds who made it financially possible to pub-
lish this e-book as open access. Citrienfonds of the NFU and ZonMw helps to
develop sustainable solutions in Dutch healthcare to all authors for their valuable
time and ­contributions, to Studio Piranha for the website, and to Springer for their
help in the publishing process.

Pieter Kubben, Michel Dumontier, and André Dekker


www.clinicaldatasciencebook.com
Reference
Chen JH, Asch SM. Machine learning and prediction in medicine – beyond the peak
of inflated expectations. N Engl J Med. 2017;376(26):2507–9. https://doi.
org/10.1056/NEJMp1702071.
Contents

Part I Data Collection

1 Data Sources ����������������������������������������������������������������������������������������������   3


Pieter Kubben
2 Data at Scale ���������������������������������������������������������������������������������������������� 11
Alberto Traverso, Frank J. W. M. Dankers, Leonard Wee,
and Sander M. J. van Kuijk
3 Standards in Healthcare Data������������������������������������������������������������������ 19
Stefan Schulz, Robert Stegwee, and Catherine Chronaki
4 Research Data Stewardship for Healthcare Professionals �������������������� 37
Paula Jansen, Linda van den Berg, Petra van Overveld,
and Jan-­Willem Boiten
5 The EU’s General Data Protection Regulation (GDPR)
in a Research Context�������������������������������������������������������������������������������� 55
Christopher F. Mondschein and Cosimo Monda
Part II From Data to Model
6 Preparing Data for Predictive Modelling������������������������������������������������ 75
Sander M. J. van Kuijk, Frank J. W. M. Dankers, Alberto Traverso,
and Leonard Wee
7 Extracting Features from Time Series ���������������������������������������������������� 85
Christian Herff and Dean J. Krusienski
8 Prediction Modeling Methodology ���������������������������������������������������������� 101
Frank J. W. M. Dankers, Alberto Traverso, Leonard Wee,
and Sander M. J. van Kuijk

vii
viii Contents

9 Diving Deeper into Models������������������������������������������������������������������������ 121


Alberto Traverso, Frank J. W. M. Dankers, Biche Osong,
Leonard Wee, and Sander M. J. van Kuijk
10 Reporting Standards and Critical Appraisal
of Prediction Models���������������������������������������������������������������������������������� 135
Leonard Wee, Sander M. J. van Kuijk, Frank J. W. M. Dankers,
Alberto Traverso, Mattea Welch, and Andre Dekker
Part III From Model to Application
11 Clinical Decision Support Systems ���������������������������������������������������������� 153
A. T. M. Wasylewicz and A. M. J. W. Scheepers-Hoeks
12 Mobile Apps������������������������������������������������������������������������������������������������ 171
Pieter Kubben
13 Optimizing Care Processes with Operational
Excellence & Process Mining�������������������������������������������������������������������� 181
Henri J. Boersma, Tiffany I. Leung, Rob Vanwersch, Elske Heeren,
and G. G. van Merode
14 Value-Based Health Care Supported by Data Science �������������������������� 193
Tiffany I. Leung and G. G. van Merode
Index�������������������������������������������������������������������������������������������������������������������� 213
Part I
Data Collection
Chapter 1
Data Sources

Pieter Kubben

1.1 Data Sources

1.1.1 Electronic Medical Records

Electronic medical records (EMRs), often also referred to as electronic health


records (EHRs), are a major source of clinical data (although EMR and EHR have
subtle differences). (“EHR (electronic health record) vs. EMR (electronic medical
record),” [6]) EMRs are computerized medical information systems that collect,
store and display patient information. They are means to create legible and orga-
nized recordings and to access clinical information about individual patients. EMRs
have been described as an important tool to reduce medical errors and improve
information sharing among physicians [1]. Nevertheless, there are many barriers
that limit EMR adoption, varying from time, cost, security concerns and vendor
trust to absence of computer skills for the physician [1]. To some extent such barri-
ers can be lowered by using a framework for systematic EMR implementation [2].
On the other hand, expectations about using EHRs need to be tempered by practical
considerations, recognizing that even those countries with relatively high rates of
EHR penetration have achieved only limited successes in using EHR data for popu-
lation health [7]. To what extent EMRs effectively succeed in improving quality of
care and patient safety, remains a matter of debate [12, 16].
EMRs contain different sources of data which are relevant for data science. Most
obvious are data that are directly linked to personal health status, such as laboratory
values (tabular data), medical imaging (audiovisual data) or physicians’ written
notes (semi-structured or free text). Less obvious but definitely not less important
are data that can be obtained from computerized physician order entry systems,

P. Kubben (*)
Department of Neurosurgery, Maastricht University, Maastricht, Limburg, The Netherlands
e-mail: [email protected]

© The Author(s) 2019 3


P. Kubben et al. (eds.), Fundamentals of Clinical Data Science,
https://doi.org/10.1007/978-3-319-99713-1_1
4 P. Kubben

clinical decision support systems or scheduling systems. The latter are more related
to healthcare processes, that are later described in the chapters on operational excel-
lence and value-based healthcare.
Given the highly sensitive data stored in EMRs, security is a particularly
important issue. Three types of safeguards have been described to limit the chance
for adverse events: access control (technical safeguard), physical access control
(physical safeguard) and administrative safeguards (such as local policies and
procedures) [11].

1.1.2 Other Medical Information Systems

A laboratory information (management) system (LI(M)S) isa software system that


records, manages, and stores data for clinical laboratories. A LIS has traditionally
been most adept at sending laboratory test orders to lab instruments, tracking those
orders, and then recording the results, typically to a searchable database. The stan-
dard LIS has supported the operations of public health institutions (like hospitals
and clinics) and their associated labs by managing and reporting critical data con-
cerning “the status of infection, immunology, and care and treatment status of
patients” [3].
Radiology information systems (RIS) have been introduced much earlier than
EMRs for efficient ordering and scheduling, and were later integrated with the
Picture Archiving and Communication System (PACS) for increased workflow effi-
ciency in radiology departments [13]. For example, this integration saved 68 min
per radiologist per day, and reduced the average uncorrected or missed errors by 21
[10]. PACS will eventually be replaced by a Vendor Neutral Archive (VNA) [4]
which can be used for more than only radiology imaging (e.g. also intraoperative
video recordings or dermatology photos).
Another important source of information are the systems in use by external care
and cure organizations, such as general practitioners. These systems are expected to
have better integration or communication with hospitals’ EMRs which would facili-
tate data exchange and provide new approaches for a more complete overview of a
patient’s individual journey including data collection at different time points and in
different healthcare settings.

1.1.3 Mobile Apps

For many telemonitoring (telemedicine, telehealth) applications, mobile apps are a


very important tool to measure health-related data independent of time and loca-
tion. Modern smartphones can capture various sorts of data and store them directly
to a remote server using built-in wireless communication channels. Such data do
not only consist of surveys, but can also be audiovisual (using the build-in camera
1 Data Sources 5

or microphone), movement data (accelerometer, gyroscope) or location (GPS).


Using push-messages users can be reached immediately when a direct response is
required. This allows for “real time” feedback, or experience sampling, in which
momentary assessments can be obtained multiple times a day during activities of
daily life [17, 18].
In the context of health-related data, Apple HealthKit (for iOS) and Google
Fit (for Android) are of particular importance. These frameworks integrate all
sorts of health-related data and provide a universal interface for external devel-
opers to acquire such data after explicit consent by the user. Dedicated frame-
works for scientific research (Apple ResearchKit and Google Study) take this
process one step further and even allow for large scale studies using smartphone
technology only.

1.1.4 Internet of Things and Big Data

Internet of Things (IoT) refers to the networked interconnection of everyday


objects, which are often equipped with omnipresent intelligence. Such objects
could be wearables (like smartwatches) but also shoe insoles or home domotics.
IoT will increase the ubiquity of the Internet by integrating every object for inter-
action via embedded systems, which leads to a highly distributed network of
devices communicating with human beings as well as other devices. Thanks to
rapid advances in underlying technologies, IoT is opening tremendous opportuni-
ties for a large number of novel applications that promise to improve the quality of
our lives [19]. By 2020, 40% of IoT-related technology will be health-related,
more than any other category, making up a $117 billion market [5]. IoT is a major
source for “Big Data”, which is often defined by “the four V’s”: Volume, Velocity,
Variety, and Value / Veracity [8, 14]. More information on Big Data is provided in
the next chapters.
An important concept to understand is that Big Data in itself is nothing more than
a pile of bricks, it is not a house yet. In healthcare, Big Data are increasingly referred
to as the solution for all sorts of problems. Although they are of fundamental impor-
tance, what matters is what we do with these data. That is covered later in this book
in the sections on modelling.

1.1.5 Social Media

Social media such as Twitter, Facebook and blogs can also be an important source
of data. Publicly available data (e.g. Twitter) can be used for several sorts of analy-
sis, like sentiment analysis or graph networks. They are also relevant media to
recruit participants for studies that can take place completely online using frame-
works as Apple ResearchKit or Google Study.
6 P. Kubben

1.2 GDPR

The General Data Protection Regulation (GDPR) is a European regulation that


became the standard for privacy in May 2018. All European organizations that pro-
cess privacy-sensitive data have to comply to the GDPR. Therefore, the GDPR
applies to all data sources mentioned above. Moreover, for scientific research most
medical-ethical research committees now also require explicit attention to the
GDPR when filing a new research protocol. A detailed description of the GDPR is
provided in Chap. 5.

1.3 Data Types

1.3.1 Tabular Data

Tabular data are the most common and well known data for research and data sci-
ence. They are represented in a column-row format in which -most commonly- rows
represent individual records and columns represent the relevant variables. For
machine learning applications in which you try to predict one variable based on the
others (supervised learning), the variable you try to predict is called the independent
or class variable, and the others are the feature or predictor variables.

1.3.2 Time Series

Time series are an ordered sequence of values of a variable at equally spaced time
intervals. They are a particular sort of tabular data in which (mostly) columns rep-
resent different time stamps in chronological order. In data science applications the
goal is mostly to predict future events. Time series require specific sorts of prepro-
cessing as values (e.g. the mean) can -by definition- change over time. A particu-
larly relevant sort of time series are processes. Improving healthcare frequently
means improving processes. Process mining refers to the automated analysis of
processes and involves time series analysis. Another relevant sort of time series are
discrete time signals (e.g. digitally recorded accelerometer or ECG data). Such sig-
nals can be analyzed in the time domain (in which they are recorded) but also in the
frequency domain (after a Fourier transform) and using time-frequency analysis
(e.g. wavelets) in case of non-stationary signals. In this case, features are extracted
from the data before modelling takes place. For common machine learning applica-
tions, feature extraction is done explicitly by the researcher, but more advanced
deep neural networks are capable of automated feature extraction nowadays. More
information is available in Chaps. 6–9.
1 Data Sources 7

1.3.3 Natural Language

In many medical applications free text format is still frequently used by physicians
(physician notes, radiology reports), but also surveys or daily logs by patients can
contain free text. Besides, social media contain free text as their data source.
Techniques are available for text mining, also called “natural language processing”,
to extract meaning in an automated fashion from free text input. These techniques
in particular fall outside the scope of this book, but general principles for modelling
do still apply.

1.3.4 Images and Videos

Images are another important source of data for data science, and also requires spe-
cific processing techniques for feature extraction before modelling can take place.
Also here, deep neural networks can perform automated feature extraction nowa-
days. A famous example is Google’s Deepmind project, in which a computer model
was fed videos that were tagged as containing cats or not containing cats. The model
came up with cat images, despite never being trained in recognizing the concept of a
cat. The same deep learning platform was later used to defeat the world champion in
the game of Go, and an improved version learned to play the game from scratch and
defeated the previous (world champion beating) algorithm with 100-0 [15].

1.4 Data Standards

Standardizing health care data involves the following [9]:


• Definition of data elements—determination of the data content to be collected
and exchanged.
• Data interchange formats—standard formats for electronically encoding the
data elements (including sequencing and error handling). Interchange standards
can also include document architectures for structuring data elements as they are
exchanged and information models that define the relationships among data ele-
ments in a message.
• Terminologies—the medical terms and concepts used to describe, classify, and
code the data elements and data expression languages and syntax that describe
the relationships among the terms/concepts.
• Knowledge Representation—standard methods for electronically representing
medical literature, clinical guidelines, and the like for decision support.
More detailed information on standards is available later in Chap. 3.
8 P. Kubben

1.5 Conclusion

A variety of data sources and data types are relevant for clinical data science. A
general overview of such data sources has been provided, and the concepts of dif-
ferent data types were introduced. Next chapters will dive deeper on data and stan-
dards, and a toolkit for natural data stewardship will be provided.

References

1. Ajami S, BagheriTadi T. Barriers for adopting electronic health records (EHRs) by physicians.
Acta Inform Med. 2013;21(2):129–6. https://doi.org/10.5455/aim.2013.21.129-134.
2. Boonstra A, Versluis A, Vos JFJ. Implementing electronic health records in hospitals: a
systematic literature review. BMC Health Serv Res. 2014;14(1):1156–24. https://doi.
org/10.1186/1472-6963-14-370.
3. Common L. From LIMSWiki Jump to: navigation, search hospitals and labs around the world
depend on a laboratory information system to manage and report patient data …. n.d. https://
doi.org/10.1097/PAP.0b013e318248b787.
4. Dennison D. PACS in 2018: an autopsy. J Digit Imaging. 2013;27(1):7–11. https://doi.
org/10.1007/s10278-013-9660-1.
5. Dimitrov DV. Medical internet of things and big data in healthcare. Healthc Inform Res.
2016;22(3):156–8. https://doi.org/10.4258/hir.2016.22.3.156.
6. EHR (electronic health record) vs. EMR (electronic medical record). EHR (electronic health
record) vs. EMR (electronic medical record). n.d. Retrieved June 22, 2018, from https://www.
practicefusion.com/blog/ehr-vs-emr/.
7. Friedman DJ, Parrish RG, Ross DA. Electronic health records and US public health: current
realities and future promise. Am J Public Health. 2013;103(9):1560–7. https://doi.org/10.2105/
AJPH.2013.301220.
8. Huang T, Lan L, Fang X, An P, Min J, Wang F. Promises and challenges of big data computing
in health sciences. Big Data Res. 2015;2(1):2–11. https://doi.org/10.1016/j.bdr.2015.02.002.
9. Institute of Medicine. IOM report: patient safety—achieving a new standard for care. Acad
Emerg Med Off J Soc Acad Emerg Med. 2005;12(10):1011–2. https://doi.org/10.1197/j.
aem.2005.07.010.
10. Kovacs MD, Cho MY, Burchett PF, Trambert M. Benefits of integrated RIS/PACS/reporting
due to automatic population of templated reports. Curr Probl Diagn Radiol. 2018:1–3. https://
doi.org/10.1067/j.cpradiol.2017.12.002.
11. Kruse CS, Smith B, Vanderlinden H, Nealand A. Security techniques for the electronic health
records. 1–9. 2017. https://doi.org/10.1007/s10916-017-0778-4.
12. Manca DP. Do electronic medical records improve quality of care?: yes. Can Fam Physician.
2015;61(10):846–7.
13. Nance JW Jr, Meenan C, Nagy PG. The future of the radiology information system. AJR Am
J Roentgenol. 2013;200(5):1064–70. https://doi.org/10.2214/AJR.12.10326.
14. Raghupathi W, Raghupathi V. Big data analytics in healthcare: promise and potential. Health
Inf Sci Syst. 2014;2(1):211–0. https://doi.org/10.1186/2047-2501-2-3.
15. Silver D, Schrittwieser J, Simonyan K, Antonoglou I, Huang A, Guez A, et al. Mastering the
game of go without human knowledge. Nat Publ Group. 2017;550(7676):354–9. https://doi.
org/10.1038/nature24270.
16. Tubaishat A. The effect of electronic health records on patient safety: a qualitative exploratory
study. Inform Health Soc Care. 2017;00(00):1–13. https://doi.org/10.1080/17538157.2017.13
98753.
1 Data Sources 9

17. van Os J, Verhagen S, Marsman A, Peeters F, Bak M, Marcelis M, et al. The experience sam-
pling method as an mHealth tool to support self-monitoring, self-insight, and personalized
health care in clinical practice. Depress Anxiety. 2017;34(6):481–93. https://doi.org/10.1002/
da.22647.
18. Verhagen SJW, Hasmi L, Drukker M, van Os J, Delespaul PAEG. Use of the experience sam-
pling method in the context of clinical trials: table 1. Evid Based Ment Health. 2016;19(3):86–
9. https://doi.org/10.1136/ebmental-2016-102418.
19. Xia F, Yang LT, Wang L, Vinel A. Internet of Things. Int J Commun Syst. 2012;25(9):1101–2.
https://doi.org/10.1002/dac.2417.

Open Access This chapter is licensed under the terms of the Creative Commons Attribution 4.0
International License (http://creativecommons.org/licenses/by/4.0/), which permits use, sharing,
adaptation, distribution and reproduction in any medium or format, as long as you give appropriate
credit to the original author(s) and the source, provide a link to the Creative Commons license and
indicate if changes were made.
The images or other third party material in this chapter are included in the chapter’s Creative
Commons license, unless indicated otherwise in a credit line to the material. If material is not
included in the chapter’s Creative Commons license and your intended use is not permitted by
statutory regulation or exceeds the permitted use, you will need to obtain permission directly from
the copyright holder.
Chapter 2
Data at Scale

Alberto Traverso, Frank J. W. M. Dankers, Leonard Wee,


and Sander M. J. van Kuijk

2.1 Introduction

Various data in hospital facilities is generated daily by different sources. Data is


usually stored electronically and spread across different locations. For example,
electronic reports reporting patients’ treatment information are usually stored
within the oncology department of a hospital. Conversely, patient’s images are
often stored into the radiology department within a different data platform (PACS,
Pictures Archive Communication System). In addition, different departments within
the same hospital might use different infrastructures (e.g. software’s, data formats)
to store acquired clinical data. Very often, those systems and / or data formats might
not be interoperable between each other’s. No matter, what the source of clinical
data is, data fragmentation represents one of the biggest issues when dealing
with clinical data in general [1]. Data fragmentation occurs when a collection
of data in memory is broken up into many pieces that are not close together. The
problem becomes even more enhanced when willing to perform multicenter studies

A. Traverso, PhD (*) · L. Wee, PhD


Department of Radiation Oncology (MAASTRO), GROW School for Oncology
and Developmental Biology, Maastricht University Medical Center+,
Maastricht, The Netherlands
e-mail: [email protected]
F. J. W. M. Dankers, MSc
Department of Radiation Oncology (MAASTRO), GROW School for Oncology
and Developmental Biology, Maastricht University Medical Center+,
Maastricht, The Netherlands
Department of Radiation Oncology, Radboud University Medical Center,
Nijmegen, The Netherlands
S. M. J. van Kuijk, PhD
Department of Clinical Epidemiology and Medical Technology Assessment,
Maastricht University Medical Center, Maastricht, The Netherlands

© The Author(s) 2019 11


P. Kubben et al. (eds.), Fundamentals of Clinical Data Science,
https://doi.org/10.1007/978-3-319-99713-1_2
12 A. Traverso et al.

(e.g. developing and validating a model using data from different institutions). In
fact, relevant information might be spread across the different institutions and, due
to lack of standardization, data interoperability might be compromised.
In addition, in the last decade we have been facing a continuous and rapid expo-
nential growth of usage and production of clinical data, such as for example in the
field of radiation oncology [2]. This growth has been affecting all the different
sources of clinical data. For example, new technologies / scanners enabling the pos-
sibility to acquire images of a patient in less than a second have determined what
has been called ‘data explosion’ [3] for medical imaging data. In general, techno-
logical developments associated with healthcare (new powerful imaging machines)
on one side have improved the general healthcare quality. Nevertheless, on the other
side they have produced much more data than expected. Conversely, our develop-
ments in data mining techniques have been growing much slower than expected or
at least not as fast as the production of data.
In fact, this data volume has been increasing so rapidly, even beyond the
capability of humans. This data represents then an almost unexplored source of
potential information that can be used for example to develop clinical prediction
models, using all the information (e.g. imaging, genetics banks, and electronic
reports) available in medical institutions.
Some of the biggest problems associated with this unexplored data are presence
of missing values, and absence of a pre-determined structure.
Missing values happen when no data value is stored for the variable in an
observation [4]. Missing data is a common occurrence and can have a significant
effect on the conclusions that can be drawn from the data common occurrence.
Statistical techniques such as data imputation (explained later in the book) could be
used to replace missing values.
Unstructured data is information that either does not have a pre-defined data
model or is not organized in a pre-defined manner [5]. A data model is an agreement
between several institutions on the format and database structure of storing data.
Unstructured information is typically text-heavy, but may contain data such
as dates, numbers, and facts as well. But also audiovisual, locations, sensors data.
If we look at clinical data, we can recognize both the presence of missing values and
its absence of predetermined structure. For these reasons, clinical data is still not ready
to be mined (i.e. processed) automatically by machines (e.g. artificial intelligence).
Therefore, the terms big (clinical) data refers to not only a large volume of
data, but on a large volume of complex, unstructured and fragmented data
coming from different sources.
We will explain this concept in the next section.

2.2 ‘Big’ Clinical Data: The Four ‘Vs’

As we already mentioned in the introduction, the problem of clinical data is not only
its increased and growing volume, but also that data is collected in different formats
and stored in various separated databases (fragmentation), together with the
2 Data at Scale 13

absence of an agreed data format (not structured). Now, why we use the term
‘big’ and what makes big data ‘big’?
We performed a literature research and we tried to summarize the most common
definitions of big data.
The community agrees that big data can be summarized by the four ‘V’ con-
cepts: volume, variety, velocity, and veracity.
1. Volume: volume of data exponentially increases every day, since not only
humans, but also and especially machines are producing faster and faster new
information (refer to previous example of ‘data explosion’ in medical imaging,
but also “Internet of Things”). In the community, data of the order of Terabyte
and larger is considered as ‘big volume’. Volume contributes to the big issue that
traditional storage systems such as traditional database are not suitable anymore
to welcome a huge amount of data.
2. Variety: big data comes from different sources and are stored in different formats:
(a) Different types: in the past, major sources of clinical data were databases or
spreadsheets. Now data can come under the form of free text (electronic
report) or images (patients’ scans). This type of data is usually characterized
by structured or, less often, semi-structured data (e.g. databases with some
missing values or inconsistencies)
(b) Different sources: variety is also used to mean that data can come from differ-
ent sources. These sources do not necessarily belong to the same institution.
Variety affects both data collection and storage. Two major challenges must be faced:
(a) storing and retrieving this data in an efficient and cost-effective way, (b) aligning
data types from different sources, so that all the data is mined at the same time.
There is also an additional complexity due to interaction between variety and
volume. In fact, unstructured data is growing much faster than structured data. An
estimation says that unstructured data doubles around every 3 months [1].
Therefore, the complexity and fragmentation of data is far from being slowed down:
we will have to deal with much more unstructured data than we expected.
3. Velocity: the production of big data (by machines or humans) is a continuous
and massive flow.
(a) Data in motion and real time big data analytics: big data are produced ‘real time’
and most of the time need to be analyzed ‘real time’. Therefore, an architecture
for capturing and mining big data flows must support real-time turnaround.
(b) Lifetime of data utility: a second dimension of data velocity is for how long
data will be valuable. Understanding this additional ‘temporal’ dimension of
velocity will allow to discard data that is not meaningful anymore when new
up-to-date and more detailed information has been produced. The period of
“data lifetime” can be long, but it some cases also short (days). For example,
we might think that for a specific analysis we only need the results from a
recent lab test (most recent data). However, for a more detailed analysis we
might want to trace same measurements from the past (longer lifetime).
4. Veracity: big data, due to its complexity, might present inconsistencies, such as
missing values. More in general, big data has ‘noise’, biases and abnormality.
14 A. Traverso et al.

The data science community usually recognizes veracity as the biggest challenge
compared to velocity and volume. For example, if we took three measurements
of blood pressure, even if they can vary differently, reporting the average may be
common practice, but it is also not a real measurement value.
Besides these four properties, additional four ‘Vs’ have been proposed by the
community: validity, volatility, viscosity, and virality.
5. Validity: due to large volume and data veracity, we need to make sure data is
accurate for the intended use. However, compared to other small datasets, in the
initial stage of the analysis, there is no need to worry about the validity of each
single data element. In fact, it is more important to see whether any relation-
ships exist between elements within this massive data source than to ensure
that all elements are valid.
6. Volatility: big data volatility refers to for how long data must be available and
how long they should be stored, since concerns about the increasing storage
capacity might be raised.
7. Viscosity: viscosity measures the resistance to flow in the volume of data. This
resistance can come from different data sources, friction from integration flow
rates, and processing required turning the data into insight.
8. Virality: defined as the rate at which the data spreads, for example it measures how
often the data is picked and re-used by other users than the original owner of the data.
To see the presented main four ‘Vs’ in action, let us consider the case of imaging
data (e.g. patient’s scans) collected within a hospital institution:
1. Due to improvements in the hardware (e.g. scanning machines) a large amount
of images are produced (and stored) within a short elapsed of time (Volume).
2. Developments on hardware and in general in the imaging healthcare sector are
producing machines able to produce much more images, combining different
modality at the same time. This phenomenon is growing exponentially (Velocity).
3. Different imaging modality are combined together (Variety).
4. Despite there is a unified standard for storing and transmitting medical images
(DICOM - Digital Imaging and Communications in Medicine), there is no agree-
ment on associated metadata, such as for example medical annotations of
patient’s scans. So that, meta-data associated with imaging data can be of differ-
ent formats, without a unique agreed data model (Veracity).
Previous considerations apply to clinical data in general. We advise the reader to
identify the eight ‘Vs’ through the different sources of data presented in the previ-
ous chapter.

2.3 Data Landscape

A good visualization of data scale is represented by the concept of data landscape,


shown in Fig. 2.1.
2 Data at Scale 15

Fig. 2.1 The data


landscape. Missing dots
represent missing values.
The clinical routine data
covers all the data
landscape

We can affirm that


1. Data collections such as clinical data registries or clinical trial data cover only a
small portion of the data landscape. In fact,
(a) Cancer registry contains usually several information about a large number of
patients (y-axis) or population, but the variables (or features, x-axis) col-
lected are limited.
(b) Clinical trial data usually collect more information than cancer registries,
but with respect to a selected and limited patients population
2. Clinical routine data covers all the data landscape. Unfortunately, the figure
shows how the data landscape is not fully covered by points in the clinical rou-
tine domain. These missing dots represent ‘missing’ values. ‘Real world’ clini-
cal data are characterized by a large amount (around 80%) of missing
values.
When looking at Fig. 2.1, it is possible to identify again some of the six ‘Vs’
associated with big data:

1. A vast volume of data is produced (large extension on x-axis and y-axis):


Velocity + Volume.
2. Data includes several information from different sources (‘features’):
Veracity + Variety.

In the last part of this chapter, we will analyze some of the barriers that are
currently limiting the share of big data across institutions (or sometimes even
within different departments of the same institution). We will also provide the
reader with some possible advanced data management techniques to solve men-
tioned issues.
16 A. Traverso et al.

2.4 Barriers to Big Data Exchange

Even when reaching such an advanced level allowing to correctly mining and
retrieving meaningful information from clinical big data, its exchange is still
restrained by following issues:
1. Administrative barriers: mining big clinical data might require additional
effort, such as new dedicated figures in hospital facility, increasing cost of
personnel.
2. Ethical barriers: issues are mainly related to data privacy concerns. Several dif-
ferent privacy laws might apply leading to relevant differences in privacy expla-
nation, application of data confidentiality, and finally different legislations
between countries exist [6].
3. Political barriers: even if technical barriers have been overcome, very often
people are not willing to share their data. A joint effort by the community is then
required to prove the benefits associated with ‘big’ data exchange.
4. Technical barriers: technical barriers are mainly related to scarce big data
interoperability across different institutions. We saw that veracity is one of the
cause of poor big data interoperability.
Secondly, lack of standardization and big data harmonization is still limiting the
data exchange. More in general, technical barriers are determined by a lack of: sup-
port of internationally standardize protocols, formats and semantics.
We believe that all the community should collaborate for facing presented chal-
lenges. In fact, the success of effective clinical prediction models based on big
clinical data depends much more on the curation of data used to develop / vali-
date the model, than on sophisticated choices for models development (e.g. the
usage of very complicated machine learning algorithms).
Some of the key points for a large-scale collaboration using big data in the clini-
cal domain are:
1. Accelerating the progress toward standardized and agreed data model for the
clinical domain by making use of advanced techniques such as ontologies [7]
and Semantic Web [8]. Ontologies provide a common terminology to over-
come for example language barriers. In fact, in an ontology, data is associated
to universal concepts (classes) specifically determined by a Universe
Resource Identifier (URI). By mean of Semantic Web, data and related meta-
data is published an accessible (via queries) by using the universal concepts
defined by the ontology [9]. In this way, data and metadata can be queried
without knowing a priori the original structures or data format of the original
sources.
2. Show the advantages the usage of real world clinical data by focusing on more
high quality and published research articles that completely proves the benefits
of data exchange (e.g., efficiency, robustness and security).
2 Data at Scale 17

2.5 Conclusion

–– Data volume has been increasing so rapidly, even beyond that capability of
humans. This data represents then an almost unexplored source of potential
information.
–– The term big (clinical) data refers to not only a large volume of data, but also
more on a large volume of complex, unstructured and fragmented data com-
ing from different sources.
–– Big Clinical data are defined by the four ‘Vs’: volume, variety, velocity, and
veracity.
–– Several issues limit that sharing and exchange of big clinical data: administra-
tive, ethical, political, and technical barriers.

References

1. Lustberg T, van Soest J, Jochems A, Deist T, van Wijk Y, Walsh S, et al. Big data in radiation
therapy: challenges and opportunities. Br J Radiol. 2017;90(1069):20160689.
2. Chen AB. Comparative effectiveness research in radiation oncology: assessing technology.
Semin Radiat Oncol. 2014;24(1):25–34.
3. Rubin GD. Data explosion: the challenge of multidetector-row CT. Eur J Radiol. 2000
Nov;36(2):74–80.
4. Little RJA, Rubin DB. Statistical analysis with missing data. 2nd ed. Hoboken: Wiley; 2002.
381 p. (Wiley series in probability and statistics).
5. Han J, Kamber M, Pei J. Data mining: concepts and techniques. San Francisco: Morgan
Kaufmann; 2011.
6. Skripcak T, Belka C, Bosch W, Brink C, Brunner T, Budach V, et al. Creating a data exchange
strategy for radiotherapy research: towards federated databases and anonymised public datas-
ets. Radiother Oncol. 2014;113(3):303–9.
7. Bechhofer S. OWL: web ontology language. In: Liu L, Özsu MT, editors. Encyclopedia of
database systems [Internet]. Boston: Springer US; 2009. p. 2008–2009. Available from: https://
doi.org/10.1007/978-0-387-39940-9_1073.
8. Berners-Lee T, Hendler J. Publishing on the semantic web. Nature. 2001;410(6832):1023–4.
9. Traverso A, van Soest J, Wee L, Dekker A. The radiation oncology ontology (ROO): publishing
linked data in radiation oncology using semantic web and ontology techniques. Med Phys.

Open Access This chapter is licensed under the terms of the Creative Commons Attribution 4.0
International License (http://creativecommons.org/licenses/by/4.0/), which permits use, sharing,
adaptation, distribution and reproduction in any medium or format, as long as you give appropriate
credit to the original author(s) and the source, provide a link to the Creative Commons license and
indicate if changes were made.
The images or other third party material in this chapter are included in the chapter’s Creative
Commons license, unless indicated otherwise in a credit line to the material. If material is not
included in the chapter’s Creative Commons license and your intended use is not permitted by
statutory regulation or exceeds the permitted use, you will need to obtain permission directly from
the copyright holder.
Chapter 3
Standards in Healthcare Data

Stefan Schulz, Robert Stegwee, and Catherine Chronaki

3.1 Introduction

Our industrialised societies are heavily dependent on standards. That we can safely
assume that electric plugs of a certain kind, independently of their manufacturer, fit into
certain sockets of certain types and not into sockets of other types is just one example
how manufacturing is guided by standards. The benefit is obvious: complex technical
artefacts can be assembled out of smaller components. Conformance to standards facil-
itates their exchange and substitutability, creates independence from manufacturers,
eases competition and generates interoperability across borders. Standardisation of
commodities and consumer goods makes them more easy to compare, to categorise
and, consequently, to trade. In addition, compliance to safety standards will increase
trust in the safe operation of components under predefined conditions. The authors of
this chapter argue that standardisation is equally required for data in general and clini-
cal data in particular, for which safety, exchangeability and interoperability is a supe-
rior aim, in particular with regard to the emerging field of data science.
There are many definitions of standards. Our approach is pragmatic and committed
to the view that standards are information artefacts developed in community-­driven
consensus processes that specify uniform features, criteria, methods, processes and

S. Schulz (*)
Medical University of Graz, Institute for Medical Informatics, Statistics and Documentation,
Graz, Austria
Averbis GmbH, Freiburg, Germany
e-mail: [email protected]
R. Stegwee
CGI Netherlands, Health Unit, Rotterdam, The Netherlands
CEN Technical Committee 251 Health Informatics, Brussels, Belgium
C. Chronaki
HL7 Foundation, Brussels, Belgium

© The Author(s) 2019 19


P. Kubben et al. (eds.), Fundamentals of Clinical Data Science,
https://doi.org/10.1007/978-3-319-99713-1_3

You might also like