Fundamentals of Clinical Data Science., 978-3319997124
Fundamentals of Clinical Data Science., 978-3319997124
Visit the link below to download the full version of this book:
https://cheaptodownload.com/product/fundamentals-of-clinical-data-science-full-p
df-docx-download/
Pieter Kubben · Michel Dumontier
Andre Dekker
Editors
Fundamentals of Clinical
Data Science
Editors
Pieter Kubben Michel Dumontier
Department of Neurosurgery Institute of Data Science
Maastricht University Maastricht University
Maastricht, Limburg Maastricht, Limburg
The Netherlands The Netherlands
Andre Dekker
Maastro Clinic
Maastricht, Limburg
The Netherlands
This Springer imprint is published by the registered company Springer Nature Switzerland AG
The registered company address is: Gewerbestrasse 11, 6330 Cham, Switzerland
Introduction “Fundamentals of Clinical
Data Science”
In the era of eHealth and personalized medicine, “big data” and “machine learning”
are increasingly becoming part of the medical world. Algorithms are capable of sup-
porting diagnostic and therapeutic processes and offer added value for both health-
care professionals and patients. The field of big data, machine learning, deep
learning, and algorithm development and validation is often referred to as “data
science,” and “data scientist” was mentioned in Harvard Business Review as “the
sexiest job of the 21th century” (https://hbr.org/2012/10/data-scientist-the-sexiest-
job-of-the-21st-century). A commonly used visual representation of the field is
Drew Conway’s Venn diagram (Fig. 1), which describes data science as a mix of
content expertise, methodological knowledge, and IT skills.
v
vi Introduction “Fundamentals of Clinical Data Science”
vii
viii Contents
Pieter Kubben
P. Kubben (*)
Department of Neurosurgery, Maastricht University, Maastricht, Limburg, The Netherlands
e-mail: [email protected]
clinical decision support systems or scheduling systems. The latter are more related
to healthcare processes, that are later described in the chapters on operational excel-
lence and value-based healthcare.
Given the highly sensitive data stored in EMRs, security is a particularly
important issue. Three types of safeguards have been described to limit the chance
for adverse events: access control (technical safeguard), physical access control
(physical safeguard) and administrative safeguards (such as local policies and
procedures) [11].
Social media such as Twitter, Facebook and blogs can also be an important source
of data. Publicly available data (e.g. Twitter) can be used for several sorts of analy-
sis, like sentiment analysis or graph networks. They are also relevant media to
recruit participants for studies that can take place completely online using frame-
works as Apple ResearchKit or Google Study.
6 P. Kubben
1.2 GDPR
Tabular data are the most common and well known data for research and data sci-
ence. They are represented in a column-row format in which -most commonly- rows
represent individual records and columns represent the relevant variables. For
machine learning applications in which you try to predict one variable based on the
others (supervised learning), the variable you try to predict is called the independent
or class variable, and the others are the feature or predictor variables.
Time series are an ordered sequence of values of a variable at equally spaced time
intervals. They are a particular sort of tabular data in which (mostly) columns rep-
resent different time stamps in chronological order. In data science applications the
goal is mostly to predict future events. Time series require specific sorts of prepro-
cessing as values (e.g. the mean) can -by definition- change over time. A particu-
larly relevant sort of time series are processes. Improving healthcare frequently
means improving processes. Process mining refers to the automated analysis of
processes and involves time series analysis. Another relevant sort of time series are
discrete time signals (e.g. digitally recorded accelerometer or ECG data). Such sig-
nals can be analyzed in the time domain (in which they are recorded) but also in the
frequency domain (after a Fourier transform) and using time-frequency analysis
(e.g. wavelets) in case of non-stationary signals. In this case, features are extracted
from the data before modelling takes place. For common machine learning applica-
tions, feature extraction is done explicitly by the researcher, but more advanced
deep neural networks are capable of automated feature extraction nowadays. More
information is available in Chaps. 6–9.
1 Data Sources 7
In many medical applications free text format is still frequently used by physicians
(physician notes, radiology reports), but also surveys or daily logs by patients can
contain free text. Besides, social media contain free text as their data source.
Techniques are available for text mining, also called “natural language processing”,
to extract meaning in an automated fashion from free text input. These techniques
in particular fall outside the scope of this book, but general principles for modelling
do still apply.
Images are another important source of data for data science, and also requires spe-
cific processing techniques for feature extraction before modelling can take place.
Also here, deep neural networks can perform automated feature extraction nowa-
days. A famous example is Google’s Deepmind project, in which a computer model
was fed videos that were tagged as containing cats or not containing cats. The model
came up with cat images, despite never being trained in recognizing the concept of a
cat. The same deep learning platform was later used to defeat the world champion in
the game of Go, and an improved version learned to play the game from scratch and
defeated the previous (world champion beating) algorithm with 100-0 [15].
1.5 Conclusion
A variety of data sources and data types are relevant for clinical data science. A
general overview of such data sources has been provided, and the concepts of dif-
ferent data types were introduced. Next chapters will dive deeper on data and stan-
dards, and a toolkit for natural data stewardship will be provided.
References
1. Ajami S, BagheriTadi T. Barriers for adopting electronic health records (EHRs) by physicians.
Acta Inform Med. 2013;21(2):129–6. https://doi.org/10.5455/aim.2013.21.129-134.
2. Boonstra A, Versluis A, Vos JFJ. Implementing electronic health records in hospitals: a
systematic literature review. BMC Health Serv Res. 2014;14(1):1156–24. https://doi.
org/10.1186/1472-6963-14-370.
3. Common L. From LIMSWiki Jump to: navigation, search hospitals and labs around the world
depend on a laboratory information system to manage and report patient data …. n.d. https://
doi.org/10.1097/PAP.0b013e318248b787.
4. Dennison D. PACS in 2018: an autopsy. J Digit Imaging. 2013;27(1):7–11. https://doi.
org/10.1007/s10278-013-9660-1.
5. Dimitrov DV. Medical internet of things and big data in healthcare. Healthc Inform Res.
2016;22(3):156–8. https://doi.org/10.4258/hir.2016.22.3.156.
6. EHR (electronic health record) vs. EMR (electronic medical record). EHR (electronic health
record) vs. EMR (electronic medical record). n.d. Retrieved June 22, 2018, from https://www.
practicefusion.com/blog/ehr-vs-emr/.
7. Friedman DJ, Parrish RG, Ross DA. Electronic health records and US public health: current
realities and future promise. Am J Public Health. 2013;103(9):1560–7. https://doi.org/10.2105/
AJPH.2013.301220.
8. Huang T, Lan L, Fang X, An P, Min J, Wang F. Promises and challenges of big data computing
in health sciences. Big Data Res. 2015;2(1):2–11. https://doi.org/10.1016/j.bdr.2015.02.002.
9. Institute of Medicine. IOM report: patient safety—achieving a new standard for care. Acad
Emerg Med Off J Soc Acad Emerg Med. 2005;12(10):1011–2. https://doi.org/10.1197/j.
aem.2005.07.010.
10. Kovacs MD, Cho MY, Burchett PF, Trambert M. Benefits of integrated RIS/PACS/reporting
due to automatic population of templated reports. Curr Probl Diagn Radiol. 2018:1–3. https://
doi.org/10.1067/j.cpradiol.2017.12.002.
11. Kruse CS, Smith B, Vanderlinden H, Nealand A. Security techniques for the electronic health
records. 1–9. 2017. https://doi.org/10.1007/s10916-017-0778-4.
12. Manca DP. Do electronic medical records improve quality of care?: yes. Can Fam Physician.
2015;61(10):846–7.
13. Nance JW Jr, Meenan C, Nagy PG. The future of the radiology information system. AJR Am
J Roentgenol. 2013;200(5):1064–70. https://doi.org/10.2214/AJR.12.10326.
14. Raghupathi W, Raghupathi V. Big data analytics in healthcare: promise and potential. Health
Inf Sci Syst. 2014;2(1):211–0. https://doi.org/10.1186/2047-2501-2-3.
15. Silver D, Schrittwieser J, Simonyan K, Antonoglou I, Huang A, Guez A, et al. Mastering the
game of go without human knowledge. Nat Publ Group. 2017;550(7676):354–9. https://doi.
org/10.1038/nature24270.
16. Tubaishat A. The effect of electronic health records on patient safety: a qualitative exploratory
study. Inform Health Soc Care. 2017;00(00):1–13. https://doi.org/10.1080/17538157.2017.13
98753.
1 Data Sources 9
17. van Os J, Verhagen S, Marsman A, Peeters F, Bak M, Marcelis M, et al. The experience sam-
pling method as an mHealth tool to support self-monitoring, self-insight, and personalized
health care in clinical practice. Depress Anxiety. 2017;34(6):481–93. https://doi.org/10.1002/
da.22647.
18. Verhagen SJW, Hasmi L, Drukker M, van Os J, Delespaul PAEG. Use of the experience sam-
pling method in the context of clinical trials: table 1. Evid Based Ment Health. 2016;19(3):86–
9. https://doi.org/10.1136/ebmental-2016-102418.
19. Xia F, Yang LT, Wang L, Vinel A. Internet of Things. Int J Commun Syst. 2012;25(9):1101–2.
https://doi.org/10.1002/dac.2417.
Open Access This chapter is licensed under the terms of the Creative Commons Attribution 4.0
International License (http://creativecommons.org/licenses/by/4.0/), which permits use, sharing,
adaptation, distribution and reproduction in any medium or format, as long as you give appropriate
credit to the original author(s) and the source, provide a link to the Creative Commons license and
indicate if changes were made.
The images or other third party material in this chapter are included in the chapter’s Creative
Commons license, unless indicated otherwise in a credit line to the material. If material is not
included in the chapter’s Creative Commons license and your intended use is not permitted by
statutory regulation or exceeds the permitted use, you will need to obtain permission directly from
the copyright holder.
Chapter 2
Data at Scale
2.1 Introduction
(e.g. developing and validating a model using data from different institutions). In
fact, relevant information might be spread across the different institutions and, due
to lack of standardization, data interoperability might be compromised.
In addition, in the last decade we have been facing a continuous and rapid expo-
nential growth of usage and production of clinical data, such as for example in the
field of radiation oncology [2]. This growth has been affecting all the different
sources of clinical data. For example, new technologies / scanners enabling the pos-
sibility to acquire images of a patient in less than a second have determined what
has been called ‘data explosion’ [3] for medical imaging data. In general, techno-
logical developments associated with healthcare (new powerful imaging machines)
on one side have improved the general healthcare quality. Nevertheless, on the other
side they have produced much more data than expected. Conversely, our develop-
ments in data mining techniques have been growing much slower than expected or
at least not as fast as the production of data.
In fact, this data volume has been increasing so rapidly, even beyond the
capability of humans. This data represents then an almost unexplored source of
potential information that can be used for example to develop clinical prediction
models, using all the information (e.g. imaging, genetics banks, and electronic
reports) available in medical institutions.
Some of the biggest problems associated with this unexplored data are presence
of missing values, and absence of a pre-determined structure.
Missing values happen when no data value is stored for the variable in an
observation [4]. Missing data is a common occurrence and can have a significant
effect on the conclusions that can be drawn from the data common occurrence.
Statistical techniques such as data imputation (explained later in the book) could be
used to replace missing values.
Unstructured data is information that either does not have a pre-defined data
model or is not organized in a pre-defined manner [5]. A data model is an agreement
between several institutions on the format and database structure of storing data.
Unstructured information is typically text-heavy, but may contain data such
as dates, numbers, and facts as well. But also audiovisual, locations, sensors data.
If we look at clinical data, we can recognize both the presence of missing values and
its absence of predetermined structure. For these reasons, clinical data is still not ready
to be mined (i.e. processed) automatically by machines (e.g. artificial intelligence).
Therefore, the terms big (clinical) data refers to not only a large volume of
data, but on a large volume of complex, unstructured and fragmented data
coming from different sources.
We will explain this concept in the next section.
As we already mentioned in the introduction, the problem of clinical data is not only
its increased and growing volume, but also that data is collected in different formats
and stored in various separated databases (fragmentation), together with the
2 Data at Scale 13
absence of an agreed data format (not structured). Now, why we use the term
‘big’ and what makes big data ‘big’?
We performed a literature research and we tried to summarize the most common
definitions of big data.
The community agrees that big data can be summarized by the four ‘V’ con-
cepts: volume, variety, velocity, and veracity.
1. Volume: volume of data exponentially increases every day, since not only
humans, but also and especially machines are producing faster and faster new
information (refer to previous example of ‘data explosion’ in medical imaging,
but also “Internet of Things”). In the community, data of the order of Terabyte
and larger is considered as ‘big volume’. Volume contributes to the big issue that
traditional storage systems such as traditional database are not suitable anymore
to welcome a huge amount of data.
2. Variety: big data comes from different sources and are stored in different formats:
(a) Different types: in the past, major sources of clinical data were databases or
spreadsheets. Now data can come under the form of free text (electronic
report) or images (patients’ scans). This type of data is usually characterized
by structured or, less often, semi-structured data (e.g. databases with some
missing values or inconsistencies)
(b) Different sources: variety is also used to mean that data can come from differ-
ent sources. These sources do not necessarily belong to the same institution.
Variety affects both data collection and storage. Two major challenges must be faced:
(a) storing and retrieving this data in an efficient and cost-effective way, (b) aligning
data types from different sources, so that all the data is mined at the same time.
There is also an additional complexity due to interaction between variety and
volume. In fact, unstructured data is growing much faster than structured data. An
estimation says that unstructured data doubles around every 3 months [1].
Therefore, the complexity and fragmentation of data is far from being slowed down:
we will have to deal with much more unstructured data than we expected.
3. Velocity: the production of big data (by machines or humans) is a continuous
and massive flow.
(a) Data in motion and real time big data analytics: big data are produced ‘real time’
and most of the time need to be analyzed ‘real time’. Therefore, an architecture
for capturing and mining big data flows must support real-time turnaround.
(b) Lifetime of data utility: a second dimension of data velocity is for how long
data will be valuable. Understanding this additional ‘temporal’ dimension of
velocity will allow to discard data that is not meaningful anymore when new
up-to-date and more detailed information has been produced. The period of
“data lifetime” can be long, but it some cases also short (days). For example,
we might think that for a specific analysis we only need the results from a
recent lab test (most recent data). However, for a more detailed analysis we
might want to trace same measurements from the past (longer lifetime).
4. Veracity: big data, due to its complexity, might present inconsistencies, such as
missing values. More in general, big data has ‘noise’, biases and abnormality.
14 A. Traverso et al.
The data science community usually recognizes veracity as the biggest challenge
compared to velocity and volume. For example, if we took three measurements
of blood pressure, even if they can vary differently, reporting the average may be
common practice, but it is also not a real measurement value.
Besides these four properties, additional four ‘Vs’ have been proposed by the
community: validity, volatility, viscosity, and virality.
5. Validity: due to large volume and data veracity, we need to make sure data is
accurate for the intended use. However, compared to other small datasets, in the
initial stage of the analysis, there is no need to worry about the validity of each
single data element. In fact, it is more important to see whether any relation-
ships exist between elements within this massive data source than to ensure
that all elements are valid.
6. Volatility: big data volatility refers to for how long data must be available and
how long they should be stored, since concerns about the increasing storage
capacity might be raised.
7. Viscosity: viscosity measures the resistance to flow in the volume of data. This
resistance can come from different data sources, friction from integration flow
rates, and processing required turning the data into insight.
8. Virality: defined as the rate at which the data spreads, for example it measures how
often the data is picked and re-used by other users than the original owner of the data.
To see the presented main four ‘Vs’ in action, let us consider the case of imaging
data (e.g. patient’s scans) collected within a hospital institution:
1. Due to improvements in the hardware (e.g. scanning machines) a large amount
of images are produced (and stored) within a short elapsed of time (Volume).
2. Developments on hardware and in general in the imaging healthcare sector are
producing machines able to produce much more images, combining different
modality at the same time. This phenomenon is growing exponentially (Velocity).
3. Different imaging modality are combined together (Variety).
4. Despite there is a unified standard for storing and transmitting medical images
(DICOM - Digital Imaging and Communications in Medicine), there is no agree-
ment on associated metadata, such as for example medical annotations of
patient’s scans. So that, meta-data associated with imaging data can be of differ-
ent formats, without a unique agreed data model (Veracity).
Previous considerations apply to clinical data in general. We advise the reader to
identify the eight ‘Vs’ through the different sources of data presented in the previ-
ous chapter.
In the last part of this chapter, we will analyze some of the barriers that are
currently limiting the share of big data across institutions (or sometimes even
within different departments of the same institution). We will also provide the
reader with some possible advanced data management techniques to solve men-
tioned issues.
16 A. Traverso et al.
Even when reaching such an advanced level allowing to correctly mining and
retrieving meaningful information from clinical big data, its exchange is still
restrained by following issues:
1. Administrative barriers: mining big clinical data might require additional
effort, such as new dedicated figures in hospital facility, increasing cost of
personnel.
2. Ethical barriers: issues are mainly related to data privacy concerns. Several dif-
ferent privacy laws might apply leading to relevant differences in privacy expla-
nation, application of data confidentiality, and finally different legislations
between countries exist [6].
3. Political barriers: even if technical barriers have been overcome, very often
people are not willing to share their data. A joint effort by the community is then
required to prove the benefits associated with ‘big’ data exchange.
4. Technical barriers: technical barriers are mainly related to scarce big data
interoperability across different institutions. We saw that veracity is one of the
cause of poor big data interoperability.
Secondly, lack of standardization and big data harmonization is still limiting the
data exchange. More in general, technical barriers are determined by a lack of: sup-
port of internationally standardize protocols, formats and semantics.
We believe that all the community should collaborate for facing presented chal-
lenges. In fact, the success of effective clinical prediction models based on big
clinical data depends much more on the curation of data used to develop / vali-
date the model, than on sophisticated choices for models development (e.g. the
usage of very complicated machine learning algorithms).
Some of the key points for a large-scale collaboration using big data in the clini-
cal domain are:
1. Accelerating the progress toward standardized and agreed data model for the
clinical domain by making use of advanced techniques such as ontologies [7]
and Semantic Web [8]. Ontologies provide a common terminology to over-
come for example language barriers. In fact, in an ontology, data is associated
to universal concepts (classes) specifically determined by a Universe
Resource Identifier (URI). By mean of Semantic Web, data and related meta-
data is published an accessible (via queries) by using the universal concepts
defined by the ontology [9]. In this way, data and metadata can be queried
without knowing a priori the original structures or data format of the original
sources.
2. Show the advantages the usage of real world clinical data by focusing on more
high quality and published research articles that completely proves the benefits
of data exchange (e.g., efficiency, robustness and security).
2 Data at Scale 17
2.5 Conclusion
–– Data volume has been increasing so rapidly, even beyond that capability of
humans. This data represents then an almost unexplored source of potential
information.
–– The term big (clinical) data refers to not only a large volume of data, but also
more on a large volume of complex, unstructured and fragmented data com-
ing from different sources.
–– Big Clinical data are defined by the four ‘Vs’: volume, variety, velocity, and
veracity.
–– Several issues limit that sharing and exchange of big clinical data: administra-
tive, ethical, political, and technical barriers.
References
1. Lustberg T, van Soest J, Jochems A, Deist T, van Wijk Y, Walsh S, et al. Big data in radiation
therapy: challenges and opportunities. Br J Radiol. 2017;90(1069):20160689.
2. Chen AB. Comparative effectiveness research in radiation oncology: assessing technology.
Semin Radiat Oncol. 2014;24(1):25–34.
3. Rubin GD. Data explosion: the challenge of multidetector-row CT. Eur J Radiol. 2000
Nov;36(2):74–80.
4. Little RJA, Rubin DB. Statistical analysis with missing data. 2nd ed. Hoboken: Wiley; 2002.
381 p. (Wiley series in probability and statistics).
5. Han J, Kamber M, Pei J. Data mining: concepts and techniques. San Francisco: Morgan
Kaufmann; 2011.
6. Skripcak T, Belka C, Bosch W, Brink C, Brunner T, Budach V, et al. Creating a data exchange
strategy for radiotherapy research: towards federated databases and anonymised public datas-
ets. Radiother Oncol. 2014;113(3):303–9.
7. Bechhofer S. OWL: web ontology language. In: Liu L, Özsu MT, editors. Encyclopedia of
database systems [Internet]. Boston: Springer US; 2009. p. 2008–2009. Available from: https://
doi.org/10.1007/978-0-387-39940-9_1073.
8. Berners-Lee T, Hendler J. Publishing on the semantic web. Nature. 2001;410(6832):1023–4.
9. Traverso A, van Soest J, Wee L, Dekker A. The radiation oncology ontology (ROO): publishing
linked data in radiation oncology using semantic web and ontology techniques. Med Phys.
Open Access This chapter is licensed under the terms of the Creative Commons Attribution 4.0
International License (http://creativecommons.org/licenses/by/4.0/), which permits use, sharing,
adaptation, distribution and reproduction in any medium or format, as long as you give appropriate
credit to the original author(s) and the source, provide a link to the Creative Commons license and
indicate if changes were made.
The images or other third party material in this chapter are included in the chapter’s Creative
Commons license, unless indicated otherwise in a credit line to the material. If material is not
included in the chapter’s Creative Commons license and your intended use is not permitted by
statutory regulation or exceeds the permitted use, you will need to obtain permission directly from
the copyright holder.
Chapter 3
Standards in Healthcare Data
3.1 Introduction
Our industrialised societies are heavily dependent on standards. That we can safely
assume that electric plugs of a certain kind, independently of their manufacturer, fit into
certain sockets of certain types and not into sockets of other types is just one example
how manufacturing is guided by standards. The benefit is obvious: complex technical
artefacts can be assembled out of smaller components. Conformance to standards facil-
itates their exchange and substitutability, creates independence from manufacturers,
eases competition and generates interoperability across borders. Standardisation of
commodities and consumer goods makes them more easy to compare, to categorise
and, consequently, to trade. In addition, compliance to safety standards will increase
trust in the safe operation of components under predefined conditions. The authors of
this chapter argue that standardisation is equally required for data in general and clini-
cal data in particular, for which safety, exchangeability and interoperability is a supe-
rior aim, in particular with regard to the emerging field of data science.
There are many definitions of standards. Our approach is pragmatic and committed
to the view that standards are information artefacts developed in community-driven
consensus processes that specify uniform features, criteria, methods, processes and
S. Schulz (*)
Medical University of Graz, Institute for Medical Informatics, Statistics and Documentation,
Graz, Austria
Averbis GmbH, Freiburg, Germany
e-mail: [email protected]
R. Stegwee
CGI Netherlands, Health Unit, Rotterdam, The Netherlands
CEN Technical Committee 251 Health Informatics, Brussels, Belgium
C. Chronaki
HL7 Foundation, Brussels, Belgium