0% found this document useful (0 votes)
26 views23 pages

Spark Healthcare

Uploaded by

sabinaahmetaj815
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
26 views23 pages

Spark Healthcare

Uploaded by

sabinaahmetaj815
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 23

SCALPEL3: a scalable open-source library for healthcare claims

databases
Emmanuel Bacry1,2 , Stéphane Gaı̈ffas3 , Fanny Leroy4 , Maryan Morel2 , Dinh Phong
Nguyen2,4 , Youcef Sebiat2 , and Dian Sun2
1
CEREMADE, Université Paris-Dauphine, PSL, Paris, France
3
LPSM, Université de Paris, Paris, France
2
CMAP, Ecole polytechnique, Palaiseau, France
4
Caisse Nationale de l’Assurance Maladie, 75986 Paris Cedex 20, France
arXiv:1910.07045v2 [cs.DC] 26 Aug 2020

August 27, 2020

Abstract
Objective: This article introduces SCALPEL3 (SCAlable Pipeline for hEaLth data), a scalable open-
source framework for studies involving Large Observational Databases (LODs). It focuses on scal-
able medical concept extraction, easy interactive analysis, and helpers for data flow analysis to ac-
celerate studies performed on LODs.
Materials and methods: Inspired from web analytics, SCALPEL3 rely on distributed computing,
data denormalization and columnar storage. It was compared to the existing SAS-Oracle SNDS
infrastructure by performing several queries on a dataset containing a three years-long history of
healthcare claims of 13.7 million patients.
Results and Discussion: SCALPEL3 horizontal scalability allows handling large tasks quicker than
the existing infrastructure while it has comparable performance when using only a few executors.
SCALPEL3 provides a sharp interactive control of data processing through legible code, which helps
to build studies with full reproducibility, leading to improved maintainability and audit of studies
performed on LODs.
Conclusion: SCALPEL3 makes studies based on SNDS much easier and more scalable than the
existing framework [46]. It is now used at the agency collecting SNDS data, at the French Ministry
of Health and soon at the National Health Data Hub in France [10].
Keywords. Large observational database; Healthcare claims data; ETL; Scalability; Reproducibil-
ity; Interactive data manipulation

1 Introduction
In the past decade, the volume of healthcare data and its accessibility rose quickly. For instance, in
France, the SNDS claims database contained 86% of the French population in 2010 [45] to reach 98.8%
in 2015 [46] leading to one of the world’s largest health Large Observational Database (LOD) [46, 7].
The exhaustivity of LODs such as SNDS has proven useful for public health research, by improving the
statistical power of algorithms using this data and by mitigating the sensitivity to selection biases [46].
However, such an abundance of data comes at a cost: SNDS is a very complex database, with data
spread across hundreds of tables and columns. Its scale makes data manipulation non-trivial. More
importantly, using this data requires a tremendous amount of knowledge from SNDS experts. Many
coding or data recording subtleties, such as data duplication caused by administrative complexity, might

1
bewilder inexperienced users. Deriving proper health events definitions and extracting them accurately
is, therefore, a difficult task, having important consequences on the derived studies [46, 13]. These
issues are of course not unique to SNDS but shared by many LODs [25].
This paper proposes an answer to this problem by introducing SCALPEL3 (SCALable Pipeline
for hEaLth data), an open-source framework intending to reduce such entry barriers to LODs. This
framework attempts to simplify medical concept extraction by providing a set of tools performing batch
Extract-Transform-Load (ETL) tasks, while an interactive API eases the manipulation and the explo-
ration of longitudinal cohorts. Thus, this research focuses on the following objectives:

1. Design and implement a scalable tool allowing to extract and manipulate longitudinal patient data
from large observational databases;

2. Simplify methodological research by reducing SNDS data complexity and by easing data loading
into formats used by common machine learning libraries;

3. Foster reproducibility by monitoring the data flow and by following best practices for clean code;

4. Promote reusability and extensibility by documenting and open-sourcing SCALPEL3 implemen-


tation.

The main concepts used by SCALPEL3 and some related works are presented in Section 2. The LOD
for which SCALPEL3 was initially designed for is described in Section 3, together with SCALPEL3
methods and abstractions. The scalability of SCALPEL3 is evaluated in Section 4, while Section 5
discusses its strengths and limitations.

2 Background
LODs are not designed to perform medical research. Electronic Health Records (EHR) data directly
supports clinical care and are used to justify care billing and reimbursement, while claims data are pri-
marily used for reimbursement purpose. The data models and terminologies used in such databases were
optimized to suit these particular goals, resulting in normalized data models built around hospital stays,
transactions, or cash flows [46]. Extracting meaningful patients care pathways from such data can be
decomposed into two tasks. First, all the data corresponding to a set of patients need to be identified and
collected. When the data is not normalized around the patients, this task requires several join operations
which can be very costly in terms of computations as the data volume increases. Second, medical con-
cepts have to be properly identified from administrative codes: this phenotyping task relies heavily on
a combination of medical and database knowledge. The algorithms used to perform concept extraction
from administrative data are either disclosed through scientific publications or shared as lengthy SQL
queries [24]. Their code or the description of the algorithms involved can vary in quality, hindering
reuse, and reproducibility. As a result, building a study from scratch might be faster than reusing poorly
documented code from previous works [35, 24]. Besides, access to LODs such as SNDS might rely on
proprietary software such as SAS [40] or SPSS [44]. While these tools are suitable to produce public
health studies, they hinder methodological research as they do not interact easily with R or Python pack-
ages that implement state-of-the-art machine learning algorithms. All of these challenges are complex
to solve and exacerbated by the data volume at hand.

Related works
Several research programs produce tools in order to alleviate some of these issues. An important research
effort aims at easing data integration and interoperability by producing standard data models and termi-
nologies to be shared across institutions. Observational Medical Outcomes Partnership Common Data

2
Model (OMOP CDM), which is supported by the Observational Health Data Sciences and Informatics
(OHDSI) research program [17], and the Informatics for Integrating Biology & the Bedside (i2b2) data
model [28], can be considered as the most pervasive data models developed for this purpose. OMOP
CDM can be used to standardize EHR or claims data, while i2b2 is focused on EHR data.
Both models are centered around the patients, thus reducing the number of join operations required
to access a specific patient history. They also rely on a normalized data model combined with SQL
databases. A collection of open-source software has been developed on top of these models, imple-
menting analytics or visualization tools [18]. These softwares can take the form of R libraries [18], or
compiled Java [41] programs with a graphical user interface. While making these softwares freely avail-
able is an important step to foster methodological research, they do not seem to be easily extensible or
interoperable as they do not provide documented APIs to build new software upon it. Besides, the pro-
cess of transforming an existing database in order to conform to such standards is costly, as it requires to
build complex mappings between shared representations expressed through highly heterogeneous codes
from one information system to the other. In the case of the SNDS database considered in this work,
such a mapping is still work in progress [12].
In other fields, web-scale analytics have shifted from the use of normalized SQL databases towards
NoSQL technologies relying on distributed computing, denormalization, and columnar storage. The
use of distributed computing allowed gains in computational power using low cost, commodity servers
instead of expensive dedicated hardware [8]. A work from OHDSI [37] compared the ACHILLES soft-
ware (R [38], PostgreSQL [36]) with Apache Spark [49] using common SQL requests. They observed
performance gains for Spark even on a single server or small clusters, at the exception of requests leading
to large network I/O, since such operations are known to be the slowest operations in a distributed com-
puting framework because of network latencies and throughput. It can create bottlenecks when many
data chunks are sent across the servers in the cluster to perform a join or a groupby operation (leading
to so-called shuffles). Denormalization can be a way to circumvent this issue by performing a set of join
operations beforehand, once and for all [48, 23, 11], reducing join operations to simple look-ups over a
very large table. The data duplication resulting from such joins operations might lead to storage issues,
which can be mitigated with the help of columnar storage formats [23, 26] using compression strategies.
To the best of our knowledge, such an approach has not been implemented to perform ETL on large
health databases. Prior works are either relying on SQL and normalized schemas [19, 31] or applied to
small datasets [14]. This paper describes and implements such an approach for large health databases,
as explained in the next section.

3 Material and Methods


This work focuses on (i) denormalizing the data in combination with columnar storage and distributed
computing to perform concept extraction, (ii) providing a structured and re-usable concept library, and
(iii) introduce useful abstractions to handle cohort data. Scalability issues are handled by (i), while (ii)
and (iii) foster the reuse of code and knowledge across studies. This is achieved by reducing both study-
specific code and database entry barriers by providing ready-to-use concepts. SCALPEL3 provides
Scala [30] and Python APIs to ensure easy extension and interoperability with numerous libraries. All
the code supporting this paper is open source and freely available.
This paper is not about data integration from disparate sources, such as multiple EHR systems, but
rather about an ETL based on batch distributed processing of a large, centralized claims database.

3.1 The SNDS database


This work was performed using the Système National des Données de Santé (SNDS), a large claims
database containing pseudonymized data on 98.8% of the French population (66 million patients in

3
2015) [46, 7]. It contains time-stamped information about medical events leading to reimbursement
(see Table 1 in [46] for an exhaustive list of available data) in the last 3 years1 . It contains more than
20 billion health events per year, representing roughly 70TB of data.
SNDS is composed of multiple “sub-databases”, each one with a star schema. The central table
records events leading to cash flows that need to be joined to many other tables to access medical
information2 . In this form, retrieving patient information for statistical studies is very costly in terms
of computation and expert knowledge: targeted data can be spread across multiple databases, tens of
tables, and hundreds of columns, and its identification requires a deep administrative knowledge of the
French health-care reimbursement mechanisms. Mitigating these issues is precisely the motivation of
the SCALPEL3 framework.

3.2 SCALPEL3: a SCAlable Pipeline for hEaLth data


SCALPEL3 is based on Apache Spark [49], a robust and widely adopted distributed in-memory com-
putation framework. Spark provides a powerful SQL-like high-level API and a more granular API to
perform data operations. It can be coupled with the Hadoop File System (HDFS) [43] replication system
to accelerate large files reading and distribution over a computing cluster. SCALPEL3 is an open-source
framework organized in the following three components.
SCALPEL-Flattening [22] denormalizes the data “once and for all” to avoid joining many tables each
time the data of a patient is accessed. Its input is a set of CSV files extracted from the original SNDS
database.
SCALPEL-Extraction [34] defines concepts extractors that process the denormalized data and trans-
formers, that compute more complex events based on extractors output. For example, extractors can
fetch all drug dispenses or medical acts.
SCALPEL-Analysis [42] implements powerful and scalable abstractions that can be used for data anal-
ysis, such as easy ways to investigate data quality issues. It can load data into formats commonly used
in machine learning, such as TensorFlow or PyTorch tensors or NumPy arrays.
As SCALPEL-Flattening and SCALPEL-Extraction perform batch operations, they need to read
(resp. write) input (resp. output) data from the file-system (local or HDFS). They are implemented in
Scala in order to access Spark’s low-level API and take advantage of functional programming and static
typing, resulting in rigorous automated testing (94% of the Scala code is covered by unit tests). Both can
be configured through textual configuration files or be used as libraries. SCALPEL-Analysis is a python
module implemented in Python/PySpark and designed for interactive use. It can be used in a Jupyter
notebook [21] for instance. This workflow is illustrated in Figure 1.

3.3 SCALPEL-Flattening: denormalization of the data


As mentioned earlier, performing data analysis on SNDS patients’ health requires many joins and can
consequently be extremely slow. To circumvent this issue, the data are denormalized by joining the
tables sequentially to obtain a big table in which each line corresponds to a patient identifier and a wide
representation of an event.
Denormalizing a star-schema database results in a really big table due to values replications. To
circumvent storage and computation issues, the denormalized data is stored in Parquet [3] files, an
1
which can be extended up to 20 years under some restrictions.
2
We work with two main sub-databases containing data relevant for public-health research. When working on drug safety
studies, each of these two databases contains 8 relevant tables, representing approximately 5 billion lines per year when
restricted to 65+ y.o. subjects.

4
Scala API (batch)
SNDS
Python API (interactive) (csv files)

read

configuration file SCALPEL-Flattening

write

Denormalized SNDS
(Parquet files)

read

SCALPEL-Extraction
configuration file
Patient, Event, Extractor, Transformer

write

CohortCollection*
(Patients and events datasets, Parquet files)

read

SCALPEL-Analysis
Cohort, CohortCollection, CohortFlow, FeatureDriver

Interactive Machine Learning


(TensorFlow or Pytorch Tensor,numpy ndarray)

Figure 1: SCALPEL3 workflow. SCALPEL3 is made of three independent open-source libraries


plugged one after another. SCALPEL-Flattening, which is implemented in Scala/Spark, denormalizes
the input database exported as CSV or Parquet files into a single big flat database. Then, SCALPEL-
Extraction, implemented in Scala/Spark, extracts concepts from this flat database. Finally, SCALPEL-
Analysis, implemented in Python/PySpark loads extracted concepts to perform in-memory interactive
analysis and feed machine learning algorithms.

5
open-source columnar storage format implementing Google’s Dremel [26] data model. Parquet is well-
integrated in the Spark ecosystem [4], allowing us to take advantage of the columnar storage in terms
of data compression and query optimization. SCALPEL-Flattening first converts the input CSV files
containing exports of SNDS tables to Parquet files. Then, it recursively performs left joins with these
tables, starting with the central table. Finally, it writes the results in a single Parquet file. To ensure the
scalability of these big join operations, the input data can be automatically divided with respect to some
time unit (such as years, months) before performing the join operations. In this case, the joins results
are sequentially appended to the output parquet file. These operations are repeated for each SNDS
sub-databases. The size of the temporal slicing used in the joins, the schema, and the joining keys can
be tuned by the end-user through a configuration file, which defaults to the denormalization of tables
containing only medical data (as opposed to econometric and administrative data). A set of statistics that
monitors the denormalization process is automatically computed along the steps involved in it, in order
to ensure that no loss of information occurs.

3.4 SCALPEL-Extraction: extraction of concepts


SCALPEL-Extraction provides fast extractions of medical concepts from the denormalized tables pro-
duced by SCALPEL-Flattening. By providing ready-to-use medical events, SCALPEL-extraction en-
capsulates SNDS technical knowledge but keeps medical data as raw as possible, so that end-users have
access to fine-grained data which is critical when designing observational studies [47, 16]. The extracted
concepts are organized around two abstractions: Patient and Event.
The Patient abstraction has a unique patientID, a gender, a birthDate and eventually a
deathDate.
The Event abstraction allows to represent any event associated to a patient. It can be punctual (e.g.,
medical act) or continuous (e.g., hospitalization).
All concepts are automatically extracted into Patient or Event objects by a set of Extractors
and Transformers, designed to fetch the data in the relevant tables and columns of the SNDS
Sources.
The Extractor abstraction maps a Row of a Source to zero or many Events:

Extractor: Row 7→ List[Event].

Extractors successively refines data from the input (wide denormalized tables) by (1) identifying the
relevant columns, (2) filtering out null values according to some columns and (3) conform the extracted
data to a standardized schema. These three operations are very fast when performed on columnar data,
as they exploit sparsity (null values are not represented in the data) and consist in simple look-ups over
hash tables containing columns metadata. An optional step that filters rows by value can occur before
step (3). This operation is slower as it manipulates row values, but since it is performed near the end of
the extraction process, it typically occurs on small data. This process is illustrated Figure 2.
Many extractors are available to fetch medical acts, diagnoses, hospital stays, among others, an ex-
ample being the drug dispense Extractor which allows extracting events related to specific subsets of
drugs and to output events at multiple levels of granularity (drug, molecule, ATC class, custom classes)
as defined in a configuration file. This simple architecture makes it easy to add new Extractors and
to answer to any extraction need.
The Transformer abstraction transforms a collection of Events related to a unique Patient into
a list of more complex Events (complex diseases, drug exposures, . . . ):

Transformer: List[Event] 7→ List[Event].

6
Wide denormalized table

get_input
Step 1: projection to fetch relevant columns
columnar operation (instantaneous)

lossy data compression


null values

Step 2: filter out null values


filter_null
leverages column sparsity (fast)

filter_values optional step: filter rows based on their values


(row operation ; slow)

build_event Step 3: output converted to Event schema


columnar operation (fast)
write Parquet file

Figure 2: Extractor design. Extractors implemented in SCALPEL-Extraction successively re-


fines the input table (a large denormalized table) by taking advantage of fast columnar operations to
produce ready-to-use medical events. Step 1 selects the relevant columns (equivalent to a hash table
look-up) while Step 2 removes rows where null values are detected in specific columns, taking advan-
tage of the sparsity of columnar representation (null values are not encoded in the data). Optionally, this
extraction process filters out rows based on their values. Finally, Step 3 conforms the data to the Event
schema, and is written to a Parquet file.

7
A Transformer is based on specific algorithms requiring multidisciplinary knowledge from epi-
demiologists, statisticians, clinicians, physicians, and SNDS experts [46]. Transformers usually
combine events built by Extractors to build more complex events, such as computing drug expo-
sures from timestamped drug dispenses. Extractors and Transformers can be used through a
Scala API or controlled using a textual configuration file. Many Transformers used in several stud-
ies such as [27, 29] are implemented and ready to use.
Besides Parquet files containing extracted events, SCALPEL-Extraction outputs metadata tracking
the data used to build each type of extracted events. This file can be leveraged by SCALPEL-Analysis
to build Cohorts and flowcharts, as explained below.

3.5 SCALPEL-Analysis: interactive manipulation and analysis of cohorts


While SCALPEL-Flattening and SCALPEL-Extraction are implemented in Scala/Spark for performance
and maintainability, SCALPEL-Analysis is implemented in Python/PySpark [49] since it is designed for
interactive environments, such as Jupyter notebooks [21]. SCALPEL-Analysis eases the manipulation
and analysis of cohort data. It is based on the following abstractions:
The Cohort abstraction is a set of Patients and their associated Events in a [startDate,
endDate] time-window. Basic operations such as union, intersection, and difference can be performed
between Cohorts, while a human-readable description is automatically updated in the results. More
granular control is kept available through accesses to the underlying Spark DataFrames (using Spark
DataFrame API). This combination allows easy data engineering and fine-grained, yet reproducible,
experiments.
The CohortCollection abstraction is a collection of Cohorts on which operations can be jointly
performed. The CohortCollection has metadata that keeps the information about each Cohort,
such as the successive operations performed on it, the Parquet files they are stored in and a git commit
hash of the code producing the extraction from the Source.
International guidelines [5] regarding studies based on LODs insist on the explanation of cohort con-
struction to highlight eventual population biases, motivating the following CohortFlow abstraction.
The CohortFlow abstraction is an ordered iterator defined as the following left fold operation
foldl(c : CohortCollection, ∩) := (((c0 ∩ c1 ) ∩ c2 ) ∩ . . . cn )
assuming an input CohortCollection c of length n, where ∩ denotes an intersection of the Cohorts’
patients. It is meant to track the stages leading to a final Cohort, where each intermediate Cohort is
stored along with textual information about the filtering rules used to go from each stage to the next one.
The scalpel.stats module produces descriptive statistics on a Cohort and their associated plots.
For now, it contains more than 25 Patient-centric or Event-centric statistics, adding a custom one
being very easy. Among other things, this module provides automatic reporting as text or graphical dis-
plays, with performance optimization through data caching. It can be combined with CohortFlow to
compute various statistics at each analysis stage, to assess the biases induced along with successive pop-
ulation filtering operations. Flowcharts can easily be produced to track how many subjects were removed
at each stage. Flowcharts can be produced either from a CohortFlow, or the metadata tracking the
data extraction process produced by SCALPEL-Extraction. Examples are provided in Supplementary
Material.
SCALPEL-Analysis also provides tools producing datasets in formats compatible with popular ma-
chine learning libraries. At the core of these tools is the FeatureDriver abstraction.
The FeatureDriver abstraction is used to transform Cohorts into data formats suitable for ma-
chine learning algorithms, such as numpy.ndarray [20], tensorflow.tensor [1] and pytorch.tensor [33].

8
It is mainly a transformation of a Spark dataframe representation into a tensor-based format. FeatureDrivers
perform several sanity checks, such as time-zone and event dates consistency, and can be easily extended
by end-users, thanks to the PySpark API.

4 Results
Scaling experiments presented in this section were performed on a SNDS subset containing 13.7 mil-
lion patients followed up to three years described in Table 1. Data from this sample is structured data

Count DCIR PMSI-MCO


Rows in the central table 10,579,545,716 35,375,046
Rows in the denormalized table 10,636,094,654 3,208,682,967
Patients 13,762,623 7,807,517
Drug reimbursements events 1,933,985,925 NA
Distinct drug codes 16,289 NA
Reimbursed medical acts events 210,847,422 97,484,303
Distinct medical acts codes 7254 7591
Diagnoses events NA 120,212,253
Distinct diagnoses codes NA 16,895
Source data set disk size (CSV, GB) 6,416.3 48.7
Source data set disk size (Parquet, GB) 572.7 5.9
Flattened data set disk size (Parquet, GB) 690.6 8.9

Table 1: Characteristics of the dataset used for experiments. Results are produced on a subset of SNDS
containing 13.7 million subjects, followed up to three years. The scope is restricted to outpatient data
(DCIR) and inpatient data excepted hospitalization at home, rehabilitation centers and psychiatric hos-
pitals (PMSI-MCO). The central fact table of DCIR records cash flows resulting from healthcare re-
imbursements to patients covered by the French national healthcare insurance. One line in this table
correspond to one cash flow (such as the reimbursement of a drug bought following a prescription). The
central fact table of PMSI-MCO records hospital stays. Events occurring during the stay are stored in
dimension tables linked to this central table.

containing common data types (timestamps, integers, floats, small strings), normalized according to the
SNDS data model. The testing data consisted in outpatient data (DCIR) and inpatient data excepted
home hospitalization, rehabilitation centers and psychiatric hospitals (PMSI-MCO). Raw data was ex-
tracted from the SNDS by CNAM, the French agency that manages this database. Extracts were dumped
on the testing cluster as a set of CSV files.
SCALPEL3 was tested on a Mesos [15] cluster of commodity servers with 14 worker nodes driven
by 4 master nodes. Worker nodes resources amount to 224 2.4Ghz logical cores, 1.7Tb of RAM, and
448Tb of storage distributed over 88 spinning hard drives. These resources are shared over the cluster
by HDFS [43] for data storage and by Spark for memory storage and computations. This cluster and the
configuration of the jobs were not fine-tuned for the usage of SCALPEL3, but follow standard guidelines
for cluster configuration for distributed computing with Spark.
Denormalizing this dataset using SCALPEL-Flattening took about 6 hours using the 14 worker
nodes. During the conversion of CSV tables to parquet files, worker nodes CPU and memory usage
are maxed out on most worker nodes. During the join operations, resource usage is first dominated by
network I/O to shuffle the data across the workers, followed by an increase in CPU and memory usage
reaching two-thirds of the cluster capacity. Note that the current framework used for SNDS data cannot
handle such denormalization so that there is no element of comparison for SCALPEL-Flattening with it.

9
SCALPEL-Extraction was evaluated on the following extraction tasks, that correspond to typical
events required for public health research studying relations between fractures and some drug expo-
sures: (a) extraction of patient demographics (gender, age, eventual date of death), (b) extraction of drug
dispenses, (c) filtering of patients w.r.t their first date of drug use (prevalent drug users, 65 drugs), (d)
computation of drug exposures based on drug dispenses dates, (e) extraction of reimbursed medical acts,
(f) extraction of diagnoses, (g) identification of fractures using the algorithm described in [9] based on
medical acts and diagnoses.
Indicative baseline performance was established by executing similar queries on the current SNDS
infrastructure, based on SAS Enterprise Guide for analytics [40], connected to an Oracle SQL database
hosted on Oracle Exadata servers [32]. This baseline performance was computed with a single run, as
the current SNDS framework is designed to allocate resources dynamically each time a new query is
submitted. The monitoring of resource usage on this SAS-Oracle infrastructure is not straightforward,
since computations are divided between SAS and Oracle jobs, and since the resources of the Oracle
Exadata infrastructure are divided across servers focused on storage or computation. At peak use (for
task (c)), the Oracle job was using 10 CPUs supported by 4.9GB of PGA memory, while SAS was using
1 to 6GB of RAM.
An assessment of the horizontal scaling of SCALPEL3 is performed by varying the number of
executors (4 logical cores and 25 GB RAM) to perform these queries. All the results are displayed in
Figure 3.
SCALPEL-Analysis aims at providing useful abstractions to ease cohort data manipulation. We
provide in Supplementary Material, see Section A herein, examples that illustrate how these abstractions
can be leveraged to perform typical data preparation in a few lines of code.

5 Discussion
SCALPEL-Extraction reaches performances similar to SQL-SAS based SNDS framework when using
6 executors (Figure 3(h)). It is consistently faster on tasks involving large data volumes or complex
operations such as tasks (b), (c), (d), and (g). On the other hand, tasks involving the PMSI-MCO database
(tasks (e) and (f)) exhibit poor performance. This is rooted in the flat table structure as PMSI-MCO is not
sparse-by-block like DCIR (see the difference in the ratio of Rows in central table w.r.t. denormalized
table in Table 1). It results in performing more tests on row values and data shuffle than necessary when
performing queries on PMSI-MCO. Performance on these tasks could be further improved by slightly
modifying the join strategy in the flattening step to ensure PMSI-MCO sparsity by block.
The cost of data denormalization should be considered to be fixed as this operation is done once and
for all. The denormalized data can then be updated incrementally when new data are fed into the cluster
(typically a few times a year).
SCALPEL-Extraction scales almost linearly from 4 to 16 executors. The scaling gains then slow
down, reaching peak performance at 28 executors (see Figure 3). These diminishing returns can be
caused by the cluster resource sharing between storage services (HDFS) and computation (SCALPEL3).
As a result, SCALPEL3 resource usage can be in conflict with HDFS resources as soon as the number of
nodes used by SCALPEL3 excess one-third of the cluster3 . Splitting the cluster nodes between storage
nodes and computation nodes could improve horizontal scalability. Note that for very small tasks (such
as (c), (d), (g)), runtime is dominated by I/O operations and do not benefit particularly from additional
CPUs.
Besides performance considerations, note that SCALPEL3 uses only open-source, free software and
runs on commodity hardware, which is likely cheaper than Oracle Exadata servers and easier to scale if
3
HDFS is configured to replicate the data across the worker nodes three times; HDFS performance is thus not much
impacted if one-third of the nodes are not available at some point.

10
(a) Patients demographics (b) Drug dispenses
Job duration (minutes)
27 27 SAS SQL-Oracle
SCALPEL3
26 Linear scaling
25
25
24 23
23 21
22 23 24 25 22 23 24 25
(c) Prevalent drug users (d) Drug exposures
20
Job duration (minutes)

22
2 1
21
2 2
20
2 3
21
2 4
22
22 23 24 25 22 23 24 25
(e) Medical acts (f) Diagnoses
26
Job duration (minutes)

26
25
24
24
22
23
22 23 24 25 22 23 24 25
(g) Fractures (h) All tasks
Job duration (minutes)

20
28
21
22 27
23
24 26

22 23 24 25 22 23 24 25
Number of executors Number of executors
Figure 3: SCALPEL-Extraction scaling experiments. The blue solid line represents the mean total
running time (in seconds) of queries (a)–(g) described in Section 4 when varying the number of worker
nodes used to perform the computation. Figure (h) represents the total running time of the (a)–(h)
queries. Light blue bands represent one standard deviation computed over 5 runs. The dotted line
corresponds to a theoretical performance assuming a perfect horizontal linear scaling (based on the
single node performance). Dashed lines represent the runtime of similar queries on the SNDS SAS-
Oracle infrastructure using a single run. Multiple runs were not performed on SAS-Oracle as computing
resources are dynamically allocated for each queries and cannot be set beforehand.

11
the data volume increases: a Spark cluster easily scales “horizontally” by adding more nodes.
The performance comparison between the two infrastructures is limited by (i) the impossibility to
set the resources used by SAS-Oracle beforehand for these experiments does not allow for multiple runs
and (ii) slight differences in query implementation caused by design differences such as columnar vs
row orientation. Nonetheless, it shows that SCALPEL3 can be used as a viable open-source alternative
running on commodity hardware while benefiting from horizontal scaling on very large jobs.
Besides, SCALPEL3 greatly improves the maintainability, audit, and reproducibility of studies using
SNDS. First, continuous integration of code updates and large code coverage (94%) with unit testing is a
big improvement in terms of maintainability over copy-pasted SQL snippets. Secondly, SNDS expertise
encapsulation for events extraction is fully tested and maintained in SCALPEL3, so it eases extraction
algorithms reuse for studies and lowers the entry-barrier to SNDS. Obviously, design and maintenance
of SNDS concept extractors by a team of developers and SNDS specialists is a mandatory task, as the
database contents are constantly evolving. Moreover, the relevance of extracted data (to answer a trade
issue) requires some SNDS knowledge and is the responsibility of the user.
The combination of expert knowledge encapsulation (SCALPEL-Extraction) and interactive cohort
manipulation (SCALPEL-Analysis) results in smaller and more readable user-code, leading to eas-
ily shared and reproducible studies, supported by data tracking and automated audit reports. Finally,
SCALPEL3 allows producing datasets compatible with several Python machine learning libraries for-
mats, fostering methodological research on SNDS data, which was not possible with the proprietary
software that is currently used.
The choice of the Python language might help SCALPEL3 adoption among the data science and ma-
chine learning community, while it might hinder its use among public health researchers who are tradi-
tionally using proprietary statistical softwares or the R language. SCALPEL3 can be used in standalone
mode4 or in distributed mode5 when working on large datasets. The knowledge and skills required to
manage a computing cluster are not yet widespread which could also impede a large adoption of the
distributed mode among small organizations.
Finally, while SCALPEL3 does not support international data standards yet, the development of
vocabulary mapping tables in France was anticipated so as to ease future support of data standards such
as OMOP-CDM [39] or FHIR [6] to SCALPEL3.

6 Conclusion
SCALPEL3 could be further improved by optimizing the flattening step, so as to ensure optimal block-
sparsity of the resulting denormalized databases automatically. Besides, optimizing the cluster design to
separate storage from computation as well as using YARN instead of Mesos to manage resources could
help to improve its performance further by lowering data access times. Finally, using Apache ORC [2]
instead of Parquet could also lead to further performance improvements. Parquet was initially chosen
over ORC because of better integration with Spark. ORC is now well-integrated in it and has been
reported to have better performances and a higher compression factor on non-nested data.

7 Acknowledgments
We thank the engineers who worked on this project at some point: Firas Ben Sassi, Prosper Burq, Philip
Deegan, Daniel De Paula e Silva, Angel Francisco Orta, Xristos Giatsidis, Sathiya Prabhu Kumar.
We also thank the people from CNAM or Polytechnique who were or are currently involved in the
Polytechnique-CNAM partnership, namely, for CNAM : Muhammad Abdallah, Aurélie Bannay, Hélène
4
Using a single large server.
5
Using a computing cluster.

12
Caillol, Anthony Du, Sébastien Dumontier, Mehdi Gabbas, Claude Gissot, Moussa Laanani, Mickaël
Lechapelier, Clémence Martin, Anke Neumann, Cédric Pulrulczyk, Jérémie Rudant, Omar Sow, Kévin
Vu Saintonge, Alain Weill, and for Polytechnique: Qing Chen, Agathe Guilloux, Anastasiia Nitavskyi,
Yiyang Yu.

References
[1] Martı́n Abadi, Ashish Agarwal, Paul Barham, Eugene Brevdo, Zhifeng Chen, Craig Citro, Greg S.
Corrado, Andy Davis, Jeffrey Dean, Matthieu Devin, Sanjay Ghemawat, Ian Goodfellow, Andrew
Harp, Geoffrey Irving, Michael Isard, Yangqing Jia, Rafal Jozefowicz, Lukasz Kaiser, Manjunath
Kudlur, Josh Levenberg, Dan Mané, Rajat Monga, Sherry Moore, Derek Murray, Chris Olah, Mike
Schuster, Jonathon Shlens, Benoit Steiner, Ilya Sutskever, Kunal Talwar, Paul Tucker, Vincent
Vanhoucke, Vijay Vasudevan, Fernanda Viégas, Oriol Vinyals, Pete Warden, Martin Wattenberg,
Martin Wicke, Yuan Yu, and Xiaoqiang Zheng. TensorFlow: Large-scale machine learning on
heterogeneous systems, 2015. Software available from tensorflow.org.

[2] ORC Apache. Apache orc: High-performance columnar storage for hadoop, 2015.

[3] Apache Parquet, 2015.

[4] Michael Armbrust, Reynold S. Xin, Cheng Lian, Yin Huai, Davies Liu, Joseph K. Bradley, Xi-
angrui Meng, Tomer Kaftan, Michael J. Franklin, Ali Ghodsi, and Matei Zaharia. Spark SQL:
Relational data processing in spark. In Proceedings of the 2015 ACM SIGMOD International Con-
ference on Management of Data, SIGMOD ’15, pages 1383–1394, New York, NY, USA, 2015.
ACM.

[5] Eric I Benchimol, Liam Smeeth, Astrid Guttmann, Katie Harron, David Moher, Irene Petersen,
Henrik T Sørensen, Erik von Elm, Sinéad M Langan, RECORD Working Committee, et al. The re-
porting of studies conducted using observational routinely-collected health data (RECORD) state-
ment. PLoS medicine, 12(10):e1001885, 2015.

[6] Duane Bender and Kamran Sartipi. HL7 FHIR: An agile and RESTful approach to healthcare
information exchange. In Proceedings of CBMS 2013 - 26th IEEE International Symposium on
Computer-Based Medical Systems, pages 326–331. IEEE, jun 2013.

[7] Julien Bezin, Mai Duong, Régis Lassalle, Cécile Droz, Antoine Pariente, Patrick Blin, and
Nicholas Moore. The national healthcare system claims databases in france, SNIIRAM and EGB:
Powerful tools for pharmacoepidemiology. Pharmacoepidemiology and Drug Safety, 26(8):954–
962, aug 2017.

[8] Stephen Bonner, Ibad Kureshi, John Brennan, and Georgios Theodoropoulos. Exploring the evolu-
tion of big data technologies. In Software Architecture for Big Data and the Cloud, pages 253–283.
Elsevier, 2017.

[9] Benjamin Bouyer, Fanny Leroy, Jérémie Rudant, Alain Weill, and Joël Coste. Burden of fractures
in France: incidence and severity by age, gender, and site in 2016. International Orthopaedics,
February 2020.

[10] Marc Cuggia, Dominique Polton, Gilles Wainrib, and Stéphanie Combes. Health Data Hub: mis-
sion de préfiguration. Technical report, Ministère des Solidarités et de la Santé, 10 2018. In
French.

13
[11] Khaled Dehdouh, Fadila Bentayeb, Omar Boussaid, and Nadia Kabachi. Using the column ori-
ented nosql model for implementing big data warehouses. In Proceedings of the International
Conference on Parallel and Distributed Processing Techniques and Applications (PDPTA), page
469. The Steering Committee of The World Congress in Computer Science, Computer . . . , 2015.

[12] M Doutreligne, D-P Nguyen, A Parot, A Lamer, and N Paris. Alignement à grande échelle du
système des données de santé vers le modèle commun de données omop. Revue d’Épidémiologie
et de Santé Publique, 68:S37, 2020.

[13] Richard A. Hansen, Michael D. Gray, Brent I. Fox, Joshua C. Hollingsworth, Juan Gao, and Peng
Zeng. How well do various health outcome definitions identify appropriate cases in observational
studies. Drug Safety, 36(SUPPL.1):27–32, oct 2013.

[14] Steve Harris, Sinan Shi, David Brealey, Niall S. MacCallum, Spiros Denaxas, David Perez-Suarez,
Ari Ercole, Peter Watkinson, Andrew Jones, Simon Ashworth, Richard Beale, Duncan Young,
Stephen Brett, and Mervyn Singer. Critical Care Health Informatics Collaborative (CCHIC): Data,
tools and methods for reproducible research: A multi-centre UK intensive care database. Interna-
tional Journal of Medical Informatics, 112:82–89, apr 2018.

[15] Benjamin Hindman, Andy Konwinski, Matei Zaharia, Ali Ghodsi, Anthony D Joseph, Randy H
Katz, Scott Shenker, and Ion Stoica. Mesos: A platform for fine-grained resource sharing in the
data center. In NSDI, volume 11, pages 22–22, 2011.

[16] Na Hong, Ning Zhang, Huawei Wu, Shanshan Lu, Yue Yu, Li Hou, Yinying Lu, Hongfang Liu,
and Guoqian Jiang. Preliminary exploration of survival analysis using the OHDSI common data
model: a case study of intrahepatic cholangiocarcinoma. BMC Medical Informatics and Decision
Making, 18(S5):116, dec 2018.

[17] George Hripcsak, Jon D Duke, Nigam H Shah, Christian G Reich, Vojtech Huser, Martijn J
Schuemie, Marc A Suchard, Rae Woong Park, Ian Chi Kei Wong, Peter R Rijnbeek, et al. Observa-
tional health data sciences and informatics (OHDSI): opportunities for observational researchers.
Studies in health technology and informatics, 216:574, 2015.

[18] Vojtech Huser, Frank J DeFalco, Martijn Schuemie, Patrick B Ryan, Ning Shang, Mark Velez,
Rae Woong Park, Richard D Boyce, Jon Duke, Ritu Khare, et al. Multisite evaluation of a data
quality tool for patient-level clinical data sets. eGEMs, 4(1), 2016.

[19] Anne-Sophie Jannot, Eric Zapletal, Paul Avillach, Marie-France Mamzer, Anita Burgun, and
Patrice Degoulet. The georges pompidou university hospital clinical data warehouse: a 8-years
follow-up experience. International journal of medical informatics, 102:21–28, 2017.

[20] Eric Jones, Travis Oliphant, Pearu Peterson, et al. SciPy: Open source scientific tools for Python,
2001–.

[21] Thomas Kluyver, Benjamin Ragan-Kelley, Fernando Pérez, Brian Granger, Matthias Busson-
nier, Jonathan Frederic, Kyle Kelley, Jessica Hamrick, Jason Grout, Sylvain Corlay, Paul Ivanov,
Damián Avila, Safia Abdalla, and Carol Willing. Jupyter notebooks – a publishing format for
reproducible computational workflows. In F. Loizides and B. Schmidt, editors, Positioning and
Power in Academic Publishing: Players, Agents and Agendas, pages 87 – 90. IOS Press, 2016.

[22] Sathiya P. Kumar, Youcef Sebiat, Firas Ben Sassi, Dian Sun, Daniel Paula e Silva, and Prosper
Burq. SCALPEL-Flattening, 2019.

14
[23] Yinan Li and Jignesh M Patel. Widetable: An accelerator for analytical data processing. Proceed-
ings of the VLDB Endowment, 7(10):907–918, 2014.

[24] Vincent Looten. Are studies of claims databases reproducible? the hypothesis of an instituted
ethical misconduct in public health. Medecine sciences, 35(8-9):689–692, 2019.

[25] David Madigan, Paul E. Stang, Jesse A. Berlin, Martijn Schuemie, J. Marc Overhage, Marc A.
Suchard, Bill Dumouchel, Abraham G. Hartzema, and Patrick B. Ryan. A systematic statistical
approach to evaluating evidence from observational studies. Annual Review of Statistics and Its
Application, 1(1):11–39, 2014.

[26] Sergey Melnik, Andrey Gubarev, Jing Jing Long, Geoffrey Romer, Shiva Shivakumar, Matt Tolton,
and Theo Vassilakis. Dremel: Interactive analysis of web-scale datasets. Proc. VLDB Endow., 3(1-
2):330–339, September 2010.

[27] Maryan Morel, Emmanuel Bacry, Stéphane Gaı̈ffas, Agathe Guilloux, and Fanny Leroy. Con-
vSCCS: convolutional self-controlled case series model for lagged adverse event detection. Bio-
statistics, 2019.

[28] Shawn N Murphy, Griffin Weber, Michael Mendis, Vivian Gainer, Henry C Chueh, Susanne
Churchill, and Isaac Kohane. Serving the enterprise and beyond with informatics for integrat-
ing biology and the bedside (i2b2). Journal of the American Medical Informatics Association,
17(2):124–130, 2010.

[29] A Neumann, A Weill, P Ricordeau, JP Fagot, F Alla, and H Allemand. Pioglitazone and risk of
bladder cancer among diabetic patients in france: a population-based cohort study. Diabetologia,
55(7):1953–1962, 2012.

[30] Martin Odersky, Philippe Altherr, Vincent Cremet, Burak Emir, Sebastian Maneth, Stéphane Mich-
eloud, Nikolay Mihaylov, Michel Schinz, Erik Stenman, and Matthias Zenger. An overview of the
Scala programming language. Technical report, École Polytechnique Fédérale de Lausanne, 2004.

[31] Toan C Ong, Michael G Kahn, Bethany M Kwan, Traci Yamashita, Elias Brandt, Patrick
Hosokawa, Chris Uhrich, and Lisa M Schilling. Dynamic-etl: a hybrid approach for health data
extraction, transformation and loading. BMC medical informatics and decision making, 17(1):134,
2017.

[32] Exadata Database Machine — Oracle, 2008.

[33] Adam Paszke, Sam Gross, Soumith Chintala, Gregory Chanan, Edward Yang, Zachary DeVito,
Zeming Lin, Alban Desmaison, Luca Antiga, and Adam Lerer. Automatic differentiation in Py-
Torch, 2017.

[34] Daniel Paula e Silva, Youcef Sebiat, Sathiya Prabhu Kumar, Firas Ben Sassi, Prosper Burq, Dian
Sun, Maryan Morel, Kevin Vu Saintonge, and Philip Deegan. SCALPEL-Extraction, 2019.

[35] Roger D Peng, Francesca Dominici, and Scott L Zeger. Reproducible epidemiologic research.
American journal of epidemiology, 163(9):783–789, 2006.

[36] Behandelt PostgreSQL. Postgresql. Web resource: http://www. PostgreSQL. org/about, 1996.

[37] Joshua Powers. Apache spark performance compared to a traditional relational database using
open source big data health software. 2016.

15
[38] R Core Team. R: A Language and Environment for Statistical Computing. R Foundation for
Statistical Computing, Vienna, Austria, 2017.

[39] Stephanie J Reisinger, Patrick B Ryan, Donald J O’Hara, Gregory E Powell, Jeffery L Painter,
Edward N Pattishall, and Jonathan A Morris. Development and evaluation of a common data
model enabling active drug safety surveillance using disparate healthcare databases. Journal of the
American Medical Informatics Association, 17(6):652–662, 2010.

[40] SAS Enterprise Guide, 1968.

[41] M. J. Schuemie and M. Moinat. WhiteRabbit, 2014.

[42] Youcef Sebiat, Maryan Morel, Dian Sun, and Dinh Phong Nguyen. SCALPEL-Analysis, 2019.

[43] Konstantin Shvachko, Hairong Kuang, Sanjay Radia, and Robert Chansler. The Hadoop distributed
file system. In Proceedings of the 2010 IEEE 26th Symposium on Mass Storage Systems and Tech-
nologies (MSST), MSST ’10, pages 1–10, Washington, DC, USA, 2010. IEEE Computer Society.

[44] SPSS Statistical software, 1976.

[45] P. Tuppin, L. de Roquefeuil, A. Weill, P. Ricordeau, and Y. Merlière. French national health
insurance information system and the permanent beneficiaries sample. Revue d’Épidémiologie et
de Santé Publique, 58(4):286 – 290, 2010.

[46] P. Tuppin, J. Rudant, P. Constantinou, C. Gastaldi-Ménager, A. Rachas, L. de Roquefeuil,


G. Maura, H. Caillol, A. Tajahmady, J. Coste, C. Gissot, A. Weill, and A. Fagot-Campagna.
Value of a national administrative database to guide public decisions: From the système national
d’information interrégimes de l’assurance maladie (SNIIRAM) to the système national des données
de santé (SNDS) in france. Revue d’Épidémiologie et de Santé Publique, 65:S149 – S167, 2017.
Réseau REDSIAM.

[47] S. V. Wang, P Verpillat, J. A. Rassen, A Patrick, E. M. Garry, and D. B. Bartels. Transparency and
reproducibility of observational cohort studies using large healthcare databases. Clinical Pharma-
cology and Therapeutics, 99(3):325–332, mar 2016.

[48] Zhou Wei, Jiang Dejun, Guillaume Pierre, Chi-Hung Chi, and Maarten van Steen. Service-oriented
data denormalization for scalable web applications. In Proceedings of the 17th international con-
ference on World Wide Web, pages 267–276, 2008.

[49] Matei Zaharia, Reynold S. Xin, Patrick Wendell, Tathagata Das, Michael Armbrust, Ankur Dave,
Xiangrui Meng, Josh Rosen, Shivaram Venkataraman, Michael J. Franklin, Ali Ghodsi, Joseph
Gonzalez, Scott Shenker, and Ion Stoica. Apache Spark: A unified engine for big data processing.
Commun. ACM, 59(11):56–65, October 2016.

16
Supplementary material

A Scalpel Analysis usage examples


This section presents a quick example of the SCALPEL-Analysis API. In [ ] and Out [ ] re-
spectively indicate numbered input code and output results. Example [1] shows how to load a cohort
collection from a json file produced by SCALPEL-Extraction. Examples [2], [3], [4] show how to
access a cohort from a cohort collection and to count their subjects. Example [5] shows how to use
algebraic manipulations over cohort to remove prevalent cases from a given population, and times this
operation to show that it fast enough for interactive use. Example [6] highlight automatically gener-
ated captions for cohorts resulting from algebraic operations, while examples [7] and [8] show how to
access subject and event data from a cohort. Examples [9] and [10] illustrate how to use SCALPEL-
Analysis to define a CohortFlow from a sequence of cohorts, then using scalpel.stats to obtain
statistics about the distributions of gender and age along the stages. In example [9], excluding patients
with a fracture does not introduce much changes in the gender and age distributions. In example [10]
however, keeping only patients with fractures in the final stage leads to an older population, with an
important change in the age distribution of women (a well-known phenomenon related to osteoporosis).
In [1]: from scalpel.core.cohort_collection import CohortCollection

# metadata_path = '/path/to/some/metadata_file.json'
cc = CohortCollection.from_json(metadata_path)
print(cc.cohorts_names)

Out[1]: {'follow up', 'acts', 'fractures', 'extract hospital stays',


'filter patients', 'liberal acts', 'extract patients', 'exposures',
'diagnoses', 'drug purchases'}

In [2]: base_population = cc.get('extract_patients')


base_population.subjects.count()

Out[2]: 5186601

In [3]: exposed_subjects = cc.get('exposures')


exposed_subjects.subjects.count()

Out[3]: 2666662

In [4]: fractured_subjects = cc.get('fractures')


fractured_subjects.subjects.count()

Out[4]: 179072

In [5]: %%timeit
# Select subjects in base population who were exposed but
# have not experienced a fracture
final_cohort = (exposed_subjects.intersection(base_population)
).difference(fractured_subjects)
final_cohort.subjects.count()

11.3 s ± 4.5 s per loop (mean ± std. dev. of 7 runs, 1 loop each)

Out[5]: 2542922

In [6]: final_cohort.describe()

17
Out[6]: 'Events are exposures. Events contain only subjects
with event exposures with extract patients without
subjects with event fractures.'

In [7]: final_cohort.subjects.show()

+---------+------+-------------------+-------------------+
|patientID|gender| birthDate| deathDate|
+---------+------+-------------------+-------------------+
| Alice| 2|1934-07-27 00:00:00| null|
| Bob| 1|1951-05-01 00:00:00| null|
| Carole| 2|1942-01-12 00:00:00| null|
| Chuck| 1|1933-10-03 00:00:00|2011-06-20 00:00:00|
| Craig| 1|1943-07-27 00:00:00|2012-12-10 00:00:00|
| Dan| 1|1971-10-07 00:00:00| null|
| Erin| 2|1924-01-12 00:00:00| null|
| Eve| 2|1953-02-21 00:00:00| null|
+---------+------+-------------------+-------------------+

In [8]: final_cohort.events.show()

+---------+--------+-------+-----+------+-------------------+-------------------+
|patientID|category|groupID|value|weight| start| end|
+---------+--------+-------+-----+------+-------------------+-------------------+
| Alice|exposure| null|DrugA| 1.0|2013-08-08 00:00:00|2013-10-07 00:00:00|
| Alice|exposure| null|DrugB| 1.0|2012-09-11 00:00:00|2012-12-30 00:00:00|
| Alice|exposure| null|DrugC| 1.0|2013-01-23 00:00:00|2013-03-24 00:00:00|
| Bob|exposure| null|DrugB| 1.0|2014-03-04 00:00:00|2014-05-03 00:00:00|
| Carole|exposure| null|DrugB| 1.0|2010-01-25 00:00:00|2010-12-13 00:00:00|
| Dan|exposure| null|DrugA| 1.0|2012-11-29 00:00:00|2013-01-28 00:00:00|
| Erin|exposure| null|DrugC| 1.0|2010-09-09 00:00:00|2011-01-17 00:00:00|
| Eve|exposure| null|DrugA| 1.0|2010-04-30 00:00:00|2010-08-02 00:00:00|
+---------+--------+-------+-----+------+-------------------+-------------------+

In [9]: from scalpel.stats.patients import distribution_by_gender_age_bucket


from scalpel.core.cohort_flow import CohortFlow

flow = CohortFlow([base_population, exposed_subjects, final_cohort])

for cohort in flow.steps:


figure = plt.figure(figsize=(8, 4.5))
distribution_by_gender_age_bucket(cohort=cohort, figure=figure)

18
Gender and age bucket distribution among extract_patients with s-
ubjects with event exposures
300000
Gender
250000 Homme

Subjects count
Femme
200000
150000
100000
50000
0
[0, 5[
[5, 10[
[10, 15[
[15, 20[
[20, 25[
[25, 30[
[30, 35[
[35, 40[
[40, 45[
[45, 50[
[50, 55[
[55, 60[
[60, 65[
[65, 70[
[70, 75[
[75, 80[
[80, 85[
[85, 90[
[90, 95[
[95, 100[
[100, 105[
[105, 110[
[110, 115[
[115, 120[
Age bucket

Gender and age bucket distribution among extract_patients with s-


ubjects with event exposures with subjects with event exposures
with extract_patients without subjects with event fractures
250000 Gender
Homme
Subjects count

200000 Femme
150000
100000
50000
0
[0, 5[
[5, 10[
[10, 15[
[15, 20[
[20, 25[
[25, 30[
[30, 35[
[35, 40[
[40, 45[
[45, 50[
[50, 55[
[55, 60[
[60, 65[
[65, 70[
[70, 75[
[75, 80[
[80, 85[
[85, 90[
[90, 95[
[95, 100[
[100, 105[
[105, 110[
[110, 115[
[115, 120[

Age bucket

In [10]: from scalpel.stats.patients import distribution_by_gender_age_bucket


from scalpel.core.cohort_flow import CohortFlow

flow = CohortFlow([base_population, exposed_subjects, fractured_subjects])

for cohort in flow.steps:


figure = plt.figure(figsize=(8, 4.5))
distribution_by_gender_age_bucket(cohort=cohort, figure=figure)

19
Subjects count Subjects count

0
5000
10000
15000
20000
0
50000
100000
150000
200000
250000
300000

[5, 10[ [0, 5[


[10, 15[ [5, 10[
[15, 20[ [10, 15[
[20, 25[ [15, 20[

Gender
Gender

[25, 30[ [20, 25[

Femme
Homme
Femme
Homme

[30, 35[ [25, 30[


[35, 40[ [30, 35[
[40, 45[ [35, 40[
[45, 50[ [40, 45[
[50, 55[ [45, 50[
[55, 60[ [50, 55[

20
[60, 65[ [55, 60[
[65, 70[ [60, 65[
[65, 70[

Age bucket
Age bucket

[70, 75[
[75, 80[ [70, 75[
[80, 85[ [75, 80[
[85, 90[ [80, 85[
[85, 90[
ubjects with event exposures

[90, 95[ [90, 95[


[95, 100[ [95, 100[
[100, 105[ [100, 105[
[105, 110[ [105, 110[
[115, 120[ [110, 115[
[110, 115[ [115, 120[
ubjects with event exposures with subjects with event fractures
Gender and age bucket distribution among extract_patients with s-
Gender and age bucket distribution among extract_patients with s-
B List of SNDS databases currently denormalized.

Database Contents
DCIR Outpatients reimbursement data
PMSI Hospital discharges
MCO Acute ward
MCO CE Acute ward outpatients treatment
SSR Rehabilitation
SSR CE Rehabilitation outpatients treatment
HAD Home-to-home care
IR IMB R Long term chronic diseases
IR BEN R Patients socio-demographic information

Table 2: List of SNDS sub-databases which are currently denormalized by SCALPEL-Flattening.


IR IMB R and IR BEN R are tables and were simply converted to Parquet files.

21
C List of available extractors

Extractor Source databases Event Type


Medical acts
CCAM DCIR, MCO, MCOCE, Punctual
SSR, SSRCE, HAD
NGAP DCIR, MCOCE Punctual
CSARR SSR Punctual
Biological acts DCIR Punctual
Practitioner encounter
Medical DCIR Punctual
Non-medical DCIR Punctual
Drug dispenses DCIR Punctual
Diagnoses
Main MCO, SSR, HAD Punctual
Associated MCO, SSR, HAD Punctual
Linked MCO, SSR, HAD Punctual
Long-term chronic disease IR IMB R Longitudinal
Hospital stay MCO Longitudinal
Emergency visit MCOCE Punctual
SSR Stay SSR Longitudinal
Hospital takeover SSR, HAD Punctual
Main Takeover reason HAD Punctual
Associated Takeover reason HAD Punctual
Patient IR BEN R, DCIR, MCO, Person
SSR, HAD

Table 3: List of implemented event extractors. This list is meant to grow over time. More
details are available in SCALPEL-Extraction [34] wiki on GitHub at https://github.com/
X-DataInitiative/SCALPEL-Extraction/wiki.

22
D List of the available transformers

Transformer Source events [optional]


Observation period Patients, [Any]
Trackloss Patients, [drug dispenses]
Follow-up Patients, observation period, [trackloss, drug dispenses, diagnoses]
Drug prescription Drug dispenses
Drug interaction Drug dispenses
Exposure
Limited in time Drug dispenses, Follow-up, [drug interaction]
Unlimited Drug dispenses, Follow-up, [drug interaction]
Outcomes
Fractures per body site Medical acts, diagnoses
Bladder cancer Medical acts, diagnoses
Infarctus Diagnoses
Heart failure Diagnoses

Table 4: List of implemented transformers. This list is meant to grow over time. More de-
tails are available in SCALPEL-Extraction [34] wiki on GitHub at https://github.com/
X-DataInitiative/SCALPEL-Extraction/wiki.

23

You might also like