OMOP Common Data Model Extract Transform Load
OMOP Common Data Model Extract Transform Load
3
Ground
Rules
• We
have
build
in
some
decent
sized
breaks,
please
return
before
Jmes
up
4
Instructors
Clair
Blacketer
Erica
A.
Voss
EvaneCe
K.
Burrows
Maxim
Moinat
5
ConnecJng
to
the
Hotel
WIFI
Network:
OHDSISYMP
Password:
OHDSI2019
6
Follow
Along
• This
full
deck
can
be
found
here:
– hdps://github.com/OHDSI/Tutorial-‐ETL
– Materials
à
OMOP
Common
Data
Model
Extract,
Transform
&
Load.pptx
7
OHDSI
in
a
Box
Raw Lauren
CDM Lauren
(EMPTY)
PostgreSQL
CDM Synpuf
Methods
Library
(100K)
OHDSI R
Studio
packages WhiteRabbit
Usagi
8
How
to
Sign
into
the
Remote
Desktop
From
your
command
prompt,
type
%systemroot%/system32/mstsc.exe
to
launch
Remote
Desktop
9
How
to
Sign
into
the
Remote
Desktop
10
How
to
Sign
into
the
Remote
Desktop
URL
TBD
• Pick
one
of
the
rows
and
put
your
name
on
the
second
column
How
to
Sign
into
the
Remote
Desktop
• Take
Column
A
from
spreadsheet
and
copy
into
the
“Computer”
field
How
to
Sign
into
the
Remote
Desktop
• Pick
‘Use
Another
Account’
13
How
to
Sign
into
the
Remote
Desktop
• If
you
get
this
page,
select
“Yes”
14
OHDSI
in
a
Box
–
Ready
15
OHDSI’s
Mission
&
Vision
To
improve
health
by
empowering
a
community
to
collaboraJvely
generate
the
evidence
that
promotes
beder
health
decisions
and
beder
care.
A
world
in
which
observaJonal
research
produces
a
comprehensive
understanding
of
health
and
disease.
Join
us
on
the
journey
hdp://ohdsi.org
Current
Approach:
“One
Study
–
One
Script“
"What's
the
adherence
to
my
drug
in
the
data
assets
I
own?"
AnalyJcal
method:
Adherence
to
Drug
North
America
China
Southeast
Asia
Japan
Europe
UK
India
ApplicaJon
to
data
Switzerland
Italy
Soouth
Africa
Israel
Current
soluJon:
• Not
scalable
One
SAS
or
R
script
for
• Not
transparent
each
study
• Expensive
• Slow
• ProhibiJve
to
non-‐expert
rouJne
use
SoluJon:
Data
StandardizaJon
Enables
SystemaJc
Research
Mortality
Source
of
Business
Adherence
Safety
Standardized
Signals
data
18
CDM
Version
6
Key
Domains
Person
Standardized
health
Standardized
metadata
ObservaJon_period
system
data
CDM_source
LocaJon
Visit_occurrence
Metadata
LocaJon_history
Visit_detail
Care_site
Standardized
CondiJon_occurrence
Provider
vocabularies
Standardized
clinical
data
Drug_exposure
Concept
Standardized
derived
Procedure_occurrence
elements
Vocabulary
CondiJon_era
Device_exposure
Domain
Drug_era
Measurement
Concept_class
Dose_era
Note
Concept_relaJonship
Results
Schema
Note_NLP
Cohort
RelaJonship
Cohort_definiJon
Concept_synonym
Survey_conduct
OMOP Vocabularies has greatly increased our ability to find relevant codes
You truly know your data if you convert it to the CDM
If you know a problem with your data, you can use the ETL to address it
You
can
use
standardized
tools
developed
by
OHDSI
like
ATLAS
and
the
PaJent
Level
PredicJon
Package
Buy vs Build: leverage an enJre community of technical and scienJfic capability for “free”
20
ETL
• Extract,
Transform,
Load
• In
order
to
get
from
our
naJve/raw
data
into
the
OMOP
CDM
we
need
to
design
and
develop
and
ETL
process
• Goal in ETLing is to standardize the format and terminology
• This
tutorial
– Will
teach
you
best
pracJces
around
designing
an
ETL
and
CDM
maintenance
– Will
not
teach
you
how
to
program
an
ETL
21
ETL
Process
ETL
DocumentaJon
ETL
hdps://ohdsi.github.io/TheBookOfOhdsi/ExtractTransformLoad.html
23
Hands
On
Exercises
for
Today
• Scan
a
database
with
White
Rabbit
24
A
PaOent’s
Story:
Lauren
hdps://www.endometriosis-‐uk.org/laurens-‐story
26
What
data
do
we
have?
dysmenorrhea
abdominal
pain
GP visit ultrasound
endometrioma
Endometriosis
Lauren’s
Timeline
/
/
/
/
-‐3
Years
-‐2
Years
-‐1
Years
-‐2
Weeks
-‐3
Days
Day
0
27
Data
Format
• SyntheaTM
is
a
SyntheJc
PaJent
PopulaJon
Simulator.
The
goal
is
to
output
syntheJc,
realisJc
(but
not
real),
paJent
data
and
associated
health
records
in
a
variety
of
formats.
• The
resulJng
data
is
free
from
cost,
privacy,
and
security
restricJons.
It
can
be
used
without
restricJon
for
a
variety
of
secondary
uses
in
academia,
research,
industry,
and
government
(although
a
citaJon
would
be
appreciated).
• hdps://github.com/syntheJchealth/synthea
Walonoski
J,
Kramer
M,
Nichols
J,
Quina
A,
Moesel
C,
Hall
D,
Duffed
C,
Dube
K,
Gallagher
T,
McLachlan
S.
Synthea:
An
approach,
method,
and
sosware
mechanism
for
generaJng
syntheJc
paJents
and
the
syntheJc
electronic
health
care
record.
J
Am
Med
Inform
Assoc.
2017
Aug
30.
doi:
10.1093/jamia/ocx079.
[Epub
ahead
of
print]
PubMed
PMID:
29025144.
28
Synthea
Tables
File
DescripOon
allergies.csv
PaJent
allergy
data.
careplans.csv
PaJent
care
plan
data,
including
goals.
condiJons.csv
PaJent
condiJons
or
diagnoses.
encounters.csv
PaJent
encounter
data.
imaging_studies.csv
PaJent
imaging
metadata.
immunizaJons.csv
PaJent
immunizaJon
data.
medicaJons.csv
PaJent
medicaJon
data.
observaJons.csv
PaJent
observaJons
including
vital
signs
and
lab
reports.
organizaJons.csv
Provider
organizaJons
including
hospitals.
paJents.csv
PaJent
demographic
data.
procedures.csv
PaJent
procedure
data
including
surgeries.
providers
Clinicians
that
provide
paJent
care.
29
Raw
Data
raw_lauren raw_synthea
30
Tools
help
us
get
started
.
.
.
31
White
Rabbit
-‐
LocaJon
32
White
Rabbit
-‐
Scan
33
White
Rabbit
-‐
Scan
34
White
Rabbit
-‐
Scan
35
White
Rabbit
–
Scan
Report
raw_synthea
36
White
Rabbit
–
Scan
Report:
raw_synthea
Overview
Tab
37
White
Rabbit
–
Scan
Report:
raw_synthea
Overview
Tab
38
White
Rabbit
–
Scan
Report:
raw_synthea
PaJents
Tab
39
Now
Your
Turn:
Scan
Lauren’s
Data
raw_lauren
40
Now
Your
Turn:
Scan
Lauren’s
Data
ohdsi
raw_lauren
• Test
connecJon
41
Now
Your
Turn:
Scan
Lauren’s
Data
raw_lauren
• Open ScanReport.xlsx
42
Now
Your
Turn:
Scan
Lauren’s
Data
raw_lauren
43
White
Rabbit
44
Rabbit
in
a
Hat
• Provides
a
graphical
interface
to
allow
a
user
to
connect
source
data
to
tables
45
Rabbit
in
a
Hat
raw_
synthea
46
Rabbit
in
a
Hat
47
Rabbit
in
a
Hat
Generate document
48
Resources
• Important
links
to
keep
in
mind
when
working
on
an
ETL:
– CDM
Wiki
hdps://github.com/OHDSI/CommonDataModel/wiki
InformaJon
about
the
CDM
structure
and
convenJons
to
follow
can
be
found
here
– OHDSI
Forums
hdp://forums.ohdsi.org/
hdp://forums.ohdsi.org/c/cdm-‐builders
OHDSI
is
an
acJve
community,
your
quesJons
may
have
already
been
asked
on
the
forum
however
if
not
do
not
be
afraid
to
ask
it
yourself!
49
Rabbit
in
a
Hat
• The
full
ETL
document:
hdps://ohdsi.github.io/ETL-‐Synthea/
50
Some
ParJng
Thoughts
On
ETL
• Vocabulary
will
tell
a
source
record
where
to
go.
– Example,
just
because
it
is
a
condiJon
code
and
in
a
condiJon
table
does
not
mean
it
will
end
up
in
CONDITION_OCCURRENCE
ICD9
783.1
-‐
Abnormal
weight
gain
51
Upcoming
enhancements
52
Upcoming
enhancements
AddiJonal
scan
report
metrics
53
Upcoming
enhancements
AddiJonal
scan
report
metrics
54
Standardizing
Terminologies
SOURCE_CODE
STANDARD_CONCEPT_ID
XYZ
?
123456789
i.e.
ICPC-‐1
Dutch
i.e.
SNOMED
for
condiNons
codes,
ICD9,
etc.
and
RxNorm
for
drugs
2. TABLE_SOURCE_CONCEPT_ID
concept
representaJon
of
the
source
code,
helps
maintain
Oe
to
raw
data
56
OMOP
Vocab
• There
are
two
standard
queries
to
help
us
use
the
OMOP
Vocabulary:
– SOURCE_TO_STANDARD.sql
– SOURCE_TO_SOURCE.sql
• hdps://github.com/OHDSI/Tutorial-‐ETL
– Materials
à
Queries
57
OMOP
Vocab
• If
your
source
data’s
codes
are
in
the
OMOP
Vocab
you
can
use
it
to
translate
to
a
standard
• For
example:
– ICD9
à
SNOMED
– NDC
à
RxNORM
58
Mapping
a
Lauren
Row
to
CONCEPT_ID
SELECT
*
FROM
RAW_LAUREN.CONDITIONS
WHERE
ENCOUNTER
=
'70'
START
STOP
PATIENT
ENCOUNTER
CODE
DESCRIPTION
1/6/2010
1
70
N94.6
Dysmenorrhea
?
CONDITION_CONCEPT_ID CONDITION_SOURCE_CONCEPT_ID
59
Source
to
Standard
WITH
CTE_VOCAB_MAP
AS
(
SELECT
c.concept_code
AS
SOURCE_CODE,
c.concept_id
AS
SOURCE_CONCEPT_ID,
c.concept_name
AS
SOURCE_CODE_DESCRIPTION,
c.vocabulary_id
AS
SOURCE_VOCABULARY_ID,
c.domain_id
AS
SOURCE_DOMAIN_ID,
c.CONCEPT_CLASS_ID
AS
SOURCE_CONCEPT_CLASS_ID,
c.VALID_START_DATE
AS
SOURCE_VALID_START_DATE,
c.VALID_END_DATE
AS
SOURCE_VALID_END_DATE,
c.INVALID_REASON
AS
SOURCE_INVALID_REASON,
c1.concept_id
AS
TARGET_CONCEPT_ID,
c1.concept_name
AS
TARGET_CONCEPT_NAME,
c1.VOCABULARY_ID
AS
TARGET_VOCABUALRY_ID,
c1.domain_id
AS
TARGET_DOMAIN_ID,
c1.concept_class_id
AS
TARGET_CONCEPT_CLASS_ID,
c1.INVALID_REASON
AS
TARGET_INVALID_REASON,
c1.standard_concept
AS
TARGET_STANDARD_CONCEPT
FROM
CONCEPT
C
JOIN
CONCEPT_RELATIONSHIP
CR
ON
C.CONCEPT_ID
=
CR.CONCEPT_ID_1
AND
CR.invalid_reason
IS
NULL
AND
cr.relationship_id
=
'Maps
to'
JOIN
CONCEPT
C1
ON
CR.CONCEPT_ID_2
=
C1.CONCEPT_ID
AND
C1.INVALID_REASON
IS
NULL
UNION
SELECT
source_code,
SOURCE_CONCEPT_ID,
SOURCE_CODE_DESCRIPTION,
source_vocabulary_id,
c1.domain_id
AS
SOURCE_DOMAIN_ID,
c2.CONCEPT_CLASS_ID
AS
SOURCE_CONCEPT_CLASS_ID,c1.VALID_START_DATE
AS
SOURCE_VALID_START_DATE,
c1.VALID_END_DATE
AS
SOURCE_VALID_END_DATE,
stcm.INVALID_REASON
AS
SOURCE_INVALID_REASON,target_concept_id,
c2.CONCEPT_NAME
AS
TARGET_CONCEPT_NAME,
target_vocabulary_id,
c2.domain_id
AS
TARGET_DOMAIN_ID,
c2.concept_class_id
AS
TARGET_CONCEPT_CLASS_ID,
c2.INVALID_REASON
AS
TARGET_INVALID_REASON,
c2.standard_concept
AS
TARGET_STANDARD_CONCEPT
FROM
source_to_concept_map
stcm
LEFT
OUTER
JOIN
CONCEPT
c1
ON
c1.concept_id
=
stcm.source_concept_id
LEFT
OUTER
JOIN
CONCEPT
c2
ON
c2.CONCEPT_ID
=
stcm.target_concept_id
WHERE
stcm.INVALID_REASON
IS
NULL
)
SELECT
TARGET_CONCEPT_ID,
TARGET_CONCEPT_NAME,
TARGET_DOMAIN_ID
FROM
CTE_VOCAB_MAP
WHERE
SOURCE_CODE
=
'N94.6'
AND
SOURCE_VOCABULARY_ID
=
'ICD10CM'
AND
TARGET_STANDARD_CONCEPT
=
'S'
60
Source
to
Standard
WITH
CTE_VOCAB_MAP
AS
(
SELECT
c.concept_code
AS
SOURCE_CODE,
c.concept_id
AS
SOURCE_CONCEPT_ID,
c.concept_name
AS
SOURCE_CODE_DESCRIPTION,
c.vocabulary_id
AS
SOURCE_VOCABULARY_ID,
c.domain_id
AS
SOURCE_DOMAIN_ID,
c.CONCEPT_CLASS_ID
AS
SOURCE_CONCEPT_CLASS_ID,
c.VALID_START_DATE
AS
SOURCE_VALID_START_DATE,
c.VALID_END_DATE
AS
SOURCE_VALID_END_DATE,
c.INVALID_REASON
AS
SOURCE_INVALID_REASON,
c1.concept_id
AS
TARGET_CONCEPT_ID,
c1.concept_name
AS
TARGET_CONCEPT_NAME,
c1.VOCABULARY_ID
AS
TARGET_VOCABUALRY_ID,
c1.domain_id
AS
TARGET_DOMAIN_ID,
c1.concept_class_id
AS
TARGET_CONCEPT_CLASS_ID,
c1.INVALID_REASON
AS
TARGET_INVALID_REASON,
c1.standard_concept
AS
TARGET_STANDARD_CONCEPT
FROM
CONCEPT
C
JOIN
CONCEPT_RELATIONSHIP
CR
ON
C.CONCEPT_ID
=
CR.CONCEPT_ID_1
AND
CR.invalid_reason
IS
NULL
AND
cr.relationship_id
=
'Maps
to'
JOIN
CONCEPT
C1
ON
CR.CONCEPT_ID_2
=
C1.CONCEPT_ID
AND
C1.INVALID_REASON
IS
NULL
UNION
SELECT
source_code,
SOURCE_CONCEPT_ID,
SOURCE_CODE_DESCRIPTION,
source_vocabulary_id,
c1.domain_id
AS
SOURCE_DOMAIN_ID,
c2.CONCEPT_CLASS_ID
AS
SOURCE_CONCEPT_CLASS_ID,c1.VALID_START_DATE
AS
SOURCE_VALID_START_DATE,
c1.VALID_END_DATE
AS
SOURCE_VALID_END_DATE,
stcm.INVALID_REASON
AS
SOURCE_INVALID_REASON,target_concept_id,
61
Source
to
Standard
WITH
CTE_VOCAB_MAP
AS
(
SELECT
c.concept_code
AS
SOURCE_CODE,
c.concept_id
AS
SOURCE_CONCEPT_ID,
c.concept_name
AS
SOURCE_CODE_DESCRIPTION,
c.vocabulary_id
AS
SOURCE_VOCABULARY_ID,
c.domain_id
AS
SOURCE_DOMAIN_ID,
c.CONCEPT_CLASS_ID
AS
SOURCE_CONCEPT_CLASS_ID,
c.VALID_START_DATE
AS
SOURCE_VALID_START_DATE,
c.VALID_END_DATE
AS
SOURCE_VALID_END_DATE,
c.INVALID_REASON
AS
SOURCE_INVALID_REASON,
c1.concept_id
AS
TARGET_CONCEPT_ID,
c1.concept_name
AS
TARGET_CONCEPT_NAME,
c1.VOCABULARY_ID
AS
TARGET_VOCABUALRY_ID,
c1.domain_id
AS
TARGET_DOMAIN_ID,
c1.concept_class_id
AS
TARGET_CONCEPT_CLASS_ID,
c1.INVALID_REASON
AS
TARGET_INVALID_REASON,
c1.standard_concept
AS
TARGET_STANDARD_CONCEPT
FROM
CONCEPT
C
62
Source
to
Standard
WITH
CTE_VOCAB_MAP
AS
(
SELECT
c.concept_code
AS
SOURCE_CODE,
c.concept_id
AS
SOURCE_CONCEPT_ID,
c.concept_name
AS
SOURCE_CODE_DESCRIPTION,
c.vocabulary_id
AS
SOURCE_VOCABULARY_ID,
c.domain_id
AS
SOURCE_DOMAIN_ID,
c.CONCEPT_CLASS_ID
AS
SOURCE_CONCEPT_CLASS_ID,
c.VALID_START_DATE
AS
SOURCE_VALID_START_DATE,
c.VALID_END_DATE
AS
SOURCE_VALID_END_DATE,
c.INVALID_REASON
AS
SOURCE_INVALID_REASON,
c1.concept_id
AS
TARGET_CONCEPT_ID,
c1.concept_name
AS
TARGET_CONCEPT_NAME,
c1.VOCABULARY_ID
AS
TARGET_VOCABUALRY_ID,
c1.domain_id
AS
TARGET_DOMAIN_ID,
c1.concept_class_id
AS
TARGET_CONCEPT_CLASS_ID,
c1.INVALID_REASON
AS
TARGET_INVALID_REASON,
c1.standard_concept
AS
TARGET_STANDARD_CONCEPT
FROM
CONCEPT
C
JOIN
CONCEPT_RELATIONSHIP
CR
ON
C.CONCEPT_ID
=
CR.CONCEPT_ID_1
AND
CR.invalid_reason
IS
NULL
AND
cr.relationship_id
=
'Maps
to'
JOIN
CONCEPT
C1
ON
CR.CONCEPT_ID_2
=
C1.CONCEPT_ID
AND
C1.INVALID_REASON
IS
NULL
UNION
SELECT
source_code,
SOURCE_CONCEPT_ID,
SOURCE_CODE_DESCRIPTION,
source_vocabulary_id,
c1.domain_id
AS
SOURCE_DOMAIN_ID,
c2.CONCEPT_CLASS_ID
AS
SOURCE_CONCEPT_CLASS_ID,c1.VALID_START_DATE
AS
SOURCE_VALID_START_DATE,
c1.VALID_END_DATE
AS
SOURCE_VALID_END_DATE,
stcm.INVALID_REASON
AS
SOURCE_INVALID_REASON,target_concept_id,
c2.CONCEPT_NAME
AS
TARGET_CONCEPT_NAME,
target_vocabulary_id,
c2.domain_id
AS
TARGET_DOMAIN_ID,
63
Mapping
a
Lauren
Row
to
CONCEPT_ID:
Source
to
Standard
START
STOP
PATIENT
ENCOUNTER
CODE
DESCRIPTION
1/6/2010
1
70
N94.6
Dysmenorrhea
TARGET_
TARGET_
TARGET_
CONCEPT_ID
CONCEPT_NAME
DOMAIN_ID
194696
Dysmenorrhea
CondiJon
CONDITION_CONCEPT_ID
CONDITION_SOURCE_CONCEPT_ID
194696
64
Source
to
Source
WITH
CTE_VOCAB_MAP
AS
(
SELECT
c.concept_code
AS
SOURCE_CODE,
c.concept_id
AS
SOURCE_CONCEPT_ID,
c.CONCEPT_NAME
AS
SOURCE_CODE_DESCRIPTION,
c.vocabulary_id
AS
SOURCE_VOCABULARY_ID,
c.domain_id
AS
SOURCE_DOMAIN_ID,
c.concept_class_id
AS
SOURCE_CONCEPT_CLASS_ID,
c.VALID_START_DATE
AS
SOURCE_VALID_START_DATE,
c.VALID_END_DATE
AS
SOURCE_VALID_END_DATE,
c.invalid_reason
AS
SOURCE_INVALID_REASON,
c.concept_ID
as
TARGET_CONCEPT_ID,
c.concept_name
AS
TARGET_CONCEPT_NAME,
c.vocabulary_id
AS
TARGET_VOCABULARY_ID,
c.domain_id
AS
TARGET_DOMAIN_ID,
c.concept_class_id
AS
TARGET_CONCEPT_CLASS_ID,
c.INVALID_REASON
AS
TARGET_INVALID_REASON,
c.STANDARD_CONCEPT
AS
TARGET_STANDARD_CONCEPT
FROM
CONCEPT
c
UNION
SELECT
source_code,
SOURCE_CONCEPT_ID,
SOURCE_CODE_DESCRIPTION,
source_vocabulary_id,
c1.domain_id
AS
SOURCE_DOMAIN_ID,
c2.CONCEPT_CLASS_ID
AS
SOURCE_CONCEPT_CLASS_ID,
c1.VALID_START_DATE
AS
SOURCE_VALID_START_DATE,
c1.VALID_END_DATE
AS
SOURCE_VALID_END_DATE,
stcm.INVALID_REASON
AS
SOURCE_INVALID_REASON,target_concept_id,
c2.CONCEPT_NAME
AS
TARGET_CONCEPT_NAME,
target_vocabulary_id,
c2.domain_id
AS
TARGET_DOMAIN_ID,
c2.concept_class_id
AS
TARGET_CONCEPT_CLASS_ID,
c2.INVALID_REASON
AS
TARGET_INVALID_REASON,
c2.standard_concept
AS
TARGET_STANDARD_CONCEPT
FROM
source_to_concept_map
stcm
LEFT
OUTER
JOIN
CONCEPT
c1
ON
c1.concept_id
=
stcm.source_concept_id
LEFT
OUTER
JOIN
CONCEPT
c2
ON
c2.CONCEPT_ID
=
stcm.target_concept_id
WHERE
stcm.INVALID_REASON
IS
NULL
)
SELECT
*
FROM
CTE_VOCAB_MAP
WHERE
SOURCE_CODE
=
'N94.6'
AND
SOURCE_VOCABULARY_ID
=
'ICD10CM'
65
Source
to
Source
WITH
CTE_VOCAB_MAP
AS
(
SELECT
c.concept_code
AS
SOURCE_CODE,
c.concept_id
AS
SOURCE_CONCEPT_ID,
c.CONCEPT_NAME
AS
SOURCE_CODE_DESCRIPTION,
c.vocabulary_id
AS
SOURCE_VOCABULARY_ID,
c.domain_id
AS
SOURCE_DOMAIN_ID,
c.concept_class_id
AS
SOURCE_CONCEPT_CLASS_ID,
c.VALID_START_DATE
AS
SOURCE_VALID_START_DATE,
c.VALID_END_DATE
AS
SOURCE_VALID_END_DATE,
c.invalid_reason
AS
SOURCE_INVALID_REASON,
c.concept_ID
as
TARGET_CONCEPT_ID,
c.concept_name
AS
TARGET_CONCEPT_NAME,
c.vocabulary_id
AS
TARGET_VOCABULARY_ID,
c.domain_id
AS
TARGET_DOMAIN_ID,
c.concept_class_id
AS
TARGET_CONCEPT_CLASS_ID,
c.INVALID_REASON
AS
TARGET_INVALID_REASON,
c.STANDARD_CONCEPT
AS
TARGET_STANDARD_CONCEPT
FROM
CONCEPT
c
UNION
SELECT
source_code,
SOURCE_CONCEPT_ID,
SOURCE_CODE_DESCRIPTION,
source_vocabulary_id,
c1.domain_id
AS
SOURCE_DOMAIN_ID,
c2.CONCEPT_CLASS_ID
AS
SOURCE_CONCEPT_CLASS_ID,
c1.VALID_START_DATE
AS
SOURCE_VALID_START_DATE,
c1.VALID_END_DATE
AS
SOURCE_VALID_END_DATE,
stcm.INVALID_REASON
AS
SOURCE_INVALID_REASON,target_concept_id,
c2.CONCEPT_NAME
AS
TARGET_CONCEPT_NAME,
target_vocabulary_id,
c2.domain_id
AS
TARGET_DOMAIN_ID,
c2.concept_class_id
AS
TARGET_CONCEPT_CLASS_ID,
c2.INVALID_REASON
AS
TARGET_INVALID_REASON,
Look
up
your
source
Code
c2.standard_concept
AS
TARGET_STANDARD_CONCEPT
FROM
source_to_concept_map
stcm
here
LEFT
OUTER
JOIN
CONCEPT
c1
ON
c1.concept_id
=
stcm.source_concept_id
LEFT
OUTER
JOIN
CONCEPT
c2
ON
c2.CONCEPT_ID
=
stcm.target_concept_id
WHERE
stcm.INVALID_REASON
IS
NULL
)
SELECT
*
FROM
CTE_VOCAB_MAP
WHERE
SOURCE_CODE
=
'N94.6'
AND
SOURCE_VOCABULARY_ID
=
'ICD10CM'
66
Mapping
a
Lauren
Row
to
CONCEPT_ID:
Source
to
Source
START
STOP
PATIENT
ENCOUNTER
CODE
DESCRIPTION
1/6/2010
1
70
N94.6
Dysmenorrhea
TARGET_
TARGET_
TARGET_
CONCEPT_ID
CONCEPT_NAME
DOMAIN_ID
Dysmenorrhea,
35209488
CondiJon
unspecified
CONDITION_CONCEPT_ID
CONDITION_SOURCE_CONCEPT_ID
194696
35209488
67
Mapping
Source
Codes
–
Your
turn
• Let’s
open
PostgreSQL
– Open
up
pgAdmin4
using
the
icon
on
the
task
bar
69
Mapping
Source
Codes
–
Your
turn
CODE
DESCRIPTION
CODE
TYPE
Diffuse
large
ICD10
C83.3
B-‐cell
lymphoma
(not
ICD10CM)
? ?
CONDITION_CONCEPT_ID CONDITION_SOURCE_CONCEPT_ID
hdps://github.com/OHDSI/Tutorial-‐ETL/tree/
master/materials/Queries
70
Mapping
Source
Codes
–
Your
turn
CODE
DESCRIPTION
CODE
TYPE
Diffuse
large
ICD10
C83.3
B-‐cell
lymphoma
(not
ICD10CM)
? ?
CONDITION_CONCEPT_ID
CONDITION_SOURCE_CONCEPT_ID
4300704
1567654
hdps://github.com/OHDSI/Tutorial-‐ETL/tree/
master/materials/Queries
71
What
do
you
do
with
the
mapping
informaJon?
)
72
Usagi
• When
the
Vocabulary
does
not
have
your
source
codes
you
will
need
to
create
a
map
to
OMOP
Vocabulary
Concepts
• Usagi
is
Japanese
for
rabbit
and
was
named
aser
the
first
mapping
exercise
it
was
used
for;
mapping
source
codes
used
in
a
Japanese
dataset
into
OMOP
Vocabulary
concepts
• Usagi
sosware
tool
to
help
with
process
of
mapping
source
codes
to
OMOP
concepts
73
Usagi
Process
1. Get
a
copy
of
the
Vocabulary
from
ATHENA
2. Download Usagi
74
Usagi
Process
1.
Get
a
copy
of
the
Vocabulary
from
ATHENA
hdp://athena.ohdsi.org
75
Usagi
Process
1.
Get
a
copy
of
the
Vocabulary
from
ATHENA
76
Usagi
Process
2.
Download
Usagi
hdps://github.com/OHDSI/Usagi
77
Usagi
Process
3.
Have
Usagi
build
an
index
on
the
Vocabulary
78
Usagi
Process
4.
Load
your
source
codes,
let
Usagi
process
them
• If
the
codes
are
not
in
English,
use
Google
Translate
to
convert
79
Usagi
Process
4.
Load
your
source
codes,
let
Usagi
process
them
80
Usagi
Process
5.
Review
and
update
suggest
mappings
with
someone
who
has
medical
knowledge
81
Usagi
Process
5.
Review
and
update
suggest
mappings
with
someone
who
has
medical
knowledge
Overview Table
82
Usagi
Process
5.
Review
and
update
suggest
mappings
with
someone
who
has
medical
knowledge
Selected Mapping
83
Usagi
Process
5.
Review
and
update
suggest
mappings
with
someone
who
has
medical
knowledge
Search Facility
84
Usagi
Process
5.
Review
and
update
suggest
mappings
with
someone
who
has
medical
knowledge
• It
may
be
valuable
to
sort
on
“Match
Score”;
reviewing
codes
that
Usagi
is
most
confident
on
first
may
quickly
knock
out
a
significant
chunk
of
codes
• SorJng
on
“Frequency”
is
valuable,
spending
more
effort
on
frequent
codes
versus
non-‐frequent
is
important
• It is okay to map to zero or 0 – “No matching concept”
• A source code might end up being mapped to two concepts
• You
might
have
what
the
system
considers
one
domain
but
the
OMOP
Vocabulary
lumps
into
another
domain
85
Usagi
Process
6.
Export
codes
into
the
SOURCE_TO_CONCEPT_MAP
86
Usagi
Process
6.
Export
codes
into
the
SOURCE_TO_CONCEPT_MAP
• You
then
load
your
generated
maps
into
the
empty
Vocabulary
table.
87
Usagi
–
Your
Turn
1. Get
a
copy
of
the
Vocabulary
from
ATHENA
2. Download Usagi
88
Now
Your
Turn:
Open
Usagi
89
Usagi
–
Your
Turn
• We
have
provided
a
small
subset
of
codes
to
try
to
map
hdps://github.com/OHDSI/
Tutorial-‐ETL/
-‐>
Materials
-‐>
Usagi
-‐>
DUTCH_ICPC_CONDITION_CODES_TO_MAP.xlsx
90
Usagi
–
Your
Turn
• Your
mission:
– Download
the
codes
to
map
– Translate
codes
to
English
– Import
codes
into
Usagi
– Map
to
standard
concepts
– Export
SOURCE_TO_CONCEPT_MAP
table
91
Usagi
–
Your
Turn
• What
CONCEPT_ID
do
you
map
“Dermatomycosis
(s)”
to?
– USAGI
Mapped:
Source
Term
Concept
ID
Concept
Name
Dermatomycosis
135473
Dermatophytosis
(s)
a
fungal
infecNon
of
the
a
highly
contagious,
skin,
especially
by
a
fungal
infecNon
of
the
dermatophyte
skin
or
scalp
(ringworm)
– We
remapped
to:
Source
Term
Concept
ID
Concept
Name
Dermatomycosis
137213
Dermal
mycosis
(s)
of
the
a
fungal
infecNon
fungal
infecNon
skin,
especially
by
a
dermatophyte
92
Usagi
–
Your
Turn
93
ETL
ImplementaJon
There
are
mulJple
tools
available
to
implement
your
ETL
In
this
example
we
created
a
builder
using
SQL
and
R,
though
your
choice
will
largely
depend
on
the
size
and
complexity
of
the
ETL
design
95
ETL
ImplementaJon
General
Flow
of
ImplementaOon
A
good
rule
of
thumb
is
to
always
create
the
PERSON
table
first
96
CDM
Version
6
Key
Domains
Person
Standardized
health
Standardized
metadata
ObservaJon_period
system
data
CDM_source
LocaJon
Visit_occurrence
Metadata
LocaJon_history
Visit_detail
Care_site
Standardized
CondiJon_occurrence
Provider
vocabularies
Standardized
clinical
data
Drug_exposure
Concept
Standardized
derived
Procedure_occurrence
elements
Vocabulary
CondiJon_era
Device_exposure
Domain
Drug_era
Measurement
Concept_class
Dose_era
Note
Concept_relaJonship
Results
Schema
Note_NLP
Cohort
RelaJonship
Cohort_definiJon
Concept_synonym
Survey_conduct
100
ETL
ImplementaJon
How
should
the
PERSON
table
logic
be
implemented
in
SQL?
101
ETL
ImplementaJon
Let’s
review
the
logic
we
decided
on
for
how
the
PERSON
table
should
be
created.
Gender:
Birthdate:
Race:
Ethnicity:
102
ETL
ImplementaJon
How
should
the
PERSON
table
logic
be
implemented
in
SQL?
Gender
103
ETL
ImplementaJon
How
should
the
PERSON
table
logic
be
implemented
in
SQL?
Gender
??
104
ETL
ImplementaJon
How
should
the
PERSON
table
logic
be
implemented
in
SQL?
Gender
??
105
ETL
ImplementaJon
Let’s
review
the
logic
we
decided
on
for
how
the
PERSON
table
should
be
created.
Gender:
Birthdate:
Race:
Ethnicity:
106
ETL
ImplementaJon
How
should
the
PERSON
table
logic
be
implemented
in
SQL?
Birthdate
107
ETL
ImplementaJon
How
should
the
PERSON
table
logic
be
implemented
in
SQL?
Birthdate
??
108
ETL
ImplementaJon
Let’s
review
the
logic
we
decided
on
for
how
the
PERSON
table
should
be
created.
Gender:
Birthdate:
Race:
Ethnicity:
109
ETL
ImplementaJon
How
should
the
PERSON
table
logic
be
implemented
in
SQL?
Race
110
ETL
ImplementaJon
Let’s
review
the
logic
we
decided
on
for
how
the
PERSON
table
should
be
created.
Gender:
Birthdate:
Race:
Ethnicity:
111
ETL
ImplementaJon
How
should
the
PERSON
table
logic
be
implemented
in
SQL?
Ethnicity
??
112
ETL
ImplementaJon
Now
let
us
run
the
code
and
create
the
PERSON
table
in
the
cdm_lauren
schema
1.
Download
the
query
from:
hdps://github.com/OHDSI/Tutorial-‐ETL
Materials
à
ImplementaJonà
Insert_Person_Lauren.sql
2.
Open
up
pgAdmin4
using
the
icon
on
the
task
bar
113
ETL
ImplementaJon
3. Expand
the
server
list
and
right-‐click
on
PostgreSQL
10
and
choose
Connect
Server
from
the
drop-‐down
menu
115
ETL
ImplementaJon
7. Paste
the
sql
code
to
create
the
PERSON
table
into
the
query
window
and
press
F5
or
NOTE:
The
‘truncate’
statement
at
the
beginning
deletes
anything
that
is
in
the
table
already
without
deleJng
the
table
itself
(helpful
if
you
make
a
mistake)
QUESTIONS:
How
would
you
check
that
your
PERSON
table
was
created?
How
could
you
fix
the
ethnicity
mapping?
116
ETL
ImplementaJon
Data
Quality
at
implantaJon
–
ethnicity
correcJon
Ethnicity
117
Build
the
rest
of
the
tables
.
.
.
github.com/ohdsi/ETL-‐synthea
118
Resources
• The
full
Synthea
builder
can
be
found
here:
hdps://github.com/OHDSI/ETL-‐Synthea
• Another
example
of
a
R/SQL
builder
for
a
much
larger
database:
hdps://github.com/OHDSI/ETL-‐HealthVerityBuilder
• A
builder
created
using
.NET:
hdps://github.com/OHDSI/ETL-‐CDMBuilder
• A
builder
created
using
the
AWS
lambda
funcJonality:
hdps://github.com/OHDSI/ETL-‐lambdabuilder
(in
development)
119
Example
Builder
1:
Janssen
CDM
Builder
Over
Time
• Simple
SQL
Queries
• Simple
SQL
Queries
+
Cursors
Simple
• SAS
Builder
Data
Experts
&
CDM
Experts
• C#
Single
Machine
• C#
MulJple
Machine
SophisJcated
• C#
in
Cloud
Enabled
Environment
Data
Experts
Technical
&
CDM
Experts
Experts
hdps://github.com/OHDSI/ETL-‐CDMBuilder
120
Example
Builder
2:
PEDSnet
Data Coordinating Center:
PEDSnet (n=6.2 million patients) Children’s Hospital of
8 contributing sites Philadelphia (CHOP)
121
CHOP
122
CHOP
ETL
Flow
–More
like
LTE
L
• Load
(very
lidle
re-‐organizaJon
of
data)
T
• Transform
(Mapping
of
concepts,
remapping
ETL)
Staging
Tables
Staging
Epic
Clarity
Postgres
DB
Final
Tables
123
Challenges/Lessons
Learned
• We
ulJmately
have
to
make
decisions
about
our
data:
– What
do
we
include?
• Cancelled
visits
with
associated
informaJon,
reflects
known
workflow
for
research
visits
124
Challenges/Lessons
Learned
• Our
ETL
is
Jme-‐constraint
due
to
clinical
system
ETL
– Structured
program
to
take
into
account
midnight
system
wide
shutdown
for
ETL
• Clinical
data
does
not
always
fit
OMOP
rules
– MulJvitamin
prescripJons
with
2055
`end_date`
– Fetal
Procedures
`procedure_start_date`
before
`birth_date`
– Autopsies
procedures
`procedure_start_date`
aser
`death_date`
– MulJple
“encounters”
associated
with
one
visit
• Intermediate/Temporary
tables
are
crucial
for
debugging
– Tables
containing
source
idenJficaJon
numbers
(IDS
such
as
MRNS,
paJent
ids,
source
system
ids)
alongside
OMOP
data
before
“final
version”
125
Data
ValidaJon:
Data
Model
Validator
• Validates
table
structures
and
data
types
• Prompts
user
to
specify
the
model
and
version
number
• Alerts
if
there
are
any
unexpected
columns
and/or
tables
• hdps://github.com/infomodels/infomodels
(OMOP
model
supported)
126
Data
ValidaJon:
Data
Quality
Framework
• Automated
Program
where
issues
are
flagged
as
GitHub
issues
categorized
by
table,
domain
and
priority
(High,
Medium,
Low)
• Checks
fall
into
the
following
categories:
– Fidelity/Reliability:
Is
this
data
correct?
Is
it
being
coded/mapped
correctly?
– Consistency/Internal
Validity:
Are
there
any
drops/inconsistencies
between
submissions?
– Accuracy:
Does
the
data
correctly
reflect
clinical
characterisJcs
of
paJents?
– Completeness
:
Is
there
data
that
is
missing?
– hdps://pedsnet.org/data/data-‐quality/
127
Quality
What
tools
are
available
to
check
that
the
CDM
logic
was
implemented
correctly?
129
Unit
Test
Cases
• TesJng
your
CDM
builder
is
important:
– ETL
osen
complex,
increasing
the
danger
of
making
mistakes
that
go
unnoJced
130
Unit
Test
Cases
Rabbit-‐in-‐a-‐Hat
131
Unit
Test
Cases
The
test
framework
creates
a
series
of
R
funcJons
that
enables
you
to
specify
your
‘fake’
people
and
records
in
the
same
structure
as
your
source
data
using
the
scan
report
as
a
guide.
132
Unit
Test
Cases
133
Unit
Test
Cases
• An
example
of
how
this
was
done
for
the
Synthea
data
is
available
from:
hdps://github.com/OHDSI/Tutorial-‐ETL/tree/
master/materials/Unit%20Tests
134
Unit
Test
Cases
Let
us
revisit
the
PERSON
table
logic:
135
Achilles
Achilles
is
a
data
characterizaJon
and
quality
tool
available
for
download
here:
hdps://github.com/OHDSI/Achilles
For
an
example
of
how
it
was
run
for
our
sample
data,
that
R
script
is
located
here:
hdps://github.com/OHDSI/Tutorial-‐ETL/
blob/master/materials/Achilles/
achillesRun.R
136
Achilles
137
Achilles
This
plot
shows
that
the
bulk
of
the
data
starts
in
2005.
However,
there
also
appear
to
be
a
few
records
from
around
1961,
which
is
likely
an
error
in
the
data.
138
Achilles
This
change
coincides
with
changes
in
the
reimbursement
rules
in
this
specific
country,
leading
to
more
diagnoses
but
probably
not
a
true
increase
in
prevalence
in
the
underlying
populaJon.
139
Achilles
Heel
Achilles
heel
is
a
report
generated
by
the
Achilles
applicaJon
that
will
run
a
series
of
data
quality
checks
on
the
CDM
using
the
Achilles
data
140
DataQualityDashboard
(DQD)
141
DQD
Example
Rules
142
Issues
in
our
syntheJc
data?
• Did
our
test
cases
run?
cdm_synthea
143
Issues
in
our
syntheJc
data?
• Did
Achilles
noJce
anything?
cdm_synthea
144
Issues
in
our
syntheJc
data?
• Did
DQD
noJce
anything?
cdm_synthea
145
Maybe
we
have
a
bug?
• In
the
CONDITION_OCCURRENCE,
61%
rows
are
mapped
to
0
146
ETL
Maintenance
ETL
Changed
or
Updated
DocumentaJon
Raw
Data?
ETL
Bug Found?
Updated CDM
New
Vocab?
All
are
involved
A
technical
person
in
quality
control
implements
the
ETL
CDM Update?
147
Document
the
Bug
148
Vocabulary
to
fix
the
problem
149
Vocabulary
to
fix
the
problem
WITH
CTE_VOCAB_MAP
AS
(
SELECT
c.concept_code
AS
SOURCE_CODE,
c.concept_id
AS
SOURCE_CONCEPT_ID,
c.concept_name
AS
SOURCE_CODE_DESCRIPTION,
c.vocabulary_id
AS
SOURCE_VOCABULARY_ID,
c.domain_id
AS
SOURCE_DOMAIN_ID,
c.CONCEPT_CLASS_ID
AS
SOURCE_CONCEPT_CLASS_ID,
c.VALID_START_DATE
AS
SOURCE_VALID_START_DATE,
c.VALID_END_DATE
AS
SOURCE_VALID_END_DATE,
c.INVALID_REASON
AS
SOURCE_INVALID_REASON,
c1.concept_id
AS
TARGET_CONCEPT_ID,
c1.concept_name
AS
TARGET_CONCEPT_NAME,
c1.VOCABULARY_ID
AS
TARGET_VOCABUALRY_ID,
c1.domain_id
AS
TARGET_DOMAIN_ID,
c1.concept_class_id
AS
TARGET_CONCEPT_CLASS_ID,
c1.INVALID_REASON
AS
TARGET_INVALID_REASON,
c1.standard_concept
AS
TARGET_STANDARD_CONCEPT
FROM
CONCEPT
C
Look
in
the
Source
to
Concept
JOIN
CONCEPT_RELATIONSHIP
CR
Map
table
for
a
map
ON
C.CONCEPT_ID
=
CR.CONCEPT_ID_1
AND
CR.invalid_reason
IS
NULL
AND
cr.relationship_id
=
'Maps
to'
JOIN
CONCEPT
C1
ON
CR.CONCEPT_ID_2
=
C1.CONCEPT_ID
AND
C1.INVALID_REASON
IS
NULL
UNION
SELECT
source_code,
SOURCE_CONCEPT_ID,
SOURCE_CODE_DESCRIPTION,
source_vocabulary_id,
c1.domain_id
AS
SOURCE_DOMAIN_ID,
c2.CONCEPT_CLASS_ID
AS
SOURCE_CONCEPT_CLASS_ID,c1.VALID_START_DATE
AS
SOURCE_VALID_START_DATE,
c1.VALID_END_DATE
AS
SOURCE_VALID_END_DATE,
stcm.INVALID_REASON
AS
SOURCE_INVALID_REASON,target_concept_id,
c2.CONCEPT_NAME
AS
TARGET_CONCEPT_NAME,
target_vocabulary_id,
c2.domain_id
AS
TARGET_DOMAIN_ID,
c2.concept_class_id
AS
TARGET_CONCEPT_CLASS_ID,
c2.INVALID_REASON
AS
TARGET_INVALID_REASON,
c2.standard_concept
AS
TARGET_STANDARD_CONCEPT
FROM
source_to_concept_map
stcm
LEFT
OUTER
JOIN
CONCEPT
c1
ON
c1.concept_id
=
stcm.source_concept_id
LEFT
OUTER
JOIN
CONCEPT
c2
ON
c2.CONCEPT_ID
=
stcm.target_concept_id
WHERE
stcm.INVALID_REASON
IS
NULL
)
SELECT
TARGET_CONCEPT_ID,
TARGET_CONCEPT_NAME,
TARGET_DOMAIN_ID
FROM
CTE_VOCAB_MAP
WHERE
SOURCE_VOCABULARY_ID
=
‘Synthea_conditions'
150
Update
the
ETL
document
• hdps://ohdsi.github.io/Tutorial-‐ETL/docs/
cdm_synthea_v2
151
Re-‐run
the
DQD
TBD
152
Re-‐run
Achilles
153
Final
Hard
Lessons
Learned
80/20
Rule
155
Comfort
with
Data
Loss
• If
there
is
data
that
is
not
of
research
quality
or
there
are
methods
to
adjust,
use
the
ETL
to
standardize
that
Example
PaOent
Drop
Counts
from
a
CDM
Build
Reason
to
Drop
Someone
Person
Count
Unknown
gender
23,592
Implausible
year
of
birth
-‐
past
749
Implausible
year
of
birth
-‐
post
earliest
observaJon
3,836
period
Gender
changes
2
156
ETL
Process
ETL
DocumentaJon
ETL
ETL
Changed
or
Updated
DocumentaJon
Raw
Data?
ETL
Bug Found?
Updated CDM
New
Vocab?
All
are
involved
A
technical
person
in
quality
control
implements
the
ETL
CDM Update?
158
Thank
you!