Summary 2
Summary 2
Big Data
Another related term is big data, which describes large and ever-
increasing volumes of data that adhere to the following attributes:
Volume – ever-increasing amounts
Velocity – quickly generated
Variety – many different types
Veracity – from trustable sources
While big data is considered a buzz word by some, we are having to
deal with terabytes and petabytes of information today. With the
addition of genomics big data will escalate.
Healthcare organizations are generating an ever-increasing amount of
data. In all healthcare organizations, clinical data takes a variety of
forms, from structured (e.g., images, lab results, etc.) to unstructured
(e.g., textual notes including clinical narratives, reports, and other
types of documents)
For example, it was estimated by Kaiser-Permanente in 2013 that its
current data store for its 9+ million members exceeds 30 petabytes
(petabyte = 1024 terabytes) of data.
Another example is CancerLinQ that will provide a comprehensive
system for clinicians and researchers consisting of EHR data
collection, application of clinical decision support, data mining and
visualization, and quality feedback.
Lastly, IBM’s Watson is now focusing on healthcare, specifically
Oncology so that massive amounts of cancer information/research
can be analyzed and applied to individual patient decision making.
Conclusions
- Healthcare data has proliferated greatly, in large part due to the
accelerated adoption of EHRs
- Analytic platforms will examine data from multiple sources, such
as clinical records, genomic data, financial systems, and
administrative systems
- Analytics is necessary to transform data to information and
knowledge
- Accountable care organizations and other new models of healthcare
delivery will rely heavily on analytics to analyze financial and clinical
data
- There is a great demand for skilled data analysts in healthcare;
expertise in informatics will be important for such individuals
Questions
Discuss the difference between descriptive, predictive and prescriptive
analytics
Describe the characteristics of “Big Data”
Enumerate the necessary skills for a worker in the data analytics field
List the limitations of healthcare data analytics
Discuss the critical role electronic health records play in healthcare data
analytics
Topic 6: Clinical Decision Support
Introduction
Definition: “Clinical decision support (CDS) provides clinicians, staff,
patients or other individuals with knowledge and person-specific
information, intelligently filtered or presented at appropriate times, to
enhance health and health care.” (ONC)
Keep in mind that any resource that aids in decision making
should be considered CDS. We will only consider electronic
CDS.
We define clinical decision support systems (CDSSs) as the
technology that supports CDS
Early on, CDS was thought of only in terms of reminders and alerts.
Now we must include diagnostic help, cost reminders, calculators,
etc.
In spite of the fact that we can use the Internet’s potent search
engines to answer questions, many organizations promote CDS as a
major strategy to improve patient safety
Most CDS strategies involve the 5 rights
Historical perspective
- As early as the 1950s scientists predicted computers would aid
medical decision making
- CDS programs appeared in the 1970s and were standalone
programs that eventually became inactive
- De Dombal’s system for acute abdominal pain: used Bayes
theorem to suggest differential diagnoses
- Internist-1: CDS program that used IF-THEN statements to
predict diagnoses
- Mycin: rule-based system to suggest diagnosis and treatment of
infections
- DxPlain: 1984 program that used clinical findings to list possible
diagnoses. Now a commercial product
- QMR: began as Internist-1 for diagnoses and ended in 2001
- HELP: began in the 1980s at the University of Utah that includes
diagnostic advice, references and clinical practice guidelines
- Iliad: diagnostic program, also developed by the University of
Utah in the 1980s
- Isabel: commercial differential diagnosis tool with information
inputted as free text for from the EHR.
- Inference engine uses natural language processing and
supported by 100,000 documents
- SimulConsult: diagnostic program based on Bayes probabilities.
- Predictions can also include clinical and genetic information
- SnapDx: free mobile app that performs diagnostic CDS for
clinicians. It is based on positive and negative likelihood ratios
from medical literature. App covers about 50 common medical
scenarios
Supporting Organizations
Institute of Medicine (IOM): they promoted “automated clinical
information and CDS”
AMIA: developed 3 pillars of CDS in 2006—best available
evidence, high adoption and effective use and continuous
improvement.
ONC: has funded research to promote excellent CDS and sharing
possibilities
AHRQ: also funded multiple CDS research projects and initiatives
HL7: has a CDS working group and developed FHIR standards,
discussed later
National Quality Forum (NQF): developed a CDS taxonomy
Leapfrog: they have promoted both CPOE and CDS
HIMSS: Their EMR Adoption Model rates EMRs from 0-7.
Full use of CDS qualifies as level 6
CMS: Meaningful Use, Stage 1 and 2 includes CDS measures
CDS Methodology
Two phases of CDS: knowledge use and knowledge management
CDS Methodology
Knowledge representation:
o Configuration: knowledge is represented by choices made by the
institution
o Table-based: rules are stored in tables, such that if a current drug
on a patient is in one row and an order for a second inappropriate
drug is stored in the same row, an alert is triggered for the
clinician
o Rules based: knowledge base has IF-THEN statements; if the
patient is allergic to sulfa and sulfa is order then an alert is
triggered. Earlier CDS programs, such as Mycin, were rule based
Knowledge representation
o Bayesian networks: based on Bayes Theorem of conditional
probabilities it predicts future (posterior) probability based on
pre-test probability or prevalence. In spite of assuming that the
findings are supposed to be independent (such as signs and
symptoms), the Bayesian approach works very well and is
commonly employed in medicine. Formula is included below
CDS Methodology
- The previous knowledge representation methods were based on
known data so they would be labelled “knowledge based CDS”.
If CDS is based on data mining-related techniques it would be
referred to as “non-knowledge based CDS”
- Data mining (machine learning) algorithms have to be
developed and validated ahead of actual implementation. This
approach is divided into supervised and unsupervised learning
- Supervised learning: assumes the user knows the categories of
data that exist, such as gender, diagnoses, age, etc. If the target
(outcome or dependent variable) is categorical (nominal, such
as lived or died) the approach will be called a classification
model. If the target is numerical (such as size of tumor, income,
etc.) the this is a regression model
- Supervised learning:
- Neural networks: configured like a human neuron. The model is
trained until the desired target output is close to the desired target.
This is not intuitive and requires great expertise.
- Supervised learning:
- Logistic regression: in spite of the name regression, it is most
commonly used where the desired output/target is binary (cancer
recurrence, no cancer recurrence).
Multiple predictors are inputted, such as age, gender, family history,
etc. and odds ratios are generated. This is the gold standard for much
of predictive analytics.
- Decision trees: can perform classification or regression and are
the easiest to understand and visualize. Trees are used by both
statisticians and machine learning programs. Below is a contact
lens decision tree.
- Unsupervised learning: means data is analyzed without first
knowing the classes of data to look for new patterns of interest.
This has been hugely important in looking at genetic data sets.
- Cluster analysis is one of the most common ways to analyze large data
sets for undiscovered trends. It is also more complex, requiring more
expertise
- Association algorithms look for relationships of interest
- Knowledge maintenance: means there is a need to constantly
update expert evidence-based information. This task is difficult
and may fall to a CDS committee or technology vendor
CDS Standards
- CDS developers have struggled for a long time with how to
share knowledge representation with others or how to modify
rules locally. Standards were developed to try to overcome
these obstacles:
- It is a RESTful API (like Google uses) that uses either JSON or XML for
data representation
CDS Functionality
- CDSSs can be classified in multiple ways:
o Knowledge and non-knowledge-based systems
o Internal or external to the HER
o Activation before, during or after a patient encounter
o Activated automatically or on demand
o Alerts can be interruptive or non-interruptive
CDS Functionality
- Ordering facilitators:
- Order sets are EHR templated commercial or home grown orders that
are modified to follow national practice guidelines. For example, a
patient with a suspected heart attack has orders that automatically
include aspirin, oxygen, EKG, etc.
- Therapeutic support includes commercial products such as
TheradocⓇ and calculators for a variety of medical conditions
- Smart forms are templated forms, generally used for specific
conditions such as diabetes. They can include simple check the boxes
with evidence-based recommendations
- Alerts and reminders are the classic CDS output that usually reminds
clinicians about drug allergies, drug to drug interactions and
preventive medicine reminders.
- Relevant information displays Infobuttons, hyperlinks, mouse overs:
common methods to connect to evidence based information
- Diagnostic support: most diagnostic support is external and not
integrated with the EHR; such as SimulConsult
- Dashboards: can also be patient, and not population level, so they can
summarize a patient’s status and thereby summarize and inform the
clinician about multiple patient aspects
CDS Sharing
Currently, there is no single method for CDS knowledge can be
universally shared. The approach has been to either use standards to
share the knowledge or use CDS on a shared external server
Socratic Grid and OpenCDS are open-source web services platforms
that support CDS
The FHIR standard appears to have the greatest chance for success,
but it is still early in the CDS game to know.
CDS Challenges
General: exploding medical information that is complicated and
evolving. Tough to write rules
Organizational support: CDS must be supported by leadership, IT
and clinical staff. Currently, only large
healthcare organizations can create robust CDSSs
Lack of a clear business case: evidence shows CDS helps improve
processes but it is unclear if it affects behavior and patient
outcomes. Therefore, there may not be a strong business case
to invest in CDSSs
Unintended consequences: alert fatigue Medico-legal: adhering
to or defying alerts has legal
implications. Product liability for EHR vendors
Clinical: must fit clinician workflow and fit the 5 Rights
Technical: complex CDS requires an expert IT team
Lack of interoperability: must be solved for CDS to succeed
Long term CDS benefits: requires long term commitment and
proof of benefit to be durable
Future Trends
- The future of Meaningful Use is unclear so there is no obvious
CDS business case for clinicians, hospitals and vendors
- If the FHIR standard makes interoperability easier we may see
new CDS innovations and improved adoption
Conclusions
- CDS could potentially assist with clinical decision making in
multiple areas
- While there is widespread support for CDS, there are a
multitude of challenges
- CDS is primarily achieved by larger healthcare systems
- The evidence so far suggests that CDS improves patient
processes and to a lesser degree clinical outcomes
Questions
Define electronic clinical decision support (CDS)
Enumerate the goals and potential benefits of CDS
Discuss the government and private organizations supporting
CDS
Discuss CDS taxonomy, functionality and interoperability
List the challenges associated with CDS
Enumerate CDS implementation steps and lessons learned
Topic 7: Healthcare Safety, Quality, and Ethics
Patient Safety-Related Definitions
Patient Safety-Related Definitions
Safety: minimization of the risk and occurrence of patient harm events
Harm: inappropriate or avoidable psychological or physical injury to patient and/or
family
Adverse Events: “an injury resulting from a medical intervention”
Preventable Adverse Events: “errors that result in an adverse event that are
preventable”
Overuse: “the delivery of care of little or no value” e.g. widespread use of antibiotics
for viral infections
Underuse: “the failure to deliver appropriate care” e.g. vaccines or cancer screening
Misuse: “the use of certain services in situations where they are not clinically indicated” e.g.
MRI for routine low back pain
Introduction
Medical errors are unfortunately common in healthcare, in spite of
sophisticated hospitals and well trained clinicians
Often it is breakdowns in protocol and communication, and not individual errors
Technology has potential to reduce medical errors (particularly
medication errors) by:
- Improving communication between physicians and patients
- Improving clinical decision support
- Decreasing diagnostic errors
Unfortunately, technology also has the potential to create unique new
errors that cause harm
Medical Errors
Errors can be related to diagnosis, treatment and preventive care.
Furthermore, medical errors can be errors of commission or omission
and fortunately not all errors result in an injury and not all medical errors
are preventable
Most common outpatient errors:
Prescribing medications
Getting the correct laboratory test for the correct patient at the correct
time
Filing system errors
Dispensing medications and responding to abnormal test results
While many would argue that treatment errors are the most common
category of medical errors, diagnostic errors accounted for the largest
percentage of malpractice claims, surpassing treatment errors in one
study
Diagnostic errors can result from missed, wrong or delayed diagnoses
and are more likely in the outpatient setting. This is somewhat
surprising given the fact that US physicians tend to practice
“defensive medicine”
Over-diagnosis may also cause medical errors but this has been less
well studied
Unsafe Actions
Most adverse events result from unsafe actions or inactions by
anyone on the healthcare team, including the patient
Missed care is “any aspect of required care that is omitted either in
part or in whole or delayed”
Many of the above go unreported
Medication Reconciliation
When patients transition from hospital-to-hospital, from physician-to
physician or from floor-to-floor, medication errors are more likely to occur
Joint Commission mandated hospitals must reconcile a list of patient
medications on admission, transfer and discharge
Task may be facilitated with EHR but still confusion may exist if there are
multiple physicians, multiple pharmacies, poor compliance or dementia
Barriers to Improving
Patient Safety through Technology
Organizational: health systems leadership must develop a strong
“culture of safety”
Financial: Cost for multiple sophisticated HIT systems is considerable
Error reporting: is voluntary and inadequate and usually “after the
fact”
Unintended Consequences
Unintended Consequences
Technology may reduce medical errors but create new ones:
1. Medical alarm fatigue
2. Infusion Pump errors
3. Distractions related to mobile devices
4. Electronic health records: data can be missing and/or incorrect,
there can be typographical entry errors, and older information is
sometimes copied and pasted into the current record.
International Considerations:
Ethics, Laws and Culture
Influenced by a country’s laws and culture
The relationship between ethics, law, culture and society is unclear, is
not fixed internationally, and may be fluid even within a given
country over time
Conclusions
Patient safety continues to be an ongoing problem with too many
medical errors reported yearly
Multiple organizations are reporting patient safety data transparently
to hopefully support change
There is a great expectation that HIT will improve patient quality
which in turn will decrease medical errors
There is some evidence that clinical decision support reduces errors,
but studies overall are mixed
Leadership must establish a “culture of safety” to effectively achieve
improvement in patient safety
Health informatics ethics stems from medical ethics
The IMIA Code of Ethics contains guidelines for multiple categories
The relationship between ethics, law, culture and society is fluid and
must be monitored
The pertinent ethical principles are: right to privacy, guarding against
excess, security and integrity of data, informed consent, data sharing,
beneficence and non-maleficence and non-transferability of
responsibility
Questions
Define safety, quality, near miss, and unsafe action
List the safety and quality factors that justified the clinical implementation
of electronic health record systems
Discuss three reasons why the electronic health record is central to safety,
quality, and value
List three issues that clinicians have with the current electronic health
record systems and discuss how these problems affect safety and quality
Describe a specific electronic patient safety measurement system and a
specific electronic safety reporting system
Describe two integrated clinical decision support systems and discuss
how they may improve safety and quality
Describe the 20th century medical and computing background to
health informatics ethics
Identify the main sections of the IMIA Code of Ethics for Health
Information Professionals
Describe the complexities in the relationship between ethics, law,
culture and society
Describe different views of ethics in different countries
Summarize the most pertinent principles in health informatics ethics
Discuss the application of health informatics ethics to research into
pertinent areas of health informatics
Discuss appropriate health informatics behavior by medical students
Topic 8: Introduction to Data Science
Introduction
Data are ubiquitous, coming in multiple industries, in multiple sizes
and formats with varying complexity
Business domain seems to have led the way with analytics to
determine potential customer loss (churn) and who will buy item B,
after buying item A (market basket analysis)
Data science is a very convenient umbrella term
Our attention will only be on healthcare data
Definitions
Data science “means the scientific study of the creation, validation
and transformation of data to create meaning.”
Because data science is relatively new, definitions are still evolving.
Data analytics is “the discovery and communication of meaningful
patterns in data.” While some would argue for separating data
analytics from data mining and knowledge discovery from data (KDD),
we will use the terms interchangeably.
Background
The term data science was first used in a publication in 2001, however, a
small group of statisticians have been talking about expanding the scope
of statistics as far back as 1962
The reality is that statistics is too narrow a field to evaluate all aspects of
the avalanche of data currently available.
Importantly, data science must consider the alternative approach of
machine learning developed by computer scientists
Cutting edge businesses such as Google, Facebook and LinkedIn have
employed data scientists for many years to do innovative things with their
data
There are now data dealing with geolocation, surveys,
sensors, images, social media and so forth
Data science has been greatly aided by the simultaneous
improvement in storage, processor speed, etc.
The above has led to the Big Data era, we will cover later
In spite of some hype associated with data science, there is clearly
tremendous interest by all industries and great demand for data
scientists
Data Basics
Datum is singular; the term data is plural!
Data is just a number; information is data with meaning and
knowledge is information that is felt to be true
The smallest unit of data is the bit (binary digit) which can be
represented as zeros and ones. A byte consists of 8 bits, so 256
possible combinations could be created 0100 0001 represents the
capital letter A
Binary coding like this is very important because computers easily
interpret these binary numbers 0 and 1
Statistics Basics
- Data can be structured (fits into a database field), unstructured
(free text) or semi-structured (e.g. XML)
- Data can also be classified as nominal (categorical) meaning it is a
name and not number, such as gender. Ordinal data is similar but
has order, such as small, medium and large. Interval data is
numerical with defined intervals, but no meaningful, such as
Celsius temp. Ratio data is numerical but has a meaningful, such
as height and weight
- Nominal/ordinal data can also be considered qualitative data,
while interval/ratio data can be considered quantitative data
- Parametric data: tends to follow a normal distribution
- Non-parametric data: follows an abnormal distribution
- Statisticians use non-parametric tests for non- parametric data
and parametric tests for parametric data.
- It is important to look at measures of central tendency, such as
mean, median and standard deviation to see how the data is
distributed.
- Mean is the sum of values divided by the number of values
- Median is is the middle of the distribution
- Mode is the most common value and the range is the difference
between the lowest and highest value
- Standard deviation is a measure of dispersion of the data from the
mean. It is calculated by taking the square root of the variance.
- Variance is calculated by subtracting each data value from the
mean, squaring them, totally the values, then dividing by the
number of values, minus 1 if you are dealing with a sample and not
subtracting by 1 if your values came from the entire population
Effect size:
Measures the magnitude of the effect and is independent of the sample size
There are a variety of effect sizes calculations available, depending on which
statistical test is run
More and more journals are requesting confidence intervals and effect sizes,
in addition to standard p values
Data pre-processing
Cleaning: correcting or removing data errors
Missing data: decision must be made what to do, e.g. deleting vs
imputing
Integration: combining disparate sources of data into a spreadsheet
or database
Reduction: consolidating categories to reduce attributes
Exploratory Data Analysis (EDA)
Descriptive statistics are used to look at the distribution, mean,
mode, range and standard deviation in order to determine the
optimal statistical method or algorithm to use for analysis
These earlier phases of working with data may take up to 80% of the
overall time required with data!
Data visualization may be used early as well as late
Programming languages
Two dominant languages are R and Python.
Both can perform statistical and machine learning analyses
R tends to be used more by data scientists and Python more by
computer scientists
Substantial learning curve for both SQL (structured query language) is
used to manipulate databases and generate reports, so in a loose sense
it is a programming language for analysis
Predictive analytics
Regression model is a linear (line- like) model. It seeks the mathematical
relationship between two or more numerical variables
Plotted: the dependent (target) variable on the y axis and independent
variable on the x axis
Formula: y = ax + b where a is the slope of the line and b is the intercept
Predictive analytics
Regression model: The goal is to fit the values as close to the line as
possible to reduce the sum of least squared errors (square the
difference between the values and the line and then sum them). The
lower number the better
For example, if we use the formula y = 0.425x + 0.785 and we set x =
2, then y = 1.64.
Correlation coefficient: value between -1 and 1
R2 is a common measure to determine how close the fit is:
0-100% range. The higher the %, the better the fit
Classification model: is non-linear and uses supervised learning. The
target or outcome variable is categorical (nominal). Algorithms:
Classification and Regression Trees (CART): decision trees are easy to
understand and visualize. Common trees are C4.5 and J48. The tree displays
root and leaf nodes (outcomes). In the next slide there is a decision tree for
contact lens, where the choices are no lenses, soft or hard lenses. The first
branch, or decision, is tear production, normal or reduced. Random forest
algorithms involve multiple decision trees and can be more accurate
Classification model
Neural networks: organized like human neurons, but this approach is more
empirical and more difficult to explain and confirm
Naïve Bayes: uses Bayes theorem to perform regression and classification to
predict future probability, based on known prevalence. Called “naïve”
because it assumes each variable is independent, which is not always the
case. Regardless, this algorithm works very well
Classification:
K-nearest neighbor: works for regression and classification. It works by
looking for “nearest neighbors” using several methods, to include Euclidean
distance. It uses the entire data set, so there is no learning.
Support Vector Machines (SVMs): algorithm separates the attribute spaces
into hyperplanes and determines the best coefficients for separation of the
data. Can be used for classification and regression
Classification model:
Logistic regression: in spite of the name regression this approach is used for
classification, usually with a binary outcome such as lived or died. Can be
dummy coded as 0 or 1.
The outcome is reported as odds ratios, so the higher the odds the more
likely a predictor (independent variable) will predict an outcome (dependent
variable). This approach is shared by statisticians and computer scientists
and is perhaps the most common approach for classification scenarios.
Performance Summary
Results must make sense. Statistically significant doesn’t equal
clinically significant and correlation doesn’t equal causation
Must take into account whether sensitivity or specificity or both are
the most important goal for your model.
Big Data
Data so large it can’t be analyzed with one computer
Common definition of big data using the Five Vs:
Volume: massive size of data
Velocity: rapid generation
Variety: structured, unstructured or mixed
Veracity: may be ”messy” with missing data, but with
huge volumes it becomes less important
Value: big data must have value, in order for it to become information and
knowledge
Certifications:
Certified analytics professional (CAP) for professionals with 5 years of
experience and a Bachelors or 3 years’ experience with Masters.
Certified Health Data Analyst (CHDA) is hosted by AHIMA with eligibility
stated on their web site
Coursera (2018) included 1103 low-cost courses related to data
science
Multiple Data Science Centers have also appeared across the US in
the past several years.
Future Trends
Look for more automation of analytics, such that the average
healthcare worker can perform predictive analytics
Look for more analytics embedded in cloud storage and other
platforms
Look for more low-cost courses covering data science
Conclusions
Data science is an important umbrella term that incorporates many
current data-related activities
All industries have been clamoring for more experts in data
analytics/data science
Until we train enough data scientists, we will have to rely on a team
of skilled healthcare workers who can digest and interpret
complicated data sets
Questions
- Define the field of data science
- Enumerate the general requirements for data science expertise
- Differentiate between modeling, using statistics and machine
learning
- Describe the general steps from data wrangling to data
presentation
- Discuss the characteristics of big data
- Provide examples of how data analytics is assisting healthcare
- Discuss the challenges facing healthcare or biomedical data analytics