0% found this document useful (0 votes)
18 views75 pages

Summary 2

Uploaded by

Saba Abu Farha
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
18 views75 pages

Summary 2

Uploaded by

Saba Abu Farha
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 75

Topic 5: Healthcare Data Analytics

One of the promises of the growing clinical data in electronic health


record (EHR) systems is secondary use (or re-use) of the data for
other purposes, such as quality improvement and clinical research
Interest in healthcare data has grown exponentially due to EHR
incentives after the HITECH Act and the addition of genomic
information that will eventually be integrated with EHRs
The term analytics is achieving wide use both in and out of
healthcare. A leader in the field defines analytics as “the extensive
use of data, statistical and quantitative analysis, explanatory and
predictive models, and fact-based management to drive decisions
and actions”
IBM defines analytics as “the systematic use of data and related
business insights developed through applied analytical disciplines to
drive fact-based decision making for planning, management,
measurement and learning.

Different Types of Analytics


Increasing functionality and value
Descriptive – standard types of reporting that describe
current situations and problems (how many uninsured
patients do we have with type 2 diabetes?)
Predictive – simulation and modeling techniques that identify trends
and portend outcomes of actions taken (can we predict who will be
readmitted for heart failure in the next 30 days?)
Prescriptive – optimizing clinical, financial, and other outcomes (of
those patients identified as high risk for readmission for heart failure
is it more cost effective to case manage in the hospital or at home?)
Analytics Concepts
Machine learning is the area of computer science that aims to build
systems and algorithms that learn from data
Data mining is defined as the processing and modeling of large
amounts of data to discover previously unknown patterns or
relationships
Text mining, a sub-area, applies data mining techniques to mostly
unstructured textual data
Provenance, which is where the data originated and how trustworthy
it is for large-scale processing and analysis
Business intelligence, which in healthcare refers to the “processes
and technologies used to obtain timely, valuable insights into
business and clinical data”
Learning health system, where data can be used for continuous
learning to allow the healthcare system to better carry out disease
surveillance and response, targeting of healthcare services, improving
decision-making, managing
misinformation, reducing harm, avoiding costly errors, and advancing
clinical research

Big Data
Another related term is big data, which describes large and ever-
increasing volumes of data that adhere to the following attributes:
Volume – ever-increasing amounts
Velocity – quickly generated
Variety – many different types
Veracity – from trustable sources
While big data is considered a buzz word by some, we are having to
deal with terabytes and petabytes of information today. With the
addition of genomics big data will escalate.
Healthcare organizations are generating an ever-increasing amount of
data. In all healthcare organizations, clinical data takes a variety of
forms, from structured (e.g., images, lab results, etc.) to unstructured
(e.g., textual notes including clinical narratives, reports, and other
types of documents)
For example, it was estimated by Kaiser-Permanente in 2013 that its
current data store for its 9+ million members exceeds 30 petabytes
(petabyte = 1024 terabytes) of data.
Another example is CancerLinQ that will provide a comprehensive
system for clinicians and researchers consisting of EHR data
collection, application of clinical decision support, data mining and
visualization, and quality feedback.
Lastly, IBM’s Watson is now focusing on healthcare, specifically
Oncology so that massive amounts of cancer information/research
can be analyzed and applied to individual patient decision making.

The Analytics Big Data Pipeline


According to Kumar et al
 One begins with multiple data sources, that are extracted and cleansed and
normalized
 Statistical processing prepares the data for output
 Finally, the data helps generate descriptive, predictive and prescriptive analytics
Big Data
Big Data will Drive ACOs
Accountable care organizations (ACOs) provide incentives to deliver
high-quality care in cost-efficient ways that will require a robust IT
architecture, health information exchange (HIE) plus analytics. This
approach would be used to predict and quickly act on excess costs
As one pundit put it: ACOs = HIE + Analytics

Challenges to Data Analytics


- Data generated in the routine care of patients may be limited in
its use for analytical purposes. For example, data may be
inaccurate or incomplete. It may be transformed in ways that
undermine its meaning (e.g., coding for billing priorities)
- It may exhibit the well-known statistical phenomenon of
censoring, i.e., the first instance of disease in record may not be
when it was first manifested (left censoring) or the data source
may not cover a sufficiently long-time interval (right censoring)
- Data may also incompletely adhere to well-known standards,
which makes combining it from different sources more difficult
- Clinical data mostly allows observational and not experimental
studies, thus raising issues of cause-and- effect of findings
discovered
- Research questions asked of the data tend to be driven by what
can be answered, as opposed to prospective hypotheses
- Data are not always as objective as one might like, and “bigger”
is not necessarily better
- There are ethical concerns over how the data of individuals is
used, the means by which it is collected, and the possible divide
between those who have access to data and those who do not
- Who owns the data and who can use it?

Research and Application of Analytics


o There is an emerging base of research that demonstrates how
data from operational clinical systems can be used to identify
critical situations or patients whose costs are outliers.
o There is less research, however, demonstrating how this data
can be put to use to actually improve clinical outcomes or
reduce costs. Studies using EHR data for clinical prediction have
been proliferating.
o One common area of focus has been the use of data analytics to
identify patients at risk for hospital readmission within 30 days
of discharge. The importance of this factor comes from the US
Centers for Medicare and Medicaid Services (CMS)
Readmissions Reduction Program that penalizes hospitals for
excessive numbers of readmissions.
o This has led to research using EHR data to predict hospital
readmissions. Thus far, the results are mixed and several
examples of trials are included in the textbook chapter.

Research and Application of Analytics


Scenarios for EHR Data Analysis
Research and Application of Analytics
Scenarios for EHR Data Analysis
- Predicting 30-day risk of readmission and death among HIV-
infected inpatients
- Identification of children with asthma
- Risk-adjusting hospital mortality rates Detecting postoperative
complications
- Measuring processes of care
- Determining five-year life expectancy
- Detecting potential delays in cancer diagnosis
- Identifying patients with cirrhosis at high risk for readmission
- Predicting out of intensive care unit cardiopulmonary arrest or
death

Research and Application of Analytics


Identifying Patients for Research Using EHR Data
 Identifying patients who might be eligible for participation in
clinical studies
 Determining eligibility for clinical trials
 Identifying patients with diabetes and the earliest date of
diagnosis
 Predicting diagnosis in new patients

Research and Application of Analytics


Use EHR Data to Replicate Randomized Controlled Trials
- Virtual Data Warehouse (VDW) Project was able to demonstrate
a link between childhood obesity and hyperglycemia in
pregnancy
- United Kingdom General Practice Research Database (UKGPRD),
a repository of longitudinal records of general practitioners, was
able to demonstrate the ability to replicate the findings of the
Women’s Health Initiative and RCTs of other cardiovascular
diseases
- Other data repositories have helped to predict a variety of
cancers, risk for venous thromboembolism (blood clots) and
even rare medical disorders
- Note the info box in the next slide that discusses data analytics
by the Veterans Health Administration (VHA)

Research and Application of Analytics


Using Genomic Information and EHRs
 Researchers have carried out genome-wide association studies
(GWAS) that associate specific findings from the HER (the
“phenotype”) with the growing amount of genomic and related
data (the “genotype”) in the Electronic Medical Records and
Genomics (eMERGE) Network
 eMERGE has demonstrated the ability to identify genomic
variants associated with atrioventricular conduction
abnormalities, red blood cell traits, white blood cell count
abnormalities, and thyroid disorders
 More recent work has “inverted” the paradigm to carry out
phenome-wide association studies (PheWAS) that associated
multiple phenotypes with varying genotypes
 Genome-wide and phenome-wide association studies are also
discussed in the chapter on bioinformatics
Role of Informaticians in Analytics
 There has been little focus on the human experts who will carry
out analytics, to say nothing of those who will support their
efforts in building systems to capture data, put it into usable
form, and apply the results of analysis
 Where will these workers come from and what will be the
education of those who work in this emerging area, that some
call data science?
 We do know that data analytics experts are in high demand
 From basic biomedical scientists to clinicians and public health
workers, those who are researchers and practitioners are
drowning in data, needing tools and techniques to allow its use
in meaningful and actionable ways
 Dr. Hersh believes that a strong background in Health
Informatics or Biomedical Informatics is the best preparation for
the healthcare data analytics field
 Data science is more than statistics or computer science applied
in a specific subject domain. It requires an understanding of
data, its varying types, and how to manipulate and leverage it
 The field requires skills in machine learning, a strong foundation
in statistics (especially Bayesian), computer science
(representation and manipulation of data), and knowledge of
correlation and causation (modeling)

The Need for Data Analytics Experts


A report by McKinsey consulting states that there will soon be a
need in the US for 140,000-190,000 individuals who have “deep
analytical talent” and an additional 1.5 million “data- savvy
managers needed to take full advantage of big data”
An analysis by SAS estimated that by 2018, there will be over
6400 organizations that will hire 100 or more analytics staff
Another report found that data scientists currently comprise
less than 1% of all big data positions, with more common job
roles consisting of developers (42% of advertised positions),
architects (10%), analysts (8%) and administrators (6%)
The technical skills most commonly required for big data
positions as a whole were NoSQL, Oracle, Java and SQL
Price Water house Coopers noted that healthcare organizations
need to acquire talent in systems and data integration, data
statistics and analytics, technology and architecture support,
and clinical informatics
Business knowledge is also useful

The Need for Data Analytics Experts


What Skill Sets Should Universities Train For?
 Programming - especially with data-oriented tools, such as SQL
and statistical programming languages
 Statistics - working knowledge to apply tools and techniques
 Domain knowledge - depending on one's area of work,
bioscience or health care
 Communication - being able to understand needs of people and
organizations and articulate results back to them

Conclusions
- Healthcare data has proliferated greatly, in large part due to the
accelerated adoption of EHRs
- Analytic platforms will examine data from multiple sources, such
as clinical records, genomic data, financial systems, and
administrative systems
- Analytics is necessary to transform data to information and
knowledge
- Accountable care organizations and other new models of healthcare
delivery will rely heavily on analytics to analyze financial and clinical
data
- There is a great demand for skilled data analysts in healthcare;
expertise in informatics will be important for such individuals
Questions
 Discuss the difference between descriptive, predictive and prescriptive
analytics
 Describe the characteristics of “Big Data”
 Enumerate the necessary skills for a worker in the data analytics field
 List the limitations of healthcare data analytics
 Discuss the critical role electronic health records play in healthcare data
analytics
Topic 6: Clinical Decision Support

Introduction
Definition: “Clinical decision support (CDS) provides clinicians, staff,
patients or other individuals with knowledge and person-specific
information, intelligently filtered or presented at appropriate times, to
enhance health and health care.” (ONC)
Keep in mind that any resource that aids in decision making
should be considered CDS. We will only consider electronic
CDS.
We define clinical decision support systems (CDSSs) as the
technology that supports CDS

Early on, CDS was thought of only in terms of reminders and alerts.
Now we must include diagnostic help, cost reminders, calculators,
etc.
In spite of the fact that we can use the Internet’s potent search
engines to answer questions, many organizations promote CDS as a
major strategy to improve patient safety
Most CDS strategies involve the 5 rights

Five Rights of CDS


The right information (what): should be based on the highest level of
evidence possible and adequately referenced.
To the right person (who): the person who is making the clinical
decision, the physician, the patient or some other team member
In the right format (how): should the information appear as part of an
alert, reminder, infobutton or order set?
Through the right channel (where): should the information be available
as an EHR alert, a text message, email alert, etc.?
At the right time (when) : new information, particularly in the format of
an alert should appear early in the order entry process so clinicians are
aware of an issue before they complete the task

Historical perspective
- As early as the 1950s scientists predicted computers would aid
medical decision making
- CDS programs appeared in the 1970s and were standalone
programs that eventually became inactive
- De Dombal’s system for acute abdominal pain: used Bayes
theorem to suggest differential diagnoses
- Internist-1: CDS program that used IF-THEN statements to
predict diagnoses
- Mycin: rule-based system to suggest diagnosis and treatment of
infections
- DxPlain: 1984 program that used clinical findings to list possible
diagnoses. Now a commercial product
- QMR: began as Internist-1 for diagnoses and ended in 2001
- HELP: began in the 1980s at the University of Utah that includes
diagnostic advice, references and clinical practice guidelines
- Iliad: diagnostic program, also developed by the University of
Utah in the 1980s
- Isabel: commercial differential diagnosis tool with information
inputted as free text for from the EHR.
- Inference engine uses natural language processing and
supported by 100,000 documents
- SimulConsult: diagnostic program based on Bayes probabilities.
- Predictions can also include clinical and genetic information
- SnapDx: free mobile app that performs diagnostic CDS for
clinicians. It is based on positive and negative likelihood ratios
from medical literature. App covers about 50 common medical
scenarios
Supporting Organizations
 Institute of Medicine (IOM): they promoted “automated clinical
information and CDS”
 AMIA: developed 3 pillars of CDS in 2006—best available
evidence, high adoption and effective use and continuous
improvement.
 ONC: has funded research to promote excellent CDS and sharing
possibilities
 AHRQ: also funded multiple CDS research projects and initiatives
 HL7: has a CDS working group and developed FHIR standards,
discussed later
 National Quality Forum (NQF): developed a CDS taxonomy
 Leapfrog: they have promoted both CPOE and CDS
 HIMSS: Their EMR Adoption Model rates EMRs from 0-7.
 Full use of CDS qualifies as level 6
 CMS: Meaningful Use, Stage 1 and 2 includes CDS measures

CDS Methodology
Two phases of CDS: knowledge use and knowledge management

 Knowledge Use. Involves these sequential steps:


o Triggers are an event, such as an order for a medication >>
o Input data refers to information within, for example the EHR,
that might include patient allergies
o Interventions are the CDS actions such as displayed alerts >>
o Action steps might be overriding the alert or canceling an order
for a drug to which the patient is allergic
 Knowledge management involves:
o Knowledge acquisition: acquire expert internal (EHR data) or
external data (e.g. Apache scores) for CDS
o Knowledge representation. Use expert information, integrate it
with an inference engine and communicate it to the end user,
e.g. an alert
o Knowledge management

CDS Methodology
 Knowledge representation:
o Configuration: knowledge is represented by choices made by the
institution
o Table-based: rules are stored in tables, such that if a current drug
on a patient is in one row and an order for a second inappropriate
drug is stored in the same row, an alert is triggered for the
clinician
o Rules based: knowledge base has IF-THEN statements; if the
patient is allergic to sulfa and sulfa is order then an alert is
triggered. Earlier CDS programs, such as Mycin, were rule based
 Knowledge representation
o Bayesian networks: based on Bayes Theorem of conditional
probabilities it predicts future (posterior) probability based on
pre-test probability or prevalence. In spite of assuming that the
findings are supposed to be independent (such as signs and
symptoms), the Bayesian approach works very well and is
commonly employed in medicine. Formula is included below
CDS Methodology
- The previous knowledge representation methods were based on
known data so they would be labelled “knowledge based CDS”.
If CDS is based on data mining-related techniques it would be
referred to as “non-knowledge based CDS”
- Data mining (machine learning) algorithms have to be
developed and validated ahead of actual implementation. This
approach is divided into supervised and unsupervised learning
- Supervised learning: assumes the user knows the categories of
data that exist, such as gender, diagnoses, age, etc. If the target
(outcome or dependent variable) is categorical (nominal, such
as lived or died) the approach will be called a classification
model. If the target is numerical (such as size of tumor, income,
etc.) the this is a regression model
- Supervised learning:
- Neural networks: configured like a human neuron. The model is
trained until the desired target output is close to the desired target.
This is not intuitive and requires great expertise.
- Supervised learning:
- Logistic regression: in spite of the name regression, it is most
commonly used where the desired output/target is binary (cancer
recurrence, no cancer recurrence).
Multiple predictors are inputted, such as age, gender, family history,
etc. and odds ratios are generated. This is the gold standard for much
of predictive analytics.
- Decision trees: can perform classification or regression and are
the easiest to understand and visualize. Trees are used by both
statisticians and machine learning programs. Below is a contact
lens decision tree.
- Unsupervised learning: means data is analyzed without first
knowing the classes of data to look for new patterns of interest.
This has been hugely important in looking at genetic data sets.

- Cluster analysis is one of the most common ways to analyze large data
sets for undiscovered trends. It is also more complex, requiring more
expertise
- Association algorithms look for relationships of interest
- Knowledge maintenance: means there is a need to constantly
update expert evidence-based information. This task is difficult
and may fall to a CDS committee or technology vendor

CDS Standards
- CDS developers have struggled for a long time with how to
share knowledge representation with others or how to modify
rules locally. Standards were developed to try to overcome
these obstacles:

- Arden syntax: represented by medical logic modules (MLMs) that


encode decision information. Ironically, the information can’t be
shared because institution specific coding resides within curly braces {}
in the MLM. This approach was doomed and is known as the “curly
brace problem”
- GELLO: can query EHRs for data to create decision criteria. Part
of HL7 v. 3
- GEM: permits clinical practice guidelines to be shared in an XML
format, as an ASTM standard
- GLIF: enables sharable and computable guidelines
- CQL: draft HL7 standard to be used in XML format for electronic
clinical quality measures (eCQMs)
- Infobuttons: can be placed in workflow where decisions are
made with recommendations
- Fast Healthcare Interoperability Resources (FHIR):
- developed by HL7 there is great hope that this standard will
solve many interoperability issues.

- It is a RESTful API (like Google uses) that uses either JSON or XML for
data representation

- It is data and not document centric; so a clinician could place a http


request to retrieve just a lab value from EHR B, instead of e.g. a CCDA.
- EHR can also request decision support from software on a CDS server
- Approximately, 95 resources have been developed to handle the most
common clinical data issues

CDS Functionality
- CDSSs can be classified in multiple ways:
o Knowledge and non-knowledge-based systems
o Internal or external to the HER
o Activation before, during or after a patient encounter
o Activated automatically or on demand
o Alerts can be interruptive or non-interruptive
CDS Functionality
- Ordering facilitators:
- Order sets are EHR templated commercial or home grown orders that
are modified to follow national practice guidelines. For example, a
patient with a suspected heart attack has orders that automatically
include aspirin, oxygen, EKG, etc.
- Therapeutic support includes commercial products such as
TheradocⓇ and calculators for a variety of medical conditions
- Smart forms are templated forms, generally used for specific
conditions such as diabetes. They can include simple check the boxes
with evidence-based recommendations
- Alerts and reminders are the classic CDS output that usually reminds
clinicians about drug allergies, drug to drug interactions and
preventive medicine reminders.
- Relevant information displays Infobuttons, hyperlinks, mouse overs:
common methods to connect to evidence based information
- Diagnostic support: most diagnostic support is external and not
integrated with the EHR; such as SimulConsult
- Dashboards: can also be patient, and not population level, so they can
summarize a patient’s status and thereby summarize and inform the
clinician about multiple patient aspects

CDS Sharing
Currently, there is no single method for CDS knowledge can be
universally shared. The approach has been to either use standards to
share the knowledge or use CDS on a shared external server
Socratic Grid and OpenCDS are open-source web services platforms
that support CDS
The FHIR standard appears to have the greatest chance for success,
but it is still early in the CDS game to know.
CDS Challenges
 General: exploding medical information that is complicated and
evolving. Tough to write rules
 Organizational support: CDS must be supported by leadership, IT
and clinical staff. Currently, only large
 healthcare organizations can create robust CDSSs
 Lack of a clear business case: evidence shows CDS helps improve
processes but it is unclear if it affects behavior and patient
outcomes. Therefore, there may not be a strong business case
to invest in CDSSs
 Unintended consequences: alert fatigue Medico-legal: adhering
to or defying alerts has legal
 implications. Product liability for EHR vendors
 Clinical: must fit clinician workflow and fit the 5 Rights
 Technical: complex CDS requires an expert IT team
 Lack of interoperability: must be solved for CDS to succeed
 Long term CDS benefits: requires long term commitment and
proof of benefit to be durable
Future Trends
- The future of Meaningful Use is unclear so there is no obvious
CDS business case for clinicians, hospitals and vendors
- If the FHIR standard makes interoperability easier we may see
new CDS innovations and improved adoption

Conclusions
- CDS could potentially assist with clinical decision making in
multiple areas
- While there is widespread support for CDS, there are a
multitude of challenges
- CDS is primarily achieved by larger healthcare systems
- The evidence so far suggests that CDS improves patient
processes and to a lesser degree clinical outcomes

Questions
 Define electronic clinical decision support (CDS)
 Enumerate the goals and potential benefits of CDS
 Discuss the government and private organizations supporting
CDS
 Discuss CDS taxonomy, functionality and interoperability
 List the challenges associated with CDS
 Enumerate CDS implementation steps and lessons learned
Topic 7: Healthcare Safety, Quality, and Ethics
Patient Safety-Related Definitions
Patient Safety-Related Definitions
Safety: minimization of the risk and occurrence of patient harm events
Harm: inappropriate or avoidable psychological or physical injury to patient and/or
family
Adverse Events: “an injury resulting from a medical intervention”
Preventable Adverse Events: “errors that result in an adverse event that are
preventable”
Overuse: “the delivery of care of little or no value” e.g. widespread use of antibiotics
for viral infections
Underuse: “the failure to deliver appropriate care” e.g. vaccines or cancer screening
Misuse: “the use of certain services in situations where they are not clinically indicated” e.g.
MRI for routine low back pain

Introduction
Medical errors are unfortunately common in healthcare, in spite of
sophisticated hospitals and well trained clinicians
Often it is breakdowns in protocol and communication, and not individual errors
Technology has potential to reduce medical errors (particularly
medication errors) by:
- Improving communication between physicians and patients
- Improving clinical decision support
- Decreasing diagnostic errors
Unfortunately, technology also has the potential to create unique new
errors that cause harm

Medical Errors
Errors can be related to diagnosis, treatment and preventive care.
Furthermore, medical errors can be errors of commission or omission
and fortunately not all errors result in an injury and not all medical errors
are preventable
Most common outpatient errors:
Prescribing medications
Getting the correct laboratory test for the correct patient at the correct
time
Filing system errors
Dispensing medications and responding to abnormal test results
While many would argue that treatment errors are the most common
category of medical errors, diagnostic errors accounted for the largest
percentage of malpractice claims, surpassing treatment errors in one
study
Diagnostic errors can result from missed, wrong or delayed diagnoses
and are more likely in the outpatient setting. This is somewhat
surprising given the fact that US physicians tend to practice
“defensive medicine”
Over-diagnosis may also cause medical errors but this has been less
well studied

Quality, Safety and Value


Unsafe healthcare lowers quality but safe medicine is not always high
quality
From the National Academy of Medicine’s perspective, quality is a set
of six aspirational goals: medical care should be safe, effective,
timely, efficient, patient- centered, and equitable
Value relates to how important something is to use Cost-effective?
Necessary?
Affect morbidity, mortality or quality of life?

Unsafe Actions
Most adverse events result from unsafe actions or inactions by
anyone on the healthcare team, including the patient
Missed care is “any aspect of required care that is omitted either in
part or in whole or delayed”
Many of the above go unreported

Reporting Unsafe Actions


Most near-miss events are not reported. Many are not witnessed
The tendency is the blame the individual, but healthcare is complex
and there are often “system errors”
Most safety systems are retrospective; we need to move to be
proactive
We need good data, such as the ratio of detected unsafe actions
divided by the opportunity of an unsafe action, over a specified time
interval

Patient Safety Systems


Patient Safety Reporting System: event is recorded and if it is a
sentinel event, it is investigated.
Most systems are not integrated with the EHR
Root Cause Analysis: common approach to determine the cause of an
adverse event. This has limitations HEDIS measures can help track
quality issues
Current reimbursement models mandate quality measures, e.g.
Medicare Patient Safety Monitoring
System, now operated by AHRQ. The new system is
known as the Quality and Safety Review System. Still labor intensive
and manual
Global Trigger Tool: evaluates hospital safety. Said to detect 90% of
adverse events. Select 10 discharge records and two reviewers review
the chart for any of the 53 “triggers”
Using the EHR to Improve Safety,
Quality and Value
Paper records have multiple disadvantages, as pointed out in the EHR
chapter
Expectations have been very high regarding the EHR’s impact on
safety, quality and value
Unfortunately, results have been mixed and there has not been a
prospective study conducted to prove the
EHR’s benefit towards safety and quality

Clinical Decision Support


 High expectations that CDS that is part of EHRs will improve
safety
 As per multiple chapters in the textbook, CDS has mixed
reviews, in terms of safety and quality
 Adverse events regarding CDS, includes ”alert fatigue”
 The FDA will regulate software that is related to treatment and
decision making

Clinician’s Issues with EHRs


- Results in altered workflow and decreased efficiency.

- Physicians are staying late to complete notes in the EHR

- In an effort to save time physicians may “cut and paste” old


histories into the EHR, creating new problems
- EHRs may create new safety issues “e-iatrogenesis” Because of
the multiple issues, it is very common to see offices and
hospitals change EHRs, not always solving the problem
- Roughly 2/3 of EHR data is unstructured (free text) so it is not
computable.
- While natural language processing (NLP) may help solve this, we
are a long way away from resolution
- Multiple open source and commercial NLP program sexist but
they require a great deal of time and expertise to match the
results a manual chart review would produce

Technologies with Potential to


Decrease Medication Errors

Computerized provider order entry (CPOE) Benefits:


o Improved handwriting identification
o Reduced time to arrive in the pharmacy
o Fewer errors related to similar drug names
o Easier to integrate with other IT systems
o Easier to link to drug-drug interactions
o More likely to identify the prescriber
o Available for immediate analysis
o Can link to clinical decision support to recommend drugs of choice
o Jury still out on actual reduction of serious ADEs

Health Information Exchange (HIE):


Improve patient safety by better communication between disparate
healthcare participants
Automated Dispensing Cabinets (ADCs): like ATM machines for
medications on a ward

Home Electronic Medication Management System: home


dispensing, particularly for the elderly or non-compliant patient
Pharmacy Dispensing Robots: bottles are filled automatically

Electronic Medication Administration Record (eMAR): electronic


record of medications that is integrated with the EHR and pharmacy

Intravenous (IV) Infusion Pumps: regulate IV drug dosing accurately

Bar Coding Medication Administration: the patient, drug and


nurse all have a barcoded identity
These must all match for the drug to be given without any alerts Bar codes
are inexpensive but the software and other components are expensive
Some healthcare systems have shown a significant reduction in medication
administrative errors, but many of these were minor and would not have
resulted in serious harm.

Medication Reconciliation
When patients transition from hospital-to-hospital, from physician-to
physician or from floor-to-floor, medication errors are more likely to occur
Joint Commission mandated hospitals must reconcile a list of patient
medications on admission, transfer and discharge
Task may be facilitated with EHR but still confusion may exist if there are
multiple physicians, multiple pharmacies, poor compliance or dementia
Barriers to Improving
Patient Safety through Technology
Organizational: health systems leadership must develop a strong
“culture of safety”
Financial: Cost for multiple sophisticated HIT systems is considerable
Error reporting: is voluntary and inadequate and usually “after the
fact”

Unintended Consequences
Unintended Consequences
Technology may reduce medical errors but create new ones:
1. Medical alarm fatigue
2. Infusion Pump errors
3. Distractions related to mobile devices
4. Electronic health records: data can be missing and/or incorrect,
there can be typographical entry errors, and older information is
sometimes copied and pasted into the current record.

Health Informatics Ethics


Informatics Ethics
The Nuremberg Code
Related to the Holocaust (death of 11 million people by the Nazis)
Medical crimes against humanity were committed
Code established voluntary consent and right to withdraw from
experiment and right to qualified medical experimenter
World Medical Associations (WMA) Declaration of Helsinki
Added the right to privacy and confidentiality of personal information of
research subjects to the Nuremberg Code
International Medical Informatics Association’s (IMIA) Code of Ethics. Very
expansive.

International Considerations:
Ethics, Laws and Culture
Influenced by a country’s laws and culture
The relationship between ethics, law, culture and society is unclear, is
not fixed internationally, and may be fluid even within a given
country over time

Three Different Views of Ethics


 Ethics does not exist outside the law, and exists only for the
good of a properly ordered and legal society
 Ethics is usually strongly informed by the law, society, and the
prevailing culture, and are extensions of these
 Ethics exists entirely outside of the law, and is a matter of
personal conscience. Where there is conflict, the ethical
viewpoint must prevail

Pertinent Ethical Principles


Right to privacy
Guard against excessive personal data collection
Security of data
Integrity of data ; must be kept current and accurate
Informed consent for patients
Awareness of existing laws
Medical ethics applies to health informatics ethics
Sharing data only when appropriate
Clinicians have broad responsibilities towards entire community
Clinicians must practice beneficence
This responsibility cannot be transferred
Difficulties Applying Medical Ethics
in the Digital World
How to obtain informed consent for the use of patient data in
large databases?
Obtain broad informed consent
One should guard against corporate ownership of databases
Research on electronic postings: privacy and disclosure depends on
which model is adopted
Human subject model-extension of the medical view
Textual object model -only rules of plagiarism and copyright apply

Challenges in Transferring Ethical


Responsibility
Researchers must obey the law, but laws do not establish ethics
Submit a protocol to Ethics Committee or an Institutional
Review Board (IRB) but members may not be familiar with subtleties
of health informatics
Keep data secure by transferring responsibility to database manager
takes full responsibility, but ultimately the researcher is still likely to
be responsible

Electronic Communication with


Patients and Caregivers
American Medical Associations (AMA’s) guidelines provide medico-
legal advice:

 Make patient aware of who is reading the email


 Delineate types of email topics that are acceptable
 Use of appropriate language
 Provide tips for patients to ensure they can quickly reference relevant
emails
 Do not use email communication with new patients
Measures to Ensure
Documents Are Understood
 Flesch Reading Ease Test
Assigns a value of 1 (most difficult) to 100 (easy)
 Flesch-Kincaid Test
Assigns a number corresponds to US school grade (1 – 14)
 Microsoft Office Word. Under Options >> Proofing
Provides readability score based on Flesch Reading Ease and Flesh-
Kincaid Grade level

Simple Data Protection


 Encryption programs to encrypt hard drive, folders or files
TrueCrypt – free software www.truecrypt.org
 Password and document encryption protection
 Anti-virus programs
 Anti-spyware and malware software
 Erase computer hard drives before discarding
 Consider using encrypted email with programs (plug ins) such as
Mailvelope
Limiting Collection of Visitor Data to
Your Website
Most web sites use tracking cookies or tracking tools that are used
without consent or even notification
Ideally should obtain consent and state clearly
What information will be gathered?
How will it be stored and secured?
With whom will it be shared?
For how long will it be kept and then destroyed?

Health Informatics Ethics


and Medical Students

 Students should be careful about online comments and photographs


of themselves, colleagues and patients on social networks
 Care in the use of mobile devices with cameras
 For all research projects, big or small, follow IRB guidance
 Avoid plagiarism
 Avoid paper mills
 Manipulation of electronic files. Ensure copyright is not violated
 Avoid recording of lectures without consent
 Avoid using pirated digital files
 Avoid accessing documents illegally

Conclusions
Patient safety continues to be an ongoing problem with too many
medical errors reported yearly
Multiple organizations are reporting patient safety data transparently
to hopefully support change
There is a great expectation that HIT will improve patient quality
which in turn will decrease medical errors
There is some evidence that clinical decision support reduces errors,
but studies overall are mixed
Leadership must establish a “culture of safety” to effectively achieve
improvement in patient safety
Health informatics ethics stems from medical ethics
The IMIA Code of Ethics contains guidelines for multiple categories
The relationship between ethics, law, culture and society is fluid and
must be monitored
The pertinent ethical principles are: right to privacy, guarding against
excess, security and integrity of data, informed consent, data sharing,
beneficence and non-maleficence and non-transferability of
responsibility

Questions
 Define safety, quality, near miss, and unsafe action
 List the safety and quality factors that justified the clinical implementation
of electronic health record systems
 Discuss three reasons why the electronic health record is central to safety,
quality, and value
 List three issues that clinicians have with the current electronic health
record systems and discuss how these problems affect safety and quality
 Describe a specific electronic patient safety measurement system and a
specific electronic safety reporting system
 Describe two integrated clinical decision support systems and discuss
 how they may improve safety and quality
 Describe the 20th century medical and computing background to
health informatics ethics
 Identify the main sections of the IMIA Code of Ethics for Health
Information Professionals
 Describe the complexities in the relationship between ethics, law,
culture and society
 Describe different views of ethics in different countries
 Summarize the most pertinent principles in health informatics ethics
 Discuss the application of health informatics ethics to research into
pertinent areas of health informatics
 Discuss appropriate health informatics behavior by medical students
Topic 8: Introduction to Data Science
Introduction
Data are ubiquitous, coming in multiple industries, in multiple sizes
and formats with varying complexity
Business domain seems to have led the way with analytics to
determine potential customer loss (churn) and who will buy item B,
after buying item A (market basket analysis)
Data science is a very convenient umbrella term
Our attention will only be on healthcare data

Definitions
Data science “means the scientific study of the creation, validation
and transformation of data to create meaning.”
Because data science is relatively new, definitions are still evolving.
Data analytics is “the discovery and communication of meaningful
patterns in data.” While some would argue for separating data
analytics from data mining and knowledge discovery from data (KDD),
we will use the terms interchangeably.

Background
The term data science was first used in a publication in 2001, however, a
small group of statisticians have been talking about expanding the scope
of statistics as far back as 1962
The reality is that statistics is too narrow a field to evaluate all aspects of
the avalanche of data currently available.
Importantly, data science must consider the alternative approach of
machine learning developed by computer scientists
Cutting edge businesses such as Google, Facebook and LinkedIn have
employed data scientists for many years to do innovative things with their
data
There are now data dealing with geolocation, surveys,
sensors, images, social media and so forth
Data science has been greatly aided by the simultaneous
improvement in storage, processor speed, etc.
The above has led to the Big Data era, we will cover later
In spite of some hype associated with data science, there is clearly
tremendous interest by all industries and great demand for data
scientists

Venn diagram of data science


Venn diagram of data science
• One could argue that data science is the combination of multiple
sub-fields
• The reality is that machine learning and math/statistics have a lot in
common and both should be taught
Artificial intelligence (AI) is the ability of a computer or a robot
controlled by a computer to do tasks that are usually done by humans
because they require human intelligence and discernment. Specific
applications of AI include expert systems, natural language
processing, speech recognition and machine vision.

Data mining also referred to as knowledge discovery, is the process


of identifying and discovering useful insights from significant volumes
of data that are stored in data warehouses and databases. It is done
for decision-making processes in businesses.

Machine learning is an application of artificial intelligence (AI) that


provides systems the ability to automatically learn and improve from
experience without being explicitly programmed. Machine learning
focuses on the development of computer programs that can access
data and use it to learn for themselves. The primary aim is to allow
the computers learn automatically without human intervention or
assistance and adjust actions accordingly.

Pattern recognition is a data analysis method that uses machine


learning algorithms to automatically recognize patterns and
regularities in data. This data can be anything from text and images to
sounds or other definable qualities. Pattern recognition systems can
recognize familiar patterns quickly and accurately. They can also
recognize and classify unfamiliar objects, recognize shapes and
objects from different angles, and identify patterns and objects even
if they’re partially obscured.
Skill sets required of data scientists
 Mathematics and statistics
 Domain expertise e.g. business and healthcare
 Programming in multiple languages: R, Python, SQL, etc.
 Database and data warehousing
 Predictive modeling and descriptive statistics
 Machine learning and algorithms
 Big data
 Communication and presentation

Data Basics
Datum is singular; the term data is plural!
Data is just a number; information is data with meaning and
knowledge is information that is felt to be true
The smallest unit of data is the bit (binary digit) which can be
represented as zeros and ones. A byte consists of 8 bits, so 256
possible combinations could be created 0100 0001 represents the
capital letter A
Binary coding like this is very important because computers easily
interpret these binary numbers 0 and 1
Statistics Basics
- Data can be structured (fits into a database field), unstructured
(free text) or semi-structured (e.g. XML)
- Data can also be classified as nominal (categorical) meaning it is a
name and not number, such as gender. Ordinal data is similar but
has order, such as small, medium and large. Interval data is
numerical with defined intervals, but no meaningful, such as
Celsius temp. Ratio data is numerical but has a meaningful, such
as height and weight
- Nominal/ordinal data can also be considered qualitative data,
while interval/ratio data can be considered quantitative data
- Parametric data: tends to follow a normal distribution
- Non-parametric data: follows an abnormal distribution
- Statisticians use non-parametric tests for non- parametric data
and parametric tests for parametric data.
- It is important to look at measures of central tendency, such as
mean, median and standard deviation to see how the data is
distributed.
- Mean is the sum of values divided by the number of values
- Median is is the middle of the distribution
- Mode is the most common value and the range is the difference
between the lowest and highest value
- Standard deviation is a measure of dispersion of the data from the
mean. It is calculated by taking the square root of the variance.
- Variance is calculated by subtracting each data value from the
mean, squaring them, totally the values, then dividing by the
number of values, minus 1 if you are dealing with a sample and not
subtracting by 1 if your values came from the entire population

- Example: If the variance is 7, the standard deviation is 2.64


With a large population, the variance and standard deviation will be
smaller and the distribution curve more narrow
Statistics Basics
Normal distribution: 95% of data falls between 2 standard deviations;
99.7% falls between 3 standard deviations. Data outside are outliers
or errors
An abnormal distribution that is skewed to the right with possible
outliers
Dependent variables are also known as target, outcome, response, label
or class variables. It is the value you are likely trying to predict
Independent variables are also known as predictor or explanatory
variables and these are factors that predict the dependent variable.
For example, you want to predict whether a patient is readmitted (yes or
no)(dependent variable) based on factors such as severity of illness, age,
insurance status, etc. (independent variables)

There are two approaches researchers use to examine data:


Descriptive statistics: looks at mean, median and the distribution curve
Inferential statistics: looks at a sample to infer conclusions about the
population.
Surveys are a common example.
Confidence Intervals and hypothesis testing are used with inferential statistics

Confidence intervals measure the uncertainty of the sample, similar


to margin of error. The most common value is the 95% confidence
interval (CI). It displays a range of numbers and you can be 95%
certain it contains the true mean of the population
The smaller the sample, the wider the CIs.
CIs can be used for numerical and categorical data

Confidence intervals can be used for:


Visual display of the range of values
The precision of the estimate
Comparison of CIs between different studies
Hypothesis testing: values outside the 95% CI may reject the null hypothesis
To power a research study
Hypothesis testing:
The null hypothesis is there is no difference between the mean (μ) of A,
versus the mean of B or Ho: μA = μB. The alternate hypothesis is there is a
difference between the mean of A versus B or HA: μA ≠ μB
Commonly, a p value of < 0.o5 is used to determine statistical significance. If
the p value is less than 0.05 then the findings are unlikely to be due to
chance (1 in 20 chance) so we reject the null hypothesis
Note: statistical significance ≠ clinical significance

Effect size:
Measures the magnitude of the effect and is independent of the sample size
There are a variety of effect sizes calculations available, depending on which
statistical test is run
More and more journals are requesting confidence intervals and effect sizes,
in addition to standard p values

Type I error is observing a significance (p <.05) when no real


difference actually exists (false positive)
Type 2 error is when you conclude there is no effect (null hypothesis
is not rejected) when one really exists (false negative). The most
common cause of a type II error is a sample size that is too small
Application Programming Interfaces
APIs are a common means to access, transfer and share data. This
creates a portal to e.g. a web application or an EHR for access for
users, developers and researchers.
This has great implications regarding interoperability among
disparate technologies
The most common communication data standard is RESTful API, used
by e.g. Google and Amazon
Organizations can have internal APIs for their staff and external APIs
for external customers
For example: Data.gov has an API catalog to access data

Data Analytical Processes


In the next slide there is a diagram of the processes or steps used to
analyze data. While the flow appears linear, in reality it is not and
tends to be iterative
The basic steps are to locate raw data >> perform data pre-processing
>> begin exploratory data analysis >> conduct analysis with statistical
modeling, machine learning or programming >> utilize data
visualization and communicate your findings to others
The first step is to have a hypothesis or concern and try to find the
most appropriate data set for analysis
Raw data
Raw data (primary data) may be unstructured and messy
It may require transformation into a computable format

Data pre-processing
Cleaning: correcting or removing data errors
Missing data: decision must be made what to do, e.g. deleting vs
imputing
Integration: combining disparate sources of data into a spreadsheet
or database
Reduction: consolidating categories to reduce attributes
Exploratory Data Analysis (EDA)
Descriptive statistics are used to look at the distribution, mean,
mode, range and standard deviation in order to determine the
optimal statistical method or algorithm to use for analysis
These earlier phases of working with data may take up to 80% of the
overall time required with data!
Data visualization may be used early as well as late

Visualization as part of EDA


Examine non-parametric data using pie and bar charts and parametric
data using box plots and histograms
Next slide shows a boxplot of blood pressure. Note the mean, median
and standard deviation. The whiskers define the range or minimum to
maximum values. The goal is to look for possible skewed data (non-
normal distribution) and outliers
Scatter plots are useful to compare two variables for a potential
linear relationship
Exploratory data analysis
Explore for missing, incorrect and duplicate data
Need strategy for missing data, to include imputation
If data attributes have widely varying ranges
May need to normalize it by converting numerical data to a
scale from 0-1
Can also standardize data where the numerical mean is zero with a standard
deviation of 1
Can also convert the data to z scores (subtract the values from the mean and
divide by the standard deviation)
Categorization (binning or bagging): you can convert numerical data
into categorical data (age into decades)
Nominal data can be converted to numerical data by dummy coding
(male = 0, female = 1)
Variable selection: independent variables can be reduced and
determine if it affects the dependent variable
Transformation: skewed data can be transformed to a more normal
distribution using techniques such as log or Fourier transformations
Three data analytical approaches
- In reality both statistics and machine learning are based on
mathematics
- Both approaches use decision trees, linear and logistic
regression
- Machine learning has unique algorithms such as neural
networks, k-nearest neighbors, Bayes, etc. Unlike statistics,
users can easily analyze data with multiple algorithms and
determine which produces the best results

Three analytical approaches


Machine learning is usually divided into:
Supervised learning: you know ahead of time the classes or attributes of
data you will be using. This is the most common learning and includes
classification and regression
Unsupervised learning: you do not know the classes of data so you use
techniques such as clustering to search for interesting patterns. Commonly
used in genetic research.
Another technique, association, looks for trends, such as, if you buy beer,
how likely are you to buy chips? (market basket analysis)
Textbook Table 22.5 lists commonly used algorithms

Programming languages
Two dominant languages are R and Python.
Both can perform statistical and machine learning analyses
R tends to be used more by data scientists and Python more by
computer scientists
Substantial learning curve for both SQL (structured query language) is
used to manipulate databases and generate reports, so in a loose sense
it is a programming language for analysis

Major types of analytics


Descriptive analytics (unsupervised learning)
Association rules: market basket analysis
Sequence rules: order of events
Clustering: discover new patterns
Predictive analytics (supervised learning)
Regression
Classification
Predictive analytics
Creates a model that examines the predictors associated with an
outcome or target. For example, what are the factors the contribute
to readmission for heart failure?
Regression model is used when the target is numerical, such as income, size,
age, etc.
Classification model is used when the target is nominal (categorical), such as
admitted and not admitted.

Predictive analytics
Regression model is a linear (line- like) model. It seeks the mathematical
relationship between two or more numerical variables
Plotted: the dependent (target) variable on the y axis and independent
variable on the x axis
Formula: y = ax + b where a is the slope of the line and b is the intercept
Predictive analytics
Regression model: The goal is to fit the values as close to the line as
possible to reduce the sum of least squared errors (square the
difference between the values and the line and then sum them). The
lower number the better
For example, if we use the formula y = 0.425x + 0.785 and we set x =
2, then y = 1.64.
Correlation coefficient: value between -1 and 1
R2 is a common measure to determine how close the fit is:
0-100% range. The higher the %, the better the fit
Classification model: is non-linear and uses supervised learning. The
target or outcome variable is categorical (nominal). Algorithms:
Classification and Regression Trees (CART): decision trees are easy to
understand and visualize. Common trees are C4.5 and J48. The tree displays
root and leaf nodes (outcomes). In the next slide there is a decision tree for
contact lens, where the choices are no lenses, soft or hard lenses. The first
branch, or decision, is tear production, normal or reduced. Random forest
algorithms involve multiple decision trees and can be more accurate

Classification model
Neural networks: organized like human neurons, but this approach is more
empirical and more difficult to explain and confirm
Naïve Bayes: uses Bayes theorem to perform regression and classification to
predict future probability, based on known prevalence. Called “naïve”
because it assumes each variable is independent, which is not always the
case. Regardless, this algorithm works very well
Classification:
K-nearest neighbor: works for regression and classification. It works by
looking for “nearest neighbors” using several methods, to include Euclidean
distance. It uses the entire data set, so there is no learning.
Support Vector Machines (SVMs): algorithm separates the attribute spaces
into hyperplanes and determines the best coefficients for separation of the
data. Can be used for classification and regression

Classification model:
Logistic regression: in spite of the name regression this approach is used for
classification, usually with a binary outcome such as lived or died. Can be
dummy coded as 0 or 1.
The outcome is reported as odds ratios, so the higher the odds the more
likely a predictor (independent variable) will predict an outcome (dependent
variable). This approach is shared by statisticians and computer scientists
and is perhaps the most common approach for classification scenarios.

Regression Model Evaluation


Performance is usually measured with:
R (correlation coefficient) with range from -1 to +1
R2 determines how close the data fits the regression line with a range of 0-
100%
Root mean square error (RMSE) also measures the fit so the lower the RMSE
the better.

Classification Model Evaluation


Measures:
True and false positives (TP, FP) and true and false negatives (TN, FN) are
part of a confusion matrix or truth table
Classification accuracy and error %
Recall is the same as sensitivity or true positives
Precision is similar to positive predictive value (PPV)
F-1 is the weighted average of precision and recall
Plot a receiver operator characteristic curve (ROC) and calculate the area
under the curve (AUC). Significant values are 70-90%
ROC Curve
Y axis = true positives
X axis = false positives
Green curve yields the best results or AUC of 0.530 (not considered
optimal which should be better than 70%
Training models for machine learning
10- fold cross validation: Common approach with small data sets. Data
is divided into 10 sets. The model trains on 9 datasets and tests on 1.
This is repeated 10 times and you take an average of the results. This
is often the default choice in programs such as WEKA.
Train and test cross validation: for larger datasets, algorithms
commonly train on roughly 70% of the data and test (validate) the
results with 30% of the remaining data.
Text mining with natural language
processing (NLP)
About 70% of healthcare data (with and without an EHR is
unstructured, so therefore not computable
NLP is an artificial intelligence technique that mines data to primarily
address clinical outcomes and billing.
Currently, chart reviews is the most common method to extract
meaning from unstructured data; slow and expensive
Multiple commercial and open source initiatives are offering NLP, as
well as being part of the R language package

Visualization and Communication


Data visualization is used initially to view data distribution and also in
the latter stages of data analysis to present and communicate results
Traditionally, programs such as Microsoft Excel dominated
visualization, but now newer players such as Tableau and QlikⓇ
Sense are available.
Dashboards are another way to present visually a variety of results
for easy interpretation.

Performance Summary
Results must make sense. Statistically significant doesn’t equal
clinically significant and correlation doesn’t equal causation
Must take into account whether sensitivity or specificity or both are
the most important goal for your model.
Big Data
Data so large it can’t be analyzed with one computer
Common definition of big data using the Five Vs:
Volume: massive size of data
Velocity: rapid generation
Variety: structured, unstructured or mixed
Veracity: may be ”messy” with missing data, but with
huge volumes it becomes less important
Value: big data must have value, in order for it to become information and
knowledge

Google developed MapReduce in 2004 to deal with very large


datasets; followed by the Hadoop Distribute File System (HDFS).
MapReduce creates a data map of key value pairs and then reduces
them and then distributes the data among multiple computers (nodes
make up a cluster).
Apache Hive is a data warehousing system built on Hadoop. SQL
queries can retrieve data from the HDFS.
Apache Mahout is a machine learning library that can be used with
Hadoop.
NoSQL databases are an alternative to standard relational database
systems
NoSQL is a general term for a variety of databases that are designed
to handle larger volumes of data
NoSQL databases are not based on ACID, but instead BASE (Basically
Available, Soft state, Eventual consistency). Data is always available, is
not immediately consistent but eventually will become consistent.

Big Data challenges


The hardware and dedicated staff are expensive
It is labor intensive to integrate multiple disparate data sources, such
as geolocation, images, sensor data, etc.
Will the information gained be of value for all members of the
healthcare team or only administrators concerned with rising costs?
In other words, will it change practice patterns or behavior?

Analytical software for healthcare


workers
Microsoft Excel with Data Analysis ToolPak: free add-in that includes
many common statistical tests. There is no guidance or wizards to
help.
Microsoft Server Analysis Services: includes statistical tools and
machine learning. Wizards are available to assist.
Statistical Package for the Social Sciences (SPSS): a powerful statistical
package, along with Modeler can perform machine learning.

IBM Watson Analytics: automates the exploration, prediction and


visualization of data. It is based on SPSS statistics. Inputting questions
is augmented by NLP.
Tackles common statistical modeling scenarios but does not include
machine learning algorithms. Fast and easy to use. Visualizations are
automatically created and can be used for dashboards and
infographics. Free academic program is available, if used for non-
commercial purposes.

Machine learning/data mining:


SSAS, SPSS as mentioned
Free open-source programs:
WEKA: the only program in this group that does not require
selecting widgets and connectors to initiate machine learning
KNIME
RapidMiner
Orange
Data Science education
According to DataScience Community 379 US universities offer data
science-related courses:
Certificate (82), Bachelors (24), Masters (259) and
Doctorate (14).
37% of the courses are online
Programs were related to: business (101), mathematics (40),
computer science (39), new data science departments (9), with the remainder
from various departments
KDNuggets is an excellent resource for courses and other aspects of
data science.

Certifications:
Certified analytics professional (CAP) for professionals with 5 years of
experience and a Bachelors or 3 years’ experience with Masters.
Certified Health Data Analyst (CHDA) is hosted by AHIMA with eligibility
stated on their web site
Coursera (2018) included 1103 low-cost courses related to data
science
Multiple Data Science Centers have also appeared across the US in
the past several years.

Data Science challenges

 Inadequate work force


 Hardware and software expense to tackle big data
 We need to close ranks and combine a statistical approach with
machine learning
 Health data analytics is more complicated than other domains,
because the data is more heterogeneous and complex;
associated with more privacy and legal implications
 There is considerable hype associated with data science, so we
need to be on the alert for factual information

Future Trends
Look for more automation of analytics, such that the average
healthcare worker can perform predictive analytics
Look for more analytics embedded in cloud storage and other
platforms
Look for more low-cost courses covering data science

Conclusions
Data science is an important umbrella term that incorporates many
current data-related activities
All industries have been clamoring for more experts in data
analytics/data science
Until we train enough data scientists, we will have to rely on a team
of skilled healthcare workers who can digest and interpret
complicated data sets

Questions
- Define the field of data science
- Enumerate the general requirements for data science expertise
- Differentiate between modeling, using statistics and machine
learning
- Describe the general steps from data wrangling to data
presentation
- Discuss the characteristics of big data
- Provide examples of how data analytics is assisting healthcare
- Discuss the challenges facing healthcare or biomedical data analytics

You might also like