Lecture 03 DS Methodology
Lecture 03 DS Methodology
Data Science
• Motivation for Data Science
Methodology
• Data Science Methodology
• From Problem to Approach
• Working with the Data
• Deriving the Answer
1
Motivation
Source : IDC 2018
200
180
160
140
Data Size in Zettabytes
120
100
80
60
40
20
0
201 8 202 5
2
Motivation
Source : 451 Research
[email protected] 4
CRISP - DM
• CRISP – DM
• Cross Industry Standard Process for Data Mining
• Goal
• Encourage interoperable tools across entire data
mining process
5
Need of Standard Process
Data
Feedback
Requirement
Data
Data Science Deployment
Collection
Methodology
Data
Evaluation
Understanding
Data
Modeling
Preparation
8
Data Science Methodology
3 Broad Steps;
• From Problem to
Approach
• Working with the Data
• Deriving the Answer
10 Key Questions to
Answer
9
What is the problem that is
being solved?
1. From
Problem
to
How the data can be used to
Approach answer the question?
10
What data Where is the
is needed to data coming
answer the from? Or how
question? to get it?
2. Working
with the Is the data
collected
What additional
work is
Data representative
of the
required to
manipulate and
problem being work with the
solved? data?
11
In what way the data can be Does the model used really
visualized to get to the answer the initial question or
answer? does it need to be adjusted?
3. Deriving
the
Answer
Can the model be put into Can constructive feedback be
practice? obtained to answer the
question?
12
A limited budget for providing health
care to public
Case Study
Assess the patient condition prior to
the discharge
13
Data Science Methodology
3 Broad Steps;
• From Problem to
Approach
• Working with the Data
• Deriving the Answer
10 Key Questions to
Answer
14
What is the problem that is
being solved?
1. From
Problem
to
How the data can be used to
Approach answer the question?
15
1. From Problem to Approach
• Business Understanding
• What is the problem that is
being solved?
• Analytic Approach
• How the data can be used
to answer the question?
16
1.1 Business Understanding
• Question
• What is the best way to allocate the limited health-care budget
to maximize its use in providing quality health care?
18
Business Understanding
Case Study
• Goal
• To provide quality care without increasing
costs
• Objective
• To review the process to identify inefficiencies.
19
Business Understanding
Case Study
21
Business Understanding
Case Study
• Predictive model
Show relationship
• Descriptive model
Yes/No answer
• Classification model 23
Analytic Approach
Case Study
• Predictive Model
• To predict an outcome
• Decision Tree Classification
• Categorical Outcome
• Explicit Decision Path
showing conditions leading to
high risk
• Easy to understand and apply 24
Analytic Approach
Case Study
N Y Y N
25
Data Science Methodology
3 Broad Steps;
• From Problem to
Approach
• Working with the Data
• Deriving the Answer
10 Key Questions to
Answer
26
What data Where is the
is needed to data coming
answer the from? Or how
question? to get it?
2. Working
with the Is the data
collected
What additional
work is
Data representative
of the
required to
manipulate and
problem being work with the
solved? data?
27
2. Working with the Data
28
2.1 Data Requirements
How the data will be used? How Where Where Do you Get the Data?
Data
How Do you obtain the Data? How When When Do you Need The Data?
Why
Why We need The Data 30
2.1 Data Requirements
Case Study
33
2.2 Data Collection
35
Data Collection
Case Study
37
Data Collection
Case Study
38
2.3 Data Understanding
39
2.3 Data Understanding
• Describe Data
• Check data volume and examine its properties
• Accessibility and availability of attributes
• Attributes types, range, correlations, identifiers
• Understand the meaning of each attribute and attribute
value in business terms.
• For each attribute, compute basic statistics
• Distribution
• Average ,Max, Min 40
• Explore Data
• Analyze properties of interesting attributes in detail
• Verify Data Quality
• Identify special values and catalogue their meaning
• Does it cover all the cases required?
• Does it contain error?
• Identify missing attributes.
• Do the meaning of attributes and contained values fit together?
• Check spelling of values (“The case of exploding mangoes”, “the Case of Exploding Mangoes”)
41
2.3 Data Understanding
Case Study
42
2.3 Data Understanding
Case Study
• Data Quality
• Missing Values
• Invalid or misleading
values
43
2.3 Data Understanding
Case Study
• Data Cleaning
• Correct, remove or ignore noise
• Decide how to deal with special values and their meaning
• 0 Male, 1 Female
• Aggregation Level
• Outliers
45
2.4 Data Preparation
• Feature Engineering
• Process of using domain knowledge to create features that
make the machine learning algorithm work.
46
2.4 Data Preparation
• Integrate Data
• Integrate sources and store result
• Format Data
• Re-arrange attributes
• First field identifier, last field the
label
• Re-ordering records
• Reformatted within value
47
• Removing illegal characters
• Upper case to lowercase etc
2.4 Data Preparation
Case Study
48
2.4 Data Preparation
Case Study
• Aggregating Records
• Claims :
• Professional provider , facility, pharmaceutical
• Inpatient and out patient records
• Diagnosis procedure, prescription etc
• Possibly thousands per patients (depends on clinical
history)
49
2.4 Data Preparation
Case Study
51
2.4 Data Preparation
Case Study
52
2.4 Data Preparation
Case Study
• Target
• CHF readmission within 30 days (Yes/No) following discharge from
CHF hospitalization
• Measures • Gender • Age • Primary Drug
• Length of Stay • Prior Admission • CHF Diagnosis
Important (Primary,
Secondary, Tertiary)
• Diagnosis
Flag • CHF • Renal Failure • Hypertension
• Diabetes • Pneumonia
53
2.4 Data Preparation
Case Study
54
Data Science Methodology
3 Broad Steps;
• From Problem to
Approach
• Working with the Data
• Deriving the Answer
10 Key Questions to
Answer
55
In what way the data can be Does the model used really
visualized to get to the answer the initial question or
answer? does it need to be adjusted?
3. Deriving
the
Answer
Can the model be put into Can constructive feedback be
practice? obtained to answer the
question?
56
3.1 Modeling
57
3.1 Modeling
• Build Model
• Set initial parameters and document reason for choosing those values
• Run the selected technique on the input data set
• Record parameters setting using to produce the model
• Describe the model, its special features, behavior and interpretation
58
Data Science Methodology
3 Broad Steps;
• From Problem to
Approach
• Working with the Data
• Deriving the Answer
10 Key Questions to
Answer
59
3.2 Evaluation
• Does the model used really answer the initial question or does it need
to be adjusted?
• Evaluate Results
• Understand results. Cross verify against goals.
• Check results against knowledge base (usefulness and novelty)
• Rank results with respect to business success criterion
• State conclusions for future projects
60
3.1/2 Modeling / Evaluation
Case Study
• Confusion
Matrix
Actual Values
Positive Negative
Predicted
Values
Positive TP FP
Negative FN TN
61
3.1/2 Modeling / Evaluation
Case Study
• Analyzing the 3
models
Model Relative Cost Overall Accuracy Sensitivity ( Specificity
Y:N (% of Correct Y & N) Y Accuracy) (N Accuracy)
62
3.1/2 Modeling / Evaluation
Case Study
64
3.1/2 Modeling / Evaluation
Case Study
65
Data Science Methodology
3 Broad Steps;
• From Problem to
Approach
• Working with the Data
• Deriving the Answer
10 Key Questions to
Answer
66
3.3 Deployment
67
3.3 Deployment
Case Study
68
3.3 Deployment
Case Study
69
3.3 Deployment
Case Study
• Additional Requirements
• Training for clinical staff
• Tracking / monitoring
processes
70
3.3 Deployment
Case Study
• Additional Requirements
• Training for clinical staff
• Tracking / monitoring
processes
71
Data Science Methodology
3 Broad Steps;
• From Problem to
Approach
• Working with the Data
• Deriving the Answer
10 Key Questions to
Answer
72
3.4 Feedback
73
3.4 Feedback
Case Study
• Refine Model
• Initial review after the first year of implementation
• Based on feedback data and knowledge gain
• Possibly incorporate detailed pharmaceutical data originally
deferred
75
3.4 Feedback
Case Study
• Redeploy
• Continue modeling, deployment, feedback, and refinement
throughout the life of the intervention program
76
Acknowledgement
75