0% found this document useful (0 votes)
17 views77 pages

Lecture 03 DS Methodology

Uploaded by

saqib ullah
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
17 views77 pages

Lecture 03 DS Methodology

Uploaded by

saqib ullah
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 77

CS 5638 – Principles of

Data Science
• Motivation for Data Science
Methodology
• Data Science Methodology
• From Problem to Approach
• Working with the Data
• Deriving the Answer

1
Motivation
Source : IDC 2018

200

180

160

140
Data Size in Zettabytes

120

100

80

60

40

20

0
201 8 202 5
2
Motivation
Source : 451 Research

• 80% of the data will be unstructured


by 2025
• Challenging to obtain information from
unstructured data.
• Need of clear, well thought out and
standardized methodology for data
science
3
Methodology
Source : Wikipedia

• Methodology is the systematic, theoretical analysis of the methods


applied to a field of study.

[email protected] 4
CRISP - DM

• CRISP – DM
• Cross Industry Standard Process for Data Mining
• Goal
• Encourage interoperable tools across entire data
mining process

5
Need of Standard Process

• Framework for recording experience


• Allows projects to be replicated
• Aid to project planning and management
• “Comfort factor” for new adopters
• Reduces dependency on “stars”
• Encourage best practices and help to obtain better
results
6
Business Analytic
Understanding Approach

Data
Feedback
Requirement

Data
Data Science Deployment
Collection
Methodology
Data
Evaluation
Understanding

Data
Modeling
Preparation

8
Data Science Methodology

3 Broad Steps;

• From Problem to
Approach
• Working with the Data
• Deriving the Answer
10 Key Questions to
Answer

9
What is the problem that is
being solved?
1. From
Problem
to
How the data can be used to
Approach answer the question?

10
What data Where is the
is needed to data coming
answer the from? Or how
question? to get it?

2. Working
with the Is the data
collected
What additional
work is
Data representative
of the
required to
manipulate and
problem being work with the
solved? data?

11
In what way the data can be Does the model used really
visualized to get to the answer the initial question or
answer? does it need to be adjusted?
3. Deriving
the
Answer
Can the model be put into Can constructive feedback be
practice? obtained to answer the
question?

12
A limited budget for providing health
care to public

Hospital re-admission – a sign of failure


of the system

Case Study
Assess the patient condition prior to
the discharge

Providing new data-


How the Data driven tools for
Science can help? timely decision

13
Data Science Methodology

3 Broad Steps;

• From Problem to
Approach
• Working with the Data
• Deriving the Answer
10 Key Questions to
Answer

14
What is the problem that is
being solved?
1. From
Problem
to
How the data can be used to
Approach answer the question?

15
1. From Problem to Approach

• Business Understanding
• What is the problem that is
being solved?
• Analytic Approach
• How the data can be used
to answer the question?

16
1.1 Business Understanding

• Determine Data Science Goals


• Translate the business questions to data science goals
• Specify data science problem type
• Specify criterion for model assessment
• Produce Project Plan
• Define initial process plan, discuss feasibility with stake
holders
• Put identified goals and selected techniques into a
i 17
coherent procedure
• Estimate efforts and resources needed, identify critical steps.
Business Understanding
Case Study

• Question
• What is the best way to allocate the limited health-care budget
to maximize its use in providing quality health care?

18
Business Understanding
Case Study

• Goal
• To provide quality care without increasing
costs

• Objective
• To review the process to identify inefficiencies.
19
Business Understanding
Case Study

Patients re-admitted to a rehabilitation


center

35% within one year

50% within five


years
20
Business Understanding
Case Study

• Review the data


• Findings
• Patients with Congestive Heart Failure (CHF) were at the top of

re- admission data

• A decision model can be applied to check why this is happening

21
Business Understanding
Case Study

• Four business requirements are identified


1. Predict CHF readmission outcome (0 or 1) for each patient
2. Predict the readmission risk for each patient
3. Understand explicitly what combination of events led to
the predicted outcome for each patient
4. Apply easy to understand process to new patients to predict
their readmission risk
22
1.2 Analytic Approach

Determine probability of an action.

• Predictive model

Show relationship

• Descriptive model

Yes/No answer

• Classification model 23
Analytic Approach
Case Study

• Predictive Model
• To predict an outcome
• Decision Tree Classification
• Categorical Outcome
• Explicit Decision Path
showing conditions leading to
high risk
• Easy to understand and apply 24
Analytic Approach
Case Study

Reduce ability to Rapid weight gain


exercise Heart
Failure

Fatigue Y Lack of appetite

False True True False

N Y Y N

25
Data Science Methodology

3 Broad Steps;

• From Problem to
Approach
• Working with the Data
• Deriving the Answer
10 Key Questions to
Answer

26
What data Where is the
is needed to data coming
answer the from? Or how
question? to get it?

2. Working
with the Is the data
collected
What additional
work is
Data representative
of the
required to
manipulate and
problem being work with the
solved? data?

27
2. Working with the Data

What data is Where is the


needed to data coming
answer the from? Or how
question? to get it?

Is the data What additional


collected work is required
representative to manipulate
of the problem and work with
being solved? the data?

28
2.1 Data Requirements

• How to cook Tiramisu?


• Problem to resolve
• How to cook Tiramisu?
• Data
• Ingredients
• Which ingredients are required?
• How to collect them?
• How to prepare the ingredients to cook the desired
dish?
29
2.1 Data Requirements

What are data


Six Key Questions
requirements? What Type of Data is required?
What

How the data will be used? How Where Where Do you Get the Data?

Data
How Do you obtain the Data? How When When Do you Need The Data?

Why
Why We need The Data 30
2.1 Data Requirements
Case Study

• Define data requirements for the decision tree classification


approach
• Define and select cohort
• In-patient within health insurance provider’s service area
• Primary diagnosis of CHF in one year
• Continuous enrollment for at least 6 months prior to primary
CHF admission
• Disqualifying conditions
31

• Patients with other significant medical conditions


Defining the Data
Case Study

• Contents, formats, representation suitable for decision tree


classifier
• One record per patient
• Columns representing variables
• Contents covering all aspects of patient’s clinical history
• Transactional format
• Transformation required
32
2.2 Data Collection

• Assessment of the data collected by the data scientist is


required after initial data collection.
• Determine if the data is what is required?
• Some data might be missing.
• Some might be hard to get.

33
2.2 Data Collection

• Various techniques can be applied


to asses the contents, quality and
initial insight about the data.
• Visualization
• Descriptive Statistics

35
Data Collection
Case Study

• Available data source


• Corporate data warehouse
• Single source of medical and claims
• In-patient record system
• Claim payment system
• Disease management program
information
36
Data Collection
Case Study

• Data wanted but not


available
• Pharmaceutical records
• Ok to defer

37
Data Collection
Case Study

Merging the data

Eliminate the redundant


data

38
2.3 Data Understanding

• Is the data to be collected representative of the problem to be


solved?
• What does it mean to “prepare” or “clean” the data?

39
2.3 Data Understanding

• Describe Data
• Check data volume and examine its properties
• Accessibility and availability of attributes
• Attributes types, range, correlations, identifiers
• Understand the meaning of each attribute and attribute
value in business terms.
• For each attribute, compute basic statistics
• Distribution
• Average ,Max, Min 40

• Std deviation, variance, skewness


2.3 Data Understanding

• Explore Data
• Analyze properties of interesting attributes in detail
• Verify Data Quality
• Identify special values and catalogue their meaning
• Does it cover all the cases required?
• Does it contain error?
• Identify missing attributes.
• Do the meaning of attributes and contained values fit together?
• Check spelling of values (“The case of exploding mangoes”, “the Case of Exploding Mangoes”)
41
2.3 Data Understanding
Case Study

• Run Descriptive Statistics against data column that can


become variables in the model.
• Descriptive Statistics
• Univariate Statistics
• Pairwise Correlation
• Histogram

42
2.3 Data Understanding
Case Study

• Data Quality
• Missing Values
• Invalid or misleading
values

43
2.3 Data Understanding
Case Study

• Iterative Data Collection and


Understanding
• Refined definition of “CHF admission”
• Initial definition
• Initial diagnosis of primary diagnosis
of CHF
• Refine the definition based on the
clinical information 44
2.4 Data Preparation

• Data Cleaning
• Correct, remove or ignore noise
• Decide how to deal with special values and their meaning
• 0 Male, 1 Female
• Aggregation Level
• Outliers

45
2.4 Data Preparation

• Feature Engineering
• Process of using domain knowledge to create features that
make the machine learning algorithm work.

46
2.4 Data Preparation

• Integrate Data
• Integrate sources and store result
• Format Data
• Re-arrange attributes
• First field identifier, last field the
label
• Re-ordering records
• Reformatted within value
47
• Removing illegal characters
• Upper case to lowercase etc
2.4 Data Preparation
Case Study

• CHF broad definition


• Define the readmission criterion
• Index admission
• Readmission
• Based on the expert advise and data, a 30 day time frame is set
for readmission.

48
2.4 Data Preparation
Case Study

• Aggregating Records
• Claims :
• Professional provider , facility, pharmaceutical
• Inpatient and out patient records
• Diagnosis procedure, prescription etc
• Possibly thousands per patients (depends on clinical
history)
49
2.4 Data Preparation
Case Study

• Aggregate to patient level


• Roll up to 1 record per patient
• Create new columns representing the
transaction
• Outpatients visits
• Inpatient episodes
• Frequency,
• Recency
50
2.4 Data Preparation
Case Study

• More or less data needed?


• Literature review of important factors for CHF
readmission

51
2.4 Data Preparation
Case Study

• Completing the Data Set


• Merge all records
• List of variables used in modeling
• Target
• CHF readmission within 30 days (Yes/No) following discharge from
CHF hospitalization

52
2.4 Data Preparation
Case Study

• Target
• CHF readmission within 30 days (Yes/No) following discharge from
CHF hospitalization
• Measures • Gender • Age • Primary Drug
• Length of Stay • Prior Admission • CHF Diagnosis
Important (Primary,
Secondary, Tertiary)
• Diagnosis
Flag • CHF • Renal Failure • Hypertension
• Diabetes • Pneumonia
53
2.4 Data Preparation
Case Study

• Using Training Set


• Total records :: 2,343
• Randomly divide into training and test sets (70%, 30%
split)
• Training – 1,640
• Testing – 703

54
Data Science Methodology

3 Broad Steps;

• From Problem to
Approach
• Working with the Data
• Deriving the Answer
10 Key Questions to
Answer

55
In what way the data can be Does the model used really
visualized to get to the answer the initial question or
answer? does it need to be adjusted?
3. Deriving
the
Answer
Can the model be put into Can constructive feedback be
practice? obtained to answer the
question?

56
3.1 Modeling

• In what way data can be visualized to get the required


answer?
• Select Modeling Technique
• Select technique
• Identify any assumption made by the technique about data
• Compare assumption with data description report
• Ensure there is no mismatch

57
3.1 Modeling

• Build Model
• Set initial parameters and document reason for choosing those values
• Run the selected technique on the input data set
• Record parameters setting using to produce the model
• Describe the model, its special features, behavior and interpretation

58
Data Science Methodology

3 Broad Steps;

• From Problem to
Approach
• Working with the Data
• Deriving the Answer
10 Key Questions to
Answer

59
3.2 Evaluation

• Does the model used really answer the initial question or does it need
to be adjusted?
• Evaluate Results
• Understand results. Cross verify against goals.
• Check results against knowledge base (usefulness and novelty)
• Rank results with respect to business success criterion
• State conclusions for future projects

60
3.1/2 Modeling / Evaluation
Case Study

• Confusion
Matrix
Actual Values
Positive Negative
Predicted
Values

Positive TP FP
Negative FN TN

61
3.1/2 Modeling / Evaluation
Case Study

• Analyzing the 3
models
Model Relative Cost Overall Accuracy Sensitivity ( Specificity
Y:N (% of Correct Y & N) Y Accuracy) (N Accuracy)

1 1:1 85% 45% 97%


2 9:1 49% 97% 35%
3 4:1 81% 68% 85%

62
3.1/2 Modeling / Evaluation
Case Study

• How to determine the optimal model?


• Balance true-positive rate and false-positive rate for best
model
Model Relative Cost TP Rate Specificity FP Rate
Y:N (Sensitivity) (N Accuracy) (1 - Specificity)

1 1:1 0.45 0.97 0.03


2 1.5:1 0.60 0.92 0.08
3 4:1 0.68 0.85 0.15
4 9:1 0.97 0.35 0.65

64
3.1/2 Modeling / Evaluation
Case Study

• Using ROC Curve


• Classification model
performance
• TP rate vs. FP rate
• Optimal model at max
separation

65
Data Science Methodology

3 Broad Steps;

• From Problem to
Approach
• Working with the Data
• Deriving the Answer
10 Key Questions to
Answer

66
3.3 Deployment

• Can the model be put into practice?


• Plan Deployment
• How will the knowledge or information be propagated to users?
• How will the use of the results be monitored or its benefits measured?
• Identify possible problems when deploying the data mining results.

67
3.3 Deployment
Case Study

• Assimilate knowledge for business


• Practical understanding of the meaning of model results
• Implications of model results for designing intervention
actions

68
3.3 Deployment
Case Study

• Gathering Application Requirements


• Automated, near real-time risk assessments of CHF
inpatients
• Easy to use
• Automated data preparation and scoring
• Up-to-date risk assessment to help clinicians target
high-risk patients

69
3.3 Deployment
Case Study

• Additional Requirements
• Training for clinical staff
• Tracking / monitoring
processes

70
3.3 Deployment
Case Study

• Additional Requirements
• Training for clinical staff
• Tracking / monitoring
processes

71
Data Science Methodology

3 Broad Steps;

• From Problem to
Approach
• Working with the Data
• Deriving the Answer
10 Key Questions to
Answer

72
3.4 Feedback

• Can constructive feedback be obtained to answer the question?

73
3.4 Feedback
Case Study

• Define review process


• To measure the result of applying the risk model to CHF
patient population
• Track patients who received intervention
• Actual readmission outcomes
• Measure effectiveness of intervention
• Compare re-admission rates before and after mode
implementation 74
3.4 Feedback
Case Study

• Refine Model
• Initial review after the first year of implementation
• Based on feedback data and knowledge gain
• Possibly incorporate detailed pharmaceutical data originally
deferred

75
3.4 Feedback
Case Study

• Redeploy
• Continue modeling, deployment, feedback, and refinement
throughout the life of the intervention program

76
Acknowledgement

•Material presented in these slides are adopted from various


sources including
• IBM Data Science Course.
• João Mendes Moreira lecture on CRISP-DM

75

You might also like