Module 6 - Data Science Methodology (Steps)
Module 6 - Data Science Methodology (Steps)
Data science
Learning Competencies
6.1. Apply methodology in different types of data science problems.
6.2. Compare the stages of data science methodology from understanding to
preparation, modeling to evaluation, and deployment to feedback.
6.3. Explain what happens when a model is deployed and why model feedback is
important.
G11: 50
MODULE 6:
The fundamentals of the data science methodology have been presented on the
previous lesson. Take note that the stages of this methodology are iterative. These steps have
included: forming a concrete business or research problem, collecting and analyzing data,
and building a model; and understanding the feedback after model deployment.
To further deepen the idea of the whole process, on this lesson you’ll learn how to
think like a data scientist, including taking the steps involved in tackling a data science
problem and applying them to interesting real-world examples.
The definition of the word methodology, as given by the Business Dictionary, is a system of
broad principles or rules from which specific methods or procedures may be derived to
interpret or solve. It is necessary to keep this in mind since temptation is often great to
circumvent the methodology and go directly to the solutions.
G11: 51
MODULE 6:
1. Business Understanding
Problem: How to best divide the limited Applying the concepts
health budget into optimal use to provide
quality care?
As public funds for readmission declined,
the insurance company ran the risk of
offsetting the difference in costs, which
could lead to higher costs for its clients.
G11: 52
MODULE 6:
First thing to do is defining the problem being faced. On this case, we have insurance
company’s insurance rates in review.
2. Analytic Approach
In the case where the question about human behavior is as ked, it would be an appropriate
response to use clustering approaches. Let us now examine the case study on the application
of the analytical approach. For the case study, a decision tree classification model was used
to identify the combination of conditions that resulted in the results of each patient.
G11: 53
MODULE 6:
G11: 54
MODULE 6:
• Then he defined the content, format and representations of the data needed to classify
the decision tree.
• This modeling technique requires one registration per patient, with columns
representing the variables of the model. To model readmission results, data covering
all aspects of the patient’s medical history should be available.
• This content includes authorizations, primary, secondary and tertiary diagnoses,
procedures, prescriptions and other services provided during hospitalization or visits
by patients / doctors.
In this way, a given patient can have thousands of records that represent all their attributes.
To obtain a record by patient format, the data analysis specialists collected the transaction
records from patient records and created a set of new variables to represent that information.
It was a task for the data preparation phase, so it is important to anticipate the next phases.
• This case study also required specific information on drugs, but this data source was
not yet integrated with the rest of the data sources.
• This brings us to an important point: it is correct to postpone decisions about
unavailable data and to try to capture them later.
• For example, this can happen even after obtaining intermediate results from predictive
modeling. If these results indicate that drug information may be important for a good
model, you will spend time trying to get it.
However, it turned out that they could build a reasonably good model without this
information about drugs.
G11: 55
MODULE 6:
• Database administrators and programmers often work together to extract data from
different sources and then combine them.
• In this way, the redundant data can be deleted and made available to the next level of
methodology, namely the understanding of the data.
• At this stage, scientists and analysis team members can discuss ways to better manage
their data by automating certain database processes to facilitate data collection.
G11: 56
MODULE 6:
• First, the importance of heart failure was determined based on a primary diagnosis of
heart failure. However, the data comprehension study revealed that the initial definition
did not cover all expected cases of heart failure due to clinical experience.
• This involved returning to the data collection phase, adding secondary and tertiary
diagnoses, and creating a more complete definition of heart failure approval.
• This is just an example of the interactive processes in the methodology. The more you
work with the problem and the data, the more you learn and the more the model can be
adjusted, which ultimately leads to a better resolution of the problem.
6. Data Preparation
Understanding the data
An important first step in the data
preparation stage was to actually define
congestive heart failure. This sounded
easy at first but defining it precisely, was
not straightforward.
• First, the set of diagnosis-related
group codes needed to be
identified, as congestive heart
failure implies certain kinds of
fluid buildup.
• We also need to consider that
congestive heart failure is only Defining Readmission
one type of heart failure. Clinical
guidance was needed to get the
right codes for congestive heart
failure.
• The next step involved defining
the re-admission criteria for the
same condition. The timing of
events needed to be evaluated in
order to define whether a
G11: 57
MODULE 6:
particular congestive heart failure admission was an initial event, which is called an
index admission, or a congestive heart failure-related re-admission.
Based on clinical expertise, a time
Defining CHF admission
period of 30 days was set as the
window for readmission relevant for
congestive heart failure patients,
following the discharge from the initial
admission.
• Next, the records that were in
transactional format were
aggregated, meaning that the
data included multiple records
for each patient.
Transactional records included professional provider facility Aggregating Records
claims submitted for physician, laboratory, hospital, and
clinical services. Also included were records describing all the
diagnoses, procedures, prescriptions, and other information
about in-patients and out-patients. A given patient could easily
have hundreds or even thousands of these records, depending
on their clinical history.
• Then, all history were aggregated to the patient level,
yielding a single record for each patient, as required
for the decision-tree classification method that would
be used for modeling.
As part of the aggregation process, many new columns were
created representing the information in the transactions. For
example, frequency and most recent visits to doctors, clinics
and hospitals with diagnoses, procedures, prescriptions, and so forth-morbidities with
congestive heart failure were also considered, such as diabetes, hypertension, and many other
diseases and chronic conditions that could impact the risk of re-admission for congestive heart
failure.
During discussions around data
More or less data needed? preparation, a literary review on
congestive heart failure was also
undertaken to see whether any
important data elements were
overlooked, such as co-morbidities
that had not yet been accounted for.
The literary review involved looping
back to the data collection stage to
add a few more indicators for
conditions and procedures.
G11: 58
MODULE 6:
7. Data Modeling
In this step, many aspects of model construction will be discussed. One thing is optimizing the
parameters to improve the model.
Analyzing the 1st model
• With a set of prepared training data, it is possible to construct the first classification
model of the decision tree for congestive readmission for heart failure. We are looking
for patients with high-risk
readmission. The result that will
interest us will be a congestive
readmission for heart failure
equivalent to “yes”. In this first
model, the overall accuracy of the
classification of the results was 85%
and not 85%. It sounds good, but
represents only 45% of the “yes”.
Actual readmission are ranked
correctly, which means that the
model is not very accurate.
G11: 59
MODULE 6:
• The question is: how to improve the accuracy of the model to predict the outcome itself?
For the classification of the decision tree, the best parameter to adjust is the relative
cost of the results yes and not classified incorrectly.
• Think of it this way: When a true non-readmission is misclassified and actions are
taken to reduce the risk of this patient, the cost of this error is a wasted intervention.
• A statistician calls this a Type I error or a false positive. But when a real readmission
is misclassified and no action is taken to reduce this risk, the cost of such an error
is readmission and all associated costs, as well as trauma to the patient.
• It’s a Type II error or a false negative. Then we can see that the costs of the two
different types of incorrect classification errors can be very different. For this reason,
it is reasonable to adjust the relative weights of the incorrect classification of the results
yes and no.
• The default is between 1 and 1, but the decision tree algorithm allows you to set a
higher value for yourself.
Analyzing the 2nd model
• For the second model, the relative cost was set at 9/1. This report is very high, but
provides more information about the behavior of the model. This time, the 97% model
worked well, but at a very low cost, with a general accuracy of only
49%. Obviously, this is not a good model.
• The problem with this result is the large number of false positives, suggesting
unnecessary and costly interventions for patients that have never been re-admitted.
• Therefore, the data scientist must try again to get a better balance between the yes and
no data.
Analyzing the 3rd model
G11: 60
MODULE 6:
• For the third model, the relative cost was set to a more reasonable 4: 1 ratio. This
time, 68% was obtained yes, but statistician called it sensitivity, and 85% accuracy for
the no, called specificity., with an overall accuracy of 81%.
• This is the best balance that can be achieved with a relatively limited training set of
workouts by adjusting the relative cost of the misclassified yes and no result
parameters. Of course, modeling requires much more work, including an iteration in
the data preparation phase, to redefine some of the other variables to better represent
the underlying information and thus improve the model.
8. Model Evaluation
In this part of the case study, the evaluation of component in the data science methodology will
be applied.
Determining the optimal model
• Look for a way to find the optimal model through a diagnostic measurement based on
the configuration of one of the model’s construction parameters. Examine more closely
how the relative costs of misclassifying positive and negative results can be adjusted.
Four models were constructed with four different relative misclassification costs.
• Each value of this model construction parameter increases the true positive rate, or
the sensitivity, of the accuracy in the prediction yes, to the detriment of a lower
accuracy in the prediction no. that is, an increasing rate of false positives.
• The question is, which model is best based on setting this parameter? For budgetary
reasons, the risk reduction intervention could not be applied to most patients with heart
failure, many of whom would not have been readmitted anyway.
• On the other hand, the intervention would not be as effective as it should be to improve
patient care, since the number of patients with high-risk heart failure was not enough.
• So how do we determine which model was optimal? Optimal model is the one that
provides the maximum separation between the blue ROC curve and the red baseline.
• We can see that model 3, with a relative cost of misclassification of 4 to 1, is the best
of the 4 models. And if asked, ROC represents the characteristic operating curve of the
receiver, which was first developed during World War II to detect enemy aircraft
on a radar.
G11: 61
MODULE 6:
G11: 62
MODULE 6:
• Then, doctors would have the most up-to-date risk assessment for each patient to help
them choose which patients to treat after discharge. As part of providing
the solution, the intervention team would develop and offer training to clinical staff.
• In addition, in collaboration with IT developers and database administrators,
monitoring and monitoring processes should be developed for patients receiving the
intervention, so that the results can go through the feedback phase and the model can
be mature over time.
• This Map is an example of a Hospitalization risk for Juvenile Diabetes Patients
solution implemented through
a Cognos application (IBM
Cognos Business Intelligence is
a web-based integrated
business intelligence suite by
IBM). In this case, the case
study focused on the risk of
hospitalization of patients with
juvenile diabetes. Similar to congestive heart failure, he used the classification of the
decision tree to create a risk model that would form the basis of this application.
• The map provides an
Risk summary report by decision tree model note
overview of hospital risk
nationwide, with a planned
interactive risk assessment for
different patient conditions and
other characteristics. This above
image provides an interactive
summary report on the risk per
patient population in a given
node of the model so that doctors can understand the combination of conditions for that
subset of patients.
Individual patient risk report
• This report provides a detailed
summary of a single patient,
including details of the patient’s
history and expected risk, and
provides the doctor with a brief
summary.
G11: 63
MODULE 6:
10. Feedback
Once the model has been evaluated and the data scientist trusts that it will work, it will be
implemented and will undergo the final test: its real use in real time in the field.
• The feedback phase plan included the following Assessing model performance
steps: First, the review process would be
defined and established, with the overall
responsibility of measuring the results of a flight
risk model of the heart failure risk population.
Clinical management has overall responsibility
for the review process.
• Second, patients with heart failure who receive
an intervention would be monitored and their
readmission results recorded.
• Third, the intervention would be measured to
determine its effectiveness in reducing the
number of readmissions.
• For ethical reasons, patients with heart failure would not be divided into controlled
groups and treatment groups. Readmission rates are compared before and after the
implementation of the model to measure the impact.
• After deployment and feedback, the impact Assessing model performance
of the intervention program on readmission
rates will be reviewed after the first year of
implementation.
• Then, the model would be refined based on
all data compiled after the implementation of
the model and the knowledge acquired in
these steps. Other improvements include
the inclusion of information on
participation in the intervention program and
possibly the refinement of the detailed Redeployment
pharmaceutical data model.
• Data collection was initially delayed because
drug data was not available at that time.
However, after feedback and practical experience
of the model, it can be said that adding this data
can be worth the investment of time and money.
The possibility of new adjustments during the
feedback phase must also be considered.
G11: 64
MODULE 6:
• In addition, response actions and processes are reviewed and probably refined
according to the experience and knowledge acquired during the initial implementation
and feedback.
• Finally, the refined model and intervention would be redeployed, and the feedback
process would continue throughout the intervention program.
Multiple choice. Analyze the questions carefully. Coose the letter of the correct answer.
1. Select the correct statement that describes data science methodology.
A. Data science methodology is not an iterative process – one does not go back and forth
between methodological steps.
B. Data science methodology is a specific strategy that guides processes and activities
relating to data science only for text analytics.
C. Data science methodology depends on a specific set of technologies or tools.
D. Data science methodology provides data scientists with a framework for how to proceed
to obtain answers.
2. What do data scientist usually use for exploratory analysis of data and to get acquainted
with them?
A. They use support vector machines and neural networks as feattureextraction techniques.
B. They begin with regression, classification, or clustering.
C. They use descriptive statistics and data visualization techniques.
D. They use deep learning.
3. Why should data scientists maintain continuous communication with business sponsors
throughout a project?
A. So that business sponsors can provide domain expertise.
B. So that business sponsors can ensure the work remains on track to generate the intended
solution
C. So that business sponsors can review intermediate findings.
D. All of the above.
4. For predictive models, a test set, which is similar to – but independent of – the training set,
is used to determine how well the model predicts outsomes. This is an example of what
step in the methodology?
A. Deployment
B. Data preparation
C. Model evaluation
D. Analytic approach
5. Data understanding involves all of the following EXCEPT for?
A. Discovering initial insights about the data
B. Visualizing the data
C. Assessing data quality
D. Gathering and analyzing feedback for assessment of the model’s performance
G11: 65
MODULE 6:
6. The following are all examples of rapidly evolving technologies that affect data science
methodology EXCEPT for?
A. Data Sampling
B. Text Analysis
C. Platform Growth
D. In-database analytics
7. Data scientists may use either a “top-down” or a “bottom-up” approach to data science.
These two approaches refer to:
A. Top-down approach – the data, when sorted, is modeled for the “top” of the data towards
the “bottom”. Bottom-up approach – the data is modeled from the “bottom” of the data
to the “top”.
B. Top-down approach – models are fit before the data is explored. Bottom-up approach –
the data is explored, and then a model is fit.
C. Top-down approach – first defining, a business problem then analyzing the data to find a
solution. Bottom-up approach – starting with the data, and then coming up with a business
problem based on the data.
D. Top-down approach – using massively parallel, warehouses with huge data volumes as
data source. Bottom-up approach – using a sample of small data before using large data.
8. A car company asked a data scientist to determine what type of customers are more likely
to purchase their vehicles. However, the data comes from several sources and is in a
relatively “raw format”. What kind of processing can the data scientist perform on the data
to prepare it for modeling?
A. Feature Engineering
B. Transforming the data into more useful variables
C. Addressing missing/invalid values
D. All of the above
9. A data scientist, John, was asked to help reduce readmission rates at a local hospital. After
some time, John provided a model that predicted which patients were more likely to be
readmitted to the hospital and declared that his work was done. Which of the following best
describes the scenario?
A. John only provided one model as a solution and he should have provided multpile
models.
B. The scenario is already optimal.
C. Even though John only submitted one solution, it might be a good one. However, John
needed feedback on his model from the hospital to comfirm that his model was able to
address the problem appropriately and sufficiently.
D. John still need to collect more data.
10. Data scientists may frequently return to a previous stage to make adjustments, as they learn
more about the data and the modeling.
A. True
B. False
G11: 66
MODULE 6:
A case study helps students learn by immersing them in a real-world business scenario where they can
act as problem-solvers and decision-makers. The case presents facts about a particular organization.
Analysis are done by focusing on the most important facts and using this information to determine the
opportunities and problems facing that organization.Then, alternative courses of action to deal with the
problems will be identified.
To be more familiar with the task, you may visit this link:
● Klaudon, Kenneth C. , (2021, November). Essentials of Management Information
Systems Sixth Edition https://tinyurl.com/63t6sr49
In your Virtual Expo Entry No. 5, you will look for research, examples, and case studies to which you
can apply the data science methodology. Disect your chosen case study an identify which specific part
relates to the different steps of the methodology.
The content and presentation of your analysis on your group’s website will be graded as your
Performance Task. You may convey it through creative graphics and illustrations. Navigation will also
be graded.
Refer on this rubric in grading your output:
VIRTUAL EXPO RUBRIC
Partially
Exemplary Proficient Incomplete
Proficient
Content (15) (12) (9) (5)
The content is rich, Content is There is adequate There is
concise, and complete and detail. Some insufficient
straightforward. includes relevant extraneous detail, or detail is
The content is detail. information and irrelevant and
relevant to the minor gaps are extraneous.
discussed topics included.
and thoroughly
answers the
questions.
Creativity/Visual (15) (12) (9) (5)
The expo is The expo is visually The main theme Lacks visual
visually sensible. The use of is still clarity. The
effective. graphics/images/ discernible, but graphics/images/
The use of photographs are use of photographs are
graphics/images/ included and graphics/images/ distracting
photographs appropriate. photographs are from the content of
seamlessly relate included but are the
well to the content. used randomly. expo.
Navigation (10) (8) (5) (2)
The document is Hyperlinks are Hyperlinks are There are few
fully hyperlinked. organized good but lacks links. Some links
The index is into logical groups. organization. are “broken”.
well organized and Not all
easy to possible features
navigate. have been
employed.
G11: 67
MODULE 6:
Logallo, Nunzio 2019. Data Science Methodology 101 How can a Data Scientist organize
his work? https://towardsdatascience.com/data-science-methodology-101-
ce9f0d660336
Patel, Ashish 2019. Data Science Methodology — How to design your data science project
https://medium.com/ml-research-lab/data-science-methodology-101-2fa9b7cf2ffe
Multiple Choice
1. D
2. C
3. D
4. C
5. D
6. A
7. C
8. D
9. C
10. A
G11: 68