0% found this document useful (0 votes)

13 views

Module 6 - Data Science Methodology (Steps)

Steps involved in tackling a data science problem and applying them to interesting real-world examples.

Uploaded by

Michael Manalo

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

13 views

Module 6 - Data Science Methodology (Steps)

Steps involved in tackling a data science problem and applying them to interesting real-world examples.

Uploaded by

Michael Manalo

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 19

MODULE 6:

Data science

Learning Competencies
6.1. Apply methodology in different types of data science problems.
6.2. Compare the stages of data science methodology from understanding to
preparation, modeling to evaluation, and deployment to feedback.
6.3. Explain what happens when a model is deployed and why model feedback is
important.

G11: 50
MODULE 6:

The fundamentals of the data science methodology have been presented on the
previous lesson. Take note that the stages of this methodology are iterative. These steps have
included: forming a concrete business or research problem, collecting and analyzing data,
and building a model; and understanding the feedback after model deployment.

To further deepen the idea of the whole process, on this lesson you’ll learn how to
think like a data scientist, including taking the steps involved in tackling a data science
problem and applying them to interesting real-world examples.

Toward Data Science Methodology

Despite the increased computing power and access to data in recent decades, the ability to
use data in the decision-making process is lost or not maximized too often. There is no solid
understanding of questions that are asked and how the data is correctly applied to the problem
in question. That’s why methodology come inti the picture to design any problem.

The definition of the word methodology, as given by the Business Dictionary, is a system of
broad principles or rules from which specific methods or procedures may be derived to
interpret or solve. It is necessary to keep this in mind since temptation is often great to
circumvent the methodology and go directly to the solutions.

Data Science Methodology

G11: 51
MODULE 6:

Data Science Methodology Outline

The Data Science Methodology aims to answer the following 10 questions in this prescribed
sequence:

From problem to approach:

1. What is the problem that you are trying to solve?
2. How can you use the data to answer the question?

From requirements to collection:

3. What data do you need to answer the question?
4. Where is the data coming from (identify all sources) and how will
you get it?
From understanding to preparation:
5. Is the data that you collected representative of the roble to be
solved?
6. What additional work is required to manipulate and work with
the data?

From Modeling to evaluation:

7. In what way can the data be visualized to get the answer that is
required?
8. Does the model used to answer the initial question, or does it need
to be adjusted?
From deployment to feedback:
9. Can you put the model into practice?
10. Can you get constructive feedback into answering the question?

Data Methodology in Practice

To better understand the application of each step on the methodology and the questions each
entails, let us view each with the use of a case study.

1. Business Understanding
Problem: How to best divide the limited Applying the concepts
health budget into optimal use to provide
quality care?
As public funds for readmission declined,
the insurance company ran the risk of
offsetting the difference in costs, which
could lead to higher costs for its clients.

G11: 52
MODULE 6:

First thing to do is defining the problem being faced. On this case, we have insurance
company’s insurance rates in review.

Objectives Knowing that higher insurance rates would

not be popular, the insurance company
contacted local health authorities and hired
Data Science Expert to learn how data
science could be applied on the question.
Before we could start collecting data, we had
to define the objectives. After spending time
setting goals, the team prioritized “patient
readmission” as an effective area for review.
Taking into account the objectives, it Examining Hospital Readmissions
was found that approximately 30% of
those who completed the
rehabilitation treatment would be
reintegrated into a rehabilitation
center within one year. and that 50%
would be resumed within five years.
After reviewing some records, it was
found that patients with heart failure
were high on the list of readmissions.
It has also been found that a decision tree model can be applied to investigate this scenario
to determine the reason for this phenomenon. To gain the business insight that will assist the
analysis team in formulating and implementing their first project, Data scientists proposed
and organized a workshop on-site.
The involvement of key commercial sponsors throughout the project has been essential as a
sponsor: setting the overall direction. He remained committed and advised. If necessary, he
got the necessary support. Finally, four business requirements were identified for each model
built: predict readmission results for patients with heart failure, predict the risk of
readmission, understand the combination of events that led to the expected result, and apply
a process that is easy for new patients to understand because of their risk of readmission.

2. Analytic Approach
In the case where the question about human behavior is as ked, it would be an appropriate
response to use clustering approaches. Let us now examine the case study on the application
of the analytical approach. For the case study, a decision tree classification model was used
to identify the combination of conditions that resulted in the results of each patient.

G11: 53
MODULE 6:

In this approach, examining the variables Decision Tree Classification Selected

in each of the nodes along each path of a
leaf resulted in a corresponding
threshold. This means that the decision
tree classifier returns both the expected
result and the probability of that result,
based on the proportion of the dominant
result, yes or no, in each group. From
this information, analysts can derive the risk of readmission or the probability of a yes for
each patient.
If the dominant result is yes, the risk is simply the proportion of patients with yes on the
sheet. Otherwise, the risk is 1 minus the proportion of a patient on the leaf. A decision
tree classification model is easy to understand and apply to non-data scientists to assess the
risk of readmitting new patients.
Doctors can easily identify under which conditions a patient is considered to be at risk, and
during hospitalization, multiple models can be designed and used at different times.
This provides a moving picture of the patient’s risk and its evolution in the various treatments
used. For these reasons, the decision tree classification approach was chosen to create the
cardiac failure readmission model.

3. Data Requirement Selecting the cohort

In the case study, the first task was to
define the data required for the
classification approach of the
selected decision tree. This involved
selecting a suitable cohort of patients
from the members of the health
insurance companies.
In order to compile the complete medical records, three criteria were identified that should
be included in the cohort.
• First, a patient had to be hospitalized in the service area of the provider to gain access
to the required information.
• Second, for one year, they focused on patients with a primary diagnosis of heart failure.
• Third, a patient must have had a continuous record of at least six months prior to initial
heart failure for a complete medical history.
Patients with congestive heart failure who have been diagnosed with other serious conditions
have been excluded from the cohort, as this may result in above-average rates of re-entry and
may therefore distort results.

G11: 54
MODULE 6:

Defining the Data

• Then he defined the content, format and representations of the data needed to classify
the decision tree.
• This modeling technique requires one registration per patient, with columns
representing the variables of the model. To model readmission results, data covering
all aspects of the patient’s medical history should be available.
• This content includes authorizations, primary, secondary and tertiary diagnoses,
procedures, prescriptions and other services provided during hospitalization or visits
by patients / doctors.
In this way, a given patient can have thousands of records that represent all their attributes.
To obtain a record by patient format, the data analysis specialists collected the transaction
records from patient records and created a set of new variables to represent that information.
It was a task for the data preparation phase, so it is important to anticipate the next phases.

4. Data Collection Gathering the available data

In our case study, this information
may include demographic, clinical
and patient care information,
provider information, claims records,
as well as pharmaceutical and other
information related to all heart failure
diagnoses.

• This case study also required specific information on drugs, but this data source was
not yet integrated with the rest of the data sources.
• This brings us to an important point: it is correct to postpone decisions about
unavailable data and to try to capture them later.
• For example, this can happen even after obtaining intermediate results from predictive
modeling. If these results indicate that drug information may be important for a good
model, you will spend time trying to get it.
However, it turned out that they could build a reasonably good model without this
information about drugs.

G11: 55
MODULE 6:

Merging the data

• Database administrators and programmers often work together to extract data from
different sources and then combine them.
• In this way, the redundant data can be deleted and made available to the next level of
methodology, namely the understanding of the data.
• At this stage, scientists and analysis team members can discuss ways to better manage
their data by automating certain database processes to facilitate data collection.

5. Data Understanding Understanding the data

Apply the understanding of our

methodological data to the case study
we are studying. To understand data
on the onset of heart failure,
descriptive statistics had to be
established in the data columns that
would become variables in the model.
• First, these statistics included Hearst, Uni-variate, and statistics for each variable,
such as mean, median, minimum, maximum, and standard deviation.
• Second, pairwise correlations have been used to determine the degree of correlation
between the linked variables and those that, if any, are highly correlated, meaning
that they are essentially redundant, making it only relevant for the modeling.
• Third, the histograms of the variables were examined to understand their
distributions. Histograms are a good way to understand how values or variables are
distributed and what kind of data preparation may be needed to make the variable
more useful in a model. For example, if a categorical variable contains too many
different values to be meaningful in a model, the histogram can help decide how to
consolidate those values.
• Univariate, statistics and histograms are also used to assess the quality of the data.
On the basis of the data provided, some values can be recorded or deleted, if necessary, e.g.
For example, if a particular variable has a lot of missing values.

G11: 56
MODULE 6:

• The question then arises as to whether “missing” means something. Sometimes a

missing value means “no” or “0” (zero), or sometimes simply “we do not know”.
• Or if a variable contains invalid or misleading values; For example, a numeric variable
called “age” containing 0 to 100 and 999, where “triple-9” actually means “missing”,
will be treated as a valid value unless we have corrected it.
Looking at the data quality

• First, the importance of heart failure was determined based on a primary diagnosis of
heart failure. However, the data comprehension study revealed that the initial definition
did not cover all expected cases of heart failure due to clinical experience.
• This involved returning to the data collection phase, adding secondary and tertiary
diagnoses, and creating a more complete definition of heart failure approval.
• This is just an example of the interactive processes in the methodology. The more you
work with the problem and the data, the more you learn and the more the model can be
adjusted, which ultimately leads to a better resolution of the problem.

6. Data Preparation
Understanding the data
An important first step in the data
preparation stage was to actually define
congestive heart failure. This sounded
easy at first but defining it precisely, was
not straightforward.
• First, the set of diagnosis-related
group codes needed to be
identified, as congestive heart
failure implies certain kinds of
fluid buildup.
• We also need to consider that
congestive heart failure is only Defining Readmission
one type of heart failure. Clinical
guidance was needed to get the
right codes for congestive heart
failure.
• The next step involved defining
the re-admission criteria for the
same condition. The timing of
events needed to be evaluated in
order to define whether a

G11: 57
MODULE 6:

particular congestive heart failure admission was an initial event, which is called an
index admission, or a congestive heart failure-related re-admission.
Based on clinical expertise, a time
Defining CHF admission
period of 30 days was set as the
window for readmission relevant for
congestive heart failure patients,
following the discharge from the initial
admission.
• Next, the records that were in
transactional format were
aggregated, meaning that the
data included multiple records
for each patient.
Transactional records included professional provider facility Aggregating Records
claims submitted for physician, laboratory, hospital, and
clinical services. Also included were records describing all the
diagnoses, procedures, prescriptions, and other information
about in-patients and out-patients. A given patient could easily
have hundreds or even thousands of these records, depending
on their clinical history.
• Then, all history were aggregated to the patient level,
yielding a single record for each patient, as required
for the decision-tree classification method that would
be used for modeling.
As part of the aggregation process, many new columns were
created representing the information in the transactions. For
example, frequency and most recent visits to doctors, clinics
and hospitals with diagnoses, procedures, prescriptions, and so forth-morbidities with
congestive heart failure were also considered, such as diabetes, hypertension, and many other
diseases and chronic conditions that could impact the risk of re-admission for congestive heart
failure.
During discussions around data
More or less data needed? preparation, a literary review on
congestive heart failure was also
undertaken to see whether any
important data elements were
overlooked, such as co-morbidities
that had not yet been accounted for.
The literary review involved looping
back to the data collection stage to
add a few more indicators for
conditions and procedures.

G11: 58
MODULE 6:

Aggregating the transactional data at Completing the data set

the patient level, meant merging it
with the other patient data, including
their demographic information, such
as age, gender, type of insurance, and
so forth. The result was the creation
of one table containing a single record
per patient, with many columns
representing the attributes about the
patient in his or her clinical history. These columns would be used as variables in the predictive
modeling.
Here is a list of the variables that were ultimately used in
building the model. The dependent variable, or target, was
congestive heart failure readmission within 30 days following
discharge from a hospitalization for congestive heart failure,
with an outcome of either yes or no. The data preparation stage
resulted in a cohort of 2,343 patients meeting all of the criteria
for this case study. The cohort was then split into training and
testing sets for building and validating the model, respectively.

7. Data Modeling
In this step, many aspects of model construction will be discussed. One thing is optimizing the
parameters to improve the model.
Analyzing the 1st model

• With a set of prepared training data, it is possible to construct the first classification
model of the decision tree for congestive readmission for heart failure. We are looking
for patients with high-risk
readmission. The result that will
interest us will be a congestive
readmission for heart failure
equivalent to “yes”. In this first
model, the overall accuracy of the
classification of the results was 85%
and not 85%. It sounds good, but
represents only 45% of the “yes”.
Actual readmission are ranked
correctly, which means that the
model is not very accurate.

G11: 59
MODULE 6:

• The question is: how to improve the accuracy of the model to predict the outcome itself?
For the classification of the decision tree, the best parameter to adjust is the relative
cost of the results yes and not classified incorrectly.
• Think of it this way: When a true non-readmission is misclassified and actions are
taken to reduce the risk of this patient, the cost of this error is a wasted intervention.
• A statistician calls this a Type I error or a false positive. But when a real readmission
is misclassified and no action is taken to reduce this risk, the cost of such an error
is readmission and all associated costs, as well as trauma to the patient.
• It’s a Type II error or a false negative. Then we can see that the costs of the two
different types of incorrect classification errors can be very different. For this reason,
it is reasonable to adjust the relative weights of the incorrect classification of the results
yes and no.
• The default is between 1 and 1, but the decision tree algorithm allows you to set a
higher value for yourself.
Analyzing the 2nd model

• For the second model, the relative cost was set at 9/1. This report is very high, but
provides more information about the behavior of the model. This time, the 97% model
worked well, but at a very low cost, with a general accuracy of only
49%. Obviously, this is not a good model.
• The problem with this result is the large number of false positives, suggesting
unnecessary and costly interventions for patients that have never been re-admitted.
• Therefore, the data scientist must try again to get a better balance between the yes and
no data.
Analyzing the 3rd model

G11: 60
MODULE 6:

• For the third model, the relative cost was set to a more reasonable 4: 1 ratio. This
time, 68% was obtained yes, but statistician called it sensitivity, and 85% accuracy for
the no, called specificity., with an overall accuracy of 81%.
• This is the best balance that can be achieved with a relatively limited training set of
workouts by adjusting the relative cost of the misclassified yes and no result
parameters. Of course, modeling requires much more work, including an iteration in
the data preparation phase, to redefine some of the other variables to better represent
the underlying information and thus improve the model.
8. Model Evaluation
In this part of the case study, the evaluation of component in the data science methodology will
be applied.
Determining the optimal model

• Look for a way to find the optimal model through a diagnostic measurement based on
the configuration of one of the model’s construction parameters. Examine more closely
how the relative costs of misclassifying positive and negative results can be adjusted.
Four models were constructed with four different relative misclassification costs.
• Each value of this model construction parameter increases the true positive rate, or
the sensitivity, of the accuracy in the prediction yes, to the detriment of a lower
accuracy in the prediction no. that is, an increasing rate of false positives.
• The question is, which model is best based on setting this parameter? For budgetary
reasons, the risk reduction intervention could not be applied to most patients with heart
failure, many of whom would not have been readmitted anyway.
• On the other hand, the intervention would not be as effective as it should be to improve
patient care, since the number of patients with high-risk heart failure was not enough.
• So how do we determine which model was optimal? Optimal model is the one that
provides the maximum separation between the blue ROC curve and the red baseline.
• We can see that model 3, with a relative cost of misclassification of 4 to 1, is the best
of the 4 models. And if asked, ROC represents the characteristic operating curve of the
receiver, which was first developed during World War II to detect enemy aircraft
on a radar.

G11: 61
MODULE 6:

• Since then, it has also been used in many

other areas. Today, it is commonly used in
machine learning and data mining. The ROC
curve is a useful diagnostic tool to determine
the optimal classification model.
• This curve quantifies the performance of
a binary classification model, declassifying
the results yes and no when a discrimination
criterion is changed.
• In this case, the criterion is a relative cost of
misclassification. By plotting the true positive rate against the false positive rate for
different values of the relative cost of misclassification, the ROC curve facilitated the
selection of the optimal model.
9. Deployment
Once the model has been evaluated and the Data scientist is convinced that it will work, it will
be used and subjected to the final test.
Understand the result
• “In preparation, to provide the
solution, the next step was
to gather the knowledge of
the stakeholder group
responsible for the design and
management of the intervention
program to reduce the risk of
readmission.
• Entrepreneurs have translated the results of the model so that clinical staff can
understand how to identify high-risk patients and design appropriate interventions.
• The objective was to reduce the risk of readmission of these patients within 30 days of
discharge. During the operational requirements phase, the intervention program
director and her team looked for an application that could assess the risk of heart failure
almost automatically in real time.
Gathering application and additional requirements
• It should also be easy for clinical
staff to use, preferably through
a browser and tablet-based
application that any employee could
carry with them. These patient data
were generated throughout the
hospital stay. It will be generated
automatically in a format required
by the model and each patient will be
noticed shortly before discharge.

G11: 62
MODULE 6:

• Then, doctors would have the most up-to-date risk assessment for each patient to help
them choose which patients to treat after discharge. As part of providing
the solution, the intervention team would develop and offer training to clinical staff.
• In addition, in collaboration with IT developers and database administrators,
monitoring and monitoring processes should be developed for patients receiving the
intervention, so that the results can go through the feedback phase and the model can
be mature over time.
• This Map is an example of a Hospitalization risk for Juvenile Diabetes Patients
solution implemented through
a Cognos application (IBM
Cognos Business Intelligence is
a web-based integrated
business intelligence suite by
IBM). In this case, the case
study focused on the risk of
hospitalization of patients with
juvenile diabetes. Similar to congestive heart failure, he used the classification of the
decision tree to create a risk model that would form the basis of this application.
• The map provides an
Risk summary report by decision tree model note
overview of hospital risk
nationwide, with a planned
interactive risk assessment for
different patient conditions and
other characteristics. This above
image provides an interactive
summary report on the risk per
patient population in a given
node of the model so that doctors can understand the combination of conditions for that
subset of patients.
Individual patient risk report
• This report provides a detailed
summary of a single patient,
including details of the patient’s
history and expected risk, and
provides the doctor with a brief
summary.

G11: 63
MODULE 6:

10. Feedback
Once the model has been evaluated and the data scientist trusts that it will work, it will be
implemented and will undergo the final test: its real use in real time in the field.
• The feedback phase plan included the following Assessing model performance
steps: First, the review process would be
defined and established, with the overall
responsibility of measuring the results of a flight
risk model of the heart failure risk population.
Clinical management has overall responsibility
for the review process.
• Second, patients with heart failure who receive
an intervention would be monitored and their
readmission results recorded.
• Third, the intervention would be measured to
determine its effectiveness in reducing the
number of readmissions.
• For ethical reasons, patients with heart failure would not be divided into controlled
groups and treatment groups. Readmission rates are compared before and after the
implementation of the model to measure the impact.
• After deployment and feedback, the impact Assessing model performance
of the intervention program on readmission
rates will be reviewed after the first year of
implementation.
• Then, the model would be refined based on
all data compiled after the implementation of
the model and the knowledge acquired in
these steps. Other improvements include
the inclusion of information on
participation in the intervention program and
possibly the refinement of the detailed Redeployment
pharmaceutical data model.
• Data collection was initially delayed because
drug data was not available at that time.
However, after feedback and practical experience
of the model, it can be said that adding this data
can be worth the investment of time and money.
The possibility of new adjustments during the
feedback phase must also be considered.

G11: 64
MODULE 6:

• In addition, response actions and processes are reviewed and probably refined
according to the experience and knowledge acquired during the initial implementation
and feedback.
• Finally, the refined model and intervention would be redeployed, and the feedback
process would continue throughout the intervention program.

Multiple choice. Analyze the questions carefully. Coose the letter of the correct answer.
1. Select the correct statement that describes data science methodology.
A. Data science methodology is not an iterative process – one does not go back and forth
between methodological steps.
B. Data science methodology is a specific strategy that guides processes and activities
relating to data science only for text analytics.
C. Data science methodology depends on a specific set of technologies or tools.
D. Data science methodology provides data scientists with a framework for how to proceed
to obtain answers.
2. What do data scientist usually use for exploratory analysis of data and to get acquainted
with them?
A. They use support vector machines and neural networks as feattureextraction techniques.
B. They begin with regression, classification, or clustering.
C. They use descriptive statistics and data visualization techniques.
D. They use deep learning.
3. Why should data scientists maintain continuous communication with business sponsors
throughout a project?
A. So that business sponsors can provide domain expertise.
B. So that business sponsors can ensure the work remains on track to generate the intended
solution
C. So that business sponsors can review intermediate findings.
D. All of the above.
4. For predictive models, a test set, which is similar to – but independent of – the training set,
is used to determine how well the model predicts outsomes. This is an example of what
step in the methodology?
A. Deployment
B. Data preparation
C. Model evaluation
D. Analytic approach
5. Data understanding involves all of the following EXCEPT for?
A. Discovering initial insights about the data
B. Visualizing the data
C. Assessing data quality
D. Gathering and analyzing feedback for assessment of the model’s performance

G11: 65
MODULE 6:

6. The following are all examples of rapidly evolving technologies that affect data science
methodology EXCEPT for?
A. Data Sampling
B. Text Analysis
C. Platform Growth
D. In-database analytics
7. Data scientists may use either a “top-down” or a “bottom-up” approach to data science.
These two approaches refer to:
A. Top-down approach – the data, when sorted, is modeled for the “top” of the data towards
the “bottom”. Bottom-up approach – the data is modeled from the “bottom” of the data
to the “top”.
B. Top-down approach – models are fit before the data is explored. Bottom-up approach –
the data is explored, and then a model is fit.
C. Top-down approach – first defining, a business problem then analyzing the data to find a
solution. Bottom-up approach – starting with the data, and then coming up with a business
problem based on the data.
D. Top-down approach – using massively parallel, warehouses with huge data volumes as
data source. Bottom-up approach – using a sample of small data before using large data.
8. A car company asked a data scientist to determine what type of customers are more likely
to purchase their vehicles. However, the data comes from several sources and is in a
relatively “raw format”. What kind of processing can the data scientist perform on the data
to prepare it for modeling?
A. Feature Engineering
B. Transforming the data into more useful variables
C. Addressing missing/invalid values
D. All of the above
9. A data scientist, John, was asked to help reduce readmission rates at a local hospital. After
some time, John provided a model that predicted which patients were more likely to be
readmitted to the hospital and declared that his work was done. Which of the following best
describes the scenario?
A. John only provided one model as a solution and he should have provided multpile
models.
B. The scenario is already optimal.
C. Even though John only submitted one solution, it might be a good one. However, John
needed feedback on his model from the hospital to comfirm that his model was able to
address the problem appropriately and sufficiently.
D. John still need to collect more data.
10. Data scientists may frequently return to a previous stage to make adjustments, as they learn
more about the data and the modeling.
A. True
B. False

G11: 66
MODULE 6:

A case study helps students learn by immersing them in a real-world business scenario where they can
act as problem-solvers and decision-makers. The case presents facts about a particular organization.
Analysis are done by focusing on the most important facts and using this information to determine the
opportunities and problems facing that organization.Then, alternative courses of action to deal with the
problems will be identified.

To be more familiar with the task, you may visit this link:
● Klaudon, Kenneth C. , (2021, November). Essentials of Management Information
Systems Sixth Edition https://tinyurl.com/63t6sr49

In your Virtual Expo Entry No. 5, you will look for research, examples, and case studies to which you
can apply the data science methodology. Disect your chosen case study an identify which specific part
relates to the different steps of the methodology.
The content and presentation of your analysis on your group’s website will be graded as your
Performance Task. You may convey it through creative graphics and illustrations. Navigation will also
be graded.
Refer on this rubric in grading your output:
VIRTUAL EXPO RUBRIC
Partially
Exemplary Proficient Incomplete
Proficient
Content (15) (12) (9) (5)
The content is rich, Content is There is adequate There is
concise, and complete and detail. Some insufficient
straightforward. includes relevant extraneous detail, or detail is
The content is detail. information and irrelevant and
relevant to the minor gaps are extraneous.
discussed topics included.
and thoroughly
answers the
questions.
Creativity/Visual (15) (12) (9) (5)
The expo is The expo is visually The main theme Lacks visual
visually sensible. The use of is still clarity. The
effective. graphics/images/ discernible, but graphics/images/
The use of photographs are use of photographs are
graphics/images/ included and graphics/images/ distracting
photographs appropriate. photographs are from the content of
seamlessly relate included but are the
well to the content. used randomly. expo.
Navigation (10) (8) (5) (2)
The document is Hyperlinks are Hyperlinks are There are few
fully hyperlinked. organized good but lacks links. Some links
The index is into logical groups. organization. are “broken”.
well organized and Not all
easy to possible features
navigate. have been
employed.

G11: 67
MODULE 6:

IBM Data Science Methodology. https://www.coursera.org/learn/data-science-methodology

Gajare, Shreyal 2019. Data Science Methodology and Approach -

https://www.geeksforgeeks.org/data-science-methodology-and-approach/

Logallo, Nunzio 2019. Data Science Methodology 101 How can a Data Scientist organize
his work? https://towardsdatascience.com/data-science-methodology-101-
ce9f0d660336

Patel, Ashish 2019. Data Science Methodology — How to design your data science project
https://medium.com/ml-research-lab/data-science-methodology-101-2fa9b7cf2ffe

Module Author/Curator : Ms. Myra Irene J. Catilo

Template & Layout Designer : Mrs. Jenny P. Macalalad

Multiple Choice

1. D
2. C
3. D
4. C
5. D
6. A
7. C
8. D
9. C
10. A

G11: 68

The Ultimate Medical Consultant Interview Guide: Fifth Edition. Over 180 Interview Questions and Answers by Senior NHS Consultants, Practice on Clinical Governance, Teaching, Management, and COVID-19
From Everand
The Ultimate Medical Consultant Interview Guide: Fifth Edition. Over 180 Interview Questions and Answers by Senior NHS Consultants, Practice on Clinical Governance, Teaching, Management, and COVID-19
Dr Ranjna Garg
No ratings yet
Google Data Analytics Professional Certificate Part 1
100% (2)
Google Data Analytics Professional Certificate Part 1
25 pages
Common Analytics Interview Questions
No ratings yet
Common Analytics Interview Questions
4 pages
Challenges Encountered by The Parents and Teachers in The Conduct of Modular Distance Learning in Pines City National High School, Baguio City
100% (1)
Challenges Encountered by The Parents and Teachers in The Conduct of Modular Distance Learning in Pines City National High School, Baguio City
17 pages
Lecture 03 DS Methodology
No ratings yet
Lecture 03 DS Methodology
77 pages
CSCI946 w3_DataPrep
No ratings yet
CSCI946 w3_DataPrep
58 pages
Data Science Methodology: Pertemuan Iv
No ratings yet
Data Science Methodology: Pertemuan Iv
80 pages
Data
No ratings yet
Data
11 pages
Week # 4
No ratings yet
Week # 4
28 pages
Data Science Methodologies (Coursera)
No ratings yet
Data Science Methodologies (Coursera)
5 pages
Overview of Data Analytics Lifecycle: Unit 2
No ratings yet
Overview of Data Analytics Lifecycle: Unit 2
100 pages
Data Science: Lesson 5
No ratings yet
Data Science: Lesson 5
6 pages
IBM Q1 Technical Marketing ASSET2 - Data Science Methodology-Best Practices For Successful Implementations Ov37176 PDF
No ratings yet
IBM Q1 Technical Marketing ASSET2 - Data Science Methodology-Best Practices For Successful Implementations Ov37176 PDF
6 pages
Module 5 - Data Science Methodologies
No ratings yet
Module 5 - Data Science Methodologies
9 pages
Xii Analytical Approach
No ratings yet
Xii Analytical Approach
3 pages
Data Science Methodology
No ratings yet
Data Science Methodology
4 pages
Data Science Methodology
No ratings yet
Data Science Methodology
21 pages
Team1_Data Science Methodology
No ratings yet
Team1_Data Science Methodology
39 pages
Module_1B
No ratings yet
Module_1B
65 pages
Data Science Methodology
No ratings yet
Data Science Methodology
26 pages
Intervention Set Selection
From Everand
Intervention Set Selection
Simone G. Symonette
No ratings yet
3 - The Data Science Method
No ratings yet
3 - The Data Science Method
8 pages
LIFE CYCLE
No ratings yet
LIFE CYCLE
35 pages
6 Phrase of Data Analysis
No ratings yet
6 Phrase of Data Analysis
9 pages
ATW115 Slides Chp02
No ratings yet
ATW115 Slides Chp02
52 pages
1..data Analytic Approach To Business Problems
No ratings yet
1..data Analytic Approach To Business Problems
6 pages
Clinical Trials Design and Methodology: Clinical Trials Mastery Series, #3
From Everand
Clinical Trials Design and Methodology: Clinical Trials Mastery Series, #3
Dr. Nilesh Panchal
No ratings yet
En DS0103EN Module 0 Welcome
No ratings yet
En DS0103EN Module 0 Welcome
2 pages
Lesson 6 Data Life Cycle Part 2
No ratings yet
Lesson 6 Data Life Cycle Part 2
30 pages
Clinical Data Manager - The Comprehensive Guide: Vanguard Professionals
From Everand
Clinical Data Manager - The Comprehensive Guide: Vanguard Professionals
Viruti Shivan
No ratings yet
W3 - DA Life Cycle
No ratings yet
W3 - DA Life Cycle
49 pages
Unit 2 Data Science Process (P)
No ratings yet
Unit 2 Data Science Process (P)
24 pages
Week 2 - Data Analytics Life Cycle
No ratings yet
Week 2 - Data Analytics Life Cycle
41 pages
Module I(Introduction Data Analytics Life Cycle) Part II (1)
No ratings yet
Module I(Introduction Data Analytics Life Cycle) Part II (1)
103 pages
Quantitative Models-Lecture 1
No ratings yet
Quantitative Models-Lecture 1
86 pages
UNIT-2
No ratings yet
UNIT-2
19 pages
Data2 Science Process Am
No ratings yet
Data2 Science Process Am
33 pages
DS Methodology Data Requirements
No ratings yet
DS Methodology Data Requirements
2 pages
Chapter 02 DataAnalyticsLifecycle
No ratings yet
Chapter 02 DataAnalyticsLifecycle
44 pages
Case Study Data Science
No ratings yet
Case Study Data Science
7 pages
Business+Problem+Solving +Lecture+Notes
No ratings yet
Business+Problem+Solving +Lecture+Notes
5 pages
Course 2
No ratings yet
Course 2
25 pages
3 Decision Analysis and Pharmacoeconomic Evaluations
No ratings yet
3 Decision Analysis and Pharmacoeconomic Evaluations
34 pages
03-Data Science Methodology
No ratings yet
03-Data Science Methodology
8 pages
Essays On Data Analysis
100% (1)
Essays On Data Analysis
136 pages
Clinical Research Associate - The Comprehensive Guide: Vanguard Professionals
From Everand
Clinical Research Associate - The Comprehensive Guide: Vanguard Professionals
Viruti Shivan
No ratings yet
Excel Definitivo 2
No ratings yet
Excel Definitivo 2
47 pages
Course 2 - 121756
No ratings yet
Course 2 - 121756
29 pages
Explains That The Pharmacy Is Considering Discontinuing A Bubble Bath Product Called Splashtastic
No ratings yet
Explains That The Pharmacy Is Considering Discontinuing A Bubble Bath Product Called Splashtastic
8 pages
Life Cycle of Data Analytics
No ratings yet
Life Cycle of Data Analytics
3 pages
Data Analysis: - Describing Data and Datasets
No ratings yet
Data Analysis: - Describing Data and Datasets
15 pages
Clinical Decision Support System: Fundamentals and Applications
From Everand
Clinical Decision Support System: Fundamentals and Applications
Fouad Sabry
5/5 (1)
BigData QB (c.format)
No ratings yet
BigData QB (c.format)
6 pages
HCI - Notes-Ch3
100% (1)
HCI - Notes-Ch3
44 pages
How to Analyze
From Everand
How to Analyze
Lucas Nguyen
No ratings yet
Data Science Process
No ratings yet
Data Science Process
101 pages
"Data Analysis" Basic Concepts and Applications
From Everand
"Data Analysis" Basic Concepts and Applications
Sukanta Bhattacharya
No ratings yet
CHAPTER 3-4 (Reviewer)
No ratings yet
CHAPTER 3-4 (Reviewer)
50 pages
ISPFL9 Module2
No ratings yet
ISPFL9 Module2
21 pages
Unlocking Potential: Navigating Employment for Neurodiverse Talent
From Everand
Unlocking Potential: Navigating Employment for Neurodiverse Talent
Travis Breeding
No ratings yet
Coursera
No ratings yet
Coursera
12 pages
Computer Science 8 - Learning Module - Complete
No ratings yet
Computer Science 8 - Learning Module - Complete
63 pages
Computer Science 7-Learning Module - Complete
No ratings yet
Computer Science 7-Learning Module - Complete
154 pages
Module 3 - Data Science
No ratings yet
Module 3 - Data Science
22 pages
Programming Tasks in C
No ratings yet
Programming Tasks in C
1 page
Assess The Knowledge of Selected Warning Signs in Pregnancy Among Primi-Gravida Women Ms. A. Jalajarani, Asst - Prof.in Nursing, MTPG&RIHS
No ratings yet
Assess The Knowledge of Selected Warning Signs in Pregnancy Among Primi-Gravida Women Ms. A. Jalajarani, Asst - Prof.in Nursing, MTPG&RIHS
2 pages
Partial Fractions: Prerequisites
No ratings yet
Partial Fractions: Prerequisites
28 pages
NCM 111 Lec Fri
No ratings yet
NCM 111 Lec Fri
21 pages
Psychopathy Checklist Revised PCL R
No ratings yet
Psychopathy Checklist Revised PCL R
7 pages
SWOT Analysis and PEST Analysis
No ratings yet
SWOT Analysis and PEST Analysis
6 pages
Baremo Test de Token Comprensión Verbal
No ratings yet
Baremo Test de Token Comprensión Verbal
6 pages
Woodcock Johnson III - Tests of Cognitive Skills
No ratings yet
Woodcock Johnson III - Tests of Cognitive Skills
2 pages
Effect of Adoption of Taxpro Max On Firs Tax Remittance in Nigeria
No ratings yet
Effect of Adoption of Taxpro Max On Firs Tax Remittance in Nigeria
8 pages
What Does A Statistical Test Do
No ratings yet
What Does A Statistical Test Do
16 pages
The Importance of Psychological and Educational Co
No ratings yet
The Importance of Psychological and Educational Co
12 pages
Week 3 Session 6 BEO6000 PPT VU Format (Update)
No ratings yet
Week 3 Session 6 BEO6000 PPT VU Format (Update)
46 pages
Title_ Customer Satisfaction Analysis of Petrol Pump Services in Damoh City
No ratings yet
Title_ Customer Satisfaction Analysis of Petrol Pump Services in Damoh City
16 pages
Q1-PRESENT The - Role - of - Brand - Interactivit
No ratings yet
Q1-PRESENT The - Role - of - Brand - Interactivit
18 pages
The Gender Dimension of Plagiarism: A Case Study
No ratings yet
The Gender Dimension of Plagiarism: A Case Study
3 pages
Philippine Political Culture: A Conceptual Framework: Cristina Jayme Montiel
No ratings yet
Philippine Political Culture: A Conceptual Framework: Cristina Jayme Montiel
19 pages
Manelkar 1 Shekhar - Dr. Dharmesh MishraSymbiosis
No ratings yet
Manelkar 1 Shekhar - Dr. Dharmesh MishraSymbiosis
18 pages
An Introduction To Distribution-Free Statistical Methods: Douglas G. Bonett University of California, Santa Cruz
No ratings yet
An Introduction To Distribution-Free Statistical Methods: Douglas G. Bonett University of California, Santa Cruz
48 pages
Byron SIA2 Draft Feedback
No ratings yet
Byron SIA2 Draft Feedback
2 pages
INTERPRETATION-AND-PRESENTATION-OF-RESULT
No ratings yet
INTERPRETATION-AND-PRESENTATION-OF-RESULT
27 pages
Traverse Report - Group 3 - Red Zone
No ratings yet
Traverse Report - Group 3 - Red Zone
41 pages
Soal Test QC
No ratings yet
Soal Test QC
4 pages
WIWIT SUGITO-Summary Consumer Contamination. Rev3
No ratings yet
WIWIT SUGITO-Summary Consumer Contamination. Rev3
4 pages
Chi Square Test
No ratings yet
Chi Square Test
3 pages
Information Literacy - GAS 12
100% (1)
Information Literacy - GAS 12
22 pages
Sample Literature Review Mla Format
100% (2)
Sample Literature Review Mla Format
4 pages
Assessment of Related Learning Experience (RLE) : Basis For A Proposed Dedicated Education Unit Model (DEU)
No ratings yet
Assessment of Related Learning Experience (RLE) : Basis For A Proposed Dedicated Education Unit Model (DEU)
17 pages
Moment Generating Function
No ratings yet
Moment Generating Function
5 pages
T A Poster
No ratings yet
T A Poster
1 page
Performance Appraisal Data Analysis
50% (2)
Performance Appraisal Data Analysis
10 pages

Module 6 - Data Science Methodology (Steps)

Uploaded by

Module 6 - Data Science Methodology (Steps)

Uploaded by

MODULE 6:

Toward Data Science Methodology

Data Science Methodology

Data Science Methodology Outline

From problem to approach:

From requirements to collection:

From Modeling to evaluation:

Data Methodology in Practice

Objectives Knowing that higher insurance rates would

In this approach, examining the variables Decision Tree Classification Selected

3. Data Requirement Selecting the cohort

Defining the Data

4. Data Collection Gathering the available data

Merging the data

5. Data Understanding Understanding the data

Apply the understanding of our

• The question then arises as to whether “missing” means something. Sometimes a

Aggregating the transactional data at Completing the data set

• Since then, it has also been used in many

IBM Data Science Methodology. https://www.coursera.org/learn/data-science-methodology

Gajare, Shreyal 2019. Data Science Methodology and Approach -

Module Author/Curator : Ms. Myra Irene J. Catilo

You might also like