Module 5 - Data Science Methodologies

Download as pdf or txt
Download as pdf or txt
You are on page 1of 9

MODULE 5:

MODULE 1: MULTIMEDIA

Data science

Learning Competencies

5.1. Familiarize with the methodology and its flowchart.


5.2. Understand methodology as needed by data scientists.
5.3. Apply methodology in different types of data science problems.

G11: 41
MODULE 5:

MODULE 1: MULTIMEDIA

Data science is an enormous field, and it is not only about developing machine
learning models or predicting outputs to various scenarios an individual can experience
when dealing with data. A data scientist wears different hats and might be responsible for
one or more of the following; Business understanding, Data understanding, Data
preparation, Modeling, Evaluation, and Deployment
Each of these tasks is linked to each other, and they help other roles within the data
science methodology. Data Science Methodology indicates the routine for finding solutions
to a specific problem. This is a cyclic process that undergoes a critic behaviour guiding
business analysts and data scientists to act accordingly.

What is Methodology in Data Science?


Every Data Scientist needs a methodology to solve data science’s problems. They need the
correct methodology to organize work, analyze different data types, and solve their problem.
Data Science Methodology is composed of 10 parts.

From Problem to Approach and From Requirements to Collection

G11: 42
MODULE 5:

MODULE 1: MULTIMEDIA

1. Business understanding

 What is problem you trying to solve?


This is the first step for any data science
methodology. The methodology of data
science begins with the search for
clarifications to achieve what can be called
business understanding. This understanding
is at the beginning of the methodology
because you can determine which data to
answer the central question by clarifying the
problem.
Business partners who need the analytics
solution plays a critical role in this phase by
defining the problem, the project objectives,
and the solution requirements from a
business perspective.

For example, if a business owner asks, “How can we lower the cost of an activity?” We need
to understand if the goal is to improve the efficiency of the activity. Or should the
profitability of companies be increased? Once the goal is clear, the next piece of the puzzle
determines the goals that support it. The breakdown of objectives can lead to structured
discussions that set priorities that can help to organize and plan how to deal with the problem.
Depending on the problem, different stakeholders should participate in the discussion to
identify the requirements and clarify the problems.

2. Analytic approach

 How can you use the data to answer the


question?
Once a business problem has been clearly
identified, the Data Scientist can define
the analytical approach. To do this, the problem
must be expressed in the context of statistical
learning and machine learning techniques so that
the Data Scientist can identify the techniques to
achieve the desired result.
Choosing the Right analytical approach depends
on the question asked. The approach is to ask
the person asking the question to clarify the
most appropriate form or approach.

Here we can understand the second stage of


data science methodology.

G11: 43
MODULE 5:

MODULE 1: MULTIMEDIA

Once the problem to be addressed is


defined, the appropriate analytical
approach is selected in the context of the
needs of the enterprise. This is the
second step in the methodology of data
science.
 Once a Deep understanding of the
question is established, the analytical
approach can be selected. This means
identifying what type of pattern is
needed to address the problem more
effectively.
When it comes to determining the probabilities of action, a predictive model can be used.

 When it comes to identifying relationships, a descriptive approach may be necessary. This


would be one that analyzes similar
activity groups based on events and
preferences.

 Statistical analysis refers to problems


that require accounts. For example, a
classification approach to predicting
response is appropriate if the question
requires a yes / no answer. Machine
learning is a field of study in which
computers can learn without being explicitly programmed. Machine learning can be used to
identify relationships and trends in data that would otherwise be inaccessible or identified.

3. Data requirements

 What data do you need to answer the question?


The Analytic approach determines
the data requirements because
the methods of analysis to be used require
specific content, formats, and data
representations, based on domain
knowledge.
Imagine that, If your goal is to prepare a
spaghetti dinner, but you don’t have the
right ingredients for this dish, your success
will be affected.
Think of this section of data science
methodology as cooking with data. Each
step is essential for the preparation of the meal.

G11: 44
MODULE 5:

MODULE 1: MULTIMEDIA

So, if the problem to be solved is the recipe and


the data are an ingredient, the data scientist
must identify the necessary ingredients, how to
obtain or collect them, how to understand or
use them, and how to obtain them.
Ready to achieve the desired result.
 Based on the understanding of the problem
and the analytic approach chosen, the data
scientist is ready to begin. Let’s look at some
examples of the data needs of the data science
methodology. Before the methodology data
collection and processing steps are performed,
it is important to define the data requirements
for the classification of the decision tree.
 This involves identifying the content, formats, and data sources needed for the initial data
collection. Now consider the case study on the application of the “data requirements”.

4. Data collection

 Where is the data coming from (identify


all sources), and how will you get it?
The Data Scientist identifies and collects
data resources (structured, unstructured,
and semi-structured) relevant to the problem
area. If the data scientist finds gaps in the
data collection, he may need to review the
data requirements and collect more data.
 Once the data collection is completed,
the Data Scientist performs a score to
determine if he has the required resources.
As with the purchase of ingredients for
making a meal, some ingredients may be out of
season and more difficult to obtain or cost more
than initially planned.
 At this stage, the data requirements are
reviewed and a decision is made as to whether
more or less data is required for the collection.
 Once the data components have been
collected, the data scientist will understand
what data he will be working on during the data
collection phase.
 Techniques such as descriptive
statistics and visualization can be applied to the
dataset to evaluate the original data’s content,
quality, and information. The gaps in the data

G11: 45
MODULE 5:

MODULE 1: MULTIMEDIA

are identified and plans for filling or replacement must be made.


Essentially, the ingredients are now sitting on the cutting board. Now let’s look at some
examples of the data collection phase in the data science methodology. This step is performed
as a result of the data request step. Let us now consider the case study on the application of
“data collection.” To capture data, you must know the source or know where the required data
items are located.

5. Data understanding

 Is the data that you collected representative of the


problem to be solved?
Descriptive statistics and visualization techniques can
help a data scientist understand the content of the data,
assess its quality, and obtain initial information about
the data. Recovery from the previous step, data
collection, may be necessary to fill the gaps in
understanding.

6. Data preparation

 What additional work is required to


manipulate and work with the data?
The Data preparation step includes all the
activities used to create the data set used during
the modeling phase. This includes cleansing data,
combining data from multiple sources, and
transforming data into more useful variables. In
addition, feature engineering and text
analysis can be used to derive new structured
variables from enriching all predictors and
improving model accuracy.

7. Model Training

 In What way can the data be visualized to get the


answer that is required?
From the first version of the prepared data set, Data
scientists use a Training data set(historical data in which
the desired result is known) to develop predictive or
descriptive models using the described analytical approach
previously. The modeling process is very iterative.It may
vary with different situations as per the problem.

G11: 46
MODULE 5:

MODULE 1: MULTIMEDIA

8. Model Evaluation

 Does the model used to answer the


initial question, or does it need to be
adjusted?
The Data Scientist evaluates the quality of
the model and verifies that the business
problem is handled in a complete and
adequate manner. To do this, several
diagnostic measures and other results, such
as tables and graphs, must be calculated
using a set of predictive model tests.

9. Deployment

 Can you put the model into practice?


Once a satisfactory model has been developed and approved by
commercial sponsors, it will be implemented in production or
in comparable test environment. Such deployment is often
initially limited to allow for performance evaluation.
Implementing a model in an operational business process
generally involves multiple groups, capabilities, and
technologies.

10. Feedback

 Can you get constructive feedback into answering the


question?
By collecting the results of the implemented model, the
organization receives feedback on the performance of the
model and its impact on the implementation
environment. By analyzing this information, the data
scientist can refine the model, increasing its accuracy and,
therefore, its utility.
This phase, often neglected, can have significant additional
benefits when carried out as part of the overall process. The
flow of this methodology illustrates the iterative nature of the
problem-solving process.
I hope you will get a basic understanding of process cycle. How to think on every stage that
helps to direct toward your successful methodology for your Data Science projec

G11: 47
MODULE 5:

MODULE 1: MULTIMEDIA

I. True/False:
Direction: Read the statements carefully. Write True if the statement is correct, otherwise,
False.
1. Data requirements involves identifying the content, formats, and data sources needed for
the initial data collection.
2. Descriptive statistics and visualization techniques can help a data scientist understand the
content of the data, assess its quality, and obtain initial information about the data.
3. Once a business problem has been clearly identified, the Data Scientist can define the data
collection.
4. Statistical analysis refers to problems that require accounts.
5. Descriptive statistics and visualization techniques can help a data scientist understand the
content of the data, assess its quality, and obtain initial information about the data.

II. Multiple choice:


Direction:Analyze the questions carefully. Choose the letter of the correct answer.

6. Business understanding is an important stage in the data science methodology. Why?


A. Because it clearly defines the problem and the needs from a business perspective.
B. Because it ensures that the work generates all possible solutions.
C. Because it is determined by the analytical approach you want to use.
D. Because it generates the data that will be used in the study.

7. Select the correct statement about the Data understanding stage.


A. The data understanding stage encompasses sorting the data.
B. The understanding stage encompasses all activities related to constructing the dataset.
C. The data understanding stage encompasses removing redundant data.
D. The data understanding stage evaluates the quality of the model and verifies that the
business problem

8. Select the correct statement.


A. A methodology is a statement of methods used in a particular area of study or activity.
B. A methodology is a set of instructions.
C. A methodology is an application for computer program.
D. A methodology is the result of the implemented model.

9. At this stage, the data requirements are reviewed and a decision is made as to whether more
or less data is required for the collection.
A. Data collection
B. Data preparation
C. Model evaluation
D. Analytic approach

G11: 48
MODULE 5:

MODULE 1: MULTIMEDIA

10. At this stage, all the activities used to create the data set used during the modeling phase
include cleansing data, combining data from multiple sources, and transforming data into
more useful variables.
A. Data preparation
B. Data collection
C. Model evaluation
D. Data requirements

IBM Data Science Methodology. https://www.coursera.org/learn/data-science-methodology


Gajare, Shreyal 2019. Data Science Methodology and Approach -
https://www.geeksforgeeks.org/data-science-methodology-and-approach/
Logallo, Nunzio 2019. Data Science Methodology 101 How can a Data Scientist organize
his work?. https://towardsdatascience.com/data-science-methodology-101-
ce9f0d660336
Patel, Ashish 2019. Data Science Methodology — How to design your data science project
https://medium.com/ml-research-lab/data-science-methodology-101-2fa9b7cf2ffe

Module Author/Curator : Mrs. Jenny P. Macalalad


Template & Layout Designer : Mrs. Jenny P. Macalalad

True/False Multiple Choice


1. T 6. A
2. T 7. B
3. F 8. A
4. T 9. A
5. T 10. A

G11: 49

You might also like