Best Methodologies
Best Methodologies
Gustavo Mendizabal
While looking for websites that has some information about the best methodologies for
Data Science. I came across a certain website called Data Science Process Alliance, which is
“A Community of Data & AI Practitioners”, that lists some of the Methodologies such as,
Waterfall, KDD, SEMMA, CRISP-DM, TDSP, DOMINO, and their descriptions, explained in
such a visual way that makes it so easy to understand. After thoroughly reading each one of
them. I chose two which are the best methodologies I would use. Keep in mind that they all
have their Strengths and Challenges, as well what they’re best for.
CRISP DM
Cross Industry Standard Process for Data Mining is a methodology created by five
different companies, Integral solutions Ltd, Teradata, Daimler AG, NCR Corporation and
OHRA in 1996.
Business Understanding
What does the business need? This phase focuses on understanding the objectives and
requirements from a business perspective, then turn this knowledge into a data mining
problem.
Data Understanding
What data do we have and/or need? Is it clean? This phase focuses on understanding
the data, becoming familiar with it, in which create a hypothesis from it.
Data Preparation
How do we organize the data for modeling? Once the previous phase is over, it’s time
to begin the construction of a fine data. This preparation task most likely will be performed
multiple times
3
Modeling
What modeling techniques should we apply? In this case, it can be applied various
modeling techniques and methodologies to have the best model based on the fine data from the
previous phases. This is usually the best part of this methodology and often the shortest one.
Evaluation
Which model best meets the business objectives? At this stage, it’s important to
evaluate whether this model is the best suited for the objective from a business perspective and
reviewed. The decision of whether it’s acceptable or not must be reached at this point.
Deployment
How do stakeholders access the results? This is the model that can be presented and
used by the costumer. In many cases, it is the costumer who gives the order to be deployed the
This methodology is used in many data science projects, however because it was
created in 1996, it is becoming more obsolete as the data are more sophisticated. Which is why
new methods like TDPS or DOMINO, which are, in a sense, a “modern” CRISP DM, are being
implemented.
4
SEMMA
SEMMA stands for Sample, Explore, Modify, Model and Assembly. Which can be
used as a methodology data scientists use for detecting frauds, costumer loyalty, bankruptcy
forecasting, and so more. It has five stages which breaks down to:
Sample
For the construction of a model, this step must give an appropriate volume and identify
variable that are influencing the process. Once identified, the information is sorted and
categorized.
Explore
In this step, the information that was sorted, is studied in order to check any relationship
between them. Every factor that may influence the data, must be analyzed.
Modify
Once exploration phase is completed, the data is then cleaned for modeling.
5
Model
What modeling techniques should we apply? In this case, it can be applied various
modeling techniques and methodologies to have the best model based on the fine data from the
previous phases. This is usually the best part of this methodology and often the shortest one.
Assembly
Which model best meets the business objectives? At this stage, it’s important to
evaluate whether this model is the best suited for the objective from a business perspective and
reviewed. The decision of whether it’s acceptable or not must be reached at this point.
This methodology of SEMMA, is the same for the last two steps, however the
difference between the two are the selection of the sample process is that is directly related to
the KDD process. This is the second most popular method used for data science.