Crisp DM
Crisp DM
the problem
CRISP DM
The CRoss Industry Standard Process for Data Mining (CRISP-DM) is a process model that
serves as the base for a data science process. It has six sequential phases:
Published in 1999 to standardize data mining processes across industries, it has since become
the most common methodology for data mining, analytics, and data science projects.
Data science teams that combine a loose implementation of CRISP-DM with overarching
team-based agile project management approaches will likely see the best results.
6 CRISP-DM Phases
I. Business Understanding
Any good project starts with a deep understanding of the customer’s needs. Data mining
projects are no exception and CRISP-DM recognizes this.
The Business Understanding phase focuses on understanding the objectives and requirements
of the project. Aside from the third task, the three other tasks in this phase are foundational
project management activities that are universal to most projects:
While many teams hurry through this phase, establishing a strong business understanding is
like building the foundation of a house – absolutely essential.
1. Collect initial data: Acquire the necessary data and (if necessary) load it into your
analysis tool.
2. Describe data: Examine the data and document its surface properties like data
format, number of records, or field identities.
3. Explore data: Dig deeper into the data. Query it, visualize it, and identify
relationships among the data.
4. Verify data quality: How clean/dirty is the data? Document any quality issues.
This phase, which is often referred to as “data munging”, prepares the final data set(s) for
modeling. It has five tasks:
1. Select data: Determine which data sets will be used and document reasons for
inclusion/exclusion.
2. Clean data: Often this is the lengthiest task. Without it, you’ll likely fall victim to
garbage-in, garbage-out. A common practice during this task is to correct, impute, or
remove erroneous values.
3. Construct data: Derive new attributes that will be helpful. For example, derive
someone’s body mass index from height and weight fields.
4. Integrate data: Create new data sets by combining data from multiple sources.
5. Format data: Re-format data as necessary. For example, you might convert string
values that store numbers to numeric values so that you can perform mathematical
operations.
IV. Modeling
What is widely regarded as data science’s most exciting work is also often the shortest phase
of the project.
Here you’ll likely build and assess various models based on several different modeling
techniques. This phase has four tasks:
Although the CRISP-DM Guide suggests to “iterate model building and assessment until you
strongly believe that you have found the best model(s)”, in practice teams should continue
iterating until they find a “good enough” model, proceed through the CRISP-DM lifecycle,
then further improve the model in future iterations.
V. Evaluation
Whereas the Assess Model task of the Modeling phase focuses on technical model
assessment, the Evaluation phase looks more broadly at which model best meets the business
and what to do next. This phase has three tasks:
1. Evaluate results: Do the models meet the business success criteria? Which one(s)
should we approve for the business?
2. Review process: Review the work accomplished. Was anything overlooked? Were all
steps properly executed? Summarize findings and correct anything if needed.
3. Determine next steps: Based on the previous three tasks, determine whether to
proceed to deployment, iterate further, or initiate new projects.
VI. Deployment
“Depending on the requirements, the deployment phase can be as simple as generating a
report or as complex as implementing a repeatable data mining process across the
enterprise.”
–CRISP-DM Guide
A model is not particularly useful unless the customer can access its results. The
complexity of this phase varies widely. This final phase has four tasks:
1. Plan deployment: Develop and document a plan for deploying the model.
2. Plan monitoring and maintenance: Develop a thorough monitoring and maintenance
plan to avoid issues during the operational phase (or post-project phase) of a model.
3. Produce final report: The project team documents a summary of the project which
might include a final presentation of data mining results.
4. Review project: Conduct a project retrospective about what went well, what could
have been better, and how to improve in the future.
Your organization’s work might not end there. As a project framework, CRISP-DM does not
outline what to do after the project (also known as “operations”). But if the model is going to
production, be sure you maintain the model in production. Constant monitoring and
occasional model tuning is often required.
Indeed, if you follow CRISP-DM precisely (defining detailed plans for each phase at the
project start and include every report) and choose not to iterate frequently, then you’re
operating more of a waterfall process.
Agile: On the other hand, CRISP-DM indirectly advocates agile principles and practices by
stating: “The sequence of the phases is not rigid. Moving back and forth between different
phases is always required. The outcome of each phase determines which phase, or particular
task of a phase, has to be performed next.”
Thus if you follow CRISP-DM in a more flexible way, iterate quickly, and layer in other
agile processes, you’ll wind up with an agile approach.
KDnuggets Polls
Bear in mind that the website caters toward data mining, and the data science field has
changed a lot since 2014.
KDnuggets is a common source for data mining methodology usage. Each of the polls
in 2002, 2004, 2007 posed the question: “What main methodology are you using for data
mining?”, and the 2014 poll expanded the question to include “…for analytics, data mining,
or data science projects.” 150-200 respondents answered each poll.
CRISP-DM was the popular methodology in each poll spanning the 12 years.
Note the response options for our poll were different from the KDnuggets polls, and our site
attracts a different audience.
CRISP-DM was the clear winner, garnering nearly half of the 109 votes.,
Google Searches
Given the ambiguity of a searcher’s intent, some searches like “my own” could not be
analyzed and others like “tdsp” and “semma” could be misleading.
For yet third view into CRISP-DM, we turned to Google Keyword Planner tool which
provided the average monthly search volumes in the USA for select key search terms and
related terms (e.g. “crispdm” or “crisp dm data science”). Clearly irrelevant searches like
“tdsp electrical charges” or “semma both aagatha” were then removed.
CRISP-DM yet again reigned as king, and this time with a much broader margin.
Like most answers in data science, it’s kind of complicated. But here’s a quick overview.
Benefits
From today’s data science perspective this seems like common sense. This is exactly the
point. The common process is so logical that it has become embedded into all our education,
training, and practice.
Generalize-able: Although designed for data mining, William Vorhies, one of the
creators of CRISP-DM, argues that because all data science projects start with
business understanding, have data that must be gathered and cleaned, and apply data
science algorithms, “CRISP-DM provides strong guidance for even the most
advanced of today’s data science activities” (Vorhies, 2016).
Common Sense: When students were asked to do a data science project without
project management direction, they “tended toward a CRISP-like methodology and
identified the phases and did several iterations.” Moreover, teams which were trained
and explicitly told to implement CRISP-DM performed better than teams using other
approaches (Saltz, Shamshurin, & Crowston, 2017).
Adopt-able: Like Kanban, CRISP-DM can be implemented without much training,
organizational role changes, or controversy.
Right Start: The initial focus on Business Understanding is helpful to align technical
work with business needs and to steer data scientists away from jumping into a
problem without properly understanding business objectives.
Strong Finish: Its final step Deployment likewise addresses important considerations
to close out the project and transition to maintenance and operations.
Flexible: A loose CRISP-DM implementation can be flexible to provide many of the
benefits of agile principles and practices. By accepting that a project starts with
significant unknowns, the user can cycle through steps, each time gaining a deeper
understanding of the data and the problem. The empirical knowledge learned from
previous cycles can then feed into the following cycles.
Rigid: On the other hand, some argue that CRISP-DM suffers from the same
weaknesses of Waterfall and encumbers rapid iteration.
Documentation Heavy: Nearly every task has a documentation step. While
documenting one’s work is key in a mature process, CRISP-DM’s documentation
requirements might unnecessarily slow the team from actually delivering increments.
Not Modern: Counter to Vorheis’ argument for the sustaining relevance of CRISP-
DM, others argue that CRISP-DM, as a process that pre-dates big data, “might not be
suitable for Big Data projects due its four V’s” (Saltz & Shamshurin, 2016).
Not a Project Management Approach: Perhaps most significantly, CRISP-DM is
not a true project management methodology because it implicitly assumes that its user
is a single person or small, tight-knit team and ignores the teamwork coordination
necessary for larger projects (Saltz, Shamshurin, & Connors, 2017).
Thus with this amount of data, simple statistics with manual intervention would not work.
This need is fulfilled by the data mining process. This leads to change from simple data
statistics to complex data mining algorithms.
The data mining process will extract relevant information from raw data such as transactions,
photos, videos, flat files and automatically process the information to generate reports useful
for businesses to take action.
Thus, the data mining process is crucial for businesses to make better decisions by
discovering patterns & trends in data, summarizing the data and taking out relevant
information.
Data is increasing day by day, hence when a new data source is found, it can change the
results.
SEMMA makes it easy to apply exploratory statistical and visualization techniques, select
and transform the significant predicted variables, create a model using the variables to come
out with the result, and check its accuracy. SEMMA is also driven by a highly iterative cycle.
Steps in SEMMA
1. Sample: In this step, a large dataset is extracted and a sample that represents the full
data is taken out. Sampling will reduce the computational costs and processing time.
2. Explore: The data is explored for any outlier and anomalies for a better
understanding of the data. The data is visually checked to find out the trends and
groupings.
3. Modify: In this step, manipulation of data such as grouping, and subgrouping is done
by keeping in focus the model to be built.
4. Model: Based on the explorations and modifications, the models that explain the
patterns in data are constructed.
5. Assess: The usefulness and reliability of the constructed model are assessed in this
step. Testing of the model against real data is done here.
Both the SEMMA and CRISP approach work for the Knowledge Discovery Process. Once
models are built, they are deployed for businesses and research work.
Basically, this step involves the removal of noisy or incomplete data from the collection.
Many methods that generally clean data by itself are available but they are not robust.
Data Integration can be performed using Data Migration Tools such as Oracle Data Service
Integrator and Microsoft SQL etc.
Relational Database management systems such as Oracle support Data mining using CRISP-
DM. The facilities of the Oracle database are useful in data preparation and understanding.
Oracle supports data mining through java interface, PL/SQL interface, automated data
mining, SQL functions, and graphical user interfaces.
Data mining in multidimensional space carried out in OLAP style (Online Analytical
Processing) where it allows exploration of multiple combinations of dimensions at varying
levels of granularity.
Data mining processes can be performed on any kind of data such as database data and
advanced databases such as time series etc. The data mining process comes with its own
challenges as well.