0% found this document useful (0 votes)
70 views14 pages

Crisp DM

notes

Uploaded by

prasanna murthy
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
70 views14 pages

Crisp DM

notes

Uploaded by

prasanna murthy
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 14

Machine Learning Models- CRISP-DM and SEMMA to highlight the various steps of solving

the problem

CRISP DM

The CRoss Industry Standard Process for Data Mining (CRISP-DM) is a process model that
serves as the base for a data science process. It has six sequential phases:

1. Business understanding – What does the business need?


2. Data understanding – What data do we have / need? Is it clean?
3. Data preparation – How do we organize the data for modeling?
4. Modeling – What modeling techniques should we apply?
5. Evaluation – Which model best meets the business objectives?
6. Deployment – How do stakeholders access the results?

Published in 1999 to standardize data mining processes across industries, it has since become
the most common methodology for data mining, analytics, and data science projects.

Data science teams that combine a loose implementation of CRISP-DM with overarching
team-based agile project management approaches will likely see the best results.

6 CRISP-DM Phases

I. Business Understanding
Any good project starts with a deep understanding of the customer’s needs. Data mining
projects are no exception and CRISP-DM recognizes this.

The Business Understanding phase focuses on understanding the objectives and requirements
of the project. Aside from the third task, the three other tasks in this phase are foundational
project management activities that are universal to most projects:

1. Determine business objectives: You should first “thoroughly understand, from a


business perspective, what the customer really wants to accomplish.” (CRISP-DM
Guide) and then define business success criteria.
2. Assess situation: Determine resources availability, project requirements, assess risks
and contingencies, and conduct a cost-benefit analysis.
3. Determine data mining goals: In addition to defining the business objectives, you
should also define what success looks like from a technical data mining perspective.
4. Produce project plan: Select technologies and tools and define detailed plans for
each project phase.

While many teams hurry through this phase, establishing a strong business understanding is
like building the foundation of a house – absolutely essential.

II. Data Understanding


Next is the Data Understanding phase. Adding to the foundation of Business Understanding,
it drives the focus to identify, collect, and analyze the data sets that can help you accomplish
the project goals. This phase also has four tasks:

1. Collect initial data: Acquire the necessary data and (if necessary) load it into your
analysis tool.
2. Describe data: Examine the data and document its surface properties like data
format, number of records, or field identities.
3. Explore data: Dig deeper into the data. Query it, visualize it, and identify
relationships among the data.
4. Verify data quality: How clean/dirty is the data? Document any quality issues.

III. Data Preparation


A common rule of thumb is that 80% of the project is data preparation.

This phase, which is often referred to as “data munging”, prepares the final data set(s) for
modeling. It has five tasks:

1. Select data: Determine which data sets will be used and document reasons for
inclusion/exclusion.
2. Clean data: Often this is the lengthiest task. Without it, you’ll likely fall victim to
garbage-in, garbage-out. A common practice during this task is to correct, impute, or
remove erroneous values.
3. Construct data: Derive new attributes that will be helpful. For example, derive
someone’s body mass index from height and weight fields.
4. Integrate data: Create new data sets by combining data from multiple sources.
5. Format data: Re-format data as necessary. For example, you might convert string
values that store numbers to numeric values so that you can perform mathematical
operations.

IV. Modeling
What is widely regarded as data science’s most exciting work is also often the shortest phase
of the project.

Here you’ll likely build and assess various models based on several different modeling
techniques. This phase has four tasks:

1. Select modeling techniques: Determine which algorithms to try (e.g. regression,


neural net).
2. Generate test design: Pending your modeling approach, you might need to split the
data into training, test, and validation sets.
3. Build model: As glamorous as this might sound, this might just be executing a few
lines of code like “reg = LinearRegression().fit(X, y)”.
4. Assess model: Generally, multiple models are competing against each other, and the
data scientist needs to interpret the model results based on domain knowledge, the
pre-defined success criteria, and the test design.

Although the CRISP-DM Guide suggests to “iterate model building and assessment until you
strongly believe that you have found the best model(s)”, in practice teams should continue
iterating until they find a “good enough” model, proceed through the CRISP-DM lifecycle,
then further improve the model in future iterations.

V. Evaluation
Whereas the Assess Model task of the Modeling phase focuses on technical model
assessment, the Evaluation phase looks more broadly at which model best meets the business
and what to do next. This phase has three tasks:

1. Evaluate results: Do the models meet the business success criteria? Which one(s)
should we approve for the business?
2. Review process: Review the work accomplished. Was anything overlooked? Were all
steps properly executed? Summarize findings and correct anything if needed.
3. Determine next steps: Based on the previous three tasks, determine whether to
proceed to deployment, iterate further, or initiate new projects.

VI. Deployment
“Depending on the requirements, the deployment phase can be as simple as generating a
report or as complex as implementing a repeatable data mining process across the
enterprise.”

–CRISP-DM Guide

A model is not particularly useful unless the customer can access its results. The
complexity of this phase varies widely. This final phase has four tasks:

1. Plan deployment: Develop and document a plan for deploying the model.
2. Plan monitoring and maintenance: Develop a thorough monitoring and maintenance
plan to avoid issues during the operational phase (or post-project phase) of a model.
3. Produce final report: The project team documents a summary of the project which
might include a final presentation of data mining results.
4. Review project: Conduct a project retrospective about what went well, what could
have been better, and how to improve in the future.

Your organization’s work might not end there. As a project framework, CRISP-DM does not
outline what to do after the project (also known as “operations”). But if the model is going to
production, be sure you maintain the model in production. Constant monitoring and
occasional model tuning is often required.

Is CRISP-DM Agile or Waterfall?


Some argue that it is flexible and agile and while others see CRISP-DM as rigid. What really
matters is how you implement it.
Waterfall: On one hand, many view CRISP-DM as a rigid waterfall process – in part
because of its reporting requirements are excessive for most projects. Moreover, the guide
states in the business understanding phase that “the project plan contains detailed plans for
each phase” – a hallmark aspect of traditional waterfall approaches that require detailed,
upfront planning.

Indeed, if you follow CRISP-DM precisely (defining detailed plans for each phase at the
project start and include every report) and choose not to iterate frequently, then you’re
operating more of a waterfall process.

Agile: On the other hand, CRISP-DM indirectly advocates agile principles and practices by
stating: “The sequence of the phases is not rigid. Moving back and forth between different
phases is always required. The outcome of each phase determines which phase, or particular
task of a phase, has to be performed next.”

Thus if you follow CRISP-DM in a more flexible way, iterate quickly, and layer in other
agile processes, you’ll wind up with an agile approach.

Example: To illustrate how CRISP-DM could be implemented in either an Agile or waterfall


manner, imagine a churn project with three deliverables: a voluntary churn model, a non-pay
disconnect churn model, and a propensity to accept a retention-focused offer.

CRISP-DM Waterfall: Horizontal Slicing


Learn more about slicing at Vertical vs Horizontal Slicing Data Science

In a waterfall-style implementation, the team’s work would comprehensively and


horizontally span across each deliverable as shown below. The team might infrequently loop
back to a lower horizontal layer only if critically needed. One “big bang” deliverable is
delivered at the end of the project.

CRISP-DM Agile: Vertical Slicing


Alternatively, in an agile implementation of CRISP-DM, the team would narrowly focus on
quickly delivering one vertical slice up the value chain at a time as shown below. They would
deliver multiple smaller vertical releases and frequently solicit feedback along the way.
Which is better?
When possible, take an agile approach and slice vertically so that:

 Stakeholders get value sooner


 Stakeholders can provide meaningful feedback
 The data scientists can assess model performance earlier
 The project team can adjust the plan based on stakeholder feedback

How popular is CRISP-DM?


Definitive research does not exist on how frequently data science teams use different
management approaches. So to get an idea on approach popularity, we investigated
KDnuggets polls, conducted our own poll, and researched Google search volumes. Each of
these views suggests that CRISP-DM is the most commonly used approach for data
science projects.

KDnuggets Polls
Bear in mind that the website caters toward data mining, and the data science field has
changed a lot since 2014.

KDnuggets is a common source for data mining methodology usage. Each of the polls
in 2002, 2004, 2007 posed the question: “What main methodology are you using for data
mining?”, and the 2014 poll expanded the question to include “…for analytics, data mining,
or data science projects.” 150-200 respondents answered each poll.
CRISP-DM was the popular methodology in each poll spanning the 12 years.

Our 2020 Poll


For a more current look into the popularity of various approaches, we conducted our own poll
on this site in August and September 2020.

Note the response options for our poll were different from the KDnuggets polls, and our site
attracts a different audience.

CRISP-DM was the clear winner, garnering nearly half of the 109 votes.,

Google Searches
Given the ambiguity of a searcher’s intent, some searches like “my own” could not be
analyzed and others like “tdsp” and “semma” could be misleading.

For yet third view into CRISP-DM, we turned to Google Keyword Planner tool which
provided the average monthly search volumes in the USA for select key search terms and
related terms (e.g. “crispdm” or “crisp dm data science”). Clearly irrelevant searches like
“tdsp electrical charges” or “semma both aagatha” were then removed.
CRISP-DM yet again reigned as king, and this time with a much broader margin.

Should I use CRISP-DM for Data Science?


So CRISP is popular. But should you use it?

Like most answers in data science, it’s kind of complicated. But here’s a quick overview.

Benefits
From today’s data science perspective this seems like common sense. This is exactly the
point. The common process is so logical that it has become embedded into all our education,
training, and practice.

-William Vorheis, one of CRISP-DM’s authors (from Data Science Central)

 Generalize-able: Although designed for data mining, William Vorhies, one of the
creators of CRISP-DM, argues that because all data science projects start with
business understanding, have data that must be gathered and cleaned, and apply data
science algorithms, “CRISP-DM provides strong guidance for even the most
advanced of today’s data science activities” (Vorhies, 2016).
 Common Sense: When students were asked to do a data science project without
project management direction, they “tended toward a CRISP-like methodology and
identified the phases and did several iterations.” Moreover, teams which were trained
and explicitly told to implement CRISP-DM performed better than teams using other
approaches (Saltz, Shamshurin, & Crowston, 2017).
 Adopt-able: Like Kanban, CRISP-DM can be implemented without much training,
organizational role changes, or controversy.
 Right Start: The initial focus on Business Understanding is helpful to align technical
work with business needs and to steer data scientists away from jumping into a
problem without properly understanding business objectives.
 Strong Finish: Its final step Deployment likewise addresses important considerations
to close out the project and transition to maintenance and operations.
 Flexible: A loose CRISP-DM implementation can be flexible to provide many of the
benefits of agile principles and practices. By accepting that a project starts with
significant unknowns, the user can cycle through steps, each time gaining a deeper
understanding of the data and the problem. The empirical knowledge learned from
previous cycles can then feed into the following cycles.

Weaknesses & Challenges


In a controlled experiment, students who used CRISP-DM “were the last to start coding” and
“did not fully understand the coding challenges they were going to face”

–Saltz, Shamshurin, & Crowston, 2017

 Rigid: On the other hand, some argue that CRISP-DM suffers from the same
weaknesses of Waterfall and encumbers rapid iteration.
 Documentation Heavy: Nearly every task has a documentation step. While
documenting one’s work is key in a mature process, CRISP-DM’s documentation
requirements might unnecessarily slow the team from actually delivering increments.
 Not Modern: Counter to Vorheis’ argument for the sustaining relevance of CRISP-
DM, others argue that CRISP-DM, as a process that pre-dates big data, “might not be
suitable for Big Data projects due its four V’s” (Saltz & Shamshurin, 2016).
 Not a Project Management Approach: Perhaps most significantly, CRISP-DM is
not a true project management methodology because it implicitly assumes that its user
is a single person or small, tight-knit team and ignores the teamwork coordination
necessary for larger projects (Saltz, Shamshurin, & Connors, 2017).

. What Is Data Mining?


Data Mining is a process of discovering interesting patterns and knowledge from large
amounts of data. The data sources can include databases, data warehouses, the web, and other
information repositories or data that are streamed into the system dynamically.

Why Do Businesses Need Data Extraction?


With the advent of Big Data, data mining has become more prevalent. Big data is extremely
large sets of data that can be analyzed by computers to reveal certain patterns, associations,
and trends that can be understood by humans. Big data has extensive information about
varied types and varied content.

Thus with this amount of data, simple statistics with manual intervention would not work.
This need is fulfilled by the data mining process. This leads to change from simple data
statistics to complex data mining algorithms.

The data mining process will extract relevant information from raw data such as transactions,
photos, videos, flat files and automatically process the information to generate reports useful
for businesses to take action.

Thus, the data mining process is crucial for businesses to make better decisions by
discovering patterns & trends in data, summarizing the data and taking out relevant
information.

Data Extraction As A Process


Any business problem will examine the raw data to build a model that will describe the
information and bring out the reports to be used by the business. Building a model from data
sources and data formats is an iterative process as the raw data is available in many different
sources and many forms.

Data is increasing day by day, hence when a new data source is found, it can change the
results.

Below is the outline of the process.


[image source]
Data Mining Models
Many industries such as manufacturing, marketing, chemical, and aerospace are taking
advantage of data mining. Thus the demand for standard and reliable data mining processes is
increased drastically.

The important data mining models include:


#1) Cross-Industry Standard Process for Data Mining (CRISP-DM)
CRISP-DM is a reliable data mining model consisting of six phases. It is a cyclical process
that provides a structured approach to the data mining process. The six phases can be
implemented in any order but it would sometimes require backtracking to the previous steps
and repetition of actions.

The six phases of CRISP-DM include:


#1) Business Understanding: In this step, the goals of the businesses are set and the
important factors that will help in achieving the goal are discovered.
#2) Data Understanding: This step will collect the whole data and populate the data in the
tool (if using any tool). The data is listed with its data source, location, how it is acquired and
if any issue encountered. Data is visualized and queried to check its completeness.
#3) Data Preparation: This step involves selecting the appropriate data, cleaning,
constructing attributes from data, integrating data from multiple databases.
#4) Modeling: Selection of the data mining technique such as decision-tree, generate test
design for evaluating the selected model, building models from the dataset and assessing the
built model with experts to discuss the result is done in this step.
#5) Evaluation: This step will determine the degree to which the resulting model meets the
business requirements. Evaluation can be done by testing the model on real applications. The
model is reviewed for any mistakes or steps that should be repeated.
#6) Deployment: In this step a deployment plan is made, strategy to monitor and maintain
the data mining model results to check for its usefulness is formed, final reports are made and
review of the whole process is done to check any mistake and see if any step is repeated.

#2) SEMMA (Sample, Explore, Modify, Model, Assess)


SEMMA is another data mining methodology developed by SAS Institute. The acronym
SEMMA stands for sample, explore, modify, model, assess.

SEMMA makes it easy to apply exploratory statistical and visualization techniques, select
and transform the significant predicted variables, create a model using the variables to come
out with the result, and check its accuracy. SEMMA is also driven by a highly iterative cycle.
Steps in SEMMA
1. Sample: In this step, a large dataset is extracted and a sample that represents the full
data is taken out. Sampling will reduce the computational costs and processing time.
2. Explore: The data is explored for any outlier and anomalies for a better
understanding of the data. The data is visually checked to find out the trends and
groupings.
3. Modify: In this step, manipulation of data such as grouping, and subgrouping is done
by keeping in focus the model to be built.
4. Model: Based on the explorations and modifications, the models that explain the
patterns in data are constructed.
5. Assess: The usefulness and reliability of the constructed model are assessed in this
step. Testing of the model against real data is done here.
Both the SEMMA and CRISP approach work for the Knowledge Discovery Process. Once
models are built, they are deployed for businesses and research work.

Steps In The Data Mining Process


The data mining process is divided into two parts i.e. Data Pre-processing and Data Mining.
Data Pre-processing involves data cleaning, data integration, data reduction, and data
transformation. The data mining part performs data mining, pattern evaluation and
knowledge representation of data.

Why do we preprocess the data?


There are many factors that determine the usefulness of data such as accuracy, completeness,
consistency, timeliness. The data has to quality if it satisfies the intended purpose. Thus pre-
processing is crucial in the data mining process. The major steps involved in data pre-
processing are explained below.

#1) Data Cleaning


Data cleaning is the first step in data mining. It holds importance as dirty data if used directly
in mining can cause confusion in procedures and produce inaccurate results.

Basically, this step involves the removal of noisy or incomplete data from the collection.
Many methods that generally clean data by itself are available but they are not robust.

This step carries out the routine cleaning work by:


(i) Fill The Missing Data:
Missing data can be filled by methods such as:

 Ignoring the tuple.


 Filling the missing value manually.
 Use the measure of central tendency, median or
 Filling in the most probable value.
(ii) Remove The Noisy Data: Random error is called noisy data.
Methods to remove noise are :
Binning: Binning methods are applied by sorting values into buckets or bins. Smoothening is
performed by consulting the neighbouring values.
Binning is done by smoothing by bin i.e. each bin is replaced by the mean of the bin.
Smoothing by a median, where each bin value is replaced by a bin median. Smoothing by bin
boundaries i.e. The minimum and maximum values in the bin are bin boundaries and each
bin value is replaced by the closest boundary value.

 Identifying the Outliers


 Resolving Inconsistencies
#2) Data Integration
When multiple heterogeneous data sources such as databases, data cubes or files are
combined for analysis, this process is called data integration. This can help in improving the
accuracy and speed of the data mining process.

Different databases have different naming conventions of variables, by causing redundancies


in the databases. Additional Data Cleaning can be performed to remove the redundancies and
inconsistencies from the data integration without affecting the reliability of data.

Data Integration can be performed using Data Migration Tools such as Oracle Data Service
Integrator and Microsoft SQL etc.

#3) Data Reduction


This technique is applied to obtain relevant data for analysis from the collection of data. The
size of the representation is much smaller in volume while maintaining integrity. Data
Reduction is performed using methods such as Naive Bayes, Decision Trees, Neural network,
etc.

Some strategies of data reduction are:


 Dimensionality Reduction: Reducing the number of attributes in the dataset.
 Numerosity Reduction: Replacing the original data volume by smaller forms of data
representation.
 Data Compression: Compressed representation of the original data.
#4) Data Transformation
In this process, data is transformed into a form suitable for the data mining process. Data is
consolidated so that the mining process is more efficient and the patterns are easier to
understand. Data Transformation involves Data Mapping and code generation process.

Strategies for data transformation are:


 Smoothing: Removing noise from data using clustering, regression techniques, etc.
 Aggregation: Summary operations are applied to data.
 Normalization: Scaling of data to fall within a smaller range.
 Discretization: Raw values of numeric data are replaced by intervals. For
Example, Age.
#5) Data Mining
Data Mining is a process to identify interesting patterns and knowledge from a large amount
of data. In these steps, intelligent patterns are applied to extract the data patterns. The data is
represented in the form of patterns and models are structured using classification and
clustering techniques.

#6) Pattern Evaluation


This step involves identifying interesting patterns representing the knowledge based on
interestingness measures. Data summarization and visualization methods are used to make
the data understandable by the user.

#7) Knowledge Representation


Knowledge representation is a step where data visualization and knowledge representation
tools are used to represent the mined data. Data is visualized in the form of reports, tables,
etc.

Data Mining Process In Oracle DBMS


RDBMS represents data in the form of tables with rows and columns. Data can be accessed
by writing database queries.

Relational Database management systems such as Oracle support Data mining using CRISP-
DM. The facilities of the Oracle database are useful in data preparation and understanding.
Oracle supports data mining through java interface, PL/SQL interface, automated data
mining, SQL functions, and graphical user interfaces.

Data Mining Process In Datawarehouse


A data warehouse is modeled for a multidimensional data structure called data cube. Each
cell in a data cube stores the value of some aggregate measures.

Data mining in multidimensional space carried out in OLAP style (Online Analytical
Processing) where it allows exploration of multiple combinations of dimensions at varying
levels of granularity.

What Are The Applications of Data Extraction?


List of areas where data mining is widely used includes:
#1) Financial Data Analysis: Data Mining is widely used in banking, investment, credit
services, mortgage, automobile loans, and insurance & stock investment services. The data
collected from these sources is complete, reliable and is of high quality. This facilitates
systematic data analysis and data mining.
#2) Retail and Telecommunication Industries: Retail Sector collects huge amounts of data
on sales, customer shopping history, goods transportation, consumption, and service. Retail
data mining helps to identify customer buying behaviors, customer shopping patterns, and
trends, improve the quality of customer service, better customer retention, and satisfaction.
#3) Science and Engineering: Data mining computer science and engineering can help to
monitor system status, improve system performance, isolate software bugs, detect software
plagiarism, and recognize system malfunctions.
#4) Intrusion Detection and Prevention: Intrusion is defined as any set of actions that
threaten the integrity, confidentiality or availability of network resources. Data mining
methods can help in intrusion detection and prevention system to enhance its performance.
#5) Recommender Systems: Recommender systems help consumers by making product
recommendations that are of interest to users.
Data Mining Challenges
Enlisted below are the various challenges involved in Data Mining.
1. Data Mining needs large databases and data collection that are difficult to manage.
2. The data mining process requires domain experts that are again difficult to find.
3. Integration from heterogeneous databases is a complex process.
4. The organizational level practices need to be modified to use the data mining results.
Restructuring the process requires effort and cost.
Conclusion
Data Mining is an iterative process where the mining process can be refined, and new data
can be integrated to get more efficient results. Data Mining meets the requirement of
effective, scalable and flexible data analysis.

It can be considered as a natural evaluation of information technology. As a knowledge


discovery process, Data preparation and data mining tasks complete the data mining process.

Data mining processes can be performed on any kind of data such as database data and
advanced databases such as time series etc. The data mining process comes with its own
challenges as well.

You might also like