0% found this document useful (0 votes)
121 views7 pages

Introduction To CRISP DM Framework For Data Science and Machine Learning

The document discusses the CRISP-DM framework, which is a widely used methodology for data science and machine learning projects. It consists of six steps: 1) business understanding, 2) data understanding, 3) data preparation, 4) modeling, 5) evaluation, and 6) deployment. The first step involves understanding business goals and problems. The second step focuses on data exploration, including identifying dependent and independent variables. The third step prepares the data for modeling through tasks like handling missing values.

Uploaded by

rameshsripada
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
121 views7 pages

Introduction To CRISP DM Framework For Data Science and Machine Learning

The document discusses the CRISP-DM framework, which is a widely used methodology for data science and machine learning projects. It consists of six steps: 1) business understanding, 2) data understanding, 3) data preparation, 4) modeling, 5) evaluation, and 6) deployment. The first step involves understanding business goals and problems. The second step focuses on data exploration, including identifying dependent and independent variables. The third step prepares the data for modeling through tasks like handling missing values.

Uploaded by

rameshsripada
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 7

10/12/2018 Chapter 1 - Introduction to CRISP DM Framework for Data Science and Machine Learning | LinkedIn

Reactivate
Search
Premium

Anshul Roy Follow


Sr. Data Engineer/Data Scientist on 28 0 1
R/Python/Spark/Scala/Big Data
3 articles

Chapter 1 - Introduction to
CRISP DM Framework for Data
Science and Machine Learning
Published on June 21, 2018

CRISP DM Framework

In my first post, I would like to discuss about the basic framework which
is normally used and implemented in any Data Science/ML Project. It is
very important for any one working on to follow a streamlined approach
of creating a Machine Learning Model. This is also done to ensure, we
follow and do not miss any of the required steps for creating our Machine
Learning Model.

Out of many such methodologies available, the one that is widely used is
CRISP DM Framework.

CRISP-DM is a cross-industry process for data mining. The CRISP-DM


methodology provides a structured approach to planning a data mining
project. It is a robust and well-proven methodology.
Messaging

https://www.linkedin.com/pulse/chapter-1-introduction-crisp-dm-framework-data-science-anshul-roy/ 1/7
10/12/2018 Chapter 1 - Introduction to CRISP DM Framework for Data Science and Machine Learning | LinkedIn

The basic steps of CRISP DM Framework are – Reactivate


Premium

1. Business Understanding

2. Data Understanding

3. Data Preparation

4. Modeling

5. Evaluation

6. Deployment

I will try to bring in the key steps and significance of all the above
mentioned steps.

Messaging

https://www.linkedin.com/pulse/chapter-1-introduction-crisp-dm-framework-data-science-anshul-roy/ 2/7
10/12/2018 Chapter 1 - Introduction to CRISP DM Framework for Data Science and Machine Learning | LinkedIn

1.      Business Understanding – Reactivate


Premium

This step mostly focuses on understanding the Business in all the different
aspects. It follows the below different steps.

a. Identify the goal and frame the business problem.

b. Gather information on resource, constraints, assumptions, risks etc

c. Prepare Analytical Goal

d. Flow Chart

2.      Data Understanding –

Data Understanding phase of CRISP DM Framework focus on collecting


the data, describing and exploring the data.

Exploring the data involves analyzing the data in hand for

· Dependent and Independent Variable Identification.

· Uni-variate Analysis – Exploring each and every independent


variables

· Bi-variate Analysis – Exploring the different combination of two or


more variable using Correlation, Chi-Sq Test, T-Test, Z-Test etc. This step
also involves subcategory analysis of each independent variable on the
dependent variable.

· Aggregated data exploration

· Data Quality Check is also performed at the step.

Messaging

https://www.linkedin.com/pulse/chapter-1-introduction-crisp-dm-framework-data-science-anshul-roy/ 3/7
10/12/2018 Chapter 1 - Introduction to CRISP DM Framework for Data Science and Machine Learning | LinkedIn

Note: Independent variables in a data are the variables which are Reactivate
Premium
used to perform Machine Learning Predictions. Dependent
Variables are the variables that we are required to predict.

Lets consider the email prediction of SPAM or HAM emails. An email


data may have many attributes like Email_To, Subject,
Mail_Body…..SPAM_HAM. In all these attributes, SPAM_HAM is an
attribute having values as 0/1 ie. A given email is SPAM or it is HAM.
This attribute is dependent on all other attributes to identify whether it
will have values as 0/1 and hence it is called Dependent Attribute. All
other attributes does not need any other attribute for any such
identification and hence they are called Independent Attribute.

In a ML Implementation, we may get numerous independent variables


that may or may not contribute to the prediction of Dependent Variable.
This step of data understanding at times even gives us a Gist of attributes
which may be important for predicting the dependent variable.

3.      Data preparation:

In this step, we prepare and clean the provided data. There are many steps
that one should follow to complete the data preparation phase.

a. The first and foremost step being the NA treatment. Normally the
data at hand is not clean and always have NA. We must identify such
values and appropriately fill or replace them. There are many different
techniques of NA treatment and there are packages in R and Python
which automatically treat such variables based on some default logic.
However, it is always good to do it manually, as this way we get to
understand the data even further and can replace these NA’s with our
understanding of Business Requirement. The below article from Analytics
Vidhya has explanation of R Packages which can do the NA treatment on
its own.

Imputing Missing Values

Messaging

https://www.linkedin.com/pulse/chapter-1-introduction-crisp-dm-framework-data-science-anshul-roy/ 4/7
10/12/2018 Chapter 1 - Introduction to CRISP DM Framework for Data Science and Machine Learning | LinkedIn

b. The next step would be to treat Null’s. This step is equally Reactivate
Premium
important as NA treatment and as per my experience, I have below steps
for the Null treatment. Again they may change as per the data in hand but
will definitely help to some extent.

i. If the variable is Continuous in Nature (Continuous


Variable i.e. Numeric Variable), we can use Mean/Median/Mode for the
missing value treatment.

ii. If the variable is categorical in nature we can impute


the Nulls with “Unavailable” as these Null values or Unavailable values
may contribute significantly to the Model creation and we do not want to
lose any important attribute.

c. Outlier Treatment of the data should be the next step where in we


check for the availability of outliers and try to impute them with again
mean/median/mode. This imputing may follow some other methods as
well like assigning or imputing these outlier values with those from the
data which are not outlier and lie at the border. We can display such
variables using Boxplot of the attribute. We can also Bin the Variables as
part of Outlier Treatment.

d. Standardization or Scaling of the data – This step is used to scale


up the values of the attribute so that they lie between -1 to 1. With scaling,
the range of the variables get reduced and result in a better predicting
variable. This could be done on Continuous Variables.

e. Feature Engineering : One of the most important steps and can be


clubbed as the combination of Feature Transformation and Feature
Extraction. In this step, we try to create more attributes from the available
attributes. Though there are some thumb rule that help us in this step, but
this step is open for analysis and performs better if explored more.

i. Create Value Transformation like Square or Cube or even Square-


root or Cube root or Log of certain columns, as it has been seen that such
derived columns contribute in algorithm then the deriving column.
Messaging

https://www.linkedin.com/pulse/chapter-1-introduction-crisp-dm-framework-data-science-anshul-roy/ 5/7
10/12/2018 Chapter 1 - Introduction to CRISP DM Framework for Data Science and Machine Learning | LinkedIn

ii. Variable Creation like Dummy Encoding (Creating Reactivate


Premium
dummy variables from the categorical variables), Data Split etc also
contribute to Feature Engineering.

Feature Engineering step is one such step which can be explored more
and can contribute significantly to the outcome.

f. Dimensionality Reduction: Another significant topic. Will cover it


in little detail in the next coming topics.

4.      Modeling :

Once the above steps are done, we have implemented the basic necessity
of ML and now we can proceed with the implementation of different ML
algorithm. The algorithm to be selected depends completely on the
business requirement, available data and the desired outcome. In an ideal
situation, we should try different algorithm or combination of algorithm
(Ensembles) to actually arrive at our final best algorithm. We will discuss
in detail on the different ML algorithms.

5.      Evaluation of the Model :

There are many model evaluation technique like Accuracy, Sensitivity,


Specificity, F-Score, AUC, R-Sq, Adj R-SQ, RMSE (Root Mean Square
Error) and the list keeps going. Which evaluation metrics to choose
completely depends on the evaluation criteria, final model desired
outcome , business requirement and the model algorithm used.

6.      Deployment:

Finally, once the model is created and tested and evaluated on the Test
and Validation data, this is presented to the business (with PPT). The
model the undergoes different real time evaluation and testing like A/B
Testing and after all the approval process, the code is pushed to the
PROD/Live data.

Report this

Messaging

https://www.linkedin.com/pulse/chapter-1-introduction-crisp-dm-framework-data-science-anshul-roy/ 6/7
10/12/2018 Chapter 1 - Introduction to CRISP DM Framework for Data Science and Machine Learning | LinkedIn

Reactivate
28 Likes Premium

+18

0 Comments

Add a comment…

Anshul Roy
Sr. Data Engineer/Data Scientist on R/Python/Spark/Scala/Big Data

Follow

More from Anshul Roy

Chapter 2 - Machine Learning Up-skilled on Data Science by


and its types IIIT Bangalore and UpGrad
Anshul Roy on LinkedIn Anshul Roy on LinkedIn

Messaging

https://www.linkedin.com/pulse/chapter-1-introduction-crisp-dm-framework-data-science-anshul-roy/ 7/7

You might also like