0% found this document useful (0 votes)

121 views7 pages

Introduction To CRISP DM Framework For Data Science and Machine Learning

The document discusses the CRISP-DM framework, which is a widely used methodology for data science and machine learning projects. It consists of six steps: 1) business understanding, 2) data understanding, 3) data preparation, 4) modeling, 5) evaluation, and 6) deployment. The first step involves understanding business goals and problems. The second step focuses on data exploration, including identifying dependent and independent variables. The third step prepares the data for modeling through tasks like handling missing values.

Uploaded by

rameshsripada

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

121 views7 pages

Introduction To CRISP DM Framework For Data Science and Machine Learning

Uploaded by

rameshsripada

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 7

10/12/2018 Chapter 1 - Introduction to CRISP DM Framework for Data Science and Machine Learning | LinkedIn

Reactivate
Search
Premium

Anshul Roy Follow

Sr. Data Engineer/Data Scientist on 28 0 1
R/Python/Spark/Scala/Big Data
3 articles

Chapter 1 - Introduction to
CRISP DM Framework for Data
Science and Machine Learning
Published on June 21, 2018

CRISP DM Framework

In my first post, I would like to discuss about the basic framework which
is normally used and implemented in any Data Science/ML Project. It is
very important for any one working on to follow a streamlined approach
of creating a Machine Learning Model. This is also done to ensure, we
follow and do not miss any of the required steps for creating our Machine
Learning Model.

Out of many such methodologies available, the one that is widely used is
CRISP DM Framework.

CRISP-DM is a cross-industry process for data mining. The CRISP-DM

methodology provides a structured approach to planning a data mining
project. It is a robust and well-proven methodology.
Messaging

https://www.linkedin.com/pulse/chapter-1-introduction-crisp-dm-framework-data-science-anshul-roy/ 1/7
10/12/2018 Chapter 1 - Introduction to CRISP DM Framework for Data Science and Machine Learning | LinkedIn

The basic steps of CRISP DM Framework are – Reactivate

Premium

1. Business Understanding

2. Data Understanding

3. Data Preparation

4. Modeling

5. Evaluation

6. Deployment

I will try to bring in the key steps and significance of all the above
mentioned steps.

Messaging

https://www.linkedin.com/pulse/chapter-1-introduction-crisp-dm-framework-data-science-anshul-roy/ 2/7
10/12/2018 Chapter 1 - Introduction to CRISP DM Framework for Data Science and Machine Learning | LinkedIn

1. Business Understanding – Reactivate

Premium

This step mostly focuses on understanding the Business in all the different
aspects. It follows the below different steps.

a. Identify the goal and frame the business problem.

b. Gather information on resource, constraints, assumptions, risks etc

c. Prepare Analytical Goal

d. Flow Chart

2. Data Understanding –

Data Understanding phase of CRISP DM Framework focus on collecting

the data, describing and exploring the data.

Exploring the data involves analyzing the data in hand for

· Dependent and Independent Variable Identification.

· Uni-variate Analysis – Exploring each and every independent

variables

· Bi-variate Analysis – Exploring the different combination of two or

more variable using Correlation, Chi-Sq Test, T-Test, Z-Test etc. This step
also involves subcategory analysis of each independent variable on the
dependent variable.

· Aggregated data exploration

· Data Quality Check is also performed at the step.

Messaging

https://www.linkedin.com/pulse/chapter-1-introduction-crisp-dm-framework-data-science-anshul-roy/ 3/7
10/12/2018 Chapter 1 - Introduction to CRISP DM Framework for Data Science and Machine Learning | LinkedIn

Note: Independent variables in a data are the variables which are Reactivate
Premium
used to perform Machine Learning Predictions. Dependent
Variables are the variables that we are required to predict.

Lets consider the email prediction of SPAM or HAM emails. An email

data may have many attributes like Email_To, Subject,
Mail_Body…..SPAM_HAM. In all these attributes, SPAM_HAM is an
attribute having values as 0/1 ie. A given email is SPAM or it is HAM.
This attribute is dependent on all other attributes to identify whether it
will have values as 0/1 and hence it is called Dependent Attribute. All
other attributes does not need any other attribute for any such
identification and hence they are called Independent Attribute.

In a ML Implementation, we may get numerous independent variables

that may or may not contribute to the prediction of Dependent Variable.
This step of data understanding at times even gives us a Gist of attributes
which may be important for predicting the dependent variable.

3. Data preparation:

In this step, we prepare and clean the provided data. There are many steps
that one should follow to complete the data preparation phase.

a. The first and foremost step being the NA treatment. Normally the
data at hand is not clean and always have NA. We must identify such
values and appropriately fill or replace them. There are many different
techniques of NA treatment and there are packages in R and Python
which automatically treat such variables based on some default logic.
However, it is always good to do it manually, as this way we get to
understand the data even further and can replace these NA’s with our
understanding of Business Requirement. The below article from Analytics
Vidhya has explanation of R Packages which can do the NA treatment on
its own.

Imputing Missing Values

Messaging

https://www.linkedin.com/pulse/chapter-1-introduction-crisp-dm-framework-data-science-anshul-roy/ 4/7
10/12/2018 Chapter 1 - Introduction to CRISP DM Framework for Data Science and Machine Learning | LinkedIn

b. The next step would be to treat Null’s. This step is equally Reactivate
Premium
important as NA treatment and as per my experience, I have below steps
for the Null treatment. Again they may change as per the data in hand but
will definitely help to some extent.

i. If the variable is Continuous in Nature (Continuous

Variable i.e. Numeric Variable), we can use Mean/Median/Mode for the
missing value treatment.

ii. If the variable is categorical in nature we can impute

the Nulls with “Unavailable” as these Null values or Unavailable values
may contribute significantly to the Model creation and we do not want to
lose any important attribute.

c. Outlier Treatment of the data should be the next step where in we

check for the availability of outliers and try to impute them with again
mean/median/mode. This imputing may follow some other methods as
well like assigning or imputing these outlier values with those from the
data which are not outlier and lie at the border. We can display such
variables using Boxplot of the attribute. We can also Bin the Variables as
part of Outlier Treatment.

d. Standardization or Scaling of the data – This step is used to scale

up the values of the attribute so that they lie between -1 to 1. With scaling,
the range of the variables get reduced and result in a better predicting
variable. This could be done on Continuous Variables.

e. Feature Engineering : One of the most important steps and can be

clubbed as the combination of Feature Transformation and Feature
Extraction. In this step, we try to create more attributes from the available
attributes. Though there are some thumb rule that help us in this step, but
this step is open for analysis and performs better if explored more.

i. Create Value Transformation like Square or Cube or even Square-

root or Cube root or Log of certain columns, as it has been seen that such
derived columns contribute in algorithm then the deriving column.
Messaging

https://www.linkedin.com/pulse/chapter-1-introduction-crisp-dm-framework-data-science-anshul-roy/ 5/7
10/12/2018 Chapter 1 - Introduction to CRISP DM Framework for Data Science and Machine Learning | LinkedIn

ii. Variable Creation like Dummy Encoding (Creating Reactivate

Premium
dummy variables from the categorical variables), Data Split etc also
contribute to Feature Engineering.

Feature Engineering step is one such step which can be explored more
and can contribute significantly to the outcome.

f. Dimensionality Reduction: Another significant topic. Will cover it

in little detail in the next coming topics.

4. Modeling :

Once the above steps are done, we have implemented the basic necessity
of ML and now we can proceed with the implementation of different ML
algorithm. The algorithm to be selected depends completely on the
business requirement, available data and the desired outcome. In an ideal
situation, we should try different algorithm or combination of algorithm
(Ensembles) to actually arrive at our final best algorithm. We will discuss
in detail on the different ML algorithms.

5. Evaluation of the Model :

There are many model evaluation technique like Accuracy, Sensitivity,

Specificity, F-Score, AUC, R-Sq, Adj R-SQ, RMSE (Root Mean Square
Error) and the list keeps going. Which evaluation metrics to choose
completely depends on the evaluation criteria, final model desired
outcome , business requirement and the model algorithm used.

6. Deployment:

Finally, once the model is created and tested and evaluated on the Test
and Validation data, this is presented to the business (with PPT). The
model the undergoes different real time evaluation and testing like A/B
Testing and after all the approval process, the code is pushed to the
PROD/Live data.

Report this

Messaging

https://www.linkedin.com/pulse/chapter-1-introduction-crisp-dm-framework-data-science-anshul-roy/ 6/7
10/12/2018 Chapter 1 - Introduction to CRISP DM Framework for Data Science and Machine Learning | LinkedIn

Reactivate
28 Likes Premium

+18

0 Comments

Add a comment…

Anshul Roy
Sr. Data Engineer/Data Scientist on R/Python/Spark/Scala/Big Data

Chapter 2 - Machine Learning Up-skilled on Data Science by

and its types IIIT Bangalore and UpGrad
Anshul Roy on LinkedIn Anshul Roy on LinkedIn

Messaging

https://www.linkedin.com/pulse/chapter-1-introduction-crisp-dm-framework-data-science-anshul-roy/ 7/7

2018, Case - Study - The - Influence - of - S
0% (1)
2018, Case - Study - The - Influence - of - S
10 pages
DEP Presentation
No ratings yet
DEP Presentation
27 pages
Slides Chap 1,2
No ratings yet
Slides Chap 1,2
54 pages
The Illusion of Statistical Control, Control Variable Practice in Management Research. Carlson & Wu 2012 ORM
No ratings yet
The Illusion of Statistical Control, Control Variable Practice in Management Research. Carlson & Wu 2012 ORM
23 pages
Certifications To Pursue
No ratings yet
Certifications To Pursue
12 pages
ARDL
No ratings yet
ARDL
17 pages
Ilovepdf Merged
No ratings yet
Ilovepdf Merged
57 pages
Effect of Competence, Independence, and Professional Skepticism Against Ability To Detect Fraud Action in Audit Assignment (Survey On Public Accounting Firm Registered in IICPA Territory of Jakarta)
No ratings yet
Effect of Competence, Independence, and Professional Skepticism Against Ability To Detect Fraud Action in Audit Assignment (Survey On Public Accounting Firm Registered in IICPA Territory of Jakarta)
16 pages
Lec 23
No ratings yet
Lec 23
36 pages
Telling Stories With Soundtracks: An Empirical Analysis of Music in Film
No ratings yet
Telling Stories With Soundtracks: An Empirical Analysis of Music in Film
11 pages
Programme Guide MSCAST - Sep 2023
No ratings yet
Programme Guide MSCAST - Sep 2023
49 pages
A Study of The Effect of Corporate Social Responsibility (CSR) Towards Consumer Buying Behavior
No ratings yet
A Study of The Effect of Corporate Social Responsibility (CSR) Towards Consumer Buying Behavior
12 pages
Sitienei (2016)
No ratings yet
Sitienei (2016)
9 pages
Social Media Analytics
No ratings yet
Social Media Analytics
12 pages
6 DATA Analysis 2
No ratings yet
6 DATA Analysis 2
46 pages
LinkedIn Profile Checklist 2014
No ratings yet
LinkedIn Profile Checklist 2014
8 pages
Carmona Et Al. 2013 - BiolCons
No ratings yet
Carmona Et Al. 2013 - BiolCons
8 pages
Traducido Después
No ratings yet
Traducido Después
19 pages
BJMC 14 Block 02
No ratings yet
BJMC 14 Block 02
44 pages
అన్నదాన మహిమ
No ratings yet
అన్నదాన మహిమ
17 pages
Introduction To Econometric Solutions To Exercises (Part 2)
75% (4)
Introduction To Econometric Solutions To Exercises (Part 2)
58 pages
Peer Influence and Self Esteem As Predictors of Self-Medication Among The Youth in The Middle Belt Region of Nigeria
No ratings yet
Peer Influence and Self Esteem As Predictors of Self-Medication Among The Youth in The Middle Belt Region of Nigeria
7 pages
6 Sigma BB Certification Exam
No ratings yet
6 Sigma BB Certification Exam
27 pages
Backward Elimination and Stepwise Regression
No ratings yet
Backward Elimination and Stepwise Regression
5 pages
Mastering LinkedIn Profile
No ratings yet
Mastering LinkedIn Profile
4 pages
Production Management - Unit 6
No ratings yet
Production Management - Unit 6
7 pages
Waspaa2023 Hai
No ratings yet
Waspaa2023 Hai
5 pages
Cs3491 Artificial Intelilgence and Machine Learning
No ratings yet
Cs3491 Artificial Intelilgence and Machine Learning
22 pages
How To Make Definition of Terms in A Research Paper
No ratings yet
How To Make Definition of Terms in A Research Paper
6 pages
Job Search Guide Checklist 2024
No ratings yet
Job Search Guide Checklist 2024
29 pages
Interdisciplinary Critical and Design Thinking
No ratings yet
Interdisciplinary Critical and Design Thinking
12 pages
Presentations
No ratings yet
Presentations
4 pages
OOP Updates in ABAP 7.4 and ABAP 7.5
No ratings yet
OOP Updates in ABAP 7.4 and ABAP 7.5
7 pages
LinkedIn Profile Checklist
No ratings yet
LinkedIn Profile Checklist
3 pages
LearningActivitySheetQ4Wk8 1
No ratings yet
LearningActivitySheetQ4Wk8 1
4 pages
Artificial Intelligence Lec 4
No ratings yet
Artificial Intelligence Lec 4
13 pages
Module 1 ML Chapter2
No ratings yet
Module 1 ML Chapter2
56 pages
Linkedin Profile Building Tips: Checkout Algoprep
No ratings yet
Linkedin Profile Building Tips: Checkout Algoprep
7 pages
The Ultimate Off-Campus Placement Guide & Resource Sheet 1
No ratings yet
The Ultimate Off-Campus Placement Guide & Resource Sheet 1
2 pages
Troubleshoot SR Broker Errors in Siebel
No ratings yet
Troubleshoot SR Broker Errors in Siebel
3 pages
Tut031 Zhu
No ratings yet
Tut031 Zhu
2 pages
SIM - Variables
No ratings yet
SIM - Variables
21 pages
YT - LinkedIn Job Data Analysis
No ratings yet
YT - LinkedIn Job Data Analysis
2 pages
Machine Learning Models For Energy Consumption Prediction in Buildings
No ratings yet
Machine Learning Models For Energy Consumption Prediction in Buildings
1 page
Professional Practices
No ratings yet
Professional Practices
17 pages
Six Abodes of Lord SubRahmanYa
No ratings yet
Six Abodes of Lord SubRahmanYa
6 pages
Data Science and Machine Learning Roadmap
No ratings yet
Data Science and Machine Learning Roadmap
4 pages
Your Guide To Making The Most of LinkedIn
No ratings yet
Your Guide To Making The Most of LinkedIn
5 pages
Software Engineering 1
No ratings yet
Software Engineering 1
20 pages
Published Linkedin Guide SW 5-2-17
No ratings yet
Published Linkedin Guide SW 5-2-17
4 pages
Econometrics Notes
No ratings yet
Econometrics Notes
30 pages
AD3491 - Unit 4 - Analysis of Variance Important Questions 2 Marks With Answer - 3-9
No ratings yet
AD3491 - Unit 4 - Analysis of Variance Important Questions 2 Marks With Answer - 3-9
7 pages
Resume - Shubham Agarwal - Linkedin
No ratings yet
Resume - Shubham Agarwal - Linkedin
1 page
How To Create An Outstanding LinkedIn Profile Masterclass
No ratings yet
How To Create An Outstanding LinkedIn Profile Masterclass
33 pages
Networking
No ratings yet
Networking
3 pages
NOtes
No ratings yet
NOtes
4 pages
DA Job Fresher Guide
No ratings yet
DA Job Fresher Guide
24 pages
MVP 8 Week
No ratings yet
MVP 8 Week
15 pages
Big Data Ecosystem in Linkedin Data
No ratings yet
Big Data Ecosystem in Linkedin Data
30 pages
2 Data Science
No ratings yet
2 Data Science
27 pages
Final Draft - WIP
No ratings yet
Final Draft - WIP
11 pages
Linkedin&Referrals
No ratings yet
Linkedin&Referrals
7 pages
Community File
No ratings yet
Community File
3 pages
127 Tools To Supercharge Your Job Search
No ratings yet
127 Tools To Supercharge Your Job Search
9 pages
LinkedIn Profile Setup MasterClass Notes
No ratings yet
LinkedIn Profile Setup MasterClass Notes
5 pages
Big Data Analytics Overview
No ratings yet
Big Data Analytics Overview
17 pages
Big Data Computing Notes
No ratings yet
Big Data Computing Notes
17 pages
Linkedin: Big Data in Social Media
No ratings yet
Linkedin: Big Data in Social Media
22 pages
Career Buddy Knowledge Base
No ratings yet
Career Buddy Knowledge Base
11 pages
Ways To Create and Display An Effective LinkedIn Profile - User Manual - CSRPL
No ratings yet
Ways To Create and Display An Effective LinkedIn Profile - User Manual - CSRPL
14 pages
Class 07
No ratings yet
Class 07
17 pages
My Presentation
No ratings yet
My Presentation
4 pages
LinkedIn Need Help
No ratings yet
LinkedIn Need Help
7 pages
Linkdln
No ratings yet
Linkdln
68 pages
Tasks List For Data Science Interns
No ratings yet
Tasks List For Data Science Interns
9 pages
A) B) C) D) E) F)
No ratings yet
A) B) C) D) E) F)
5 pages
Linkedin Profile
No ratings yet
Linkedin Profile
4 pages
Test Tribe
No ratings yet
Test Tribe
4 pages
How To Onboard A New Technical Space Quickly LinkedIn
No ratings yet
How To Onboard A New Technical Space Quickly LinkedIn
6 pages
Internship Report
No ratings yet
Internship Report
25 pages
Useful Links To Resources For Subjects and Job Search Activities
No ratings yet
Useful Links To Resources For Subjects and Job Search Activities
4 pages
Unit 5
No ratings yet
Unit 5
9 pages
Interview Skills Notes
No ratings yet
Interview Skills Notes
7 pages
Sri Saraswati Kavacham
No ratings yet
Sri Saraswati Kavacham
1 page
S7 Research Design
No ratings yet
S7 Research Design
21 pages
Data Warehousing in The Era of Big Data - LinkedIn PDF
No ratings yet
Data Warehousing in The Era of Big Data - LinkedIn PDF
4 pages
Linkedin Personal Blog: Promote Your Skills
No ratings yet
Linkedin Personal Blog: Promote Your Skills
1 page
Comprehensive Data Analyst Roadmap For Freshers
No ratings yet
Comprehensive Data Analyst Roadmap For Freshers
7 pages
Karan Mahajan View Full Profile: Search
No ratings yet
Karan Mahajan View Full Profile: Search
4 pages
LinkedIn Profile Best Practices
No ratings yet
LinkedIn Profile Best Practices
4 pages
Introduction To Linkedin: Resource Guide
No ratings yet
Introduction To Linkedin: Resource Guide
25 pages
Syllabus - CIS 509 Data Mining II (Fall 2019)
No ratings yet
Syllabus - CIS 509 Data Mining II (Fall 2019)
7 pages

Introduction To CRISP DM Framework For Data Science and Machine Learning

Uploaded by

Introduction To CRISP DM Framework For Data Science and Machine Learning

Uploaded by

10/12/2018 Chapter 1 - Introduction to CRISP DM Framework for Data Science and Machine Learning | LinkedIn

Anshul Roy Follow

CRISP-DM is a cross-industry process for data mining. The CRISP-DM

The basic steps of CRISP DM Framework are – Reactivate

1. Business Understanding – Reactivate

a. Identify the goal and frame the business problem.

b. Gather information on resource, constraints, assumptions, risks etc

c. Prepare Analytical Goal

Data Understanding phase of CRISP DM Framework focus on collecting

Exploring the data involves analyzing the data in hand for

· Dependent and Independent Variable Identification.

· Uni-variate Analysis – Exploring each and every independent

· Bi-variate Analysis – Exploring the different combination of two or

· Aggregated data exploration

· Data Quality Check is also performed at the step.

Lets consider the email prediction of SPAM or HAM emails. An email

In a ML Implementation, we may get numerous independent variables

Imputing Missing Values

i. If the variable is Continuous in Nature (Continuous

ii. If the variable is categorical in nature we can impute

c. Outlier Treatment of the data should be the next step where in we

d. Standardization or Scaling of the data – This step is used to scale

e. Feature Engineering : One of the most important steps and can be

i. Create Value Transformation like Square or Cube or even Square-

ii. Variable Creation like Dummy Encoding (Creating Reactivate

f. Dimensionality Reduction: Another significant topic. Will cover it

5. Evaluation of the Model :

There are many model evaluation technique like Accuracy, Sensitivity,

More from Anshul Roy

Chapter 2 - Machine Learning Up-skilled on Data Science by

You might also like