Module 5 - Data Science Methodology
Module 5 - Data Science Methodology
Understanding the
Data Science Methodology
Data Science
CRISP-DM Methodology
• CRISP-DM stands for The CRoss Industry Standard Process for Data Mining
• CRISP-DM is a process model with six phases that naturally describes the
data science life cycle. It’s like a set of guardrails to help you plan, organize,
and implement your data science (or machine learning) project.
• The process consists of the following steps:
• Business understanding – What does the business need?
• Data understanding – What data do we have / need? Is it clean?
• Data preparation – How do we organize the data for modeling?
• Modeling – What modeling techniques should we apply?
• Evaluation – Which model best meets the business objectives?
• Deployment – How do stakeholders access the results?
CRISP-DM Methodology Diagrams
Business Understanding The Business Understanding phase
focuses on understanding the objectives
and requirements of the project.
• Select data: Determine which data sets will be used and document reasons
for inclusion/exclusion.
• Clean data: Often this is the lengthiest task. Without it, you’ll likely fall
victim to garbage-in, garbage-out. A common practice during this task is to
correct, impute, or remove erroneous values.
• Construct data: Derive new attributes that will be helpful. For example,
derive someone’s body mass index from height and weight fields.
• Integrate data: Create new data sets by combining data from multiple
sources.
• Format data: Re-format data as necessary. For example, you might convert
string values that store numbers to numeric values so that you can perform
mathematical operations.
Modeling Here you’ll likely build and assess
various models based on several
different modeling techniques.