0% found this document useful (0 votes)

23 views31 pages

Data Science Methodologies

The document outlines various methodologies for managing data science projects, including KDD, SEMMA, CRISP-DM, and TDSP, emphasizing their structured approaches and iterative processes. It details each methodology's phases, objectives, and key activities, highlighting the importance of aligning data science efforts with business goals. Additionally, it discusses the benefits and challenges associated with each methodology, providing insights into their application in real-world scenarios.

Uploaded by

aniketsha784

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

23 views31 pages

Data Science Methodologies

Uploaded by

aniketsha784

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 31

M.

Sc (IT - AI/CC/Security) Semester I

DATA SCIENCE AND ANALYTICS

(DSA)

Ms. Pooja R. Tupe

Visiting Faculty ,UDIT, University of Mumbai.
Topics to cover

Different Methodologies –
KDD,SEMMA, CRISP-DM,TDSP
• Managing DS projects involves
• navigating complexity through tailored project management
strategies.
• This includes orchestrating diverse teams,
• handling extensive data sets,
• ensuring alignment with both business objectives and rigorous
scientific standards.
• Over the few years, there have been a lot of effort in
terms of standardizing the methodologies and defining the
best practices which are followed in building your data
science solutions and data science projects.
Different Methodologies for DS
• So we will understand about the various project management
methodologies frameworks which are used for building data
mining solutions.
• We will look into
• Knowledge Discovery In Databases (KDD),
• Cross Industry Standard Processes for Data Mining (CRISP-DM)
• SEMMA Stands for Sample, Explore, Modify, Model, Assess, And
Refers to the process Of Conducting A DM Project.
• Team Data Science Process (TDSP).
DIKW pyramid
• This is called as DIKW
pyramid, also known variously
as the DIKW hierarchy,
wisdom hierarchy, knowledge
hierarchy, information
hierarchy, information
pyramid, and the data pyramid.
• refers loosely to a class of
models for representing
purported structural and/or
functional relationships
between data, information,
knowledge, and wisdom.
1. Knowledge Discovery in Databases
(KDD)
• A comprehensive process used in data mining and machine
learning to extract useful knowledge from large datasets.
• The KDD process typically consists of several stages or
steps, which are often represented as a sequence:

1.Selection:
 Objective: In the selection stage, the focus is on identifying
and retrieving data from various sources that are relevant to
the analysis and decision-making process.
 Activities:
 Define the criteria for selecting data based on the problem
domain and objectives.
 Gather data from databases, data warehouses, or other sources
that meet the defined criteria.
 Ensure the data collected is comprehensive and representative of
the problem at hand.
2 . Pre-processing:

Objective: Pre-processing involves cleaning and transforming the

raw data to prepare it for further analysis.

 Activities:
• Clean the data by handling missing values, outliers, and noise.
• Normalize or standardize data to ensure consistency and
comparability across different variables.
• Feature selection or extraction to identify relevant attributes
that contribute most to the analysis.
3. Transformation:

 Objective: Transformation aims to convert the pre-processed

data into a format suitable for mining and modeling.
 Activities:
• Aggregate or summarize data to reduce complexity and improve
efficiency in subsequent analysis.
• Perform dimensionality reduction techniques such as PCA
(Principal Component Analysis) to reduce the number of
variables while retaining important information.
• Apply encoding techniques for categorical variables to convert
them into numerical format if necessary.
4. Data Mining:

Objective: Applying algorithms and statistical methods

to discover patterns, relationships, and insights from the
transformed data.
Activities:
• Use various data mining techniques such as clustering,
classification, association rule mining, and regression
to extract patterns.
• Evaluate and compare different models to identify the
most suitable one for the problem at hand.
• Iteratively refine models based on feedback and
insights gained from the evaluation process.
5. Interpretation/Evaluation:

Objective: Interpreting the patterns discovered and

evaluating the effectiveness and reliability of the models
developed.
Activities:
Interpret and visualize the patterns and relationships discovered
during the data mining process.
Assess the quality and relevance of the insights gained against the
initial objectives and business goals.
Communicate findings and recommendations to stakeholders in a
clear and understandable manner.
Key Characteristics
• Iterative Process: The KDD process is iterative, allowing
for refinement and adjustment of each stage based on
insights gained from subsequent stages.
• Comprehensive Approach: It integrates data selection,
preprocessing, transformation, mining, and evaluation into a
cohesive framework, ensuring a thorough analysis of data.
• Application Flexibility: KDD can be applied across various
domains and industries, adapting to different data types and
analysis requirements.
2. SEMMA
• The SEMMA process is a methodology developed by SAS
(Statistical Analysis System) for data mining and predictive
modeling.
• SEMMA stands for Sample, Explore, Modify, Model, and
Assess.
Sample
• Objective: The first phase involves selecting a
representative sample of data from the population for
analysis. This sample should accurately reflect the
characteristics of the entire dataset.
• Activities:
• Define sampling criteria based on project goals and data
characteristics.
• Randomly select samples from the dataset using appropriate
sampling techniques.
• Ensure the sample size is sufficient for meaningful analysis.
Explore
• Objective: In this phase, the selected data sample is explored to
understand its characteristics, identify patterns, and gain
insights into potential relationships between variables.
• Activities:
• Perform descriptive statistics to summarize data distribution,
central tendencies, and variability.
• Visualize data through charts, graphs, and plots to identify
trends and outliers.
• Conduct preliminary data analysis to identify potential data
quality issues or missing values.
Modify
• Objective: The modify phase focuses on data preparation
and preprocessing to ensure data quality and suitability for
modeling.
• Activities:
• Clean the data by handling missing values, outliers, and
inconsistencies.
• Transform variables as needed, such as normalization,
scaling, or encoding categorical variables.
• Select relevant features or variables that are most
predictive for the modeling phase.
Model
• Objective: The modeling phase involves building and validating
predictive or descriptive models using statistical or machine
learning techniques.
• Activities:
• Select appropriate modeling techniques based on project
objectives and data characteristics (e.g., regression,
classification, clustering).
• Train models using the prepared dataset and evaluate model
performance using appropriate metrics.
• Iteratively refine models by tuning parameters and assessing
model robustness and generalization.
Assess
• Objective: The assess phase evaluates the performance and
effectiveness of the models developed in the previous phase.
• Activities:
• Assess model performance using evaluation metrics (e.g.,
accuracy, precision, recall, F1-score for classification; RMSE,
MAE for regression).
• Validate models by testing against new or unseen data to ensure
they generalize well.
• Interpret and communicate results to stakeholders, including
recommendations based on model findings.
Key Benefits
• Structured Approach: SEMMA provides a systematic and
structured approach to data mining projects, guiding
practitioners through each essential phase from data
sampling to model assessment.
• Iterative Process: It allows for iterative refinement and
improvement of models based on insights gained during
exploration and assessment phases.
• Clear Methodological Steps: Each phase in SEMMA is
clearly defined, facilitating easier communication and
collaboration among team members involved in the project.
CRISP-DM
• CRISP-DM (Cross-Industry Standard Process for Data
Mining) is a widely used methodology for data mining and
data science projects.
• It provides a structured approach to planning and executing
data mining projects, ensuring that the results are relevant
and actionable. CRISP-DM is composed of six major phases:
Business Understanding:
• Focuses on understanding the project objectives and
requirements from a business perspective.
• Determines the business objectives, assesses the situation,
defines the data mining goals, and produces a project plan.
• Data Understanding:
• Involves collecting initial data, describing the data,
exploring the data, and verifying data quality.
• Helps in understanding the data’s characteristics and
identifying any data quality issues.
Data Understanding:
• Involves collecting initial data, describing the data,
exploring the data, and verifying data quality.
• Helps in understanding the data’s characteristics and
identifying any data quality issues.
Data Preparation:
• Covers all activities needed to construct the final dataset from the initial
raw data.
• Includes tasks such as selecting data, cleaning data, constructing data,
integrating data, and formatting data.
• Modeling:
• Involves selecting modeling techniques, generating test design, building
models, and assessing models.
• Different modeling techniques may require different data formats and
assumptions.
• Evaluation:
• Assesses the model thoroughly to ensure it meets the business objectives.
• Involves reviewing the steps executed, ensuring that the model is achieving
the intended goals, and deciding on the next steps.
Modeling
• Involves selecting modeling techniques, generating test
design, building models, and assessing models.
• Different modeling techniques may require different data
formats and assumptions.
• Evaluation:
• Assesses the model thoroughly to ensure it meets the
business objectives.
• Involves reviewing the steps executed, ensuring that the
model is achieving the intended goals, and deciding on the
next steps.
Evaluation

• Assesses the model thoroughly to ensure it meets the

business objectives.
• Involves reviewing the steps executed, ensuring that the
model is achieving the intended goals, and deciding on the
next steps.
Deployment
• Involves deploying the model into the operational environment
for use.
• Can include generating reports, implementing the model within an
application, or creating a repeatable data mining process for
ongoing use.

Each phase in CRISP-DM is iterative, and the process often

requires revisiting previous phases as new insights are gained and
requirements evolve.
This flexibility and structured approach make CRISP-DM a popular
choice in the data science community.
4. Team Data Science Process (TDSP)
• If you combine Scrum and CRISP-DM, you will get something
that looks like Microsoft’s Team Data Science Process.
• Launched in 2016, TDSP is “an agile, iterative data science
methodology to deliver predictive analytics solutions and
intelligent applications efficiently.” (Microsoft, 2020 ).
• This is a modern data science process that combines
elements of the core data science life cycle, software
engineering, and Agile processes.
TDSP Components
• TDSP has four main components:

• A data science lifecycle definition

• A standardized project structure
• Recommended infrastructure and resources
• Recommended tools and utilities
TDSP Life Cycle
• Although the lifecycle graphic looks quite different, TDSP’s
project lifecycle is like CRISP-DM and includes five iterative
stages:
1. Business Understanding: define objectives and identify data sources
2. Data Acquisition and Understanding: ingest data and determine if it
can answer the presenting question (effectively combines Data
Understanding and Data Cleaning from CRISP-DM)
3. Modeling: feature engineering and model training
(combines Modeling and Evaluation)
4. Deployment: deploy into a production environment
5. Customer Acceptance: customer validation if the system meets
business needs (a phase not explicitly covered by CRISP-DM)
Evaluation
• Pros
• Agile: Emphasizes the need for incremental deliverables.
• Familiar: The product backlog, features, user stories, bugs, Git
versioning, and sprint planning are familiar to those used to common
software practices.
• Data Science Native: TDSP acknowledges that data science and
software engineering are different, and is built for data science teams
working on production-bound projects.
• Flexible: TDSP can be implemented as it is defined or in conjunction with
other approaches such as CRISP-DM.
• Thorough: Because of its rich team focus and detailed documentation,
TDSP is arguably the most mature CRISP-derived project management
approach. It is conceptually similar to Domino Data Lab’s Lifecycle but is
more detailed.
• Free Templates: Go to Microsoft Azure’s GitHub repository to get
started.
Cons
• Fixed Sprints: TDSP leverages fixed-length planning
sprints which many data scientists struggle with.
• Some Inconsistencies: Not all of Microsoft’s documentation
is consistent.

• TDSP is a good option for data science teams who aspire to

deliver production-level data science products.
• It may not be appropriate for one-team data scientists or
for projects without a production goal.

Shipan MechanismsPolicyDiffusion 2008
No ratings yet
Shipan MechanismsPolicyDiffusion 2008
19 pages
JAMOVI AND Basic Statistics
No ratings yet
JAMOVI AND Basic Statistics
28 pages
DSS Lec.8
No ratings yet
DSS Lec.8
22 pages
ISEVBS (BSIS 4A) - Lesson 1
No ratings yet
ISEVBS (BSIS 4A) - Lesson 1
12 pages
An Investigation of Anti-Intellectualism Among Nurses
No ratings yet
An Investigation of Anti-Intellectualism Among Nurses
151 pages
Data Mining - Intro
No ratings yet
Data Mining - Intro
17 pages
Metodologia para Mineria de Datos Crisp-Dm
No ratings yet
Metodologia para Mineria de Datos Crisp-Dm
33 pages
CRISP DM - Explained in Easy Way
No ratings yet
CRISP DM - Explained in Easy Way
12 pages
Lecture 1
No ratings yet
Lecture 1
35 pages
Lecture 1.2.1
No ratings yet
Lecture 1.2.1
33 pages
PAM - Unit1 PDF
No ratings yet
PAM - Unit1 PDF
217 pages
IMP Questions & Ans On ML & CI Using Python
No ratings yet
IMP Questions & Ans On ML & CI Using Python
21 pages
Data Analysis and Mining
No ratings yet
Data Analysis and Mining
39 pages
Data Mining and IBM SPSS Modeler
No ratings yet
Data Mining and IBM SPSS Modeler
20 pages
Chapter Five Data Mining For Healthcare Analytics
No ratings yet
Chapter Five Data Mining For Healthcare Analytics
77 pages
PPT4 W3 S4 R0 Predictive Analytics I Data Mining Process
No ratings yet
PPT4 W3 S4 R0 Predictive Analytics I Data Mining Process
50 pages
LGT2425 Lecture 3 Part II (Notes)
No ratings yet
LGT2425 Lecture 3 Part II (Notes)
55 pages
Unlocking Insights The CRISPDMData Mining Process 622298 B 6385 C 29 A 7
No ratings yet
Unlocking Insights The CRISPDMData Mining Process 622298 B 6385 C 29 A 7
19 pages
PDF Bella Rudie Tarturumies Esp
No ratings yet
PDF Bella Rudie Tarturumies Esp
27 pages
Best Methodologies
No ratings yet
Best Methodologies
5 pages
Cbar Format 1
No ratings yet
Cbar Format 1
15 pages
6th Grade Data Math Study Guide
100% (1)
6th Grade Data Math Study Guide
5 pages
Section 1
No ratings yet
Section 1
49 pages
Data Mining Summary
No ratings yet
Data Mining Summary
2 pages
Predictive Analytics Modelling (21CSH-440) : Apex Institute of Technology
No ratings yet
Predictive Analytics Modelling (21CSH-440) : Apex Institute of Technology
42 pages
Introduction To Data Mining
No ratings yet
Introduction To Data Mining
48 pages
4 Completed Action Research Template 2023
No ratings yet
4 Completed Action Research Template 2023
10 pages
Modern Data Mining Design
No ratings yet
Modern Data Mining Design
49 pages
CRISP DM For Data Science 2025
No ratings yet
CRISP DM For Data Science 2025
6 pages
Data Mining
No ratings yet
Data Mining
30 pages
Linear Regression Using TensorFlow PDF
No ratings yet
Linear Regression Using TensorFlow PDF
5 pages
Guidebook For Six Sigma Implementation With Real Time Applications
No ratings yet
Guidebook For Six Sigma Implementation With Real Time Applications
5 pages
(Ca) Bda Unit-I
No ratings yet
(Ca) Bda Unit-I
10 pages
Crisp DM Presentation
No ratings yet
Crisp DM Presentation
9 pages
MBAS901 Subject Outline T1 2024
No ratings yet
MBAS901 Subject Outline T1 2024
18 pages
Introduction to Robotics
From Everand
Introduction to Robotics
Swarnalata Verma
No ratings yet
PAM - Complete
No ratings yet
PAM - Complete
322 pages
Practical Guide To Statistical Forecasting in APO DP
100% (3)
Practical Guide To Statistical Forecasting in APO DP
49 pages
2 - Unit 1 - Lecture 3
No ratings yet
2 - Unit 1 - Lecture 3
16 pages
Data Mining
No ratings yet
Data Mining
41 pages
CH 2
No ratings yet
CH 2
7 pages
Management & Data Systems, With Examples of Two Potential Mediations: A Multiple Mediation
No ratings yet
Management & Data Systems, With Examples of Two Potential Mediations: A Multiple Mediation
21 pages
Unit 1.2 Layered Framework
No ratings yet
Unit 1.2 Layered Framework
32 pages
CHAPTER 5 Exploratory Research Design: Qualitative Research
No ratings yet
CHAPTER 5 Exploratory Research Design: Qualitative Research
3 pages
MCQs - Big Data Analytics - Predictive Analytics
No ratings yet
MCQs - Big Data Analytics - Predictive Analytics
10 pages
Studio 9 Questions
No ratings yet
Studio 9 Questions
6 pages
Crisp
No ratings yet
Crisp
31 pages
IBA - MODULe 4.3
No ratings yet
IBA - MODULe 4.3
10 pages
Forecasting Problems Solutions
60% (5)
Forecasting Problems Solutions
10 pages
Neeraj - Nepal Resume - D365
No ratings yet
Neeraj - Nepal Resume - D365
9 pages
Chapter-3 (Research Methodology)
No ratings yet
Chapter-3 (Research Methodology)
13 pages
Analysis of Financial Statements of Tata Steel Bachelor of Business Administration
No ratings yet
Analysis of Financial Statements of Tata Steel Bachelor of Business Administration
6 pages
Chapter 3-IB
No ratings yet
Chapter 3-IB
69 pages
Systematic Training For Effective Parenting (STEP) : Descriptive Information
No ratings yet
Systematic Training For Effective Parenting (STEP) : Descriptive Information
8 pages
DSS Chapter 5
No ratings yet
DSS Chapter 5
9 pages
What Is CRISP in Data Mining - Javatpoint
No ratings yet
What Is CRISP in Data Mining - Javatpoint
10 pages
115AG01
No ratings yet
115AG01
2 pages
Pearson Product-Moment Correlation: Mr. Armando U. Miranda JR., MATM 111 Instructor
No ratings yet
Pearson Product-Moment Correlation: Mr. Armando U. Miranda JR., MATM 111 Instructor
10 pages
Unit 1 Business Intelligence Introduction
No ratings yet
Unit 1 Business Intelligence Introduction
8 pages
Data Science Methodology
No ratings yet
Data Science Methodology
3 pages
Mastering Data Mining Techniques
From Everand
Mastering Data Mining Techniques
Dhaanyalakshmi Ahuja
No ratings yet
Crisp DM
100% (1)
Crisp DM
30 pages
02 Crispdm
No ratings yet
02 Crispdm
25 pages
Crispslides
No ratings yet
Crispslides
20 pages
Crisp Note
No ratings yet
Crisp Note
5 pages
What Is CRISP DM - Data Science Process Alliance
No ratings yet
What Is CRISP DM - Data Science Process Alliance
20 pages
WEEK 4-CRISP-DM Framework
No ratings yet
WEEK 4-CRISP-DM Framework
9 pages
Crisp DM Presentation
No ratings yet
Crisp DM Presentation
13 pages
Crisp - DM: Data Mining Process
No ratings yet
Crisp - DM: Data Mining Process
8 pages
PredictiveAnalysis U1 U2
No ratings yet
PredictiveAnalysis U1 U2
7 pages
Data Mining
100% (2)
Data Mining
36 pages
Big Data Basics
No ratings yet
Big Data Basics
7 pages
OLA - Research Engineer AI IITR
No ratings yet
OLA - Research Engineer AI IITR
2 pages
Big Data Analytics - Quick Guide - Tutorialspoint
No ratings yet
Big Data Analytics - Quick Guide - Tutorialspoint
50 pages
Data Mining Implementation Process
No ratings yet
Data Mining Implementation Process
9 pages
Excel Cheat Sheet
No ratings yet
Excel Cheat Sheet
61 pages
What Is Data Mining?: Dama-Ncr
No ratings yet
What Is Data Mining?: Dama-Ncr
36 pages
What Is Data Mining?: Dama-Ncr
No ratings yet
What Is Data Mining?: Dama-Ncr
36 pages
Crisp DM
No ratings yet
Crisp DM
33 pages
T Assignment
No ratings yet
T Assignment
5 pages
My Chapter Two
No ratings yet
My Chapter Two
57 pages
What Is Data Mining?: Dama-Ncr
No ratings yet
What Is Data Mining?: Dama-Ncr
36 pages
What Is Data Mining?: Dama-Ncr
No ratings yet
What Is Data Mining?: Dama-Ncr
36 pages
Notes On Data Science Methodologies
No ratings yet
Notes On Data Science Methodologies
4 pages
170 Machine Learning Interview Questios - Greatlearning
100% (1)
170 Machine Learning Interview Questios - Greatlearning
57 pages
DS CRISP-DM Model
No ratings yet
DS CRISP-DM Model
2 pages
Crisp-Dm
No ratings yet
Crisp-Dm
4 pages
Crisp DM 1stclass
No ratings yet
Crisp DM 1stclass
30 pages
30-Article Text-112-1-10-20201223
No ratings yet
30-Article Text-112-1-10-20201223
20 pages
Crisp DM
No ratings yet
Crisp DM
7 pages

Data Science Methodologies

Uploaded by

Data Science Methodologies

Uploaded by

M.

Sc (IT - AI/CC/Security) Semester I

DATA SCIENCE AND ANALYTICS

Ms. Pooja R. Tupe

Objective: Pre-processing involves cleaning and transforming the

 Objective: Transformation aims to convert the pre-processed

Objective: Applying algorithms and statistical methods

Objective: Interpreting the patterns discovered and

• Assesses the model thoroughly to ensure it meets the

Each phase in CRISP-DM is iterative, and the process often

• A data science lifecycle definition

• TDSP is a good option for data science teams who aspire to

You might also like