0% found this document useful (0 votes)
23 views31 pages

Data Science Methodologies

The document outlines various methodologies for managing data science projects, including KDD, SEMMA, CRISP-DM, and TDSP, emphasizing their structured approaches and iterative processes. It details each methodology's phases, objectives, and key activities, highlighting the importance of aligning data science efforts with business goals. Additionally, it discusses the benefits and challenges associated with each methodology, providing insights into their application in real-world scenarios.

Uploaded by

aniketsha784
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
23 views31 pages

Data Science Methodologies

The document outlines various methodologies for managing data science projects, including KDD, SEMMA, CRISP-DM, and TDSP, emphasizing their structured approaches and iterative processes. It details each methodology's phases, objectives, and key activities, highlighting the importance of aligning data science efforts with business goals. Additionally, it discusses the benefits and challenges associated with each methodology, providing insights into their application in real-world scenarios.

Uploaded by

aniketsha784
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 31

M.

Sc (IT - AI/CC/Security) Semester I

DATA SCIENCE AND ANALYTICS


(DSA)

Ms. Pooja R. Tupe


Visiting Faculty ,UDIT, University of Mumbai.
Topics to cover

Different Methodologies –
KDD,SEMMA, CRISP-DM,TDSP
• Managing DS projects involves
• navigating complexity through tailored project management
strategies.
• This includes orchestrating diverse teams,
• handling extensive data sets,
• ensuring alignment with both business objectives and rigorous
scientific standards.
• Over the few years, there have been a lot of effort in
terms of standardizing the methodologies and defining the
best practices which are followed in building your data
science solutions and data science projects.
Different Methodologies for DS
• So we will understand about the various project management
methodologies frameworks which are used for building data
mining solutions.
• We will look into
• Knowledge Discovery In Databases (KDD),
• Cross Industry Standard Processes for Data Mining (CRISP-DM)
• SEMMA Stands for Sample, Explore, Modify, Model, Assess, And
Refers to the process Of Conducting A DM Project.
• Team Data Science Process (TDSP).
DIKW pyramid
• This is called as DIKW
pyramid, also known variously
as the DIKW hierarchy,
wisdom hierarchy, knowledge
hierarchy, information
hierarchy, information
pyramid, and the data pyramid.
• refers loosely to a class of
models for representing
purported structural and/or
functional relationships
between data, information,
knowledge, and wisdom.
1. Knowledge Discovery in Databases
(KDD)
• A comprehensive process used in data mining and machine
learning to extract useful knowledge from large datasets.
• The KDD process typically consists of several stages or
steps, which are often represented as a sequence:

1.Selection:
 Objective: In the selection stage, the focus is on identifying
and retrieving data from various sources that are relevant to
the analysis and decision-making process.
 Activities:
 Define the criteria for selecting data based on the problem
domain and objectives.
 Gather data from databases, data warehouses, or other sources
that meet the defined criteria.
 Ensure the data collected is comprehensive and representative of
the problem at hand.
2 . Pre-processing:

Objective: Pre-processing involves cleaning and transforming the


raw data to prepare it for further analysis.

 Activities:
• Clean the data by handling missing values, outliers, and noise.
• Normalize or standardize data to ensure consistency and
comparability across different variables.
• Feature selection or extraction to identify relevant attributes
that contribute most to the analysis.
3. Transformation:

 Objective: Transformation aims to convert the pre-processed


data into a format suitable for mining and modeling.
 Activities:
• Aggregate or summarize data to reduce complexity and improve
efficiency in subsequent analysis.
• Perform dimensionality reduction techniques such as PCA
(Principal Component Analysis) to reduce the number of
variables while retaining important information.
• Apply encoding techniques for categorical variables to convert
them into numerical format if necessary.
4. Data Mining:

Objective: Applying algorithms and statistical methods


to discover patterns, relationships, and insights from the
transformed data.
Activities:
• Use various data mining techniques such as clustering,
classification, association rule mining, and regression
to extract patterns.
• Evaluate and compare different models to identify the
most suitable one for the problem at hand.
• Iteratively refine models based on feedback and
insights gained from the evaluation process.
5. Interpretation/Evaluation:

Objective: Interpreting the patterns discovered and


evaluating the effectiveness and reliability of the models
developed.
Activities:
Interpret and visualize the patterns and relationships discovered
during the data mining process.
Assess the quality and relevance of the insights gained against the
initial objectives and business goals.
Communicate findings and recommendations to stakeholders in a
clear and understandable manner.
Key Characteristics
• Iterative Process: The KDD process is iterative, allowing
for refinement and adjustment of each stage based on
insights gained from subsequent stages.
• Comprehensive Approach: It integrates data selection,
preprocessing, transformation, mining, and evaluation into a
cohesive framework, ensuring a thorough analysis of data.
• Application Flexibility: KDD can be applied across various
domains and industries, adapting to different data types and
analysis requirements.
2. SEMMA
• The SEMMA process is a methodology developed by SAS
(Statistical Analysis System) for data mining and predictive
modeling.
• SEMMA stands for Sample, Explore, Modify, Model, and
Assess.
Sample
• Objective: The first phase involves selecting a
representative sample of data from the population for
analysis. This sample should accurately reflect the
characteristics of the entire dataset.
• Activities:
• Define sampling criteria based on project goals and data
characteristics.
• Randomly select samples from the dataset using appropriate
sampling techniques.
• Ensure the sample size is sufficient for meaningful analysis.
Explore
• Objective: In this phase, the selected data sample is explored to
understand its characteristics, identify patterns, and gain
insights into potential relationships between variables.
• Activities:
• Perform descriptive statistics to summarize data distribution,
central tendencies, and variability.
• Visualize data through charts, graphs, and plots to identify
trends and outliers.
• Conduct preliminary data analysis to identify potential data
quality issues or missing values.
Modify
• Objective: The modify phase focuses on data preparation
and preprocessing to ensure data quality and suitability for
modeling.
• Activities:
• Clean the data by handling missing values, outliers, and
inconsistencies.
• Transform variables as needed, such as normalization,
scaling, or encoding categorical variables.
• Select relevant features or variables that are most
predictive for the modeling phase.
Model
• Objective: The modeling phase involves building and validating
predictive or descriptive models using statistical or machine
learning techniques.
• Activities:
• Select appropriate modeling techniques based on project
objectives and data characteristics (e.g., regression,
classification, clustering).
• Train models using the prepared dataset and evaluate model
performance using appropriate metrics.
• Iteratively refine models by tuning parameters and assessing
model robustness and generalization.
Assess
• Objective: The assess phase evaluates the performance and
effectiveness of the models developed in the previous phase.
• Activities:
• Assess model performance using evaluation metrics (e.g.,
accuracy, precision, recall, F1-score for classification; RMSE,
MAE for regression).
• Validate models by testing against new or unseen data to ensure
they generalize well.
• Interpret and communicate results to stakeholders, including
recommendations based on model findings.
Key Benefits
• Structured Approach: SEMMA provides a systematic and
structured approach to data mining projects, guiding
practitioners through each essential phase from data
sampling to model assessment.
• Iterative Process: It allows for iterative refinement and
improvement of models based on insights gained during
exploration and assessment phases.
• Clear Methodological Steps: Each phase in SEMMA is
clearly defined, facilitating easier communication and
collaboration among team members involved in the project.
CRISP-DM
• CRISP-DM (Cross-Industry Standard Process for Data
Mining) is a widely used methodology for data mining and
data science projects.
• It provides a structured approach to planning and executing
data mining projects, ensuring that the results are relevant
and actionable. CRISP-DM is composed of six major phases:
Business Understanding:
• Focuses on understanding the project objectives and
requirements from a business perspective.
• Determines the business objectives, assesses the situation,
defines the data mining goals, and produces a project plan.
• Data Understanding:
• Involves collecting initial data, describing the data,
exploring the data, and verifying data quality.
• Helps in understanding the data’s characteristics and
identifying any data quality issues.
Data Understanding:
• Involves collecting initial data, describing the data,
exploring the data, and verifying data quality.
• Helps in understanding the data’s characteristics and
identifying any data quality issues.
Data Preparation:
• Covers all activities needed to construct the final dataset from the initial
raw data.
• Includes tasks such as selecting data, cleaning data, constructing data,
integrating data, and formatting data.
• Modeling:
• Involves selecting modeling techniques, generating test design, building
models, and assessing models.
• Different modeling techniques may require different data formats and
assumptions.
• Evaluation:
• Assesses the model thoroughly to ensure it meets the business objectives.
• Involves reviewing the steps executed, ensuring that the model is achieving
the intended goals, and deciding on the next steps.
Modeling
• Involves selecting modeling techniques, generating test
design, building models, and assessing models.
• Different modeling techniques may require different data
formats and assumptions.
• Evaluation:
• Assesses the model thoroughly to ensure it meets the
business objectives.
• Involves reviewing the steps executed, ensuring that the
model is achieving the intended goals, and deciding on the
next steps.
Evaluation

• Assesses the model thoroughly to ensure it meets the


business objectives.
• Involves reviewing the steps executed, ensuring that the
model is achieving the intended goals, and deciding on the
next steps.
Deployment
• Involves deploying the model into the operational environment
for use.
• Can include generating reports, implementing the model within an
application, or creating a repeatable data mining process for
ongoing use.

Each phase in CRISP-DM is iterative, and the process often


requires revisiting previous phases as new insights are gained and
requirements evolve.
This flexibility and structured approach make CRISP-DM a popular
choice in the data science community.
4. Team Data Science Process (TDSP)
• If you combine Scrum and CRISP-DM, you will get something
that looks like Microsoft’s Team Data Science Process.
• Launched in 2016, TDSP is “an agile, iterative data science
methodology to deliver predictive analytics solutions and
intelligent applications efficiently.” (Microsoft, 2020 ).
• This is a modern data science process that combines
elements of the core data science life cycle, software
engineering, and Agile processes.
TDSP Components
• TDSP has four main components:

• A data science lifecycle definition


• A standardized project structure
• Recommended infrastructure and resources
• Recommended tools and utilities
TDSP Life Cycle
• Although the lifecycle graphic looks quite different, TDSP’s
project lifecycle is like CRISP-DM and includes five iterative
stages:
1. Business Understanding: define objectives and identify data sources
2. Data Acquisition and Understanding: ingest data and determine if it
can answer the presenting question (effectively combines Data
Understanding and Data Cleaning from CRISP-DM)
3. Modeling: feature engineering and model training
(combines Modeling and Evaluation)
4. Deployment: deploy into a production environment
5. Customer Acceptance: customer validation if the system meets
business needs (a phase not explicitly covered by CRISP-DM)
Evaluation
• Pros
• Agile: Emphasizes the need for incremental deliverables.
• Familiar: The product backlog, features, user stories, bugs, Git
versioning, and sprint planning are familiar to those used to common
software practices.
• Data Science Native: TDSP acknowledges that data science and
software engineering are different, and is built for data science teams
working on production-bound projects.
• Flexible: TDSP can be implemented as it is defined or in conjunction with
other approaches such as CRISP-DM.
• Thorough: Because of its rich team focus and detailed documentation,
TDSP is arguably the most mature CRISP-derived project management
approach. It is conceptually similar to Domino Data Lab’s Lifecycle but is
more detailed.
• Free Templates: Go to Microsoft Azure’s GitHub repository to get
started.
Cons
• Fixed Sprints: TDSP leverages fixed-length planning
sprints which many data scientists struggle with.
• Some Inconsistencies: Not all of Microsoft’s documentation
is consistent.

• TDSP is a good option for data science teams who aspire to


deliver production-level data science products.
• It may not be appropriate for one-team data scientists or
for projects without a production goal.

You might also like