0% found this document useful (0 votes)

50 views43 pages

Lecture 8 - Lifecycle of A Data Science Project - Part 2

The document discusses the CRISP-DM methodology for data science projects. It outlines the key phases of CRISP-DM including business understanding, data understanding, data preparation, modeling, evaluation, and deployment. It then provides an example use case of applying CRISP-DM to analyze grid loss data for an energy company. The document highlights some of the challenges with the grid loss data and discusses strategies for testing models and monitoring performance in production.

Uploaded by

Giorgio Aduso

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

50 views43 pages

Lecture 8 - Lifecycle of A Data Science Project - Part 2

Uploaded by

Giorgio Aduso

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 43

TDT4259 – Applied Data Science

Lecture 8: Lifecycle of a data science project II

Nisha Dalal
Adj. Associate Professor

[email protected]
CRISP-DM: with a use case
3

What is CRISP-DM
Cross-industry standard process for data
mining - CRISP-DM

An open standard developed in 1996 by leading

companies in data analysis

It is still the most popular methodology for data-centric

projects

It is an agile method that introduces almost no

overhead and emphasizes adaptive transitions between
project phases

Source
4

What is CRISP-DM
Cross-industry standard process for data
mining - CRISP-DM

Maintenace and
An open standard developed in 1996 by leading monitoring
companies in data analysis

It is still the most popular methodology for data-centric

projects

It is an agile method that introduces almost no

overhead and emphasizes adaptive transitions between
project phases
5

Aneo: Grid loss data

• Grid load
• Total amount of electric energy in the grid
• Grid loss
• Difference in electricity between what has been
produced by the power plants (or load) and what has
been sold to the customers
• Grid load = consumption by customers + grid loss
6

Problems: Grid loss data

• Delayed measurements
• Missing values
• Incorrect values
• Changing values
• Small datasets
• Missing features
• Not grid-specific
• Manual retraining
• Manual and subjective alterations
• Lack of monitoring infrastructure
• Poor scalability
7

Deployment
• Pre-deployment
• Testing online
• Monitoring and logging
Maintenace and
• Active feedback monitoring

• On-call responsibilities
• Set aside enough time for this phase
8

Deployment
• Three Vs of MLOps: Velocity, Validation and Version

Maintenace and
monitoring
9

Experimental and Deployable Code

• Experimental code
• Fast (high velocity)
• Easy to adjust and parameterize
• Quick fixes
• Strict evaluation
• Development environment

• Deployable code
• Robust
• Standardized code quality
• Hard to make unintended changes
• Infrastructure constraints
• Easy to maintain
• Well tested
• Easy to maintain
• Production environment
10

Experimental --> Deployable Code

• Quick fixes --> Robust codebase
• Code reviews
• Performance reviews
• Business metrics alignment
• Documentation (experimental and production both)
• Package reusable codes
• Tests
• Deployment (in phases)
11

Production readiness
12

Traditional software testing

Breck, Eric, et al. "The ML test score: A rubric for ML production readiness and technical debt
reduction." 2017 IEEE International Conference on Big Data (Big Data). IEEE, 2017.
13

Data Science pipeline testing

Breck, Eric, et al. "The ML test score: A rubric for ML production readiness and technical debt
reduction." 2017 IEEE International Conference on Big Data (Big Data). IEEE, 2017.
14

Data Science testing

• Tests for Data and Features
• Tests for Model development
• Tests for Infrastructure
• Tests for Monitoring

Breck, Eric, et al. "The ML test score: A rubric for ML production readiness and technical debt
reduction." 2017 IEEE International Conference on Big Data (Big Data). IEEE, 2017.
15

Data and feature tests

Breck, Eric, et al. "The ML test score: A rubric for ML production readiness and technical debt
reduction." 2017 IEEE International Conference on Big Data (Big Data). IEEE, 2017.
16

Model development tests

Breck, Eric, et al. "The ML test score: A rubric for ML production readiness and technical debt
reduction." 2017 IEEE International Conference on Big Data (Big Data). IEEE, 2017.
17

Infrastructure tests

Breck, Eric, et al. "The ML test score: A rubric for ML production readiness and technical debt
reduction." 2017 IEEE International Conference on Big Data (Big Data). IEEE, 2017.
18

Maintenance and Monitoring

Maintenace and
monitoring
19

Monitoring
• By definition, your system is making predictions on previously unseen data
• Crucial to know that the system continues to work correctly over time
• Using Dashboards displaying relevant graphs and statistics
• Monitoring the system, pipelines and input data
• Alerting the team when metrics deviate significantly for expectations
20

Monitoring

Source
21

Monitoring
22

Handling A Spectrum of Data Errors

• Hard errors are obvious and result in clearly “bad predictions”, such as when mixing or
swapping columns or when violating constraints (e.g., a negative age).
• Soft errors, such as a few null-valued features in a data point, are less pernicious and can
still yield reasonable predictions, making them hard to catch and quantify.
• Drift errors occur when the live data is from a seemingly different distribution than the
training set; these happen relatively slowly over time.

Shankar et al, 2022

Monitoring tests

Breck, Eric, et al. "The ML test score: A rubric for ML production readiness and technical debt
reduction." 2017 IEEE International Conference on Big Data (Big Data). IEEE, 2017.
24

Alert fatigue
• A surplus of false-positive alerts led to fatigue and silencing of alerts, which could miss
actual performance drops.

“Recently we've noticed that some of these alerts have been rather noisy and not
necessarily reflective of events that we care about triaging and fixing. So we've recently
taken a close look at those alerts and are trying to figure out, how can we more precisely
specify that query such that it's only highlighting the problematic events?”

“You typically ignore most alerts...I guess on record I'd say 90% of them aren't immediate.
You just have to acknowledge them internally, like just be aware that there is something
happening.”

Shankar et al, 2022

Maintenance

Source
26

Maintenance

Source
27

Grid loss data: Problems

• Not grid-specific
• Manual retraining
• Manual and subjective alterations
• Lack of monitoring infrastructure
• Poor scalability
• Delayed measurements
• Missing values
• Incorrect values
• Changing values
• Small datasets
• Missing features
• Performance
28

Data and Publications

•https://www.kaggle.com/trnderenergikraft/grid-loss-time-series-dataset
29

When things don’t work: Aneo grid loss

When things don’t work

• Data availability

• Feature relevancy changes

• Consumer behavior changes

• Market changes

• Technology updates

• Hardware/Software updates

• And 100s more

Learnings from experiences
33

Simplicity
Simplicity is an advantage but sadly, complexity sells better (Source)

• Complexity signals efforts, mastery and innovation

BUT simple ideas and features

• Easier to understand, use and trust

• Easier to build and scale

• Easier to maintain and fix

• Have lower operational costs, mostly

Rules of ML/ Data Science

• Rule #1: Don’t be afraid to launch a product without machine learning.

• Rule #3: Choose machine learning over a complex heuristic.

• Rule #4: Keep the first model simple and get the infrastructure right.

• Rule #5: Test the infrastructure independently from the machine learning.

• Rule #10: Watch for silent failures.

• Rule #26: Look for patterns in the measured errors, and create new features.

Source
35

Good project ideas start with collaborators

“We really think it's important to bridge that gap

between what's often, you know, a (subject matter
expert) in one room annotating and then handing
things over the wire to a data scientist—a scene
where you have no communication. So we make sure
there's both data science and subject matter
expertise representation (on our teams).”

Shankar et al, 2022

Spread a deployment across multiple stages

“In (the large companies I've worked at), when we

deploy code it goes through what's called a staged
deployment process, where we have designated test
clusters, (stage 1) clusters, (stage 2) clusters, then
the global deployment (to all users). The idea here is
you deploy increasingly along these clusters, so that
you catch problems before they've met customers.”

Shankar et al, 2022

ML evaluation metrics should be tied to product metrics

“Tying (model performance) to the business's KPIs

(key performance indicators) is really important. But
it's a process—you need to figure out what (the KPIs)
are, and frankly I think that's how people should be
doing AI. It (shouldn't be) like: hey, let's do these
experiments and get cool numbers and show off
these nice precision-recall curves to our bosses and
call it a day. It should be like: hey, let's actually show
the same business metrics that everyone else is held
accountable to our bosses at the end of the day.”

Shankar et al, 2022

Don't keep your GPUs warm

“One thing that I've noticed is, especially when you have as many resources as
large companies do, that there's a compulsive need to leverage all the
resources that you have. And just, you know, get all the experiments out there.
Come up with a bunch of ideas; run a bunch of stuff. I actually think that's bad.
You can be overly concerned with keeping your GPUs warm, so much so that
you don't actually think deeply about what the highest value experiment is. I
think you can end up saving a lot more time—and obviously GPU cycles, but
mostly end-to-end completion time—if you spend more efforts choosing the
right experiment to run instead of spreading yourself thin. All these different
experiments have their own frontier to explore, and all these frontiers have
different options. I basically will only do the most important thing from each
project's frontier at a given time, and I found that the net throughput for
myself has been much higher.”

Shankar et al, 2022

Important to know!
• Communication is the key (Stakeholders, Management, domain experts, end users, data scientists/engineers).

• Start from the problem (not the tech)

• Choosing the right problem is half the battle won.

• Model performance depends less on the model than the data we feed the model.

• It is important to not dive right in and think about the problem and get feedback from experts.

• Best predicting model might not be the best value creating model.

• More emphasis on model evaluation system than individual models.

• Important to test the systems for: software, data science pipeline and value creation.

• An imperfect deployed system is more valuable than a perfect undeployed system (80-20 rule).
40

Resources
41

Important Deadlines
When you will need to deliver or complete a task

1 20/9 Register yourself/group and the company/dataset for group assignment

2 30/10 Deliver individual assignment

3 27/11 Deliver presentation and report for group assignment

Lecture Plan
Unpacking the course syllabus

1 23/8 Lecture 1: Introduction [Nisha Dalal] 8 11/10 Lecture 6: Data Visualization & Storytelling
[Manos Papagiannidis]

2 30/8 Lecture 2: Presentation of datasets [Nisha Dalal]

9 18/10 Lecture 7: Data Science in the time of Chat-
GPT [Pikakshi Manchanda]
3 6/9 Lecture 3: Crash course in machine learning
[Kshitij Sharma]
10 25/10 Lecture 8: Lifecycle of a Data Science project
II [Nisha Dalal]
13/9 Lecture 4: Data analysis with low or no-code
4
tools [Nisha Dalal] 1/11 Lecture 9: Decision making with data
11
science [Nisha Dalal]

5 20/9 No lecture
8/11 Lecture 10: Experiences from Industry
12
[Thomas Thorensen]
6 27/9 Lecture 5: Lifecycle of a Data Science project I
[Nisha Dalal]

4/10 No Lecture 13 15/11 Course finish

7
43

Nisha Dalal
Questions & Discussion [email protected]

(Ebook PDF) Essentials of Business Analytics 2nd Edition All Chapter Instant Download
100% (1)
(Ebook PDF) Essentials of Business Analytics 2nd Edition All Chapter Instant Download
45 pages
Generative AI A Transformative Force in Business Intelligence
No ratings yet
Generative AI A Transformative Force in Business Intelligence
7 pages
Daily Management (JSQC) - English
No ratings yet
Daily Management (JSQC) - English
19 pages
C Boe Taxes and Investing
No ratings yet
C Boe Taxes and Investing
27 pages
AI For Everyone 2
No ratings yet
AI For Everyone 2
69 pages
Integrator Windup and How To Avoid It
No ratings yet
Integrator Windup and How To Avoid It
6 pages
Lecture 5 - Lifecycle of A Data Science Project
No ratings yet
Lecture 5 - Lifecycle of A Data Science Project
55 pages
6 Different Ways To Compensate For Missing Values in A Dataset (Data Imputation With Examples) - by Will Badr - Towards Data Science
No ratings yet
6 Different Ways To Compensate For Missing Values in A Dataset (Data Imputation With Examples) - by Will Badr - Towards Data Science
10 pages
Safety Evaluation Process For AI Based Autonomous Systems - Pedroza - Adedjouma
No ratings yet
Safety Evaluation Process For AI Based Autonomous Systems - Pedroza - Adedjouma
17 pages
O RAN - Wg1.use Cases Analysis Report v02.00
No ratings yet
O RAN - Wg1.use Cases Analysis Report v02.00
28 pages
Assessing Value For Money 2015 PDF
No ratings yet
Assessing Value For Money 2015 PDF
33 pages
Six Week-Total Handson Internship Program On Machine Learning
No ratings yet
Six Week-Total Handson Internship Program On Machine Learning
8 pages
08 Lab 3 Network Address Translation
No ratings yet
08 Lab 3 Network Address Translation
8 pages
Machine Learning and Data Mining in Manufacturing
No ratings yet
Machine Learning and Data Mining in Manufacturing
45 pages
CTF - Kioptrix Level 3 - Walkthrough Step by Step - Yeah Hub
No ratings yet
CTF - Kioptrix Level 3 - Walkthrough Step by Step - Yeah Hub
26 pages
Value Proposition Can Apply To An Entire Organization, or Parts Thereof, or
100% (1)
Value Proposition Can Apply To An Entire Organization, or Parts Thereof, or
5 pages
An Introduction To Predictive Analytics Final
No ratings yet
An Introduction To Predictive Analytics Final
31 pages
Artificial Intelligence A European Perspective
No ratings yet
Artificial Intelligence A European Perspective
140 pages
Augmented Analytics
No ratings yet
Augmented Analytics
8 pages
McKinsey Machine Learning
No ratings yet
McKinsey Machine Learning
6 pages
BDM Using AI - Data Driven Decision Making
No ratings yet
BDM Using AI - Data Driven Decision Making
34 pages
Project Decision Analysis
No ratings yet
Project Decision Analysis
30 pages
Super VIP Cheatsheet - Deep Learning
No ratings yet
Super VIP Cheatsheet - Deep Learning
47 pages
Data Science Pipeline, EDA & Data Preparation
No ratings yet
Data Science Pipeline, EDA & Data Preparation
14 pages
Data Smart For Product Managers
100% (1)
Data Smart For Product Managers
13 pages
Data Science PPT-2
No ratings yet
Data Science PPT-2
34 pages
BpmToolbox 6.0-Forecast Business Planning Model Example (Basic)
No ratings yet
BpmToolbox 6.0-Forecast Business Planning Model Example (Basic)
50 pages
Optimizing Business Operations Through A
No ratings yet
Optimizing Business Operations Through A
9 pages
A Primer On Process Mining Practical Skills With Python and Graphviz
No ratings yet
A Primer On Process Mining Practical Skills With Python and Graphviz
101 pages
DataMining Lecture 1
No ratings yet
DataMining Lecture 1
35 pages
Mall Customer Data Analysis PDF
No ratings yet
Mall Customer Data Analysis PDF
10 pages
BERT
No ratings yet
BERT
21 pages
AHDAdv Cust Guide
No ratings yet
AHDAdv Cust Guide
361 pages
7 More Steps To Mastering Machine Learning With Python - Page1
No ratings yet
7 More Steps To Mastering Machine Learning With Python - Page1
8 pages
Resume - Rajat Chaturvedi
No ratings yet
Resume - Rajat Chaturvedi
3 pages
Lecture Notes 1 - Introduction To SMEs
No ratings yet
Lecture Notes 1 - Introduction To SMEs
7 pages
BDM Unit I Slides Part 1
No ratings yet
BDM Unit I Slides Part 1
27 pages
Pattern Classification of Back-Propagation Algorithm Using Exclusive Connecting Network
No ratings yet
Pattern Classification of Back-Propagation Algorithm Using Exclusive Connecting Network
5 pages
Artificial Intelligence With Lab: Report: Machine Learning
No ratings yet
Artificial Intelligence With Lab: Report: Machine Learning
6 pages
Why DV - Delloite CFO - Why A Picture Can Be Worth A Thousand Clicks
No ratings yet
Why DV - Delloite CFO - Why A Picture Can Be Worth A Thousand Clicks
4 pages
Battery OpenCloud Report 2024
No ratings yet
Battery OpenCloud Report 2024
43 pages
Python Programming-Grade 9
No ratings yet
Python Programming-Grade 9
53 pages
Data-Centric Artificial Intelligence
No ratings yet
Data-Centric Artificial Intelligence
39 pages
An Introduction To Supervised Learning With Scikit-Learn: Machine Learning: The Problem Setting
No ratings yet
An Introduction To Supervised Learning With Scikit-Learn: Machine Learning: The Problem Setting
4 pages
Agility in Audit Could Scrum Improve The Audit Process2018current Issues in Auditing
No ratings yet
Agility in Audit Could Scrum Improve The Audit Process2018current Issues in Auditing
22 pages
Intelligent Waste Management A Data Science Approach
No ratings yet
Intelligent Waste Management A Data Science Approach
8 pages
An Executives Guide To AI PDF
No ratings yet
An Executives Guide To AI PDF
12 pages
Lecture 6 - Data Visualization
No ratings yet
Lecture 6 - Data Visualization
90 pages
Multi-Criteria Decision Making
No ratings yet
Multi-Criteria Decision Making
5 pages
Delivery Excellence
100% (2)
Delivery Excellence
10 pages
Business Intelligence & Business Analytics
No ratings yet
Business Intelligence & Business Analytics
8 pages
Midia Kit - Valor 2023 en
No ratings yet
Midia Kit - Valor 2023 en
71 pages
Data Science Bootcamp Curriculum 2
No ratings yet
Data Science Bootcamp Curriculum 2
7 pages
Titanic Survival Prediction
No ratings yet
Titanic Survival Prediction
14 pages
Business Intelligence
No ratings yet
Business Intelligence
8 pages
Chapter 11: Business Intelligence and Knowledge Management: Oz (5th Edition)
100% (1)
Chapter 11: Business Intelligence and Knowledge Management: Oz (5th Edition)
20 pages
ALX Data Science Program Description
No ratings yet
ALX Data Science Program Description
12 pages
Study Material BTech IT VIII Sem Subject Deep Learning Deep Learning Btech IT VIII Sem
No ratings yet
Study Material BTech IT VIII Sem Subject Deep Learning Deep Learning Btech IT VIII Sem
30 pages
Future Skills - An Introduction, General Overview of The Future Skills Sub-Sector-1
No ratings yet
Future Skills - An Introduction, General Overview of The Future Skills Sub-Sector-1
15 pages
The ML Test Score: A Rubric For ML Production Readiness and Technical Debt Reduction
No ratings yet
The ML Test Score: A Rubric For ML Production Readiness and Technical Debt Reduction
10 pages
C2 - W1 Mlopssadsa
No ratings yet
C2 - W1 Mlopssadsa
111 pages
cs329s 2022 02 Slides MLSD
No ratings yet
cs329s 2022 02 Slides MLSD
99 pages
Example 3
No ratings yet
Example 3
29 pages
Example 1
No ratings yet
Example 1
47 pages
Lecture 9 - Decision Making With Data Science
No ratings yet
Lecture 9 - Decision Making With Data Science
19 pages
Lecture 2 - The Dataset Presentation
No ratings yet
Lecture 2 - The Dataset Presentation
35 pages
Lecture 4 - No-Code and Low-Code Tools
No ratings yet
Lecture 4 - No-Code and Low-Code Tools
29 pages
Blue Futuristic Illustrative Artificial Intelligence Project Presentation
No ratings yet
Blue Futuristic Illustrative Artificial Intelligence Project Presentation
12 pages
Computer Science 219
No ratings yet
Computer Science 219
3 pages
1102AITA04 AI For Text Analytics
No ratings yet
1102AITA04 AI For Text Analytics
88 pages
C. QUIZ 1 MANAGING SOFTWARE CONSTRUCTION Attempt Review PDF
No ratings yet
C. QUIZ 1 MANAGING SOFTWARE CONSTRUCTION Attempt Review PDF
7 pages
Solving Thermodynamics Problems PDF
No ratings yet
Solving Thermodynamics Problems PDF
3 pages
Carnot's Theorem PDF
100% (1)
Carnot's Theorem PDF
17 pages
Scrum Summary Book
No ratings yet
Scrum Summary Book
15 pages
Karunakaran
No ratings yet
Karunakaran
7 pages
ITM + CTM 3RD Sem Time Table
No ratings yet
ITM + CTM 3RD Sem Time Table
24 pages
13.5 Heat Capacity of 1D, 2D and 3D Phonon
No ratings yet
13.5 Heat Capacity of 1D, 2D and 3D Phonon
19 pages
Homeostasis Student Version
No ratings yet
Homeostasis Student Version
6 pages
New Methods For Tuning PID Controllers: You Will Learn
No ratings yet
New Methods For Tuning PID Controllers: You Will Learn
1 page
Chapter 06
No ratings yet
Chapter 06
36 pages
Failure Modes and Effects Analysis (FMEA) - Risk Assessment Matrix
No ratings yet
Failure Modes and Effects Analysis (FMEA) - Risk Assessment Matrix
2 pages
Lab 04 PDF
No ratings yet
Lab 04 PDF
6 pages
Exergy Cost Minimization PDF
No ratings yet
Exergy Cost Minimization PDF
8 pages
Industrial Control System
No ratings yet
Industrial Control System
7 pages
Hybrid Electric Vehicles Energy Management Strategies
No ratings yet
Hybrid Electric Vehicles Energy Management Strategies
2 pages
Object Oriented Analysis and Design
No ratings yet
Object Oriented Analysis and Design
54 pages
1) What Are The Four Functions Included Within The Scope of Manufacturing Support Systems?
No ratings yet
1) What Are The Four Functions Included Within The Scope of Manufacturing Support Systems?
4 pages
Chapter 4 Block Diagrams of Control Systems
No ratings yet
Chapter 4 Block Diagrams of Control Systems
18 pages
Project Test Plan - IPA
No ratings yet
Project Test Plan - IPA
5 pages
MEC-E5004 Fluid Power Systems: Report On Servo System 2 Name
No ratings yet
MEC-E5004 Fluid Power Systems: Report On Servo System 2 Name
16 pages
Vapour Absorption Refrigeration Systems
No ratings yet
Vapour Absorption Refrigeration Systems
2 pages
A Multidimensional Analysis of The Epist PDF
No ratings yet
A Multidimensional Analysis of The Epist PDF
10 pages
Adaptive Filters - Algorithms (Part 1) : Gerhard Schmidt
No ratings yet
Adaptive Filters - Algorithms (Part 1) : Gerhard Schmidt
42 pages
Chapter 1 - Introduction New March 2018
No ratings yet
Chapter 1 - Introduction New March 2018
46 pages
Untitled
No ratings yet
Untitled
49 pages

Lecture 8 - Lifecycle of A Data Science Project - Part 2

Uploaded by

Lecture 8 - Lifecycle of A Data Science Project - Part 2

Uploaded by

TDT4259 – Applied Data Science

Lecture 8: Lifecycle of a data science project II

An open standard developed in 1996 by leading

It is still the most popular methodology for data-centric

It is an agile method that introduces almost no

It is still the most popular methodology for data-centric

It is an agile method that introduces almost no

Aneo: Grid loss data

Problems: Grid loss data

Experimental and Deployable Code

Experimental --> Deployable Code

Traditional software testing

Data Science pipeline testing

Data Science testing

Data and feature tests

Model development tests

Maintenance and Monitoring

Handling A Spectrum of Data Errors

Shankar et al, 2022

Shankar et al, 2022

Grid loss data: Problems

Data and Publications

When things don’t work: Aneo grid loss

When things don’t work: Aneo grid loss

When things don’t work

• Feature relevancy changes

• Consumer behavior changes

• And 100s more

• Complexity signals efforts, mastery and innovation

BUT simple ideas and features

• Easier to understand, use and trust

• Easier to build and scale

• Easier to maintain and fix

• Have lower operational costs, mostly

Rules of ML/ Data Science

• Rule #3: Choose machine learning over a complex heuristic.

• Rule #10: Watch for silent failures.

Good project ideas start with collaborators

“We really think it's important to bridge that gap

Shankar et al, 2022

Spread a deployment across multiple stages

“In (the large companies I've worked at), when we

Shankar et al, 2022

ML evaluation metrics should be tied to product metrics

“Tying (model performance) to the business's KPIs

Shankar et al, 2022

Don't keep your GPUs warm

Shankar et al, 2022

• Start from the problem (not the tech)

• Choosing the right problem is half the battle won.

• More emphasis on model evaluation system than individual models.

1 20/9 Register yourself/group and the company/dataset for group assignment

2 30/10 Deliver individual assignment

3 27/11 Deliver presentation and report for group assignment

2 30/8 Lecture 2: Presentation of datasets [Nisha Dalal]

4/10 No Lecture 13 15/11 Course finish

You might also like