0% found this document useful (0 votes)
48 views43 pages

Lecture 8 - Lifecycle of A Data Science Project - Part 2

The document discusses the CRISP-DM methodology for data science projects. It outlines the key phases of CRISP-DM including business understanding, data understanding, data preparation, modeling, evaluation, and deployment. It then provides an example use case of applying CRISP-DM to analyze grid loss data for an energy company. The document highlights some of the challenges with the grid loss data and discusses strategies for testing models and monitoring performance in production.

Uploaded by

Giorgio Aduso
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
48 views43 pages

Lecture 8 - Lifecycle of A Data Science Project - Part 2

The document discusses the CRISP-DM methodology for data science projects. It outlines the key phases of CRISP-DM including business understanding, data understanding, data preparation, modeling, evaluation, and deployment. It then provides an example use case of applying CRISP-DM to analyze grid loss data for an energy company. The document highlights some of the challenges with the grid loss data and discusses strategies for testing models and monitoring performance in production.

Uploaded by

Giorgio Aduso
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 43

TDT4259 – Applied Data Science

Lecture 8: Lifecycle of a data science project II


Nisha Dalal
Adj. Associate Professor

[email protected]
CRISP-DM: with a use case
3

What is CRISP-DM
Cross-industry standard process for data
mining - CRISP-DM

An open standard developed in 1996 by leading


companies in data analysis

It is still the most popular methodology for data-centric


projects

It is an agile method that introduces almost no


overhead and emphasizes adaptive transitions between
project phases

Source
4

What is CRISP-DM
Cross-industry standard process for data
mining - CRISP-DM

Maintenace and
An open standard developed in 1996 by leading monitoring
companies in data analysis

It is still the most popular methodology for data-centric


projects

It is an agile method that introduces almost no


overhead and emphasizes adaptive transitions between
project phases
5

Aneo: Grid loss data


• Grid load
• Total amount of electric energy in the grid
• Grid loss
• Difference in electricity between what has been
produced by the power plants (or load) and what has
been sold to the customers
• Grid load = consumption by customers + grid loss
6

Problems: Grid loss data


• Delayed measurements
• Missing values
• Incorrect values
• Changing values
• Small datasets
• Missing features
• Not grid-specific
• Manual retraining
• Manual and subjective alterations
• Lack of monitoring infrastructure
• Poor scalability
7

Deployment
• Pre-deployment
• Testing online
• Monitoring and logging
Maintenace and
• Active feedback monitoring

• On-call responsibilities
• Set aside enough time for this phase
8

Deployment
• Three Vs of MLOps: Velocity, Validation and Version

Maintenace and
monitoring
9

Experimental and Deployable Code


• Experimental code
• Fast (high velocity)
• Easy to adjust and parameterize
• Quick fixes
• Strict evaluation
• Development environment

• Deployable code
• Robust
• Standardized code quality
• Hard to make unintended changes
• Infrastructure constraints
• Easy to maintain
• Well tested
• Easy to maintain
• Production environment
10

Experimental --> Deployable Code


• Quick fixes --> Robust codebase
• Code reviews
• Performance reviews
• Business metrics alignment
• Documentation (experimental and production both)
• Package reusable codes
• Tests
• Deployment (in phases)
11

Production readiness
12

Traditional software testing

Breck, Eric, et al. "The ML test score: A rubric for ML production readiness and technical debt
reduction." 2017 IEEE International Conference on Big Data (Big Data). IEEE, 2017.
13

Data Science pipeline testing

Breck, Eric, et al. "The ML test score: A rubric for ML production readiness and technical debt
reduction." 2017 IEEE International Conference on Big Data (Big Data). IEEE, 2017.
14

Data Science testing


• Tests for Data and Features
• Tests for Model development
• Tests for Infrastructure
• Tests for Monitoring

Breck, Eric, et al. "The ML test score: A rubric for ML production readiness and technical debt
reduction." 2017 IEEE International Conference on Big Data (Big Data). IEEE, 2017.
15

Data and feature tests

Breck, Eric, et al. "The ML test score: A rubric for ML production readiness and technical debt
reduction." 2017 IEEE International Conference on Big Data (Big Data). IEEE, 2017.
16

Model development tests

Breck, Eric, et al. "The ML test score: A rubric for ML production readiness and technical debt
reduction." 2017 IEEE International Conference on Big Data (Big Data). IEEE, 2017.
17

Infrastructure tests

Breck, Eric, et al. "The ML test score: A rubric for ML production readiness and technical debt
reduction." 2017 IEEE International Conference on Big Data (Big Data). IEEE, 2017.
18

Maintenance and Monitoring

Maintenace and
monitoring
19

Monitoring
• By definition, your system is making predictions on previously unseen data
• Crucial to know that the system continues to work correctly over time
• Using Dashboards displaying relevant graphs and statistics
• Monitoring the system, pipelines and input data
• Alerting the team when metrics deviate significantly for expectations
20

Monitoring

Source
21

Monitoring
22

Handling A Spectrum of Data Errors


• Hard errors are obvious and result in clearly “bad predictions”, such as when mixing or
swapping columns or when violating constraints (e.g., a negative age).
• Soft errors, such as a few null-valued features in a data point, are less pernicious and can
still yield reasonable predictions, making them hard to catch and quantify.
• Drift errors occur when the live data is from a seemingly different distribution than the
training set; these happen relatively slowly over time.

Shankar et al, 2022


23

Monitoring tests

Breck, Eric, et al. "The ML test score: A rubric for ML production readiness and technical debt
reduction." 2017 IEEE International Conference on Big Data (Big Data). IEEE, 2017.
24

Alert fatigue
• A surplus of false-positive alerts led to fatigue and silencing of alerts, which could miss
actual performance drops.

“Recently we've noticed that some of these alerts have been rather noisy and not
necessarily reflective of events that we care about triaging and fixing. So we've recently
taken a close look at those alerts and are trying to figure out, how can we more precisely
specify that query such that it's only highlighting the problematic events?”

“You typically ignore most alerts...I guess on record I'd say 90% of them aren't immediate.
You just have to acknowledge them internally, like just be aware that there is something
happening.”

Shankar et al, 2022


25

Maintenance

Source
26

Maintenance

Source
27

Grid loss data: Problems


• Not grid-specific
• Manual retraining
• Manual and subjective alterations
• Lack of monitoring infrastructure
• Poor scalability
• Delayed measurements
• Missing values
• Incorrect values
• Changing values
• Small datasets
• Missing features
• Performance
28

Data and Publications

•https://www.kaggle.com/trnderenergikraft/grid-loss-time-series-dataset
29

When things don’t work: Aneo grid loss


30

When things don’t work: Aneo grid loss


31

When things don’t work


• Data availability

• Feature relevancy changes

• Consumer behavior changes

• Market changes

• Technology updates

• Hardware/Software updates

• And 100s more


Learnings from experiences
33

Simplicity
Simplicity is an advantage but sadly, complexity sells better (Source)

• Complexity signals efforts, mastery and innovation

BUT simple ideas and features

• Easier to understand, use and trust

• Easier to build and scale

• Easier to maintain and fix

• Have lower operational costs, mostly


34

Rules of ML/ Data Science


• Rule #1: Don’t be afraid to launch a product without machine learning.

• Rule #3: Choose machine learning over a complex heuristic.

• Rule #4: Keep the first model simple and get the infrastructure right.

• Rule #5: Test the infrastructure independently from the machine learning.

• Rule #10: Watch for silent failures.

• Rule #26: Look for patterns in the measured errors, and create new features.

Source
35

Good project ideas start with collaborators

“We really think it's important to bridge that gap


between what's often, you know, a (subject matter
expert) in one room annotating and then handing
things over the wire to a data scientist—a scene
where you have no communication. So we make sure
there's both data science and subject matter
expertise representation (on our teams).”

Shankar et al, 2022


36

Spread a deployment across multiple stages

“In (the large companies I've worked at), when we


deploy code it goes through what's called a staged
deployment process, where we have designated test
clusters, (stage 1) clusters, (stage 2) clusters, then
the global deployment (to all users). The idea here is
you deploy increasingly along these clusters, so that
you catch problems before they've met customers.”

Shankar et al, 2022


37

ML evaluation metrics should be tied to product metrics

“Tying (model performance) to the business's KPIs


(key performance indicators) is really important. But
it's a process—you need to figure out what (the KPIs)
are, and frankly I think that's how people should be
doing AI. It (shouldn't be) like: hey, let's do these
experiments and get cool numbers and show off
these nice precision-recall curves to our bosses and
call it a day. It should be like: hey, let's actually show
the same business metrics that everyone else is held
accountable to our bosses at the end of the day.”

Shankar et al, 2022


38

Don't keep your GPUs warm

“One thing that I've noticed is, especially when you have as many resources as
large companies do, that there's a compulsive need to leverage all the
resources that you have. And just, you know, get all the experiments out there.
Come up with a bunch of ideas; run a bunch of stuff. I actually think that's bad.
You can be overly concerned with keeping your GPUs warm, so much so that
you don't actually think deeply about what the highest value experiment is. I
think you can end up saving a lot more time—and obviously GPU cycles, but
mostly end-to-end completion time—if you spend more efforts choosing the
right experiment to run instead of spreading yourself thin. All these different
experiments have their own frontier to explore, and all these frontiers have
different options. I basically will only do the most important thing from each
project's frontier at a given time, and I found that the net throughput for
myself has been much higher.”

Shankar et al, 2022


39

Important to know!
• Communication is the key (Stakeholders, Management, domain experts, end users, data scientists/engineers).

• Start from the problem (not the tech)

• Choosing the right problem is half the battle won.

• Model performance depends less on the model than the data we feed the model.

• It is important to not dive right in and think about the problem and get feedback from experts.

• Best predicting model might not be the best value creating model.

• More emphasis on model evaluation system than individual models.

• Important to test the systems for: software, data science pipeline and value creation.

• An imperfect deployed system is more valuable than a perfect undeployed system (80-20 rule).
40

Resources
41

Important Deadlines
When you will need to deliver or complete a task

1 20/9 Register yourself/group and the company/dataset for group assignment

2 30/10 Deliver individual assignment

3 27/11 Deliver presentation and report for group assignment


42

Lecture Plan
Unpacking the course syllabus

1 23/8 Lecture 1: Introduction [Nisha Dalal] 8 11/10 Lecture 6: Data Visualization & Storytelling
[Manos Papagiannidis]

2 30/8 Lecture 2: Presentation of datasets [Nisha Dalal]


9 18/10 Lecture 7: Data Science in the time of Chat-
GPT [Pikakshi Manchanda]
3 6/9 Lecture 3: Crash course in machine learning
[Kshitij Sharma]
10 25/10 Lecture 8: Lifecycle of a Data Science project
II [Nisha Dalal]
13/9 Lecture 4: Data analysis with low or no-code
4
tools [Nisha Dalal] 1/11 Lecture 9: Decision making with data
11
science [Nisha Dalal]

5 20/9 No lecture
8/11 Lecture 10: Experiences from Industry
12
[Thomas Thorensen]
6 27/9 Lecture 5: Lifecycle of a Data Science project I
[Nisha Dalal]

4/10 No Lecture 13 15/11 Course finish


7
43

Nisha Dalal
Questions & Discussion [email protected]

You might also like