0% found this document useful (0 votes)
76 views55 pages

Lecture 5 - Lifecycle of A Data Science Project

The document discusses the lifecycle of a data science project using the CRISP-DM methodology. It covers the business understanding and data understanding phases of CRISP-DM, including understanding the problem to be solved, documenting goals and scope, exploring and assessing data quality. It also provides an example use case of predicting grid losses using a dataset on grid load, weather forecasts, and other features.

Uploaded by

Giorgio Aduso
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
76 views55 pages

Lecture 5 - Lifecycle of A Data Science Project

The document discusses the lifecycle of a data science project using the CRISP-DM methodology. It covers the business understanding and data understanding phases of CRISP-DM, including understanding the problem to be solved, documenting goals and scope, exploring and assessing data quality. It also provides an example use case of predicting grid losses using a dataset on grid load, weather forecasts, and other features.

Uploaded by

Giorgio Aduso
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 55

TDT4259 – Applied Data Science

Lecture 5: Lifecycle of a data science project


Nisha Dalal
Adj. Associate Professor

[email protected]
2

But first,
• Check your groups
• Contact your team members
• Decide datasets
• Discuss contributions
• Email from TAs
• Language preferences
• Schedule group meeting with TAs (preferably after)
• Some changes in the scoring scheme for group assignment

• Questions on Slack
3

Reference groups and feedback

We are looking for 5-8 students to comprise the reference group. The purpose of the reference group is to provide
constructive feedback about the course through an ongoing open dialogue with other students throughout the semester.
You can read more about task of the reference group in this link.

If you want to sign up to be a member of the reference group, use this link.

A survey will be sent out to all to evaluate the course during the last week.
CRISP-DM: with a use case
5

Aneo: Grid loss data


• Grid load
• Total amount of electric energy in the grid
• Grid loss
• Difference in electricity between what has been
produced by the power plants (or load) and what has
been sold to the customers
• Grid load = consumption by customers + grid loss
6

What is CRISP-DM
Cross-industry standard process for data
mining - CRISP-DM

An open standard developed in 1996 by leading


companies in data analysis

It is still the most popular methodology for data-centric


projects

It is an agile method that introduces almost no


overhead and emphasizes adaptive transitions between
project phases

Source
7

What is CRISP-DM
Cross-industry standard process for data
mining - CRISP-DM

Maintenace and
An open standard developed in 1996 by leading monitoring
companies in data analysis

It is still the most popular methodology for data-centric


projects

It is an agile method that introduces almost no


overhead and emphasizes adaptive transitions between
project phases
8

What is CRISP-DM

Maintenace and
monitoring
9

Business Understanding
• Initially, it is vital to understand the problem to be solved

• This may seem obvious, but business projects seldom


come pre-packaged as clear and unambiguous data Maintenace and
monitoring
science problems

• The design team should think carefully about the problem


to be solved and about the use scenario

• Learn to concretize or even reduce the scope of the initial


idea
10

Documentation
• One Pager
• Design document

• High level documents explaining the overall goal


• Quick feedback from the stakeholders and data scientists/engineers
• Different document for different audience
• Enough information to make decisions and provide feedback
• Everyone on the same page
• Easier to scope the project
• Provide clarity and avoids getting into the perfection rabbit hole
• Make project planning easier
11

Data analytics
Examining data to answer questions, identify trends, and extract insights.
12

Types of data analytics


13

Descriptive analysis
• Pull trends from raw data and succinctly describe it.

• Focus on What happened or is currently happening ?


14

Descriptive analysis
15

Descriptive analysis
16

Descriptive analysis
17

Diagnostic analysis
• Comparing coexisting trends or movement, uncovering correlations between
variables, and determining causal relationships where possible.

• Focus on: Why did it happen?


18

Descriptive analysis
19

Descriptive analysis
• Comparing coexisting trends or movement, uncovering correlations between
variables, and determining causal relationships where possible.

• Focus on: Why did it happen?

• Correlation versus Causation


20

Correlation versus Causation


21

Spurious Correlations
22

Correlation versus Causation


23

Predictive analysis
• Predict the future trends and events, using the data at hand.

• Focus on: What might happen in future?


24

Predictive analysis
25

Predictive analysis

Source
26

Prescriptive analysis
• Suggests actionable takeaways considering all possible factors in a scenario

• Focus on: What should we do next?


27

Prescriptive analysis
28

Prescriptive analysis
29

Business Understanding
• Initially, it is vital to understand the problem to be solved

• This may seem obvious, but business projects seldom


come pre-packaged as clear and unambiguous data Maintenace and
monitoring
science problems

• The design team should think carefully about the problem


to be solved and about the use scenario

• Learn to concretize or even reduce the scope of the initial


idea
30

Aneo: Grid loss data


• Grid load
• Total amount of electric energy in the grid
• Grid loss
• Difference in electricity between what has been
produced by the power plants (or load) and what has
been sold to the customers
• Grid load = consumption by customers + grid loss
• Estimated grid loss = idle loss + k*(expected power consumption)2
31

Aneo: Grid loss data


• Grid load
• Total amount of electric energy in the grid
• Grid loss
• Difference in electricity between what has been
produced by the power plants (or load) and what has
been sold to the customers
• Grid load = consumption by customers + grid loss
• Estimated grid loss = idle loss + k*(expected power consumption)2
32

Grid loss data: Problems


• Not grid-specific
• Manual retraining
• Manual and subjective alterations
• Lack of monitoring infrastructure
• Poor scalability
33

Data Understanding
• If solving the business problem is the goal, the data
comprise the available raw material from which the
solution will be built

• Collect initial data Maintenace and


o Existing data monitoring
o Purchased data
o Additional data
• Describe data
o Amount of data
o Value types
o Coding schemes
• Explore data
• Verify data quality
o Missing data
o Data errors
o Coding inconsistencies
o Bad metadata
34

Data Understanding

Maintenace and
monitoring
35

Data Understanding

Maintenace and
monitoring
36

Aneo: Grid loss data

•https://www.kaggle.com/trnderenergikraft/grid-loss-time-series-dataset
37

Aneo: Grid loss data


1. Grid load:
• Grid load = consumption by customers + grid loss
• Estimated grid loss = idle loss + k*(expected power consumption)2

2. Calendar features

3. Weather forecasts

4. Estimated demand in the area


38

Data Understanding
Dos and DON’T’S

• Do not economize on this phase


o The earlier you discover issues with your data the better
o Data understanding leads to domain understanding Maintenace and
monitoring

• Verify as far as you can, if your data is correct,


complete, coherent, deduplicated, representative,
independent and up-to-date

• Investigate what sort of processing was applied to the


raw data

• Understand anomalies and outliers


39

Grid loss data: Problems


• Delayed measurements
• Missing values
• Incorrect values
• Changing values
• Small datasets
• Missing features
40

Data Preparation
1. Select data
• Select features

2. Clean data
• Correct data errors
Maintenace and
• Make coding consistent monitoring
• Fill in or infer missing data

3. Construct data
• Generate derived attributes

4. Integrate data
• Merge information from different sources

5. Format data
• Convert to format convenient for modelling
41

Data Preparation

Maintenace and
monitoring
42

Day in the life of a Data Scientist


43

Day in the life of a Data Scientist


44

Aneo: Grid loss data


1. Grid load
• Is it possible to predict grid load?

2. Calendar features
• Categorical features
• Encoding?

3. Time series decomposed features


• Prophet based features?
45

Modelling
1. Select modelling techniques
• Select an algorithm or a model
Maintenace and
2. Build the model monitoring

• Feature selection
• Hyperparameter optimization
• Training and validation

3. Assess model
• Model performance on test dataset
• Time
• Other Key Performance Indicators (KPIs)
46

Aneo: Grid loss data


1. Select modelling techniques
• Select an algorithm or a model

2. Build the model


• Feature selection
• Hyperparameter optimization
• Training and validation

3. Assess model
• Model performance on test dataset
• Time
• Other KPIs
47

Aneo: Grid loss data


1. Select modelling techniques
• Select an algorithm or a model

2. Build the model


• Feature selection
• Hyperparameter optimization
• Training and validation

3. Assess model
• Model performance on test dataset
• Time
• Other KPIs
48

Aneo: Grid loss data


Which models to use
• Multi-layer perceptron
• Decision tree regressor
• Gradient boosting regressor ensemble
• CatBoost

What baselines to compare to


• Manual method
• Last week

How much training data to use

Which features to use


49

Aneo: Grid loss data


Which models to use
• Multi-layer perceptron
• Decision tree regressor
• Gradient boosting regressor ensemble
• CatBoost

What baselines to compare to


• Manual method
• Last week

How much training data to use

Which features to use


50

Traditional software testing

Breck, Eric, et al. "The ML test score: A rubric for ML production readiness and technical debt
reduction." 2017 IEEE International Conference on Big Data (Big Data). IEEE, 2017.
51

Data Science pipeline testing

Breck, Eric, et al. "The ML test score: A rubric for ML production readiness and technical debt
reduction." 2017 IEEE International Conference on Big Data (Big Data). IEEE, 2017.
52

Important Deadlines
When you will need to deliver or complete a task

1 20/9 Register yourself/group and the company/dataset for group assignment

2 30/10 Deliver individual assignment

3 27/11 Deliver presentation and report for group assignment


53

Lecture Plan
Unpacking the course syllabus

1 23/8 Lecture 1: Introduction [Nisha Dalal] 8 11/10 Lecture 7: Data Visualization & Storytelling
[Manos Papagiannidis]

2 30/8 Lecture 2: Presentation of datasets [Nisha Dalal]


9 18/10 Lecture 8: Data Science in the time of Chat-
GPT [Pikakshi Manchanda]
3 6/9 Lecture 3: Crash course in machine learning
[Kshitij Sharma]
10 25/10 No lecture

13/9 Lecture 4: Data analysis with low or no-code


4
tools [Nisha Dalal] 1/11 Lecture 9: Experiences from Industry [Thomas
11
Thorensen]

5 20/9 No lecture
8/11 Lecture 10: Decision making with data science
12
[Nisha Dalal]
6 27/9 Lecture 5: Lifecycle of a Data Science project I
[Nisha Dalal]

4/10 Lecture 6: Lifecycle of a Data Science project II 13 15/11 Course finish


7
[Nisha Dalal]
Summer Internship 2024
in the AI and Product Development department of Aneo

Rea d more a nd a pply here:


We are looking for you who want to contribute to a sustainable future by
applying AI and/or software development in the renewable energy sector!
Where: Trondheim
When: Summer 2024 (7 weeks, dates TBA)
Deadline: November 12th, 2023

https ://tinyurl.com/2xxh5uhx
55

Nisha Dalal
Questions & Discussion [email protected]

You might also like