0% found this document useful (0 votes)

19 views19 pages

S2 - Datascience Lifecycle

The data science lifecycle involves several interdependent tasks including organizing data, exploring patterns through data mining, selecting and refining models, and other tasks. There is no single workflow that applies to all projects. Business understanding is critical to define objectives and success metrics. Data acquisition identifies sources and evaluates quality. Data preparation explores and conditions data to address issues like outliers. Modelling includes feature engineering, training models, and evaluating performance. Successful models are then deployed for operational use.

Uploaded by

mmtharindu

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

19 views19 pages

S2 - Datascience Lifecycle

Uploaded by

mmtharindu

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 19

DataScience Lifecycle

DATA SCIENCE LIFE CYCLE

People often confuse the lifecycle of a data science

project with that of a software engineering project.

Data science is more of science and less of engineering.

There is no one-size-fits-all workflow process for all data

science projects and data scientists have to determine
which workflow best fits the business requirements.
DATA SCIENCE LIFE CYCLE

The typical lifecycle of a data science project involves

jumping back and forth among various
interdependent science tasks using variety of tools,
techniques (mostly statistical methods and formulae),
programming etc
DATA SCIENTIST
Effort
Organize & Clean Data

5
7 Collect data / Dataset
9

60 Data Mining to draw

19
pattern

Model Selection,
training and refining
Other Tasks
BUSINESS UNDERSTANDING

 The data science team must learn and investigate the

business problem,

 Develop context and understanding,

 Clearly define project objectives and translate them into KPI and
success metrics.
SOME COMMON DATA SCIENCE PROJECT
OBJECTIVES
 Prediction (predict a value based on inputs)
 Classification (e.g., spam or not spam)
 Recommendations (e.g., Amazon and Netflix recommendations)
 Pattern detection and grouping (e.g., classification without known
classes)
 Anomaly detection (e.g., fraud detection)
 Recognition (image, text, audio, video, facial, …)
 Actionable insights (via dashboards, reports, visualizations, …)
 Automated processes and decision-making (e.g., credit card approval)
 Scoring and ranking (e.g., FICO score)
 Segmentation (e.g., demographic-based marketing)
 Optimization (e.g., risk management)
 Forecasts (e.g., sales and revenue)
DATA ACQUISITION
 The team typically perform the following activities:
 Identify data sources: Make a list of data sources the team may need to test the
initial hypotheses outlined in this phase.
 Make an inventory of the datasets currently available and those that can be
purchased or otherwise acquired for the tests the team wants to perform.

 Capture aggregate data sources: This is for previewing the data and providing high-
level understanding.
 It enables the team to gain a quick overview of the data and perform further
exploration on specific areas.

 Review the raw data: Begin understanding the interdependencies among the data
attributes.
 Become familiar with the content of the data, its quality, and its limitations.
DATA ACQUISITION

• Feedback system

Static • CSV Data sets / text files

• Logs data, memory dumps

Live • Sensors, controllers etc.

• Data Virtualization

Virtual • Caching, Storing

DATA ACQUISITION

 Evaluate the data structures and tools needed: The data type
and structure dictate which tools the team can use to analyze the data.

 Scope the sort of data infrastructure needed for this type

of problem: In addition to the tools needed, the data influences the
kind of infrastructure that's required, such as disk storage and
network capacity.
DATA PREPARATION

Need for Data Preparation Steps Invovled

 Bad data or poor quality data can
alter accuracy & led to incorrect
insights.
 Dataset might contain
discrepancies in the names or
codes.
 Dataset might contain outliers or
errors.
 Dataset lacks your attributes of
interest for analysis.
 All in all the dataset is not
qualitative but is just quantitative.  Gartner- Poor quality data costs an avg. organization $13.5M / year.
DATA PREPARATION

 Includes steps to explore, preprocess and condition data

 Create robust environment – analytics sandbox
 Data preparation tends to be the most labor-intensive step in
the analytics lifecycle
 Often at least 50 – 60% of the data science project’s time
 The data preparation phase is generally the most iterative
and the one that data scientists tend to underestimate most
often
MODELLING
There are three main tasks addressed in this stage:

 Feature engineering: Create data features from the raw data

to facilitate model training.

 Model training: Find the model that answers the question

most accurately by comparing their success metrics.

 Determine if your model is suitable for production.

CREATE YOUR MODEL & EVALUATE

Split the input data randomly for modeling into a training dataset
and a test dataset.

Build the models by using the training dataset.

Evaluate the training and the test data set. Use a series of competing
machine-learning algorithms along with the various associated tuning
parameters (known as a parameter sweep) that are geared toward
answering the question of interest with the current data.

Determine the “best” solution to answer the question by comparing

the success metrics between alternative methods.
MODELLING
-

' . '
CREATE YOUR MODEL & EVALUATE

• Supervised Learning • Classification Metrics

• Naive Bayes • Accuracy Score
• Classification Report
• KNN
• Confusion Matrix
• Support Vector Machines
• Regression Metrics
(SVM)
• Mean Absolute Error.
• Linear Regression • Mean Squared Error
• R2 Score
• Unsupervised Learning
• Principal Component Analysis. • Clustering Metrics
• Adjusted Rand Index.
• K Means • Homogeneity
• V - measure
DEPLOYMENT

After you have a set of models that perform well, you can operationalize
them for other applications through API’s or other interfaces to consume
from various applications, such as:

• Online websites
• Spreadsheets
• Dashboards
• Line-of-business applications
• Back-end applications
ee

Team1 - Data Science Methodology
No ratings yet
Team1 - Data Science Methodology
39 pages
Shaping Maths SG1
100% (1)
Shaping Maths SG1
19 pages
Data Preparation
No ratings yet
Data Preparation
16 pages
CO1 - 2 - Data Science Roles, Stages in A Data Science Project
No ratings yet
CO1 - 2 - Data Science Roles, Stages in A Data Science Project
19 pages
What Is Data Science?
No ratings yet
What Is Data Science?
94 pages
M1 - FDS
No ratings yet
M1 - FDS
19 pages
IDS - UNIT-2 - Notes Part1 - Introduction To Data Science and Prob Concept
No ratings yet
IDS - UNIT-2 - Notes Part1 - Introduction To Data Science and Prob Concept
66 pages
PAM - Complete
No ratings yet
PAM - Complete
322 pages
JobRecord MUHAMMAD NAEEM F70a3eba Db3d 11ef A12f 96f32f87411b
No ratings yet
JobRecord MUHAMMAD NAEEM F70a3eba Db3d 11ef A12f 96f32f87411b
63 pages
Liceria Tech
No ratings yet
Liceria Tech
12 pages
What Is CRISP DM - Data Science Process Alliance
No ratings yet
What Is CRISP DM - Data Science Process Alliance
20 pages
Data Science
100% (2)
Data Science
33 pages
Ma 0702 05 en 00 - Setup Manual
No ratings yet
Ma 0702 05 en 00 - Setup Manual
214 pages
Data Science
No ratings yet
Data Science
3 pages
Unit - 2 PDA
No ratings yet
Unit - 2 PDA
20 pages
M1 - FDS
No ratings yet
M1 - FDS
19 pages
DS PPT 1
No ratings yet
DS PPT 1
30 pages
Data Science Methodology
No ratings yet
Data Science Methodology
14 pages
MLM FDS
No ratings yet
MLM FDS
19 pages
Data Science S3mca
No ratings yet
Data Science S3mca
55 pages
Part1 Ds ML Introduction
No ratings yet
Part1 Ds ML Introduction
61 pages
Week5 Modified
No ratings yet
Week5 Modified
25 pages
Lecture02 Frameworks Platforms-Part1
No ratings yet
Lecture02 Frameworks Platforms-Part1
40 pages
Introduction of Data Science
No ratings yet
Introduction of Data Science
28 pages
DSA Lecture1
No ratings yet
DSA Lecture1
15 pages
Life Cycle
No ratings yet
Life Cycle
35 pages
Session 1 2 Blockchain v2.16
No ratings yet
Session 1 2 Blockchain v2.16
204 pages
5 Data Science Project Lifecycle
No ratings yet
5 Data Science Project Lifecycle
33 pages
EBook - Data Science 4
No ratings yet
EBook - Data Science 4
14 pages
Module1 Data Science
No ratings yet
Module1 Data Science
15 pages
DTS Modul Data Science Methodology
100% (1)
DTS Modul Data Science Methodology
56 pages
Data Science
No ratings yet
Data Science
14 pages
W3 - DA Life Cycle
No ratings yet
W3 - DA Life Cycle
49 pages
Week 3 - LAQ
No ratings yet
Week 3 - LAQ
5 pages
DS Unit 1
No ratings yet
DS Unit 1
26 pages
Unit I
No ratings yet
Unit I
52 pages
Data Science Methodology
No ratings yet
Data Science Methodology
21 pages
Dsdm-Unit1 241031 194317
No ratings yet
Dsdm-Unit1 241031 194317
38 pages
ServiceNow - Security Hardening Template
No ratings yet
ServiceNow - Security Hardening Template
32 pages
Data Science: Chapter 1: Introduction To Big Data
100% (2)
Data Science: Chapter 1: Introduction To Big Data
77 pages
Data Science Life Cycle - All Details
No ratings yet
Data Science Life Cycle - All Details
12 pages
Data Science-Lec 1
No ratings yet
Data Science-Lec 1
17 pages
FDS Introduction
No ratings yet
FDS Introduction
41 pages
Internship Report: T.J.Instituteoftechnology
No ratings yet
Internship Report: T.J.Instituteoftechnology
29 pages
Life Cycle of DS Project
No ratings yet
Life Cycle of DS Project
9 pages
Lecture 4
No ratings yet
Lecture 4
20 pages
Data Science Pipeline, EDA & Data Preparation
No ratings yet
Data Science Pipeline, EDA & Data Preparation
14 pages
149 - CGLP Leader Guide - Think Ahead Preview
No ratings yet
149 - CGLP Leader Guide - Think Ahead Preview
3 pages
Dsur Ea2352001010391 W3
No ratings yet
Dsur Ea2352001010391 W3
3 pages
FILE Ai
No ratings yet
FILE Ai
10 pages
Unit2 DATA SCIENCE
No ratings yet
Unit2 DATA SCIENCE
8 pages
Data Science
No ratings yet
Data Science
18 pages
DATA MINING and MACHINE LEARNING. PREDICTIVE TECHNIQUES: ENSEMBLE METHODS, BOOSTING, BAGGING, RANDOM FOREST, DECISION TREES and REGRESSION TREES.: Examples with MATLAB
From Everand
DATA MINING and MACHINE LEARNING. PREDICTIVE TECHNIQUES: ENSEMBLE METHODS, BOOSTING, BAGGING, RANDOM FOREST, DECISION TREES and REGRESSION TREES.: Examples with MATLAB
César Pérez López
No ratings yet
Crime Record Management System11
No ratings yet
Crime Record Management System11
40 pages
Data Science Process Stages Lecture 2
No ratings yet
Data Science Process Stages Lecture 2
4 pages
10 Things Know Before First Data Science Project
No ratings yet
10 Things Know Before First Data Science Project
8 pages
Unit 3
No ratings yet
Unit 3
9 pages
Digital Forensic Analysis of Facebook App in Virtual Environment
No ratings yet
Digital Forensic Analysis of Facebook App in Virtual Environment
8 pages
Unit 1 - DSA
No ratings yet
Unit 1 - DSA
12 pages
Challenges and Scope of Data Science Project
No ratings yet
Challenges and Scope of Data Science Project
21 pages
Data Analytics 1
No ratings yet
Data Analytics 1
4 pages
Project Online To Planner Sync
100% (1)
Project Online To Planner Sync
6 pages
TLE10 Types of Malwares
No ratings yet
TLE10 Types of Malwares
24 pages
T Assignment
No ratings yet
T Assignment
5 pages
Architecture of Data Science Projects: Components
No ratings yet
Architecture of Data Science Projects: Components
4 pages
What Is Data Science
No ratings yet
What Is Data Science
13 pages
Get Essential C# 12.0, 8th Edition Mark Michaelis Free All Chapters
100% (8)
Get Essential C# 12.0, 8th Edition Mark Michaelis Free All Chapters
39 pages
Data Science Lifecycle
No ratings yet
Data Science Lifecycle
3 pages
320 Web Applications PDF
No ratings yet
320 Web Applications PDF
7 pages
User Administration - PostQuiz - Attempt Review
No ratings yet
User Administration - PostQuiz - Attempt Review
4 pages
2AE Series Differential Protection Relay Operation Manual
No ratings yet
2AE Series Differential Protection Relay Operation Manual
36 pages
Crime Investigation Management System Abstract
No ratings yet
Crime Investigation Management System Abstract
2 pages
Garbage Collector Robot
No ratings yet
Garbage Collector Robot
6 pages
A Project Report of ISM On Role of Java in Market
No ratings yet
A Project Report of ISM On Role of Java in Market
17 pages
Life Cycle of Data Science - Complete Step-By-step Guide
No ratings yet
Life Cycle of Data Science - Complete Step-By-step Guide
3 pages
Studio One 6 - Release Notes
No ratings yet
Studio One 6 - Release Notes
10 pages
Eti PDF
No ratings yet
Eti PDF
16 pages
Ionic Tutorial
No ratings yet
Ionic Tutorial
14 pages
AIML-IITRopar Course Brochure
No ratings yet
AIML-IITRopar Course Brochure
9 pages
New Microsoft Word Document (1) - 3
No ratings yet
New Microsoft Word Document (1) - 3
18 pages
Syllabus - OmniSOC Internship 2024
No ratings yet
Syllabus - OmniSOC Internship 2024
6 pages
How To Create A Live Ubuntu USB Drive With Persistent Storage
No ratings yet
How To Create A Live Ubuntu USB Drive With Persistent Storage
15 pages
PBC Interchange Chart Hevi Rail
No ratings yet
PBC Interchange Chart Hevi Rail
1 page
Safecast
No ratings yet
Safecast
13 pages
STK - 32-Bit Starter Kit Quick Guide - Eng
No ratings yet
STK - 32-Bit Starter Kit Quick Guide - Eng
23 pages
PDF Metadata - Document Capture - Recherche Google
No ratings yet
PDF Metadata - Document Capture - Recherche Google
4 pages
M DSM Guide Palo Alto
No ratings yet
M DSM Guide Palo Alto
3 pages
Concert Band 9 10
No ratings yet
Concert Band 9 10
1 page
Practical Data Strategies and Recipes
From Everand
Practical Data Strategies and Recipes
Tom Henricksen
No ratings yet
Advantages and Disadvantages of Computer
No ratings yet
Advantages and Disadvantages of Computer
1 page

S2 - Datascience Lifecycle

Uploaded by

S2 - Datascience Lifecycle

Uploaded by

DataScience Lifecycle

DATA SCIENCE LIFE CYCLE

People often confuse the lifecycle of a data science

Data science is more of science and less of engineering.

There is no one-size-fits-all workflow process for all data

The typical lifecycle of a data science project involves

60 Data Mining to draw

 The data science team must learn and investigate the

 Develop context and understanding,

Static • CSV Data sets / text files

• Logs data, memory dumps

Live • Sensors, controllers etc.

Virtual • Caching, Storing

 Scope the sort of data infrastructure needed for this type

Need for Data Preparation Steps Invovled

 Includes steps to explore, preprocess and condition data

 Feature engineering: Create data features from the raw data

 Model training: Find the model that answers the question

 Determine if your model is suitable for production.

Build the models by using the training dataset.

Determine the “best” solution to answer the question by comparing

• Supervised Learning • Classification Metrics

You might also like