0% found this document useful (0 votes)
19 views19 pages

S2 - Datascience Lifecycle

The data science lifecycle involves several interdependent tasks including organizing data, exploring patterns through data mining, selecting and refining models, and other tasks. There is no single workflow that applies to all projects. Business understanding is critical to define objectives and success metrics. Data acquisition identifies sources and evaluates quality. Data preparation explores and conditions data to address issues like outliers. Modelling includes feature engineering, training models, and evaluating performance. Successful models are then deployed for operational use.

Uploaded by

mmtharindu
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
19 views19 pages

S2 - Datascience Lifecycle

The data science lifecycle involves several interdependent tasks including organizing data, exploring patterns through data mining, selecting and refining models, and other tasks. There is no single workflow that applies to all projects. Business understanding is critical to define objectives and success metrics. Data acquisition identifies sources and evaluates quality. Data preparation explores and conditions data to address issues like outliers. Modelling includes feature engineering, training models, and evaluating performance. Successful models are then deployed for operational use.

Uploaded by

mmtharindu
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 19

DataScience Lifecycle

DATA SCIENCE LIFE CYCLE

People often confuse the lifecycle of a data science


project with that of a software engineering project.

Data science is more of science and less of engineering.

There is no one-size-fits-all workflow process for all data


science projects and data scientists have to determine
which workflow best fits the business requirements.
DATA SCIENCE LIFE CYCLE

The typical lifecycle of a data science project involves


jumping back and forth among various
interdependent science tasks using variety of tools,
techniques (mostly statistical methods and formulae),
programming etc
DATA SCIENTIST
Effort
Organize & Clean Data

5
7 Collect data / Dataset
9

60 Data Mining to draw


19
pattern

Model Selection,
training and refining
Other Tasks
BUSINESS UNDERSTANDING

 The data science team must learn and investigate the


business problem,

 Develop context and understanding,

 Clearly define project objectives and translate them into KPI and
success metrics.
SOME COMMON DATA SCIENCE PROJECT
OBJECTIVES
 Prediction (predict a value based on inputs)
 Classification (e.g., spam or not spam)
 Recommendations (e.g., Amazon and Netflix recommendations)
 Pattern detection and grouping (e.g., classification without known
classes)
 Anomaly detection (e.g., fraud detection)
 Recognition (image, text, audio, video, facial, …)
 Actionable insights (via dashboards, reports, visualizations, …)
 Automated processes and decision-making (e.g., credit card approval)
 Scoring and ranking (e.g., FICO score)
 Segmentation (e.g., demographic-based marketing)
 Optimization (e.g., risk management)
 Forecasts (e.g., sales and revenue)
DATA ACQUISITION
 The team typically perform the following activities:
 Identify data sources: Make a list of data sources the team may need to test the
initial hypotheses outlined in this phase.
 Make an inventory of the datasets currently available and those that can be
purchased or otherwise acquired for the tests the team wants to perform.

 Capture aggregate data sources: This is for previewing the data and providing high-
level understanding.
 It enables the team to gain a quick overview of the data and perform further
exploration on specific areas.

 Review the raw data: Begin understanding the interdependencies among the data
attributes.
 Become familiar with the content of the data, its quality, and its limitations.
DATA ACQUISITION

• Feedback system

Static • CSV Data sets / text files

• Logs data, memory dumps

Live • Sensors, controllers etc.

• Data Virtualization

Virtual • Caching, Storing


DATA ACQUISITION

 Evaluate the data structures and tools needed: The data type
and structure dictate which tools the team can use to analyze the data.

 Scope the sort of data infrastructure needed for this type


of problem: In addition to the tools needed, the data influences the
kind of infrastructure that's required, such as disk storage and
network capacity.
DATA PREPARATION

Need for Data Preparation Steps Invovled


 Bad data or poor quality data can
alter accuracy & led to incorrect
insights.
 Dataset might contain
discrepancies in the names or
codes.
 Dataset might contain outliers or
errors.
 Dataset lacks your attributes of
interest for analysis.
 All in all the dataset is not
qualitative but is just quantitative.  Gartner- Poor quality data costs an avg. organization $13.5M / year.
DATA PREPARATION

 Includes steps to explore, preprocess and condition data


 Create robust environment – analytics sandbox
 Data preparation tends to be the most labor-intensive step in
the analytics lifecycle
 Often at least 50 – 60% of the data science project’s time
 The data preparation phase is generally the most iterative
and the one that data scientists tend to underestimate most
often
MODELLING
There are three main tasks addressed in this stage:

 Feature engineering: Create data features from the raw data


to facilitate model training.

 Model training: Find the model that answers the question


most accurately by comparing their success metrics.

 Determine if your model is suitable for production.


CREATE YOUR MODEL & EVALUATE

Split the input data randomly for modeling into a training dataset
and a test dataset.

Build the models by using the training dataset.

Evaluate the training and the test data set. Use a series of competing
machine-learning algorithms along with the various associated tuning
parameters (known as a parameter sweep) that are geared toward
answering the question of interest with the current data.

Determine the “best” solution to answer the question by comparing


the success metrics between alternative methods.
MODELLING
-

' . '
CREATE YOUR MODEL & EVALUATE

• Supervised Learning • Classification Metrics


• Naive Bayes • Accuracy Score
• Classification Report
• KNN
• Confusion Matrix
• Support Vector Machines
• Regression Metrics
(SVM)
• Mean Absolute Error.
• Linear Regression • Mean Squared Error
• R2 Score
• Unsupervised Learning
• Principal Component Analysis. • Clustering Metrics
• Adjusted Rand Index.
• K Means • Homogeneity
• V - measure
DEPLOYMENT

After you have a set of models that perform well, you can operationalize
them for other applications through API’s or other interfaces to consume
from various applications, such as:

• Online websites
• Spreadsheets
• Dashboards
• Line-of-business applications
• Back-end applications
ee

You might also like