S2 - Datascience Lifecycle
S2 - Datascience Lifecycle
5
7 Collect data / Dataset
9
Model Selection,
training and refining
Other Tasks
BUSINESS UNDERSTANDING
Clearly define project objectives and translate them into KPI and
success metrics.
SOME COMMON DATA SCIENCE PROJECT
OBJECTIVES
Prediction (predict a value based on inputs)
Classification (e.g., spam or not spam)
Recommendations (e.g., Amazon and Netflix recommendations)
Pattern detection and grouping (e.g., classification without known
classes)
Anomaly detection (e.g., fraud detection)
Recognition (image, text, audio, video, facial, …)
Actionable insights (via dashboards, reports, visualizations, …)
Automated processes and decision-making (e.g., credit card approval)
Scoring and ranking (e.g., FICO score)
Segmentation (e.g., demographic-based marketing)
Optimization (e.g., risk management)
Forecasts (e.g., sales and revenue)
DATA ACQUISITION
The team typically perform the following activities:
Identify data sources: Make a list of data sources the team may need to test the
initial hypotheses outlined in this phase.
Make an inventory of the datasets currently available and those that can be
purchased or otherwise acquired for the tests the team wants to perform.
Capture aggregate data sources: This is for previewing the data and providing high-
level understanding.
It enables the team to gain a quick overview of the data and perform further
exploration on specific areas.
Review the raw data: Begin understanding the interdependencies among the data
attributes.
Become familiar with the content of the data, its quality, and its limitations.
DATA ACQUISITION
• Feedback system
• Data Virtualization
Evaluate the data structures and tools needed: The data type
and structure dictate which tools the team can use to analyze the data.
Split the input data randomly for modeling into a training dataset
and a test dataset.
Evaluate the training and the test data set. Use a series of competing
machine-learning algorithms along with the various associated tuning
parameters (known as a parameter sweep) that are geared toward
answering the question of interest with the current data.
' . '
CREATE YOUR MODEL & EVALUATE
After you have a set of models that perform well, you can operationalize
them for other applications through API’s or other interfaces to consume
from various applications, such as:
• Online websites
• Spreadsheets
• Dashboards
• Line-of-business applications
• Back-end applications
ee