Week 3 - LAQ
Week 3 - LAQ
A data science lifecycle indicates the iterative steps taken to build, deliver and maintain
any data science product. All data science projects are not built the same, so their life
cycle varies as well. Still, we can picture a general lifecycle that includes some of the most
common data science steps. A general data science lifecycle process includes the use of
machine learning algorithms and statistical practices that result in better prediction
models. Some of the most common data science steps involved in the entire process are
data extraction, preparation, cleansing, modeling, evaluation etc. The world of data
science refers to this general process as the “Cross Industry Standard Process for Data
Mining”.
We will go through these steps individually in the subsequent sections and understand
how businesses execute these steps throughout data science projects. But before that,
let us take a look at the data science professionals involved in any data science project.
Get to know more about measures of dispersion .
Domain Expert: The data science projects are applied in different domains or
industries of real life like Banking, Healthcare, Petroleum industry etc. A domain
expert is a person who has experience working in a particular domain and knows
in and out about the domain.
Business analyst: A business analyst is required to understand the business needs
in the domain identified. The person can guide in devising the right solution and
timeline for the same.
Data Scientist: A data scientist is an expert in data science projects and has
experience working with data and can work out the solution as to what data is
needed to produce the required solution.
Machine Learning Engineer: A machine learning engineer can advise on which
model to be applied to get the desired output and devise a solution to produce
the correct and required output.
Data Engineer and Architect: Data architects and Data engineers are the experts in
the modeling of data. Visualization of data for better understanding, as well as
storage and efficient retrieval of data, are looked after by them.
1. Problem identification
This is the crucial step in any Data Science project . The first thing is understanding in
what way Data Science is useful in the domain under consideration and identification
of appropriate tasks which are useful for the same. Domain experts and Data
Scientists are the key persons in the problem identification of problem. Domain expert
has in depth knowledge of the application domain and exactly what the problem is to be
solved. Data Scientist understands the domain and help in identification of problem
and possible solutions to the problems.
2. Business Understanding
Understanding what customer exactly wants from the business perspective is nothing but
Business Understanding. Whether customer wish to do predictions or want to improve
sales or minimize the loss or optimize any particular process etc forms the business
goals. During business understanding, two important steps are followed:
KPI (Key Performance Indicator)
For any data science project, key performance indicators define the performance or
success of the project. There is a need to be an agreement between the customer and
data science project team on Business related indicators and related data science project
goals. Depending on the business need the business indicators are devised and then
accordingly the data science project team decides the goals and indicators. To better
understand this let us see an example. Suppose the business need is to optimise the
overall spendings of the company, then the data science goal will be to use the existing
resources to manage double the clients. Defining the Key performance Indicators is very
crucial for any data science projects as the cost of the solutions will be different for
different goals.
SLA (Service Level Agreement)
Once the performance indicators are set then finalizing the service level agreement is
important. As per the business goals the service level agreement terms are decided.
For example, for any airline reservation system simultaneous processing of say 1000
users is required. Then the product must satisfy this service requirement is the part of
service level agreement.
Once the performance indicators are agreed and service level agreement is completed
then the project proceeds to the next important step.
3. Collecting Data
Data Collection is the important step as it forms the important base to achieve targeted
business goals. There are various ways the data will flow into the system as shown in
figure 2.
The basic data collection can be done using the surveys. Generally, the data collected
through surveys provide important insights. Much of the data is collected from the
various processes followed in the enterprise. At various steps the data is recorded in
various software systems used in the enterprise which is important to understand the
process followed from the product development to deployment and delivery. The
historical data available through archives is also important to better understand the
business. Transactional data also plays a vital role as it is collected on a daily basis.
Many atistical methods are applied to the data to extract the important information
related to business. In data science project the major role is played by data and so proper
data collection methods are important.
4. Pre-processing data
Large data is collected from archives, daily transactions and intermediate records. The
data is available in various formats and in various forms. Some data may be available in
hard copy formats also. The data is scattered at various places on various servers. All
these data are extracted and converted into single format and then processed. Typically,
as data warehouse is constructed where the Extract, Transform and Loading (ETL) process
or operations are carried out. In the data science project this ETL operation is vital and
important. A data architect role is important in this stage who decides the structure of
data warehouse and perform the steps of ETL operations.
5. Analyzing data
Now that the data is available and ready in the format required then next important step
is to understand the data in depth. This understanding comes from analysis of data using
various statistical tools available. A data engineer plays a vital role in analysis of data. This
step is also called as Exploratory Data Analysis (EDA). Here the data is examined by
formulating the various statistical functions and dependent and independent variables or
features are identified. Careful analysis of data revels which data or features are
important and what is the spread of data. Various plots are utilized to visualize the data
for better understanding. The tools like Tableau, PowerBI etc are famous for performing
Exploratory Data Analysis and Visualization. Knowledge of Data Science with Python and
R is important for performing EDA on any type of data.
6. Data Modelling
Data modelling is the important next step once the data is analysed and visualized. The
important components are retained in the dataset and thus data is further refined. Now
the important is to decide how to model the data? What tasks are suitable for modelling?
The tasks, like classification or regression, which is suitable is dependent upon what
business value is required. In these tasks also many ways of modelling are available. The
Machine Learning engineer applies various algorithms to the data and generates the
output. While modelling the data many a times the models are first tested on dummy
data similar to actual data.
8. Model Training
Once the task and the model are finalised and data drift analysis modelling is finalized
then the important step is to train the model. The training can be done is phases where
the important parameters can be further fine tuned to get the required accurate
output. The model is exposed to the actual data in production phase and output
is monitored.
9. Model Deployment
Once the model is trained with the actual data and parameters are fine tuned then model
is deployed. Now the model is exposed to real time data flowing into the system and
output is generated. The model can be deployed as web service or as an embedded
application in edge or mobile application. This is very important step as now model is
exposed to real world.
10. Driving insights and generating BI reports
After model deployment in real world, the next step is to find out how the model is
behaving in real-world scenario. The model is used to get insights that aid in strategic
decisions related to business. The business goals are bound to these insights. Various
reports are generated to see how business is driving. These reports help in finding out if
key process indicators are achieved or not.