Xii Analytical Approach
Xii Analytical Approach
After clearly stating a business problem, the data scientist can define the analytic approach to
solving it. Doing so involves expressing the problem in the context of statistical and machine
learning techniques so that the data scientist can identify techniques suitable for achieving the
desired outcome. Selecting the right analytic approach depends on the question being asked.
Once the problem to be addressed is defined, the appropriate analytic approach for the problem is
selected in the context of the business requirements.
Data Requirements is the stage where we identify the necessary data content, formats, and
sources for initial data collection. This includes 5W1H approach.
In the Data Collection Stage, data scientists identify the available data resources relevant to
the problem domain.
Now that the data collection stage is complete, data scientists use descriptive statistics and
visualization techniques to understand data better. Data scientists, explore the dataset to
understand its content, , quality, and initial insights about the data. Gaps in data will be
identified and plans to either fill or make substitutions will have to be made. They determine
if additional data is necessary to fill any gaps but also to verify the quality of the data.
In the Data Preparation stage, data scientists prepare data for modeling, by cleaning the
data and make it error free for use during modelling.
Once data are prepared for the chosen machine learning algorithm, we are ready for modeling.
Modeling focuses on developing models that are either descriptive or predictive, and these
models are based on the analytic approach that was taken statistically or through machine
learning. Descriptive modeling is a mathematical process that describes real-world events
and the relationships between factors responsible for them, for example, a descriptive model
might examine things like: if a person did this, then they’re likely to prefer that. Predictive
modeling is a process that uses data mining and probability to forecast outcomes; for
example, a predictive model might be used to determine whether an email is a spam or not.
For predictive modeling, data scientists use a training set that is a set of historical data in
which the outcomes are already known. This step can be repeated more times until the model
understands the question and answer to it.
In the Model Evaluation stage, data scientists can evaluate the model in two ways: Hold-Out
and Cross-Validation. In the Hold-Out method, the dataset is divided into three subsets:
a training set as we said in the modeling stage; a validation set that is a subset used to
assess the performance of the model built in the training phase; a test set is a subset to
evaluate the likely future performance of a model.
The Deployment stage depends on the purpose of the model, and it may be rolled out to a
limited group of users or in a test environment.
The Feedback stage is usually made the most from the customer. Customers after the
deployment stage can say if the model works for their purposes or not. Data scientists take
this feedback and decide if they should improve the model; that’s because the process from
modeling to feedback is highly iterative.