0% found this document useful (0 votes)
60 views

Data Science: Lesson 5

1. The document describes the key activities in Phase 3 (Model Planning) and Phase 4 (Model Building) of the data analytics life cycle. 2. In Phase 3, the data science team identifies candidate models to apply to the data based on hypotheses from Phase 1. They explore relationships between variables to select important predictors and determine if a single model or multiple models are needed. 3. Phase 4 involves developing training and test datasets to build analytical models, evaluate their performance, and refine them to optimize results. Thorough documentation of modeling decisions and assumptions is important.

Uploaded by

lia immie rigo
Copyright
© © All Rights Reserved
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
60 views

Data Science: Lesson 5

1. The document describes the key activities in Phase 3 (Model Planning) and Phase 4 (Model Building) of the data analytics life cycle. 2. In Phase 3, the data science team identifies candidate models to apply to the data based on hypotheses from Phase 1. They explore relationships between variables to select important predictors and determine if a single model or multiple models are needed. 3. Phase 4 involves developing training and test datasets to build analytical models, evaluate their performance, and refine them to optimize results. Thorough documentation of modeling decisions and assumptions is important.

Uploaded by

lia immie rigo
Copyright
© © All Rights Reserved
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 6

DATA ANALYTICS LIFE CYCLE

PART 2
Module Objectives
At the end of this module, students must be able to:
1. describe the processes involves in model planning such as data exploration, variable and model selection;
2. enumerate the key decisions needed to finalize the model as well as the tools available for model building;
3. discuss the importance of communicating the results obtained to key stakeholders;
4. describe the steps in operationalizing the results;
Recap:
From previous discussion, we learn about
1. an overview of the data analytics life cycle;
2. the seven key roles in an analytics project;
3. the discovery phase (phase 1) where data science team learns about the business domain, assesses
resources available as well as formulate initial hypotheses to test in learning about the data.
4. the data preparation phase (phase 2) about preparation of the analytic sandbox, performing ETLT, data
conditioning, etc.
Phase 3 – Model Planning

✔ In Phase 3, the data science team identifies candidate models to apply to the data for clustering,
classifying, or finding relationships in the data depending on the goal of the project, as shown

✔ It is during this phase that the team refers to the hypotheses developed in Phase 1, when they first
became acquainted with the data and understanding the business problems or domain area.

✔ These hypotheses help the team frame the analytics to execute in Phase 4 and select the right methods to
achieve its objectives.
Some of the activities to consider in this phase include the following:

✔ Assess the structure of the datasets. The structure of the datasets is one factor that dictates the tools and
analytical techniques for the next phase. Depending on whether the team plans to analyze textual data or
transactional data, for example, different tools and approaches are required.

✔ Ensure that the analytical techniques enable the team to meet the business objectives and accept or reject
the working hypotheses.

✔ Determine if the situation warrants a single model or a series of techniques as part of a larger analytic
workflow.

✔ In addition to the considerations just listed, it is useful to research and understand how other analysts
generally approach a specific kind of problem.

✔ Given the kind of data and resources that are available, evaluate whether similar, existing approaches will
work or if the team will need to create something new. Many times teams can get ideas from analogous
problems that other people have solved in different industry verticals or domain areas.

✔ Table 2-2 summarizes the results of an exercise of this type, involving several domain areas and the types
of models previously used in a classification type of problem after conducting research on churn models in
multiple industry verticals.
✔ Performing this sort of diligence gives the team ideas of how others have solved similar problems and
presents the team with a list of candidate models to try as part of the model planning phase.

Model Planning - Data Exploration and Variable Selection

✔ In Phase 3, the objective of the data exploration is to understand the relationships among the variables to
inform selection of the variables and methods and to understand the problem domain. As with earlier
phases of the Data Analytics Lifecycle, it is important to spend time and focus attention on this preparatory
work to make the subsequent phases of model selection and execution easier and more efficient.

✔ A common way to conduct this step involves using tools to perform data visualizations. Approaching the
data exploration in this way aids the team in previewing the data and assessing relationships between
variables at a high level.

✔ As the team begins to question assumptions and test initial ideas of the project sponsors and
stakeholders, it needs to consider the inputs and data that will be needed, and then it must examine
whether these inputs are actually correlated with the outcomes that the team plans to predict or analyze.

✔ Some methods and types of models will handle correlated variables better than others. Depending on what
the team is attempting to solve, it may need to consider an alternate method, reduce the number of data
inputs, or transform the inputs to allow the team to use the best method for a given business problem.

✔ The key to this approach is to aim for capturing the most essential predictors and variables rather than
considering every possible variable that people think may influence the outcome.

✔ Approaching the problem in this manner requires iterations and testing to identify the most essential
variables for the intended analyses. The team should plan to test a range of variables to include in the
model and then focus on the most important and influential variables.

✔ If the team plans to run regression analyses, identify the candidate predictors and outcome variables of the
model. Plan to create variables that determine outcomes but demonstrate a strong relationship to the
outcome rather than to the other input variables. This includes remaining vigilant for problems such as
serial correlation, multicollinearity, and other typical data modeling challenges that interfere with the validity
of these models

Model Planning – Model Selection

✔ In the model selection subphase, the team’s main goal is to choose an analytical technique, or a short list
of candidate techniques, based on the end goal of the project.
✔ For the context of this book, a model is discussed in general terms. In this case, a model simply refers to
an abstraction from reality. One observes events happening in a real-world situation or with live data
and attempts to construct models that emulate this behavior with a set of rules and conditions.

✔ In the case of machine learning and data mining, these rules and conditions are grouped into several
general sets of techniques, such as classification, association rules, and clustering. When reviewing this
list of types of potential models, the team can winnow down the list to several viable models to try to
address a given problem.

✔ An additional consideration in this area for dealing with Big Data involves determining if the team will be
using techniques that are best suited for structured data, unstructured data, or a hybrid approach.

✔ Lastly, the team should take care to identify and document the modeling assumptions it is making as it
chooses and constructs preliminary models.

✔ Typically, teams create the initial models using a statistical software package such as R, SAS, or Matlab.
Although these tools are designed for data mining and machine learning algorithms, they may have
limitations when applying the models to very large datasets, as is common with Big Data.

Phase 4 – Model Building

✔ In Phase 4, the data science team needs to develop datasets for training, testing, and production
purposes. These datasets enable the data scientist to develop the analytical model and train it (“training
data”), while holding aside some of the data (“hold-out data” or “test data”) for testing the model.

✔ During this process, it is critical to ensure that the training and test datasets are sufficiently robust for the
model and analytical techniques. A simple way to think of these datasets is to view the training dataset for
conducting the initial experiments and the test sets for validating an approach once the initial experiments
and models have been run.

✔ In the model building phase, as shown, an analytical model is developed and fit on the training data and
evaluated (scored) against the test data.

✔ The phases of model planning and model building can overlap quite a bit, and in practice one can iterate
back and forth between the two phases for a while before settling on a final model.

✔ Although the modeling techniques and logic required to develop models can be highly complex, the actual
duration of this phase can be short compared to the time spent preparing the data and defining the
approaches.

✔ In general, plan to spend more time preparing and learning the data (Phases 1–2) and crafting a
presentation of the findings (Phase 5). Phases 3 and 4 tend to move more quickly, although they are more
complex from a conceptual standpoint. As part of this phase, the data science team needs to execute the
models defined in Phase 3.

✔ During this phase, users run models from analytical software packages, such as R or SAS, on file extracts
and small datasets for testing purposes. In addition, we assess the validity of the model and its results as
well as determine if the model accounts for most of the data and has robust predictive power.

✔ Also, at this point, we refine the models to optimize the results, such as by modifying variable inputs or
reducing correlated variables where appropriate. In Phase 3, the team may have had some knowledge of
correlated variables or problematic data attributes, which will be confirmed or denied once the models are
actually executed.

✔ When immersed in the details of constructing models and transforming data, many small decisions are
often made about the data and the approach for the modeling. These details can be easily forgotten once
the project is completed. Therefore, it is vital to record the results and logic of the model during this phase.
In addition, one must take care to record any operating assumptions that were made in the modeling
process regarding the data or the context.
Creating robust models that are suitable to a specific situation requires thoughtful consideration to ensure the
models being developed ultimately meet the objectives outlined in Phase 1. Questions to consider include these:

✔ Does the model appear valid and accurate on the test data?

✔ Does the model output/behavior make sense to the domain experts? That is, does it appear as if the
model is giving answers that make sense in this context?

✔ Do the parameter values of the fitted model make sense in the context of the domain?

✔ Is the model sufficiently accurate to meet the goal?

✔ Does the model avoid intolerable mistakes?

✔ Are more data or more inputs needed? Do any of the inputs need to be transformed or eliminated?

✔ Will the kind of model chosen support the runtime requirements?

✔ Is a different form of the model required to address the business problem? If so, go back to the model
planning phase and revise the modeling approach.

✔ Once the data science team can evaluate either if the model is sufficiently robust to solve the problem or if
the team has failed, it can move to the next phase in the Data Analytics Lifecycle.

✔ There are many tools available to assist in this phase, focused primarily on statistical analysis or data
mining software. Common tools in this space include, but are not limited to, the following:

Commercial Tools: Free or Open-Source tools:


1. SAS Enterprise Miner 1. R and PL/R
2. SPSS Modeler 2. Octave
3. Matlab 3. WEKA
4. Alpine Miner 4. Python
5. STATISTICA 5. SQL
6. Mathematica

Phase 5 – Communicate the Results

✔ After executing the model, the team needs to compare the outcomes of the modeling to the criteria
established for success and failure.

✔ In Phase 5, as shown, the team considers how best to articulate the findings and outcomes to the various
team members and stakeholders, taking into account caveats, assumptions, and any limitations of the
results.

✔ Because the presentation is often circulated within an organization, it is critical to articulate the results
properly and position the findings in a way that is appropriate for the audience.

✔ As part of Phase 5, the team needs to determine if it succeeded or failed in its objectives. Many times
people do not want to admit to failing, but in this instance failure should not be considered as a true failure,
but rather as a failure of the data to accept or reject a given hypothesis adequately.
✔ This concept can be counterintuitive for those who have been told their whole careers not to fail. However,
the key is to remember that the team must be rigorous enough with the data to determine whether it will
prove or disprove the hypotheses outlined in Phase 1 (discovery).

✔ Sometimes teams have only done a superficial analysis, which is not robust enough to accept or reject a
hypothesis. Other times, teams perform very robust analysis and are searching for ways to show results,
even when results may not be there. It is important to strike a balance between these two extremes when
it comes to analyzing data and being pragmatic in terms of showing real-world results.

✔ When conducting this assessment, determine if the results are statistically significant and valid. If they are,
identify the aspects of the results that stand out and may provide salient findings when it comes time to
communicate them.

✔ If the results are not valid, think about adjustments that can be made to refine and iterate on the model to
make it valid. During this step, assess the results and identify which data points may have been surprising
and which were in line with the hypotheses that were developed in Phase 1.

✔ Comparing the actual results to the ideas formulated early on produces additional ideas and insights that
would have been missed if the team had not taken time to formulate initial hypotheses early in the process.

✔ By this time, the team should have determined which model or models address the analytical challenge in
the most appropriate way. In addition, the team should have ideas of some of the findings as a result of the
project. The best practice in this phase is to record all the findings and then select the three most
significant ones that can be shared with the stakeholders.

✔ In addition, the team needs to reflect on the implications of these findings and measure the business
value. Depending on what emerged as a result of the model, the team may need to spend time quantifying
the business impact of the results to help prepare for the presentation and demonstrate the value of the
findings.

✔ Now that the team has run the model, completed a thorough discovery phase, and learned a great deal
about the datasets, reflect on the project and consider what obstacles were in the project and what can be
improved in the future.

✔ Make recommendations for future work or improvements to existing processes, and consider what each of
the team members and stakeholders needs to fulfill her responsibilities. For instance, sponsors must
champion the project. Stakeholders must understand how the model affects their processes.

✔ For example, if the team has created a model to predict customer churn, the Marketing team must
understand how to use the churn model predictions in planning their interventions.

✔ Production engineers need to operationalize the work that has been done. In addition, this is the phase to
underscore the business benefits of the work and begin making the case to implement the logic into a live
production environment.

Phase 6 - Operationalize

✔ In the final phase, the team communicates the benefits of the project more broadly and sets up a pilot
project to deploy the work in a controlled way before broadening the work to a full enterprise or ecosystem
of users.

✔ Phase 6 represents the first time that most analytics teams approach deploying the new analytical
methods or models in a production environment. Rather than deploying these models immediately on a
wide-scale basis, the risk can be managed more effectively and the team can learn by undertaking a small
scope, pilot deployment before a wide-scale rollout.
✔ This approach enables the team to learn about the performance and related constraints of the model in a
production environment on a small scale and make adjustments before a full deployment.

✔ Be aware that this phase can bring in a new set of team members—usually the engineers responsible for
the production environment who have a new set of issues and concerns beyond those of the core project
team.

✔ This technical group needs to ensure that running the model fits smoothly into the production environment
and that the model can be integrated into related business processes.

✔ Part of the operationalizing phase includes creating a mechanism for performing ongoing monitoring of
model accuracy and, if accuracy degrades, finding ways to retrain the model.

✔ If feasible, design alerts for when the model is operating “out-of-bounds.” This includes situations when the
inputs are beyond the range that the model was trained on, which may cause the outputs of the model to
be inaccurate or invalid. If this begins to happen regularly, the model needs to be retrained on new data.
Although many roles represent many interests within a project, these interests usually overlap, and most of them
can be met with four main deliverables.

✔ Presentation for project sponsors: This contains high-level takeaways for executive level stakeholders,
with a few key messages to aid their decision-making process. Focus on clean, easy visuals for the
presenter to explain and for the viewer to grasp.

✔ Presentation for analysts, which describes business process changes and reporting changes. Fellow
data scientists will want the details and are comfortable with technical graphs such as Receiver Operating
Characteristic [ROC] curves, density plots, and histograms

✔ Code for technical people.

✔ Technical specifications of implementing the code.

✔ As a general rule, the more executive the audience, the more succinct the presentation needs to be. Most
executive sponsors attend many briefings in the course of a day or a week. Ensure that the presentation
gets to the point quickly and frames the results in terms of value to the sponsor’s organization.

✔ When presenting to other audiences with more quantitative backgrounds, focus more time on the
methodology and findings. In these instances, the team can be more expansive in describing the
outcomes, methodology, and analytical experiment with a peer group.

You might also like