0% found this document useful (0 votes)
7 views

Module_1B

The Data Analytics Lifecycle consists of six phases: Discovery, Data Preparation, Model Planning, Model Building, Communicate Results, and Operationalize, designed to tackle complex data problems. Key roles include Business User, Project Sponsor, Project Manager, Business Intelligence Analyst, Database Administrator, Data Engineer, and Data Scientist. The lifecycle emphasizes iterative processes, with significant time often spent in the preparation phases to ensure data quality and relevance for successful analytics outcomes.

Uploaded by

Sam Ehlinger
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
7 views

Module_1B

The Data Analytics Lifecycle consists of six phases: Discovery, Data Preparation, Model Planning, Model Building, Communicate Results, and Operationalize, designed to tackle complex data problems. Key roles include Business User, Project Sponsor, Project Manager, Business Intelligence Analyst, Database Administrator, Data Engineer, and Data Scientist. The lifecycle emphasizes iterative processes, with significant time often spent in the preparation phases to ensure data quality and relevance for successful analytics outcomes.

Uploaded by

Sam Ehlinger
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 65

Data Analytics Lifecycle

• Data Analytics Lifecycle Overview


• Phase 1: Discovery
• Phase 2: Data Preparation
• Phase 3: Model Planning
• Phase 4: Model Building
• Phase 5: Communicate Results
• Phase 6: Operationalize
• Case Study:
Data Analytics Lifecycle Overview
• The data analytic lifecycle is designed for
Complex Data problems and data science
projects
• With six phases the project work can occur in
several phases simultaneously
• The cycle is iterative to portray a real project
• Work can return to earlier phases as new
information is uncovered
Key Roles for a Successful Analytics Project
Key Roles for a Successful Analytics Project

• Business User – understands the domain area


• Project Sponsor – provides requirements
• Project Manager – ensures meeting objectives
• Business Intelligence Analyst – provides business domain
expertise based on deep understanding of the data
• Database Administrator (DBA) – creates DB environment
• Data Engineer – provides technical skills, assists data
management and extraction, supports analytic sandbox
• Data Scientist – provides analytic techniques and
modeling
Four Types of Data Analytics
Types of Analytics
• Descriptive: A set of techniques for reviewing and examining
the data set(s) to understand the data and analyze business
performance.
• Diagnostic: A set of techniques for determine what has
happened and why
• Predictive: A set of techniques that analyze current and
historical data to determine what is most likely to (not)
happen
• Prescriptive: A set of techniques for computationally
developing and analyzing alternatives that can become
courses of action – either tactical or strategic – that may
discover the unexpected
Descriptive Analytics
– Identify the attributes, then assess/evaluate the attributes
– Estimate the magnitude to correlate the relative contribution of each attribute
to the final solution
– Accumulate more instances of data from the data sources
– If possible, perform the steps of evaluation, classification and categorization
quickly
– Yield a measure of adaptability within the OODA loop
• At some threshold, crossover into diagnostic and predictive analytics
Diagnostic Analytics
– Begin with descriptive analytics
– Extract patterns from large data quantities via data mining
– Correlate data types for explanation of near-term behavior – past
and present
– Estimate linear/non-linear behavior not easily identifiable through
other approaches.
• Example: by classifying past insurance claims, estimate
the number of future claims to flag for investigation with
a high probability of being fraudulent.
Predictive Analytics

• Begin with descriptive AND


diagnostic analytics
• Choose the right data based on
domain knowledge and relationships
among variables
• Choose the right techniques to yield
insight into possible outcomes
• Determine the likelihood of possible
outcomes given initial boundary
conditions
Prescriptive Analytics
• Process:
– Begin w/ predictive analytics
– Determine what should occur and how to make it so
– Determine the mitigating factors that lead to desirable/undesirable outcomes
– “What-if” analysis w/ local or global optimization
– Ex: Find the best set of prices and advertising frequency to maximize revenue
– Ex: And, the right set of business moves to make to achieve that goal

“Make it so”
Overview of Data Analytics Lifecycle
Phase 1: Discovery
Phase 1: Discovery

1. Learning the Business Domain


2. Resources
3. Framing the Problem
4. Identifying Key Stakeholders
5. Interviewing the Analytics Sponsor
6. Developing Initial Hypotheses
7. Identifying Potential Data Sources
Phase 2: Data Preparation
Phase 2: Data Preparation
• Includes steps to explore, preprocess, and
condition data
• Create robust environment – analytics sandbox
• Data preparation tends to be the most labor-
intensive step in the analytics lifecycle
– Often at least 50% of the data science project’s time
• The data preparation phase is generally the
most iterative and the one that teams tend to
underestimate most often
Preparing the Analytic Sandbox

• Create the analytic sandbox (also called workspace)


• Allows team to explore data without interfering with live
production data
• Sandbox collects all kinds of data (expansive approach)
• The sandbox allows organizations to undertake
ambitious projects beyond traditional data analysis and
BI to perform advanced predictive analytics
• Although the concept of an analytics sandbox is
relatively new, this concept has become acceptable to
data science teams and IT groups
Performing ETLT
(Extract, Transform, Load, Transform)
• In ETL users perform extract, transform, load
• In the sandbox the process is often ELT – early load
preserves the raw data which can be useful to examine
• Example – in credit card fraud detection, outliers can
represent high-risk transactions that might be
inadvertently filtered out or transformed before being
loaded into the database
• New technologies Hadoop, NoSQL etc are used here
Learning about the Data

• Becoming familiar with the data is critical


• This activity accomplishes several goals:
– Determines the data available to the team
early in the project
– Highlights gaps – identifies data not currently
available
– Identifies data outside the organization that
might be useful
Learning about the Data Sample
Dataset Inventory
Data Conditioning

• Data conditioning includes cleaning data,


normalizing datasets, and performing
transformations
– Often viewed as a preprocessing step prior to data
analysis, it might be performed by data owner, IT
department, DBA, etc.
– Best to have data scientists involved
– Data science teams prefer more data than too little
Data Conditioning

• Additional questions and considerations


– What are the data sources? Target fields?
– How clean is the data?
– How consistent are the contents and files? Missing
or inconsistent values?
– Assess the consistency of the data types – numeric,
alphanumeric?
– Review the contents to ensure the data makes sense
– Look for evidence of systematic error
Survey and Visualize
• Leverage data visualization tools to gain an overview of the
data
• Review data to ensure calculations are consistent
• Does the data distribution stay consistent?
• Assess the granularity of the data, the range of values, and
the level of aggregation of the data
• Does the data represent the population of interest?
• Check time-related variables – daily, weekly, monthly? Is
this good enough?
• Is the data standardized/normalized? Scales consistent?
• For geospatial datasets, are state/country abbreviations
consistent
Common Tools for Data Preparation
Microsoft Power BI
Tools
Data Preparation Tools
• Trifacta
• Datameer
• Databricks
• Altair
• IBM
• Tablaeu
Phase 3: Model Planning
Phase 3: Model Planning

• Activities to consider
– Assess the structure of the data – this dictates the tools and
analytic techniques for the next phase
– Ensure the analytic techniques enable the team to meet the
business objectives and accept or reject the working
hypotheses
– Determine if the situation warrants a single model or a series
of techniques as part of a larger analytic workflow
– Research and understand how other analysts have
approached this kind or similar kind of problem
Phase 3: Model Planning
Model Planning in Industry Verticals

• Example of other analysts approaching a similar problem


Data Exploration and Variable Selection
• Explore data to understand the relationships among the
variables & thus selection of the variables and methods
• A common way to do this – data visualization tools
• Stakeholders and subject matter experts may have ideas
– For example, some hypothesis that led to the project
• Aim for capturing the most essential predictors & variables
– This often requires iterations and testing to identify key
variables
• If the team plans to run regression analysis, identify the
candidate predictors and outcome variables of the model
Model Selection
• The main goal is to choose an analytical technique, or several
candidates, based on the end goal of the project
• We observe events in the real world and attempt to construct
models that emulate this behavior with a set of rules and
conditions
• Determine whether to use techniques best suited for structured
data, unstructured data, or a hybrid approach
• Teams often create initial models using statistical software
packages such as R, SAS, or Matlab
– Which may have limitations when applied to very large datasets
• The team moves to the model building phase once it has a good
idea about the type of model to try
Common Tools for the Model
Planning Phase
• R has a complete set of modeling capabilities
– R contains about 5000 packages for data analysis and
graphical presentation
• SQL Analysis services can perform in-database analytics of
common data mining functions, involved aggregations, and basic
predictive models
• SAS/ACCESS provides integration between SAS and the analytics
sandbox via multiple data connections
• Rapid Miner
• Tableau Public
Phase 4: Model Building
Phase 4: Model Building
• Execute the models defined in Phase 3
• Develop datasets for training, testing, and production
• Develop analytic model on training data, test on test data
• Question to consider
– Does the model appear valid and accurate on the test data?
– Does the model output/behavior make sense to the domain experts?
– Do the parameter values make sense in the context of the domain?
– Is the model sufficiently accurate to meet the goal?
– Does the model avoid intolerable mistakes?
– Are more data or inputs needed?
– Will the kind of model chosen support the runtime environment?
– Is a different form of the model required to address the business
problem?
Lifecycle of Model Building Phase

• Select variables
• Balance data
• Build models
• Validate
• Deploy
• Maintain
• Define success
• Explore data
• Condition data
Common Tools for
the Model Building Phase
• Commercial Tools
– SAS Enterprise Miner – built for enterprise-level computing &
analytics
– SPSS Modeler (IBM) – provides enterprise-level computing and
analytics
– Matlab – high-level language for data analytics, algorithms, data
exploration
– Alpine Miner – provides GUI frontend for backend analytics tools
• Free or Open Source Tools
– R and PL/R - PL/R is a procedural language for PostgreSQL with R
– Octave – language for computational modeling
– WEKA – data mining software package with analytic workbench
– Python – language providing toolkits for machine learning and
analysis
Phase 5: Communicate Results
Phase 5: Communicate Results
• After executing model team need to
compare outcomes of modeling to
criteria established for success and
failure.
• Team considers how best to articulate
findings and outcomes to various team
members and stakeholders, taking into
account warning, assumptions.
• Team should identify key findings,
quantify business value, and develop
narrative to summarize and convey
findings to stakeholders.
Phase 5: Communicate Results
• Determine if the team succeeded or failed in its
objectives
• Assess if the results are statistically significant and
valid
– If so, identify aspects of the results that present salient
findings
– Identify surprising results and those in line with the
hypotheses
• Communicate and document the key findings and
major insights derived from the analysis
– This is the most visible portion of the process to the outside
stakeholders and sponsors
Phase 6: Operationalize
Phase 6: Operationalize

• In this last phase, the team communicates the benefits of the project
more broadly and sets up a pilot project to deploy the work in a
controlled way
• Risk is managed effectively by undertaking small scope, pilot
deployment before a wide-scale rollout
• During the pilot project, the team may need to execute the algorithm
more efficiently in the database rather than with in-memory tools
like R, especially with larger datasets
• To test the model in a live setting, consider running the model in a
production environment for a discrete set of products or a single line
of business
• Monitor model accuracy and retrain the model if necessary
Phase 6: Operationalize
Key outputs from successful analytics project
Phase 6: Operationalize
Key outputs from successful analytics project
• Business user – tries to determine business benefits and
implications
• Project sponsor – wants business impact, risks, ROI
• Project manager – needs to determine if project
completed on time, within budget, goals met
• Business intelligence analyst – needs to know if reports
and dashboards will be impacted and need to change
• Data engineer and DBA – must share code and document
• Data scientist – must share code and explain model to
peers, managers, stakeholders
Phase 6: Operationalize
Four main deliverables

• Although the seven roles represent many


interests, the interests overlap and can be met
with four main deliverables
1. Presentation for project sponsors – high-level
takeaways for executive level stakeholders
2. Presentation for analysts – describes business
process changes and reporting changes, includes
details and technical graphs
3. Code for technical people
4. Technical specifications of implementing the code
Practical Data Analytics Life Cycle
Life Cycle
Case Study:
Global Innovation Network and Analysis (GINA)

• EMC’s new director wanted to improve the company’s


engagement of employees across the global centers of
excellence (GCE) to drive innovation, research, and
university partnerships
• This project was created to accomplish
– Store formal and informal data
– Track research from global technologists
– Mine the data for patterns and insights to improve the
team’s operations and strategy
Phase 1: Discovery

• Team members and roles


– Business user, project sponsor, project manager –
Vice President from Office of CTO
– BI analyst – person from IT
– Data engineer and DBA – people from IT
– Data scientist – distinguished engineer
Phase 1: Discovery

• The data fell into two categories


– Five years of idea submissions from internal innovation
contests
– Minutes and notes representing innovation and research
activity from around the world
• Hypotheses grouped into two categories
– Descriptive analytics of what is happening to spark further
creativity, collaboration, and asset generation
– Predictive analytics to advise executive management of
where it should be investing in the future
Phase 2: Data Preparation

• Set up an analytics sandbox


• Discovered that certain data needed conditioning and
normalization and that missing datasets were critical
• Team recognized that poor quality data could impact
subsequent steps
• They discovered many names were misspelled and problems
with extra spaces
• These seemingly small problems had to be addressed
Phase 3: Model Planning

• The study included the following considerations


– Identify the right milestones to achieve the goals
– Trace how people move ideas from each milestone
toward the goal
– Tract ideas that die and others that reach the goal
– Compare times and outcomes using a few different
methods
Phase 4: Model Building

• Several analytic method were employed


– NLP on textual descriptions
– Social network analysis using R and Rstudio
– Developed social graphs and visualizations
Phase 4: Model Building
Social graph of data submitters and finalists
Phase 4: Model Building
Social graph of top innovation influencers
Phase 5: Communicate Results

• Study was successful in in identifying hidden innovators


– Found high density of innovators in Cork, Ireland
• The CTO office launched longitudinal studies
Phase 6: Operationalize

• Deployment was not really discussed


• Key findings
– Need more data in future
– Some data were sensitive
– A parallel initiative needs to be created to improve
basic BI activities
– A mechanism is needed to continually reevaluate
the model after deployment
Phase 6: Operationalize
Summary
• The Data Analytics Lifecycle is an approach
to managing and executing analytic
projects
• Lifecycle has six phases
• Bulk of the time usually spent on
preparation – phases 1 and 2
• Seven roles needed for a data science team

You might also like