Module 1
Module 1
Visualization
Course Code- CSC601
Module I- Introduction to Data analytics and life cycle (5Hr CO1)
Data Analytics Lifecycle overview:Key Roles for a Successful Analytics, Background and
Overview of Data Analytics Lifecycle Project
Phase 1: Discovery: Learning the Business Domain, Resources Framing the Problem, Identifying
Key Stakeholders. Interviewing the Analytics Sponsor, Developing Initial Hypotheses Identifying
Potential Data Sources
Phase 2: Data Preparation: Preparing the Analytic Sandbox, Performing ETLT, Learning About
the Data, Data Conditioning, Survey and visualize, Common Tools for the Data Preparation
Phase
Phase 3: Model Planning: Data Exploration and Variable Selection, Model Selection ,Common
Tools for the Model Planning Phase
Phase 4: Model Building: Common Tools for the Model Building Phase
Phase 5: Communicate Results
Phase 6: Operationalize
Introduction to Data analytics-
Analytics is the discovery and communication of meaningful patterns in data. Especially, valuable
in areas rich with recorded information, analytics relies on the simultaneous application of
statistics, computer programming, and operation research to qualify performance. Analytics often
favors data visualization to communicate insight.
Firms may commonly apply analytics to business data, to describe, predict, and improve business
performance. Especially, areas within include predictive analytics, enterprise decision
management, etc. Since analytics can require extensive computation(because of big data), the
algorithms and software used to analytics harness the most current methods in computer science.
In a nutshell, analytics is the scientific process of transforming data into insight for making better
decisions. The goal of Data Analytics is to get actionable insights resulting in smarter decisions
and better business outcomes.
It is critical to design and built a data warehouse or Business Intelligence(BI) architecture that
provides a flexible, multi-faceted analytical ecosystem, optimized for efficient ingestion and
analysis of large and diverse data sets.
There are four types of data analytics
1. Predictive (forecasting)
2. Descriptive (business intelligence and data mining)
3. Prescriptive (optimization and simulation)
4. Diagnostic analytics
Predictive Analytics
Predictive analytics turn the data into valuable, actionable information. predictive
analytics uses data to determine the probable outcome of an event or a likelihood
of a situation occurring.
Predictive analytics holds a variety of statistical techniques from modeling,
machine, learning, data mining, and game theory that analyze current and
historical facts to make predictions about a future event. Techniques that are
used for predictive analytics are:
● Linear Regression
● Time series analysis and forecasting
● Data Mining
Descriptive Analytics
Descriptive analytics looks at data and analyze past event for insight as how to approach future events. It
looks at the past performance and understands the performance by mining historical data to understand
the cause of success or failure in the past. Almost all management reporting such as sales, marketing,
operations, and finance uses this type of analysis.
The descriptive model quantifies relationships in data in a way that is often used to classify customers or
prospects into groups. Unlike a predictive model that focuses on predicting the behavior of a single
customer, Descriptive analytics identifies many different relationships between customer and product.
Common examples of Descriptive analytics are company reports that provide historic reviews like:
● Data Queries
● Reports
● Descriptive Statistics
● Data dashboard
Prescriptive Analytics
Prescriptive analytics goes beyond predicting future outcomes by also suggesting action
benefit from the predictions and showing the decision maker the implication of each decision
option. Prescriptive Analytics not only anticipates what will happen and when to happen but
also why it will happen. Further, Prescriptive Analytics can suggest decision options on how
to take advantage of a future opportunity or mitigate a future risk and illustrate the implication
of each decision option.
For example, Prescriptive Analytics can benefit healthcare strategic planning by using
analytics to leverage operational and usage data combined with data of external factors such
as economic data, population demography, etc.
Diagnostic Analytics
In this analysis, we generally use historical data over other data to answer any question
or for the solution of any problem. We try to find any dependency and pattern in the
historical data of the particular problem.
For example, companies go for this analysis because it gives a great insight into a
problem, and they also keep detailed information about their disposal otherwise data
collection may turn out individual for every problem and it will be very time-consuming.
Common techniques used for Diagnostic Analytics are:
● Data discovery
● Data mining
● Correlations
Data Analytics Lifecycle Overview
Key Roles for a Successful Analytics Project-
Seven roles are as follows-
1. Business User
2. Project Sponsor
3. Project Manager
4. Business Intelligence Analyst
5. Database Administrator (DBA)
6. Data Engineer
7. Data Scientist
1. Business User
● The business user is the one who understands the main area of the project
and is also basically benefited from the results.
● This user gives advice and consult the team working on the project about the
value of the results obtained and how the operations on the outputs are done.
● The business manager, line manager, or deep subject matter expert in the
project mains fulfills this role.
2. Project Sponsor
● The Project Sponsor is the one who is responsible to initiate the project.
Project Sponsor provides the actual requirements for the project and presents
the basic business issue.
● He generally provides the funds and measures the degree of value from the
final output of the team working on the project.
● This person introduce the prime concern and brooms the desired output.
3. Project Manager:
● This person ensures that key milestone and purpose of the project is met on
time and of the expected quality.
7. Data Scientist:
● Data scientist facilitates with the subject matter expertise for analytical techniques,
data modelling, and applying correct analytical techniques for a given business
issues.
● He ensures overall analytical objectives are met.
● Data scientists outline and apply analytical methods and proceed towards the data
available for the concerned project.
Data Analytics Lifecycle
Phase 1- Discovery
● The data science team learn and investigate the problem.
● Develop context and understanding.
● Come to know about data sources needed and available for the project.
● The team formulates initial hypothesis that can be later tested with data.
In Phase 1, the team learns the business domain, including relevant history such as
whether the organization or business unit has attempted similar projects in the past from
which they can learn. The team assesses the resources available to support the project
in terms of people, technology, time, and data. Important activities in this phase include
framing the business problem as an analytics challenge that can be addressed in
subsequent phases and formulating initial hypotheses (IHs) to test and begin learning the
data.
Phase 1: Discovery
1.2.1 Learning the Business Domain
In many cases, data scientists will have deep computational and quantitative
knowledge that can be broadly applied across many disciplines. An example of
this role would be someone with an advanced degree in applied
mathematics or statistics.
These data scientists have deep knowledge of the methods, techniques, and
ways for applying heuristics to a variety of business and conceptual
problems.
Others in this area may have deep knowledge of a domain area, coupled with
quantitative expertise.
1.2.2 Resources Framing the Problem
As part of the discovery phase, the team needs to assess the resources available to support the
project. In this context, resources include technology, tools, systems, data, and people.
During this scoping, consider the available tools and technology the team will be using and the
types of systems needed for later phases to operationalize the models.
For the project to have long-term success, what types of skills and roles will be needed for the
recipients of the model being developed? Does the requisite level of expertise exist within the
organization today, or will it need to be cultivated?
Answering these questions will influence the techniques the team selects and the kind of
implementation the team chooses to pursue in subsequent phases of the Data Analytics
Lifecycle.
1.2.3 Identifying Key Stakeholders
Each team member may hear slightly different things related to the needs and the problem and
have somewhat different ideas of possible solutions. For these reasons, it is crucial to state
the analytics problem, as well as why and to whom it is important.
Essentially, the team needs to clearly articulate the current situation and its main challenges.
As part of this activity, it is important to identify the main objectives of the project, identify what
needs to be achieved in business terms, and identify what needs to be done to meet the needs.
Additionally, consider the objectives and the success criteria for the project. What is the team
attempting to achieve by doing the project, and what will be considered “good enough” as an
outcome of the project? This is critical to document and share with the project team and key
stakeholders.
1.2.4 Interviewing the Analytics Sponsor
The team should plan to collaborate with the stakeholders to clarify and frame the
analytics problem.
At the outset, project sponsors may have a predetermined solution that may not
necessarily realize the desired outcome. In these cases, the team must use its
knowledge and expertise to identify the true underlying problem and appropriate
solution.
In essence, the data science team can take a more objective approach, as the
stakeholders may have developed biases over time, based on their experience.
Also, what may have been true in the past may no longer be a valid working
assumption.
1.2.5 Developing Initial Hypotheses and Identifying Potential Data Sources
Identify data sources: Make a list of candidate data sources the team may need to test the initial
hypotheses outlined in this phase. Make an inventory of the datasets currently available and
those that can be purchased or otherwise acquired for the tests the team wants to perform.
Capture aggregate data sources: This is for previewing the data and providing high-level
understanding. It enables the team to gain a quick overview of the data and perform further
exploration on specific areas. It also points the team to possible areas of interest within the
data.
Review the raw data: Obtain preliminary data from initial data feeds. Begin understanding the
interdependencies among the data attributes, and become familiar with the content of the data,
its
quality, and its limitations.
Evaluate the data structures and tools needed:The data type and structure dictate which tools
the team can use to analyze the data. This evaluation gets the team thinking about which
technologies may be good candidates for the project and how to start getting access to these
tools.
Phase 2- Data preparation:
● Steps to explore, preprocess, and condition data prior to modeling and analysis.
● It requires the presence of an analytic sandbox, the team execute, load, and
transform, to get data into the sandbox.
● Data preparation tasks are likely to be performed multiple times and not in
predefined order.
● Several tools commonly used for this phase are – Hadoop, Alpine Miner, Open
Refine, etc.
Phase 2 requires the presence of an analytic sandbox, in which the team can work with
data and perform analytics for the duration of the project. The team needs to execute
extract, load, and transform (ELT) or extract, transform and load (ETL) to get data into
the sandbox. The ELT and ETL are sometimes abbreviated as ETLT. Data should be
transformed in the ETLT process so the team can work with it and analyze it. In this
phase, the team also needs to familiarize itself with the data thoroughly and take steps to
condition the data
Phase 2: Data Preparation
1.3.1 Preparing the Analytic Sandbox
In ETL, users perform extract, transform, load processes to extract data from a
datastore, perform data transformations, and load the data back into the
datastore. However, the analytic sandbox approach differs slightly; it advocates
extract, load, and then transform. In this case, the data is extracted in its raw
form and loaded into the datastore, where analysts can choose to transform the
data into a new state or leave it in its original, raw condition.
1.3.3 Learning About the Data
It is important to catalog the data sources that the team has access to and identify
additional data sources that the team can leverage but perhaps does not have
access to today.
Some of the activities in this step may overlap with the initial investigation of the
datasets that occur in the discovery phase. Doing this activity accomplishes
several goals.
Clarifies the data that the data science team has access to at the start of the
project .
Highlights gaps by identifying datasets within an organization that the team may
find useful but may not be accessible to the team today
1.3.4 Data Conditioning
∙ Data conditioning refers to the process of cleaning data, normalizing datasets,
and performing transformations on the data. A critical step within the Data
Analytics
∙ Lifecycle, data conditioning can involve many complex steps to join or merge
datasets or otherwise get datasets into a state that enables analysis in further
phases.
∙ Data conditioning is often viewed as a preprocessing step for the data analysis
because it involves many operations on the dataset before developing models
to process or analyze the data.
1.3.5 Survey and Visualize
∙ After the team has collected and obtained at least some of the datasets
needed for the subsequent analysis, a useful step is to leverage data
visualization tools to gain an overview of the data.
∙ Alpine Miner provides a Graphical User Interface (GUI) for creating analytic workflows, including
data manipulations and a series of analytic events such as staged data-mining techniques (for
example, first select the top 100 customers, and then run descriptive statistics and clustering) on
Postgres SQL and other Big Data sources.
∙ OpenRefine (formerly called Google Refine) is “a free, open source, powerful tool for working with
messy data.” It is a popular GUI-based tool for performing data transformations, and it’s one of the
most robust free tools currently available
∙ Similar to OpenRefine, Data Wrangler is an interactive tool for data cleaning and transformation.
Phase 3-Model planning:
● Team explores data to learn about relationships between variables and
subsequently, selects key variables and the most suitable models.
● In this phase, data science team develop data sets for training, testing, and
production purposes.
● Team builds and executes models based on the work done in the model planning
phase.
● Several tools commonly used for this phase are – Matlab, STASTICA.
Phase 3 is model planning, where the team determines the methods, techniques, and
workflow it intends to follow for the subsequent model building phase. The team explores
the data to learn about the relationships between variables and subsequently selects key
variables and the most suitable models.
Phase 3: Model Planning
1.4.1 Data Exploration and Variable Selection
∙ As with earlier phases of the Data Analytics Lifecycle, it is important to spend time
and focus attention on this preparatory work to make the subsequent phases of model
selection and execution easier and more efficient.
2.4.2 Model Selection
∙ In the model selection subphase, the team’s main goal is to choose an analytical
technique, or a short list of candidate techniques, based on the end goal of the project.
∙ In this case, a model simply refers to an abstraction from reality. One observes events
happening in a real-world situation or with live data and attempts to construct models
that emulate this behavior with a set of rules and conditions.
∙ In the case of machine learning and data mining, these rules and conditions are grouped
into several general sets of techniques, such as classification, association rules, and
clustering. When reviewing this list of types of potential models, the team can winnow
down the list to several viable models to try to address a given problem.
∙ An additional consideration in this area for dealing with Big Data involves determining if
the team will be using techniques that are best suited for structured data, unstructured
data, or a hybrid approach. For instance, the team can leverage MapReduce to analyze
unstructured data.
2.4.3 Common Tools for the Model Planning Phase
∙ R has a complete set of modeling capabilities and provides a good
environment for building interpretive models with high-quality code.
● The team communicates benefits of project more broadly and sets up pilot project to
deploy work in controlled way before broadening the work to full enterprise of users.
● This approach enables team to learn about performance and related constraints of
the model in production environment on small scale , and make adjustments before
full deployment.
● The team delivers final reports, briefings, codes.
● Free or open source tools – Octave, WEKA, SQL, MADlib.
In Phase 6, the team delivers final reports, briefings, code, and technical documents. In
addition, the team may run a pilot project to implement the models in a production
environment.
Thank You