0% found this document useful (0 votes)
9 views40 pages

Module 1

The document outlines the Data Analytics lifecycle, detailing its six phases: Discovery, Data Preparation, Model Planning, Model Building, Communicating Results, and Operationalizing. It describes various types of analytics—predictive, descriptive, prescriptive, and diagnostic—and emphasizes the importance of roles such as Business User, Project Sponsor, and Data Scientist in successful analytics projects. Additionally, it highlights the need for a well-structured data warehouse and the use of tools like Hadoop and OpenRefine for data preparation.

Uploaded by

yatinchauhan786
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
9 views40 pages

Module 1

The document outlines the Data Analytics lifecycle, detailing its six phases: Discovery, Data Preparation, Model Planning, Model Building, Communicating Results, and Operationalizing. It describes various types of analytics—predictive, descriptive, prescriptive, and diagnostic—and emphasizes the importance of roles such as Business User, Project Sponsor, and Data Scientist in successful analytics projects. Additionally, it highlights the need for a well-structured data warehouse and the use of tools like Hadoop and OpenRefine for data preparation.

Uploaded by

yatinchauhan786
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 40

Data Analytics and

Visualization
Course Code- CSC601
Module I- Introduction to Data analytics and life cycle (5Hr CO1)
Data Analytics Lifecycle overview:Key Roles for a Successful Analytics, Background and
Overview of Data Analytics Lifecycle Project
Phase 1: Discovery: Learning the Business Domain, Resources Framing the Problem, Identifying
Key Stakeholders. Interviewing the Analytics Sponsor, Developing Initial Hypotheses Identifying
Potential Data Sources
Phase 2: Data Preparation: Preparing the Analytic Sandbox, Performing ETLT, Learning About
the Data, Data Conditioning, Survey and visualize, Common Tools for the Data Preparation
Phase
Phase 3: Model Planning: Data Exploration and Variable Selection, Model Selection ,Common
Tools for the Model Planning Phase
Phase 4: Model Building: Common Tools for the Model Building Phase
Phase 5: Communicate Results
Phase 6: Operationalize
Introduction to Data analytics-
Analytics is the discovery and communication of meaningful patterns in data. Especially, valuable
in areas rich with recorded information, analytics relies on the simultaneous application of
statistics, computer programming, and operation research to qualify performance. Analytics often
favors data visualization to communicate insight.
Firms may commonly apply analytics to business data, to describe, predict, and improve business
performance. Especially, areas within include predictive analytics, enterprise decision
management, etc. Since analytics can require extensive computation(because of big data), the
algorithms and software used to analytics harness the most current methods in computer science.
In a nutshell, analytics is the scientific process of transforming data into insight for making better
decisions. The goal of Data Analytics is to get actionable insights resulting in smarter decisions
and better business outcomes.
It is critical to design and built a data warehouse or Business Intelligence(BI) architecture that
provides a flexible, multi-faceted analytical ecosystem, optimized for efficient ingestion and
analysis of large and diverse data sets.
There are four types of data analytics

1. Predictive (forecasting)
2. Descriptive (business intelligence and data mining)
3. Prescriptive (optimization and simulation)
4. Diagnostic analytics
Predictive Analytics

Predictive analytics turn the data into valuable, actionable information. predictive
analytics uses data to determine the probable outcome of an event or a likelihood
of a situation occurring.
Predictive analytics holds a variety of statistical techniques from modeling,
machine, learning, data mining, and game theory that analyze current and
historical facts to make predictions about a future event. Techniques that are
used for predictive analytics are:
● Linear Regression
● Time series analysis and forecasting
● Data Mining
Descriptive Analytics
Descriptive analytics looks at data and analyze past event for insight as how to approach future events. It
looks at the past performance and understands the performance by mining historical data to understand
the cause of success or failure in the past. Almost all management reporting such as sales, marketing,
operations, and finance uses this type of analysis.
The descriptive model quantifies relationships in data in a way that is often used to classify customers or
prospects into groups. Unlike a predictive model that focuses on predicting the behavior of a single
customer, Descriptive analytics identifies many different relationships between customer and product.
Common examples of Descriptive analytics are company reports that provide historic reviews like:
● Data Queries
● Reports
● Descriptive Statistics
● Data dashboard
Prescriptive Analytics

Prescriptive Analytics automatically synthesize big data, mathematical science, business


rule, and machine learning to make a prediction and then suggests a decision option to take
advantage of the prediction.

Prescriptive analytics goes beyond predicting future outcomes by also suggesting action
benefit from the predictions and showing the decision maker the implication of each decision
option. Prescriptive Analytics not only anticipates what will happen and when to happen but
also why it will happen. Further, Prescriptive Analytics can suggest decision options on how
to take advantage of a future opportunity or mitigate a future risk and illustrate the implication
of each decision option.

For example, Prescriptive Analytics can benefit healthcare strategic planning by using
analytics to leverage operational and usage data combined with data of external factors such
as economic data, population demography, etc.
Diagnostic Analytics

In this analysis, we generally use historical data over other data to answer any question
or for the solution of any problem. We try to find any dependency and pattern in the
historical data of the particular problem.
For example, companies go for this analysis because it gives a great insight into a
problem, and they also keep detailed information about their disposal otherwise data
collection may turn out individual for every problem and it will be very time-consuming.
Common techniques used for Diagnostic Analytics are:

● Data discovery
● Data mining
● Correlations
Data Analytics Lifecycle Overview
Key Roles for a Successful Analytics Project-
Seven roles are as follows-

1. Business User
2. Project Sponsor
3. Project Manager
4. Business Intelligence Analyst
5. Database Administrator (DBA)
6. Data Engineer
7. Data Scientist
1. Business User
● The business user is the one who understands the main area of the project
and is also basically benefited from the results.
● This user gives advice and consult the team working on the project about the
value of the results obtained and how the operations on the outputs are done.
● The business manager, line manager, or deep subject matter expert in the
project mains fulfills this role.

2. Project Sponsor
● The Project Sponsor is the one who is responsible to initiate the project.
Project Sponsor provides the actual requirements for the project and presents
the basic business issue.
● He generally provides the funds and measures the degree of value from the
final output of the team working on the project.
● This person introduce the prime concern and brooms the desired output.
3. Project Manager:
● This person ensures that key milestone and purpose of the project is met on
time and of the expected quality.

4. Business Intelligence Analyst :


● Business Intelligence Analyst provides business domain perfection based on
a detailed and deep understanding of the data, key performance indicators
(KPIs), key matrix, and business intelligence from a reporting point of view.
● This person generally creates fascia and reports and knows about the data
feeds and sources.
5. Database Administrator (DBA):
● DBA facilitates and arrange the database environment to support the
analytics need of the team working on a project.
● His responsibilities may include providing permission to key databases or
tables and making sure that the appropriate security stages are in their
correct places related to the data repositories or not.
6. Data Engineer:
● Data engineer grasps deep technical skills to assist with tuning SQL queries for data
management and data extraction and provides support for data intake into the
analytic sandbox.
● The data engineer works jointly with the data scientist to help build data in correct
ways for analysis.

7. Data Scientist:
● Data scientist facilitates with the subject matter expertise for analytical techniques,
data modelling, and applying correct analytical techniques for a given business
issues.
● He ensures overall analytical objectives are met.
● Data scientists outline and apply analytical methods and proceed towards the data
available for the concerned project.
Data Analytics Lifecycle
Phase 1- Discovery
● The data science team learn and investigate the problem.
● Develop context and understanding.
● Come to know about data sources needed and available for the project.
● The team formulates initial hypothesis that can be later tested with data.

In Phase 1, the team learns the business domain, including relevant history such as
whether the organization or business unit has attempted similar projects in the past from
which they can learn. The team assesses the resources available to support the project
in terms of people, technology, time, and data. Important activities in this phase include
framing the business problem as an analytics challenge that can be addressed in
subsequent phases and formulating initial hypotheses (IHs) to test and begin learning the
data.
Phase 1: Discovery
1.2.1 Learning the Business Domain

1.2.2 Resources Framing the Problem

1.2.3 Identifying Key Stakeholders

1.2.4 Interviewing the Analytics Sponsor

1.2.5 Developing Initial Hypotheses and Identifying Potential Data Sources


1.2.1 Learning the Business Domain

In many cases, data scientists will have deep computational and quantitative
knowledge that can be broadly applied across many disciplines. An example of
this role would be someone with an advanced degree in applied
mathematics or statistics.
These data scientists have deep knowledge of the methods, techniques, and
ways for applying heuristics to a variety of business and conceptual
problems.
Others in this area may have deep knowledge of a domain area, coupled with
quantitative expertise.
1.2.2 Resources Framing the Problem

As part of the discovery phase, the team needs to assess the resources available to support the
project. In this context, resources include technology, tools, systems, data, and people.

During this scoping, consider the available tools and technology the team will be using and the
types of systems needed for later phases to operationalize the models.

For the project to have long-term success, what types of skills and roles will be needed for the
recipients of the model being developed? Does the requisite level of expertise exist within the
organization today, or will it need to be cultivated?

Answering these questions will influence the techniques the team selects and the kind of
implementation the team chooses to pursue in subsequent phases of the Data Analytics
Lifecycle.
1.2.3 Identifying Key Stakeholders

Each team member may hear slightly different things related to the needs and the problem and
have somewhat different ideas of possible solutions. For these reasons, it is crucial to state
the analytics problem, as well as why and to whom it is important.

Essentially, the team needs to clearly articulate the current situation and its main challenges.
As part of this activity, it is important to identify the main objectives of the project, identify what
needs to be achieved in business terms, and identify what needs to be done to meet the needs.

Additionally, consider the objectives and the success criteria for the project. What is the team
attempting to achieve by doing the project, and what will be considered “good enough” as an
outcome of the project? This is critical to document and share with the project team and key
stakeholders.
1.2.4 Interviewing the Analytics Sponsor
The team should plan to collaborate with the stakeholders to clarify and frame the
analytics problem.
At the outset, project sponsors may have a predetermined solution that may not
necessarily realize the desired outcome. In these cases, the team must use its
knowledge and expertise to identify the true underlying problem and appropriate
solution.
In essence, the data science team can take a more objective approach, as the
stakeholders may have developed biases over time, based on their experience.
Also, what may have been true in the past may no longer be a valid working
assumption.
1.2.5 Developing Initial Hypotheses and Identifying Potential Data Sources

Identify data sources: Make a list of candidate data sources the team may need to test the initial
hypotheses outlined in this phase. Make an inventory of the datasets currently available and
those that can be purchased or otherwise acquired for the tests the team wants to perform.
Capture aggregate data sources: This is for previewing the data and providing high-level
understanding. It enables the team to gain a quick overview of the data and perform further
exploration on specific areas. It also points the team to possible areas of interest within the
data.

Review the raw data: Obtain preliminary data from initial data feeds. Begin understanding the
interdependencies among the data attributes, and become familiar with the content of the data,
its
quality, and its limitations.

Evaluate the data structures and tools needed:The data type and structure dictate which tools
the team can use to analyze the data. This evaluation gets the team thinking about which
technologies may be good candidates for the project and how to start getting access to these
tools.
Phase 2- Data preparation:

● Steps to explore, preprocess, and condition data prior to modeling and analysis.
● It requires the presence of an analytic sandbox, the team execute, load, and
transform, to get data into the sandbox.
● Data preparation tasks are likely to be performed multiple times and not in
predefined order.
● Several tools commonly used for this phase are – Hadoop, Alpine Miner, Open
Refine, etc.
Phase 2 requires the presence of an analytic sandbox, in which the team can work with
data and perform analytics for the duration of the project. The team needs to execute
extract, load, and transform (ELT) or extract, transform and load (ETL) to get data into
the sandbox. The ELT and ETL are sometimes abbreviated as ETLT. Data should be
transformed in the ETLT process so the team can work with it and analyze it. In this
phase, the team also needs to familiarize itself with the data thoroughly and take steps to
condition the data
Phase 2: Data Preparation
1.3.1 Preparing the Analytic Sandbox

1.3.2 Performing ETLT

1.3.3 Learning About the Data

1.3.4 Data Conditioning

1.3.5 Survey and Visualize

1.3.6 Common Tools for the Data Preparation Phase


1.3.1 Preparing the Analytic Sandbox
The first subphase of data preparation requires the team to obtain an analytic
sandbox (also commonly referred to as a workspace), in which the team can
explore the data without interfering with live production databases.
Consider an example in which the team needs to work with a company’s financial
data.
The team should access a copy of the financial data from the analytic sandbox
rather than interacting with the production version of the organization’s main
database, because that will be tightly controlled and needed for financial
reporting.
1.3.2 Performing ETLT
As the team looks to begin data transformations, make sure the analytics
sandbox has ample bandwidth and reliable network connections to the
underlying data sources to enable uninterrupted read and write.

In ETL, users perform extract, transform, load processes to extract data from a
datastore, perform data transformations, and load the data back into the
datastore. However, the analytic sandbox approach differs slightly; it advocates
extract, load, and then transform. In this case, the data is extracted in its raw
form and loaded into the datastore, where analysts can choose to transform the
data into a new state or leave it in its original, raw condition.
1.3.3 Learning About the Data
It is important to catalog the data sources that the team has access to and identify
additional data sources that the team can leverage but perhaps does not have
access to today.

Some of the activities in this step may overlap with the initial investigation of the
datasets that occur in the discovery phase. Doing this activity accomplishes
several goals.

Clarifies the data that the data science team has access to at the start of the
project .

Highlights gaps by identifying datasets within an organization that the team may
find useful but may not be accessible to the team today
1.3.4 Data Conditioning
∙ Data conditioning refers to the process of cleaning data, normalizing datasets,
and performing transformations on the data. A critical step within the Data
Analytics

∙ Lifecycle, data conditioning can involve many complex steps to join or merge
datasets or otherwise get datasets into a state that enables analysis in further
phases.

∙ Data conditioning is often viewed as a preprocessing step for the data analysis
because it involves many operations on the dataset before developing models
to process or analyze the data.
1.3.5 Survey and Visualize
∙ After the team has collected and obtained at least some of the datasets
needed for the subsequent analysis, a useful step is to leverage data
visualization tools to gain an overview of the data.

∙ Seeing high-level patterns in the data enables one to understand


characteristics about the data very quickly.

∙ One example is using data visualization to examine data quality, such as


whether the data contains many unexpected values or other indicators of dirty
data.

∙ Another example is skewness, such as if the majority of the data is heavily


shifted toward one value or end of a continuum.
1.3.6 Common Tools for the Data Preparation Phase
∙ Hadoop can perform massively parallel ingest and custom analysis for web traffic parsing, GPS
location analytics, genomic analysis, and combining of massive unstructured data feeds from multiple
sources.

∙ Alpine Miner provides a Graphical User Interface (GUI) for creating analytic workflows, including
data manipulations and a series of analytic events such as staged data-mining techniques (for
example, first select the top 100 customers, and then run descriptive statistics and clustering) on
Postgres SQL and other Big Data sources.

∙ OpenRefine (formerly called Google Refine) is “a free, open source, powerful tool for working with
messy data.” It is a popular GUI-based tool for performing data transformations, and it’s one of the
most robust free tools currently available

∙ Similar to OpenRefine, Data Wrangler is an interactive tool for data cleaning and transformation.
Phase 3-Model planning:
● Team explores data to learn about relationships between variables and
subsequently, selects key variables and the most suitable models.
● In this phase, data science team develop data sets for training, testing, and
production purposes.
● Team builds and executes models based on the work done in the model planning
phase.
● Several tools commonly used for this phase are – Matlab, STASTICA.
Phase 3 is model planning, where the team determines the methods, techniques, and
workflow it intends to follow for the subsequent model building phase. The team explores
the data to learn about the relationships between variables and subsequently selects key
variables and the most suitable models.
Phase 3: Model Planning
1.4.1 Data Exploration and Variable Selection

1.4.2 Model Selection

1.4.3 Common Tools for the Model Planning Phase


1.4.1 Data Exploration and Variable Selection
∙ Although some data exploration takes place in the data preparation phase, those
activities focus mainly on data hygiene and on assessing the quality of the data itself.

∙ In Phase 3, the objective of the data exploration is to understand the relationships


among the variables to inform selection of the variables and methods and to understand
the problem domain.

∙ As with earlier phases of the Data Analytics Lifecycle, it is important to spend time
and focus attention on this preparatory work to make the subsequent phases of model
selection and execution easier and more efficient.
2.4.2 Model Selection
∙ In the model selection subphase, the team’s main goal is to choose an analytical
technique, or a short list of candidate techniques, based on the end goal of the project.

∙ In this case, a model simply refers to an abstraction from reality. One observes events
happening in a real-world situation or with live data and attempts to construct models
that emulate this behavior with a set of rules and conditions.

∙ In the case of machine learning and data mining, these rules and conditions are grouped
into several general sets of techniques, such as classification, association rules, and
clustering. When reviewing this list of types of potential models, the team can winnow
down the list to several viable models to try to address a given problem.

∙ An additional consideration in this area for dealing with Big Data involves determining if
the team will be using techniques that are best suited for structured data, unstructured
data, or a hybrid approach. For instance, the team can leverage MapReduce to analyze
unstructured data.
2.4.3 Common Tools for the Model Planning Phase
∙ R has a complete set of modeling capabilities and provides a good
environment for building interpretive models with high-quality code.

∙ In addition, it has the ability to interface with databases via an ODBC


connection and execute statistical tests and analyses against Big Data via an
open source connection.

∙ SQL Analysis services can perform in-database analytics of common data


mining functions, involved aggregations, and basic predictive models.

∙ SAS/ACCESS provides integration between SAS and the analytics sandbox


via multiple data connectors such as OBDC, JDBC, and OLE DB.
Phase 4-Model building:
● Team develops datasets for testing, training, and production purposes.
● Team also considers whether its existing tools will suffice for running the models or
if they need more robust environment for executing models.
● Free or open-source tools – Rand PL/R, Octave, WEKA.
● Commercial tools – Matlab , STASTICA.
In Phase 4, the team develops datasets for testing, training, and production purposes. In
addition, in this phase the team builds and executes models based on the work done in
the model planning phase. The team also considers whether its existing tools will suffice
for running the models, or if it will need a more robust environment for executing models
and work flows (for example, fast hardware and parallel processing, if applicable).
Phase 5-Communicate results:

● After executing model team need to compare outcomes of modeling to criteria


established for success and failure.
● Team considers how best to articulate findings and outcomes to various team
members and stakeholders, taking into account warning, assumptions.
● Team should identify key findings, quantify business value, and develop
narrative to summarize and convey findings to stakeholders.
In Phase 5, the team, in collaboration with major stakeholders, determines if the
results of the project are a success or a failure based on the criteria developed in
Phase 1. The team should identify key findings, quantify the business value, and
develop a narrative to summarize and convey findings to stakeholders.
Phase 6-Operationalize:

● The team communicates benefits of project more broadly and sets up pilot project to
deploy work in controlled way before broadening the work to full enterprise of users.
● This approach enables team to learn about performance and related constraints of
the model in production environment on small scale , and make adjustments before
full deployment.
● The team delivers final reports, briefings, codes.
● Free or open source tools – Octave, WEKA, SQL, MADlib.
In Phase 6, the team delivers final reports, briefings, code, and technical documents. In
addition, the team may run a pilot project to implement the models in a production
environment.
Thank You

You might also like