Big Data Analytics Unit1
Big Data Analytics Unit1
***************************************************************************
Data Analytics Lifecycle Overview
Key Roles for a Successful Analytics Project
● Business User: Someone who understands the domain area and usually benefits from the
results. This person can consult and advise the project team on the context of the project,
the value of the results, and how the outputs will be operationalized. Usually a business
analyst, line manager, or deep subject matter expert in the project domain fulfills this role.
● Project Sponsor: Responsible for the genesis of the project. Provides the impetus and
requirements for the project and defines the core business problem. Generally provides the
funding and gauges the degree of value from the final outputs of the working team. This
person sets the priorities for the project and clarifies the desired outputs.
● Project Manager: Ensures that key milestones and objectives are met on time and at the
expected quality.
● Business Intelligence Analyst: Provides business domain expertise based on a deep
understanding of the data, key performance indicators (KPIs), key metrics, and business
intelligence from a reporting perspective. Business Intelligence Analysts generally create
dashboards and reports and have knowledge of the data feeds and sources.
● Database Administrator (DBA): Provisions and configures the database environment to
support the analytics needs of the working team. These responsibilities may include
providing access to key databases or tables and ensuring the appropriate security levels are
in place related to the data repositories.
● Data Engineer: Leverages deep technical skills to assist with tuning SQL queries for data
management and data extraction, and provides support for data ingestion into the analytic
sandbox. The data engineer works closely with the data scientist to help shape data in the
right ways for analyses.
● Data Scientist: Provides subject matter expertise for analytical techniques, data modeling,
and applying valid analytical techniques to given business problems. Ensures overall
analytics objectives are met. Designs and executes analytical methods and approaches with
the data available to the project.
***************************************************************************
Discovery
The first phase of the Data Analytics Lifecycle involves discovery (Figure 2-3). In this phase,
the data science team must learn and investigate the problem, develop context and
understanding, and learn about the data sources needed and available for the project. In
addition, the team formulates initial hypotheses that can later be tested with data.
Learning the Business Domain
Understanding the domain area of the problem is essential. In many cases, data scientists
will have deep computational and quantitative knowledge that can be broadly applied across
many disciplines. An example of this role would be someone with an advanced degree in
applied mathematics or statistics. These data scientists have deep knowledge of the
methods, techniques, and ways for applying heuristics to a variety of business and
conceptual problems.
Resources
As part of the discovery phase, the team needs to asse ss the resources available to support
the project. In this context, resources include technology, tools, systems, data, and people.
During this scoping, consider the available tools and technology the team will be using and
the types of systems needed for later phases to operationalize the models.
Framing the Problem
Framing the problem well is critical to the success of the project. Framing is the process of
stating the analytics problem to be solved. At this point, it is a best practice to write down
the problem statement and share it with the key stakeholders. Each team member may hear
slightly different things related to the needs and the problem and have somewhat different
ideas of possible solutions.
Identifying Key Stakeholders
Another important step is to identify the key stakeholders and their interests in the project.
During these discussions, the team can identify the success criteria, key risks, and
stakeholders, which should include anyone who will benefit from the project or will be
significantly impacted by the project. When interviewing stakeholders, learn about the
domain area and any relevant history from similar analytics projects.
Interviewing the Analytics Sponsor
The team should plan to collaborate with the stakeholders to clarify and frame the analytics
problem. At the outset, project sponsors may have a predetermined solution that may not
necessarily realize the desired outcome. In these cases, the team must use its knowledge
and expertise to identify the true underlying problem and appropriate solution.
● Prepare for the interview; draft questions, and review with colleagues.
● Use open-ended questions; avoid asking leading questions. ● Probe for details and pose
follow-up questions.
● Avoid filling every silence in the conversation; give the other person time to think.
● Let the sponsors express their ideas and ask clarifying questions, such as “Why? Is that
correct? Is this idea on target? Is there anything else?”
● Use active listening techniques; repeat back what was heard to make sure the team heard
it correctly, or reframe what was said.
● Try to avoid expressing the team’s opinions, which can introduce bias; instead, focus on
listening. ● Be mindful of the body language of the interviewers and stakeholders; use eye
contact where appropriate, and be attentive.
● Minimize distractions.
● Document what the team heard, an d review it with the sponsors. Following is a brief list
of common questions that are helpful to ask during the discovery phase when interviewing
the project sponsor. The responses will begin to shape the scope of the project and give the
team an idea of the goals and objectives of the project.
Developing Initial Hypotheses
Developing a set of IHs is a key facet of the discovery phase. This step involves forming ideas
that the team can test with data. Generally, it is best to come up with a few primary
hypotheses to test and then be creative about developing several more . These IHs form the
basis of the analytical tests the team will use in later phases and serve as the foundation for
the findings in Phase 5.
Identifying Potential Data Sources
As part of the discovery phase, identify the kinds of data the team will need to solve the
problem. Consider the volume, type, and time span of the data needed to test the
hypotheses. Ensure that the team can access more than simply aggregated data. In most
cases, the team will need the raw data to avoid introducing bias for the downstream
analysis.
The team should perform five main activities during this step of the discovery phase:
● Identify data sources: Make a list of candidate data sources the team may need to test the
initial hypotheses outlined in this phase. Make an inventory of the datasets currently
available and those that can be purchased or otherwise acquired for the tests the team
wants to perform.
● Capture aggregate data sources: This is for previewing the data and providing high-level
understanding. It enables the team to gain a quick overview of the data and perform further
exploration on specific areas. It also points the team to possible areas of interest within the
data.
● Review the raw data: Obtain preliminary data from initial data feeds. Begin
understanding the interdependencies among the data attributes, and become familiar with
the content of the data, its quality, and its limitations.
● Evaluate the data structures and tools needed: The data type and structure dictate which
tools the team can use to analyze the data. This evaluation gets the team thinking about
which technologies may be good candidates for the project and how to start getting access
to these tools.
● Scope the sort of data infrastructure needed for this type of problem: In addition to the
tools needed, the data influences the kind of infrastructure that’s required, such as disk
storage and network capacity.
***************************************************************************
Data Preparation
The second phase of the Data Analytics Lifecycle involves data preparation, which includes
the steps to explore, preprocess, and condition data prior to modeling and analysis. In this
phase, the team needs to create a robust environment in which it can explore the data that
is separate from a production environment. Usually, this is done by preparing an analytics
sandbox.
Preparing the Analytic Sandbox
The first subphase of data preparation requires the team to obtain an analytic sandbox (also
commonly referred to as a workspace ), in which the team can explore the data without
interfering with live production databases
Performing ETLT
As the team looks to begin data transformations, make sure the analytics sandbox has ample
bandwidth and reliable network connections to the underlying data sources to enable
uninterrupted read and write. In ETL, users perform extract, transform, load processes to
extract data from a datastore, perform data transformations, and load the data back into the
datastore
Learning About the Data
A critical aspect of a data science project is to become familiar with the data itself . Spending
time to learn the nuances of the datasets provides context to understand what constitutes a
reasonable value and expected output versus what is a surprising finding
● Clarifies the data that the data science team has access to at the start of the project
● Highlights gaps by identifying datasets within an organization that the team may find
useful but may not be accessible to the team today. As a consequence, this activity can
trigger a project to begin building relationships with the da ta owners and finding ways to
share data in appropriate ways. In addition, this activity may provide an impetus to begin
collecting new data that benefits the organization or a specific long-term project.
● Identifies datasets outside the organization that ma y be useful to obtain, through open
APIs, data sharing, or purchasing data to supplement already existing datasets
Data Conditioning
Data conditioning refers to the process of cleaning data, normalizing datasets, and
performing transformations on the data. A critical step within the Da ta Analytics Lifecycle,
data conditioning can involve many complex steps to join or merge datasets or otherwise get
datasets into a state that enables analysis in further phases
● What are the data sources? What are the target fields (for example, columns of the
tables)?
● How clean is the data?
● How consistent are the contents and files? Determine to what degree the data contains
missing or inconsistent values and if the data contains values deviating from normal.
● Assess the consistency of the data types. For instance , if the team expects certain data to
be numeric, confirm it is numeric or if it is a mixture of alphanumeric strings and text.
****************************
***********************************************