0% found this document useful (0 votes)
2 views10 pages

Big Data Analytics Unit1

The document provides an overview of Big Data, highlighting its characteristics such as volume, variety, and velocity, and discusses the challenges and opportunities it presents across various industries. It differentiates between Business Intelligence (BI) and Data Science, emphasizing their respective roles in analyzing data for current and future insights. Additionally, it outlines the Data Analytics Lifecycle, detailing phases from discovery to data preparation and model planning, including key roles and activities necessary for successful analytics projects.

Uploaded by

John
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
2 views10 pages

Big Data Analytics Unit1

The document provides an overview of Big Data, highlighting its characteristics such as volume, variety, and velocity, and discusses the challenges and opportunities it presents across various industries. It differentiates between Business Intelligence (BI) and Data Science, emphasizing their respective roles in analyzing data for current and future insights. Additionally, it outlines the Data Analytics Lifecycle, detailing phases from discovery to data preparation and model planning, including key roles and activities necessary for successful analytics projects.

Uploaded by

John
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 10

Introduction to Big Data

1.1 Big Data Overview


Data is created constantly, and at an ever-increasing rate. Mobile phones, social media,
imaging technologies to determine a medical diagnosis—all these and more create new
data, and that must be stored somewhere for some purpose. Devices and sensors
automatically generate diagnostic information that needs to be stored and processed in real
time. Merely keeping up with this huge influx of data is difficult, but substantially more
challenging is analyzing vast amounts of it, especially when it does not conform to
traditional notions of data structure, to identify meaningful patterns and extract useful
information. These challenges of the data deluge present the opportunity to transform
business, government, science, and everyday life. Several industries have led the way in
developing their ability to gather and exploit data:
● Credit card companies monitor every purchase their customers make and can identify
fraudulent purchases with a high degree of accuracy using rule s derived by processing
billions of transactions.
● Mobile phone companies analyze subscribers’ calling patterns to determine, for example,
whether a caller’s frequent contacts are on a rival network. If that rival network is offering
an attractive promotion that might cause the subscriber to defect, the mobile phone
company can proactively offer the subscriber an incentive to remain in her contract.
● For companies such as LinkedIn and Facebook, data itself is their primary product. The
valuations of these companies are heavily derived from the data they gather and host, which
contains more and more intrinsic value as the data grows. Three attributes stand out as
defining Big Data characteristics:
● Huge volume of data: Rather than thousands or millions of rows, Big Data can be billions
of rows and millions of columns.
● Complexity of data types and structures: Big Data reflects the variety of new data
sources, formats, and structures, including digital traces being left on the web and other
digital repositories for subsequent analysis.
● Speed of new data creation and growth: Big Data can describe high velocity data, with
rapid data ingestion and near real time analysis. Although the volume of Big Data tends to
attract the most attention, generally the variety and velocity of the data provide a more apt
definition of Big Da ta.
***************************************************************************
1.2 State of the Practice in Analytics
Current business problems provide many opportunities for organizations to become more
analytical and data driven, as shown in Table 1-2.

1.2.1 BI Versus Data Science


The four business drivers shown in Table 1-2 require a variety of analytical techniques to
address them properly. Although much is written generally about analytics, it is important
to distinguish between BI and Data Science. As shown in Figure 1-8, there are several ways
to compare these groups of analytical techniques.
One way to evaluate the type of analysis being performed is to examine the time horizon
and the kind of analytical approaches being used. BI tends to provide reports, dashboards,
and queries on business questions for the current period or in the past. BI systems make it
easy to answer questions related to quarter-to-date revenue, progress toward quarterly
targets, and understand how much of a given product was sold in a prior quarter or year.
These questions tend to be closed-ended and explain current or past behavior, typically by
aggregating historical data and grouping it in some way. BI provides hindsight and some
insight and generally answers questions related to “when” and “where” events occurred. By
comparison, Data Science tends to use disaggrega ted data in a more forward-looking,
exploratory way, focusing on analyzing the present and enabling informed decisions about
the future.
Drivers of Big Data
To better understand the market drivers related to Big Data, it is helpful to first understand
some past history of data stores and the kinds of repositories and tools to manage these
data stores.
● Medical information, such as genomic sequencing and diagnostic imaging
● Photos and video footage uploaded to the World Wide Web
● Video surveillance, such as the thousands of video cameras spread across a city
● Mobile devices, which provide geospatial location data of the users, as well as metadata
about text messages, phone calls, and application usage on smart phones
● Smart devices, which provide sensor-based collection of information from smart electric
grids, smart buildings, and many other public and industry infrastructures
● Nontraditional IT devices, including the use of radio-frequency identification (RFID)
readers, GPS navigation systems, and seismic processing

***************************************************************************
Data Analytics Lifecycle Overview
Key Roles for a Successful Analytics Project
● Business User: Someone who understands the domain area and usually benefits from the
results. This person can consult and advise the project team on the context of the project,
the value of the results, and how the outputs will be operationalized. Usually a business
analyst, line manager, or deep subject matter expert in the project domain fulfills this role.
● Project Sponsor: Responsible for the genesis of the project. Provides the impetus and
requirements for the project and defines the core business problem. Generally provides the
funding and gauges the degree of value from the final outputs of the working team. This
person sets the priorities for the project and clarifies the desired outputs.
● Project Manager: Ensures that key milestones and objectives are met on time and at the
expected quality.
● Business Intelligence Analyst: Provides business domain expertise based on a deep
understanding of the data, key performance indicators (KPIs), key metrics, and business
intelligence from a reporting perspective. Business Intelligence Analysts generally create
dashboards and reports and have knowledge of the data feeds and sources.
● Database Administrator (DBA): Provisions and configures the database environment to
support the analytics needs of the working team. These responsibilities may include
providing access to key databases or tables and ensuring the appropriate security levels are
in place related to the data repositories.
● Data Engineer: Leverages deep technical skills to assist with tuning SQL queries for data
management and data extraction, and provides support for data ingestion into the analytic
sandbox. The data engineer works closely with the data scientist to help shape data in the
right ways for analyses.
● Data Scientist: Provides subject matter expertise for analytical techniques, data modeling,
and applying valid analytical techniques to given business problems. Ensures overall
analytics objectives are met. Designs and executes analytical methods and approaches with
the data available to the project.
***************************************************************************
Discovery
The first phase of the Data Analytics Lifecycle involves discovery (Figure 2-3). In this phase,
the data science team must learn and investigate the problem, develop context and
understanding, and learn about the data sources needed and available for the project. In
addition, the team formulates initial hypotheses that can later be tested with data.
Learning the Business Domain
Understanding the domain area of the problem is essential. In many cases, data scientists
will have deep computational and quantitative knowledge that can be broadly applied across
many disciplines. An example of this role would be someone with an advanced degree in
applied mathematics or statistics. These data scientists have deep knowledge of the
methods, techniques, and ways for applying heuristics to a variety of business and
conceptual problems.
Resources
As part of the discovery phase, the team needs to asse ss the resources available to support
the project. In this context, resources include technology, tools, systems, data, and people.
During this scoping, consider the available tools and technology the team will be using and
the types of systems needed for later phases to operationalize the models.
Framing the Problem
Framing the problem well is critical to the success of the project. Framing is the process of
stating the analytics problem to be solved. At this point, it is a best practice to write down
the problem statement and share it with the key stakeholders. Each team member may hear
slightly different things related to the needs and the problem and have somewhat different
ideas of possible solutions.
Identifying Key Stakeholders
Another important step is to identify the key stakeholders and their interests in the project.
During these discussions, the team can identify the success criteria, key risks, and
stakeholders, which should include anyone who will benefit from the project or will be
significantly impacted by the project. When interviewing stakeholders, learn about the
domain area and any relevant history from similar analytics projects.
Interviewing the Analytics Sponsor
The team should plan to collaborate with the stakeholders to clarify and frame the analytics
problem. At the outset, project sponsors may have a predetermined solution that may not
necessarily realize the desired outcome. In these cases, the team must use its knowledge
and expertise to identify the true underlying problem and appropriate solution.
● Prepare for the interview; draft questions, and review with colleagues.
● Use open-ended questions; avoid asking leading questions. ● Probe for details and pose
follow-up questions.
● Avoid filling every silence in the conversation; give the other person time to think.
● Let the sponsors express their ideas and ask clarifying questions, such as “Why? Is that
correct? Is this idea on target? Is there anything else?”
● Use active listening techniques; repeat back what was heard to make sure the team heard
it correctly, or reframe what was said.
● Try to avoid expressing the team’s opinions, which can introduce bias; instead, focus on
listening. ● Be mindful of the body language of the interviewers and stakeholders; use eye
contact where appropriate, and be attentive.
● Minimize distractions.
● Document what the team heard, an d review it with the sponsors. Following is a brief list
of common questions that are helpful to ask during the discovery phase when interviewing
the project sponsor. The responses will begin to shape the scope of the project and give the
team an idea of the goals and objectives of the project.
Developing Initial Hypotheses
Developing a set of IHs is a key facet of the discovery phase. This step involves forming ideas
that the team can test with data. Generally, it is best to come up with a few primary
hypotheses to test and then be creative about developing several more . These IHs form the
basis of the analytical tests the team will use in later phases and serve as the foundation for
the findings in Phase 5.
Identifying Potential Data Sources
As part of the discovery phase, identify the kinds of data the team will need to solve the
problem. Consider the volume, type, and time span of the data needed to test the
hypotheses. Ensure that the team can access more than simply aggregated data. In most
cases, the team will need the raw data to avoid introducing bias for the downstream
analysis.
The team should perform five main activities during this step of the discovery phase:
● Identify data sources: Make a list of candidate data sources the team may need to test the
initial hypotheses outlined in this phase. Make an inventory of the datasets currently
available and those that can be purchased or otherwise acquired for the tests the team
wants to perform.
● Capture aggregate data sources: This is for previewing the data and providing high-level
understanding. It enables the team to gain a quick overview of the data and perform further
exploration on specific areas. It also points the team to possible areas of interest within the
data.
● Review the raw data: Obtain preliminary data from initial data feeds. Begin
understanding the interdependencies among the data attributes, and become familiar with
the content of the data, its quality, and its limitations.
● Evaluate the data structures and tools needed: The data type and structure dictate which
tools the team can use to analyze the data. This evaluation gets the team thinking about
which technologies may be good candidates for the project and how to start getting access
to these tools.
● Scope the sort of data infrastructure needed for this type of problem: In addition to the
tools needed, the data influences the kind of infrastructure that’s required, such as disk
storage and network capacity.
***************************************************************************
Data Preparation
The second phase of the Data Analytics Lifecycle involves data preparation, which includes
the steps to explore, preprocess, and condition data prior to modeling and analysis. In this
phase, the team needs to create a robust environment in which it can explore the data that
is separate from a production environment. Usually, this is done by preparing an analytics
sandbox.
Preparing the Analytic Sandbox
The first subphase of data preparation requires the team to obtain an analytic sandbox (also
commonly referred to as a workspace ), in which the team can explore the data without
interfering with live production databases
Performing ETLT
As the team looks to begin data transformations, make sure the analytics sandbox has ample
bandwidth and reliable network connections to the underlying data sources to enable
uninterrupted read and write. In ETL, users perform extract, transform, load processes to
extract data from a datastore, perform data transformations, and load the data back into the
datastore
Learning About the Data
A critical aspect of a data science project is to become familiar with the data itself . Spending
time to learn the nuances of the datasets provides context to understand what constitutes a
reasonable value and expected output versus what is a surprising finding
● Clarifies the data that the data science team has access to at the start of the project
● Highlights gaps by identifying datasets within an organization that the team may find
useful but may not be accessible to the team today. As a consequence, this activity can
trigger a project to begin building relationships with the da ta owners and finding ways to
share data in appropriate ways. In addition, this activity may provide an impetus to begin
collecting new data that benefits the organization or a specific long-term project.
● Identifies datasets outside the organization that ma y be useful to obtain, through open
APIs, data sharing, or purchasing data to supplement already existing datasets

Data Conditioning
Data conditioning refers to the process of cleaning data, normalizing datasets, and
performing transformations on the data. A critical step within the Da ta Analytics Lifecycle,
data conditioning can involve many complex steps to join or merge datasets or otherwise get
datasets into a state that enables analysis in further phases
● What are the data sources? What are the target fields (for example, columns of the
tables)?
● How clean is the data?
● How consistent are the contents and files? Determine to what degree the data contains
missing or inconsistent values and if the data contains values deviating from normal.
● Assess the consistency of the data types. For instance , if the team expects certain data to
be numeric, confirm it is numeric or if it is a mixture of alphanumeric strings and text.

Survey and Visualize


After the team has collected and obtained at least some of the datasets needed for the
subsequent analysis, a useful step is to leverage data visualization tools to gain an overview
of the data. Seeing high-level patterns in the data enables one to understand characteristics
about the data very quickly. One example is using data visualization to examine data quality,
such as whether the data contains many unexpected values or other indicators of dirty data
***************************************************************************
Model Planning
In Phase 3, the data science team identifies candidate models to apply to the data for
clustering, classifying, or finding relationships in the data depending on the goal of the
project, as shown in Figure 2-5. It is during this phase that the team refers to the hypotheses
developed in Phase 1, when they first became acquainted with the data and understanding
the business problems or domain area.
● Assess the structure of the datasets. The structure of the datasets is one factor that
dictates the tools and analytical techniques for the next phase. Depending on whether the
team plans to analyze textual data or transactional data, for example, different tools and
approaches are required.
● Ensure that the analytical techniques enable the team to meet the business objectives and
accept or reject the working hypotheses.
● Determine if the situation warrants a single model or a series of techniques as part of a
larger analytic workflow. A few example models include association rules (Chapter 5,
“Advanced Analytical Theory and Methods: Association Rules”) and logistic regression
(Chapter 6, “Advanced Analytical Theory and Methods: Regression”). Other tools, such as
Alpine Miner, enable users to set up a series of steps and analyses and can serve as a front-
end user interface (UI) for manipulating Big Data sources in PostgreSQL.
Data Exploration and Variable Selection
Although some data exploration takes place in the data preparation phase, those activities
focus mainly on data hygiene and on assessing the quality of the data itself. In Phase 3, the
objective of the data exploration is to understand the relationships among the variables to
inform selection of the variables and methods and to understand the problem domain. As
with earlier phases of the Data Analytics Lifecycle, it is important to spend time and focus
attention on this preparatory work to make the subsequent phases of model selection and
execution easier and more efficient. A comm on way to conduct this step involves using tools
to perform data visualizations. Approaching the data exploration in this way aids the team in
previewing the data and assessing relationships between variables at a high level.

****************************
***********************************************

You might also like