Lesson 3 Big Data Overview
Lesson 3 Big Data Overview
Data Science
BIG DATA OVERVIEW
Module Objectives
At the end of this module, students must be able to:
1. discuss Big Data and its characteristics;
2. differentiate the data structures;
3. distinguished the different repositories used by data scientist;
4. examine the state of the practice of analytics;
5. differentiate between business intelligence and data science; and
6. examine the current analytical architecture and its problems
Big Data
Data is created constantly, and at an ever-increasing rate. Mobile phones, social
media, imaging technologies to determine a medical diagnosis—all these and
more create new data, and that must be stored somewhere for some purpose.
Merely keeping up with this huge influx of data is difficult, but substantially more
challenging is analyzing vast amounts of it, especially when it does not conform
to traditional notions of data structure, to identify meaningful patterns and extract
useful information.
*Text taken from Data Science and Big Data Analytics by EMC Education Services
Big Data
Several industries have led the way in developing their ability to gather and exploit data:
Credit card companies monitor every purchase their customers make and can
identify fraudulent purchases with a high degree of accuracy using rules derived by
processing billions of transactions.
Mobile phone companies analyze subscribers’ calling patterns to determine, for
example, whether a caller’s frequent contacts are on a rival network. If that rival
network is offering an attractive promotion that might cause the subscriber to defect,
the mobile phone company can proactively offer the subscriber an incentive to
remain in her contract.
For companies such as LinkedIn and Facebook, data itself is their primary product.
The valuations of these companies are heavily derived from the data they gather and
host, which contains more and more intrinsic value as the data grows.
*Text taken from Data Science and Big Data Analytics by EMC Education Services
Big Data
Three attributes stand out as defining Big Data characteristics:
Huge volume of data: Rather than thousands or millions of rows, Big Data can be
billions of rows and millions of columns.
Complexity of data types and structures: Big Data reflects the variety of new
data sources, formats, and structures, including digital traces being left on the web
and other digital repositories for subsequent analysis.
Speed of new data creation and growth: Big Data can describe high velocity
data, with rapid data ingestion and near real time analysis.
*Text taken from Data Science and Big Data Analytics by EMC Education Services
Big Data
Although the volume of Big Data tends to attract the most attention, generally
the variety and velocity of the data provide a more apt definition of Big Data.
Due to its size or structure, Big Data cannot be efficiently analyzed using only
traditional databases or methods. Big Data problems require new tools and
technologies to store, manage, and realize the business benefit. These new
tools and technologies enable creation, manipulation, and management of large
datasets and the storage environments that house them.
*Text taken from Data Science and Big Data Analytics by EMC Education Services
Big Data
Big Data Definition:
“Big Data is data whose scale, distribution, diversity, and/or timeliness require
the use of new technical architectures and analytics to enable insights that
unlock new sources of business value.”
-McKinsey Global Report, 2011
McKinsey’s definition of Big Data implies that organizations will need new data
architectures and ana- lytic sandboxes, new tools, new analytical methods, and
an integration of multiple skills into the new role of the data scientist
*Text taken from Data Science and Big Data Analytics by EMC Education Services
Big Data
*Text taken from Data Science and Big Data Analytics by EMC Education Services
Big Data
Social media and genetic sequencing are among the fastest-growing sources
of Big Data and examples of untraditional sources of data being used for
analysis.
For example, in 2012 Facebook users posted 700 status updates per second
worldwide, which can be leveraged to deduce latent interests or political views
of users and show relevant ads. For instance, an update in which a woman
changes her relationship status from “single” to “engaged” would trigger ads
on bridal dresses, wedding planning, or name-changing services.
*Text taken from Data Science and Big Data Analytics by EMC Education Services
Big Data
*Text taken from Data Science and Big Data Analytics by EMC Education Services
Data Structures
Big data can come in multiple forms, including structured and non-structured
data such as financial data, text files, multimedia files, and genetic mappings.
*Text taken from Data Science and Big Data Analytics by EMC Education Services
Data Structures
The following shows four types of
data structures, with 80–90% of
future data growth coming from non-
structured data types.
*Text taken from Data Science and Big Data Analytics by EMC Education Services
Structured Data
Structured data: Data
containing a defined data type,
format, and structure (that is,
transaction data, online
analytical processing data
cubes, traditional RDBMS,
CSV files, and even simple
spreadsheets).
*Text taken from Data Science and Big Data Analytics by EMC Education Services
Semi-structured Data
Semi-structured data:
Textual data files with a
discernible pattern that
enables parsing (such as
Extensible Markup Language
[XML] data files that are self-
describing and defined by an
XML schema)
*Text taken from Data Science and Big Data Analytics by EMC Education Services
Quasi-structured Data
Quasi-structured data: Textual data
with erratic data formats that can be
formatted with effort, tools, and time
(for instance, web clickstream data
that may contain inconsistencies in
data values and formats).
*Text taken from Data Science and Big Data Analytics by EMC Education Services
Unstructured Data
Unstructured data: Data that
has no inherent structure,
which may include text
documents, PDFs, images,
and video.
*Text taken from Data Science and Big Data Analytics by EMC Education Services
Data Repositories
As data needs grew, so did more scalable data warehousing solutions. These
technologies enabled data to be managed centrally, providing benefits of security,
failover, and a single repository where users could rely on getting an “official”
source of data for financial reporting or other mission-critical tasks.
*Text taken from Data Science and Big Data Analytics by EMC Education Services
Data Repositories
*Text taken from Data Science and Big Data Analytics by EMC Education Services
State of Practice in Analytics
Current business problems provide many opportunities for organizations to
become more analytical and data driven, as shown in the ff table:
*Text taken from Data Science and Big Data Analytics by EMC Education Services
State of Practice in Analytics
The previous table outlines four categories of common business problems that organizations
contend with where they have an opportunity to leverage advanced analytics to create
competitive advantage. Rather than only performing standard reporting on these areas,
organizations can apply advanced analytical techniques to optimize processes and derive
more value from these common tasks.
The first three examples do not represent new problems. Organizations have been trying to reduce
customer churn, increase sales, and cross-sell customers for many years. What is new is the opportunity
to fuse advanced analytical techniques with Big Data to produce more impactful analyses for these
traditional problems.
The last example portrays emerging regulatory requirements. Many compliance and
regulatory laws have been in existence for decades, but additional requirements are
added every year, which represent additional complexity and data requirements for
organizations. Laws related to anti-money laundering (AML) and fraud prevention
require advanced analytical techniques to comply with and manage properly.
*Text taken from Data Science and Big Data Analytics by EMC Education Services
Business Intelligence vs Data Science
Although much is written
generally about analytics, it is
important to distinguish between
Business Intelligence (BI) and
Data Science. As shown in figure
on the right, there are several
ways to compare these groups of
analytical techniques.
*Text taken from Data Science and Big Data Analytics by EMC Education Services
Business Intelligence vs Data Science
BI tends to provide reports, dashboards, and queries on business questions
for the current period or in the past. BI systems make it easy to answer
questions related to quarter-to-date revenue, progress toward quarterly
targets, and understand how much of a given product was sold in a prior
quarter or year.
*Text taken from Data Science and Big Data Analytics by EMC Education Services
Business Intelligence vs Data Science
By comparison, Data Science tends to use disaggregated data in a more
forward-looking, exploratory way, focusing on analyzing the present and
enabling informed decisions about the future.
In addition, Data Science tends to be more exploratory in nature and may use
scenario optimization to deal with more open-ended questions. This approach
provides insight into current activity and foresight into future events, while
generally focusing on questions related to “how” and “why” events occur.
*Text taken from Data Science and Big Data Analytics by EMC Education Services
Business Intelligence vs Data Science
Where BI problems tend to require highly structured data organized in rows
and columns for accurate reporting, Data Science projects tend to use many
types of data sources, including large or unconventional datasets.
*Text taken from Data Science and Big Data Analytics by EMC Education Services
Current Analytical Architecture
Data Science projects need workspaces that are purpose-built for
experimenting with data, with flexible and agile data architectures.
Most organizations still have data warehouses that provide excellent support
for traditional reporting and simple data analysis activities but unfortunately
have a more difficult time supporting more robust analyse
*Text taken from Data Science and Big Data Analytics by EMC Education Services
Current Analytical Architecture
The following figure shows
a typical data architecture
and several of the
challenges it presents to
data scientists and others
trying to do advanced
analytics.
*Text taken from Data Science and Big Data Analytics by EMC Education Services
Current Analytical Architecture
1. Data sources are loaded into the data
warehouse where data needs to be well
understood, structured, and normalized
with the appropriate data type definitions.
This kind of centralization enables
security, backup, and failover of highly
critical data.
*Text taken from Data Science and Big Data Analytics by EMC Education Services
Current Analytical Architecture
3. Once in the data warehouse, data is read
by additional applications across the
enterprise for BI and reporting purposes.
These are high-priority operational processes
getting critical data feeds from the data
warehouses and repositories.
*Text taken from Data Science and Big Data Analytics by EMC Education Services
Current Analytical Architecture
The typical data architectures just described are designed for storing and
processing mission-critical data, supporting enterprise applications, and
enabling corporate reporting activities.
Although reports and dashboards are still important for organizations, most
traditional data architectures inhibit data exploration and more sophisticated
analysis.
*Text taken from Data Science and Big Data Analytics by EMC Education Services
Problems in Traditional Data Architecture
High-value data is hard to reach and leverage, and predictive analytics and data mining
activities are last in line for data. Because the EDWs are designed for central data
management and reporting, those wanting data for analysis are generally prioritized only after
operational processes.
Data moves in batches from EDW to local analytical tools.This workflow means that
datascientists are limited to performing in-memory analytics which will restrict the size of the
datasets they can use. As such, analysis may be subject to constraints of sampling, which
can skew model accuracy.
Data Science projects will remain isolated and ad hoc, rather than centrally managed. The
implication of this isolation is that the organization can never harness the power of advanced
analytics in a scalable way, and Data Science projects will exist as nonstandard initiatives,
which are frequently not aligned with corporate business goals or strategy.
*Text taken from Data Science and Big Data Analytics by EMC Education Services