Unit-1 Introduction To Big Data Analytics
Unit-1 Introduction To Big Data Analytics
By
DCGDATACORESYSTEMSINDIAPVTLTD
Kolkata
Overview
• It’s not easy to measure the total volume of data stored electronically, but an
estimate is that over the next five years up to 2025, global data creation is
projected to grow to more than 180 zettabytes.
• Although the storage capacities of hard drives have increased massively over
the years, access speeds—the rate at which data can be read from drives
have not kept up.
• The size, speed, and complexity of big data necessitates the use specialist of
software which in turn relies on significant processing power and storage
capabilities. While costly, embracing big data analytics enables organizations
to derive powerful insights and gain a competitive edge.
• By 2029, the value of the big data analytics market is expected to reach over
655 billion U.S. dollars, up from around
• Data Volume
• Growth 40% per year
• From 8 zettabytes (2016) to 44zb (2020)
• Data volume is increasing exponentially
Exponential increase in
collected/generated data
Processes 20 PB a day (2008)
Crawls 20B web pages a day (2012)
Search index is 100+ PB (5/2014) 400B pages,
Bigtable serves 2+ EB, 600M QPS (5/2014) 10+ PB
(2/2014)
Hadoop: 365 PB, 330K
nodes (6/2014)
• Different Types:
• Relational Data (Tables/Transaction/Legacy Data)
• Text Data (Web)
• Semi-structured Data (XML)
• Graph Data
• Social Network, Semantic Web (RDF), …
• Streaming Data
• You can only scan the data once
• A single application can be generating/collecting many types of data
• Different Sources :
• Movie reviews from IMDB and Rotten Tomatoes
• Product reviews from different provider websites
To extract knowledge all these
types of data need to linked
together
A Single View to the Customer
Social Banking
Finance
Media
Our
Gaming
Customer Known
History
Entertain Purchase
A Global View of Linked Big Data
pre
is sc r
nos i pt
d i ag i on
doctors drug
patient
get
tar
mu
ta t i on
“Ebola”
tissue gene
protein
Diversified social network Heterogeneous information network
Velocity (Speed)
Product
Recommendations Learning why Customers
that are Relevant Influence
Behavior
Switch to competitors
& Compelling and their offers; in
time to Counter
Friend Invitations
to join a
Improving the Customer
Game or Activity
Marketing
that expands
Effectiveness of a
business
Promotion while it
is still in Play
Preventing Fraud
as it is Occurring
& preventing more
proactively
Extended Big Data Characteristics: 6V
• Volume: In a big data environment, the amounts of data collected and processed are
much larger than those stored in typical relational databases.
• Velocity: Big data arrives to the organization at high speeds and from multiple sources
simultaneously.
• Veracity: Data quality issues are particularly challenging in a big data context.
• Value: Ultimately, big data is meaningless if it does not provide value toward some
meaningful goal.
Veracity (Quality & Trust)
• Data = quantity + quality
Transforming Energy and Utilities through Big Data & Analytics. By Anders
Quitzau@IBM
Other V’s
• Variability
Variability refers to data whose meaning is constantly changing. This is
particularly the case when gathering data relies on language processing.
• Viscosity
This term is sometimes used to describe the latency or lag time in the
data relative to the event being described. We found that this is just as
easily understood as an element of Velocity.
• Virality
Defined by some users as the rate at which the data spreads; how often
it is picked up and repeated by other users or events.
• Volatility
Big data volatility refers to how long is data valid and how long should it
be stored. You need to determine at what point is data no longer relevant to
the current analysis.
• More V’s in the future …
Big Data Overview
Several industries have led the way in developing their ability to gather and exploit data:
• Credit card companies monitor every purchase their customers make and can identify
fraudulent purchases with a high degree of accuracy using rules derived by processing
billions of transactions.
• For companies such as LinkedIn and Facebook, data itself is their primary product. The
valuations of these companies are heavily derived from the data they gather and host,
which contains more and more intrinsic value as the data grows.
Big Data Overview
McKinsey’s definition of Big Data implies that organizations will need new data
architectures and analytic sandboxes, new tools, new analytical methods, and an
integration of multiple skills into the new role of the data scientist
Big Data Overview
• Social media and genetic sequencing are among the fastest-growing sources of Big
Data and examples of untraditional sources of data being used for analysis.
• For example, in 2012 Facebook users posted 700 status updates per second worldwide,
which can be leveraged to deduce latent interests or political views of users and show
relevant ads. For instance, an update in which a woman changes her relationship
status from “single” to “engaged” would trigger ads on bridal dresses, wedding
planning, or name-changing services.
• Facebook can also construct social graphs to analyze which users are connected to
each other as an interconnected network. In March 2013, Facebook released a new
feature called “Graph Search,” enabling users and developers to search social graphs
for people with similar interests, hobbies, and shared locations.
Big Data Overview
• Another example comes from genomics. Genetic sequencing and human genome
mapping provide a detailed understanding of genetic makeup and lineage. The health
care industry is looking toward these advances to help predict which illnesses a person
is likely to get in his lifetime and take steps to avoid these maladies or reduce their
impact through the use of personalized medicine and treatment.
• Such tests also highlight typical responses to different medications and pharmaceutical
drugs, heightening risk awareness of specific drug treatments.
Data Structures
• Big data can come in multiple forms, including structured and non-structured data such
as financial data, text files, multimedia files, and genetic mappings.
Current business problems provide many opportunities for organizations to become more
analytical and data driven, as shown
Four categories of common business problems that organizations contend with where
they have an opportunity to leverage advanced analytics to create competitive
advantage. Rather than only performing standard reporting on these areas,
organizations can apply advanced analytical techniques to optimize processes and derive
more value from these common tasks.
BI Versus Data Science
• Data Science projects need workspaces that are purpose-built for experimenting
with data, with flexible and agile data architectures. Most organizations still have data
warehouses that provide excellent support for traditional reporting and simple data
analysis activities but unfortunately have a more difficult time supporting more robust
analyses.
• The data flow to the Data Scientist and how the individual fits into the process of
getting data to analyze on projects:
1. For data sources to be loaded into the data warehouse, data needs to be well
understood, structured, and normalized with the appropriate data type definitions.
3. Once in the data warehouse, data is read by additional applications across the
enterprise for BI and reporting purposes. These are high-priority operational
processes getting critical data feeds from the data warehouses and repositories.
4. At the end of this workflow, analysts get data provisioned for their downstream
analytics. Because users generally are not allowed to run custom or intensive
analytics on production databases, analysts create data extracts from the EDW to
analyze data offline in R or other local analytical tools.
Because new data sources slowly accumulate in the EDW due to the
rigorous validation and data structuring process, data is slow to move into the EDW,
and the data schema is slow to change.
The typical data architectures just described are designed for storing and processing
mission-critical data, supporting enterprise applications, and enabling corporate
reporting activities , thus limit the ability of analysts to iterate on the data in a
separate nonproduction environment
Key Roles for the New Big Data Ecosystem
The need for applying more advanced analytical techniques to increasingly complex
business problems has driven the emergence of new roles, new technology platforms,
and new analytical methods.
Three Roles for the New Big Data
Ecosystem
Key Roles for the New Big Data Ecosystem
• Deep Analytical Talent — is technically savvy, with strong analytical skills. Members
possess a combination of skills to handle raw, unstructured data and to apply complex
analytical techniques at massive scales. This group has advanced training in quantitative
disciplines, such as mathematics, statistics, and machine learning.
• Data Savvy Professionals — has less technical depth but has a basic knowledge of
statistics or machine learning and can define key questions that can be answered using
advanced analytics. These people tend to have a base knowledge of working with data, or
an appreciation for some of the work being performed by data scientists and others with
deep analytical talent.
Data scientists are generally thought of as having five main sets of skills and behavioral
characteristic
• Skeptical mind-set and critical thinking: It is important that data scientists can
examine their work critically rather than in a one-sided way.
• Curious and creative: Data scientists are passionate about data and finding creative
ways to solve problems and portray information.
Big Data presents many opportunities to improve sales and marketing analytics. An example of
this is the U.S. retailer Target. Charles Duhigg’s book The Power of Habit discusses how retailer
used Big Data and advanced analytical methods to drive new revenue. After analyzing consumer-
purchasing behavior, retailer made a great deal of money from three main life event situations.
Retailer determined that the most lucrative of these life-events is the third situation: pregnancy.
Using data collected from shoppers, retailer was able to identify this fact and
predict which of its shoppers were pregnant. This kind of knowledge allowed retailer to
offer specific coupons and incentives to their pregnant shoppers. In fact, retailer could not
only determine if a shopper was pregnant, but in which month of pregnancy a shopper
may be. This enabled retailer to manage its inventory, knowing that there would be demand
for specific products and it would likely vary by month over the coming nine- to ten month cycles.
Examples of Big Data Analytics
As of 2014, LinkedIn has more than 250 million user accounts (Current data??) and
has added many additional features and data-related products, such as recruiting, job
seeker tools, advertising, and InMaps, which show a social graph of a user’s professional
network.
There may be roughly seven key roles that need to be fulfilled for a high functioning data
science team to execute analytic projects successfully.
• Business User: Someone who understands the domain area and usually benefits from the
results. This person can consult and advise the project team on the context of the project,
the value of the results, and how the outputs will be operationalized.
• Project Sponsor: Responsible for the genesis of the project. Provides the impetus and
requirements for the project and defines the core business problem. Generally provides the
funding and gauges the degree of value from the final outputs of the working team. This
person sets the priorities for the project and clarifies the desired outputs.
• Project Manager: Ensures that key milestones and objectives are met on time and at the
expected quality
• CRISP-DM provides useful input on ways to frame analytics problems and is a popular approach for
data mining.
https://en.wikipedia.org/wiki/Cross-industry_standard_process_for_data_mining
• Tom Davenport’s DELTA framework : The DELTA framework offers an approach for data analytics
projects, including the context of the organization’s skills, datasets, and leadership engagement.
(Analytics at Work: Smarter Decisions, Better Results, 2010, Harvard Business Review Press )
• Doug Hubbard’s Applied Information Economics (AIE) approach : AIE provides a framework for
measuring intangibles and provides guidance on developing decision models, calibrating expert
estimates, and deriving the expected value of information.
(How to Measure Anything: Finding the Value of Intangibles in Business, 2010, Hoboken, NJ: John
Wiley & Sons)
• “MAD Skills” by Cohen et al. offers input for several of the techniques mentioned in Phases 2–4 that
focus on model planning, execution, and key findings.
(MAD Skills: New Analysis Practices for Big Data, Watertown, MA 2009)
Phase 1: Discovery
In this phase, the data science team must learn and investigate the problem, develop
context and understanding, and learn about the data sources needed and available for the
project. In addition, the team formulates initial hypotheses that can later be tested with
data.
This phase includes the steps to explore, preprocess, and condition data prior to modeling and
analysis.
• In this phase, the team needs to create a robust environment in which it can explore the data
that is separate from a production environment. Usually, this is done by preparing an
analytics sandbox.
• To get the data into the sandbox, the team needs to perform ETL, by a combination of
extracting, transforming, and loading data into the sandbox. Once the data is in the sandbox,
the team needs to learn about the data and become familiar with it.
• The team also must decide how to condition and transform data to get it into a format to
facilitate subsequent analysis.
• The team may perform data visualizations to help team members understand the data,
including its trends, outliers, and relationships among data variables.
Phase 2: Data Preparation
• The data science team identifies candidate models to apply to the data for clustering,
classifying, or finding relationships in the data depending on the goal of the project.
• Given the kind of data and resources that are available, evaluate whether similar,
existing approaches will work or if the team will need to create something new.
• The data science team needs to develop datasets for training, testing, and production
purposes. These datasets enable the data scientist to develop the analytical model and
train it (“training data”), while holding aside some of the data (“hold-out data” or “test
data”) for testing the model.
• It is critical to ensure that the training and test datasets are sufficiently robust for the
model and analytical techniques
Phase 4: Model Building
• Does the model appear valid and accurate on the test data?
• Does the model output/behavior make sense to the domain experts?
• Do the parameter values of the fitted model make sense in the context of the domain?
• Is the model sufficiently accurate to meet the goal?
• Does the model avoid intolerable mistakes?
• Are more data or more inputs needed? Do any of the inputs need to be transformed or
eliminated?
• Will the kind of model chosen support the runtime requirements?
• Is a different form of the model required to address the business problem? If so, go back
to the model planning phase and revise the modeling approach.
• After executing the model, the team needs to compare the outcomes of the modeling to
the criteria established for success and failure.
• The team considers how best to articulate the findings and outcomes to the various
team members and stakeholders, considering caveats, assumptions, and any limitations
of the results
• As a result of this phase, the team will have documented the key findings and major
insights derived from the analysis.
• The deliverable of this phase will be the most visible portion of the process to the
outside stakeholders and sponsors, so take care to clearly articulate the results,
methodology, and business value of the findings
Phase 6: Operationalize
• The team communicates the benefits of the project more broadly and sets up a pilot
project to deploy the work in a controlled way before broadening the work to a full
enterprise or ecosystem of users.
• This approach enables the team to learn about the performance and related constraints
of the model in a production environment on a small scale and make adjustments before
a full deployment.
• While scoping the effort involved in conducting a pilot project, consider running the
model in a production environment for a discrete set of products or a single line of
business, which tests the model in a live setting.