Basic Concepts in Big Data 1
Basic Concepts in Big Data 1
•Big data is the term for a collection of data sets so large and
complex that it becomes difficult to process using on-hand
database management tools or traditional data processing
applications.
•The challenges include capture, curation, storage, search,
sharing, transfer, analysis, and visualization.
•The trend to larger data sets is due to the additional
information derivable from analysis of a single large set of
related data, as compared to separate smaller sets with the
same total amount of data, allowing correlations to be found to
"spot business trends, determine quality of research, prevent
diseases, link legal citations, combat crime, and determine
real-time roadway traffic conditions.”
2
• "Big Data are high-volume, high-velocity, and/or
high-variety information assets that require new
forms of processing to enable enhanced decision
making, insight discovery and process
optimization” (Gartner 2012)
• Complicated (intelligent) analysis of data may
make a small data “appear” to be “big”
• Bottom line: Any data that exceeds our current
capability of processing can be regarded as “big”
Why is “big data” a “big deal”?
• Government
– Many leading country administration announced “big data” initiative
– Many different big data programs launched
• Private Sector
– Walmart handles more than 1 million customer transactions
every hour, which is imported into databases estimated to
contain more than 2.5 petabytes of data
– Facebook handles 40 billion photos from its user base.
– Falcon Credit Card Fraud Detection System protects 2.1 billion
active accounts world-wide
• Science
– Large Synoptic Survey Telescope will generate 140 Terabyte
of data every 5 days.
– Biomedical computation like decoding human Genome &
personalized medicine
– Social science revolution
– -…
How much data?
• Google processes 20 PB a day (2008)
• Wayback Machine (digital archive of www) has 3 PB +
100 TB/month (3/2009)
• Facebook has 2.5 PB of user data + 15 TB/day (4/2009)
• eBay has 6.5 PB of user data + 50 TB/day (5/2009)
• CERN’s Large Hydron Collider (LHC) generates 15 PB a
year
640K ought to be
enough for anybody.
Type of Data
• Streaming Data
– You can only scan the data once
What to do with these data?
• Aggregation and Statistics
– Data warehouse and OLAP
• Indexing, Searching, and Querying
– Keyword based search
– Pattern matching (XML/RDF Resource Description Framework (RDF) is a
family of World Wide Web Consortium (W3C) specifications originally
designed as a metadata data model)
• Knowledge discovery
– Data Mining
– Statistical Modeling
Lifecycle of Data: 4 “A”s
Int
r e d Aggregation Da egr
ate
t ta ate
c
S a d
Dat
Acquisition Analysis
g e
Log le d
dat w
a no
Application K
Computational View of Big Data
Data Visualization
Formatting, Cleaning
Storage Data
Big Data: 3V’s
10
Volume (Scale)
• Data Volume
– 44x increase from 2009 2020
– From 0.8 zettabytes to 35zb
• Data volume is increasing exponentially
Exponential increase in
collected/generated data
11
4.6
30 billion RFID billion
tags today
12+ TBs (1.3B in 2005)
camera
of tweet data phones
every day world wide
100s of
millions
data every day
of GPS
? TBs of
enabled
devices
sold
annually
25+ TBs of
log data 2+
every day billion
people on
the Web
76 million smart by end
meters in 2009… 2011
200M by 2014
CERN’s Large Hydron Collider (LHC) generates 15 PB a year
Maximilien Brice, © CERN
The Earthscope
•The Earthscope is the world's
largest science project. Designed to
track North America's geological
evolution, this observatory records
data over 3.8 million square miles,
amassing 67 terabytes of data. It
analyzes seismic slips in the San
Andreas fault, sure, but also the
plume of magma underneath
Yellowstone and much, much more.
(http://www.msnbc.msn.com/id/44
363598/ns/technology_and_scienc
e-future_of_technology/#.TmetOd
Q--uI)
Variety (Complexity)
• Streaming Data
– You can only scan the data once
Social Banking
Media Finance
Our
Gaming
Custom Known
History
er
Entertain Purchase
Velocity (Speed)
17
Real-time/Fast Data
Mobile devices
(tracking all objects all the
time)
Social media and Scientific instruments
networks (collecting all sorts of
(all of us are generating data)
data) Sensor technology and
networks
(measuring all kinds of data)
• The progress and innovation is no longer hindered by the ability to collect data
• But, by the ability to manage, analyze, summarize, visualize, and discover
knowledge from the collected data in a timely manner and in a scalable fashion
18
Real-Time Analytics/Decision Requirement
Product
Recommendatio Learning why
Influenc
ns e Customers
that are
Behavio
r
Switch to
Relevant competitors
& Compelling and their offers; in
time to Counter
Friend
Improving the Custom Invitations
to join a
Marketing
Effectiveness of
er Game or
a Activity
Promotion while that expands
Preventing business
it
Fraud
is still in Play
as it is
Occurring
& preventing
more
proactively
Some Make it 4V’s
20
• Volume:
– How much data is really relevant to the problem solution? Cost of processing?
– So, can you really afford to store and process all that data?
• Velocity:
– Much data coming in at high speed
– Need for streaming versus block approach to data analysis
– So, how to analyze data in-flight and combine with data at-rest
• Variety:
– A small fraction is structured formats, Relational, XML, etc.
– A fair amount is semi-structured, as web logs, etc.
– The rest of the data is unstructured text, photographs, etc.
– So, no single data model can currently handle the diversity
• Veracity: cover term for …
– Accuracy, Precision, Reliability, Integrity
– So, what is it that you don’t know you don’t know about the data?
• Value:
– How much value is created for each unit of data (whatever it is)?
– So, what is the contribution of subsets of the data to the problem solution?
Harnessing Big Data
22
The Model Has Changed…
Old Model: Few companies are generating data, all others are consuming
data
New Model: all of us are generating data, and all of us are consuming
data
23
What’s driving Big Data
24
THE EVOLUTION OF BUSINESS INTELLIGENCE
Interactive Business
Intelligence &
Spee Big Data:
d In-memory RDBMS Scal
e Real Time &
QliqView, Tableau, HANA Single View
BI Reporting
OLAP &
Graph Databases
Dataware house
Business Objects, SAS, Scal
Big Data: Spee
Informatica, Cognos other SQL
e Batch Processing & d
Reporting Tools
Distributed Data Store
Hadoop/Spark; HBase/Cassandra
26
Big Data Technology
28
Cloud Computing
Hardware Hardware
COBOL, Amazon.com
Edsel ARPANET Internet
Formatting, Cleaning
Signal Processing
Many Applications!
Storage Data
Information Theory
Some Data Analysis Techniques
Visualization