Week6 Iot Big Data
Week6 Iot Big Data
Baris
Aksanli
02/10/2016
Why
is
there
big
data?
• 3V
model:
Volume,
Velocity,
Variety
[META]
– +1V:
Value
[IDC]
3
Value
of
Big
Data
• New
business
and
efficiency
opportuniEes
• $300B
in
US
medical
industry
• Increased
efficiency
of
government
operaEons
• Search
engines
personalized
for
users
• Personalized
ads,
products,
etc.
4
IoT
and
Big
Data
• IoT
applicaEons
conEnuously
generate
data
– Even
the
smallest
device
generates
data
• The
problem:
data
processing
capacity
is
lower
than
data
genera9on
speed
5
Big
Data
ClassificaEon
6
Path
of
the
Data
• Other
examples:
– HDFS
and
Kosmos
– Extensions
to
GFS
• Google
file
system
(GFS)
– Cosmos
from
MS
– File
broken
into
chunks
(typically
64MB)
– Haystack
from
FB
– Master
manages
metadata
– Data
transfers
happen
directly
between
clients
and
chunkservers
15
Database
Technology
• Key-‐value
databases:
data
is
stored
corresponding
to
unique
key-‐values
-‐>
shorter
query
response
Eme
– Provide
expandability
by
distribuEng
key
words
into
nodes
– Dynamo
[Amazon]
and
Voldemort
[LinkedIn]
• Column-‐oriented
databases:
store
and
process
data
according
to
columns
rather
than
rows
– Both
columns
and
rows
are
segmented
in
mulEple
nodes
to
realize
expandability
– BigTable
[Google]
and
Cassandra
[Facebook]
• Document
databases:
can
support
more
complex
data
forms
and
key-‐value
pairs
can
sEll
be
saved
– Structured
data
storage
with
objects
– MongoDB
[Binary
JSON
objects],
SimpleDB
[Amazon]
and
CouchDB
[Apache]
16
Programming
Models
• TradiEonal
parallel
models
do
not
perform
well
– Scalability
issues:
big
data
are
generally
stored
in
hundreds
and
even
thousands
of
commercial
servers
17
Data
Analysis
• Goal
is
to
extract
useful
values,
w/suggesEons
or
decisions
• TradiEonal
data
analysis
– Cluster
analysis:
grouping
objects
– Factor
analysis:
describe
the
relaEon
among
many
elements
with
a
few
factors
– CorrelaEon
analysis:
dependence
among
variables
– Regression
analysis:
dependence
relaEonships
among
variables
hidden
by
randomness
– A/B
tesEng:
improve
target
variables
by
comparing
the
tested
group
– StaEsEcal
analysis:
summarize
and
describe
data
sets
18
Big
Data
AnalyEcs
• Bloom
filter:
using
hash
funcEons
to
conduct
lossy
compression
storage
of
data
– High
space
efficiency
and
high
query
speed
• Hashing:
transforms
data
into
shorter
fixed-‐length
numerical
values
or
index
values
– Rapid
reading
but
hard
to
find
a
good
hash
funcEon
• Index:
fast
data
retrieval
and
modificaEon
– AddiEonal
cost
for
storing
index
files
which
should
be
maintained
dynamically
when
data
is
updated
• Triel:
trie
tree,
a
variant
of
hash
tree
– Fast
string
operaEons
– Leverage
common
prefixes
of
character
strings
to
reduce
comparison
on
character
strings
19
Tools
for
Big
Data
Analysis
• The
top
five
most
widely
used
sovware,
according
to
a
survey
of
“What
AnalyEcs,
Data
mining,
Big
Data
sovware
that
you
used
in
the
past
12
months
for
a
real
project?”
of
798
professionals
made
by
KDNuggets
in
2012
• R
[30.7%]
• Excel
[29.8%]
• Rapid-‐I
Rapidminer
[26.7%]
• KNMINE
[21.8%]
• Weka/Pentaho
[14.8%]
20
Summary
• Big
data
is
different
than
tradiEonal
massive
data
– Cannot
be
processed
by
general
computers
within
acceptable
Eme
– Why
big
data
is
an
inevitable
result
of
the
IoT
• The
basics
of
big
data
and
analyEcs
– Data
generaEon/acquisiEon
– Data
storage
– Data
analyEcs
• Many
systems
built
to
address
a
different
aspect
of
big
data
21