0% found this document useful (0 votes)
78 views21 pages

Week6 Iot Big Data

The document discusses big data and the Internet of Things (IoT). It notes that the number of devices generating data is increasing exponentially as they continuously create data. This massive amount of data from various sources is what constitutes big data. It also discusses how IoT applications generate large amounts of data continuously. The document provides an overview of big data classification, the path data takes from collection to processing, different sources of data generation, techniques for data acquisition and transportation, preprocessing methods, and approaches to big data storage.

Uploaded by

Hector Tibo
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
78 views21 pages

Week6 Iot Big Data

The document discusses big data and the Internet of Things (IoT). It notes that the number of devices generating data is increasing exponentially as they continuously create data. This massive amount of data from various sources is what constitutes big data. It also discusses how IoT applications generate large amounts of data continuously. The document provides an overview of big data classification, the path data takes from collection to processing, different sources of data generation, techniques for data acquisition and transportation, preprocessing methods, and approaches to big data storage.

Uploaded by

Hector Tibo
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 21

Big

 Data  and  IoT  

Baris  Aksanli  
02/10/2016  
Why  is  there  big  data?  

•  Number  of  devices  increasing  exponenEally  


–  They  conEnuously  generate  data  
–  For  example,  on  average,  72  hours  of  videos  are  uploaded  to  YouTube  in  
every  minute.   2  
How  much  data  is  big?  
•  2010:  Apache  Hadoop:  
“datasets  which  could  
not  be  captured,  
managed,  and  processed  
by  general  computers  
within  an  acceptable  
scope.”  

•  3V  model:  Volume,  
Velocity,  Variety  [META]  
–  +1V:  Value  [IDC]  
3  
Value  of  Big  Data  
•  New  business  and  
efficiency  
opportuniEes  
•  $300B  in  US  medical  
industry  
•  Increased  efficiency    
of  government  
operaEons  
•  Search  engines  
personalized  for  users  
•  Personalized  ads,  
products,  etc.  
4  
IoT  and  Big  Data  
•  IoT  applicaEons  conEnuously  generate  data  
–  Even  the  smallest  device  generates  data  
•  The  problem:  data  processing  capacity  is  lower  than  
data  genera9on  speed  

5  
Big  Data  ClassificaEon  

6  
Path  of  the  Data  

Data  collecEon   Data  transfer   Data  processing  


&  acquisiEon   &  analysis  
7  
Data  GeneraEon  
•  Enterprise  data:  big  companies,  e.g.  Facebook,  Amazon  
–  Business  data  is  expected  to  double  every  1.2  years  
–  Walmart  processes  1M  customer  trades/hour  
–  Akamai  adverEses  75M  events/day  
•  IoT  data:  pervasive  applicaEons,  clinical  medical  care-­‐R&D  
–  Large  scale,  heterogeneous  and  strongly  correlated  data  
–  30  billion  RFID  tags  and  4.6  billion  camera  phones  are  used  around  the  
world  today  
–  If  Wal-­‐Mart  operates  RFID  on  item  level,  it  is  expected  to  generate  7  
terabytes  (TB)  of  data  every  day  
•  Bio-­‐medical  data:  human  gene  sequencing  
–  One  sequencing  of  human  gene  may  generate  100  sequences  of  600GB  
raw  data  
•  Other  areas:  physics,  bio-­‐informaEcs,  etc.  
–  Astronomy:  Sloan  Digital  Sky  Survey  (SDSS),  the  data  volume  generated  
per  night  surpasses  20TB   8  
Data  AcquisiEon  
•  Log  files:  almost  all  digital  devices  
provide  logging  capability  
–  Web  acEvity  recording,  financial  
applicaEons,  network  monitoring  
•  Sensing:  physical  quanEEes  into  
readable  digital  signals  
–  Sound  wave,  voice,  vibraEon,  
automobile,  chemical,  current,  
weather,  pressure,  temperature,  etc.  
–  LocalizaEon  
•  Mobile  plakorms:  similar  to  sensing  
–  More  personalized,  specific  to  a  user  
9  
Data  TransportaEon  
•  Data  transfer  to  a  storage  infrastructure  for  
processing  and  analysis  
•  Inter  data  center  
network  (DCN)  
transmissions:  
–  Source  to  data  center  
–  Using  WAN:  40-­‐100Gbps  
•  Intra  DCN  transmissions:    
–  Data  center  interconnect  
–  Top-­‐of-­‐the-­‐rack  vs.  
aggregator  switches  
–  1-­‐10-­‐100  Gbps   Data  transportaEon  example   10  
Data  Preprocessing  
•  Eliminate  or  reduce  redundancy,  noise,  meaningless  data  
–  Increase  storage  efficiency,  data  analysis  speed  
•  IntegraEon:  combining  data  from  different  sources  
–  Data  warehouse:  ETL  (Extract,  Transform  and  Load)  
–  Data  federaEon  
–  Mostly  used  by  search  engines  
•  Cleaning:  how  can  data  be  cleaned?  
–  Define  error  types  -­‐>  idenEfy  errors  -­‐>  correct  errors  -­‐>  
document  errors  -­‐>  modify  infrastructure  to  prevent  errors  
•  Redundancy  eliminaEon  
–  Redundancy  detecEon,  data  filtering,  data  compression  
–  Areas:  Images,  videos  
•  One  soluEon:  Compression!   11  
Preprocessing  CapabiliEes  

Arduino   Network     Raspberry  Pi  2   Commodity  server  


16  MHz   speed:  1Gpbs   600  MHz   Network     3  GHz  
32KB  flash   1GB    Ram   speed:  10Gpbs   32  GB  Ram  

•  Assume  there  is  a  job  with  1TB  total  size  


•  100K  Arduino,  1K  Raspberry  Pi  2,  100  servers  
•  Time  spent  in  computaEon  vs.  networking  
–  Ardunio  level  
–  Raspberry  Pi  2  level  
–  Server  level   12  
Big  Data  Storage  
•  Storage  and  management  of  large-­‐scale  data  sets  while  
achieving  reliability  and  availability  of  data  accessing  
–  TradiEonally  on  servers  with  structured  RDBMSs.  
•  ExisEng  storage  systems  for  massive  data    
–  Direct  asached  storage  (DAS)  
•  Several  hard  disks  directly  connected  with  servers  
•  Only  suitable  to  interconnect  servers  with  a  small  scale  
–  Network  asached  storage  (NAS)  
•  NAS  uElizes  network  to  provide  a  union  interface  for  data  access  and  
sharing  
•  The  I/O  burden  is  reduced  extensively  since  the  server  accesses  a  storage  
device  indirectly  through  a  network  
–  Storage  area  network  (SAN)  
•  Designed  for  data  storage  with  a  scalable  and  bandwidth  intensive  network  
•  Data  storage  management  is  relaEvely  independent  within  a  storage  local  
area  network   13  
Distributed  Storage  System  

•  CAP:  Consistency  –  Availability  –  ParEEon  tolerance  


–  At  most  two  of  the  three  requirements  can  be  saEsfied  simultaneously  
•  CA  vs.  CP  vs.  AP  systems  
–  CA:  for  single  servers  
–  CP:  useful  for  moderate  load  [BigTable  and  Hbase]  
–  AP:  useful  when  no  high  demand  on  accuracy  [Dynamo  and  Cassandra]  
14  
File  systems  for  Big  Data  

•  Other  examples:  
–  HDFS  and  Kosmos  
–  Extensions  to  GFS  
•  Google  file  system  (GFS)   –  Cosmos  from  MS  
–  File  broken  into  chunks  (typically  
64MB)   –  Haystack  from  FB  
–  Master  manages  metadata  
–  Data  transfers  happen  directly  
between  clients  and  chunkservers  
15  
Database  Technology  
•  Key-­‐value  databases:  data  is  stored  corresponding  to  
unique  key-­‐values  -­‐>  shorter  query  response  Eme  
–  Provide  expandability  by  distribuEng  key  words  into  nodes  
–  Dynamo  [Amazon]  and  Voldemort  [LinkedIn]  
•  Column-­‐oriented  databases:  store  and  process  data  
according  to  columns  rather  than  rows  
–  Both  columns  and  rows  are  segmented  in  mulEple  nodes  to  
realize  expandability  
–  BigTable  [Google]  and  Cassandra  [Facebook]  
•  Document  databases:  can  support  more  complex  data  
forms  and  key-­‐value  pairs  can  sEll  be  saved  
–  Structured  data  storage  with  objects  
–  MongoDB  [Binary  JSON  objects],  SimpleDB  [Amazon]  and  
CouchDB  [Apache]   16  
Programming  Models  
•  TradiEonal  parallel  models  do  not  perform  well  
–  Scalability  issues:  big  data  are  generally  stored  in  hundreds  and  even  
thousands  of  commercial  servers  

17  
Data  Analysis  
•  Goal  is  to  extract  useful  values,  w/suggesEons  or  decisions  
•  TradiEonal  data  analysis  
–  Cluster  analysis:  grouping  objects  
–  Factor  analysis:  describe  the  
relaEon  among  many  elements  with  
a  few  factors  
 
–  CorrelaEon  analysis:  dependence  among  variables  
–  Regression  analysis:  dependence  relaEonships  among  
variables  hidden  by  randomness  
–  A/B  tesEng:  improve  target  variables  by  comparing  the  
tested  group  
–  StaEsEcal  analysis:  summarize  and  describe  data  sets   18  
Big  Data  AnalyEcs  
•  Bloom  filter:  using  hash  funcEons  to  conduct  lossy  
compression  storage  of  data  
–  High  space  efficiency  and  high  query  speed  
•  Hashing:  transforms  data  into  shorter  fixed-­‐length  
numerical  values  or  index  values  
–  Rapid  reading  but  hard  to  find  a  good  hash  funcEon  
•  Index:  fast  data  retrieval  and  modificaEon  
–  AddiEonal  cost  for  storing  index  files  which  should  be  
maintained  dynamically  when  data  is  updated  
•  Triel:  trie  tree,  a  variant  of  hash  tree  
–  Fast  string  operaEons  
–  Leverage  common  prefixes  of  character  strings  to  reduce  
comparison  on  character  strings  
19  
Tools  for  Big  Data  Analysis  
•  The  top  five  most  widely  used  sovware,  
according  to  a  survey  of  “What  AnalyEcs,  Data  
mining,  Big  Data  sovware  that  you  used  in  the  
past  12  months  for  a  real  project?”  of  798  
professionals  made  by  KDNuggets  in  2012  
•  R  [30.7%]  
•  Excel  [29.8%]  
•  Rapid-­‐I  Rapidminer  [26.7%]  
•  KNMINE  [21.8%]  
•  Weka/Pentaho  [14.8%]  

20  
Summary  
•  Big  data  is  different  than  tradiEonal  massive  data  
–  Cannot  be  processed  by  general  computers  within  
acceptable  Eme  
–  Why  big  data  is  an  inevitable  result  of  the  IoT  
•  The  basics  of  big  data  and  analyEcs  
–  Data  generaEon/acquisiEon  
–  Data  storage  
–  Data  analyEcs  
•  Many  systems  built  to  address  a  different  aspect  
of  big  data  
21  

You might also like