Bigdata Overview PDF
Bigdata Overview PDF
Large DataSets
Author
:
Rajdeep
Dua
Twiter
:
@rajdeepdua
Agenda
Introduc7on,
Historical
Perspec7ve
SQL
and
its
Limita7ons
Data
Pipelines
Cloud
and
Big
Data
Storing
and
Serializing
Data
Map
Reduce,
Hadoop
Ecosystem
Hadoop
Components
:
MR,
Pig,
Hive
NoSQL
Big
Data
in
the
Cloud
Machine
Learning
Introduc7on
Machine
Learning
Demos
Introduc8on
Large
Data
is
being
Generated
Mobility
Internet
of
Things
Social
Data
Need
to
Store
Data
Analyze
Data
Visualize
data
Analysing Data
Data
Analysis
has
been
done
since
ages
Tradi7onal
term
called
Data
Mining
/
Machine
Learning
Data
science
is
the
new
term
Need
for
New
Age
Skills
Learn
new
tools
Learn
how
to
handle
large
data
sets
Learn
domain
and
how
to
extract
meaning
from
LOT
of
noise
hSp://drewconway.com/zia/2013/3/26/the-data-science-venn-diagram
Historical Perspec8ve
When
Data
became
Important
:
Apple
II
Windows
3.1,
Rise
of
Desktop
Tools
like
Spreadsheets
and
SPSS
SQL Databases
SQL
Databases
have
evolved
from
System
R
[1974]
First
Commercial
Database
by
Oracle
19
1979
Rise
of
SQL
Structured
Query
Language
:
SQL
92,
SQL
99
CODDs
rule
:
Rules
0
to
12,
13
rules
which
dene
characteris7c
of
a
Database
hSp://en.wikipedia.org/wiki/Rela7onal_database_management_system
Limita8ons of RDBMS
Fixed
Schema
Cannot
Scale
to
Large
Clusters
Very
Expensive
Licensing
Querying
Large
Data
sets
using
SQL
on
RDBMs
is
very
7me
consuming
Jim
Gray
Turing
Award
Winner
Acquire
Parse
Filter
Mine
Represent
Rene
Interact
Ben
Fry
Visualiza7on
Expert
Iden7fy
Problem
Instrument
Data
Sources
Collect
Data
Evaluate
Build
Model
Communicate
Results
Je
HammerBacher
Facebook,
cloudera
Jim
Gray
Turing
Award
Winner
Acquire
Parse
Filter
Mine
Represent
Rene
Interact
Ben
Fry
Visualiza7on
Expert
Iden7fy
Problem
Instrument
Data
Sources
Collect
Data
Evaluate
Build
Model
Communicate
Results
Je
HammerBacher
Facebook,
cloudera
Spreadsheets
Databases
DataWarehousing
Tools
Special
Tools
SPSS,
SAS
Spreadhseets
Batch
Jobs
Map
Reduce
Pig
Hive
Apache
storm
Spark
Apache
Mahout
Apache
Spark
R
Python
Based
Tools
Processed
Data
Pipeline
Source1
Source2
Data
Dump
Curated
Data
API
Job2
Job1
HDFS
API
Persistence
Layer
SQL/NoSQL
Dashboard
Machine
Learning
as
a
Service
Storing Data
Storing Data -
Approach
varies
depending
on
the
following
factor
Privacy
Requirements
of
the
data
Type
of
Organiza7on
Preference
for
Private
or
Public
Cloud
Public
Cloud
Stores
les
on
AWS
S3,
Google
storage
or
Azure
Storage
Private
Cloud
Store
les
on
EMC
Viper,
Open
Stack
Swin
Serializing Data
Need
to
send
data
in
a
ecient
binary
format
vs
text
format
Binary
Format
is
much
smaller
in
size.
Serializa7on
and
Deserializa7on
is
much
smaller
Apache
Thrin
Dene
your
service
deni7on
using
Thrin
le
Generate
bindings
for
Specic
Languages
java,
Python,
CPP
etc
Create
the
Server
using
those
bindings.
Server
accepts
a
TCP
connec7on
Create
a
client
which
serializes
data
using
this
bindings
a
Protocol Buers :
Serializing Data
Protocol
Buers
Developed
by
Google
Used
extensively
by
Google
for
all
its
services
for
communica7ng
with
each
other
Run7me Performance
Map Reduce
A
new
Programming
framework
to
process
very
large
data
(some7mes
peta
bytes)
over
large
cluster
of
Servers
First
implemented
at
Google
to
build
Search
Index
and
Process
incoming
Ads
Open
Source
Version
Implemented
at
Yahoo
Called
Hadoop
Commercial
distribu7ons
available
from
Cloudera,
HortonWorks,
MapR,
Greenplum
Hadoop Ecosystem
Apache Oozie
Workow
Hive
Pig Latin
DW System
Data Analysis
Mahout
Machine
Learning
Other
YARN
MPI,
GIRAPH
YARN
HDFS
Flume
Unstructured Data
Sqoop
Structured Data
HBase
Processing
HDFS
YARN
Flume
Unstructured Data
NameNode
Master
Resource Manager
Secondary
NameNode
DataNode
Slave
Node Manager
YARN
Resource Manager
Node Manager
Node Manager
Node Manager
HDFS
NameNode
DataNode
DataNode
DataNode
HDFS Federa8on
Block Storage
Namespace
Block
Storage
Service
Namespace
Block Management
Datanode
Datanode
Storage
Hive
What is Hive
Data
warehousing
package
built
on
top
of
Hadoop
Used
for
data
analysis
Targeted
towards
users
comfortable
with
SQL
Abstracts
complexity
of
Hadoop
No
need
to
learn
Java
and
Hadoop
APIs
Developed
by
Facebook
and
maintained
by
the
Community
What is Hive?
Denes SQLLike Query
Language : QL
Hive
Allows
Programmers
to plugin-in
custom
mappers and
reducers
Provides tools
to enable ETL
Hive Applica8ons
Data
Mining
Log
Processing
Hive
Applications
BI
Hive
Applications
Predictive
Modelling
Hive Architecture
Hive
JDBC
CLI
HWI
ODBC
Thrift Server
Driver
(compiles, optimizes, executes)
Metastore
Hadoop
Master
DFS
JobTracker
Name Node
Limita8ons of Hive
Not
designed
for
online
transac7on
processing
Does
not
oer
real-7me
queries
and
row
level
updates
Latency
for
Hive
query
is
generally
very
high
(minutes)
Provides
acceptable
(not
op7mal)
latency
for
interac7ve
data
browsing
Abili8es of HiveQL
Filter
rows
from
a
table
using
a
'where'
clause
Store
the
results
of
a
query
into
another
table
Manage
tables
and
par77ons
(create,
drop
and
alter)
Store
results
of
a
query
in
Hadoop
dfs
directory
Do
equi-joins
between
two
tables
Par77ons
HDFS
sub-directory
for
college=IITD
and
branch
=
ece
/hive/warehouse/Student/college=IITD/branch=ece
Load
Data
in
HDFS
Load
Data
into
HIVE
Query
Data
Doe
Smith
Jones
King
100000.0
80000.0
70000.0
60000.0
Mary
Bill
Mary
Todd
Smith,Todd Jones
Federal Taxes-.2,State Taxes-.05,Insurance-.1
A1~Michigan Ave~Chicago~IL~B60600
King
Federal Taxes-.2,State Taxes-.05,Insurance-.1
100~Ontario St.~Chicago~IL~60601
Smith
Federal Taxes-.15,State Taxes-.03,Insurance-.1
200~Chicago Ave.~Oak Park~IL~B60700
Jones
Federal Taxes-.15,State Taxes-.03,Insurance-.1
300~Obscure Dr.~Obscuria~IL~60100
Field
Separator
:
\t
Array
Delimiter
:
,
Map
Key
Value
Seperator
:
-
Struct
Separator
:
~
Query
Data
hive> select * from employees;
OK
John Doe 100000.0 ["Mary Smith","Todd Jones"]
{"Federal Taxes":0.2,"State Taxes .05":null}
{"street":"Insurance-.1","city":null,"state":null,"zip":null}
Mary Smith
80000.0 ["Bill King"]
{"Federal Taxes":0.2,"State Taxes":0.05,"Insurance":0.1}
{"street":"100~Ontario St.~Chicago~IL~60601","city":null,"state":null,"zip":null}
Todd Jones
70000.0 ["Mary Smith"]
{"Federal Taxes":0.15,"State Taxes":0.03,"Insurance":0.1}
{"street":"200~Chicago Ave.~Oak Park~IL~B60700","city":null,"state":null,"zip":null}
Bill King
60000.0 ["Todd Jones"]
{"Federal Taxes":0.15,"State Taxes":0.03,"Insurance":0.1}
{"street":"300~Obscure Dr.~Obscuria~IL~60100","city":null,"state":null,"zip":null}
Time taken: 0.453 seconds, Fetched: 4 row(s)
Internal Tables
Also
called
managed
tables,
because
Hive
controls
the
lifecyle
of
their
data
When
we
drop
an
internal
table,
Hive
deletes
the
data
in
the
table
Less
convenient
for
sharing
with
other
tools
External
Tables
Unlike
Internal
tables,
Hive
does
not
own
the
data
in
external
table
Dropping
the
table
does
not
delete
the
data,
only
metadata
for
the
table
is
deleted
Use cases
Processing
of
Web
logs
Data
processing
for
search
plajorms
Support
for
AdHoc
queries
across
large
datasets
Quick
Prototyping
of
algorithms
for
processing
large
datasets
Grunt
An
interac7ve
shell
for
running
Pig
commands
Also
possible
to
run
Pig
scripts
within
Grunt
using
run
and
exec
commands
Embedded
Can
run
Pig
programs
from
Java,
like
you
can
use
JDBC
to
run
SQL
programs
from
Java
Components
Pig
is
made
up
of
two
components
Pig
La7n
Used
to
express
Data
Flows
Execu7on
Environments
Distributed
execu7on
on
a
Hadoop
Cluster
Local
execu7on
in
a
single
JVM
Pig La8n
Pig
La7n
program
is
made
up
of
opera7ons
or
transforma7ons
that
are
applied
to
the
input
data
to
produce
output.
NoSQL Databases
Scalability
Ability
to
easily
scale
up
or
Down
Can
handle
and
store
data
depending
on
the
source
Flexible
to
No
Schema
What
is
NoSQL
NoSQL
is
a
set
of
concepts
that
allows
the
rapid
and
ecient
processing
of
data
sets
with
a
focus
on
performance,
reliability,
and
agility.
Need
for
NoSQL
Databases
Ability
to
easily
scale
up
to
very
large
clusters
Can
handle
and
store
data
depending
on
the
source
Flexible
to
No
Schema
NoSQL Case
Study
ACID vs BASE
Atomicity
Consistency
Integrity
Durability
Basic
Availability
Son
State
Eventual
Consistency
Apache Cassandra
Started
at
Facebook,
Open
Sourced
in
2008.
Top
Level
Apache
Project
Used
by
Nejlix,
TwiSer
and
Rackspace
Why
Cassandra
Scale
Opera7ons
Data
Model
MongoDB Overview
Documents
At
the
heart
of
MongoDB
is
the
document:
an
ordered
set
of
keys
with
associated
values.
The
representa7on
of
a
document
varies
by
programming
language,
but
most
languages
have
a
data
structure
that
is
a
natural
t,
such
as
a
map,
hash,
or
dic7onary.
In
JavaScript,
for
example,
documents
are
represented
as
objects:
{"gree7ng"
:
"Hello,
world!}
{"gree7ng"
:
"Hello,
world!",
"foo"
:
3}
Collec8ons
Collec7ons
A
collec7on
is
a
group
of
documents.
Collec7on
is
like
a
row
in
a
database
table
Dynamic
Schemas
Collec7ons
have
dynamic
schemas.
This
means
that
the
documents
within
a
single
collec7on
can
have
any
number
of
dierent
shapes.
For
example,
both
of
the
following
documents
could
be
stored
in
a
single
collec7on:
{"gree7ng"
:
"Hello,
world!}
{"foo" : 3}
Databases
Collec7ons
are
Grouped
by
Databases
Collec7on
Database
Collec7on
Document
Collec7on
Document
Document
Introduc8on
Amazon
EMR
is
an
AWS
service
that
allows
users
to
launch
and
use
resizable
Hadoop
clusters
inside
of
Amazon's
infrastructure
Can
be
used
analyze
large
data
sets
Greatly
simplies
setup
and
management
of
the
cluster
of
Hadoop
and
MapReduce
components
EMR
instances
use
Amazon's
prebuilt
and
customized
EC2
instances
Can
take
full
advantage
of
other
AWS
services
Introduc8on contd...
EC2
instances
are
invoked
when
we
start
a
new
Job
Flow
to
form
an
EMR
cluster
A
Job
Flow
is
Amazon's
term
for
the
complete
data
processing
that
occurs
through
a
number
of
compute
steps
A
Job
Flow
is
specied
by
the
MapReduce
applica7on
and
its
input
and
output
parameters
Architecture
Accessing EMR
Using
the
Management
Console
Using
Command
Line
Interface
Amazon
EMR
SDKs
Java,
PHP,
Python,
.Net
etc
Choosing
the
Instance
Type
Depends
on
the
Use
Case
Machine Learning
Extract
Usesful
informa7on
from
the
data
by
designing
models
Use
Cases
Clustering
Classica7on
Decision
Trees
Regression
Clustering Techniques
Centroid
Based
Clustering
:
k-means
Clustering Techniques
Distribu7on
Based
Clustering
:
The
clustering
model
most
closely
related
to
sta7s7cs
is
based
on
distribu7on
models.
Clusters
can
then
easily
be
dened
as
objects
belonging
most
likely
to
the
same
distribu7on.
One
prominent
method
is
known
as
Gaussian
mixture
models
(using
the
expecta7on-maximiza7on
algorithm).
Data
set
is
modelled
with
a
xed
(to
avoid
overng)
number
of
Gaussian
distribu7ons
that
are
ini7alized
randomly
and
whose
parameters
are
itera7vely
op7mized
to
t
beSer
to
the
data
set.
Example
Alogirthm
is
Guassian
Mixture
Models
using
Expecta7on
Maximiza7on
Real
World
example
:
Classifying
genes
to
a
cluster
using
Guassian
Mixture
Models
Clustering : DBSCAN
Density-based
spa7al
clustering
of
applica7ons
with
noise
(DBSCAN)
is
a
data
clustering
algorithm
proposed
by
Mar7n
Ester
et
al.
in
1996.
It
is
a
density-based
clustering
algorithm:
given
a
set
of
points
in
some
space,
Groups
together
points
that
are
closely
packed
together
(points
with
many
nearby
neighbors),
Marking
as
outliers
points
that
lie
alone
in
low-density
regions
(whose
nearest
neighbors
are
too
far
away).
hSp://www.slideshare.net/jonsedar/customer-clustering-for-marke7ng?related=1
Process Followed
Analysis
Converted
Features
into
Principal
Components
Clustering
Technique
:
K
means
For
parameter
learning,
the
expecta7on
maximiza7on
algorithm
alternates
between
compu7ng
probabili7es
for
assignments
of
each
gene
to
each
cluster
(E-step)
and
upda7ng
the
cluster
means
and
covariance
based
on
the
set
of
genes
predominantly
belonging
to
that
cluster
(M-
step).
Classica8on Algorithms
Classica7on
is
the
problem
of
iden7fying
to
which
of
a
set
of
categories
(sub-popula7ons)
a
new
observa7on
belongs,
on
the
basis
of
a
training
set
of
data
containing
observa7ons
(or
instances)
whose
category
membership
is
known.
Example
:
assigning
a
given
email
into
"spam"
or
"non-spam"
classes
or
assigning
a
diagnosis
to
a
given
pa7ent
as
described
by
observed
characteris7cs
of
the
pa7ent
(gender,
blood
pressure,
presence
or
absence
of
certain
symptoms,
etc.).
List of Algorithms
Classiers
Linear
Classiers
Quadrant
Classiers
Support
Vector
Machines
Decision
Trees
Neural
Networks
Regression
Linear
Regression
Logis7cal
Regression
Polynomial
Regression
Generalized
Linear
Model
Linear Classier
A
linear
classier
determines
class
of
an
object
by
making
a
classica7on
decision
based
on
the
value
of
a
linear
combina7on
of
the
characteris7cs.
An
object's
characteris7cs
are
also
known
as
feature
values
and
are
typically
presented
to
the
machine
in
a
vector
called
a
feature
vector.
Linear Regression
Linear
regression
is
an
approach
for
modeling
the
rela7onship
between
a
scalar
dependent
variable
y
and
one
or
more
explanatory
variables
(or
independent
variable)
denoted
X.
The
case
of
one
explanatory
variable
is
called
simple
linear
regression.
For
more
than
one
explanatory
variable,
the
process
is
called
mul7ple
linear
regression
Logis8cal Regression
Logis7c
regression
can
be
binomial
or
mul7nomial.
Binomial
or
binary
logis7c
regression
deals
with
situa7ons
in
which
the
observed
outcome
for
a
dependent
variable
can
have
only
two
possible
types
(for
example,
"dead"
vs.
"alive"
or
"win"
vs.
"loss").
Mul7nomial
logis7c
regression
deals
with
situa7ons
where
the
outcome
can
have
three
or
more
possible
types
(e.g.,
"disease
A"
vs.
"disease
B"
vs.
"disease
C").
Logis7c
regression
is
used
to
predict
the
odds
of
being
a
case
based
on
the
values
of
the
independent
variables
(predictors).
The
odds
are
dened
as
the
probability
that
a
par7cular
outcome
is
a
case
divided
by
the
probability
that
it
is
a
noncase.
Decision Trees
A
decision
tree
uses
a
tree
structure
to
represent
a
number
of
possible
decision
paths
and
an
outcome
for
each
path.
Find
Entropy
at
each
level
and
choose
label
with
lowest
entropy
for
spling
the
data
Can
be
used
for
Classica7on
and
Regression
Other
types
of
trees
:
Random
Forest
Boosted
Trees
Appendix
Demos
MemSQL
:
hSp://fast.wis7a.net/embed/iframe/yi5kwa94uk?popover=true
MetaMarkets
:
Analy7cs
on
Programma7c
Adver7sing
:
hSp://fast.wis7a.net/embed/iframe/yi5kwa94uk?popover=true