Big DATA Analytics
C.RANICHANDRA
&
N.C.SENTHILKUMAR
CRA
NO SQL
2
Not Only SQL
being non-relational, distributed, open-source
and horizontally scalable.
> 255 No SQL Databases
Categorized as
Column store/column families: HBASE, Accumulo, IBM Informix
Document Store: Azure Document DB, Mongo DB, IBM Cloudant
Key Value/ Tuple Store: Dynamo DB, Azure Table Storage, Oracle
NoSQL DB
Graph Databases: AllegroGraph, Neo4J, OrientDB
CRA
Failures
3
In classic MapReduce
Modes- failure of running task, task tracker, job
tracker
Task failure- map or reduce due to run time exception
Task tracker failure-fails by crashing or slow, job
tracker finds by heartbeat and removes from pool, any
job incomplete or in progress is scheduled again to
other tt as the result may be available in the local
system(intermediate keys)
CRA
NO SQL
4
Since 1970 , RDBMS is the only solution for data
storage and manipulation and maintenance
After the data changed in all dimension(Vs),
companies realized the solutions for processing big
data
Solution: Hadoop, but only sequential access
CRA
HBase
5
Hbase is a distributed column-oriented database
built on top of HDFS
Based on Google’s Big Table, provides random
access on structured data
It’s a part of Hadoop Eco system, which provides
random r/w access on HDFS
CRA
Random R/W
6
CRA
HDFS and HBASE
7
HDFS HBASE
Distributed FS Database on HDFS
Provides high latency batch processing Low latency access to single rows
Only sequential access of data Random access by hash index
CRA
Storage Mechanism in HBase
8
Column –oriented
Table schema defines only column families , which
are key value pairs
Rowid Column family Column family
Col1 Col2 Col3 Col1 Col2 col3
CRA
Hbase and RDBMS
9
HBASE RDBMS
Schema less Schema oriented
Built for wide tables, horizontally Thin and built for small tables, hard to
scalable scale
No transactions-Suitable for OLAP Transactional
Demoralized data Normalized data
Good for semi structured and structured Good for structured
data
CRA
Applications of Hbase
10
Need to write heavy applications
Random access of data
Facebook , twitter, yahoo and Adobe use Hbase
internally
CRA
Hbase Architecture
11
Tables are split into regions and served by region server,
Regions are divided into stores and stored in HDFS
CRA
HBase Shell Commands
12
General:
Status
Version
Table_help
Whoami
DDL:
Create
List
Disable
Is_disabled
Enable
Is_enabled
Describe
Alter
Exists
Drop_all –drop tables matching regrex commands
CRA
HBase Shell Commands
13
DML
Put- a cell value
Get- get row or cell
Delete – delete a cell value
Delete all- delete all the cells in a row
Scan- scan and return table value
Count- number of rows in a table
Truncate- disable, drop and recreate a specified table
CRA
DDL
14
Create <table name> <family name>
List
Disable <table name>
Describe <table name>
Drop <table name>
Drop_all <regexp>
alter <tablename>, NAME=><column familyname>,
VERSIONS=>5
alter <table name> , NAME =><cf> , METHOD => 'delete'
CRA
DML Commands
15
Put <'tablename'>,<'rowname'>,<'columnvalue'>,
<'value'>
Scan <‘tablename’>
get 'table name', ‘rowid’, {COLUMN ⇒ ‘column
family:column name ’}
delete <'tablename'>,<'row name'>,<'column name'>
deleteall <'tablename'>, <'rowname'>
truncate <tablename>
CRA
Example
16
CRA
DDL+DML Commands
17
create 'emp', 'personal data', ’professional data’
put 'emp','1','personal data:name','raju‘
put 'emp','1','personal data:city','hyderabad‘
put 'emp','1','professional
data:designation','manager
put 'emp','1','professional data:salary','50000’
Scan ‘emp’
CRA
DDL+DML Commands
18
get 'emp', '1‘
get 'emp', ‘1', {COLUMN ⇒ 'personal data: name'}
delete 'emp', '1', 'personal data:city‘
deleteall 'emp','1‘
put 'emp',‘1','personal data:city','Delhi‘
count 'emp‘
truncate 'emp‘
Drop ‘emp’
CRA
Exercise-MBA Admissions
19
Application Personal Data Academic details
no
Name Gender Address UG qualification University/ Overall
college percentage
CRA
Batch Processing
20
Batch processing is the execution of a series of
jobs in a program on a computer without manual
intervention (non-interactive). Strictly speaking, it is
a processing mode: the execution of a series of
programs each on a set or "batch" of inputs, rather
than a single input (which would instead be a
custom job)
CRA
Batch Processing
21
Batch processing is very efficient in processing high
volume data.
Where data is collected, entered to the system,
processed and then results are produced in batches.
Here time taken for the processing is not an issue.
Batch jobs are configured to run without manual
intervention, trained against entire dataset at scale in
order to produce output in the form of computational
analyses and data files.
Depending on the size of the data being processed
and the computational power of the system, output
CRA
can be delayed significantly.
MapReduce
22
MapReduce is a programming paradigm that runs in
the background of Hadoop to provide scalability and
easy data-processing solutions.
The Map task takes a set of data and converts it into
another set of data, where individual elements are
broken down into tuples (key-value pairs).
The Reduce task takes the output from the Map as
an input and combines those data tuples (key-value
pairs) into a smaller set of tuples.
CRA
23
CRA
Phases
24
Input Phase − Here we have a Record Reader that
translates each record in an input file and sends the
parsed data to the mapper in the form of key-value
pairs.
Map − Map is a user-defined function, which takes
a series of key-value pairs and processes each one of
them to generate zero or more key-value pairs.
Intermediate Keys − The key-value pairs
generated by the mapper are known as intermediate
keys.
CRA
Phases
25
Combiner − A combiner is a type of local Reducer that groups
similar data from the map phase into identifiable sets. It takes
the intermediate keys from the mapper as input and applies a
user-defined code to aggregate the values in a small scope of one
mapper. It is not a part of the main MapReduce algorithm; it is
optional.
Shuffle and Sort − The Reducer task starts with the Shuffle
and Sort step. It downloads the grouped key-value pairs onto the
local machine, where the Reducer is running. The individual key-
value pairs are sorted by key into a larger data list. The data list
groups the equivalent keys together so that their values can be
iterated easily in the Reducer task.
CRA
Phases
26
Reducer − The Reducer takes the grouped key-value
paired data as input and runs a Reducer function on each
one of them. Here, the data can be aggregated, filtered, and
combined in a number of ways, and it requires a wide range
of processing. Once the execution is over, it gives zero or
more key-value pairs to the final step.
Output Phase − In the output phase, we have an output
formatter that translates the final key-value pairs from the
Reducer function and writes them onto a file using a record
writer.
CRA
Word Count Example
27
CRA
Anatomy of MapReduce
29
MapReduce 1 (classic)
MapReduce 2 (YARN)
hadoop>start-dfs.sh
– Starting namenode, datanode, secondary namenode
hadoop>jps
– Jobid Namenode,secondary namenode(m)
– Jobid Datanode (s)
hadoop>start-yarn.sh
– Starting resource manager(m), node manager(c)
hadoop>jps
– Jod id Resource Manager(m),
– Job id Node Manager(s)
CRA
Execute wordcount
30
Hadoop> hadoop jar wordcount.jar
–input /usr/local/hadoop/input/4800-4.txt
–output /usr/local/hadoop/output
CRA
Classic MapReduce Framework
31
Four Entities
The client-submit job
The Job Tracker –coordinates the job, a Java
API-JobTracker main class
The task Tracker-run the task that the job has
been split, TaskTracker main class
Distributed file system for sharing files
between entiites
CRA
Job Submission
32
Asks jobtracker for a new jobID
Checks the output directory[error : not specified or
already exists]
Computes the input splits [error: if input path does
not exist or file very small]
Copies the resources needed to run the job[JAR
file,configuration file,input splits] to jobtracker file
system
Tells the jobtracker that the job is ready for exec . By
calling submitjob( )
CRA
Job Initialization
33
Jobtracker puts job in internal queue
Job scheduler picks the job
Job scheduler initializes the job by creating a
object( encapsulates tasks and book keeping
information)
One map task for each split
Number of reducer –mapred.reduce.tasks
Plus setup task, cleanup task –run by task tracker
CRA
Task Assignment
34
Task tracker run a simple loop that periodically
send heart beat to job tracker
Indicates whether ready to run a task
Fixed slots for map task and reduce task[default 2]
Task tracker for map is chosen for how close it is to
input splits [data-local or rack –local]
For reduce – next in list
CRA
Task Execution and Job Completion
35
Jar and other supporting files from shared FS is
localized
Creates instance of Taskrunner to run the task
Jobtracker receives that the last task is over-clean up
task-job is successful
CRA
CRA 09/07/16
NO SQL
37
CRA
YARN MapReduce 2
38
For large clusters -4000 nodes
YET Another Resource Negotiater
Remedies the scalability of classic by splitting role of
job tracker- Resource manager and application
manager
CRA
Entities
39
Client-job submission
Resource manager- coordinates the allocation of
resources on cluster
Node Manager- monitor machines in cluster
Application Master- coordinates the tasks running
the MapReduce job
HDFS
CRA
40
CRA
Failures
41
In classic MapReduce
Modes- failure of running task, task tracker, job
tracker
Task failure- map or reduce due to run time
exception
Task tracker failure-fails by crashing or slow, job
tracker finds by heartbeat and removes from pool,
any job incomplete or in progress is scheduled again
to other tt as the result may be available in the local
system(intermediate keys)
CRA
42
Task tracker-blacklisted by JT, if more than 4 tasks
from same job fail
Job tracker failure-most serious, single point failure,
Hadoop has no mechanism for dealing with JT
failure
CRA
Failures in YARN
43
Modes- task, application master, node manager,
resource manager
Task- same as classic
Application master failure- applications in YARN are
tried multiple times in the event of failure, Resource
manager will detect the failure and start in new
container
Node Manager failure- node manager sends periodic
heart beat to resource manager, so RM will detect the
failure and remove from list
CRA
44
Node manager-will be black listed , if the failures of
application is high
Resource Manager Failure-serious, recover from
crashes by using check point mechanism
CRA
Job Scheduling
45
Early versions of hadoop- FIFO, Queue based,
priorities added , but no preemption, high priority jobs
wait for long running jobs
Later versions
The Fair Scheduler-each user a fair share of the cluster capacity,
single job all nodes in the cluster, m jobs n nodes form the cluster.
Jobs are placed in pools. Supports preemption . If a pool has not
received its fair share, scheduler will kill tasks in one pool and give
to other
Contrib module, place the jar in hadoop classpath by copying from
contrib/fairscheduler and set the property
org.apache.hadoop.mapred.FairScheduler
CRA
The Capacity Scheduler-cluster is made up of a number of queues,
which is hierarchical, each q has allocated capacity. Within each q jobs
are scheduled using FIFO (with priorities)