0% found this document useful (0 votes)

115 views46 pages

Big DATA Analytics: C.Ranichandra & N.C.Senthilkumar

MapReduce is a programming paradigm that runs in the background of Hadoop to provide scalability and easy data-processing solutions. The Map task takes a set of data and converts it into another set of data, where individual elements are broken down into tuples (key-value pairs). The Reduce task takes the output from the Map as an input and combines those data tuples (key-value pairs) into a smaller set of tuples.

Uploaded by

anubhav gupta

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPT, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

115 views46 pages

Big DATA Analytics: C.Ranichandra & N.C.Senthilkumar

Uploaded by

anubhav gupta

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPT, PDF, TXT or read online on Scribd

You are on page 1/ 46

Big DATA Analytics

C.RANICHANDRA
&
N.C.SENTHILKUMAR

CRA
NO SQL
2

Not Only SQL

 being non-relational, distributed, open-source
and horizontally scalable.
> 255 No SQL Databases
Categorized as
 Column store/column families: HBASE, Accumulo, IBM Informix
 Document Store: Azure Document DB, Mongo DB, IBM Cloudant
 Key Value/ Tuple Store: Dynamo DB, Azure Table Storage, Oracle
NoSQL DB
 Graph Databases: AllegroGraph, Neo4J, OrientDB

CRA
Failures
3

In classic MapReduce
Modes- failure of running task, task tracker, job
tracker
Task failure- map or reduce due to run time exception
Task tracker failure-fails by crashing or slow, job
tracker finds by heartbeat and removes from pool, any
job incomplete or in progress is scheduled again to
other tt as the result may be available in the local
system(intermediate keys)

CRA
NO SQL
4

Since 1970 , RDBMS is the only solution for data

storage and manipulation and maintenance
After the data changed in all dimension(Vs),
companies realized the solutions for processing big
data
Solution: Hadoop, but only sequential access

CRA
HBase
5

Hbase is a distributed column-oriented database

built on top of HDFS
Based on Google’s Big Table, provides random
access on structured data
It’s a part of Hadoop Eco system, which provides
random r/w access on HDFS

CRA
Random R/W
6

CRA
HDFS and HBASE
7

HDFS HBASE
Distributed FS Database on HDFS
Provides high latency batch processing Low latency access to single rows

Only sequential access of data Random access by hash index

CRA
Storage Mechanism in HBase
8

Column –oriented
Table schema defines only column families , which
are key value pairs

Rowid Column family Column family

Col1 Col2 Col3 Col1 Col2 col3

CRA
Hbase and RDBMS
9

HBASE RDBMS
Schema less Schema oriented
Built for wide tables, horizontally Thin and built for small tables, hard to
scalable scale
No transactions-Suitable for OLAP Transactional
Demoralized data Normalized data
Good for semi structured and structured Good for structured
data

CRA
Applications of Hbase
10

Need to write heavy applications

Random access of data
Facebook , twitter, yahoo and Adobe use Hbase
internally

CRA
Hbase Architecture
11

Tables are split into regions and served by region server,

Regions are divided into stores and stored in HDFS

CRA
HBase Shell Commands
12
 General:
 Status
 Version
 Table_help
 Whoami
 DDL:
 Create
 List
 Disable
 Is_disabled
 Enable
 Is_enabled
 Describe
 Alter
 Exists
 Drop_all –drop tables matching regrex commands

CRA
HBase Shell Commands
13

DML
 Put- a cell value
 Get- get row or cell
 Delete – delete a cell value
 Delete all- delete all the cells in a row
 Scan- scan and return table value
 Count- number of rows in a table
 Truncate- disable, drop and recreate a specified table

CRA
DDL
14

Create <table name> <family name>

List
Disable <table name>
Describe <table name>
Drop <table name>
Drop_all <regexp>
alter <tablename>, NAME=><column familyname>,
VERSIONS=>5
alter <table name> , NAME =><cf> , METHOD => 'delete'

CRA
DML Commands
15

Put <'tablename'>,<'rowname'>,<'columnvalue'>,
<'value'>
Scan <‘tablename’>
get 'table name', ‘rowid’, {COLUMN ⇒ ‘column
family:column name ’}
delete <'tablename'>,<'row name'>,<'column name'>
deleteall <'tablename'>, <'rowname'>
 truncate <tablename>

CRA
Example
16

CRA
DDL+DML Commands
17

create 'emp', 'personal data', ’professional data’

put 'emp','1','personal data:name','raju‘
put 'emp','1','personal data:city','hyderabad‘
put 'emp','1','professional
data:designation','manager
put 'emp','1','professional data:salary','50000’
Scan ‘emp’

CRA
DDL+DML Commands
18

get 'emp', '1‘

get 'emp', ‘1', {COLUMN ⇒ 'personal data: name'}
delete 'emp', '1', 'personal data:city‘
deleteall 'emp','1‘
put 'emp',‘1','personal data:city','Delhi‘
count 'emp‘
truncate 'emp‘
Drop ‘emp’

CRA
Exercise-MBA Admissions
19

Application Personal Data Academic details

no
Name Gender Address UG qualification University/ Overall
college percentage

CRA
Batch Processing
20

Batch processing is the execution of a series of

jobs in a program on a computer without manual
intervention (non-interactive). Strictly speaking, it is
a processing mode: the execution of a series of
programs each on a set or "batch" of inputs, rather
than a single input (which would instead be a
custom job)

CRA
Batch Processing
21
Batch processing is very efficient in processing high
volume data.
Where data is collected, entered to the system,
processed and then results are produced in batches.
 Here time taken for the processing is not an issue.
Batch jobs are configured to run without manual
intervention, trained against entire dataset at scale in
order to produce output in the form of computational
analyses and data files.
 Depending on the size of the data being processed
and the computational power of the system, output
CRA
can be delayed significantly.
MapReduce
22

MapReduce is a programming paradigm that runs in

the background of Hadoop to provide scalability and
easy data-processing solutions.
The Map task takes a set of data and converts it into
another set of data, where individual elements are
broken down into tuples (key-value pairs).
The Reduce task takes the output from the Map as
an input and combines those data tuples (key-value
pairs) into a smaller set of tuples.

CRA
23

CRA
Phases
24

Input Phase − Here we have a Record Reader that

translates each record in an input file and sends the
parsed data to the mapper in the form of key-value
pairs.
Map − Map is a user-defined function, which takes
a series of key-value pairs and processes each one of
them to generate zero or more key-value pairs.
Intermediate Keys − The key-value pairs
generated by the mapper are known as intermediate
keys.

CRA
Phases
25

Combiner − A combiner is a type of local Reducer that groups

similar data from the map phase into identifiable sets. It takes
the intermediate keys from the mapper as input and applies a
user-defined code to aggregate the values in a small scope of one
mapper. It is not a part of the main MapReduce algorithm; it is
optional.
Shuffle and Sort − The Reducer task starts with the Shuffle
and Sort step. It downloads the grouped key-value pairs onto the
local machine, where the Reducer is running. The individual key-
value pairs are sorted by key into a larger data list. The data list
groups the equivalent keys together so that their values can be
iterated easily in the Reducer task.

CRA
Phases
26

Reducer − The Reducer takes the grouped key-value

paired data as input and runs a Reducer function on each
one of them. Here, the data can be aggregated, filtered, and
combined in a number of ways, and it requires a wide range
of processing. Once the execution is over, it gives zero or
more key-value pairs to the final step.
Output Phase − In the output phase, we have an output
formatter that translates the final key-value pairs from the
Reducer function and writes them onto a file using a record
writer.

CRA
Word Count Example
27

CRA
Anatomy of MapReduce
29

MapReduce 1 (classic)
MapReduce 2 (YARN)
hadoop>start-dfs.sh
– Starting namenode, datanode, secondary namenode
hadoop>jps
– Jobid Namenode,secondary namenode(m)
– Jobid Datanode (s)
hadoop>start-yarn.sh
– Starting resource manager(m), node manager(c)
hadoop>jps
– Jod id Resource Manager(m),
– Job id Node Manager(s)
CRA
Execute wordcount
30

Hadoop> hadoop jar wordcount.jar

–input /usr/local/hadoop/input/4800-4.txt
–output /usr/local/hadoop/output

CRA
Classic MapReduce Framework
31

Four Entities
The client-submit job
The Job Tracker –coordinates the job, a Java
API-JobTracker main class
The task Tracker-run the task that the job has
been split, TaskTracker main class
Distributed file system for sharing files
between entiites

CRA
Job Submission
32

Asks jobtracker for a new jobID

Checks the output directory[error : not specified or
already exists]
Computes the input splits [error: if input path does
not exist or file very small]
Copies the resources needed to run the job[JAR
file,configuration file,input splits] to jobtracker file
system
Tells the jobtracker that the job is ready for exec . By
calling submitjob( )
CRA
Job Initialization
33

Jobtracker puts job in internal queue

Job scheduler picks the job
Job scheduler initializes the job by creating a
object( encapsulates tasks and book keeping
information)
One map task for each split
Number of reducer –mapred.reduce.tasks
Plus setup task, cleanup task –run by task tracker

CRA
Task Assignment
34

Task tracker run a simple loop that periodically

send heart beat to job tracker
Indicates whether ready to run a task
Fixed slots for map task and reduce task[default 2]
Task tracker for map is chosen for how close it is to
input splits [data-local or rack –local]
For reduce – next in list

CRA
Task Execution and Job Completion
35

Jar and other supporting files from shared FS is

localized
Creates instance of Taskrunner to run the task

Jobtracker receives that the last task is over-clean up

task-job is successful

CRA
CRA 09/07/16
NO SQL
37

CRA
YARN MapReduce 2
38

For large clusters -4000 nodes

YET Another Resource Negotiater
Remedies the scalability of classic by splitting role of
job tracker- Resource manager and application
manager

CRA
Entities
39

Client-job submission
Resource manager- coordinates the allocation of
resources on cluster
Node Manager- monitor machines in cluster
Application Master- coordinates the tasks running
the MapReduce job
HDFS

CRA
40

CRA
Failures
41


In classic MapReduce

Modes- failure of running task, task tracker, job
tracker

Task failure- map or reduce due to run time
exception

Task tracker failure-fails by crashing or slow, job
tracker finds by heartbeat and removes from pool,
any job incomplete or in progress is scheduled again
to other tt as the result may be available in the local
system(intermediate keys)

CRA
42


Task tracker-blacklisted by JT, if more than 4 tasks
from same job fail

Job tracker failure-most serious, single point failure,
Hadoop has no mechanism for dealing with JT
failure

CRA
Failures in YARN
43


Modes- task, application master, node manager,
resource manager

Task- same as classic

Application master failure- applications in YARN are
tried multiple times in the event of failure, Resource
manager will detect the failure and start in new
container

Node Manager failure- node manager sends periodic
heart beat to resource manager, so RM will detect the
failure and remove from list

CRA
44


Node manager-will be black listed , if the failures of
application is high

Resource Manager Failure-serious, recover from
crashes by using check point mechanism

CRA
Job Scheduling
45

Early versions of hadoop- FIFO, Queue based,

priorities added , but no preemption, high priority jobs
wait for long running jobs
Later versions
The Fair Scheduler-each user a fair share of the cluster capacity,
single job all nodes in the cluster, m jobs n nodes form the cluster.
Jobs are placed in pools. Supports preemption . If a pool has not
received its fair share, scheduler will kill tasks in one pool and give
to other
Contrib module, place the jar in hadoop classpath by copying from
contrib/fairscheduler and set the property
org.apache.hadoop.mapred.FairScheduler
CRA
The Capacity Scheduler-cluster is made up of a number of queues,
which is hierarchical, each q has allocated capacity. Within each q jobs
are scheduled using FIFO (with priorities)

Chapter 25
No ratings yet
Chapter 25
43 pages
Apache Hadoop Developer Training PDF
100% (1)
Apache Hadoop Developer Training PDF
397 pages
Apache Hadoop Training
No ratings yet
Apache Hadoop Training
377 pages
BDA-3 UNIT (1)
No ratings yet
BDA-3 UNIT (1)
23 pages
Apache Hadoop Developer Training PDF
No ratings yet
Apache Hadoop Developer Training PDF
394 pages
Apache Hadoop Developer Training
100% (1)
Apache Hadoop Developer Training
394 pages
Large-Scale Data Management: Cs525: Special Topics in Dbs
No ratings yet
Large-Scale Data Management: Cs525: Special Topics in Dbs
22 pages
Sdcbdasparkweek1 1
No ratings yet
Sdcbdasparkweek1 1
9 pages
Unit-2 1
No ratings yet
Unit-2 1
93 pages
Unit 5
No ratings yet
Unit 5
7 pages
HADOOP
No ratings yet
HADOOP
19 pages
Big Data
No ratings yet
Big Data
3 pages
BDH Unit 1
No ratings yet
BDH Unit 1
14 pages
Hadoop Introduction
No ratings yet
Hadoop Introduction
29 pages
Chapter2 Bdi
No ratings yet
Chapter2 Bdi
101 pages
Big Data Analysis PDF 2
No ratings yet
Big Data Analysis PDF 2
18 pages
Chapter 2
No ratings yet
Chapter 2
19 pages
Unit-III Big Data
No ratings yet
Unit-III Big Data
10 pages
Big Data
No ratings yet
Big Data
29 pages
Cloud Computing Unit 3
No ratings yet
Cloud Computing Unit 3
10 pages
Unit-2 - Introduction To Hadoop and Hadoop Architecture
No ratings yet
Unit-2 - Introduction To Hadoop and Hadoop Architecture
46 pages
Bda QB Soln
No ratings yet
Bda QB Soln
22 pages
Chapter 4 MapReduce and New Software Stack
No ratings yet
Chapter 4 MapReduce and New Software Stack
48 pages
L02-Hadoop Framework
No ratings yet
L02-Hadoop Framework
40 pages
Unit 2
No ratings yet
Unit 2
7 pages
2 Hadoop Ecosystem
No ratings yet
2 Hadoop Ecosystem
41 pages
big data analytics
No ratings yet
big data analytics
8 pages
Unit 2
No ratings yet
Unit 2
22 pages
Lecture4 IntroMapReduce PDF
No ratings yet
Lecture4 IntroMapReduce PDF
75 pages
Unit 3 & 4 Big Data
No ratings yet
Unit 3 & 4 Big Data
18 pages
I Am Preparing For A Big Data Analytics University...
No ratings yet
I Am Preparing For A Big Data Analytics University...
15 pages
CAIM: Cerca I Anàlisi D'informació Massiva: FIB, Grau en Enginyeria Informàtica
No ratings yet
CAIM: Cerca I Anàlisi D'informació Massiva: FIB, Grau en Enginyeria Informàtica
65 pages
Big Data
No ratings yet
Big Data
43 pages
CC Unit5
No ratings yet
CC Unit5
27 pages
lec 6
No ratings yet
lec 6
36 pages
CO3 Session 19
No ratings yet
CO3 Session 19
29 pages
07 BigData DataAnalysis
No ratings yet
07 BigData DataAnalysis
66 pages
Hadoop Trainting in Hyderabad@KellyTechnologies
No ratings yet
Hadoop Trainting in Hyderabad@KellyTechnologies
23 pages
NAAC Accredited ''A" D. E. Society's: "Rainfall in India Analysis"
No ratings yet
NAAC Accredited ''A" D. E. Society's: "Rainfall in India Analysis"
21 pages
Bsd1313 Chapter 4
No ratings yet
Bsd1313 Chapter 4
129 pages
NAAC Accredited ''A" D. E. Society's: "Rainfall Analysis in India "
No ratings yet
NAAC Accredited ''A" D. E. Society's: "Rainfall Analysis in India "
21 pages
Bda Summer 2022 Solution
No ratings yet
Bda Summer 2022 Solution
30 pages
BDA Unit 4 PDF
No ratings yet
BDA Unit 4 PDF
31 pages
Hadoop Map Reduce Concept
No ratings yet
Hadoop Map Reduce Concept
23 pages
Lecture 5 - Hadoop and Mapreduce
No ratings yet
Lecture 5 - Hadoop and Mapreduce
30 pages
Bda Unit 2
No ratings yet
Bda Unit 2
79 pages
Shortnotes For Cloud
No ratings yet
Shortnotes For Cloud
22 pages
Big Data Analytics
No ratings yet
Big Data Analytics
44 pages
Problem-Solving Using Mapreduce/Hadoop
No ratings yet
Problem-Solving Using Mapreduce/Hadoop
22 pages
Big Data - Hadoop
No ratings yet
Big Data - Hadoop
20 pages
9 Hadoop PDF
No ratings yet
9 Hadoop PDF
59 pages
Bda Winter 2021 Solution
No ratings yet
Bda Winter 2021 Solution
27 pages
Module 2 HDFS
No ratings yet
Module 2 HDFS
33 pages
Unit 5 - Introduction To Hadoop
No ratings yet
Unit 5 - Introduction To Hadoop
50 pages
Introduction To Hadoop
No ratings yet
Introduction To Hadoop
14 pages
DSCI 5350 - Lecture 2 PDF
No ratings yet
DSCI 5350 - Lecture 2 PDF
54 pages
Learn Hbase in 24 Hours
From Everand
Learn Hbase in 24 Hours
Alex Nordeen
No ratings yet
Big Data Analytics
From Everand
Big Data Analytics
Nitin Kumar Yadav
No ratings yet
Learn Hive in 24 Hours
From Everand
Learn Hive in 24 Hours
Alex Nordeen
No ratings yet
DRBD-Cookbook: How to create your own cluster solution, without SAN or NAS!
From Everand
DRBD-Cookbook: How to create your own cluster solution, without SAN or NAS!
Joerg Christian Seubert
No ratings yet
Module5 - Control Structure
No ratings yet
Module5 - Control Structure
17 pages
資工系王俊堯教授數位邏輯設計Unit 6 - Quine-McClusky Method
No ratings yet
資工系王俊堯教授數位邏輯設計Unit 6 - Quine-McClusky Method
25 pages
Assignment 2
No ratings yet
Assignment 2
2 pages
Bottom-Up Parsing
No ratings yet
Bottom-Up Parsing
4 pages
Knuth - 1974 - Structured Programming With Go To Statements
No ratings yet
Knuth - 1974 - Structured Programming With Go To Statements
41 pages
Racket Guide
100% (1)
Racket Guide
336 pages
A+ Sample Review
No ratings yet
A+ Sample Review
7 pages
Skinned Decal Component Documentation (V2)
No ratings yet
Skinned Decal Component Documentation (V2)
6 pages
3 1 Overfitting
No ratings yet
3 1 Overfitting
25 pages
C++ Interview Questions in
100% (1)
C++ Interview Questions in
34 pages
LondonR - Algorithmic Trading in R - Malcolm Sherrington - 20131203
No ratings yet
LondonR - Algorithmic Trading in R - Malcolm Sherrington - 20131203
30 pages
Java Introduction:: What Is JDK?
No ratings yet
Java Introduction:: What Is JDK?
28 pages
Resume - Saurabh Khera
No ratings yet
Resume - Saurabh Khera
3 pages
Addressing Modes in 8085 - ComputerSC PDF
No ratings yet
Addressing Modes in 8085 - ComputerSC PDF
6 pages
Infobip Assignment - Java URL Shortener Eng
No ratings yet
Infobip Assignment - Java URL Shortener Eng
2 pages
DSA Mid Term Lab Exam S21 1
No ratings yet
DSA Mid Term Lab Exam S21 1
4 pages
Topic03D Using Assembly in The Freescale MC9S12X
100% (1)
Topic03D Using Assembly in The Freescale MC9S12X
34 pages
Dbms Unit 2 Notes
No ratings yet
Dbms Unit 2 Notes
32 pages
PD1 Set2
No ratings yet
PD1 Set2
9 pages
SAP IBP Model Config Guide
No ratings yet
SAP IBP Model Config Guide
128 pages
Chapter Two
No ratings yet
Chapter Two
30 pages
HTTP App - Utu.ac - in Utuexmanagement Exammsters Syllabus CE4012 Database Management System - Docx - 2
No ratings yet
HTTP App - Utu.ac - in Utuexmanagement Exammsters Syllabus CE4012 Database Management System - Docx - 2
6 pages
Elon Musk Program
No ratings yet
Elon Musk Program
4 pages
Lecture 07 Chapter 7 CSE 309
No ratings yet
Lecture 07 Chapter 7 CSE 309
49 pages
Best Practices in Elasticsearch
No ratings yet
Best Practices in Elasticsearch
5 pages
Technical Questions
No ratings yet
Technical Questions
59 pages
Detection and Verification System of Handwriting and Signature Using Raspberry Picopy 220426094934
No ratings yet
Detection and Verification System of Handwriting and Signature Using Raspberry Picopy 220426094934
48 pages
SPCC Question Bank
No ratings yet
SPCC Question Bank
3 pages
Module 1-1
No ratings yet
Module 1-1
22 pages
Linux - Command - History and - Vim Editor
No ratings yet
Linux - Command - History and - Vim Editor
11 pages

Big DATA Analytics: C.Ranichandra & N.C.Senthilkumar

Uploaded by

Big DATA Analytics: C.Ranichandra & N.C.Senthilkumar

Uploaded by

Big DATA Analytics

Not Only SQL

Since 1970 , RDBMS is the only solution for data

Hbase is a distributed column-oriented database

Only sequential access of data Random access by hash index

Rowid Column family Column family

Col1 Col2 Col3 Col1 Col2 col3

Need to write heavy applications

Tables are split into regions and served by region server,

Create <table name> <family name>

create 'emp', 'personal data', ’professional data’

get 'emp', '1‘

Application Personal Data Academic details

Batch processing is the execution of a series of

MapReduce is a programming paradigm that runs in

Input Phase − Here we have a Record Reader that

Combiner − A combiner is a type of local Reducer that groups

Reducer − The Reducer takes the grouped key-value

Hadoop> hadoop jar wordcount.jar

Asks jobtracker for a new jobID

Jobtracker puts job in internal queue

Task tracker run a simple loop that periodically

Jar and other supporting files from shared FS is

Jobtracker receives that the last task is over-clean up

For large clusters -4000 nodes

Early versions of hadoop- FIFO, Queue based,

You might also like