0% found this document useful (0 votes)

107 views18 pages

Bda - 3 Unit

The document discusses MapReduce, HBase, and Hadoop frameworks for big data processing. It covers the key components and architecture of each framework. MapReduce involves map and reduce tasks to process large datasets in parallel across clusters. HBase is a column-oriented database that stores big data on top of HDFS. It organizes data into tables, rows, column families and columns. Hadoop uses HDFS for reliable storage of large datasets across clusters and MapReduce for distributed processing of data.

Uploaded by

ASMA UL HUSNA

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

107 views18 pages

Bda - 3 Unit

Uploaded by

ASMA UL HUSNA

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

You are on page 1/ 18

UNIT-3

Understanding MR fundamentals and HBase

Topics Covered

MapReduce: The MapReduce framework, 3.1.1Techniques to optimize MR jobs,

3.1.3.uses of MR.

HBase: 3.2.1.Role of HBase in Big data processing , 3.2.2.introducing HBase architecture

3.2.3. storing big data in HBase, 3.2.4.HBase operations-programming with HBase, Installation

Hadoop: 3.3.1. Storing data in Hadoop 3.3.2. Introduction of HDFS architecture 3.3.3. HDFS
file system types, commands , 3.3.4. org.apache.io package, 3.3.5.HDFS high availability,
3.3.6.interacting with Hadoop eco system

, combining HBase and HDFS

Map Reduce framework

1. MR Programming is a software frame work which helps to process massive amounts of data in
parallel.
2. In MR the input data set is split into independent chunks.
3. MR involves two tasks: Map task and Reduce task
4. The Map task processes the independent chunks in parallel manner. it converts input data into key
value pairs
Reduce task combines outputs of mappers and produces a reduced data set
5.The o/p of Mappers is automatically shuffled and sorted by the frame work and stored
as intermediate data on the local disk of that server.
6. The MR frame work sorts the o/p of mappers based on keys
7. The sorted o/p becomes input to the Reduce task.
8. The Reduce task combines the o/p of various Mappers and produces a reduced o/p.
9. Map Reduce framework also takes care of other tasks such as scheduling, monitoring, re
executing failed tasks etc.,
10. For the given jobs the inputs and outputs are stored in a file system (here HDFS is used)
11. HDFS and MR framework run on the same set of nodes.
12. Here the Paradigm shift is that scheduling of tasks is done on the nodes where data is present.
from Data>to> compute to
Compute> to > data model.
ie Data processing is co located with data storage. (data locality). It achieves high throughput

MR daemons
• There are two daemons associated with MR
-1.Job tracker : a Mater daemon. A single job tracker in the master per cluster of nodes
-2. Task trackers: one slave task tracker for each nodes

pg. 29 | BIG DATA ANALYTICS, IV CSE, JBIET, HYDERABAD Prepared By, Dr.G.Arun Sampaul Thomas
Job tracker:
• responsible for scheduling tasks to the Task trackers, monitoring the task and re executing
the task if the Task tracker fails.
• It provides connectivity between hadoop and our MR application
• The MR functions and input o/p locations are implemented via our MR application program-
the job configuration
• In Hadoop, its job client submits the job (jar/executable, etc.,) to the job tracker
• The job tracker creates the execution plan and decides which task to assign to which node.
• Job tracker monitors and if a task fails it will automatically reschedule the task to a different
node after a predetermined no of tries

Task trackers
• This daemon present in every node is responsible for executing the tasks assigned to them by
the job tracker of the cluster.
• There is a single task tracker per slave and which spawns multiple JVMs to handlw multiple
map or reduce tasks in parallel.
• Task tracker continuously sends messages to job tracker.

Map Reduce features

simplicity: programmer can easily design parallel and distributed applications
manageability: data and computation are alloocated to the same slave( data) node and no need
to forward data for computation
scalability: increase the data node to increase job with minimal losses
fault tolerance : any node with hw failure can be removed and a new node installed the
reliability:tasks under progress run failed tasks also

Map Reduce framework

• Job Tracker is the master node (runs with the namenode)
• Receives the user’s job
• Decides on how many tasks will run (number of mappers)
Decides on where to run each mapper (concept of locality)

• Task Tracker is the slave node (runs on each data node)

• Receives the task from Job Tracker
• Runs the task until completion (either map or reduce task)
• Always in communication with the Job Tracker reporting progress

pg. 30 | BIG DATA ANALYTICS, IV CSE, JBIET, HYDERABAD Prepared By, Dr.G.Arun Sampaul Thomas
• Job Tracker is the master node (runs with the namenode)
• Receives the user’s job
• Decides on how many tasks will run (number of mappers)
• Decides on where to run each mapper (concept of locality)
• Task Tracker is the slave node (runs on each datanode)
• Receives the task from Job Tracker
• Runs the task until completion (either map or reduce task)
• Always in communication with the Job Tracker reporting progress

Applications
- It is used in Machine Learning,
Graphic programming, and
multi core programming

MR programming
• Requires three things:
• 1. driver class: it specifies job configuration details
• 2. mapper class: it overrides map function based on the problem statement
• 3. reducer class: this class overrides the Reduce function based on the problem statement

Implementations of MR
• Many implementations of MR developed in different languages for different purposes.
1.Hadoop: The most popular Open Source implementation is Hadoop, developed by yahoo, which
runs on top of HDFS. It is now being used by face book, amazon etc.,
- In this implementation it processes 100s of terabytes of data in at least 10000 cores
2. Google implementation: It runs on top of Google File System. Within Google File System data is
loaded, partitioned into chunks and each chunk is replicated.
- it processes 20 peta bytes /day

MR programming model

• MR functions use functional languages like Lisp

• Map function , written by user processes a key/value pair to generate a list of intermediate
key/value pairs
map(key1, value1)-> list (key2,value2)

pg. 31 | BIG DATA ANALYTICS, IV CSE, JBIET, HYDERABAD Prepared By, Dr.G.Arun Sampaul Thomas
• The reduce function , also written by user, merges, all intermediate values associated with a
particular intermediate key
• reduce (key2,list(value2))->list(value2) unique key in the sorted list
• Finally the key/value pairs are reduced , one for each in the sorted list . Ie the reduce
function sums all the counts emitted for a particular key

Example 1: Color Count

Example 2: Color Count

pg. 32 | BIG DATA ANALYTICS, IV CSE, JBIET, HYDERABAD Prepared By, Dr.G.Arun Sampaul Thomas
Example 3: Color Filter

Example 2: Word Count

pg. 33 | BIG DATA ANALYTICS, IV CSE, JBIET, HYDERABAD Prepared By, Dr.G.Arun Sampaul Thomas
Introduction Hadoop, MR and HBase

• Since 1970, RDBMS is the solution for data storage and maintenance related problems.
• After the advent of big data, companies realized the benefit of processing big data and started
opting for solutions like Hadoop.
• Hadoop uses distributed file system HDFS for storing big data, and MapReduce to process it.
• Hadoop excels in storing and processing of huge data of various formats such as
arbitrary, semi-, or even unstructured.

HBase
It is a distributed, column-oriented database built on top of the hadoop file system.(HDFS)

pg. 34 | BIG DATA ANALYTICS, IV CSE, JBIET, HYDERABAD Prepared By, Dr.G.Arun Sampaul Thomas
storing big data in HBase
• HBase is a column-oriented database and the tables in it are sorted by row.
• The table schema defines only column families, which are the key value pairs.
• A table have multiple column families and each column family can have any number of
columns.
• Subsequent column values are stored contiguously on the disk. Each cell value of the table
has a timestamp.
• In short, in an HBase:
• Table is a collection of rows.
• Row is a collection of column families.
• Column family is a collection of columns.
• Column is a collection of key value pairs.

pg. 35 | BIG DATA ANALYTICS, IV CSE, JBIET, HYDERABAD Prepared By, Dr.G.Arun Sampaul Thomas
HBase operations-programming with HBase, Installation
• Installing Hbase:
• We can install HBase in any of the three modes: Standalone mode, Pseudo Distributed mode,
and Fully Distributed mode.
• Installing HBase in Standalone Mode

pg. 36 | BIG DATA ANALYTICS, IV CSE, JBIET, HYDERABAD Prepared By, Dr.G.Arun Sampaul Thomas
• Download the latest stable version of HBase form http://www.interior-
dsgn.com/apache/hbase/stable/ using “wget” command, and extract it using the tar “zxvf”
command
• Before proceeding with HBase, you have to edit the following files and configure HBase.
• hbase-env.sh
• hbase-site.xml

Hadoop

• Problem1: storing exponentially growing datasets

• Solution: Hadoop Distributed file system: divides input data files into chunks of data and
stores them across the cluster.
• Problem2: storing unstructured data:
• Solution: Hadoop allows storing of unstructured, semi structured and structured data.
• It follows WORM (Write Once Read Many)
• No schema validation is required while dumping data
• It is designed to run on clusters of commodity machines. Scalable as per requirements
• Problem 3: processing the data faster:
Solution: Map Reduce in Hadoop:
Provides parallel processing of data present in HDFS
Each data node processes the part data stored within the node

Why Hadoop is able to compete with conventional DBMS?

What is Hadoop architecture ?

• Hadoop is a framework consisting of clusters . Each cluster having two main layers
• HDFS layer : Hadoop Distributed file system layer-consists of one name node and
multiple data nodes
pg. 37 | BIG DATA ANALYTICS, IV CSE, JBIET, HYDERABAD Prepared By, Dr.G.Arun Sampaul Thomas
• MapReduce layer : Execution engine layer, consists of one job tracker and multiple
task tracke

Developed by Yahoo

Main components of Hadoop

• HDFS (Hadoop Distributed File System) : for big data storage in distributed environment –
allows dumping of any kind of data across the cluster
• Map reduce :for faster data processing . Allows parallel processing of data stored in HDFS-
processing done at data nodes instead of data going to processor (NameNode)
• It is a Apache project built and used by a community of contributors
• Premier web players: google, yahoo, Microsoft, facebook, use it as engine t power the cloud
• The project is a collection of various subprojects:
appache Hadoop Common, Avro, Chukwa, Hbase, HDFS, Hive, MapReduce, Pig, Zookeeper

Hadoop ecosystem (Total tools)

• Scoop and flume : to inject data into HDFS
• HDFS: ditributed file system that allows storage of all 3 types of data
• Yarn: (yet another ) the brain of hadoop. Allocates resources and schedules and does all the
processing activities
• PIG: a platform used to analyze large data sets representing them as data flows. Introduced by
yahoo. Language BigLatin
• HIVE: is a data warehousing tool that allows us to perform big data analytics using HIVE
Query language which is similar to SQL . introduced by face book
• Mapreduce: JAVA . Provides parallel processing of data sets
• Hbase: is a NoSQL data base on top of HDFS that enables us to store unstructured and semi
structured data with ease and provides real time read/ write access
• apache Spark :is an in-memory data processing engine that allows efficient execution of
streaming, machine learning or SQL workloads and requires fast
pg. 38 | BIG DATA ANALYTICS, IV CSE, JBIET, HYDERABAD Prepared By, Dr.G.Arun Sampaul Thomas
Hadoop Master/Slave Architecture
• Hadoop is designed as a master-slave ,shared-nothing, architecture

Design Principles of Hadoop

• Need to process Big data
• Need to parallelize computation across thousands of nodes
• Support Commodity hardware
– Large number of low-end cheap machines working in parallel to
solve a computing problem
• This is in contrast to conventional DBMs where small number of high-
end expensive machines are used
• Automatic parallelization & distribution
– Hidden from the end-user
– Fault tolerance and automatic recovery
– Nodes/tasks will fail and will recover automatically
– Clean and simple programming abstraction
– Users only provide two functions “map” and “reduce”

Hadoop: How it Works

Hadoop Distributed File System (HDFS)

39 | BIG DATA ANALYTICS, IV CSE, JBIET, HYDERABAD Prepared By, Dr.G.Arun Sampaul Thomas
Limitations of Hadoop
• Hadoop can perform only batch processing, and data will be accessed only in a sequential
manner. That means one has to search the entire dataset even for the simplest of jobs.
• A huge dataset when processed results in another huge data set, which should also be
processed sequentially. At this point, a new solution is needed to access any point of data in a
single unit of time (random access).
• Hadoop Random Access Databases:
• Applications such as HBase, Cassandra, couchDB, Dynamo, and MongoDB are some of the
databases that store huge amounts of data and access the data in a random manner.

Hadoop vs. Other Systems

pg. 40 | BIG DATA ANALYTICS, IV CSE, JBIET, HYDERABAD Prepared By, Dr.G.Arun Sampaul Thomas
• Cloud Computing
• A computing model where any computing infrastructure can run on the cloud
• Hardware & Software are provided as remote services
• Elastic: grows and shrinks based on the user’s demand
• Example: Amazon EC2

combining HBase with HDFS.

pg. 41 | BIG DATA ANALYTICS, IV CSE, JBIET, HYDERABAD Prepared By, Dr.G.Arun Sa

General Chemistry 1: Quarter 1 - Module 9: Calculations With Balanced Equations
50% (2)
General Chemistry 1: Quarter 1 - Module 9: Calculations With Balanced Equations
14 pages
Apache Hadoop Developer Training PDF
100% (1)
Apache Hadoop Developer Training PDF
397 pages
Unit 1,2,3,4
No ratings yet
Unit 1,2,3,4
116 pages
CST322 Module4 Part3 Hadoop
No ratings yet
CST322 Module4 Part3 Hadoop
45 pages
Hose Reel Calculation
100% (4)
Hose Reel Calculation
2 pages
Solving Tarkeeb PDF
No ratings yet
Solving Tarkeeb PDF
146 pages
Apache Hadoop Training
No ratings yet
Apache Hadoop Training
377 pages
DBMS Unit-5
No ratings yet
DBMS Unit-5
92 pages
Big Data Management Continued
No ratings yet
Big Data Management Continued
48 pages
Lecture 3 MR Model and Systems
No ratings yet
Lecture 3 MR Model and Systems
67 pages
Unit 3 & 4 Big Data
No ratings yet
Unit 3 & 4 Big Data
18 pages
BDA UNIT-3 (1) - Merged
No ratings yet
BDA UNIT-3 (1) - Merged
98 pages
Chapter 4 MapReduce and New Software Stack
No ratings yet
Chapter 4 MapReduce and New Software Stack
48 pages
Unit 2
No ratings yet
Unit 2
22 pages
L02-Hadoop Framework
No ratings yet
L02-Hadoop Framework
40 pages
Apache Hadoop Developer Training
100% (1)
Apache Hadoop Developer Training
394 pages
Big Data Analytics
From Everand
Big Data Analytics
Nitin Kumar Yadav
No ratings yet
Mu-Analysis and Synthesis Toolbox
No ratings yet
Mu-Analysis and Synthesis Toolbox
734 pages
Unit 5
No ratings yet
Unit 5
32 pages
Exploring Hadoop Ecosystem (Volume 1): Batch Processing
From Everand
Exploring Hadoop Ecosystem (Volume 1): Batch Processing
Wei Liu
No ratings yet
45 90 Degree Pipe Elbow Dimensions Sizes
No ratings yet
45 90 Degree Pipe Elbow Dimensions Sizes
10 pages
Bda Unit 3
No ratings yet
Bda Unit 3
29 pages
MapReduce Unit3
No ratings yet
MapReduce Unit3
27 pages
KRT V3.5 Eng
No ratings yet
KRT V3.5 Eng
103 pages
HADOOP
No ratings yet
HADOOP
55 pages
Apache Hadoop Developer Training PDF
No ratings yet
Apache Hadoop Developer Training PDF
394 pages
Hadoop and MR Programming: DR G Sudha Sadasivam Professor Cse, PSGCT
No ratings yet
Hadoop and MR Programming: DR G Sudha Sadasivam Professor Cse, PSGCT
71 pages
DM - Topic Five
No ratings yet
DM - Topic Five
30 pages
Cloud Computing Unit 3
No ratings yet
Cloud Computing Unit 3
10 pages
Unit 5
No ratings yet
Unit 5
35 pages
Hadoop Presentation
No ratings yet
Hadoop Presentation
19 pages
Big Data Analytics
No ratings yet
Big Data Analytics
12 pages
Analyzing Big Data in Hadoop Spark
No ratings yet
Analyzing Big Data in Hadoop Spark
30 pages
Hadoop Map Reduce Concept
No ratings yet
Hadoop Map Reduce Concept
23 pages
Big Data Analysis PDF 2
No ratings yet
Big Data Analysis PDF 2
18 pages
BDA Class3
No ratings yet
BDA Class3
15 pages
Introduction To Hadoop
No ratings yet
Introduction To Hadoop
14 pages
BDAunit III
No ratings yet
BDAunit III
4 pages
Bda Unit-3
No ratings yet
Bda Unit-3
20 pages
Unit 4 1
No ratings yet
Unit 4 1
12 pages
Exploring Hadoop Ecosystem (Volume 2): Stream Processing
From Everand
Exploring Hadoop Ecosystem (Volume 2): Stream Processing
Wei Liu
No ratings yet
Hadoop Spark
No ratings yet
Hadoop Spark
34 pages
BDA Module 3 - Part 1 (Mapreduce and HBase) 2023
No ratings yet
BDA Module 3 - Part 1 (Mapreduce and HBase) 2023
15 pages
CC Unit4
No ratings yet
CC Unit4
14 pages
Chapter 25
No ratings yet
Chapter 25
43 pages
Shortnotes For Cloud
No ratings yet
Shortnotes For Cloud
22 pages
Husqvarna 2003 SM WRE 125 Manual
No ratings yet
Husqvarna 2003 SM WRE 125 Manual
2 pages
9 Hadoop PDF
No ratings yet
9 Hadoop PDF
59 pages
Group 3: Molecular Orbital Theory
No ratings yet
Group 3: Molecular Orbital Theory
37 pages
Apache Hadoop and Spark:: and Use Cases For Data Analysis
No ratings yet
Apache Hadoop and Spark:: and Use Cases For Data Analysis
48 pages
Big Data Analysis
No ratings yet
Big Data Analysis
8 pages
Big Data and Hadoop Overview
100% (1)
Big Data and Hadoop Overview
17 pages
Updated Unit-2
0% (1)
Updated Unit-2
55 pages
Cloud Computing Prof
No ratings yet
Cloud Computing Prof
11 pages
Hadoop Notes
No ratings yet
Hadoop Notes
8 pages
FIBA Basketball Equipment 2020 - V1
No ratings yet
FIBA Basketball Equipment 2020 - V1
30 pages
Unit 5
No ratings yet
Unit 5
7 pages
WWW Doubtly in Big Data Analytics Semester 7 Mu Ai Ds Viva Qna
No ratings yet
WWW Doubtly in Big Data Analytics Semester 7 Mu Ai Ds Viva Qna
7 pages
Big Data
No ratings yet
Big Data
43 pages
Cribbage Rules1
No ratings yet
Cribbage Rules1
5 pages
Experiment No. 11 Part A A.1 Aim: 2 Prerequisite: A.3 Outcome: After Successful Completion of This Experiment, Students Will Be Able To
No ratings yet
Experiment No. 11 Part A A.1 Aim: 2 Prerequisite: A.3 Outcome: After Successful Completion of This Experiment, Students Will Be Able To
21 pages
Unit - III Advanced Analytics Technology and Tools
No ratings yet
Unit - III Advanced Analytics Technology and Tools
44 pages
Big Data
No ratings yet
Big Data
67 pages
Blas Lapack
No ratings yet
Blas Lapack
21 pages
Data Mining With Hadoop and Hive Introduction To Architecture
No ratings yet
Data Mining With Hadoop and Hive Introduction To Architecture
39 pages
XDM Datasheet
No ratings yet
XDM Datasheet
6 pages
Kebutuhan Panas Cement Mill (1) 1
No ratings yet
Kebutuhan Panas Cement Mill (1) 1
3 pages
03 Firstmrjob Invertedindexconstruction 141206231216 Conversion Gate01 PDF
No ratings yet
03 Firstmrjob Invertedindexconstruction 141206231216 Conversion Gate01 PDF
54 pages
Maths Scanner
No ratings yet
Maths Scanner
136 pages
Report Title: Wasit University
No ratings yet
Report Title: Wasit University
8 pages
DSD Univ Paper 2023-24
No ratings yet
DSD Univ Paper 2023-24
2 pages
AR253 History 2 - Structuralism and Metabolism
No ratings yet
AR253 History 2 - Structuralism and Metabolism
55 pages
Large-Scale Data Management: Cs525: Special Topics in Dbs
No ratings yet
Large-Scale Data Management: Cs525: Special Topics in Dbs
22 pages
HKLS Valid Reabilit
No ratings yet
HKLS Valid Reabilit
8 pages
Frontiers in Quantum Computing Luigi Maxmilian Caligiuri Editor Instant Download
No ratings yet
Frontiers in Quantum Computing Luigi Maxmilian Caligiuri Editor Instant Download
84 pages
Solutions: Homework 5 Partial Differential Equations M 104228
No ratings yet
Solutions: Homework 5 Partial Differential Equations M 104228
4 pages
PMA 133 Book - Verbal Intelligence Test Questions (Solved) - 1
No ratings yet
PMA 133 Book - Verbal Intelligence Test Questions (Solved) - 1
4 pages
Moment Gradient Factor For Steel I-Beams
No ratings yet
Moment Gradient Factor For Steel I-Beams
20 pages
Constructive Cost Model
No ratings yet
Constructive Cost Model
14 pages
Microsoft Excel 2007 Chris Menard
No ratings yet
Microsoft Excel 2007 Chris Menard
20 pages
Anritsu - Spectrum Master MS2720T - 2009
No ratings yet
Anritsu - Spectrum Master MS2720T - 2009
28 pages
Akka HTTP
No ratings yet
Akka HTTP
23 pages
Math 2
No ratings yet
Math 2
17 pages
Jr. Chemistry 2024 AP
No ratings yet
Jr. Chemistry 2024 AP
12 pages
Logic: Term
No ratings yet
Logic: Term
2 pages
Algebra 2 Homework Help Answers
100% (1)
Algebra 2 Homework Help Answers
7 pages
Atlas Copco Pf4000 Manual
67% (6)
Atlas Copco Pf4000 Manual
476 pages
SAS Programming Guidelines Interview Questions You'll Most Likely Be Asked
From Everand
SAS Programming Guidelines Interview Questions You'll Most Likely Be Asked
Vibrant Publishers
No ratings yet
Learn Hbase in 24 Hours
From Everand
Learn Hbase in 24 Hours
Alex Nordeen
No ratings yet
Learn Hive in 24 Hours
From Everand
Learn Hive in 24 Hours
Alex Nordeen
No ratings yet

Bda - 3 Unit

Uploaded by

Bda - 3 Unit

Uploaded by

UNIT-3

Understanding MR fundamentals and HBase

MapReduce: The MapReduce framework, 3.1.1Techniques to optimize MR jobs,

HBase: 3.2.1.Role of HBase in Big data processing , 3.2.2.introducing HBase architecture

, combining HBase and HDFS

Map Reduce framework

Map Reduce features

Map Reduce framework

• Task Tracker is the slave node (runs on each data node)

• MR functions use functional languages like Lisp

Example 1: Color Count

Example 2: Color Count

Example 2: Word Count

• Problem1: storing exponentially growing datasets

Why Hadoop is able to compete with conventional DBMS?

What is Hadoop architecture ?

Main components of Hadoop

Hadoop ecosystem (Total tools)

Design Principles of Hadoop

Hadoop: How it Works

Hadoop vs. Other Systems

combining HBase with HDFS.

You might also like