0% found this document useful (0 votes)
107 views18 pages

Bda - 3 Unit

The document discusses MapReduce, HBase, and Hadoop frameworks for big data processing. It covers the key components and architecture of each framework. MapReduce involves map and reduce tasks to process large datasets in parallel across clusters. HBase is a column-oriented database that stores big data on top of HDFS. It organizes data into tables, rows, column families and columns. Hadoop uses HDFS for reliable storage of large datasets across clusters and MapReduce for distributed processing of data.

Uploaded by

ASMA UL HUSNA
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
107 views18 pages

Bda - 3 Unit

The document discusses MapReduce, HBase, and Hadoop frameworks for big data processing. It covers the key components and architecture of each framework. MapReduce involves map and reduce tasks to process large datasets in parallel across clusters. HBase is a column-oriented database that stores big data on top of HDFS. It organizes data into tables, rows, column families and columns. Hadoop uses HDFS for reliable storage of large datasets across clusters and MapReduce for distributed processing of data.

Uploaded by

ASMA UL HUSNA
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 18

UNIT-3

Understanding MR fundamentals and HBase

Topics Covered

MapReduce: The MapReduce framework, 3.1.1Techniques to optimize MR jobs,


3.1.3.uses of MR.

HBase: 3.2.1.Role of HBase in Big data processing , 3.2.2.introducing HBase architecture


3.2.3. storing big data in HBase, 3.2.4.HBase operations-programming with HBase, Installation

Hadoop: 3.3.1. Storing data in Hadoop 3.3.2. Introduction of HDFS architecture 3.3.3. HDFS
file system types, commands , 3.3.4. org.apache.io package, 3.3.5.HDFS high availability,
3.3.6.interacting with Hadoop eco system

, combining HBase and HDFS

Map Reduce framework

1. MR Programming is a software frame work which helps to process massive amounts of data in
parallel.
2. In MR the input data set is split into independent chunks.
3. MR involves two tasks: Map task and Reduce task
4. The Map task processes the independent chunks in parallel manner. it converts input data into key
value pairs
Reduce task combines outputs of mappers and produces a reduced data set
5.The o/p of Mappers is automatically shuffled and sorted by the frame work and stored
as intermediate data on the local disk of that server.
6. The MR frame work sorts the o/p of mappers based on keys
7. The sorted o/p becomes input to the Reduce task.
8. The Reduce task combines the o/p of various Mappers and produces a reduced o/p.
9. Map Reduce framework also takes care of other tasks such as scheduling, monitoring, re
executing failed tasks etc.,
10. For the given jobs the inputs and outputs are stored in a file system (here HDFS is used)
11. HDFS and MR framework run on the same set of nodes.
12. Here the Paradigm shift is that scheduling of tasks is done on the nodes where data is present.
from Data>to> compute to
Compute> to > data model.
ie Data processing is co located with data storage. (data locality). It achieves high throughput

MR daemons
• There are two daemons associated with MR
-1.Job tracker : a Mater daemon. A single job tracker in the master per cluster of nodes
-2. Task trackers: one slave task tracker for each nodes

pg. 29 | BIG DATA ANALYTICS, IV CSE, JBIET, HYDERABAD Prepared By, Dr.G.Arun Sampaul Thomas
Job tracker:
• responsible for scheduling tasks to the Task trackers, monitoring the task and re executing
the task if the Task tracker fails.
• It provides connectivity between hadoop and our MR application
• The MR functions and input o/p locations are implemented via our MR application program-
the job configuration
• In Hadoop, its job client submits the job (jar/executable, etc.,) to the job tracker
• The job tracker creates the execution plan and decides which task to assign to which node.
• Job tracker monitors and if a task fails it will automatically reschedule the task to a different
node after a predetermined no of tries

Task trackers
• This daemon present in every node is responsible for executing the tasks assigned to them by
the job tracker of the cluster.
• There is a single task tracker per slave and which spawns multiple JVMs to handlw multiple
map or reduce tasks in parallel.
• Task tracker continuously sends messages to job tracker.

Map Reduce features


simplicity: programmer can easily design parallel and distributed applications
manageability: data and computation are alloocated to the same slave( data) node and no need
to forward data for computation
scalability: increase the data node to increase job with minimal losses
fault tolerance : any node with hw failure can be removed and a new node installed the
reliability:tasks under progress run failed tasks also

Map Reduce framework


• Job Tracker is the master node (runs with the namenode)
• Receives the user’s job
• Decides on how many tasks will run (number of mappers)
Decides on where to run each mapper (concept of locality)

• Task Tracker is the slave node (runs on each data node)


• Receives the task from Job Tracker
• Runs the task until completion (either map or reduce task)
• Always in communication with the Job Tracker reporting progress

pg. 30 | BIG DATA ANALYTICS, IV CSE, JBIET, HYDERABAD Prepared By, Dr.G.Arun Sampaul Thomas
• Job Tracker is the master node (runs with the namenode)
• Receives the user’s job
• Decides on how many tasks will run (number of mappers)
• Decides on where to run each mapper (concept of locality)
• Task Tracker is the slave node (runs on each datanode)
• Receives the task from Job Tracker
• Runs the task until completion (either map or reduce task)
• Always in communication with the Job Tracker reporting progress

Applications
- It is used in Machine Learning,
Graphic programming, and
multi core programming

MR programming
• Requires three things:
• 1. driver class: it specifies job configuration details
• 2. mapper class: it overrides map function based on the problem statement
• 3. reducer class: this class overrides the Reduce function based on the problem statement

Implementations of MR
• Many implementations of MR developed in different languages for different purposes.
1.Hadoop: The most popular Open Source implementation is Hadoop, developed by yahoo, which
runs on top of HDFS. It is now being used by face book, amazon etc.,
- In this implementation it processes 100s of terabytes of data in at least 10000 cores
2. Google implementation: It runs on top of Google File System. Within Google File System data is
loaded, partitioned into chunks and each chunk is replicated.
- it processes 20 peta bytes /day

MR programming model

• MR functions use functional languages like Lisp


• Map function , written by user processes a key/value pair to generate a list of intermediate
key/value pairs
map(key1, value1)-> list (key2,value2)

pg. 31 | BIG DATA ANALYTICS, IV CSE, JBIET, HYDERABAD Prepared By, Dr.G.Arun Sampaul Thomas
• The reduce function , also written by user, merges, all intermediate values associated with a
particular intermediate key
• reduce (key2,list(value2))->list(value2) unique key in the sorted list
• Finally the key/value pairs are reduced , one for each in the sorted list . Ie the reduce
function sums all the counts emitted for a particular key

Example 1: Color Count

Example 2: Color Count


pg. 32 | BIG DATA ANALYTICS, IV CSE, JBIET, HYDERABAD Prepared By, Dr.G.Arun Sampaul Thomas
Example 3: Color Filter

Example 2: Word Count


pg. 33 | BIG DATA ANALYTICS, IV CSE, JBIET, HYDERABAD Prepared By, Dr.G.Arun Sampaul Thomas
Introduction Hadoop, MR and HBase

• Since 1970, RDBMS is the solution for data storage and maintenance related problems.
• After the advent of big data, companies realized the benefit of processing big data and started
opting for solutions like Hadoop.
• Hadoop uses distributed file system HDFS for storing big data, and MapReduce to process it.
• Hadoop excels in storing and processing of huge data of various formats such as
arbitrary, semi-, or even unstructured.

HBase
It is a distributed, column-oriented database built on top of the hadoop file system.(HDFS)

pg. 34 | BIG DATA ANALYTICS, IV CSE, JBIET, HYDERABAD Prepared By, Dr.G.Arun Sampaul Thomas
storing big data in HBase
• HBase is a column-oriented database and the tables in it are sorted by row.
• The table schema defines only column families, which are the key value pairs.
• A table have multiple column families and each column family can have any number of
columns.
• Subsequent column values are stored contiguously on the disk. Each cell value of the table
has a timestamp.
• In short, in an HBase:
• Table is a collection of rows.
• Row is a collection of column families.
• Column family is a collection of columns.
• Column is a collection of key value pairs.

pg. 35 | BIG DATA ANALYTICS, IV CSE, JBIET, HYDERABAD Prepared By, Dr.G.Arun Sampaul Thomas
HBase operations-programming with HBase, Installation
• Installing Hbase:
• We can install HBase in any of the three modes: Standalone mode, Pseudo Distributed mode,
and Fully Distributed mode.
• Installing HBase in Standalone Mode

pg. 36 | BIG DATA ANALYTICS, IV CSE, JBIET, HYDERABAD Prepared By, Dr.G.Arun Sampaul Thomas
• Download the latest stable version of HBase form http://www.interior-
dsgn.com/apache/hbase/stable/ using “wget” command, and extract it using the tar “zxvf”
command
• Before proceeding with HBase, you have to edit the following files and configure HBase.
• hbase-env.sh
• hbase-site.xml

Hadoop

• Problem1: storing exponentially growing datasets


• Solution: Hadoop Distributed file system: divides input data files into chunks of data and
stores them across the cluster.
• Problem2: storing unstructured data:
• Solution: Hadoop allows storing of unstructured, semi structured and structured data.
• It follows WORM (Write Once Read Many)
• No schema validation is required while dumping data
• It is designed to run on clusters of commodity machines. Scalable as per requirements
• Problem 3: processing the data faster:
Solution: Map Reduce in Hadoop:
Provides parallel processing of data present in HDFS
Each data node processes the part data stored within the node

Why Hadoop is able to compete with conventional DBMS?

What is Hadoop architecture ?

• Hadoop is a framework consisting of clusters . Each cluster having two main layers
• HDFS layer : Hadoop Distributed file system layer-consists of one name node and
multiple data nodes
pg. 37 | BIG DATA ANALYTICS, IV CSE, JBIET, HYDERABAD Prepared By, Dr.G.Arun Sampaul Thomas
• MapReduce layer : Execution engine layer, consists of one job tracker and multiple
task tracke

Developed by Yahoo

Main components of Hadoop

• HDFS (Hadoop Distributed File System) : for big data storage in distributed environment –
allows dumping of any kind of data across the cluster
• Map reduce :for faster data processing . Allows parallel processing of data stored in HDFS-
processing done at data nodes instead of data going to processor (NameNode)
• It is a Apache project built and used by a community of contributors
• Premier web players: google, yahoo, Microsoft, facebook, use it as engine t power the cloud
• The project is a collection of various subprojects:
appache Hadoop Common, Avro, Chukwa, Hbase, HDFS, Hive, MapReduce, Pig, Zookeeper

Hadoop ecosystem (Total tools)


• Scoop and flume : to inject data into HDFS
• HDFS: ditributed file system that allows storage of all 3 types of data
• Yarn: (yet another ) the brain of hadoop. Allocates resources and schedules and does all the
processing activities
• PIG: a platform used to analyze large data sets representing them as data flows. Introduced by
yahoo. Language BigLatin
• HIVE: is a data warehousing tool that allows us to perform big data analytics using HIVE
Query language which is similar to SQL . introduced by face book
• Mapreduce: JAVA . Provides parallel processing of data sets
• Hbase: is a NoSQL data base on top of HDFS that enables us to store unstructured and semi
structured data with ease and provides real time read/ write access
• apache Spark :is an in-memory data processing engine that allows efficient execution of
streaming, machine learning or SQL workloads and requires fast
pg. 38 | BIG DATA ANALYTICS, IV CSE, JBIET, HYDERABAD Prepared By, Dr.G.Arun Sampaul Thomas
Hadoop Master/Slave Architecture
• Hadoop is designed as a master-slave ,shared-nothing, architecture

Design Principles of Hadoop


• Need to process Big data
• Need to parallelize computation across thousands of nodes
• Support Commodity hardware
– Large number of low-end cheap machines working in parallel to
solve a computing problem
• This is in contrast to conventional DBMs where small number of high-
end expensive machines are used
• Automatic parallelization & distribution
– Hidden from the end-user
– Fault tolerance and automatic recovery
– Nodes/tasks will fail and will recover automatically
– Clean and simple programming abstraction
– Users only provide two functions “map” and “reduce”

Hadoop: How it Works


Hadoop Distributed File System (HDFS)

39 | BIG DATA ANALYTICS, IV CSE, JBIET, HYDERABAD Prepared By, Dr.G.Arun Sampaul Thomas
Limitations of Hadoop
• Hadoop can perform only batch processing, and data will be accessed only in a sequential
manner. That means one has to search the entire dataset even for the simplest of jobs.
• A huge dataset when processed results in another huge data set, which should also be
processed sequentially. At this point, a new solution is needed to access any point of data in a
single unit of time (random access).
• Hadoop Random Access Databases:
• Applications such as HBase, Cassandra, couchDB, Dynamo, and MongoDB are some of the
databases that store huge amounts of data and access the data in a random manner.

Hadoop vs. Other Systems

pg. 40 | BIG DATA ANALYTICS, IV CSE, JBIET, HYDERABAD Prepared By, Dr.G.Arun Sampaul Thomas
• Cloud Computing
• A computing model where any computing infrastructure can run on the cloud
• Hardware & Software are provided as remote services
• Elastic: grows and shrinks based on the user’s demand
• Example: Amazon EC2

combining HBase with HDFS.

pg. 41 | BIG DATA ANALYTICS, IV CSE, JBIET, HYDERABAD Prepared By, Dr.G.Arun Sa

You might also like