Bda - 3 Unit
Bda - 3 Unit
Topics Covered
Hadoop: 3.3.1. Storing data in Hadoop 3.3.2. Introduction of HDFS architecture 3.3.3. HDFS
file system types, commands , 3.3.4. org.apache.io package, 3.3.5.HDFS high availability,
3.3.6.interacting with Hadoop eco system
1. MR Programming is a software frame work which helps to process massive amounts of data in
parallel.
2. In MR the input data set is split into independent chunks.
3. MR involves two tasks: Map task and Reduce task
4. The Map task processes the independent chunks in parallel manner. it converts input data into key
value pairs
Reduce task combines outputs of mappers and produces a reduced data set
5.The o/p of Mappers is automatically shuffled and sorted by the frame work and stored
as intermediate data on the local disk of that server.
6. The MR frame work sorts the o/p of mappers based on keys
7. The sorted o/p becomes input to the Reduce task.
8. The Reduce task combines the o/p of various Mappers and produces a reduced o/p.
9. Map Reduce framework also takes care of other tasks such as scheduling, monitoring, re
executing failed tasks etc.,
10. For the given jobs the inputs and outputs are stored in a file system (here HDFS is used)
11. HDFS and MR framework run on the same set of nodes.
12. Here the Paradigm shift is that scheduling of tasks is done on the nodes where data is present.
from Data>to> compute to
Compute> to > data model.
ie Data processing is co located with data storage. (data locality). It achieves high throughput
MR daemons
• There are two daemons associated with MR
-1.Job tracker : a Mater daemon. A single job tracker in the master per cluster of nodes
-2. Task trackers: one slave task tracker for each nodes
pg. 29 | BIG DATA ANALYTICS, IV CSE, JBIET, HYDERABAD Prepared By, Dr.G.Arun Sampaul Thomas
Job tracker:
• responsible for scheduling tasks to the Task trackers, monitoring the task and re executing
the task if the Task tracker fails.
• It provides connectivity between hadoop and our MR application
• The MR functions and input o/p locations are implemented via our MR application program-
the job configuration
• In Hadoop, its job client submits the job (jar/executable, etc.,) to the job tracker
• The job tracker creates the execution plan and decides which task to assign to which node.
• Job tracker monitors and if a task fails it will automatically reschedule the task to a different
node after a predetermined no of tries
Task trackers
• This daemon present in every node is responsible for executing the tasks assigned to them by
the job tracker of the cluster.
• There is a single task tracker per slave and which spawns multiple JVMs to handlw multiple
map or reduce tasks in parallel.
• Task tracker continuously sends messages to job tracker.
pg. 30 | BIG DATA ANALYTICS, IV CSE, JBIET, HYDERABAD Prepared By, Dr.G.Arun Sampaul Thomas
• Job Tracker is the master node (runs with the namenode)
• Receives the user’s job
• Decides on how many tasks will run (number of mappers)
• Decides on where to run each mapper (concept of locality)
• Task Tracker is the slave node (runs on each datanode)
• Receives the task from Job Tracker
• Runs the task until completion (either map or reduce task)
• Always in communication with the Job Tracker reporting progress
Applications
- It is used in Machine Learning,
Graphic programming, and
multi core programming
MR programming
• Requires three things:
• 1. driver class: it specifies job configuration details
• 2. mapper class: it overrides map function based on the problem statement
• 3. reducer class: this class overrides the Reduce function based on the problem statement
Implementations of MR
• Many implementations of MR developed in different languages for different purposes.
1.Hadoop: The most popular Open Source implementation is Hadoop, developed by yahoo, which
runs on top of HDFS. It is now being used by face book, amazon etc.,
- In this implementation it processes 100s of terabytes of data in at least 10000 cores
2. Google implementation: It runs on top of Google File System. Within Google File System data is
loaded, partitioned into chunks and each chunk is replicated.
- it processes 20 peta bytes /day
MR programming model
pg. 31 | BIG DATA ANALYTICS, IV CSE, JBIET, HYDERABAD Prepared By, Dr.G.Arun Sampaul Thomas
• The reduce function , also written by user, merges, all intermediate values associated with a
particular intermediate key
• reduce (key2,list(value2))->list(value2) unique key in the sorted list
• Finally the key/value pairs are reduced , one for each in the sorted list . Ie the reduce
function sums all the counts emitted for a particular key
• Since 1970, RDBMS is the solution for data storage and maintenance related problems.
• After the advent of big data, companies realized the benefit of processing big data and started
opting for solutions like Hadoop.
• Hadoop uses distributed file system HDFS for storing big data, and MapReduce to process it.
• Hadoop excels in storing and processing of huge data of various formats such as
arbitrary, semi-, or even unstructured.
HBase
It is a distributed, column-oriented database built on top of the hadoop file system.(HDFS)
pg. 34 | BIG DATA ANALYTICS, IV CSE, JBIET, HYDERABAD Prepared By, Dr.G.Arun Sampaul Thomas
storing big data in HBase
• HBase is a column-oriented database and the tables in it are sorted by row.
• The table schema defines only column families, which are the key value pairs.
• A table have multiple column families and each column family can have any number of
columns.
• Subsequent column values are stored contiguously on the disk. Each cell value of the table
has a timestamp.
• In short, in an HBase:
• Table is a collection of rows.
• Row is a collection of column families.
• Column family is a collection of columns.
• Column is a collection of key value pairs.
pg. 35 | BIG DATA ANALYTICS, IV CSE, JBIET, HYDERABAD Prepared By, Dr.G.Arun Sampaul Thomas
HBase operations-programming with HBase, Installation
• Installing Hbase:
• We can install HBase in any of the three modes: Standalone mode, Pseudo Distributed mode,
and Fully Distributed mode.
• Installing HBase in Standalone Mode
pg. 36 | BIG DATA ANALYTICS, IV CSE, JBIET, HYDERABAD Prepared By, Dr.G.Arun Sampaul Thomas
• Download the latest stable version of HBase form http://www.interior-
dsgn.com/apache/hbase/stable/ using “wget” command, and extract it using the tar “zxvf”
command
• Before proceeding with HBase, you have to edit the following files and configure HBase.
• hbase-env.sh
• hbase-site.xml
Hadoop
• Hadoop is a framework consisting of clusters . Each cluster having two main layers
• HDFS layer : Hadoop Distributed file system layer-consists of one name node and
multiple data nodes
pg. 37 | BIG DATA ANALYTICS, IV CSE, JBIET, HYDERABAD Prepared By, Dr.G.Arun Sampaul Thomas
• MapReduce layer : Execution engine layer, consists of one job tracker and multiple
task tracke
Developed by Yahoo
• HDFS (Hadoop Distributed File System) : for big data storage in distributed environment –
allows dumping of any kind of data across the cluster
• Map reduce :for faster data processing . Allows parallel processing of data stored in HDFS-
processing done at data nodes instead of data going to processor (NameNode)
• It is a Apache project built and used by a community of contributors
• Premier web players: google, yahoo, Microsoft, facebook, use it as engine t power the cloud
• The project is a collection of various subprojects:
appache Hadoop Common, Avro, Chukwa, Hbase, HDFS, Hive, MapReduce, Pig, Zookeeper
39 | BIG DATA ANALYTICS, IV CSE, JBIET, HYDERABAD Prepared By, Dr.G.Arun Sampaul Thomas
Limitations of Hadoop
• Hadoop can perform only batch processing, and data will be accessed only in a sequential
manner. That means one has to search the entire dataset even for the simplest of jobs.
• A huge dataset when processed results in another huge data set, which should also be
processed sequentially. At this point, a new solution is needed to access any point of data in a
single unit of time (random access).
• Hadoop Random Access Databases:
• Applications such as HBase, Cassandra, couchDB, Dynamo, and MongoDB are some of the
databases that store huge amounts of data and access the data in a random manner.
pg. 40 | BIG DATA ANALYTICS, IV CSE, JBIET, HYDERABAD Prepared By, Dr.G.Arun Sampaul Thomas
• Cloud Computing
• A computing model where any computing infrastructure can run on the cloud
• Hardware & Software are provided as remote services
• Elastic: grows and shrinks based on the user’s demand
• Example: Amazon EC2
pg. 41 | BIG DATA ANALYTICS, IV CSE, JBIET, HYDERABAD Prepared By, Dr.G.Arun Sa