Unit 2 - Hadoop PDF
Unit 2 - Hadoop PDF
● In Hadoop, data resides in a distributed file system which is called a Hadoop Distributed
File system.
● HDFS splits files into blocks and sends them across various nodes in form of large
clusters.
● The Hadoop Distributed File System (HDFS) is based on the Google File System (GFS)
and provides a distributed file system that is designed to run on commodity hardware.
● Commodity hardware is cheap and widely available, these are useful for achieving
greater computational power at a low cost.
● It provides high throughput access to application data and is suitable for applications
having large datasets.
▪ Hadoop Common: These are Java libraries and utilities required by other Hadoop
modules.
▪ Hadoop YARN: This is a framework for job scheduling and cluster resource
management (managing the resources of the Clusters).
Hadoop HDFS -
▪ Name node: Name Node is the prime node which contains metadata requiring
comparatively fewer resources than the data nodes that stores the actual data.
▪ Data Node: These data nodes are commodity hardware in the distributed
environment. Undoubtedly, making Hadoop cost effective.
● HDFS splits files into blocks and sends them across various nodes in form of large
clusters.
● It provides high throughput access to application data and is suitable for applications
having large datasets.
Hadoop MapReduce -
● MapReduce makes the use of two functions i.e. Map() and Reduce() whose task is:
▪ Map() performs sorting and filtering of data and thereby organizing them in the
form of group. Map generates a key-value pair based result which is later on
processed by the Reduce() method.
▪ Reduce() as the name suggests does the summarization by aggregating the
mapped data. In simple, Reduce() takes the output generated by Map() as input
and combines those tuples into smaller set of tuples.
Hadoop YARN –
Other components:
PIG:
● Pig was basically developed by Yahoo which works on a pig Latin language, which is
Query based language similar to SQL.
● It is a platform for structuring the data flow, processing and analyzing huge data sets.
● Pig does the work of executing commands and in the background, all the activities of
MapReduce are taken care of. After the processing, pig stores the result in HDFS.
● Pig Latin language is specially designed for this framework which runs on Pig Runtime.
Just the way Java runs on the Java virtual machine (JVM).
● Pig helps to achieve ease of programming and optimization and hence is a major
segment of the Hadoop Ecosystem.
Mahout:
HIVE:
● With the help of SQL methodology and interface, HIVE performs reading and writing of
large data sets. However, its query language is called as HQL (Hive Query Language).
● It is highly scalable as it allows real-time processing and batch processing both. Also, all
the SQL datatypes are supported by Hive thus, making the query processing easier.
● Similar to the Query Processing frameworks, HIVE too comes with two components: Java
Database Connectivity (JDBC) Drivers and HIVE Command Line.
● JDBC, along with ODBC drivers (uses the Open Database Connectivity interface by Microsoft)
work on establishing the data storage permissions and connection whereas HIVE
Command line helps in the processing of queries.
HBase:
● It’s a NoSQL database which supports all kinds of data and thus capable of handling
anything of Hadoop Database. It provides capabilities of Google’s BigTable, thus able to
work on Big Data sets effectively.
● At times where we need to search or retrieve the occurrences of something small in a
huge database, the request must be processed within a short quick span of time. At such
times, HBase comes handy as it gives us a tolerant way of storing limited data.
Zookeeper:
● There was a huge issue of management of coordination and synchronization among the
resources or the components of Hadoop which resulted in inconsistency, often.
● Zookeeper overcame all the problems by performing synchronization, inter-component
based communication, grouping, and maintenance.
Oozie:
● Oozie simply performs the task of a scheduler, thus scheduling jobs and binding them
together as a single unit. There are two kinds of jobs .i.e Oozie workflow and Oozie
coordinator jobs.
● Oozie workflow is the jobs that need to be executed in a sequentially ordered manner
whereas Oozie Coordinator jobs are those that are triggered when some data or
external stimulus is given to it.
● It eliminates the need for the manual tasks that used to watch over Hadoop operations.
● It gives a simple and secure platform for provisioning, managing, and monitoring
Hortonworks Data Platform (HDP) deployments.
Sqoop (SQL-to-Hadoop):
● It is a big data tool that offers the capability to extract data from non-Hadoop data
stores, transform the data into a form usable by Hadoop, and then load the data into
HDFS.
● Open Database Connectivity (ODBC) is an open standard Application Programming
Interface (API) for accessing a database.
DATA FORMAT:
● While the MapReduce programming model is at the heart of Hadoop, it is low-level and
as such becomes an unproductive way for developers to write complex analysis jobs.
● To increase developer productivity, several higher-level languages and APIs have been
created that abstract away the low-level details of the MapReduce programming model.
● There are several choices available for writing data analysis jobs.
● The Hive and Pig projects are popular choices that provide SQL-like and procedural data
flow-like languages, respectively.
● HBase is also a popular way to store and analyze data in HDFS. It is a column-oriented
database, and unlike MapReduce, provides random read and write access to data with
low latency (there are no or almost no delays on network or Internet connection).
● MapReduce jobs can read and write data in HBase’s table format, but data processing is
often done via HBase’s own client API.
Scaling In Vs Scaling Out :
● Once a decision has been made for data scaling, the specific scaling approach must be chosen.
● There are two commonly used types of data scaling :
▪ Up
▪ Out
● Scaling up, or vertical scaling :
▪ It involves obtaining a faster server with more powerful processors and more memory.
▪ This solution uses less network hardware, and consumes less power; but ultimately for
many platforms, it may only provide a short-term fix, especially if continued growth is
expected.
● Scaling out, or horizontal scaling :
▪ It involves adding servers for parallel computing.
▪ The scale-out technique is a long-term solution, as more and more servers may be
added when needed. But going from one monolithic system to this type of cluster may
be difficult, although extremely effective solution.