0% found this document useful (0 votes)
10 views40 pages

BDA Module 2

The document provides an overview of Hadoop, an open-source framework for big data analytics, focusing on its architecture, core components, and ecosystem. It discusses the Hadoop Distributed File System (HDFS), MapReduce programming model, and the importance of fault tolerance and scalability in handling large datasets. Additionally, it highlights the features and functionalities of Hadoop and its ecosystem components, including resource management and data processing capabilities.

Uploaded by

abhijnas005
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
10 views40 pages

BDA Module 2

The document provides an overview of Hadoop, an open-source framework for big data analytics, focusing on its architecture, core components, and ecosystem. It discusses the Hadoop Distributed File System (HDFS), MapReduce programming model, and the importance of fault tolerance and scalability in handling large datasets. Additionally, it highlights the features and functionalities of Hadoop and its ecosystem components, including resource management and data processing capabilities.

Uploaded by

abhijnas005
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 40

BIG DATA ANALYTICS (18CS72)

MODULE 02

Introduction to Hadoop
 Hadoop is a open source software framework.

 Scalable and parallel computing platform to handle large amount of data.

 Distributed data storage system do not use the concept of JOINS.

 Data need to be fault tolerant.

 Big data should follow the CAP theorem.

Big Data Store model

 Data store in file system consist of Data Blocks(Physical division of large data).

 Data blocks are distributed across multiple nodes(Data nodes).

 Data nodes are stored at the racks of a clusters.

 Each rack has multiple data nodes(Data servers).

 Hadoop system uses the Data Store Model

 Data blocks are replicate at the data nodes if any one get failed.

Big data programming model

 Hadoop system will uses the Big Data Programming model.

 Big Data programming model is that application in which application jobs and tasks is
scheduled on the same servers which stores the data for processing.

 Job means running an assignment of a set of instructions for processing. Example,


Processing the queries in application and sending back to the application is job.

 Job scheduling means assigning a job for processing.

Key Terms

Cluster Computing

 Refers to the computing, storing and analyzing huge amounts of unstructured or structured
data in a distributed computing environment.

 Each cluster forms loosely or tightly connected computing nodes that work together.
BIG DATA ANALYTICS (18CS72)

 Improves the performance cost effectiveness and accessibility

Data Flow

 Flow of data from one node to another node.

.Data consistency

 All copies of the data blocks should have same value

Data availability

 At least one copy of the data should be available if partition becomes inactive.

Resources

 Availability of physical/virtual components or devices.

Resource Management

 Managing resources such as creation, deletion or the manipulation of resource data.

Horizontal Scalability

 Increasing number of systems working in coherence.

 Example: MPPs

Vertical Scalability

 Increasing the number of tasks in the system. Tasks like reporting, Business processing(BP),
Business Intelligent(BI) Tasks.

Ecosystem

 Made up of multiple computing components which work together.

Hadoop and its Ecosystem


 Hadoop Ecosystem is a platform or a suite which provides various services to solve the big
data problems.

 It includes Apache projects and various commercial tools and solutions.

 Most of the tools or solutions are used to supplement or support the core elements of
Hadoop. All these tools work collectively to provide services such as absorption, analysis,
storage and maintenance of data etc.
BIG DATA ANALYTICS (18CS72)

Hadoop Core Components

 Above diagram shows the Core components of Hadoop.

1. Hadoop Common

 The common module contains the libraries and utilities that are required by the other
modules of hadoop.

 Hadoop common provides various components and interfaces for distributed file system
and general input/output. This includes serialization and file based data structures.

2. Hadoop Distributed File System(HDFS)

 A java based distributed file system which can store all kinds of data on the disks at the
clusters.

3. Map Reduce v1

 Software programming model in Hadoop 1 using Maper and Reducer. The v1processes
large data sets in parallel and in batches.

4. YARN

 Software for managing resources for computing.

 The user application tasks or sub-tasks run in parallel at the hadoop, uses scheduling
and handles the requests for the resources in distributed running of the tasks.
BIG DATA ANALYTICS (18CS72)

5. Map Reduce v2

 Hadoop 2 YARN-based system for parallel processing of the application tasks.

Spark
 Open source frame work
 Cluster computing frame work
 Provides in-memory analytics
 Enables OLAP and Real time processing
 Adapted by the companies like Amazon, eBay, and Yahoo.
Features of Hadoop
 Fault-efficient scalable, flexible and modular design
 Hadoop uses simple and modular programming model.
 The system provides server at high scalability.The system is scalable by adding new
nodes to handle large data.

 Hadoop proves very helpful in storing , managing, processing and analyzing Big data.
 Modular functions make system flexible.
 One can add or replace components at ease.
 Robust design of HDFS
 Execution of big data applications continue even when an individual server or cluster
fails. This is because of hadoop provisions of backup and recovery mechanism.

 HDFS has high reliability.


 Store and process Big Data

 Processes Big data of 4V characteristics(Volume, Variety, volume, Veracity)


 Distributed Clusters computing model with data locality

 Processes Big data at high speed as the application tasks and sub tasks submit to the
DataNodes.

 One can achieve more computing power by increasing the number of computing nodes.
BIG DATA ANALYTICS (18CS72)

 The processing splits across multiple DataNodes, and thus fast processing and
aggregated results.

 Hardware fault-tolerant

 A fault does not affect data and application processing.

 If a node goes down, the other nodes take care of the residue.

 This is due to multiple copies of all data blocks which replicate automatically.

 Open source framework

 Open source access and cloud services enable large data store.

 Hadoop uses a cluster of multiple inexpensive servers or the cloud.

 Java and Linux Based

 Hadoop uses Java interfaces.

 Hadoop base is Linux but has its own set of shell command support.

 Hadoop provides various components and interfaces for distributed file system and general
input/output.

 HDFS is basically designed more for batch processing.

 YARN provides a platform for many different modes of data processing, from traditional
batch processing to processing of the applications such as interactive queries, text analytics
and streaming analytics.

Hadoop Ecosystem Components


 Hadoop ecosystem refers to a combination of technologies.

 Hadoop ecosystem consist of own family of applications which tie up together with the
hadoop.

 The system component support the storage, processing, access, analysis, governance,
security and operations for Big data.

 The system enables the applications which run Big Data and deploy HDFS.

 The data store system consist of clusters, racks, DataNodes and blocks.

 Hadoop deploys application programming model, such as MapReduce and HBase. YARN
manages resources and schedules sub-tasks of the application.
BIG DATA ANALYTICS (18CS72)

 Below figure shows the Hadoop Core components HDFS, MapReduce and YARN along
with the ecosystem. Ecosystem includes the application support layer and application layer
components.

 Components are AVRO, Zookeeper, Pig, Hive, Sqoop, Ambari, Mahout, Spark, Flink and
Flame.

 Four layers in above diagram are as follows

1. Distributed Storage Layer

2. Resource-manager layer for job or application sub-tasks scheduling and execution.

3. Processing-frame work layer, consisting of Maper and Reducer for the MapReduce
process flow.

4. APIs at application support layer. The codes communicate to APIs


BIG DATA ANALYTICS (18CS72)

 AVRO enables data serialization between the layers.

 Zookeeper enables coordination among layer components and it is a centralized server


which provides the configuration of layers.

 Mahout is the ready to use framework.

Hadoop Streaming
 HDFS with MapReduce and YARN-based system enables parallel processing of large
datasets.

 Spark provides in-memory processing of data thus improves processing speed.

 In hadoop streaming spark and Flink are used to interface between the Maper and Reducer.

 Flink improves overall performance as it provides single run-time for streaming as well as
batch processing.

Hadoop Pipes
 This is another way to interface or connecting between the Maper and Reducer .

 C++ pipes are used for interfacing.

Hadoop Distributed File System(HDFS)


 Big data analytics applications are software applications.

 HDFS is a core component of Hadoop.

 HDFS is designed to run on a cluster of computers and servers at cloud based utility
services.

 HDFS stores Big Data range from GBs to PBs.

HDFS Data Storage


 Hadoop data store concept implies storing the data at a number of clusters.

 Each cluster has a number of data stores, called racks. Each rack stores a number of
DataNodes.

 Each DataNode has a large number of Data Blocks.

 A rack distribute across a cluster. The nodes have storing and processing capabilities.

 The data blocks replicate by default at least on three data nodes in same or remote nodes.
BIG DATA ANALYTICS (18CS72)

 Below diagram shows the replication of data blocks.

Features of HDFS
 Create, append, delete, rename and attribute modification functions.
 Content of individual file cannot be modified or replaced but appended with new data at the
end of the file.

 Write once but read many times during usages and processing.
 Average file size can be more than 500MB
Hadoop Physical Organization
 The conventional file system uses directories.
 A directory consists of folders. A folder consists of files.
 When data processes, the data sources identify by pointers for the resources.

 A data-dictionary stores the resources pointers. Master tables at the dictionary store at a
central location. The centrally stored tables enable administration easier when the data
sources change during processing.

 The files,DataNodes and blocks need the identification during processing at HDFS. HDFS
use the NameNode and DataNode.
BIG DATA ANALYTICS (18CS72)

 Few nodes in Hadoop cluster act as NameNodes. These nodes are termed as Master Nodes
or simply Masters.

 These masters have the different configurations and processing power.

 Master nodes have less local storage.

 Majority of the nodes in hadoop cluster acts as DataNodes and TaskTrackers. These nodes
are refered to as slave nodes or slaves.

 The slaves have lots of disk storage and moderate amounts of processing capabilities.

 Slaves are responsible to store the data and process the computation tasks submitted by the
clients.

 Below diagram illustrates the above explanation

 A single master node provides HDFS, MapReduce and Hbase using threads in small to
medium sized clusters.

 When the cluster size is large, multiple servers are used, such as to balance the load.

 The secondary NameNode provides Name node management services and Zookeepr is used
by HBase for metadata storage.

 The master node fundamentally plays the role of a coordinator.


BIG DATA ANALYTICS (18CS72)

 The master node receives client connections, maintains the description of the global file
system name space, and the allocation of file blocks.

 It also monitors the state of the system in order to detect any failure.

 MasterNode consists of 3 components NameNode, Secondary Node and Job tracker.

 The NameNode stores all the file system related information such as:

 The file section is stored

 Last access time for the file

 User permissions like which user has access to the file.

 Secondary NameNode is an alternate for NameNode. Secondary node keeps a copy of


NameNode meta data.

 Masters and slaves, and hadoop client(node) load the data into clusters, submit the
processing job and then retrieve the data to see the response after the job completion.

Hadoop 2

 Single NameNode failure in Hadoop 1 is an operational limitation. Scaling up was also


restricted to scale beyond a few thousands of DataNodes and few number of clusters.

 Hadoop 2 provides the multiple NameNodes.This enables higher resources availability.

 Each Master Node has the following components.

 An associated NameNode

 Zookeeper coordination client(an associated NameNode), functions as a centralized


repository for distributed applications. Zookeeper uses synchronization, serialization,
and coordination activities.

 Associated JournalNode(JN) . The JN keeps the records of the state, resources assigned,
and intermediate results. The System takes care of failure issues as follows.
BIG DATA ANALYTICS (18CS72)

HDFS Commands
 HDFS commands are common to other modules of Hadoop. The HDFS shell
is notcomplaint with the POSIX.

 Thus, the shell cannot interact similar to Unix or Linux.

 Commands for interacting with the files in HDFS require /bin/hdfs dfs <args>, where
argsstands for the command arguments.

 Below table shows the examples and usages of commands.


BIG DATA ANALYTICS (18CS72)

MapReduce FRAMEWORK AND PROGRAMMING MODEL

 MapReduce function is an integral part of hadoop physical organization.

 MapReduce is a programming model for distributed computing.

 Mapper means software for doing the assigned task after organizing the data blocks
imported using the keys. A key specifies in a command line of Mapper. The command maps
the key to the data, which an application uses.

 Reducer means software for reducing the mapped data by using the aggregation, query or
user specified function. The reducer provides a concise cohesive response for the
application.

 Aggregation function means the function that groups the values of multiple rows together to
result a single value of more significant meaning or measurement. For example, function
such as count, sum, maximum, minimum, deviation and standard deviation.

 Querying function means a function that finds the desired values. For example, function for
finding a best student of a class who has shown the best performance in examination.

Features of MapReduce framework are as follows:

1. Provides automatic parallelization and distribution of computation based on several

processors

2. Processes data stored on distributed clusters of DataNodes and racks

3. Allows processing large amount of data in parallel

4. Provides scalability for usages of large number of servers

5. Provides MapReduce batch-oriented programming model in Hadoop version 1

6. Provides additional processing modes in Hadoop 2 YARN-based system and enables

required parallel processing.

Hadoop MapReduce Framework

MapReduce provides two important functions.

 The distribution of job based on client application task or users query to various nodes
within a cluster is one function.
BIG DATA ANALYTICS (18CS72)

 The second function is organizing and reducing the results from each node into a cohesive
response to the application or answer to the query.

 The processing tasks are submitted to the Hadoop. The Hadoop framework in turns manages
the task of issuing jobs, job completion, and copying data around the cluster between the
DataNodes with the help of JobTracker.

 Daemon refers to a highly dedicated program that runs in the background in a system. The
user does not control or interact with that.

 MapReduce runs as per assigned Job by JobTracker, which keeps track of the job submitted
for execution and runs TaskTracker for tracking the tasks.

MapReduce programming enables job scheduling and task execution as follows:

 A client node submits a request of an application to the JobTracker. A JobTracker is a


Hadoop daemon (background program).

The following are the steps on the request to MapReduce:

(i) Estimate the need of resources for processing that request,

(ii) Analyze the states of the slave nodes,

(iii) Place the mapping tasks in queue,

(iv) Monitor the progress of task, and on the failure, restart the task on slots of time

available.

The job execution is controlled by two types of processes in MapReduce:

1. The Mapper deploys map tasks on the slots. Map tasks assign to those nodes where the

data for the application is stored. The Reducer output transfers to the client node after

the data serialization using AVRO.

2. The Hadoop system sends the Map and Reduce jobs to the appropriate servers in the cluster.
The Hadoop framework in turns manages the task of issuing jobs, job completion and copying
data around the cluster between the slave nodes. Finally, the cluster collects and reduces the data
to obtain the result and sends it back to the Hadoop server after completion of the given tasks.

 The job execution is controlled by two types of processes in MapReduce. A single master
process called JobTracker is one. This process coordinates all jobs running on the cluster and
assigns map and reduce tasks to run on the TaskTrackers.

 The second is a number of subordinate processes called TaskTrackers. These processes run
BIG DATA ANALYTICS (18CS72)

assigned tasks and periodically report the progress to the JobTracker.

MapReduce Programming Model

 MapReduce program can be written in any language including JAVA, C++ PIPEs or Python.

 Map function of MapReduce program do mapping to compute the data and convert the data
into other data sets.

 After the Mapper computations finish, the Reducer function collects the result of map and
generates the final output result. MapReduce program can be applied to any type of data,
i.e., structured or unstructured stored in HDFS.

 The input data is in the form of file or directory and is stored in the HDFS. The MapReduce
program performs two jobs on this input data, the Map job and the Reduce job.

 The map job takes a set of data and converts it into another set of data. The individual
elements are broken down into tuples (key/value pairs) in the resultant set of data. The
reduce job takes the output from a map as input and combines the data tuples into a smaller
set of tuples.

 Map and reduce jobs run in isolation from one another. As the sequence of the name
MapReduce implies, the reduce job is always performed after the map job.

 The MapReduce v2 uses YARN based resource scheduling which simplifies the software
development.

HADOOP YARN:

 YARN is a resource management platform which manages computer resources.

 The platform is responsible for providing the computational resources, such as CPUs,
memory, network I/O which are needed when an application executes.

 YARN manages the schedules for running of the sub-tasks. Each sub-task uses the resources
in allotted time intervals.

 YARN separates the resource management and processing components.

 An application consists of a number of tasks. Each task can consist of a number of sub-tasks
(threads), which run in parallel at the nodes in the cluster.

 YARN enables running of multi-threaded applications.

 YARN manages and allocates the resources for the application sub-tasks and submits the
resources for them at the Hadoop system.
BIG DATA ANALYTICS (18CS72)

Hadoop 2 Execution Model

YARN components are:

1. Client,

2. Resource Manager (RM),

3. Node Manager (NM),

4. Application Master (AM) and Containers.

List of actions of YARN resource allocation and scheduling functions is as follows:

 A MasterNode has two components: (i) Job History Server and (i) Resource Manager(RM).

 A Client Node submits the request of an application to the RM. The RM is the master. One
RM exists per cluster. The RM keeps information of all the slave NMs.

 Information is about the location and the number of resources (data blocks and servers) they
have.

 The RM also processes the Resource Scheduler service that decides how to assign the
resources.Therefore, performs resource management as well as scheduling.

 Multiple NMs are at a cluster. An NM creates an AM instance (AMI). The AMI initializes
itself and registers with the RM. Multiple AMIs can be created in an AM.

 The AMI performs role of an Application Manager (ApplM), that estimates the resources
requirement for running an application program or sub-task.

 The ApplMs send their requests for the necessary resources to the RM. Each NM includes
several containers for uses by the subtasks of the application.

 NM is a slave of the infrastructure. It signals whenever it initializes. All active NMs send the
controlling signal periodically to the RM signaling their presence.

 Each NM assigns a container(s) for each AMI. The container(s) assigned at an instance may
be at same NM or another NM.

 RM allots the resources to AM, and thus to ApplMs for using assigned containers on the
same or other NM for running the application sub tasks in parallel.

Below shows the YARN-based execution model:


BIG DATA ANALYTICS (18CS72)

HADOOPECOSYSTEM TOOLS

1. Zookeeper

2. Oozie

3. Sqoop

4. Flume

5. Ambari

1. Zookeeper:

 Apache is a coordination service that enables synchronization across a cluster in distributed


applications.

 Zookeeper in Hadoop behaves as a centralized repository where distributed applications can


write data at a node called JournalNode and read the data out of it.

 It uses synchronization, serialization and coordination activities. It enables functioning of a


distributed system as a single function.

ZooKeeper's main coordination services are:

Name service - A name service maps a name to the information associated with that name. For

example, DNS service is a name service that maps a domain name to an IP address. Similarly,

name keeps a track of servers or services those are up and running, and looks up their status by

name in name service.


BIG DATA ANALYTICS (18CS72)

Concurrency control - Concurrent access to a shared resource may cause inconsistency of the
resource. A concurrency control algorithm accesses shared resource in the distributed syste and
controls concurrency.

Configuration management - A requirement of a distributed system is a central configuration


manager. A new joining node can pick up the up-to-date centralized configuration from the
ZooKeeper coordination service as soon as the node joins the system.

Failure - Distributed systems are susceptible to the problem of node failures. This requires
implementing an automatic recovery strategy by selecting some alternate node for processing.

2. Oozie

 Apache Oozie is an open-source project of Apache that schedules Hadoop jobs. An efficient
process for job handling is required. Analysis of Big Data requires creation of multiple jobs
and sub-tasks in a process.

 Oozie design provisions the scalable processing of multiple jobs.

 Oozie provides a way to package and bundle multiple coordinator and workflow jobs, and
manage the lifecycle of those jobs.

The two basic Oozie functions are:

1. Oozie workflow jobs are represented as Directed Acrylic Graphs (DAGs), specifying a
sequence of actions to execute.

2. Oozie coordinator jobs are recurrent. Oozie workflow jobs that are triggered by time and data
availability.

Oozie provisions for the following:

1. Integrates multiple jobs in a sequential manner.

2. Stores and supports Hadoop jobs for MapReduce, Hive, Pig, and Sqoop

3. Runs workflow jobs based on time and data triggers

4. Manages batch coordinator for the applications

5. Manages the timely execution of tens of elementary jobs lying in thousands of

workflows in a Hadoop cluster.

3. Sqoop
BIG DATA ANALYTICS (18CS72)

 Apache Sqoop is a tool that is-built for loading efficiently the voluminous amount of data
between Hadoop and external data repositories that resides on enterprise application servers
or relational databases.

 Sqoop works with relational databases such as Oracle, MySQL, PostgreSQL and DB2.
Sqoop provides the mechanism to import data from external Data Stores into HDFS.

 Sqoop relates to Hadoop Eco-system components, such as Hive and HBase. Sqoop can
extract data from Hadoop or other ecosystem components.

 Sqoop provides command line interface to its users.

 The tool allows defining the schema of the data for import.

 Sqoop exploits MapReduce framework to import and export the data, and transfers for
parallel processing of sub-tasks. Sqoop provisions for fault tolerance.

 Sqoop initially parses the arguments passed in the command line and prepares the map task.

 The map task initializes multiple Mappers depending on the number supplied by the user in
the command line.

 Sqoop distributes the input data equally among the Mappers. Then each Mapper creates a
connection with the database using JDBC and fetches the part of data assigned by Sqoop and
writes it into HDFS/Hive/HBase as per the choice provided in the command line.

4. Flume

I. Apache Flume provides a distributed, reliable, and available service. Flume efficiently collects,
aggregates and transfers a large amount of streaming data into HDFS.

II. Flume enables upload of large files into Hadoop clusters.

III. The features of flume include robustness and fault tolerance. Flume provides data transfer
which is reliable and provides for recovery in case of failure.

IV. Flume is useful for transferring a large amount of data in applications related to logs of
network traffic, sensor data, geo-location data, e-mails and social-media messages.

Apache Flume has the following four important components:

1. Sources which accept data from a server or an application.

2. Sinks which receive data and store it in HDFS repository or transmit the data to another
source. Data units that are transferred over a channel from source to sink are called events.
BIG DATA ANALYTICS (18CS72)

3. Channels connect between sources and sink by queuing event data for transactions. The size
of events data is usually 4 KB. The data source is considered to be a source of various set of
events. Sources listen for events and write events to a channel. Sinks basically write event data to
a target and remove the event from the queue.

4. Agents run the sinks and sources in Flume. The interceptors drop the data or transfer data as it
flows into the system.

Ambari
I. Apache Ambari is a management platform for Hadoop. It is open source.

II. Ambari enables an enterprise to plan, securely install, manage and maintain the clusters in the
Hadoop.

III. Ambari provisions for advanced cluster security capabilities, such as Kerberos Ambari.

Features of Ambari and associated components are as follows:

1. Simplification of installation, configuration and management.

2. Enables easy, efficient, repeatable and automated creation of clusters.

3. Manages and monitors scalable clustering.

4. Visualizes the health of clusters and critical metrics for their operations.

5. Enables detection of faulty node links.

6. Provides extensibility and customizability.

Hadoop Administration

I. Administrator procedures enable managing and administering Hadoop clusters, resources, and
associated Hadoop ecosystem components.

II. Administration includes installing and monitoring clusters.

III. Ambari provides a centralized setup for security.

IV. Ambari helps automation of the setup and configuration of Hadoop using Web User Interface
and REST APIs.

V. The console is similar to web UI at Ambari. The console enables visualization of the cluster
health, HDFS directory structure, status of MapReduce tasks review of log records and access
application status.
BIG DATA ANALYTICS (18CS72)

VI. Single harmonized view on console makes administering the task easier. Visualization can be
up to individual components level on drilling down.

VII. Nodes addition and deletion are easy using the console.

HBase

I. HBase is an Hadoop system database. HBase was created for large tables.

II. HBase is an open-source, distributed, versioned and non-relational (NoSQL) database.

Features of HBase are:

1. Uses a partial columnar data schema on top of Hadoop and HDFS.

2. Supports a large table of billions of rows and millions of columns.

3. Supports data compression algorithms.

4. Provisions in-memory column-based data transactions.

5. Accesses rows serially and does not provision for random accesses and write into the

rows.

6. Provides random, real-time read/write access to Big Data.

7. Fault tolerant storage due to automatic failure support between DataNodes servers.

8. Similarity with Google BigTable.

HBase is written in Java. It stores data in a large structured table.

HBase provides scalable distributed Big Data Store.

HBase data store as key-value pairs.

HBase system consists of a set of tables. Each table contains rows and columns, similar to a
traditional database.

HBase provides a primary key as in database table.

HBase applies a partial columnar scheme on top of the Hadoop and HDFS.

Hive

 Apache Hive is an open-source data warehouse software. Hive facilitates reading, writing and
managing large datasets which are at distributed Hadoop files. Hive uses SQL

 Hive design provisions for batch processing of large sets of data.


BIG DATA ANALYTICS (18CS72)

 Hive does not process real time queries and does not update row-based data tables.

 Hive enables data serialization/deserialization and increase flexibility in schema design by


including a system catalog called Hive Metastore.

 Hive supports different storage types like text files, sequence files, RC Files, ORC Files and
HBase.

Pig

 Apache Pig is an open source, high-level language platform. Pig was developed for
analyzing large-data sets.

 Pig executes queries on large datasets that are stored in HDFS using Apache Hadoop.

Additional features of Pig are as follows:

I. Loads the data after applying the required filters and dumps the data in the desired

format.

II. Requires Java runtime environment for executing Pig Latin programs.

III. Converts all the operations into map and reduce tasks. The tasks run on Hadoop.

IV. Allows concentrating upon the complete operation, Reducer functions to irrespective

of the individual Mapper and reducer functions to produce the output results.

Mahout

 Mahout is a Java library Implementing Machine Learning techniques for clustering, classification,
and recommendation.

Apache Mahout features are:

 Collaborative data-filtering that mines user behavior and makes product recommendations.

 Clustering that takes data items in a particular class, and organizes them into naturally occurring
groups, such that items belonging to the same group are similar to each other.

 Classification that means learning from existing categorizations and then assigning the future items
to the best category.

 Frequent item-set mining that analyzes items in a group and then identifies which items usually occur
together.
BIG DATA ANALYTICS (18CS72)

Hadoop Distributed File System Basics


 The Hadoop Distributed File System (HDFS) was designed for Big Data processing. HDFS
rigorously restricts data writing to one user at a time.

 All additional writes are ―append-only,‖ and there is no random writing to HDFS files.

 The design of HDFS is based on the design of the Google File System (GFS).

 HDFS is designed for data streaming where large amounts of data are read from disk in bulk.
The HDFS block size is typically 64MB or 128MB.

 In HDFS there is no local caching mechanism. The large block and file sizes make it more
efficient to read data from HDFS than to try to cache the data. Important feature of HDFS is
data locality.

 Data locality: The process of moving data requests to the place where the actual data resides.
The following points are the important features of HDFS

 The write-once/read-many design is intended to facilitate streaming reads.

 Files may be appended, but random seeks are not permitted. There is no caching of data.

 Converged data storage and processing happen on the same server nodes.

 A reliable file system maintains multiple copies of data across the cluster. Consequently,
failure of a single node (or even a rack in a large cluster) will not bring down the file system.

 A specialized file system is used, which is not designed for general use.

HDFS Components

HDFS consists of two main components:

1. A NameNode

2. Multiple DataNodes

 A single NameNode manages all the metadata needed to store and retrieve the actual data
from the DataNodes.

 The design is a master/slave architecture in which the master (NameNode) manages the file
system namespace and regulates access to files by clients.

 File system namespace operations such as opening, closing, and renaming files and
directories are all managed by the NameNode.
BIG DATA ANALYTICS (18CS72)

 The NameNode also determines the mapping of blocks to DataNodes and handles DataNode
failures.

 The slaves (DataNodes) are responsible for serving read and write requests from the file
system to the clients.

Below Figure shows the Various system roles in an HDFS deployment

 When a client writes data, it first communicates with the NameNode and requests to create a
file.

 The NameNode determines how many blocks are needed and provides the client with the
DataNodes that will store the data.

 As part of the storage process, the data blocks are replicated after they are written to the
assigned node.

 Depending on how many nodes are in the cluster, the NameNode will attempt to write
replicas of the data blocks on nodes that are in other separate racks.

 If there is only one rack, then the replicated blocks are written to other servers in the same
rack.

 After the DataNode acknowledges that the file block replication is complete, the client
closes the file and informs the NameNode that the operation is complete.

 The NameNode does not write any data directly to the DataNodes. It give the client a limited
amount of time to complete the operation. If it does not complete in the time period, the
operation is canceled.
BIG DATA ANALYTICS (18CS72)

 Reading data happens in a similar fashion. The client requests a file from the NameNode,
which returns the best DataNodes from which to read the data.

 The client then accesses the data directly from the DataNodes. once the metadata has been
delivered to the client, the NameNode steps back and lets the conversation between the
client and the DataNodes proceed.

 While data transfer is progressing, the NameNode also monitors the DataNodes by listening
for heartbeats sent from DataNodes for detecting the failure.

 If the DataNodes fail the NameNode will route around the failed DataNode and begin re-
replicating the now-missing blocks.

 The block reports are sent every 10 heartbeats. The reports enable the NameNode to keep an
up-to-date account of all data blocks in the cluster.

 The purpose of the Secondary NameNode is to perform periodic checkpoints that evaluate
the status of the NameNode. It also has two disk files that track changes to the metadata

 The various roles in HDFS can be summarized as follows:

 HDFS uses a master/slave model designed for large file reading/streaming.

 The NameNode is a metadata server or ―data traffic cop.‖

 HDFS provides a single namespace that is managed by the NameNode.

 Data is redundantly stored on DataNodes; there is no data on the NameNode.

HDFS Block Replication

 Hadoop clusters containing more than eight DataNodes, the replication value is usually set
to 3(RF=1).

 The HDFS default block size is 64MB. In a typical operating system, the block size is 4KB
or 8KB. The HDFS default block size is not the minimum block size.

 If a 20KB file is written to HDFS, it will create a block that is approximately 20KB in size.
If a file of size 80MB is written to HDFS, a 64MB block and a 16MB block will be created.

 The HDFS blocks are based on size, while the splits are based on a logical partitioning of the
data.

 Below figure illustrates the HDFS block replication example


BIG DATA ANALYTICS (18CS72)

HDFS Safe Mode

 When the NameNode starts, it enters a read-only safe mode where blocks cannot be
replicated or deleted. Safe Mode enables the NameNode to perform two important
processes:

 The previous file system state is reconstructed by loading the fsimage file into memory
and replaying the edit log.

 The mapping between blocks and data nodes is created by waiting for enough of the
DataNodes to register so that at least one copy of the data is available. Not all
DataNodes are required to register before HDFS exits from Safe Mode. The registration
process may continue for some time.

 HDFS may also enter Safe Mode for maintenance using the hdfs dfsadmin-safemode
command or when there is a file system issue that must be addressed by the administrator.

Rack Awareness

 Rack awareness deals with data locality.

 Hadoop cluster will exhibit three levels of data locality:

1. Data resides on the local machine (best).


BIG DATA ANALYTICS (18CS72)

2. Data resides in the same rack (better).

3. Data resides in a different rack (good).

 When the YARN scheduler is assigning MapReduce containers to work as mappers, it will
try to place the container first on the local machine, then on the same rack, and finally on
another rack.

 In addition, the NameNode tries to place replicated data blocks on multiple racks for
improved fault tolerance. In such a case, an entire rack failure will not cause data loss or stop
HDFS from working.

 Performance may be degraded. A default Hadoop installation assumes all the nodes belong
to the same (large) rack. In that case, there is no option 3.

NameNode High Availability

 With early Hadoop installations, the NameNode was a single point of failure that could bring
down the entire Hadoop cluster.

 NameNode hardware often employed redundant power supplies and storage to guard against
such problems, but it was still susceptible to other failures.

 The solution was to implement NameNode High Availability (HA) as a means to provide
true fail over service.
BIG DATA ANALYTICS (18CS72)

 An HA Hadoop cluster has two (or more) separate NameNode machines. Each machine is
configured with exactly the same software.

 One of the NameNode machines is in the Active state, and the other is in the Standby state.

 Like a single NameNode cluster, the Active NameNode is responsible for all client HDFS
operations in the cluster.

 The Standby NameNode maintains enough state to provide a fast failover (if required).
HDFS High Availability design to guarantee the file system state is preserved, both the
Active and Standby NameNodes receive block reports from the DataNodes.

 The Active node also sends all file system edits to a quorum of Journal nodes. At least three
physically separate JournalNode daemons are required, because edit log modifications must
be written to a majority of the JournalNodes.

 This design will enable the system to tolerate the failure of a single JournalNode machine.

 The Standby node continuously reads the edits from the JournalNodes to ensure its
namespace is synchronized with that of the Active node.

 In the event of an Active NameNode failure, the Standby node reads all remaining edits
from the JournalNodes before promoting itself to the Active state.

 To prevent confusion between NameNodes, the JournalNodes allow only one NameNode to
be a writer at a time.
BIG DATA ANALYTICS (18CS72)

 During failover, the NameNode that is chosen to become active takes over the role of
writing to the JournalNodes.

 A SecondaryNameNode is not required in the HA configuration because the Standby node


also performs the tasks of the Secondary NameNode.

 Apache ZooKeeper is used to monitor the NameNode health. Zookeeper is a highly


available service for maintaining small amounts of coordination data, notifying clients of
changes in that data, and monitoring clients for failures.

 HDFS failover relies on ZooKeeper for failure detection and for Standby to Active
NameNode election.

HDFS NameNode Federation

 Another important feature of HDFS is NameNode Federation.

 Older versions of HDFS provided a single namespace for the entire cluster managed by a
single NameNode.

 Federation addresses this limitation by adding support for multiple NameNodes/namespaces


to the HDFS file system.

 The key benefits are as follows:

Namespace scalability: HDFS cluster storage scales horizontally without placing a burden on the
NameNode.

Better performance: Adding more NameNodes to the cluster scales the file system read/write
operations throughput by separating the total namespace.

System isolation: Multiple NameNodes enable different categories of applications to be


distinguished, and users can be isolated to different namespaces.
BIG DATA ANALYTICS (18CS72)

Above figure illustrates how HDFS NameNode Federation is accomplished.

 NameNode1 manages the /research and /marketing namespaces, and

 NameNode2 manages the /data and /project namespaces.

 The NameNodes do not communicate with each other and the DataNodes ―just store

data block‖ as directed by either NameNode.

HDFS Checkpoints and Backups

 The NameNode stores the metadata of the HDFS file system in a file called fsimage.

 File systems modifications are written to an edits log file, and at startup the NameNode
merges the edits into a new fsimage.

 The SecondaryNameNode or CheckpointNode periodically fetches edits from the


NameNode, merges them, and returns an updated fsimage to the NameNode.

 Backups are different than the checkpoints. Backup will store the data in the Hard disks for
the future use.

HDFS Snapshots

 HDFS snapshots are similar to backups.


BIG DATA ANALYTICS (18CS72)

 Created by administrators using the hdfs dfs-snapshot command. HDFS snapshots are read-
only point-in- time copies of the file system.

 They offer the following features:

 Snapshots can be taken of a sub-tree of the file system or the entire file system.

 Snapshots can be used for data backup, protection against user errors, and disaster
recovery.

HDFS User Commands

$ hdfs version - The version of HDFS

$ hdfs dfs -ls / - To list the files in the root HDFS directory

$ hdfs dfs -ls or $ hdfs dfs -ls /user/hdfs - To list files in your home directory

$ hdfs dfs -mkdir stuff - To make a directory in HDFS

$ hdfs dfs -put test stuff - To copy a file from your current local directory into HDFS

$ hdfs dfs -get stuff/test test-local – To copy Files from HDFS

$ hdfs dfs -cp stuff/test test.hdfs – To copy Files within HDFS

$ hdfs dfs -rm test.hdfs – To delete a File within HDFS

$ hdfs dfs -rm -r -skipTrash stuff – To delete a Directory in HDFS(r is a recursive deletion
factor,and Trash file is same as that of Recycle bin in Windows OS)

$ hdfs dfsadmin -report – To get an HDFS Status Report.


BIG DATA ANALYTICS (18CS72)

Essential Hadoop Tools

1. Apache Pig

i. Apache Pig is a high-level language that enables programmers to write complex MapReduce
transformations using a simple scripting language.

ii. Pig Latin defines a set of transformations on a data set such as aggregate, join, and sort.

iii. Pig is used to extract, transform, and load (ETL) data pipelines, quick research on raw data,
and iterative data processing.

iv. Apache Pig has several usage modes. The first is a local mode in which all processing is done
on the local machine.

v. The non-local (cluster) modes are MapReduce and Tez. These modes execute the job on the

cluster using either the MapReduce engine or the optimized Tez engine.

Below figure shows the Modes supported by the Apache pig

2. Apache Hive

I. Apache Hive is a data warehouse infrastructure built on top of Hadoop for providing data
summarization, ad hoc queries, and the analysis of large data sets using a SQL-like language
called HiveQL.

II. Hive is considered the standard for interactive SQL queries over petabytes of data

Features offered by Apache Hive

▪ Tools to enable easy data extraction, transformation, and loading (ETL).


BIG DATA ANALYTICS (18CS72)

▪ A mechanism to impose structure on a variety of data formats.

▪ Access to files stored either directly in HDFS or in other data storage systems such as HBase

▪ Query execution via MapReduce

* Hive provides users to query the data on Hadoop clusters using SQL. Hive makes it possible
for programmers who are familiar with the MapReduce framework to add their custom mappers
and reducers to Hive queries. Hive queries can be dramatically accelerated using the Apache Tez
framework under YARN in Hadoop version 2.

3. Apache Sqoop to Acquire Relational Data

 Sqoop is a tool designed to transfer data between Hadoop and relational databases.

 Sqoop can be used with any Java Database Connectivity (JDBC)

 In version 1 of Sqoop, data were accessed using connectors written for specific databases.

 Version 2 does not support connectors or version 1 data transfer from a RDBMS directly to

 Hive or HBase, or data transfer from Hive or HBase to your RDBMS. Instead, version 2
offers more generalized ways to accomplish these tasks.

Apache Sqoop Import and Export Methods

 Sqoop data import (to HDFS) process is done in two steps:

1. In the first step, Sqoop examines the database to gather the necessary metadata for the

data to be imported.

2. The second step is a map-only (no reduce step) Hadoop job that Sqoop submits to the

cluster. This job does the actual data transfer using the metadata captured in the previous
step. Each node doing the import must have access to the database.
BIG DATA ANALYTICS (18CS72)

 The imported data are saved in an HDFS directory. Sqoop will use the database name for the
directory, or the user can specify any alternative directory where the files should be
populated.

 By default, these files contain comma-delimited fields, with new lines separating different
records.

 Below diagram illustrates the Sqoop data import method

 The export is done in two steps:

 The first step is to examine the database for metadata. The export step again uses a map-
only Hadoop job to write the data to the database.

 Sqoop divides the input data set into splits, then uses individual map tasks to push the
splits to the database. This process assumes the map tasks have access to the database.

 Below diagram illustrates the 2-step Sqoop data export method.


BIG DATA ANALYTICS (18CS72)

Apache Sqoop Version Changes

 Sqoop Version 1 uses specialized connectors to access external systems. These connectors
are often optimized for various RDBMSs or for systems that do not support JDBC.

 Connectors are plug-in components based on Sqoop’s extension framework and can be
added to any existing Sqoop installation. Once a connector is installed, Sqoop can use it to
efficiently transfer data between Hadoop and the external store supported by the connector.

 By default, Sqoop version 1 includes connectors for popular databases such as MySQL,
PostgreSQL, Oracle, SQL Server, and DB2. It also supports direct transfer to and from the
RDBMS to HBase or Hive.

 Sqoop version 2 no longer supports specialized connectors or direct import into HBase or
Hive. All imports and exports are done through the JDBC interface.

 Below table illustrates the Differences in Version 1 and Version 2


BIG DATA ANALYTICS (18CS72)

4. Apache Flume

 Apache Flume is an independent agent designed to collect, transport, and store data into
HDFS.

 Data transport involves a number of Flume agents that may traverse a series of machines and
locations.

 Flume is used for log files, social media-generated data, email messages, and any continuous
data source.

Flume agent is composed of three components:

▪ Source. The source component receives data and sends it to a channel. It can send the data to
more than one channel. The input data can be from a real-time source (e.g.,weblog) or another
Flume agent.

▪ Channel. A channel is a data queue that forwards the source data to the sink destination.It can
be thought of as a buffer that manages input (source) and output (sink) flow rates.

▪ Sink. The sink delivers data to destination such as HDFS, a local file, or another Flume agent.

Below figure illustrates the Flume agent with source, channel, and sink
BIG DATA ANALYTICS (18CS72)

 A Flume agent can have several sources, channels, and sinks. Sources can write to multiple
channels, but a sink can take data from only a single channel.

 Data written to a channel remain in the channel until a sink removes the data. By default, the
data in a channel are kept in memory but may be optionally stored on disk to prevent data
loss in the event of a network failure.

 Sqoop agents may be placed in a pipeline, possibly to traverse several machines or domains.

 Below figure shows the Pipeline created by connecting Flume agents

 In a Flume pipeline, the sink from one agent is connected to the source of another. The data
transfer format normally used by Flume, is called Apache Avro.

 First, Avro is a data serialization/deserialization system that uses a compact binary format.
The schema is sent as part of the data exchange and is defined using JSON.

 Avro also uses remote procedure calls (RPCs) to send data. Avro sink will contact an Avro
source to send data.

 Below figure illustrates A Flume consolidation network.


BIG DATA ANALYTICS (18CS72)

5. Oozie

 Oozie is a workflow director system designed to run and manage multiple related Apache
Hadoop jobs.

 Oozie is not a substitute for the YARN scheduler. Oozie provides a way to connect and
control Hadoop jobs on the cluster.

 Oozie workflow jobs are represented as directed acyclic graphs (DAGs) of actions.

Three types of Oozie jobs are permitted:

▪ Workflow: a specified sequence of Hadoop jobs with outcome-based decision points

and control dependency. Progress from one action to another cannot happen until the first
action is complete.

▪ Coordinator: a scheduled workflow job that can run at various time intervals or when
data become available.

▪ Bundle: a higher-level Oozie abstraction that will batch a set of coordinator jobs.

 Oozie is integrated with the rest of the Hadoop stack, supporting several types of Hadoop
jobs out of the box as well as system-specific jobs

 Oozie provides a CLI(Command Line Interface) and a web UI(User Interface) for
monitoring jobs.
BIG DATA ANALYTICS (18CS72)

 Oozie runs a basic MapReduce operation. If the application was successful, the job ends; if
an error occurred, the job is killed.

 Below figure shows a simple Oozie DAG workflow

 Oozie workflow definitions are written in hPDL(an XML Process Definition Language).

 Such workflows contain several types of nodes:

▪ Control flow nodes define the beginning and the end of a workflow. They include start,
end, and optional fail nodes.

▪ Action nodes are where the actual processing tasks are defined. When an action node
finishes, the remote systems notify Oozie and the next node in the workflow is executed.
Action nodes can also include HDFS commands.

▪ Fork/join nodes enable parallel execution of tasks in the workflow. The fork node
enables two or more tasks to run at the same time. A join node represents a rendezvous
point that must wait until all forked tasks complete.

▪ Control flow nodes enable decisions to be made about the previous task. Control
decisions are based on the results of the previous action. Decision nodes are essentially
switch-case statements that use JSP EL (Java Server Pages—Expression Language) that
evaluate to either true or false.

Below figure shows a more complex Oozie DAG workflow (Adapted from Apache Oozie
Documentation)
BIG DATA ANALYTICS (18CS72)

6. Apache HBase

 Apache HBase is an open source, distributed, versioned, nonrelational database

 Apache HBase provides Bigtable-like capabilities on top of Hadoop and HDFS.

Important features are:

▪ Linear and modular scalability

▪ Strictly consistent reads and writes

▪ Automatic and configurable sharding of tables

▪ Automatic failover support between RegionServers.

▪ Convenient base classes for backing Hadoop MapReduce jobs with Apache HBase
tables.

▪ Easy-to-use Java API for client access.

HBase Data Model Overview

 A table in HBase is similar to other databases, having rows and columns. Columns in HBase
are grouped into column families, all with the same prefix.
BIG DATA ANALYTICS (18CS72)

 It is possible to have many versions of data within an HBase cell. A version is specified as a
timestamp and is created each time data are written to a cell.
 The empty byte array denotes both the start and the end of a table’s namespace. All table
accesses are via the table row key, which is considered its primary key.

You might also like