BDA Module 2 PDF
BDA Module 2 PDF
Hadoop ecosystem
• Sqoop: It is used to import and export data to and from
between HDFS and RDBMS.
• Pig: It is a procedural language platform used to develop a
script for MapReduce operations.
• Hbase: HBase is a distributed column-oriented database
built on top of the Hadoop file system.
• Hive: It is a platform used to develop SQL type scripts to do
MapReduce operations.
• Flume: Used to handle streaming data on the top of
Hadoop.
• Oozie: Apache Oozie is a workflow scheduler for Hadoop.
Introduction to Pig
● Pig raises the level of abstraction for processing
large amount of datasets. It is a fundamental platform
for analyzing large amount of data sets which consists
of a high level language for expressing data analysis
programs. It is an open source platform developed by
yahoo. Apache Pig is a high-level language that enables programmers to write complex
mapReduce transformations using a simple scripting language. Pig Latin (the actual
language) defines a set Of transformations on a data set such as aggregate, join, and
Pig is Often used to extract, transform, and load (ETL) data pipelines, quick
Apache Pig has several usage modes. The first is a local mode in which all
● Reusing the code Processing is done on the local machine. The non-local (cluster) modes are
MaPReduce and Tez. These modes execute the job on the cluster using either
● Faster development
● Less number of lines of code
● Schema and type checking etc
PIG is
•Easy to learn read and write and implement if you
know SQL.
● It implements a new approach of multi query.
● Provides a large number of nested data types such as
Maps, Tuples and Bags which are not easily available in
MapReduce along with some other data operations like
Filters, Ordering and Joins.
● It consist of different user groups for instance up to
90% of Yahoo’s MapReduce is done by Pig and up to
80% of Twitter’s MapReduce is also done by Pig and
various other companies like Sales force, LinkedIn and
Nokia etc are majoritively using the Pig.
Pig Latin comes with the following features:
● Simple programming: it is easy to code, execute and
manage the program.
● Better optimization: system can automatically
optimize the execution as per the requirement raised.
● Extensive nature: Used to achieve highly specific
processing tasks.
• If you have not done so already, make sure Sqoop is installed on your
cluster. Sqoop is needed on only a single node in your cluster. This Sqoop
node will then serve as an entry point for all connecting Sqoop clients.
Because the Sqoop node is a Hadoop MapReduce client, it requires both a
Hadoop installation and access to HDFS. To install Sqoop using the HDP
distribution RPM files, simply enter:
– yum install sqoop sqoop-metastore
Flume
• Apache Flume is a tool/service/data ingestion mechanism for
collecting aggregating and transporting large amounts of streaming data
such as log files, events (etc...) from various sources to a centralized data
store.
• Flume is a highly reliable, distributed, and configurable tool. It is
principally designed to copy streaming data (log data) from various web
servers to HDFS.
$ Hadoop fs –put /path of the required file /path in HDFS where to save
the file
Problem with put Command
Available Solutions :
• Facebook’s Scribe • Apache Kafka • Apache Flume
Applications of Flume
• Assume an e-commerce web application wants to analyze the
customer behavior from a particular region. To do so, they would need
to move the available log data in to Hadoop for analysis. Here, Apache
Flume comes to our rescue.
• Flume is used to move the log data generated by application servers
into HDFS at a higher speed.
Advantages of Flume
• Using Apache Flume we can store the data in to any of the centralized
stores (HBase, HDFS).
• When the rate of incoming data exceeds the rate at which data can be
written to the destination, Flume acts as a mediator between data
producers and the centralized stores and provides a steady flow of data
between them.
• Flume provides the feature of contextual routing.
Advantages of Flume
• The transactions in Flume are channel-based where two transactions
(one sender and one receiver) are maintained for each message. It
guarantees reliable message delivery.
• Flume is reliable, fault tolerant, scalable, manageable, and
customizable.
•Using Apache Flume we can store the data in to any of the centralized
stores (HBase, HDFS).
• When the rate of incoming data exceeds the rate at which data can be
written to the destination, Flume acts as a mediator between data
producers and the centralized stores and provides a steady flow of data
between them.
• Flume provides the feature of contextual routing.
• The transactions in Flume are channel-based where two transactions
(one sender and one receiver) are maintained for each message. It
guarantees reliable message delivery.
• Flume is reliable, fault tolerant, scalable, manageable, and
customizable.
Flume Architecture
Flume Agent
• An agent is an independent daemon process (JVM) in Flume. It
receives the data (events) from clients or other agents and forwards
it to its next destination (sink or agent). Flume may have more than
one agent. Following diagram represents a Flume Agent
Source
• A source is the component of an Agent which receives data from
the data generators and transfers it to one or more channels in the
form of Flume events.
• Apache Flume supports several types of sources and each source
receives events from a specified data generator.
• Example − Facebook, Avro source, Thrift source, twitter 1% source
etc.
Channel
• A channel is a transient store which receives the events from
the source and buffers them till they are consumed by sinks. It
acts as a bridge between the sources and the sinks.
• These channels are fully transactional and they can work
with any number of sources and sinks.
• Example − JDBC channel, File system channel, Memory
channel, etc.
Sink
• A sink stores the data into centralized stores like HBase and
HDFS. It consumes the data (events) from the channels and
delivers it to the destination. The destination of the sink might
be another agent or the central stores.
• Example − HDFS sink
Setting multi-agent flow
• In order to flow the data across multiple agents or hops, the sink of
the previous agent and source of the current hop need to be avro type
with the sink pointing to the hostname (or IP address) and port of the
source.
• Within Flume, there can be multiple agents and before reaching the
final destination, an event may travel through more than one agent.
This is known as multi-hop flow.
Consolidation
• A very common scenario in log collection is a large number of log
producing clients sending data to a few consumer agents that are
attached to the storage subsystem. For example, logs collected from
hundreds of web servers sent to a dozen of agents that write to HDFS
cluster.
• This can be achieved in Flume by configuring a number of first tier
agents with an avro sink, all pointing to an avro source of single agent
(Again you could use the thrift sources/sinks/clients in such a scenario).
This source on the second tier agent consolidates the received events
into a single channel which is consumed by a sink to its final destination.
Apache Hive
Architecture of Hive
User Interface Hive is a data warehouse infrastructure software that can create
interaction between user and HDFS. The user interfaces that
Hive supports are Hive Web UI, Hive command line, and Hive HD
Insight (In Windows server).
Meta Store Hive chooses respective database servers to store the schema or
Metadata of tables, databases, columns in a table, their data
types, and HDFS mapping.
HiveQL Process Engine HiveQL is similar to SQL for querying on schema info on the
Metastore. It is one of the replacements of traditional approach
for MapReduce program. Instead of writing MapReduce program
in Java, we can write a query for MapReduce job and process it.
Execution Engine The conjunction part of HiveQL process Engine and MapReduce
is Hive Execution Engine. Execution engine processes the query
and generates results as same as MapReduce results. It uses the
flavor of MapReduce.
HDFS or HBASE Hadoop distributed file system or HBASE are the data storage
techniques to store data into file system.
Working of Hive
Step no Operation
1 Execute Query The Hive interface such as Command Line or
Web UI sends query to Driver (any database driver such as
JDBC, ODBC, etc.) to execute.
2 Get Plan The driver takes the help of query compiler that
parses the query to check the syntax and query plan or the
requirement of query.
3 Get Metadata The compiler sends metadata request to
Metastore (any database).
4 Send Metadata Metastore sends metadata as a response to
the compiler.
5 Send Plan The compiler checks the requirement and resends
the plan to the driver. Up to here, the parsing and compiling of
a query is complete.
6 Execute Plan The driver sends the execute plan to the
execution engine.
7 Execute Job Internally, the process of execution job is a
MapReduce job. The execution engine sends the job to
JobTracker, which is in Name node and it assigns this job to
TaskTracker, which is in Data node. Here, the query executes
MapReduce job.
Limitations of Hadoop
● Hadoop can perform only batch processing, and data will be accessed
only in a sequential manner. That means one has to search the entire
dataset even for the simplest of jobs.
● A huge dataset when processed results in another huge data set,
which should also be processed sequentially. At this point, a new
solution is needed to access any point of data in a single unit of time
(random access).
•HBase is a distributed column-oriented database built on top of the
Hadoop file system. It is an open-source project and is horizontally
scalable.
● HBase is a data model that is similar to Google’s big table designed to
provide quick random access to huge amounts of structured data. It
leverages the fault tolerance provided by the Hadoop File System
(HDFS).
Storage Mechanism in HBase HBase is a column-oriented
database and the tables in it are sorted by row. The In short,
in an HBase:
●Table is a collection of rows.
●Row is a collection of column families.
●Column family is a collection of columns.
●Column is a collection of key value pairs.
Use of HBase
● Apache HBase is used to have random, real-time read/write access to Big data
● It hosts very large tables on top of clusters of commodity hardware.
● Apache HBase is a non-relational database modeled after Google's Bigtable.
Bigtable acts up on Google File System, likewise Apache
Hbase Architecture
Hbase Installation
● Download
● http://archive.apache.org/dist/hbase/0.98.24/
● Extract
● sudo tar -zxvf hbase-0.98.24-hadoop2- bin.tar.gz
● Move
● sudo mv hbase-0.98.24-hadoop2 /usr/local/Hbase
● cd /usr/local/Hbase/
● <property>
● <name>hbase.rootdir</name>
● <value>file:/usr/local/hadoop/HBase/HFiles</ value>
● </property>
● //Here you have to set the path where you want HBase to store its
built in zookeeper
Hbase Shell
● hbase shell
● Create Database and insert data
● create ‘apple’, ‘price’, ‘volume’
● put 'apple', '17-April-19', 'price:open', '125'
● put 'apple', '17-April-19', 'price:high', '126'
● put 'apple', '17-April-19', 'price:low', '124'
● put 'apple', '17-April-19', 'price:close', '125.5'
● put 'apple', '17-April-19', 'volume', '1000'
Inspect Database
● scan 'apple'
● ROW COLUMN+CELL
● 17-April-19 column=price:close, timestamp=1555508855040, value=122.5
● 17-April-19 column=price:high, timestamp=1555508840180, value=126
● 17-April-19 column=price:low, timestamp=1555508846589, value=124
● 17-April-19 column=price:open, timestamp=1555508823773, value=125
● 17-April-19 column=volume:, timestamp=1555508892705, value=1000
Get a row
● get 'apple', '17-April-19'
● COLUMN CELL
● price:close timestamp=1555508855040, value=122.5
● price:high timestamp=1555508840180, value=126
● price:low timestamp=1555508846589, value=124
● price:open timestamp=1555508823773, value=125
● volume: timestamp=1555508892705, value=1000
NodeManager
• The NodeManager is the slave process of YARN.
• It runs on every data node in a cluster.
• Its job is to create, monitor, and kill containers.
• It services requests from the ResourceManager and ApplicationMaster to create
containers, and it reports on the status of the containers to the ResourceManager. The
ResourceManager uses the data contained in these status messages to make
scheduling decisions for new container requests.
• On start-up, the NodeManager registers with the ResourceManager; it then sends
heartbeats with its status and waits for instructions. Its primary goal is to manage
application containers assigned to it by the ResourceManager.
YARN Applications
• The YARN framework/platform exists to manage applications, so let’s
take a look at what components a YARN application is composed of.
• A YARN application implements a specific function that runs on
Hadoop. A YARN application involves 3 components:
– Client
– ApplicationMaster(AM)
– Container
YARN Applications - YARN Client
• Launching a new YARN application starts with a YARN client
communicating with the ResourceManager to create a new YARN
ApplicationMaster instance.
• Part of this process involves the YARN client informing the
ResourceManager of the ApplicationMaster’s physical resource
requirements.
YARN ApplicationMaster
• The ApplicationMaster is the master process of a YARN application.
• It doesn’t perform any application-specific work, as these functions
are delegated to the containers. Instead, it’s responsible for managing
the application-specific containers.
• Once the ApplicationMaster is started (as a container), it will
periodically send heartbeats to the ResourceManager to affirm its
health and to update the record of its resource demands.
YARN Applications - YARN Container
• A container is an application-specific process that’s created by a
NodeManager on behalf of an ApplicationMaster.
• At the fundamental level, a container is a collection of physical
resources such as RAM, CPU cores, and disks on a single node.
hdfs fsck /
• Other options provide more detail, include snapshots and open files,
and management of corrupted files.
-move moves corrupted files to / lost+found.
-delete deletes corrupted files.
-files prints out files being checked.
-openforwrite prints out files opened for writes during check.
-list-corruptfileblocks prints out a list of missing blocks
-blocks prints out a block report.
-locations prints out locations for every block.
-racks prints out network topology for data-node locations.
Basic HDFS Administration
Balancing HDFS
• Based on usage patterns and DataNode availability, the
number of data blocks across the DataNodes may become
unbalanced.
• To avoid over-utilized DataNodes, the HDFS balancer tool
rebalances data blocks across the available DataNodes.
• Data blocks are moved from over-utilized to under-utilized
nodes to within a certain percent threshold.
• Rebalancing can be done when new DataNodes are added
or when a DataNode is removed from service.
• This step does not create more space in HDFS, but rather
improves efficiency.
hdfs balancer
HDFS Safe Mode
• When the NameNode starts, it loads the file system state from the
fsimage and then applies the edits log file.
• It then waits for DataNodes to report their blocks. During this time,
the NameNode stays in a read-only Safe Mode. The NameNode leaves
Safe Mode automatically after the DataNodes have reported that most
file system blocks are available.
• The administrator can place HDFS in Safe Mode by giving the following
command:
hdfs dfsadmin -safemode enter
• Entering the following command turns off Safe Mode: