Unit II BDA
Unit II BDA
History of Hadoop, Apache Hadoop, Analysing Data with Unix tools, Analysing
Data with Hadoop, The Design of HDFS, HDFS Concepts, Command Line
Interface, Hadoop file system interfaces, Data flow, Data Ingest with Flume and
Scoop and Hadoop archives, Hadoop I/O: Compression, Serialization, Avro and
File-Based Data structures.H
History of Hadoop Hadoop is an Apache Software Foundation-managed open
source framework developed in Java for storing and analyzing massive
information on commodity hardware clusters. There are primarily two issues with
big data. The first is to store such a massive quantity of data, and the second is to
process it. Thus, Hadoop serves as a solution to the issue of big data, namely the
storage and processing of large amounts of data with specific additional
capabilities. Hadoop is composed chiefly of Hadoop Distributed File System
(HDFS) and Yet Another Resource Negotiator (YARN).
Hadoop's Historical Background Hadoop was originated in 2002 and founded
by Doug Cutting and Mike Cafarella as part of their work on the Apache Nutch
project. The Apache Nutch project was tasked with developing a search engine
system capable of indexing one billion documents. After doing extensive study on
Nutch, they determined that such a system would cost roughly half a million
dollars in hardware and a monthly operating cost of approximately $30 000,
which is rather costly. As a result, they discovered that their project design would
not cope with the billions of online pages. As a result, they sought a practical
solution to minimise the implementation cost and store and process massive
datasets. In 2003, they discovered a document describing the design of Google's
distributed file system, GFS (Google File System), which Google released to store
massive data collections. They now see that this research can resolve their storage
of huge files created by web crawling and indexing operations. However, this
research provided just a partial answer to their difficulty. In 2004, Google
produced another article on MapReduce technology used to handle such massive
datasets. For Doug Cutting and Mike Cafarella, this report was another half-
solution to their Nutch project. Both approaches (GFS and MapReduce) were
previously only available as white papers at Google. Google did not use any of
these approaches. Doug Cutting recognised through his work on Apache Lucene
(a free and open-source information retrieval software library that Doug Cutting
first wrote in Java in 1999) that open-source is an excellent approach to sharing
technology with a broader audience. As a result, he began working with Mike
Cafarella on open-source implementations of Google's algorithms (GFS &
MapReduce) in the Apache Nutch project. Big Data Analytics with Hadoop [10]
Cutting discovered in 2005 that Nutch is confined to clusters of between 20 and
40 nodes. He quickly saw two issues: Nutch would not reach its full potential
until it could run stably on more prominent clusters (b), which seemed
unachievable with just two workers (Doug Cutting & Mike Cafarella). The
engineering work in the Nutch project was far more than he anticipated. As a
result, he began looking for work with a firm willing to invest in their efforts.
Moreover, he discovered that Yahoo! has a sizable engineering staff ready to
work on this project. Thus, Doug Cutting joined Yahoo in 2006 and the Nutch
project. With the assistance of Yahoo, he wanted to present the world with an
open-source, dependable, and scalable computing architecture. Thus, the first
separate the distributed computing components of Nutch and establishes a new
project at Yahoo. Hadoop (He provided the name Hadoop since it was the name
of a yellow toy elephant that Doug Cutting's kid had. because it was simple to say
and was a one-of-a-kind term.) Now he desired to optimise Hadoop's performance
on hundreds of nodes. As a result, he began working on Hadoop using GFS and
MapReduce. Yahoo began utilising Hadoop in 2007 after successfully testing it
on a 1000-node cluster. In January 2008, Yahoo donated Hadoop to the Apache
Software Foundation as an open-source project (Apache Software Foundation).
Additionally, in July 2008, the Apache Software Foundation successfully tested
Hadoop on a 4000-node cluster. Hadoop was successfully tested in 2009 for
sorting a PB (PetaByte) of data in less than 17 hours for processing billions of
queries and indexing millions of web pages. Moreover,
Doug Cutting Departed Yahoo to join Cloudera to take on the task of bringing
Hadoop to new sectors.
Apache Hadoop version 1.0 was published by the
Apache Software Foundation in December 2011.
Additionally, Version 2.0.6 was released in August 2013.
Furthermore, as of December 2017, we have Apache
Hadoop version 3.0.
Apache Hadoop
Apache Hadoop is a free and open-source platform for storing
and processing massive information ranging in size from
gigabytes to petabytes. Rather than storing and processing
data on a single colossal computer, Hadoop enables clustering
many computers to analyse enormous datasets in parallel.
Applications built using HADOOP are run on large data sets distributed across
clusters of commodity computers. Commodity computers are cheap and widely
available. These are mainly useful for achieving greater computational power at low
cost.
Although Hadoop is best known for MapReduce and its distributed file system-
HDFS, the term is also used for a family of related projects that fall under the
umbrella of distributed computing and large-scale data processing. Other Hadoop-
related projects at Apache include are Hive, HBase, Mahout, Sqoop, Flume, and
ZooKeeper.
Hadoop Architecture
NameNode:
NameNode represented every files and directory which is used in the namespace
DataNode:
DataNode helps you to manage the state of an HDFS node and allows you to interacts
with the blocks
MasterNode:The master node allows you to conduct parallel processing of data using
Hadoop MapReduce.
Slave node:
The slave nodes are the additional machines in the Hadoop cluster which allows you
to store data to conduct complex calculations. Moreover, all the slave node comes
with Task Tracker and a DataNode. This allows you to synchronize the processes with
the NameNode and Job Tracker respectively.
Features Of ‘Hadoop’
• Suitable for Big Data Analysis
As Big Data tends to be distributed and unstructured in nature, HADOOP clusters are
best suited for analysis of Big Data. Since it is processing logic (not the actual data)
that flows to the computing nodes, less network bandwidth is consumed. This concept
is called as data locality concept which helps increase the efficiency of Hadoop
based applications.
• Scalability
HADOOP clusters can easily be scaled to any extent by adding additional cluster
nodes and thus allows for the growth of Big Data. Also, scaling does not require
modifications to application logic.
• Fault Tolerance
HADOOP ecosystem has a provision to replicate the input data on to other cluster
nodes. That way, in the event of a cluster node failure, data processing can still
proceed by using data stored on another cluster node.
Hadoop cluster consists of a data center, the rack and the node which actually
executes jobs. Here, data center consists of racks and rack consists of nodes. Network
bandwidth available to processes varies depending upon the location of the processes.
That is, the bandwidth available becomes lesser as we go away from-
What is HDFS?
HDFS is a distributed file system for storing very large data files, running on clusters
of commodity hardware. It is fault tolerant, scalable, and extremely simple to expand.
Hadoop comes bundled with HDFS (Hadoop Distributed File Systems).
When data exceeds the capacity of storage on a single physical machine, it becomes
essential to divide it across a number of separate machines. A file system that
manages storage specific operations across a network of machines is called a
distributed file system. HDFS is one such software.
HDFS Architecture
HDFS cluster primarily consists of a NameNode that manages the file
system Metadata and a DataNodes that stores the actual data.
Read/write operations in HDFS operate at a block level. Data files in HDFS are
broken into block-sized chunks, which are stored as independent units. Default block-
size is 64 MB.
4.
Data is read in the form of streams wherein client invokes ‘read()’ method
repeatedly. This process of read() operation continues till it reaches the end of
block.
5.
6. Once the end of a block is reached, DFSInputStream closes the connection and
moves on to locate the next DataNode for the next block
7. Once a client has done with the reading, it calls a close() method.
Write Operation In HDFS
In this section, we will understand how data is written into HDFS through files.
Object java.net.URL is used for reading contents of a file. To begin with, we need to
make Java recognize Hadoop’s hdfs URL scheme. This is done by
calling setURLStreamHandlerFactory method on URL object and an instance of
FsUrlStreamHandlerFactory is passed to it. This method needs to be executed only
once per JVM, hence it is enclosed in a static block.
Some of the widely used commands are listed below along with some details of each
one.
This command copies file temp.txt from the local filesystem to HDFS.
We can see a file ‘temp.txt’ (copied earlier) being listed under ‘ / ‘ directory.
An example use case of Hadoop Sqoop is an enterprise that runs a nightly Sqoop
import to load the day’s data from a production transactional RDBMS into a Hive
data warehouse for further analysis.
Sqoop Architecture
All the existing Database Management Systems are designed with SQL standard in
mind. However, each DBMS differs with respect to dialect to some extent. So, this
difference poses challenges when it comes to data transfers across the systems. Sqoop
Connectors are components which help overcome these challenges.
Data transfer between Sqoop Hadoop and external storage system is made possible
with the help of Sqoop’s connectors.
Sqoop has connectors for working with a range of popular relational databases,
including MySQL, PostgreSQL, Oracle, SQL Server, and DB2. Each of these
connectors knows how to interact with its associated DBMS. There is also a generic
JDBC connector for connecting to any database that supports Java’s JDBC protocol.
In addition, Sqoop Big data provides optimized MySQL and PostgreSQL connectors
that use database-specific APIs to perform bulk transfers efficiently.
Sqoop Architecture
In addition to this, Sqoop in big data has various third-party connectors for data
stores, ranging from enterprise data warehouses (including Netezza, Teradata, and
Oracle) to NoSQL stores (such as Couchbase). However, these connectors do not
come with Sqoop bundle; those need to be downloaded separately and can be added
easily to an existing Sqoop installation.
Major Issues:
1. Data load using Scripts
The traditional approach of using scripts to load data is not suitable for bulk data load
into Hadoop; this approach is inefficient and very time-consuming.
Providing direct access to the data residing at external systems(without loading into
Hadoop) for map-reduce applications complicates these applications. So, this
approach is not feasible.
3. In addition to having the ability to work with enormous data, Hadoop can work
with data in several different forms. So, to load such heterogeneous data into Hadoop,
different tools have been developed. Sqoop and Flume are two such data loading
tools.
Next in this Sqoop tutorial with examples, we will learn about the difference between
Sqoop, Flume and HDFS.
Flume Architecture
A Flume agent is a JVM process which has 3 components –Flume Source, Flume
Channel and Flume Sink– through which events propagate after initiated at an
external source.
Flume Architecture
1. In the above diagram, the events generated by external source (WebServer) are
consumed by Flume Data Source. The external source sends events to Flume
source in a format that is recognized by the target source.
2. Flume Source receives an event and stores it into one or more channels. The
channel acts as a store which keeps the event until it is consumed by the flume
sink. This channel may use a local file system in order to store these events.
3. Flume sink removes the event from a channel and stores it into an external
repository like e.g., HDFS. There could be multiple flume agents, in which
case flume sink forwards the event to the flume source of next flume agent in
the flow.
Flume has a flexible design based upon streaming data flows. It is fault
tolerant and robust with multiple failovers and recovery mechanisms. Flume
Big data has different levels of reliability to offer which includes ‘best-effort
delivery’ and an ‘end-to-end delivery’. Best-effort delivery does not tolerate
any Flume node failure whereas ‘end-to-end delivery’ mode guarantees
delivery even in the event of multiple node failures.
Flume carries data between sources and sinks. This gathering of data can
either be scheduled or event-driven. Flume has its own query processing
engine which makes it easy to transform each new batch of data before it is
moved to the intended sink.
Possible Flume sinks include HDFS and HBase. Flume Hadoop can also be
used to transport event data including but not limited to network traffic data,
data generated by social media websites and email messages.
2.
3. Copy files MyTwitterSource.java and MyTwitterSourceForFlume.java in
this directory.
Check the file permissions of all these files and if ‘read’ permissions are missing then
grant the same-
Next Click
Step 3) Copy the downloaded tarball in the directory of your choice and extract
contents using the following command
This command will create a new directory named apache-flume-1.4.0-bin and extract
files into it. This directory will be referred to as <Installation Directory of
Flume> in rest of the article.
It is possible that either or all of the copied JAR will have to execute permission. This
may cause an issue with the compilation of code. So, revoke execute permission on
such JAR.
RELATED ARTICLES
export CLASSPATH="/usr/local/apache-flume-1.4.0-bin/lib/*:~/FlumeTutorial/
flume/mytwittersource/*"
Main-Class: flume.mytwittersource.MyTwitterSourceForFlume
.. here flume.mytwittersource.MyTwitterSourceForFlume is the name of the main
class. Please note that you have to hit enter key at end of this line.
Step 8) Click on ‘Test OAuth’. This will display ‘OAuth’ settings of the application.
Note: These values belong to the user and hence are confidential, so should not
be shared.
MyTwitAgent.sources = Twitter
MyTwitAgent.channels = MemChannel
MyTwitAgent.sinks = HDFS
MyTwitAgent.sources.Twitter.type =
flume.mytwittersource.MyTwitterSourceForFlume
MyTwitAgent.sources.Twitter.channels = MemChannel
MyTwitAgent.sources.Twitter.consumerKey = <Copy consumer key value from
Twitter App>
MyTwitAgent.sources.Twitter.consumerSecret = <Copy consumer secret value from
Twitter App>
MyTwitAgent.sources.Twitter.accessToken = <Copy access token value from Twitter
App>
MyTwitAgent.sources.Twitter.accessTokenSecret = <Copy access token secret value
from Twitter App>
MyTwitAgent.sources.Twitter.keywords = guru99
MyTwitAgent.sinks.HDFS.channel = MemChannel
MyTwitAgent.sinks.HDFS.type = hdfs
MyTwitAgent.sinks.HDFS.hdfs.path =
hdfs://localhost:54310/user/hduser/flume/tweets/
MyTwitAgent.sinks.HDFS.hdfs.fileType = DataStream
MyTwitAgent.sinks.HDFS.hdfs.writeFormat = Text
MyTwitAgent.sinks.HDFS.hdfs.batchSize = 1000
MyTwitAgent.sinks.HDFS.hdfs.rollSize = 0
MyTwitAgent.sinks.HDFS.hdfs.rollCount = 10000
MyTwitAgent.channels.MemChannel.type = memory
MyTwitAgent.channels.MemChannel.capacity = 10000
MyTwitAgent.channels.MemChannel.transactionCapacity = 1000
Step 3) In order to flush the data to HDFS, as an when it comes, delete below entry if
it exists,
TwitterAgent.sinks.HDFS.hdfs.rollInterval = 600
$HADOOP_HOME/sbin/start-dfs.sh
$HADOOP_HOME/sbin/start-yarn.sh
Step 3) Two of the JAR files from the Flume tarball are not compatible with Hadoop
2.2.0. So, we will need to follow below steps in this Apache Flume example to make
Flume compatible with Hadoop 2.2.0.
sudo mv protobuf-java-2.4.1.jar ~/
sudo mv guava-10.0.1.jar ~/
c. Download guava-17.0.jar from http://mvnrepository.com/artifact/
com.google.guava/guava/17.0