0% found this document useful (0 votes)
10 views32 pages

Unit II BDA

The document provides an overview of Hadoop and its components, focusing on the Hadoop Distributed File System (HDFS) and its architecture. It details the history of Hadoop's development, its key modules, and the operational principles of HDFS, including data storage, replication, and read/write processes. Additionally, it highlights the scalability, fault tolerance, and ecosystem of tools associated with Hadoop for big data analysis.

Uploaded by

Rakhi Turkhade
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
10 views32 pages

Unit II BDA

The document provides an overview of Hadoop and its components, focusing on the Hadoop Distributed File System (HDFS) and its architecture. It details the history of Hadoop's development, its key modules, and the operational principles of HDFS, including data storage, replication, and read/write processes. Additionally, it highlights the scalability, fault tolerance, and ecosystem of tools associated with Hadoop for big data analysis.

Uploaded by

Rakhi Turkhade
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 32

Unit II

HDFS (Hadoop Distributed File System)

History of Hadoop, Apache Hadoop, Analysing Data with Unix tools, Analysing
Data with Hadoop, The Design of HDFS, HDFS Concepts, Command Line
Interface, Hadoop file system interfaces, Data flow, Data Ingest with Flume and
Scoop and Hadoop archives, Hadoop I/O: Compression, Serialization, Avro and
File-Based Data structures.H
History of Hadoop Hadoop is an Apache Software Foundation-managed open
source framework developed in Java for storing and analyzing massive
information on commodity hardware clusters. There are primarily two issues with
big data. The first is to store such a massive quantity of data, and the second is to
process it. Thus, Hadoop serves as a solution to the issue of big data, namely the
storage and processing of large amounts of data with specific additional
capabilities. Hadoop is composed chiefly of Hadoop Distributed File System
(HDFS) and Yet Another Resource Negotiator (YARN).
Hadoop's Historical Background Hadoop was originated in 2002 and founded
by Doug Cutting and Mike Cafarella as part of their work on the Apache Nutch
project. The Apache Nutch project was tasked with developing a search engine
system capable of indexing one billion documents. After doing extensive study on
Nutch, they determined that such a system would cost roughly half a million
dollars in hardware and a monthly operating cost of approximately $30 000,
which is rather costly. As a result, they discovered that their project design would
not cope with the billions of online pages. As a result, they sought a practical
solution to minimise the implementation cost and store and process massive
datasets. In 2003, they discovered a document describing the design of Google's
distributed file system, GFS (Google File System), which Google released to store
massive data collections. They now see that this research can resolve their storage
of huge files created by web crawling and indexing operations. However, this
research provided just a partial answer to their difficulty. In 2004, Google
produced another article on MapReduce technology used to handle such massive
datasets. For Doug Cutting and Mike Cafarella, this report was another half-
solution to their Nutch project. Both approaches (GFS and MapReduce) were
previously only available as white papers at Google. Google did not use any of
these approaches. Doug Cutting recognised through his work on Apache Lucene
(a free and open-source information retrieval software library that Doug Cutting
first wrote in Java in 1999) that open-source is an excellent approach to sharing
technology with a broader audience. As a result, he began working with Mike
Cafarella on open-source implementations of Google's algorithms (GFS &
MapReduce) in the Apache Nutch project. Big Data Analytics with Hadoop [10]
Cutting discovered in 2005 that Nutch is confined to clusters of between 20 and
40 nodes. He quickly saw two issues: Nutch would not reach its full potential
until it could run stably on more prominent clusters (b), which seemed
unachievable with just two workers (Doug Cutting & Mike Cafarella). The
engineering work in the Nutch project was far more than he anticipated. As a
result, he began looking for work with a firm willing to invest in their efforts.
Moreover, he discovered that Yahoo! has a sizable engineering staff ready to
work on this project. Thus, Doug Cutting joined Yahoo in 2006 and the Nutch
project. With the assistance of Yahoo, he wanted to present the world with an
open-source, dependable, and scalable computing architecture. Thus, the first
separate the distributed computing components of Nutch and establishes a new
project at Yahoo. Hadoop (He provided the name Hadoop since it was the name
of a yellow toy elephant that Doug Cutting's kid had. because it was simple to say
and was a one-of-a-kind term.) Now he desired to optimise Hadoop's performance
on hundreds of nodes. As a result, he began working on Hadoop using GFS and
MapReduce. Yahoo began utilising Hadoop in 2007 after successfully testing it
on a 1000-node cluster. In January 2008, Yahoo donated Hadoop to the Apache
Software Foundation as an open-source project (Apache Software Foundation).
Additionally, in July 2008, the Apache Software Foundation successfully tested
Hadoop on a 4000-node cluster. Hadoop was successfully tested in 2009 for
sorting a PB (PetaByte) of data in less than 17 hours for processing billions of
queries and indexing millions of web pages. Moreover,
Doug Cutting Departed Yahoo to join Cloudera to take on the task of bringing
Hadoop to new sectors.
 Apache Hadoop version 1.0 was published by the
Apache Software Foundation in December 2011.
 Additionally, Version 2.0.6 was released in August 2013.
 Furthermore, as of December 2017, we have Apache
Hadoop version 3.0.

Apache Hadoop
Apache Hadoop is a free and open-source platform for storing
and processing massive information ranging in size from
gigabytes to petabytes. Rather than storing and processing
data on a single colossal computer, Hadoop enables clustering
many computers to analyse enormous datasets in parallel.

Four significant modules of Hadoop:


HDFS — A distributed file system that works on commodity or low- end
hardware. HDFS outperforms conventional file systems in data performance, fault
tolerance, and native support for massive datasets.

YARN — manages and monitors cluster nodes and resource utilisation. It


automates the scheduling of jobs and tasks.

MapReduce — A framework that enables parallel computing on data by


programmes. The map job turns the input data into a dataset calculated in key-
value pairs. It is reducing tasks that consume the output of the map task in order
to aggregate it and produce the desired result.

Hadoop Common — Provides a set of shared Java libraries utilised by all


modules.

How Hadoop Operates

Hadoop simplifies the process of using all the data storage


capacity available in cluster computers and executing
distributed algorithms against massive volumes of data.
Hadoop offers the
foundation for the development of additional services and
applications.

Applications that gather data in various forms may upload data


to the Hadoop cluster by connecting to the NameNode
through an API function. The NameNode maintains the directory
structure of each file and the location of "chunks" for each
file, which is duplicated among DataNodes. To launch a job
that queries the data, supply a MapReduce job consisting of
several maps and reduce jobs executing on the data stored in
HDFS across the DataNodes. Each node executes map tasks
against the specified input files, while reducers execute to
aggregate and arrange the final output.
Due to Hadoop's flexibility, the ecosystem has evolved
tremendously over the years. Today, the Hadoop ecosystem
comprises a variety of tools and applications that aid in the
collection, storage, processing, analysis, and management of
large amounts of data. Several of the most prominent uses
include the following:

Spark — An open-source distributed processing technology


often used to handle large amounts of data. Apache Spark
provides general batch processing, streaming analytics, machine
learning, graph databases, and ad hoc queries through in-
memory caching and efficient execution.

Presto — A distributed SQL query engine geared for low-


latency, ad-hoc data processing. It adheres to the ANSI SQL
standard, which includes sophisticated searches, aggregations,
joins, and window functions. Presto can handle data from
various sources,
What is Hadoop?
Apache Hadoop is an open source software framework used to develop data
processing applications which are executed in a distributed computing environment.

Applications built using HADOOP are run on large data sets distributed across
clusters of commodity computers. Commodity computers are cheap and widely
available. These are mainly useful for achieving greater computational power at low
cost.

Similar to data residing in a local file system of a personal computer system, in


Hadoop, data resides in a distributed file system which is called as a Hadoop
Distributed File system. The processing model is based on ‘Data Locality’ concept
wherein computational logic is sent to cluster nodes(server) containing data. This
computational logic is nothing, but a compiled version of a program written in a high-
level language such as Java. Such a program, processes data stored in Hadoop HDFS.

Hadoop EcoSystem and Components


Below diagram shows various components in the Hadoop ecosystem-

Apache Hadoop consists of two sub-projects –

1. Hadoop MapReduce: MapReduce is a computational model and software


framework for writing applications which are run on Hadoop. These
MapReduce programs are capable of processing enormous data in parallel on
large clusters of computation nodes.
2. HDFS (Hadoop Distributed File System): HDFS takes care of the storage
part of Hadoop applications. MapReduce applications consume data from
HDFS. HDFS creates multiple replicas of data blocks and distributes them on
compute nodes in a cluster. This distribution enables reliable and extremely
rapid computations.

Although Hadoop is best known for MapReduce and its distributed file system-
HDFS, the term is also used for a family of related projects that fall under the
umbrella of distributed computing and large-scale data processing. Other Hadoop-
related projects at Apache include are Hive, HBase, Mahout, Sqoop, Flume, and
ZooKeeper.

Hadoop Architecture

High Level Hadoop Architecture


Hadoop has a Master-Slave Architecture for data storage and distributed data
processing using MapReduce and HDFS methods.

NameNode:

NameNode represented every files and directory which is used in the namespace

DataNode:

DataNode helps you to manage the state of an HDFS node and allows you to interacts
with the blocks

MasterNode:The master node allows you to conduct parallel processing of data using
Hadoop MapReduce.

Slave node:

The slave nodes are the additional machines in the Hadoop cluster which allows you
to store data to conduct complex calculations. Moreover, all the slave node comes
with Task Tracker and a DataNode. This allows you to synchronize the processes with
the NameNode and Job Tracker respectively.

In Hadoop, master or slave system can be set up in the cloud or on-premise

Features Of ‘Hadoop’
• Suitable for Big Data Analysis

As Big Data tends to be distributed and unstructured in nature, HADOOP clusters are
best suited for analysis of Big Data. Since it is processing logic (not the actual data)
that flows to the computing nodes, less network bandwidth is consumed. This concept
is called as data locality concept which helps increase the efficiency of Hadoop
based applications.

• Scalability

HADOOP clusters can easily be scaled to any extent by adding additional cluster
nodes and thus allows for the growth of Big Data. Also, scaling does not require
modifications to application logic.

• Fault Tolerance

HADOOP ecosystem has a provision to replicate the input data on to other cluster
nodes. That way, in the event of a cluster node failure, data processing can still
proceed by using data stored on another cluster node.

Network Topology In Hadoop


Topology (Arrangment) of the network, affects the performance of the Hadoop cluster
when the size of the Hadoop cluster grows. In addition to the performance, one also
needs to care about the high availability and handling of failures. In order to achieve
this Hadoop, cluster formation makes use of network topology.

Typically, network bandwidth is an important factor to consider while forming any


network. However, as measuring bandwidth could be difficult, in Hadoop, a network
is represented as a tree and distance between nodes of this tree (number of hops) is
considered as an important factor in the formation of Hadoop cluster. Here, the
distance between two nodes is equal to sum of their distance to their closest common
ancestor.

Hadoop cluster consists of a data center, the rack and the node which actually
executes jobs. Here, data center consists of racks and rack consists of nodes. Network
bandwidth available to processes varies depending upon the location of the processes.
That is, the bandwidth available becomes lesser as we go away from-

 Processes on the same node


 Different nodes on the same rack
 Nodes on different racks of the same data center
 Nodes in different data centers

What is HDFS?
HDFS is a distributed file system for storing very large data files, running on clusters
of commodity hardware. It is fault tolerant, scalable, and extremely simple to expand.
Hadoop comes bundled with HDFS (Hadoop Distributed File Systems).

When data exceeds the capacity of storage on a single physical machine, it becomes
essential to divide it across a number of separate machines. A file system that
manages storage specific operations across a network of machines is called a
distributed file system. HDFS is one such software.

HDFS Architecture
HDFS cluster primarily consists of a NameNode that manages the file
system Metadata and a DataNodes that stores the actual data.

 NameNode: NameNode can be considered as a master of the system. It


maintains the file system tree and the metadata for all the files and directories
present in the system. Two files ‘Namespace image’ and the ‘edit log’ are
used to store metadata information. Namenode has knowledge of all the
datanodes containing data blocks for a given file, however, it does not store
block locations persistently. This information is reconstructed every time from
datanodes when the system starts.
 DataNode: DataNodes are slaves which reside on each machine in a cluster
and provide the actual storage. It is responsible for serving, read and write
requests for the clients.

Read/write operations in HDFS operate at a block level. Data files in HDFS are
broken into block-sized chunks, which are stored as independent units. Default block-
size is 64 MB.

HDFS operates on a concept of data replication wherein multiple replicas of data


blocks are created and are distributed on nodes throughout a cluster to enable high
availability of data in the event of node failure.

Read Operation In HDFS


Data read request is served by HDFS, NameNode, and DataNode. Let’s call the reader
as a ‘client’. Below diagram depicts file read operation in Hadoop.

1. A client initiates read request by calling ‘open()’ method of FileSystem


object; it is an object of type DistributedFileSystem.
2. This object connects to namenode using RPC and gets metadata information
such as the locations of the blocks of the file. Please note that these addresses
are of first few blocks of a file.
3. In response to this metadata request, addresses of the DataNodes having a
copy of that block is returned back.

Once addresses of DataNodes are received, an object of


type FSDataInputStream is returned to the
client. FSDataInputStream contains DFSInputStream which takes care of
interactions with DataNode and NameNode. In step 4 shown in the above diagram, a
client invokes ‘read()’ method which causes DFSInputStream to establish a
connection with the first DataNode with the first block of a file.

4.

Data is read in the form of streams wherein client invokes ‘read()’ method
repeatedly. This process of read() operation continues till it reaches the end of
block.

5.
6. Once the end of a block is reached, DFSInputStream closes the connection and
moves on to locate the next DataNode for the next block
7. Once a client has done with the reading, it calls a close() method.
Write Operation In HDFS
In this section, we will understand how data is written into HDFS through files.

1. A client initiates write operation by calling ‘create()’ method of


DistributedFileSystem object which creates a new file – Step no. 1 in the
above diagram.
2. DistributedFileSystem object connects to the NameNode using RPC call and
initiates new file creation. However, this file creates operation does not
associate any blocks with the file. It is the responsibility of NameNode to
verify that the file (which is being created) does not exist already and a client
has correct permissions to create a new file. If a file already exists or client
does not have sufficient permission to create a new file, then IOException is
thrown to the client. Otherwise, the operation succeeds and a new record for
the file is created by the NameNode.
3. Once a new record in NameNode is created, an object of type
FSDataOutputStream is returned to the client. A client uses it to write data into
the HDFS. Data write method is invoked (step 3 in the diagram).
4. FSDataOutputStream contains DFSOutputStream object which looks after
communication with DataNodes and NameNode. While the client continues
writing data, DFSOutputStream continues creating packets with this data.
These packets are enqueued into a queue which is called as DataQueue.
5. There is one more component called DataStreamer which consumes
this DataQueue. DataStreamer also asks NameNode for allocation of new
blocks thereby picking desirable DataNodes to be used for replication.
6. Now, the process of replication starts by creating a pipeline using DataNodes.
In our case, we have chosen a replication level of 3 and hence there are 3
DataNodes in the pipeline.
7. The DataStreamer pours packets into the first DataNode in the pipeline.
8. Every DataNode in a pipeline stores packet received by it and forwards the
same to the second DataNode in a pipeline.
9. Another queue, ‘Ack Queue’ is maintained by DFSOutputStream to store
packets which are waiting for acknowledgment from DataNodes.
10. Once acknowledgment for a packet in the queue is received from all
DataNodes in the pipeline, it is removed from the ‘Ack Queue’. In the event of
any DataNode failure, packets from this queue are used to reinitiate the
operation.
11. After a client is done with the writing data, it calls a close() method (Step 9 in
the diagram) Call to close(), results into flushing remaining data packets to the
pipeline followed by waiting for acknowledgment.
12. Once a final acknowledgment is received, NameNode is contacted to tell it
that the file write operation is complete.

Access HDFS using JAVA API


In this section, we try to understand Java interface used for accessing Hadoop’s file
system.

In order to interact with Hadoop’s filesystem programmatically, Hadoop provides


multiple JAVA classes. Package named org.apache.hadoop.fs contains classes useful
in manipulation of a file in Hadoop’s filesystem. These operations include, open, read,
write, and close. Actually, file API for Hadoop is generic and can be extended to
interact with other filesystems other than HDFS.

Reading a file from HDFS, programmatically

Object java.net.URL is used for reading contents of a file. To begin with, we need to
make Java recognize Hadoop’s hdfs URL scheme. This is done by
calling setURLStreamHandlerFactory method on URL object and an instance of
FsUrlStreamHandlerFactory is passed to it. This method needs to be executed only
once per JVM, hence it is enclosed in a static block.

An example code is-

public class URLCat {


static {
URL.setURLStreamHandlerFactory(new
FsUrlStreamHandlerFactory());
}
public static void main(String[] args) throws
Exception {
InputStream in = null;
try {
in = new URL(args[0]).openStream();
IOUtils.copyBytes(in, System.out, 4096,
false);
} finally {
IOUtils.closeStream(in);
}
}
}
This code opens and reads contents of a file. Path of this file on HDFS is passed to the
program as a command line argument.

Access HDFS Using COMMAND-LINE INTERFACE


This is one of the simplest ways to interact with HDFS. Command-line interface has
support for filesystem operations like read the file, create directories, moving files,
deleting data, and listing directories.

We can run ‘$HADOOP_HOME/bin/hdfs dfs -help’ to get detailed help on every


command. Here, ‘dfs’ is a shell command of HDFS which supports multiple
subcommands.

Some of the widely used commands are listed below along with some details of each
one.

1. Copy a file from the local filesystem to HDFS

$HADOOP_HOME/bin/hdfs dfs -copyFromLocal temp.txt /

This command copies file temp.txt from the local filesystem to HDFS.

2. We can list files present in a directory using -ls

$HADOOP_HOME/bin/hdfs dfs -ls /

We can see a file ‘temp.txt’ (copied earlier) being listed under ‘ / ‘ directory.

3. Command to copy a file to the local filesystem from HDFS

$HADOOP_HOME/bin/hdfs dfs -copyToLocal /temp.txt


We can see temp.txt copied to a local filesystem.

4. Command to create a new directory

$HADOOP_HOME/bin/hdfs dfs -mkdir /mydirectory

What is SQOOP in Hadoop?


Apache SQOOP (SQL-to-Hadoop) is a tool designed to support bulk export and
import of data into HDFS from structured data stores such as relational databases,
enterprise data warehouses, and NoSQL systems. It is a data migration tool based
upon a connector architecture which supports plugins to provide connectivity to new
external systems.

An example use case of Hadoop Sqoop is an enterprise that runs a nightly Sqoop
import to load the day’s data from a production transactional RDBMS into a Hive
data warehouse for further analysis.

Sqoop Architecture
All the existing Database Management Systems are designed with SQL standard in
mind. However, each DBMS differs with respect to dialect to some extent. So, this
difference poses challenges when it comes to data transfers across the systems. Sqoop
Connectors are components which help overcome these challenges.

Data transfer between Sqoop Hadoop and external storage system is made possible
with the help of Sqoop’s connectors.

Sqoop has connectors for working with a range of popular relational databases,
including MySQL, PostgreSQL, Oracle, SQL Server, and DB2. Each of these
connectors knows how to interact with its associated DBMS. There is also a generic
JDBC connector for connecting to any database that supports Java’s JDBC protocol.
In addition, Sqoop Big data provides optimized MySQL and PostgreSQL connectors
that use database-specific APIs to perform bulk transfers efficiently.
Sqoop Architecture
In addition to this, Sqoop in big data has various third-party connectors for data
stores, ranging from enterprise data warehouses (including Netezza, Teradata, and
Oracle) to NoSQL stores (such as Couchbase). However, these connectors do not
come with Sqoop bundle; those need to be downloaded separately and can be added
easily to an existing Sqoop installation.

Why do we need Sqoop?


Analytical processing using Hadoop requires loading of huge amounts of data from
diverse sources into Hadoop clusters. This process of bulk data load into Hadoop,
from heterogeneous sources and then processing it, comes with a certain set of
challenges. Maintaining and ensuring data consistency and ensuring efficient
utilization of resources, are some factors to consider before selecting the right
approach for data load.

Major Issues:
1. Data load using Scripts

The traditional approach of using scripts to load data is not suitable for bulk data load
into Hadoop; this approach is inefficient and very time-consuming.

2. Direct access to external data via Map-Reduce application

Providing direct access to the data residing at external systems(without loading into
Hadoop) for map-reduce applications complicates these applications. So, this
approach is not feasible.
3. In addition to having the ability to work with enormous data, Hadoop can work
with data in several different forms. So, to load such heterogeneous data into Hadoop,
different tools have been developed. Sqoop and Flume are two such data loading
tools.

Next in this Sqoop tutorial with examples, we will learn about the difference between
Sqoop, Flume and HDFS.

Sqoop vs Flume vs HDFS in Hadoop


Sqoop Flume HDFS
Sqoop is used for importing data from HDFS is a distributed file
Flume is used for moving bulk
structured data sources such as system used by Hadoop
streaming data into HDFS.
RDBMS. ecosystem to store data.
Sqoop has a connector based Flume has an agent-based HDFS has a distributed
architecture. Connectors know how to architecture. Here, a code is written architecture where data is
connect to the respective data source (which is called as ‘agent’) which distributed across
and fetch the data. takes care of fetching data. multiple data nodes.
HDFS is an ultimate
HDFS is a destination for data import Data flows to HDFS through zero or
destination for data
using Sqoop. more channels.
storage.
HDFS just stores data
Flume data load can be driven by an
Sqoop data load is not event-driven. provided to it by
event.
whatsoever means.
In order to import data from structured In order to load streaming data such HDFS has its own built-
data sources, one has to use Sqoop as tweets generated on Twitter or log in shell commands to
commands only, because its connectors files of a web server, Flume should store data into it. HDFS
know how to interact with structured be used. Flume agents are built for cannot import streaming
data sources and fetch data from them. fetching streaming data. data

Flume Architecture
A Flume agent is a JVM process which has 3 components –Flume Source, Flume
Channel and Flume Sink– through which events propagate after initiated at an
external source.
Flume Architecture

1. In the above diagram, the events generated by external source (WebServer) are
consumed by Flume Data Source. The external source sends events to Flume
source in a format that is recognized by the target source.
2. Flume Source receives an event and stores it into one or more channels. The
channel acts as a store which keeps the event until it is consumed by the flume
sink. This channel may use a local file system in order to store these events.
3. Flume sink removes the event from a channel and stores it into an external
repository like e.g., HDFS. There could be multiple flume agents, in which
case flume sink forwards the event to the flume source of next flume agent in
the flow.

Some Important features of FLUME

 Flume has a flexible design based upon streaming data flows. It is fault
tolerant and robust with multiple failovers and recovery mechanisms. Flume
Big data has different levels of reliability to offer which includes ‘best-effort
delivery’ and an ‘end-to-end delivery’. Best-effort delivery does not tolerate
any Flume node failure whereas ‘end-to-end delivery’ mode guarantees
delivery even in the event of multiple node failures.
 Flume carries data between sources and sinks. This gathering of data can
either be scheduled or event-driven. Flume has its own query processing
engine which makes it easy to transform each new batch of data before it is
moved to the intended sink.
 Possible Flume sinks include HDFS and HBase. Flume Hadoop can also be
used to transport event data including but not limited to network traffic data,
data generated by social media websites and email messages.

Flume, library and source code setup


Before we start with the actual process, ensure you have Hadoop installed. Change
user to ‘hduser’ (id used while Hadoop configuration, you can switch to the userid
used during your Hadoop config)

Step 1) Create a new directory with the name ‘FlumeTutorial’

sudo mkdir FlumeTutorial

1. Give a read, write and execute permissions

sudo chmod -R 777 FlumeTutorial

2.
3. Copy files MyTwitterSource.java and MyTwitterSourceForFlume.java in
this directory.

Download Input Files From Here

Check the file permissions of all these files and if ‘read’ permissions are missing then
grant the same-

Step 2) Download ‘Apache Flume’ from a


site- https://flume.apache.org/download.html

Apache Flume 1.4.0 has been used in this Flume tutorial.

Next Click
Step 3) Copy the downloaded tarball in the directory of your choice and extract
contents using the following command

sudo tar -xvf apache-flume-1.4.0-bin.tar.gz

This command will create a new directory named apache-flume-1.4.0-bin and extract
files into it. This directory will be referred to as <Installation Directory of
Flume> in rest of the article.

Step 4) Flume library setup

Copy twitter4j-core-4.0.1.jar, flume-ng-configuration-1.4.0.jar, flume-ng-core-


1.4.0.jar, flume-ng-sdk-1.4.0.jar to

<Installation Directory of Flume>/lib/

It is possible that either or all of the copied JAR will have to execute permission. This
may cause an issue with the compilation of code. So, revoke execute permission on
such JAR.

In my case, twitter4j-core-4.0.1.jar was having to execute permission. I revoked it as


below-

sudo chmod -x twitter4j-core-4.0.1.jar


After this command gives ‘read’ permission on twitter4j-core-4.0.1.jar to all.

sudo chmod +rrr /usr/local/apache-flume-1.4.0-bin/lib/twitter4j-core-4.0.1.jar


Please note that I have downloaded-

– twitter4j-core-4.0.1.jar from https://mvnrepository.com/artifact/org.twitter4j/


twitter4j-core

– All flame JARs i.e., flume-ng-*-1.4.0.jar from http://mvnrepository.com/artifact/


org.apache.flume

RELATED ARTICLES

 Hadoop MapReduce Join & Counter with Example


 Sqoop Tutorial: What is Apache Sqoop? Architecture & Example
 8 BEST Big Data Analytics Tools & Software (2025)
 Top 30 Talend Interview Questions and Answers (2025)

Load data from Twitter using Flume


Step 1) Go to the directory containing source code files in it.

Step 2) Set CLASSPATH to contain <Flume Installation


Dir>/lib/* and ~/FlumeTutorial/flume/mytwittersource/*

export CLASSPATH="/usr/local/apache-flume-1.4.0-bin/lib/*:~/FlumeTutorial/
flume/mytwittersource/*"

Step 3) Compile source code using the command-

javac -d . MyTwitterSourceForFlume.java MyTwitterSource.java

Step 4)Create a jar


First, create Manifest.txt file using a text editor of your choice and add below line in
it-

Main-Class: flume.mytwittersource.MyTwitterSourceForFlume
.. here flume.mytwittersource.MyTwitterSourceForFlume is the name of the main
class. Please note that you have to hit enter key at end of this line.

Now, create JAR ‘MyTwitterSourceForFlume.jar’ as-

jar cfm MyTwitterSourceForFlume.jar Manifest.txt flume/mytwittersource/*.class

Step 5) Copy this jar to <Flume Installation Directory>/lib/

sudo cp MyTwitterSourceForFlume.jar <Flume Installation Directory>/lib/

Step 6) Go to the configuration directory of Flume, <Flume Installation


Directory>/conf

If flume.conf does not exist, then copy flume-conf.properties.template and rename it


to flume.conf

sudo cp flume-conf.properties.template flume.conf

If flume-env.sh does not exist, then copy flume-env.sh.template and rename it


to flume-env.sh
sudo cp flume-env.sh.template flume-env.sh

Creating a Twitter Application


Step 1) Create a Twitter application by signing in to https://developer.twitter.com/
Step 2) Go to ‘My applications’ (This option gets dropped down when ‘Egg’ button
at the top right corner is clicked)

Step 3) Create a new application by clicking ‘Create New App’

Step 4) Fill up application details by specifying the name of application, description,


and website. You may refer to the notes given underneath each input box.
Step 5) Scroll down the page and accept terms by marking ‘Yes, I agree’ and click on
button‘Create your Twitter application’

Step 6) On the window of a newly created application, go to the tab, ‘API


Keys’ scroll down the page and click button ‘Create my access token’
Step 7) Refresh the page.

Step 8) Click on ‘Test OAuth’. This will display ‘OAuth’ settings of the application.

Step 9) Modify ‘flume.conf’ using these OAuth settings. Steps to


modify ‘flume.conf’ are given below.
We need to copy Consumer key, Consumer secret, Access token and Access token
secret to updating ‘flume.conf’.

Note: These values belong to the user and hence are confidential, so should not
be shared.

Modify ‘flume.conf’ File


Step 1) Open ‘flume.conf’ in write mode and set values for below parameters-

sudo gedit flume.conf


Copy below contents-

MyTwitAgent.sources = Twitter
MyTwitAgent.channels = MemChannel
MyTwitAgent.sinks = HDFS
MyTwitAgent.sources.Twitter.type =
flume.mytwittersource.MyTwitterSourceForFlume
MyTwitAgent.sources.Twitter.channels = MemChannel
MyTwitAgent.sources.Twitter.consumerKey = <Copy consumer key value from
Twitter App>
MyTwitAgent.sources.Twitter.consumerSecret = <Copy consumer secret value from
Twitter App>
MyTwitAgent.sources.Twitter.accessToken = <Copy access token value from Twitter
App>
MyTwitAgent.sources.Twitter.accessTokenSecret = <Copy access token secret value
from Twitter App>
MyTwitAgent.sources.Twitter.keywords = guru99
MyTwitAgent.sinks.HDFS.channel = MemChannel
MyTwitAgent.sinks.HDFS.type = hdfs
MyTwitAgent.sinks.HDFS.hdfs.path =
hdfs://localhost:54310/user/hduser/flume/tweets/
MyTwitAgent.sinks.HDFS.hdfs.fileType = DataStream
MyTwitAgent.sinks.HDFS.hdfs.writeFormat = Text
MyTwitAgent.sinks.HDFS.hdfs.batchSize = 1000
MyTwitAgent.sinks.HDFS.hdfs.rollSize = 0
MyTwitAgent.sinks.HDFS.hdfs.rollCount = 10000
MyTwitAgent.channels.MemChannel.type = memory
MyTwitAgent.channels.MemChannel.capacity = 10000
MyTwitAgent.channels.MemChannel.transactionCapacity = 1000

Step 2) Also, set TwitterAgent.sinks.HDFS.hdfs.path as below,

TwitterAgent.sinks.HDFS.hdfs.path = hdfs://<Host Name>:<Port


Number>/<HDFS Home Directory>/flume/tweets/
To know <Host Name>, <Port Number> and <HDFS Home Directory> , see value
of parameter ‘fs.defaultFS’ set in $HADOOP_HOME/etc/hadoop/core-site.xml

Step 3) In order to flush the data to HDFS, as an when it comes, delete below entry if
it exists,

TwitterAgent.sinks.HDFS.hdfs.rollInterval = 600

Example: Streaming Twitter Data using Flume


Step 1) Open ‘flume-env.sh’ in write mode and set values for below parameters,

JAVA_HOME=<Installation directory of Java>


FLUME_CLASSPATH="<Flume Installation
Directory>/lib/MyTwitterSourceForFlume.jar"

Step 2) Start Hadoop

$HADOOP_HOME/sbin/start-dfs.sh
$HADOOP_HOME/sbin/start-yarn.sh
Step 3) Two of the JAR files from the Flume tarball are not compatible with Hadoop
2.2.0. So, we will need to follow below steps in this Apache Flume example to make
Flume compatible with Hadoop 2.2.0.

a. Move protobuf-java-2.4.1.jar out of ‘<Flume Installation Directory>/lib’.

Go to ‘<Flume Installation Directory>/lib’

cd <Flume Installation Directory>/lib

sudo mv protobuf-java-2.4.1.jar ~/

b. Find for JAR file ‘guava’ as below

find . -name "guava*"

Move guava-10.0.1.jar out of ‘<Flume Installation Directory>/lib’.

sudo mv guava-10.0.1.jar ~/
c. Download guava-17.0.jar from http://mvnrepository.com/artifact/
com.google.guava/guava/17.0

Now, copy this downloaded jar file to ‘<Flume Installation Directory>/lib’

Step 4) Go to ‘<Flume Installation Directory>/bin’ and start Flume as-

./flume-ng agent -n MyTwitAgent -c conf -f <Flume Installation


Directory>/conf/flume.conf

Command prompt window where flume is fetching Tweets-


From command window message we can see that the output is written
to /user/hduser/flume/tweets/ directory.

Now, open this directory using a web browser.

Step 5) To see the result of data load, using a browser


open http://localhost:50070/ and browse the file system, then go to the directory
where data has been loaded, that is-

<HDFS Home Directory>/flume/tweets/

You might also like