Flume Agent
Flume Agent
Flume is used to move the log data generated by application servers into
HDFS at a higher speed.
Advantages of Flume
Here are the advantages of using Flume −
• Using Apache Flume we can store the data into any of the centralized
stores (HDFS).
• When the rate of incoming data exceeds the rate at which data can be
written to the destination, Flume acts as a mediator between data
producers and the centralized stores and provides a steady flow of data
between them.
• Flume provides the feature of contextual routing.
• The transactions in Flume are channel-based where two transactions (one
sender and one receiver) are maintained for each message. It guarantees
reliable message delivery.
• Flume is reliable, fault tolerant, scalable, manageable, and customizable.
Features of Flume
Some of the notable features of Flume are as follows −
• Flume ingests log data from multiple web servers into a centralized store
(HDFS) efficiently.
• Using Flume, we can get the data from multiple servers immediately into
Hadoop.
• Along with the log files, Flume is also used to import huge volumes of event
data produced by social networking sites like Facebook and Twitter, and e-
commerce websites like Amazon and Flipkart.
• Flume supports a large set of sources and destinations types.
• Flume supports multi-hop flows, fan-in, fan-out flows, contextual routing,
etc.
• Flume can be scaled horizontally.
Apache Flume - Data Transfer In Hadoop:
Streaming / Log Data
Generally, most of the data that is to be analyzed will be produced by
various data sources like applications servers, social networking sites,
cloud servers, and enterprise servers. This data will be in the form of
log files and events.
Log file − In general, a log file is a file that lists events/actions that occur
in an operating system. For example, web servers list every request
made to the server in the log files.
On harvesting such log data, we can get information about −
The main challenge in handling the log data is in moving these logs
produced by multiple servers to the Hadoop environment.
$ Hadoop fs –put /path of the required file /path in HDFS where to save
the file
Problem with put Command:
We can use the put command of Hadoop to transfer data from these
sources to HDFS. But, it suffers from the following drawbacks −
Using put command, we can transfer only one file at a time while the
data generators generate data at a much higher rate. Since the analysis
made on older data is less accurate, we need to have a solution to
transfer data in real time.
A source is the component of an Agent which receives data from the data
generators and transfers it to one or more channels in the form of Flume
events.
Apache Flume supports several types of sources and each source receives
events from a specified data generator.
Example −
• Avro source(provides data serialization and data exchange services),
• Thrift source(enable efficient and scalable communication and data
serialization between different programming languages) etc.
Channel:
A channel is a transient store which receives the events from the source
and buffers them till they are consumed by sinks. It acts as a bridge
between the sources and the sinks.
These channels are fully transactional and they can work with any
number of sources and sinks.
A sink stores the data into centralized stores like HBase and HDFS. It
consumes the data (events) from the channels and delivers it to the
destination. The destination of the sink might be another agent or the
central stores.
Note − A flume agent can have multiple sources, sinks and channels.
Additional Components of Flume Agent:
Apart from Primitive components, we have few more components that
play a vital role in transferring the events from the data generator to
the centralized stores.
1. Interceptors
Interceptors are used to alter/inspect flume events which are
transferred between source and channel.
2. Channel Selectors
These are used to determine which channel is to be opted to transfer the
data in case of multiple channels.
There are two types of channel selectors −
• Default channel selectors − These are also known as replicating
channel selectors they replicates all the events in each channel.
3. Sink Processors
These are used to invoke a particular sink from the selected group of
sinks. These are used to create failover paths for your sinks or load
balance events across multiple sinks from a channel.
Apache Flume - Data Flow:
Finally, the data from all these collectors will be aggregated and pushed
to a centralized store such as HBase or HDFS.
The following diagram explains the data flow in Flume.
Multi-hop Flow:
Within Flume, there can be multiple agents and before reaching the
final destination, an event may travel through more than one agent.
This is known as multi-hop flow.
Fan-out Flow:
The dataflow from one source to multiple channels is known as fan-out
flow. It is of two types −
• Replicating − The data flow where the data will be replicated in all
the configured channels.
• Multiplexing − The data flow where the data will be sent to a
selected channel which is mentioned in the header of the event.
Fan-in Flow:
The data flow in which the data will be transferred from many sources
to one channel is known as fan-in flow.
Failure Handling
In Flume, for each event, two transactions take place: one at the sender
and one at the receiver. The sender sends events to the receiver. Soon
after receiving the data, the receiver commits its own transaction and
sends a “received” signal to the sender. After receiving the signal from
receiver, the sender commits its transaction. (Sender will not commit its
transaction till it receives a signal from the receiver). No chances of
Failure, means Atomic in nature.