0% found this document useful (0 votes)

10 views23 pages

Flume Agent

Apache Flume is a reliable and configurable tool designed for collecting, aggregating, and transporting large amounts of streaming data, such as log files, from various sources to a centralized data store like HDFS. It supports multiple sources and destinations, provides features like contextual routing, and ensures reliable message delivery through channel-based transactions. Flume's architecture consists of agents, sources, channels, and sinks, facilitating efficient data flow and handling of high-throughput streaming data.

Uploaded by

deepak Verma

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

10 views23 pages

Flume Agent

Uploaded by

deepak Verma

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

You are on page 1/ 23

Apache FLUME:

Apache Flume is a tool, used for data ingestion mechanism, for

collecting aggregating and transporting large amounts of streaming
data such as log files, events etc. from various sources to a centralized
data store.

Flume is a highly reliable, distributed, and configurable tool. It is

principally designed to copy streaming data (log data) from various web
servers to HDFS.
FLUME Overview:
Applications of Flume:

Assume an e-commerce web application wants to analyze the customer

behavior from a particular region. To do so, they would need to move
the available log data into Hadoop for analysis. Here, Apache Flume
comes to our rescue.

Flume is used to move the log data generated by application servers into
HDFS at a higher speed.
Advantages of Flume
Here are the advantages of using Flume −

• Using Apache Flume we can store the data into any of the centralized
stores (HDFS).
• When the rate of incoming data exceeds the rate at which data can be
written to the destination, Flume acts as a mediator between data
producers and the centralized stores and provides a steady flow of data
between them.
• Flume provides the feature of contextual routing.
• The transactions in Flume are channel-based where two transactions (one
sender and one receiver) are maintained for each message. It guarantees
reliable message delivery.
• Flume is reliable, fault tolerant, scalable, manageable, and customizable.
Features of Flume
Some of the notable features of Flume are as follows −

• Flume ingests log data from multiple web servers into a centralized store
(HDFS) efficiently.
• Using Flume, we can get the data from multiple servers immediately into
Hadoop.
• Along with the log files, Flume is also used to import huge volumes of event
data produced by social networking sites like Facebook and Twitter, and e-
commerce websites like Amazon and Flipkart.
• Flume supports a large set of sources and destinations types.
• Flume supports multi-hop flows, fan-in, fan-out flows, contextual routing,
etc.
• Flume can be scaled horizontally.
Apache Flume - Data Transfer In Hadoop:
Streaming / Log Data
Generally, most of the data that is to be analyzed will be produced by
various data sources like applications servers, social networking sites,
cloud servers, and enterprise servers. This data will be in the form of
log files and events.

Log file − In general, a log file is a file that lists events/actions that occur
in an operating system. For example, web servers list every request
made to the server in the log files.
On harvesting such log data, we can get information about −

• The application performance and locate various software and

hardware failures.
• The user behavior and derive better business insights.

The traditional method of transferring data into the HDFS system is to

use the put command.
HDFS put Command:

The main challenge in handling the log data is in moving these logs
produced by multiple servers to the Hadoop environment.

Hadoop File System Shell provides commands to insert data into

Hadoop and read from it. You can insert data into Hadoop using the put
command as shown below.

$ Hadoop fs –put /path of the required file /path in HDFS where to save
the file
Problem with put Command:

We can use the put command of Hadoop to transfer data from these
sources to HDFS. But, it suffers from the following drawbacks −

Using put command, we can transfer only one file at a time while the
data generators generate data at a much higher rate. Since the analysis
made on older data is less accurate, we need to have a solution to
transfer data in real time.

If we use put command, the data is needed to be packaged and should

be ready for the upload. Since the webservers generate data
continuously, it is a very difficult task.
SOLUTION to PUT command:

What we need here is a solutions that can overcome the drawbacks of

put command and transfer the "streaming data" from data generators
to centralized stores (especially HDFS) with less delay.

Note − In POSIX file system, whenever we are accessing a file (say

performing write operation), other programs can still read this file (at
least the saved portion of the file). This is because the file exists on the
disc before it is closed.
Better Available Solutions:
To send streaming data (log files, events etc.) from various sources to HDFS, we
have the following tools available at our disposal −

• Facebook’s Scribe: Scribe is an immensely popular tool that is used to

aggregate and stream log data. It is designed to scale to a very large number of
nodes and be robust to network and node failures.
• Apache Kafka: Kafka has been developed by Apache Software Foundation. It
is an open-source message broker. Using Kafka, we can handle feeds with
high-throughput and low-latency.
• Apache Flume: Apache Flume is a tool/service/data ingestion mechanism for
collecting aggregating and transporting large amounts of streaming data such
as log data, events (etc.) from various webservers to a centralized data store. It
is a highly reliable, distributed, and configurable tool that is principally
designed to transfer streaming data from various sources to HDFS.
Apache Flume – Architecture:
The data generators (such as Facebook, Twitter etc) generates data
which gets collected by individual Flume agents running on them.
Thereafter, a data collector (which is also an agent) collects the data
from the agents which is aggregated and pushed into a centralized store
such as HDFS or HBase.
Flume Event:

An event is the basic unit of the data transported inside Flume. It

contains a payload of byte array that is to be transported from the
source to the destination accompanied by optional headers. A typical
Flume event would have the following structure −
Flume Agent
An agent is an independent daemon process (JVM) in Flume. It receives
the data (events) from clients or other agents and forwards it to its next
destination (sink or agent). Flume may have more than one agent.
Following diagram represents a Flume Agent.

As shown in the diagram,

It contains three primitive Components
• Source
• Channel
• Sink
Source:

A source is the component of an Agent which receives data from the data
generators and transfers it to one or more channels in the form of Flume
events.

Apache Flume supports several types of sources and each source receives
events from a specified data generator.

Example −
• Avro source(provides data serialization and data exchange services),
• Thrift source(enable efficient and scalable communication and data
serialization between different programming languages) etc.
Channel:

A channel is a transient store which receives the events from the source
and buffers them till they are consumed by sinks. It acts as a bridge
between the sources and the sinks.

These channels are fully transactional and they can work with any
number of sources and sinks.

Example − JDBC channel, File system channel, Memory channel, etc.

Sink:

A sink stores the data into centralized stores like HBase and HDFS. It
consumes the data (events) from the channels and delivers it to the
destination. The destination of the sink might be another agent or the
central stores.

Note − A flume agent can have multiple sources, sinks and channels.
Additional Components of Flume Agent:
Apart from Primitive components, we have few more components that
play a vital role in transferring the events from the data generator to
the centralized stores.
1. Interceptors
Interceptors are used to alter/inspect flume events which are
transferred between source and channel.

2. Channel Selectors
These are used to determine which channel is to be opted to transfer the
data in case of multiple channels.
There are two types of channel selectors −
• Default channel selectors − These are also known as replicating
channel selectors they replicates all the events in each channel.

• Multiplexing channel selectors − These decides the channel to send

an event based on the address in the header of that event.

3. Sink Processors
These are used to invoke a particular sink from the selected group of
sinks. These are used to create failover paths for your sinks or load
balance events across multiple sinks from a channel.
Apache Flume - Data Flow:

Flume is a framework which is used to move log data into HDFS.

Generally events and log data are generated by the log servers and these
servers have Flume agents running on them. These agents receive the
data from the data generators.

The data in these agents will be collected by an intermediate node

known as Collector. Just like agents, there can be multiple collectors in
Flume.

Finally, the data from all these collectors will be aggregated and pushed
to a centralized store such as HBase or HDFS.
The following diagram explains the data flow in Flume.
Multi-hop Flow:
Within Flume, there can be multiple agents and before reaching the
final destination, an event may travel through more than one agent.
This is known as multi-hop flow.

Fan-out Flow:
The dataflow from one source to multiple channels is known as fan-out
flow. It is of two types −
• Replicating − The data flow where the data will be replicated in all
the configured channels.
• Multiplexing − The data flow where the data will be sent to a
selected channel which is mentioned in the header of the event.
Fan-in Flow:
The data flow in which the data will be transferred from many sources
to one channel is known as fan-in flow.

Failure Handling
In Flume, for each event, two transactions take place: one at the sender
and one at the receiver. The sender sends events to the receiver. Soon
after receiving the data, the receiver commits its own transaction and
sends a “received” signal to the sender. After receiving the signal from
receiver, the sender commits its transaction. (Sender will not commit its
transaction till it receives a signal from the receiver). No chances of
Failure, means Atomic in nature.

SVN Tausief 03 08 2015
No ratings yet
SVN Tausief 03 08 2015
37 pages
Apache Flume
No ratings yet
Apache Flume
8 pages
Notes Bug Data and of Apache
No ratings yet
Notes Bug Data and of Apache
11 pages
Expose BDD
No ratings yet
Expose BDD
16 pages
Module 5 - Flume
No ratings yet
Module 5 - Flume
23 pages
FLUME
No ratings yet
FLUME
31 pages
Unit-3 (HDFS-II)
No ratings yet
Unit-3 (HDFS-II)
28 pages
Apache Flume Tutorial PDF
No ratings yet
Apache Flume Tutorial PDF
43 pages
U Iv Flume 1
No ratings yet
U Iv Flume 1
37 pages
What Is Apache Flume?: Collecting, Aggregating, and Moving Large Amounts of Log Data. in
No ratings yet
What Is Apache Flume?: Collecting, Aggregating, and Moving Large Amounts of Log Data. in
8 pages
Apache Flume - Data Transfer in Hadoop - Tutorialspoint
No ratings yet
Apache Flume - Data Transfer in Hadoop - Tutorialspoint
2 pages
Bda Exp7 Chinmay
No ratings yet
Bda Exp7 Chinmay
5 pages
Streaming Data Via Flume
No ratings yet
Streaming Data Via Flume
13 pages
06 - Acquire Data Using CLI and Flume
No ratings yet
06 - Acquire Data Using CLI and Flume
13 pages
Assignment
No ratings yet
Assignment
37 pages
Apache Flume: Distributed Log Collection For Hadoop - Second Edition - Sample Chapter
No ratings yet
Apache Flume: Distributed Log Collection For Hadoop - Second Edition - Sample Chapter
13 pages
Apache Flume
No ratings yet
Apache Flume
21 pages
Flume
No ratings yet
Flume
15 pages
8 - Big - Data Vivek
No ratings yet
8 - Big - Data Vivek
2 pages
Module 10 Flume - Massive Logs Aggregation
No ratings yet
Module 10 Flume - Massive Logs Aggregation
42 pages
Flume User Guide
No ratings yet
Flume User Guide
48 pages
Unit 2 (2 Part)
No ratings yet
Unit 2 (2 Part)
69 pages
Apache Flume Tutorial - What Is - Architecture
No ratings yet
Apache Flume Tutorial - What Is - Architecture
8 pages
Sqoop & Flume: Issues With Data Load Into Hadoop
No ratings yet
Sqoop & Flume: Issues With Data Load Into Hadoop
6 pages
Flume Case Study
No ratings yet
Flume Case Study
2 pages
Flume User Guide
No ratings yet
Flume User Guide
32 pages
6 Flume - Student - Datadotz
No ratings yet
6 Flume - Student - Datadotz
29 pages
Arinto Murdopo Josep Subirats Group 4 EEDC 2012
No ratings yet
Arinto Murdopo Josep Subirats Group 4 EEDC 2012
19 pages
Big Data Ca
No ratings yet
Big Data Ca
14 pages
5a. Introduction To Data Ingestion and Processing
No ratings yet
5a. Introduction To Data Ingestion and Processing
26 pages
Chapter 8 Flume - Massive Log Aggregation
No ratings yet
Chapter 8 Flume - Massive Log Aggregation
35 pages
Hadoop 3
No ratings yet
Hadoop 3
52 pages
Unit-2 Imp Ques Ans
No ratings yet
Unit-2 Imp Ques Ans
8 pages
BDA Mid-2 Important Questions
No ratings yet
BDA Mid-2 Important Questions
19 pages
Flume PDF
No ratings yet
Flume PDF
7 pages
Presentation of Big Data
No ratings yet
Presentation of Big Data
4 pages
Lect - 11 - BIG DATA
No ratings yet
Lect - 11 - BIG DATA
42 pages
Cse 17CS82 M2 S2 PPT
No ratings yet
Cse 17CS82 M2 S2 PPT
20 pages
Data Ingest
No ratings yet
Data Ingest
15 pages
Indjcse24 15 04 020
No ratings yet
Indjcse24 15 04 020
13 pages
Big Data: Week - 13
No ratings yet
Big Data: Week - 13
33 pages
Flume Developer Guide
No ratings yet
Flume Developer Guide
14 pages
Bda Iat2
No ratings yet
Bda Iat2
23 pages
Big Data Unit - 2
No ratings yet
Big Data Unit - 2
18 pages
Screenshot 2025-01-13 at 12.17.38 PM
No ratings yet
Screenshot 2025-01-13 at 12.17.38 PM
12 pages
Unit - 5 Updated MHM
No ratings yet
Unit - 5 Updated MHM
25 pages
Unit 3 Part 2 Scoopflume
No ratings yet
Unit 3 Part 2 Scoopflume
10 pages
Hortonworks Data Platform (HDP)
100% (1)
Hortonworks Data Platform (HDP)
56 pages
BDA UNIT-2 (Final)
No ratings yet
BDA UNIT-2 (Final)
27 pages
Hadoop Overview Training Material
No ratings yet
Hadoop Overview Training Material
44 pages
A728542518 - 16469 - 30 - 2019 - Flume Complete
No ratings yet
A728542518 - 16469 - 30 - 2019 - Flume Complete
13 pages
Questions For CCA175
50% (2)
Questions For CCA175
33 pages
Twitter Data Analysis Using Flume & Hive On Hadoop Framework
No ratings yet
Twitter Data Analysis Using Flume & Hive On Hadoop Framework
5 pages
1 - HADOOP Crash Course
No ratings yet
1 - HADOOP Crash Course
52 pages
Big Data-2 Sourcing Data
No ratings yet
Big Data-2 Sourcing Data
38 pages
Essential Hadoop Tools: Module - 2 Session - 2
No ratings yet
Essential Hadoop Tools: Module - 2 Session - 2
6 pages
UNIT 3 HDFS, Hadoop Environment Part 2
No ratings yet
UNIT 3 HDFS, Hadoop Environment Part 2
6 pages
Random
No ratings yet
Random
3 pages
Big Data Analytics
From Everand
Big Data Analytics
Nitin Kumar Yadav
No ratings yet
PHP & MySQL Practice It Learn It
From Everand
PHP & MySQL Practice It Learn It
Jitendra Patel
3/5 (2)
Kafka Kafdrop
No ratings yet
Kafka Kafdrop
13 pages
Understanding Firewalls and Their Role in Industrial Networks
No ratings yet
Understanding Firewalls and Their Role in Industrial Networks
10 pages
AI/ML Project of BOT DETECTION On SOCIAL MEDIA
No ratings yet
AI/ML Project of BOT DETECTION On SOCIAL MEDIA
24 pages
HIVE
No ratings yet
HIVE
28 pages
A Comparative Analysis of Password Cracking Tools
No ratings yet
A Comparative Analysis of Password Cracking Tools
15 pages
Firewall SCADA/ICS/OT
No ratings yet
Firewall SCADA/ICS/OT
38 pages
IN - Assignment - 112023 - VE
No ratings yet
IN - Assignment - 112023 - VE
10 pages
Web Technology - CSE/IT Engineering - Third-Year-Notes, Books, Ebook PDF Download
No ratings yet
Web Technology - CSE/IT Engineering - Third-Year-Notes, Books, Ebook PDF Download
167 pages
TASTEM04 - TasWater Electrical Scope of Works Template
100% (1)
TASTEM04 - TasWater Electrical Scope of Works Template
28 pages
RossTalk Commands (4802DR 403 29)
No ratings yet
RossTalk Commands (4802DR 403 29)
24 pages
Docu53275 VNX Operating Environment For Block 05-33-000!5!52 Release Notes
No ratings yet
Docu53275 VNX Operating Environment For Block 05-33-000!5!52 Release Notes
72 pages
Instructor Materials Chapter 4: Access Control Lists: CCNA Routing and Switching Connecting Networks
No ratings yet
Instructor Materials Chapter 4: Access Control Lists: CCNA Routing and Switching Connecting Networks
76 pages
Facebook Pro: The Ultimate Guide To Hacking Facebook
No ratings yet
Facebook Pro: The Ultimate Guide To Hacking Facebook
29 pages
Review For Chapter 4
No ratings yet
Review For Chapter 4
60 pages
Chapter 6 - Introduction To Internet
No ratings yet
Chapter 6 - Introduction To Internet
14 pages
6GK74431GX300XE0 Datasheet en
No ratings yet
6GK74431GX300XE0 Datasheet en
4 pages
Fortianalyzer: Administration and Management
No ratings yet
Fortianalyzer: Administration and Management
48 pages
Paramiko Docs
No ratings yet
Paramiko Docs
92 pages
Bilal CV
No ratings yet
Bilal CV
2 pages
Brocade Ezswitchsetup: Administrator'S Guide
No ratings yet
Brocade Ezswitchsetup: Administrator'S Guide
62 pages
ARQ Techniques
No ratings yet
ARQ Techniques
49 pages
GPON TriplePlay Service Deployment (ONT) - TTVT
No ratings yet
GPON TriplePlay Service Deployment (ONT) - TTVT
10 pages
MCN
No ratings yet
MCN
27 pages
Sonic Bandit Manual
No ratings yet
Sonic Bandit Manual
74 pages
Install OwnCloud 10 On Raspberry Pi 3 With Raspbian Stretch Installed
100% (1)
Install OwnCloud 10 On Raspberry Pi 3 With Raspbian Stretch Installed
26 pages
(Ebook PDF) The Architecture of Computer Hardware, Systems Software, and Networking: An Information Technology Approach 5th Edition Download
100% (2)
(Ebook PDF) The Architecture of Computer Hardware, Systems Software, and Networking: An Information Technology Approach 5th Edition Download
57 pages
Product Overview ESET NOD32 Antivirus 9
No ratings yet
Product Overview ESET NOD32 Antivirus 9
3 pages
Distributed Systems: Tanenbaum Chapter 1
No ratings yet
Distributed Systems: Tanenbaum Chapter 1
70 pages
9300 Decoder User Manual-V1.0
No ratings yet
9300 Decoder User Manual-V1.0
30 pages
Introduction To Idrac6 PDF
No ratings yet
Introduction To Idrac6 PDF
23 pages
EDUSAT
No ratings yet
EDUSAT
20 pages
Axis p1425 Le Network Camera en US 208243
No ratings yet
Axis p1425 Le Network Camera en US 208243
2 pages
Inforamtion Security
No ratings yet
Inforamtion Security
151 pages
Final Copy Rev 2 Specification of Media Converter
No ratings yet
Final Copy Rev 2 Specification of Media Converter
13 pages
Introduction To Computer Networks and Internet
No ratings yet
Introduction To Computer Networks and Internet
46 pages

Flume Agent

Uploaded by

Flume Agent

Uploaded by

Apache FLUME:

Apache Flume is a tool, used for data ingestion mechanism, for

Flume is a highly reliable, distributed, and configurable tool. It is

Assume an e-commerce web application wants to analyze the customer

• The application performance and locate various software and

The traditional method of transferring data into the HDFS system is to

Hadoop File System Shell provides commands to insert data into

If we use put command, the data is needed to be packaged and should

What we need here is a solutions that can overcome the drawbacks of

Note − In POSIX file system, whenever we are accessing a file (say

• Facebook’s Scribe: Scribe is an immensely popular tool that is used to

An event is the basic unit of the data transported inside Flume. It

As shown in the diagram,

Example − JDBC channel, File system channel, Memory channel, etc.

• Multiplexing channel selectors − These decides the channel to send

Flume is a framework which is used to move log data into HDFS.

The data in these agents will be collected by an intermediate node

You might also like