0% found this document useful (0 votes)
23 views42 pages

Module 10 Flume - Massive Logs Aggregation

Flume is an open-source, distributed log aggregation system designed for collecting, processing, and transferring massive amounts of log data. It offers various features such as real-time log collection, customizable data senders and receivers, and supports multiple data sources and sinks including HDFS, HBase, and Kafka. The document outlines Flume's architecture, key characteristics, and provides examples of its applications in log collection and real-time data processing.

Uploaded by

Lucas Oliveira
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
23 views42 pages

Module 10 Flume - Massive Logs Aggregation

Flume is an open-source, distributed log aggregation system designed for collecting, processing, and transferring massive amounts of log data. It offers various features such as real-time log collection, customizable data senders and receivers, and supports multiple data sources and sinks including HDFS, HBase, and Kafka. The document outlines Flume's architecture, key characteristics, and provides examples of its applications in log collection and real-time data processing.

Uploaded by

Lucas Oliveira
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 42

Technical Principles of

Flume

www.huawei.com

Copyright © 2018 Huawei Technologies Co., Ltd. All rights reserved.


Foreword
 Flume is an open-source log system, which is a distributed,
reliable, and high-available massive log aggregation system.
Flume supports customization of data senders and receivers for
collecting, processing and transferring data.

Copyright © 2018 Huawei Technologies Co., Ltd. All rights reserved. Page 2
Objectives
 Upon completion of this course, you will be able to know:
 What Flume is
 Functions of Flume
 Position of Flume in FusionInsight
 System architecture of Flume
 Key characteristics of Flume
 Application Examples of Flume

Copyright © 2018 Huawei Technologies Co., Ltd. All rights reserved. Page 3
Contents
1. Flume Overview and Architecture

2. Key Characteristics of Flume

3. Flume Applications

Copyright © 2018 Huawei Technologies Co., Ltd. All rights reserved. Page 4
What Is Flume
 Flume is a streamed log collection tool. Flume can roughly
processes data and writes data to customizable data receivers.
Flume can collect data from various data sources such as local
files (spool directory source), real-time logs (taildir and exec),
REST message, Thrift, Avro, Syslog, and Kafka.

Copyright © 2018 Huawei Technologies Co., Ltd. All rights reserved. Page 5
Functions of Flume
 Flume can collect logs from a specified directory and save the logs in a
specified path (HDFS, HBase, and Kafka).
 Flume can collect and save logs (taildir) to a specified path in real time.
 Flume supports the cascading mode (multiple Flume nodes interwork
with each other) and data aggregation.
 Flume supports customized data collection.

Copyright © 2018 Huawei Technologies Co., Ltd. All rights reserved. Page 6
Position of Flume in FusionInsight
Application service layer
OpenAPI/SDK REST/SNMP/Syslog

Data Information Knowledge Wisdom


DataFarm Flume Miner Farmer Manager
System
management
Hadoop API Plugin API
Service
governance
HIVE M/R Spark Storm Flink
Hadoop LibrA
YARN/ Zookeeper Security
management
HDFS/HBase

Flume is a distributed framework for collecting and aggregating stream data.

Copyright © 2018 Huawei Technologies Co., Ltd. All rights reserved. Page 7
Architecture of Flume (1)
 Basic Flume architecture: Flume can directly collect data on a single node. This architecture is mainly
applicable to data collection within a cluster.

Source Sink

Log Channel HDFS

 Multi-agent architecture of the Flume: Multiple Flume nodes can be connected. After
collecting initial data from data sources, Flume saves the data in the final storage system.
This architecture is mainly applicable to the import of data outside to the cluster.

Source Sink Source Sink


Channel Channel
Log HDFS

Copyright © 2018 Huawei Technologies Co., Ltd. All rights reserved. Page 8
Architecture of Flume (2)

Interceptor events

Channel
events events events
events
Source Channel Channel
Porcessor Selector Channel
events
Sink Sink
Runner Processor Sink

Copyright © 2018 Huawei Technologies Co., Ltd. All rights reserved. Page 9
Basic Concept - Source (1)
 The source receives events or generates events based on special mechanisms.
The source can save events to one channel or multiple channels in batches.
The sources are classified into event-driven sources and event polling sources.
 Event-driven source: The external source actively sends data to Flume to
drive Flume to accept the data.
 Event polling source: Flume periodically obtains data in an active manner.
 The source must be associated with at least one channel.

Copyright © 2018 Huawei Technologies Co., Ltd. All rights reserved. Page 10
Basic Concept - Source (2)
Source Type Description
Runs a certain command or script, and outputs the
exec source
execution results as a data source.
Provides an Avro-based server. It binds the server with a
avro source port so that the server waits for the data sent from the
Avro-based client.
The same as the avro source. The transmission protocol is
thrift source
Thrift.
http source Supports data transmission based on HTTP POST.
syslog source Collects the syslog logs.

spooling directory source Collects local static files.

jms source Obtain data from the message queue.


Kafka source Obtain data from the Kafka.

Copyright © 2018 Huawei Technologies Co., Ltd. All rights reserved. Page 11
Basic Concept - Channel (1)
 The channel is located between the source and the sink. The channel functions similar
to the queue. It temporarily saves events. When the sink successfully sends events to
the next-hop channel or the destination, the events are removed from the current
channel.
 The persistence levels vary with channels.
 Memory channel: The persistence is not supported.
 File channel: The persistence is achieved based on the Write-Ahead Log (WAL).
 JDBC channel: The persistence is achieved based on the embedded database.
 Channels support transactions and provide weak sequence assurance. Channels can
connect any quantity of sources and sinks.

Copyright © 2018 Huawei Technologies Co., Ltd. All rights reserved. Page 12
Basic Concept - Channel (2)
 Memory channel: Messages are saved in the memory. This channel supports
high throughput but no reliability. Data may be lost.
 File channel: It supports data persistence. However, the configuration is
complex. Both the data directory and the checkpoint directory need to be
configured. Checkpoint directories need to be configured for different file
channels.
 JDBC channel: It is the embedded Derby database. It supports event
persistence and high reliability. It can replace the file channel that also
supports persistence.

Copyright © 2018 Huawei Technologies Co., Ltd. All rights reserved. Page 13
Basic Concept - Sink (1)
 The sink transmits events to the next hop or destination.
After the events are successfully transmitted, they are
removed from the current channel.
 The sink must bind to a specific channel.

Copyright © 2018 Huawei Technologies Co., Ltd. All rights reserved. Page 14
Basic Concept - Sink (2)
Sink Type Description
hdfs sink Writes the data in the HDFS.

Transmits data to the next-hope Flume node using


avro sink
the Avro protocol.

The same as the avro sink. The transmission


thift sink
protocol is Thrift.

file roll sink Saves data in the local file system.

hbase sink Writes data in the HBase.

Kafka sink Writes data in the Kafka.

MorphlineSolr sink Writes data in the Solr.

Copyright © 2018 Huawei Technologies Co., Ltd. All rights reserved. Page 15
Contents
1. Flume Overview and Architecture

2. Key Characteristics of Flume

3. Flume Applications

Copyright © 2018 Huawei Technologies Co., Ltd. All rights reserved. Page 16
Log Collection
 Flume can collect logs beyond a cluster and archive the logs in the
HDFS, HBase, and Kafka for data analysis and cleaning by upper-layer
applications.

Log HDFS
Source Channel Sink

Log Source Channel HBase


Sink

Source Channel Kafka


Log Sink

Copyright © 2018 Huawei Technologies Co., Ltd. All rights reserved. Page 17
Multi-level Cascading and Multi-
channel Duplication
 Multiple Flume nodes can be cascaded. The cascaded nodes support
internal data duplication.
Source

Log Channel

Sink

Channel Sink HDFS

Source

Channel Sink HBase

Copyright © 2018 Huawei Technologies Co., Ltd. All rights reserved. Page 18
Message Compression and Encryption
by Cascaded Flume Nodes
 Data transmitted between cascaded Flume nodes can be compressed
and encrypted, thereby improving the data transmission efficiency
and security.

Flume

Compression Decompression HDFS/Hive/


and encryption and decryption HBase/Kafka
应用
Flume API

Copyright © 2018 Huawei Technologies Co., Ltd. All rights reserved. Page 19
Data Monitoring

Flume monitoring information

FusionInsight
Manager

Application Flume
Received data size Transmitted data size
HDFS/Hive/Hbase/
Source Sink Kafka
Data buffer size
Flume API
Channel
Transmitted
data size

Copyright © 2018 Huawei Technologies Co., Ltd. All rights reserved. Page 20
Transmission Reliability
 Flume adopts the transaction management mode for data transmission. This mode
ensures the data security and enhances the reliability during transmission. In addition, if
the file channel is used to transmit data buffered in the channel, the data is not lost
when a process or node is restarted.

Channel Sink Source Channel


Start tx

Take events Send events


Start tx

Put events
End tx End tx

Copyright © 2018 Huawei Technologies Co., Ltd. All rights reserved. Page 21
Transmission Reliability (Failover)
 During data transmission, if the next-hop Flume node is faulty or receives data
abnormally, the data is automatically switched over to another path.

Source Sink
HDFS
Sink
Source

Channel
Log
Channel
Sink
Source
Sink HDFS

Channel

Copyright © 2018 Huawei Technologies Co., Ltd. All rights reserved. Page 22
Data Filtering During Transmission
 During data transmission, Flume roughly filters and cleans the data.
The unnecessary data is filtered. In addition, you can develop filter
plug-ins based on the data particularity if you need to filter complex
data. Flume supports the third-party filter plug-ins.

Interceptor
Channel

events events
Source Channel Channel
Porcessor Selector events
Channel

Copyright © 2018 Huawei Technologies Co., Ltd. All rights reserved. Page 23
Contents
1. Flume Overview and Architecture

2. Key Characteristics of Flume

3. Flume Applications

Copyright © 2018 Huawei Technologies Co., Ltd. All rights reserved. Page 24
Flume Example 1 (1)
 Description
 In this application scenario, Flume collects logs from an application
(for example, the online banking system) outside the cluster and
saves the logs in the HDFS.

 Data preparations
 Create a log directory /tmp/log_test on a node in the cluster

 Take this directory as the monitoring directory

Copyright © 2018 Huawei Technologies Co., Ltd. All rights reserved. Page 25
Flume Example 1 (2)
 Download the Flume Client:

 Log in to the FusionInsight HD cluster. Choose Service


Management > Flume > Download Client.

Copyright © 2018 Huawei Technologies Co., Ltd. All rights reserved. Page 26
Flume Example 1 (3)
 Install Flume client:
 Decompress the client
Tar –xvf FusionInsight_V100R002C60_Flume_Client.tar
Tar –xvf FusionInsight_V100R002C60_Flume_ClientConfig.tar
Cd FussionInsight_V100R002C60_Flume_ClientConfig/Flume
Tar –xvf FusionInsight-Flume-1.6.0.tar.gz

 Install the client

./install.sh –d /opt/FlumeClient –f hostIP –c


flume/conf/client.properties.properties

Copyright © 2018 Huawei Technologies Co., Ltd. All rights reserved. Page 27
Flume Example 1 (4)
 Configure flume source
server.sources = a1
server.channels = ch1
server.sinks = s1
# the source configuration of a1
server.sources.a1.type = spooldir
server.sources.a1.spoolDir = /tmp/log_test
server.sources.a1.fileSuffix = .COMPLETED
server.sources.a1.deletePolicy = never
server.sources.a1.trackerDir = .flumespool
server.sources.a1.ignorePattern = ^$
server.sources.a1.batchSize = 1000
server.sources.a1.inputCharset = UTF-8
server.sources.a1.deserializer = LINE
server.sources.a1.selector.type = replicating
server.sources.a1.fileHeaderKey = file
server.sources.a1.fileHeader = false
server.sources.a1.channels = ch1

Copyright © 2018 Huawei Technologies Co., Ltd. All rights reserved. Page 28
Flume Example 1 (5)
 Configure flume channel
# the channel configuration of ch1
server.channels.ch1.type = memory
server.channels.ch1.capacity = 10000
server.channels.ch1.transactionCapacity = 1000
server.channels.ch1.channlefullcount = 10
server.channels.ch1.keep-alive = 3
server.channels.ch1.byteCapacityBufferPercentage = 20

Copyright © 2018 Huawei Technologies Co., Ltd. All rights reserved. Page 29
Flume Example 1 (6)
 Configure flume sink
server.sinks.s1.type = hdfs
server.sinks.s1.hdfs.path = /tmp/flume_avro
server.sinks.s1.hdfs.filePrefix =
over_%{basename}
server.sinks.s1.hdfs.inUseSuffix = .tmp
server.sinks.s1.hdfs.rollInterval = 30
server.sinks.s1.hdfs.rollSize = 1024
server.sinks.s1.hdfs.rollCount = 10
server.sinks.s1.hdfs.batchSize = 1000
server.sinks.s1.hdfs.fileType = DataStream
server.sinks.s1.hdfs.maxOpenFiles = 5000
server.sinks.s1.hdfs.writeFormat = Writable
server.sinks.s1.hdfs.callTimeout = 10000
server.sinks.s1.hdfs.threadsPoolSize = 10
server.sinks.s1.hdfs.failcount = 10
server.sinks.s1.hdfs.fileCloseByEndEvent = true
server.sinks.s1.channel = ch1

Copyright © 2018 Huawei Technologies Co., Ltd. All rights reserved. Page 30
Flume Example 1 (7)
 Name the configuration file of flume agent properties.properties.

 Upload the configuration file.

Copyright © 2018 Huawei Technologies Co., Ltd. All rights reserved. Page 32
Flume Example 1 (8)
 Move data files to the monitor directory /tmp/log_test:
mv /var/log/log.11 /tmp/log_test

 Check if data is sinked to HDFS:


hdfs dfs –ls /tmp/flume_avro

 log.11 is already renamed log.11. COMPLETED, which means


success of data collection.

Copyright © 2018 Huawei Technologies Co., Ltd. All rights reserved. Page 33
Flume Example 2 (1)
 Description
 In this application scenario, Flume collects real-time clickstream
logs and saves the logs to the Kafka, for real-time analysis
processing.

 Data preparations
 Create a log directory /tmp/log_click on a node in the cluster

 Collect data to kafka topic_1028

Copyright © 2018 Huawei Technologies Co., Ltd. All rights reserved. Page 34
Flume Example 2 (2)
 Configure flume source:
server.sources = a1
server.channels = ch1
server.sinks = s1

# the source configuration of a1


server.sources.a1.type = spooldir
server.sources.a1.spoolDir = /tmp/log_click
server.sources.a1.fileSuffix = .COMPLETED
server.sources.a1.deletePolicy = never
server.sources.a1.trackerDir = .flumespool
server.sources.a1.ignorePattern = ^$
server.sources.a1.batchSize = 1000
server.sources.a1.inputCharset = UTF-8
server.sources.a1.selector.type = replicating
jserver.sources.a1.basenameHeaderKey = basename
server.sources.a1.deserializer.maxBatchLine = 1
server.sources.a1.deserializer.maxLineLength = 2048
server.sources.a1.channels = ch1

Copyright © 2018 Huawei Technologies Co., Ltd. All rights reserved. Page 35
Flume Example 2 (3)
 Configure flume channel:
# the channel configuration of ch1
server.channels.ch1.type = memory
server.channels.ch1.capacity = 10000
server.channels.ch1.transactionCapacity = 1000
server.channels.ch1.channlefullcount = 10
server.channels.ch1.keep-alive = 3
server.channels.ch1.byteCapacityBufferPercentage = 20

Copyright © 2018 Huawei Technologies Co., Ltd. All rights reserved. Page 36
Flume Example 2 (4)
 Configure flume sink:
# the sink configuration of s1
server.sinks.s1.type = org.apache.flume.sink.kafka.KafkaSink
server.sinks.s1.kafka.topic = topic_1028
server.sinks.s1.flumeBatchSize = 1000
server.sinks.s1.kafka.producer.type = sync
server.sinks.s1.kafka.bootstrap.servers = 192.168.225.15:21007
server.sinks.s1.kafka.security.protocol = SASL_PLAINTEXT
server.sinks.s1.requiredAcks = 0
server.sinks.s1.channel = ch1

Copyright © 2018 Huawei Technologies Co., Ltd. All rights reserved. Page 37
Flume Example 2 (5)
 Upload the configuration file to flume.

 Use kafka demands to view data collected kafka topic_1028.

Copyright © 2018 Huawei Technologies Co., Ltd. All rights reserved. Page 38
Summary
 This course describes Flume functions and application scenarios,
including the basic concepts, functions, reliability, and
configuration items. Upon completion of this course, you can
understand Flume functions, application scenarios, and
configuration methods.

Copyright © 2018 Huawei Technologies Co., Ltd. All rights reserved. Page 39
Quiz
1. What is Flume? What are functions of the Flume?

2. What are key characteristics of the Flume?

3. What are functions of the source, channel, and sink?

Copyright © 2018 Huawei Technologies Co., Ltd. All rights reserved. Page 40
Quiz
True or False
 Flume supports cascading. That is, multiple Flume nodes can be cascaded for
data transmission.

Copyright © 2018 Huawei Technologies Co., Ltd. All rights reserved. Page 41
More Information
 Training materials:
 http://support.huawei.com/learning/Certificate!showCertificate?lang=en&pbiPath=term100002
5450&id=Node1000011796
 Exam outline:
 http://support.huawei.com/learning/Certificate!toExamOutlineDetail?lang=en&nodeId=Node10
00011797
 Mock exam:
 http://support.huawei.com/learning/Certificate!toSimExamDetail?lang=en&nodeId=Node10000
11798
 Authentication process:
 http://support.huawei.com/learning/NavigationAction!createNavi#navi[id]=_40

Copyright © 2018 Huawei Technologies Co., Ltd. All rights reserved. Page 42
Thank You
www.huawei.com

Copyright © 2018 Huawei Technologies Co., Ltd. All rights reserved. Page 43

You might also like