Module 10 Flume - Massive Logs Aggregation
Module 10 Flume - Massive Logs Aggregation
Flume
www.huawei.com
Copyright © 2018 Huawei Technologies Co., Ltd. All rights reserved. Page 2
Objectives
Upon completion of this course, you will be able to know:
What Flume is
Functions of Flume
Position of Flume in FusionInsight
System architecture of Flume
Key characteristics of Flume
Application Examples of Flume
Copyright © 2018 Huawei Technologies Co., Ltd. All rights reserved. Page 3
Contents
1. Flume Overview and Architecture
3. Flume Applications
Copyright © 2018 Huawei Technologies Co., Ltd. All rights reserved. Page 4
What Is Flume
Flume is a streamed log collection tool. Flume can roughly
processes data and writes data to customizable data receivers.
Flume can collect data from various data sources such as local
files (spool directory source), real-time logs (taildir and exec),
REST message, Thrift, Avro, Syslog, and Kafka.
Copyright © 2018 Huawei Technologies Co., Ltd. All rights reserved. Page 5
Functions of Flume
Flume can collect logs from a specified directory and save the logs in a
specified path (HDFS, HBase, and Kafka).
Flume can collect and save logs (taildir) to a specified path in real time.
Flume supports the cascading mode (multiple Flume nodes interwork
with each other) and data aggregation.
Flume supports customized data collection.
Copyright © 2018 Huawei Technologies Co., Ltd. All rights reserved. Page 6
Position of Flume in FusionInsight
Application service layer
OpenAPI/SDK REST/SNMP/Syslog
Copyright © 2018 Huawei Technologies Co., Ltd. All rights reserved. Page 7
Architecture of Flume (1)
Basic Flume architecture: Flume can directly collect data on a single node. This architecture is mainly
applicable to data collection within a cluster.
Source Sink
Multi-agent architecture of the Flume: Multiple Flume nodes can be connected. After
collecting initial data from data sources, Flume saves the data in the final storage system.
This architecture is mainly applicable to the import of data outside to the cluster.
Copyright © 2018 Huawei Technologies Co., Ltd. All rights reserved. Page 8
Architecture of Flume (2)
Interceptor events
Channel
events events events
events
Source Channel Channel
Porcessor Selector Channel
events
Sink Sink
Runner Processor Sink
Copyright © 2018 Huawei Technologies Co., Ltd. All rights reserved. Page 9
Basic Concept - Source (1)
The source receives events or generates events based on special mechanisms.
The source can save events to one channel or multiple channels in batches.
The sources are classified into event-driven sources and event polling sources.
Event-driven source: The external source actively sends data to Flume to
drive Flume to accept the data.
Event polling source: Flume periodically obtains data in an active manner.
The source must be associated with at least one channel.
Copyright © 2018 Huawei Technologies Co., Ltd. All rights reserved. Page 10
Basic Concept - Source (2)
Source Type Description
Runs a certain command or script, and outputs the
exec source
execution results as a data source.
Provides an Avro-based server. It binds the server with a
avro source port so that the server waits for the data sent from the
Avro-based client.
The same as the avro source. The transmission protocol is
thrift source
Thrift.
http source Supports data transmission based on HTTP POST.
syslog source Collects the syslog logs.
Copyright © 2018 Huawei Technologies Co., Ltd. All rights reserved. Page 11
Basic Concept - Channel (1)
The channel is located between the source and the sink. The channel functions similar
to the queue. It temporarily saves events. When the sink successfully sends events to
the next-hop channel or the destination, the events are removed from the current
channel.
The persistence levels vary with channels.
Memory channel: The persistence is not supported.
File channel: The persistence is achieved based on the Write-Ahead Log (WAL).
JDBC channel: The persistence is achieved based on the embedded database.
Channels support transactions and provide weak sequence assurance. Channels can
connect any quantity of sources and sinks.
Copyright © 2018 Huawei Technologies Co., Ltd. All rights reserved. Page 12
Basic Concept - Channel (2)
Memory channel: Messages are saved in the memory. This channel supports
high throughput but no reliability. Data may be lost.
File channel: It supports data persistence. However, the configuration is
complex. Both the data directory and the checkpoint directory need to be
configured. Checkpoint directories need to be configured for different file
channels.
JDBC channel: It is the embedded Derby database. It supports event
persistence and high reliability. It can replace the file channel that also
supports persistence.
Copyright © 2018 Huawei Technologies Co., Ltd. All rights reserved. Page 13
Basic Concept - Sink (1)
The sink transmits events to the next hop or destination.
After the events are successfully transmitted, they are
removed from the current channel.
The sink must bind to a specific channel.
Copyright © 2018 Huawei Technologies Co., Ltd. All rights reserved. Page 14
Basic Concept - Sink (2)
Sink Type Description
hdfs sink Writes the data in the HDFS.
Copyright © 2018 Huawei Technologies Co., Ltd. All rights reserved. Page 15
Contents
1. Flume Overview and Architecture
3. Flume Applications
Copyright © 2018 Huawei Technologies Co., Ltd. All rights reserved. Page 16
Log Collection
Flume can collect logs beyond a cluster and archive the logs in the
HDFS, HBase, and Kafka for data analysis and cleaning by upper-layer
applications.
Log HDFS
Source Channel Sink
Copyright © 2018 Huawei Technologies Co., Ltd. All rights reserved. Page 17
Multi-level Cascading and Multi-
channel Duplication
Multiple Flume nodes can be cascaded. The cascaded nodes support
internal data duplication.
Source
Log Channel
Sink
Source
Copyright © 2018 Huawei Technologies Co., Ltd. All rights reserved. Page 18
Message Compression and Encryption
by Cascaded Flume Nodes
Data transmitted between cascaded Flume nodes can be compressed
and encrypted, thereby improving the data transmission efficiency
and security.
Flume
Copyright © 2018 Huawei Technologies Co., Ltd. All rights reserved. Page 19
Data Monitoring
FusionInsight
Manager
Application Flume
Received data size Transmitted data size
HDFS/Hive/Hbase/
Source Sink Kafka
Data buffer size
Flume API
Channel
Transmitted
data size
Copyright © 2018 Huawei Technologies Co., Ltd. All rights reserved. Page 20
Transmission Reliability
Flume adopts the transaction management mode for data transmission. This mode
ensures the data security and enhances the reliability during transmission. In addition, if
the file channel is used to transmit data buffered in the channel, the data is not lost
when a process or node is restarted.
Put events
End tx End tx
Copyright © 2018 Huawei Technologies Co., Ltd. All rights reserved. Page 21
Transmission Reliability (Failover)
During data transmission, if the next-hop Flume node is faulty or receives data
abnormally, the data is automatically switched over to another path.
Source Sink
HDFS
Sink
Source
Channel
Log
Channel
Sink
Source
Sink HDFS
Channel
Copyright © 2018 Huawei Technologies Co., Ltd. All rights reserved. Page 22
Data Filtering During Transmission
During data transmission, Flume roughly filters and cleans the data.
The unnecessary data is filtered. In addition, you can develop filter
plug-ins based on the data particularity if you need to filter complex
data. Flume supports the third-party filter plug-ins.
Interceptor
Channel
events events
Source Channel Channel
Porcessor Selector events
Channel
Copyright © 2018 Huawei Technologies Co., Ltd. All rights reserved. Page 23
Contents
1. Flume Overview and Architecture
3. Flume Applications
Copyright © 2018 Huawei Technologies Co., Ltd. All rights reserved. Page 24
Flume Example 1 (1)
Description
In this application scenario, Flume collects logs from an application
(for example, the online banking system) outside the cluster and
saves the logs in the HDFS.
Data preparations
Create a log directory /tmp/log_test on a node in the cluster
Copyright © 2018 Huawei Technologies Co., Ltd. All rights reserved. Page 25
Flume Example 1 (2)
Download the Flume Client:
Copyright © 2018 Huawei Technologies Co., Ltd. All rights reserved. Page 26
Flume Example 1 (3)
Install Flume client:
Decompress the client
Tar –xvf FusionInsight_V100R002C60_Flume_Client.tar
Tar –xvf FusionInsight_V100R002C60_Flume_ClientConfig.tar
Cd FussionInsight_V100R002C60_Flume_ClientConfig/Flume
Tar –xvf FusionInsight-Flume-1.6.0.tar.gz
Copyright © 2018 Huawei Technologies Co., Ltd. All rights reserved. Page 27
Flume Example 1 (4)
Configure flume source
server.sources = a1
server.channels = ch1
server.sinks = s1
# the source configuration of a1
server.sources.a1.type = spooldir
server.sources.a1.spoolDir = /tmp/log_test
server.sources.a1.fileSuffix = .COMPLETED
server.sources.a1.deletePolicy = never
server.sources.a1.trackerDir = .flumespool
server.sources.a1.ignorePattern = ^$
server.sources.a1.batchSize = 1000
server.sources.a1.inputCharset = UTF-8
server.sources.a1.deserializer = LINE
server.sources.a1.selector.type = replicating
server.sources.a1.fileHeaderKey = file
server.sources.a1.fileHeader = false
server.sources.a1.channels = ch1
Copyright © 2018 Huawei Technologies Co., Ltd. All rights reserved. Page 28
Flume Example 1 (5)
Configure flume channel
# the channel configuration of ch1
server.channels.ch1.type = memory
server.channels.ch1.capacity = 10000
server.channels.ch1.transactionCapacity = 1000
server.channels.ch1.channlefullcount = 10
server.channels.ch1.keep-alive = 3
server.channels.ch1.byteCapacityBufferPercentage = 20
Copyright © 2018 Huawei Technologies Co., Ltd. All rights reserved. Page 29
Flume Example 1 (6)
Configure flume sink
server.sinks.s1.type = hdfs
server.sinks.s1.hdfs.path = /tmp/flume_avro
server.sinks.s1.hdfs.filePrefix =
over_%{basename}
server.sinks.s1.hdfs.inUseSuffix = .tmp
server.sinks.s1.hdfs.rollInterval = 30
server.sinks.s1.hdfs.rollSize = 1024
server.sinks.s1.hdfs.rollCount = 10
server.sinks.s1.hdfs.batchSize = 1000
server.sinks.s1.hdfs.fileType = DataStream
server.sinks.s1.hdfs.maxOpenFiles = 5000
server.sinks.s1.hdfs.writeFormat = Writable
server.sinks.s1.hdfs.callTimeout = 10000
server.sinks.s1.hdfs.threadsPoolSize = 10
server.sinks.s1.hdfs.failcount = 10
server.sinks.s1.hdfs.fileCloseByEndEvent = true
server.sinks.s1.channel = ch1
Copyright © 2018 Huawei Technologies Co., Ltd. All rights reserved. Page 30
Flume Example 1 (7)
Name the configuration file of flume agent properties.properties.
Copyright © 2018 Huawei Technologies Co., Ltd. All rights reserved. Page 32
Flume Example 1 (8)
Move data files to the monitor directory /tmp/log_test:
mv /var/log/log.11 /tmp/log_test
Copyright © 2018 Huawei Technologies Co., Ltd. All rights reserved. Page 33
Flume Example 2 (1)
Description
In this application scenario, Flume collects real-time clickstream
logs and saves the logs to the Kafka, for real-time analysis
processing.
Data preparations
Create a log directory /tmp/log_click on a node in the cluster
Copyright © 2018 Huawei Technologies Co., Ltd. All rights reserved. Page 34
Flume Example 2 (2)
Configure flume source:
server.sources = a1
server.channels = ch1
server.sinks = s1
Copyright © 2018 Huawei Technologies Co., Ltd. All rights reserved. Page 35
Flume Example 2 (3)
Configure flume channel:
# the channel configuration of ch1
server.channels.ch1.type = memory
server.channels.ch1.capacity = 10000
server.channels.ch1.transactionCapacity = 1000
server.channels.ch1.channlefullcount = 10
server.channels.ch1.keep-alive = 3
server.channels.ch1.byteCapacityBufferPercentage = 20
Copyright © 2018 Huawei Technologies Co., Ltd. All rights reserved. Page 36
Flume Example 2 (4)
Configure flume sink:
# the sink configuration of s1
server.sinks.s1.type = org.apache.flume.sink.kafka.KafkaSink
server.sinks.s1.kafka.topic = topic_1028
server.sinks.s1.flumeBatchSize = 1000
server.sinks.s1.kafka.producer.type = sync
server.sinks.s1.kafka.bootstrap.servers = 192.168.225.15:21007
server.sinks.s1.kafka.security.protocol = SASL_PLAINTEXT
server.sinks.s1.requiredAcks = 0
server.sinks.s1.channel = ch1
Copyright © 2018 Huawei Technologies Co., Ltd. All rights reserved. Page 37
Flume Example 2 (5)
Upload the configuration file to flume.
Copyright © 2018 Huawei Technologies Co., Ltd. All rights reserved. Page 38
Summary
This course describes Flume functions and application scenarios,
including the basic concepts, functions, reliability, and
configuration items. Upon completion of this course, you can
understand Flume functions, application scenarios, and
configuration methods.
Copyright © 2018 Huawei Technologies Co., Ltd. All rights reserved. Page 39
Quiz
1. What is Flume? What are functions of the Flume?
Copyright © 2018 Huawei Technologies Co., Ltd. All rights reserved. Page 40
Quiz
True or False
Flume supports cascading. That is, multiple Flume nodes can be cascaded for
data transmission.
Copyright © 2018 Huawei Technologies Co., Ltd. All rights reserved. Page 41
More Information
Training materials:
http://support.huawei.com/learning/Certificate!showCertificate?lang=en&pbiPath=term100002
5450&id=Node1000011796
Exam outline:
http://support.huawei.com/learning/Certificate!toExamOutlineDetail?lang=en&nodeId=Node10
00011797
Mock exam:
http://support.huawei.com/learning/Certificate!toSimExamDetail?lang=en&nodeId=Node10000
11798
Authentication process:
http://support.huawei.com/learning/NavigationAction!createNavi#navi[id]=_40
Copyright © 2018 Huawei Technologies Co., Ltd. All rights reserved. Page 42
Thank You
www.huawei.com
Copyright © 2018 Huawei Technologies Co., Ltd. All rights reserved. Page 43