LogLens A Real-Time Log Analysis System
LogLens A Real-Time Log Analysis System
net/publication/326445987
CITATIONS READS
85 2,756
10 authors, including:
All content following this page was uploaded by Biplob Debnath on 02 January 2019.
Abstract—Administrators of most user-facing systems depend formats. As more new devices and data formats enter the
on periodic log data to get an idea of the health and status market (Gartner, Inc. forecasts that 20.4 billion IoT units will
of production applications. Logs report information, which is be in use worldwide by 2020 [15]), it becomes increasingly
crucial to diagnose the root cause of complex problems. In this
difficult for the supervised log analysis tools to keep track and
paper, we present a real-time log analysis system called LogLens
that automates the process of anomaly detection from logs with adapt to new log structures and identify unknown anomalies.
no (or minimal) target system knowledge and user specification. In this paper, we describe LogLens, a log analysis sys-
In LogLens, we employ unsupervised machine learning based tem to automatically detect operational problems from any
techniques to discover patterns in application logs, and then software system logs. Rather than taking the log structure
leverage these patterns along with the real-time log parsing for
as an input, LogLens automatically learns structures from
designing advanced log analytics applications. Compared to the
existing systems which are primarily limited to log indexing and the “correct logs” and generates models that capture normal
search capabilities, LogLens presents an extensible system for system behaviors. It subsequently employs these models to
supporting both stateless and stateful log analysis applications. analyze production logs generated in real-time and detects
Currently, LogLens is running at the core of a commercial log anomalies. Here, we define anomaly as a log or group of
analysis solution handling millions of logs generated from the
logs that do not match the normal system behavior models.
large-scale industrial environments and reported up to 12096x
man-hours reduction in troubleshooting operational problems LogLens requires no (or minimal) user involvement and adapts
compared to the manual approach. automatically to new log formats and patterns as long as users
can provide a set of logs for building models against which
I. I NTRODUCTION anomalies are detected.
Log analysis is the process of transforming raw logs – writ- LogLens classifies anomaly detection algorithms into two
ten records of software systems events – into information that major groups: stateful and stateless. Stateless anomalies arise
helps operators and administrators to solve problems [1, 2]. from analyzing a single log instance, while stateful anomalies
Log analysis is used in a variety of domains such as detecting appear when a combination of multiple logs does not match the
security threats [3, 4, 5], compliance auditing [6], power plant trained model. For example, identifying errors or warnings in
fault detection [7], or data center operations [8, 9, 10, 11, 12]. operational logs do not require keeping state about each log. In
The ability to analyze logs quickly and accurately is critical to contrast, identifying maximum duration violation of a database
reduce system downtime and to detect operational problems transaction requires storing start event time of the transaction
before or while they occur. so that when an end event of the same transaction comes,
A critical aspect of a log that enables fast and accurate anomalies can be detected by calculating the duration of the
analysis is its structure. Recognizing the structure of a log transaction. LogLens presents one exemplary stateless algo-
greatly helps in easy extraction of specific system information, rithm and one exemplary stateful algorithm. The exemplary
such as the type, time of creation, source of a specific stateless algorithm is a log parser, which parses logs using
event, the value of key performance indicators, etc. Without a patterns discovered during system normal runs and reports
known log structure, log analysis becomes a simple keyword- anomalies if streaming logs cannot be parsed using discovered
based text search tool. In fact, most commercial log analytics patterns. This stateless parser can parse logs up to 41x faster
platforms today [13, 14] allow users to directly specify log than the Logstash [16], which is a widely used log parsing
patterns or to generate models based on domain knowledge. tool. The exemplary stateful algorithm discovers relationships
While supervised log analysis can help extracting important among log sequences representing usual operational workflows
insights without ambiguity, it also has several shortcomings: from the system normal runs and reports anomalies in the
a) it is specific to what the user seeks and focuses on known streaming logs. This stateful algorithm can handle heteroge-
errors and b) it cannot easily adapt to new data sources and neous log streams and can automatically discover ID fields to
link multiple logs corresponding to an event.
† Work done while worked at NEC Laboratories America, Inc. To analyze massive volumes of logs with zero-downtime,
1053
Agent
Log Sequence
Agent Log Log Parser Anomaly Viszualization
Anomaly Detector
Manager (Stateless) Storage Dashboard
Agent (Stateful)
Log
Heartbeat
Model
Controller
Controller
Model
Manager
manager allows human experts to inspect models and edit them big data processing framework. It uses Kafka [21] for shipping
to incorporate domain knowledge. logs and communicating among different components. For the
Model Controller gets notifications from the model manager storage, it uses Elasticsearch [14] a NoSQL database. Elastic-
and sends control instructions to the anomaly detectors. Mod- search provides a very useful query facility that can be used
els can be added or updated or deleted, and each operation for data exploration. Furthermore, it has close integration with
needs a separate instruction which contains detail information Kibana [22], which provides a tool for building visualization
about the steps that need to be executed. Anomaly detectors front-ends and writing interactive ad-hoc queries.
read control instructions and take action accordingly. Now, we describe our exemplary anomaly detection al-
gorithms in Section III and Section IV, and deployment
Log Parser takes streaming logs and log-pattern model from challenges and solutions in Section V
the model manager as input. It parses logs using patterns
and forwards them to the log sequence anomaly detector. III. S TATELESS : L OG PARSER
All unparsed logs are reported as anomalies and presented
to the user for further review. Log parser is an example For an automated log analysis system, a core step is to
implementation of the stateless anomaly detection algorithm. parse raw logs and make them structured so that various
We describe it in detail in Section III. log analysis tasks could be carried out by leveraging the
structured form of the raw logs. LogLens parses logs using
Log Sequence Anomaly Detector detects anomalous log patterns learned from the systems normal runs. Here, we define
sequence of an event (or transaction), which consists of a pattern as a GROK expression [23]. For example, for the log
sequence of actions and each action is represented by a log. It “Connect DB 127.0.0.1 user abc123”, one of the matching
is a stateful algorithm which detects malfunctioned events by GROK patterns is “%{WORD:Action} DB %{IP:Server} user
analyzing abnormal log sequences based on an automata-based %{NOTSPACE:UserName}” and after parsing LogLens pro-
model. We describe it in detail in Section IV. duces {“Action”: “Connect”, “Server”: “127.0.0.1”, “User-
Heartbeat Controller periodically sends heartbeat (i.e., echo Name”:“abc123”} as a parsing output in JSON format. Parsed
or dummy) messages to the log sequence anomaly detector. outputs can be used as a building block for designing various
These messages aid to report anomalies in time and to identify log analysis features. For example, our stateful algorithm (see
open states in a transaction. Section IV) uses them to detect log sequence violations.
Anomaly Storage stores all anomalies for human validation. Challenges. Automatically parsing heterogeneous logs with-
Each anomaly has a type, severity, reason, timestamp, asso- out human involvement is a non-trivial task. LogLens parses
ciates logs, etc. logs in two phases: 1) it discovers a set of GROK patterns
from a set of logs representing system normal runs and 2) it
Visualization Dashboard provides a graphical user interface parses logs using these GROK patterns.
and dashboard to the end users. It combines information from Existing log analysis tools either use predefined regular
log storage, model storage, and anomaly storage to present expressions (RegEx) or source-code level information for log
anomalies to the users. Users can easily view anomalies and parsing [11, 16, 24]. Thus, these tools are supervised and
take actions to rebuild or edit models. It also allows users to need human involvement – they cannot be used for the first
run complex analysis by issuing ad-hoc queries. phase. Our earlier work, LogMine [25], shows how to discover
Most components described above can be implemented us- patterns without any human involvement by clustering similar
ing many different open-source products. LogLens uses Spark logs. LogMine uses tokenized logs and datatypes of the tokens
1054
during the similarity computation step. However, identifying LogLens allows users to specify formats to identify timestamp
some tokens, especially timestamp identification is a very related tokens. It uses Java’s SimpleDateFormat [27] notation
challenging task. In addition, LogMine may fail to meet user to specify a timestamp format. However, if users do not specify
needs as it is very hard to automatically infer semantics of a any format, LogLens identifies timestamps based on a set of
field in the GROK pattern. predefined formats (for example, MM/dd HH:mm:ss, dd/MM
In the second phase, we need a tool to parse incoming logs. HH:mm:ss:SSS, yyyy/MM/dd HH:mm:ss.SSS etc.). Users can
We can use Logstash [16], an industrial-strength open-source also add new formats in the predefined list. The worst case
log parsing tool, which can parse logs using GROK patterns. time complexity of identifying timestamp is O(k), where k
However, we find that Logstash suffers from two severe is the total number predefined formats or the total number of
scalability problems: 1) it cannot handle a large number of user-specified formats.
patterns and 2) it consumes huge memory (see Section VI-A). Solution. LogLens uses the following two optimizations to
Since LogLens discovers patterns with no (or minimal) human quickly identify tokens related to the timestamp formats:
involvement, it can generate a huge number of patterns which
• Caching matched formats. LogLens maintains a cache to
is very problematic for the Logstash.
track the matched formats. Caching reduces the amortized
Solution. LogLens provides an efficient solution for identify- time complexity to O(1). To identify timestamp related
ing timestamps and to meet user expectation it allows users tokens in a log, first, LogLens finds if there is a cache hit.
to edit/modify automatically generated GROK patterns. For In case of a cache miss, LogLens checks the non-cached
the fast parsing, LogLens transforms both logs and patterns formats and if a match found, then the corresponding format
into their underlying datatypes and builds an index for quickly is added to the cache. This simple caching strategy works
finding the log-to-GROK mapping. Now, we describe log well in practice as logs from the same (or similar) sources
parsing workflow in detail. use same formats, and every source uses only a few different
formats to record timestamps.
A. Model Building • Filtering. LogLens maintains a set of keywords based on
1) Tokenization: LogLens preprocesses a log by splitting the most common form of specifying month (i.e., jan-
it into individual units called tokens. Splitting is done based dec, january-december, 01-12, 1-9) , day (i.e., 01-31), and
on a set of delimiters. The default delimiter set consists of hour (i.e., 00-59), day of the week (i.e., mon-sun, monday-
white space characters (i.e., space, tab, etc.). LogLens also sunday), etc. It uses these keywords to filter out tokens
allows users to provide delimiters to overwrite the default which cannot be related to a timestamp. If a token cannot be
delimiters in order to meet their needs. In addition, users can filtered, then only LogLens checks the predefined formats.
provide regular expression (RegEx) based rules to split a single
3) Pattern Discovery By Clustering Similar Logs: In this
token into multiple sub-tokens. For example, to split the token
step, LogLens clusters preprocessed logs based on a similarity
“123KB” into sub-tokens “123” and “KB”, user can provide
distance using LogMine [25] algorithm. All logs within a
the following RegEx rule: “[0-9]+KB” =⇒ “[0-9]+ KB”.
cluster are merged together to generate one final pattern in
2) Datatype Identification: During this step, for every token
the form of a GROK expression. LogLens assigns a field
LogLens identifies its datatype based on the RegEx rules.
ID for each field. The field ID consists of two parts: 1) the
Table I shows the sample RegEx rules for identifying different
ID of the log pattern that this field belongs to and 2) the
datatypes in LogLens.
sequence number of this field compared to other fields in the
Datatype Regular Expression (RegEx) Syntax same pattern. The log format pattern IDs can be assigned
WORD [a-zA-Z]+ with the integer number 1, 2, 3, ... m for a log pattern
NUMBER -?[0-9]+(.[0-9]+)?
IP [0-9]{1,3}.[0-9]{1,3}.[0-9]{1,3}.[0-9]{1,3} set of size m. The field sequence order can be assigned
NOTSPACE \S+ with the integer number 1, 2, 3, ... k for a log pattern
DATETIME [0-9]{4}/[0-9]{2}/[0-9]{2} [0-9]{2}:[0-9]{2}:[0-9]{2}.[0-9]{3}
ANYDATA .*
with k variable fields. For example, for the log “2016/02/23
09:00:31 127.0.0.1 login user1” the corresponding generated
TABLE I: Syntax for various data types. Notation is adopted GROK pattern would be “%{DATETIME:P1F1} %{IP:P1F2}
from the Java Pattern API [26]. %{WORD:P1F3} user1”.
4) Incorporating Domain Knowledge: LogLens automati-
Challenge. LogLens identifies timestamps and unifies them cally generates patterns, therefore it may not always meet
into a single format “yyyy/MM/dd HH:mm:ss.SSS” corre- user needs. In addition, users may want to generate patterns
sponding to the DATETIME datatype. However, we find from one system and later want to apply them to a different
that it is a very cumbersome process due to the hetero- system with some minor modifications. A user may even
geneity of timestamp formats used in various logs. For ex- want to delete some patterns or add new patterns or edit
ample, timestamp “2016/02/23 09:00:31” can be expressed datatypes. To solve these issues, LogLens allows users to
in “2016/23/02 09:00:31” or “2016/23/02 09:00:31.000” or edit automatically generated patterns. It supports the following
“Feb 23, 2016 09:00:31” or “2016 Feb 23 09:00:31” or editing operations:
“02/23/2016 09:00:31” or “02-23-2016 09:00:31” and so on. • LogLens allows users to add the semantic meaning of a field
1055
by renaming its generic name. For example, LogLens may (in terms of number of tokens). If no candidate pattern
assign “P1F1” as a generic field name for the “logTime” is found, then the candidate-pattern-group is set to empty.
field, thus it may be difficult for users to interpret the Next, LogLens adds this group in a hash index using log-
parsed output. By renaming “P1F1” to “logTime”, users can signature as the “key”, and candidate-pattern-group as the
easily fix this problem. To ease renaming effort, LogLens “value”. Finally, it follows Step 3.
uses a heuristic based approach to leverage commonly 3) Scanning the candidate-pattern-group. If a candidate-
used patterns found in the logs. For example, LogLens pattern-group is found, LogLens scans all patterns in that
automatically renames “PDU = %{NUMBER:P1F1}” as group until the input log is parsed. If an input log cannot be
“PDU = %{NUMBER:PDU}”. If none of the heuristics parsed or group has no patterns (i.e., empty), then LogLens
matches, then only LogLens assigns a generic name. reports it as an anomaly.
• LogLens allows users to specialize a field. For example, a
Pattern-Signature Generation. LogLens generates a pattern-
user can specialize “%{IP:P1F2}” by replacing it with the
signature from each GROK pattern as follows. First, it
fixed value “127.0.0.1”.
splits a pattern into various tokens separated by white space
• LogLens allows users to generalize a specific token
characters. Next, it replaces every token by its datatype.
value. For example, a user can generalize “user1” to
For example, the token “%{DATETIME:P1F1}” is replaced
“%{NOTSPACE:userName}” in order to convert it into a
by its datatype “DATETIME”. If datatype is not present
variable field.
in the token, then LogLens finds out the datatype of the
• LogLens allows users to edit datatype definition to include
token’s present value. For example, the token “user1” is
multiple tokens under one field. To support this feature, it
replaced by “NOTSPACE” by using the RegEx rule defined
introduces the ANYDATA (i.e., wildcard) datatype, which
in Table I. Thus, the pattern-signature of the GROK pat-
is defined in Table I.
tern “%{DATETIME:P1F1} %{IP:P1F2} %{WORD:P1F3}
user1” would be “DATETIME IP WORD NOTSPACE”.
B. Parsing Logs and Anomaly Detection
LogLens uses patterns discovered during modeling stage for How to compare log-signature with pattern-signature? If
parsing logs. If a log does not match with any patterns, then a log-signature is parsed by a pattern-signature, then corre-
it is reported as an anomaly. sponding GROK pattern is added to the candidate-pattern-
group. There are two cases to consider for the pattern-
Problem Definition. Log parsing problem using a set of signature: without and with the ANYDATA datatype (i.e.,
patterns can be formalized as follows: given a set of m wildcard). The first case (i.e., without) is easy to handle,
GROK patterns and a set of n logs, find out the log-to-pattern while the second case is challenging due to the variability
mappings. A naı̈ve solution scans all m patterns to find a match arising from the presence of wildcard. LogLens solves this
for every log. This simple algorithm needs on the average m 2 problem with a dynamic programming algorithm. It can be
comparisons for the matched logs, while for the unmatched formally defined as follows: given a log-signature of length
logs it incurs m comparisons. So, the overall time complexity r tokens, L =< l1 , l2 , ..., lr > and a pattern-signature of
is O(mn). LogLens aims to reduce the number of comparisons length s tokens, P =< p1 , p2 , ..., ps >, we have to find
to O(1), thus the overall time complexity reduces to O(n). out if L can be matched by P . Let us define T [i, j] to the
Solution Sketch. LogLens leverages the fact that logs and pat- boolean value indicating whether < l1 , l2 , ..., li > is parsed by
terns have the common underlying datatypes representing their < p1 , p2 , ..., pj > or not. This matching problem has optimal
structures, thus it can build an index based on these structures substructure, which gives the following recursive formula:
to quickly find the log-to-pattern mapping. LogLens maintains ⎧
⎨ true if i = 0 and j = 0
an index in order to reduce the number of comparisons by T [i, j] = T [i − 1, j − 1] if li = pj or isCovered(li , pj )
using the following three steps: ⎩ T [i − 1, j] T [i, j − 1] if pj = ∗
1) Finding candidate-pattern-group. To parse a log, Here, isCovered(li , pj ) is a function, which returns true if
LogLens first generates a log-signature by concatenating the RegEx definition corresponding to li ’s datatype is covered
the datatypes of all its tokens. For example, for the by the RegEx definition of the pj ’s datatype. For example,
log “2016/02/23 09:00:31.000 127.0.0.1 login user1” the isCovered(“WORD”, “NOTSPACE”) returns true. In contrast,
corresponding log-signature would be “DATETIME IP isCovered(“NOTSPACE”, “WORD”) returns false. Based on
WORD NOTSPACE”. Next, LogLens finds out if there is a the above formulation, LogLens uses dynamic programming
candidate-pattern-group which can parse the log-signature. to compute the solution in a bottom-up fashion as outlined in
2) Building candidate-pattern-group. If no group found, Algorithm 1. If T [r, s] is true, then LogLens adds the GROK
first LogLens builds a candidate-pattern-group by compar- pattern corresponding to P in the candidate-pattern-group.
ing an input logs log-signature with all GROK m patterns
using their pattern-signatures (explained later) to find IV. S TATEFUL : L OG S EQUENCE A NOMALY D ETECTOR
out all potential candidate patterns and put all candidate Log sequence anomaly detector detects abnormal log se-
patterns in one group. In a group, patterns are sorted in quence in an event (or transaction). Here, we define an event
the ascending order of datatype’s generality and length as follows: an event is an independent operational work unit of
1056
Algorithm 1 Dynamic Programming Algorithm
procedure IS M ATCHED
Input: String logSignature, String patternSignature
Output: boolean (i.e., true/false)
String L[] = logSignature.split(“ ”);
String P[] = patternSignature.split(“ ”);
boolean T[][] = new boolean[L.length+1][P.length+1];
T[0][0] = true;
for (int i = 1; i < T.length; i++) do
for (int j = 1; j < T[0].length; j++) do
if (L[i-1].equals(P[j-1])) then Fig. 2: Sample event trace logs.
T[i][j] = T[i-1][j-1];
else if (P[j-1].equals(“ANYDATA”)) then
Handling wildcard case
T[i][j] = T[i-1][j] T[i][j-1];
else if (isCovered(L[i-1], P[j-1])) then
Is log-token covered by pattern-token?
T[i][j] = T[i-1][j-1];
end if
end for Fig. 3: Sample automaton for an event from the logs of Figure
end for 2. It has the rule of min/max occurrence of each state s and
return T[L.length][P.length]; min/max time duration of an event. Each state corresponds to
end procedure a log in that event.
1057
possible event ID content, it builds a list of (log pattern, 1) it introduces a downtime of several seconds, if not minutes,
field) pair for all logs that have this ID content. This gives depending on the size of the cluster; 2) restarting the cluster
multiple lists and LogLens builds a set of unique lists. If any requires rescheduling and redistribution of data and memory
list covers all log patterns discovered in the training logs, leading to significant decrease in the throughput of the cluster;
then LogLens assigns it to an event ID Field. and 3) if a stateful Spark streaming service is terminated, all
2) Event Automata Modeling: In this step, LogLens pro- the state data is lost and losing states can have significant
files automata with rules from logs using ID Fields. It scans impact on the efficacy of the anomaly detection algorithms.
through each log and extracts its ID Field and content. To eliminate any possibility of downtime or loss of state, the
LogLens also keeps track of the log arrival time. For an model update mechanism should meet at least the following
ID Field content, it keeps a sorted list of log patterns with their two requirements: 1) service must be up, and running all the
fields. Finally, it merges them and builds automata with rules. time and 2) states must be preserved during model updates.
An automaton is built with states. Each log pattern with its Solution. In LogLens, to update a broadcast variable (BV) at
ID Field is a state which stores log arrival time, the number of runtime, we modify Spark internals with minimum possible
occurrences, etc. Each automaton has a begin, an end, and mul- changes. Our solution is capable of rebroadcasting the im-
tiple intermediate states. LogLens also tracks the occurrence of mutable BVs at runtime without job termination. BVs is a
the intermediate states and the duration between the begin and serializable data object that is a virtual data block containing
the end state. After building automata, LogLens profiles the a reference to the actual disk block where the variable resides.
minimum and maximum of those statistics (min/max duration When a BV is used in a Spark program, it is shipped to
of an event, min/max occurrence of the intermediate states, each individual worker. During execution, whenever a worker
etc.) and uses them as rules for detecting anomalies. requests the value of a BV using getValue() method, Spark
first looks into the local data-block-cache of the worker for the
B. Anomaly Detection
variable. If there is a cache miss, it sends a pull request to the
LogLens collects incoming logs in real-time. First, it extracts driver (where the variable is initially stored) to get the value
log pattern and ID from each log. It groups all logs that have a over the network. Once this variable is received, it is stored
common Field ID. After that, it sorts logs in each group based on the local-disk-block-cache of that worker. From now and
on their arrival time – this gives incoming log sequence in an so on, this cached value of the variable will be used whenever
event. Next, it scans logs in each group and validates them a getValue() method is called.
against the automata rules discovered during model learning. To rebroadcast a BV which already resides in the local-
Logs in a sequence will be flagged as anomalies if they violate disk-block-cache of individual workers, LogLens invalidates
any of these rules. Table II shows various anomaly types all locally cached values. Thus, whenever getValue() method
reported by LogLens. is called for that BV, a pull request is made to the driver. At
Type Anomaly the driver, when a pull request is received rather than handing
1 Missing begin/end state over the old value, the driver sends the updated value. The
2 Missing intermediate states worker then receives the updated value and stores it in the
Min/Max occurrence violation
3
of the intermediate stats
local cache. From now and so on, the newly fetched local
Min/Max time duration violation copy of the BV will be used.
4
in between the begin state and the end state Whenever a new model is issued from the model manager, it
TABLE II: Sample log sequence anomalies. is read from the model storage and enrolled into a queue. The
scheduler then waits for the current job to finish. LogLens’s
dynamic model update implementation communicates with the
V. LogLens AS A S ERVICE : C HALLENGES AND S OLUTIONS block manager of each worker as well as the driver. It also
In this section, we highlight two key real-world deployment tracks all BV identifiers to maintain the same ID for the
challenges that we encountered when implementing LogLens updated BV which is otherwise incremented at each update.
as a service using Spark [17]. We believe that these challenges This allows workers to retrieve the original BV after cache
and our proposed generic solutions will offer insights for invalidation. Furthermore, LogLens’s implements a thread-safe
building similar services in the near future. queuing mechanism to avoid any race conditions due to the
extreme parallelization of the Spark jobs.
A. Supporting Dynamic Model Updates Spark data processing is a queue-based execution of the
Challenges. Spark’s data parallel execution model uses broad- data received in every micro-batch. In LogLens, model update
cast variables to load models and distribute data to all workers. operation runs between these micro-batches in a serialized
However, broadcast variables have been designed to be im- lock process. The model data is loaded into memory, and
mutable and can only be updated before data stream processing an in-memory copy operation loads the data to the BV. The
is started. The only possible way to update a model in Spark is execution proceeds as normal and whenever the broadcast
to re-initialize and re-broadcast the model data to all workers. value is required, workers fetch a fresh copy from the master.
Unfortunately, this process can lead to drastic consequences: The only blocking operation is the in-memory copy operation,
1058
and hence the overhead is directly dependent on the size of the state object. This method returns the reference to the state-
the model. In practice, we find that this overhead is negligible map object where all the states of that partition are stored.
and it does not incur any slow-down on LogLens. For anomaly detection, this state-map is enumerated to find
the states that are open and expired with respect to the current
B. Implementing Stateful Algorithms Efficiently log time. Although LogLens does not have the key to an open
Expedited Anomaly Detection. LogLens focuses on the real- state, it can still access that state and reports anomalies which
time anomaly detection, thus it is essential to report anomalies would otherwise go entirely undetected. However, because of
as soon as they occur. At the same time to allow for scalable event-driven nature of the Spark’s stream processing, LogLens
and fast execution, LogLens uses data-parallel algorithms to still needs a trigger on all partitions to handle the infrequent
distribute the processing workload of incoming logs across log arrival scenario.
worker nodes. Data partitioning logic is only constrained
Solution. As a remedy, LogLens also leverages the external
by grouping together logs which have an inherent causal
heartbeat controller and a custom partitioner. LogLens peri-
dependency on each other (i.e., same model, source, etc.) –
odically receives heartbeat messages from this controller to
this allows LogLens to optimize performance and to avoid
trigger the expired state detection procedure. This external
performance bottlenecks as much as possible.
message is sent to the same data channel (where logs arrive)
In a stateful anomaly detection, each log is independent of
with a specific tag to indicate that it is a heartbeat message. If
the other, thus if a log comes, then anomalies can be reported
such a message is observed in the program logic, the custom
to the user. However, several real-world issues are potentially
partitioner kicks in and broadcasts the same heartbeat message
problematic especially in the case of stateful anomalies, which
to all partitions. Whenever a heartbeat message is received,
depend on the previous states. Two of these issues are:
an anomaly detection algorithm iterates over its states to
1) What if a transaction fails and no log comes at all from a detect anomalies and clean up expired states. This procedure
source or for a particular key or pattern of the model? Es- is performed at all the partitions on every worker since the
sentially, the saved state is already “anomalous”, but cannot heartbeat message is duplicated and broadcast to each partition
be reported since we have no concrete log as evidence. In on the data channel.
this case, the anomaly would never be reported.
2) Similarly if logs of certain automata are coming very
infrequently (several hours apart). This could be because VI. E XPERIMENTAL R ESULTS
of overload in the target system. In such a scenario,
the anomaly may not be reported immediately for any The goal of this section is to show experimental results to
countermeasures to be taken. evaluate the functionality and effectiveness of LogLens.
Traditional timeout based approaches cannot be used as they Dataset. We use six different datasets covering various data-
use system time, which can be very different from “log time”. center operations for evaluation as shown in Table III. In the
The log timestamps may come faster or slower than the actual table, we have proprietary dataset D1 of trace log of a data
time progress within the LogLens system. Hence only the log center (Figure 2 shows sample logs), a synthetic dataset D2,
rate of embedded timestamps within the logs can be used to storage server based dataset D3, OpenStack based dataset D4
predict timestamps in the absence of logs. Furthermore, the for infrastructure as a service deployment, PCAP based dataset
key based mapping of states only allows similar keys to access D5, and proprietary D6 dataset covering network operations.
or modify the state. Even if somehow LogLens receives an We simulate these datasets as streams in our LogLens system.
event that informs the program logic to flush the unnecessary
states, there is currently no way to access the states without Total logs
Dataset Type
their keys. Training Testing
Solution. To allow for expedited real-time anomaly detection, D1 Trace log 16,000 16,000
LogLens uses an external heartbeat controller. This controller D2 Synthetic 18,000 18,000
generates a heartbeat message for every log source and period- D3 Storage Server 792,176 NA
ically sends it to the anomaly detectors if the corresponding log D4 OpenStack [33] 400,000 NA
agent is still active. The heartbeat message is embedded with a D5 PCAP [34] 246,500 NA
timestamp based on the last log observed and the rate of logs D6 Network 1,000,000 NA
from that source. Hence in the absence of logs, the heartbeat
message provides the current time of the target systems and TABLE III: Evaluation Dataset.
allows LogLens to proceed with the anomaly detection.
Efficient State Management. To enable efficient memory Experimental setup. We perform our tests on a Spark cluster
management of the open states, LogLens extends the Spark with Spark Streaming. Our cluster has one master and eight
API (v1.6.1) to expose the reference of the state in a partition worker nodes. We use Spark version 1.6.1 with Kafka version
to the program logic. Within the program logic, the state-map 0.9.0.1. For replaying log data, we have developed an agent,
can be accessed by calling the getParentStateMap() method on which emulates the log streaming behavior.
1059
A. Log Parser 30
Ground Truth
Anomaly Count
Fast Timestamp Identification. LogLens has 89 predefined 21 21 LogLens
20
timestamp formats in the knowledge-base. From our experi-
13 13
ments using datasets in Table III, we find that by combining
both caching and filtering, LogLens can detect timestamps up 10
to 22x faster than the linear scan-based solution – 19.4x is
contributed by caching, and the rest is contributed by filtering. 0
2
D
D
Fast Log Parsing. We compare LogLens against
Fig. 4: Log sequence anomaly detector results accuracy.
Logstash [16], a popular open-source log parsing tool,
to show its efficiency. For these experiments, we use D3, D4,
D5, and D6 datasets which use the same set of logs in both anomaly immediately. With HB controller, we expect to re-
training and testing phases for sanity checking – a correct port more anomalies as soon as they occur. Figure 5 shows
parser does not produce any anomalies for these datasets. performance result of our HB controller. For a certain time
Using LogMine [25] algorithm, first, we generate a set of period, we run our anomaly detector on D1 and D2. If we
GROK patterns from the training logs; next, we parse testing do not use HB controller, we detect 20 anomalies for D1 and
logs using these patterns; and we expect that every testing 10 anomalies for D2. However, when with HB controller, we
log matches a GROK pattern as testing logs are same as the detect 21 anomalies for D1 and 13 anomalies for D2, and all
training logs. Table IV shows that LogLens runs up to 41x of these extra anomalies are related to the missing end states.
times faster than Logstash (v5.3.0), and handles large number These results demonstrate that HB controller is effective in
of patterns. Both LogLens and Logstash parse all training immediately reporting anomalies.
logs and produce same parsing results. For the D4 and D6
datasets, Logstash did not generate any output even running 30
for more than 48 hours, and eventually we stopped it. The Anomaly Count
Ground Truth
21 21 LogLens w/o HB
main reason is as follows: D4 and D6 datasets produce 3234 20
20 LogLens w/ HB
and 2012 patterns, respectively and Logstash is not suitable
13 13
for handling such large pattern-sets. 10
10
Running Time
Dataset Total Patterns
LogLens Logstash Improvement 0
D3 301 109 sec 4550 sec 4074.31%
1
2
D
D
D4 3234 72 sec NA NA
D5 243 34 sec 588 sec 1629.41% Fig. 5: Anomaly detection with and without heartbeats
D6 2012 170 sec NA NA
Model Controller. LogLens provides a key feature like the
TABLE IV: Results:LogLens vs. Logstash.
model update as a service. It can add, update, or delete
models without restarting the running service. The goal of this
experiment is to show that the number of anomaly changes
B. Log Sequence Anomaly Detector
after the model update in order to verify model controller’s
The effectiveness of LogLens in easing human log analysis functionality. Now, we run two set of experiments. First, we
burden requires detecting anomaly accurately. We also need build models using training logs of D1 and D2. D1’s model
to verify that heartbeat controller helps to report anomalies has two automata, while D2 has three automata. Using these
in real-time, and model controller instantly reflects system models, we detect 21 anomalies for D1, and 13 anomalies for
behavior changes after a model update operation. Since log D2. Next, we modify both models by deleting an automaton
sequence anomaly detector uses the output from the log parser, from them, update models through the model controller with-
consequentially our evaluation also demonstrates log parser’s out service interruption, and rerun testing. Table V shows that
efficacy in building advanced log analytics applications. deleting an automaton reduces the number of anomalies from
21 to 13 for D1, and 13 to 9 for D2. This behavior matches
Accuracy. We use D1 and D2 to evaluate the accuracy of
with our intuition as the second set of experiments will
the log sequence anomaly detector because we have ground
produce fewer anomalies because they have fewer automata
truth form them. Figure 4 shows that D1 has originally 21
rules. Therefore, these two set of experiments validate the
anomalous sequences, and our detector identifies all of them;
functionality of the model update operation.
D2 has originally 13 anomalous sequences, and our detector
identifies all of them (red bar). Thus, for both datasets, we get VII. R EAL - WORLD C ASE S TUDIES
100% recall. A. Analyzing Custom Application Logs
Heartbeat Controller. In LogLens, heartbeat (HB) controller In this case-study, users want to analyze logs from a custom
controls open states and helps to report missing end state application. These logs record SQL queries issued in the
1060
Total Anomaly
Total Anomaly
Dataset Automata Count
Automata Count
(after delete) (after delete)
D1 2 21 1 13
D2 3 13 2 9
1061
R EFERENCES
[1] S. Alspaugh, B. Chen, J. Lin, A. Ganapathi, M. Hearst, and R. Katz, “Analyzing
log analysis: An empirical study of user log mining,” in LISA14, 2014, pp. 62–77.
[2] G. Lee, J. Lin, C. Liu, A. Lorek, and D. Ryaboy, “The unified logging infrastructure
for data analytics at twitter,” VLDB, vol. 5, no. 12, pp. 1771–1780, 2012.
[3] LATK, “Log Analysis Tool Kit,” http://www.cert.org/digital-intelligence/tools/latke.
cfm, Aug. 2017.
[4] M. Du, F. Li, G. Zheng, and V. Srikumar, “DeepLog: anomaly detection and
diagnosis from system logs through deep learning,” in ACM Conference on
Computer and Communications Security (CCS), 2017.
[5] C. C. Michael and A. Ghosh, “Simple, state-based approaches to program-based
anomaly detection,” ACM Transactions on Information and System Security, vol. 5,
no. 3, Aug. 2002.
[6] E. Analyzer, “An IT Compliance and Log Management Software for SIEM,” https:
//www.manageengine.com/products/eventlog/, Aug. 2017.
[7] PlantLog, “Operator Rounds Software,” https://plantlog.com/, Aug. 2017.
[8] LogEntries, “Log Analysis for Software-defined Data Centers,” https://
blog.logentries.com/2015/02/log-analysis-for-software-defined-data-centers/, Feb.
2015.
[9] X. Yu, P. Joshi, J. Xu, G. Jin, H. Zhang, and G. Jiang, “Cloudseer: Workflow
monitoring of cloud infrastructures via interleaved logs,” in ASPLOS’16. ACM,
2016, pp. 489–502.
[10] Q. Fu, J. G. Lou, Y. Wang, and J. Li, “Execution anomaly detection in distributed
systems through unstructured log analysis,” in Data Mining, 2009. ICDM’09. Ninth
IEEE International Conference on, 2009.
[11] W. Xu, L. Huang, A. Fox, D. Patterson, and M. I. Jordan, “Detecting large-scale
system problems by mining console logs,” in ACM SIGOPS. ACM, 2009, pp.
117–132.
[12] S. He, J. Zhu, P. He, and M. R. Lyu, “Experience report: System log analysis
for anomaly detection,” 2016 IEEE 27th International Symposium on Software
Reliability Engineering (ISSRE), pp. 207–218, 2016.
[13] Splunk, “Turn Machine Data Into Answers,” https://www.splunk.com, Aug. 2017.
[14] ElasticSearch, “Open-Source Log Storage,” Aug. 2017. [Online]. Available:
https://www.elastic.co/products/elasticsearch
[15] Gartner, “Iot forecast,” http://www.gartner.com/newsroom/id/3598917, Aug. 2017.
[16] Logstash, “Log Parser,” Aug. 2017. [Online]. Available: https://www.elastic.co/
products/logstash
[17] A. Spark, “Lightning-fast cluster computing,” http://spark.apache.org/, Aug. 2017.
[18] F. Yang, J. Li, and J. Cheng, “Husky: Towards a more efficient and expressive
distributed computing framework,” VLDB Endowment, vol. 9, no. 5, pp. 420–431,
2016.
[19] A. Flink, “Scalable Stream and Batch Data Processing,” https://flink.apache.org/,
Aug. 2017.
[20] Samza, “Distributed stream processing framework,” http://samza.apache.org/, Aug.
2017.
[21] Message-Broker, “Apache kafka,” http://kafka.apache.org/, Aug. 2017.
[22] Kibana, “Visualization tool,” Aug. 2017. [Online]. Available: https://www.elastic.
co/products/kibana
[23] GROK, “Pattern,” Aug. 2017. [Online]. Available: https://www.elastic.co/guide/
en/logstash/current/plugins-filters-grok.html
[24] D. Yuan, H. Mai, W. Xiong, L. Tan, Y. Zhou, and S. Pasupathy, “Sherlog: error
diagnosis by connecting clues from run-time logs,” in ACM SIGARCH, vol. 38,
no. 1. ACM, 2010, pp. 143–154.
[25] H. Hamooni, B. Debnath, H. Xu, Jianwu amd Zhang, G. Jiang, and A. Mueen,
“Logmine: Fast pattern recognition for log analytics,” in CIKM. ACM, October
2016.
[26] “Java regex,” https://docs.oracle.com/javase/7/docs/api/java/util/regex/Pattern.html,
Aug. 2017.
[27] RegEx-Format, “Java SimpleDateFormat,” https://docs.oracle.com/javase/7/docs/
api/java/text/SimpleDateFormat.html, Aug. 2017.
[28] C. Ezeife and D. Zhang, “Tidfp: mining frequent patterns in different databases with
transaction id,” in International Conference on Data Warehousing and Knowledge
Discovery. Springer, 2009, pp. 125–137.
[29] “DISCO,” https://fluxicon.com/disco/.
[30] G. Cugola and A. Margara, “Processing flows of information: From data stream to
complex event processing,” vol. 44, no. 3. ACM, 2012, p. 15.
[31] A. Margara, G. Cugola, and G. Tamburrelli, “Learning from the past: automated
rule generation for complex event processing,” in Proceedings of the 8th ACM
International Conference on Distributed Event-Based Systems. ACM, 2014, pp.
47–58.
[32] I. Tudor, “Association rule mining as a data mining technique,” Seria Matematic
Informatic Fizic Buletin, vol. 1, pp. 49–56, 2008.
[33] IaSS, “Openstack,” https://en.wikipedia.org/wiki/OpenStack, Aug. 2017.
[34] PCAP, “Packet Capture,” https://en.wikipedia.org/wiki/Pcap, Aug. 2017.
[35] Spoofing-Attack, “Spoofing attack,” https://en.wikipedia.org/wiki/Spoofing attack,
Aug. 2017.
[36] SS7, “Signaling system no. 7,” https://en.wikipedia.org/wiki/Signalling System
No. 7, Aug. 2017.
1062