Evaluating Automatically Generated Yara Rules and Enhancing 15tho7h74a
Evaluating Automatically Generated Yara Rules and Enhancing 15tho7h74a
Abstract—Emerging as a widely accepted technique for mal- Indicator of Compromise (IoC) strings from those malware
ware analysis, YARA rules due to its flexible and customisable samples to find similar types of malware.
nature, allows malware analysts to develop rules according to the The success of YARA rules is dependent on the effective-
requirements of a specific security domain. YARA rules can be
automatically generated using tools, however, they may require ness of generated YARA rules, which is determined by the
post-processing for their optimisation, and may not be effective types of IoC strings and the number of IoC strings utilised in
for the specific security domain. This compels the requirement to its rules [3]. Therefore, the generation of the most effective
enhance automatically generated YARA rules and increase their YARA rules is the biggest challenge in applying YARA rules
effectiveness for malware analysis without increasing computa- for malware analysis [4]. YARA rules can be generated either
tional overheads. Reflecting on the above requirement, this paper
initially evaluates automatically generated YARA rules using manually or automatically. Generating YARA rules manually
three YARA tools: yarGen, yaraGenerator and yabin. These requires a highly-specialized skill-set in a specific security
tools are Python-based open-source tools used to generate YARA area, whereas generating YARA rules automatically using a
rules automatically utilising different underlying techniques. tool is a relatively easy task [5]. However, there are several
Subsequently, it proposes a method to enhance automatically issues with automatically generated YARA rules such as these
generated YARA rules using a fuzzy hashing method. This
proposed enhancement method can improve the effectiveness rules require post processing operations for their optimisa-
of YARA rules irrespective of the chosen YARA tool used to tion, despite this they may not become very effective for
generate YARA rules, which is demonstrated through several certain types of threats [4], [5]. This drives the requirement
experiments on samples of collected malware and goodware. to enhance YARA rules and make them more effective for
Index Terms—Malware Analysis; YARA Rules; Fuzzy Hash- malware analysis. There are a number of ways to achieve this
ing; yarGen, yaraGenerator; yabin; Ransomware; Indicator of
aim, however, any chosen mechanism should not increase the
Compromise; IoC String.
computational overheads as certain types of YARA rules may
slow down the operation when applied to a large sample of
I. I NTRODUCTION malware [2], [6], [7]. Reflecting on the above requirement
and the further enhancement of YARA rules, this paper at
The accelerating rate of malware incidents on daily basis first evaluates automatically generated YARA rules using three
indicates the magnitude of the problem in malware analysis. YARA tools yarGen, yaraGenerator and yabin. These tools
While malware analysts detect many malware attacks and are Python-based open-source tools used to generate YARA
incidents, keeping pace with the number and different types rules automatically utilising different underlying techniques.
of attacks poses a significant challenge to malware analysts. Subsequently, it proposes a method to enhance automatically
There is no silver bullet with respect to malware, as there is generated YARA rules using a fuzzy hashing method. This
no single malware analysis technique with the capability to proposed enhancement method can improve the effectiveness
treat all malware incidents, as a result analysts select the most of YARA rules irrespective of the chosen YARA tool used
suitable malware analysis technique for the specific security to generate YARA rules [8], which is demonstrated through
incident under consideration [1]. In recent years, YARA rules several experiments on the collected malware and goodware
technique has emerged as a widely accepted technique for samples.
malware analysis due to its flexible and customisable nature, The paper is divided into the following sections: Section
allowing malware analysts to develop YARA rules according II discusses YARA rules and fuzzy hashing as the underlying
to their specific requirements in targeting specific types of methods. Section III describes the three employed tools for
threats [2]. YARA rules are generated based on reverse en- automatically generating YARA rules: yarGen, yaraGenerator
gineering of malware samples to include the most common and yabin. Section IV explains the collection and verification
process of ransomware and goodware samples. Section V B. Fuzzy Hashing
performs an evaluation of automatically generated YARA rules Fuzzy hashing is used to determine the similarity between
using yarGen, yaraGenerator and yabin Tools. Section VI digital files, which makes it a very useful method for malware
presents the proposed enhancement process of automatically analysis as several pieces of malware and their variants possess
generated YARA rules using above YARA tools by employing some similarity with each other, which is not detected by a
the fuzzy hashing method SSDEEP. Section VII explores cryptographic hash as it has a binary outcome i.e., either the
advantages and limitations of YARA Rules. Lastly, Section two files are exactly identical or not [13], [14]. In a fuzzy
VIII concludes the paper and outlines some future work. hashing technique, the file of interest is split into several blocks
and each block is treated separately for calculating its hash,
II. YARA RULES AND F UZZY H ASHING finally, hashes of all the blocks are concatenated to obtain the
fuzzy hash of that file (see Fig. 3). A number of factors affect
A. YARA Rules the size of the fuzzy hash of a file, comprising of the block
YARA rules are developed to detect malware by match- size, the size of the file and the output size of the chosen
ing its signatures/strings with the existing malware signa- hash function [15]. Fuzzy hashing methods are divided into
tures/strings [3], [9]. These rules contain predetermined sig- different types namely: Context-Triggered Piecewise Hashing
natures/strings related to known malware used in attempting (CTPH), Statistically-Improbable Features (SIF), Block-Based
to match against the targeted files, folders, or processes Hashing (BBH) and Block-Based Rebuilding (BBR) [16],
[10]. YARA rules consist of three sections: meta, strings and [17], [18]. Forensic analysis of malware requires a thorough
condition as shown in Figs. 1 and 2. Here, strings can be knowledge of the degree of similarity between known malware
classified into three types: text strings, hexadecimal strings and inert files to assess files for their threat potential [19]. This
and regular expression strings. Text strings are generally a is especially important when considering the analysis and clus-
readable text complemented with some modifiers (e.g., nocase, tering of suspected malware in order to discover new variants
ASCII, wide, and fullword), to manage the process more [20], [21]. As a result, the use of the similarity preserving
effectively [11]. Hexadecimal strings are a sequence of raw property of fuzzy hashing is useful in malware analysis while
bytes complemented with three flexible formats: wild-cards, comparing unknown files with known malware families during
jumps, and alternatives [11]. Regular expression strings are malware analysis, where samples possess similar functionality,
similar to text strings as a readable text complemented with yet different cryptographic hash values [22].
some modifiers; which are available since version 2.0 and
increases the capability of YARA rules [11]. Text strings
and regular expressions which express a sequence of raw
bytes through the use of escape sequences. The final part of
YARA rules is a rule condition that specifies the number of
signatures/strings required matching with the target to declare
the sample as malware [12]. YARA conditions determine
whether to trigger the rule or not, however, these conditions
are Boolean expressions similar to those used in all other
programming languages [11].
A. yarGen Tool
yarGen is a Python-based tool utilised to generate YARA
Fig. 1. YARA Rules: Syntax Fig. 2. YARA Rules: Example rules, which is developed by Florian Roth [23]. It generates
YARA rules utilising some intelligent techniques such as
fuzzy regular expressions, Naive Bayes classifier and Gibber-
ish Detector [24]. The generated YARA rules include those
strings and opcodes from malware which do not match with
the provided goodware databases [23]. These YARA rules
contain a predefined number of strings (generally up to 20
strings), based on their highest scores to maintain a reasonable
operational speed. This tool generates two types of rules basic
rules and super rules depending on the malware sample types,
where basic rules can generally target a specific malware and
super rules can target a set of malware or malware family.
B. yaraGenerator Tool
It is a Python-based tool used for the generation of YARA
rules, which is developed by Chris Clark [25]. It generates
Fig. 4. yarGen Generated YARA Rule
YARA rules with a completely different signature for different
types of files such as EXEs, PDFs and Emails utilising string
1) Advantages- yarGen Tool:
prioritization logic and code refactoring [25]. The generated
• It allows generation of YARA rule based on both opcodes YARA rules consist of strings only, including those strings
and strings. from malware which do not match with the provided blacklist
• It supports the use of PE (portable executable) modules, of strings [25]. It uses a database of 30,000 blacklisted strings
which are used by the Windows operating system for divided based on different file formats. These YARA rules
executables such as DLL and COM files. contain a large number of strings (depending on the types of
• It can be integrated with other anti-malware software for samples) selected randomly as it does not compute a score or
its more effective use. weighting for strings.
• It reduces false positives by checking all strings against 1) Advantages- yaraGenerator Tool:
strings of goodware databases.
• Python script is simple and easy to use through command • It can generate specialised rules of a specific file format.
line interface. • It supports the use of PE (portable executable) modules,
which are used by Windows operating system for exe-
2) Drawbacks- yarGen Tool: cutables such as DLL and COM files.
• It requires post-processing of rules for increasing their • It reduces false positives by checking all strings against
effectiveness. strings of blacklist files.
• Python script is simple and easy to use through command 2) Drawbacks- yabin Tool:
line interface. • It requires post-processing of rules to make them more
2) Drawbacks- yaraGenerator Tool: effective.
• It requires post-processing of rules for increasing their • It may not work on some specific file formats.
effectiveness. • It only uses functions and does not use other types of