0% found this document useful (0 votes)
20 views

Malware_Analysis_using_Machine_Learning_and_Deep_Learning_techniques

The document discusses the use of machine learning and deep learning techniques for malware analysis, highlighting the limitations of traditional signature-based detection methods. It presents a framework for extracting various feature sets from malware files and demonstrates that deep learning approaches significantly improve classification accuracy compared to shallow machine learning methods. The study utilizes a dataset of malware files to evaluate the performance of different algorithms, ultimately showing that deep learning models outperform traditional methods in detecting and classifying malware.

Uploaded by

manjula kj
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
20 views

Malware_Analysis_using_Machine_Learning_and_Deep_Learning_techniques

The document discusses the use of machine learning and deep learning techniques for malware analysis, highlighting the limitations of traditional signature-based detection methods. It presents a framework for extracting various feature sets from malware files and demonstrates that deep learning approaches significantly improve classification accuracy compared to shallow machine learning methods. The study utilizes a dataset of malware files to evaluate the performance of different algorithms, ultimately showing that deep learning models outperform traditional methods in detecting and classifying malware.

Uploaded by

manjula kj
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 7

Malware Analysis using Machine Learning and

Deep Learning techniques

Rajvardhan Patil Wei Deng


Department of Computer & Informacion Science College of Information Science & Technology
Arkansas Tech University University of Nebraska Omaha
Russellville, AR, USA Omaha, NE, 68106
[email protected] [email protected]

Abstract— In this era, where the volume and diversity of addition, even for the known malware, the attackers can use
malware is rising exponentially, new techniques need to be several techniques such as obfuscation, polymorphism,
SoutheastCon 2020 | 978-1-7281-6861-6/20/$31.00 ©2020 IEEE | DOI: 10.1109/SoutheastCon44009.2020.9368268

employed for faster and accurate identification of the malwares. encryption [2, 16, 23] to dodge firewalls, gateways, and
Manual heuristic inspection of malware analysis are neither
antivirus systems. Some of the obfuscation techniques that
effective in detecting new malware, nor efficient as they fail to
keep up with the high spreading rate of malware. Machine are commonly used are: dead code insertion, register
learning approaches have therefore gained momentum. They reassignment, subroutine reordering, instruction substitution,
have been used to automate static and dynamic analysis code transposition, and code integration [1]. In addition, it is
investigation where malware having similar behavior are trivial for malware-writers to evade such systems simply by
clustered together, and based on the proximity unknown deriving a slight variant of that malware. As per [17],
malwares get classified to their respective families. Although thousands of new malicious samples are introduced every day
many such research efforts have been conducted where data- into the market, and these traditional signature based systems
mining and machine-learning techniques have been applied, in fail to be effective when it comes to detection of unknown
this paper we show how the accuracy can further be improved
malware. Using this signature based system to detect worms
using deep learning networks. As deep learning offers superior
classification by constructing neural networks with a higher hardly pose any danger to the zero day attacks [22].
number of potentially diverse layers it leads to improvement in Therefore, to combat this issue static and dynamic
automatic detection and classification of the malware variants. analysis approaches are used, which can identify a variation
In this research, we present a framework which extracts of already known threat. Features derived from the analysis
various feature-sets such as system calls, operational codes, are used to group the malwares, and classify the unknown
sections, and byte codes from the malware files. In the malware into the existing families. Dynamic analysis is
experimental and result section, we compare the accuracy where the malicious code is executed (in controlled
obtained from each of these features and demonstrate that environment), whereas in static analysis the code is inspected
feature vector for system calls yields the highest accuracy. The
and not executed.
paper concludes by showing how deep learning approach
performs better than the traditional shallow machine learning Static analysis are used to detect patterns and extract
approaches. information such as strings, n-grams, byte sequences,
opcodes, and call-graphs. The disassembler tools are used to
Keywords—malware detection, malware analysis, deep reverse engineer the windows executable to generate the
learning, machine learning assembly instructions. In addition, memory dumper tools are
used to obtain the protected code (located in system memory)
I. INTRODUCTION
and analyze the packed executables which are otherwise
Malware is a malicious software that infiltrates security, difficult to disassemble. If required, the executable needs to
integrity, and functionality of a system [2] without the user’s be unpacked and decrypted to perform analysis. But
consent to fulfill the harmful intent of the attacker [8, 15]. techniques such as obfuscation, encryption, polymorphism,
There are different types of malware such as Virus, Worm, metamorphism can be used to thwart the reverse compilation
Trojan-horse, Rootkit, Backdoor, Botnet, Spyware, Adware process making static analysis a difficult choice [3, 16, 18,
etc. [1]. Anti-virus software is used to detect, and prevent 19]. In addition, when compiling the source code into binary
malware from execution, where it applies some signature format some of the information such as data structure size or
matching algorithm to identify known threats. variables get lost [7]. To overcome these limitations, dynamic
The anti-virus software uses signature-based database to analysis is used as it is less vulnerable to obfuscation
detect malware. Here, a signature is generated for each known techniques.
malware [7]. There is a need to create signature for a given In dynamic analysis, the malicious code is executed in
malware so that the software can later identify that malware. controlled or virtual environment. Tools such as Process
The signature is a short string of bytes, which is unique for Monitor, Process Explorer, Wireshark, Regshot [1] are used
each malware. The anti-virus system scans through the files, to analyze the behavior of the code in execution. Behavior
generates the signature, and checks to see if their signature such as function-call, function parameters, information flow,
exist in the database. If there is a match then the file in instruction traces etc. are monitored [7]. Additional run time
consideration is a malware. Although this system correctly information like transmitted network traffic, length of
classifies the malware, it cannot detect an unknown or new execution, changes that are made in the file system [3] can
malware as its signature will be missing from the database. In also be noted. In dynamic analysis there is no need to

978-1-7281-6861-6/20/$31.00 ©2020 IEEE


Authorized licensed use limited to: PES University Bengaluru. Downloaded on March 13,2025 at 08:44:24 UTC from IEEE Xplore. Restrictions apply.
disassemble the executable. However, this approach is time file. The file also had the class label indicating what type of
and resource consuming. Additionally, some of the malware malware the given file belongs to. The label is an integer
can behave differently in the virtual environment than the real representing one of 9 family names to which the malware
one, making them harder to trace [1]. Furthermore, the may belong. The 9 families are: Ramnit, Lollipop,
malware behavior might get triggered under certain Kelihos_ver3, Vundo, Simda, Tracur, Kelihos_ver1,
conditions, such as system date. Therefore, dynamic analysis Obfuscator.ACY, and Gatak.
techniques can be easily evaded by malware that are aware of The malware-files came in two formats: .asm and .bytes.
execution conditions and computing environment. In The asm-file is the output of disassembler IDA disassembler
dynamic analysis, as the files are executed, their behavior is tool, which displays the malware code as assembly
noted down in a feature vector-space that captures their instructions. It provides various metadata information
pattern. extracted from the binary, such as function calls, strings, etc.
Although static and dynamic approaches can be used for The byte-file is the raw data that contains the hexadecimal
analysis, however to restrict the number of samples requiring representation of the file's binary content without the PE
close human analysis, several machine-learning algorithms header. The goal of the classifier was to classify the given
have been used to automate the malware analysis and malicious files into correct families.
classification steps. Machine learning techniques (such as
B. Preprocessing
clustering, classification) are used to study the patterns
obtained from static and/or dynamic analysis and to Before we begin with the analysis phase, the malware
categorize the unknown malware into the respective families. dataset needs to be mapped to feature vector space. These
Detecting whether a file is malware or not is a feature vectors in turn can be used by the algorithms for
classification problem. Several different machine-learning classification purpose. From the malware dataset, we derived
four types of feature-set:
algorithms are used to classify the malware, such as: Naïve
• frequency of bytecodes, where bytecodes are the
Bayes, Decision Tree, Support Vector Machine etc. In this
hexadecimal codes ranging from 00 to FF.
machine learning approach, usually the dataset are the files
and the label represents whether the file is malware or benign. • frequency of opcodes, where operational codes are the
This dataset is divided into training and testing dataset. The machine language instructions that specify the
training dataset is used to train a particular model. Cross- operation to be performed. For example: ADD, SUB,
validation technique is used for better evaluation of the CMP, etc.
model. Once the model is trained, this model is applied on the • frequency of sections, where section is the smallest
testing dataset. Here the model is unaware of the labels. It unit of an object file that can be relocated. For
predicts the label for each file. The accuracy is then computed example: .init, .text, .bss, etc.
based on how many of the files were correctly classified.
However these machine-learning approaches have shallow • One hot-encoding for system-calls, where system calls
learning architecture and the accuracy can further be are the API-calls invoked by user program to request
improved using deep learning technique which has superior kernel service. In this feature set, the vector length
ability for automatic feature learning [2] . equals to the number of distinct API calls in the
In the deep learning technique, multiple layers are added malware dataset. A flag was set to one if a particular
to extract features from lowest level to highest level [20]. kernel API call sequence was called from that malware
file, else it was set to zero.
Here, each layer identifies certain type of feature and
forwards it to the next layer. The next layer combines these After extracting the feature sets, the data is then
lower level features to compose higher level feature and so transformed using min-max scaler technique. Next, the data
on. These features are kind of aggregated as they are is shuffled and then split into training and testing dataset. 80%
forwarded, and eventually the final layer in the model is able (8694) of the data was used for training the classifier, and
to classify whether the file is malign or benign. Unlike 20% (2174) for testing the classifier. To avoid overfitting,
machine learning where the feature set needs to be fed to the cross validation technique is used, which splits the training
network, in deep learning the model is able to self extract the set into multiple smaller training sets and a validation set. The
feature. In this paper, we apply both machine learning and model is trained against the smaller training sets and
deep learning methods on the dataset and compare the results. evaluated against the validation set, before its being finally
The rest of the paper is structured as follows. Section 2 tested on the test set.
introduces our proposed deep learning approach for malware
detection. Section 3 describes the feature extraction and C. System Overview
analysis step. Section 4 evaluates the performance of our The processed feature sets are forwarded as inputs to the
proposed method in comparison with other alternative machine learning algorithms and neural networks, which
methods in malware detection. Section 5 presents the related classify the unknown malware (from the test dataset) into one
work, followed by Section 6 which concludes the paper. of the predefined malware families. We use tensorflow’s keras
II. METHODOLOGY framework to build our DNN model. In our set up, we used
NVIDIA’s TITAN V GPU and the CUDA 9 platform to
A. Data Collection expedite the training process. As GPUs are designed for fast
We used Microsoft’s malware dataset available on Kaggle execution of linear algebra operations, such as matrix
website [4]. It consisted of 10868 malware files representing multiplications, we used them to speed up the neural network
training process. Fig. 1 below represents the system overview.
a mix of 9 different families. Each malware file had an Id,
which was a 20 character hash value uniquely identifying the

Authorized licensed use limited to: PES University Bengaluru. Downloaded on March 13,2025 at 08:44:24 UTC from IEEE Xplore. Restrictions apply.
Algorithm Precision Recall F1-Score
Malware Preprocessing DecisionTree 91.45 93.28 92.25
dataset (feature extraction) SGD 65.72 64.71 64.48
Kneighbors 90.11 92.3 90.98
Logistic
59.9 44.31 42.06
Regression
Standardization of data, and RandomForest 94.87 95.77 95.27
Data partitioning (train and test) Naïve Bayes 51.58 57.58 48.78
SVC 13.56 15.31 11.09
DNN 97.14 95.31 96.09
Neural Network
B. Opcode
The total number of operational codes (opcode) we came
across were 129. We counted their frequency in each file. Each
Family#1 … Family#9
file was represented as the frequency of these opcode counts.
Fig. 1. System Overview
After using this feature set for classification, we got the results
shown in Table II.
D. Performance Measure: TABLE II: PRECISION, RECALL, F1-SCORE FOR OPCODES
Algorithm Precision Recall F1- Score
To measure the performance of machine learning DecisionTree 89.88 90.76 90.3
algorithms and deep learning neural network (DNN) we used SGD 66.1 43.44 44.87
the confusion matrix to compute precision, recall, and F1- Kneighbors 87.21 87.1 87.14
score. We also take a look at TPR and FPR. Here is the Logistic
54.54 34.22 37.28
explanation of the measures that we used in the context of Regression
malware classification: RandomForest 96.1 90.98 92.8
• TP: True Positive represents number of files correctly Naïve Bayes 55.39 52.45 41.62
SVC 36.46 13 8.19
classified as malicious. DNN 97.42 95.18 96.14
• TN: True Negative represents number of files correctly
classified as benign. C. API Calls
• FP: False Positive represents number of files For the system-calls, because of the space limitation we
mistakenly classified as malicious. focused only on the top 1500 calls. We sorted the system-calls
in descending order based on the frequency they were called
• FN: False Negative represents number of files
in the entire dataset, and selected the top 1500 calls. Each file
mistakenly classified as benign.
was represented using these top 1500 api-calls as feature
• TPR: True Positive Rate, also called as Recall vector. A flag was set to one if a particular api-call was called
indicates percentage of total relevant results from that malware file, else it was set to zero. After using this
(malicious files) that were correctly classified (TP / feature set for classification, we got the results shown in Table
TP + FN). III.
• FPR: False Positive Rate indicates the ratio of negative TABLE III: PRECISION, RECALL, F1-SCORE FOR API-C ALLS
instances that are incorrectly classified as positive (FP Algorithm Precision Recall F1- Score
/ FP + TN). DecisionTree 92.51 92.45 92.36
SGD 96.4 92.93 94.3
• Precision: indicates percentage of the result which is Kneighbors 92.98 85.14 86.46
relevant (TP/TP + FP) Logistic
96.66 93.02 94.31
Regression
• F1-Score: harmonic mean of precision and recall. RandomForest 95.76 90.42 92.03
(precision * recall / precision + recall) Naïve Bayes 80.99 81.64 80.11
SVC 80.7 79.51 78.71
• Accuracy: percentage of how many malicious and DNN 98.36 98.37 98.3
benign files were correctly classified (TP + TN / TP +
TN + FP + FN).
D. Sections
We counted distinct types of sections in the .asm format of
the malware files. We were able to extract 443 unique types.
III. FEATURE EXTRACTION AND ANALYSIS
Each file was represented as the frequency of these section-
A. Byte Codes counts. After using this feature set for classification, we got
One of the representations of the input file was in hexa- the results shown in Table IV.
dump format. We extracted the byte codes and counted their TABLE IV: PRECISION, RECALL, F1-SCORE FOR SECTIONS
frequency in each file. They ranged from 00 to FF. In addition, Algorithm Precision Recall F1- Score
there was one special character (??), making the feature-count DecisionTree 93.27 92.56 93.14
to 257. Each file was represented as the frequency of these SGD 82.11 48.89 52.79
byte counts. After using this feature set for classification, we Kneighbors 96.98 87.15 87.84
got the results shown in Table I. Logistic
72.19 40.86 44.28
Regression
TABLE I: PRECISION, RECALL, F1-SCORE FOR BYTE CODES RandomForest 98.63 92.3 94.17

Authorized licensed use limited to: PES University Bengaluru. Downloaded on March 13,2025 at 08:44:24 UTC from IEEE Xplore. Restrictions apply.
Naïve Bayes 76.72 51.79 49.65 C. Sections
SVC 14.15 11.58 5.68
DNN 98.75 98.57 98.65

IV. EXPERIMENTAL RESULTS


When it comes to malware-analysis most of the research
has dealt with binary classification, where the file needs to be
classified either as malign or benign. Whereas, our work
deals with multi-class classification, where the specific
family type of the given malware file needs to be predicted.
As we are dealing with multi-class classification problem,
9 binary-classifiers are created in the background by the
machine learning algorithms, one on for each class. When it
comes to classifying the file, the class whose classifier outputs
the highest score is selected. However, Random Forest
Fig. 4. TPR and FPR for Sections
classifier can directly classify instances into multiple classes
and therefore doesn’t create binary classifiers in the
background. In our evaluation we used 3-fold cross-
validation. Two thirds of the samples are used to train the D. Op-Code
model, and one third of the samples are used to evaluate the
model. Below, we compare the accuracy, TPR, and FPR of
deep-learning approach to the shallow-learning based
machine learning approaches. The results from Fig. 2, Fig. 3,
Fig. 4, Fig. 5, and Fig. 6 show that DNN outperforms the
traditional machine-learning approaches. In the deep learning
approach we see how the back-propagation and gradient
descent techniques help improve the weights, minimize the
loss, and thus increase the overall accuracy of the model.
A. Accuracy

Fig. 5. TPR and FPR for Opcodes

E. System-Calls

Fig. 2. Accuracy of Algorithms

B. Byte-Code

Fig. 6. TPR and FPR for System-calls

Experimental results show that our deep learning malware


classification method is able to achieve high classification
accuracy. From analysis, it can be inferred that the back
propagation and gradient descent mechanism in deep learning
helped the algorithm improve the accuracy, TPR and reduce
FPR to classify the malware files more accurately. Gradient
descent is used to tweak parameters iteratively in order to
minimize the loss or cost function. In gradient descent, the
back propagation algorithm first makes a prediction (forward
pass) on the training instances. That is, it computes the output
Fig. 3. TPR and FPR for Bytecodes of every neuron in each consecutive layer. Then it measures
the error (difference between actual and desired output). It

Authorized licensed use limited to: PES University Bengaluru. Downloaded on March 13,2025 at 08:44:24 UTC from IEEE Xplore. Restrictions apply.
then goes through each layer in backward direction to measure behavior were created. Reference [27] used an automated tool
the error contribution from each connection (reverse pass), to extract system-call sequences from binaries when they were
and finally tweaks the connection weights to reduce the error. running in virtual environment. They used classifiers from
In the machine learning, as there is single layer to compute the WEKA library to discriminate malicious files from benign
accuracy, the weights don’t get readjusted and therefore the files, as well as to classify the malwares into their respective
accuracy doesn’t get refined further. families.
Reference [28] pointed out that how different anti-virus
software characterized malware in inconsistent and
V. RELATED WORK incomplete way and fail to be concise in their semantics across
the board. To address this issue, they proposed new
Machine learning techniques are widely being used in classification technique that described malware behavior in
various domain, such as: wild fire detection [33], sentiment terms of system state changes (e.g., files written, processes
analysis [34], PVC pipe crack detection [35, 36, 37], and created) rather than in sequences or patterns of system calls.
recommendation systems [38] to name a few. In this paper Further, they categorized malware into groups that exhibited
we apply these techniques on Malware domain. similar behavior and demonstrated how behavior based
Normally, the variations in malware go undetected by the clustering is effective to analyze and classify the malware.
traditional anti-virus software as the signature is missing in Reference [29] proposed a malware classification method
the database. To fix this problem, machine-learning which is based on maximal component subgraph detection.
algorithms are used to capture these small variations having The behavior-graph is generated by capturing system calls
similar behavioral pattern and to classify them into their when the malware samples were executed in virtual
known families. In this section, we discuss the state of art environment. The maximal common subgraph is computed to
techniques that use machine learning algorithms for compare two executables. Results show that this method
classification of malware. effectively groups malware and also has a low false positive
In [9] three different features were extracted using static rate. Reference [30] presented a proof of concept of a malware
analysis: system resource information, strings and byte detection method. Here the malware are executed in virtual
sequences. The resource information comprised of list of environment to generate their behavioral report. Using this
DLLs used by the binary, the list of DLL function calls, and report sparse vector model are created, which are further used
number of different system calls used within each DLL. They for machine learning algorithms. The paper used classifiers
used three algorithms: RIPPER, Naive Bayes, and a Multi- such as: k-Nearest Neighbors (kNN), Naïve Bayes, J48
Classifier system. Reference [11] used n-grams of byte codes Decision Tree, Support Vector Machine (SVM), and
as features, and applied machine learning algorithms, such as: Multilayer Perceptron Neural Network (MLP). After the
naive Bayes, decision trees, support vector machines, and analysis and experiments of all the 5 classifiers, best
boosting where boosted decision trees outperformed other performance was achieved by J48 decision tree.
methods. Reference 10] proposed a method to visualize the Reference [31] focused mainly on the network activity of
malware using image processing techniques and use K-nearest the malware. The framework takes network traces in the form
neighbor to classify them. Reference [12] presented a of pcap files as input. Flow information is extracted to
framework that used call graphs as features and applied generate behavioral graph that represents malware’s network
distance metrics to evaluate the similarity between call graphs activities and dependencies. Features reflecting network
of malware programs. Malware samples belonging to same behavior (such as: size, degree, nodes etc.) were extracted.
family got clustered together. Reference [13] used two aspects These features were used by the classification algorithms to
of functions to classify the Trojans. One was length of the classify malware into their respective families based on their
function as measured by the number of bytes, and second was network behavior. Based on the results, J48 decision tree
frequency of the function length. Their results indicate that the performs better than other classifiers. Reference [32] proposed
function length along with its frequency is significant in hybrid approach where features from static analysis (such as
identifying malware family. frequency of opcodes) were combined with features from
Reference [14] identified the issue with supervised dynamic analysis (such as information of execution trace).
learning where one has to identify and prepare the labels for This hybrid approach enhanced the performance of both
all the data. It instead focused on using semi-supervised approaches when run separately. Reference [5] made use of
learning approach to identify the malware where it uses set of deep learning based malware detection approach that achieved
labeled and unlabeled instances. Reference [22] used variable a detection rate of 95% and a false positive rate of 0.1%. In
length instruction sequence to identify worms and applied [21], the neural network consisted of convolutional and
machine learning algorithms like decision tree, bagging and recurrent networks. Combination of these two in hierarchical
random forest to classify worms from clean programs. fashion obtained best features for classification and increased
Reference [6] proposed an incremental approach for behavior- the malware detection capabilities. Reference [26] uses
based analysis that combined clustering and classification multiple-gated CNNs on hash features generated from API
models, which significantly reduced the run-time overhead. calls and feeds the result to Bi-LSTM, which captures the
Clustering was used to identify and group novel classes of sequential correlations of API call sequence.
malware with similar behavior, and unknown malware were VI. CONCLUSION AND FUTURE WORK
classified into these classes using classification. Reference
[24] analyzes graph, which constructed using the data In this research, we presented a framework which
obtained from dynamic analysis. A similarity matrix is created extracted system calls, operational codes, sections, and byte
using Gaussian and spectral kernel. Support vector machine is codes features from the malware files. In the experimental and
trained on this similarity matrix to classify the data. Reference result section, we compared the accuracy obtained from each
[25] proposed novel approach to cluster malware samples, of these features and demonstrated that feature vector for
where partitions or subsets of programs that exhibits similar system calls yielded the highest accuracy. The experimental

Authorized licensed use limited to: PES University Bengaluru. Downloaded on March 13,2025 at 08:44:24 UTC from IEEE Xplore. Restrictions apply.
and results section demonstrated how deep learning approach, International Conference on Knowledge Discovery and Data
because of the back propagation and gradient-descent Mining (ACM SIGKDD), 222-230, 2011.
techniques, performed better than the traditional shallow [18] M. Christodorescu and S. Jha, “Static analysis of executables to
detect malicious patterns”, In SSYM’03: Proceedings of the
machine learning approaches. 12th conference on USENIX Security Symposium, pages 12–
In our future work, instead of testing individual feature 12, Berkeley, CA, USA, 2003.
sets, we plan to see the effect of all the feature-sets combined [19] P. Beaucamps, and E. Filiol, “On the possibility of practically
on the loss and accuracy. We also plan to impose obfuscating programs - Towards a unified perspective of code
convolutional neural networks, and recurrent neural networks protection”, Journal in Computer Virology, Springer Verlag, ,
3 (1), pp.3-21, 2007.
in the malware classification domain.
[20] Y. Lv, Y. Duan, W. Kang, Z. Li, and F. Y. Wang, “Traffic flow
REFERENCES prediction with big data: A deep learning approach”, ITS, IEEE
Transactions on, 16(2):865–873, 2015.
[1] E.Gandotra, D. Bansal, and S. Sofat, “Malware analysis and [21] B. Kolosnjaji, A. Zarras, G. Webster, and C. Eckert, “Deep
classification: A survey”, in Journal of Information Security 5, learning for classification of malware system call sequences”,
2, 56–64, 2014. in Australasian Joint Conference on Artificial Intelligence, pp.
[2] W. Hardy, L. Chen, S. Hou, Y. Ye, and X. Li, “DL4MD: A 137–149. Springer, 2016.
deep learning framework for intelligent malware detection”, in [22] M. Siddiqui, M. C. Wang, and J. Lee, “Detecting Internet
International Conference on Data Mining (DMIN), 2016. Worms Using Data Mining Techniques”, Journal of Systemics,
[3] A. Shalaginov, S. Banin, A. Dehghantanha, K. Franke, Cybernetics and Informatics, 6, 48-53, 2009.
“Machine learning aided static malware analysis: A survey and [23] M.F. Zolkipli, and A. Jantan, “An Approach for Malware
tutorial”, Cyber Threat Intelligence, 7–45, 2018. Behavior Identification and Classification”, Proceeding of 3rd
[4] https://www.kaggle.com/c/malware-classification/data International Conference on Computer Research and
[5] J. Saxe and K. Berlin, “Deep neural network based malware Development, Shanghai, 11-13, 191-194, 2011.
detection using two dimensional binary program features,” in [24] B. Anderson, D. Quist, J. Neil, C. Storlie, and T. Lane, “Graph
Malicious and Unwanted Software (MALWARE), 2015 10th Based Malware Detection Using Dynamic Analysis”, Journal
International Conference on. IEEE, pp. 11–20, 2015. in Computer Virology, 7, 247-258, 2011.
[6] K. Rieck, P. Trinius, C. Willems and T. Holz, “Automatic [25] U. Bayer, P. M. Comparetti, C. Hlauschek, and C. Kruegel,
Analysis of Malware Behavior Using Machine Learning”, “Scalable, Behavior-Based Malware Clustering”, Proceedings
Journal of Computer Security, 19, 639-668, 2011. of the 16th Annual Network and Distributed System Security
[7] M. Egele, T. Scholte, E. Kirda, and C. Kruegel, “A Survey on Symposium, 2009.
Automated Dynamic Malware-Analysis Techniques and [26] Z. Zhang, P. Qi, and W. Wang, “Dynamic malware analysis
Tools”, Journal in ACM Computing Surveys, 44, Article No. with feature engineering and feature learning”, arXiv preprint
6, 2012. arXiv:1907.07352, 2019.
[8] A. Moser, C. Kruegel and E. Kirda, “ Limits of Static Analysis [27] R. Tian, M. R. Islam, L. Batten, and S. Versteeg,
for Malware Detection”, 23rd Annual Computer Security “Differentiating Malware from Cleanwares Using Behavioral
Applications Conference, Miami Beach, 421-430, 2007. Analysis”, Proceedings of 5th International Conference on
[9] M. Schultz, E. Eskin, F. Zadok, and S. Stolfo, “Data Mining Malicious and Unwanted Software (Malware), Nancy, 19-20,
Methods for Detection of New Malicious Executables”, 23-30, 2010.
Proceedings of 2001 IEEE Symposium on Security and [28] M. Biley, J. Oberheid, J. Andersen, Z. M. Morley, F. Jahanian,
Privacy, Oakland, 14-16 May 2001, 38-49, 2001. and J. Nazario, “Automated Classification and Analysis of
[10] L. Nataraj, S. Karthikeyan, G. Jacob, and B. Manjunath, Internet Malware”, Proceedings of the 10th International
“Malware Images: Visualization and Automatic Conference on Recent Advances in Intrusion Detection, 4637,
Classification”, Proceedings of the 8th International 178-197, 2007.
Symposium on Visualization for Cyber Security, Article No. 4, [29] Y. Park, D. Reeves, V. Mulukutla, and B. Sundaravel, “Fast
2011. Malware Classification by Automated Behavioral Graph
[11] J. Kolter, and M. Maloof, “Learning to Detect Malicious Matching”, Proceedings of the 6th Annual Workshop on Cyber
Executables in the Wild”, Proceedings of the 10th ACM Security and Information Intelligence Research, Article No. 45,
SIGKDD International Conference on Knowledge Discovery 2010.
and Data Mining, 470-478, 2004. [30] I.Firdausi, C. Lim, and A. Erwin, “Analysis of Machine
[12] D. Kong, and G. Yan, “Discriminant Malware Distance Learning Techniques Used in Behavior Based Malware
Learning on Structural Information for Automated Malware Detection”, Proceedings of 2nd International Conference on
Classification”, Proceedings of the ACM Advances in Computing, Control and Telecommunication
SIGMETRICS/International Conference on Measurement and Technologies (ACT), Jakarta, 2-3, 201-203, 2010.
Modeling of Computer Systems, 347-348, 2013. [31] S. Nari, and A. Ghorbani, “Automated Malware Classification
[13] R. Tian, L. Batten, and S. Versteeg, “Function Length as a Tool Based on Network Behavior”, Proceedings of International
for Malware Classification”, Proceedings of the 3rd Conference on Computing, Networking and Communications
International Conference on Malicious and Unwanted (ICNC), San Diego, 28-31, 642-647, 2013.
Software, Fairfax, 7-8 October 2008, 57-64, 2008. [32] I. Santos, J. Devesa, F. Brezo, J. Nieves, and P. G. Bringas,
[14] I. Santos, J. Nieves, and P.G. Bringas, “Semi-Supervised “OPEM: A Static-Dynamic Approach for Machine Learning
Learning for Unknown Malware Detection”, International Based Malware Detection”, Proceedings of International
Symposium on Distributed Computing and Artificial Conference CISIS’12-ICEUTE’12, Special Sessions Advances
Intelligence Advances in Intelligent and Soft Computing, 91, in Intelligent Systems and Computing, 189, 271-280, 2013.
415-422, 2011. [33] M. Khan, R. Patil, and S. A. Haider, “Application of
[15] U. Bayer, A. Moser, C. Kruegel, and E. Kirda, “Dynamic Convolutional Neural Networks for Wild Fire Detection”, In
Analysis of Malicious Code”, Journal in Computer Virology, proceedings of IEEE SoutheastCon 2020.
2, 67-77, 2006. [34] R. Patil, and A. Shrestha, “Feature-Set for Sentiment analysis”,
[16] A. Sung, J. Xu, P. Chavez, and S. Mukkamala, “Static analyzer In proceedings of IEEE SoutheastCon 2019.
of vicious executables (save)”, In Proceedings of the 20th [35] M. Khan, and R. Patil, “Application of Machine Learning
Annual Computer Security Applications Conference (ACSAC Algorithms for Crack Detection in PVC Pipes”, In proceedings
’04), 00:326–334, 2004. of IEEE SoutheastCon 2019.
[17] Y. Ye, T. Li, S. Zhu, W. Zhuang, E. Tas, U. Gupta, and M. [36] M. Khan, and R. Patil, “Statistical Analysis of Acoustic
Abdulhayoglu, “Combining File Content and File Relations for Response of PVC Pipes for Crack Detection”, In proceedings
Cloud Based Malware Detection”, In Proceedings of ACM of IEEE SoutheastCon 2018.

Authorized licensed use limited to: PES University Bengaluru. Downloaded on March 13,2025 at 08:44:24 UTC from IEEE Xplore. Restrictions apply.
[37] M. Khan, and R. Patil, “Acoustic Characterization of PVC
Sewer Pipes for Crack Detection Using Frequency Domain
Analysis”, In proceedings of IEEE International Smart Cities
Conference (ISC2).
[38] W. Deng, R. Patil, L. Najjar, Y. Shi, and Z. Chen,
“Incorporating Community Detection and Clustering
Techniques into Collaborative Filtering Model”, In
proceedings of Information Technology and Quantitative
Management (ITQM), 66-74.

Authorized licensed use limited to: PES University Bengaluru. Downloaded on March 13,2025 at 08:44:24 UTC from IEEE Xplore. Restrictions apply.

You might also like