Malware_Analysis_using_Machine_Learning_and_Deep_Learning_techniques
Malware_Analysis_using_Machine_Learning_and_Deep_Learning_techniques
Abstract— In this era, where the volume and diversity of addition, even for the known malware, the attackers can use
malware is rising exponentially, new techniques need to be several techniques such as obfuscation, polymorphism,
SoutheastCon 2020 | 978-1-7281-6861-6/20/$31.00 ©2020 IEEE | DOI: 10.1109/SoutheastCon44009.2020.9368268
employed for faster and accurate identification of the malwares. encryption [2, 16, 23] to dodge firewalls, gateways, and
Manual heuristic inspection of malware analysis are neither
antivirus systems. Some of the obfuscation techniques that
effective in detecting new malware, nor efficient as they fail to
keep up with the high spreading rate of malware. Machine are commonly used are: dead code insertion, register
learning approaches have therefore gained momentum. They reassignment, subroutine reordering, instruction substitution,
have been used to automate static and dynamic analysis code transposition, and code integration [1]. In addition, it is
investigation where malware having similar behavior are trivial for malware-writers to evade such systems simply by
clustered together, and based on the proximity unknown deriving a slight variant of that malware. As per [17],
malwares get classified to their respective families. Although thousands of new malicious samples are introduced every day
many such research efforts have been conducted where data- into the market, and these traditional signature based systems
mining and machine-learning techniques have been applied, in fail to be effective when it comes to detection of unknown
this paper we show how the accuracy can further be improved
malware. Using this signature based system to detect worms
using deep learning networks. As deep learning offers superior
classification by constructing neural networks with a higher hardly pose any danger to the zero day attacks [22].
number of potentially diverse layers it leads to improvement in Therefore, to combat this issue static and dynamic
automatic detection and classification of the malware variants. analysis approaches are used, which can identify a variation
In this research, we present a framework which extracts of already known threat. Features derived from the analysis
various feature-sets such as system calls, operational codes, are used to group the malwares, and classify the unknown
sections, and byte codes from the malware files. In the malware into the existing families. Dynamic analysis is
experimental and result section, we compare the accuracy where the malicious code is executed (in controlled
obtained from each of these features and demonstrate that environment), whereas in static analysis the code is inspected
feature vector for system calls yields the highest accuracy. The
and not executed.
paper concludes by showing how deep learning approach
performs better than the traditional shallow machine learning Static analysis are used to detect patterns and extract
approaches. information such as strings, n-grams, byte sequences,
opcodes, and call-graphs. The disassembler tools are used to
Keywords—malware detection, malware analysis, deep reverse engineer the windows executable to generate the
learning, machine learning assembly instructions. In addition, memory dumper tools are
used to obtain the protected code (located in system memory)
I. INTRODUCTION
and analyze the packed executables which are otherwise
Malware is a malicious software that infiltrates security, difficult to disassemble. If required, the executable needs to
integrity, and functionality of a system [2] without the user’s be unpacked and decrypted to perform analysis. But
consent to fulfill the harmful intent of the attacker [8, 15]. techniques such as obfuscation, encryption, polymorphism,
There are different types of malware such as Virus, Worm, metamorphism can be used to thwart the reverse compilation
Trojan-horse, Rootkit, Backdoor, Botnet, Spyware, Adware process making static analysis a difficult choice [3, 16, 18,
etc. [1]. Anti-virus software is used to detect, and prevent 19]. In addition, when compiling the source code into binary
malware from execution, where it applies some signature format some of the information such as data structure size or
matching algorithm to identify known threats. variables get lost [7]. To overcome these limitations, dynamic
The anti-virus software uses signature-based database to analysis is used as it is less vulnerable to obfuscation
detect malware. Here, a signature is generated for each known techniques.
malware [7]. There is a need to create signature for a given In dynamic analysis, the malicious code is executed in
malware so that the software can later identify that malware. controlled or virtual environment. Tools such as Process
The signature is a short string of bytes, which is unique for Monitor, Process Explorer, Wireshark, Regshot [1] are used
each malware. The anti-virus system scans through the files, to analyze the behavior of the code in execution. Behavior
generates the signature, and checks to see if their signature such as function-call, function parameters, information flow,
exist in the database. If there is a match then the file in instruction traces etc. are monitored [7]. Additional run time
consideration is a malware. Although this system correctly information like transmitted network traffic, length of
classifies the malware, it cannot detect an unknown or new execution, changes that are made in the file system [3] can
malware as its signature will be missing from the database. In also be noted. In dynamic analysis there is no need to
Authorized licensed use limited to: PES University Bengaluru. Downloaded on March 13,2025 at 08:44:24 UTC from IEEE Xplore. Restrictions apply.
Algorithm Precision Recall F1-Score
Malware Preprocessing DecisionTree 91.45 93.28 92.25
dataset (feature extraction) SGD 65.72 64.71 64.48
Kneighbors 90.11 92.3 90.98
Logistic
59.9 44.31 42.06
Regression
Standardization of data, and RandomForest 94.87 95.77 95.27
Data partitioning (train and test) Naïve Bayes 51.58 57.58 48.78
SVC 13.56 15.31 11.09
DNN 97.14 95.31 96.09
Neural Network
B. Opcode
The total number of operational codes (opcode) we came
across were 129. We counted their frequency in each file. Each
Family#1 … Family#9
file was represented as the frequency of these opcode counts.
Fig. 1. System Overview
After using this feature set for classification, we got the results
shown in Table II.
D. Performance Measure: TABLE II: PRECISION, RECALL, F1-SCORE FOR OPCODES
Algorithm Precision Recall F1- Score
To measure the performance of machine learning DecisionTree 89.88 90.76 90.3
algorithms and deep learning neural network (DNN) we used SGD 66.1 43.44 44.87
the confusion matrix to compute precision, recall, and F1- Kneighbors 87.21 87.1 87.14
score. We also take a look at TPR and FPR. Here is the Logistic
54.54 34.22 37.28
explanation of the measures that we used in the context of Regression
malware classification: RandomForest 96.1 90.98 92.8
• TP: True Positive represents number of files correctly Naïve Bayes 55.39 52.45 41.62
SVC 36.46 13 8.19
classified as malicious. DNN 97.42 95.18 96.14
• TN: True Negative represents number of files correctly
classified as benign. C. API Calls
• FP: False Positive represents number of files For the system-calls, because of the space limitation we
mistakenly classified as malicious. focused only on the top 1500 calls. We sorted the system-calls
in descending order based on the frequency they were called
• FN: False Negative represents number of files
in the entire dataset, and selected the top 1500 calls. Each file
mistakenly classified as benign.
was represented using these top 1500 api-calls as feature
• TPR: True Positive Rate, also called as Recall vector. A flag was set to one if a particular api-call was called
indicates percentage of total relevant results from that malware file, else it was set to zero. After using this
(malicious files) that were correctly classified (TP / feature set for classification, we got the results shown in Table
TP + FN). III.
• FPR: False Positive Rate indicates the ratio of negative TABLE III: PRECISION, RECALL, F1-SCORE FOR API-C ALLS
instances that are incorrectly classified as positive (FP Algorithm Precision Recall F1- Score
/ FP + TN). DecisionTree 92.51 92.45 92.36
SGD 96.4 92.93 94.3
• Precision: indicates percentage of the result which is Kneighbors 92.98 85.14 86.46
relevant (TP/TP + FP) Logistic
96.66 93.02 94.31
Regression
• F1-Score: harmonic mean of precision and recall. RandomForest 95.76 90.42 92.03
(precision * recall / precision + recall) Naïve Bayes 80.99 81.64 80.11
SVC 80.7 79.51 78.71
• Accuracy: percentage of how many malicious and DNN 98.36 98.37 98.3
benign files were correctly classified (TP + TN / TP +
TN + FP + FN).
D. Sections
We counted distinct types of sections in the .asm format of
the malware files. We were able to extract 443 unique types.
III. FEATURE EXTRACTION AND ANALYSIS
Each file was represented as the frequency of these section-
A. Byte Codes counts. After using this feature set for classification, we got
One of the representations of the input file was in hexa- the results shown in Table IV.
dump format. We extracted the byte codes and counted their TABLE IV: PRECISION, RECALL, F1-SCORE FOR SECTIONS
frequency in each file. They ranged from 00 to FF. In addition, Algorithm Precision Recall F1- Score
there was one special character (??), making the feature-count DecisionTree 93.27 92.56 93.14
to 257. Each file was represented as the frequency of these SGD 82.11 48.89 52.79
byte counts. After using this feature set for classification, we Kneighbors 96.98 87.15 87.84
got the results shown in Table I. Logistic
72.19 40.86 44.28
Regression
TABLE I: PRECISION, RECALL, F1-SCORE FOR BYTE CODES RandomForest 98.63 92.3 94.17
Authorized licensed use limited to: PES University Bengaluru. Downloaded on March 13,2025 at 08:44:24 UTC from IEEE Xplore. Restrictions apply.
Naïve Bayes 76.72 51.79 49.65 C. Sections
SVC 14.15 11.58 5.68
DNN 98.75 98.57 98.65
E. System-Calls
B. Byte-Code
Authorized licensed use limited to: PES University Bengaluru. Downloaded on March 13,2025 at 08:44:24 UTC from IEEE Xplore. Restrictions apply.
then goes through each layer in backward direction to measure behavior were created. Reference [27] used an automated tool
the error contribution from each connection (reverse pass), to extract system-call sequences from binaries when they were
and finally tweaks the connection weights to reduce the error. running in virtual environment. They used classifiers from
In the machine learning, as there is single layer to compute the WEKA library to discriminate malicious files from benign
accuracy, the weights don’t get readjusted and therefore the files, as well as to classify the malwares into their respective
accuracy doesn’t get refined further. families.
Reference [28] pointed out that how different anti-virus
software characterized malware in inconsistent and
V. RELATED WORK incomplete way and fail to be concise in their semantics across
the board. To address this issue, they proposed new
Machine learning techniques are widely being used in classification technique that described malware behavior in
various domain, such as: wild fire detection [33], sentiment terms of system state changes (e.g., files written, processes
analysis [34], PVC pipe crack detection [35, 36, 37], and created) rather than in sequences or patterns of system calls.
recommendation systems [38] to name a few. In this paper Further, they categorized malware into groups that exhibited
we apply these techniques on Malware domain. similar behavior and demonstrated how behavior based
Normally, the variations in malware go undetected by the clustering is effective to analyze and classify the malware.
traditional anti-virus software as the signature is missing in Reference [29] proposed a malware classification method
the database. To fix this problem, machine-learning which is based on maximal component subgraph detection.
algorithms are used to capture these small variations having The behavior-graph is generated by capturing system calls
similar behavioral pattern and to classify them into their when the malware samples were executed in virtual
known families. In this section, we discuss the state of art environment. The maximal common subgraph is computed to
techniques that use machine learning algorithms for compare two executables. Results show that this method
classification of malware. effectively groups malware and also has a low false positive
In [9] three different features were extracted using static rate. Reference [30] presented a proof of concept of a malware
analysis: system resource information, strings and byte detection method. Here the malware are executed in virtual
sequences. The resource information comprised of list of environment to generate their behavioral report. Using this
DLLs used by the binary, the list of DLL function calls, and report sparse vector model are created, which are further used
number of different system calls used within each DLL. They for machine learning algorithms. The paper used classifiers
used three algorithms: RIPPER, Naive Bayes, and a Multi- such as: k-Nearest Neighbors (kNN), Naïve Bayes, J48
Classifier system. Reference [11] used n-grams of byte codes Decision Tree, Support Vector Machine (SVM), and
as features, and applied machine learning algorithms, such as: Multilayer Perceptron Neural Network (MLP). After the
naive Bayes, decision trees, support vector machines, and analysis and experiments of all the 5 classifiers, best
boosting where boosted decision trees outperformed other performance was achieved by J48 decision tree.
methods. Reference 10] proposed a method to visualize the Reference [31] focused mainly on the network activity of
malware using image processing techniques and use K-nearest the malware. The framework takes network traces in the form
neighbor to classify them. Reference [12] presented a of pcap files as input. Flow information is extracted to
framework that used call graphs as features and applied generate behavioral graph that represents malware’s network
distance metrics to evaluate the similarity between call graphs activities and dependencies. Features reflecting network
of malware programs. Malware samples belonging to same behavior (such as: size, degree, nodes etc.) were extracted.
family got clustered together. Reference [13] used two aspects These features were used by the classification algorithms to
of functions to classify the Trojans. One was length of the classify malware into their respective families based on their
function as measured by the number of bytes, and second was network behavior. Based on the results, J48 decision tree
frequency of the function length. Their results indicate that the performs better than other classifiers. Reference [32] proposed
function length along with its frequency is significant in hybrid approach where features from static analysis (such as
identifying malware family. frequency of opcodes) were combined with features from
Reference [14] identified the issue with supervised dynamic analysis (such as information of execution trace).
learning where one has to identify and prepare the labels for This hybrid approach enhanced the performance of both
all the data. It instead focused on using semi-supervised approaches when run separately. Reference [5] made use of
learning approach to identify the malware where it uses set of deep learning based malware detection approach that achieved
labeled and unlabeled instances. Reference [22] used variable a detection rate of 95% and a false positive rate of 0.1%. In
length instruction sequence to identify worms and applied [21], the neural network consisted of convolutional and
machine learning algorithms like decision tree, bagging and recurrent networks. Combination of these two in hierarchical
random forest to classify worms from clean programs. fashion obtained best features for classification and increased
Reference [6] proposed an incremental approach for behavior- the malware detection capabilities. Reference [26] uses
based analysis that combined clustering and classification multiple-gated CNNs on hash features generated from API
models, which significantly reduced the run-time overhead. calls and feeds the result to Bi-LSTM, which captures the
Clustering was used to identify and group novel classes of sequential correlations of API call sequence.
malware with similar behavior, and unknown malware were VI. CONCLUSION AND FUTURE WORK
classified into these classes using classification. Reference
[24] analyzes graph, which constructed using the data In this research, we presented a framework which
obtained from dynamic analysis. A similarity matrix is created extracted system calls, operational codes, sections, and byte
using Gaussian and spectral kernel. Support vector machine is codes features from the malware files. In the experimental and
trained on this similarity matrix to classify the data. Reference result section, we compared the accuracy obtained from each
[25] proposed novel approach to cluster malware samples, of these features and demonstrated that feature vector for
where partitions or subsets of programs that exhibits similar system calls yielded the highest accuracy. The experimental
Authorized licensed use limited to: PES University Bengaluru. Downloaded on March 13,2025 at 08:44:24 UTC from IEEE Xplore. Restrictions apply.
and results section demonstrated how deep learning approach, International Conference on Knowledge Discovery and Data
because of the back propagation and gradient-descent Mining (ACM SIGKDD), 222-230, 2011.
techniques, performed better than the traditional shallow [18] M. Christodorescu and S. Jha, “Static analysis of executables to
detect malicious patterns”, In SSYM’03: Proceedings of the
machine learning approaches. 12th conference on USENIX Security Symposium, pages 12–
In our future work, instead of testing individual feature 12, Berkeley, CA, USA, 2003.
sets, we plan to see the effect of all the feature-sets combined [19] P. Beaucamps, and E. Filiol, “On the possibility of practically
on the loss and accuracy. We also plan to impose obfuscating programs - Towards a unified perspective of code
convolutional neural networks, and recurrent neural networks protection”, Journal in Computer Virology, Springer Verlag, ,
3 (1), pp.3-21, 2007.
in the malware classification domain.
[20] Y. Lv, Y. Duan, W. Kang, Z. Li, and F. Y. Wang, “Traffic flow
REFERENCES prediction with big data: A deep learning approach”, ITS, IEEE
Transactions on, 16(2):865–873, 2015.
[1] E.Gandotra, D. Bansal, and S. Sofat, “Malware analysis and [21] B. Kolosnjaji, A. Zarras, G. Webster, and C. Eckert, “Deep
classification: A survey”, in Journal of Information Security 5, learning for classification of malware system call sequences”,
2, 56–64, 2014. in Australasian Joint Conference on Artificial Intelligence, pp.
[2] W. Hardy, L. Chen, S. Hou, Y. Ye, and X. Li, “DL4MD: A 137–149. Springer, 2016.
deep learning framework for intelligent malware detection”, in [22] M. Siddiqui, M. C. Wang, and J. Lee, “Detecting Internet
International Conference on Data Mining (DMIN), 2016. Worms Using Data Mining Techniques”, Journal of Systemics,
[3] A. Shalaginov, S. Banin, A. Dehghantanha, K. Franke, Cybernetics and Informatics, 6, 48-53, 2009.
“Machine learning aided static malware analysis: A survey and [23] M.F. Zolkipli, and A. Jantan, “An Approach for Malware
tutorial”, Cyber Threat Intelligence, 7–45, 2018. Behavior Identification and Classification”, Proceeding of 3rd
[4] https://www.kaggle.com/c/malware-classification/data International Conference on Computer Research and
[5] J. Saxe and K. Berlin, “Deep neural network based malware Development, Shanghai, 11-13, 191-194, 2011.
detection using two dimensional binary program features,” in [24] B. Anderson, D. Quist, J. Neil, C. Storlie, and T. Lane, “Graph
Malicious and Unwanted Software (MALWARE), 2015 10th Based Malware Detection Using Dynamic Analysis”, Journal
International Conference on. IEEE, pp. 11–20, 2015. in Computer Virology, 7, 247-258, 2011.
[6] K. Rieck, P. Trinius, C. Willems and T. Holz, “Automatic [25] U. Bayer, P. M. Comparetti, C. Hlauschek, and C. Kruegel,
Analysis of Malware Behavior Using Machine Learning”, “Scalable, Behavior-Based Malware Clustering”, Proceedings
Journal of Computer Security, 19, 639-668, 2011. of the 16th Annual Network and Distributed System Security
[7] M. Egele, T. Scholte, E. Kirda, and C. Kruegel, “A Survey on Symposium, 2009.
Automated Dynamic Malware-Analysis Techniques and [26] Z. Zhang, P. Qi, and W. Wang, “Dynamic malware analysis
Tools”, Journal in ACM Computing Surveys, 44, Article No. with feature engineering and feature learning”, arXiv preprint
6, 2012. arXiv:1907.07352, 2019.
[8] A. Moser, C. Kruegel and E. Kirda, “ Limits of Static Analysis [27] R. Tian, M. R. Islam, L. Batten, and S. Versteeg,
for Malware Detection”, 23rd Annual Computer Security “Differentiating Malware from Cleanwares Using Behavioral
Applications Conference, Miami Beach, 421-430, 2007. Analysis”, Proceedings of 5th International Conference on
[9] M. Schultz, E. Eskin, F. Zadok, and S. Stolfo, “Data Mining Malicious and Unwanted Software (Malware), Nancy, 19-20,
Methods for Detection of New Malicious Executables”, 23-30, 2010.
Proceedings of 2001 IEEE Symposium on Security and [28] M. Biley, J. Oberheid, J. Andersen, Z. M. Morley, F. Jahanian,
Privacy, Oakland, 14-16 May 2001, 38-49, 2001. and J. Nazario, “Automated Classification and Analysis of
[10] L. Nataraj, S. Karthikeyan, G. Jacob, and B. Manjunath, Internet Malware”, Proceedings of the 10th International
“Malware Images: Visualization and Automatic Conference on Recent Advances in Intrusion Detection, 4637,
Classification”, Proceedings of the 8th International 178-197, 2007.
Symposium on Visualization for Cyber Security, Article No. 4, [29] Y. Park, D. Reeves, V. Mulukutla, and B. Sundaravel, “Fast
2011. Malware Classification by Automated Behavioral Graph
[11] J. Kolter, and M. Maloof, “Learning to Detect Malicious Matching”, Proceedings of the 6th Annual Workshop on Cyber
Executables in the Wild”, Proceedings of the 10th ACM Security and Information Intelligence Research, Article No. 45,
SIGKDD International Conference on Knowledge Discovery 2010.
and Data Mining, 470-478, 2004. [30] I.Firdausi, C. Lim, and A. Erwin, “Analysis of Machine
[12] D. Kong, and G. Yan, “Discriminant Malware Distance Learning Techniques Used in Behavior Based Malware
Learning on Structural Information for Automated Malware Detection”, Proceedings of 2nd International Conference on
Classification”, Proceedings of the ACM Advances in Computing, Control and Telecommunication
SIGMETRICS/International Conference on Measurement and Technologies (ACT), Jakarta, 2-3, 201-203, 2010.
Modeling of Computer Systems, 347-348, 2013. [31] S. Nari, and A. Ghorbani, “Automated Malware Classification
[13] R. Tian, L. Batten, and S. Versteeg, “Function Length as a Tool Based on Network Behavior”, Proceedings of International
for Malware Classification”, Proceedings of the 3rd Conference on Computing, Networking and Communications
International Conference on Malicious and Unwanted (ICNC), San Diego, 28-31, 642-647, 2013.
Software, Fairfax, 7-8 October 2008, 57-64, 2008. [32] I. Santos, J. Devesa, F. Brezo, J. Nieves, and P. G. Bringas,
[14] I. Santos, J. Nieves, and P.G. Bringas, “Semi-Supervised “OPEM: A Static-Dynamic Approach for Machine Learning
Learning for Unknown Malware Detection”, International Based Malware Detection”, Proceedings of International
Symposium on Distributed Computing and Artificial Conference CISIS’12-ICEUTE’12, Special Sessions Advances
Intelligence Advances in Intelligent and Soft Computing, 91, in Intelligent Systems and Computing, 189, 271-280, 2013.
415-422, 2011. [33] M. Khan, R. Patil, and S. A. Haider, “Application of
[15] U. Bayer, A. Moser, C. Kruegel, and E. Kirda, “Dynamic Convolutional Neural Networks for Wild Fire Detection”, In
Analysis of Malicious Code”, Journal in Computer Virology, proceedings of IEEE SoutheastCon 2020.
2, 67-77, 2006. [34] R. Patil, and A. Shrestha, “Feature-Set for Sentiment analysis”,
[16] A. Sung, J. Xu, P. Chavez, and S. Mukkamala, “Static analyzer In proceedings of IEEE SoutheastCon 2019.
of vicious executables (save)”, In Proceedings of the 20th [35] M. Khan, and R. Patil, “Application of Machine Learning
Annual Computer Security Applications Conference (ACSAC Algorithms for Crack Detection in PVC Pipes”, In proceedings
’04), 00:326–334, 2004. of IEEE SoutheastCon 2019.
[17] Y. Ye, T. Li, S. Zhu, W. Zhuang, E. Tas, U. Gupta, and M. [36] M. Khan, and R. Patil, “Statistical Analysis of Acoustic
Abdulhayoglu, “Combining File Content and File Relations for Response of PVC Pipes for Crack Detection”, In proceedings
Cloud Based Malware Detection”, In Proceedings of ACM of IEEE SoutheastCon 2018.
Authorized licensed use limited to: PES University Bengaluru. Downloaded on March 13,2025 at 08:44:24 UTC from IEEE Xplore. Restrictions apply.
[37] M. Khan, and R. Patil, “Acoustic Characterization of PVC
Sewer Pipes for Crack Detection Using Frequency Domain
Analysis”, In proceedings of IEEE International Smart Cities
Conference (ISC2).
[38] W. Deng, R. Patil, L. Najjar, Y. Shi, and Z. Chen,
“Incorporating Community Detection and Clustering
Techniques into Collaborative Filtering Model”, In
proceedings of Information Technology and Quantitative
Management (ITQM), 66-74.
Authorized licensed use limited to: PES University Bengaluru. Downloaded on March 13,2025 at 08:44:24 UTC from IEEE Xplore. Restrictions apply.