0% found this document useful (0 votes)
23 views

Building A Malware Detection System Based On A Mac

Uploaded by

cybertabatabaei
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
23 views

Building A Malware Detection System Based On A Mac

Uploaded by

cybertabatabaei
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 6

International Journal of Innovative Technology and Exploring Engineering (IJITEE)

ISSN: 2278-3075 (Online), Volume-9 Issue-5, March 2020

Building a Malware Detection System Based on


a Machine Learning Method
Cho Do Xuan, Tisenko Victor Nikolaevich, Do Minh Tuan, Nguyen the Lam, Nguyen Anh Tuan

Abstract- Malware attacks are dangerous and difficult to However, the disadvantage of this method isn’t able to detect
detect and prevent. Therefore, the task of detecting signs of new malware samples that are not in the signature database.
malware and alerting it for users or the system is very necessary In this paper, we propose a method to detect malware based
today. One of the most effective malware detection approaches is on machine learning techniques. In the paper [1], there are
applying machine learning or deep learning to analyze its
behavior. There have been many studies and recommendations to
some difficulties in the method of detecting malware based
analyze malicious behavior then combined with some sorting or on machine learning. In our study, we propose a malware
clustering methods to find their signs. In this paper, we will detection process based on static and dynamic analysis.
propose a method to use machine learning to detect malicious Finally, to conclude the existence of malware in the system
signs based on their unusual behavior. Accordingly, in our we propose to use machine learning algorithms.
research, we will conduct malicious analysis using static and
dynamic analysis methods to detect abnormal behaviors and II. RELATED WORKS
combine them with a supervised classification algorithm to the
conclusion on malware behavior. 2.1. Malware detection technique
Malware detection, feature selection, machine learning
a) Detection technique based on static analysis
I. INTRODUCTION The malware detection technique is based on the static
analysis method, which is characterized by the detection
Malware is software programs designed to harm or perform
unwanted actions on a computer system. Malicious software of malware without having to run or execute any of its
is essentially a software like other software on the computer code, including three main methods: scanner technology,
that is used every day and has all the characteristics and diagnostics based on Heuristics and Integrity Checkers.
properties of a normal software, except that it is more b) Detection technique based on dynamic analysis
malicious. The study listed some common types of malware
including Virus, Worm, Trojan Horse, Malicious Mobile The malware detection technique based on dynamic
Code, Tracking Cookie, Attacker Tool, Phishing, Hoax analysis is the technique of determining whether a file is
Virus. According to statistics [15], the situation of malware infected by executing program code and observing its
distribution in 2019 increased by 79% compared to 2018. behavior. Two main techniques for malware analysis
This is entirely reasonable because hackers used to focus on include:
information systems in the past. This usually chooses to
- Behavior Monitors/Blockers: Behavior blocker is a
attack the user primarily. Therefore, malware rapidly
increases not only in a number of attacks but also its technique that monitors the execution behavior of a
dangerous levels. In the study [10], there are several program in real-time. Besides, this technique also allows
approaches to detecting malware. The two basic methods the monitoring of suspicious actions and blocks of
used to detect malware are the sign-based detection method malware.
and based on behavioral analysis. Methods of detecting - Emulation: Malware detection techniques use emulation
malware based on a set of signs have been studied and that allows the code to be run and analyzed in a simulated
applied early because of its rapidity and accurate detection
environment. Two main techniques are Dynamic
capability. Commonly used signs in this method include hash
code, IP, Domain or Indicators of compromise. heuristic and Generic decryption.
2.2. Detecting malware based on machine learning
Revised Manuscript Received on March 30, 2020.
* Correspondence Author To solve the disadvantage of the method of detecting
Cho Do Xuan, FPT University, Hanoi, Vietnam. E-mail:
[email protected], malware based on the signal set, the technique of
Tisenko Victor Nikolaevich , Peter the Great St. Petersburg Polytechnic detecting malware based on analyzing the behavior of
University Russia, St.Petersburg, Polytechnicheskaya, 29 E-mail: malware was born. In the research [1], the authors
[email protected] presented the idea of detecting malware based on file
Do Minh Tuan, FPT University, Hanoi, Vietnam.
E-mail:[email protected], abnormal behavior based on machine learning algorithm.
Nguyen The Lam, FPT University, Hanoi, Vietnam. E-mail: The paper [2], presented a number of basic approaches in
[email protected], the problem of malware detection based on machine
Nguyen Anh Tuan, FPT University, Hanoi, Vietnam. E-mail: learning including how to extract malicious features and
[email protected]
detection algorithms. For extracting feature data, recent
© The Authors. Published by Blue Eyes Intelligence Engineering and studies often use three main techniques including [2 -7]:
Sciences Publication (BEIESP). This is an open access article under the CC static analysis, dynamic analysis, and combined analysis.
BY-NC-ND license (http://creativecommons.org/licenses/by-nc-nd/4.0/)

Retrieval Number: E2945039520/2020©BEIESP


DOI: 10.35940/ijitee.E2945.039520 Published By:
Journal Website: www.ijitee.org Blue Eyes Intelligence Engineering
1488 & Sciences Publication
Building a Malware Detection System Based on a Machine Learning Method

Based on these analysis techniques, the malware will be 2.3. Some malware detection tools
analyzed and synthesized into the corresponding sequence of To be able to implement the above methods, we need the
behaviors. In this paper, we will use static and dynamic support tools corresponding to each specific method. Below,
analysis methods to look for abnormal behavior of malware I offer five main groups of tools:
based on the sandbox tool [8]. After configuring and - Antivirus software: Kaspersky, Bitdefender, Avast, Norton,
analyzing malware with a sandbox tool, the main groups of Bkav…
behaviors that can be selected and used to extract malicious - Network monitoring tool group: TCPView, Wireshark.
- A group of tools for monitoring file system resources:
behavior include Byte sequences, Opcodes, network
AutoRun. ProcessExplorer, ProcessMon, …
Activity., System files, API and System Calls, Windows - Registry monitoring tool group: ProcessMon, AutoRun, …
Registry, PE file characteristics. This method has two basic Automatic analysis tool: Sandboxie, Cuckoo Sandbox.
algorithms that are machine learning and deep learning
algorithm [9, 10]. In this paper, we use a supervised machine III. DEVELOP A MALWARE DETECTION SYSTEM
learning algorithm. This detection method is relatively 3.1. Proposing a model to detect malware
effective and has been researched and experimented in many
studies.

Figure 1. Malicious detection system based on rules and machine learning algorithms

Retrieval Number: E2945039520/2020©BEIESP


DOI: 10.35940/ijitee.E2945.039520 Published By:
Journal Website: www.ijitee.org Blue Eyes Intelligence Engineering
1489 & Sciences Publication
International Journal of Innovative Technology and Exploring Engineering (IJITEE)
ISSN: 2278-3075, Volume-9 Issue-5, March 2020

The operating process of a malware detection system is as malware. If files don’t match with the malware signature
follows: database, it will be detected by a machine learning method.
- First of all, the data will be checked by the malware - If the file doesn’t found in the signature database, it will be
signature database. If a match is detected, it will immediately sent to the sandbox. In the sandbox, the file will be loaded
report the occurrence of malware in the system. To into the virtual environment running the test, the sandbox
accomplish this task, the malware will be sent to the Virus system records the file's behavior logs, the system's machine
total and checked by the signature database. We use Yara to learning module learns the behavior of the test file in the
build a signature database, Virus total as a tool to check data Feature Dataset of the system by Random Forest algorithm.
if the database sign not detected. The process of detecting 3.2. Select and extract features
malware with signs does not take much time and has accurate
In the document [11] listed the properties to detect malware.
results. However, this method is very difficult to detect new
In this paper, we use some of the following features to detect
malware. The list of features is shown in the following table.
Table 1. List of malicious features

No Category Features Data Type Desciption

1 Static Size Numberic File size.


2 Timestamp Numberic Date file created.
3 Signature String File signature.
4 Packer String Packer
5 Section Features String Section and resource features

6 Static Import String Import library.

Allow only one malware to execute in the system at a


7 Mutex
time…

8 Processes String Process.


9 Dynamic imports String Import external library.
10 File Read String Get links to malware to read.
11 Dynamic File Written String Get links to malware to over write.
12 File Delete String Get links to malware to delete.
13 File Copied String Get links to malware to copy.
14 File Renamed String Get links to malware to rename.
15 File Open String Get links to malware to open.
16 File Exists String Get links to opened files.
17 File Failed String Get links to error files.
18 File Operations String List all type of above file.
19 TCP String Get ip and using TCP to connect to outside.
20 UDP String Get ip and using UDP to connect to outside.
21 HTTP String Find all list of connect Http
22 Registry Written String Get links to malware to edit registry.

23 Registry Delete String Get links to malware to delete registry.

24 API Stats String Get API.


3.3. Malware detection algorithm - Incorporate decision trees by the voting method.
The Random forest algorithm is generically described
Random forest is a member of the decision tree algorithm
through the following steps:
chain. The main idea of this algorithm is to create some
Step 1: For k = 1, L where L is the number of decision trees:
decision trees. When the data have a height, build a long tree
- Get random set Rk including M data in D.
and the quality of identification (classification/regression) is
- Dk is the projection of Rk on the characteristics of the data
low, instead of using other methods, people use the Random
set to be taken.
forest. A random forest is an identifier consisting of a set of
- Building Tk decision tree from Dk set, obtaining Ck
decision tree parts combined by the voting method. Decision
identifier.
trees are constructed from different sub-datasets, with
Step 2: Combine voting results {𝐶𝑘} 𝑘 = 1𝐿 to give identifier
different characteristic subset taken randomly from the
results to new subjects.
observed data set. The construction of the Random forest
consists of three phases:
- Data creation (random vector generator)
- Building decision trees.

Published By:
Retrieval Number: E2945039520/2020©BEIESP
Blue Eyes Intelligence Engineering
DOI: 10.35940/ijitee.E2945.039520 1490 & Sciences Publication
Building a Malware Detection System Based on a Machine Learning Method

The Random forest algorithm will allow decision trees to run in the decision tree. This process is called Bootstrapping.
and produce independent results. The answer predicted by Using the Bootstrap Aggregating Tree (Bagging Tree) helps
the most decision trees will be selected by the Random forest. reduce discrepancies, increase stability and accuracy. In
But if the decision tree is accidentally the same, the result of addition, the Random forest model calculates the importance
the decision tree will be the result of the entire model. To of features, unlike the algorithm that the features are equal.
ensure that decision trees are not the same, the Random forest The following figure shows an example of a Random forest:
will randomly pick a subset of the features at each node. The
remaining parameters are used in the Random forest just like

Fig. 1. The operation model of the Random forest algorithm


The random forest depicted in Figure 2 creates a set of
unconditioned decision trees, each built on a bootstrap
sample set, at which the best partition node is performed
from randomly selecting a subset of features.

Published By:
Retrieval Number: E2945039520/2020©BEIESP
Blue Eyes Intelligence Engineering
DOI: 10.35940/ijitee.E2945.039520 1491 & Sciences Publication
International Journal of Innovative Technology and Exploring Engineering (IJITEE)
ISSN: 2278-3075, Volume-9 Issue-5, March 2020

IV. EXPRIMENTAL AND EVALUATED


4.1. The experimental model of detecting malware
based on machine learning

Fig. 3. The experimental model of malware detection using machine learning


Training and Learning phase: Detector tries to learn the Parameter Notes Calculationprocess
behavior andfeature of malware. This the phase where the True Positive – Count number of malware
model was built for malware detection TP result of finding and it was corected (files with
malware equal files with files
Detection phase: Based on machine learning at phase 1, malware correctly labeled ‘malware’)
the detector will detect malware and send notification True Negative – Count number of malware
base on a machine learning algorithm. result of finding and it was incorrect (files
TN
Experiment Process: malware with malware not equal files
- Start Cuckoo Sandbox like a service, wait for command incorrectly with files labeled ‘malware’)
via REST API. False Positive - Count number of harmless
- Use REST API Cuckoo Sandbox to upload result of finding files and it was corected (files
FP
non-malware data (document, images, audio, …) and run harmless file with harmless files equal files
correctly with files labeled ‘malware’)
analysis on Cuckoo Sandbox.
False Negative – Count number of malware
- Use REST API Cuckoo Sandbox to upload entries
result of finding and it was corected (files with
malware and run analyses on Cuckoo Sandbox. FN
malware malware equal files with files
- Take log (report. jsonvàdump.pcap) of harmless data incorrectly labeled ‘malware’)
and malware.
- Export entries feature from file report and file dump
Where:
(table 1).
TP + TN
- UseRandom Forest Classification algorithm and begin acc =  100%
TP + TN + FP + FN
training and build the model.
TP
4.2. Data and experiment process precision = 100%
TP + FP
a) Preparing data TP
Re call = 100%
- Benign data: [12] TP + FN
Quantity: 7620 (files). 2  precision  Re call
F1 =
Document File: (Docx, Excel , PDF, Text, Ebook…). precision + Re call
Progamming File: (Java, Python, C, C++…). d) Experimental script
ImageFile: (JPG, PNG…). Test case:
AudioFile: (MP3, WAV). Test case A: 3000 (clean) + 5000(malware)
Exercutable File: (DLL, EXE). Test case B: 5000 (clean) + 8000(malware)
- Malware: [13, 14] Test case C: 6500 (clean) + 11000(malware)
Quantity: 13966 (files). Test case D: 7620 (clean) + 13966(malware)
Malware File: (Virus, Adware, Malware…). e) Experimental results
b) Experiment script
Export data (including harmless data and malware at the
80% – 20% ratio for testing).
c) Calculation parameter
Table 2. Measured values

Published By:
Retrieval Number: E2945039520/2020©BEIESP
Blue Eyes Intelligence Engineering
DOI: 10.35940/ijitee.E2945.039520 1492 & Sciences Publication
Building a Malware Detection System Based on a Machine Learning Method

Table 3. Experimental results of malware detection using 11. MehedyMasud, Latifur Khan, and BhavaniThuraisingham. Data
Mining Tools for Malware Detection. CRC Press, 2011.
machine learning 12. https://mp3.zing.vn.
http://chinhphu.vn/portal/page/portal/chinhphu/trangchu
Case Test Accuracy Training Testing 13. VirusShare. https://virusshare.com/. [Accessed February 15, 2020].
Error(%) (%) time time (s) 14. DAS MALWERK // malware samples.
A 25 75,0 9.2 0.17234 https://dasmalwerk.eu/?fbclid=IwAR1lI91cVexbTj09Qd449PO5y2zo
B 12,15 87,85 4.2 0.37738 Sdq3SxJfxR3-8mxdn1MECA-W3rwtCsw. [Accessed February 15,
2020].
C 4,41 95,59 16.9 0.69376 15. 2019 Internet Security Threat Report.
D 1,06 98,94 25.8 1.05794 https://resources.malwarebytes.com/files/2019/01/Malwarebytes-Labs
-2019-State-of-Malware-Report-2.pdf[Accessed February 15, 2020].
From the experimental results in Table 3, the system
shows the best detection result of 98.94% with scenario D AUTHORS PROFILE
when the data is complete and large.
- Profile Dr. Do Xuan Cho is currently a lecturer at the
However, with scenario D, the training and testing time
Faculty of Information Technology at Posts and
will be much larger than other scenarios. Besides, the
Telecommunications Institute of Technology in Vietnam. In
results give the lowest when with data set A 75%. With
2008, received a bachelor's degree in the Saint Petersburg
the above experimental results, it can be said that the
Electrotechnical University "LETI" on a specialty "Computer
malware detection system based on the properties we
science and computer facilities", Russia. In 2010, graduated a
propose has brought high efficiency.
masters from the Saint Petersburg Electrotechnical
University "LETI" on a specialty "Computer science and
V. CONCLUSIONS
computer facilities", Russia. In 2013, received a PhD in the
In this paper, we have proposed the model of the malware Saint Petersburg Electrotechnical University "LETI", on a
detection system based on the machine learning method. The specialty CAD. Russia. Area of scientific interests -
experimental results in this paper have shown that we had a modeling, control systems, algorithmization.
right and reasonable approach to detect and prevent malware. Email: [email protected]
The innovation of our paper expresses not only in the use of
machine learning algorithms to detect malware but only in - Authors Do Minh Tuan, Nguyen The Lam, Nguyen Anh Tuan are
the proposal to use features that aren't too complicated to fourth-year students majoring in information security at FPT
University. These students have over 2 years of experience working
calculate and extract but still being highly effective in
with APT attack detection issues.
detecting abnormal behavior of the process. However, it is Email: [email protected],
easy to see that, to detect malware, a cumbersome and [email protected], [email protected]
complex collection and extraction system are required. Second Author: My position is the professor of Institute of
Besides, many currently advanced malwares are difficult for computer sciences and technologies in Peter the Great
their behavior to be collected if it is based solely on the Saint-Petersburg Polytechnic Ubiversity. I have received the
Sandbox tool. Therefore, in the future, it is necessary to have degree Doctor of Technical Sciences in 1998 in accordance
research to detect malware based on the processes they of scientific speciality "Systems of automatic Desing" in
generate on the operating system SPbPY. The area of scintific interest is use of new type of
fuzzy logics in different applications. I think that we could
REFERENCES cooperate intensively in future.
1. A. Shabtai, R. Moskovitch, Y. Elovici, C. Glezer, Detection of Email: [email protected]
malware by applying machine learning classifiers on static features: A
stateof- the-art survey, Inf. Secur. Tech. Rep. 14 (1) (2009) 16–29.
2. M. Bailey, J. Oberheide, J. Andersen, Z. M. Mao, F. Jahanian, J.
Nazario, Automated classification and analysis of internet malware, in:
Recent advances in intrusion detection, Springer, 2007, pp. 178–197.
3. U. Bayer, P. M. Comparetti, C. Hlauschek, C. Kruegel, E. Kirda,
Scalable, behavior-based malware clustering, in: NDSS, Vol. 9, 2009,
pp. 8–11.
4. K. Rieck, P. Trinius, C. Willems, T. Holz, Automatic analysis of
malware behavior using machine learning, Journal of Computer
Security 19 (4) (2011) 639–668.
5. S. Palahan, D. Babi´c, S. Chaudhuri, D. Kifer, Extraction of
statistically significant malware behaviors, in: Computer Security
Applications Conference, ACM, 2013, pp. 69–78.
6. M. Egele, M. Woo, P. Chapman, D. Brumley, Blanket execution:
Dynamic similarity testing for program binaries and components, in:
USENIX Security ’14, USENIX Association, San Diego, CA, 2014,
pp. 303–317.
7. M. Lindorfer, C. Kolbitsch, P. M. Comparetti, Detecting
environmentsensitive malware, in: Recent Advances in Intrusion
Detection, Springer, 2011, pp. 338–357.
8. IMPORTANT INFORMATION REGARDING SANDBOXIE
VERSIONS. https://www.sandboxie.com/. [Accessed February 15,
2020].
9. Smola, A.; Vishwanathan, S.V.N. Introduction to Machine Learning;
Cambridge University Press: Cambridge, UK, 2008.
10. Daniele Ucci, Leonardo Aniello, Roberto Baldoni. Survey of Machine
Learning Techniques for Malware Analysis. Computers & Security
(2018), doi: https://doi.org/10.1016/j.cose.2018.11.001.

Published By:
Retrieval Number: E2945039520/2020©BEIESP
Blue Eyes Intelligence Engineering
DOI: 10.35940/ijitee.E2945.039520 1493 & Sciences Publication

You might also like