Threat Detection Model Based On Machine
Threat Detection Model Based On Machine
learning Algorithm
Abstract— A threat can be anything that causes potential the data set given. Machine Learning works with following
damage to the network system. These threats can turn out to be a steps [9] :
attack to the system. Threat may occur in any forms like viruses, 1. Categorize the characteristics from training data (training
outright attack, and phishing attack from hackers to gain model).
information. Such attacks put a user’s system and also business
2. Classify split sets of attributes necessary for classification.
system at risks. Cyber security aims at the protection of system
from attacks like unauthorized network access, intrusions attacks 3. Model learning using training data (providing ML
etc. This paper presents a novel architecture model based on algorithm).
machine learning for the prediction of Cyber security malware 4. Using trained model to create classes of unknown data,
that requires execution in a sandbox environment. In order to and predict the result accurately.
prevent the attackers from infiltrating the system Machine
Learning approach is adapted. The main advantage of ML algorithms is that it will learn
and calculate based on practice and results. It means if today
Keywords—Machine Learning, Sandbox it is taking 1 day, tomorrow working time will be 20 hours,
I. INTRODUCTION the next day it will spend 12 hours and so on. Machine
Learning by "learning and predicting" calculates the
IT infrastructure environments are growing complex day automated tasks to the level of extent that human team
by day as more data is being produced. It is beyond the cannot reach [10].
capacity of human brain monitoring, which fails to
differentiate which data is standard and which one contains Several factors need to be measured while selecting the
virus or malwares. Cyber security is that domain which finds Machine learning algorithm. The factors take account of
difference between irregularities that still needs human time intricacy, progressive up gradation ability,
expertise alongside with machine learning algorithms. It can online/offline approach, and algorithm based on the
be defined as the intersection of network security, computer detection rate of a system. Due to the diverse role of
security and information security. A cyber security system algorithm, it is quite difficult to make a choice of the most
consists of two main parts a network and host security apt and well-organized algorithm [11].
system. These two parts owns minimum of antivirus
software, firewalls, and Intrusion Detection System (IDS). Security Analytic System performance depends on two
IDS help identify unauthorized use, alteration, duplication, important factors i.e. class of input data and the
and destruction of information systems [7]. implementation of Machine Learning (ML) algorithm. The
ML algorithms scopes from supervised learning (e.g.,
Machine Learning(ML) is a data analysis method that Logistic Regression, Support Vector Machine, Naïve Bayes,
automates building of an analytical model using algorithms and Decision Trees) to unsupervised learning
that learn from data which can be easily automated, and find algorithms(k-mean, clustering) and reinforcement learning .
accuracy in the data without using explicit programming as
to where look[8]. II. LITERATURE REVIEW
In the paper [1] ‘The promise of machine learning in
cyber `Security’ by authors J.B.Fraley, J.Kannady have
ML works on three phases. It first trains data, validate it mentioned some of the works have been carried out in the
and then test it through algorithms. In order to decide which regards of securing the data from attackers. This paper will
the best model to work with is, the selection should be made also helps to know the significant elementary part as well as
on the basis of performance of model not on the accuracy of finding solutions utilizing the system software. This paper is
transpired from a regular lab scale which is termed as ‘black
239
box’ surrounding to an experimental platform. It has been
utilized by numerous blooming commercialized agencies. Last step is the ‘Evaluate model’ here the models created
Few of the applications synthesized are ‘speech recognition, with the data set fed to the system are analyzed and are
voice command, language processing and many more. evaluated with performing calibration. The stages are
cross-examined to comprehend the performed status which
The cyber security and system learning is the best for is the major goal, later all are properly labelled.
managing high amount of data processing unit. Automated
security devices are specifically based to offer signature The accuracy level was pointed to about 90% yield for
based control. These applications specifically detect the study. By knowing the results the time consumed from
suspicious events and provides alert to data, network like the traditional way was brought down by approximately
Intrusion Prevention System/Intrusion Detection System. 78% from the study. The stressed reported for every hour
This paper reveals the higher pathway of approaching the was brought down by 1/4th of the original problems reported.
system and their results in the security aspects.
In the paper[2] ‘A machine learning approach to detecting
Some of the reports such as from Macfee says that there senior data modification intrusion in WBANs’ by A.Verner
are about 3 million newly synthesized vicious files and D.Butvinik, In WBAN only little quality of the answers
generated every 60 minutes. The anticipated aim is that the are produced, the best are code blue and alarm net. These
system learning will be the main objective to eradicate the are based on ECC and AEC type of security safety
current attack of scenario. respectively. In this field it uses the similar method of
enlisting the received data in an efficient format in many
To comprehend the exact situation in ongoing attack the fields. In the methodology in this study they have made few
exact numerical are mentioned in tabular column to know assumptions which enables them to teach in a more
the state of the attacks. The e-mail, network, internet, understandable way. The keys used are ‘Assumptions,
malware in thousands. These numerical are in common for method, Chosen ML features, chosen structure of negatively
millions of users. labelled vector’ which are the key tools used in the study.
The studying phase of SVM and also the execution time
There are six steps of analyzing and overcoming the have taken longer period to complete in a way many of the
working process these operations are utilizes. vectors were combined to overcome the drawback of the
In the phase ‘develop business understanding’ step, here the time with a complete accuracy level. Still better
actual reasons for the defects are taken into consideration in modification is in need for this method.[2]
this meet are the answer for them are maintained through the
system learning procedure. In the paper [3], ‘Machine learning to detect anomalies in
web log analysis’ published by Q.Cao, Y.Qiao, and
In ‘Analyse data and data dependencies’ step, the group Z.Lyn..This paper segregates the whole system into 4
combines and to compare, analyse the data present with the different parts are ‘data pre-processing, decision free
situation and to bring out the out of the box answer for the classifier, data extractor, Hidden Markov model’ are the
problem. Globally there are 1200 alerts which are grouped included models. It mainly relates in producing the solutions
in to 20 sections to ease the way of study. in a path of data. Clustering algorithm was the first ever
anomaly network. This proposed method builds a proper
The subsequent step is ‘Engage subject matter experts’ model for regular files and then maintain them as the
are indulged in analysis of the situation and in examining detector. This paper produced high level of optimum
along with grouping of the factors. The accuracy of attacks solution of about 93% and negative positive rate till 4%. The
along with proper numerical are listed for legible analysis in parameters were self-allotted to generate better results. The
further steps. The SME’s were in clear in-tension of performed method was made in comparison with other
bringing out transparent results in the scenario. models which shows better performance. Even with the
better results the module was not assured with proper
Followed by ‘Paper data’ in which the data will be fed to adaptable quality with regards.
the system which will provide the proper solutions. To get
an optimum result, the data which is fed must be analyzed, In the paper [4], ‘Evaluation of machine learning
re-checked in an organized way. This step will consume techniques for network intrusion detection’ by M.Zaman and
around 85% of the total engineering work. The system will C.H.Lung, have mentioned that in the evaluation of machine
analyse for the gaps, in-completion, not in relation along learning for detection clustering types are mostly utilized.
with comprehension with the data present to clear the This platform is widely used in the research as the source
analysis. data. The paper totally represented six separate methods of
machine learning along with six ensemble methods which is
Then comes the ‘Develop model’ which will be used to gather the results for the attacks. But when
generated by the group from the previous step data fed to the compared with the ROC results, it was not satisfactory
set system. ‘TensorFlow’ is being used in this step for the although it had opened a unique route of machine learning.
optimizing the results. It has the capability to multitask by The technique involved in tracking the traffic data which
handling the system and also managing status in a parallel had involved true positive, false positive, true negative and
way. false negative. The proposed method did not provide better
240
results hence further study and adaptable measure is a must moved directly to sand box as test data for further deep
in the future studies. malware detection.
The primary requirement of training dataset is that it
In the paper [5] ‘A review of intrusion detection using must be accurately labelled as it may predict misleading
anomaly based detection’ by authors U.Kumari and U.Soni, results and the future unknown sample of data. The training
have illustrated that the most important aspect of data set containing one labelled and another unlabelled data set
security is data intrusion detection, in identifying the fraud, (positive data (labelled) as well as malicious sample
fake products, attacks. This intrusion is used to detect the (unlabelled)) [21] is applied to the training model. Now the
attack cause or the root or the relation/bridge between the rules are extracted from these trained data set. These rules
huge set back of attack. This key aspect is now been are used to train the algorithm. Principally, this leads to the
regularly used as the route to find the cause in all the fields creation of classifier for new samples.
both private, commercial fields. The unusual patterns can be
easily identified for any terrorist activities as the data attack. Here training data is prepared to train the model by
This paper has improved the effectiveness of the system applying it to various filters.
functionalities also with the secure feature. The huge
drawback is that in securing the database the paper has left
out [5].
241
From database it will go to sandbox as in figure 1.
IV. EXAMPLES OF DIFFERENT ALGORITHMS
Another powerful tool for advance threat detection is
EMPLOYED IN OTHER DETECTION APPLICATIONS
Sandbox. Sandboxing along with machine learning methods
has emerged as a powerful cyber security tool. The time
taken by the algorithm to train a database (i.e., training time) Cloud-based Threat Detector [25] the system has been
varies from algorithm to algorithm. Lots of research is implemented with two ML algorithms – K-mean and Naive
needed for selecting the ML algorithm according to Bayes. To explore the training time taken by both the
application. Area like if the input is continuous apply algorithms to train a system, it is observed that with 500 GB
supervised machine learning methods and if discrete apply training data, K-means takes around 60 seconds while Naive
unsupervised learning methods. The selection of algorithm Bayes takes around 92 seconds to train a model.
also depends upon the working mode (online or offline) of a
security analytic system. ● V APPLICATIONS OF APPLYING MACHINE
LEARNING IN CYBER SECURITY
It is clear that to apply the machine learning algorithms to
any problem, it is essential to represent the data in some These are the threats that ML can save against [38]:-
form. For this purpose, Sandbox is used. The reports
generated by the sandbox, describing the behavioural data of 1. Ransomware- malware that dont allows the user to access
each sample, are pre-processed, and malware features are its personal files and ask for ransome payement for again
extracted from there [22]. accessing personal account.
Sandboxing engages in the capturing a document or 2. Watering Hole- Hackers in this keep on tracking the
executable file which is then opened within a secure virtual websites on which users usually visits and access their
machine. In this controlled environment, in order to observe identification is the concept of a watering hole.
how the executing software behaves exactly, potential
threats are run. Sandbox is used for executing untrusted, 3. Webshell - Webshell is a short code that allows the hacker
untested programs or code, possibly from unverified third to make modifications on server's web root directory. This
parties, websites, suppliers, without risking any harm to the means that full access to the database of the system is
host machine in cyber-security system or any operating gained. If it is an e-commerce website, in order to collect
system. credit card information of the customer who operates on
everyday can access the database.
An advanced sandboxing solution provides CPU-level
threat protection which depends on the mishandling stage of
4. Spear Phishing- The Machine learning trained models can
the attack. This allows an organisation to be detected and
be used to detect whether the email is malicious or not by
blocked against advanced persistent threats (APT).
identifying the key features such as email headers,
subsamples of body-data, punctuation patterns, etc.
A software architect should take into consideration
several things during the selection and integration of an
optimized algorithm. The ML algorithm performs better for VI. CONCLUSION
one type of security analytics (e.g., detecting DoS attack)
Sand boxing technique along with Machine Learning
may not perform well in another security analytics i.e
method is a dominant cyber security tool to prevent cyber
detecting brute force attack).
attacks on the network system. A dominant cyber security
tool that uses Machine Learning methods along with
The selection of an algorithm is tricky in a way that if it is
Sandbox is presented in this paper. It has proven to be a
giving performance, then it may degrade other system
powerful method for advanced threat protection. In this
qualities like accuracy, complexity, and understand ability
propose method, the transactions will be on hold until the
of the final result. For example, Researchers [24] compared
controller gets the confirmation from the sand box servers to
SVM (Support Vector Machine) with Extreme Learning
execute on main servers. This technique proves to be an
Machine (ELM) in terms of accuracy and performance. It is
efficient and autonomous in advance threat detection
found out that SVM (Support Vector Machine) produced
systems.
more accurate results but proved computationally expensive.
On the other hand, ELM proved lighter but produced less REFERENCES
accurate results. A practical trade-off should be recognized [1] J. Fraley and J. Cannady, "The promise of machine learning in
while selecting the algorithm among various system’s cybersecurity", SoutheastCon 2017, 2017.
qualities.
[2] A. Verner and D. Butvinik, "A Machine Learning Approach to
As it is said in paper[23] “While there is no silver bullet Detecting Sensor Data Modification Intrusions in WBANs", 2017
to solve the cybersecurity challenge, the key is to use 16th IEEE International Conference on Machine Learning and
layered defences along with machine learning threat Applications (ICMLA), 2017.
detection capabilities for optimum results".
[3] Q. Cao, Y. Qiao and Z. Lyu, "Machine learning to detect anomalies in
web log analysis", 2017 3rd IEEE International Conference on
Computer and Communications (ICCC), 2017.
242
University College London , NIPS Tutorials,
http://www.gatsby.ucl.ac.uk/, Dec 1999.
[4] [M. Zaman and C. Lung, "Evaluation of machine learning techniques
for network intrusion detection", NOMS 2018 - 2018 IEEE/IFIP [18] G. N. Ramadevi, K. Usharani , “Study on Dimensionality Reduction
Network Operations and Management Symposium, 2018. Techniques and Applications”, Publications of Problems &
Application in Engineering Research – Paper, Vol 04, Special
Issue01 , ISSN: 2230-8547; e-ISSN: 2230-8555, 2013.
[5] U. Kumari and U. Soni, "A review of intrusion detection using
anomaly based detection", 2017 2nd International Conference on
[19] W. Lee, M. Cheon, C. Hyun and M. Park, "Best Basis Selection
Communication and Electronics Systems (ICCES), 2017.
Method Using Learning Weights for Face Recognition", Sensors, vol.
13, no. 10, pp. 12830-12851, 2013.
[6] N.Papernot, P.McDaniel, A.Sinha, M.P.Wellman, “Security and
privacy in machine learning”, Tutorial at IEEE WIFS, Rennes, France
[20] S. B. Kotsiantis, “Supervised Machine Learning: A Review of
2017.
Classification Techniques”, Published in Proceedings of the 2007
conference on Emerging Artificial Intelligence Applications in
[7] Mukkamala A, Sung A, Abraham A, “Cyber security
Computer Engineering Real Word AI Systems with Applications in
challenges:Designing efficient intrusion detection systems
eHealth, HCI, Information Retrieval and Pervasive Technologies, IOS
andantivirus tools”.
Press Amsterdam, The Netherlands, The Netherlands, pp. 3-24, ISBN:
[8] R. Shanbhogue and B. Beena, "Survey of Data Mining (DM) and 978-1-58603-780-2, 2007.
Machine Learning (ML) Methods on Cyber Security", Indian Journal
[21] Anoop Kumar Jain and Satyam Maheswari, “Survey of Recent
of Science and Technology, vol. 10, no. 35, pp. 1-7, 2017.
Clustering Techniques in Data Mining”, International Archive of
[9] Congzheng Song, Thomas Ristenpart, Vitaly Shmatikov, “Machine Applied Sciences and Technology, Volume 3 [2], ISSN: 0976-4828,
Learning Models that Remember Too Much”, 22 Sep 2017, pp. 68 - 75, June 2012.
arXiv:1709.07886v1 [cs.CR].
[22] B. Liu, X. Li, W. Lee, & P. S. Yu, “Text Classification by Labeling
[10] Ullah, F., and Babar, “Architectural Tactics for Big Data Words. Retrieved”, American Association for Artificial Intelligence,
Cybersecurity Analytic Systems: A Review”, 9 Feb 2018, 2004.
arXiv:1802.03178v1 [cs.CR].
[23] L. Vokorokos, A. Balaz and B Mados, “Application Security through
[11] A. Buczak and E. Guven, "A Survey of Data Mining and Machine Sandbox Virtualization”, Acta Polytechnica Hungarica, Vol. 12, No.
Learning Methods for Cyber Security Intrusion Detection", IEEE 1, 2015.
Communications Surveys & Tutorials, vol. 18, no. 2, pp. 1153-1176,
[24] R. Caruana and A. Niculescu-Mizil, "An empirical comparison of
2016.
supervised learning algorithms", Proceedings of the 23rd
[12] DeLiang Wang , “Unsupervised Learning - Foundations of Neural international conference on Machine learning - ICML '06, 2006.
Computation” ,AI Magazine, vol. 22, pp. 101-102, September 2001.
[25] Chi Cheng, Wee Peng Tay and G. Huang, "Extreme learning
[13] Hwanjo Yu,Jiong Yang, Jiawei Han, “Classifying Large Data Sets machines for intrusion detection", The 2012 International Joint
Using SVMs with Hierarchical Clusters,” SIGKDD ’03 Washington, Conference on Neural Networks (IJCNN), 2012.
DC, USA, ACM 1581137370/ 03/0008, 2003.
[26] N . Keegan, S. Ji, A. Chaudhary, C. Concolato, B. Yu and D. Jeong,
[14] Andrew Ng , CS229 Lecture notes, support vector machines,part- V, "A survey of cloud-based network intrusion detection
http://math480-s15-zarringhalam.wikispaces.umb.edu/file/view/SVMs analysis", Human-centric Computing and Information Sciences, vol.
_Ng_Stanford.pdf/538578768/SVMs_Ng_Stanford.pdf 6, no. 1, 2016.
[15] Jason Brownlee, “A Tour of Machine Learning Algorithms”, 25 Nov [27] A. Sallab, M. Abdou, E. Perot and S. Yogamani, "Deep
2013. Reinforcement Learning framework for Autonomous
Driving", Electronic Imaging, vol. 2017, no. 19, pp. 70-76, 2017.
[16] T. Soni Madhulatha, “An Overview on Clustering Methods”, IOSR
Journal of Engineering, Apr 2012, Vol. 2(4) pp: 719-725, ISSN [1]
2250-3021, arXiv:1205.1117v1 [cs.DS]. [2]
243