Fuzzy K-Mean Clustering To Preclude Cyber Security Risk: Problem Statement
Fuzzy K-Mean Clustering To Preclude Cyber Security Risk: Problem Statement
Problem Statement:
The growing prevalence of cyber threats in the world is affecting every network user. Numerous
security monitoring systems are being employed to protect computer networks and resources
from falling victim to cyber-attacks. There is a pressing need to have an efficient security
monitoring system to monitor the large network datasets generated in this process. A large
network datasets representing Malware attacks have been used in this work to establish an expert
system. The characteristics of attacker’s IP addresses can be extracted from our integrated
datasets to generate statistical data. The cyber security expert provides to the weight of each
attribute and forms a scoring system by annotating the log history. A semi supervise method is
used to classify cyber security log into attack, unsure and no attack by first breaking the data
into 3 cluster using Fuzzy K mean (FKM), then manually label a small data (Analyst Intuition)
and finally train the neural network classifier multilayer perceptron (MLP) base on the manually
labelled data. It helps in finding anomaly in a cyber security log, which generally results in
creating huge amount of false detection. The classification results are encouraging in segregating
the types of attacks. It also automate the integration process of datasets and implicitly send the
statistical data to the machine learning and data mining algorithms which would make a
complete end-to-end process of identifying attack-related traffic from the network datasets.
Literature Survey:
The voluminous amount of big data presents a great challenge when we attempt to study the
patterns, or association amongst the data. The advancement in handling big data enables many
industrial problems and challenges to be addressed. These industries and companies are now able
to understand and process volumes of data which was once beyond their reach. While many
domains have benefited through the use of big data technologies, cyber security is one field that
is just beginning explores the use of big data analytics. The ability to detect and deter cyber-
attacks can make or break the functional success of an enterprise [1]. Using big data,
organizations may be able to rigorously detect threats, create better defense mechanisms and
improve security. Security Information and Event Management (SIEM) systems [2] is a system
that is capable of analyzing data from several log files, however such systems are limited to the
amount of data they can handle. With systems such as Hadoop [3], cyber security data can now
be stored in a dedicated repository which can accommodate more than three months of data as
well as combining and analyzing real-time data along with historical data. Advanced persistent
threats (APT) is a network attack in which an unauthorized person gains access to a network and
stays there undetected for a long period of time [4]. The very nature of big data analytics which
deals with longer term data could potentially help to detect advanced persistent threats (APTs)
that manifest over time. Big data analytics play an important role in detecting advanced threats
and insider threats [5]. Monitoring systems can potentially minimize false alarms by providing
smarter analytics. Data analytics can be used to assist systems in collecting internal data by
merging with relevant external data to detect known patterns to stay ahead of malicious activities
or intruders. Currently, 8% of major global companies [5] have adopted big data analytics for
one or more use cases related to security and fraud detection. Gartner predicted that within a
year, this will be increased to 25% with a positive return on investment within six months of
implementation. Data analysis should be intelligent and timely as anything that is delayed will
lose its value, especially in the field of cyber security. Given that hackers are well aware of
security measures and other fraud detection measures that are employed by enterprises, they are
able to directly attack without any reconnaissance phase. Hence, to stay ahead of hackers,
enterprises can use big data analytics to improve monitoring and detection systems with
contextual data and apply smarter analytics. Data correlation techniques can be used among the
high-priority alerts and monitoring systems to detect patterns and get a bigger picture on the state
of security. Also, enterprises can opt for fast tuning of their rules and models to test against data
streaming close to real time. The Teradata report [6] states that the traditional methods that fall
short in detecting and preventing threats can be enhanced with big data analytics. Many big data
tools and techniques have emerged that can efficiently handle the volume and complexity of
various kinds of data, such as machine generated and network-related data. Also, the results from
the survey conducted by Teradata [6] indicates, that the shortcomings of traditional solutions in
detecting and preventing threats can be overcome by using big data analytics. Hence, big data
systems are parts of a cyber defense strategy for every enterprise to meet the needs of complex
and large scale analytics. A major concern with the cyber security monitoring process is that
when multiple security monitoring systems are employed and each system generates numerous
log files (such as security logs, network traffic logs), there is no well established system that can
identify the relationships among these log files and integrate them. These log files are crucial for
identifying attack related patterns and assist in early detection of APTs or any other malicious
attacks. The work in [2] identifies the challenges in dealing with big data analysis, such as
automating the whole process of locating, identifying and understanding the data. A good
database schema design is mandated prior to analyzing the dataset. Similarly, mining requires
data to be integrated, cleaned and efficiently accessible, which involves the use of effective
mining algorithms and big data computing environments. Labrinidis [7] also describes that
significant research is required to achieve automated integration of data sets as well as a suitable
database design, even for simpler analysis of a single data set. It is also essential that effective
mining techniques are used to extract information from the large datasets. The objective of this
research is to use an efficient expert system that tags on the expertise of cyber security expert
and allow them to input suitable weights for different attribute. The cyber security expert also
contributes to the scoring system based on the words in log file. We then adopt Fuzzy k-Means
(FKM) algorithm to create clusters of attackers and no attackers in order to segregate the attack-
related traffic from the network datasets. Our Analyst Intuition approach is inspired by Kalyan
[8] and Chang [9]. Kaylan [8] used semi-supervise approach for huge, unbalanced and unlabeled
data. The approach started with labelling a sample of data and trains the system and used that to
test against the remaining huge data. Likewise, Chang et. al. [9] also use similar method known
as “Expectation Regulated Neural Network” for DDOS attack. In this study, we collected 3 days
of data that chalks up to 36 million of log files amounting to 36 Gigabytes of data. The 3 days of
data include the logs from about 1200 computers. The logs files were obtained from firewall
server, intrusion detection system and anti-virus logs. The total amount of Malware instances is
about 60 cases. We apply data mining techniques to study the statistical data obtained from the
integrated datasets. These analytics help in identifying the attack related traffic from normal
traffic as well as extracting attack patterns. The Fuzzy k-Means (FKM) clustering algorithm were
performed to create attacker and non-attacker clusters on the time-related and connection-related
data obtained from the integrated datasets. Several models were generated through changing the
key parameters. The testing step was repeated several times to determine accuracy and efficiency
in results. The results obtained from the algorithms were validated against each other’s in
verifying the attack-related traffic. The FKM algorithm created three cluster in total: (i) cluster-1
consists of no attackers, (ii) cluster-2 consists of uncertain number of attacker, and (iii) cluster-3
consists of 364 non-attackers. One of the issue in cyber security is that different network security
systems and tools generate log files in different format that renders complexity in consolidation.
This research demonstrates the integration and analysis of datasets for identifying attack-related
traffic that can potentially lead to easier threat detection in cases where attacks occur on multiple
platforms.
Methodology:
A special semi supervise method is used which helps in classifying cyber security log into attack,
unsure and no attack by first, breaking the data into 3 cluster using Fuzzy K mean (FKM), then
manually label a small data (Analyst Intuition) and finally train the neural network classifier
multilayer perceptron (MLP) base on the manually labelled data. It helps in finding anomaly in a
cyber security log which is generally creating huge amount of false detection. The method of
including Artificial Intelligence (AI) and Analyst Intuition (AI) is also known as AI2 [8][9]. The
Fuzzy k-Means (FKM) clustering algorithm is used to create attacker and non-attacker clusters
on the time-related and connection-related data obtained from the integrated datasets. The model
is illustrated in Figure 1. The model starts by extracting data from application. The log
information is extracted for training and testing. The data is split into 3 clusters base on K-means
algorithm. The 3 clusters are no attack, unsure and attack. After that data is train using
Multilayer Perception Neural Network using 2/3 of the data. The remaining 1/3 of the data is
used for test. The original data is not labelled. From the log files, words are given a certain
weight and score is created from there. The above method use excel to visualize the data and
manually label the data into 3 classes, attack, unsure and no attack. From there model is trained
and use the model for classification. The classification system takes in expert view to provide
weightages according to the types of attacks as shown in Table I. Figure 1 The model that detects
anomaly from big data .The total amount of Malware instances is about 60 cases. Network traffic
data comes in at very high speed resulting in more than 1000 log files being generated every
second; we use batch processing instead of real-time processing.
Data mining techniques to study the statistical data obtained from the integrated datasets which
consist of attacks like Malware, Trojan, Passing off, Soft1026, and Virus. The expert system
allows cyber security expert to enter their inputs to form the scores. These analytics help in
identifying the attack related traffic from normal traffic as well as extracting attack patterns. The
Fuzzy k-Means (FKM) clustering algorithm is used to create attacker and non-attacker clusters
on the time-related and connection-related data obtained from the integrated datasets. The
clustering algorithm forms 3 clusters: Strong, Average, and Mild. The 3 clusters correspond to
the labels attack, unsure and no attack respectively. The input of the algorithm consists of 2
attributes, the first attribute is word-found and the second attribute is scoring. The corresponding
algorithm is shown in Algorithm 1. The K-means Euclidean distance is used as the distance
measure. Prior to clustering, fuzzification was performed. The system first looks for keywords
among data like worm, malware and mark the feature as 1 when keywords are encountered.
Expert weightage is then given and forms the scoring.
REFERENCES:
[1] Harper, Jelani. "Enterprise Threats: Big Data and Cyber Security." 11 June 2013. Dataversity
Education.
[2] Cardenas, Alvaro, Pratyusa Manadhata and Sreeranga Rajan. "Big Data Analytics for Security." IEEE
Security and Privacy 2013. Document.
[5] Gartner. (February 2014). Security Announcements - Retrieved from Gartner website:
http://www.gartner.com/
[6] Ponemon, Institute. "Big Data Analytics in Cyber Defense." Ponemon Institute Research Report.
2013.
[7] Labrinidis, Alexandros, and H. V. Jagadish. "Challenges and opportunities with big data."
Proceedings of the VLDB Endowment 5.12, 2012.
[8] Apache Mahout. The Apache Software Foundation. 2014. Retrieved from:
https://mahout.apache.org/general/faq.html
[9] Veeramachaneni, Kalyan, and Ignacio Arnaldo. "AI2: Training a big data machine to defend.”
ReRetrieved from: https://people.csail.mit.edu/kalyan/AI2_Paper.pdf