Chapter 3 2
Chapter 3 2
for Cyber
security
-Introduction to
Data Mining-
Vandana P. Janeja
Data Preprocessing
• Data Cleaning
• Data Transformation and Integration
• Data Reduction
• Measures of Similarity
• Measures of Evaluation
• Clustering Algorithms
• Classification
• Pattern Mining: Association Rule Mining
Data Analytics for Cybersecurity, ©2022 Janeja All rights reserved. 04/19/2025 2
Data Mining Methods
Clustering
Data Analytics for Cybersecurity, ©2022 Janeja All rights reserved. 04/19/2025 3
Data Mining
methods: Clustering
04/19/2025 Data Analytics for Cybersecurity, ©2022 Janeja All rights reserved. 4
Data Mining methods: Clustering
Data Analytics for Cybersecurity, ©2022 Janeja All rights reserved. 04/19/2025 6
• The primary approach of the partitioning algorithms, such as K-
means, is to form subgroups of the entire set of data, by making K
partitions, based on a similarity criterion. The data in the subgroups is
grouped around central object such as mean or medoid. However,
they differ the way in which they define the similarity criterion,
central object and convergence of clusters.
• K-means (MacQueen 67)
Partitioning • Based on finding partitions in the data by evaluating the
distance of objects in the cluster to the centroid, which is the
Algorithms: mean of the data in the cluster
• Intuitively the bigger the error the more spread out the points
K-means are around the mean
• If we plot the SSE for K=1 to K=n (n number of points), then SSE
ranges from a very large value (all points in one cluster) to 0
(every point in its own cluster). The elbow of this plot provides
an ideal value of K
• K-means has been a well-used and well accepted clustering
algorithm due to its intuitive approach and interpretable output.
However, K-means does not work very well in presence of
outliers and does not form non-spherical or free form clusters
• Starts with a set of points, K value for number of clusters to form and seed centroids Reassignment of Points
which can be selected from the data or randomly generated
• Partitioning points around the seed centroids such that the point is allocated to the Centroids recomputed
centroid to which its distance is smallest
• At the end of the first round we have K clusters partitioned around the seed
centroids
• Now the means or centroids of the newly formed clusters are computed and the
process is repeated to align the points to the newly computed centroids
• This process is iterated until there is no more reassignment of the points moving
from one cluster to the other
• K-means works on a heuristic-based partitioning rather than an optimal partitioning
• The quality of clusters can be evaluated using Sum of squared errors which
computes the distance of every point in the cluster to its centroid
Cluster assignments
Clusters
No more
Reassignment
04/19/2025 Data Analytics for Cybersecurity, ©2022 Janeja All rights reserved. 8
• Density-Based Algorithm for Discovering Clusters in Large Spatial
Databases with Noise, proposed by Ester et. Al. 1996
• Foundation of several other density based clustering algorithms
• Clusters as dense or sparse regions based on the number of points in
Density the neighborhood region of a point under consideration
• Radius and a minimum number of points in the region (parameters: ,
Based MinPts)
Algorithms:
• Minimum number of points in a radius of facilitates the
measurement of density in that neighborhood
Data Analytics for Cybersecurity, ©2022 Janeja All rights reserved. 04/19/2025 9
DBSCAN Example: MinPts=3, =1
Core Points: p2, p4, p6, and p7
(b)
04/19/2025 Data Analytics for Cybersecurity, ©2022 Janeja All rights reserved. 11
Classification process
Given a collection of records, Here class is often referred to Multi class classifiers are also Classification approaches learn The labelled data, which is a
the goal of classification is to as label, for example a data proposed which deal with a model from training dataset, set of database tuples with
derive a model that can assign point is anomaly or not. Here classification of an instance as i.e. pre labelled data with their corresponding class
a record to a class as accurately Anomaly and Not anomaly can belonging to multiple classes samples of both classes, to labels, can be divided into:
as possible be two classes identify previously unseen training and test data
observations in the test set for
labelling each instance with a
label based on the models
prediction
04/19/2025 Data Analytics for Cybersecurity, ©2022 Janeja All rights reserved. 12
Classification process
• In the training phase, a classification algorithm
builds a classifier (set of rules) that learns from a
Labelled Data training set
• This classification model includes some
descriptions (rules) for each class using features or
attributes in the data
Training
• This model is used to determine the class to which
Data a test data instance belongs
• In the testing phase, a set of data tuples that are
Classification not overlapping with the training tuples are
Accuracy selected. Each test tuple is compared with the
Test Metrics classification rules to determine its class
Model Data
• The labels of the test tuples are reported along
Generation with percentage of correctly classified labels to
evaluate the accuracy of the model in previously
unseen (labels in the) data
• As the model accuracy is evaluated and rules are
perfected for labelling previously unseen
Expert Real time Prediction instances, these rules can be used for future
systems predictions
• One way to do that is to maintain a knowledge
base or expert system with these discovered rules
and as incoming data is observed to match these
Data Analytics for Cybersecurity, ©2022 Janeja All rights reserved.
04/19/2025 rules the labels for them are predicted. 13
Data Mining methods: Classification
•Several classification algorithms have been proposed which approach the classification modelling using different mechanisms.
•The decision tree algorithms provide a set of rules in the form of a decision tree to provide labels based on conditions in the
tree branches
•Bayesian models provide a probability value of an instance belonging to a class
•Function based methods provide functions for demarcations in the data such that the data is clearly divided between classes
•Classification combinations namely ensemble methods which combine the classifiers across multiple samples of the training
data
• These methods are designed to increase the accuracy of the classification task by training several different classifiers and
combining their results to output a class label
• A good analogy is when humans seek the input of several experts before making an important decision
• Diversity among classifier models is a required condition for having a good ensemble classifier (He & Garcia, 2009; Li,
2007)
• Base classifier in the ensemble should be a weak learner in order to get the best results out of an ensemble
•A classification algorithm is called a weak learner if a small change in the training data produces big difference in the induced
classifier mapping
04/19/2025 Data Analytics for Cybersecurity, ©2022 Janeja All rights reserved. 14
Classificat
ion
Algorithm
s
04/19/2025 Data Analytics for Cybersecurity, ©2022 Janeja All rights reserved. 15
• C4.5 decision tree algorithm starts by evaluating attributes to identify the attribute
which gives most information for making a decision for labelling with the class
• The decision tree provides a series of rules to identify which attribute should be
evaluated to come up with the label for a record where the label is unknown
• These rules form the branches in the decision tree
• The purity of the attributes to make a decision split in the tree is computed using
measures such as Entropy
• The entropy of a particular node, InfoA (corresponding to an attribute) in a tree is
the sum, over all classes represented in the node, of the proportion of records
belonging to a particular class
• When Entropy reduction is chosen as a splitting criterion, the algorithm searches
Decision for the split that reduces entropy or equivalently the split that increases
information, learnt from that attribute, by the greatest amount
Tree Based • If a leaf in a decision tree is entirely pure then classes in the leaf can be clearly
described, that is, they all fall in the same class
Classifier:
• If a leaf is highly impure then describing it is much more complex
• Entropy helps us quantify the concept of purity of the node
C4.5 • The best split is one that does the best job of separating the records into groups
where a single class predominates the group of records in that branch.
04/19/2025 Data Analytics for Cybersecurity, ©2022 Janeja All rights reserved. 16
ID DOW packets Flows Attack
Example C4.5 DOW Weekday Weekend
1
2
Weekday
Weekday
<400
<400
<10K
<10K
N
N Attack_Y 2 6 (b)
(a) 3
4
Weekday
Weekday
<400
<400
<10K
<10K
N
N
Attack_N 7 3
5 Weekday <400 <10K Y packets <400 >400K&<1000K >1000
6 Weekday >400K&<1000K
<10K N Attack_Y 1 1 6
7 Weekday >400K&<1000K
>10K&<30KN
8 Weekend >400K&<1000K
>10K&<30KY
Attack_N 4 3 3
9 Weekend >1000K >10K&<30KN
10 Weekend >1000K >10K&<30KY Flows <10k >10K&<30K >30K
11 Weekend >1000K >10K&<30KY Attack_Y 1 3 4
12 Weekend >1000K >10K&<30KN
13 Weekend >1000K >10K&<30KN
Attack_N 5 4 1
14 Weekend >1000K >30K Y
15 Weekend >1000K >30K Y
16 Weekend >1000K >30K Y
17 Weekday >1000K >30K Y
18 Weekday >400K&<1000K
>30K N
Cluster 1
Far Cluster 2
Near
Mid
Cluster 3 Cluster 4
04/19/2025 Data Analytics for Cybersecurity, ©2022 Janeja All rights reserved. 20
Data Mining methods: Pattern
Mining
04/19/2025 Data Analytics for Cybersecurity, ©2022 Janeja All rights reserved. 21
•Pattern refers to occurrence of multiple objects in a certain
combination, which could lead to discovery of an implicit interaction
Data Analytics for Cybersecurity, ©2022 Janeja All rights reserved. 04/19/2025 22
Categorization
of the various
pattern mining
approaches
04/19/2025 Data Analytics for Cybersecurity, ©2022 Janeja All rights reserved. 23
•The Apriori algorithm is based on finding frequent itemsets and then discovering
and quantifying the association rules
•It provides efficient mechanism of discovering frequent item sets
Frequent •A subset of a frequent itemset must also be a frequent itemset i.e., if {IP1, high}
is a frequent itemset, both {IP1} and {high} should be a frequent itemset
Pattern
•It iteratively finds frequent itemsets with cardinality from 1 to k (k-itemset). It
uses the frequent itemsets to generate association rules
•The apriori generates the candidate itemsets by joining the itemsets with large
Mining: support from the previous pass and deleting the subsets, which have small
support from the previous pass
Apriori •By only considering the itemsets with large support, the number of candidate
itemsets is significantly reduced. In the first pass itemsets with only one item are
counted
•The itemsets with higher support are used to generate the candidate sets of the
second pass
•Once the candidate itemsets are found, their supports are calculated to
discover the items sets of size two with large support and so on
•This process terminates when no new itemsets are found with large support
•Here min support is predetermined as a user defined threshold
•A confidence threshold is predetermined by the user
• If IP1 and high are two items the association rule IP1 high means that whenever an item IP1 occurs in a transaction then high
IP1, high 3
also occurs with a quantified probability IP1, low 0
• The probability or confidence threshold can be defined as the percentage of transactions containing high and IP1 with respect to
the percentage of transactions containing just IP1
IP4, high 0
• This can be seen in terms of conditional probability where P (high|IP1) = P (IP1 U high) / P (IP1).
IP4, low 2
• The strength of the rule can also be quantified in terms of support, where support is the number of transactions, which contain a IP6, low 0
certain item, therefore support of X is the percentage of transactions containing X in the entire set of transactions. Confidence is
determined as Support (IP1 U high) / Support (IP1) IP6, med 1
• This gives the confidence and support for rule for IP1high IP6, high 1
• The rule IP1high is not the same as highIP1 as the confidence will change high, low 0
Confidence Support
IP1=> high =IP1& high/ IP1 1 =IP1& high/ # tuples 0.38 Frequent item set 2
high=> IP1 =IP1& high/ high 0.6 =IP1& high/ # tuples 0.38
IP1, high 3
IP4=> low =IP4&low/IP4 1 =IP4&low/ # tuples 0.25
low=>IP4 =IP4 & low/low 1 =IP4 & low/ # tuples 0.25
IP4, low 2
04/19/2025 Data Analytics for Cybersecurity, ©2022 Janeja All rights reserved.
25
Selected References
• Fayyad, Usama, Gregory Piatetsky-Shapiro, and Padhraic Smyth. "From data mining to knowledge discovery in databases." AI magazine 17.3 (1996): 37.
• Shearer C., The CRISP-DM model: the new blueprint for data mining, J Data Warehousing (2000); 5:13—22.
• Pete Chapman (NCR), Julian Clinton (SPSS), Randy Kerber (NCR),
• Thomas Khabaza (SPSS), Thomas Reinartz (DaimlerChrysler),
• Colin Shearer (SPSS) and Rüdiger Wirth (DaimlerChrysler), CRISP-DM 1.0 , Step-by-Step data mining guide, 2000
• Lukasz A. Kurgan and Petr Musilek. 2006. A survey of Knowledge Discovery and Data Mining process models. Knowl. Eng. Rev. 21, 1 (March 2006), 1-24.
DOI=http://dx.doi.org/10.1017/S0269888906000737
• Famili, A., Shen, W. M., Weber, R., & Simoudis, E. (1997). Data preprocessing and intelligent data analysis. Intelligent data analysis, 1(1-4), 3-23.
• Liu, H., Hussain, F., Tan, C. L., & Dash, M. (2002). Discretization: An enabling technique. Data mining and knowledge discovery, 6(4), 393-423.
• Garcia, S., Luengo, J., Sáez, J. A., Lopez, V., & Herrera, F. (2013). A survey of discretization techniques: Taxonomy and empirical analysis in supervised learning. IEEE Transactions on
Knowledge and Data Engineering, 25(4), 734-750.
• Al Shalabi, L., Shaaban, Z., & Kasasbeh, B. (2006). Data mining: A preprocessing engine. Journal of Computer Science, 2(9), 735-739.
• Chandrashekar, G., & Sahin, F. (2014). A survey on feature selection methods. Computers & Electrical Engineering, 40(1), 16-28.
• Molina, L. C., Belanche, L., & Nebot, À. (2002). Feature selection algorithms: A survey and experimental evaluation. In Data Mining, 2002. ICDM 2003. Proceedings. 2002 IEEE
International Conference on (pp. 306-313). IEEE.
• Fodor, Imola K. "A survey of dimension reduction techniques." Center for Applied Scientific Computing, Lawrence Livermore National Laboratory 9 (2002): 1-18.
• Wall, M. E., Rechtsteiner, A., & Rocha, L. M. (2003). Singular value decomposition and principal component analysis. In A practical approach to microarray data analysis (pp. 91-
109). Springer US.
• Wu, X., Kumar, V., Quinlan, J. R., Ghosh, J., Yang, Q., Motoda, H., ... & Zhou, Z. H. (2008). Top 10 algorithms in data mining. Knowledge and information systems, 14(1), 1-37.
• Leskovec, J., Kleinberg, J., & Faloutsos, C. (2005, August). Graphs over time: densification laws, shrinking diameters and possible explanations. In Proceedings of the eleventh ACM
SIGKDD international conference on Knowledge discovery in data mining (pp. 177-187). ACM.
• Shashi Shekhar, Chang-tien Lu, Pusheng Zhang, Detecting Graph-based Spatial Outliers: Algorithms and Applications, Proc. of the 7th ACM SIGKDD International Conference on
Knowledge Discovery and Data Mining, 2001
• Joachims, T. (2002, July). Optimizing search engines using clickthrough data. In Proceedings of the eighth ACM SIGKDD international conference on Knowledge discovery and data
mining (pp. 133-142). ACM.
• Ester, M., Kriegel, H. P., Sander, J., & Xu, X. (1996, August). A density-based algorithm for discovering clusters in large spatial databases with noise. In Kdd (Vol. 96, No. 34, pp. 226-
231).
• Hu, M., & Liu, B. (2004, August). Mining and summarizing customer reviews. In Proceedings of the tenth ACM SIGKDD international conference on Knowledge discovery and data
mining (pp. 168-177). ACM.
04/19/2025
• L. Kaufman and P. J. Rousseeuw. Finding Groups in Data: Data Analytics for Cybersecurity,
an Introduction ©2022
to Cluster Analysis. Janeja
John WileyAll rights1990.
& Sons, reserved. 26
• MacQueen, J. (1967, June). Some methods for classification and analysis of multivariate observations. In Proceedings of the fifth Berkeley symposium on mathematical statistics
Selected References
• Ng, R. T., & Han, J. (1994, September). E cient and E ective Clustering Methods for Spatial Data Mining. In Proc. of (pp. 144-155).
• Zhang, T., Ramakrishnan, R., & Livny, M. (1996, June). BIRCH: an efficient data clustering method for very large databases. In ACM Sigmod Record (Vol. 25, No. 2, pp. 103-114).
ACM.
• Guha, S., Rastogi, R., & Shim, K. (1998, June). CURE: an efficient clustering algorithm for large databases. In ACM Sigmod Record (Vol. 27, No. 2, pp. 73-84). ACM.
• Karypis, G., Han, E. H., & Kumar, V. (1999). Chameleon: Hierarchical clustering using dynamic modeling. Computer, 32(8), 68-75.
•Ankerst, M., Breunig, M. M., Kriegel, H. P., & Sander, J. (1999, June). OPTICS: ordering points to identify the clustering structure. In ACM Sigmod record (Vol. 28, No.
2, pp. 49-60). ACM.
•Hinneburg, A., & Keim, D. A. (1998, August). An efficient approach to clustering in large multimedia databases with noise. In KDD (Vol. 98, pp. 58-65).
• Sander, J., Ester, M., Kriegel, H. P., & Xu, X. (1998). Density-based clustering in spatial databases: The algorithm gdbscan and its applications. Data mining and
knowledge discovery, 2(2), 169-194.
• Xu, X., Jäger, J., & Kriegel, H. P. (1999). A fast parallel clustering algorithm for large spatial databases. In High Performance Data Mining (pp. 263-290). Springer US.
• Xu, X., Ester, M., Kriegel, H. P., & Sander, J. (1998, February). A distribution-based clustering algorithm for mining in large spatial databases. In Data Engineering,
1998. Proceedings., 14th International Conference on (pp. 324-331). IEEE.
• Jarvis, Raymond Austin, and Edward A. Patrick. "Clustering using a similarity measure based on shared near neighbors." IEEE Transactions on computers 100.11
(1973): 1025-1034.
• Zhou, B., Cheung, D. W., & Kao, B. (1999, April). A fast algorithm for density-based clustering in large database. In Pacific-Asia Conference on Knowledge Discovery
and Data Mining (pp. 338-349). Springer Berlin Heidelberg.
• Stutz, J., & Cheeseman, P. (1996). AutoClass—a Bayesian approach to classification. In Maximum entropy and Bayesian methods (pp. 117-126). Springer
Netherlands.
• Gennari, J. H., Langley, P., & Fisher, D. (1989). Models of incremental concept formation. Artificial intelligence, 40(1-3), 11-61.
• Dempster, A. P., Laird, N. M., & Rubin, D. B. (1977). Maximum likelihood from incomplete data via the EM algorithm. Journal of the royal statistical society. Series B
(methodological), 1-38.
•Peña, J. M., Lozano, J. A., & Larrañaga, P. (2002). Learning recursive Bayesian multinets for data clustering by means of constructive induction. Machine Learning,
47(1), 63-89.
• Wang, W., Yang, J., & Muntz, R. (1997, August). STING: A statistical information grid approach to spatial data mining. In VLDB (Vol. 97, pp. 186-195).
• R. Agrawal, J. Gehrke, D. Gunopulos, and P. Raghavan. Automatic subspace clustering of high dimensional data for data mining applications. In Proceedings of the
1998 ACM SIGMOD international conference on Management of data, pages 94{105. ACM Press, 1998.
• Guha, S., Rastogi, R., & Shim, K. (2000). ROCK: A robust clustering algorithm for categorical attributes. Information systems, 25(5), 345-366.
• Peter J. Rousseeuw (1987). "Silhouettes: a Graphical Aid to the Interpretation and Validation of Cluster Analysis". Computational and Applied Mathematics. 20: 53–
65
•
•
04/19/2025 Data Analytics for Cybersecurity, ©2022 Janeja All rights reserved. 27