A Novel Method of Spam Mail Detection Using Text Based Clustering Approach
A Novel Method of Spam Mail Detection Using Text Based Clustering Approach
\
|
=
t
t
df
N
idf log . (3)
Email documents with their weighted terms
Term 1 ... Term n
Email 1 W
1
W
n
Email 2 : :
: : :
Email n W
1
W
n
Now, combine the above expressions for term frequency and
inverse document frequency, to produce a composite weight for
each term in each document. The tf-idf weighting scheme assigns
to term t a weight in document d given by Eq. (4) as
t d t d t
idf tf idf - tf =
, ,
. (4)
The model thus built shows the representation of each email
document described by their attributes [13]. Each tuple is
assumed to belong to a prior-defined class, as determined by one
International Journal of Computer Applications (0975 8887)
Volume 5 No.4, August 2010
18
of the attributes called the class label attribute. Thus, the e-mail
documents with their weighted terms can be represented in the
form of a table given in Table II.
5. DATA REDUCTION & CLUSTERING
Data reduction includes data clustering that concerns how to
group a set of objects based on their similarity of attributes and /
or their proximity in the vector space. Clustering is applied to 2
different training sets, one belonging to spam and the other
belonging to non-spam. The cluster representatives will now
belong to 2 different classes. The class that each tuple (e-mail)
belongs to is given by one of the attributes of the tuple, often
called the class label attribute [14].
5.1 Data Clustering
Clustering is the process of partitioning or dividing a set of
patterns (data) into groups. Each cluster is abstracted using one or
more representatives. Representing the data by fewer clusters
necessarily loses certain fine details, but achieves simplification.
It models data by its clusters. Clustering is a type of classification
imposed on finite set of objects. The relationship between objects
is represented in a proximity matrix in which the rows represent
n e-mails and columns correspond to the terms given as
dimensions.
If objects are categorized as patterns, or points in a d-dimensional
metric space, the proximity measure can be Euclidean distance
between a pair of points. Unless a meaningful measure of distance
or proximity, between a pair of objects is established, no
meaningful cluster analysis is possible. Clustering is useful in
many applications like decision making, data mining, text mining,
machine learning, grouping, and pattern classification and
intrusion detection. Clustering has to be done as it helps in
detecting outliners & to examine small size clusters [15].
The proximity matrix is used in this context & thus serves as a
useful input to the clustering algorithm. It represents a cluster of n
patterns by m points. Typically, m < n leading to data
compression, can use centroids. This would help in prototype
selection for efficient classification. The clustering algorithms are
applied to the training set belonging to 2 different classes
separately to obtain their correspondent cluster representatives.
There are different stages in clustering. Typical pattern clustering
activity involves the following steps, viz.,
Pattern representation (optionally including feature extraction
and/or selection),
Definition of a pattern proximity measure appropriate to the
data domain,
Clustering or grouping,
Data abstraction (if needed), and
Assessment of output (if needed).
The Fig. 3 shown below depicts a typical sequencing of the first
three of the above mentioned 5 steps, including a feedback path
where the grouping process output could affect subsequent feature
extraction and similarity computations [16]. Pattern representation
refers to the number of classes, the number of available patterns,
and the number, type and scale of the features available to the
clustering algorithm where a pattern x is a single data item used
by the clustering algorithm. Some of this information may not be
controllable by the practitioner.
Feature
Selection /
Extraction
Inter-pattern
similarity
Grouping
Clusters
Pattern
representations Patterns
Feedback loop
Fig. 3 : First 3 steps of clustering process
Feature selection is the process of identifying the most effective
sub-set of the original features to use in clustering where the
individual scalar components x
i
of a pattern x are called features.
Feature extraction is the use of one or more transformations of the
input features to produce new salient features. Either or both of
these techniques can be used to obtain an appropriate set of
features to use in clustering. Pattern proximity is usually
measured by a distance function defined on pairs of patterns, such
as the Euclidean distance between patterns.
The grouping step can be performed in a number of ways. The
output clustering (or clusterings) can be hard (a partition of the
data into groups) or fuzzy (where each pattern has a variable
degree of membership in each of the output clusters). Hierarchical
clustering algorithms produce a nested series of partitions based
on a criterion for merging or splitting clusters based on similarity.
Partitional clustering algorithms identify the partition that
optimizes (usually locally) a clustering criterion [17].
Clustering
Hierarchical Partitional
Single link
Complete link
Square error
K-means
Graph Theoretric
Mixture resolving
Expected maximization
Mode seeking
Fig. 4 : Approaches to clustering the data
Data abstraction is the process of extracting a simple and compact
representation of a data set. Here, simplicity is either from the
perspective of automatic analysis (so that a machine can perform
further processing efficiently) or it is human-oriented (so that the
representation obtained is easy to comprehend and intuitively
appealing).
In the clustering context, a typical data abstraction is a compact
description of each cluster, usually in terms of cluster prototypes
or representative patterns such as the centroid. Different
approaches to clustering data can be described with the help of the
hierarchy shown in Fig. 4. In the work considered in this paper,
the hierarchical type of clustering has been used [18].
Some of the clustering algorithms popularly used are:
5.1.1 K-means
The K-means algorithm takes the input parameter k, and partitions
a set of n objects into k clusters so that the resulting intra cluster
similarity is high but the inter-cluster similarity is low. The K-
means algorithm proceeds as follows:
International Journal of Computer Applications (0975 8887)
Volume 5 No.4, August 2010
19
Arbitrarily choose k objects as the initial cluster centers.
Repeat
(re)assign each object to the cluster to which the object is the
most similar, based on the mean value of the objects in the
cluster.
update the cluster means, i.e., calculate the mean value of the
objects for each clusters.
until no change.
Generally, the k-means algorithm has the following important
properties, viz.,
It is efficient in processing large data sets.
It often terminates at a local optimum.
The clusters have spherical shapes.
It is sensitive to noise.
The K-means algorithm is classified as a batch method, because it
requires that all the data should be available in advance. However,
there are variants of the k-means clustering process, which gets
around this limitation. Choosing the proper initial centroids is the
key step of the basic K-means procedure. The time complexity of
the k-means algorithm is O(nkl), where n is the number of objects,
k is the number of clusters, and l is the number of iterations [19].
The other variant of K-means algorithm is the single pass K-
means algorithm. Generally K-means algorithm takes more
number of iterations to converge. For handling large data sets with
K-means algorithm needs buffering strategy which takes single
pass over the data set. It works as follows.
B is the size of buffer and k
1
is the cluster representatives which
are representing the means of cluster. Initially, data of size B is
taken into the buffer. On this data, K-means algorithm is applied.
The cluster representatives are stored into the memory. The
remaining data is discarded from the memory. Again data is
loaded into memory from disk. But, the K-means algorithm is
performed on this new data with previous cluster representatives.
This process repeats until the whole data is clustered. It takes less
computational effort as compared to the normal K-means
algorithm. But, the K-means algorithm suffers from initial guess
of centroids, value of K, lack of scalability, capacity to handle
numerical attributes and resulting clusters can be unbalanced.
5.1.2 BIRCH-Balanced Iterative Reducing and
Clustering using Hierarchies
BIRCH is an integrated hierarchical clustering method. It has
clustering feature and a clustering feature tree (CF-Tree) which is
used to summarize cluster representations. A clustering feature
(CF) is a triplet summarizing information about sub-clusters of the
objects. Given N d-dimensional points or objects {o
i
} in a sub-
cluster, then the CF of the sub-cluster is defined as:
CF = (N, LS, SS) (3)
where, N is the number of points in the sub cluster, LS is the
linear sum on N points and SS is the square sum of the data
points.
A CF-Tree is a height balanced tree that stores the clustering
features for a hierarchical clustering. The non-leaf nodes store
sums of the CFs of their children, and thus summarize clustering
information about their children. A CF tree has 2 parameters, viz.,
branching factor B and threshold T. The branching factor specifies
the maximum number of children a non-leaf node can have and
the threshold specifies the maximum diameter of the sub clusters
stored at the leaf nodes. It consists of 4 phases,
Phase 1 : Load into memory by building a CF Tree.
Phase 2 : Condense into desirable range by building a
smaller CF Tree (P
2
is optional).
Phase 3 : Global Clustering.
Phase 4 : Cluster Refining (P
4
is optional).
Phase 1 : The CF tree is built as the objects are inserted. An
object is inserted to the closest leaf entry. If the diameter of the
sub cluster stored in the leaf node is larger than the threshold then
the leaf node is split. After the insertion of a new object,
information about it is passed towards the root of the tree.
Phase 2 : P
2
is optional. Its observed that the existing global or
semi-global clustering methods applied.
Phase 3 : A global clustering algorithm is used to cluster the leaf
nodes of a CF tree.
Phase 4 : P
4
is optional and entails the cost of additional passes
over the data to correct those inaccuracies and refine the clusters
further. Up to this point, the original data has only been scanned
once, although the tree and outlier information may have been
scanned multiple times. It uses the centroids of the clusters
produced by phase 3 as seeds, and redistributes the data points to
its closest seed to obtain a set of new clusters [20].
6. CLASSIFICATION OF CLASSFIERS
Classifiers are used to predict the class label of the new document
which is unlabelled [7]. For classification, we use classifiers like
NNC (Nearest Neighbour Classifier) and its variant K-NNC (K-
Nearest Neighbour Classifier).
6.1 NNC
The nearest neighbour classifier assigns to a test pattern a class
label of its closest neighbour. If there are n patterns X
1
, X
2
,..,
X
n
each of dimension d, and each pattern is associated with a class
c, and if we have a test pattern P, then
( ) ( ) { } n ,......, 2 , 1 where , , min , if = = i X P d X P d
i k
(5)
To compare the distances of a given test pattern with other
patterns, the nearest neighbour classifier uses the Euclidean
distance method which is given by [1]
( ) ( )
2
1 2
2
1 2
y x y x (6)
Pattern P is assigned to the class associated with X
k
.
6.2 K-NNC
The K-nearest neighbour classifier is a variant of the nearest
neighbour classifier where instead of finding just one nearest
neighbour as in the case of nearest neighbour classifier, k nearest
neighbours are found. The nearest neighbours are found using the
International Journal of Computer Applications (0975 8887)
Volume 5 No.4, August 2010
20
Euclidean distance. The majority class of this k nearest neighbours
is the class label assigned to the new test pattern.
The value chosen for k is crucial & with the right value of k, the
classification accuracy will be better than that of using the nearest
neighbour classifier. For large data sets, k can be larger to reduce
the error. Choosing k can be done experimentally, where a number
of patterns taken out from the training set can be classified using
the remaining training patterns for different values of k & k can be
chosen the value which gives the least error in classification [2].
This method will reduce the error in classification when training
patterns are noisy. The closest pattern of the test pattern may
belong to another class, but when a no. of neighbours are obtained
& the majority class label is considered, pattern is more likely to
be classified correctly.
7. PORTERS ALGORTIHM
In this section, a small overview of the porters algorithm is
presented. The Porter Stemmer is a conflation stemmer developed
by Martin Porter at the University of Cambridge in 1980. The
stemmer is based on the idea that the suffixes in the English
language (approximately 1200) are mostly made up of a
combination of smaller and simpler suffixes. This stemmer is a
linear step stemmer.
The porter stemming algorithm (or Porter Stemmer) is a process
for removing the commoner morphological and in-flexional
endings from words in English. Its main use is as part of a term
normalization process that is usually done when setting up
information retrieval systems. Porters algorithm works based on
number of vowel characters, which are followed be a consonant
character in the stem (measure), must be greater than one for the
rule to be applied [3]. Using this porters algorithm, a code has
been developed & used to classify the emails into spam & non-
spam emails more efficiently.
8. EVALUATION MEASURES
This is done in 2 steps, viz., classifier accuracy & alternative to
the measure of the accuracy
8.1 Classifier Accuracy
Estimating classifier accuracy is important in that it allows one to
evaluate how accurately a given classifier will label the test data.
It can be calculated using the formula discussed below. The data
set used for training and testing is the ling spam corpus. Each of
the 10 sub-directories contains spam and legitimate messages, one
message in each file. The total number of spam messages is 481
and that of legitimate messages are 2412.
8.2 Alternatives to accuracy measure
A classifier is trained to classify e-mails as non-spam and spam
mails [6]. An accuracy of 85 % may make the classifier accurate,
but what if only 10-15 % of the training samples are actually
spam? Clearly an accuracy of 85 % may not be acceptable-the
classifier could be correctly labelling only the non-spam
samples. Instead, we would like to be able to access how well the
classifier can recognize spam samples (referred to as positive
samples) how well it can recognize non-spam samples (referred
to as negative samples). The sensitivity (recall) and specificity
measures can be used, respectively for this purpose. In addition,
we may use precision to access the percentage of samples labeled
as spam that actually are spam samples. The evaluation
measures which are used in approach for testing process in our
research work could be defined as follows [4]:
True Positive (TP) : This states the no. of spam documents
correctly classified as spam.
True Negative (TN) : This states the number of non-spam
documents correctly classified as non-
spam.
False Positive (FP) : This states the number spam documents
classified as non-spam.
False Negative (FN) : This states the number of non-spam
document classified as spam.
TABLE II. THE DIFFERENT MEASURES USED FOR CLASSIFICATION OF
SPAM & NON-SPAM SAMPLES
MEASURE FORMULA MEANING
Precision
FP TP
TP
+
The percentage of
positive predictions
that are correct.
Recall /
Sensitivity
FN TP
TP
+
The percentage of
positive labelled
instances that were
predicted as positive.
Specificity
FP TN
TN
+
The percentage of
negative labelled
instances that were
predicted as negative.
Accuracy
( )
)
`
+
+ +
+
FN FP
TN TP
TN TP
The percentage of
predictions that are
correct.
Note that the evaluation is done on the above 4 parameters. The
different methods of evaluation measures used in the research
work considered is summarized in the form of a table in table III.
9. DETAILED DESIGN
The design involves three parts, viz., vector space model, the CF
tree & the development of the DFD.
Design Constraints :
The design constraints are divided into software & hardware
constraints, which are listed as below.
Software Constraints :
o Linux Operating system.
o Gcc compiler to compile C programs.
Hardware Constraints :
o Intel Pentium Processor.
o 1 GB RAM.
o 80 GB hard disc space.
International Journal of Computer Applications (0975 8887)
Volume 5 No.4, August 2010
21
9.1 Vector space model
Due to the large number of features (terms) in the training set,
memory requirements will be more. Arrays cannot be used to
store the features as this leads to memory problems so we use a
linked list to implement the storage of features and the T
f
- idf
calculation [5]. As the training set contains large number of
documents, the documents are also implemented in the linked list
format as shown in Fig. 5.
Term
1
Term
2
Term
d
Doc
1
T
f
T
f
-idf Doc
2
T
f
T
f
-idf
Doc
n
T
f
T
f
-idf
Fig. 5 : Linked list format in the model
9.2 CF Tree
BIRCH is used to cluster large number of data. It inserts the data
into the nodes of a CF tree one by one for efficient memory usage.
The insertion of a data into the CF tree is carried out by traversing
the CF tree top-down from the root according to an instance-
cluster distance function i.e. Euclidean distance [6]. The data is
then inserted into the closest sub-cluster under a leaf node as
shown in Fig. 6. Note that in the Fig. 6 shown, B = 7 & L = 6.
CF1
child1
CF3
child3
CF2
child2
CF6
child6
Non-leaf node
Leaf node Leaf node
CF1 CF2 CF3 CF5
Child1 child2 child3 child5
Prev CF1 CF2 . CF6 Next Prev CF1 CF2 . CF4 Next
Fig. 6 : BIRCH insertion tree
The data flow diagram used for the design of the algorithm for
efficient spam mail classification is shown in the Fig. 7 along with
the inputs & outputs. The general description of the inputs & the
outputs shown in the Fig. 7 could be further explained as follows
which involves a 5 step procedure [7].
1) In pre-processing of data, there are two main modules, i.e.,
Stopping and Stemming
Stopping
Input : Training & test document.
Output : Document with stopped words.
Stemming
Input : Output of Stopping module, i.e., document with
words that are stopped
Output : Document with stemmed words.
Training & test documents
(input)
Stopping & Stemming
Vector space model
Documents with stopped
& stemmed words
File converted to vector
format with frequency
test pattern Tf - idf
Training
pattern
Test pattern
Clustering
Centroids
Classification
Output
(spam or
non-spam)
Fig. 7 : Data flow diagram (DFD) of the designed system or the
proposed model
2) In Vector Space Model,
Input : Output of stemming module.
Output : The T
f
-idf of each document.
3) Data Reduction has two main modules, i.e., K-means and
BIRCH, both have identical Input and Output forms.
Input : Vector representation of the training data.
Output : Two sets of data, one belonging to spam & the
other to non-spam represented by centroids.
4) Classification also has two main modules, i.e., NNC and K-
NNC where both have identical Input and Output forms.
Input : Test pattern from the user & the centroids
Output : The classified result as the pattern
belongs to Spam or Non-Spam category.
5) The main module is the integration of all the above four
stages.
Input : Training pattern and test pattern where only the
training patterns are clustered.
Output : The classified result of the test pattern and the
accuracy.
Sequence diagrams are also drawn for stopping, stemming, vector
space model, K-means, BIRCH, NNC & K-NNC, which are not
shown in the paper for the sake of convenience [8].
International Journal of Computer Applications (0975 8887)
Volume 5 No.4, August 2010
22
10. Coding
Top-level pseudo code developed in C language. The coding
developed in this research work consisted of 4 modules, viz., main
module, stopping, stemming & the vector space modeling. Each
module developed is explained as follows [9]:
Main Module
Step 1 : Read each word from each document.
Step 2 : If the scanned word is a stop word then remove the
stop word.
Step 3 : Perform stemming based on the rules of the stemming
algorithm.
Step 4 : Build the vocabulary and calculate the Tf-idf
Step 5 : Cluster the documents by any of the two
clustering algorithms K-means or BIRCH
Step 6 : Classify the test document by using either NNC or K-
NNC classifiers.
Stopping
Step 1 : Check if the word in the main module is present in
the stop list of words.
Step 2 : If present, then remove the word.
Step 3 : Else do not remove.
Step 4 : Check if the data is a number or any special symbol
Step 5 : If so, remove that word.
Stemming
Step 1 : If the word is not stopped, then check if a root word
exists for that word by various rules provided by the
algorithm.
Step 2 : If a root word exists, then replace all the occurrences
of that word with the root word.
Vector space model
Step 1 : Check if the word is already present in the vocabulary
list.
Step 2 : If not, insert this word into a new node and update
the document number and frequency in the
corresponding node.
Step 3 : If the word is already present, and if it is appearing
for the first time in the document, then create a new
node with the document number and its
corresponding frequency.
Step 4 : Else if the word is appearing again in the same
document then increment the frequency.
Step 5 : Calculate the inverse document frequency (idf) for
each term(word) by the formula idf = log (N/df
t
),
where N is the total number of documents and df
t
is
the number of documents that the term has occurred
in.
Step 6 : Calculate the T
f
- idf of each word in each document
by the formula
T
f
- idf = Frequency * idf.
11. TESTING OF THE DESIGNED &
DEVELOPED SOFTWARE MODULE
WITH TEST CASE SPECS
Testing is a very important process in any design & development
of the software. It uncovers all the bugs generated by the software
to make the application a successful product. It can be done in
four different stages such as unit testing, module testing,
integration testing and system testing. A very important criterion
for testing is the data set used, i.e., corpus. The corpus used for
training and testing is the Ling Spam corpus [10]. In LingSpam,
there are four subdirectories, corresponding to 4 versions of the
corpus, viz.,
bare: Lemmatiser disabled, stop-list disabled,
lemm: Lemmatiser enabled, stop-list disabled,
lemm_stop: Lemmatiser enabled, stop-list enabled,
stop: Lemmatiser disabled, stop-list enabled,
where lemmatizing is similar to stemming and stop-list tells if
stopping is done on the content of the parts or not. Our analysis is
done with the lemm_stop subdirectory. Each one of these 4
directories contains 10 subdirectories (part 1, ., part 10).
These correspond to the 10 partitions of the corpus that were used
in the 10-fold experiments. In every part, 2/3
rd
of the content is
taken as training data and 1/3
rd
as the test data.
Each one of the 10 subdirectories contains both spam and
legitimate messages, one message in each file. Files whose names
have the form spmsg*.txt are spam messages. All other files are
legitimate messages. The total number of spam messages is 481
and that of legitimate messages are 2412.
chosen data set: rationale
easy to preprocess,
relatively small in terms of features,
simple: only two categories.
thus:
not very demanding computationally,
not very much time consuming,
but still pretty illustrative and inspiring,
as well as of high practical importance.
K-means
Step 1 : Select k initial centres.
Step 2 : repeat {
assign every data instance to the closest cluster based
on the distance between the data instance and the
center of the cluster
compute the new centers of the k clusters
} until(the convergence criterion is met)
International Journal of Computer Applications (0975 8887)
Volume 5 No.4, August 2010
23
TABLE III. TESTING SCHEDULE
Sequence
No.
Test case
Conditions
being
checked
Expected
Output
1 K-means Value K
Larger the
data size &
Higher the
value of K,
clustering is
better
2 K-means
Value
clusterno
Larger the
data size,
Higher the
value of K &
more the value
of clusterno,
clustering is
better
3 BIRCH
Branching
Factor
Less the
Branching
Factor, good
quality of
clusters &
hence more
number of
centroids are
obtained
4 BIRCH Threshold
More the
threshold,
Better the
cluster quality
& more
number of
centroids
5 K-NNC K
Lager the data
& Higher the
value of K,
better
classification
results.
TABLE IV. EVALUATION MEASURES
Test document Classified to
True Positive Spam Spam
True Negative Non-Spam Non-Spam
False Positive Spam Non-Spam
False Negative Non-Spam Spam
BIRCH
Phase 1 : Scan all data and build an initial CF tree.
Phase 2 : Condense into desirable length by building a smaller
CF tree.
Phase 3 : Global clustering
Phase 4 : Cluster refining (optional) - requires more passes
over the data to refine the results
NNC
Step 1 : Get the centroids from the clustering module.
Step 2 : Calculate the distance between the test data and each
centroid.
Step 3 : Test data is assigned to the class associated with the
least distance from the distances calculated.
K-NNC
Step 1 : Calculate the distance of test data with respect to each
centroid.
Step 2 : Find out the K nearest neighbors from the above
calculated distances.
Step 3 : Classify the test data corresponding to the class label
with which the test data has majority of the minimum
distances.
12. RESULTS AND INFERENCE
The coding was done in C ; after the code was run, various
performance measures such as the precision, recall, specificity &
the accuracy, etc. were observed. The results are shown in the
Figs. 8 to 11 respectively.
12.1 Precision
Fig. 8 : Plot of measure of precision vs. data size
Inference : The percentages of positive predictions that are
correct are high for nearest neighbour classifiers. The precision
table in table V and the following graph in Fig. 8 shows that for
large data sets, BIRCH with NNC and K-means with NNC has an
optimal value.
TABLE V. QUANTITATIVE RESUTLS OF PRECISION
Data
size
K-means
NNC
K-means
K-NNC
BIRCH
NNC
BIRCH
K-NNC
120 81.2% 72.9% 75% 61.9%
200 91.1% 74.4% 88.2% 91.6%
400 97.5% 56.6% 93.7% 69.8%
International Journal of Computer Applications (0975 8887)
Volume 5 No.4, August 2010
24
12.2 Recall
Fig. 9 : Plot of measure of recall vs. data size
Inference : The percentage of positive labelled instances that
predicted positive are high for the combination of K-means
algorithm with K-NNC as the classifier and the percentage
increases as the data set size increases. BIRCH does not work
well for smaller data sets. The recall values can be visualized from
the following table in table VI which indicates that for large data
sets, BIRCH with K-NNC has a high value, which can also be
observed from the Fig. 9.
TABLE VI. QUANTITATIVE RESUTLS OF RECALL
Data
size
K-means
NNC
K-means
K-NNC
BIRCH
NNC
BIRCH
K-NNC
120 78.7% 84.3% 63.1% 49.6%
200 97.6% 97.6% 71.4% 52.3%
400 63.4% 96.2% 58.1% 63.7%
12.3 Specificity
Fig. 10 : Plot of measure of specificity vs. data size
TABLE VII. QUANTITATIVE RESUTLS OF SPECIFICITY
Data
size
K-means
NNC
K-means
K-NNC
BIRCH
NNC
BIRCH
K-NNC
120 80.6% 68.7% 69.2% 75%
200 90.4% 66.6% 90.4% 95.2%
400 97.6% 53.9% 93.6% 82.8%
Inference : The percentages of negative labelled instances that are
predicted as negative are high for the combination using NNC as
the classifier. The specificity values for large data sets as seen
from the following table in table VII are optimal for BIRCH with
K-NNC combination, which can also be observed from the Fig.
10.
12.4 Accuracy
Fig. 11 : Plot of measure of accuracy vs. data size
TABLE VIII. QUANTITATIVE RESUTLS OF ACCURACY
Data
size
K-means
NNC
K-means
K-NNC
BIRCH
NNC
BIRCH
K-NNC
120 79.68% 76.56% 65.62% 57.81%
200 94.4% 82.14% 80.9% 73.8%
400 77.4% 70.19% 71.6% 75.48%
Inference : Accuracy for BIRCH with K-NNC has a optimal
value as the data set increases, also K-means works well for
smaller data set. It can be visualized from the graph in Fig. 11
that conditions being checked hold good for large data and
BIRCH with K-NNC is the best combination if the data set
increases. It can be seen that BIRCH with K-NNC is more
accurate for large data, which can be observed from the
quantitative results shown in the table VIII.
TABLE IX. COMPARISONS OF BIRCH & K-MEANS WITH DATASETS
BIRCH K-means
Time Faster Slower
Sensitivity to input pattern of dataset Yes No
Cluster Quality (center location,
number of data point in a cluster,
radii of clusters)
More
Accurate
Less
Accurate
Demand for memory Less More
Finally, comparisons are made between Birch & K-means and the
advantages / dis-advantages are shown in the table IX. It is
concluded that BIRCH is the best when data-sets are taken into
consideration.
13. Conclusions
In this paper, an email clustering method is proposed and
implemented to efficient detect the spam mails. The proposed
technique includes the distance between all of the attributes of an
email. The proposed technique is implemented using open source
technology in C language; ling spam corpus dataset was selected
for the experiment. Different performance measures such as the
precision, recall, specificity & the accuracy, etc. were observed.
K-means clustering algorithm works well for smaller data sets.
International Journal of Computer Applications (0975 8887)
Volume 5 No.4, August 2010
25
BIRCH with K-NNC is the best combination as it works better
with large data sets. In BIRCH clustering, decisions made without
scanning the whole data & BIRCH utilizes local information
(each clustering decision is made without scanning all data
points). BIRCH is a better clustering algorithm requiring a single
scan of the entire data set thus saving time. The work presented
in this paper can be further extended & can be tested with
different algorithms and varying size of large data sets.
REFERENCES
[1] Sudipto Guha, Adam Meyerson, Nina Mishra, Rajeev
Motwani and Liadan OCallaghan, Clustering Data
Streams, IEEE Trans.s on Knowledge & Data Engg., 2003.
[2] Enrico Blanzieri and Anton Bryl, A Survey of Learning-
Based Techniques of Email Spam Filtering, Conference on
Email and Anti-Spam., 2008.
[3] Jain A.K., M.N. Murthy and P.J. Flynn, Data Clustering : A
Review,ACM Computing Surveys., 1999.
[4] Tian Zhang, Raghu Ramakrishnan, Miron Livny, BIRCH:
An Efficient Data Clustering Method For Very Large
Databases, Technical Report, Computer Sciences Dept.,
Univ. of Wisconsin-Madison, 1996.
[5] Porter. M, An algorithm for suffix stripping, Proc.
Automated library Information systems, pp. 130-137, 1980.
[6] Manning C.D., P. Raghavan, H. Schtze, Introduction to
Information Retrieval, Cambridge University Press, 2008.
[7] Richard O. Duda, Peter E. Hart, David G. Stork, Pattern
Classification, Wiley-Interscience Pubs., 2
nd
Edn., Oct. 26
2000.
[8] http://www.informationretrieval.org/
[9] http://www.aueb.gr/users/ion/publications.html
[10] http://www.cl.cam.ac.uk/users/bwm23/
[11] http://www.wikipedia.org
[12] Jiawei Han and Micheline Kamber, Data Mining Concepts
and Techniques, Second Edn.
[13] Ajay Gupta and R. Sekar, An Approach for Detecting Self-
Propagating Email Using Anomaly Detection, Springer
Berlin / Heidelberg, Vol. 2820/2003.
[14] Anagha Kulkarni and Ted Pedersen, Name Discrimination
and Email Clustering using Unsupervised Clustering and
Labeling of Similar Contexts, 2
nd
Indian International
Conference on Artificial Intelligence (IICAI-05), pp. 703-
722, 2005.
[15] Bryan Klimt and Yiming Yang, The Enron Corpus: A New
Dataset for Email Classification Research, European
Conference on Machine Learning, Pisa, Italy, 2004.
[16] Sahami M., S. Dumais, D. Heckerman, E. Horvitz, A
Bayesian approach to filtering junk e-mail. AAAI98
Workshop on Learning for Text Categorization,
http://robotics.stanford.edu/users/sahami/papers-
dir/spam.pdf, 1998.
[17] Sculley D., Gordon V. Cormack, Filtering Email Spam in
the Presence of Noisy User Feedback, CEAS 2008: Proc. of
the Fifth Conference on Email and Anti-Spam. Aug., 2008.
[18] Dave DeBarr, Harry Wechsler, Spam Detection using
Clustering, Random Forests, and Active Learning, CEAS
2009 Sixth Conference on Email and Anti-Spam, Mountain
View, California, USA, July 16-17, 2009.
[19] Manning, C.D., Raghavan, P., and Schutze, H., Scoring,
Term Weighting, and the Vector Space Model, Introduction
to Information Retrieval, Cambridge University Press,
Cambridge, England, pp. 109-133, 2008.
[20] Naresh Kumar Nagwani and Ashok Bhansali, An Object
Oriented Email Clustering Model Using Weighted
Similarities between Emails Attributes, International
Journal of Research and Reviews in Computer Science
(IJRRCS), Vol. 1, No. 2, pp. 1-6. Jun. 2010.
Mr. M. Basavaraju completed his Masters in Engineering in
Electronics and Communication Engg. from the University
Visvesvaraya College of Engg. (Bangalore), Bangalore University
in 1990, & B.E. from Siddaganga Institute of Technology
(Tumkur), Bangalore University in the year 1982. He has got a
vast teaching experience of 23 years & an industrial experience of
7 years. Currently, he is working as Professor and Head of
Computer science & Engg. Dept., Atria Institute of Technology,
Bangalore, Karnataka, India. He is also a research scholar in
Coimbatore Inst. of. Tech., Coimbatore, doing his research work
& progressing towards his Ph.D. in the computer science field
from Anna University Coimbatore, India, He has conducted a
number of seminars, workshops, conferences, summer courses in
various fields of computer science & engineering. His research
interests are Data Mining, Computer Networks, Parallel
computing.
Dr. R. Prabhakar obtained his B. Tech. degree from IIT Madras
in 1969, M.S. from Oklahoma State University, USA and Ph.D.
from Purdue University, USA. Currently he is professor of
Computer of Science and Engineering and Secretary of
Coimbatore Institute of Technology, Coimbatore, India. His areas
of specialization include Control Systems, CNC Control,
Robotics, Computer Graphics, Data Structures, Compilers,
Optimization. He has published a number of papers in various
national & international journals, conferences of high repute. He
has done a number of projects in the national & international
level. At the same time, he has guided a number of students in
UG, PG & in the doctoral level.