Identification of Image-Spam Based On SIFT Image Matching Algorithm
Identification of Image-Spam Based On SIFT Image Matching Algorithm
Available at http://www.joics.com
Chundong Wang a,b , Hua Yang a,b , Yinghui Chen a,b , Li Sun a,b
Yan Zhou a,b , Huaibin Wang a,b,∗
a Key Laboratory of Computer Vision and System (Tianjin University of Technology), Ministry of
Education, Tianjin 300191, China
b Tianjin Key Laboratory of Intelligence Computing and Novel Software Technology, Tianjin University
of Technology, Tianjin 300191, China
Abstract
In the paper, a method is proposed for efficient identification of Image-spam using the simplified Scale
Invariant Feature Transform (SIFT) algorithm. It firstly strengthens the feature of image-spam email
picture by Symmetric Neighborhood Filters(SNF). In addition, we improved the SIFT algorithm by
reducing the dimensions of feature vectors from 128 to 40, which can obviously save the time cost of
identifying a regular E-mail or a spam image-based E-mail. And the method is demonstrated successfully
by simulation experiment based MATLAB7.0 for its high accuracy and rapid speed in detection.
Keywords: Image Spam; Symmetric Neighborhood Filters; Feature Extraction Based Sift Algorithm;
Image Matching
1 Introduction
Today, while email is an effective way of internet marketing, unsolicited commercial email(spam),
as part of it , also becomes a big problem on the internet for its serious harmfulness, not only
causing the economy loss but also threatening the sharing, security, and interaction of internet.
Recently, the success of text document categorization techniques on email spam detection, such as
Naive Bayesian algorithm[1], has driven spammers to explore new variation of spam email known
as image-based spam email, which is generated by embedding the text content into images. Worse
still, to help the image-based spam emails escape from text-based spam filters, spammers make
good use of various obfuscation techniques by which they can product various variations from a
small number of image spam source, such as adding random noise or straight line to the spam
⋆
This work is supported by ”863” project plan of China (No. 2007AA01Z450 and No. 2007AA01Z188),
Tianjin Science and Technology Innovation Special Fund(10FDZDGX00400), Education Science and Technology
Foundation of Tianjin (SB20080053, SB20080055 & 20080805).
∗
Corresponding author.
Email addresses: [email protected] (Chundong Wang), [email protected] (Huaibin Wang).
email image, and just changing the ground of image while keeping the same text.
Consequently, different recognition approaches have been put forward to enhance the anti-
spam detectors. Some early works such as H.B.Aradhye et al[2] have tried to apply Bayesian
classification method based image. To employ classifiers based text content in image, it needs to
extract the area of text firstly. Unfortunately, the background of some spam image is complex,
existing color transition and vague edges problems which may be an obstacle of extracting. More
recent works focus on identifying the image-based spam directly according to features of images,
such as image matching techniques. One popular method for image matching is SIFT(Scale
Invariant Feature Transform) algorithm[8-11]. Not withstanding its demonstrated success used
in distinguishing image-based spam email[3], the precision is not high while time-cost is not
low which due to the high feature vectors’ dimensions of SIFT algorithm. In addition, most
matching points of traditional SIFT algorithm randomly dispersed over the image, so for the
ground-changing image spam, the precision will be lower.
We proposed a categorization method of image spam E-mail using a simplified SIFT algorithm.
Attention is paid on reducing time-cost and improving precision of detecting image-based spam
emails. A circle neighborhood[10] is designed for dimensionality reduction of feature vectors in
purpose of saving time, and SNF(Symmetric Neighborhood Filters)[4], an image enhancement
method, which is as preprocessing step, is utilized for making more feature points matching
gathered on text area.
In Sec.2, we will present the flow chart of the proposed method. Then SNF and traditional
SIFT methodology are introduced in Sec.3. We discuss the simplified SIFT algorithm in Sec.4.
Experiments and results are analyzed in Sec.5. We finally conclude in Sec.6.
step, simplified SIFT is prepared for feature extraction which is vital to the whole experiment.
All the works are finished in scale space which we should convert to. The final goal of SIFT
algorithm is to determine the best feature vectors for similarity measure and ensure the dimen-
sions of feature vectors. We design a circle neighborhood instead of square neighborhood used by
traditional SIFT algorithm and manage to reduce the dimensions from 128 to 40. In the paper,
we utilize the distance-ratio[11] norm for similarity measure. And according to the threshold, we
can judge whether an image email is a spam.
(1) Firstly, we need to performance traversal of the whole image for not only calculating standard
deviations between core points and corresponding symmetric neighborhood points but also the
mid-value of those standard deviations which is presented as δ. Furthermore, the threshold
is defined as T : T = kδ,where k ≤ 2 .
(2) The color value of filtering elements is ensured by color distance in HSV space. Assume two
points x1 , x2 , then the color distance from x1 to x2 is defined as:
Cdistx1 →x2 =| v1 − v2 | + | v1 × s1 × cos(h1 ) − v2 × s2 × cos(h2 ) | + | v1 × s1 × sin(h1 ) − v2 × s2 × sin(h2 ) | (1)
Here, vp ,sp ,hp , p = 1, 2, respectively represents value of color shade, color saturation, and
color brightness.
The color value of filtering element is a vector, presented as vl = (v, s, h). vli (k) = (vi , si , hi ),i =
1, 2, 3, 4, vlc (k) = (vc , sc , hc ) are separately the color value of the i filtering element and core
points. k is the time of iteration. In a filtering element, compute the color distances from the
two symmetric neighborhood points to the core point, and keep the point whose distance is
shorter, presented as ”mind”.
For l1 , assume CdistS→C < CdistN →C , then the mind=North, i.e. N. If min(CdistS→C ,
CdistN →C ) > T , then vl1 (k + 1) = vlmind (k) = vlN (k)); else, if min(CdistS→C , CdistN →C ) ≥
T or CdistS→C = CdistN →C , then vl1 (k + 1) = vlc (k). Coping all the filtering elements like
that, and derive the value of vli (k + 1),i = 1, 2, 3, 4 , then turn to step (3).
(3) Update the value of vlc (k + 1) : vlc (k + 1) = (vl(k) + vlc (k))/2, where vl(k) is the mean value
∑
4
of vli , that is vl(k) = ( vli (k))/4. The iterative process continues until 80% points of the
i=1
image are invariant.
3156 C. Wang et al. /Journal of Information & Computational Science 7: 14 (2010) 3153–3160
Table 1: Template of SNF 3 × 3 NW, N etc, is short for the corresponding of directions.
northwest(NW) north(N) northeast(NE)
west(W) core(C) east(E)
southwest(SW) south(S) southeast(SE)
To keep invariance for spin images, the direction of coordinate axis is rotated to the same
directions of feature points. The gradient value and directions of feature points are determined
as follows:
√
m(x, y) = (L(x + 1, y) − L(x, y + 1))2 + (L(x, y + 1) − L(x, y − 1))2
(5)
q(x, y) = tan−1 ((L(x, y + 1) − L(x, y − 1))/(L(x + 1, y) − L(x − 1, y))
The further step is to form the vectors for describing feature points which was called feature
vectors or feature descriptors. Every key point have only one feature vector formed by gradi-
ents statistics of points in a special neighborhood area of the feature point. The dimension of
feature vectors is vital for decreasing the impact of gradient values and illumination variations.
Traditional SIFT algorithm used a square neighborhood to ensure the feature vectors so that the
dimensions are 128 which is too high for time-cost. Due to page limit, we never introduce the
square neighborhood method.
C. Wang et al. /Journal of Information & Computational Science 7: 14 (2010) 3153–3160 3157
(1) With the feature points as the center, take the radius of the circular area for 8 with 2 for
the unit, which is divided into 4 concentric circles. In the 4 concentric area, we select these
10 directions and compute their gradient accumulative values: 0◦ , 36◦ , 72◦ , 108◦ , 144◦ , 188◦ ,
216◦ , 252◦ , 288◦ , 324◦ .
(2) In the 4 circle rings, we respectively select a vector of 10 dimensions on every ring from inside
to outside, presented as T1 , T2 , T3 , T4 , so the primary feature vector is T = (T1 , T2 , T3 , T4 ),
where Ti = ti1 , ti2 , ..., ti10 , i = 1, 2, 3, 4. Then components of T1 , T2 , T3 , T4 are circled to move
left to find the max component of T1 , noted as t1max , let t11 = t1max , so Ti is converted to:
Ti = (ti,max , ti,max+1 , ..., ti10 , ti1 , ti2 , ..., ti,max−1 , i = 1, 2, 3, 4)
′ ′ ′
The new feature vector is expressed as: T = (t1 , ..., t40 )
′ ′ ′ ′
(3) For resisting effect of illumination, new feature T is normalized: T ′ = T /∥T ∥, where T is
its modulus-length. The values of components in T ′ which is less than 2 are set to 2.
5 Experiments
5.1 Data Set
The simulation experiments were performed by MATLAB7.0 software under Windows XP op-
erating system environment. We collected 300 images from email boxes, including 200 normal
images and 100 spam images which composed spam sample set. And samples under test were also
from those images. All the images were preprocessed by SNF image enhancement and feature
points were extracted by simplified SIFT algorithm.
na and nb are feature points numbers of them. We used distance-ratio norm to measure whether
the two points succeeded in matching:
{
ratio > ε sucess
ratio = d1/d2 (7)
else f ailure
3158 C. Wang et al. /Journal of Information & Computational Science 7: 14 (2010) 3153–3160
Here, for one point in Fa , d1 is the nearest Euclidean distance between this point and the point
in Fb , and d2 is the second nearest one. The threshold ε is set to 0.44.
We created index for all feature points in Fb named B-Tree. Then, the two nearest distance
d1 , d2 between every points of Ti of Fa and all the points in B-Tree were computed by (6) formula
using BBF[4] searching algorithm. Repeat the process above, and separately storage those points
in Fa , Fb which succeed in matching judged by (7) formula. The similarity measure of two image
is defined as:
m
P (Fa , Fb ) = ∑ m
ε+ d(Ti )
i=1 (8)
P (Fa ,Fb )
Sim(Fa , Fb ) = P (Fa ,Fa )
Here, m is the number of points success in matching between Fa and Fb , and d(Ti ) is the
Euclidean distance of two success matching points. Sim(Fa , Fb ) is a threshold, if it is closer to 1,
the similarity is higher. In this paper, we set Sim(Fa , Fb ) as 0.5, i.e. if exceeded 0.5, the image
is judged as spam image, else normal image.
0.8
recall rate
0.7
precision rate
0.6
0.5
0.4
0.3
0.2
0.1
6 8 10 12 14
dimensions
Fig. 2: The recall and precision rate under different dimensions. • is recall rate, ∗is precision rate.
Recall and Precision are two common evaluation performance of spam detection. Assume A is
the number of spam judged as spam, B is the number of normal emails judged as spam, C is the
number of spam judged as normal emails, D is the number of normal emails judged as normal
A
ones, then Recall rate is defined as A+C × 100% , and Precision rate is A+B
A
× 100%.
We select 20 images in order to test effects of simplified SIFT algorithm under different dimen-
sions of feature vectors which were 6, 8, 10, 12. From Fin.2, we can see, with the increasing of
dimensions, recall rate was decreasing while precision rate increased. The reason was with the
increasing of dimensions, more information of image was derived for matching. But the recall and
precision rate were incompatible, when the dimension was 10 × 4, both of them were not bad, so
we chose 40 dimensions for the feature vectors.
C. Wang et al. /Journal of Information & Computational Science 7: 14 (2010) 3153–3160 3159
0.7
0.6
recall 0.5
0.4
0.3
0.2
0.1
0
0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
precision
Fig. 3: The recall and precision rate under scale-changing and rotating.
For testing the effect of simplified SIFT algorithm and traditional SIFT algorithm in detecting
spam image under conditions of scale-changing, rotating, and adding noise, we selected 20 images
which were scale-changed by 50%, or rotated 45 degree, and added into noise for experiments.
During the process, the feature dimension was 40, and the results curve was pointed as figure 3.
And the Tab.3 showed time cost and number of feature points of 5 images.
From Fig.3, we can see, the two curves were close. And although the effect of simplified SIFT
algorithm was a bit poor, it also can detected the image spam effectively. Furthermore, from the
Tab.2, we can know although the number of feature points of traditional SIFT was larger than
those of simplified SIFT, the time cost of simplified one was obviously lower. So, the superiority
of simplified SIFT algorithm was obvious on time-cost, which was about 3.43 seconds for average
in detecting a image spam comparing 6.27 seconds of traditional SIFT algorithm. From this
perspective, it was worth saving detecting time by sacrificing a bit recall and precision rate.
3160 C. Wang et al. /Journal of Information & Computational Science 7: 14 (2010) 3153–3160
6 Conclusion
The features extracted by SIFT remains invariance for scale-changing, rotating, brightness changes,
so the SIFT algorithm is widely used in image matching area. Moreover, due to its stability for
perspective changes, affine changes, and noise, it is popular in image-based spam email filtering
area. In this paper, for the special requirements of spam filtering on user side, we improved the
traditional SIFT algorithm by reducing dimensions of feature vector and managed to apply it on
detecting spam images. The results of experiments demonstrated that the proposed SIFT algo-
rithm greatly saved the time-cost, and although the effect was a bit poor, it was also worth to do it.
References
[1] Bin Chen, Shoubin Dong. An Optimized Spam Filtering Method Based on AODE.Jounal of In-
formation and Computational Science,2009,6(2):749-756
[2] H.B.Aradhye, G.K.Myers, J.A.Herson, Image analysis for efficient categorization of image-based
spam e-mail, Proc. 8th Int.Conf. on Document Analysis and Recognition,2005,pp.914-91
[3] Chen junwei, Zhang lichun, Lu yue.A spam image filtering system based on user-specified image
content[J]. CAAI Transactions on Intelligent Systems,Nov.2008,3(5):416-412
[4] Liu Xingxing, Wang Zengfu. An algorithm for text extraction in complex color image[J]. PR&AI,
Dem.2006, 19(6):771-775
[5] Zhang Qi, Cao Qi, Bi Duyan, et al.An improved enhancement approach based on anisotropy
differential used for SAR image[J].Journal of OptoelectronicsLaser, Apr.2010,21(4):614-617
[6] Liu Shangping, Chen Ji.Enhancement method for retinal images based on Gabor filter and mor-
phology[J], Journal of OptoelectronicsLaser,Feb.2010,21(2):318-322
[7] Gui Zhiguo, Zhang Pengcheng.Enhancement algorithm for X-ray images based on neighborhood
related information[J].Computer Engineering and Applications, 2010,46(21):175-177
[8] Jia Shijie, Wang Pengxiang, Jiang Haiyang, Zeng Jie.Study of imge matching algorithm based on
SIFT[J].Journal of DaLian Jiao Tong University, Aug.2010,31(2):17-21
[9] Zhang Chunmei, Gong Zhihui, Sun Lei.Improved SIFT feature applied in image matching[J].Computer
Engineering and Applications,2008,44(2):95-97
[10] Wu Huilan, Liu Guodong, Liu Bingguo, et al.Study on the circle center fast accurate locating
technique based on the SIFT[J]. Journal of OptoelectronicsLaser,Nov.2008,19(11):1512-1515
[11] Feng Jia,The research and improvement of SIFT algorithm[D], CNKI, 2010