0% found this document useful (0 votes)
9 views

Identification of Image-Spam Based On SIFT Image Matching Algorithm

Uploaded by

gracehsing2020
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
9 views

Identification of Image-Spam Based On SIFT Image Matching Algorithm

Uploaded by

gracehsing2020
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 8

Journal of Information & Computational Science 7: 14 (2010) 3153–3160

Available at http://www.joics.com

Identification of Image-spam Based on SIFT Image


Matching Algorithm ⋆

Chundong Wang a,b , Hua Yang a,b , Yinghui Chen a,b , Li Sun a,b
Yan Zhou a,b , Huaibin Wang a,b,∗
a Key Laboratory of Computer Vision and System (Tianjin University of Technology), Ministry of
Education, Tianjin 300191, China
b Tianjin Key Laboratory of Intelligence Computing and Novel Software Technology, Tianjin University
of Technology, Tianjin 300191, China

Abstract

In the paper, a method is proposed for efficient identification of Image-spam using the simplified Scale
Invariant Feature Transform (SIFT) algorithm. It firstly strengthens the feature of image-spam email
picture by Symmetric Neighborhood Filters(SNF). In addition, we improved the SIFT algorithm by
reducing the dimensions of feature vectors from 128 to 40, which can obviously save the time cost of
identifying a regular E-mail or a spam image-based E-mail. And the method is demonstrated successfully
by simulation experiment based MATLAB7.0 for its high accuracy and rapid speed in detection.

Keywords: Image Spam; Symmetric Neighborhood Filters; Feature Extraction Based Sift Algorithm;
Image Matching

1 Introduction
Today, while email is an effective way of internet marketing, unsolicited commercial email(spam),
as part of it , also becomes a big problem on the internet for its serious harmfulness, not only
causing the economy loss but also threatening the sharing, security, and interaction of internet.
Recently, the success of text document categorization techniques on email spam detection, such as
Naive Bayesian algorithm[1], has driven spammers to explore new variation of spam email known
as image-based spam email, which is generated by embedding the text content into images. Worse
still, to help the image-based spam emails escape from text-based spam filters, spammers make
good use of various obfuscation techniques by which they can product various variations from a
small number of image spam source, such as adding random noise or straight line to the spam

This work is supported by ”863” project plan of China (No. 2007AA01Z450 and No. 2007AA01Z188),
Tianjin Science and Technology Innovation Special Fund(10FDZDGX00400), Education Science and Technology
Foundation of Tianjin (SB20080053, SB20080055 & 20080805).

Corresponding author.
Email addresses: [email protected] (Chundong Wang), [email protected] (Huaibin Wang).

1548–7741/ Copyright © 2010 Binary Information Press


December 2010
3154 C. Wang et al. /Journal of Information & Computational Science 7: 14 (2010) 3153–3160

email image, and just changing the ground of image while keeping the same text.
Consequently, different recognition approaches have been put forward to enhance the anti-
spam detectors. Some early works such as H.B.Aradhye et al[2] have tried to apply Bayesian
classification method based image. To employ classifiers based text content in image, it needs to
extract the area of text firstly. Unfortunately, the background of some spam image is complex,
existing color transition and vague edges problems which may be an obstacle of extracting. More
recent works focus on identifying the image-based spam directly according to features of images,
such as image matching techniques. One popular method for image matching is SIFT(Scale
Invariant Feature Transform) algorithm[8-11]. Not withstanding its demonstrated success used
in distinguishing image-based spam email[3], the precision is not high while time-cost is not
low which due to the high feature vectors’ dimensions of SIFT algorithm. In addition, most
matching points of traditional SIFT algorithm randomly dispersed over the image, so for the
ground-changing image spam, the precision will be lower.
We proposed a categorization method of image spam E-mail using a simplified SIFT algorithm.
Attention is paid on reducing time-cost and improving precision of detecting image-based spam
emails. A circle neighborhood[10] is designed for dimensionality reduction of feature vectors in
purpose of saving time, and SNF(Symmetric Neighborhood Filters)[4], an image enhancement
method, which is as preprocessing step, is utilized for making more feature points matching
gathered on text area.
In Sec.2, we will present the flow chart of the proposed method. Then SNF and traditional
SIFT methodology are introduced in Sec.3. We discuss the simplified SIFT algorithm in Sec.4.
Experiments and results are analyzed in Sec.5. We finally conclude in Sec.6.

Fig. 1: Flow chart of a image spam recognition by SIFT algorithm analysis

2 Simplified SIFT for image-based Spam Email Detection


Figure 1 presents the whole flow chart of image-based spam recognition by simplified SIFT algo-
rithm. In view of those image spam whose background is different but embedded text is same, it
is necessary to preprocess all the collected spam images by SNF image enhancement[4-7] with the
purpose that strengthen the features of text area so that more matching points will be gathered
round text area instead of background. Moreover, noise, color transition and vague edges these
possible problems, as previously mentioned, can be partly decreased by SNF. After preprocessing
C. Wang et al. /Journal of Information & Computational Science 7: 14 (2010) 3153–3160 3155

step, simplified SIFT is prepared for feature extraction which is vital to the whole experiment.
All the works are finished in scale space which we should convert to. The final goal of SIFT
algorithm is to determine the best feature vectors for similarity measure and ensure the dimen-
sions of feature vectors. We design a circle neighborhood instead of square neighborhood used by
traditional SIFT algorithm and manage to reduce the dimensions from 128 to 40. In the paper,
we utilize the distance-ratio[11] norm for similarity measure. And according to the threshold, we
can judge whether an image email is a spam.

3 SNF Image Enhancement and SIFT Algorithm

3.1 SNF Image Enhancement


To reserve the authentic edge of text and smooth inner disturbance points of monochromatic
target area, SNF[4] repeats iterative operation in a template whose scale is 33, as shown in Tab.1.
A filtering element is composed of the core point and one pair of the symmetric neighborhood
points. From Tab.1, we can see four filtering elements in all, respectively denoted as l1 , l2 , l3 , l4 ,
and their symmetric neighborhood points are separately (N, S), (W, E), (NW, SW), (NE, SE).
Then, iterative operation is implemented in the four filtering elements. For example, let l1 =C
and (N, S), the process of iterative operation is as follows:

(1) Firstly, we need to performance traversal of the whole image for not only calculating standard
deviations between core points and corresponding symmetric neighborhood points but also the
mid-value of those standard deviations which is presented as δ. Furthermore, the threshold
is defined as T : T = kδ,where k ≤ 2 .

(2) The color value of filtering elements is ensured by color distance in HSV space. Assume two
points x1 , x2 , then the color distance from x1 to x2 is defined as:
Cdistx1 →x2 =| v1 − v2 | + | v1 × s1 × cos(h1 ) − v2 × s2 × cos(h2 ) | + | v1 × s1 × sin(h1 ) − v2 × s2 × sin(h2 ) | (1)

Here, vp ,sp ,hp , p = 1, 2, respectively represents value of color shade, color saturation, and
color brightness.
The color value of filtering element is a vector, presented as vl = (v, s, h). vli (k) = (vi , si , hi ),i =
1, 2, 3, 4, vlc (k) = (vc , sc , hc ) are separately the color value of the i filtering element and core
points. k is the time of iteration. In a filtering element, compute the color distances from the
two symmetric neighborhood points to the core point, and keep the point whose distance is
shorter, presented as ”mind”.
For l1 , assume CdistS→C < CdistN →C , then the mind=North, i.e. N. If min(CdistS→C ,
CdistN →C ) > T , then vl1 (k + 1) = vlmind (k) = vlN (k)); else, if min(CdistS→C , CdistN →C ) ≥
T or CdistS→C = CdistN →C , then vl1 (k + 1) = vlc (k). Coping all the filtering elements like
that, and derive the value of vli (k + 1),i = 1, 2, 3, 4 , then turn to step (3).

(3) Update the value of vlc (k + 1) : vlc (k + 1) = (vl(k) + vlc (k))/2, where vl(k) is the mean value

4
of vli , that is vl(k) = ( vli (k))/4. The iterative process continues until 80% points of the
i=1
image are invariant.
3156 C. Wang et al. /Journal of Information & Computational Science 7: 14 (2010) 3153–3160

Table 1: Template of SNF 3 × 3 NW, N etc, is short for the corresponding of directions.
northwest(NW) north(N) northeast(NE)
west(W) core(C) east(E)
southwest(SW) south(S) southeast(SE)

3.2 SIFT Algorithm


SIFT algorithm, which was put forth by David G.Lowe in 1999, is wide used in image matching
area because the features of image extracted by it is robust to noise, scale-change and illumination
variations of image.
Assume I(x, y) is a function of an image. Gaussian pyramid of scale space is built by convolution
operation:
L(x, y, σ) = G(x, y, σ) ∗ I(x, y) (2)
1 −(x +y )/2σ
2 2 2
where G(x, y, σ) = 2πσ 2e is the Gaussian kernel function. L(x, y, σ) is a scale function,
where σ is scale factor increasing k times every time. The DoG(difference-of-Gaussian) pyramid
is further constructed:
D(x, y, s) = (G(x, y, ks) − G(x, y, s)) ∗ I(x, y) = L(x, y, ks) − L(x, y, s) (3)
We select 4 orders of Gaussian pyramid including 5 layers in every order. In DoG space, apart
from the bottom and top layers, in the rest layers, we compare every value with those pixel points
including 8 neighborhood pixel points in the same layer and 9 neighborhood pixel points of its
two neighborhood layers, i.e. 26 pixels in all, so that all extreme values will be searched.
The feature points are selected from extreme value set. For improving robustness of the algo-
rithm, we refer to the selection norms of [11], discarding those unstable or low contrast points on
the edge and only remaining the points which satisfy the following conditions:| D(Xmax ) |≥ 0.03
and ratio ≤ (r + 1)2 /r, where D(Xmax ) is quadratic fitting of Taylor expansion in the extreme
point, and r (the value is 10 here referring to [11]) is the ratio of the larger eigenvalue and the
smaller eigenvalue of Hessian matrix H:


 1 ∂DT
 D(Xmax ) = D + 2 ∂X
Dxx Dxy (4)

 H =| |
 D D
yx yy

To keep invariance for spin images, the direction of coordinate axis is rotated to the same
directions of feature points. The gradient value and directions of feature points are determined
as follows:

m(x, y) = (L(x + 1, y) − L(x, y + 1))2 + (L(x, y + 1) − L(x, y − 1))2
(5)
q(x, y) = tan−1 ((L(x, y + 1) − L(x, y − 1))/(L(x + 1, y) − L(x − 1, y))
The further step is to form the vectors for describing feature points which was called feature
vectors or feature descriptors. Every key point have only one feature vector formed by gradi-
ents statistics of points in a special neighborhood area of the feature point. The dimension of
feature vectors is vital for decreasing the impact of gradient values and illumination variations.
Traditional SIFT algorithm used a square neighborhood to ensure the feature vectors so that the
dimensions are 128 which is too high for time-cost. Due to page limit, we never introduce the
square neighborhood method.
C. Wang et al. /Journal of Information & Computational Science 7: 14 (2010) 3153–3160 3157

4 Simplified SIFT Algorithm


For simplified SIFT algorithm, the former three step is the same as traditional SIFT algorithm,
and the difference is that dimensions are ensured by a designed circle neighborhood. And the
method is as follows:

(1) With the feature points as the center, take the radius of the circular area for 8 with 2 for
the unit, which is divided into 4 concentric circles. In the 4 concentric area, we select these
10 directions and compute their gradient accumulative values: 0◦ , 36◦ , 72◦ , 108◦ , 144◦ , 188◦ ,
216◦ , 252◦ , 288◦ , 324◦ .
(2) In the 4 circle rings, we respectively select a vector of 10 dimensions on every ring from inside
to outside, presented as T1 , T2 , T3 , T4 , so the primary feature vector is T = (T1 , T2 , T3 , T4 ),
where Ti = ti1 , ti2 , ..., ti10 , i = 1, 2, 3, 4. Then components of T1 , T2 , T3 , T4 are circled to move
left to find the max component of T1 , noted as t1max , let t11 = t1max , so Ti is converted to:
Ti = (ti,max , ti,max+1 , ..., ti10 , ti1 , ti2 , ..., ti,max−1 , i = 1, 2, 3, 4)
′ ′ ′
The new feature vector is expressed as: T = (t1 , ..., t40 )
′ ′ ′ ′
(3) For resisting effect of illumination, new feature T is normalized: T ′ = T /∥T ∥, where T is
its modulus-length. The values of components in T ′ which is less than 2 are set to 2.

5 Experiments
5.1 Data Set
The simulation experiments were performed by MATLAB7.0 software under Windows XP op-
erating system environment. We collected 300 images from email boxes, including 200 normal
images and 100 spam images which composed spam sample set. And samples under test were also
from those images. All the images were preprocessed by SNF image enhancement and feature
points were extracted by simplified SIFT algorithm.

5.2 Similarity Measures


The similarity between two feature points was measured by Euclidean distance. For example,
assume Fa = {T1a , ..., Tna a
} is a feature points set of an image under test, Fb = {T1b , ..., Tnb b
} is a
a b
feature points set of a image in spam sample set when the dimension Ti ,Ti of feature vectors is
k,Tia = (tai1 , ..., taik ), Tib = (tbi1 , ..., tbik ), the Euclidean distance between two feature points is defined
as: v
u k
u∑
d(Ti , Ti ) = t (taij − tbij )2
a b
(6)
j=1

na and nb are feature points numbers of them. We used distance-ratio norm to measure whether
the two points succeeded in matching:
{
ratio > ε sucess
ratio = d1/d2 (7)
else f ailure
3158 C. Wang et al. /Journal of Information & Computational Science 7: 14 (2010) 3153–3160

Here, for one point in Fa , d1 is the nearest Euclidean distance between this point and the point
in Fb , and d2 is the second nearest one. The threshold ε is set to 0.44.
We created index for all feature points in Fb named B-Tree. Then, the two nearest distance
d1 , d2 between every points of Ti of Fa and all the points in B-Tree were computed by (6) formula
using BBF[4] searching algorithm. Repeat the process above, and separately storage those points
in Fa , Fb which succeed in matching judged by (7) formula. The similarity measure of two image
is defined as:
m
P (Fa , Fb ) = ∑ m
ε+ d(Ti )
i=1 (8)
P (Fa ,Fb )
Sim(Fa , Fb ) = P (Fa ,Fa )

Here, m is the number of points success in matching between Fa and Fb , and d(Ti ) is the
Euclidean distance of two success matching points. Sim(Fa , Fb ) is a threshold, if it is closer to 1,
the similarity is higher. In this paper, we set Sim(Fa , Fb ) as 0.5, i.e. if exceeded 0.5, the image
is judged as spam image, else normal image.

5.3 Experiments and Analysis

0.8

recall rate
0.7
precision rate

0.6

0.5

0.4

0.3

0.2

0.1
6 8 10 12 14
dimensions

Fig. 2: The recall and precision rate under different dimensions. • is recall rate, ∗is precision rate.

Recall and Precision are two common evaluation performance of spam detection. Assume A is
the number of spam judged as spam, B is the number of normal emails judged as spam, C is the
number of spam judged as normal emails, D is the number of normal emails judged as normal
A
ones, then Recall rate is defined as A+C × 100% , and Precision rate is A+B
A
× 100%.
We select 20 images in order to test effects of simplified SIFT algorithm under different dimen-
sions of feature vectors which were 6, 8, 10, 12. From Fin.2, we can see, with the increasing of
dimensions, recall rate was decreasing while precision rate increased. The reason was with the
increasing of dimensions, more information of image was derived for matching. But the recall and
precision rate were incompatible, when the dimension was 10 × 4, both of them were not bad, so
we chose 40 dimensions for the feature vectors.
C. Wang et al. /Journal of Information & Computational Science 7: 14 (2010) 3153–3160 3159

0.9 the traditional SIFT algorithm


the simplified SIFT algorithm
0.8

0.7

0.6

recall 0.5

0.4

0.3

0.2

0.1

0
0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
precision

Fig. 3: The recall and precision rate under scale-changing and rotating.

Table 2: Time cost of SIFT and simplified SIFT of 5 images.


SIFT Simplified SIFT
Number of feature points Time cost(s) Number of feature points Time cost(s)
1 4751 7.13 3126 4.21
2 3862 6.19 2466 3.16
3 4360 6.36 2863 2.74
4 3021 5.88 2031 3.33
5 2219 5.37 1329 1.98

For testing the effect of simplified SIFT algorithm and traditional SIFT algorithm in detecting
spam image under conditions of scale-changing, rotating, and adding noise, we selected 20 images
which were scale-changed by 50%, or rotated 45 degree, and added into noise for experiments.
During the process, the feature dimension was 40, and the results curve was pointed as figure 3.
And the Tab.3 showed time cost and number of feature points of 5 images.
From Fig.3, we can see, the two curves were close. And although the effect of simplified SIFT
algorithm was a bit poor, it also can detected the image spam effectively. Furthermore, from the
Tab.2, we can know although the number of feature points of traditional SIFT was larger than
those of simplified SIFT, the time cost of simplified one was obviously lower. So, the superiority
of simplified SIFT algorithm was obvious on time-cost, which was about 3.43 seconds for average
in detecting a image spam comparing 6.27 seconds of traditional SIFT algorithm. From this
perspective, it was worth saving detecting time by sacrificing a bit recall and precision rate.
3160 C. Wang et al. /Journal of Information & Computational Science 7: 14 (2010) 3153–3160

6 Conclusion
The features extracted by SIFT remains invariance for scale-changing, rotating, brightness changes,
so the SIFT algorithm is widely used in image matching area. Moreover, due to its stability for
perspective changes, affine changes, and noise, it is popular in image-based spam email filtering
area. In this paper, for the special requirements of spam filtering on user side, we improved the
traditional SIFT algorithm by reducing dimensions of feature vector and managed to apply it on
detecting spam images. The results of experiments demonstrated that the proposed SIFT algo-
rithm greatly saved the time-cost, and although the effect was a bit poor, it was also worth to do it.

References
[1] Bin Chen, Shoubin Dong. An Optimized Spam Filtering Method Based on AODE.Jounal of In-
formation and Computational Science,2009,6(2):749-756
[2] H.B.Aradhye, G.K.Myers, J.A.Herson, Image analysis for efficient categorization of image-based
spam e-mail, Proc. 8th Int.Conf. on Document Analysis and Recognition,2005,pp.914-91
[3] Chen junwei, Zhang lichun, Lu yue.A spam image filtering system based on user-specified image
content[J]. CAAI Transactions on Intelligent Systems,Nov.2008,3(5):416-412
[4] Liu Xingxing, Wang Zengfu. An algorithm for text extraction in complex color image[J]. PR&AI,
Dem.2006, 19(6):771-775
[5] Zhang Qi, Cao Qi, Bi Duyan, et al.An improved enhancement approach based on anisotropy
differential used for SAR image[J].Journal of OptoelectronicsLaser, Apr.2010,21(4):614-617
[6] Liu Shangping, Chen Ji.Enhancement method for retinal images based on Gabor filter and mor-
phology[J], Journal of OptoelectronicsLaser,Feb.2010,21(2):318-322
[7] Gui Zhiguo, Zhang Pengcheng.Enhancement algorithm for X-ray images based on neighborhood
related information[J].Computer Engineering and Applications, 2010,46(21):175-177
[8] Jia Shijie, Wang Pengxiang, Jiang Haiyang, Zeng Jie.Study of imge matching algorithm based on
SIFT[J].Journal of DaLian Jiao Tong University, Aug.2010,31(2):17-21
[9] Zhang Chunmei, Gong Zhihui, Sun Lei.Improved SIFT feature applied in image matching[J].Computer
Engineering and Applications,2008,44(2):95-97
[10] Wu Huilan, Liu Guodong, Liu Bingguo, et al.Study on the circle center fast accurate locating
technique based on the SIFT[J]. Journal of OptoelectronicsLaser,Nov.2008,19(11):1512-1515
[11] Feng Jia,The research and improvement of SIFT algorithm[D], CNKI, 2010

You might also like