0% found this document useful (0 votes)
9 views

Identification of Image-Spam Based On Perimetric Complexity Analysis and SIFT Image Matching Algorithm

Uploaded by

gracehsing2020
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
9 views

Identification of Image-Spam Based On Perimetric Complexity Analysis and SIFT Image Matching Algorithm

Uploaded by

gracehsing2020
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 9

Journal of Information & Computational Science 9: 4 (2012) 1073–1081

Available at http://www.joics.com

Identification of Image-spam Based on Perimetric


Complexity Analysis and SIFT Image Matching
Algorithm ⋆

Chundong Wang a,b,∗, Hua Yang a,b,∗, Yinghui Chen a,b , Li Sun a,b
Huaibin Wang a,b , Yan Zhou a,b
a Key Laboratory of Computer Vision and System (Tianjin University of Technology), Ministry of
Education, Tianjin 300191, China
b Tianjin Key Laboratory of Intelligence Computing and Novel Software Technology, Tianjin University
of Technology, Tianjin 300191, China

Abstract

In this paper, a new method is proposed for effective identification of image spam using simplified Scale
Invariant Feature Transform (SIFT) algorithm and “Perimetric Complexity” analysis which is composed
of double filter layers. Our early work has reduced the time cost of detecting a image-spam by reducing
the feature vectors’ dimensions of SIFT algorithm from 128 to 40, and to further improve efficiency,
“Perimetric Complexity” analysis is added as the first layer to recognize image-spam by detecting the
presence of two noise resulting in character breaking or merging. The simplified SIFT algorithm is used
in the rest images after the “Perimetric Complexity” analysis. The results of experiment demonstrates
that it improves the performance of simplified SIFT algorithm for its high accuracy and rapid speed in
detection.

Keywords: Image Spam Identification; Scale Invariant Feature Transform Algorithm; Perimetric
Complexity; Similarity Measure; Euclidean Distance

1 Introduction
Nowadays, it is no doubt that email is an absolutely necessary application in internet. Mean-
while, the overrunning of unsolicited commercial email (spam), also become a severe problem
for its serious harmfulness, not only causing the economic loss but also threatening the sharing,
security, and interaction of internet. To bypass the spam filters, recently, spammers have created
a new variation of spam known as image spam by embedding the text content into images. The
image spam has resulted in severe damage since 2007, because the text document categorization

Project supported by the Foundation of Tianjin for Science and Technology Innovation (No. 10FDZDGX004
00), the Education Science and Technology Foundation of Tianjin (No. SB20080053 & No. SB20080055).

Corresponding author.
Email addresses: [email protected] (Chundong Wang), [email protected] (Hua Yang).

1548–7741 / Copyright © 2012 Binary Information Press


April 2012
1074 C. Wang et al. / Journal of Information & Computational Science 9: 4 (2012) 1073–1081

techniques on email spam detection has lost efficiency for it. Worse still, to help the image-based
spam emails escape from text-based spam filters, spammers make good use of various obfuscation
techniques by which they can produce various variations from a small number of image spam
source, such as adding random noise or straight line to the spam email image, and just changing
the ground of image while keeping the same text.
Consequently, different recognition approaches have been put forward to enhance the anti-
spam detectors. Some early works such as H.B.Aradhye et al [1] have tried to apply Bayesian
classification method after extracting the text from images, which is time-consuming. More
recent works focus on classifying image spam and normal image emails according to some image
features extracted before, such as metadata features, visual features concluding color, gray level,
texture, and so on [2]. Another popular method of feature extraction is SIFT(Scale Invariant
Feature Transform) algorithm [5, 6, 10, 11]. Not withstanding its demonstrated success used in
distinguishing image spam email [3], the precision is not high while time-cost is not low due to the
high feature vectors’ dimensions of SIFT algorithm. Previous work has been done for simplifying
the SIFT algorithm by reducing the feature vectors’ dimensions to save time cost [4].
To defeat the content obfuscation techniques and improve precision and recall rate of simplified
SIFT algorithm, we detect the obscuring techniques firstly for identifying the obvious image
spam disturbed by noise, then apply the simplified SIFT algorithm.The complexity of obfuscation
techniques in images is evaluated by “Perimetric Complexity”, which is a measurement used in
the psychophysics of reading literature. In addition, for the SIFT algorithm, a circle neighborhood
[5] is designed for dimensionality reduction of feature vectors with the purpose of saving time.
In Sec. 2, the flow chart of the proposed method is presented. Then “perimetric complexity”
and simplified SIFT algorithm are separately introduced in Sec. 3 and Sec. 4. Experiments and
results are analyzed in Sec. 5. Sec. 6 is the conclusion.

2 Proposed Approach

Fig. 1 presents the whole flow chart of image-based spam recognition by obfuscation technique
detection and simplified SIFT algorithm. “Perimetric complexity” is the measurement to evaluate
the extent of noise in image generated by various content obfuscation techniques, such as char-
acters rotation, waveform characters, broken or merged text, etc. As a complementary method
of simplified SIFT algorithm, “perimetric complexity” analysis is applied to make up for its defi-
ciency of not high precision and recall rate in the case of image degradation brought by broken or
merged characters and presence of noise components. By the “perimetric complexity” analysis,
only those spam image emails with obvious noise are filtered, and those which can’t be judged
as a spam or normal images are further filtered by simplified SIFT algorithm. The works of
feature extraction by SIFT algorithm are finished in scale space.The final goal is to determine the
best feature vectors for similarity measure and ensure the dimensions of feature vectors. A circle
neighborhood is designed instead of square neighborhood used by traditional SIFT algorithm
and it manages to reduce dimensions from 128 to 40. And distance-ratio [6] norm is utilized for
similarity measure. At last according to the threshold, whether an image email is a spam can be
judged.
C. Wang et al. / Journal of Information & Computational Science 9: 4 (2012) 1073–1081 1075

Image data

Preprocessing

Threshold<16 or
Premetric threshold>150
comlexity analyze Spam images

In the range of
[16,150]
Simplitied
SIFT feature
extraction

Similarity Similarity Normal


measures analyze images

Spam

Fig. 1: Flow chart of image spam recognition by perimetric complexity and SIFT algorithm analysis

3 Perimetric Complexity
Perimetric complexity, as the measure to evaluate the complexity level of image, is firstly used by
Battista Biggio et al to detect image spam [7]. The perimetric complexity for the clean text image
is generally in the range of (16, 150], while, for the broken characters or small noise concluded
images, the value is usually lower than 16, and for the merged characters or large noise contained
images, the value is often over 150. That’s why it can be used to identify image spam.
Perimetric complexity is defined as: P 2 /A. P presents the length of boundary between black
and white pixels, and A presents the area of black pixels. Generally, P is equivalent to the number
of those background pixels which are 4-neighborhood connected with at least one foreground
pixels, i.e. the number of boundary pixels, and A is equivalent to numbers of black pixels. As
previously mentioned, only two noises can be detected: one is the broken characters in the image
or small noise components originated from background; another is the merged characters or large
noise components. Their perimetric complexity values are respectively presented as f1 , f2 , the
calculation process is as follows:

(1) Firstly preprocess the images into binary images in order to derive connected components
marked by segment labeling algorithm [8].

(2) Every binary image is subdivided into p × q blocks which are same in size, marked as Cij ,
here, i = 1, · · · , p, j = 1, · · · , q.

(3) Calculate the f1 . For each block Cij , only consider those connected components whose mass
of center belongs to the block. In the connected components, calculate the value of P 2 /A.The
components number with P 2 /A value in the range of (16, 150] and the ratio of height and
width in the range of (0.25, 2.5] is presented as cij . In fact, cij indicates how many connected
components in the sub-block Cij including clean characters. Instead, nij is the connected
1076 C. Wang et al. / Journal of Information & Computational Science 9: 4 (2012) 1073–1081

components number of those whose P 2 /A values are lower than 16 or the height and width
ratio is out of (0.25, 2.5].And nij indicates connected components number containing the
broken characters or small noise. Every considered block has a c( ij) value, and maybe some
values are 0.Let c10 presents the total number blocks whose c( ij) > 0, then the noise of
sub-blocks fij is calculated as follows:

fij = nij /(nij + cij ) (1)

f1 is defined as: ∑
f1 = (1/c10 ) ∗ fij (2)

(4) Calculate the f2 . In each block Cij , calculate P 2 /A value of every connected component, and
the black pixels ratio of the whole block is presented as follows:

wk = Ak /A (3)

Here, Ak is the black pixels number of k-th connected component of the image, A is the black
pixels number of the whole image. Assume the total connected components number is N (One
block may have several connected areas.). So, f2 is computed as follows:


N
f2 = (1/N ) wk ∗ Pk2 /Ak (4)
k=1

4 Simplified SIFT Algorithm


SIFT algorithm, which was put forth by David G.Lowe in 1999, is widely used in image matching
area because the features of image extracted by it is robust to noise, scale-change and illumination
variations of image.
Assume I(x, y) is a function of an image. Gaussian pyramid of scale space is built by convolution
operation:
L(x, y, σ) = G(x, y, σ) ∗ I(x, y) (5)
1 −(x +y )/2σ
2 2 2
where G(x, y, σ) = 2πσ 2e is the Gaussian kernel function. L(x, y, σ) is a scale function,
where σ is scale factor increasing k times every time. The DoG(difference-of-Gaussian) pyramid
is further constructed:

D(x, y, s) = (G(x, y, ks) − G(x, y, s)) ∗ I(x, y) = L(x, y, ks) − L(x, y, s) (6)

In the paper, 4 orders are selected of Gaussian pyramid including 5 layers in every order. In
DoG space, apart from the bottom and top layers, in the rest layers, we compare every value with
those pixel points including 8 neighborhood pixel points in the same layer and 9 neighborhood
pixel points of its two neighborhood layers, i.e. 26 pixels in all, so that all extreme values will be
searched.
The feature points are selected from extreme value set. To improve robustness of the algorithm,
the selection norms of paper [11] are referred to. Those unstable or low contrast points on the edge
are discarded, and the remaining points are required the following conditions: | D(Xmax ) |≥ 0.03
C. Wang et al. / Journal of Information & Computational Science 9: 4 (2012) 1073–1081 1077

and ratio ≤ (r + 1)2 /r, where D(Xmax ) is quadratic fitting of Taylor expansion in the extreme
point, and r (the value is 10, here referring to [11]) is the ratio of the larger eigenvalue and the
smaller eigenvalue of Hessian matrix H:


 1 ∂DT
 D(Xmax ) = D + 2 ∂X
Dxx Dxy (7)

 |
 H =| D D
yx yy

In order to keep invariance for spin images, the direction of coordinate axis is rotated to the same
directions of feature points. The gradient value and directions of feature points are determined
as follows:

m(x, y) = (L(x + 1, y) − L(x, y + 1))2 + (L(x, y + 1) − L(x, y − 1))2
(8)
q(x, y) = tan−1 ((L(x, y + 1) − L(x, y − 1))/(L(x + 1, y) − L(x − 1, y))

The further step is to form the vectors for describing feature points which are called feature
vectors. Every key point has a feature vector formed by gradients statistics of points in a special
feature point neighborhood area. The dimension of feature vectors is vital for decreasing the
impact of gradient values and illumination variations. Traditional SIFT algorithm uses a square
neighborhood to ensure the feature vectors so that the dimensions are 128 which is too high for
time-cost.
For simplified SIFT algorithm, a circle neighborhood is designed to ensure the dimensions of
feature vectors. And the method is as follows:

(1) With the feature points as the center, take the radius of the circular area for 8 with 2 for
the unit, which is divided into 4 concentric circles. In the 4 concentric area, the following 10
directions are selected: 0◦ , 36◦ , 72◦ , 108◦ , 144◦ , 188◦ , 216◦ , 252◦ , 288◦ , 324◦ . Then compute
gradient accumulative values of these 10 directions.

(2) In the 4 circle rings, respectively select a vector of 10 dimensions on every ring from inside to
outside, presented as T1 , T2 , T3 , T4 , so the primary feature vector is T = (T1 , T2 , T3 , T4 ), where
Ti = ti1 , ti2 , ..., ti10 , i = 1, 2, 3, 4. Then components of T1 , T2 , T3 , T4 are circled to move left in
order to find the max component of T1 , noted as t1max , let t11 = t1max , so Ti is converted to:
Ti = (ti,max , ti,max+1 , ..., ti10 , ti1 , ti2 , ..., ti,max−1 , i = 1, 2, 3, 4)
′ ′ ′
The new feature vector is expressed as: T = (t1 , ..., t40 )
′ ′ ′ ′
(3) For resisting effect of illumination, new feature T is normalized: T ′ = T /∥T ∥, where T is
its modulus-length. The values of components in T ′ which is less than 2 are set to 2.

5 Experiments
5.1 Data Set
The simulation experiments were performed by MATLAB7.0 software under WindowsXP oper-
ating system environment. 380 images were collected from email boxes, including 200 normal
1078 C. Wang et al. / Journal of Information & Computational Science 9: 4 (2012) 1073–1081

images and 180 spam images which composed spam sample set. And samples under test were
also from those images. All the images were converted into binary images, and the two noises
were detected by perimetric complexity analysis, and in this paper p = 10, q = 10, the values
selection partially referring paper [7]. Some image spam was identified after first step. Then,
the rest unidentified images were analyzed by simplified SIFT algorithm, and feature points were
extracted by simplified SIFT algorithm.

5.2 Similarity Measures


The similarity between two feature points was measured by Euclidean distance. For example,
assume Fa = T1a , ..., Tna
a
is a feature points set of an image under test, Fb = T1b , ..., Tnb
b
is a feature
a b
points set of an image in spam sample set when the dimension Tj , Tj of feature vectors is k, the
Euclidean distance between two feature points is defined as:
v
u k
u∑
d(Fa , Fb ) = t (Tja − Tjb )2 (9)
j=1

na and nb are feature points numbers of them. We used distance-ratio norm to measure whether
the two points succeeded in matching:
{
ratio > ε sucess
ratio = d1/d2 (10)
else f ailure

For one point in Fa , d1 is the nearest Euclidean distance between this point and the point in
Fb , and d2 is the second nearest one. The threshold ε is set to 0.44.
We create index for all feature points in Fb named B-Tree. Then, the two nearest distance
d1 , d2 between every points of Ti of Fa and all the points in B-Tree were computed by (9) formula
using BBF [4] searching algorithm. Repeat the process above, and separately store those points in
Fa , Fb which succeed in matching judged by (10) formula. The similarity measure of two images
is defined as:
m
P (Fa , Fb ) = ∑ m
ε+ d(Ti )
i=1 (11)
P (Fa ,Fb )
Sim(Fa , Fb ) = P (Fa ,Fa )

Here, m is the number of points successful in matching between Fa and Fb , and d(Ti ) is the
Euclidean distance of two successful matching points. Sim(Fa , Fb ) is a threshold, if it is close to
1, the similarity is high. In this paper, we set Sim(Fa , Fb ) threshold as 0.5, i.e. if it exceeds 0.5,
the image is judged as spam, else normal image.

5.3 Experiments and Analysis


Recall and Precision are two common evaluation performance of spam detection. Assume A is
the number of spam judged as spam, B is the number of normal emails judged as spam, C is the
number of spam judged as normal emails, D is the number of normal emails judged as normal
A
ones, then Recall rate is defined as A+C × 100%, and Precision rate is A+B
A
× 100%.
C. Wang et al. / Journal of Information & Computational Science 9: 4 (2012) 1073–1081 1079

To test the perimetric complexity for the two kinds of noise, 80 spam images are selected for
the experiment, respectively concluding 40 images of the two kinds. I presents spam images
added into broken characters or small noise component and II presents those added into merged
characters or large noise component. Partial values of I and II are shown in Table 1 and Table 2.
Here, only 5 spam images’ values are displayed as Tab. 1, Tab. 2.

Table 1: Value of f1 and f2 for I


I f2 f1
1 16.20 0.63
2 14.49 0.66
3 16.18 0.49
4 13.32 0.72
5 36.12 0.33

Table 2: Value of f1 and f2 for II


II f2 f1
1 131.42 0.015
2 287.56 0.13
3 1352 0.08
4 2423 0.018
5 4431 0.16

From the Table 1 and Table 2, it is clear that the value of f1 evaluates the percentage of
sub-blocks containing obscure techniques. In fact f2 is the average value of for an image. When
existing broken characters or small noise component, the percentage of noise blocks, i.e. f1 , and
so does the number of blocks with P 2 /A value lower than 16. The existence of merging characters
or large noise component results in larger number of blocks with the P 2 /A value larger than 150.
Consequently, spam images are identified in two cases: one is high value of f1 mean while value
of f2 lower than 16; another is low f1 and f2 value much larger than 150 at the same time.
For comparing normal images to spam images with value of f1 and f2 , 160 images are selected
for perimetric complexity test, including 80 spam images and 80 normal images. Their values
are shown as Fig. 2. From Fig. 2, two differences can be seen. one is the spots from normal
images is compact in a range while those from spam images are scattered in a big range. Another
is spots from normal and spam images are assembled in different ranges in rough, and that’s
why the spam images can be identified by perimetric complexity analysis. At least, those spam
images which are obviously different from normal ones in perimetric complexity values can be
identified. Nonetheless, some of them are still mixed in the same range, so the SIFT algorithm
is used for further identification. To ensure the precision rate, threshold should be set strictly.
Images identified as spam only meet the following requirement: f1 > 60 and f2 < 15, or f2 > 500.
After wiping out the obvious spam images using the perimetric complexity analysis, the rest
images are detected by simplified SIFT algorithm. The precision and recall rate are shown in
Fig. 3. For comparing the results with previous work, the rest images are scale-changed by 50%,
or rotated 45 degree.
From the Fig. 3, the three curves are different. Obviously, the green curve is better than the
other two in precision and recall rate. Our previous work has showed that the simplified SIFT
1080 C. Wang et al. / Journal of Information & Computational Science 9: 4 (2012) 1073–1081

0.8
0.7
0.6
Value of noise 0.5
0.4
0.3
0.2
0.1
0
0 200 400 600 800 1000 1200 1400
P2/A

Fig. 2: The value of f1 and f2

1.0
0.9 Traditional SIFT algorithm
0.8 Simplified SIFT algorithm
0.7 Perimetric complexity and
Recall rate

0.6 simplified SIFT algorithm


0.5
0.4
0.3
0.2
0.1
0
0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0
Precision rate

Fig. 3: The precision and recall rate of three methods

algorithm is ascendant in time cost but the precision and recall rates are a bit poorer compared
with the traditional SIFT algorithm which is time consuming. And by adding the perimetric
complexity analysis, the disadvantage of simplified SIFT algorithm is improved. The curve shows
increase in precision and recall rates which is even better than traditional SIFT algorithm.

6 Conclusions
Features extracted by SIFT remains invariance for scale-changing, rotating, brightness changes,
so the SIFT algorithm is widely used in image matching area. Moreover, due to its stability for
perspective changes, affine changes, and noise, it is popular in image-based spam email filtering
area. The paper studies the obscure techniques, and improves the precision and recall rates
of simplified SIFT by analyzing perimetric complexity values.The main benefit of perimetric
complexity is that some images including the two noise can be filtered in the first step. So burden
of the second filter composed of simplified SIFT algorithm is reduced. The results of experiments
demonstrate that the proposed method that combines the perimetric complexity and simplified
SIFT algorithm not only greatly makes up for the disadvantage of SIFT algorithm after reducing
C. Wang et al. / Journal of Information & Computational Science 9: 4 (2012) 1073–1081 1081

the feature vectors’ dimensions but also strengthens the robustness of simplified algorithm.

References
[1] Bin Chen, Shoubin Dong. An optimized spam filtering method based on AODE. Jounal of Infor-
mation and Computational Science [J], 6(2), 2009, 749-756
[2] Fen. Liu. Research on the Image Spam Filtering Technology Based on Content [D]. Master degree
thesis. 2010
[3] H. B. Aradhye, G. K. Myers, J. A. Herson. Image analysis for efficient categorization of image-
based spam e-mail[C], Proc. 8th Int. Conf. on Document Analysis and Recognition, 2005, 914-91
[4] Chundong Wang, Hua Yang, Yinghui Chen et al. Identification of image-spam based on SIFT image
matching algorithm [J]. Journal of Computational Information Systems, 7(14), 2010, 3153-3160
[5] Chunmei Zhang, Zhihui Gong, Lei Sun. Improved SIFT feature applied in image matching [J].
Computer Engineering and Applications, 44(2), 2008, 95-97
[6] Min Zuo, Guangping Zeng et al. A connected domain labeling algorithm based on equialence pair
in binary image [J]. Journal of Computer Simulation, 28(1), 2011, 14-16
[7] Battista Biggio, Giorgio Fumera et al. Image spam filtering using visual information [J]. Interna-
tional Conference on Image Analysis and Processing, 2007
[8] Huilan Wu, Guodong Liu et al. Study on the circle center fast accurate locating technique based
on the SIFT [J]. Journal of OptoelectronicsLaser, 19(11), Nov. 2008, 1512-1515
[9] Xingxing Liu, Zengfu Wang. An algorithm for text extraction in complex color image [J]. PR&AI,
19(6), Dec. 2006, 771-775
[10] Shijie Jia, Pengxiang Wang et al. Study of imge matching algorithm based on SIFT [J]. Journal
of DaLian Jiao Tong University, 31(2), Aug. 2010, 17-21
[11] Zhiguo Gui, Pengcheng Zhang. Enhancement algorithm for X-ray images based on neighborhood
related information [J]. Computer Engineering and Applications, 46(21), 2010, 175-177

You might also like