0% found this document useful (0 votes)
14 views10 pages

Deep Networks For Saliency Detection Via Local Estimation and Global Search

Deep Networks for Saliency Detection via Local Estimation and Global Search proposes a saliency detection algorithm that integrates both local estimation and global search. In the local estimation stage, a deep neural network (DNN-L) learns local image features to determine pixel saliency. The local saliency maps are then refined using object concepts. In the global search stage, another deep neural network (DNN-G) predicts saliency scores for object regions based on global features including local saliency, contrast, and geometry. The final saliency map is a weighted sum of the salient object regions.

Uploaded by

siva shankaran
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
14 views10 pages

Deep Networks For Saliency Detection Via Local Estimation and Global Search

Deep Networks for Saliency Detection via Local Estimation and Global Search proposes a saliency detection algorithm that integrates both local estimation and global search. In the local estimation stage, a deep neural network (DNN-L) learns local image features to determine pixel saliency. The local saliency maps are then refined using object concepts. In the global search stage, another deep neural network (DNN-G) predicts saliency scores for object regions based on global features including local saliency, contrast, and geometry. The final saliency map is a weighted sum of the salient object regions.

Uploaded by

siva shankaran
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 10

Deep Networks for Saliency Detection via Local Estimation and Global Search

Lijun Wang† , Huchuan Lu† , Xiang Ruan‡ and Ming-Hsuan Yang§



Dalian University of Technology ‡ OMRON Corporation § University of California at Merced

Abstract

This paper presents a saliency detection algorithm by in-


tegrating both local estimation and global search. In the
local estimation stage, we detect local saliency by using
a deep neural network (DNN-L) which learns local patch
features to determine the saliency value of each pixel. The
estimated local saliency maps are further refined by explor-
ing the high level object concepts. In the global search
stage, the local saliency map together with global contrast
and geometric information are used as global features to
describe a set of object candidate regions. Another deep
neural network (DNN-G) is trained to predict the salien-
cy score of each object region based on the global fea- (a) (b) (c) (d) (e)
tures. The final saliency map is generated by a weighted
sum of salient object regions. Our method presents two in- Figure 1. Saliency detection by different methods. (a) Original
images. (b) Ground truth saliency maps. (c) Saliency maps by a
teresting insights. First, local features learned by a super-
local method [13]. (d) Saliency maps by a global method [7]. (e)
vised scheme can effectively capture local contrast, texture Saliency maps by the proposed method.
and shape information for saliency detection. Second, the
complex relationship between different global saliency cues
can be captured by deep networks and exploited principally s locally standing out from their surroundings. Although
rather than heuristically. Quantitative and qualitative ex- being biologically plausible, local models often lack global
periments on several benchmark data sets demonstrate that information and tend to highlight the boundaries of salient
our algorithm performs favorably against the state-of-the- objects rather than the interiors (See Figure 1(c)). In con-
art methods. trast, global methods [1, 24, 29] take the entire image into
consideration to predict the salient regions which are char-
acterized by holistic rarity and uniqueness, and thus help
1. Introduction detect large objects and uniformly assign saliency values
to the contained regions. Unlike local methods which are
Saliency detection, which aims to identify the most im- sensitive to high frequency image contents like edges and
portant and conspicuous object regions in an image, has re- noise, global methods are less effective when the textured
ceived increasingly more interest in recent years. Serving regions of salient objects are similar to the background (See
as a preprocessing step, it can efficiently focus on the inter- Figure 1(d)). The combination of local and global method-
esting image regions related to the current task and broadly s has been explored by a few recent studies, where back-
facilitates computer vision applications such as segmenta- ground prior, center prior, color histograms and other hand-
tion, image classification, and compression, to name a few. crafted features are utilized in a simple and heuristic way to
Although much progress has been made, it remains a chal- compute saliency maps.
lenging problem. While the combination of local and global models [32,
Existing methods mainly formulate saliency detection by 36] is technically sound, these methods have two major
a computational model in a bottom-up fashion with either a drawbacks. First, these methods mainly rely on hand-
local or a global view. Local methods [13, 25, 19, 39] com- crafted features which may fail to describe complex im-
pute center-surround differences in a local context for color, age scenarios and object structures. Second, the adopted
texture and edge orientation channels to capture the region- saliency priors and features are mostly combined based on
Local Estimation Global Search

...

(a) (b) (c) (d) (e)

...
(g) (f)

Object Proposals
Figure 2. Pipeline of our algorithm. (a) Proposed deep network DNN-L (Section 3.1). (b) Local saliency map (Section 3.1). (c) Local
saliency map after refinement (Section 3.2). (d) Feature extraction (Section 4.1). (e) Proposed deep network DNN-G (Section 4.2). (f)
Sorted object candidate regions (Section 4.2). (g) Final saliency map (Section 4.2).

heuristics and it is not clear how these features can be better ture as well as shape information, and predicting the salien-
integrated. cy value of each pixel without the need for hand-crafted fea-
In this paper, we propose a novel saliency detection al- tures. The proposed DNN-G can effectively detect global
gorithm by combining local estimation and global search salient regions by using various saliency cues through a su-
(LEGS) to address the above-mentioned issues. In the lo- pervised learning scheme. Both DNN-L and DNN-G are
cal estimation stage, we formulate a deep neural network trained on the same training data set (See Section 5.1 for
(DNN) based saliency detection method to assign a local details). Without additional training, our method general-
saliency value to each pixel by considering its local con- izes well to the other data sets and performs well against
text. The trained deep neural network, named as DNN-L, the state-of-the-art approaches.
takes raw pixels as inputs and learns the contrast, texture
and shape information of local image patches. The saliency 2. Related Work
maps generated by DNN-L are further refined by explor-
ing the high level objectness (i.e., generic visual informa-
tion of objects) to ensure label consistency and serve as lo- In this section, we discuss the related saliency detection
cal saliency measurements. In the global search stage, we methods and their connection to generic object proposal
search for the most salient object regions. A set of candidate methods. In addition, we also briefly review deep neural
object regions are first generated using a generic object pro- networks that are closely related to this work.
posal method [20]. A feature vector containing global color Saliency detection methods can be generally categorized
contrast, geometric information as well as the local saliency as local and global schemes. Local methods measure salien-
measurements estimated by DNN-L is collected to describe cy by computing local contrast and rarity. In the seminal
each object candidate region. These extracted feature vec- work [13] by Itti et al., center-surround differences across
tors are used to train another deep neural network, DNN-G, multi-scales of image features are computed to detect lo-
to predict the saliency value of each object candidate region cal conspicuity. Ma and Zhang [25] utilize color contrast
from a global perspective. The final saliency map is gener- in a local neighborhood as a measure of saliency. In [11],
ated by the sum of salient object regions weighted by their the saliency values are measured by the equilibrium distri-
saliency values. Figure 2 shows the pipeline of our algorith- bution of Markov chains over different feature maps. The
m. methods that consider only local contexts tend to detect high
Much success has been demonstrated by deep network- frequency content and suppress the homogeneous region-
s in image classification, object detection, and scene pars- s inside salient objects. On the other hand, global meth-
ing. However, the use of DNNs in saliency detection is ods detect saliency by using holistic contrast and color s-
still limited, since DNNs, mainly fed with image patches, tatistics of the entire image. Achanta et al. [1] estimate
fail to capture the global relationship of image regions and visual saliency by computing the color difference between
maintain label consistency in a local neighborhood. Our each pixel with respect to its mean. Histograms based glob-
main contribution addresses these issues by proposing an al contrast and spatial coherence are used in [7] to detect
approach to apply DNNs to saliency detection from both saliency. Liu et al. [24] propose a set of features from both
local and global perspectives. We demonstrate that the pro- local and global views, which are integrated by a condition-
posed DNN-L is capable of capturing the local contrast, tex- al random field to generate a saliency map. In [29], two
Table 1. Architecture details of the proposed deep networks. C: convolutional layer; F: fully connected layer; R: ReLUs; L: local response
normalization; D: dropout; S: softmax; Channels: the number of output feature maps; Input size: the spatial size of input feature maps.
DNN-L DNN-G
Layer 1 2 3 4 5 6 (Output) 1 2 3 4 5 6 (Output)
Type C+R+L C+R C+R F+R+D F+R+D F+S F+R+D F+R+D F+R+D F+R+D F+R F
Channels 96 256 384 2048 2048 2 1024 2048 2048 1024 1024 2
Filter size 11x11 5x5 3x3 – – – – – – – – –
Pooling size 3x3 2x2 3x3 – – – – – – – – –
Polling stride 2x2 2x2 3x3 – – – – – – – – –
Input size 51x51 20x20 8x8 2x2 1x1 1x1 1x1 1x1 1x1 1x1 1x1 1x1

contrast measures based on the uniqueness and spatial dis- To address this issue, Pinheiro and Collobert [30] use a re-
tribution of regions are defined for saliency detection. To current convolutional neural network to consider large con-
identify small high contrast regions, Yan et al. [40] propose texts. In [9], a DNN is applied in a multi-scale manner
a multi-layer approach to analyze saliency cues. A random to learn hierarchical feature representations for scene label-
forest based regression model is proposed in [16] to direct- ing. We propose to utilize DNNs in both local and global
ly map regional feature vectors to saliency scores. Recently, perspectives for saliency detection, where the DNN-L esti-
Zhu et al. [42] present a background measurement scheme mates local saliency of each pixel and the DNN-G searches
to utilize boundary prior for saliency detection. Although for salient object regions based on global features to enforce
significant advances have been made, most of the above- label dependencies.
mentioned methods integrate hand-crafted features heuristi-
cally to generate the final saliency map, and do not perform 3. Local Estimation
well on challenging images. In contrast, we utilize a deep
network (DNN-L) to automatically learn features capturing The motivation of local estimation is that local outliers,
local saliency, and learn the complex dependencies among standing out from their neighbors with different colors or
global cues using another deep network (DNN-G). textures, tend to attract human attention. In order to detect
these outliers from a local view, we formulate a binary clas-
Generic object detection (also known as object propos- sification problem to determine whether each pixel is salient
al) methods [3, 2, 37] aim at generating the locations of all (1) or non-salient (0) based on its surrounding. We use a
category independent objects in an image and have attract- deep network, namely DNN-L, to conduct classification s-
ed growing interest in recent years. Existing techniques ince DNNs have demonstrated state-of-the-art performance
propose object candidates by either measuring the object- in image classification and do not rely on hand-crafted fea-
ness [2, 5] of an image window or grouping regions in a tures. By incorporating object level concepts into local es-
bottom-up process [37, 20]. The generated object candi- timation, we present a refinement method to enhance the
dates can significantly reduce the search space of category spatial consistency of local saliency maps.
specific object detectors, which in turn helps other modules
for recognition and other tasks. As such, generic object de- 3.1. DNN based Local Saliency Estimation
tection are closely related to salient object segmentation. In
Architecture of DNN-L. The proposed DNN-L consists
[2], saliency is utilized as objectness measurement to gen-
of six layers, with three convolutional layers and three fully
erate object candidates. Chang et al. [4] use a graphical
connected layers. Each layer contains learnable parameters
model to exploit the relationship of objectness and saliency
and consists of a linear transformation followed by a nonlin-
cues for salient object detection. In [23], a random forest
ear mapping, which is implemented by Rectified Linear U-
model is trained to predict the saliency score of an object
nites (ReLUs) [28] to accelerate the training process. Local
candidate. In this work, we propose a DNN-based saliency
response normalization is applied to the first layer to help
detection method combining both local saliency estimation
generalization. Max pooling is applied to all the three con-
and global salient object candidate search.
volutional layers for translational invariance. The dropout
Deep neural networks have achieved state-of-the-art re- procedure is used after the first and the second fully con-
sults in image classification [21, 8, 34], object detec- nected layers to avoid overfitting. The network takes a RG-
tion [35, 10, 12] and scene parsing [9, 30]. The success B image patch of 51 × 51 pixels as an input, and exploits
stems from the expressibility and capacity of deep architec- a softmax regression model as the output layer to generate
tures that facilitates learning complex features and models the probabilities of the central pixel being salient and non-
to account for interacted relationships directly from train- salient. The architecture details are listed in Table 1.
ing examples. Since DNNs mainly take image patches as
inputs, they tend to fail in capturing long range label de- Training data. For each image in the training set (See al-
pendencies for scene parsing as well as saliency detection. so Section 5.1), we collect samples by cropping 51×51 RG-
B image patches in a sliding window fashion with a stride
of 10 pixels. To label the training patches, we mainly con-
sider the ground truth saliency values of their central pixels
as well as the overlaps between the patches and the ground
truth saliency mask. The patch B is labeled as a positive
training example if i). the central pixel is salient, and ii).
it sufficiently
T overlaps with the ground truth salient region
G: |B G| ≥ 0.7 × min(|B|, |G|). Similarly, the patch
B is labeled as a negative training example if i). the central (a)
pixel is located within the background, and ii). its overlap
with the ground Ttruth salient region is less than a predefined
threshold: |B G| < 0.3 × min(|B|, |G|). The remain-
ing samples labeled as neither positive nor negative are not
used. Following [21], we do not pre-process the training
samples, except for subtracting the mean values over the
training set from each pixel.
Training DNN-L. Given the training patch set {Bi }N L
(b) (c)
and the corresponding label set {li }N L , we use the softmax Figure 3. Visualization of DNN-L. (a) 96 convolutional filters with
loss with weight decay as the cost function, the size of 11 × 11 × 3 in the first layer. (b) Input image (top) and
the local saliency map (bottom) generated by DNN-L. (c) Output
1 XX
m 1
L(θ L ) = − 1{li = j} log P (li = j|θ L ) feature maps of the first layer by applying DNN-L to the input
m i=1 j=0 image in a sliding window manner. (Better viewed at high resolu-
(1) tion.)
X
6
+λ kWkL k2F ,
k=1

L
where θ is the learnable parameter set of DNN-L includ-
ing the weights and bias of all layers; 1{·} is the indicator
function; P (li = j|θ L ) is the label probability of the i-
th training samples predicted by DNN-L; λ is the weight
decay parameter; and WkL is the weight of the k-th layer.
DNN-L is trained using stochastic gradient descent with a
batch size of m = 256, momentum of 0.9, and weight de-
cay of 0.0005. The learning rate is initially set to 0.01 and
is decreased by a factor of 0.1 when the cost is stabilized.
The training process is repeated for 80 epochs. Figure 3(a) (a) (b) (c) (d)
Figure 4. Saliency maps by local estimation. (a) Source images.
illustrates the learned convolutional filters in the first lay-
(b) Ground truth. (c) Local saliency maps predicted by DNN-L.
er, which capture color, contrast, edge and pattern informa- (d) Local saliency maps after refinement.
tion of a local neighborhood. Figure 3(c) shows the output
of the first layer, where locally salient pixels with different
features are highlighted by different feature maps. neighborhood. Thus it may be sensitive to high frequency
At test stage, we apply DNN-L in a sliding window fash- background noise and fail to maintain spatial consistency.
ion to the entire image and predict the probability P (l = On the other hand, saliency is closely correlated with the
1|θ) for each pixel as its local saliency value. Figure 4(c) object-level concepts, i.e., interesting objects easily attract
demonstrates the generated local saliency maps. Both Fig- human attention. Based on this observation, we propose to
ure 3 and Figure 4 show that the proposed local estimation refine the local saliency map by combining low level salien-
method can effectively learn, rather than design, useful fea- cy and high level objectness. To this end, we utilize the
tures characterizing local saliency by training DNN-L with geodesic object proposal (GOP) [20] method to extract a set
local image patches. of object segments. The generated object candidates encode
informative shape and boundary cues and serve as an over-
3.2. Refinement
complete coverage of the object in an image. Our method
The local estimation method detects saliency by consid- searches for a subset of these candidates with high probabil-
ering the color, contrast and texture information within a ities to be the potential object according to the local salien-
timal object candidates, we sort all the candidates by their
confidences in a descending order. The refined local salien-
cy map is generated by averaging the top K candidate re-
gions (K is set to 20 in all the experiments). Figure 4 shows
the local saliency maps before and after refinement.

4. Global Search
A=0.18, C=0.24 A=0.45, C=0.88 A=0.93, C=0.22 A=0.81, C=0.75 Saliency cues such as center and object bias [31, 22],
contrast information [38] and background prior [33, 15]
Figure 5. Top row (left to right): source image, ground truth, local have been shown to be effective in previous work. How-
saliency map output by DNN-L, local saliency map after refine- ever, these saliency cues are considered independently, and
ment. Bottom row: different object candidate regions with their combined based on heuristics. For example, the background
corresponding accuracy scores A and coverage scores C.
prior is utilized by treating all pixels within the boundary re-
gions of an image as background without considering the
cy map, and thereby integrates local estimation and generic color statistics of the entire image or the location of the
object proposals as a complementary process. foreground. Instead, we formulate a DNN-based regres-
Given an input image, we first generate a set of objec- sion method for saliency detection, where various saliency
t candidate masks {Oi }NO using the GOP method and a cues are considered simultaneously and their complex de-
saliency map SL using our local estimation method. To de- pendencies are learned automatically through a supervised
termine the confidence of each segment, we mainly consider learning scheme. For each input image, we first detect lo-
two measurements based on the local saliency map, accura- cal saliency using the proposed local estimation method. A
cy score A and coverage score C, defined by 72-dimensional feature vector is extracted to describe each
P L object candidate generated by the GOP method from a glob-
x,y Oi (x, y) × S (x, y)
Ai = P , (2) al view. The proposed deep network DNN-G takes the ex-
x,y Oi (x, y) tracted features as inputs and predicts the saliency values of
P L
x,y Oi (x, y) × S (x, y) the candidate regions through regression.
Ci = P L
, (3)
x,y S (x, y)
4.1. Global Features
where Oi (x, y) = 1 indicates that the pixel located at (x, y)
The proposed 72-dimensional feature vector covers
of the input image belongs to the i-th object candidate, and
global contrast features, geometric information, and local
Oi (x, y) = 0 otherwise; SL (x, y) ∈ [0, 1] represents the
saliency measurements of object candidate regions. Glob-
local saliency value of pixel (x, y).
al contrast features consist of three components: boundary
The accuracy score Ai measures the average local salien-
contrast, image statistic divergence and internal variance,
cy value of the i-th object candidate, whereas the coverage
which are computed in the RGB, Lab and HSV color s-
score Ci measures the proportion of salient area covered
paces. Given an object candidate region O and using the
by the i-th object candidate. Figure 5 presents an intuitive
RGB color space as an example, we compute its RGB his-
example for interpreting these two measurements. The yel-
togram hRGB
O , mean RGB values mRGB O , and RGB color
low candidate region having a small overlap with the local RGB
variance varO over all the pixels within the candidate
salient area is assigned with both a low accuracy score and
region. We define the border regions of 15 pixels width
a low coverage score. The red candidate region covering al-
in four directions of the image as boundary regions. S-
most the entire local salient region has a high coverage score
ince the boundary regions in different directions may have
but a low accuracy score. The green candidate region locat-
different appearance, we compute their RGB histograms
ed inside the local salient region has a high accuracy score
and mean RGB values separately. For representation con-
but a low coverage score. Only the optimal blue candidate
has a high accuracy score as well as a high coverage score.
Based on the above observations, we define the confidence Table 2. Global contrast features of object candidate regions.
for the i-th candidate by considering both the accuracy s- Feature Definition Feature Definition
core and the coverage score as c1 − c4 χ2 (hRGB
O , hRGB
B ) c49 χ2 (hRGB
O , hRGB
I )
c5 − c8 χ2 (hLab Lab
O , hB ) c50 χ2 (hLab Lab
O , hI )
L (1 + β) × Ai × Ci c9 − c12 χ (hHSV
2
, hHSV ) c51 χ2 (hHSV , hHSV )
conf i = , (4) O B O I
βAi + Ci c13 − c24 d(mORGB
, mRGB
B ) c52 − c54 RGB
varO

where we set β = 0.4 to emphasize the impact of the accu- c25 − c36 d(mHSV
O , mHSV
B ) c55 − c57 Lab
varO
c37 − c48 d(mLab , m Lab
) c58 − c60 HSV
varO
racy score on the final confidence. To find a subset of op- O B
Table 3. Geometric information and local saliency measurements of object regions.
Geometric Information Local Saliency Measurement
Feature Definition Feature Definition Feature Definition
g1 Bounding box aspect ratio g6 Major axis length s1 Accuracy score A
g2 Bounding box height g7 Minor axis length s2 Coverage score C
g3 Bounding box width g8 Euler number s3 A×C
g4 − g5 Centroid coordinates s4 Overlap rate

venience, we uniformly denote the RGB histograms and deviation of the elements. Given the ground truth saliency
mean RGB values of the four boundary regions as hRGB B map G, a label vector of precision pi and overlap rate oi ,
and mRGBB , respectively. The RGB histogram of the en- yi = [pi , oi ], is assigned to each object region Oi .
tire image hRGBI is also used as an image statistic. The Given the training data set {vi }N G and the correspond-
boundary contrast is measured by the chi-square distances ing label set {yi }N G , the network parameters of DNN-G
χ2 (hRGB
O , hRGB
B ) between the RGB histograms of the can- are learned by solving the following optimization problem
didate and the four boundary regions, and the Euclidean dis-
1 X X
m 6
tances d(mRGB , mRGB ) between their mean RGB values.
O B
arg min kyi − φ(vi |θ G )k22 + η kWkG k2F , (5)
The color divergence of the candidate region from the en- θG m i=1
k=1
tire image statistic is measured by the chi-square distance
χ2 (hRGB
O , hRGB
I ) between the RGB histograms of the can- where θ G is the network parameter set; φ(vi |θ G ) =
didate region and the entire image. The internal color vari- [φ1i , φ2i ] is the output of DNN-G for the i-th training sam-
ance of the candidate region is measured by the RGB color ple; WkG is the weight of the k-th layer; and η is the weight
RGB
variance varO . The global contrast features in the Lab decay parameter which is set to 0.0005. The above opti-
and HSV color spaces are extracted in a similar way. Table 2 mization problem is solved by using stochastic gradient de-
summarizes the components of global contrast features. scent with a batch size m of 1000 and momentum of 0.9.
Geometric information characterizes the spatial distribu- The learning rate is initially set to 0.05 and is decreased by
tion of object candidates. We extract the centroid coor- a factor of 0.5 when the cost is stabilized. The training pro-
dinates, major/minor axis length, Euler number1 and the cess is repeated for 100 epochs.
shape information of the enclosing bounding box includ- At test stage, the network takes the feature vector of the
ing its width, height and aspect ratio. All the above features i-th candidate region as an input and predicts its precision
except the Euler number are normalized with respect to the and overlap rate by φ(vi |θ G ). The global confidence score
input image size. Table 3 shows the details of the geometric of the candidate region is defined by
information.
Local saliency measurements evaluate the saliency value conf G 1 2
i = φi × φi . (6)
of each candidate region based on the saliency map pro-
duced by the local estimation method. Given the refined Denote {Ô1 , . . . , ÔN } as the mask set of all the candidate
local saliency map and the object candidate mask, we com- regions in the input image sorted by the global confidence
pute the accuracy score A and the coverage score C using scores in a descending order. The corresponding global
(2)-(3). The overlap rate between the object mask and the confidence scores are represented by {conf G G
1 , . . . , conf N }.
local saliency map is also computed (See Table 3 for detail- The final saliency map is computed by a weighted sum of
s). the top K candidate masks,
PK
4.2. Saliency Prediction via DNN-G Regression conf G
k × Ôk
SG = k=1 PK G
. (7)
The proposed DNN-G consists of 6 fully connected lay- k=1 conf k
ers. Each layer carries out a linear transformation followed Although similar in spirit, our global search method is sig-
by ReLUs to accelerate the training process and the dropout nificantly different from [10], [16] and [23] in the following
operation to avoid overfitting (See Table 1). For each image aspects: i). Our method utilizes DNNs to learn the complex
in the training data set (Section 5.1), around 1200 objec- dependencies among different visual cues and determines
t regions are generated as training samples using the GOP the saliency of a candidate region in a global view, whereas
method. The proposed 72-dimensional global feature vec- [10] applies DNN to a bounding box to extract category-
tor v is extracted from each candidate region and then pre- specific features. ii). Both [16] and [23] use random forests
processed by subtracting the mean and dividing the standard to predict region saliency based on regional features, where
1 The Euler number of an object mask is the total number of objects in [23] trains the model for each data set. In contrast, we use
the mask minus the total number of holes in those objects. DNNs for saliency detection and conduct training in one
0.35 1 0.35 0.25 1 0.7
Foreground Foreground Foreground Foreground Foreground Foreground
0.3 Background Background 0.3 Background Background Background 0.6 Background
0.8 0.2 0.8
0.25 0.25 0.5

0.2 0.6 0.2 0.15 0.6 0.4

0.15 0.4 0.15 0.1 0.4 0.3

0.1 0.1 0.2


0.2 0.05 0.2
0.05 0.05 0.1

0 0 0 0 0 0
0 0.2 0.4 0.6 0.8 1 0 0.2 0.4 0.6 0.8 1 0 0.2 0.4 0.6 0.8 1 0 0.2 0.4 0.6 0.8 1 0 0.2 0.4 0.6 0.8 1 0 0.2 0.4 0.6 0.8 1
c1 c50 g4 s1 s2 confG

Figure 6. Distribution of foreground and background regions in different feature spaces, including global contrast features (c1 and c50 ),
geometric information (g4 ), local saliency measurements (s1 and s2 ) and the global confidence scores (conf G ) generated by DNN-G.

Table 4. Quantitative results using F-measure and MAE. The best and second best results are shown in red color and blue color.

Data Set Metric DRFI GC HS MR PCA SVO UFO wCtr CPMC-GBVS HDCT LEGS
F-Measure 0.617 0.433 0.480 0.542 0.498 0.217 0.521 0.567 – 0.511 0.630
SOD
MAE 0.230 0.288 0.301 0.274 0.290 0.414 0.272 0.245 – 0.260 0.205
F-Measure 0.726 0.568 0.631 0.689 0.575 0.237 0.638 0.672 – 0.641 0.775
ECCSD
MAE 0.172 0.218 0.232 0.192 0.252 0.406 0.210 0.178 – 0.204 0.137
F-Measure 0.619 0.496 0.536 0.600 0.531 0.266 0.552 0.611 0.654 0.536 0.669
PASCAL-S
MAE 0.195 0.245 0.249 0.219 0.239 0.373 0.227 0.193 0.178 0.226 0.170
F-Measure – 0.704 0.765 0.789 0.707 0.302 0.774 0.788 – 0.773 0.803
MSRA-5000
MAE – 0.149 0.160 0.130 0.189 0.364 0.145 0.110 – 0.141 0.128

data set (See Section 5.1). iii). Global search is integrat- using the Caffe framework [14]. The trained models and
ed with local estimation in our work, which facilitates more source code are available at our website2 .
robust saliency detection from both perspectives. We evaluate the performance using precision-recall (PR)
curves, F-measure and mean absolute error (MAE). The
5. Experimental Results precision and recall of a saliency map are computed by
segmenting a salient region with a threshold, and compar-
5.1. Setup ing the binary map with the ground truth. The PR curves
We evaluate the proposed algorithm on four benchmark demonstrate the mean precision and recall of saliency maps
data sets: MSRA-5000 [24], SOD [27], ECCSD [40] and at different thresholds. The F-measure is defined as Fγ =
(1+γ 2 )P recision×Recall
γ 2 P recision+Recall , where P recision and Recall are
PASCAL-S [23]. The MSRA-5000 data set is widely used
for saliency detection and covers a large variety of image obtained using twice the mean saliency value of saliency
contents. Most of the images include only one salient ob- maps as the threshold, and γ 2 is set to 0.3. The MAE is the
ject with high contrast to the background. The SOD data average per-pixel difference between saliency maps and the
set, containing 300 images, is collected from the Berkeley ground truth.
segmentation data base. Many images in this data set have
multiple salient objects of various sizes and locations. The 5.2. Feature Analysis
ECCSD data set contains 1000 images with complex scenes Our global search method exploits various saliency cues
from the Internet and is more challenging. The newly devel- to describe each object candidate. We present an empiri-
oped PASCAL-S data set is constructed on the validation set cal analysis on the discriminative ability of all the global
of the PASCAL VOC 2012 segmentation challenge. This features based on the distribution of both foreground and
data set contains 850 natural images with multiple complex background regions in different feature spaces. We gen-
objects and cluttered backgrounds. The PASCAL-S data set erate 500000 object candidate regions using 510 test im-
is arguably one of the most challenging saliency data set- ages from the PASCAL-S data set. Based on the over-
s without various design biases (e.g., center bias and color lap rate oi with the ground truth salient region, the i-th
contrast bias). All the data sets contain manually annotated candidate region is classified as foreground (oi > 0.7) or
ground truth saliency maps. background (oi < 0.2). The remaining candidate region-
Since the MSRA-5000 data set covers various scenar- s (0.2 ≤ oi ≤ 0.7) are left unused. Figure 6 illustrates
ios and the PASCAL-S data set contains images with com- the distribution of both foreground and background region-
plex structures, we randomly sample 3000 images from the s in three types of feature spaces discussed in Section 4.1
MSRA-5000 data set and 340 images from the PASCAL-S and the global confidence score space generated by DNN-
data set to train the proposed two networks. The remain- G. More results can be found in the supplementary material.
ing images are used for tests. Both horizontal reflection and The distribution plots in Figure 6 show strong overlap-
rescaling (±5%) are applied to all the training images to
augment the training data set. The DNNs are implemented 2 http://ice.dlut.edu.cn/lu/index.html
Original GT DRFI GC HS MR PCA SVO UFO wCtr HDCT LEGS
Figure 7. Saliency maps. Top, middle and bottom two rows are images from the SOD, ECCSD and PASCAL-S data sets. GT: ground truth.

1 1 1 1

0.9 0.9 0.9 0.9

0.8 0.8 0.8 0.8

0.7 0.7 0.7 0.7


CPMC−GBVS
Precision

Precision

Precision

Precision
DRFI
DRFI DRFI
0.6 0.6 GC 0.6 0.6 GC
GC GC
HS HDCT
HS HS
MR HS
0.5 MR 0.5 0.5 MR 0.5
PCA MR
PCA PCA
SVO PCA
0.4 SVO 0.4 0.4 SVO 0.4
UFO SVO
UFO UFO
wCtr UFO
wCtr wCtr
0.3 0.3 HDCT 0.3 0.3 wCtr
HDCT HDCT
LEGS LEGS
LEGS LEGS
0.2 0.2 0.2 0.2
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
Recall Recall Recall Recall

(a) SOD (b) ECCSD (c) PASCAL-S (d) MSRA-5000


Figure 8. PR curves of saliency detection methods on four benchmark data sets.

s between foreground and background regions in all three sets. Figure 7 shows that our method generates more ac-
types of feature spaces. Foreground and background re- curate saliency maps in various challenging scenarios. The
gions can be hardly separated based on a heuristic combi- robust performance of our method can be attributed to the
nation of these features. Our global search method trains use of DNNs for complex feature and model learning, and
a deep network to learn complex feature dependencies and the integration of local/global saliency estimation.
achieves accurate confidence scores for saliency detection.
6. Conclusions
5.3. Performance Comparison
In this paper, we propose DNNs for saliency detection
We compare the proposed method (LEGS) with ten state- by combining local estimation and global search. In the lo-
of-the-art models including SVO [4], PCA[26], DRFI [16], cal estimation stage, the proposed DNN-L estimates local
GC [6], HS [40], MR [41], UFO [17], wCtr [42], CPMC- saliency by learning rich image patch features from local
GBVS [23] and HDCT [18]. We use either the implemen- contrast, texture and shape information. In the global search
tations or the saliency maps provided by the authors for fair stage, the proposed DNN-G effectively exploits the com-
comparison3 . Our method performs favorably against the plex relationships among global saliency cues and predicts
state-of-the-art methods in terms of PR curves (Figure 8), the saliency value for each object region. Our method inte-
F-measure as well as MAE scores (Table 4) in all three data grates low level saliency and high level objectness through
3 The
result of the DRFI method [16] on the MSRA-5000 data set are
a supervised DNN-based learning schemes. Experimental
not reported, since it is also trained on this data set with different train- results on benchmark data sets show that the proposed al-
ing images from ours. The CPMC-GBVS method [23] only provides the gorithm can achieve state-of-the-art performance.
saliency maps of the PASCAL-S data set.
Acknowledgements. L. Wang and H. Lu are supported by the [17] P. Jiang, H. Ling, J. Yu, and J. Peng. Salient region detection
Natural Science Foundation of China (NSFC) #61472060 and the by ufo: Uniqueness, focusness and objectness. In ICCV,
Fundamental Research Funds for the Central Universities under pages 1976–1983, 2013.
Grant DUT14YQ101. M.-H. Yang is supported in part by NSF [18] J. Kim, D. Han, Y. Tai, and J. Kim. Salient region detection
CAREER Grant #1149783 and NSF IIS Grant #1152576. via high-dimensional color transform. In CVPR, pages 883–
890, 2014.
References [19] D. A. Klein and S. Frintrop. Center-surround divergence of
feature statistics for salient object detection. In ICCV, pages
[1] R. Achanta, S. Hemami, F. Estrada, and S. Susstrunk. 2214–2219, 2011.
Frequency-tuned salient region detection. In CVPR, pages [20] P. Krähenbühl and V. Koltun. Geodesic object proposals. In
1597–1604, 2009. ECCV, pages 725–739. 2014.
[2] B. Alexe, T. Deselaers, and V. Ferrari. Measuring the object- [21] A. Krizhevsky, I. Sutskever, and G. E. Hinton. Imagenet
ness of image windows. PAMI, 34(11):2189–2202, 2012. classification with deep convolutional neural networks. In
[3] J. Carreira and C. Sminchisescu. Constrained parametric NIPS, pages 1097–1105, 2012.
min-cuts for automatic object segmentation. In CVPR, pages [22] X. Li, H. Lu, L. Zhang, X. Ruan, and M.-H. Yang. Salien-
3241–3248, 2010. cy detection via dense and sparse reconstruction. In ICCV,
[4] K.-Y. Chang, T.-L. Liu, H.-T. Chen, and S.-H. Lai. Fusing pages 2976–2983, 2013.
generic objectness and visual saliency for salient object de- [23] Y. Li, X. Hou, C. Koch, J. M. Rehg, and A. L. Yuille. The
tection. In ICCV, pages 914–921, 2011. secrets of salient object segmentation. In CVPR, pages 280–
[5] M. Cheng, Z. Zhang, W. Lin, and P. H. S. Torr. BING: bina- 287, 2014.
rized normed gradients for objectness estimation at 300fps. [24] T. Liu, Z. Yuan, J. Sun, J. Wang, N. Zheng, X. Tang, and
In CVPR, pages 3286–3293, 2014. H.-Y. Shum. Learning to detect a salient object. PAMI,
[6] M.-M. Cheng, J. Warrell, W.-Y. Lin, S. Zheng, V. Vineet, and 33(2):353–367, 2011.
N. Crook. Efficient salient region detection with soft image [25] Y.-F. Ma and H.-J. Zhang. Contrast-based image attention
abstraction. In ICCV, pages 1529–1536, 2013. analysis by using fuzzy growing. In ACM Multimedia, pages
[7] M.-M. Cheng, G.-X. Zhang, N. J. Mitra, X. Huang, and S.- 374–381, 2003.
M. Hu. Global contrast based salient region detection. In [26] R. Margolin, A. Tal, and L. Zelnik-Manor. What makes a
CVPR, pages 409–416, 2011. patch distinct? In CVPR, pages 1139–1146, 2013.
[8] J. Donahue, Y. Jia, O. Vinyals, J. Hoffman, N. Zhang, [27] V. Movahedi and J. H. Elder. Design and perceptual vali-
E. Tzeng, and T. Darrell. Decaf: A deep convolutional acti- dation of performance measures for salient object segmenta-
vation feature for generic visual recognition. In ICML, pages tion. In POCV, pages 49–56, 2010.
647–655, 2014. [28] V. Nair and G. E. Hinton. Rectified linear units improve
[9] C. Farabet, C. Couprie, L. Najman, and Y. LeCun. Learning restricted boltzmann machines. In ICML, pages 807–814,
hierarchical features for scene labeling. PAMI, 35(8):1915– 2010.
1929, 2013. [29] F. Perazzi, P. Krahenbuhl, Y. Pritch, and A. Hornung. Salien-
[10] R. B. Girshick, J. Donahue, T. Darrell, and J. Malik. Rich cy filters: Contrast based filtering for salient region detec-
feature hierarchies for accurate object detection and semantic tion. In CVPR, pages 733–740, 2012.
segmentation. In CVPR, pages 580–587, 2014. [30] P. Pinheiro and R. Collobert. Recurrent convolutional neural
[11] J. Harel, C. Koch, and P. Perona. Graph-based visual salien- networks for scene labeling. In ICML, pages 82–90, 2014.
cy. In NIPS, pages 545–552, 2006. [31] X. Shen and Y. Wu. A unified approach to salient object
[12] B. Hariharan, P. Arbeláez, R. Girshick, and J. Malik. Simul- detection via low rank matrix recovery. In CVPR, pages 853–
taneous detection and segmentation. In ECCV, pages 297– 860, 2012.
312. 2014. [32] K. Shi, K. Wang, J. Lu, and L. Lin. Pisa: Pixelwise image
[13] L. Itti, C. Koch, and E. Niebur. A model of saliency-based vi- saliency by aggregating complementary appearance contrast
sual attention for rapid scene analysis. PAMI, 20(11):1254– measures with spatial priors. In CVPR, pages 2115–2122,
1259, 1998. 2013.
[14] Y. Jia, E. Shelhamer, J. Donahue, S. Karayev, J. Long, R. Gir- [33] J. Sun, H. Lu, and S. Li. Saliency detection based on inte-
shick, S. Guadarrama, and T. Darrell. Caffe: Convolution- gration of boundary and soft-segmentation. In ICIP, pages
al architecture for fast feature embedding. arXiv preprint, 1085–1088, 2012.
2014. [34] C. Szegedy, W. Liu, Y. Jia, P. Sermanet, S. Reed,
[15] B. Jiang, L. Zhang, H. Lu, C. Yang, and M.-H. Yang. Salien- D. Anguelov, D. Erhan, V. Vanhoucke, and A. Rabinovich.
cy detection via absorbing markov chain. In ICCV, pages Going deeper with convolutions. arXiv preprint, 2014.
1665–1672, 2013. [35] C. Szegedy, A. Toshev, and D. Erhan. Deep neural networks
[16] H. Jiang, J. Wang, Z. Yuan, Y. Wu, N. Zheng, and S. Li. for object detection. In NIPS, pages 2553–2561, 2013.
Salient object detection: A discriminative regional feature [36] N. Tong, H. Lu, Y. Zhang, and X. Ruan. Salient object de-
integration approach. In CVPR, pages 2083–2090, 2013. tection via global and local cues. Pattern Recognition, 2014.
[37] J. R. Uijlings, K. E. van de Sande, T. Gevers, and A. W.
Smeulders. Selective search for object recognition. IJCV,
104(2):154–171, 2013.
[38] Y. Xie and H. Lu. Visual saliency detection based on
bayesian model. In ICIP, pages 645–648, 2011.
[39] Y. Xie, H. Lu, and M.-H. Yang. Bayesian saliency via low
and mid level cues. Image Processing, IEEE Transactions
on, 22(5):1689–1698, 2013.
[40] Q. Yan, L. Xu, J. Shi, and J. Jia. Hierarchical saliency detec-
tion. In CVPR, pages 1155–1162, 2013.
[41] C. Yang, L. Zhang, H. Lu, X. Ruan, and M.-H. Yang. Salien-
cy detection via graph-based manifold ranking. In CVPR,
pages 3166–3173, 2013.
[42] W. Zhu, S. Liang, Y. Wei, and J. Sun. Saliency optimization
from robust background detection. In CVPR, pages 2814–
2821, 2014.

You might also like