0% found this document useful (0 votes)
1 views11 pages

Paper 3

Uploaded by

vmsprrce
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
1 views11 pages

Paper 3

Uploaded by

vmsprrce
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 11

Semantic Perceptual Image Compression using Deep

Convolution Networks

Aaditya Prakash, Nick Moran, Solomon Garber, Antonella DiLillo and James Storer
Brandeis University
{aprakash,nemtiax,solomongarber,dilant,storer}@brandeis.edu
arXiv:1612.08712v2 [cs.MM] 29 Mar 2017

Abstract
It has long been considered a significant problem to improve the visual quality of lossy image
and video compression. Recent advances in computing power together with the availability
of large training data sets has increased interest in the application of deep learning cnns
to address image recognition and image processing tasks. Here, we present a powerful cnn
tailored to the specific task of semantic image understanding to achieve higher visual quality
in lossy compression. A modest increase in complexity is incorporated to the encoder which
allows a standard, off-the-shelf jpeg decoder to be used. While jpeg encoding may be
optimized for generic images, the process is ultimately unaware of the specific content of
the image to be compressed. Our technique makes jpeg content-aware by designing and
training a model to identify multiple semantic regions in a given image. Unlike object
detection techniques, our model does not require labeling of object positions and is able to
identify objects in a single pass. We present a new cnn architecture directed specifically to
image compression, which generates a map that highlights semantically-salient regions so
that they can be encoded at higher quality as compared to background regions. By adding
a complete set of features for every class, and then taking a threshold over the sum of all
feature activations, we generate a map that highlights semantically-salient regions so that
they can be encoded at a better quality compared to background regions. Experiments are
presented on the Kodak PhotoCD dataset and the MIT Saliency Benchmark dataset, in
which our algorithm achieves higher visual quality for the same compressed size.1

1 Introduction and Related Work


We employ a Convolutional Neural Network (cnn) tailored to the specific task of
semantic image understanding to achieve higher visual quality in lossy image compres-
sion. We focus on the jpeg standard, which remains the dominant image representa-
tion on the internet and in consumer electronics. Several attempts have been made to
improve upon its lossy image compression, for example WebP [1] and Residual-GRU
[2], but many of these require custom decoders and are not sufficiently content-aware.
We improve the visual quality of standard jpeg by using a higher bit rate to encode
image regions flagged by our model as containing content of interest and lowering the
bit rate elsewhere in the image. With our enhanced jpeg encoder, the quantization of
each region is informed by knowledge of the image content. Human vision naturally
focuses on familiar objects, and is particularly sensitive to distortions of these objects
as compared to distortions of background details [3]. By improving the signal-to-
noise ratio within multiple regions of interest, we improve visual quality of those
regions, while preserving overall PSNR and compression ratio. A second encoding pass
1
Code for the model is available at https://github.com/iamaaditya/image-compression-cnn
Figure 1: Comparison of compression of semantic objects in standard jpeg[left] and our
model [right]

produces a final jpeg encoding that may be decoded with any standard off-the-shelf
jpeg decoder. Measuring visual quality is an ongoing area of research and there is no
consensus among researchers on the proper metric. Yuri et al [4] showed that PSNR
has severe limitations as an image comparison metric. Richter et al [5][6] addressed
structural similarity (SSIM[7] and MS-SSIM[8]) for jpeg and jpeg 2000. We evaluate
on these metrics, as well as VIFP[9], PSNR-HVS[10] and PSNR-HVSM[11], which have
been shown to correlate with subjective visual quality. Figure 1 compares close-up
views of a salient object in a standard jpeg and our new content-aware method.
cnns have been successfully applied to a variety of computer vision tasks [12].
The feature extraction and transfer learning capabilities of cnns are well known
[13], as are their ability to classify images by their most prominent object [14], and
compute a bounding box [15]. Some success has been obtained in predicting the visual
saliency map of a given image [3],[16]. Previous work has shown that semantic object
detection has a variety of advantages over saliency maps [17], [18]. Semantic detection
recognizes discrete objects and is thus able to generate maps that are more coherent
for human perception. Visual saliency models are based on human eye fixations, and
thus produce results which do not capture object boundaries [16]. This is evident in
the results obtained by Stella et al [19], in which image compression is guided by a
multi-scale saliency map, and the obtained images show blurred edges and soft focus.
We present a cnn designed to locate multiple regions of interest (roi) within a
single image. Our model differs from traditional object detection models like [20],
[15] as these models are restricted to detecting a single salient object in an image.
It captures the structure of the depicted scene and thus maintains the integrity of
semantic objects, unlike results produced using human eye fixations [21]. We produce
a single class-invariant feature map by learning separate feature maps for each of a
set of object classes and then summing over the top features. Because this task
does not require precise identification of object boundaries, our system is able to
capture multiple salient regions of an image in a single pass, as opposed to standard
object detection cnns, which require multiple passes over the image to identify and
locate multiple objects. Model training need only be done offline, and encoding with
our model employs a standard jpeg encoder combined with efficient computation
of saliency maps (over 90 images per second using a Titan X Maxwell gpu). A
key advantage of our approach is that its compressed output can be decoded by
any standard off-the-shelf jpeg implementation. It serves to maintain the existing
decoding complexity, the primary issue for distribution of electronic media.
Section 2 reviews cnn techniques used for object localization, semantic segmen-
tation and class activation maps. We also discuss merits of using our technique over
these methods. Section 3 presents our new model which can generate a map showing
multiple regions of interest. In Section 4 show how we combine this map to make
jpeg semantically aware. Sections 5 presents experimental results on a variety of
image datasets and metrics. Section 6 concludes with future areas for research.

2 Review of localization using CNNs


cnns are multi-layered feed-forward architectures where the learned features at
each level are the weights of the convolution filters to be applied to the output of
the previous level. Learning is done via gradient-based optimization [22]. cnns differ
from fully connected neural networks in that the dimensions of the learned convolution
filters are, in general, much smaller than the dimensions of the input image, so the
learned features are forced to be localized in space. Also, the convolution operation
uses the same weight kernel at every image location, so feature detection is spatially
invariant.
Given an image x, and a convolution filter of size n × n, then a convolutional layer
performs the operation shown in equation 1, where W is the learned filter.
n X
X n
yij = Wab x(i+a)(j+b) (1)
a=0 b=0
In practice, multiple filters are learned in parallel within each layer, and thus the
output of a convolution layer is a 3-d feature map, where the depth represents the
number of filters. The number of features in a given layer is a design choice, and
may differ from layer to layer. cnns include a max pooling [22] step after every or
every other layer of convolution, in which the height and width of the feature map
(filter response) are reduced by replacing several neighboring activations (coefficients),
generally within a square window, with a single activation equal to the maximum
within that window. This pooling operation is strided, but the size of the pooling
window can be greater than the stride, so windows can overlap. This results in down-
sampling of input data, and filters applied to such a map will have a larger receptive
field (spatial support in the pixel space) for a given kernel size, thus reducing the
number of parameters of the cnn model and allowing the training of much deeper
networks. This does not change the depth of the feature map, but only its width
and height. In practice, pooling windows are typically of size 2 × 2 or 4 × 4, with a
stride of two, which reduces the number activations by 75%. cnns apply some form
of non-linear operation such as sigmoid (1 − e−x )−1 or linear rectifier max(0, x) on
the output of each convolution operation.
Region-based cnns use a moving window to maximize the posterior of the pres-
ence of an object [23]. Faster rcnns [24] have been proposed, but they are still
computationally expensive and are limited to determining the presence or absence of
a single class of object within the entire image. Moving-window methods are able to
produce rectangular bounding boxes, but cannot produce object silhouettes. In con-
trast, recent deep learning models proposed for semantic segmentation [25], [15], [26]
are very good at drawing a close border around the objects. However, these methods
do not scale well to more than a small number of object categories (e.g. 20) [27].
Segmentation methods typically seek to produce a hard boundary for the object, in
which every pixel is labeled as either part of the object or part of the background. In
contrast, class activation mapping produces a fuzzy boundary and is therefore able
to capture pixels which are not part of any particular object, but are still salient to
some spatial interaction between objects. Segmentation techniques are also currently
limited by the requirement for strongly-labeled data for training. Obtaining training
data where the locations of all the objects in the images are tagged is expensive and
not scalable [27]. Our approach only requires image-level labels of object classes,
without pixel-level annotation or bounding-box localization.
In a traditional cnn, there are two fully-connected (non-convolutional) layers as
the final layers of the network. The final layer has one neuron for every class in the
training data, and the final step in the inference is to normalize the activations of the
last layer to sum to one. The second to last layer, however, is fully connected to the
last convolution layer, and a non-linearity is applied to its activations. The authors
of [28], [29] modify this second to last layer to allow for class localization. In their
architecture, the second to last layer is not learned, but consists of one neuron for
each feature map, which has fixed equally-weighted connections to each activation of
its corresponding map. No non-linearity is applied to the outputs of these neurons, so
that each activation in this layer represents the global spatial average of one feature
map from the previous layer. Since the output of this layer is connected directly to
the classification layer, each class will in essence learn a weight for each feature map
from the final convolution layer. Thus, given an image and a class, the classification
weights for that class can be used to re-weight the layers of activations of the final
convolution layer on that image. These activations can be collapsed along the feature
axis to create a class activation map, spatially localizing the best evidence for that
class within that image.
Figure 2 (c) shows an example of such a map, the equation of which is given by
X
Mc (x, y) = wdc fd (x, y) (2)
dD

where wdc is the learned of class c for feature map d. Training for cam minimizes
the cross entropy between objects’ true probability distribution over classes (all mass
given to the true class) and the predicted distribution, which is obtained as
P
exp( xy Mc (x, y))
P (c) = P P (3)
c exp( xy Mc (x, y))

Since cams are trained to maximize posterior probability for the class, they tend
to only highlight a single most prominent object. This makes them useful for studying
the output of cnns, but not well suited to more general semantic saliency, as real
world images typically contain multiple objects of interest.

3 Multi-Structure Region of Interest


We have developed a variant of cam which balances the activation for multiple
objects and thus does not suffer from the issues of global average pooling. Our
method, Multi-Structure Region of Interest (MS-ROI), allows us to effectively train on
localization tasks independent of the number of classes. For the purposes of semantic
compression, obtaining a tight bound on the objects is not important. However,
identifying and approximately locating all the objects is critical. We propose a set of
3D feature maps in which each feature map is learned for an individual class, and is
learned independently of the maps for other classes. For L layers, where each layer l
contains dl features, an image of size n × n, and with C classes, this results in a total
activation size of X n n
dl × C × l × l
lL
k k
where k is the max pooling stride size. This is computationally very expensive, and
not practical for real world data. cnns designed for full-scale color images have many
filters per layer and are several layers deep. For such networks, learning a model with
that many parameters would be unfeasible in terms of computational requirements.
We propose two techniques to make this idea feasible for large networks: (i) reduce the
number of classes and increase the inter-class variance by combining similar classes,
and (ii) share feature maps across classes to jointly learn lower level features.
Most cnn models are built for classification on the Large Scale Visual Recognition
Challenge, commonly known as ImageNet. ImageNet has one thousand classes and
many of the classes are fine-grained delineations of various types of animals, flowers
and other objects. We significantly reduce the number of classes by collapsing these
sets of similar classes to a single, more general class. This is desirable because, for
the purpose of selecting a class invariant ‘region of interest,’ we do not care about
the differences between Siberian husky and Eskimo dog or between Lace-flower and
Tuberose. As long as objects of these combined classes have similar structure and are
within the same general category, the map produced will be almost identical. Details
of the combined classes used in our model are provided in the Experimental Results
section.
It is self-evident that most images contain only a few classes and thus it is compu-
tationally inefficient to build a separate feature map for every class. More importantly,
many classes have similar lower-level features, even when the number of classes is rel-
atively small. The first few layers of a cnn learn filters which recognize small edges
and motifs [13], which are found across a wide variety of object classes. Therefore,
we propose parameter sharing across the feature maps for different classes. This re-
duces the number of parameters and also allows for the joint learning of these shared,
low-level features.
Although we do not restrict ourselves to a single most-probable class, it is desirable
to eliminate the effects of activations for classes which are not present in the image.
In order to accomplish this, we propose a thresholding operation which discards those
classes whose learned features do not have a sufficiently large total activation when
(a) Original (b) Human Fixation (c) CAM (d) MS-ROI

Figure 2: Comparison of various methods of detecting objects in an image

summed across all features and across the entire image. Let Zlc denote the total sum
of the activations of layer ` for all feature maps for a given class c. Since our feature
map is a 4-dimensional tensor, Zlc can be obtained by summation of this tensor over
the three non-class dimensions.
XX
Zlc = fdc (x, y) (4)
d  D x,y

Next, we use Zlc


to filter the classes. Computation of the multi-structure region
of interest is shown below.
P
c c
X  d fd (x, y), if Zl > T

M
c(x, y) = (5)

cc
0 otherwise

We use the symbol M c to denote the multi-structure map generated by our pro-
posed model in order to contrast it with the map generated using standard cam
models, M . Mc is a sum over all classes with total activations Z c beyond a threshold
l
value T . T is determined during the training or chosen as a hyper-parameter for
learning. In practice, it is sufficient to argsort Zlc and pick the top five classes and
combine them via a sum weighted by their rank. It should be noted that, because M c
is no longer a reflection of the class of the image, we use the term ‘region of interest’.
A comparison of our model (MS-ROI) with cam and human fixation is shown in
Figure 2. Only our model identifies the face of the boy on the right as well the hands
of both children at the bottom. When doing compression, it is important that we do
not lower the quality of body extremities or other objects which other models may
not identify as critical to the primary object class of the image. If a human annotator
were to paint the areas which should be compressed at better quality, we believe the
annotated area would be closer to that captured by our model.
To train the model to maximize the detection of all objects, instead of using a
softmax function as in equation 3, we use sigmoid, which does not marginalize the
posterior over the classes. Thus the likelihood of a class c is given by equation 6.
1
P (c) = (6)
1 + exp(Zlc )

4 Integrating MS-ROI map with JPEG


We obtain from MS-ROI a saliency value for each pixel in the range [0,1], where
1 indicates maximum saliency. Then discretize these saliency values into k levels,
where k is a tune-able hyper-parameter. The lowest level contains pixels of saliency
[0, 1/k], the second level contains pixels of saliency (1/k, 2/k] and so forth. We next
select a range of jpeg quality levels, Ql to Qh . Each saliency level will be compressed
using a Q value drawn from this range, corresponding to that level. In other words,
saliency level n, with saliency range [n/k, (n + 1)/k] will be compressed using

n ∗ (Qh − Ql )
Qn = Ql + (7)
k
For each level l ≤ n ≤ h, we obtain a decoded jpeg of the image after encoding at
quality level Qn . For each 8 × 8 block of our output image, we select the block of color
values obtained by the jpeg corresponding to that block’s saliency level. This mosiac
of blocks is finally compressed using a standard jpeg encoder with the desired output
quality to produce a file which can be decoded by any off-the-shelf jpeg decoder.
Details of our choices for k, Ql and Qh , as well as target image sizes are provided
in the next section. A wider range of Ql and Qh will tend to produce stronger results,
but at the expense of very poor quality in non-salient regions.

5 Experimental Results
We trained our model with the Caltech-256 dataset [30], which contains 256 classes
of man-made and natural objects, common plants and animals, buildings, etc. We
believe this offers a good balance between covering more classes as compared to
CIFAR-100 which contains only 100 classes, and avoiding overly finely-grained classes
as in ImageNet with 1000 classes [31]. For the results reported here, we experimented
with several stacked layers of convolution as shown in the diagram below:
 5
 2
IMAGE 7−→ CONV → RELU → MAXPOOL 7−→ MS-ROI 7−→ MAP

MS-ROI refers to the operation shown in the equation 5. To obtain the final image
we discretize the heat-map into five levels and use jpeg quality levels Q in increments
of ten from Ql = 30 to Qh = 70. For all experiments, the file size of the standard jpeg
image and the jpeg obtained from our model were kept within ±1% of each other.
On average, salient regions were compressed at Qf = 65, and non-salient regions were
compressed at Q = 45. The overall Q for the final image generated using our model
was Q = 55, whereas for all standard jpeg samples, Q was chosen to be 50.
We tested on the Kodak PhotoCD set (24 images) and the the MIT dataset (2,000
images). Kodak is a well known dataset consisting primarily of natural outdoor color
images. Figure 3 shows a sample of five of these images, along with the correspond-
ing heatmaps generated by our algorithm; the first four show typical results which
strongly capture the salient content of the images, while the fifth is a rare case of
partial failure, in which the heatmap does not fully capture all salient regions. The
Figure 3: Sample of our map for five KODAK images.

1.6

1741x1680

2612x2520

4353x4200

5223x5040

6094x5880

6964x6720

7835x7560

8705x8400
1.0

3482x3360
174x168

261x252

348x336

435x420

522x504

609x588

696x672

783x756

871x840
87x84

0.4

0.0

Figure 4: PSNR-HVS of our model - jpeg across various image size (higher is better).

MIT set allows us to compare results across twenty categories. In Table 1 we only
report averaged results across ‘Outdoor Man-made’ and ‘Outdoor Natural’ categories
(200 images), as these categories are likely to contain multiple semantic objects, and
are therefore appropriate for our method. Both datasets contain images of smaller
resolutions, but the effectiveness of perceptual compression is more pronounced for
larger images. Therefore, we additionally selected a very large image of resolution
8705 × 8400, which we scale to a range of sizes to demonstrate the effectiveness of
our system at a variety of resolutions. See Figure 4 for the image sizes used in this
experiment. Both Figure 4 and Figure 5 show the PSNR-HVS difference between our
model and standard jpeg. Positive values indicate our model has higher performance
compared to standard jpeg. In addition to an array of standard quality metrics, we
also report a PSNR value calculated only for those regions our method has identi-
fied as salient, which we term PSNR-S. By examining only regions of high semantic
saliency, this metric demonstrates that our compression method is indeed able to
preserve visual quality in targeted regions, without sacrificing performance on tradi-
tional image-level quality metrics or compression ratio. It should be noted that the
validity of this metric is dependent on the correctness of the underlying saliency map,
and thus should only be interpreted to demonstrate the success of the final image
construction in preserving details highlighted by that map.
The results in Table 1 show the success of our method in maintaining or improving
performance on traditional image quality metrics. Further, given the efficacy of our
PSNR-S PSNR PSNR-HVS PSNR-HVSM SSIM MS-SSIM VIFP
Kodak PhotoCD [24 images]
Std jpeg 33.91 34.70 34.92 42.19 0.969 0.991 0.626
Our model 39.16 34.82 35.05 42.33 0.969 0.991 0.629
MIT Saliency Benchmark [Outdoor Man-made + Natural, 200 images]
Std jpeg 36.9 31.84 35.91 45.37 0.893 0.982 0.521
Our model 40.8 32.16 36.32 45.62 0.917 0.990 0.529
Re-sized images of a very large image, see fig: 4 [20 images]
Std jpeg 35.4 27.46 33.12 43.26 0.912 0.988 0.494
Our model 39.6 28.67 34.63 44.89 0.915 0.991 0.522

Table 1: Results across datasets

1.0
Pattern

0.5

0.0
Jumbled
Natural

Social

Fractal
Action

Inverted
Sketch

Random

Cartoon

Manmade
B&W

Statellite

Affective

Line draw
Low Res

Noisy
Art

Object
Indoor
-.5

Figure 5: PSNR-HVS of our model - jpeg across various categories of MIT Saliency dataset
(higher is better).

method in identifying multiple regions of interest, the PSNR-S measurements demon-


strate the power of our method to produce superior visual quality in subjectively
important regions.
Figure 5 shows the performance of our model across all categories of the MIT
dataset. Performance was strongest in categories like ‘Outdoor Natural’, ‘Outdoor
Man Made’, ‘Action’ and ‘Object’, while categories like ‘Line Drawing’, ‘Fractal’
and ‘Low Resolution’ showed the least improvement. Not surprisingly, the category
‘Pattern’, which lacks semantic objects, is the only category where our model did not
improve upon standard jpeg. Figure 4 shows results on the same image scaled to
different sizes. Because our model benefits from the scale-invariance of cnns, we are
able to preserve performance across a wide range of input sizes.

6 Conclusion and Future research


We have presented a model which can learn to detect multiple objects at any scale
and generate a map of multiple semantically salient image regions. This provides suf-
ficient information to perform variable-quality image compression, without providing
a precise semantic segmentation. Unlike region-based models, our model does not
have to iterate over many windows. We sacrifice exact localization for the ability to
detect multiple salient objects. Our variable compression improves upon visual qual-
ity without sacrificing compression ratio. Encoding requires a single inference over
the pre-trained model, the cost of which is reasonable when performed using a gpu,
along with a standard jpeg encoder. The cost of decoding, which employs a stan-
dard, off-the-shelf jpeg decoder remains unchanged. We believe it will be possible
to incorporate our approach into other lossy compression methods such as jpeg 2000
and vector quantization, a subject of future work. Improvements to the power of our
underlying cnn, addressing evolving visual quality metrics, and other applications
such as video compression, are also potential areas of future work.

[1] G. Ginesu, M. Pintus, and D. D. Giusto, “Objective assessment of the webp image
coding algorithm,” Signal Processing: Image Communication, vol. 27, no. 8, pp. 867–
874, 2012.
[2] G. Toderici, D. Vincent, N. Johnston, S. J. Hwang, D. Minnen, J. Shor, and M. Covell,
“Full resolution image compression with recurrent neural networks,” arXiv preprint
arXiv:1608.05148, 2016.
[3] M. Jiang, S. Huang, J. Duan, and Q. Zhao, “Salicon: Saliency in context,” in 2015
IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 1072–
1080, IEEE, 2015.
[4] L. Kerofsky, R. Vanam, and Y. Reznik, “Perceptual adaptation of objective video qual-
ity metrics,” in Proc. Ninth International Workshop on Video Processing and Quality
Metrics (VPQM), 2015.
[5] T. Richter and K. J. Kim, “A ms-ssim optimal jpeg 2000 encoder,” in 2009 Data
Compression Conference, pp. 401–410, IEEE, 2009.
[6] T. Richter, “Ssim as global quality metric: a differential geometry view,” in Quality
of Multimedia Experience (QoMEX), 2011 Third International Workshop on, pp. 189–
194, IEEE, 2011.
[7] Z. Wang, A. C. Bovik, H. R. Sheikh, and E. P. Simoncelli, “Image quality assessment:
from error visibility to structural similarity,” IEEE transactions on image processing,
vol. 13, no. 4, pp. 600–612, 2004.
[8] Z. Wang, E. P. Simoncelli, and A. C. Bovik, “Multiscale structural similarity for image
quality assessment,” in Signals, Systems and Computers, 2004. Conference Record of
the Thirty-Seventh Asilomar Conference on, vol. 2, pp. 1398–1402, Ieee, 2003.
[9] H. R. Sheikh and A. C. Bovik, “Image information and visual quality,” IEEE Trans-
actions on Image Processing, vol. 15, no. 2, pp. 430–444, 2006.
[10] K. Egiazarian, J. Astola, N. Ponomarenko, V. Lukin, F. Battisti, and M. Carli, “New
full-reference quality metrics based on hvs,” in CD-ROM proceedings of the second
international workshop on video processing and quality metrics, Scottsdale, USA, vol. 4,
2006.
[11] N. Ponomarenko, F. Silvestri, K. Egiazarian, M. Carli, J. Astola, and V. Lukin, “On
between-coefficient contrast masking of dct basis functions,” in Proceedings of the third
international workshop on video processing and quality metrics, vol. 4, 2007.
[12] A. Krizhevsky, I. Sutskever, and G. E. Hinton, “Imagenet classification with deep
convolutional neural networks,” in Advances in neural information processing systems,
pp. 1097–1105, 2012.
[13] M. D. Zeiler and R. Fergus, “Visualizing and understanding convolutional networks,”
in European Conference on Computer Vision, pp. 818–833, Springer, 2014.
[14] K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image recognition,”
in The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), June
2016.
[15] R. Girshick, J. Donahue, T. Darrell, and J. Malik, “Rich feature hierarchies for accurate
object detection and semantic segmentation,” in Proceedings of the IEEE conference
on computer vision and pattern recognition, pp. 580–587, 2014.
[16] M. Kümmerer, L. Theis, and M. Bethge, “Deep gaze i: Boosting saliency prediction
with feature maps trained on imagenet,” arXiv preprint arXiv:1411.1045, 2014.
[17] V. Mnih, N. Heess, A. Graves, et al., “Recurrent models of visual attention,” in Ad-
vances in Neural Information Processing Systems, pp. 2204–2212, 2014.
[18] F. Zünd, Y. Pritch, A. Sorkine-Hornung, S. Mangold, and T. Gross, “Content-aware
compression using saliency-driven image retargeting,” in 2013 IEEE International Con-
ference on Image Processing, pp. 1845–1849, IEEE, 2013.
[19] X. Y. Stella and D. A. Lisin, “Image compression based on visual saliency at individual
scales,” in International Symposium on Visual Computing, pp. 157–166, Springer, 2009.
[20] J. Dai, Y. Li, K. He, and J. Sun, “R-fcn: Object detection via region-based fully
convolutional networks,” arXiv preprint arXiv:1605.06409, 2016.
[21] N. Liu, J. Han, D. Zhang, S. Wen, and T. Liu, “Predicting eye fixations using convo-
lutional neural networks,” in Proceedings of the IEEE Conference on Computer Vision
and Pattern Recognition, pp. 362–370, 2015.
[22] Y. LeCun and Y. Bengio, “Convolutional networks for images, speech, and time series,”
The handbook of brain theory and neural networks, vol. 3361, no. 10, p. 1995, 1995.
[23] R. Girshick, “Fast r-cnn,” in Proceedings of the IEEE International Conference on
Computer Vision, pp. 1440–1448, 2015.
[24] S. Ren, K. He, R. Girshick, and J. Sun, “Faster r-cnn: Towards real-time object
detection with region proposal networks,” in Advances in neural information processing
systems, pp. 91–99, 2015.
[25] J. Long, E. Shelhamer, and T. Darrell, “Fully convolutional networks for semantic seg-
mentation,” in Proceedings of the IEEE Conference on Computer Vision and Pattern
Recognition, pp. 3431–3440, 2015.
[26] S. Zheng, S. Jayasumana, B. Romera-Paredes, V. Vineet, Z. Su, D. Du, C. Huang, and
P. H. Torr, “Conditional random fields as recurrent neural networks,” in Proceedings
of the IEEE International Conference on Computer Vision, pp. 1529–1537, 2015.
[27] M. Everingham, L. Van Gool, C. K. Williams, J. Winn, and A. Zisserman, “The pascal
visual object classes (voc) challenge,” International journal of computer vision, vol. 88,
no. 2, pp. 303–338, 2010.
[28] M. Oquab, L. Bottou, I. Laptev, and J. Sivic, “Is object localization for free?-weakly-
supervised learning with convolutional neural networks,” in Proceedings of the IEEE
Conference on Computer Vision and Pattern Recognition, pp. 685–694, 2015.
[29] B. Zhou, A. Khosla, A. Lapedriza, A. Oliva, and A. Torralba, “Learning deep features
for discriminative localization,” in The IEEE Conference on Computer Vision and
Pattern Recognition (CVPR), June 2016.
[30] G. Griffin, A. Holub, and P. Perona, “Caltech-256 object category dataset,” 2007.
[31] J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, and L. Fei-Fei, “Imagenet: A large-
scale hierarchical image database,” in Computer Vision and Pattern Recognition, 2009.
CVPR 2009. IEEE Conference on, pp. 248–255, IEEE, 2009.

You might also like