Paper 3
Paper 3
Convolution Networks
Aaditya Prakash, Nick Moran, Solomon Garber, Antonella DiLillo and James Storer
Brandeis University
{aprakash,nemtiax,solomongarber,dilant,storer}@brandeis.edu
arXiv:1612.08712v2 [cs.MM] 29 Mar 2017
Abstract
It has long been considered a significant problem to improve the visual quality of lossy image
and video compression. Recent advances in computing power together with the availability
of large training data sets has increased interest in the application of deep learning cnns
to address image recognition and image processing tasks. Here, we present a powerful cnn
tailored to the specific task of semantic image understanding to achieve higher visual quality
in lossy compression. A modest increase in complexity is incorporated to the encoder which
allows a standard, off-the-shelf jpeg decoder to be used. While jpeg encoding may be
optimized for generic images, the process is ultimately unaware of the specific content of
the image to be compressed. Our technique makes jpeg content-aware by designing and
training a model to identify multiple semantic regions in a given image. Unlike object
detection techniques, our model does not require labeling of object positions and is able to
identify objects in a single pass. We present a new cnn architecture directed specifically to
image compression, which generates a map that highlights semantically-salient regions so
that they can be encoded at higher quality as compared to background regions. By adding
a complete set of features for every class, and then taking a threshold over the sum of all
feature activations, we generate a map that highlights semantically-salient regions so that
they can be encoded at a better quality compared to background regions. Experiments are
presented on the Kodak PhotoCD dataset and the MIT Saliency Benchmark dataset, in
which our algorithm achieves higher visual quality for the same compressed size.1
produces a final jpeg encoding that may be decoded with any standard off-the-shelf
jpeg decoder. Measuring visual quality is an ongoing area of research and there is no
consensus among researchers on the proper metric. Yuri et al [4] showed that PSNR
has severe limitations as an image comparison metric. Richter et al [5][6] addressed
structural similarity (SSIM[7] and MS-SSIM[8]) for jpeg and jpeg 2000. We evaluate
on these metrics, as well as VIFP[9], PSNR-HVS[10] and PSNR-HVSM[11], which have
been shown to correlate with subjective visual quality. Figure 1 compares close-up
views of a salient object in a standard jpeg and our new content-aware method.
cnns have been successfully applied to a variety of computer vision tasks [12].
The feature extraction and transfer learning capabilities of cnns are well known
[13], as are their ability to classify images by their most prominent object [14], and
compute a bounding box [15]. Some success has been obtained in predicting the visual
saliency map of a given image [3],[16]. Previous work has shown that semantic object
detection has a variety of advantages over saliency maps [17], [18]. Semantic detection
recognizes discrete objects and is thus able to generate maps that are more coherent
for human perception. Visual saliency models are based on human eye fixations, and
thus produce results which do not capture object boundaries [16]. This is evident in
the results obtained by Stella et al [19], in which image compression is guided by a
multi-scale saliency map, and the obtained images show blurred edges and soft focus.
We present a cnn designed to locate multiple regions of interest (roi) within a
single image. Our model differs from traditional object detection models like [20],
[15] as these models are restricted to detecting a single salient object in an image.
It captures the structure of the depicted scene and thus maintains the integrity of
semantic objects, unlike results produced using human eye fixations [21]. We produce
a single class-invariant feature map by learning separate feature maps for each of a
set of object classes and then summing over the top features. Because this task
does not require precise identification of object boundaries, our system is able to
capture multiple salient regions of an image in a single pass, as opposed to standard
object detection cnns, which require multiple passes over the image to identify and
locate multiple objects. Model training need only be done offline, and encoding with
our model employs a standard jpeg encoder combined with efficient computation
of saliency maps (over 90 images per second using a Titan X Maxwell gpu). A
key advantage of our approach is that its compressed output can be decoded by
any standard off-the-shelf jpeg implementation. It serves to maintain the existing
decoding complexity, the primary issue for distribution of electronic media.
Section 2 reviews cnn techniques used for object localization, semantic segmen-
tation and class activation maps. We also discuss merits of using our technique over
these methods. Section 3 presents our new model which can generate a map showing
multiple regions of interest. In Section 4 show how we combine this map to make
jpeg semantically aware. Sections 5 presents experimental results on a variety of
image datasets and metrics. Section 6 concludes with future areas for research.
where wdc is the learned of class c for feature map d. Training for cam minimizes
the cross entropy between objects’ true probability distribution over classes (all mass
given to the true class) and the predicted distribution, which is obtained as
P
exp( xy Mc (x, y))
P (c) = P P (3)
c exp( xy Mc (x, y))
Since cams are trained to maximize posterior probability for the class, they tend
to only highlight a single most prominent object. This makes them useful for studying
the output of cnns, but not well suited to more general semantic saliency, as real
world images typically contain multiple objects of interest.
summed across all features and across the entire image. Let Zlc denote the total sum
of the activations of layer ` for all feature maps for a given class c. Since our feature
map is a 4-dimensional tensor, Zlc can be obtained by summation of this tensor over
the three non-class dimensions.
XX
Zlc = fdc (x, y) (4)
d D x,y
We use the symbol M c to denote the multi-structure map generated by our pro-
posed model in order to contrast it with the map generated using standard cam
models, M . Mc is a sum over all classes with total activations Z c beyond a threshold
l
value T . T is determined during the training or chosen as a hyper-parameter for
learning. In practice, it is sufficient to argsort Zlc and pick the top five classes and
combine them via a sum weighted by their rank. It should be noted that, because M c
is no longer a reflection of the class of the image, we use the term ‘region of interest’.
A comparison of our model (MS-ROI) with cam and human fixation is shown in
Figure 2. Only our model identifies the face of the boy on the right as well the hands
of both children at the bottom. When doing compression, it is important that we do
not lower the quality of body extremities or other objects which other models may
not identify as critical to the primary object class of the image. If a human annotator
were to paint the areas which should be compressed at better quality, we believe the
annotated area would be closer to that captured by our model.
To train the model to maximize the detection of all objects, instead of using a
softmax function as in equation 3, we use sigmoid, which does not marginalize the
posterior over the classes. Thus the likelihood of a class c is given by equation 6.
1
P (c) = (6)
1 + exp(Zlc )
n ∗ (Qh − Ql )
Qn = Ql + (7)
k
For each level l ≤ n ≤ h, we obtain a decoded jpeg of the image after encoding at
quality level Qn . For each 8 × 8 block of our output image, we select the block of color
values obtained by the jpeg corresponding to that block’s saliency level. This mosiac
of blocks is finally compressed using a standard jpeg encoder with the desired output
quality to produce a file which can be decoded by any off-the-shelf jpeg decoder.
Details of our choices for k, Ql and Qh , as well as target image sizes are provided
in the next section. A wider range of Ql and Qh will tend to produce stronger results,
but at the expense of very poor quality in non-salient regions.
5 Experimental Results
We trained our model with the Caltech-256 dataset [30], which contains 256 classes
of man-made and natural objects, common plants and animals, buildings, etc. We
believe this offers a good balance between covering more classes as compared to
CIFAR-100 which contains only 100 classes, and avoiding overly finely-grained classes
as in ImageNet with 1000 classes [31]. For the results reported here, we experimented
with several stacked layers of convolution as shown in the diagram below:
5
2
IMAGE 7−→ CONV → RELU → MAXPOOL 7−→ MS-ROI 7−→ MAP
MS-ROI refers to the operation shown in the equation 5. To obtain the final image
we discretize the heat-map into five levels and use jpeg quality levels Q in increments
of ten from Ql = 30 to Qh = 70. For all experiments, the file size of the standard jpeg
image and the jpeg obtained from our model were kept within ±1% of each other.
On average, salient regions were compressed at Qf = 65, and non-salient regions were
compressed at Q = 45. The overall Q for the final image generated using our model
was Q = 55, whereas for all standard jpeg samples, Q was chosen to be 50.
We tested on the Kodak PhotoCD set (24 images) and the the MIT dataset (2,000
images). Kodak is a well known dataset consisting primarily of natural outdoor color
images. Figure 3 shows a sample of five of these images, along with the correspond-
ing heatmaps generated by our algorithm; the first four show typical results which
strongly capture the salient content of the images, while the fifth is a rare case of
partial failure, in which the heatmap does not fully capture all salient regions. The
Figure 3: Sample of our map for five KODAK images.
1.6
1741x1680
2612x2520
4353x4200
5223x5040
6094x5880
6964x6720
7835x7560
8705x8400
1.0
3482x3360
174x168
261x252
348x336
435x420
522x504
609x588
696x672
783x756
871x840
87x84
0.4
0.0
Figure 4: PSNR-HVS of our model - jpeg across various image size (higher is better).
MIT set allows us to compare results across twenty categories. In Table 1 we only
report averaged results across ‘Outdoor Man-made’ and ‘Outdoor Natural’ categories
(200 images), as these categories are likely to contain multiple semantic objects, and
are therefore appropriate for our method. Both datasets contain images of smaller
resolutions, but the effectiveness of perceptual compression is more pronounced for
larger images. Therefore, we additionally selected a very large image of resolution
8705 × 8400, which we scale to a range of sizes to demonstrate the effectiveness of
our system at a variety of resolutions. See Figure 4 for the image sizes used in this
experiment. Both Figure 4 and Figure 5 show the PSNR-HVS difference between our
model and standard jpeg. Positive values indicate our model has higher performance
compared to standard jpeg. In addition to an array of standard quality metrics, we
also report a PSNR value calculated only for those regions our method has identi-
fied as salient, which we term PSNR-S. By examining only regions of high semantic
saliency, this metric demonstrates that our compression method is indeed able to
preserve visual quality in targeted regions, without sacrificing performance on tradi-
tional image-level quality metrics or compression ratio. It should be noted that the
validity of this metric is dependent on the correctness of the underlying saliency map,
and thus should only be interpreted to demonstrate the success of the final image
construction in preserving details highlighted by that map.
The results in Table 1 show the success of our method in maintaining or improving
performance on traditional image quality metrics. Further, given the efficacy of our
PSNR-S PSNR PSNR-HVS PSNR-HVSM SSIM MS-SSIM VIFP
Kodak PhotoCD [24 images]
Std jpeg 33.91 34.70 34.92 42.19 0.969 0.991 0.626
Our model 39.16 34.82 35.05 42.33 0.969 0.991 0.629
MIT Saliency Benchmark [Outdoor Man-made + Natural, 200 images]
Std jpeg 36.9 31.84 35.91 45.37 0.893 0.982 0.521
Our model 40.8 32.16 36.32 45.62 0.917 0.990 0.529
Re-sized images of a very large image, see fig: 4 [20 images]
Std jpeg 35.4 27.46 33.12 43.26 0.912 0.988 0.494
Our model 39.6 28.67 34.63 44.89 0.915 0.991 0.522
1.0
Pattern
0.5
0.0
Jumbled
Natural
Social
Fractal
Action
Inverted
Sketch
Random
Cartoon
Manmade
B&W
Statellite
Affective
Line draw
Low Res
Noisy
Art
Object
Indoor
-.5
Figure 5: PSNR-HVS of our model - jpeg across various categories of MIT Saliency dataset
(higher is better).
[1] G. Ginesu, M. Pintus, and D. D. Giusto, “Objective assessment of the webp image
coding algorithm,” Signal Processing: Image Communication, vol. 27, no. 8, pp. 867–
874, 2012.
[2] G. Toderici, D. Vincent, N. Johnston, S. J. Hwang, D. Minnen, J. Shor, and M. Covell,
“Full resolution image compression with recurrent neural networks,” arXiv preprint
arXiv:1608.05148, 2016.
[3] M. Jiang, S. Huang, J. Duan, and Q. Zhao, “Salicon: Saliency in context,” in 2015
IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 1072–
1080, IEEE, 2015.
[4] L. Kerofsky, R. Vanam, and Y. Reznik, “Perceptual adaptation of objective video qual-
ity metrics,” in Proc. Ninth International Workshop on Video Processing and Quality
Metrics (VPQM), 2015.
[5] T. Richter and K. J. Kim, “A ms-ssim optimal jpeg 2000 encoder,” in 2009 Data
Compression Conference, pp. 401–410, IEEE, 2009.
[6] T. Richter, “Ssim as global quality metric: a differential geometry view,” in Quality
of Multimedia Experience (QoMEX), 2011 Third International Workshop on, pp. 189–
194, IEEE, 2011.
[7] Z. Wang, A. C. Bovik, H. R. Sheikh, and E. P. Simoncelli, “Image quality assessment:
from error visibility to structural similarity,” IEEE transactions on image processing,
vol. 13, no. 4, pp. 600–612, 2004.
[8] Z. Wang, E. P. Simoncelli, and A. C. Bovik, “Multiscale structural similarity for image
quality assessment,” in Signals, Systems and Computers, 2004. Conference Record of
the Thirty-Seventh Asilomar Conference on, vol. 2, pp. 1398–1402, Ieee, 2003.
[9] H. R. Sheikh and A. C. Bovik, “Image information and visual quality,” IEEE Trans-
actions on Image Processing, vol. 15, no. 2, pp. 430–444, 2006.
[10] K. Egiazarian, J. Astola, N. Ponomarenko, V. Lukin, F. Battisti, and M. Carli, “New
full-reference quality metrics based on hvs,” in CD-ROM proceedings of the second
international workshop on video processing and quality metrics, Scottsdale, USA, vol. 4,
2006.
[11] N. Ponomarenko, F. Silvestri, K. Egiazarian, M. Carli, J. Astola, and V. Lukin, “On
between-coefficient contrast masking of dct basis functions,” in Proceedings of the third
international workshop on video processing and quality metrics, vol. 4, 2007.
[12] A. Krizhevsky, I. Sutskever, and G. E. Hinton, “Imagenet classification with deep
convolutional neural networks,” in Advances in neural information processing systems,
pp. 1097–1105, 2012.
[13] M. D. Zeiler and R. Fergus, “Visualizing and understanding convolutional networks,”
in European Conference on Computer Vision, pp. 818–833, Springer, 2014.
[14] K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image recognition,”
in The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), June
2016.
[15] R. Girshick, J. Donahue, T. Darrell, and J. Malik, “Rich feature hierarchies for accurate
object detection and semantic segmentation,” in Proceedings of the IEEE conference
on computer vision and pattern recognition, pp. 580–587, 2014.
[16] M. Kümmerer, L. Theis, and M. Bethge, “Deep gaze i: Boosting saliency prediction
with feature maps trained on imagenet,” arXiv preprint arXiv:1411.1045, 2014.
[17] V. Mnih, N. Heess, A. Graves, et al., “Recurrent models of visual attention,” in Ad-
vances in Neural Information Processing Systems, pp. 2204–2212, 2014.
[18] F. Zünd, Y. Pritch, A. Sorkine-Hornung, S. Mangold, and T. Gross, “Content-aware
compression using saliency-driven image retargeting,” in 2013 IEEE International Con-
ference on Image Processing, pp. 1845–1849, IEEE, 2013.
[19] X. Y. Stella and D. A. Lisin, “Image compression based on visual saliency at individual
scales,” in International Symposium on Visual Computing, pp. 157–166, Springer, 2009.
[20] J. Dai, Y. Li, K. He, and J. Sun, “R-fcn: Object detection via region-based fully
convolutional networks,” arXiv preprint arXiv:1605.06409, 2016.
[21] N. Liu, J. Han, D. Zhang, S. Wen, and T. Liu, “Predicting eye fixations using convo-
lutional neural networks,” in Proceedings of the IEEE Conference on Computer Vision
and Pattern Recognition, pp. 362–370, 2015.
[22] Y. LeCun and Y. Bengio, “Convolutional networks for images, speech, and time series,”
The handbook of brain theory and neural networks, vol. 3361, no. 10, p. 1995, 1995.
[23] R. Girshick, “Fast r-cnn,” in Proceedings of the IEEE International Conference on
Computer Vision, pp. 1440–1448, 2015.
[24] S. Ren, K. He, R. Girshick, and J. Sun, “Faster r-cnn: Towards real-time object
detection with region proposal networks,” in Advances in neural information processing
systems, pp. 91–99, 2015.
[25] J. Long, E. Shelhamer, and T. Darrell, “Fully convolutional networks for semantic seg-
mentation,” in Proceedings of the IEEE Conference on Computer Vision and Pattern
Recognition, pp. 3431–3440, 2015.
[26] S. Zheng, S. Jayasumana, B. Romera-Paredes, V. Vineet, Z. Su, D. Du, C. Huang, and
P. H. Torr, “Conditional random fields as recurrent neural networks,” in Proceedings
of the IEEE International Conference on Computer Vision, pp. 1529–1537, 2015.
[27] M. Everingham, L. Van Gool, C. K. Williams, J. Winn, and A. Zisserman, “The pascal
visual object classes (voc) challenge,” International journal of computer vision, vol. 88,
no. 2, pp. 303–338, 2010.
[28] M. Oquab, L. Bottou, I. Laptev, and J. Sivic, “Is object localization for free?-weakly-
supervised learning with convolutional neural networks,” in Proceedings of the IEEE
Conference on Computer Vision and Pattern Recognition, pp. 685–694, 2015.
[29] B. Zhou, A. Khosla, A. Lapedriza, A. Oliva, and A. Torralba, “Learning deep features
for discriminative localization,” in The IEEE Conference on Computer Vision and
Pattern Recognition (CVPR), June 2016.
[30] G. Griffin, A. Holub, and P. Perona, “Caltech-256 object category dataset,” 2007.
[31] J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, and L. Fei-Fei, “Imagenet: A large-
scale hierarchical image database,” in Computer Vision and Pattern Recognition, 2009.
CVPR 2009. IEEE Conference on, pp. 248–255, IEEE, 2009.