0% found this document useful (0 votes)
6 views

Deep Learning-Based Picture-Wise Just Noticeable Distortion Prediction Model for Image Compression

The document presents a deep learning-based model for predicting Picture Wise Just Noticeable Difference (PW-JND) to enhance image compression techniques. The proposed model transforms the PW-JND prediction into a binary classification problem using a perceptually lossy/lossless predictor, achieving a mean accuracy of 92%. This model aims to improve the efficiency of image and video processing by accurately reflecting the total masking effect perceived by the human visual system.

Uploaded by

Shadhu Baba
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
6 views

Deep Learning-Based Picture-Wise Just Noticeable Distortion Prediction Model for Image Compression

The document presents a deep learning-based model for predicting Picture Wise Just Noticeable Difference (PW-JND) to enhance image compression techniques. The proposed model transforms the PW-JND prediction into a binary classification problem using a perceptually lossy/lossless predictor, achieving a mean accuracy of 92%. This model aims to improve the efficiency of image and video processing by accurately reflecting the total masking effect perceived by the human visual system.

Uploaded by

Shadhu Baba
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 16

IEEE TRANSACTIONS ON IMAGE PROCESSING, VOL.

29, 2020 641

Deep Learning-Based Picture-Wise Just


Noticeable Distortion Prediction
Model for Image Compression
Huanhua Liu, Yun Zhang , Senior Member, IEEE, Huan Zhang , Student Member, IEEE, Chunling Fan,
Sam Kwong , Fellow, IEEE, C.-C. Jay Kuo , Fellow, IEEE, and Xiaoping Fan

Abstract— Picture Wise Just Noticeable Difference (PW-JND), another or not. Finally, we propose a sliding window based search
which accounts for the minimum difference of a picture strategy to predict PW-JND based on the prediction results of the
that human visual system can perceive, can be widely used perceptually lossy/lossless predictor. Experimental results show
in perception-oriented image and video processing. However, that the mean accuracy of the perceptually lossy/lossless predictor
the conventional Just Noticeable Difference (JND) models cal- reaches 92%, and the absolute prediction error of the proposed
culate the JND threshold for each pixel or sub-band separately, PW-JND model is 0.79 dB on average, which show the superiority
which may not reflect the total masking effect of a picture accu- of the proposed PW-JND model to the conventional JND models.
rately. In this paper, we propose a deep learning based PW-JND
prediction model for image compression. Firstly, we formulate Index Terms— Just noticeable distortion, convolutional neural
the task of predicting PW-JND as a multi-class classification network, visual perception, image quality assessment.
problem, and propose a framework to transform the multi-class
classification problem to a binary classification problem solved by I. I NTRODUCTION
just one binary classifier. Secondly, we construct a deep learning
based binary classifier named perceptually lossy/lossless predictor
which can predict whether an image is perceptually lossy to T HE Ultra-High-Definition (UHD), Three Dimensional
(3D) [1], and Virtual Reality (VR) images and videos
with the ability to provide a more immersive and realistic
experience than conventional multimedia, are becoming more
Manuscript received July 23, 2018; revised February 4, 2019 and May 27,
2019; accepted July 29, 2019. Date of publication August 13, 2019; and more popular with consumers in streaming services [2].
date of current version September 25, 2019. This work was supported However, the bandwidth and storage required to support
in part by the National Natural Science Foundation of China under UHD, 3D, and VR streaming services are several or even more
Grant 61871372, in part by the Guangdong Natural Science Foundation
for Distinguished Young Scholar under Grant 2016A030306022, in part times the size of that required for the traditional images and
by the Key Project for Guangdong Provincial Science and Technology videos, which has been a bottleneck of the streaming services
Development under Grant 2017B010110014, in part by the Shenzhen Interna- industry. The main-stream of the current image/video cod-
tional Collaborative Research Project under Grant GJHZ20170314155404913,
in part by the Shenzhen Science and Technology Program under Grant ing techniques [3] are signal-processing-based, which mainly
JCYJ20170811160212033, in part by the Guangdong International Science consider the statistical properties of visual content. They are
and Technology Cooperative Research Project under Grant 2018A050506063, becoming difficult to achieve further improvement in reducing
in part by the Membership of Youth Innovation Promotion Associa-
tion, Chinese Academy of Sciences under Grant 2018392, and in part the size of images and videos without perceptual quality
by the Shenzhen Science and Technology Plan Project under Grant degradation. As we know, the ultimate receiver of most visual
JCYJ20180507183823045. The associate editor coordinating the review of content is the Human Visual System (HVS), therefore it
this article and approving it for publication was Prof. Damon M. Chandler.
(Corresponding author: Yun Zhang.) is important to develop image/video processing technologies
H. Liu is with the School of Computer Science and Engineering, incorporating the characteristic of HVS [4] for streaming
Central South University, Changsha 410075, China, and also with the services industry.
Shenzhen Institutes of Advanced Technology, Chinese Academy of Sciences,
Shenzhen 518055, China (e-mail: [email protected]).
It is well known that humans cannot perceive the small
Y. Zhang is with the Shenzhen Institutes of Advanced Technol- changes in the images/videos due to the psychological and
ogy, Chinese Academy of Sciences, Shenzhen 518055, China (e-mail: physiological mechanism of HVS. Therefore the processed
[email protected]).
H. Zhang and C. Fan are with the Shenzhen Institutes of Advanced Technol-
images/videos have visual redundancy which can be removed
ogy, Chinese Academy of Sciences, Shenzhen 518055, China, and also with without any perceptual quality degradation. Just Noticeable
the Shenzhen College of Advanced Technology, University of Chinese Acad- Difference (JND) refers to the minimum distortion HVS
emy of Sciences, Shenzhen 518055, China (e-mail: [email protected];
[email protected]). can perceive, which has been widely used in image/video
S. Kwong is with the Department of Computer Science, City University of processing, e.g., perceptual image/video coding [5]–[7], image
Hong Kong, Hong Kong (e-mail: [email protected]). enhancing [8], and objective quality estimation [9]. The
C.-C. J. Kuo is with the Ming Hsieh Department of Electrical Engineering,
University of Southern California, Los Angeles, CA 90089 USA (e-mail: existing JND models can be divided into two categories:
[email protected]). 1) pixel-domain models [10]–[15] calculate JND threshold for
X. Fan is with the School of Information Technology and Manage- each pixel directly in the pixel domain; 2) sub-band domain
ment, Hunan University of Finance and Economics, Changsha 410205,
China, and also with the School of Automation, Central South University, models transfer pixel domain images to the sub-band domain,
Changsha 410075, China (e-mail: [email protected]). e.g., Discrete Cosine Transformation (DCT), then calculate the
Digital Object Identifier 10.1109/TIP.2019.2933743 JND threshold for each sub-band [16]–[20].
1057-7149 © 2019 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission.
See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.

Authorized licensed use limited to: Lappeenranta-Lahti University of Technology LUT. Downloaded on March 20,2024 at 08:58:35 UTC from IEEE Xplore. Restrictions apply.
642 IEEE TRANSACTIONS ON IMAGE PROCESSING, VOL. 29, 2020

Pixel domain JND models mainly focus on the background quality levels. The distortion of JND image reflects the total
luminance adaptation and spatial contrast masking. In [10], masking effect of a picture accurately, which can be defined as
a novel spatial masking function was introduced, which was Picture Wise Just Noticeable Difference (PW-JND) referring
combined with luminance adaptation to deduce the overall to the minimum difference of a picture that can be perceived
JND thresholds. Wu et al. [11] suggested that there exists by the HVS. Jin et al. [23] constructed the first JND based
disorderly concealment effect resulting in high JND thresholds image quality data set MCL-JCI. Wang et al. [24] proposed
of the disorderly region, and proposed a free energy based a subjective methodology to find the video-wise JND videos
JND model aiming to improve the accuracy of JND threshold and constructed the first video-wise JND based quality data set
estimation of texture regions. In [12], a novel pattern mask- MCL-JCV. Fan et al. [25] studied the PW-JND of symmet-
ing function deduced from luminance contrast and structural rically and asymmetrically compressed stereoscopic images
uncertainty was incorporated into the proposed JND model. for JPEG2000 and H.265 intra coding. Two PW-JND based
Wang et al. [13] proposed an edge profile reconstruction based stereo image quality datasets have been provided: one for
JND model for screen content images. Each edge profile was symmetric compression and one for asymmetric compression.
decomposed into its luminance, contrast, and structure part, Huang et al. [26] proposed a machine learning approach to
each of which was evaluated respectively. Wu et al. [14] predict the mean of group-based JND distribution by using the
proposed an improved pattern masking function based model, extracted features of videos. As every one knows, subjective
where the pattern complexity was calculated as the diversity prediction methods are too time consuming to apply into the
of orientation in local region. In [15], a new JND model was real-world systems. Therefore, it deserves to devise objective
proposed which considers visual saliency. However, the pixel PW-JND prediction method, which is more challenging than
domain JND models can hardly be incorporated into sub-band to estimate the pixel/sub-band JND threshold. There are more
image/video compression systems. factors affecting the PW-JND thresholds, e.g., distortion type,
The sub-band domain JND models mainly focus on Con- contrast masking, and luminance adaptation. In this paper,
trast Sensitive Function (CSF), luminance adaptation, con- we propose a PW-JND model to predict PW-JND for pristine
trast masking, and foveated masking. Wei and Ngan [16] and distorted images. The main contributions of our work can
proposed a CSF-based spatio-temporal JND model, in which be summarized as:
gamma correction was introduced to compensate luminance 1) We formulate the task of predicting PW-JND as a multi-
adaptation effect. In [17], a luminance adaptation JND model class classification problem, and propose a framework
was proposed which took frequency characteristics into the to transform the multi-class classification problem to a
luminance adaption. Bae and Kim [18] proposed a novel binary classification problem.
JND model being applicable to any size of transform kernel, 2) We construct a deep learning based binary classifier
which introduced a new texture complexity metric to measure named perceptually lossy/lossless predictor. It can pre-
contrast masking effect. In [19], foveated masking was incor- dict whether a distorted image is perceptually lossy to
porated into the proposed temporal-foveated masking model its reference or not. The experimental results show that
which also considered the difference between moving and still its mean accuracy reaches 92%.
objects. Recently, learning techniques have also been applied 3) We propose a sliding window based search strategy to
to estimating the JND thresholds. In [20], Ki et al. proposed a predict the PW-JND based on the prediction results of
regression based method to estimate the JND thresholds under the perceptually lossy/lossless predictor.
the distortion with energy reduction. However, the sub-band The paper is organized as follows. In Section II, we present
domain models require a DCT transform, and can hardly esti- the motivation of predicting PW-JND which is formulated as
mate thresholds of the complicated texture regions accurately a multi-class classification problem. Section III presents the
since each block is isolated from its surrounding. framework of the proposed PW-JND model which transforms
The pixel/sub-band domain JND models compute the JND the multi-class classification problem to a binary classification
threshold for each pixel/sub-band separately. By a simple problem. In Section IV, we propose a deep learning based
summation of the estimated JND thresholds, they may not perceptually lossy/lossless predictor which can predict whether
reflect the total masking effect of a picture accurately. The con- a distorted image is perceptually lossy to its reference or not,
tribution of different regions to the image quality is different, and evaluate the performance of the predictor. In Section V,
and some critical regions together with the worst quality ones we propose a sliding window based PW-JND search strategy.
determine the image quality [21]. Moreover, the traditional In Section VI, we report the experimental results. Section VII
JND models mainly focused on the pristine images/videos but concludes this paper.
not on the distorted ones, which limits the application areas,
since the images/videos fed into the real-world applications are II. M OTIVATION AND P ROBLEM F ORMULATION
usually degraded. Recent studies [22]–[25] demonstrated that The traditional Rate-Distortion (R-D) function shown in
humans cannot perceive continuous-scale visual quality that Fig. 1(a) is continuous and convenient for the computation
changes over a range of coding bit rate, and this phenomenon of coding systems. However, the visual quality experience of
was quantified based on the notion of JND. Hu et al. [22] pro- humans is a discrete rather than continuous process. Recent
posed a subjective methodology to find the JND images under studies [22], [24] demonstrated that humans can only dis-
Joint Photographic Experts Group (JPEG) compression, which tinguish several limited quality levels of the image/video
are the transition images between two adjacent perceptual changing in a range of bit rate. A perceptual distortion function

Authorized licensed use limited to: Lappeenranta-Lahti University of Technology LUT. Downloaded on March 20,2024 at 08:58:35 UTC from IEEE Xplore. Restrictions apply.
LIU et al.: DEEP LEARNING-BASED PICTURE-WISE JUST NOTICEABLE DISTORTION PREDICTION MODEL 643

Fig. 2. Application scenarios of PW-JND model. (a) Application in streaming


Fig. 1. Illustration of perceptual distortion of JPEG-compressed images, service. (b) Application in watermark embedding.
(c) to (f) are enlarged patches. (a) The difference between JND based stair
R-D function and the traditional R-D function [24]. (b) Pristine image,
size = 6220 KB, MSE = 0. (c) The third PW-JND image, size = 80 KB, reference can be perceived by humans or not. In MOS/DMOS
MSE = 199.5. (d) The second PW-JND image, size = 159KB, MSE = 99.6.
(e) The first PW-JND image, size = 235 KB, MSE = 70.2. (f) JPEG- evaluating, sample label is an absolute score describing the
compressed image with QF 100, size = 1728 KB, MSE = 3.59. overall image quality. Thirdly, PW-JND prediction model will
mainly be used to predict distorted images of which the
f(R) shown in Fig. 1(a) was proposed, which is a stair step difference cannot be perceived by humans. The conventional
function about bit rate. In f(R), the jump points denoted MOS/DMOS evaluators [31] were mainly used to evaluate
by the circles between two adjacent quality levels are JND perceptually lossy images.
points [22], [23], e.g., the first JND point jumps from the Two applications of PW-JND prediction model are listed
best to the secondary quality level. It can be seen from f(R) in Fig. 2. For streaming media systems, high visual quality
that the bit rate of the compressed images with the same requires large bit rate, and lower bit rate can only provide
perceptual quality vary greatly, and JND points have the lowest low quality visual content. However, higher bit rate than
bit rate in a given quality level. For example, Fig. 1(b) is the what it needs means a waste of storage and bandwidth, but
pristine image with size 6220 KB, Fig. 1(f)-(c) are enlarged lower bit rate will damage the consumers’ visual experience.
patches of JPEG-compressed images coded from Fig. 1(b) Fig. 2(a) shows a typical framework of streaming media
with different QF. Fig. 1(f) is cropped from the compressed system including coding, streaming, decoding, and display
image coded with QF 100, Fig. 1(e)-(c) are cropped from process. In the streaming systems, the PW-JND prediction
the first, second, and third JND images of Fig. 1(b) [23]. model can be used for coding or selecting the images/videos
The size of the associated images of patch Fig. 1(f)-(c) are with the smallest size but best quality, which can help
1728 KB, 235 KB, 159 KB, and 80 KB, and the Mean Squared save the bandwidth without damaging consumers’ experience.
Error (MSE) values are 3.95, 70.2, 99.6, and 199.5 respec- Fig. 2(b) shows a digital watermarking system which includes
tively. The perceptual quality of Fig. 1(e) is nearly equal watermark embedding, watermark extraction, and watermark
to that of Fig. 1(f), but the size of the associated image of verification blocks. Watermark embedding block is responsible
Fig. 1(e) is much smaller than that of Fig. 1(f). We can also for embedding the digital watermark (i.e, ownership) into
see the similar phenomenon from Fig. 1(c) and (d). We define digital media for copyright protection, source tracking, and so
the bit rate of the first, second, and third JND images as the on. The embedded digital watermarks are often required to be
first, second, third PW-JND respectively. Therefore, PW-JND only perceptible under certain conditions for human beings.
prediction can be used to guide coding, which can help save Therefore, PW-JND prediction model can be used to guide
bit rate without perceptual quality degradation. As far as we the embedding process. Although PW-JND prediction model
know, it is the first work to predict PW-JND, which is different has a wide range of applications, there are many challenges
from conventional Mean Option Score (MOS) or Difference in designing an accurate PW-JND model. First of all, the
Mean Option Score (DMOS) predictors, e.g., SSIM (MS- PW-JND has a wide range of values affected by the visual
SSIM) [27], FSIM [28], GMSD [29], and VSI [30]. First content which varies greatly. Secondly, the distortion can be
of all, there is an assumption [22]–[26] that the perception introduced at different stages, e.g., pre-processing, compres-
model on quality of HVS is discrete in PW-JND prediction, sion, and transmission, each of which affects PW-JND thresh-
and PW-JND is the boundary between two adjacent perceptual olds in different manners. Thirdly, we can hardly build an
quality levels of the reference image. However, in conventional accurate mathematical PW-JND model because it is not clear
MOS/DMOS prediction it is continuous. Secondly, in PW-JND about the mechanism of HVS in processing visual signals.
prediction, sample label is a relative value which denotes In order to formulate the problem of predicting PW-JND,
whether the difference between a distorted image and its we define the perceptual distortion function f (R) shown

Authorized licensed use limited to: Lappeenranta-Lahti University of Technology LUT. Downloaded on March 20,2024 at 08:58:35 UTC from IEEE Xplore. Restrictions apply.
644 IEEE TRANSACTIONS ON IMAGE PROCESSING, VOL. 29, 2020

In the following section, we will introduce how to predict the


i t h PW-JND in detail.

III. P ROPOSED F RAMEWORK OF


PW-JND P REDICTION M ODEL
A. Classification Modeling
In the previous section, we modelled the task of predicting
PW-JND for a given image x as a multi-class classification
problem. The straightforward method to solve the multi-class
classification problem is to construct a multi-class classifier
(x) shown in Fig. 4(a). The circle denotes the multi-class
classifier and rectangles denote the classes. This model is
severely limited to the training data. The multi-class classi-
fication problem was usually converted to a combination of
binary classification problems, e.g., one-versus-all, one-versus-
one, and hierarchy combinations. The hierarchy combination
Fig. 3. The relationship between D and R/K. is the most popular model due to its superior performance.
One-All (O-A) hierarchy and binary hierarchy model are the
in Fig. 1 as most used hierarchy models, which are illustrated in Fig. 4(b)
and Fig. 4(c) respectively. The circles denote binary classifiers

n
and rectangles denote classes. O-A hierarchy needs (m − 1)
f (R) = h i δ(R − bi ), (1)
binary classifiers, which often suffers from the problem of
i=1
sample imbalance. The binary hierarchy needs 2 L−1 binary
where bi is the i t h PW-JND which we need to predict, h i is classifiers, where L is number of the levels and L = log2 m.
the difference between two adjacent quality levels, and δ(·) is For the above models, it is required that each class has
defined as enough data to train the classifiers. Due to the shortage of

0, x > 0 training data samples, the above models can hardly employ
δ(x) = (2) deep learning tools directly which achieved impressive success
1, x ≤ 0.
in both high level [32] and low level [33] computer vision
The PW-JND can be described as bit rate or other metrics, tasks. As shown in Fig. 4(d), we propose a PW-JND model
e.g., Quality Factor (QF), Peak Signal-to-Noise Ratio (PSNR), consisting of an input part, a binary classifier (x, Di sti )
and Structural Similarity Index Measurement (SSIM). There- denoted by the circles, a search strategy, and a output part.
fore, we introduce a monotone increasing function g(R) to The input part comprises a test image x and a distorted image
map R to K domain shown in Fig. 3. K can be continuous set D consisting of distorted images Di sti , where i denotes
(e.g., PSNR) or discrete (e.g., QF). In this paper, K is the the image index belonging to [1, . . . , m]. The binary classifier
index of a distorted image, which denotes that we predict (x, Di sti ) is designed to predict whether a distorted image
the PW-JND in a discrete domain. Now, we replace the Di sti is perceptually lossy from the test image x or not. The
perceptual distortion function f (R) with E(K ) in the discrete search strategy will be used to predict PW-JND based on the
K domain as prediction results of the binary classifier. The output part is
n PW-JND prediction result. The proposed PW-JND model can
E(K ) = h i (K − K i ), K ∈ [1, . . . , m], (3) be used to predict PW-JND for test image x under different
i=1 types of distortion. In this work, we predict PW-JND for x
where K i is the index of the i t h PW-JND we need to predict, under JPEG compression. Di sti is a JPEG-compressed image
and (·) is defined as with QF i , and bigger QF value means higher quality.
 For comparison, the required binary classifiers, comparison
0, x ∈ Z + times, and mean accuracy of the above four models are listed
(x) = (4)
1, x ∈ / Z +, in Table I. From the second column, we can see that the
proposed PW-JND model needs to train only one binary
where Z + denotes positive integer, which is the definition of classifier, which is the biggest advantage. From the third
symbol Z + . It is clear that the K i is a positive integer here column, we can see that the proposed PW-JND model needs m
which ranges from 1 to m. Therefore, the task of predicting computing times in predicting stage, which has the maximum
the i t h PW-JND of image x can be formulated as a multi-class time cost. In order to obtain the mean accuracy, we assume m
classification problem. It can be described as as the number of classes, χi as the probability of class i . For
K i = (x), (5) the multi-class classifier (x), e is assumed as the accuracy.
For O-A hierarchy model, we assume the accuracy of classifier
where the input x is a predicting image, and the output K i 
m
Ci to be ei , and the mean accuracy is about χi ei . For the
is the i t h PW-JND of x, which is considered as a class i=1
label that belongs to [1, . . . , m] (m is the number of classes). binary hierarchy model, e j,q is assumed as the accuracy of the

Authorized licensed use limited to: Lappeenranta-Lahti University of Technology LUT. Downloaded on March 20,2024 at 08:58:35 UTC from IEEE Xplore. Restrictions apply.
LIU et al.: DEEP LEARNING-BASED PICTURE-WISE JUST NOTICEABLE DISTORTION PREDICTION MODEL 645

Fig. 4. Different decompositions of multi-class classification problem. (a) Multi-class classifier. (b) O-A hierarchy decomposition. (c) Binary hierarchy
decomposition. (d) Proposed model.

TABLE I In general, there are several advantages of the proposed


C OMPARISON OF D IFFERENT D ECOMPOSITIONS OVER R EQUIRED PW-JND model. First of all, the proposed model just needs
C LASSIFIERS , C OMPARE T IMES , AND M EAN A CCURACY
to train one binary classifier (perceptually lossy/lossless pre-
dictor), which simplifies the problem of predicting PW-JND.
Moreover, for the proposed perceptually lossy/lossless pre-
dictor, training data are perceptually lossy/lossless samples.
The number of lossy/lossless samples is many times that of
PW-JND samples. It can be said that the proposed model
augments the available samples effectively in an indirect way,
which is helpful for deep learning. Secondly, the proposed
classifier C j,q , where j represents the j t h layer and q repre- search strategy can tolerate some mistakes made by the percep-
sents the q t h node th tually lossy/lossless predictor. Thirdly, the proposed model can
 in the j layer. The mean Laccuracy can predict all PW-JNDs for the test image x, which denotes that
be computed as χi e j,q , j ∈ [1, L], q ∈ [1, 2 ] (L denotes
the levels). e j,q is the accuracy of the binary classifier between the proposed model can predict PW-JND for distorted images.
class i to C1,1 . The mean accuracy of proposed model is about The PW-JND model also has some shortcomings, e.g., it needs
(1 − τ (ek , p, ε)). τ (ek , p, ε) is the error rate which can be m comparison times and a distorted image set D in addition
nearly obtained by to the test image x.
p−1
τ (ek , p, ε) ≈ (1 − ek ) + 3(1 − ek )ek
p−2
B. The Framework of the Proposed PW-JND
+ (2 p + 3)(1 − ek )2 ek Prediction Model
( p − 1)!
+ (1−ek ) p−ε+1 ekε−1 . (6) Fig. 5 shows the proposed framework of PW-JND model
(ε − 1)!( p − ε − 1)! which includes a training and predicting stage. At the
ek is the mean accuracy of the perceptually lossy/lossless training stage, we need to train a patch-based perceptually
predictor Ck . p ( p ≥ 1) and ε (ε ≤ p) are window size lossy/lossless predictor (x, Di sti ) built on Convolutional
and threshold of the sliding window, which will be described Neural Network (CNN), which can predict whether a distorted
in Section V-A in detail. If the accuracy of each classifier in image is perceptually lossy from its reference or not. The
the above models is assumed to be υ and the probability of proposed perceptually lossy/lossless predictor includes the
each class is assumed to be equal, the mean accuracy is υ, following blocks: 1) patch selection module selects the patches
1  
m L from the reference and distorted images; 2) CNN-based feature
m υ, υ, and (1 − τ (υ, p, ε)) for multi-class classifier,
i=1 i=1
extractor is to extract distinguished features from the selected
O-A hierarchy, binary hierarchy, and the proposed model patches; 3) patch feature fusion block is responsible for
respectively. Although the mean accuracy of the proposed concatenating the features that extracted from the distorted and
model is not the highest, it overcomes the limitation of reference image; 4) patch-wise quality measure block mea-
PW-JND data problem and just needs one binary classifier sures the quality of selected patches from the distorted image;
at the cost of more computing time in predicting phase. 5) picture-wise perceptually lossy/lossless predictor uses the

Authorized licensed use limited to: Lappeenranta-Lahti University of Technology LUT. Downloaded on March 20,2024 at 08:58:35 UTC from IEEE Xplore. Restrictions apply.
646 IEEE TRANSACTIONS ON IMAGE PROCESSING, VOL. 29, 2020

Fig. 6. Architecture of the proposed perceptually lossy/lossyless predictor.

TABLE II
C ONFIGURATIONS OF THE P ERCEPTUALLY
L OSSY /L OSSLESS P REDICTOR N ETWORK

Fig. 5. Framework of the proposed PW-JND model.

patch-wise quality index to classify Di sti as perceptually lossy


or lossless categories.
The predicting stage includes three steps. Firstly, the test
image x should be compressed with different quality factors
to obtain distorted image set D, where the number of quality
the architecture from [34] to build LQAN which consists of
levels depends on the distortion type. In this work, we take
ten Convolutional (Conv.) layers, one Concatenation (Concat)
JPEG coder to compress x with QF ranging from 1 to 100
layer, and two Full Connected (FC) layers. Each convolutional
and obtain D including 100 JPEG-compressed images with
layer is activated by the Rectified Linear Unit (ReLU) [35],
different quality. Secondly, each distorted image Di sti in D
and a max pooling layer follows each two convolutional layers.
will be compared with the test image x by the well trained
A concatenation layer following the ten convolutional layers
perceptually lossy/lossless predictor. The prediction class label
is to concatenate the features learned from the distorted and
of Di sti will be one if Di sti is perceptually lossy from the
reference patches. The two FC layers FC1 and FC2 adopt
test image x, otherwise zero. Thirdly, the prediction results of
dropout regulation with ratio 0.5. The output of the LQAN is
the perceptually lossy/lossless predictor will be fed into the
the quality score of a selected distorted patch, and larger score
search strategy which is designed to predict PW-JND finally.
denotes worse quality here. The GCN is a binary classification
network which includes one FC layer (FC3) activated by
IV. T HE CNN BASED P ERCEPTUALLY sigmoid function. The dimension of FC3 is the same as the
L OSSY /L OSSLESS P REDICTOR patch number N. Each element (a positive value) in FC3 is
A. Convolutional Neural Networks Architecture the weight of a selected patch, which denotes the contribution
The proposed perceptually lossy/lossless predictor based on of the corresponding patch to the whole image quality. The
deep CNN is trained for predicting whether a distorted image weights are initialized the same as [34] and updated with
is perceptually lossy from its reference or not. As shown LQAN simultaneously. The sigmoid function in the last layer
in Fig. 6, the predictor consists of a patch selection strategy, rescales the output among [0, 1]. If the output is larger
a Local Quality Assessment Network (LQAN), and a Global than 0.5, the distorted image is determined as perceptually
Classifier Network (GCN). The LQAN includes the feature lossy from the reference, otherwise lossless.
extractor, patch feature fusion, and local quality measure
block. B. Loss Function
The configurations of the LQAN and GCN are shown Let F(x t , θ ) denotes the end-to-end mapping function,
in Table II. The distorted and its reference image are firstly in which x t is the input image pair and θ is the set of weights.
divided into patches with size M × M, and N patches are Therefore, we need to estimate θ mapping x t to its ground
selected from the reference and distorted image in the same truth label. We optimize θ by minimizing the cross-entropy
location respectively. The patch size M and patch number N loss. Given a set of n training sample pairs (x t , yt ), where
are parameters which will be discussed later. We borrow x t consists of a reference image and a distorted image, and

Authorized licensed use limited to: Lappeenranta-Lahti University of Technology LUT. Downloaded on March 20,2024 at 08:58:35 UTC from IEEE Xplore. Restrictions apply.
LIU et al.: DEEP LEARNING-BASED PICTURE-WISE JUST NOTICEABLE DISTORTION PREDICTION MODEL 647

yt is the ground truth label. We update θ by minimizing the


following loss function
1
n
L(θ ) = − [yt log F(x t , θ ) + (1 − yt )log(1 − F(x t , θ ))].
n
t =1
(7)

C. Dataset Generation
MCL-JCI dataset [23] is the first PW-JND based image
quality dataset. This dataset comprises 50 pristine images, and Fig. 7. Validation accuracy of the perceptually lossy/lossless predictor with
each pristine image has 100 JPEG-compressed images with different patch sizes and numbers of selected patches. (a) Different patch sizes.
different QF ranging from 1 to 100. Most of the pristine (b) Different numbers of selected patches.
images have 4 to 8 PW-JND images found by subjects, each of
which is the transitional image between two adjacent percep- determine M and N, which will be described in detail in the
tual quality levels. In order to train the proposed predictor, following paragraph. Truncated normal initializer was used to
we make use of PW-JND images to generate perceptually initialize the network weights and Adam [36] was taken as
lossy/lossless training samples. A perceptually lossy/lossless the gradient descent optimization method. The learning rate
sample can be described as (x t , yt ) which consists of image was initialized as 1 × 10−4, which decreases as the number of
data x t and label yt : 1) x t consists of the reference r e f k  iterations increases. Each mini-batch contains 4 images, each
and distorted image Di sti , where r e f k  is the k t h PW-JND represented by 32 randomly sampled image patches, which
image with QF value of k  (larger k denotes smaller QF), leads to the effective batch size of 128 patches. All of the
i is the QF value of its corresponding distorted images Di sti , training parameters were updated one time when a mini-batch
which ranges from 1 to (k  − 1). In particular, r e f 0 represents was processed. It is worth noting that the proposed predictor
the pristine image, and the QF value i of its corresponding is a patch-based network which predicts an image-wise result
distorted images ranges 1 to 100. 2) yt is labeled as one by pooling the quality of the selected patches. Therefore,
when i ranges 1 to (k + 1) denoting the QF value of the the selected patches cannot cross mini-batch. We implemented
(k + 1)t h PW-JND image, which denotes Di sti is perceptually the networks based on the Tensorflow 1.2.0 and Python 3.5.2,
lossy to r e f k  , otherwise yt is labeled as zero, which denotes then trained them on the machine with NVIDIA GTX1080Ti
Di sti is perceptually lossless from r e f k  . The first, second, GPU and memory 32G.
third, and fourth ground truth PW-JND in MCL-JCI data set The patch size M and number of patches N are two key
were used as reference images to generate training samples. factors for the proposed perceptually lossy/lossless predictor.
Finally, we generated 5003 positive and 3974 negative image However, there are so many combinations of (M, N) that it
samples. is difficult to find the global optimal combination through
For the convenience of training, the generated data set exhaustive searching methods. Firstly, the number of patches
was split into five subsets. Firstly, the 50 pristine images N was fixed as 32 [34], the other variables was controlled sta-
were randomly divided into five equal groups I1 , I2 , I3 , I4 , ble except patch size. The training and validation set are ran-
and I5 , each of which includes 10 images. Then, the samples domly selected from the generated data set {S1 , S2 , S3 , S4 , S5 }.
generated from the images in the same group were added Six patch sizes, M ∈ [8, 16, 32, 64, 128, 256], were tested.
into one subset. Finally, the generated data set was split as The validation accuracy is shown in Fig. 7(a), where x-axis
{S1 , S2 , S3 , S4 , S5 } according to the division of the pristine denotes patch size and y-axis denotes validation accuracy.
images {I1 , I2 , I3 , I4 , I5 }. There is no overlap among the five It can be seen that the validation accuracy is improved with the
subsets. increase of patch size when the patch size is not beyond 32.
The reason lies in the quality of a larger block is closer to the
D. Validation and Optimal Parameters Determination quality of the entire image. However, the situation is just the
Each image fed into the proposed perceptually loss/lossless opposite when the patch size is larger than 64. The reason may
predictor is represented by N patches with size M × M. be that the parameters of the networks increase exponentially
The patches share the convolutional neural networks, and with the increase of the patch size, which means that we
each patch can be seen as a sample. Therefore, the training need more training data. Secondly, patch size M was fixed as
samples are many times that of image samples. Random 32 (32 × 32), and N ∈ [8, 16, 32, 64, 128, 256] were tested.
cropping data augmentation was used in training, all of the The validation accuracy is shown in Fig. 7(b), where x-axis
patches were randomly cropped from the image every epoch denotes patch size and y-axis denotes validation accuracy. The
to ensure that as many different patches as possible can be accuracy is low when N is 8 and 16, and it reaches a high
used [34]. In validation, the N random patches for each value when N is 32. The reason may be that a very small
image were only sampled once at the beginning of training in number of selected patches (N = 8, 16) can hardly represent
order to avoid noise. In the experiment, patch size M × M the whole image quality. It can also be observed that there is
was set to 32 × 32 and the number of patches N was not a significant gain in validation accuracy with the increase
set to 32. The variable-controlling approach was taken to of N when N is larger than 32. Since a larger N brings more

Authorized licensed use limited to: Lappeenranta-Lahti University of Technology LUT. Downloaded on March 20,2024 at 08:58:35 UTC from IEEE Xplore. Restrictions apply.
648 IEEE TRANSACTIONS ON IMAGE PROCESSING, VOL. 29, 2020

one time on the entire generated data set including samples


generated from the first to fourth ground truth PW-JND. There
is no need to re-train them in predicting the first to fourth
PW-JND.

V. P ROPOSED PW-JND S EARCH S TRATEGY


A. Sliding Window Based PW-JND Searching
If the accuracy of the perceptually lossy/lossless predictor
is assumed to be 100%, we can obtain the ideal case shown
in Fig. 9(a), in which x-axis represents distorted images
Fig. 8. Training peformance of the perceptually lossy/lossless predictor.
(a) Training loss. (b) Validation accuracy.
Di sti (larger i means better quality), and y-axis denotes the
labels predicted by the perceptually lossy/lossless predictor
TABLE III
(r e f, Di sti ). y  = 1 (or 0) denotes the distorted image Di sti
T HE VALIDATION AND T EST A CCURACY IN
is perceptually lossy (or lossless) from the reference r e f .
F IVE -F OLD C ROSS VALIDATION The distorted image Di stk can be determined as the PW-JND
image and the corresponding index k can be predicted as the
PW-JND, which satisfy: 1) H (v) = 1 when v ∈ [0, . . . , k −1],
where H (v) is defined as
H (v) = (r e f, Di skk−v ); (8)
2) T (n) = 0 when n ∈ [1, . . . , m−k], where T (n) is defined as
T (n) = (r e f, Di skk+n ). (9)
computation time, we selected the number of patches N as 32.
Finally, we selected 32 × 32 as the patch size and 32 as the It is easy to locate the PW-JND image by searching from
number of selected patches in this work. right to left (or left to right) when all of the v (or n)
We randomly chose three subsets for training, one for distorted images are predicted accurately by the perceptually
validation, and the rest one for testing from the generated lossy/lossless predictor. However, it can hardly predict all v
dataset. The training loss is shown in Fig. 8(a), where x-axis (or n) distorted images accurately, since the predictor may
and y-axis denote the training epoch and the training loss have a error prediction. It can easily cause that the predicted
respectively. We can see that the training loss drops down PW-JND thresholds differ greatly from the ground truth.
rapidly at the beginning, and after about 80 training epochs Therefore, we propose a sliding window based PW-JND search
the loss fluctuates slightly in the later training epochs. The method shown in Fig. 9(b) to determine the PW-JND image.
validation prediction accuracy is shown in Fig. 8(b), where We slide the window from right to left, and determine the
x-axis and y-axis denote training epoch and validation accu- distorted image Di stk as the PW-JND image which satisfies
racy respectively. We can see that the validation accuracy rises 
p
rapidly at the beginning and keeps stable later. From the above H (v) ≥ ε, (10)
observations, we can conclude that the predictor converges 0
at last. where p is the window size, and ε is a given threshold. The
proposed PW-JND search strategy can tolerate some mistakes
E. Testing of the Perceptually Lossy/Lossless Predictor of the perceptually lossy/lossless predictor. Take Fig. 9(b) as
In order to further evaluate the generalization capabilities an example, the window size p is set as 5 and ε is set as 4.
of the proposed perceptually lossy/lossless predictor, five- Although the point A and C are predicted incorrectly, it will
fold cross-validation was made in this work. Firstly, three not affect the result that point B will be determined as the
subsets were selected to train the predictor, one for valida- PW-JND image, which is consistent with the ground truth. The
tion, and the other for testing from the generated data set mean accuracy of the proposed PW-JND model can be derived
{S1 , S2 , S3 , S4 , S5 }. Once a stable predictor was well trained, from (6), and we can see that the accuracy of the perceptually
we rotated the test set, and finally we trained five predictors lossy/lossless predictor is a key factor, and thresholds p and
D1 , D2 , D3 , D4 , and D5 . The cross validation results are ε are also important factors.
shown in Table III, where accv and acct denote the mean
validation accuracy and testing accuracy. Str , Sva , St e are B. The Parameter Determination for the Sliding Window
training, validation, and testing set respectively. We can see The window size p and threshold ε affect the performance
from Table III that: 1) All of the validation and test accuracy of the PW-JND search strategy. In order to select a suitable
is over 90%. The mean validation and test accuracy are 91.98% combination of p and ε, we selected D3 as the perceptually
and 91.96% respectively, which denotes that the accuracy of lossy/lossless predictor, I3 as the test image set. The aim is
the predictor is high. 2) The validation accuracy is close to to ensure that the test images are randomly selected from
the corresponding test accuracy, which means a stable perfor- MCL-JCI dataset and the test images have never been seen
mance. In this work, the five predictors D1 -D5 were trained by the perceptually lossy/lossless predictor. We predicted the

Authorized licensed use limited to: Lappeenranta-Lahti University of Technology LUT. Downloaded on March 20,2024 at 08:58:35 UTC from IEEE Xplore. Restrictions apply.
LIU et al.: DEEP LEARNING-BASED PICTURE-WISE JUST NOTICEABLE DISTORTION PREDICTION MODEL 649

Fig. 9. PW-JND image searching. (a) Ideal case. (b) The proposed strategy.

For such candidate parameter combinations, the prediction


accuracy is similar. We select p = 6 and ε = 5 in our
experiment.

VI. E XPERIMENTAL R ESULTS AND A NALYSES


A. Experimental Settings
The performance of the proposed PW-JND model was
evaluated on MCL-JCI data set mentioned in Section IV-C.
The five well trained perceptually lossy/lossless predictors D1
to D5 mentioned in Section IV-E were used to predict PW-JND
for I1 to I5 mentioned in Section IV-C, respectively. The aim
is to ensure that the test images have never been seen by the
perceptually lossy/lossless predictors. The prediction results
Fig. 10. The performance of the proposed sliding window based PW-JND of the 50 images were obtained by combining the prediction
searching strategy with different window sizes and thresholds.
results of the five predictors. It is worth noting that I3 has been
used to select parameters for the proposed PW-JND search
first PW-JND of the test images by fixing window size p, strategy in Section V-B. The five predictors were shared in
then changing ε (ε ≤ p). In order to reduce the computational predicting the first and second PW-JND. In predicting the first
complexity, ten window sizes p ∈ [1, 2, …, 9, 10] were tested PW-JND, the test images x was set to the pristine image, and
in this work. The prediction results are shown in Fig. 10, where the distorted image set D consists of the 100 JPEG-coded
x-axis represents threshold ε and y-axis represents | PSNR|. images. In predicting the second PW-JND, x was set to the
PSNR is the value of ground truth PSNR minus prediction first ground truth PW-JND image, and D consists of all of
PSNR, and | · | is absolute value operator. A smaller | PSNR| the JPEG-compressed images with smaller QF than that of the
denotes a better performance. Each line represents a window first ground truth PW-JND image. The parameters p and ε of
size p ( p = 1 is only one point), and each point denotes a the proposed sliding window based search strategy were set
different threshold in a fixed p. It can be seen that: 1) For the same ( p = 6, ε = 5) in predicting the first and second
each fixed window size p, there is an inflection point with PW-JND.
the smallest | PSNR|. On the left of the inflection point, In order to compare the performance among the pro-
| PSNR| decreases with the increase of ε. The reason may posed PW-JND model and conventional pixel domain JND
be that a very small ε can easily result in an underestimation, models, the Free Energy Principle based Pixel domain JND
which means the predicted PSNR is larger than the ground (FEP_PJND) model [11] and Enhanced Pattern Complexity
truth PSNR and PSNR is negative. The underestimation based Pixel domain JND (EPC_PJND) model [14] were
becomes small with the increase of ε. It means that PSNR selected as comparison models. They estimate the JND thresh-
becomes larger, but | PSNR| becomes smaller. On the right old for each pixel and return a JND thresholds map. Since
of the inflection point, | PSNR| increases with the increase the pixel domain models estimate JND threshold for each
of ε. The reason may be that a large ε may result in an pixel but not the whole image, we devise a method for the
overestimation, which means the predicted PSNR is smaller comparison models to predict PW-JND of the test image x
than the ground truth PSNR and PSNR is positive. The as: 1) Z (x, Di sti ) is designed to predict whether a distorted
overestimation increases when ε becomes bigger, both of image Di sti is perceptually lossy from x or not, of which
PSNR and | PSNR| become larger. The inflection point the function is the same as that of the proposed perceptually
can be seen as the boundary between underestimation and lossy/lossless predictor (x, Di sti ) in Fig. 4(d). 2) Each
overestimation. 2) The | PSNR| of different inflection points JPEG-compressed image in the distorted image set D will be
of different window sizes p (except p = 1, 2, 3) are very close. compared with x, and Z (x, Di sti ) is designed to output 1
It means that there are many different parameter combinations if Di sti is perceptually lossy from x, otherwise 0. 3) The
( p, ε) to be chosen, such as (4, 3), (5, 4), (6, 5) and so on. distorted image with prediction label 1 and the largest QF will

Authorized licensed use limited to: Lappeenranta-Lahti University of Technology LUT. Downloaded on March 20,2024 at 08:58:35 UTC from IEEE Xplore. Restrictions apply.
650 IEEE TRANSACTIONS ON IMAGE PROCESSING, VOL. 29, 2020

Fig. 11. Performance comparison in predicting the first PW-JND. (a) PSNR. (b) QF.

be determined as the PW-JND for x. Z (x, Di sti ) is defined as means an overestimation, and Q F = 0 denotes the estima-
 tion is consistent with the ground truth. For a further analysis,
0, S ≤ T × λ the absolute difference (|·|) was selected as another evaluation
Z (x, Di sti ) = (11)
1, S > T × λ, index. For example, | QF| is the absolute value of QF.
A larger | QF| means a greater prediction error.
where T is the number of pixels of x, S is the number of the
pixels that changes over the corresponding JND thresholds,
and λ is a given threshold. S is defined as B. Performance of the Proposed PW-JND
Prediction Model Evaluation

m 
n
S= Ui, j , (12) 1) Predicting the First PW-JND: The prediction results of
i=1 j =1 the first PW-JND are plotted in Fig. 11(a) and Fig. 11(b),
where x-axis represents image index, and y-axis in Fig. 11(a)
where i, j is the pixel index, Ui, j describes whether the change
and Fig. 11(b) represent PSNR and QF respectively. From
of pixel Di st (i, j ) is over its JND threshold or not. Ui, j is
Fig. 11(a), it can be observed that: 1) When λ = 0, all of the
defined as
 PSNR values of FEP_PJND and EPC_JND are below x-axis,
0, Di, j ≤ Mi, j which denotes all the PW-JNDs were underestimated. On the
Ui, j = (13) other hand, when λ = 0.1, all of the PSNR values of the two
1, Di, j > Mi, j ,
comparison models are over x-axis, which denotes all the
where i, j is the pixel index, Di, j = r e f i, j − Di sti, j , and PW-JNDs were overestimated. 2) Most of PSNR values of
Mi, j is the estimated JND threshold calculated by pixel the proposed PW-JND are very close to zero, which means the
domain JND models for pixel x i, j . The test image x and proposed model has a high prediction accuracy. 3) Compared
distorted image set D were set the same as the proposed model with the two pixel domain models, all PSNR values of
mentioned in the previous paragraph. The JND map M was the proposed PW-JND are closer to zero, which denotes the
obtained by pixel domain JND model from the test image x. proposed PW-JND model has the highest prediction accuracy.
In predicting the first and second PW-JND, λ was set as 0, From Fig. 11(b), we can also come to the conclusion that the
0.05, and 0.1. Particularly, λ = 0 denotes that as long as proposed model has the highest accuracy.
there is any one pixel changing over its JND threshold in the For a further comparison, the mean and variance of the
distorted image, it will be considered as a perceptually lossy absolute difference are listed in Table IV. From the mean part,
image. it can be observed that the mean of | QF| and | PSNR| of
QF, PSNR, SSIM, Feature Similarity Index Measure- the proposed PW-JND are 8.7 and 0.8 respectively, which are
ment (FSIM), Gradient Magnitude Similarity Deviation the smallest among all of the compared models. The similar
(GMSD) [29], Visual Saliency Index (VSI) [30], and Percep- phenomenon can be obtained in other metrics. Therefore,
tually Weighted Mean Squared Error (PWMSE) [37] metrics we can conclude that the accuracy of the proposed model is
were selected to describe PW-JND. The difference in the above the highest. From the variance part in Table IV, we can see that
metrics between the predicted PW-JND and the ground truth all of variance values of the proposed model are the smallest,
PW-JND was selected as the evaluation index. For example, which denotes the proposed model is the most stable one.
QF is the result of ground truth QF minus prediction QF. Therefore, we can conclude that the proposed model performs
A positive QF denotes an underestimation, a negative QF best in predicting the PW-JND for pristine images.

Authorized licensed use limited to: Lappeenranta-Lahti University of Technology LUT. Downloaded on March 20,2024 at 08:58:35 UTC from IEEE Xplore. Restrictions apply.
LIU et al.: DEEP LEARNING-BASED PICTURE-WISE JUST NOTICEABLE DISTORTION PREDICTION MODEL 651

TABLE IV
C OMPARISON OF THE F IRST PW-JND P REDICTION IN T ERMS OF THE M EAN AND VARIANCE OF A BSOLUTE P REDICTION D IFFERENCE

Fig. 12. Performance comparison in predicting the second PW-JND. (a) PSNR. (b) QF.

2) Predicting the Second PW-JND: The prediction results 3.14, 0.76, and 0.53 × 10−2, which are the smallest compared
of the second PW-JND are plotted in Fig. 12(a) and Fig. 12(b). with the other models. From the variance part, it can be seen
It can be seen that the phenomenon of Fig. 12(a) is very similar that the variances of | QF|, | PSNR|, and | SSIM| of the
to that shown in Fig. 11(a). We can obtain the following con- proposed model are 12.3, 0.65, and 0.29×10−4, which are also
clusions: 1) When λ = 0 all the PSNR were underestimated the smallest. Compared with the two comparison models with
by the two comparison models, and when λ = 0.1 all the different thresholds in different metrics, the mean and variance
PSNR were overestimated by the two comparison models. of the proposed model are the smallest except the variance of
2) The PSNR values of the proposed model are closer to FSIM. Therefore, we can conclude that the proposed model
zero than that of the two comparison models, which denotes performs best among the comparison models.
that the proposed model has the highest accuracy. Another
phenomenon is that the PSNR values are closer to ground C. Visual Quality Comparison
truth compared with the first PW-JND thresholds predicting, Moreover, subjective quality of the prediction results for
which denotes it is easier to predict the second PW-JND “ImageJND_SRC13” and “ImageJND_SRC39” are shown
than to predict the first PW-JND. The reason may be that in Fig. 13 and Fig. 14 respectively, in which the enlarged
the degradation between the distorted image and test image x patches are the most quality sensitive regions. Take Fig. 13 as
is becoming larger. The conclusions also can be convinced an example, (a) is the source image, (b) to (d) are enlarged
by Fig. 12(b), and we will not give a further analysis for patches of QF 100, the first ground truth PW-JND (QF 31), and
Fig. 12(b). the prediction result of the proposed model (QF 35), respec-
The mean and variance of the absolute difference are listed tively. From the first row, we can see that the perceptual quality
in Table V. From the mean part, we can see that the mean of of (b), (c), and (d) are very similar meanwhile the image
| QF|, | PSNR|, and | SSIM| of the proposed PW-JND are size and PSNR of (c) and (d) are very close. It demonstrates

Authorized licensed use limited to: Lappeenranta-Lahti University of Technology LUT. Downloaded on March 20,2024 at 08:58:35 UTC from IEEE Xplore. Restrictions apply.
652 IEEE TRANSACTIONS ON IMAGE PROCESSING, VOL. 29, 2020

TABLE V
C OMPARISON OF THE S ECOND PW-JND P REDICTION IN T ERMS OF THE M EAN AND VARIANCE OF A BSOLUTE P REDICTION D IFFERENCE

Fig. 14. Visual quality comparison of source image “ImageJND_SRC39”


in MCL-JCI, (b)-(j) are enlarged patches, and {*, *, *} denotes QF, image
size (KB), PSNR (dB) of the associated images. (a) Original image. (b) Image
Fig. 13. Visual quality comparison of source image “ImageJND_SRC13” with best quality under JPEG compression, {100, 1143.4, 49.12}. (c) The first
in MCL-JCI, (b)-(j) are enlarged patches, and {*, *, *} denotes QF, image ground truth PW-JND, {44, 132.9, 35.67}. (d) Proposed, {41, 127.4, 35.41}.
size (KB), PSNR (dB) of the associated images. (a) Original image. (b) Image (e) to (g) are prediction results of EPC_PJND [14] with λ = 0, 0.05, 0.10,
with best quality under JPEG compression, {100, 1334.1, 50.03}. (c) The first {97, 684.8, 47.03}, {23, 90.9, 33.26}, {11, 61.7, 30.20}. (h) to (j) are
ground truth PW-JND, {31, 105.7, 34.67}. (d) Proposed, {35, 113.7, 35.12}. prediction results of FEP_PJND [11] with λ = 0, 0.05, 0.10, {99, 967.9,
(e) to (g) are prediction results of EPC_PJND [14] with λ = 0, 0.05, 0.10, 48.59}, {20, 83.7, 32.72}, {5, 53.3, 28.66}.
{97, 762.4, 45.69}, {21, 85.4, 33.19}, {10, 59.2, 29.96}. (h) to (j) are
prediction results of FEP_PJND [11] with λ = 0, 0.05, 0.10, {99, 1148.2,
TABLE VI
49.63}, {15, 71.8, 31.83}, {7, 51, 28.23}.
T HE RUNNING T IME S PENT IN P REDICTING THE
F IRST PW-JND (S ECONDS )
the effectiveness of the proposed model. In the second row,
(e) to (g) and (h) to (j) are enlarged images of prediction
results of EPC_PJND [14] and FEP_PJND [11] with λ =
0, 0.05, 0.10, respectively. When λ = 0, the perceptual quality
of (e) and (h) is almost the same as that of (b), however
the size of corresponding images is much larger than that
of (d). When λ = 0.05, the distortion of (f) and (i) can be
easily perceived by HVS. When λ = 0.10, the perceptual code “imwrite (image, imageName, ‘jpeg’, ‘Quality’, QF)”
quality of (g) and (j) is unacceptable. From the second row, to compress the test image, and implemented the predictor
we can conclude that the proposed model performs better than on Tensorflow 1.2.0 and Python 3.5.2. All of the tests were
EPC_PJND and FEP_PJND model. Similar phenomenon can finished on a desktop computer with Intel CPU i-7-6700K,
be seen from Fig. 14. GPU GTX1080Ti, and 32G memory. The running time spent
in predicting the first PW-JND for the 50 pristine images
D. Computational Complexity of the PW-JND Model in MCL-JCI dataset is shown in Table VI. The test sets in
The computational complexity of the proposed model the first column are the same as Table III, and each set
mainly includes the time spent in compressing the test image includes ten 1920×1080 images. As the penultimate row
and predicting whether a JPEG-compressed image is per- shows, the mean compressing, predicting, and total time of
ceptually lossy from the test image or not. The time spent predicting the first PW-JNDs for 50 images are 229.38, 406.88,
in PW-JND searching can be ignored. We used MATLAB and 636.26 seconds respectively. We can also see from the

Authorized licensed use limited to: Lappeenranta-Lahti University of Technology LUT. Downloaded on March 20,2024 at 08:58:35 UTC from IEEE Xplore. Restrictions apply.
LIU et al.: DEEP LEARNING-BASED PICTURE-WISE JUST NOTICEABLE DISTORTION PREDICTION MODEL 653

Fig. 15. Bit rate saving versus PSNR and QF of the first prediction PW-JND. (a) PSNR. (b) QF.

last row that predicting the first PW-JND for a pristine image model increases from 0.878 to 0.894 when n increases from
needs about 12.71 seconds (mean) which includes compressing 1 to 9, PSNR increases from 0.175 to 0.451 (see Fig. 15(a)),
time 4.58 seconds (mean) and predicting time 8.13 seconds and QF increases 1.78 from to 4.68 (see Fig. 15(b)). It can be
(mean). From the above observations, it can be seen that the concluded that the proposed model has a high bit rate saving
computational complexity of the proposed model is acceptable. which is close to that of the ground truth. It also has a stable
We will focus on reducing the computational complexity in the performance when the QF varies. From the dotted and solid
future work. line, we can see that the performance of the FEP_PJND and
EPC_PJND model is very similar. The bitrate saving, PSNR,
E. Application in Image Compression and Transmission and QF vary greatly when λ increases from 0.025 to 0.125.
The comparison models need to pay large PSNR (or QF) to
The PW-JND reveals the minimum difference of a picture
achieve a high bit rate saving. If we fix the PSNR (or QF),
that HVS can perceive, which can be used for selecting
the proposed model has a larger bit rate saving. If we fix
parameters in image compression. We take JPEG coder to
the bit rate saving, the proposed model has a lower PSNR
compress the 50 images in MCL-JCI dataset with the predicted
and QF. Therefore, we can conclude that the proposed model
PW-JND (QF). The mean of bit rate saving C and prediction
has a better performance than the two pixel domain models.
error Q were designed as evaluation indexes. C is defined as
  The experimental results of the second PW-JND are plotted
C = Rt e − R p Rt e × 100%, (14) on Fig. 16, where the x-axis, y-axis, lines, and legends are
the same with that in Fig. 15. We set λ the same as the first
where Rt e and R p are the bit rate of the test image x PW-JND case and n ∈ [1, 2, . . . , 5]. The bit rate saving of the
and predicted PW-JND image respectively. In predicting the proposed model is close to that of the ground truth, which is
first PW-JND, x is JPEG-compressed image with QF 100. smaller than that of the first PW-JND case. The reason may
In predicting the second PW-JND, x is the first ground truth be that the JPEG coder compresses image more at a high QF
PW-JND image; Q (Q ∈ PSNR, QF) is defined as level. We can also see from Fig. 16 the proposed model has

Q tr − Q p Q tr > Q p the best performance.
Q= (15) As mentioned in Section I, we can predict PW-JND in
0 Q tr ≤ Q p ,
the bit rate domain (R) or other discrete/continuous domains,
where Q tr and Q p are the ground truth and predicted PW-JND e.g., QF, and PSNR. Therefore, we analyse the correlation
image respectively. When Q tr > Q p , the predicted PW-JND between the prediction and ground truth PW-JNDs in QF,
image has a perceptual loss, which is regarded as a prediction PSNR, and bit rate domain. The scatter plots map of predicted
error. PW-JNDs and ground truth PW-JNDs are shown in Fig. 17,
The experimental results of the first PW-JND are shown where x-axis represents ground truth PW-JNDs, and y-axis
in Fig. 15, where y-axis represents the mean of bit rate represents the predicted PW-JNDs. We can see from Fig. 17
saving, and x-axis represents the mean of PSNR and QF in that: 1) The correlation of QF (R 2 = 0.12) and bit rate
Fig. 15(a) and Fig. 15(b) respectively. In Fig. 15, the diamond prediction (R 2 = 0.615) is low. Especially, the correlation
denotes the ground truth, the red circles from right to left of QF prediction has the lowest correlation. The ground truth
represent the results of adding n ∈ [1, 2, . . . , 9] to the PW-JND in MCL-JCI data set is a statistical value, around
predicted QF of the proposed model. The dotted and solid line which there is an ambiguous region in perceptual quality.
denote the FEP_PJND and EPC_PJND model. The squares The quality difference among the distorted images in such
and triangles represent the results when λ (see (11)) was set region is very hard to distinguish by humans. The width of
to 0.025, 0.05, 0.075, 0.1, and 0.125, respectively. From the the region is determined by image content and subjects, which
red circles, we can see that the bit rate saving of the proposed is big for most images in QF. There is a great possibility

Authorized licensed use limited to: Lappeenranta-Lahti University of Technology LUT. Downloaded on March 20,2024 at 08:58:35 UTC from IEEE Xplore. Restrictions apply.
654 IEEE TRANSACTIONS ON IMAGE PROCESSING, VOL. 29, 2020

Fig. 16. Bit rate saving versus PSNR and QF of the second prediction PW-JND. (a) PSNR. (b) QF.

Fig. 17. Prediction of the first PW-JND versus ground truth PW-JND in different metrics. (a) PSNR: R 2 = 0.908. (b) Bit rate: R 2 = 0.615. (c) QF:
R 2 = 0.12.

that the predicted QF falls within this region. The wider the Experimental results on comparison with the conventional just
ambiguous region, the greater possibility the predicted QF noticeable difference models demonstrate the effectiveness of
would deviate from the ground truth, thus leading to a lower the proposed model.
correlation of QF prediction. 2) The PSNR has the highest
correlation (R 2 = 0.908), which denotes that we can predict
R EFERENCES
the PW-JNDs in PSNR domain. The predicted PSNR can be
used in different image/video processing algorithms: 1) It can [1] H. Zhang, Y. Zhang, H. Wang, Y.-S. Ho, and S. Feng, “WLDISR:
Weighted local sparse representation-based depth image super-resolution
be used to compress image/video with the lowest bit rate for 3D video system,” IEEE Trans. Image Process., vol. 28, no. 2,
without perceptual quality degradation. 2) It is also helpful for pp. 561–576, Feb. 2019.
streaming system to select the images/videos with the smallest [2] L. Toni and P. Frossard, “Optimal representations for adaptive stream-
size but best quality, which can save the bandwidth without ing in interactive multiview video systems,” IEEE Trans. Multimedia,
vol. 19, no. 12, pp. 2775–2787, Dec. 2017.
damaging consumers’ experience. 3) It can be used to guide [3] Y. Zhang, X. Yang, X. Liu, Y. Zhang, G. Jiang, and S. Kwong, “High-
watermarks embedding, which ensure the impairment of the efficiency 3D depth coding based on perceptual quality of synthesized
embedded digital watermarks cannot perceived by the humans. video,” IEEE Trans. Image Process., vol. 25, no. 12, pp. 5877–5891,
Dec. 2016.
[4] L. Xu et al., “Free-energy principle inspired video quality metric and
its use in video coding,” IEEE Trans. Multimedia, vol. 18, no. 4,
VII. C ONCLUSION pp. 590–602, Apr. 2016.
In this paper, we propose a deep learning method based [5] S.-H. Bae, J. Kim, and M. Kim, “HEVC-based perceptually adaptive
video coding using a DCT-based local distortion detection probability
Picture Wise Just Noticeable Difference (PW-JND) prediction model,” IEEE Trans. Image Process., vol. 25, no. 7, pp. 3343–3357,
model. Firstly, the task of predicting PW-JND is formulated Jul. 2016.
as a multi-class classification problem, which is transformed [6] X. Zhang, S. Wang, K. Gu, W. Lin, S. Ma, and W. Gao, “Just-
noticeable difference-based perceptual optimization for JPEG com-
to a binary classification. Secondly, we construct a deep learn- pression,” IEEE Signal Process. Lett., vol. 24, no. 1, pp. 96–100,
ing based binary classifier named perceptually lossy/lossless Jan. 2017.
predictor to predict whether a distorted image is perceptually [7] Z. Luo, L. Song, S. Zheng, and N. Ling, “H.264/advanced video con-
trol perceptual optimization coding based on JND-directed coefficient
lossy to its reference or not. Finally, we propose a sliding win- suppression,” IEEE Trans. Circuits Syst. Video Technol., vol. 23, no. 6,
dow based PW-JND search strategy to predict the PW-JND. pp. 935–948, Jun. 2013.

Authorized licensed use limited to: Lappeenranta-Lahti University of Technology LUT. Downloaded on March 20,2024 at 08:58:35 UTC from IEEE Xplore. Restrictions apply.
LIU et al.: DEEP LEARNING-BASED PICTURE-WISE JUST NOTICEABLE DISTORTION PREDICTION MODEL 655

[8] H. Su and C. Jung, “Perceptual enhancement of low light images based [32] K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for
on two-step noise suppression,” IEEE Access, vol. 6, pp. 7005–7018, image recognition,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit.
2018. (CVPR), Jun. 2016, pp. 770–778.
[9] T. Zhu and L. Karam, “A no-reference objective image quality metric [33] L. Zhu, Y. Zhang, S. Wang, H. Yuan, S. Kwong, and H.-H. S. Ip, “Con-
based on perceptually weighted local noise,” EURASIP J. Image Video volutional neural network-based synthesized view quality enhancement
Process., vol. 5, pp. 1–8, Jan. 2014. for 3D video coding,” IEEE Trans. Image Process., vol. 27, no. 11,
[10] J. Wu, F. Qin, and M. Shi, “Self-similarity based structural regularity for pp. 5365–5377, Nov. 2018.
just noticeable difference estimation,” J. Vis. Commun. Image Represent., [34] S. Bosse, D. Maniry, K. R. Müller, T. Wiegand, and W. Samek,
vol. 23, no. 6, pp. 845–852, Aug. 2012. “Deep neural networks for no-reference and full-reference image quality
[11] J. Wu, G. Shi, W. Lin, A. Liu, and F. Qi, “Just noticeable difference esti- assessment,” IEEE Trans. Image Process., vol. 27, no. 1, pp. 206–219,
mation for images with free-energy principle,” IEEE Trans. Multimedia, Jan. 2018.
vol. 15, no. 7, pp. 1705–1710, Nov. 2013. [35] V. Nair and G. E. Hinton, “Rectified linear units improve restricted
[12] J. Wu, W. Lin, G. Shi, X. Wang, and F. Li, “Pattern masking estimation boltzmann machines,” in Proc. 27th Int. Conf. Mach. Learn., 2010,
in image with structural uncertainty,” IEEE Trans. Image Process., pp. 807–814.
vol. 22, no. 12, pp. 4892–4904, Dec. 2013. [36] D. Kingma and J. Ba, “Adam: A method for stochastic optimiza-
[13] S. Wang, L. Ma, Y. Fang, W. Lin, S. Ma, and W. Gao, “Just noticeable tion,” 2014, arXiv:1412.6980. [Online]. Available: https://arxiv.org/abs/
difference estimation for screen content images,” IEEE Trans. Image 1412.6980
Process., vol. 25, no. 8, pp. 3838–3851, May 2016. [37] S. Hu, L. Jin, H. Wang, Y. Zhang, S. Kwong, and C.-C. J. Kuo,
[14] J. Wu, L. Li, W. Dong, G. Shi, W. Lin, and C.-C. J. Kuo, “Enhanced “Compressed image quality metric based on perceptually weighted
just noticeable difference model for images with pattern complexity,” distortion,” IEEE Trans. Image Process., vol. 24, no. 12, pp. 5594–5608,
IEEE Trans. Image Process., vol. 26, no. 6, pp. 2682–2693, Jun. 2017. Dec. 2015.
[15] H. Hadizadeh, A. Rajati, and I. V. Bajić, “Saliency-guided just noticeable
distortion estimation using the normalized laplacian pyramid,” IEEE
Signal Process. Lett., vol. 24, no. 8, pp. 1218–1222, Aug. 2017.
[16] Z. Wei and K. N. Ngan, “Spatio-temporal just noticeable distortion
profile for grey scale image/video in DCT domain,” IEEE Trans. Circuits
Syst. Video Technol., vol. 19, no. 3, pp. 337–346, Mar. 2009.
[17] S.-H. Bae and M. Kim, “A novel DCT-based JND model for luminance
adaptation effect in DCT frequency,” IEEE Signal Process. Lett., vol. 20, Huanhua Liu received the B.S. degree in computer
no. 9, pp. 893–896, Sep. 2013. science and technology from Changsha University,
[18] S.-H. Bae and M. Kim, “A novel generalized DCT-based JND profile China, in 2009, and the M.S. degree in computer
based on an elaborate CM-JND model for variable block-sized trans- science and technology from the University of South
forms in monochrome images,” IEEE Trans. Image Process., vol. 23, China, China, in 2013. He is currently pursuing
no. 8, pp. 3227–3240, Aug. 2014. the Ph.D. degree with Central South University.
[19] S. Bae and M. Kim, “A DCT-based total JND profile for spatiotemporal Since 2017, he has been with the Shenzhen Insti-
and foveated masking effects,” IEEE Trans. Circuits Syst. Video Technol., tutes of Advanced Technology, Chinese Academy
vol. 27, no. 6, pp. 1196–1207, Jun. 2017. of Sciences, Shenzhen, China. His current research
[20] S. Ki, S.-H. Bae, M. Kim, and H. Ko, “Learning-based just-noticeable- interests include learning-based image/video quality
quantization-distortion modeling for perceptual video coding,” IEEE assessment and video coding.
Trans. Image Process., vol. 27, no. 7, pp. 3178–3193, Jul. 2018.
[21] X. Liu, Y. Zhang, S. Hu, S. Kwong, C.-C. J. Kuo, and Q. Peng,
“Subjective and objective video quality assessment of 3D synthesized
views with texture/depth compression distortion,” IEEE Trans. Image
Process., vol. 24, no. 12, pp. 4847–4861, Dec. 2015.
[22] S. Hu, H. Wang, and C.-C. J. Kuo, “A GMM-based stair quality model
for human perceived JPEG images,” in Proc. IEEE Int. Conf. Acoust.
Yun Zhang (M’12–SM’16) received the B.S. and
Speech Signal Process., Mar. 2016, pp. 1070–1074.
M.S. degrees in electrical engineering from Ningbo
[23] L. Jin et al., “Statistical study on perceived JPEG image quality via
University, Ningbo, China, in 2004 and 2007,
MCL-JCI dataset construction and analysis,” Electron. Imag., vol. 2016,
respectively, and the Ph.D. degree in computer
no. 13, pp. 1–9, Feb. 2016.
science from the Institute of Computing Technol-
[24] H. Wang et al., “VideoSet: A large-scale compressed video quality
ogy (ICT), Chinese Academy of Sciences (CAS),
dataset based on JND measurement,” J. Vis. Commun. Image Represent.,
Beijing, China, in 2010. From 2009 to 2014, he was
vol. 46, pp. 292–302, Jul. 2017.
a Postdoctoral Researcher with the Department of
[25] C. Fan, Y. Zhang, H. Zhang, R. Hamzaouic, and Q. Jiang, “Picture-
Computer Science, City University of Hong Kong,
level just noticeable difference for symmetrically and asymmetrically
Hong Kong. From 2010 to 2017, he was an Assis-
compressed stereoscopic images: Subjective quality assessment study
tant Professor and an Associate Professor with the
and datasets,” J. Vis. Commun. Image Represent., vol. 62, pp. 140–151,
Shenzhen Institutes of Advanced Technology (SIAT), CAS, Shenzhen, China,
Jul. 2019.
where he is currently a Professor. His research interests include video
[26] Q. Huang, H. Wang, S. C. Lim, H. Y. Kim, S. Y. Jeong, and C.-C. J. Kuo,
compression, 3D video processing, and visual perception.
“Measure and prediction of HEVC perceptually lossy/lossless boundary
QP values,” in Proc. IEEE Data Compress. Conf. (DCC), Apr. 2017,
pp. 42–51.
[27] Z. Wang, E. P. Simoncelli, and A. C. Bovik, “Multiscale structural
similarity for image quality assessment,” in Proc. 37th Asilomar Conf.
Signals, Syst. Comput., vol. 2, Nov. 2003, pp. 1398–1402.
[28] L. Zhang, L. Zhang, X. Mou, and D. Zhang, “FSIM: A feature similarity
index for image quality assessment,” IEEE Trans. Image Process., Huan Zhang received the B.S. degree from the
vol. 20, no. 8, pp. 2378–2386, Aug. 2011. Civil Aviation University of China, Tianjin, China,
[29] W. Xue, L. Zhang, X. Mou, and A. C. Bovik, “Gradient magnitude in 2010, and the M.S. degree from Tsinghua Uni-
similarity deviation: A highly efficient perceptual image quality index,” versity, Beijing, China, in 2013. She is currently
IEEE Trans. Image Process., vol. 23, no. 2, pp. 684–695, Feb. 2014. pursuing the Ph.D. degree with the University of
[30] L. Zhang, Y. Shen, and H. Li, “VSI: A visual saliency-induced index Chinese Academy of Sciences, China. Her research
for perceptual image quality assessment,” IEEE Trans. Image Process., interests include image restoration and image/video
vol. 23, no. 10, pp. 4270–4281, Aug. 2014. quality assessment.
[31] L. Xu et al., “Multi-task rank learning for image quality assessment,”
IEEE Trans. Circuits Syst. Video Technol., vol. 27, no. 9, pp. 1833–1843,
Sep. 2017.

Authorized licensed use limited to: Lappeenranta-Lahti University of Technology LUT. Downloaded on March 20,2024 at 08:58:35 UTC from IEEE Xplore. Restrictions apply.
656 IEEE TRANSACTIONS ON IMAGE PROCESSING, VOL. 29, 2020

Chunling Fan received the M.S. degree from C.-C. Jay Kuo (F’99) received the B.S. degree in
Nanjing Normal University, Nanjing, in 2011. electrical engineering from National Taiwan Uni-
She is currently pursuing the Ph.D. degree with versity, Taipei, Taiwan, in 1980, and the M.S. and
the Shenzhen Institutes of Advanced Technol- Ph.D. degrees in electrical engineering from the
ogy (SIAT), University of Chinese Academy of Massachusetts Institute of Technology, Cambridge
Sciences, Shenzhen, China. Her research inter- MA, USA, in 1985 and 1987, respectively. He is
ests include image processing and image quality currently the Director of the Multimedia Commu-
assessment. nications Laboratory and a Professor of electrical
engineering, computer science and mathematics with
the Ming-Hsieh Department of Electrical Engineer-
ing, University of Southern California, Los Angeles,
CA, USA. He has coauthored about 200 journal papers, 850 conference
papers, and ten books. His research interests include digital image/video
analysis and modeling, multimedia data compression, communication and
networking, and biological signal/image processing. He is a fellow of The
American Association for the Advancement of Science and The International
Society for Optical Engineers.

Xiaoping Fan received the B.S. degree in electrical


engineering from the Jiangxi University of Tech-
nology (Nanchang University), Nanchang, China,
Sam Kwong (F’13) received the B.S. degree in in 1981, the M.S. degree in traffic information engi-
electrical engineering from the State University of neering and control from Changsha Railway Univer-
New York at Buffalo in 1983, the M.S. degree sity (Central South University), Changsha, Hunan,
in electrical engineering from the University of in 1984, and the Ph.D. degree in control science and
Waterloo, Waterloo, ON, Canada, in 1985, and engineering jointly from the South China University
the Ph.D. degree from the University of Hagen, of Technology, Guangzhou, and The Hong Kong
Germany, in 1996. From 1985 to 1987, he was Polytechnic University, Hong Kong, in 1998. He
a Diagnostic Engineer with Control Data Canada. was a Professor with the School of Information
He joined Bell Northern Research Canada as a mem- Science and Engineering, Central South University. Since 2010, he has been
ber of Scientific Staff. In 1990, he became a Lecturer a Professor with the Laboratory of Networked Systems, Hunan University
with the Department of Electronic Engineering, City of Finance and Economics. He is the author of two books, over 300 journal
University of Hong Kong, Hong Kong, where he is currently a Professor with and conference papers, and 15 inventions. His research interests include robot
the Department of Computer Science. His research interests include video and control, wireless sensor networks, data mining, and intelligent transportation
image coding and evolutionary algorithms. systems.

Authorized licensed use limited to: Lappeenranta-Lahti University of Technology LUT. Downloaded on March 20,2024 at 08:58:35 UTC from IEEE Xplore. Restrictions apply.

You might also like