Deep Learning-Based Picture-Wise Just Noticeable Distortion Prediction Model for Image Compression
Deep Learning-Based Picture-Wise Just Noticeable Distortion Prediction Model for Image Compression
Abstract— Picture Wise Just Noticeable Difference (PW-JND), another or not. Finally, we propose a sliding window based search
which accounts for the minimum difference of a picture strategy to predict PW-JND based on the prediction results of the
that human visual system can perceive, can be widely used perceptually lossy/lossless predictor. Experimental results show
in perception-oriented image and video processing. However, that the mean accuracy of the perceptually lossy/lossless predictor
the conventional Just Noticeable Difference (JND) models cal- reaches 92%, and the absolute prediction error of the proposed
culate the JND threshold for each pixel or sub-band separately, PW-JND model is 0.79 dB on average, which show the superiority
which may not reflect the total masking effect of a picture accu- of the proposed PW-JND model to the conventional JND models.
rately. In this paper, we propose a deep learning based PW-JND
prediction model for image compression. Firstly, we formulate Index Terms— Just noticeable distortion, convolutional neural
the task of predicting PW-JND as a multi-class classification network, visual perception, image quality assessment.
problem, and propose a framework to transform the multi-class
classification problem to a binary classification problem solved by I. I NTRODUCTION
just one binary classifier. Secondly, we construct a deep learning
based binary classifier named perceptually lossy/lossless predictor
which can predict whether an image is perceptually lossy to T HE Ultra-High-Definition (UHD), Three Dimensional
(3D) [1], and Virtual Reality (VR) images and videos
with the ability to provide a more immersive and realistic
experience than conventional multimedia, are becoming more
Manuscript received July 23, 2018; revised February 4, 2019 and May 27,
2019; accepted July 29, 2019. Date of publication August 13, 2019; and more popular with consumers in streaming services [2].
date of current version September 25, 2019. This work was supported However, the bandwidth and storage required to support
in part by the National Natural Science Foundation of China under UHD, 3D, and VR streaming services are several or even more
Grant 61871372, in part by the Guangdong Natural Science Foundation
for Distinguished Young Scholar under Grant 2016A030306022, in part times the size of that required for the traditional images and
by the Key Project for Guangdong Provincial Science and Technology videos, which has been a bottleneck of the streaming services
Development under Grant 2017B010110014, in part by the Shenzhen Interna- industry. The main-stream of the current image/video cod-
tional Collaborative Research Project under Grant GJHZ20170314155404913,
in part by the Shenzhen Science and Technology Program under Grant ing techniques [3] are signal-processing-based, which mainly
JCYJ20170811160212033, in part by the Guangdong International Science consider the statistical properties of visual content. They are
and Technology Cooperative Research Project under Grant 2018A050506063, becoming difficult to achieve further improvement in reducing
in part by the Membership of Youth Innovation Promotion Associa-
tion, Chinese Academy of Sciences under Grant 2018392, and in part the size of images and videos without perceptual quality
by the Shenzhen Science and Technology Plan Project under Grant degradation. As we know, the ultimate receiver of most visual
JCYJ20180507183823045. The associate editor coordinating the review of content is the Human Visual System (HVS), therefore it
this article and approving it for publication was Prof. Damon M. Chandler.
(Corresponding author: Yun Zhang.) is important to develop image/video processing technologies
H. Liu is with the School of Computer Science and Engineering, incorporating the characteristic of HVS [4] for streaming
Central South University, Changsha 410075, China, and also with the services industry.
Shenzhen Institutes of Advanced Technology, Chinese Academy of Sciences,
Shenzhen 518055, China (e-mail: [email protected]).
It is well known that humans cannot perceive the small
Y. Zhang is with the Shenzhen Institutes of Advanced Technol- changes in the images/videos due to the psychological and
ogy, Chinese Academy of Sciences, Shenzhen 518055, China (e-mail: physiological mechanism of HVS. Therefore the processed
[email protected]).
H. Zhang and C. Fan are with the Shenzhen Institutes of Advanced Technol-
images/videos have visual redundancy which can be removed
ogy, Chinese Academy of Sciences, Shenzhen 518055, China, and also with without any perceptual quality degradation. Just Noticeable
the Shenzhen College of Advanced Technology, University of Chinese Acad- Difference (JND) refers to the minimum distortion HVS
emy of Sciences, Shenzhen 518055, China (e-mail: [email protected];
[email protected]). can perceive, which has been widely used in image/video
S. Kwong is with the Department of Computer Science, City University of processing, e.g., perceptual image/video coding [5]–[7], image
Hong Kong, Hong Kong (e-mail: [email protected]). enhancing [8], and objective quality estimation [9]. The
C.-C. J. Kuo is with the Ming Hsieh Department of Electrical Engineering,
University of Southern California, Los Angeles, CA 90089 USA (e-mail: existing JND models can be divided into two categories:
[email protected]). 1) pixel-domain models [10]–[15] calculate JND threshold for
X. Fan is with the School of Information Technology and Manage- each pixel directly in the pixel domain; 2) sub-band domain
ment, Hunan University of Finance and Economics, Changsha 410205,
China, and also with the School of Automation, Central South University, models transfer pixel domain images to the sub-band domain,
Changsha 410075, China (e-mail: [email protected]). e.g., Discrete Cosine Transformation (DCT), then calculate the
Digital Object Identifier 10.1109/TIP.2019.2933743 JND threshold for each sub-band [16]–[20].
1057-7149 © 2019 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission.
See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
Authorized licensed use limited to: Lappeenranta-Lahti University of Technology LUT. Downloaded on March 20,2024 at 08:58:35 UTC from IEEE Xplore. Restrictions apply.
642 IEEE TRANSACTIONS ON IMAGE PROCESSING, VOL. 29, 2020
Pixel domain JND models mainly focus on the background quality levels. The distortion of JND image reflects the total
luminance adaptation and spatial contrast masking. In [10], masking effect of a picture accurately, which can be defined as
a novel spatial masking function was introduced, which was Picture Wise Just Noticeable Difference (PW-JND) referring
combined with luminance adaptation to deduce the overall to the minimum difference of a picture that can be perceived
JND thresholds. Wu et al. [11] suggested that there exists by the HVS. Jin et al. [23] constructed the first JND based
disorderly concealment effect resulting in high JND thresholds image quality data set MCL-JCI. Wang et al. [24] proposed
of the disorderly region, and proposed a free energy based a subjective methodology to find the video-wise JND videos
JND model aiming to improve the accuracy of JND threshold and constructed the first video-wise JND based quality data set
estimation of texture regions. In [12], a novel pattern mask- MCL-JCV. Fan et al. [25] studied the PW-JND of symmet-
ing function deduced from luminance contrast and structural rically and asymmetrically compressed stereoscopic images
uncertainty was incorporated into the proposed JND model. for JPEG2000 and H.265 intra coding. Two PW-JND based
Wang et al. [13] proposed an edge profile reconstruction based stereo image quality datasets have been provided: one for
JND model for screen content images. Each edge profile was symmetric compression and one for asymmetric compression.
decomposed into its luminance, contrast, and structure part, Huang et al. [26] proposed a machine learning approach to
each of which was evaluated respectively. Wu et al. [14] predict the mean of group-based JND distribution by using the
proposed an improved pattern masking function based model, extracted features of videos. As every one knows, subjective
where the pattern complexity was calculated as the diversity prediction methods are too time consuming to apply into the
of orientation in local region. In [15], a new JND model was real-world systems. Therefore, it deserves to devise objective
proposed which considers visual saliency. However, the pixel PW-JND prediction method, which is more challenging than
domain JND models can hardly be incorporated into sub-band to estimate the pixel/sub-band JND threshold. There are more
image/video compression systems. factors affecting the PW-JND thresholds, e.g., distortion type,
The sub-band domain JND models mainly focus on Con- contrast masking, and luminance adaptation. In this paper,
trast Sensitive Function (CSF), luminance adaptation, con- we propose a PW-JND model to predict PW-JND for pristine
trast masking, and foveated masking. Wei and Ngan [16] and distorted images. The main contributions of our work can
proposed a CSF-based spatio-temporal JND model, in which be summarized as:
gamma correction was introduced to compensate luminance 1) We formulate the task of predicting PW-JND as a multi-
adaptation effect. In [17], a luminance adaptation JND model class classification problem, and propose a framework
was proposed which took frequency characteristics into the to transform the multi-class classification problem to a
luminance adaption. Bae and Kim [18] proposed a novel binary classification problem.
JND model being applicable to any size of transform kernel, 2) We construct a deep learning based binary classifier
which introduced a new texture complexity metric to measure named perceptually lossy/lossless predictor. It can pre-
contrast masking effect. In [19], foveated masking was incor- dict whether a distorted image is perceptually lossy to
porated into the proposed temporal-foveated masking model its reference or not. The experimental results show that
which also considered the difference between moving and still its mean accuracy reaches 92%.
objects. Recently, learning techniques have also been applied 3) We propose a sliding window based search strategy to
to estimating the JND thresholds. In [20], Ki et al. proposed a predict the PW-JND based on the prediction results of
regression based method to estimate the JND thresholds under the perceptually lossy/lossless predictor.
the distortion with energy reduction. However, the sub-band The paper is organized as follows. In Section II, we present
domain models require a DCT transform, and can hardly esti- the motivation of predicting PW-JND which is formulated as
mate thresholds of the complicated texture regions accurately a multi-class classification problem. Section III presents the
since each block is isolated from its surrounding. framework of the proposed PW-JND model which transforms
The pixel/sub-band domain JND models compute the JND the multi-class classification problem to a binary classification
threshold for each pixel/sub-band separately. By a simple problem. In Section IV, we propose a deep learning based
summation of the estimated JND thresholds, they may not perceptually lossy/lossless predictor which can predict whether
reflect the total masking effect of a picture accurately. The con- a distorted image is perceptually lossy to its reference or not,
tribution of different regions to the image quality is different, and evaluate the performance of the predictor. In Section V,
and some critical regions together with the worst quality ones we propose a sliding window based PW-JND search strategy.
determine the image quality [21]. Moreover, the traditional In Section VI, we report the experimental results. Section VII
JND models mainly focused on the pristine images/videos but concludes this paper.
not on the distorted ones, which limits the application areas,
since the images/videos fed into the real-world applications are II. M OTIVATION AND P ROBLEM F ORMULATION
usually degraded. Recent studies [22]–[25] demonstrated that The traditional Rate-Distortion (R-D) function shown in
humans cannot perceive continuous-scale visual quality that Fig. 1(a) is continuous and convenient for the computation
changes over a range of coding bit rate, and this phenomenon of coding systems. However, the visual quality experience of
was quantified based on the notion of JND. Hu et al. [22] pro- humans is a discrete rather than continuous process. Recent
posed a subjective methodology to find the JND images under studies [22], [24] demonstrated that humans can only dis-
Joint Photographic Experts Group (JPEG) compression, which tinguish several limited quality levels of the image/video
are the transition images between two adjacent perceptual changing in a range of bit rate. A perceptual distortion function
Authorized licensed use limited to: Lappeenranta-Lahti University of Technology LUT. Downloaded on March 20,2024 at 08:58:35 UTC from IEEE Xplore. Restrictions apply.
LIU et al.: DEEP LEARNING-BASED PICTURE-WISE JUST NOTICEABLE DISTORTION PREDICTION MODEL 643
Authorized licensed use limited to: Lappeenranta-Lahti University of Technology LUT. Downloaded on March 20,2024 at 08:58:35 UTC from IEEE Xplore. Restrictions apply.
644 IEEE TRANSACTIONS ON IMAGE PROCESSING, VOL. 29, 2020
Authorized licensed use limited to: Lappeenranta-Lahti University of Technology LUT. Downloaded on March 20,2024 at 08:58:35 UTC from IEEE Xplore. Restrictions apply.
LIU et al.: DEEP LEARNING-BASED PICTURE-WISE JUST NOTICEABLE DISTORTION PREDICTION MODEL 645
Fig. 4. Different decompositions of multi-class classification problem. (a) Multi-class classifier. (b) O-A hierarchy decomposition. (c) Binary hierarchy
decomposition. (d) Proposed model.
Authorized licensed use limited to: Lappeenranta-Lahti University of Technology LUT. Downloaded on March 20,2024 at 08:58:35 UTC from IEEE Xplore. Restrictions apply.
646 IEEE TRANSACTIONS ON IMAGE PROCESSING, VOL. 29, 2020
TABLE II
C ONFIGURATIONS OF THE P ERCEPTUALLY
L OSSY /L OSSLESS P REDICTOR N ETWORK
Authorized licensed use limited to: Lappeenranta-Lahti University of Technology LUT. Downloaded on March 20,2024 at 08:58:35 UTC from IEEE Xplore. Restrictions apply.
LIU et al.: DEEP LEARNING-BASED PICTURE-WISE JUST NOTICEABLE DISTORTION PREDICTION MODEL 647
C. Dataset Generation
MCL-JCI dataset [23] is the first PW-JND based image
quality dataset. This dataset comprises 50 pristine images, and Fig. 7. Validation accuracy of the perceptually lossy/lossless predictor with
each pristine image has 100 JPEG-compressed images with different patch sizes and numbers of selected patches. (a) Different patch sizes.
different QF ranging from 1 to 100. Most of the pristine (b) Different numbers of selected patches.
images have 4 to 8 PW-JND images found by subjects, each of
which is the transitional image between two adjacent percep- determine M and N, which will be described in detail in the
tual quality levels. In order to train the proposed predictor, following paragraph. Truncated normal initializer was used to
we make use of PW-JND images to generate perceptually initialize the network weights and Adam [36] was taken as
lossy/lossless training samples. A perceptually lossy/lossless the gradient descent optimization method. The learning rate
sample can be described as (x t , yt ) which consists of image was initialized as 1 × 10−4, which decreases as the number of
data x t and label yt : 1) x t consists of the reference r e f k iterations increases. Each mini-batch contains 4 images, each
and distorted image Di sti , where r e f k is the k t h PW-JND represented by 32 randomly sampled image patches, which
image with QF value of k (larger k denotes smaller QF), leads to the effective batch size of 128 patches. All of the
i is the QF value of its corresponding distorted images Di sti , training parameters were updated one time when a mini-batch
which ranges from 1 to (k − 1). In particular, r e f 0 represents was processed. It is worth noting that the proposed predictor
the pristine image, and the QF value i of its corresponding is a patch-based network which predicts an image-wise result
distorted images ranges 1 to 100. 2) yt is labeled as one by pooling the quality of the selected patches. Therefore,
when i ranges 1 to (k + 1) denoting the QF value of the the selected patches cannot cross mini-batch. We implemented
(k + 1)t h PW-JND image, which denotes Di sti is perceptually the networks based on the Tensorflow 1.2.0 and Python 3.5.2,
lossy to r e f k , otherwise yt is labeled as zero, which denotes then trained them on the machine with NVIDIA GTX1080Ti
Di sti is perceptually lossless from r e f k . The first, second, GPU and memory 32G.
third, and fourth ground truth PW-JND in MCL-JCI data set The patch size M and number of patches N are two key
were used as reference images to generate training samples. factors for the proposed perceptually lossy/lossless predictor.
Finally, we generated 5003 positive and 3974 negative image However, there are so many combinations of (M, N) that it
samples. is difficult to find the global optimal combination through
For the convenience of training, the generated data set exhaustive searching methods. Firstly, the number of patches
was split into five subsets. Firstly, the 50 pristine images N was fixed as 32 [34], the other variables was controlled sta-
were randomly divided into five equal groups I1 , I2 , I3 , I4 , ble except patch size. The training and validation set are ran-
and I5 , each of which includes 10 images. Then, the samples domly selected from the generated data set {S1 , S2 , S3 , S4 , S5 }.
generated from the images in the same group were added Six patch sizes, M ∈ [8, 16, 32, 64, 128, 256], were tested.
into one subset. Finally, the generated data set was split as The validation accuracy is shown in Fig. 7(a), where x-axis
{S1 , S2 , S3 , S4 , S5 } according to the division of the pristine denotes patch size and y-axis denotes validation accuracy.
images {I1 , I2 , I3 , I4 , I5 }. There is no overlap among the five It can be seen that the validation accuracy is improved with the
subsets. increase of patch size when the patch size is not beyond 32.
The reason lies in the quality of a larger block is closer to the
D. Validation and Optimal Parameters Determination quality of the entire image. However, the situation is just the
Each image fed into the proposed perceptually loss/lossless opposite when the patch size is larger than 64. The reason may
predictor is represented by N patches with size M × M. be that the parameters of the networks increase exponentially
The patches share the convolutional neural networks, and with the increase of the patch size, which means that we
each patch can be seen as a sample. Therefore, the training need more training data. Secondly, patch size M was fixed as
samples are many times that of image samples. Random 32 (32 × 32), and N ∈ [8, 16, 32, 64, 128, 256] were tested.
cropping data augmentation was used in training, all of the The validation accuracy is shown in Fig. 7(b), where x-axis
patches were randomly cropped from the image every epoch denotes patch size and y-axis denotes validation accuracy. The
to ensure that as many different patches as possible can be accuracy is low when N is 8 and 16, and it reaches a high
used [34]. In validation, the N random patches for each value when N is 32. The reason may be that a very small
image were only sampled once at the beginning of training in number of selected patches (N = 8, 16) can hardly represent
order to avoid noise. In the experiment, patch size M × M the whole image quality. It can also be observed that there is
was set to 32 × 32 and the number of patches N was not a significant gain in validation accuracy with the increase
set to 32. The variable-controlling approach was taken to of N when N is larger than 32. Since a larger N brings more
Authorized licensed use limited to: Lappeenranta-Lahti University of Technology LUT. Downloaded on March 20,2024 at 08:58:35 UTC from IEEE Xplore. Restrictions apply.
648 IEEE TRANSACTIONS ON IMAGE PROCESSING, VOL. 29, 2020
Authorized licensed use limited to: Lappeenranta-Lahti University of Technology LUT. Downloaded on March 20,2024 at 08:58:35 UTC from IEEE Xplore. Restrictions apply.
LIU et al.: DEEP LEARNING-BASED PICTURE-WISE JUST NOTICEABLE DISTORTION PREDICTION MODEL 649
Fig. 9. PW-JND image searching. (a) Ideal case. (b) The proposed strategy.
Authorized licensed use limited to: Lappeenranta-Lahti University of Technology LUT. Downloaded on March 20,2024 at 08:58:35 UTC from IEEE Xplore. Restrictions apply.
650 IEEE TRANSACTIONS ON IMAGE PROCESSING, VOL. 29, 2020
Fig. 11. Performance comparison in predicting the first PW-JND. (a) PSNR. (b) QF.
be determined as the PW-JND for x. Z (x, Di sti ) is defined as means an overestimation, and Q F = 0 denotes the estima-
tion is consistent with the ground truth. For a further analysis,
0, S ≤ T × λ the absolute difference (|·|) was selected as another evaluation
Z (x, Di sti ) = (11)
1, S > T × λ, index. For example, | QF| is the absolute value of QF.
A larger | QF| means a greater prediction error.
where T is the number of pixels of x, S is the number of the
pixels that changes over the corresponding JND thresholds,
and λ is a given threshold. S is defined as B. Performance of the Proposed PW-JND
Prediction Model Evaluation
m
n
S= Ui, j , (12) 1) Predicting the First PW-JND: The prediction results of
i=1 j =1 the first PW-JND are plotted in Fig. 11(a) and Fig. 11(b),
where x-axis represents image index, and y-axis in Fig. 11(a)
where i, j is the pixel index, Ui, j describes whether the change
and Fig. 11(b) represent PSNR and QF respectively. From
of pixel Di st (i, j ) is over its JND threshold or not. Ui, j is
Fig. 11(a), it can be observed that: 1) When λ = 0, all of the
defined as
PSNR values of FEP_PJND and EPC_JND are below x-axis,
0, Di, j ≤ Mi, j which denotes all the PW-JNDs were underestimated. On the
Ui, j = (13) other hand, when λ = 0.1, all of the PSNR values of the two
1, Di, j > Mi, j ,
comparison models are over x-axis, which denotes all the
where i, j is the pixel index, Di, j = r e f i, j − Di sti, j , and PW-JNDs were overestimated. 2) Most of PSNR values of
Mi, j is the estimated JND threshold calculated by pixel the proposed PW-JND are very close to zero, which means the
domain JND models for pixel x i, j . The test image x and proposed model has a high prediction accuracy. 3) Compared
distorted image set D were set the same as the proposed model with the two pixel domain models, all PSNR values of
mentioned in the previous paragraph. The JND map M was the proposed PW-JND are closer to zero, which denotes the
obtained by pixel domain JND model from the test image x. proposed PW-JND model has the highest prediction accuracy.
In predicting the first and second PW-JND, λ was set as 0, From Fig. 11(b), we can also come to the conclusion that the
0.05, and 0.1. Particularly, λ = 0 denotes that as long as proposed model has the highest accuracy.
there is any one pixel changing over its JND threshold in the For a further comparison, the mean and variance of the
distorted image, it will be considered as a perceptually lossy absolute difference are listed in Table IV. From the mean part,
image. it can be observed that the mean of | QF| and | PSNR| of
QF, PSNR, SSIM, Feature Similarity Index Measure- the proposed PW-JND are 8.7 and 0.8 respectively, which are
ment (FSIM), Gradient Magnitude Similarity Deviation the smallest among all of the compared models. The similar
(GMSD) [29], Visual Saliency Index (VSI) [30], and Percep- phenomenon can be obtained in other metrics. Therefore,
tually Weighted Mean Squared Error (PWMSE) [37] metrics we can conclude that the accuracy of the proposed model is
were selected to describe PW-JND. The difference in the above the highest. From the variance part in Table IV, we can see that
metrics between the predicted PW-JND and the ground truth all of variance values of the proposed model are the smallest,
PW-JND was selected as the evaluation index. For example, which denotes the proposed model is the most stable one.
QF is the result of ground truth QF minus prediction QF. Therefore, we can conclude that the proposed model performs
A positive QF denotes an underestimation, a negative QF best in predicting the PW-JND for pristine images.
Authorized licensed use limited to: Lappeenranta-Lahti University of Technology LUT. Downloaded on March 20,2024 at 08:58:35 UTC from IEEE Xplore. Restrictions apply.
LIU et al.: DEEP LEARNING-BASED PICTURE-WISE JUST NOTICEABLE DISTORTION PREDICTION MODEL 651
TABLE IV
C OMPARISON OF THE F IRST PW-JND P REDICTION IN T ERMS OF THE M EAN AND VARIANCE OF A BSOLUTE P REDICTION D IFFERENCE
Fig. 12. Performance comparison in predicting the second PW-JND. (a) PSNR. (b) QF.
2) Predicting the Second PW-JND: The prediction results 3.14, 0.76, and 0.53 × 10−2, which are the smallest compared
of the second PW-JND are plotted in Fig. 12(a) and Fig. 12(b). with the other models. From the variance part, it can be seen
It can be seen that the phenomenon of Fig. 12(a) is very similar that the variances of | QF|, | PSNR|, and | SSIM| of the
to that shown in Fig. 11(a). We can obtain the following con- proposed model are 12.3, 0.65, and 0.29×10−4, which are also
clusions: 1) When λ = 0 all the PSNR were underestimated the smallest. Compared with the two comparison models with
by the two comparison models, and when λ = 0.1 all the different thresholds in different metrics, the mean and variance
PSNR were overestimated by the two comparison models. of the proposed model are the smallest except the variance of
2) The PSNR values of the proposed model are closer to FSIM. Therefore, we can conclude that the proposed model
zero than that of the two comparison models, which denotes performs best among the comparison models.
that the proposed model has the highest accuracy. Another
phenomenon is that the PSNR values are closer to ground C. Visual Quality Comparison
truth compared with the first PW-JND thresholds predicting, Moreover, subjective quality of the prediction results for
which denotes it is easier to predict the second PW-JND “ImageJND_SRC13” and “ImageJND_SRC39” are shown
than to predict the first PW-JND. The reason may be that in Fig. 13 and Fig. 14 respectively, in which the enlarged
the degradation between the distorted image and test image x patches are the most quality sensitive regions. Take Fig. 13 as
is becoming larger. The conclusions also can be convinced an example, (a) is the source image, (b) to (d) are enlarged
by Fig. 12(b), and we will not give a further analysis for patches of QF 100, the first ground truth PW-JND (QF 31), and
Fig. 12(b). the prediction result of the proposed model (QF 35), respec-
The mean and variance of the absolute difference are listed tively. From the first row, we can see that the perceptual quality
in Table V. From the mean part, we can see that the mean of of (b), (c), and (d) are very similar meanwhile the image
| QF|, | PSNR|, and | SSIM| of the proposed PW-JND are size and PSNR of (c) and (d) are very close. It demonstrates
Authorized licensed use limited to: Lappeenranta-Lahti University of Technology LUT. Downloaded on March 20,2024 at 08:58:35 UTC from IEEE Xplore. Restrictions apply.
652 IEEE TRANSACTIONS ON IMAGE PROCESSING, VOL. 29, 2020
TABLE V
C OMPARISON OF THE S ECOND PW-JND P REDICTION IN T ERMS OF THE M EAN AND VARIANCE OF A BSOLUTE P REDICTION D IFFERENCE
Authorized licensed use limited to: Lappeenranta-Lahti University of Technology LUT. Downloaded on March 20,2024 at 08:58:35 UTC from IEEE Xplore. Restrictions apply.
LIU et al.: DEEP LEARNING-BASED PICTURE-WISE JUST NOTICEABLE DISTORTION PREDICTION MODEL 653
Fig. 15. Bit rate saving versus PSNR and QF of the first prediction PW-JND. (a) PSNR. (b) QF.
last row that predicting the first PW-JND for a pristine image model increases from 0.878 to 0.894 when n increases from
needs about 12.71 seconds (mean) which includes compressing 1 to 9, PSNR increases from 0.175 to 0.451 (see Fig. 15(a)),
time 4.58 seconds (mean) and predicting time 8.13 seconds and QF increases 1.78 from to 4.68 (see Fig. 15(b)). It can be
(mean). From the above observations, it can be seen that the concluded that the proposed model has a high bit rate saving
computational complexity of the proposed model is acceptable. which is close to that of the ground truth. It also has a stable
We will focus on reducing the computational complexity in the performance when the QF varies. From the dotted and solid
future work. line, we can see that the performance of the FEP_PJND and
EPC_PJND model is very similar. The bitrate saving, PSNR,
E. Application in Image Compression and Transmission and QF vary greatly when λ increases from 0.025 to 0.125.
The comparison models need to pay large PSNR (or QF) to
The PW-JND reveals the minimum difference of a picture
achieve a high bit rate saving. If we fix the PSNR (or QF),
that HVS can perceive, which can be used for selecting
the proposed model has a larger bit rate saving. If we fix
parameters in image compression. We take JPEG coder to
the bit rate saving, the proposed model has a lower PSNR
compress the 50 images in MCL-JCI dataset with the predicted
and QF. Therefore, we can conclude that the proposed model
PW-JND (QF). The mean of bit rate saving C and prediction
has a better performance than the two pixel domain models.
error Q were designed as evaluation indexes. C is defined as
The experimental results of the second PW-JND are plotted
C = Rt e − R p Rt e × 100%, (14) on Fig. 16, where the x-axis, y-axis, lines, and legends are
the same with that in Fig. 15. We set λ the same as the first
where Rt e and R p are the bit rate of the test image x PW-JND case and n ∈ [1, 2, . . . , 5]. The bit rate saving of the
and predicted PW-JND image respectively. In predicting the proposed model is close to that of the ground truth, which is
first PW-JND, x is JPEG-compressed image with QF 100. smaller than that of the first PW-JND case. The reason may
In predicting the second PW-JND, x is the first ground truth be that the JPEG coder compresses image more at a high QF
PW-JND image; Q (Q ∈ PSNR, QF) is defined as level. We can also see from Fig. 16 the proposed model has
Q tr − Q p Q tr > Q p the best performance.
Q= (15) As mentioned in Section I, we can predict PW-JND in
0 Q tr ≤ Q p ,
the bit rate domain (R) or other discrete/continuous domains,
where Q tr and Q p are the ground truth and predicted PW-JND e.g., QF, and PSNR. Therefore, we analyse the correlation
image respectively. When Q tr > Q p , the predicted PW-JND between the prediction and ground truth PW-JNDs in QF,
image has a perceptual loss, which is regarded as a prediction PSNR, and bit rate domain. The scatter plots map of predicted
error. PW-JNDs and ground truth PW-JNDs are shown in Fig. 17,
The experimental results of the first PW-JND are shown where x-axis represents ground truth PW-JNDs, and y-axis
in Fig. 15, where y-axis represents the mean of bit rate represents the predicted PW-JNDs. We can see from Fig. 17
saving, and x-axis represents the mean of PSNR and QF in that: 1) The correlation of QF (R 2 = 0.12) and bit rate
Fig. 15(a) and Fig. 15(b) respectively. In Fig. 15, the diamond prediction (R 2 = 0.615) is low. Especially, the correlation
denotes the ground truth, the red circles from right to left of QF prediction has the lowest correlation. The ground truth
represent the results of adding n ∈ [1, 2, . . . , 9] to the PW-JND in MCL-JCI data set is a statistical value, around
predicted QF of the proposed model. The dotted and solid line which there is an ambiguous region in perceptual quality.
denote the FEP_PJND and EPC_PJND model. The squares The quality difference among the distorted images in such
and triangles represent the results when λ (see (11)) was set region is very hard to distinguish by humans. The width of
to 0.025, 0.05, 0.075, 0.1, and 0.125, respectively. From the the region is determined by image content and subjects, which
red circles, we can see that the bit rate saving of the proposed is big for most images in QF. There is a great possibility
Authorized licensed use limited to: Lappeenranta-Lahti University of Technology LUT. Downloaded on March 20,2024 at 08:58:35 UTC from IEEE Xplore. Restrictions apply.
654 IEEE TRANSACTIONS ON IMAGE PROCESSING, VOL. 29, 2020
Fig. 16. Bit rate saving versus PSNR and QF of the second prediction PW-JND. (a) PSNR. (b) QF.
Fig. 17. Prediction of the first PW-JND versus ground truth PW-JND in different metrics. (a) PSNR: R 2 = 0.908. (b) Bit rate: R 2 = 0.615. (c) QF:
R 2 = 0.12.
that the predicted QF falls within this region. The wider the Experimental results on comparison with the conventional just
ambiguous region, the greater possibility the predicted QF noticeable difference models demonstrate the effectiveness of
would deviate from the ground truth, thus leading to a lower the proposed model.
correlation of QF prediction. 2) The PSNR has the highest
correlation (R 2 = 0.908), which denotes that we can predict
R EFERENCES
the PW-JNDs in PSNR domain. The predicted PSNR can be
used in different image/video processing algorithms: 1) It can [1] H. Zhang, Y. Zhang, H. Wang, Y.-S. Ho, and S. Feng, “WLDISR:
Weighted local sparse representation-based depth image super-resolution
be used to compress image/video with the lowest bit rate for 3D video system,” IEEE Trans. Image Process., vol. 28, no. 2,
without perceptual quality degradation. 2) It is also helpful for pp. 561–576, Feb. 2019.
streaming system to select the images/videos with the smallest [2] L. Toni and P. Frossard, “Optimal representations for adaptive stream-
size but best quality, which can save the bandwidth without ing in interactive multiview video systems,” IEEE Trans. Multimedia,
vol. 19, no. 12, pp. 2775–2787, Dec. 2017.
damaging consumers’ experience. 3) It can be used to guide [3] Y. Zhang, X. Yang, X. Liu, Y. Zhang, G. Jiang, and S. Kwong, “High-
watermarks embedding, which ensure the impairment of the efficiency 3D depth coding based on perceptual quality of synthesized
embedded digital watermarks cannot perceived by the humans. video,” IEEE Trans. Image Process., vol. 25, no. 12, pp. 5877–5891,
Dec. 2016.
[4] L. Xu et al., “Free-energy principle inspired video quality metric and
its use in video coding,” IEEE Trans. Multimedia, vol. 18, no. 4,
VII. C ONCLUSION pp. 590–602, Apr. 2016.
In this paper, we propose a deep learning method based [5] S.-H. Bae, J. Kim, and M. Kim, “HEVC-based perceptually adaptive
video coding using a DCT-based local distortion detection probability
Picture Wise Just Noticeable Difference (PW-JND) prediction model,” IEEE Trans. Image Process., vol. 25, no. 7, pp. 3343–3357,
model. Firstly, the task of predicting PW-JND is formulated Jul. 2016.
as a multi-class classification problem, which is transformed [6] X. Zhang, S. Wang, K. Gu, W. Lin, S. Ma, and W. Gao, “Just-
noticeable difference-based perceptual optimization for JPEG com-
to a binary classification. Secondly, we construct a deep learn- pression,” IEEE Signal Process. Lett., vol. 24, no. 1, pp. 96–100,
ing based binary classifier named perceptually lossy/lossless Jan. 2017.
predictor to predict whether a distorted image is perceptually [7] Z. Luo, L. Song, S. Zheng, and N. Ling, “H.264/advanced video con-
trol perceptual optimization coding based on JND-directed coefficient
lossy to its reference or not. Finally, we propose a sliding win- suppression,” IEEE Trans. Circuits Syst. Video Technol., vol. 23, no. 6,
dow based PW-JND search strategy to predict the PW-JND. pp. 935–948, Jun. 2013.
Authorized licensed use limited to: Lappeenranta-Lahti University of Technology LUT. Downloaded on March 20,2024 at 08:58:35 UTC from IEEE Xplore. Restrictions apply.
LIU et al.: DEEP LEARNING-BASED PICTURE-WISE JUST NOTICEABLE DISTORTION PREDICTION MODEL 655
[8] H. Su and C. Jung, “Perceptual enhancement of low light images based [32] K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for
on two-step noise suppression,” IEEE Access, vol. 6, pp. 7005–7018, image recognition,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit.
2018. (CVPR), Jun. 2016, pp. 770–778.
[9] T. Zhu and L. Karam, “A no-reference objective image quality metric [33] L. Zhu, Y. Zhang, S. Wang, H. Yuan, S. Kwong, and H.-H. S. Ip, “Con-
based on perceptually weighted local noise,” EURASIP J. Image Video volutional neural network-based synthesized view quality enhancement
Process., vol. 5, pp. 1–8, Jan. 2014. for 3D video coding,” IEEE Trans. Image Process., vol. 27, no. 11,
[10] J. Wu, F. Qin, and M. Shi, “Self-similarity based structural regularity for pp. 5365–5377, Nov. 2018.
just noticeable difference estimation,” J. Vis. Commun. Image Represent., [34] S. Bosse, D. Maniry, K. R. Müller, T. Wiegand, and W. Samek,
vol. 23, no. 6, pp. 845–852, Aug. 2012. “Deep neural networks for no-reference and full-reference image quality
[11] J. Wu, G. Shi, W. Lin, A. Liu, and F. Qi, “Just noticeable difference esti- assessment,” IEEE Trans. Image Process., vol. 27, no. 1, pp. 206–219,
mation for images with free-energy principle,” IEEE Trans. Multimedia, Jan. 2018.
vol. 15, no. 7, pp. 1705–1710, Nov. 2013. [35] V. Nair and G. E. Hinton, “Rectified linear units improve restricted
[12] J. Wu, W. Lin, G. Shi, X. Wang, and F. Li, “Pattern masking estimation boltzmann machines,” in Proc. 27th Int. Conf. Mach. Learn., 2010,
in image with structural uncertainty,” IEEE Trans. Image Process., pp. 807–814.
vol. 22, no. 12, pp. 4892–4904, Dec. 2013. [36] D. Kingma and J. Ba, “Adam: A method for stochastic optimiza-
[13] S. Wang, L. Ma, Y. Fang, W. Lin, S. Ma, and W. Gao, “Just noticeable tion,” 2014, arXiv:1412.6980. [Online]. Available: https://arxiv.org/abs/
difference estimation for screen content images,” IEEE Trans. Image 1412.6980
Process., vol. 25, no. 8, pp. 3838–3851, May 2016. [37] S. Hu, L. Jin, H. Wang, Y. Zhang, S. Kwong, and C.-C. J. Kuo,
[14] J. Wu, L. Li, W. Dong, G. Shi, W. Lin, and C.-C. J. Kuo, “Enhanced “Compressed image quality metric based on perceptually weighted
just noticeable difference model for images with pattern complexity,” distortion,” IEEE Trans. Image Process., vol. 24, no. 12, pp. 5594–5608,
IEEE Trans. Image Process., vol. 26, no. 6, pp. 2682–2693, Jun. 2017. Dec. 2015.
[15] H. Hadizadeh, A. Rajati, and I. V. Bajić, “Saliency-guided just noticeable
distortion estimation using the normalized laplacian pyramid,” IEEE
Signal Process. Lett., vol. 24, no. 8, pp. 1218–1222, Aug. 2017.
[16] Z. Wei and K. N. Ngan, “Spatio-temporal just noticeable distortion
profile for grey scale image/video in DCT domain,” IEEE Trans. Circuits
Syst. Video Technol., vol. 19, no. 3, pp. 337–346, Mar. 2009.
[17] S.-H. Bae and M. Kim, “A novel DCT-based JND model for luminance
adaptation effect in DCT frequency,” IEEE Signal Process. Lett., vol. 20, Huanhua Liu received the B.S. degree in computer
no. 9, pp. 893–896, Sep. 2013. science and technology from Changsha University,
[18] S.-H. Bae and M. Kim, “A novel generalized DCT-based JND profile China, in 2009, and the M.S. degree in computer
based on an elaborate CM-JND model for variable block-sized trans- science and technology from the University of South
forms in monochrome images,” IEEE Trans. Image Process., vol. 23, China, China, in 2013. He is currently pursuing
no. 8, pp. 3227–3240, Aug. 2014. the Ph.D. degree with Central South University.
[19] S. Bae and M. Kim, “A DCT-based total JND profile for spatiotemporal Since 2017, he has been with the Shenzhen Insti-
and foveated masking effects,” IEEE Trans. Circuits Syst. Video Technol., tutes of Advanced Technology, Chinese Academy
vol. 27, no. 6, pp. 1196–1207, Jun. 2017. of Sciences, Shenzhen, China. His current research
[20] S. Ki, S.-H. Bae, M. Kim, and H. Ko, “Learning-based just-noticeable- interests include learning-based image/video quality
quantization-distortion modeling for perceptual video coding,” IEEE assessment and video coding.
Trans. Image Process., vol. 27, no. 7, pp. 3178–3193, Jul. 2018.
[21] X. Liu, Y. Zhang, S. Hu, S. Kwong, C.-C. J. Kuo, and Q. Peng,
“Subjective and objective video quality assessment of 3D synthesized
views with texture/depth compression distortion,” IEEE Trans. Image
Process., vol. 24, no. 12, pp. 4847–4861, Dec. 2015.
[22] S. Hu, H. Wang, and C.-C. J. Kuo, “A GMM-based stair quality model
for human perceived JPEG images,” in Proc. IEEE Int. Conf. Acoust.
Yun Zhang (M’12–SM’16) received the B.S. and
Speech Signal Process., Mar. 2016, pp. 1070–1074.
M.S. degrees in electrical engineering from Ningbo
[23] L. Jin et al., “Statistical study on perceived JPEG image quality via
University, Ningbo, China, in 2004 and 2007,
MCL-JCI dataset construction and analysis,” Electron. Imag., vol. 2016,
respectively, and the Ph.D. degree in computer
no. 13, pp. 1–9, Feb. 2016.
science from the Institute of Computing Technol-
[24] H. Wang et al., “VideoSet: A large-scale compressed video quality
ogy (ICT), Chinese Academy of Sciences (CAS),
dataset based on JND measurement,” J. Vis. Commun. Image Represent.,
Beijing, China, in 2010. From 2009 to 2014, he was
vol. 46, pp. 292–302, Jul. 2017.
a Postdoctoral Researcher with the Department of
[25] C. Fan, Y. Zhang, H. Zhang, R. Hamzaouic, and Q. Jiang, “Picture-
Computer Science, City University of Hong Kong,
level just noticeable difference for symmetrically and asymmetrically
Hong Kong. From 2010 to 2017, he was an Assis-
compressed stereoscopic images: Subjective quality assessment study
tant Professor and an Associate Professor with the
and datasets,” J. Vis. Commun. Image Represent., vol. 62, pp. 140–151,
Shenzhen Institutes of Advanced Technology (SIAT), CAS, Shenzhen, China,
Jul. 2019.
where he is currently a Professor. His research interests include video
[26] Q. Huang, H. Wang, S. C. Lim, H. Y. Kim, S. Y. Jeong, and C.-C. J. Kuo,
compression, 3D video processing, and visual perception.
“Measure and prediction of HEVC perceptually lossy/lossless boundary
QP values,” in Proc. IEEE Data Compress. Conf. (DCC), Apr. 2017,
pp. 42–51.
[27] Z. Wang, E. P. Simoncelli, and A. C. Bovik, “Multiscale structural
similarity for image quality assessment,” in Proc. 37th Asilomar Conf.
Signals, Syst. Comput., vol. 2, Nov. 2003, pp. 1398–1402.
[28] L. Zhang, L. Zhang, X. Mou, and D. Zhang, “FSIM: A feature similarity
index for image quality assessment,” IEEE Trans. Image Process., Huan Zhang received the B.S. degree from the
vol. 20, no. 8, pp. 2378–2386, Aug. 2011. Civil Aviation University of China, Tianjin, China,
[29] W. Xue, L. Zhang, X. Mou, and A. C. Bovik, “Gradient magnitude in 2010, and the M.S. degree from Tsinghua Uni-
similarity deviation: A highly efficient perceptual image quality index,” versity, Beijing, China, in 2013. She is currently
IEEE Trans. Image Process., vol. 23, no. 2, pp. 684–695, Feb. 2014. pursuing the Ph.D. degree with the University of
[30] L. Zhang, Y. Shen, and H. Li, “VSI: A visual saliency-induced index Chinese Academy of Sciences, China. Her research
for perceptual image quality assessment,” IEEE Trans. Image Process., interests include image restoration and image/video
vol. 23, no. 10, pp. 4270–4281, Aug. 2014. quality assessment.
[31] L. Xu et al., “Multi-task rank learning for image quality assessment,”
IEEE Trans. Circuits Syst. Video Technol., vol. 27, no. 9, pp. 1833–1843,
Sep. 2017.
Authorized licensed use limited to: Lappeenranta-Lahti University of Technology LUT. Downloaded on March 20,2024 at 08:58:35 UTC from IEEE Xplore. Restrictions apply.
656 IEEE TRANSACTIONS ON IMAGE PROCESSING, VOL. 29, 2020
Chunling Fan received the M.S. degree from C.-C. Jay Kuo (F’99) received the B.S. degree in
Nanjing Normal University, Nanjing, in 2011. electrical engineering from National Taiwan Uni-
She is currently pursuing the Ph.D. degree with versity, Taipei, Taiwan, in 1980, and the M.S. and
the Shenzhen Institutes of Advanced Technol- Ph.D. degrees in electrical engineering from the
ogy (SIAT), University of Chinese Academy of Massachusetts Institute of Technology, Cambridge
Sciences, Shenzhen, China. Her research inter- MA, USA, in 1985 and 1987, respectively. He is
ests include image processing and image quality currently the Director of the Multimedia Commu-
assessment. nications Laboratory and a Professor of electrical
engineering, computer science and mathematics with
the Ming-Hsieh Department of Electrical Engineer-
ing, University of Southern California, Los Angeles,
CA, USA. He has coauthored about 200 journal papers, 850 conference
papers, and ten books. His research interests include digital image/video
analysis and modeling, multimedia data compression, communication and
networking, and biological signal/image processing. He is a fellow of The
American Association for the Advancement of Science and The International
Society for Optical Engineers.
Authorized licensed use limited to: Lappeenranta-Lahti University of Technology LUT. Downloaded on March 20,2024 at 08:58:35 UTC from IEEE Xplore. Restrictions apply.