4.1. Datasets
We tested the proposed NSSIM on four popular datasets for IQA, including CSIQ [
25], LIVE II [
26], IVC [
27] and TID2013 [
28]. All datasets consist of several subsets of different distortion types. In this paper, we used Gaussian blur distortion for experiments. In particular, the CSIQ dataset contains 30 reference images and 150 Gaussian blur images, the LIVE II dataset contains 29 reference images and 145 Gaussian blur images, the IVC dataset contains four reference images and 20 Gaussian blur images, and the TID2013 dataset contains 25 reference images and 125 Gaussian blur images. All blur images in these datasets were used together with their DMOS as subjective quality scores. Some samples of Gaussian blur images for experiments are shown in
Figure 6 together with their DMOS. It should be noted that DMOS scores provided by CSIQ [
25] and LIVE II [
26] have positive correlation with blurriness scores, while DMOS scores provided by IVC [
27] and TID2013 [
28] change in the opposite direction. As shown in
Figure 6, the datasets used for performance evaluation contain blur images with different kinds of nature scenes which are generally distributed, thus are suitable for statistical analysis of evaluation results.
4.2. Indexes for Evaluation
In order to provide quantitative measurement of the performance of our proposed model, we follow the performance evaluation procedures employed in the video quality experts group (VQEG) [
29]. We test the proposed IQA metrics using Spearman’s rank order correlation coefficient (SROCC), Pearson linear correlation coefficient (PLCC) and root mean square error (RMSE) as evaluation indexes. SROCC, PLCC and RMSE are respectively defined as
where
N is the number of images,
and
represent the
i-th scores given by subjective evaluation and objective evaluation,
and
represent mean subjective quality score and mean objective predicted score, and
and
represent the rank order number of
and
, respectively. SROCC is a nonparametric measure of rank correlation to statistically assess how well the relationship between the subjective quality score and the objective predicted score can be monotonically described, while PLCC is a measure of the linear correlation between them. RMSE represents the square root of the quadratic mean of the differences between the subjective quality scores and the objective predicted scores, and is a frequently used measure. By using these three statistical measures, we can easily analyze the consistency between the subjective quality score and the objective predicted score, which indicates the capability of IQA methods.
4.4. Comparison with the State-of-the-Arts
We compared the performance of NSSIM against PSNR, original SSIM [
1], and several state-of-the-art NR IQA models such as BRISQUE [
4], BLIIND-II [
3], MCNN [
19], IQA-CWT [
7], SFA [
20], etc. In order to evaluate the statistically significant difference between the proposed metric and the existing IQA algorithms, we performed statistical analysis by paired-sample
t-tests and reporting the
p-values. The null hypothesis in our
t-tests is that the pairwise difference between the proposed metric and others has a mean equal to zero, i.e., the differences in performances presented in the results are not statistically significant.
p-values < 0.05 indicates the rejection of the null hypothesis at the 5% significance level, meaning that the differences are statistically significant. It should be noticed that results of some IQA algorithms, such as NR-CSR [
18], MCNN [
19], IQA-CWT [
7], and SFA [
20], are collected from the corresponding references, thus their
p-values are not shown in the result tables.
As seen from
Table 6, NSSIM achieves 0.8971 of PLCC and 0.1266 of RMSE on CSIQ [
25] dataset, better than SSIM and MCNN metrics. We randomly sampled 100 images from CSIQ dataset by 10 times for the
t-test, so that 10 samples of SROCC, PLCCs and RMSE were achieved for each algorithm in comparison.
p-values in
Table 6 show that
t-tests reject the null hypothesis at 5% significance level, i.e., the alternative hypothesis is accepted that the pairwise difference between NSSIM and the other metrics does not have a mean equal to zero. This ascertains the differences of various metrics are the statistically significant.
Table 7 shows that NSSIM achieves 0.9689 of PLCC which performs better than the other ten algorithms on LIVE II [
26] dataset, and 0.9464 of SROCC holds the third position among the 11 metrics. A paired-sample
t-test was also applied the same as we done on CSIQ [
25] dataset. The
p-values demonstrate the statistically significant improvement of the proposed metric in terms of PLCC.
Table 8 indicates that NSSIM achieves 0.9239 of PLCC and 0.4367 of RMSE, which are the best among all nine algorithms tested on IVC [
27] dataset. We also run a paired-sample
t-test via randomly sampling 15 images from IVC dataset by 10 times. The
p-values proves the statistically significant improvement of the proposed metric in terms of PLCC and RMSE.
As seen from
Table 9, the SROCC of NSSIM is 0.8995, which beats other FR or NR IQA algorithms on TID2013 [
28] dataset. A paired-sample
t-test was performed as we done on CSIQ and LIVE II dataset. The
p-values give the demonstration of the robust superior SROCC performance of NSSIM than those of the other algorithms.
Furthermore,
Table 10 gives means and standard deviations of SROCC, PLCC, and RMSE of tested IQA algorithms on four datasets. Only IQA algorithms tested on all the four datasets are collected. It can be seen from
Table 10 that our NSSIM metric achieves the highest mean SROCC and the third best of mean PLCC and RMSE. Meanwhile the
t-test results shown in
Table 10 led to the acceptance of the null hypothesis at the 5% significance level, indicating that the differences of average performance between NSSIM and the other metrics are not statistically significant on various datasets. This is understandable since our NSSIM cannot achieve significant improvement on every dataset in terms of all indexes. However, considering the differences of image complexity in different datasets, such as image size, contrast and diversity, the experimental results validate that the proposed NSSIM can be adapted to different image categories. Noticing that [
3,
4,
17] all need prior training procedure, our NSSIM performs the best to maintain the balance of IQA and time-efficiency.
These test results also validate that NSSIM is a demonstration of the relationship between quantified image naturalness and perceptual image quality. NSSIM establishes a simple method to identify image quality without reference or prior training on human judgments of blurred images. Besides, compared with up-to-date NR IQA metrics NR-CSR [
18], MCNN [
19] and SFA [
20], NSSIM is less time costly since the needless of training or learning procedure.
4.6. IQA for Blurred Image Restoration
The purpose of image restoration is to reduce or erase image degeneration during acquisition, compression, transmission, processing, and reproduction. IQA can be used to evaluate image restoration algorithm by assessing qualities of distorted image and restoration image. Sroubek [
30] presented a deconvolution algorithm for decomposition and approximation of space-variant blur using the alternating direction method of multipliers. Kotera [
31] proposed a blind deconvolution algorithm using the variational Bayesian approximation with the automatic relevance determination model on likelihood and image and blur priors. In this section, we use the proposed NSSIM to evaluate the performance of image restoration. Two groups of images including original image, blurred image and restorations are evaluated by the proposed IQA algorithm NSSIM and PSNR, SSIM and several state-of-the-art NR IQA algorithms. The experimental results are shown in
Figure 9 and
Figure 10 and
Table 11 and
Table 12.
It can be seen from
Figure 9 and
Table 11 that PSNR and SSIM [
1] fail to evaluate the quality of both restorations since the quality scores are smaller than that of the blurred image. BIQI [
14] and NIQE [
17] succeed to identity the restoration images but the difference between the two restorations is slight. While BRISQUE [
4], BLIIND-II [
3], SFA [
20] and the proposed metrics NSSIM achieve better precision and the differences between two restorations are distinct. Moreover, the NSSIM-predicted score of blurred image is 0.0709 ×
, showing obvious variance between the blurred image and the original image. This demonstrates that NSSIM is extremely sensitive to blur. Similarly,
Figure 10 and
Table 12 also demonstrate that NSSIM is suitable for blur IQA.