NIMA Neural Image Assessment (TIP 2018).pdf

所需积分/C币:49 2019-09-23 12:02:37 10.09MB PDF
收藏 收藏

Automatically learned quality assessment for images has recently become a hot topic due to its usefulness in a wide variety of applications such as evaluating image capture pipelines, storage techniques and sharing media. Despite the subjective nature of this problem, most existing methods only predict the mean opinion score provided by datasets such as AVA and TID2013. Our approach differs from others in that we predict the distribution of human opinion scores using a convolutional neural network. Our architecture also has the advantage of being significantly simpler than other methods with comparable performance. Our proposed approach relies on the success (and retraining) of proven, state-of-the-art deep object recognition networks. Our resulting network can be used to not only score images reliably and with high correlation to human perception, but also to assist with adaptation and optimization of photo editing/enhancement algorithms in a photographic pipeline. All this is done without need of a “golden” reference image, consequently allowing for single-image, semantic- and perceptually-aware, no-reference quality assessment.
180 12C wooo5卫 0.2 0.15 0.10 020 mean score standard deviation of scores mean score Fig. 3: Histograms of ratings from TID2013 dataset [2]. Left: Histogram of mean scores. Middle: Histogram of standard deviations. Right: Joint histogram of the mean and standard deviation D. Tampere Image Database 2013(TID2013)/27 assessment task. VGG16 consists of 13 convolutional and 3 e TID2013 is curated for evaluation of full-reference percep- fully-connected layers. Small convolution filters of size 3X3 al image quality. It contains 3000 images, from 25 reference are used in the deep VGG16 architecture [17]. Inception (clean)images (Kodak images [20]), 24 types of distortions v2 [22] is based on Inception module [24] which allows for with 5 levels for each distortion This leads to 120 distorted parallel use of convolution and pooling operations. Also, in images for each reference image; including different types of the Inception architecture, traditional fully-connected layers distortions such as compression artifacts, noise, blur and color are replaced by average pooling, which leads to a signifi- artifacts cant reduction in number of parameters. MobileNet [231 is Human ratings of Tid2013 images are collected through an efficient deep CNn, mainly designed for mobile vision a forced choice experiment, where observers select a better applications. In this architecture, dense convolutional filters are image between two distorted choices. Set up of the experiment replaced by separable depth filters This simplification results allows raters to view the reference image while making a in smaller and faster cnn models I. In each experiment, every distorted image is used We replaced the last layer of the baseline Cnn with a andom pairwise comparisons. The selected image gets fully-connected layer with 10 neurons followed by soft-max one point, and other image gets zero points. At the end of activations(shown in Fig. 6). Baseline CNN weights are the experiment, sum of the points is used as the quality score initialized by training on the ImageNet dataset [15],and then associated with an image(this leads to scores ranging from 0 to an end-to-end training on quality assessment is performed. In 9). To obtain the overall mean scores, total of 985 experiments this paper, we discuss performance of the proposed model with carried out various baseline cnns Mean and standard deviation of TID2013 ratings are shown In training, input images are rescaled to 256x 256, and in Fig. 3. As can be seen in Fig. 3(c), the mean and score then a crop of size 224 X 224 crop is randomly extracted deviation values are weakly correlated. A few images from This lessens potential over-fitting issues, especially when TID2013 are illustrated in Fig. 4 and Fig. 5. All five levels training on relatively small datasets(e.g. TID2013) It is worth of JPEG compression artifacts and the respective ratings are noting that we also tried training with random crops without llustrated in Fig 4. Evidently higher distortion level leads to rescaling. However, results were not compelling. This is due to lower mean score. Effect of contrast compression/stretching the inevitable change in image composition. Another random distortion on the human ratings is demonstrated in Fig. 5. data augmentation in our training process is horizontal tipping Interestingly, stretch of contrast(Fig. 5(c)and Fig. 5(e))leads of the image crops to relatively higher perceptual qualit Our goal is to predict the distribution of ratings for a Unlike AVA, which includes distribution of ratings for each given image. Ground truth distribution of human ratings of image, TID2013 only provides mean and standard deviation a given image can be expressed as an empirical probability of the opinion scores. Since our proposed method requires mass function p= ps1, .. PsN with s1 si <8N, training on score probabilities, the score distributions are where si denotes the ith score bucket, and N denotes the approximated through maximum entropy optimization [21]. total number of score buckets. In both AVA and TID2013 datasets N=10, in Ava, S1= l and SN=10, and in TID II. PROPOSED METHOD 0 and SN=9. Since∑ 1, Ps, represents the Our proposed quality and aesthetic predictor stands on probability of a quality score falling in the ith bucket. Given few different classifier architectures such as VGG16 [171, asR=2N, &r ratings as p, mean quality score is defined the distribution of s, and standard deviation of the score is Inception-v2 [22], and MobileNet [23] for image quality computed as o (si-u)2x ps )/2. As discussed in the previous section, one can qualitatively compare images by This is a quite consistent trend for most of the other distortions too mean and standard deviation of scores (namely noise, blur and color distortions). However, in case of the contrast change(Fig. 5), this trend is not obvious. This is due to the order of contrast Each exanple in the dataset consists of an image and its compression/stretching from level l to level 5 ground truth (user) ratings p. Our objective is to find the 4 (a)clean image (b)573(±0.15) (c)5.47(±0.11) Hau d)486(±0.11) (e)3.0(±0.11) (f)1.66(±0.16) Fig 4: JPEG artifact example images from TID2013 dataset [2] with quality score u(to), where u and o represent mean and standard deviation of score, respectively. Clean image and 5 levels of JPEG compression artifacts are shown here.( clean image,(b) compression artifact level 1, A=5.73, 0 =0.15,(c)compression artifact level 2, 4=5.47, 0=0.11, compression artifact level 3, A=4.86, 0=0.11,(e) compression artifact level 4, u=3.0, 0=0.11, (f compression artifact level5,=1.66,a=0.16. (a) clean image (b)5.67(士0.10) C)6.80(±0.18) (d)4.83(士0.16) (e)669(±0.29) (f)3.88(±0.18 Fig. 5: Some example images from TID2013 dataset [2] with quality score u(+o), where l, and o represent mean and standard deviation of score, respectively. Clean image and 5 levels of contrast change distortions are shown here.(a) clean image,(b) contrast change distortion of level 1, u-5.67, 0=0.10, (c) contrast change distortion of level 2, A=6.80, 0-0.18,(d) contrast change distortion of level 3, u=4.83, 0=0.16,(e) contrast change distortion of level 4, A=6.69, 0=0.29,(f) contrast change distortion of level 5, A=3.88, 0=0.18 baseline image classifier network (last layer removed) Fig. 6: Modified baseline image classifier network used in our framework. Last layer of classifier network is replaced by a fully-connected layer to output 10 classes of quality scores. Baseline network weights are initialized by training on Image Net dataset [15], and the added fully-connected weights are initialized randomly probability mass function p that is an accurate estimate of split the Ava and tid datasets to train and test sets, such that p. Next, our training loss function is discussed 20% of the data is used for testing. In this section, performance of the proposed models on the test sets are discussed and A. Loss Function compared to the existing methods. Then, applications of the Soft-max cross-entropy is widely used as training loss proposed technique in photo ranking and image enhancement are explored. Before moving forward, details of our imple in classification tasks. This loss can be represented as mentation are explained Ps, log(ps, )(where Ps, denotes estimated probability The CNNs presented in this paper are implemented using of ith score bucket) to maximize predicted probability oI Tensor Flow [27), [28]. The baseline cnn weights are initial the correct labels. However. in the case of ordered -classes (e. g. aesthetic and quality estimation ) cross-entropy loss ized by training on ImageNet [ 5), and the last fully-connected lacks the inter-class relationships between score buckets. One might argue that ordered-classes can be represented by a are set to 0.9, and a dropout rate of 0.75 is applied on the real number, and consequently, can be learned through a last layer of the baseline network. The learning rate of the regression framework. Yet, it has been shown that for ordered baseline CNN layers and the last fully-connected \ayas are classes, classification frameworks can outperform regression set as 3x 10 and 3X10, respectively. We observed that models [191, [25]. Hou et al. [19 show that training on datasets setting a low learning rate on baseline CNn layers results in with intrinsic ordering between classes can benefit from emd easier and faster optimization when using stochastic gradient based losses. These loss functions penalize mis-classifications descent. Also, after every 10 epochs of training, an exponential according to class distances decay with decay factor 0.95 is applied to all learning rates For image quality ratings, classes are inherently ordered as S1<,.< Sv, and r-norm distance between classes is defined as si -S,ill,where 1 <i, j<N. EMD is defined A. Performance Comparisons as the minimum cost to move the mass of one distribution to another. Given the ground truth and estimated probability Accuracy, correlation and EMd values of our evaluations mass functions p and p, with N ordered classes of distance lSi-sillr, the normalized Earth Mover's Distance can be on the aesthetic assessment model on Ava are presented in Table I. Most methods in Table I are designed to perform expressed as「26l: binary classification on the aesthetic scores, and as a result only accuracy evaluations of two-class quality categorization are reported. In two-class aesthetic categorization task, re EMDp=(N∑CDF()-CDF)r (1) sults from [181, and NIMA(Inception-v2) show the highest accuracy. Also, in terms of rank correlation, NIMA(VGG16) where CDf(k)is the cumulative distribution function as and NIMA(Inception-v2) outperform [14. NIMA is much >isI Ps,. It is worth noting that this closed-form solution cheaper: [18 applies multiple vGG16 nets on image patches requires both distributions to have equal massa/ Ps: complexity of NIMA(Inception-v2)is roughly one pass of to generate a single quality score, whereas computational 1,. As shown in Fig. 6, our predicted quality prob Inception-V2(see Table Im) abilities are fed to a soft-max function to guarantee that Our technical quality assessment model on TID2013 is 2i_,= 1. Similar to [19], in our training framework, r is set as 2 to penalize the Euclidean distance between the compared to other existing methods in Table II. while most of CDES.T=2 allows easier optimization when working with these methods regress to the mean opinion score, our proposed gradient descent technique predicts the distribution of ratings, as well as mean opinion score. Correlation between ground truth and results of NIMAOVGG16) are close to the state-of-the-art results in [29] IIL. EXPERIMENTAL RESULTS and[7]. It is worth highlighting that Bianco et al. [7 feed quality assessment on AVa and TID2013. For each case, we takes only the rescaled imAc ep CNN. whereas our method We train two separate models for aesthetics and technical multiple image crops to a d 6 TABLE 1: Performance of the proposed method with various architec tures in predicting Ava quality ratings [1] compared to the state-of- the-art. Reported accuracy values are based on classification of photos to two classes (column 2). LCC (linear correlation coefficient) and SRCC (Spearmans rank correlation coefficient) are computed between predicted and ground truth mean scores(column 3 and 4)and standard deviation of scores(column 5 and 6). EMD measures closeness of the predicted and ground truth rating distributions with r= l in Eg. 1 Model Accuracy LCC SRCC LCC SRCC EMD 2 classes)(mean (mean)(std dev) (std devi Murray et al. [l] 66.70% 71.42% Lu et al. 301 74.46 Lu et al.[16 75.42% Kao et al. 31 76.58% Wang et al. 321 76.80% et al. [10] 77.10 Kong et al. [14] 77.33% 0.58 Ma el al. [18 81.70% NIMA( Mobilcnct)80.36%0.5l80.5100.152 0.137 NIMA(VGG16 80.60% 0.6100.5920.205 0.202 0.052 NIMA(Inception-v2) 81.51%0 0.6360.6120.233 0.218 0.050 B. Photo ranking can be improved by contrast adjustments. Consequently, our Predicted mean scores can be used to rank photos, aestheti- model is able to guide the multi-layer Laplacian filter to find cally. Some test photos from Ava dataset are ranked in Fig. 7 aesthetically near-optimal settings of its parameters. Examples and Fig, 8. Predicted NIMA scores and ground truth ava of this type of image editing are represented in Fig, 11,where scores are shown below each image Results in Fig. 7 suggest a combination of detail, shadow and brightness change is that in addition to image content, other factors such as tone applied on each image. In each example, 6 levels of detail contrast and composition of photos are important aesthetic boost, 1 1 levels of shadow change, and 11 levels of brightness qualities. Also, as shown in Fig 8, besides image semantics change account for a total of 726 variations the aesthetic framing and color palette are key qualities in these photos. assessment model tends to prefer high contrast images with These aesthetic attributes are closely predicted by our trained boosted details. This is consistent with the ground truth results models on ava from Ava illustrated in Fig. 7 Predicted mean scores are used to qualitatively rank photos in Fig 9. These images are part of our TID2013 test set, which Turbo denoising [38] is a technique which uses the domain contain various types and levels of distortions. Comparing transform [39 as its core filter Performance of Turbo denois- ground truth and predicted scores indicates that our trained ing depends on spatial and range smoothing parameters,and model on TID2013 accurately ranks the test images consequently, proper tuning of these parameters can effectively boost performance of the denoiser. We observed that varying the spatial smoothing parameter makes the most significant C. Image enhancement perceptual difference, and as a result, we use our quality Quality and aesthetic scores can be used to perceptually tune assessment model trained on TID2013 dataset to tune this image enhancement operators. In other words, maximizing denoiser. Application of our no-reference quality metric as a NIMA Score as a prior can increase the likelihood of enhanc- prior in image denoising is similar to the work of Zhu et al ing perceptual quality of an image. Typically, parameters of [40],[41]. Our results are shown in Fig. 12. Additive white enhancement operators such as image denoising and contrast Gaussian noise with standard deviation 30 is added to the clean enhancement are selected by extensive experiments under image, and Turbo denoising with various spatial parameters is various photographic conditions. Perceptual tuning could be used to denoise the noisy image. To reduce the score deviation quite expensive and time consuming, especially when human 50 random crops are extracted from denoised image. These opinion is required. In this section, our proposed models are scores are averaged to obtain the plots illustrated in Fig. 12. used to tune a tone enhancement method [37, and an image As can be seen although the same amount of noise is added to denoiser 38 each image, maximum quality scores correspond to different The multi-layer Laplacian technique [37 enhances local denoising parameters in each example. For relatively smooth and global contrast of images. Parameters of this method images such as(a)and(g), optimal spatial parameter of turbo control the amount of detail, shadow, and brightness of an denoising is higher (which implies stronger smoothing)than image. Fig. 10 shows a few examples of the multi-layer the textured image in (. This is probably due to the relatively Laplacian with different sets of parameters. We observed that high signal-Lo-noise ratio of (). In other words, the quality the predicted aesthetic ratings from training on the ava dataset assessment model tends to respect textures and avoid over TABLE II: Performance of the proposed method with vari- ous architectures in predicting TID2013 quality ratings [2 compared to the state-of-the-art. LCC (linear correlation co efficient) and SRCC(Spearman's rank correlation coefficient) are computed between predicted and ground truth nean scores (column 2 and 3)and standard deviation of scores(column 4 and 5). EMd measures closeness of the predicted and ground truth rating distributions with r= 1 in eg 1 Model LCC SRCC LCC SRCO EMD n)(mean) (std dev) (std dev) Moorthy et al. 33 0.85 Mittal et al. [34] 0.92 0.89 Saad et al. 351 0.91 Kottayil et al. [36 0.96 Bianco et al 0.96 NIMA mobile 0.78 0.6980.2090.18 0.105 NIMA(VGGI6 ).0410.944 0.538 0.557 .054 NIMA(Inception-V2) 0.827 0.750 0.470 0.468 0.064 (a)6.38(7.16) (b)624(6.79 (c)6.22(664) (d)6.16(693) (e)5.92(6,23) (D)571(578) (g)561(5.54 h)5.28(532 (i)511(5.23) j)5.03(535) M (k)490(491) (1)483(489) (m)4.77(455) (n)448(395) (o)355(353) Fig. 7: Ranking some examples labelled with"landscape"tag from Ava dataset [1] using our proposed aesthetic assessment model NIMA(VGG16) Predicted (and ground truth) scores are shown below each image smoothing of details. Effect of the denoising parameter can one pass of NIMa models on an image of size 224 X 224 X3 be visually inspected in Fig. 13. While the denoised result in are reported in Table III. Evidently, NIMA(MobileNet)is Fig 13(a)is under-smoothed, (c),(e)and(f)show undesirable significantly lighter and faster than other models. This comes over-smoothing effects. The predicted quality scores validate at the expense of a slight performance drop(shown in Table I this perceptual observation and Table Il) D. Computational Costs IV. CONCLUSION Computational complexity of NIMa models are compared In this work we introduced a CNN-based image assessment in Table Ill. Our inference Tensor Flow implementation is method, which can be trained on both aesthetic and pixel-level ested on an Intel Xeon CPU 3.5 GHz with 32 GB memory quality datasets Our models effectively predict the distribution and 12 cores, and NVIDIA Quadro K620 GPU. Timings of of quality ratings, rather than just the mean scores. This leads (a)688(740) (b)663(6.89 (c)6.29(659) (d)5.86(6.16) c)577(552 (f)5.51(547) (g)546(538) (h)5.24(4.74) (i)4.96(483) ()490(4.71) )460(4.59) l)4.53(5.05) Fig 8: Ranking some examples labelled with"sky"tag from ava dataset [1] using our proposed aesthetic assessment model NIMA(Inception-V2). Predicted (and ground truth) scores are shown below each image (a)5.31(5.93) (b)435(464) (c)400(3.91) 3.56(3.61) (e)3.05(3.26) (f)287(286) (g)233(2.44) (h)1.67(0.73) Fig. 9: Ranking some examples from TID2013 dataset [2] using our proposed quality assessment model NIMa(vGG16) Predicted (and ground truth) scores are shown below each image. to a more accurate quality prediction with higher correlation to to steer parameters of a few image enhancement operators. Our the ground truth ratings. We trained two models for high level experiments suggest that these models are capable of guiding aesthetics and low level technical qualities, and utilized them denoising and tone enhancement to produce perceptually su 9 (a) Input (5.52) (b) contrast compression(4.79) (c) boosting details (5.73) (d)increasing brightness (5.52) (e)increasing shadows (5.95 Fig. 10: Predicted aesthetic score (NIMACvGGlo)) for various parameter settings of multi-layer Laplacian technique [371 Predicted aesthetic scores are shown below each image perior results Recognition(CVPR), 2012 IEEE Conference on. IEEE, 2012, pp 2408 As part of our future work, we will exploit the trained 2415.1,2,6,7,8 models on other image enhancement applications. Our current [2] N Ponomarenko, O. leremeiev, V Lukin, K. Egiazarian, LJin, J. Astola B. Vozel, K. Chehdi, M. Carll, F. Battisti et al., "Color image database experimental setup requires the enhancement operator to be TID2013: Peculiarities and preliminary resulls, in Visual Information evaluated multiple times. This limits real-time application Processing (EUVIP),2013 4th European Workshop on. IEEE, 2013 of the proposed method. One might argue that in case of 106-111.1,3,4,7,8 an enhancement operator with well-defined derivatives, using [3 Z. Wang, A. C. Bovik, H. R. Sheikh, and E. P. Simoncelli, "Image quality assessment: from error visibility to structural similarity, IEEE NIMA as the loss function is a more efficient approach ransactions on Image Processing, vol. 13, no 4, pp. 600-612, 2004 ACKNOWLEDGMENT [4 W. Xue, L. Zhang, and X. Mou, " Learning without human scores for blind image quality assessment, in Proceedings of the IEEE Conference We would like to thank dr. pascal getreuer for valuable on Computer Vision and Pattern Recognition, 2013, Pp. 995-1002. I discussions and helpful advice on approximation of score [5] L Kang, P. Ye, Y Li, and D Doermann, Convolutional neural networks for no-reference image quality assessment, in Proceedings of the IEEE distributions Conference on Computer Vision and Pattern Recognition, 2014, pp REFERENCES [6S. Bosse, D. Maniry, T. wiegand, and w. Samek, A deep neural network for imagc quality asscssmcnt, in Image Processing(1CIP),2016 IEEE [l] N. Murray, L. Marchesotti, and F. Perronnin,"AVA: A large-scale International Conference on. IEEE, 2016, pp. 3773-3777. 1 database for aesthetic visual analysis, in Computer vision and Pattern [7S. Bianco, L. Celona, P. Napoletano, and r. Schettini,"On the use 10 (a)Input (5.80) (b) Enhanced (6.12 (d) Enhanced(6.13) (e) Input(4.87) (f)Enhanced (5.57 (g) Input (5.59 (h) Enhanced(5.98 Fig. 11: Tone enhancement by multi-layer Laplacian technique [37] along with our proposed aesthetic assessment model Nima(VgG16).Predictedaestheticscoresareshownbeloweachimage.(inputphotosaredownloadedfromwww.farbspiel photo. com of deep learning for blind image quality assessment, arXiv preprint [9] Y. Kao, C. Wang, and K. Huang, "Visual aesthetic quality assess arXiy:/002.0531,2016.1,5,7 with a regression model, in Image Processing(ICIP),2015 International Conference on. IEEE, 2015, pp. 1583-1587. 1, 6 [8X. Lu, Z. Lin,, X. Shen, R. Mech, and J. Z. Wang, " Deep multi-patch aggregation network for image style, aesthetics, and quality estimation, [10] L Mai, H. Jin, and F liu, Composition-preserving deep photo aesthet in Proceedings of the IEEE International Conference on Computer ics assessment, in Proceedings of the IEEE Conference on Computer Vision,2015.pp.990-998.1 Vision and Pattern Recognition, 2016, pp 497-506. 1,6

试读 13P NIMA Neural Image Assessment (TIP 2018).pdf
立即下载 低至0.43元/次 身份认证VIP会员低至7折
  • 分享精英

关注 私信
NIMA Neural Image Assessment (TIP 2018).pdf 49积分/C币 立即下载
NIMA Neural Image Assessment (TIP 2018).pdf第1页
NIMA Neural Image Assessment (TIP 2018).pdf第2页
NIMA Neural Image Assessment (TIP 2018).pdf第3页

试读结束, 可继续读1页

49积分/C币 立即下载 >