0% found this document useful (0 votes)
35 views30 pages

1563 Bayesian Enhancement Mode

This paper presents a Bayesian Enhancement Model (BEM) to address the one-to-many mapping challenge in image enhancement, particularly for low-light and underwater images. The proposed model leverages Bayesian estimation to capture uncertainty and utilizes a two-stage approach combining Bayesian Neural Networks and deterministic networks to refine image representations. Experimental results demonstrate the effectiveness of BEM over traditional deterministic models, highlighting its potential in real-world applications lacking reference images.

Uploaded by

neeharikajada
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
35 views30 pages

1563 Bayesian Enhancement Mode

This paper presents a Bayesian Enhancement Model (BEM) to address the one-to-many mapping challenge in image enhancement, particularly for low-light and underwater images. The proposed model leverages Bayesian estimation to capture uncertainty and utilizes a two-stage approach combining Bayesian Neural Networks and deterministic networks to refine image representations. Experimental results demonstrate the effectiveness of BEM over traditional deterministic models, highlighting its potential in real-world applications lacking reference images.

Uploaded by

neeharikajada
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 30

Under review as a conference paper at ICLR 2025

000
001
BAYESIAN E NHANCEMENT M ODELS FOR O NE - TO -
002 M ANY M APPING IN I MAGE E NHANCEMENT
003
004
005 Anonymous authors
006 Paper under double-blind review
007
008 A BSTRACT
009
010
Image enhancement is considered an ill-posed inverse problem due to its tendency
011
to have multiple solutions. The loss of information makes accurately reconstructing
012 the original image from observed data challenging. Also, the quality of the result
013 is often subjective to individual preferences. This obviously poses a one-to-many
014 mapping challenge. To address this, we propose a Bayesian Enhancement Model
015 (BEM) that leverages Bayesian estimation to capture inherent uncertainty and ac-
016 commodate diverse outputs. To address the noise in predictions of Bayesian Neural
017 Networks (BNNs) for high-dimensional images, we propose a two-stage approach.
018 The first stage utilises a BNN to model reduced-dimensional image representations,
019 while the second stage employs a deterministic network to refine these representa-
020
tions. We further introduce a dynamic Momentum Prior to overcome convergence
issues typically faced by BNNs in high-dimensional spaces. Extensive experi-
021
ments across multiple low-light and underwater image enhancement benchmarks
022
demonstrate the superiority of our method over traditional deterministic models,
023 particularly in real-world applications lacking reference images, highlighting the
024 potential of Bayesian models in handling one-to-many mapping problems.
025
026
027 1 I NTRODUCTION
028
029 In computer vision, image enhancement refers to the process of enhancing the perceptual quality,
030 visibility, and overall appearance of an image, which can involve reducing noise, increasing contrast,
031 sharpening details, or correcting colour imbalances. In image enhancement tasks such as low-light
032 image enhancement (LLIE) and underwater image enhancement (UIE), a common challenge arises
033
from dynamic photography conditions, where a single degraded input image can correspond to
multiple plausible target images. This phenomenon, known as the one-to-many mapping problem,
034
arises because multiple valid outputs can be generated depending on varying conditions during image
035
capture, such as changes in lighting, exposure, or other factors.
036
037 Recent advances in deep learning have shifted image enhancement towards data-driven approaches.
038 Several deep learning-based models (Zamir et al., 2022; Cai et al., 2023) have achieved advanced re-
039
sults by learning mappings between low-quality (LQ) inputs and their high-quality (HQ) counterparts
using paired datasets. However, we observe that existing datasets exhibit the one-to-many relationship
040
between their input and target domains. Specifically, we observe cases where there exist at least
041
two image pairs with input images that are either identical or visually indistinguishable, yet their
042 corresponding targets exhibit notable variations. When such discrepancies arise due to ambiguity in
043 the target domain, a traditional deep neural network—being a deterministic function—struggles to
044 effectively model these one-to-many image pairs. Previous methods employing deterministic neural
045 networks (DNNs) for image enhancement often overlook this class of one-to-many samples, leading
046 to sub-optimal solutions. Figure 1 (middle) demonstrates how a deterministic neural network trained
047 on one-to-many mapping data struggles to predict any specific target, instead producing an averaged
048 output due to “regression toward the mean”.
049 To tackle the inherent ambiguity in image enhancement tasks caused by one-to-many mappings, we
050 adopt a Bayesian framework that models these mappings probabilistically. Rather than relying on a
051 sub-optimal deterministic approach, our method leverages Bayesian inference to sample multiple
052 sets of network weights from a learned distribution, effectively creating a diverse ensemble of deep
053 networks. Each sampled network captures a distinct plausible solution, allowing our model to map
a single input to a distribution of possible target outputs. This approach theoretically enables the

1
Under review as a conference paper at ICLR 2025

054 One-to-Many Mapping DNN’s deterministic process BNN’s uncertainty prediction


055 5%
y1 y4
y6
3%
056 y1 y2 y3 11%
6
100%
057 x ෍ 𝛼𝑖 y𝑖 x 7%
y3
058 x y4 y5 y6 𝑖=1 39%
Crop
059 DNN BNN 15% 20%
2 ?
060 y5 y
061
062 Figure 1: One-to-Many Mapping. The left panel shows an image crop x associated with multiple
063
targets {y1 , . . . , y6 }. A DNN (middle) trained on such data tends to predict the weighted average of
all targets. In contrast, a BNN (right) models the one-to-many relation by producing different outputs
064
according to a learned probability distribution.
065
066
067 mapping of all plausible variations, effectively modelling the complex one-to-many relationships
068 present in real-world scenarios.
069
While BNNs have shown promise in capturing uncertainty in various tasks (Kendall & Cipolla,
070
2016; Kendall et al., 2015; 2018; Pang et al., 2020), their potential in addressing the one-to-many
071
mapping problem for image enhancement remains largely under-explored. By incorporating Bayesian
072 inference into the enhancement process, our approach captures uncertainty in dynamic, uncontrolled
073 environments, providing a more flexible and robust solution than traditional deterministic models.
074 However, applying BNNs to these tasks presents significant challenges due to the high dimensionality
075 of image data and the strong 2D spatial correlations between pixels: The weight uncertainty in
076 BNNs often leads to noisy image outputs, while models with high-dimensional weight spaces are
077 prone to underfitting (Dusenberry et al., 2020; Tomczak et al., 2021). To mitigate the noise in BNN
078 predictions, we propose a two-stage approach that combines a BNN and a DNN (Sec. 4). Following
079 our approach, we systematically address these challenges, unleashing the potential of BNNs in
080
low-light and underwater enhancement tasks.
081 As the first work to explore the feasibility of BNNs for image enhancement, we selected tasks where
082 the one-to-many mapping problem is particularly pronounced, such as LLIE and UIE, to effectively
083 validate our theoretical framework. The main contributions of this paper are summarised as follows:
084 • We identify the one-to-many mapping issue between inputs and outputs as a primary
085 bottleneck in image enhancement models for LLIE and UIE, and propose the first Bayesian-
086 based Enhancement Model (BEM) to learn this mapping relation.
087
• We introduce a dynamic prior called the Momentum Prior to mitigate the convergence
088
difficulties typically encountered by BNNs in high-dimensional weight spaces.
089
090 • To reduce the complexity of BEM in modelling high-dimensional image data, we pro-
091
pose an innovative two-stage approach that combines the strengths of Bayesian NNs and
Deterministic NNs).
092
093
094 2 R ELATED W ORK
095
Bayesian Deep Learning. BNNs quantify uncertainty by learning distributions over network
096 weights, offering robust predictions (Neal, 2012). Variational Inference (VI) is a common method for
097 approximating these distributions (Graves, 2011; Blundell et al., 2015). Gal & Ghahramani (2016)
098 simplify the implementation of BNNs by interpreting dropout as an approximate Bayesian inference
099 method. Recent advancements show that adding uncertainty only to the final layer can efficiently
100 approximate a full BNN (Harrison et al., 2024). Another line of approaches, such as Krishnan et al.
101 (2020), explored the use of empirical Bayes to specify weight priors in BNNs to enhance the model’s
102 adaptability to diverse datasets. These BNN approaches have shown promise across a range of
103 vision applications, including camera relocalisation (Kendall & Cipolla, 2016), semantic and instance
104
segmentation (Kendall et al., 2015; 2018). Despite these advances, BNNs remain underutilised in
image enhancement tasks.
105
106 Probabilistic Models in Image Enhancement. Several works have utilised probabilistic models
107 to address different aspects of image enhancement. Jiang et al. (2021) employed GANs to capture
features for LLIE, while Fabbri et al. (2018) leveraged CycleGAN (Zhu et al., 2017) to generate

2
Under review as a conference paper at ICLR 2025

108
synthetic paired datasets, addressing data scarcity in UIE. FUnIE-GAN (Islam et al., 2020) further
109 demonstrated effectiveness in both paired and unpaired UIE training. Anantrasirichai & Bull (2021)
110 applied unpaired learning for LLIE when the scene conditions are known. Wang et al. (2022)
111 applied normalising flow-based methods to reduce residual noise in LLIE predictions. However,
112 its invertibility constraint limits model complexity. Zhou et al. (2024) mitigated this by integrating
113 normalising flows with codebook techniques, introducing latent normalising flows. Diffusion Models
114 (DMs) have been widely adopted for enhancement tasks (Hou et al., 2024; Tang et al., 2023). While
115 DMs inherently address one-to-many mappings, their high latency for generating a single sample
116 makes producing hundreds of candidates impractical due to prohibitive delays. Due to the practical
117
limitations in generating multiple candidates, DM-based methods often prefer to produce an average
of multiple targets, as this helps reduce the quality fluctuations within a single sampling process, as
118
suggested by Jiang et al. (2023a).
119
120
2.1 P RELIMINARIES
121
122 In image enhancement, the output of a neural network can be interpreted as the conditional probability
123 distribution of the target image, y ∈ Y, given the degraded input image x ∈ X , and the network’s
124 weights w: P (y|x, w). Assuming the prediction errors follow a Gaussian distribution, the conditional
125 probability density function (PDF) of the target image y can be modeled as a multivariate Gaussian,
126 where the mean is given by the neural network output F (x; w):
127
P (y|x, w) = N (y|F (x; w), diag(σ 2 )). (1)
128
129 The network weights w can be learned through maximum likelihood estimation (MLE). Given
130 a dataset of image pairs {xi , yi }N
i=1 , the MLE estimate w
MLE
is computed by maximising the
131 log-likelihood of the observed data:
N
132 X
wMLE = argmax log P (yi |xi , w),
133 w (2)
i=1
134
135 By optimising such an objective function in Eq. (2), the network Fw learns an injective function,
136 Fw : X → Y. The deterministic nature of such a mapping implies that when yi ̸= yj , the condition
137 xi ̸= xj must hold. We argue that this deterministic process is inadequate in cases where one input
138
corresponds to multiple plausible targets. In Sec. 3, we delve into methods for addressing this issue.
139
140 3 M ODELLING THE ONE - TO - MANY MAPPING
141
142 3.1 BAYESIAN E NHANCEMENT M ODELS
143 We introduce uncertainty into the network weights w through Bayesian estimation, thus obtaining a
144 posterior distribution over the weight, w ∼ P (w|y, x). During inference, weights are sampled from
145 this distribution. The posterior distribution over the weights is expressed as:
146
P (y|x, w)P (w)
147 P (w|y, x) = (3)
P (y|x)
148
149 where P (y | x, w) represents the likelihood of observing y given the input x and weights w, P (w)
150 denotes the prior distribution of the weights, and P (y | x) is the marginal likelihood.
151
Unfortunately, for any neural networks the posterior in Eq. (3) cannot be calculated analytically. This
152 makes it impractical to directly sample weights from the true posterior distribution. Instead, we
153 can leverage variational inference (VI) to approximate P (w|y, x) with a more tractable distribution
154 q(w|θ). Such that, we can draw samples of weights w from the distribution q(w|θ). As suggested
155 by (Hinton & Van Camp, 1993; Graves, 2011; Blundell et al., 2015), the variational approximation is
156 fitted by minimising their Kullback-Leibler (KL) divergence:
157
θ ⋆ = arg min KL [q(w|θ)∥P (w|y, x)]
158 θ
Z
159 q(w|θ)
160
= arg min q(w|θ) log dw (Apply Equation 3) (4)
θ P (w)P (y|x, w)
161
= arg min −Eq(w|θ) [log P (y|x, w)] + KL [q(w|θ)∥P (w)] .
θ

3
Under review as a conference paper at ICLR 2025

162
We define the resulting cost function from Eq. (4) as:
163
164
L(x, y) = −Eq(w|θ) [log P (y|x, w)] + KL [q(w|θ)∥P (w)] .
| {z } | {z } (5)
165 data-dependent term prior matching term
166
167 The loss function L(x, y) in Eq. (5), also known as the variational free energy, consists of two
168
components: the prior matching term and the data-dependent term. The prior matching term can
be approximated using the Monte Carlo method or computed analytically if a closed-form solution
169
exists. The data-dependent term is equivalent to minimising the mean squared error between the
170
input-output pairs in the training data. To optimise L(x, y), the prior distribution P (w) must be
171 defined. In Sec. 3.2, we define a dynamic prior that accelerates convergence and better models
172 complex one-to-many mappings in the data.
173
174 3.2 M OMENTUM P RIOR WITH E XPONENTIAL M OVING AVERAGE
175
176 In our preliminary work, significant performance degradation is observed when using naive Gaussian
177 (e.g., N (0, I)) or empirical Bayes priors. To address this, we propose the Momentum Prior, a simple
178 yet effective strategy that uses an exponential moving average to stabilise training by smoothing
179 parameter updates and promoting convergence to better local optima. Suppose that the variational
180
posterior is a diagonal Gaussian, then the variational posterior parameters are θ = (µ, σ). A posterior
sample of the weights w can be obtained via the reparameterisation trick (Kingma, 2014).
181
182 w = µ + σ ◦ ϵ with ϵ ∼ N (0, I). (6)
183 Having liberated our algorithm from the confines of fixed priors, we propose a dynamic prior by
184 updating the prior’s parameters to the exponential moving average (EMA) of the variational posterior
2
185 parameters. Specifically, for the prior P (w) = N (w; µEMA t , σtEMA I), the parameters are updated
186 at each minibatch training step t over the training period [0, 1, 2, . . . , T ] as follows:
187 µEMA = 0, σ0EMA = σ o 1,
0
188
189
µEMA
t = βµEMA
t−1 + (1 − β)µt , t = 1...T, (7)
190 σtEMA = EMA
βσt−1 + (1 − β)σt , t = 1...T,
191 where µt and σt represent the mean and variance from the variational posterior q(w|θ) at training
192 step t, σ o is a scalar controlling the magnitude of initial variance in the prior distribution, and β
193 denotes the EMA decay rate. Thereafter, for minibatch optimisation with M image pairs, we update
194 θ = (µ, σ) at step t by minimising minibatch loss Lmini (x, y), reformulated from Eq. (5) as:
195 1
196 Lmini (x, y) = −Eq(w|θ) [log P (y|x, w)] + KL [q(w|θ)∥P (w)],
| {z } |M
197
{z }
data-dependent term prior matching term
198
M 2
199 1 hX σ EMA σ 2 + (µ − µEMA ) 1i
= Ew∼q(w|θ) ∥F (xi ; w) − yi ∥22 + log t + t
2 − ,
200 M i σ 2σtEMA 2
201 | {z } | {z }
data-dependent term prior matching term
202
(8)
203
where the prior matching term is expressed as the analytical solution of KL [q(w|θ)∥P (w)].
204
205 The momentum prior is motivated by the following reasoning: it begins with a naive Gaussian
206 prior early in training, offering useful inductive biases (Wilson & Izmailov, 2020). However, as
207
training progresses, relying on a fixed simple prior can restrict the network’s capacity to fit the data
effectively. To overcome this, the momentum prior gradually updates its parameters with empirical
208
information from the data during training. The momentum prior is akin to the momentum teacher (He
209
et al., 2020; Grill et al., 2020) in self-supervised learning but differs by regularising variational
210 posterior parameters instead of student model outputs. This simple idea significantly improves BNN
211 performance on our task. Additionally, the momentum prior also shares similarities with deep learning
212 ensembles (Lakshminarayanan et al., 2017), a key strategy for uncertainty estimation, as per Ashukha
213 et al. (2020). Unlike empirical Bayes (Robbins, 1956; Krishnan et al., 2020), which defines a static
214 prior based on MLE-optimised parameters, our momentum-based strategy incrementally refines the
215 prior during training. This continuous adaptation prevents the model from exploiting shortcut learning
when optimising the data-dependent term in Eq. (5), thereby avoiding sub-optimal solutions.

4
Under review as a conference paper at ICLR 2025

216
3.3 P REDICTIONS U NDER U NCERTAINTY
217
218 After optimising the variational posterior parameters θ ⋆ via Eq. (4), predictions are made by sampling
219 weights w from the variational posterior distribution q(w|θ). As shown in Algorithm 1, we sample
220 K sets of network weights {wk }K k=1 , where each wk is used to produce a corresponding output ŷk
221 via F (x; wk ). A quality metric D is then employed to rank the K candidates and select the most
222 suitable output yopt , with higher D-values indicating better quality.
223 The prediction process is described for two cases
224
Algorithm 1: Prediction
depending on the availability of a reference:
225 Input: Input x, network F
i) With reference: When a reference image y is Initialisation: the best score sbest ← 0;
226
available, the quality metric D can be instantiated for k ← 1 to K do
227 Sample ϵk ∼ N (0, I);
as the negative mean squared error (MSE) or other
228 perceptual metrics to rank the K candidates, with wk ← Calculate Eq. (6);
229 the best score determining the final output. ŷk = F (x; wk );
230 if reference y exists then
ii) Without reference: in the absence of a reference sk = D(ŷk , y) ; // reference
231
image, the quality metric D(·) can be a no-reference else
232 sk = D(ŷk ) ; // no-reference
image quality metric, such as negative NIQE (Mit-
233
tal et al., 2012), UIQM (Panetta et al., 2015), or if sk > sbest then
234 UCIQE (Yang & Sowmya, 2015). Alternatively, Update sbest ← sk ;
235 vision-language models like CLIP (Radford et al., Set yopt ← ŷk ;
236 2021; Wang et al., 2023) can be used to find the best-
opt
237 matching image based on a given textual description. Output: Optimal prediction y .
238 For instance, CLIP’s encoders can extract features
239 from a predicted image ŷk and a text prompt (e.g., “A bright photo”), denoted as hk and htext ,
h⊤ h
240 respectively. The quality metric D is then defined as their cosine similarity: D(ŷk ) = ∥hkk∥∥htexttext ∥ .
241 We denote the BEM utilising CLIP as BEMCLIP . Meanwhile, our BEM can perform deterministic
242 predictions (i.e., without requiring multiple weight samples) by simply setting w = u. We refer to
243 this deterministic mode as BEMDeterm. . However, due to its deterministic nature, BEMDeterm. , like any
244 deterministic model, is inherently sub-optimal for capturing complex one-to-many mappings.
245
246 4 BNN + DNN: A TWO - STAGE APPROACH
247
248
Image data is inherently high-dimensional. While BNN can be directly applied to model high-
dimensional image data, it compromises precision due to the complexity involved (see Appendix E
249
for detailed analysis). To address this issue, we propose to use BEM to model the one-to-many
250
mapping in a lower-dimensional feature representation of image. Then, we project the image features
251 back to the original pixel space by a DNN.
252
253 4.1 T HE F RAMEWORK
254
Up: Upsmapling © : Concatenate Channels ϕ( ⋅ ) y
255
Inference:
Lmini (v𝑦 ,ොv𝑦) v𝑦
256 Up
Training: x ©
257 Gradient:
258
ϕ( ⋅ )
259 x v𝑥 vො 𝑦 Up © yො
260 L1(y, yො )
261
BNN, w ∼ q(w|θ) DNN, wG
262
263 Stage I Stage II
264 Figure 2: The two-stage pipeline. In Stage I, the BNN with weights w ∼ q(w|θ) is trained by
265 minimising the minibatch loss Lmini (vy , v̂y ) in Eq. (8). In Stage II, the DNN with weights wG is
266
trained by minimising the L1 loss, L1(y, ŷ). The inference process is denoted by →, while the
training process for each stage is indicated by 99K. The gradient flow is shown with 99K.
267
268
269 Figure 2 illustrates our proposed two-stage framework. We apply a reduction function ϕ to compress
high-dimensional image data by either statistical summarisation or down-sampling, yielding compact

5
Under review as a conference paper at ICLR 2025

270
representations vx = ϕ(x) and vy = ϕ(y) in a lower-dimensional space. In the first stage, the BEM
271 models the complex one-to-many mapping between vx and vy . In the second stage, a DNN G refines
272 the results by taking the first-stage low-dimensional output v̂y along with the original low-quality
273 image x as inputs, producing a high-quality recovered image. The overall process is formulated as:
274
v̂y = F (ϕ(x); w), w ∼ q(w | θ), (9)
275
276 ŷ = G(v̂y , x; wG ), (10)
277 where wG denotes the weights of the second-stage model. We explore two reduction functions:
278 bilinear downsampling and local 2D histogram. Both methods are effective; however, bilinear
279 downsampling provides higher measurement values on full-reference image quality assessment
280 metrics. Additionally, considering bilinear downsampling offers a more efficient computation, we
281 adopt it as the default setting. Further analysis of the reduction function ϕ is provided in Appendix A.
282 During the training phase of the second-stage model, we use the downsampled features of the target
283 image y along with the low-quality image x as input to the DNN, instead of using the output from
284 the first-stage model. This strategy removes constraints imposed by the first-stage model, thereby
285 allowing the second stage to reach its full potential. Importantly, as illustrated in the inference flow
286 in Figure 2, the inference process remains independent of the target image. Further analysis for
287 two-stage frameworks is provided in Appendix E.
288 Backbone Network. For both the first and the second stage models, we adopt the same backbone
289 network, but with different input and output layers. To enable weight uncertainty for the first stage
290 model, we convert all the convolution and linear layers in the backbone network to their Bayesian
291 counterparts, the weight parameters of which are obtained via Eq. (6). Inspired by Mamba (Gu &
292 Dao, 2023) and VMamba (Liu et al., 2024b), featuring their linear computational complexity for long
293 sequence modelling, we employ a Mamba as the backbone of our BEM. The overall framework is
294 akin to a U-Net. We provide the details and experiment with the backbone in Appendix B.
295
296 4.2 S PEEDING UP I NFERENCE
297
Similar to diffusion models, our BEM benefits 2194.00
298 103 K=1
from multiple inference passes to produce high- K = 100, SpeedUp
Inference Time (second)

299 quality outputs. However, unlike the sequential K = 100


300 denoising process of diffusion models, BEM al- 199.57
102
301 lows parallel execution. We accelerate inference 46.7476.05
49.41
302 using two main strategies: I) Applying Algo- 21.08
303 rithm 1 only to the first-stage model to gener- 101
ate a low-resolution output, vopt . With a 16× 3.57
304 2.27 3.47
1.672.09 2.23
305 downsampling in function ϕ, this provides a the-
256x256 512x512 1024x1024 2048x2048
306 oretical 256× speedup. II) Parallelising the K Image Resolution (pixels)
307
iterations along the batch dimension achieves
a speedup proportional to the GPU’s parallel Figure 3: Inference speed before and after acceler-
308
computing capability. As illustrated in Figure 3, ation. A parallel implementation of D is employed.
309
the accelerated inference speed for image reso- The model runs on an Nvidia RTX 4090.
310 lutions of 5122 and 10242 , is in the same level
311 of the single-pass inference. However, when the function D does not support parallel execution,
312 the speed decreases proportionally to D’s computational complexity. This acceleration strategy
313 introduces a minor degradation in image quality: at K = 100, we observe an average drop of 3.2%
314 in PSNR, while no decrease is noted in UIQM.
315
316 5 E XPERIMENTS
317
318 Datasets. We conduct experiments on several low-light image enhancement (LLIE) and underwater
319 image enhancement (UIE) datasets. For LLIE, we evaluate our method on LOL-v1 (Wei et al., 2018)
320 and LOL-v2 (real and synthetic subsets)(Yang et al., 2021), both of which have training and test splits,
321 as well as the unpaired LIME(Guo et al., 2016), NPE (Wang et al., 2013), MEF (Ma et al., 2015),
322 DICM (Lee et al., 2013), and VV (Vonikakis et al., 2018) datasets. For UIE, we use the UIEB (Li
323 et al., 2019a), U45 (Li et al., 2019b), and UCCS (Liu et al., 2020) datasets. The UIEB dataset is
further divided into training, validation (R90), and test (C60) subsets.

6
Under review as a conference paper at ICLR 2025

324
Metrics. For paired datasets, we evaluate pixel-level accuracy using PSNR and SSIM, and perceptual
325 quality using LPIPS (Zhang et al., 2018). For real-world datasets, we use NIQE Mittal et al. (2012)
326 as a no-reference metric. In UIE tasks, we additionally evaluate image quality using UIQM (Panetta
327 et al., 2015) and UCIQE (Yang & Sowmya, 2015).
328
Settings. All models are trained with the Adam optimiser, starting at a learning rate of 2 × 10−4
329
and decaying to 10−6 using a cosine annealing schedule. The first-stage model is trained for 300K
330
iterations on inputs reduced to a size of 24 × 24 through function ϕ, while the second-stage model is
331 trained for 150K iterations on inputs of size 128 × 128. Batch size M is set to 8, and ϕ defaults to
332 1
bilinear downsampling with a 16 scaling factor. Unless stated otherwise, K is 100, D in Algorithm 1
333 o
is negative MSE, and σ in Eq. (7) is set to 0.05.
334
335 5.1 Q UANTITATIVE R ESULTS
336
337 Full-reference evaluation offers a limited view of model performance. Even without obvious distribu-
338 tional shifts between training and test sets, test results may not fully reflect the model’s generalisation
339 to real-world scenarios. In contrast, no-reference evaluation provides a more practical and meaningful
340 measure of a model’s utility in real-world applications.
341
342 Table 1: Full-reference evaluation on the LOL-v1 and v2 datasets. The BEM in grey selects the
343 output based on the GT images. The best results are in bold, and the second-best are underlined.
344 LOL-v1 LOL-v2-real LOL-v2-syn
Method GT Mean
345 PSNR ↑ SSIM ↑ LPIPS ↓ PSNR ↑ SSIM ↑ LPIPS ↓ PSNR ↑ SSIM ↑ LPIPS ↓

346 KinD (Zhang et al., 2019) ✗ 19.66 0.820 0.156 18.06 0.825 0.151 17.41 0.806 0.255
347 Restormer (Zamir et al., 2022) ✗ 22.43 0.823 0.147 18.60 0.789 0.232 21.41 0.830 0.144
SNR-Net (Xu et al., 2022) ✗ 24.61 0.842 0.151 21.48 0.849 0.157 24.14 0.928 0.056
348 RetinexFormer (Cai et al., 2023) ✗ 25.16 0.845 0.131 22.80 0.840 0.171 25.67 0.930 0.059
349 RetinexMamba (Bai et al., 2024) ✗ 24.03 0.827 0.146 22.45 0.844 0.174 25.89 0.935 0.054
LLFlow (Wang et al., 2022) ✓ 25.13 0.872 0.117 26.20 0.888 0.137 24.81 0.919 0.067
350
GlobalDiff (Hou et al., 2024) ✓ 27.84 0.877 0.091 28.82 0.895 0.095 28.67 0.944 0.047
351 GLARE (Zhou et al., 2024) ✓ 27.35 0.883 0.083 28.98 0.905 0.097 29.84 0.958 -
352 BEM (ours) ✗ 26.83 0.877 0.072 28.89 0.902 0.076 29.22 0.955 0.031
353 BEM (ours) ✓ 28.80 0.884 0.069 32.66 0.915 0.060 32.95 0.964 0.026
354 BEMDeterm. (ours) ✓ 28.30 0.881 0.072 31.41 0.912 0.064 30.58 0.958 0.033
BEMCLIP (ours) ✓ 28.43 0.882 0.071 30.01 0.910 0.076 31.51 0.961 0.030
355
356
357 Full-Reference Evaluation. For the LLIE tasks, we present quantitative comparisons with state-of-
358 the-art methods on the LOL-v1 and LOL-v2 datasets, as detailed in Table 1. Our BEM significantly
359
outperforms all previous methods across all metrics. Notably, on LOL-v2-real, BEM achieves an
exceptionally high PSNR of 32.66 dB. Although deterministic models are considered sub-optimal in
360
the one-to-many mapping problem, our BEMDeterm. (deterministic mode) still surpasses the previous
361
methods across all benchmarks. We observed that previous methods often struggle to maintain high
362 perceptual quality (measured by LPIPS) while ensuring pixel-level accuracy. However, our BEM
363 excels in both, delivering the highest SSIM (0.877) and the lowest LPIPS (0.072). For the UIE
364
365
366 Table 2: Quantitative comparisons on the UIEB-R90, UIEB-C60, U45, and UCCS datasets in terms
367
of PSNR, SSIM, UIQM, and UCIQE. Best results are in bold, second best are underlined.
368 UIEB-R90 UIEB-C60 U45 UCCS
Method
369 PSNR ↑ SSIM ↑ UIQM ↑ UCIQE ↑ UIQM ↑ UCIQE ↑ UIQM ↑ UCIQE ↑
370 WaterNet (Li et al., 2019a) 21.04 0.860 2.399 0.591 - - 2.275 0.556
371 Ucolor (Li et al., 2021) 20.13 0.877 2.482 0.553 3.148 0.586 3.019 0.550
PUIE-MP (Fu et al., 2022) 21.05 0.854 2.524 0.561 3.169 0.569 2.758 0.489
372 Restormer (Zamir et al., 2022) 23.82 0.903 2.688 0.572 3.097 0.600 2.981 0.542
373 CECF (Cong et al., 2024) 21.82 0.894 - - - - - -
374 FUnIEGAN (Islam et al., 2020) 19.12 0.832 2.867 0.556 2.495 0.545 3.095 0.529
PUGAN (Cong et al., 2023) 22.65 0.902 2.652 0.566 - - 2.977 0.536
375 U-Shape (Peng et al., 2023) 20.39 0.803 2.730 0.560 3.151 0.592 - -
376 Semi-UIR (Huang et al., 2023) 22.79 0.909 2.667 0.574 3.185 0.606 3.079 0.554
WFI2-Net (Zhao et al., 2024a) 23.86 0.873 - - 3.181 0.619 - -
377
BEMCLIP (ours) 24.36 0.921 2.885 0.554 3.266 0.608 3.115 0.558
BEM (ours) 25.62 0.940 2.931 0.567 3.406 0.620 3.224 0.561

7
Under review as a conference paper at ICLR 2025

378
tasks, we present quantitative comparisons on the UIEB-R90 dataset, as shown in Table 2. Our BEM
379 outperforms the second-best WFI2-Net by 1.76 dB in PSNR. This superior performance, observed
380 consistently across both LLIE and UIE tasks, highlights BEM’s effectiveness and versatility.
381
382
No-Reference Evaluation. For no-reference low-
light images, we recover them using Algorithm 1 Table 3: No-reference evaluation on LIME,
383
and D is instantiated as the NIQE metric. We then NPE, MEF, DICM and VV, in terms of
384
evaluate our method on five unpaired datasets as NIQE↓. The best results are in blodface.
385 shown in Table 3, where we report the NIQE scores
386 of SOTA methods. Our BEM consistently outper- Method DICM LIME MEF NPE VV
387 forms prior methods across all datasets. For enhanc-
388 ing no-reference underwater images, we instantiate ZeroDCE (Guo et al., 2020) 5.15
KinD (Zhang et al., 2019) 5.03 5.47 4.98 4.30
4.58 5.82 4.93 4.53 4.81
389 D in Algorithm 1 as the UIQM and UCIQE metrics. RUAS (Liu et al., 2021) 5.21 4.26 3.83 5.53 4.29
390 We then evaluate our method on the C60, U45 and LLFlow (Wang et al., 2022) 4.06 4.59 4.70 4.67 4.04
391 UCCS test sets. As shown in Table 2, BEM achieves PairLIE (Fu et al., 2023b) 4.03 4.58 4.06 4.18 3.57
392 the best UIQM scores across all test sets. With the RFR (Fu et al., 2023a) 3.75 3.81 3.92 4.13 -
GLARE (Zhou et al., 2024) 3.61 4.52 3.66 4.19 -
393
UCIQE metric, we also achieve the best results in CIDNet (Feng et al., 2024) 3.79 4.13 3.56 3.74 3.21
the U45 and UCCS test sets. These results, spanning
394 BEM (ours) 3.77 3.94 3.22 3.85 2.95
different tasks and datasets, demonstrate the robust- BEMDeterm.
(ours) 3.55 3.56 3.14 3.72 2.91
395
ness and effectiveness of our method in real-world
396 applications.
397
398
399 5.2 V ISUAL A NALYSIS
400
Predictions of One-to-Many. In Figure 4, we visualise the prediction process of BEM, where
401
multiple plausible candidates are generated. As shown at the top of the figure, these candidates
402
exhibit apparent visual differences. The best prediction candidate is selected using Algorithm 1,
403 which is visually closer to the reference image. For no-reference prediction, we demonstrate that
404 using the CLIP score with the text prompt, “A bright photo”, results in the brightest image
405 being outputted. By instantiating D as the NIQE metric, we can avoid generating overexposed
406 predictions, as shown at the bottom right.
407
408 Full-Reference Inference:
409
410
411
Input 16.3 dB 21.2 dB 24.5 dB 29.6 dB 30.4 dB Reference
412
413 No-Reference Inference: CLIP : Brightness NIQE
414
415
416
417
418 Input 49 62 74 80 93 Input 3.12 2.95 2.82 2.73 2.67
419
420
Figure 4: Visualisation of the predicting process of BEM in both cases with reference (top) and
without reference (bottom). Zoom in for more details.
421
422
423
Qualitative Comparisons. We visually compare our BEM with twelve state-of-the-art UIE methods,
424
including WaterNet (Li et al., 2019a), PRWNet (Huo et al., 2021), FUnIEGAN (Islam et al., 2020),
425
PUGAN Cong et al. (2023), MMLE (Zhang et al., 2022), PUIE-MP (Fu et al., 2022), FiveA+(Jiang
426 et al., 2023b), CLUIE (Li et al., 2023), Semi-UIR (Huang et al., 2023), UColor (Li et al., 2021),
427 DM-Underwater (Tang et al., 2023), and CLIP-UIE (Liu et al., 2024a). As depicted in the first and
428 second rows of Figure 5, our BEM achieves superior removal of underwater turbidity compared to
429 other methods. In deeper ocean images with dominant blueish effects (last row in Figure 5), BEM
430 can better enhance visual clarity. Visual comparisons on five unpaired LLIE test sets are shown in
431 Figure 6, where our restored images offer better perceptual improvement. For example, in DICM, our
method enhances brightness while effectively avoiding overexposure. These visual improvements

8
Under review as a conference paper at ICLR 2025

432
align with the superior quantitative results presented in Sec. 5.1. HD visual results are included in
433 Appendix E.
434
435 R90
436
437
438
439
440 Input WaterNet PRWNet FUnIEGAN PUIE-MP Ground Truth BEM (Ours)
441 U45
442
443
444
445
Input FiveA+ MMLE Semi-UIR PUIE-MP CLUIE BEM (Ours)
446
447 C60
448
449
450
451
Input UColor DM-Underwater CLIP-UIE PUIE-MP PUGAN BEM (Ours)
452
453
Figure 5: Visual comparisons on the R90, C60 and U45 datasets. Best viewed when zoomed in.
454
455
456
DICM MEF LIME
457
458
459
460
461 Input RUAS BEM (Ours) Input KinD++ BEM (Our) Input ZeroDCE BEM (Ours)
462 NPE
DIC VV
463 M

464
465
466 Input KinD BEM (Ours) Input SNR-Net BEM (Ours)
467
Figure 6: Visual comparisons on the DICM, LIME, MEF, NPE and VV datasets.
468
469
470 5.3 A BLATION S TUDIES
471
Single-Stage vs. Two-Stage Approaches. We assess the performance of our two-stage ap-
472
proach by comparing it against a single-stage variant. As discussed in Sec. 4, directly con-
473
verting a DNN into a BNN typically results in noisy predictions. To generate smooth out-
474 puts, our single-stage model retains the last layer in the network as a deterministic layer, the
475 entire process of which is opposite to the Bayesian last layer method (Harrison et al., 2024).
476 While the two-stage approach introduces only
477 marginal additional computational overhead, its Table 4: Single-stage vs. two-stage approaches on
478 performance significantly surpasses that of the LOL-v1. FLOPs are calculated in an input size of
479 single-stage model, as shown in Table 4. This 256×256 pixels.
480 highlights the efficiency and effectiveness of our
481 two-stage approach.
Model FLOPs PSNR ↑ SSIM ↑
482 Magnitude of Uncertainty. The performance Single Stage 20.41G 24.78 0.852
483 improvements of our BEM primarily stem from Two Stages 20.49G 26.83 0.877
484 its ability to effectively model the one-to-many
485 mapping using BNNs. To support this claim, we
evaluate the influence of the variance in the variational posterior on model performance. As shown in

9
Under review as a conference paper at ICLR 2025

486
Figure 7, except for BEM with σ ◦ = 0.0001, all other BEM instances outperform the DNN. This
487 indicates that by setting a moderate variance in the momentum prior, BEM can significantly surpass
488 its DNN counterpart.
489
490 25.5 0.14 PSNR
491 25.0 0.12 SSIM
0.86
LPIPS
492 24.5
0.10
493 24.0 0.84
0.08
494 23.5
= 0 01
o 001

o 05

o 01

o 5

o 001

o 001

o 05

o 01

o 5

o 001

o 001

o 05

o 01

o 5
.2

.2

.2
= 0 rm.

= 0 rm.

= 0 rm.
.0

.0

.0
495
=0

=0

=0
.00

.0
.

.0
.

.0
.
=0
=0

=0
=0

=0
=0
te

te

te
.

.0
.

.0
.
=0

=0
=0

=0
=0
De

De

De
o

496
o

o
497 Figure 7: Effect of initial variance values (i.e., σ o in Eq. 7) on model performance. The results
498 are obtained by evaluating single-stage models on the LOL-v1 dataset. “Determ.” denotes the
499 deterministic baseline model.
500
501
Impact of Different Priors. We evaluate the effectiveness of our momentum prior against
502 two baseline priors: a naive Gaussian prior and an empirical Bayes prior. The naive Gaus-
503 sian prior is defined as P (W) = N (0, 0.1I). The empirical Bayes prior, MOPED (Krish-
504 nan et al., 2020), is defined as P (W) = N (wMLE , 0.1I), where wMLE represents the maxi-
505 mum likelihood estimate (MLE) of the weights learned by optimising a deterministic network.
506 In the case of the empirical Bayes prior, the
507 mean µ of the variational posterior q(w|θ) is ini- 22.5 Momentum Prior
508 tialised as the MLE of the weights, wMLE , and Empirical Bayes Prior
MLE 20.0
509 the posterior variance σ is set to 0.1w , as Native Gaussian
17.5
suggested by Krishnan et al. (2020). As shown
PSNR (dB)

510
in Figure 8, the momentum prior demonstrates 15.0
511
a clear advantage over both baselines. While 12.5
512
the empirical Bayes prior accelerates training 10.0
513 during early iterations, its performance degrades 7.5
514 over time due to the fixed nature of the prior. 5.0
515 The fixed prior, learned from the same data, can 0 50 100 150 200 250
516 act as a shortcut during the optimisation of the Iteration (x1000)
517 variational posterior parameters, minimising the Figure 8: Training curves of one-stage BEMs with
518 loss function in Eq. (5) predominantly by reduc- different priors. The PSNR for each iteration is
519 ing the prior matching term KL [q(w|θ)∥P (w)]. calculated using the mean weight µ.
520 This behaviour bypasses data-driven learning,
521
ultimately resulting in sub-optimal solutions that do not fully capture the data’s inherent uncertainty.
522
523 6 D ISCUSSION AND C ONCLUSION
524
525
Although BEM demonstrates stronger generalisation capability than DNN-based methods, fully
realising its potential will require intentionally collecting target images under diverse capture settings
526
to further increase label diversity. While using small image crops as training data can alleviate the
527
label diversity problem to some extent, similar to conventional data augmentation strategies in DNNs,
528 this approach has limitations. We leave these aspects for future work. Additionally, the distinction
529 between image enhancement and image restoration is not always well-defined, as some restoration
530 tasks (e.g., image colourisation and de-raining) may also present one-to-many mapping challenges.
531 Consequently, our BEM could be extended to certain image restoration scenarios.
532
Overall, we identified the one-to-many mapping problem as a key limitation in existing image
533
enhancement tasks and introduced the first Bayesian-based model to address this issue. To facilitate
534 efficient training on high-dimensional data, we proposed a Momentum Prior that dynamically refines
535 the prior distribution during training, enhancing convergence and performance. Our two-stage
536 framework integrates the strengths of BNNs and DNNs, yielding a flexible yet computationally
537 efficient model. Extensive experiments on various image enhancement benchmarks demonstrate
538 significant performance gains over state-of-the-art models, showcasing the potential of Bayesian
539 probabilistic models in handling the inherent ambiguities of image enhancement tasks, paving the
way for future research in modelling complex one-to-many mappings in low-level vision tasks.

10
Under review as a conference paper at ICLR 2025

540
R EFERENCES
541
542 Nantheera Anantrasirichai and David Bull. Contextual colorization and denoising for low-light ultra
543 high resolution sequences. In 2021 IEEE International Conference on Image Processing (ICIP),
544 pp. 1614–1618. IEEE, 2021.
545 Arsenii Ashukha, Alexander Lyzhov, Dmitry Molchanov, and Dmitry Vetrov. Pitfalls of in-domain
546 uncertainty estimation and ensembling in deep learning. International Conference on Learning
547 Representations (ICLR), 2020.
548
549
Jiesong Bai, Yuhao Yin, and Qiyuan He. Retinexmamba: Retinex-based mamba for low-light image
enhancement. arXiv preprint arXiv:2405.03349, 2024.
550
551 Charles Blundell, Julien Cornebise, Koray Kavukcuoglu, and Daan Wierstra. Weight uncertainty in
552 neural network. In International conference on machine learning, pp. 1613–1622. PMLR, 2015.
553
554
Yuanhao Cai, Hao Bian, Jing Lin, Haoqian Wang, Radu Timofte, and Yulun Zhang. Retinexformer:
One-stage retinex-based transformer for low-light image enhancement. In Proceedings of the
555
IEEE/CVF International Conference on Computer Vision, pp. 12504–12513, 2023.
556
557 Runmin Cong, Wenyu Yang, Wei Zhang, Chongyi Li, Chun-Le Guo, Qingming Huang, and Sam
558 Kwong. Pugan: Physical model-guided underwater image enhancement using gan with dual-
559 discriminators. IEEE Transactions on Image Processing, 32:4472–4485, 2023.
560
Xiaofeng Cong, Jie Gui, and Junming Hou. Underwater organism color fine-tuning via decomposition
561
and guidance. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 38, pp.
562 1389–1398, 2024.
563
564 Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas
565 Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Sylvain Gelly, et al. An
566
image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint
arXiv:2010.11929, 2020.
567
568 Michael Dusenberry, Ghassen Jerfel, Yeming Wen, Yian Ma, Jasper Snoek, Katherine Heller, Balaji
569 Lakshminarayanan, and Dustin Tran. Efficient and scalable bayesian neural nets with rank-1
570 factors. In International conference on machine learning, pp. 2782–2792. PMLR, 2020.
571
Cameron Fabbri, Md Jahidul Islam, and Junaed Sattar. Enhancing underwater imagery using
572
generative adversarial networks. In 2018 IEEE international conference on robotics and automation
573
(ICRA), pp. 7159–7165. IEEE, 2018.
574
575 Yixu Feng, Cheng Zhang, Pei Wang, Peng Wu, Qingsen Yan, and Yanning Zhang. You only
576 need one color space: An efficient network for low-light image enhancement. arXiv preprint
577 arXiv:2402.05809, 2024.
578 Huiyuan Fu, Wenkai Zheng, Xiangyu Meng, Xin Wang, Chuanming Wang, and Huadong Ma.
579 You do not need additional priors or regularizers in retinex-based low-light image enhancement.
580 In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp.
581 18125–18134, 2023a.
582
583
Zhenqi Fu, Wu Wang, Yue Huang, Xinghao Ding, and Kai-Kuang Ma. Uncertainty inspired under-
water image enhancement. In European conference on computer vision, pp. 465–482. Springer,
584
2022.
585
586 Zhenqi Fu, Yan Yang, Xiaotong Tu, Yue Huang, Xinghao Ding, and Kai-Kuang Ma. Learning a
587 simple low-light image enhancer from paired low-light instances. In Proceedings of the IEEE/CVF
588 conference on computer vision and pattern recognition, pp. 22252–22261, 2023b.
589
Yarin Gal and Zoubin Ghahramani. Dropout as a bayesian approximation: Representing model
590
uncertainty in deep learning. In international conference on machine learning, pp. 1050–1059.
591 PMLR, 2016.
592
593 Alex Graves. Practical variational inference for neural networks. Advances in neural information
processing systems, 24, 2011.

11
Under review as a conference paper at ICLR 2025

594
Jean-Bastien Grill, Florian Strub, Florent Altché, Corentin Tallec, Pierre Richemond, Elena
595 Buchatskaya, Carl Doersch, Bernardo Avila Pires, Zhaohan Guo, Mohammad Gheshlaghi Azar,
596 et al. Bootstrap your own latent-a new approach to self-supervised learning. Advances in neural
597 information processing systems, 33:21271–21284, 2020.
598
599 Albert Gu and Tri Dao. Mamba: Linear-time sequence modeling with selective state spaces. arXiv
600 preprint arXiv:2312.00752, 2023.
601 Chunle Guo, Chongyi Li, Jichang Guo, Chen Change Loy, Junhui Hou, Sam Kwong, and Runmin
602 Cong. Zero-reference deep curve estimation for low-light image enhancement. In Proceedings of
603 the IEEE/CVF conference on computer vision and pattern recognition, pp. 1780–1789, 2020.
604
605 Xiaojie Guo, Yu Li, and Haibin Ling. Lime: Low-light image enhancement via illumination map
606
estimation. IEEE Transactions on image processing, 26(2):982–993, 2016.
607 Jiang Hai, Zhu Xuan, Ren Yang, Yutong Hao, Fengzhu Zou, Fang Lin, and Songchen Han. R2rnet:
608 Low-light image enhancement via real-low to real-normal network. Journal of Visual Communica-
609 tion and Image Representation, 90:103712, 2023.
610
611 James Harrison, John Willes, and Jasper Snoek. Variational bayesian last layers. In International
612
Conference on Learning Representations (ICLR), 2024.
613 Kaiming He, Haoqi Fan, Yuxin Wu, Saining Xie, and Ross Girshick. Momentum contrast for
614 unsupervised visual representation learning. In Proceedings of the IEEE/CVF conference on
615 computer vision and pattern recognition, pp. 9729–9738, 2020.
616
617 Geoffrey E Hinton and Drew Van Camp. Keeping the neural networks simple by minimizing the
618
description length of the weights. In Proceedings of the sixth annual conference on Computational
learning theory, pp. 5–13, 1993.
619
620 Jinhui Hou, Zhiyu Zhu, Junhui Hou, Hui Liu, Huanqiang Zeng, and Hui Yuan. Global structure-aware
621 diffusion process for low-light image enhancement. Advances in Neural Information Processing
622 Systems, 36, 2024.
623
624
Shirui Huang, Keyan Wang, Huan Liu, Jun Chen, and Yunsong Li. Contrastive semi-supervised
learning for underwater image restoration via reliable bank. In Proceedings of the IEEE/CVF
625
conference on computer vision and pattern recognition, pp. 18145–18155, 2023.
626
627 Fushuo Huo, Bingheng Li, and Xuegui Zhu. Efficient wavelet boost learning-based multi-stage pro-
628 gressive refinement network for underwater image enhancement. In Proceedings of the IEEE/CVF
629 international conference on computer vision, pp. 1944–1952, 2021.
630
Md Jahidul Islam, Youya Xia, and Junaed Sattar. Fast underwater image enhancement for improved
631
visual perception. IEEE Robotics and Automation Letters, 5(2):3227–3234, 2020.
632
633 Hai Jiang et al. Low-light image enhancement with wavelet-based diffusion models. ACM Transac-
634 tions on Graphics (TOG), 42(6):1–14, 2023a.
635
Jingxia Jiang, Tian Ye, Jinbin Bai, Sixiang Chen, Wenhao Chai, Shi Jun, Yun Liu, and Erkang
636
Chen. Five a+ network: You only need 9k parameters for underwater image enhancement. British
637
Machine Vision Conference (BMVC), 2023b.
638
639 Yifan Jiang, Xinyu Gong, Ding Liu, Yu Cheng, Chen Fang, Xiaohui Shen, Jianchao Yang, Pan Zhou,
640 and Zhangyang Wang. Enlightengan: Deep light enhancement without paired supervision. IEEE
641 transactions on image processing, 30:2340–2349, 2021.
642
Alex Kendall and Roberto Cipolla. Modelling uncertainty in deep learning for camera relocalization.
643
In 2016 IEEE international conference on Robotics and Automation (ICRA), pp. 4762–4769. IEEE,
644 2016.
645
646 Alex Kendall, Vijay Badrinarayanan, and Roberto Cipolla. Bayesian segnet: Model uncertainty
647 in deep convolutional encoder-decoder architectures for scene understanding. arXiv preprint
arXiv:1511.02680, 2015.

12
Under review as a conference paper at ICLR 2025

648
Alex Kendall, Yarin Gal, and Roberto Cipolla. Multi-task learning using uncertainty to weigh losses
649 for scene geometry and semantics. In Proceedings of the IEEE conference on computer vision and
650 pattern recognition, pp. 7482–7491, 2018.
651
652 Diederik P Kingma. Auto-encoding variational bayes. International Conference on Learning
653 Representations (ICLR), 2014. URL http://arxiv.org/abs/1312.6114.
654 Ranganath Krishnan, Mahesh Subedar, and Omesh Tickoo. Specifying weight priors in bayesian
655 deep neural networks with empirical bayes. In Proceedings of the AAAI conference on artificial
656 intelligence, volume 34, pp. 4477–4484, 2020.
657
658
Balaji Lakshminarayanan, Alexander Pritzel, and Charles Blundell. Simple and scalable predictive
uncertainty estimation using deep ensembles. Advances in neural information processing systems,
659
30, 2017.
660
661 Chulwoo Lee, Chul Lee, and Chang-Su Kim. Contrast enhancement based on layered difference
662 representation of 2d histograms. IEEE transactions on image processing, 22(12):5372–5384, 2013.
663
Chongyi Li, Chunle Guo, Wenqi Ren, Runmin Cong, Junhui Hou, Sam Kwong, and Dacheng Tao.
664 An underwater image enhancement benchmark dataset and beyond. IEEE transactions on image
665 processing, 29:4376–4389, 2019a.
666
667 Chongyi Li, Saeed Anwar, Junhui Hou, Runmin Cong, Chunle Guo, and Wenqi Ren. Underwa-
668 ter image enhancement via medium transmission-guided multi-color space embedding. IEEE
669
Transactions on Image Processing, 30:4985–5000, 2021.
670 Hanyu Li, Jingjing Li, and Wei Wang. A fusion adversarial underwater image enhancement network
671 with a public test dataset. arXiv preprint arXiv:1906.06819, 2019b.
672
673
Kunqian Li, Li Wu, Qi Qi, Wenjie Liu, Xiang Gao, Liqin Zhou, and Dalei Song. Beyond single refer-
ence for training: Underwater image enhancement via comparative learning. IEEE Transactions
674
on Circuits and Systems for Video Technology, 33(6):2561–2576, 2023.
675
676 Risheng Liu, Xin Fan, Ming Zhu, Minjun Hou, and Zhongxuan Luo. Real-world underwater
677 enhancement: Challenges, benchmarks, and solutions under natural light. IEEE transactions on
678 circuits and systems for video technology, 30(12):4861–4875, 2020.
679
Risheng Liu, Long Ma, Jiaao Zhang, Xin Fan, and Zhongxuan Luo. Retinex-inspired unrolling with
680 cooperative prior architecture search for low-light image enhancement. In Proceedings of the
681 IEEE/CVF conference on computer vision and pattern recognition, pp. 10561–10570, 2021.
682
683 Shuaixin Liu, Kunqian Li, and Yilin Ding. Underwater image enhancement by diffusion model with
684
customized clip-classifier. arXiv preprint arXiv:2405.16214, 2024a.
685 Yue Liu et al. Vmamba: Visual state space model. arXiv preprint arXiv:2401.10166, 2024b.
686
687 Kede Ma, Kai Zeng, and Zhou Wang. Perceptual quality assessment for multi-exposure image fusion.
688
IEEE Transactions on Image Processing, 24(11):3345–3356, 2015.
689 Anish Mittal, Rajiv Soundararajan, and Alan C Bovik. Making a “completely blind” image quality
690 analyzer. IEEE Signal processing letters, 20(3):209–212, 2012.
691
692
Radford M Neal. Bayesian learning for neural networks, volume 118. Springer Science & Business
Media, 2012.
693
694 Karen Panetta, Chen Gao, and Sos Agaian. Human-visual-system-inspired underwater image quality
695 measures. IEEE Journal of Oceanic Engineering, 41(3):541–551, 2015.
696
Tongyao Pang, Yuhui Quan, and Hui Ji. Self-supervised bayesian deep learning for image recovery
697
with applications to compressive sensing. In Computer Vision–ECCV 2020: 16th European
698
Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XI 16, pp. 475–491. Springer,
699 2020.
700
701 Lintao Peng, Chunli Zhu, and Liheng Bian. U-shape transformer for underwater image enhancement.
IEEE Transactions on Image Processing, 2023.

13
Under review as a conference paper at ICLR 2025

702
Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal,
703 Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al. Learning transferable visual
704 models from natural language supervision. In International conference on machine learning, pp.
705 8748–8763. PMLR, 2021.
706
707 Ali M Reza. Realization of the contrast limited adaptive histogram equalization (clahe) for real-time
708
image enhancement. Journal of VLSI signal processing systems for signal, image and video
technology, 38:35–44, 2004.
709
710 Herbert Robbins. An empirical bayes approach to statistics. Proceedings of the Third Berkeley
711 Symposium on Mathematical Statistics and Probability, 1:157–163, 1956.
712
Wenzhe Shi, Jose Caballero, Ferenc Huszár, Johannes Totz, Andrew P Aitken, Rob Bishop, Daniel
713
Rueckert, and Zehan Wang. Real-time single image and video super-resolution using an efficient
714
sub-pixel convolutional neural network. In Proceedings of the IEEE conference on computer vision
715 and pattern recognition, pp. 1874–1883, 2016.
716
717 Yi Tang, Hiroshi Kawasaki, and Takafumi Iwaguchi. Underwater image enhancement by transformer-
718 based diffusion model with non-uniform sampling for skip strategy. In Proceedings of the 31st
719 ACM International Conference on Multimedia, pp. 5419–5427, 2023.
720 Marcin Tomczak, Siddharth Swaroop, Andrew Foong, and Richard Turner. Collapsed variational
721 bounds for bayesian neural networks. Advances in Neural Information Processing Systems, 34:
722 25412–25426, 2021.
723
724
Vassilios Vonikakis, Rigas Kouskouridas, and Antonios Gasteratos. On the evaluation of illumination
compensation algorithms. Multimedia Tools and Applications, 77:9211–9231, 2018.
725
726 Jianyi Wang, Kelvin CK Chan, and Chen Change Loy. Exploring clip for assessing the look and
727 feel of images. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 37, pp.
728 2555–2563, 2023.
729
Shuhang Wang, Jin Zheng, Hai-Miao Hu, and Bo Li. Naturalness preserved enhancement algorithm
730
for non-uniform illumination images. IEEE transactions on image processing, 22(9):3538–3548,
731
2013.
732
733 Yudong Wang, Jichang Guo, Huan Gao, and Huihui Yue. Uiecˆ 2-net: Cnn-based underwater image
734 enhancement using two color space. Signal Processing: Image Communication, 96:116250, 2021.
735
Yufei Wang, Renjie Wan, Wenhan Yang, Haoliang Li, Lap-Pui Chau, and Alex Kot. Low-light
736 image enhancement with normalizing flow. In Proceedings of the AAAI conference on artificial
737 intelligence, volume 36, pp. 2604–2612, 2022.
738
739 Chen Wei, Wenjing Wang, Wenhan Yang, and Jiaying Liu. Deep retinex decomposition for low-light
740 enhancement. British Machine Vision Conference (BMVC), 2018.
741 Andrew G Wilson and Pavel Izmailov. Bayesian deep learning and a probabilistic perspective of
742 generalization. Advances in neural information processing systems, 33:4697–4708, 2020.
743
744 Xiaogang Xu, Ruixing Wang, Chi-Wing Fu, and Jiaya Jia. Snr-aware low-light image enhancement.
745
In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp.
17714–17724, 2022.
746
747 Miao Yang and Arcot Sowmya. An underwater color image quality evaluation metric. IEEE
748 Transactions on Image Processing, 24(12):6062–6071, 2015.
749
750
Wenhan Yang, Wenjing Wang, Haofeng Huang, Shiqi Wang, and Jiaying Liu. Sparse gradient
regularized deep retinex network for robust low-light image enhancement. IEEE Transactions on
751
Image Processing, 30:2072–2086, 2021.
752
753 Syed Waqas Zamir, Aditya Arora, Salman Khan, Munawar Hayat, Fahad Shahbaz Khan, and
754 Ming-Hsuan Yang. Restormer: Efficient transformer for high-resolution image restoration. In
755 Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR),
pp. 5728–5739, 2022.

14
Under review as a conference paper at ICLR 2025

756
Richard Zhang, Phillip Isola, Alexei A Efros, Eli Shechtman, and Oliver Wang. The unreasonable
757 effectiveness of deep features as a perceptual metric. In Proceedings of the IEEE conference on
758 computer vision and pattern recognition, pp. 586–595, 2018.
759
760 Weidong Zhang, Peixian Zhuang, Hai-Han Sun, Guohou Li, Sam Kwong, and Chongyi Li. Underwa-
761 ter image enhancement via minimal color loss and locally adaptive contrast enhancement. IEEE
762
Transactions on Image Processing, 31:3997–4010, 2022.
763 Yonghua Zhang, Jiawan Zhang, and Xiaojie Guo. Kindling the darkness: A practical low-light image
764 enhancer. In Proceedings of the 27th ACM international conference on multimedia, pp. 1632–1640,
765 2019.
766
767
Chen Zhao, Weiling Cai, Chenyu Dong, and Chengwei Hu. Wavelet-based fourier information
interaction with frequency diffusion adjustment for underwater image restoration. In Proceedings
768
of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 8281–8291, 2024a.
769
770 Chen Zhao, Chenyu Dong, and Weiling Cai. Learning a physical-aware diffusion model based on
771 transformer for underwater image enhancement. arXiv preprint arXiv:2403.01497, 2024b.
772
Han Zhou, Wei Dong, Xiaohong Liu, Shuaicheng Liu, Xiongkuo Min, Guangtao Zhai, and Jun
773
Chen. Glare: Low light image enhancement via generative latent feature based codebook retrieval.
774
Proceedings of the European conference on computer vision (ECCV), 2024.
775
776 Jun-Yan Zhu, Taesung Park, Phillip Isola, and Alexei A Efros. Unpaired image-to-image translation
777 using cycle-consistent adversarial networks. In Proceedings of the IEEE international conference
778 on computer vision, pp. 2223–2232, 2017.
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809

15
Under review as a conference paper at ICLR 2025

810
A E XPERIMENTS ON R EDUCTION F UNCTION ϕ
811
812
813
814 Regarding the form of reduction function ϕ in Eq. (9). we consider two instantiations: bilinear
815 downsampling and local 2D histogram. As illustrated in Figure 9. with the local histogram, the
816 recovered images preserve more details than that of bilinear downsampling, due to additional
817
configuration for the histogram’s bin number, avoiding losing much information when the downsample
scale is larger.
818
819
820
821
H W
822 H×𝑊×3 H×𝑊×3 × × 18
16 16
H W
823 × ×3
16 16
824
ϕ( ⋅ ) ϕ( ⋅ )
825
826
827
(a) Bilinear Down (b) Local Histogram
828
829 Figure 9: With the same downsampling scale, the local histogram offers more precise control over
830 the amount of retained information by adjusting the number of bins (corresponding to the number of
831 channels). In contrast, bilinear downsampling tends to lose excessive details, especially when using
832 larger downsampling strides.
833
834
835
836
837 The discrete nature of histograms poses challenges in both prediction accuracy and computational
838
speed. To address this, we approximate the histogram calculation using Kernel Density Estimation
(KDE), which significantly improves both computation efficiency and prediction accuracy. As shown
839
in Table 5, while the pixel-level PSNR of local histogram-based ϕ is slightly lower than that of
840
bilinear downsampling, we attribute this to the larger variance inherent in histogram values, which
841 the model struggles to fit effectively.
842
843
844
845
Table 5: Comparisons of different instantiations of ϕ. The PSNR values on LOL-v1 are reported. K
846 is set to 100.
847
848
Function ϕ Downscale Bins Channels PSNR ↑
849
850 Bilinear Down 8 N/A 3 25.87
Local Histogram 8 3 9 25.29
851
Local Histogram 8 10 30 24.96
852
Local Histogram 8 16 48 24.80
853
854 Bilinear Down 16 N/A 3 26.83
855 Local Histogram 16 10 30 25.89
Local Histogram 16 16 48 25.83
856
857
858
859
860 Despite this, we observe that the local histogram approach exhibits slightly better colour representation
861 compared to the bilinear instance. In Figure 10, we present a visual comparison between the two
862 implementations, highlighting that the histogram-based model generates more vivid colours. However,
863 the bilinear downsampling method performs better in restoring details in areas where significant
information loss occurs.

16
Under review as a conference paper at ICLR 2025

864
865
866
867
868
869
870
871
872
873
874
875
876
877
878
879
880
881 Histogram Bilinear Histogram Bilinear
882
883 Figure 10: Visual comparison between the local histogram and bilinear downsampling implementa-
884 tions of the reduction function ϕ. The bilinear ϕ demonstrates better restoration capability compared
885 to the histogram-based counterpart. However, the histogram-based ϕ shows better global colour
886 representation. Best viewed when zoomed in.
887
888
889 B I NVESTIGATION ON M AMBA BACKBONE
890
891
892 𝐻 𝑊 𝐻 𝑊 𝐻 𝑊
× × 2𝐶
𝐻×𝑊×𝐶 × × 2𝐶 × × 4𝐶 2 2 𝐻×𝑊×𝐶
2 2 4 4
893
Down-Sample

894
Down-Sample

Up-Sample

Up-Sample
3×3 Conv

3×3 Conv
VSS VSS VSS VSS VSS
895
Block Block Block Block Block
896
× L1 × L2 × L3 × L4 × L5
897
898 (a) Backbone Architecture
899
linear
900 FFN Cross-Scan SSM Merge
901 LN . .
LN 1 2 3 … 8 9
902 SS2D
𝐀, 𝐁 = exp Δ𝐀 , Δ𝐁
1 2 3 1 2 3
7 … 6
.
.

1 4 9
Unfold

903 4 5 6 ℎ𝑡 = 𝐀ℎt−1 + 𝐁𝑥𝑡 4 5 6


SS2D SiLU
904 9 8 7 … 2 1 .
7 8 9
.
Block 7 8 9 𝑦𝑡 = 𝐂ℎ𝑡 + D𝑥𝑡
DWConv
905 9 6 3 … 4 1
.
.

LN Input patches output patches


906 linear

907
908 (b) VSS Block (c) SS2D
909
Figure 11: Overview of the Mamba backbone architecture, consisting of five feature stages, each
910
comprising Li VSS blocks. The shortcut connections are implemented using addition. Panel (a)
911 illustrates the hierarchical structure of the backbone. Panel (b) details the VSS Block, including its
912 integration with the SS2D module. Panel (c) explains the SS2D mechanism, incorporating Cross-
913 Scan, structured state-space modelling (SSM), and patch merging. Further details about SS2D can be
914 found in Liu et al. (2024b).
915
916
917 Considering Mamba’s linear computational complexity for long sequence modelling, we adopt the
VMmaba Liu et al. (2024b) to build the backbone of our BEM. The overall framework is akin to a

17
Under review as a conference paper at ICLR 2025

918
U-Net, but we replace all the Transformer blocks Dosovitskiy et al. (2020) with the Visual State-Space
919 (VSS) blocks, each of which is composed of a 2D Selective Scan (SS2D) module Liu et al. (2024b)
920 and a feedforward network (FFN). The formulation of VSS block Liu et al. (2024b) in layer l can be
921 expressed as
922 hl = SS2D (LN (hl−1 )) + hl−1 ,
923
(11)
hl+1 = FFN (LN (hl )) + hl ,
924 where FFN denotes the feedforward network and LN denotes layer normalisation. hl−1 and hl denote
925 the input and output in the l-th layer, respectively. As shown in Figure 11, the Mamba backbone
926 consists of an input convolutional layer, L1 + L2 + L3 + L4 + L5 VSS blocks, and an output
927 convolutional layer. After each downsampling operation, the spatial dimensions of the feature maps
928 are halved, while the number of channels is doubled. Specifically, given an input image with a
929 shape of H × W × 3, the encoding blocks obtain hierarchical feature maps of sizes H × W × C,
H W H W
930 2 × 2 × 2C and 4 × 4 × 4C. In the last two feature stages, the features are upsampled with the
931 pixelshuffle layers (Shi et al., 2016). At each scale level, lateral connections are built to link
932
the corresponding blocks in the encoder and decoder.
933 Construct the backbone. We build our backbone by gradually evaluating each configuration of
934 a vanilla Mamaba-based UNet. We thoroughly investigate settings including ssm-ratio, block
935 numbers, n_feat and mlp-ratio. The training strategies for all variants are identical. Setting
936 n_feat denotes the number of feature maps in the first conv3×3’s output. Setting d_state
937 denotes the state dimension of SSM. Note that the established baseline assures two things: 1) Further
938
naively introducing additional parameters and FLOPs, e.g., scaling models with more blocks, will not
help boost the performance. 2) A technique with additional parameters introduced to the baseline
939
model can no doubt demonstrate its effectiveness if the modified model shows better results than the
940
baseline.
941
942 Table 6: The performance of deterministic Mamba UNet variants with different d_state,
943 ssm-ratio, mlp-ratio, n_feat and block numbers. PSNR and SSIM on LOL-v1 are
944 reported. Since the deterministic networks trained using minibatch optimisation are likely to fit very
945
different targets each time, the results will fluctuate greatly. We train each model five times and report
the average performance.
946
947
block FLOPs Params TP PSNR SSIM
948 d_state ssm-ratio mlp-ratio n_feat
numbers (G) (M) img/s (dB)
949 1 1 2.66 40 [2,2,2] 14.25 1.23 125 22.45 0.828
950 1 1 4 40 [2,2,2] 20.41 1.52 78 23.76 0.842
951 16 1 2.66 40 [2,2,2] 25.50 1.37 84 23.83 0.840
952 32 1 2.66 40 [2,2,2] 37.49 1.52 61 21.93 0.812
953 16 2 4 40 [2,2,2] 44.36 2.08 58 23.67 0.830
954 16 2 4 52 [2,2,2] 65.10 3.37 40 23.21 0.833
955 16 2 4 40 [2,2,2,2] 54.82 7.77 51 23.44 0.838
956 1 2 4 40 [2,2,2] 21.87 1.79 82 22.73 0.834
957
958 To balance both speed and performance, we selected the model in the second row of Table 6 as the
959 backbone for our BEM. The chosen backbone features a simple architecture with no task-specific
960 modules, enhancing its generalisability and establishing a solid foundation for extending our method
961 to other types of vision tasks.
962
963
C C ONTROLLABLE L OCAL E NHANCEMENT
964
965
Thanks to the interpretability of the lower-dimensional representations in both the spatial and channel
966 dimensions, we can easily achieve local adjustment with a masking strategy. The local adjustment is
967 particularly useful in the cases where the input images are unevenly distorted, and we want to retain
968 the undistorted regions consistent before and after enhancement. The local adjustment process can be
969 achieved by using a mask layer M: ylocal = G(γM ⊙ v, x; wG ), where v can be lower-dimensional
970 features extracted from a real image or estimated by the first stage model via Eq. (9). We can use a
971 scalar γ to control the strength of the enhancement effect. A demonstration of the local enchantment
is shown in Figure 12.

18
Under review as a conference paper at ICLR 2025

972
973
974
975
976
977
978
979
980 Before Mask Layer After
981
982 Figure 12: The local brightness of an image before adjustment (left) can be edited locally by providing
983 a mask layer (middle). The image after adjustment (right) shows improved brightness in the regions
984 indicated by the mask.
985
986
987 Compared to directly applying the mask to the output, our local enhancement strategy not only reduces
988 the dependency on mask accuracy but also results in smoother transitions at the mask boundaries.
989
This mitigates issues such as excessive roughness or colour inconsistencies between processed and
unprocessed regions.
990
991
992 D L ABEL D IVERSITY AUGMENTATION
993
994 Theoretically, an infinite number of target images could correspond to a single input. However,
995 current paired datasets often lack sufficient label diversity, which may become a bottleneck for BEM
996 model performance.
997
Table 7: Evaluation of label augmentation strategies for enhancing label diversity. PSNR scores are
998
obtained using single-stage models on LOL-v1.
999
1000
Model Gamma Correction Saturation Shift CLAHE PSNR ↑
1001
1002 BEM 24.78
1003 BEM ✓ 24.89
1004 BEM ✓ ✓ 24.93
BEM ✓ ✓ ✓ 24.86
1005
1006 DNN 24.02
1007 DNN ✓ ✓ ✓ 21.58
1008
1009 Without relying on additional data collection to increase label diversity, we propose two strategies for
1010 augmenting label diversity within existing datasets:
1011
i) When training a deep network, high-resolution images are often divided into smaller crops (e.g.,
1012
128×128). Many of these smaller image crops may represent the same scene, but due to various
1013 factors, such as being captured at different moments in a video or having different capture settings,
1014 the corresponding target crops show differences in colour or brightness. Thus, using these crops as
1015 input during training, the actual label diversity within the training data is naturally increased.
1016
ii) Existing labels can be further enriched by applying data augmentation techniques such as random
1017
brightness adjustments, saturation shifts, changes in colour temperature, gamma corrections, and
1018
histogram equalisation.
1019
1020 Both strategies contribute to increasing label diversity to some extent.
1021 In Table 7, we evaluate whether expanding the number of target images using gamma correction,
1022 saturation shift, and CLAHE Reza (2004) can further improve the model’s performance. Among
1023 these, saturation shift is a linear transformation, while gamma correction and CLAHE are nonlinear
1024 methods. We observed that deterministic networks showed a decline in performance after applying
1025 these label augmentation techniques. This can be attributed to DNNs overfitting to local solutions
that deviate further from the inference image as uncertainty in the data increases. In contrast, BEM

19
Under review as a conference paper at ICLR 2025

1026
exhibited a slight increase in PSNR when using these augmented labels. For consistency, these
1027 augmentation strategies were not applied in other experiments.
1028
1029
1030
1031
1032
1033 E S UPPLEMENTARY V ISUALISATIONS
1034
1035
1036
1037
1038
1039
1040
1041
1042
1043
1044
1045
1046
1047
1048
1049
1050 KinD SNR-Net
1051
1052
1053
1054
1055
1056
1057
1058
1059
1060
1061
1062
BEM (Ours) RetinexFormer
1063
1064 Figure 13: Visual comparisons with KinD, SNR-Net and RetinexFormer under images’ original
1065 resolution. The sample is from the LOL-v2-real dataset.
1066
1067
1068
1069
1070
1071 HD Visualisation for LLIE. To facilitate a closer inspection of enhanced image details, we present
1072 high-resolution visual comparisons in Figure 13, where the predictions of state-of-the-art models
1073 are displayed at their original resolutions. The high-resolution visualisation reveals that previous
1074 state-of-the-art methods tend to exhibit varying degrees of noise artefacts in the enhanced results,
1075 significantly degrading perceptual quality. In contrast, our method effectively suppresses these noise
1076
artefacts, which are often introduced by low-light conditions. Furthermore, our approach achieves
superior detail restoration, while other methods show signs of blurring and detail loss.
1077
1078 More Visualisations for UIE. In Figure 14, we present additional visual comparisons on the U45
1079 and UCCS datasets, demonstrating that our method consistently outperforms PUGAN and PUIE-MP
in enhancing various underwater scenes.

20
Under review as a conference paper at ICLR 2025

1080
1081
1082
1083 PUGAN
1084
1085
1086
1087
1088
1089
1090 PUIE-MP
1091
1092
1093
1094
1095
1096
BEM
1097
1098
(Ours)
1099
1100
1101
1102
U45
1103
1104
1105
1106
1107
PUGAN
1108
1109
1110
1111
1112
1113
1114
PUIE-MP
1115
1116
1117
1118
1119
1120 BEM
1121 (Ours)
1122
1123
1124
1125 UCCS
1126
1127 Figure 14: Visual comparisons with PUGAN and PUIE-MP on the U45 and UCCS test sets.
1128
1129
1130
1131
1132
1133

21
Under review as a conference paper at ICLR 2025

1134
F M OTIVATION OF THE T WO -S TAGE F RAMEWORK
1135
1136
1137 To demonstrate the advantages and necessity of our two-stage BNN-DNN framework, we analyze its
1138 performance by comparing it with five other frameworks, as shown in Figure 15. The corresponding
1139 results on UIEB and LOL-v1 are presented in Table 8.
1140
1141 (a) BNN (d) BNN-DNN
1142 Down
BNN BNN DNN
1143
1144
1145 (b) BNN-v2 (e) DNNdown-DNN
3×3 conv

1146 Down
BNN DNN DNN
1147
1148
1149 (c) DNN (f) Cascaded DNNs
1150
1151 DNN DNN DNN

1152
1153
1154 Figure 15: Illustration of six framework variants, including three one-stage models (a, b, and c) on the
1155 left and three two-stage models (d, e, and f) on the right. The arrows indicate the inference process,
1156 with each framework demonstrating different architectural designs. The square box labelled “Linear”
1157 in (e) denotes that the final projection layer is a deterministic linear layer. In (d) and (e), the first
1158
stage and second stage are training independently, while the two stages of Cascaded DNNs (f) are
training together. Enlarged views highlight key regions for better comparison.
1159
1160
1161
1162
1163
F.1 L IMITATIONS OF O NE -S TAGE BNN
1164
1165
In high-dimensional image data, BNN introduces uncertainty in the prediction of each pixel. As shown
1166
in Figure 15 (a-b) and Figure 16, this pixel-level uncertainty manifests as noise in the output image,
1167 which negatively impacts both visual perception and certain image quality metrics. Nevertheless, the
1168 one-stage BNN models yet provide better results than pure DNN-based models. Visually, for example,
1169 by comparing the enlarged views of Figure 15 (a) and Figure 15 (c), we can observe that the BNN
1170 model is capable of recovering the red colour of the top surface of the box, while the DNN fails to do
1171 so. To cancel the noise in the enchanted image, we attempt to strengthen the spatial relations between
1172 adjacent pixels by retaining the BNN’s output layer as a deterministic 3×3 convolutional layer as
1173 shown in Figure 15 (b). However, the denoising effect of this simple method is not satisfactory, and
1174 because the deterministic layer is introduced in the end-to-end training, the diversity of the model
1175
output is reduced.
1176
1177
F.2 R ESORT TO THE T WO -S TAGE BNN-DNN F RAMEWORK
1178
1179
In BNN-v2 (b), by removing the uncertainty in the weights of the final convolutional layer, specifically
1180
by eliminating the random noise term ϵ ∼ N (0, I) in Eq. 6, we were able to significantly reduce the
1181
noise frequency. This leads us to hypothesize that the strong Gaussian-like noise observed in the
1182 output of BNN is primarily caused by the noise term ϵ in each Bayesian layer. Therefore, to eliminate
1183 the noise in the output, it becomes necessary to replace the Bayesian layers near the output of the
1184 BNN model with deterministic layers. However, this approach is not straightforward, as making
1185 the layers near the output deterministic inherently makes the entire output deterministic, effectively
1186 neutralizing the uncertainty provided by the BNN. To address this, we propose splitting the model
1187 into a BNN part and a DNN part, and training them separately. This forms the basis of our two-stage
BNN-DNN framework.

22
Under review as a conference paper at ICLR 2025

1188 Table 8: Comparisons of various one-stage and two-stage frameworks. For two-stage frameworks,
1189 the second column specifies whether 16× downsampling is applied to the input in the first stage.
1190
1191 UIEB-R90 LOL-v1
Framework Downscale (Stage-I)
1192 PSNR ↑ SSIM ↑ PSNR ↑ SSIM ↑
1193
1194 (a) BNN N/A 21.72 0.885 22.74 0.818
(b) BNN-v2 N/A 23.71 0.899 24.78 0.852
1195
(c) DNN N/A 20.83 0.864 23.76 0.842
1196
(d) BNN-DNN ✓ 25.62 0.940 26.83 0.877
1197
(e) DNNdown -DNN ✓ 20.68 0.812 22.85 0.823
1198 (f) Cascaded DNNs ✗ 20.95 0.873 23.98 0.827
1199 (g) BNN-DNN ✗ 17.78 0.689 19.26 0.798
1200
1201
1202
1203
To demonstrate the benefits of our separate training scheme, we compare it with cascaded DNNs (c),
1204
where both stages are trained jointly. As shown in Table 8, the two-stage separate training scheme
1205
outperforms the conventional cascaded DNNs. Meanwhile, we conduct an ablation study on the
1206 BNN component of the two-stage framework (d). Specifically, we replace the BNN part with a DNN
1207 of equivalent size, resulting in the DNNdown framework (e). By comparing the performance of
1208 both frameworks across different datasets, as shown in Table 8 , we observe that the BNN-DNN
1209 framework outperforms DNNdown. This result verifies that the primary performance improvement
1210 of the two-stage BNN-DNN framework is attributed to the BNN.
1211
1212
1213 F.3 I MPORTANCE OF I NPUT D OWNSAMPLING FOR S TAGE -I
1214
1215 The input dimensionality reduction in the first stage of our BNN-DNN framework is crucial for
1216 the successful training of the second-stage model. This is because the two stages are trained
1217 independently, and during the training of the second stage, the predictions from the first stage are
1218 replaced with ground-truth (GT) information. Without dimensionality reduction, the training of the
1219 second stage becomes invalid, as it would merely result in learning an identity mapping, as evidenced
1220 by the result shown in the last row in Table 8. Furthermore, the BNN in the first stage is trained on
1221
downsampled, low-resolution images. We found that BNNs are more effective when dealing with
these lower-dimensional data. In Table 9, we compare the performance of the BNN trained on 16×
1222
downsampled image datasets with its performance on the original resolution datasets. Our results
1223
show that the BNN achieves more accurate predictions when processing lower-resolution images
1224 compared to high-resolution images. In contrast, the DNN shows no obvious difference in predictive
1225 performance between low-resolution and original-resolution images.
1226
1227
1228
Table 9: Comparing the performance of one-stage BNN on 16× downsampled image data of LOL-v1
1229
and that of the original resolution LOL-v1.
1230
1231
Model dataset PSNR ↑
1232
1233 BNNdown 16× down LOL-v1 25.43
1234 BNN LOL-v1 22.74
1235 DNNdown 16× down LOL-v1 22.25
DNN LOL-v1 23.76
1236
1237
1238
1239
1240 In Figure 16, we compare the enhanced outputs of the one-stage and two-stage models. The one-stage
1241 model’s output exhibits noticeable noise due to the per-pixel uncertainty predictions of the BNN,
whereas the two-stage model produces a noise-free output.

23
Under review as a conference paper at ICLR 2025

1242
1243
1244
1245
1246
1247
1248
1249
1250
1251
1252
1253 One-Stage Two-Stage
1254
1255 Figure 16: A visual comparison of enhanced images produced by the one-stage BNN-v2 (left) and
1256 two-stage BNN-DNN models (right).
1257
1258
1259
1260
1261
1262 G A NALYSIS OF P REDICTIVE U NCERTAINTY
1263
1264
1265 In this section, we statistical analyse of the diversity in predictions generated by BEM. Table 10
1266
presents the predictive uncertainty statistics collected from the LOL-v1 dataset. A larger standard
deviation indicates higher uncertainty, suggesting that the BEM produces more diverse predictions
1267
and better captures the one-to-many mapping nature of the task. The maximum values approximate
1268
the upper bound of the BEM’s predictive quality, while the minimum values approximate its lower
1269 bound.
1270
1271
1272
1273
Table 10: Statistic data on predictive uncertainty on LOL-v1. CLIP (Brightness) indicate the CLIP
feature similarity using text prompt “Bright photo”. Likewise, CLIP (Quality) use prompt
1274
“Good photo”.
1275
1276
Metric Maximum Mean Median Minimum Standard deviation
1277
1278 PSNR 26.89 22.87 22.97 17.90 1.911
1279 SSIM 0.876 0.855 0.856 0.819 0.013
1280 CLIP-IQA (Brightness) ×100 93.62 89.63 89.71 84.20 1.689
1281
CLIP-IQA (Quality) ×100 64.34 59.13 59.08 54.22 1.825
CLIP-IQA (Noisiness) ×100 36.17 30.06 30.02 25.08 1.942
1282
Negative NIQE - 4.647 -4.808 - 4.806 -4.971 0.059
1283
1284
1285
1286
1287 As shown in Table 10, the minimum CLIP-IQA values in the LOL dataset are significantly smaller
1288 than the maximum values, potentially reflecting the presence of low-quality GT images in the
1289 dataset. We hypothesise that these poor-quality GT images significantly impact the performance
1290 of deterministic neural networks. However, due to BEM’s uncertainty modelling, such low-quality
1291 GT images primarily affect the lower bound of BEM’s predictive quality, minimising their overall
1292
influence on performance.
1293 In Figure 17, we randomly selected an input image from the heterogeneous dataset LSRW (Hai et al.,
1294 2023) to analyse the distribution of its prediction results. We observe that, for each metric, although
1295 many predictions fall within the central range, they are not overly concentrated. This demonstrates
the diversity of the model’s predictions.

24
Under review as a conference paper at ICLR 2025

1296 0.90
1297 25.0 0.85
1.0 0.50
0.85
1298 22.5 0.80 0.45
1299 0.80
20.0 0.75 0.8 0.40
1300 0.75
Values

17.5 0.70 0.35


1301
0.6 0.70
1302 15.0 0.65 0.30

1303 0.60 0.65 0.25


12.5
0.4
1304 0.60 0.20
10.0 0.55
1305
0.50 0.2 0.55 0.15
7.5
1306
1307 PSNR SSIM CLIP-IQA (Brightness) CLIP-IQA (Quality) CLIP-IQA (Noisiness)
1308
Figure 17: Distribution of 500 random predictions generated by the BEM model for a single
1309
low-light image across different evaluation metrics, including PSNR, SSIM, and three CLIP-IQA
1310 metrics (“Brightness”, “Quality”, “Noisiness”). Each violin plot visualises the density and range of
1311 predictions.
1312
1313
1314
1315
1316
1317 From the uncertainty map (e) in Figure 18, we observe a structured distribution of uncertainty, where
1318 regions expected to be in shadow exhibit lower uncertainty, while illuminated areas tend to have
1319
higher uncertainty.
1320
1321
1322
1323
1324
1325
1326
1327
1328 (a) Input (b) GT (c) Max. (PSNR) (d) Min. (PSNR) (e) Uncertainty
1329
1330 Figure 18: Visualisation of BEM outputs showing the input image (a), ground truth (b), the prediction
1331
with the highest PSNR (c), the prediction with the lowest PSNR (d), and the uncertainty map (e). The
uncertainty is computed as the pixel-wise standard deviation across 500 predicted images.
1332
1333
1334
1335
1336
1337 To investigate how the predictive uncertainty and quality of BEM are influenced by the overall GT
1338 quality in the training data, we conduct the following experiments as detailed in Appendices G.1 and
1339 G.2.
1340
G.1 S TEP ONE : I DENTIFY LOW- QUALITY GT IMAGES IN T RAINING DATA
1341
1342
1343
To separate training data with low-quality GT images from the dataset, we initially employed CLIP-
1344 IQA (Wang et al., 2023) with text prompts (“Brightness”, “Noisiness”, “Qualit”) to filter out images
1345 with low brightness, high noise levels, and poor quality. This automated process was followed by
1346 manual refinement to identify and separate poor-quality GT images. Examples of low-quality GT
1347 images from the LOL and UIEB training sets are shown in Figure 19 and Figure 20, alongside high-
1348 quality GT images for comparison. While the algorithmic filtering reduced excessive subjectivity, the
1349 manual refinement process may still introduce some subjective bias. Therefore, the separation results
should be treated as indicative rather than definitive.

25
Under review as a conference paper at ICLR 2025

1350
1351
1352
1353
1354
1355
1356
1357
1358
1359
1360
1361
1362
1363
1364
1365
1366
1367
1368
1369
1370 Low-Quality GT Images
1371
1372
1373
1374
1375
1376
1377
1378
1379
1380
1381
1382
1383
1384
1385
1386
1387
1388
1389
1390
1391
1392
1393 High-Quality GT Images
1394
1395 Figure 19: Examples of low-quality and high-quality GT images from the LOL training set. The
1396 categorisation may be influenced by subjective biases in assessing visual clarity, lighting, and overall
1397 image quality.
1398
1399
1400
1401
1402
1403

26
Under review as a conference paper at ICLR 2025

1404
1405
1406
1407
1408
1409
1410
1411
1412
1413
1414
1415
1416
1417
1418
1419
1420
1421
1422
1423
1424 Low-Quality GT Images
1425
1426
1427
1428
1429
1430
1431
1432
1433
1434
1435
1436
1437
1438
1439
1440
1441
1442
1443
1444
1445
1446
1447 High-Quality GT Images
1448
1449 Figure 20: Examples of low-quality and high-quality GT images from the UIEB training set. The
1450 categorisation may be influenced by subjective biases in assessing visual clarity, lighting, and overall
1451 image quality.
1452
1453
1454 G.2 ST EP T WO : I MPACT OF T RAINING DATA Q UALITY ON P REDICTIVE P ERFORMANCE
1455
1456 When the dataset contains low-quality ground-truth images, BEM generates a distribution of predictive
1457 quality, producing both high-quality and low-quality outputs. The probability of generating high-
quality outputs is influenced by the proportion of high-quality ground-truth images in the training data.

27
Under review as a conference paper at ICLR 2025

1458
Specifically, as the proportion of high-quality 22 22.1%
1459 ground-truth images increases, the probability
1460 20.8%

High quality Images %


of sampling high-quality outputs during infer- 20
1461 ence also rises. Consequently, fewer sampling
19.2%
1462 iterations are required to obtain satisfactory en-
18
1463 hancement results. Conversely, when the pro- 17.6%
1464 portion of high-quality ground-truth images is
16
1465 low, more sampling iterations are needed. 15.8%
15.2%
1466 To examine whether the proportion of high- 14
1467 quality ground-truth (GT) images in the training
1468 data affects the likelihood of generating high-

0%

0%

0%

0%

0%

%
00
=5

=6

=7

=8

=9
quality outputs, we pose the question: Does in-

=1
1469
1470 creasing the share of high-quality images in the
1471 training set improve the probability of producing Figure 21: Impact of training data quality on
1472 high-quality results? BEM. The x-axis represents the proportion of high-
1473 To test this hypothesis, we conducted the follow- quality images in the training dataset (τ ), while the
1474 ing experiment: First, using the sample separa- y-axis shows the percentage of high-quality pre-
1475 tion method described in Sec. G.1, we identified dictions obtained after K = 100 sampling times
1476 and labelled low-quality GT images in the train- on the test set. Higher proportions of high-quality
1477 ing dataset. Next, while keeping the total size of training data lead to a greater likelihood of gener-
1478 the training dataset constant, we systematically ating high-quality predictions. A prediction is clas-
1479
replaced low-quality GT images in the LOL-v1 sified as high-quality if its CLIP (Quality) score
training set with high-quality GT images from exceeds 0.8.
1480
the LOL-v2-real dataset. This allowed us to con-
1481
trol the proportion of high-quality images in the
1482 training data, denoted as τ .
1483
1484 The results, shown in Figure 21, demonstrate a clear trend: as the proportion of low-quality GT
1485
images decreases, the likelihood of generating high-quality outputs increases consistently. When the
training dataset consists entirely of high-quality GT images (τ = 100%), BEM achieves significant
1486
efficiency, producing a satisfactory enhanced output approximately once every five sampling iterations
1487
on average. This highlights the direct relationship between training data quality and the predictive
1488 performance of BEM. Nonetheless, the true strength of BEM lies in its ability to generate high-quality
1489 enhanced images even when real-world data contains low-quality GT images, thanks to its uncertainty
1490 modelling capabilities. The trade-off, however, is the need for more sampling attempts.
1491
1492
1493
1494
1495
1496
1497
1498
1499
1500
H U SE CLIP TO PICK OUT A HIGH - QUALITY ENHANCED IMAGE
1501
1502
1503
1504
1505 As illustrated in Figure 22, the ground-truth images in the test set are low-quality. When evaluated
1506 using full-reference metrics such as MSE or PSNR, BEM produces outputs like image (b), which
1507 closely resemble the low-quality GT image. In contrast, when using CLIP-IQA as a no-reference
1508
metric, BEM generates outputs like image (a). Upon observation, image (a) demonstrates superior
illumination and clarity compared to image (b) in Figure 22.
1509
1510 Figure 23 illustrates the outputs selected by BEM using the no-reference CLIP metric and the full-
1511 reference PSNR metric, alongside other unselected predictions. Notably, the results selected by both
metrics are visually acceptable.

28
Under review as a conference paper at ICLR 2025

1512 14.78 dB 23.18 dB GT


1513
1514
1515
1516
1517
1518
1519
1520
17.31 dB 18.12 dB GT
1521
1522
1523
1524
1525
1526
1527
1528 (a) (b) (c)
1529
1530 Figure 22: A superior enhancement does not necessarily align with the suboptimal ground truth. The
1531 left and middle images represent two plausible outputs from BEM, showcasing diverse enhancements.
1532 The left images are selected using the no-reference CLIP-IQA (Qualify) metric, while the middle
1533 images are chosen based on the full-reference PSNR metric.
1534
1535
1536
1537
1538
1539
1540
1541
1542
1543
1544
1545
1546
1547
1548
1549
1550
1551
1552
Figure 23: Visualisation of BEM predictions. The pink box (□) highlights the output selected using
1553
the no-reference CLIP-IQA (“Brightness”, “Noisiness”, “Quality”) metric, while the blue box (□)
1554
highlights the output selected using the full-reference PSNR metric. The input image is from the
1555 LSRW dataset (Hai et al., 2023).
1556
1557
1558 In Table 11, we present the results obtained by instantiating the quality metric D in Algorithm 1 as
1559 CLIP-IQA with the text prompts "Natural", "Brightness", and "Warm". Notably, we intentionally
1560 avoided using "Quality" as the prompt for CLIP, as it tends to select the highest-quality images. Given
1561 that some GT images in the LOL-v1 dataset are of suboptimal quality, this choice could result in a
1562 decrease in full-reference metrics like PSNR.
1563
1564
1565 I A DDITIONAL R ESULTS ON UIEB

29
Under review as a conference paper at ICLR 2025

1566 Table 11: Additional quantitative results of BEM using CLIP-IQA (denoted as BEMCLIP ) on the
1567 LOL-v1 and v2 datasets. GT Mean is used to adjust the output brightness. The BEM model use
1568 full-reference quality metric is denoted as BEMfull .
1569
1570 LOL-v1 LOL-v2-real LOL-v2-syn
Method
1571 PSNR ↑ SSIM ↑ LPIPS ↓ PSNR ↑ SSIM ↑ LPIPS ↓ PSNR ↑ SSIM ↑ LPIPS ↓
1572
BEM 28.80 0.884 0.069 32.66 0.915 0.060 32.95 0.964 0.026
1573
BEMCLIP 28.43 0.882 0.071 30.01 0.910 0.076 31.51 0.961 0.030
1574 BEMDeterm. 28.30 0.881 0.072 31.41 0.912 0.064 30.58 0.958 0.033
1575
1576
1577 In Table 12, we provide additional results on the validation set of UIEB in terms of FID and LPIPS.
1578 The listed methods includes UIECˆ2-Net (Wang et al., 2021), Water-Net (Li et al., 2019a), U-color (Li
1579 et al., 2021), U-shape (Peng et al., 2023), DM-water (Tang et al., 2023), PA-Diff and (Zhao et al.,
1580 2024b) WFI2-net (Zhao et al., 2024a).
1581
1582 Table 12: Results on UIEB in terms of FID and LPIPs.
1583
1584 Method UIECˆ2-Net Water-Net U-color U-shape DM-water PA-Diff WFI2-net BEM (ours)
1585 FID ↓ 35.06 37.48 38.25 46.11 31.07 28.74 27.85 26.11
1586 LPIPS ↓ 0.2033 0.2116 0.2337 0.2264 0.1436 0.1328 0.1248 0.1019
1587
1588
1589
1590
1591
1592
1593
1594
1595
1596 (a) (b) (c)
1597
1598 Figure 24: (a) Input image; (b) input image after linear brightness adjustment; (c) output of the
1599 one-stage BNN. When the input photo is particularly dark, the read noise becomes more prominent
1600 after brightness adjustment, making its impact on the output more noticeable. This suggests that the
1601 one-stage BNN might amplify such noise unintentionally due to its inherent uncertainty, leading to
1602
less desirable output results.
1603
1604
1605
1606
1607
1608
1609
1610
1611
1612
1613
1614
1615
1616
1617
1618
1619

30

You might also like