Fast Fourier Color Constancy

This paper presents a new color constancy algorithm called Fast Fourier Color Constancy (FFCC). FFCC achieves lower error rates than previous state-of-the-art methods while being much faster. It addresses the requirements of being fast, working with low quality images, providing uncertainty estimates, and allowing temporal smoothing of estimates over time, making it suitable for use as an automatic white balance algorithm in cameras.

Uploaded by

Hedy Liu

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

46 views17 pages

Fast Fourier Color Constancy

Uploaded by

Hedy Liu

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 17

Fast Fourier Color Constancy

Jonathan T. Barron Yun-Ta Tsai

[email protected] [email protected]

Abstract based techniques in computer vision, both problems reduce

arXiv:1611.07596v3 [cs.CV] 13 Aug 2020

to just estimating the “best” illuminant from an image, and

We present Fast Fourier Color Constancy (FFCC), a the question of whether that illuminant is objectively true
color constancy algorithm which solves illuminant esti- or subjectively attractive is just a matter of the data used
mation by reducing it to a spatial localization task on a during training.
torus. By operating in the frequency domain, FFCC pro- Despite their accuracy, modern learning-based color
duces lower error rates than the previous state-of-the-art by constancy algorithms are not immediately suitable as prac-
13 − 20% while being 250 − 3000× faster. This unconven- tical white balance algorithms, as practical white balance
tional approach introduces challenges regarding aliasing, has several requirements besides accuracy:
directional statistics, and preconditioning, which we ad- Speed - An algorithm running in a camera’s viewfinder
dress. By producing a complete posterior distribution over must run at 30 FPS on mobile hardware. But a camera’s
illuminants instead of a single illuminant estimate, FFCC compute budget is precious: demosaicing, face detection,
enables better training techniques, an effective temporal auto exposure, etc, must also run simultaneously and in real
smoothing technique, and richer methods for error analy- time. Spending more than a small fraction (say, 5 − 10%)
sis. Our implementation of FFCC runs at ∼ 700 frames per of a camera’s compute budget on white balance is impracti-
second on a mobile device, allowing it to be used as an ac- cal, suggesting that our speed requirement is closer to 1 − 5
curate, real-time, temporally-coherent automatic white bal- milliseconds per frame.
ance algorithm. Impoverished Input - Most color constancy algorithms are
designed for full resolution, high bit-depth input images, but
operating on such large images is challenging and costly in
1. Intro practice. To be fast, the algorithm must work well on the
small, low bit-depth “preview” images (32 × 24 or 64 × 48
A fundamental problem in computer vision is that of esti-
pixels, 8-bit) which are usually computed by specialized
mating the underlying world that resulted in some observed
camera hardware for this task.
image [1, 5]. One subset of this problem is color constancy:
Uncertainty - In addition to the illuminant, the algorithm
estimating the color of the illuminant of the scene and the
should produce some confidence measure or a complete
colors of the objects in the scene viewed under a white light.
posterior distribution over illuminants, thereby enabling
Despite its apparent simplicity, this problem has yielded a
convenient downstream integration with hand-engineered
great deal of depth and challenge for both the human vision
heuristics or external sources of information.
and computer vision communities [17, 23]. Color constancy
Temporal Coherence - The algorithm should allow the es-
is also a practical concern in the camera industry: produc-
timated illuminant to be smoothed over time, to prevent
ing a natural looking photograph without user intervention
color composition in videos from varying erratically.
requires that the illuminant be automatically estimated and
discounted, a process referred to as “auto white balance” In this paper we present a novel color constancy al-
among practitioners. Though there is a profound histori- gorithm, which we call “Fast Fourier Color Constancy”
cal connection between color constancy and consumer pho- (FFCC). Viewed as a color constancy algorithm, FFCC is
tography (exemplified by Edwin Land, the inventor of both 13 − 20% more accurate than the state of the art on stan-
Retinex theory [27] and the Polaroid instant camera), “color dard benchmarks. Viewed as a prospective white balance al-
constancy” and “white balance” have come to mean differ- gorithm, FFCC addresses our previously described require-
ent things — color constancy aims to recover the veridical ments: Our technique is 250−3000× faster than the state of
world behind an image, while white balance aims to give an the art, and is capable of running at 1.44 milliseconds per
image a pleasant appearance consistent with some aesthetic frame on a standard consumer mobile platform using the
or cultural norm. But with the current ubiquity of learning- thumbnail images already produced by that camera’s hard-

1
it also raises new problems: 1) pixel values are corrupted
with superimposed shapes that make detection difficult, 2)
detections must “wrap” around the edges of this toroidal
image, and 3) instead of an absolute, global location we can
only recover an aliased, incomplete location. FFCC works
by taking the large convolutional problem of CCC (ie, face
detection on A) and aliasing that problem down to a smaller
size where it can be solved efficiently (ie, face detection
on B). We will show that we can learn an effective color
constancy model in the face of the difficulty and ambiguity
(a) Image A (b) Aliased Image B
introduced by aliasing. This convolutional classifier will be
Figure 1: CCC [4] reduces color constancy to a 2D local- implemented and learned using FFTs, because the naturally
ization problem similar to object detection (1a). FFCC re- periodic nature of FFT convolutions resolves the problem of
peatedly wraps this 2D localization problem around a small detections “wrapping” around the edge of toroidal images,
torus (1b), which creates challenges but allows for faster and produces a significant speedup.
illuminant estimation. See the text for details. Our approach to color constancy introduces a number
of issues. The aforementioned periodic ambiguity result-
ing from operating on a torus (which we dub “illuminant
ware. FFCC produces a complete posterior distribution over aliasing”) requires new techniques for recovering a global
illuminants which allows us to reason about uncertainty and illuminant estimate from an aliased estimate (Section 3).
enables simple and effective temporal smoothing. Localizing the centroid of the illuminant on a torus is dif-
We build on the “Convolutional Color Constancy” ficult, requiring that we adopt and extend techniques from
(CCC) approach of [4], which is currently one of the top- the directional statistics literature (Section 4). But our ap-
performing techniques on standard color constancy bench- proach presents a number of benefits. FFCC improves accu-
marks [12, 20, 31]. CCC works by observing that applying racy relative to CCC by 17 − 24% while retaining its flex-
a per-channel gain to a linear RGB image is equivalent to ibility, and allows us to construct priors over illuminants
inducing a 2D translation of the log-chroma histogram of (Section 5). By learning in the frequency-domain we can
that image, which allows color constancy to be reduced to construct a novel method for fast frequency-domain regu-
the task of localizing a signature in log-chroma histogram larization and preconditioning, making FFCC training 20×
space. This reduction is at the core of the success of CCC faster than CCC (Section 6). Our model produces a com-
and, by extension, our FFCC technique; see [4] for a thor- plete unimodal posterior over illuminants as output, allow-
ough explanation. The primary difference between FFCC ing us to construct a Kalman filter-like approach for pro-
is that instead of performing an expensive localization on a cessing videos instead of independent images (Section 7).
large log-chroma plane, we perform a cheap localization on
a small log-chroma torus. 2. Convolutional Color Constancy
At a high level, CCC reduces color constancy to object
detection — in the computability theory sense of “reduce”. Let us review the assumptions made in CCC and inher-
FFCC reduces color constancy to localization on a torus in- ited by our model. Assume that we have a photometrically
stead of a plane, and because this task has no intuitive ana- linear input image I from a camera, with a black level of
logue in computer vision we will attempt to provide one1 . zero and with no saturated pixels2 . Each pixel k’s RGB
Given a large image A on which we would like to perform value in image I is assumed to be the product of that pixel’s
object detection, imagine constructing a smaller n × n im- “true” white-balanced RGB value W (k) and some global
age B in which each pixel in B is the sum of all values in A RGB illumination L shared by all pixels:
separated by a multiple of n pixels in either dimension:  (k)   (k)   
X Ir Wr Lr
B(i, j) = A(i + nk, j + nl) (1) ∀k Ig(k)  = Wg(k)  ◦ Lg  (2)
   
k,l (k) (k) L
Ib Wb b

This amounts to taking A and repeatedly wrapping it around

The task of color constancy is to use the input image I to
a small torus (see Figure 1). Detecting objects this way may
estimate L, and with that produce W (k) = I (k) /L.
yield a speedup as the image being processed is smaller, but
Given a pixel from our input RGB image I (k) , CCC de-
1 We cannot speak to the merit of this idea in the context of object

detection, and we present it here solely to provide an intuition of our work 2 in practice, saturated pixels are identified and removed from all down-

on color constancy stream computation, similarly to how color checker pixels are ignored.
(a) Input Image (b) Histogram (c) Aliased Histogram (d) Aliased Prediction (e) De-aliased Prediction (f) Output Image

Figure 2: An overview of our pipeline demonstrating the problem of illuminant aliasing. Similarly to CCC, we take an input
image (2a) and transform it into a log-chroma histogram (2b, presented in the same format as in [4]). But unlike CCC, our
histograms are small and toroidal, meaning that pixels can “wrap around” the edges (2c, with the torus “unwrapped” once
in every direction). This means that the centroid of a filtered histogram, which would simply be the illuminant estimate in
CCC, is instead an infinite family of possible illuminants (2d). This requires de-aliasing, some technique for disambiguating
between illuminants to select the single most likely estimate (2e, shown as a point surrounded by an ellipse visualizing the
output covariance of our model). Our model’s output (u, v) coordinates in this de-aliased log-chroma space corresponds to
the color of the illuminant, which can then be divided into the input image to produce a white balanced image (2f).

fines two log-chroma measures: with the properties of natural images has a significant effect,
as we will demonstrate.
(k)
u(k) = log Ig(k) /Ir(k) v (k) = log Ig(k) /Ib (3) Similarly to CCC, given an input image I we construct a
histogram N from I, where N (i, j) is the number of pixels
The absolute scale of L is assumed to be unrecoverable, so in I whose log-chroma is near the (u, v) coordinates corre-
estimating L simply requires estimating its log-chroma: sponding to histogram position (i, j):
Lu = log (Lg /Lr ) Lv = log (Lg /Lb ) (4)
u(k) − ulo
X
N (i, j) = mod − i, n < 1
After recovering (Lu , Lv ), assuming that L has a magnitude h
k
of 1 lets us recover the RGB values of the illuminant: (k) !
v − vlo
exp(−Lu ) 1 exp(−Lv ) ∧ mod − j, n < 1 (6)
Lr = Lg = Lb = h
p z z z
2 2
z = exp(−Lu ) + exp(−Lv ) + 1 (5) Where i, j are 0-indexed, n = 64 is the number of bins,
h = 1/32 is the bin size, and (ulo , vlo ) is the starting
Framing color constancy in terms of predicting log-chroma
point of the histogram. Because our histogram is too small
has several small advantages over the standard RGB ap-
to contain the wide spread of colors present in most nat-
proach (2 unknowns instead of 3, better numerical stability,
ural images, we use modular arithmetic to cause pixels to
etc) but the primary advantage of this approach is that us-
“wrap around” with respect to log-chroma (any other stan-
ing log-chroma turns the multiplicative constraint relating
dard boundary condition would violate our convolutional
W and I into an additive constraint [15], and this in turn
assumption and would cause many image pixels to be ig-
enables a convolutional approach to color constancy. As
nored). This means that, unlike standard CCC, a single
shown in [4], color constancy can be framed as a 2D spatial
(i, j) coordinate in the histogram no longer corresponds to
localization task on a log-chroma histogram N , where some
an absolute (u, v) color, but instead corresponds to an infi-
sliding-window classifier is used to filter that histogram and
nite family of (u, v) colors. Accordingly, the centroid of a
the centroid of that filtered histogram is used as the log-
filtered histogram no longer corresponds to the color of the
chroma of the illuminant.
illuminant, but instead is an infinite set of illuminants. We
will refer to this phenomenon as illuminant aliasing. Solv-
3. Illuminant Aliasing
ing this problem requires that we use some technique to de-
We assume the same convolutional premise of CCC, but alias an aliased illuminant estimate3 . A high-level outline of
with one primary difference to improve quality and speed:
3 It is tempting to refer to resolving the illuminant aliasing problem as
we use FFTs to perform the convolution that filters the log-
“anti-aliasing”, but anti-aliasing usually refers to preprocessing a signal to
chroma histogram, and we use a small histogram to make prevent aliasing during some resampling operation, which does not appear
that convolution as fast as possible. This change may seem possible in our framework. “De-aliasing” suggests that we allow aliasing
trivial, but the periodic nature of FFT convolution combined to happen to the input, but then remove the aliasing from the output.
our FFCC pipeline that illustrates illuminant (de-)aliasing model in an end-to-end fashion by propagating the gradients
can be seen in Fig. 2. of some loss computed on the de-aliased illuminant predic-
De-aliasing requires that we use some external information L̂ back onto the learned filters F . The centroid fitting
tion (or some external color constancy algorithm) to dis- in Eq. 11 is performed by fitting a bivariate von Mises dis-
ambiguate between illuminants. An intuitive approach is tribution to a PDF, which we will now explain.
to select the illuminant that causes the average image color
to be as neutral as possible, which we call “gray world de- 4. Differentiable Bivariate von Mises
aliasing”. We compute average log-chroma values (ū, v̄) for
Our architecture requires some mechanism for reducing
the entire image and use this to turn an aliased illuminant
a toroidal PDF P (i, j) to a single estimate of the illumi-
estimate (L̂u , L̂v ) into a de-aliased illuminant (L̂0u , L̂0v ):
nant. Localizing the center of mass of a histogram defined
on a torus is difficult: fitting a bivariate Gaussian may fail
ū = meank u(k) v̄ = meank v (k) (7) when the input distribution “wraps around” the sides of the
0
L̂u L̂

1 L̂u − ū

1
PDF, as shown in Fig. 3. Additionally, for the sake of tem-
0 = u − (nh) + (8) poral smoothing (Section 7) and confidence estimation, we
L̂v L̂v nh L̂v − v̄ 2
want our model to predict a well-calibrated covariance ma-
Another approach, which we call “gray light de-aliasing”, trix around the center of mass of P . This requires that our
is to assume that the illuminant is as close to the center of model be trained end-to-end, which therefore requires that
the histogram as possible. This de-aliasing approach sim- our mean/covariance fitting be analytically differentiable
ply requires carefully setting the starting point of the his- and therefore usable as a “layer” in our learning architec-
togram (ulo , vlo ) such that the true illuminants in natural ture. To address these problems we present a variant of the
scenes all lie within the span of the histogram, and setting bivariate von Mises distribution [28], which we will use to
L̂0 = L̂. We do this by setting ulo and vlo to maximize efficiently localize the mean and covariance of P in a man-
the distance between the edges of the histogram and the ner that allows for easy backpropagation.
bounding box surrounding the ground-truth illuminants in The bivariate von Mises distribution (BVM) is a com-
the training data4 . Gray light de-aliasing is trivial to imple- mon parameterization of a PDF on a torus. There exist
ment but, unlike gray world de-aliasing, it will systemati- several parametrizations which mostly differ in how “con-
cally fail if the histogram is too small to fit all illuminants centration” is represented (“concentration” having a sim-
within its span. ilar meaning to covariance). All of these parametriza-
To summarize the difference between CCC [4] and our tions present problems in our use case: none have closed
approach with regards to illuminant aliasing, CCC (approx- form expressions for maximum likelihood estimators [24],
imately) performs illuminant estimation as follows: none lend themselves to convenient backpropagation, and
all define concentration in terms of angles and therefore

L̂u u require “conversion” to covariance matrices during color
= lo + h arg max (N ∗ F ) (9) de-aliasing. For these reasons we present an alternative
L̂v vlo i,j
parametrization in which we directly estimate a BVM as
Where N ∗ F is performed using a pyramid convolution. a mean µ and covariance Σ in a simple and differentiable
FFCC corresponds to this procedure: closed form expression. Though necessarily approximate,
our estimator is accurate when the distribution is well-
P ← softmax (N ∗ F ) (10) concentrated, which is generally the case for our task.
(µ, Σ) ← fit bvm(P ) (11) Our input is a PDF P (i, j) of size n × n, where i and
j are integers in [0, n − 1]. For convenience we define a
L̂u mapping from i or j to angles in [0, 2π) and the marginal
← de alias(µ) (12)
L̂v distributions of P with respect to i and j:
Where N is a small and aliased toroidal histogram, convolu- 2πi X X
tion is performed with FFTs, and the centroid of the filtered θ(i) = Pi (i) = P (i, j) Pj (j) = P (i, j)
n j i
histogram is estimated and de-aliased as necessary. By con-
structing this pipeline to be differentiable we can train our We also define the marginal expectation of the sine and co-
4 Our histograms are shifted toward green colors rather than centered sine of the angle:
around a neutral color, as cameras are traditionally designed with an more X X
sensitive green channel which enables white balance to be performed by yi = Pi (i) sin(θ(i)) xi = Pi (i) cos(θ(i)) (13)
gaining red and blue up without causing color clipping. Ignoring this prac- i i
tical issue, our approach can be thought of as centering our histograms
around a neutral white light With xj and yj defined similarly.
Using this loss causes our model to produce a well-
calibrated complete posterior of the illuminant instead of
just a single estimate. This posterior will be useful when
processing video sequences (Section 7) and also allows us
to attach confidence estimates to our predictions using the
entropy of Σ (see the appendix).
Our entire system is trained end-to-end, which requires
Figure 3: We fit a bivariate von Mises distribution (shown that every step in BVM fitting and loss computation be an-
in solid blue) to toroidal PDFs P (i, j) to produce an aliased alytically differentiable. See the appendix for the analytical
illuminant estimate. Contrast this with fitting a bivariate gradients for Eqs. 14, 17, and 18, which can be chained to-
Gaussian (shown in dashed red) which treats the PDF as gether to backpropagate the gradient of f (·) onto the input
if it lies on a plane. Both approaches behave similarly if PDF P .
the distribution lies near the center of the unwrapped plane
(left) but fitting a Gaussian fails as the distribution begins to 5. Model Extensions
“wrap around” the edge (middle, right).
The system we have described thus far (compute a peri-
odic histogram of each pixel’s log-chroma, apply a learned
Estimating the mean µ of a BVM from a histogram just FFT convolution, apply a softmax, fit a de-aliased bivariate
requires computing the circular mean in i and j: von Mises distribution) works reasonably well (Model A in
n
Table 1) but does not produce state-of-the-art results. This
u mod 2π atan2(yi , xi ), n
µ = lo + h n (14) is likely because this model reasons about pixels indepen-
vlo mod 2π atan2(yj , xj ), n
dently, ignores all spatial information in the image, and does
Eq. 14 includes gray light de-aliasing, though gray world not consider the absolute color of the illuminant. Here we
de-aliasing can also be applied to µ after fitting. present extensions to the model which address these issues
We can fit the covariance of our model by simply “un- and improve accuracy accordingly.
wrapping” the coordinates of the histogram relative to the As explored in [4], a CCC-like model can be generalized
estimated mean and treating these unwrapped coordinates to a set of “augmented” images provided that these images
as though we are fitting a bivariate Gaussian. We define the are non-negative and “scale with intensity” [14]. This lets
“unwrapped” (i, j) coordinates such that the “wrap around” us apply certain filtering operations to image I and, instead
point on the torus lies as far away from the mean as possible, of constructing a single histogram from our image, con-
or equivalently, such that the unwrapped coordinates are as struct a “stack” of histograms constructed from the image
close to the mean as possible: and its filtered versions. Instead of learning and applying

µu − ulo

n
one filter, we learn a stack of filters and sum across chan-
ī = mod i − + ,n nels after convolution. The general family of augmented
h 2
images used in [4] are expensive to compute, so we instead
µv − vlo n use just the input image I and a local measure of absolute
j̄ = mod j − + ,n (15)
h 2 deviation in the input image:
Our estimated covariance matrix is simply the sample co- 1 1
variance of P (ī, j̄):
X X
1
E(x, y, c) = 8 |I(x, y, c) − I(x + i, y + j, c)| (19)
i=−1 j=−1
X X
E [ī] = Pi (i)ī E [j̄] = Pj (j)j̄ (16)
i j These two features appears to perform similarly to the four
+
 X
2
Pi (i)ī − E [ī]
2
X
P (i, j)īj̄ − E [ī] E [j̄]

features used in [4], while being cheaper to compute.

Σ=h 
 2 i i,j X

 (17) Just as a sliding-window object detector is often invariant
2
X
P (i, j)īj̄ − E [ī] E [j̄] + Pj (j)j̄ 2 − E [j̄]  to the absolute location of an object in an image, the convo-
i,j j
lutional nature of our baseline model makes it invariant to
We regularize the sample covariance matrix slightly by any global shift of the color of the input image. This means
adding a constant = 1 to the diagonal. that our baseline model cannot rely on any statistical regu-
With our estimated mean and covariance we can com- larities of the illumination by, say, modeling black body ra-
pute our loss: the negative log-likelihood of a Gaussian (ig- diation, the specific properties of commonly manufactured
noring scale factors and constants) relative to the true illu- light bulbs, or any varying spectral sensitivity across cam-
minant L∗ : eras. Though CCC does not model illumination directly,
T it appears to indirectly reason about illumination by using
L∗u
∗
−1 Lu
f (µ, Σ) = log |Σ| + − µ Σ − µ (18) the boundary conditions of its pyramid convolution to learn
L∗v L∗v
a regularization term g(Z):

Z ∗ = arg min (f (Z) + g (Z)) (21)

We require that the regularization g(Z) is the weighted sum

(a) Pixel Filter (b) Edge Filter (c) Illum. Gain (d) Illum. Bias of squared periodic convolutions of Z with some filter bank.
In our experiments g(Z) is the weighted sum of the squared
Figure 4: A complete learned model (Model J in Table 1)
difference between adjacent values (similar to a total varia-
shown in centered (u, v) log-chroma space, with bright-
tion loss [30]) and the sum of squared values:
ness indicating larger values. Our learned filters are cen-
tered around the origin (the predicted white point) and our
P 2
g(Z) =λ1 i,j (Z (i, j) − Z (mod(i + 1, n), j))
illuminant gain and bias maps model the black body curve 2
and varying camera sensitivity as two wrap-around line seg- + (Z (i, j) − Z (i, mod(j + 1, n)))
+λ0 i,j Z(i, j)2
P
ments (this dataset consists of images from two different (22)
cameras).
Where λ1 and λ0 are hyperparameters that determine the
strength of each smoothness term. We require that λ0 > 0
a model which is not truly spatially varying and is there- to prevent divide-by-zero issues during preconditioning.
fore sensitive to absolute color. Because a torus has no We use a variant of the standard FFT Fv (·) which bi-
boundaries, our model is invariant to global input color, so jectively maps from some real n × n image to a real n2 -
we must therefore introduce a mechanism for directly rea- dimensional vector, instead of the complex n × n image
soning about illuminants. We use a per-illuminant “gain” produced by a standard FFT (See the appendix for a formal
map G(i, j) and “bias” map B(i, j), which together apply description). With this, we can rewrite Eq. 22 as follows:
a per-illuminant affine transformation to the output of our r
previously-described convolution at (aliased) color (i, j). 1 2 2

w= λ1 |Fv ([1, −1])| + |Fv ([1; −1])| + λ0
The bias B causes our model to prefer certain illuminants n
T 2
over others, while the gain G causes the contribution of the g(Z) = Fv (Z) diag (w) Fv (Z) (23)
convolution at certain colors to be amplified.
Our two extensions (an augmented edge channel and an where the vector w is just some fixed function of the def-
illuminant gain/bias map) let us redefine the P in Eq. 10 as inition of g(Z) and the values of the hyperparameters λ1
! and λ0 . The 2-tap difference filters in Fv ([1, −1]) and
X Fv ([1; −1]) are padded to size (n × n) before the FFT.
P = softmax B + G ◦ (Nk ∗ Fk ) (20) With w we can define a mapping between our 2D image
k
space and a rescaled FFT vector space:
Where {Fk } are the set of learned filters for each augmented
z = w ◦ Fv (Z) (24)
channel’s histogram Nk , G is our learned gain map, and B
is our learned bias map. In practice we actually parametrize Where ◦ is an element-wise product. This mapping lets us
Glog when training and define G = exp(Glog ), which con- rewrite the optimization problem in Eq. 21 as:
straints G to be non-negative. Visualizations of G and B z
and our learned filters can be seen in Fig. 4.

2
Z ∗ = Fv−1 1
w arg min f Fv−1 + kzk (25)
z w
6. Fourier Regularization and Preconditioning
where Fv−1 (·) is the inverse of Fv (·), and division is
Our learned model weights ({Fk }, G, B) are all peri- element-wise. This reparametrization reduces the compli-
odic n × n images. To improve generalization, we want cated regularization of Z to a simple L2 regularization of z,
these weights to be small and smooth. In this section we which has a preconditioning effect.
present the general form of the regularization used during We use this technique during training to reparameterize
training, and we show how this regularization lets us pre- all model components ({Fk }, G, B) as rescaled FFT vec-
condition the optimization problem solved during training tors, each with their own values for λ0 and λ1 . The ef-
to find lower-cost minima in fewer iterations. Because this fect of this can be seen in Fig. 5, where we show the loss
frequency-domain optimization technique applies generally during our two training stages. We compare against naive
to any optimization problem concerning smooth and peri- time-domain optimization (Eq. 21) and non-preconditioned
odic images, we will describe it in general terms. frequency-domain optimization (Eq. 25 with w = 1). Our
Let us construct an optimization problem with respect to preconditioned reformulation exhibits a significant speedup
a single n × n image Z consisting of a data term f (Z) and and finds minima with lower losses.
Logistic Loss BVM Loss
it with an zero-mean isotropic Gaussian (encoding our prior
belief that the illuminant may change over time) and then
multiplying that “fuzzed” Gaussian by the observed Gaus-
sian:
−1 !−1
α 0
Σt+1 = Σt + + Σo (26)
0 α
−1 !−1
α 0
µt+1 = Σt+1 Σt + µt + Σo µo
0 α

Where α is a parameter that defines the expected vari-

Figure 5: Loss traces for our two stages of training, for ance of the illuminant over time. This update resembles
three fold cross validation (each line represents a fold) on a Kalman filter but with a simplified transition model, no
the Gehler-Shi dataset using LBFGS. Our preconditioned control model, and variable observation noise.
frequency domain optimization produces lower minima at This temporal smoothing is not used in our benchmarks,
greater rates than are achieved by non-preconditioned opti- but its effect can be seen in the supplemental video.
mization in the frequency domain or naive optimization in
the time domain. 8. Results
We evaluate our technique using two standard color con-
For all experiments (excluding our “deep” variants, see stancy datasets: the Gehler-Shi dataset [20, 31] and the
the appendix), training is as follows: All model parame- Cheng et al. dataset [12] (see Tables 1 and 2). For the
ters are initialized to 0, then we have a convex pre-training Gehler-Shi dataset we present several ablations and vari-
step which optimizes Eq. 25 where f (·) is a logistic loss ants of our model to show the effect of each design decision
(described in the appendix) using LBFGS for 16 iterations, and to investigate trade-offs between speed and accuracy.
and then we optimize Eq. 25 where f (·) is the non-convex Models labeled “full” were run on 384 × 256 16-bit im-
BVM loss in Eq. 18 using LBFGS for 64 iterations. ages, while models labeled “thumb” were run on 48 × 32
8-bit images, which are the kind of images that a practi-
7. Temporal Smoothing cal white-balance system embedded on a hardware device
might use. Models labeled “4 channel” use the four fea-
Color constancy is usually studied in the context of indi- ture channels used in [4], while models labeled “2 chan-
vidual images, which are assumed to be IID. But a practical nel” use the two channels we present in Section 5. We also
white balance algorithm must run on a video sequence, and present models in which we only use the “pixel channel” I
must enforce some temporal smoothing of the predicted il- or the “edge channel” E as input. All models have a his-
luminant to avoid presenting the viewer with an erratically- togram size of n = 64 except for Models K and L where
varying image in the viewfinder. This smoothing cannot n is varied to show the impact of illuminant aliasing. Two
be too aggressive or else the viewfinder may appear unre- models use “gray world” de-aliasing, and the rest use “gray
sponsive when the illumination changes rapidly (a colorful light” de-aliasing. The former seems slightly less effective
light turning on, the camera quickly moving outdoors, etc). than the latter unless chroma histograms are heavily aliased,
Additionally, when faced with multiple valid hypotheses (a which is why we use it in Model K. Model C only has one
blue wall under white light vs a white wall under blue light, training stage that minimizes logistic loss for 64 iterations,
etc) we may want to use earlier images to resolve ambi- thereby removing the BVM fitting from training. Model E
guities. These desiderata of stability, responsiveness, and fixes G(i, j) = 1 and B(i, j) = 0, thereby removing the
robustness are at odds with each other, and so some com- model’s ability to reason about the absolute color of the il-
promise must be struck. luminant. Model B was trained only to minimize the data
Our task of constructing a temporally coherent illumi- term (ie, λ0 = λ1 = 0 in Eq. 22) while Model D uses L2
nant estimate is aided by the probabilistic nature of the out- regularization but not total variation (ie, λ1 = 0 in Eq. 22).
put of our per-frame model, which produces a posterior dis- Models N, O and P are variants of Model J in which, instead
tribution over illuminants parametrized as a bivariate Gaus- of learning a fixed model ({Fk }, G, B) we express those
sian. Let us assume that we have some ongoing estimate model parameters as the output of a small 2-layer neural
of the illuminant and its covariance (µt , Σt ). Given the network. As inputs to this network we use image metadata,
observed mean and covariance (µo , Σo ) provided by our which allows the model to reason about exposure time and
model we update our ongoing estimate by first convolving camera sensor type, and/or a CNN-produced feature vector
Best Worst Test Train
[35], which allows the model to reason about semantics (see Algorithm Mean Med. Tri.
25% 25%
Avg.
Time Time
the appendix for details). For each experiment we tune all Support Vector Regression [18]
White-Patch [8]
8.08
7.55
6.73
5.68
7.19
6.35
3.35
1.45
14.89
16.12
7.21
5.76
-
0.16
-
-
λ hyperparameters to minimize the “average” error during Grey-world [9]
Edge-based Gamut [22]
6.36
6.52
6.28
5.04
6.28
5.43
2.33
1.90
10.58
13.58
5.73
5.40
0.15
3.6
-
1986
cross-validation, using cyclic coordinate descent. 1st-order Gray-Edge [33] 5.33 4.52 4.73 1.86 10.03 4.63 1.1 -
2nd-order Gray-Edge [33] 5.13 4.44 4.62 2.11 9.26 4.60 1.3 -
Shades-of-Gray [16] 4.93 4.01 4.23 1.14 10.20 3.96 0.47 -
Model P achieves the lowest-error results, with a 20% re- Bayesian [20] 4.82 3.46 3.88 1.26 10.49 3.86 97 764
Yang et al. 2015 [36] 4.60 3.10 - - - - 0.88 -
duction in error on Gehler-Shi compared to the previously General Gray-World [3] 4.66 3.48 3.81 1.00 10.09 3.62 0.91 -
Natural Image Statistics [21] 4.19 3.13 3.45 1.00 9.22 3.34 1.5 10749
best-performing published technique. This improvement in CART-based Combination [6] 3.90 2.91 3.21 1.02 8.27 3.14 - -
Spatio-spectral Statistics [11] 3.59 2.96 3.10 0.95 7.61 2.99 6.9 3159
accuracy also comes with a significant speedup compared to LSRS [19] 3.31 2.80 2.87 1.14 6.39 2.87 2.6 1345
previous techniques: ∼ 30 ms/image for most models, com- Interesection-based Gamut [22]
Pixels-based Gamut [22]
4.20
4.20
2.39
2.33
2.93
2.91
0.51
0.50
10.70
10.72
2.76
2.73
-
-
-
-
pared to the 520 ms of CCC [4] or the 3 seconds (on a GPU) Bottom-up+Top-down [34] 3.48 2.47 2.61 0.84 8.01 2.73 - -
Cheng et al. 2014 [12] 3.52 2.14 2.47 0.50 8.74 2.41 0.24 -
of Shi et al. [32]. Model Q (our fastest model) has an accu- Exemplar-based [26] 2.89 2.27 2.42 0.82 5.97 2.39 - -
Bianco et al. 2015 [7] 2.63 1.98 - - - - - -
racy comparable to [4] and [32] but takes only 1.1 millisec- Corrected-Moment [14] 2.86 2.04 2.22 0.70 6.34 2.25 0.77 584
Chakrabarti et al. 2015 [10] 2.56 1.67 1.89 0.52 6.07 1.91 0.30 -
onds to process an image, making it hundreds or millions Cheng et al. 2015 [13] 2.42 1.65 1.75 0.38 5.87 1.73 0.25 245
CCC [4] 1.95 1.22 1.38 0.35 4.76 1.40 0.52 2168
of times faster than the current state-of-the art. Addition- Shi et al. 2016 [32] 1.90 1.12 1.33 0.31 4.84 1.34 3.0 -
2.88 1.90 2.05 0.50 6.98 2.08 0.0076 117
ally, our model appears to be faster to train than the state- A) FFCC - full, pixel channel only, no illum.
B) FFCC - full 2 channels, no regularization 2.34 1.33 1.55 0.51 5.84 1.70 0.031 96
of-the-art, though training times for prior work are often not C) FFCC - full 2 channels, no BVM loss
D) FFCC - full 2 channels, no total variation
2.16
1.92
1.45
1.11
1.56
1.27
0.76
0.28
4.84
4.89
1.78
1.30
0.031
0.028
62
104
available. All runtimes in Table 1 for our model were com- E) FFCC - full, 2 channels, no illuminant 2.14 1.34 1.52 0.37 5.27 1.53 0.031 94
F) FFCC - full, pixel channel only 2.15 1.33 1.51 0.34 5.35 1.51 0.0063 67
puted on an Intel Xeon CPU E5-2680. Runtimes for the G) FFCC - full, edge channel only 2.02 1.25 1.39 0.34 5.11 1.44 0.026 94
H) FFCC - full, 2 channels, no precond. 2.91 1.99 2.23 0.57 6.74 2.18 0.025 152
“full” model were produced using a Matlab implementa- I) FFCC - full, 2 channels, gray world 1.79 1.01 1.22 0.29 4.54 1.24 0.029 98
J) FFCC - full, 2 channels 1.80 0.95 1.18 0.27 4.65 1.20 0.029 98
tion, while runtimes for the “thumb” model were produced K) FFCC - full, 4 channels, n = 32, gray world 2.69 1.31 1.49 0.37 7.48 1.70 0.068 138
L) FFCC - full, 4 channels, n = 256 1.78 1.05 1.19 0.27 4.46 1.22 0.068 395
using a Halide [29] CPU implementation (our Matlab im- M) FFCC - full, 4 channels 1.78 0.96 1.14 0.29 4.62 1.21 0.070 96
plementation of Model Q takes 2.37 ms/image). Runtimes N) FFCC - full, 2 channels, +semantics[35]
O) FFCC - full, 2 channels, +metadata
1.67
1.65
0.96
0.86
1.13
1.07
0.26
0.24
4.23
4.44
1.15
1.10
-
0.036
-
143
for our “+semantic” models are not presented as we were P) FFCC - full, 2 channels, +metadata +semantics[35]
Q) FFCC - thumb, 2 channels
1.61
2.01
0.86
1.13
1.02
1.38
0.23
0.30
4.27
5.14
1.07
1.37
-
0.0011
-
73
unable to profile [35] accurately (CNN feature computation
appears to dominate runtime). Table 1: Performance on the Gehler-Shi dataset [20, 31].
To demonstrate that our model is a viable automatic We present five error metrics and their average (the geomet-
white balance system for consumer photography, we ran our ric mean) with the lowest error per metric highlighted in
Halide code on a 2016 Google Pixel XL using the thumb- yellow. We present the time (in seconds) for training each
nail images computed by the device’s camera stack. This model and for evaluating a single image, when available.
implementation ran at 1.44ms per image, which is equiva-
lent to 30 frames per second using < 5% of the total com- Algorithm Mean Med. Tri.
Best Worst
Avg.
25% 25%
pute budget, thereby satisfying our previously-stated speed
White-Patch [8] 9.91 7.44 8.78 1.44 21.27 7.24
requirements. A video of our system running in real-time Pixels-based Gamut [22] 5.27 4.26 4.45 1.28 11.16 4.27
on a phone can be found in the supplement. Grey-world [9] 4.59 3.46 3.81 1.16 9.85 3.70
Edge-based Gamut [22] 4.40 3.30 3.45 0.99 9.83 3.45
Shades-of-Gray [16] 3.67 2.94 3.03 0.98 7.75 3.01
Natural Image Statistics [21] 3.45 2.88 2.95 0.83 7.18 2.81
Local Surface Reflectance Statistics [19] 3.45 2.51 2.70 0.98 7.32 2.79
9. Conclusion 2nd-order Gray-Edge [33] 3.36 2.70 2.80 0.89 7.14 2.76
1st-order Gray-Edge [33] 3.35 2.58 2.76 0.79 7.18 2.67
Bayesian [20] 3.50 2.36 2.57 0.78 8.02 2.66
We have presented FFCC, a color constancy algorithm General Gray-World [3] 3.20 2.56 2.68 0.85 6.68 2.63
that produces a 13 − 20% reduction in error and a 250 − Spatio-spectral Statistics [11] 3.06 2.58 2.74 0.87 6.17 2.59
Bright-and-dark Colors PCA [12] 2.93 2.33 2.42 0.78 6.13 2.40
3000× speedup relative to prior work. In doing so we have Corrected-Moment [14] 2.95 2.05 2.16 0.59 6.89 2.21
introduced the concept of convolutional color constancy on Color Dog [2] 2.83 1.77 2.03 0.48 7.04 2.03
a torus, and we have introduced techniques for illuminant Shi et al. 2016 [32] 2.24 1.46 1.68 0.48 6.08 1.74
CCC [4] 2.38 1.48 1.69 0.45 5.85 1.74
de-aliasing and differentiable bivariate von Mises fitting re- Cheng 2015 [13] 2.18 1.48 1.64 0.46 5.03 1.65
quired for this toroidal approach. We have also presented a M) FFCC - full, 4 channels 1.99 1.31 1.43 0.35 4.75 1.44
Q) FFCC - thumb, 2 channels 2.06 1.39 1.53 0.39 4.80 1.53
novel technique for fast Fourier-domain optimization sub-
ject to a certain family of regularizers. FFCC produces a
complete posterior distribution over illuminants, which lets Table 2: Performance on the dataset from Cheng et al.[12],
us assess the model’s confidence and also enables a Kalman in the same format as Table 1, excluding runtimes. As was
filter-like temporal smoothing model. FFCC’s speed, ac- done in [4] we present the average performance (the geo-
curacy, and temporal consistency allows it to be used for metric mean) over all 8 cameras in the dataset.
real-time white balance on a consumer camera.
References [19] S. Gao, W. Han, K. Yang, C. Li, and Y. Li. Efficient
color constancy with local surface reflectance statis-
[1] E. H. Adelson and A. P. Pentland. The perception of tics. ECCV, 2014.
shading and reflectance. Perception As Bayesian In-
ference, 1996. [20] P. Gehler, C. Rother, A. Blake, T. Minka, and T. Sharp.
Bayesian color constancy revisited. CVPR, 2008.
[2] N. Banic and S. Loncaric. Color dog - guiding the
[21] A. Gijsenij and T. Gevers. Color constancy using natu-
global illumination estimation to better accuracy. VIS-
ral image statistics and scene semantics. TPAMI, 2011.
APP, 2015.
[22] A. Gijsenij, T. Gevers, and J. van de Weijer. General-
[3] K. Barnard, L. Martin, A. Coath, and B. Funt. A com- ized gamut mapping using image derivative structures
parison of computational color constancy algorithms for color constancy. IJCV, 2010.
— part 2: Experiments with image data. TIP, 2002.
[23] A. Gijsenij, T. Gevers, and J. van de Weijer. Computa-
[4] J. T. Barron. Convolutional color constancy. ICCV, tional color constancy: Survey and experiments. TIP,
2015. 2011.
[5] H. G. Barrow and J. M. Tenenbaum. Recovering In- [24] T. Hamelryck, K. Mardia, and J. Ferkinghoff-
trinsic Scene Characteristics from Images. Academic Borg. Bayesian methods in structural bioinformatics.
Press, 1978. Springer, 2012.
[6] S. Bianco, G. Ciocca, C. Cusano, and R. Schettini. [25] S. W. Hasinoff, D. Sharlet, R. Geiss, A. Adams, J. T.
Automatic color constancy algorithm selection and Barron, F. Kainz, J. Chen, and M. Levoy. Burst pho-
combination. Pattern Recognition, 2010. tography for high dynamic range and low-light imag-
[7] S. Bianco, C. Cusano, and R. Schettini. Color con- ing on mobile cameras. SIGGRAPH Asia, 2016.
stancy using cnns. CVPR Workshops, 2015. [26] H. R. V. Joze and M. S. Drew. Exemplar-based color
constancy and multiple illumination. TPAMI, 2014.
[8] D. H. Brainard and B. A. Wandell. Analysis of the
retinex theory of color vision. JOSA A, 1986. [27] E. H. Land and J. J. McCann. Lightness and retinex
theory. JOSA, 1971.
[9] G. Buchsbaum. A spatial processor model for object
[28] K. V. Mardia. Statistics of directional data. Journal of
colour perception. Journal of the Franklin Institute,
the Royal Statistical Society, Series B, 1975.
1980.
[29] J. Ragan-Kelley, A. Adams, S. Paris, M. Levoy,
[10] A. Chakrabarti. Color constancy by learning to predict
S. Amarasinghe, and F. Durand. Decoupling algo-
chromaticity from luminance. NIPS, 2015.
rithms from schedules for easy optimization of image
[11] A. Chakrabarti, K. Hirakawa, and T. Zickler. Color processing pipelines. SIGGRAPH, 2012.
constancy with spatio-spectral statistics. TPAMI, [30] L. I. Rudin, S. Osher, and E. Fatemi. Nonlinear total
2012. variation based noise removal algorithms. Physica D:
[12] D. Cheng, D. K. Prasad, and M. S. Brown. Illuminant Nonlinear Phenomena, 1992.
estimation for color constancy: why spatial-domain [31] L. Shi and B. Funt. Re-processed version of
methods work and the role of the color distribution. the gehler color constancy dataset of 568 images.
JOSA A, 2014. http://www.cs.sfu.ca/ colour/data/.
[13] D. Cheng, B. Price, S. Cohen, and M. S. Brown. Effec- [32] W. Shi, C. C. Loy, and X. Tang. Deep specialized
tive learning-based illuminant estimation using simple network for illuminant estimation. ECCV, 2016.
features. CVPR, 2015. [33] J. van de Weijer, T. Gevers, and A. Gijsenij. Edge-
[14] G. D. Finlayson. Corrected-moment illuminant esti- based color constancy. TIP, 2007.
mation. ICCV, 2013. [34] J. van de Weijer, C. Schmid, and J. Verbeek. Us-
[15] G. D. Finlayson and S. D. Hordley. Color constancy ing high-level visual information for color constancy.
at a pixel. JOSA A, 2001. ICCV, 2007.
[16] G. D. Finlayson and E. Trezzi. Shades of gray and [35] J. Wang, Y. Song, T. Leung, C. Rosenberg, J. Wang,
colour constancy. Color Imaging Conference, 2004. J. Philbin, B. Chen, and Y. Wu. Learning fine-grained
image similarity with deep ranking. CVPR, 2014.
[17] D. H. Foster. Color constancy. Vision research, 2011.
[36] K.-F. Yang, S.-B. Gao, and Y.-J. Li. Efficient illumi-
[18] B. V. Funt and W. Xiong. Estimating illumination nant estimation for color constancy using grey pixels.
chromaticity via support vector regression. Color CVPR, 2015.
Imaging Conference, 2004.

A. Pretraining ī (ī − 2 E [ī]) , (ī − E [ī]) (j̄ − E [j̄]) − E [ī] E [j̄]
∇P (i,j) Σ = h2
(ī − E [ī]) (j̄ − E [j̄]) − E [ī] E [j̄] , j̄ (j̄ − 2 E [j̄])
In the paper we described the data term for our loss func- By chaining these gradients together we can backpropagate
tion f (·) which takes a toroidal PDF P (i, j), fits a bivariate the gradient of the loss back onto the input PDF P . Back-
von Mises distribution to P , and then computes the negative propagating through the softmax operation and the convo-
log-likelihood of the true white point L∗ under that distri- lution (and illumination gain/bias) is straightforward and so
bution. This loss is non-convex, and therefore may behave is not detailed here.
erratically in the earliest training iterations. This issue is
compounded by our differentiable BVM fitting procedure, C. Deep Models
which may be inacurate when P has a low concentration,
which is often the case in early iterations. For this reason, In the main paper we stated that Models N, O, and P use
we train our model in two stages: In the “pretraining” stage an alternative parametrization to incorporate external fea-
we replace the data term in our loss function with a more tures during training and testing. This parameterization al-
simple loss: straightforward logistic regression with respect lows our model to reason about things other than simple
to P and some ground-truth PDF P ∗ (Eq. 28), and then in pixel and edge log-chroma histograms, like semantics and
the second training stage we use the data term described in camera metadata. In the basic model presented in the main
the paper while using the output of pretraining to initialize paper, we learn a set of weights ({Fk }, G, B), where these
the model. Because our regularization is also convex, us- weights determine the shape of the filters used during con-
ing this pretraining loss makes our entire optimization prob- volution and the per-color gain/bias applied to the output of
lem convex and therefore straightforward to optimize, and that convolution. Let us abstractly refer to the concatena-
(when coupled with our use of LBFGS for optimization in- tion of these (preconditioned, Fourier-domain) weights as
stead of some SGD-like approach) also makes training de- w, and let the loss contributed by the data term for train-
terministic. ing data instance i be fi (w) (here fi (w) does not just apply
Computing a logistic loss is straightforward: we com- a loss, but first undoes the preconditioning transformation
pute a ground-truth PDF P ∗ from the ground-truth illumi- and maps from our real FFT vector space to a complex 2D
nant L∗ , and then compute a standard logistic loss. FFT). Training our basic model can be thought of as simply
finding X
∗
∗ Lu − ulo arg min fi (w) (31)
P (i, j) = mod − i, n < 1 (27) w
h i
∗
Lv − vlo
To generalize our model, instead of learning a single model
∧ mod − j, n < 1 w, we instead define a feature vector for each training in-
h
X stance xi and learn a mapping from each xi to some wi such
fpretrain (P ) = − P ∗ (i, j) log(P (i, j)) (28) that the loss for all {wi } is minimized. Instead of learning a
i,j single w, we learn the weights in a small 2-layer neural net-
work with a ReLU activation function, where those network
This loss behaves very similarly to the loss used in CCC [4],
weights define the mapping from features to FFCC param-
but it has the added benefit of being convex.
eters. The resulting optimization problem during training
is:
B. Backpropagation X
arg min fi (W2 max(0, W1 xi + b1 ) + b2 )
The bivariate von Mises estimation procedure described W1 ,b1 ,W2 ,b2 i
in the paper can be thought of as a “layer” in a deep learning (32)
architecture, as our end-to-end training procedure requires Like in all other experiments we train using batch L-BFGS,
that we be able to backpropagate through the fitting proce- but instead of the two-stage training used in the shallow
dure and the loss computation. Here we present the gradi- model (a convex “pretraining” loss and a nonconvex final
ents of the critical equations described in the paper. loss), we have only one training stage: 64 iterations of
∗ LBFGS, in which we minimize a weighted sum of the two
Lu losses. Our input vectors {xi } are whitened before training,
∇µ f (µ, Σ) = −2Σ−1 − µ (29)
L∗v and the whitening transformation is absorbed into W1 and
T b1 after training so that unwhitened features can be used at
L∗u
∗
Lu
∇Σ f (µ, Σ) = Σ−1 − Σ−1 − µ − µ Σ−1 test-time. Our weights are initialized to random Gaussian
L∗v L∗v
noise, unlike the shallow model which is initialized to all
 
xi sin(θ(i))−yi cos(θ(i)) zeros. Unlike our “shallow” model, in which w is regu-
nh x2i +yi2
∇P (i,j) µ =  x sin(θ(j))−y
j j cos(θ(j))
 (30) larized during training, for our “deep” models we do not di-
2π x2j +yj2 rectly regularize each wi but instead indirectly regularize all
wi by minimizing the squared 2-norm of each Wi and bi . apply an sRGB gamma curve. These semantic features have
This use of weight decay to regularize our model depends a modest positive effect.
critically on the frequency-domain preconditioning we use,
which causes a simple weight decay to indirectly impose D. Real Bijective FFT
the careful smoothness regularizer that was constructed for
In the paper we describe Fv (Z), a FFT variant that takes
our shallow model. Note that our “deep” model is equiv-
the 2D FFT of a n × n real-valued 2D image Z and then lin-
alent to our “shallow” model if the input vector is empty
earizes it into a real-valued vector with no redundant values.
(ie, xi = [ ]), as b2 would behave equivalently to w in that
Having this FFT-like one-to-one mapping between real 2D
case. We use 4 hidden units for Models N and O, and 8
images and real 1D vectors enables our frequency-domain
hidden units for Model P (which uses the concatenated fea-
preconditioner.
tures from both Models N and O). The magnitude of the
Our modified FFT function is defined as:
noise used for initialization and of the weight decay for each
layer of the network are tuned using cross-validation.
 
Re(F (Z) (0 : n/2, 0))
To produce the “metadata” features used in Models O  Re(F (Z) (0 : n/2, n/2))



and P we use the EXIF tags included in the Gehler-Shi  Re(F (Z) (0 : (n − 1), 1 : (n/2 − 1))) 
dataset. Using external information in this way is unusual Fv (Z) =   
 Im(F (Z) (1 : ( /2 − 1), 0))
n 

in the color constancy literature, which is why this aspect of  Im(F (Z) (1 : (n/2 − 1), n/2 − 1)) 
our model is relegated to just two experiments (all figures Im(F (Z) (0 : (n − 1), 1 : (n/2 − 1)))
and other results do not use external metadata). In contrast, (34)
camera manufacturers spend significant effort considering
sensor spectral properties and other sources of information Where F (Z) (i, j) is the complex number at the zero-
that may be useful when building a white balance system. indexed (i, j) position in the FFT of Z, and Re(·) and Im(·)
For example, knowing that two images came from two dif- extract real and imaginary components, respectively. The
ferent sensors (as is the case in the Gehler-Shi dataset) al- output of Fv (Z) is an n2 -dimensional vector, as it must be
lows for a more careful treatment of absolute color and for our mapping to preserve all FFT coefficients with no re-
black body radiation. And knowing the absolute brightness dundancy. To preserve the scale √ of the FFT through this
of the scene (indicated by the camera’s exposure time, etc) mapping we scale Fv (Z) by 2, ignoring the entries that
can be a useful cue for distinguishing between the bright correspond to:
light of the sun and the relatively low light of man made
light sources. As the improved performance of Model O Re(F (Z) (0, 0))
demonstrates, this other information is indeed informative Re(F (Z) (0, n/2))
and can induce a significant reduction in error. We use a Re(F (Z) (n/2, 0))
compact feature vector that encodes the outer product of the
Re(F (Z) (n/2, n/2)) (35)
exposure settings of the camera and the name of the camera
sensor itself, all extracted from the EXIF tags included in This scaling ensure that the magnitude of Z is preserved:
the public dataset:
2 2
kFv (Z)k = |F (Z)| (36)
xi =vec( (33)
[log(shutter speedi ); log(f numberi ); 1] To compute the inverse of Fv (·) we undo this scaling, undo
the vectorization by filling in a subset of the elements of
×[1Canon1D (camerai ), 1Canon5D (camerai ), 1])
F (Z) from the vector representation, set the other elements
of F (Z) such that Hermitian symmetry holds, and the in-
Note that the Gehler-Shi dataset uses images from two dif-
vert the FFT.
ferent Canon cameras, as reflected here. The log of the shut-
ter speed and F number are chosen as features because, in
E. Results
theory, their difference should be proportional to the log of
the exposure value of the image, which should indicate the Because our model produces a complete posterior dis-
amount of light receiving by the camera sensor. tribution over illuminants in the form of a covariance ma-
The “semantics” features used in Models N and P are trix Σ, each of our illuminant estimates comes with a mea-
simply the output of the CNN model used in [35], which sure of confidence in the form of the entropy: 12 log |Σ|
was run on the pre-whitebalance image after it is center- (ignoring a constant shift). A low entropy suggests a tight
cropped to a square and resized to 256 × 256. Because this concentration of the output distribution, which tends to be
image is in the sensor colorspace, before passing it to the well-correlated with a low error. To demonstrate this we
CNN we scale the green channel by 0.6, apply a CCM, and present a novel error metric, which is twice the area under
the curve formed by ordering all images (the union of all
test-set images from each cross-validation fold) by ascend-
ing entropy and normalizing by the number of images. In
Figure 6 we visualize this error metric and show that our
entropy-ordered error is substantially lower than the mean
error for both of our datasets, which shows that a low en-
tropy is suggestive of a low error. We are not aware of any
other color constancy technique which explicitly predicts a
confidence measure, and so we do not compare against any
existing technique, but it can be demonstrated that if the en- (a) Gehler-Shi dataset [20, 31] (b) Cheng et al. dataset [12]
tropy used to sort error is decorrelated with error (or, equiv- EO Error: 1.287 1.696
alently, if the error cannot be sorted due to the lack of the Mean Error: 1.775 2.121
means to sort it) that entropy-ordered error will on average
be equal to mean error. Figure 6: By sorting each image by the entropy of its pos-
To allow for a better understanding of our model’s per- terior distribution we can show that entropy correlates with
formance, we present images from the Gehler-Shi dataset error. Here we sort the images by ascending entropy and
[20, 31] (Figures 7-16) and the Canon 1Ds MkIII camera plot the cumulative sum of the error, filling in the area un-
from the Cheng et al. dataset [12] (Figures 17-21). There der that curve with gray. If entropy was not correlated with
results were produced using Model J presented in the main error we would expect the area under the curve to match the
paper. For each dataset we perform three-fold cross valida- black line, and if entropy was perfectly correlated with er-
tion, and with that we produce output predictions for each ror then the area under the curve would exactly match the
image along with an error measure (angular RGB error) and dashed red line. We report twice the area under the curve
an entropy measure (the entropy of the covariance matrix of as “entropy-ordered” error (mean error happens to be twice
our predicted posterior distribution over illuminants). The the area under the diagonal line).
images chosen here were selected by sorting images from
each dataset by increasing error and evenly sampling im-
ages according to that ordering (10 from Gehler-Shi, 5 from
the smaller Cheng dataset). This means that the first image Nexus 6 in the HDR+ mode [25] after being white-balanced
in each sequence is the lowest error image, and the last is by Model Q in the main paper (the version designed to run
the highest. The rendered images include the color checker on thumbnail images).
used in creating the ground-truth illuminants used during
training, but it should be noted that these color checkers are
masked out when these images are used during training and F. Color Rendering
evaluation. For each image we present: a) the input image,
b) the predicted bivariate von Mises distribution over illumi- All images are rendered by applying the RGB gains im-
nants, c) our estimated illuminant and white-balanced im- plied by the estimated illuminant, applying some color cor-
age (produced by dividing the estimated illuminant into the rection matrix (CCM) and then applying an sRGB gamma-
input image), and d) the ground-truth illuminant and white- correction function (the Clinear to Csrgb mapping in http:
balanced image. Our log-chroma histograms are visualized //en.wikipedia.org/wiki/SRGB). For each cam-
using gray light de-aliasing to assign each (i, j) coordinate era in the datasets we use we estimate our own CCMs us-
a color, with a blue dot indicating the location/color of the ing the imagery, which we present here. These CCMs do
ground-truth illuminant, a red dot indicating our predicted not affect our illuminant estimation or our results, and are
illuminant µ and a red ellipse indicating the predicted co- only relevant to our visualizations. Each CCM is estimated
variance of the illuminant Σ. The bright lines in the his- through an iterative least-squares process in which we al-
togram indicate the locations where u = 0 or v = 0. ternatingly: 1) estimate the ground-truth RGB gains for
The reported entropy of the covariance Σ corresponds to each image from a camera by solving a least-squares sys-
the spread of the covariance (low entropy = small spread). tem using our current CCM, and 2) use our current gains
We see that our low error predictions tend to have lower to estimate a row-normalized CCM using a constrained
entropies, and vice versa, confirming our analysis in Fig- least-squares solve. Our estimated ground-truth gains are
ure 6. We also see that the ground-truth illuminant tends not used in this paper. For the ground-truth sRGB col-
to lie within the estimated covariance matrix, though not ors of the Macbeth color chart we use the hex values
always for the largest-error images. provided here: http://en.wikipedia.org/wiki/
In Figure 22 we visualize a set of images taken from a ColorChecker#Colors which we linearize.
2.2310 −1.5926 0.3616
" #
GehlerShi, Canon1D −0.1494 1.4544 −0.3050
0.1641 −0.6588 1.4947
1.7494 −0.8470 0.0976
" #
GehlerShi, Canon5D −0.1565 1.4380 −0.2815
0.0786 −0.5070 1.4284
1.7247 −0.7791 0.0544
" #
Cheng, Canon1DsMkIII −0.1436 1.4632 −0.3195
0.0589 −0.4625 1.4037
1.8988 −0.9897 0.0909
" #
Cheng, Canon600D −0.2058 1.6396 −0.4338
0.0749 −0.7030 1.6281
1.4183 −0.2497 −0.1686
" #
Cheng, FujifilmXM1 −0.2230 1.6449 −0.4219
0.0785 −0.5980 1.5195
1.3792 −0.3134 −0.0659
" #
Cheng, NikonD5200 −0.0826 1.3759 −0.2932
0.0483 −0.4553 1.4070
1.6565 −0.4971 −0.1595
" #
Cheng, OlympusEPL6 −0.3335 1.7772 −0.4437
0.0895 −0.7023 1.6128
1.5629 −0.5117 −0.0512
" #
Cheng, PanasonicGX1 −0.2472 1.7590 −0.5117
0.1395 −0.8945 1.7550
1.5770 −0.4351 −0.1419
" #
Cheng, SamsungNX2000 −0.1747 1.5225 −0.3477
0.0573 −0.6397 1.5825
1.5963 −0.5545 −0.0418
" #
Cheng, SonyA57 −0.1343 1.5331 −0.3988
0.0563 −0.4026 1.3463
(a) Input Image (b) Illuminant Posterior (c) Our prediction (d) Ground Truth

Figure 7: A result from the Gehler-Shi dataset using Model J. Error = 0.02°, entropy = −6.48