Fast Fourier Color Constancy
Fast Fourier Color Constancy
1
it also raises new problems: 1) pixel values are corrupted
with superimposed shapes that make detection difficult, 2)
detections must “wrap” around the edges of this toroidal
image, and 3) instead of an absolute, global location we can
only recover an aliased, incomplete location. FFCC works
by taking the large convolutional problem of CCC (ie, face
detection on A) and aliasing that problem down to a smaller
size where it can be solved efficiently (ie, face detection
on B). We will show that we can learn an effective color
constancy model in the face of the difficulty and ambiguity
(a) Image A (b) Aliased Image B
introduced by aliasing. This convolutional classifier will be
Figure 1: CCC [4] reduces color constancy to a 2D local- implemented and learned using FFTs, because the naturally
ization problem similar to object detection (1a). FFCC re- periodic nature of FFT convolutions resolves the problem of
peatedly wraps this 2D localization problem around a small detections “wrapping” around the edge of toroidal images,
torus (1b), which creates challenges but allows for faster and produces a significant speedup.
illuminant estimation. See the text for details. Our approach to color constancy introduces a number
of issues. The aforementioned periodic ambiguity result-
ing from operating on a torus (which we dub “illuminant
ware. FFCC produces a complete posterior distribution over aliasing”) requires new techniques for recovering a global
illuminants which allows us to reason about uncertainty and illuminant estimate from an aliased estimate (Section 3).
enables simple and effective temporal smoothing. Localizing the centroid of the illuminant on a torus is dif-
We build on the “Convolutional Color Constancy” ficult, requiring that we adopt and extend techniques from
(CCC) approach of [4], which is currently one of the top- the directional statistics literature (Section 4). But our ap-
performing techniques on standard color constancy bench- proach presents a number of benefits. FFCC improves accu-
marks [12, 20, 31]. CCC works by observing that applying racy relative to CCC by 17 − 24% while retaining its flex-
a per-channel gain to a linear RGB image is equivalent to ibility, and allows us to construct priors over illuminants
inducing a 2D translation of the log-chroma histogram of (Section 5). By learning in the frequency-domain we can
that image, which allows color constancy to be reduced to construct a novel method for fast frequency-domain regu-
the task of localizing a signature in log-chroma histogram larization and preconditioning, making FFCC training 20×
space. This reduction is at the core of the success of CCC faster than CCC (Section 6). Our model produces a com-
and, by extension, our FFCC technique; see [4] for a thor- plete unimodal posterior over illuminants as output, allow-
ough explanation. The primary difference between FFCC ing us to construct a Kalman filter-like approach for pro-
is that instead of performing an expensive localization on a cessing videos instead of independent images (Section 7).
large log-chroma plane, we perform a cheap localization on
a small log-chroma torus. 2. Convolutional Color Constancy
At a high level, CCC reduces color constancy to object
detection — in the computability theory sense of “reduce”. Let us review the assumptions made in CCC and inher-
FFCC reduces color constancy to localization on a torus in- ited by our model. Assume that we have a photometrically
stead of a plane, and because this task has no intuitive ana- linear input image I from a camera, with a black level of
logue in computer vision we will attempt to provide one1 . zero and with no saturated pixels2 . Each pixel k’s RGB
Given a large image A on which we would like to perform value in image I is assumed to be the product of that pixel’s
object detection, imagine constructing a smaller n × n im- “true” white-balanced RGB value W (k) and some global
age B in which each pixel in B is the sum of all values in A RGB illumination L shared by all pixels:
separated by a multiple of n pixels in either dimension: (k) (k)
X Ir Wr Lr
B(i, j) = A(i + nk, j + nl) (1) ∀k Ig(k) = Wg(k) ◦ Lg (2)
k,l (k) (k) L
Ib Wb b
detection, and we present it here solely to provide an intuition of our work 2 in practice, saturated pixels are identified and removed from all down-
on color constancy stream computation, similarly to how color checker pixels are ignored.
(a) Input Image (b) Histogram (c) Aliased Histogram (d) Aliased Prediction (e) De-aliased Prediction (f) Output Image
Figure 2: An overview of our pipeline demonstrating the problem of illuminant aliasing. Similarly to CCC, we take an input
image (2a) and transform it into a log-chroma histogram (2b, presented in the same format as in [4]). But unlike CCC, our
histograms are small and toroidal, meaning that pixels can “wrap around” the edges (2c, with the torus “unwrapped” once
in every direction). This means that the centroid of a filtered histogram, which would simply be the illuminant estimate in
CCC, is instead an infinite family of possible illuminants (2d). This requires de-aliasing, some technique for disambiguating
between illuminants to select the single most likely estimate (2e, shown as a point surrounded by an ellipse visualizing the
output covariance of our model). Our model’s output (u, v) coordinates in this de-aliased log-chroma space corresponds to
the color of the illuminant, which can then be divided into the input image to produce a white balanced image (2f).
fines two log-chroma measures: with the properties of natural images has a significant effect,
as we will demonstrate.
(k)
u(k) = log Ig(k) /Ir(k) v (k) = log Ig(k) /Ib (3) Similarly to CCC, given an input image I we construct a
histogram N from I, where N (i, j) is the number of pixels
The absolute scale of L is assumed to be unrecoverable, so in I whose log-chroma is near the (u, v) coordinates corre-
estimating L simply requires estimating its log-chroma: sponding to histogram position (i, j):
Lu = log (Lg /Lr ) Lv = log (Lg /Lb ) (4)
u(k) − ulo
X
N (i, j) = mod − i, n < 1
After recovering (Lu , Lv ), assuming that L has a magnitude h
k
of 1 lets us recover the RGB values of the illuminant: (k) !
v − vlo
exp(−Lu ) 1 exp(−Lv ) ∧ mod − j, n < 1 (6)
Lr = Lg = Lb = h
p z z z
2 2
z = exp(−Lu ) + exp(−Lv ) + 1 (5) Where i, j are 0-indexed, n = 64 is the number of bins,
h = 1/32 is the bin size, and (ulo , vlo ) is the starting
Framing color constancy in terms of predicting log-chroma
point of the histogram. Because our histogram is too small
has several small advantages over the standard RGB ap-
to contain the wide spread of colors present in most nat-
proach (2 unknowns instead of 3, better numerical stability,
ural images, we use modular arithmetic to cause pixels to
etc) but the primary advantage of this approach is that us-
“wrap around” with respect to log-chroma (any other stan-
ing log-chroma turns the multiplicative constraint relating
dard boundary condition would violate our convolutional
W and I into an additive constraint [15], and this in turn
assumption and would cause many image pixels to be ig-
enables a convolutional approach to color constancy. As
nored). This means that, unlike standard CCC, a single
shown in [4], color constancy can be framed as a 2D spatial
(i, j) coordinate in the histogram no longer corresponds to
localization task on a log-chroma histogram N , where some
an absolute (u, v) color, but instead corresponds to an infi-
sliding-window classifier is used to filter that histogram and
nite family of (u, v) colors. Accordingly, the centroid of a
the centroid of that filtered histogram is used as the log-
filtered histogram no longer corresponds to the color of the
chroma of the illuminant.
illuminant, but instead is an infinite set of illuminants. We
will refer to this phenomenon as illuminant aliasing. Solv-
3. Illuminant Aliasing
ing this problem requires that we use some technique to de-
We assume the same convolutional premise of CCC, but alias an aliased illuminant estimate3 . A high-level outline of
with one primary difference to improve quality and speed:
3 It is tempting to refer to resolving the illuminant aliasing problem as
we use FFTs to perform the convolution that filters the log-
“anti-aliasing”, but anti-aliasing usually refers to preprocessing a signal to
chroma histogram, and we use a small histogram to make prevent aliasing during some resampling operation, which does not appear
that convolution as fast as possible. This change may seem possible in our framework. “De-aliasing” suggests that we allow aliasing
trivial, but the periodic nature of FFT convolution combined to happen to the input, but then remove the aliasing from the output.
our FFCC pipeline that illustrates illuminant (de-)aliasing model in an end-to-end fashion by propagating the gradients
can be seen in Fig. 2. of some loss computed on the de-aliased illuminant predic-
De-aliasing requires that we use some external informa- tion L̂ back onto the learned filters F . The centroid fitting
tion (or some external color constancy algorithm) to dis- in Eq. 11 is performed by fitting a bivariate von Mises dis-
ambiguate between illuminants. An intuitive approach is tribution to a PDF, which we will now explain.
to select the illuminant that causes the average image color
to be as neutral as possible, which we call “gray world de- 4. Differentiable Bivariate von Mises
aliasing”. We compute average log-chroma values (ū, v̄) for
Our architecture requires some mechanism for reducing
the entire image and use this to turn an aliased illuminant
a toroidal PDF P (i, j) to a single estimate of the illumi-
estimate (L̂u , L̂v ) into a de-aliased illuminant (L̂0u , L̂0v ):
nant. Localizing the center of mass of a histogram defined
on a torus is difficult: fitting a bivariate Gaussian may fail
ū = meank u(k) v̄ = meank v (k) (7) when the input distribution “wraps around” the sides of the
0
L̂u L̂
1 L̂u − ū
1
PDF, as shown in Fig. 3. Additionally, for the sake of tem-
0 = u − (nh) + (8) poral smoothing (Section 7) and confidence estimation, we
L̂v L̂v nh L̂v − v̄ 2
want our model to predict a well-calibrated covariance ma-
Another approach, which we call “gray light de-aliasing”, trix around the center of mass of P . This requires that our
is to assume that the illuminant is as close to the center of model be trained end-to-end, which therefore requires that
the histogram as possible. This de-aliasing approach sim- our mean/covariance fitting be analytically differentiable
ply requires carefully setting the starting point of the his- and therefore usable as a “layer” in our learning architec-
togram (ulo , vlo ) such that the true illuminants in natural ture. To address these problems we present a variant of the
scenes all lie within the span of the histogram, and setting bivariate von Mises distribution [28], which we will use to
L̂0 = L̂. We do this by setting ulo and vlo to maximize efficiently localize the mean and covariance of P in a man-
the distance between the edges of the histogram and the ner that allows for easy backpropagation.
bounding box surrounding the ground-truth illuminants in The bivariate von Mises distribution (BVM) is a com-
the training data4 . Gray light de-aliasing is trivial to imple- mon parameterization of a PDF on a torus. There exist
ment but, unlike gray world de-aliasing, it will systemati- several parametrizations which mostly differ in how “con-
cally fail if the histogram is too small to fit all illuminants centration” is represented (“concentration” having a sim-
within its span. ilar meaning to covariance). All of these parametriza-
To summarize the difference between CCC [4] and our tions present problems in our use case: none have closed
approach with regards to illuminant aliasing, CCC (approx- form expressions for maximum likelihood estimators [24],
imately) performs illuminant estimation as follows: none lend themselves to convenient backpropagation, and
all define concentration in terms of angles and therefore
L̂u u require “conversion” to covariance matrices during color
= lo + h arg max (N ∗ F ) (9) de-aliasing. For these reasons we present an alternative
L̂v vlo i,j
parametrization in which we directly estimate a BVM as
Where N ∗ F is performed using a pyramid convolution. a mean µ and covariance Σ in a simple and differentiable
FFCC corresponds to this procedure: closed form expression. Though necessarily approximate,
our estimator is accurate when the distribution is well-
P ← softmax (N ∗ F ) (10) concentrated, which is generally the case for our task.
(µ, Σ) ← fit bvm(P ) (11) Our input is a PDF P (i, j) of size n × n, where i and
j are integers in [0, n − 1]. For convenience we define a
L̂u mapping from i or j to angles in [0, 2π) and the marginal
← de alias(µ) (12)
L̂v distributions of P with respect to i and j:
Where N is a small and aliased toroidal histogram, convolu- 2πi X X
tion is performed with FFTs, and the centroid of the filtered θ(i) = Pi (i) = P (i, j) Pj (j) = P (i, j)
n j i
histogram is estimated and de-aliased as necessary. By con-
structing this pipeline to be differentiable we can train our We also define the marginal expectation of the sine and co-
4 Our histograms are shifted toward green colors rather than centered sine of the angle:
around a neutral color, as cameras are traditionally designed with an more X X
sensitive green channel which enables white balance to be performed by yi = Pi (i) sin(θ(i)) xi = Pi (i) cos(θ(i)) (13)
gaining red and blue up without causing color clipping. Ignoring this prac- i i
tical issue, our approach can be thought of as centering our histograms
around a neutral white light With xj and yj defined similarly.
Using this loss causes our model to produce a well-
calibrated complete posterior of the illuminant instead of
just a single estimate. This posterior will be useful when
processing video sequences (Section 7) and also allows us
to attach confidence estimates to our predictions using the
entropy of Σ (see the appendix).
Our entire system is trained end-to-end, which requires
Figure 3: We fit a bivariate von Mises distribution (shown that every step in BVM fitting and loss computation be an-
in solid blue) to toroidal PDFs P (i, j) to produce an aliased alytically differentiable. See the appendix for the analytical
illuminant estimate. Contrast this with fitting a bivariate gradients for Eqs. 14, 17, and 18, which can be chained to-
Gaussian (shown in dashed red) which treats the PDF as gether to backpropagate the gradient of f (·) onto the input
if it lies on a plane. Both approaches behave similarly if PDF P .
the distribution lies near the center of the unwrapped plane
(left) but fitting a Gaussian fails as the distribution begins to 5. Model Extensions
“wrap around” the edge (middle, right).
The system we have described thus far (compute a peri-
odic histogram of each pixel’s log-chroma, apply a learned
Estimating the mean µ of a BVM from a histogram just FFT convolution, apply a softmax, fit a de-aliased bivariate
requires computing the circular mean in i and j: von Mises distribution) works reasonably well (Model A in
n
Table 1) but does not produce state-of-the-art results. This
u mod 2π atan2(yi , xi ), n
µ = lo + h n (14) is likely because this model reasons about pixels indepen-
vlo mod 2π atan2(yj , xj ), n
dently, ignores all spatial information in the image, and does
Eq. 14 includes gray light de-aliasing, though gray world not consider the absolute color of the illuminant. Here we
de-aliasing can also be applied to µ after fitting. present extensions to the model which address these issues
We can fit the covariance of our model by simply “un- and improve accuracy accordingly.
wrapping” the coordinates of the histogram relative to the As explored in [4], a CCC-like model can be generalized
estimated mean and treating these unwrapped coordinates to a set of “augmented” images provided that these images
as though we are fitting a bivariate Gaussian. We define the are non-negative and “scale with intensity” [14]. This lets
“unwrapped” (i, j) coordinates such that the “wrap around” us apply certain filtering operations to image I and, instead
point on the torus lies as far away from the mean as possible, of constructing a single histogram from our image, con-
or equivalently, such that the unwrapped coordinates are as struct a “stack” of histograms constructed from the image
close to the mean as possible: and its filtered versions. Instead of learning and applying
µu − ulo
n
one filter, we learn a stack of filters and sum across chan-
ī = mod i − + ,n nels after convolution. The general family of augmented
h 2
images used in [4] are expensive to compute, so we instead
µv − vlo n use just the input image I and a local measure of absolute
j̄ = mod j − + ,n (15)
h 2 deviation in the input image:
Our estimated covariance matrix is simply the sample co- 1 1
variance of P (ī, j̄):
X X
1
E(x, y, c) = 8 |I(x, y, c) − I(x + i, y + j, c)| (19)
i=−1 j=−1
X X
E [ī] = Pi (i)ī E [j̄] = Pj (j)j̄ (16)
i j These two features appears to perform similarly to the four
+
X
2
Pi (i)ī − E [ī]
2
X
P (i, j)īj̄ − E [ī] E [j̄]
features used in [4], while being cheaper to compute.
Σ=h
2 i i,j X
(17) Just as a sliding-window object detector is often invariant
2
X
P (i, j)īj̄ − E [ī] E [j̄] + Pj (j)j̄ 2 − E [j̄] to the absolute location of an object in an image, the convo-
i,j j
lutional nature of our baseline model makes it invariant to
We regularize the sample covariance matrix slightly by any global shift of the color of the input image. This means
adding a constant = 1 to the diagonal. that our baseline model cannot rely on any statistical regu-
With our estimated mean and covariance we can com- larities of the illumination by, say, modeling black body ra-
pute our loss: the negative log-likelihood of a Gaussian (ig- diation, the specific properties of commonly manufactured
noring scale factors and constants) relative to the true illu- light bulbs, or any varying spectral sensitivity across cam-
minant L∗ : eras. Though CCC does not model illumination directly,
T it appears to indirectly reason about illumination by using
L∗u
∗
−1 Lu
f (µ, Σ) = log |Σ| + − µ Σ − µ (18) the boundary conditions of its pyramid convolution to learn
L∗v L∗v
a regularization term g(Z):
Figure 7: A result from the Gehler-Shi dataset using Model J. Error = 0.02°, entropy = −6.48
(a) Input Image (b) Illuminant Posterior (c) Our prediction (d) Ground Truth
Figure 8: A result from the Gehler-Shi dataset using Model J. Error = 0.26°, entropy = −6.55
(a) Input Image (b) Illuminant Posterior (c) Our prediction (d) Ground Truth
Figure 9: A result from the Gehler-Shi dataset using Model J. Error = 0.46°, entropy = −6.91
(a) Input Image (b) Illuminant Posterior (c) Our prediction (d) Ground Truth
Figure 10: A result from the Gehler-Shi dataset using Model J. Error = 0.63°, entropy = −6.37
(a) Input Image (b) Illuminant Posterior (c) Our prediction (d) Ground Truth
Figure 11: A result from the Gehler-Shi dataset using Model J. Error = 0.83°, entropy = −6.62
(a) Input Image (b) Illuminant Posterior (c) Our prediction (d) Ground Truth
Figure 12: A result from the Gehler-Shi dataset using Model J. Error = 1.19°, entropy = −6.71
(a) Input Image (b) Illuminant Posterior (c) Our prediction (d) Ground Truth
Figure 13: A result from the Gehler-Shi dataset using Model J. Error = 1.61°, entropy = −6.88
(a) Input Image (b) Illuminant Posterior (c) Our prediction (d) Ground Truth
Figure 14: A result from the Gehler-Shi dataset using Model J. Error = 2.35°, entropy = −6.32
(a) Input Image (b) Illuminant Posterior (c) Our prediction (d) Ground Truth
Figure 15: A result from the Gehler-Shi dataset using Model J. Error = 3.84°, entropy = −5.28
(a) Input Image (b) Illuminant Posterior (c) Our prediction (d) Ground Truth
Figure 16: A result from the Gehler-Shi dataset using Model J. Error = 21.64°, entropy = −4.95
(a) Input Image (b) Illuminant Posterior (c) Our prediction (d) Ground Truth
Figure 17: A result from the Cheng dataset using Model J. Error = 0.12°, entropy = −6.82
(a) Input Image (b) Illuminant Posterior (c) Our prediction (d) Ground Truth
Figure 18: A result from the Cheng dataset using Model J. Error = 0.64°, entropy = −6.69
(a) Input Image (b) Illuminant Posterior (c) Our prediction (d) Ground Truth
Figure 19: A result from the Cheng dataset using Model J. Error = 1.37°, entropy = −6.48
(a) Input Image (b) Illuminant Posterior (c) Our prediction (d) Ground Truth
Figure 20: A result from the Cheng dataset using Model J. Error = 2.69°, entropy = −5.82
(a) Input Image (b) Illuminant Posterior (c) Our prediction (d) Ground Truth
Figure 21: A result from the Cheng dataset using Model J. Error = 17.85°, entropy = −3.04
Figure 22: A sampling of unedited HDR+[25] images from a Nexus 6, after being processed with Model Q.