Identity-Preserving Aging of Face Images Via Laten
Identity-Preserving Aging of Face Images Via Laten
Abstract Cafe-GAN, Talk-to-Edit [17, 20, 23, 24, 37] have also been
used to simulate age progression in face images. However,
The performance of automated face recognition systems we find that generative models struggle to correctly model
is inevitably impacted by the facial aging process. However, biological aging, which is a complex process affected by ge-
high quality datasets of individuals collected over several netic, demographic, and environmental factors. Moreover,
years are typically small in scale. In this work, we pro- training high quality GANs for adjusting facial attributes
pose, train, and validate the use of latent text-to-image dif- themselves require a large amount of training data.
fusion models for synthetically aging and de-aging face im- Our Approach. Existing generative models often struggle
ages. Our models succeed with few-shot training, and have to manipulate the age attribute and preserve facial identity.
the added benefit of being controllable via intuitive textual They also require auxiliary age classifiers and/or extensive
prompting. We observe high degrees of visual realism in the training data with longitudinal age variations. To address
generated images while maintaining biometric fidelity mea- both of the above issues, we propose a new latent genera-
sured by commonly used metrics. We evaluate our method tive model for simulating high-quality facial aging, while si-
on two benchmark datasets (CelebA and AgeDB) and ob- multaneously preserving biometric identity. The high level
serve significant reduction (∼ 44%) in the False Non-Match algorithmic idea is to finetune latent text-to-image diffusion
Rate compared to existing state-of the-art baselines. models (such as Stable Diffusion [32]) with a novel combi-
nation of contrastive and biometric losses that help preserve
facial identity. See Fig. 1 for an overview of our method.
1. Introduction The proposed method requires: (i) a pre-trained latent
diffusion model (see Sec. 2), (ii) a small set (numbering
Motivation. It is well known that facial aging can signifi- ≈ 20) of training face images of an individual, and (iii)
cantly degrade the performance of modern automated face a small auxiliary set (numbering ≈ 600) of image-caption
recognition systems [16, 19, 29]. Improving the robustness pairs. The pairs contain facial images of individuals and
of such systems to aging variations is therefore critical for captions indicating their corresponding age. This auxiliary
their lasting practical use. However, building systems that set of image-caption pairs serve as the regularization set.
are robust to aging variations requires high quality longitu- The individuals in the training set and the regularization set
dinal datasets: images of a large number of individuals col- are disjoint. We use the training images during fine-tuning
lected over several years. Collection of such data constitutes to learn the identity-specific information of the individual,
a major challenge in practice. Datasets such as MORPH [5] and the regularization images with captions to learn the as-
contains longitudinal samples of only 317 subjects from a sociation between an image (face) and its caption (age). Fi-
total of ∼13K subjects over a period of five years [8]. Other nally, we simulate age regression and progression of the
datasets like AgeDB [28] and CACD [10] contains uncon- trained individual using a text prompt specifying the target
strained images with significant variations in pose, illumi- age. See the details of our method in Sec. 3.
nation, background, and expression. Summary. Our main contributions are as follows.
An alternative approach to gathering longitudinal data
is to digitally simulate face age progression [21]. Ap- • We adapt latent diffusion models to perform age regres-
proaches include manual age-editing tools, such as You- sion and progression in face images. We introduce two
Cam Makeup, FaceApp, and AgingBooth [1, 13]; more key ideas: an identity-preserving loss (in addition to per-
recently, GAN-based generative models, such as AttGAN, ceptual loss), and a small regularization set of image-
caption pairs to resolve the limitations posed by existing
* Both authors contributed equally. GAN-based methods.
Figure 1. Overview of the proposed method. The proposed method needs a fixed Regularization Set comprising facial images with age
variations and a variable Training Set comprising facial images of a target individual. The latent diffusion module (comprising a VAE,
U-Net and CLIP-text encoder) learns the concept of age progression from the regularization images and the identity-specific information
from the training images. We integrate biometric and contrastive losses in the network for identity preservation. At inference, the user
prompts the trained model using a rare token associated with the trained target subject and the desired age to perform age editing.
• As a secondary finding, we show that face recognition Cafe-GAN [23] utilizes complementary attention features
classifiers may benefit by fine-tuning on generated images to focus on the regions pertinent to the target attribute while
with significant age variations as indicated in [31]. preserving the remaining details. HRFAE [38] encodes an
input image to a set of age-invariant features and an age-
• We conduct experiments on CelebA and AgeDB datasets specific modulation vector. The age-specific modulation
and perform evaluations to demonstrate that the synthe- vector re-weights the encoded features depending on the tar-
sized images i) appear visually compelling in terms of get age and then passes it to a decoder unit that edits the im-
aging and de-aging through qualitative analysis and auto- age. CUSP [15] uses a custom structure preserving module
mated age predictor, and ii) match with the original sub- that masks the irrelevant regions for better facial structure
ject with respect to human evaluators and automated face preservation in the generated images. The method performs
matcher. We demonstrate that our method outperforms style and content disentanglement while conditioning the
SOTA image editing methods, namely, IPCGAN [34], generated image on the target age. ChildGAN [9] is inspired
AttGAN [17] and Talk-to-Edit [20]. from the self-attention GAN and uses one-hot encoding of
The rest of the paper is organized as follows. Sec. 2 out- age labels and gender labels appended to the noise vector to
lines existing work. Sec. 3 describes the proposed method perform age translation in images of young children.
for simulating facial aging and de-aging. Sec. 4 describes
the experimental settings. Sec. 5 presents our findings and We focus on three methods in our comparisons. IPC-
analysis. Sec. 6 concludes the paper. GAN [34] uses a conditional GAN with an identity preserv-
ing module and an age classifier to perform image-to-image
2. Related Work style transfer for age-editing. AttGAN [17] performs binary
facial attribute manipulation by modeling the relationship
Previous automated age progression models have used a between the attributes and the latent representation of the
variety of architectures, including recurrent ones [36] and face. The network enables high quality facial attribute edit-
GANs. [37] uses a hierarchy of discriminators to preserve ing while controlling the attribute intensity and style. Talk-
the reconstruction details, age and identity. STGAN [24] to-Edit [20] provides fine-grained facial attribute editing via
utilizes selective transfer units that accepts the difference dialog interaction, similar to our approach. The method
between the target and source attribute vector as input, re- uses a language encoder to convert the user’s request into an
sulting in more controlled manipulation of the attribute. ‘editing encoding’ that encapsulates information about the
degree and direction of change of the target attribute, and In this work, we focus on DreamBooth [33], a latent
seeks user feedback to iteratively edit the desired attribute. diffusion model that fine-tunes a text-to-image diffusion
We also highlight two recent methods that also use dif- framework for re-contextualization of a single subject. To
fusion models for face generation. In DCFace [21], the au- accomplish this, it requires (i) a few images of the subject,
thors propose a dual condition synthetic face generator to and (ii) text prompts containing a unique identifier and the
allow control over simulating intra-class (within same indi- class label of the subject. The class label denotes a collec-
vidual) and inter-class (across different individuals) varia- tive representation of multiple instances while the subject
tions. In [30], the authors explore suitable prompts for gen- will correspond to a specific example belonging to the class.
erating realistic faces using stable diffusion and investigate The objective is to associate a unique token or a rare identi-
their quality. Neither method focus on identity-preserving fier to each subject (a specific instance of a class) and then
text guided facial aging and de-aging, which is our goal. recreate images of the same subject in different contexts
as guided by the text prompts. The class label harnesses
3. Our Proposed Method the prior knowledge of the trained diffusion framework for
that specific class. Incorrect class labels or missing class
Although a suite of age editing methods exist in the lit-
labels may result in inferior outputs [33]. The unique token
erature as discussed above, the majority of them focuses on
acts as a reference to the particular subject, and needs to be
perceptual quality instead of biometric quality. A subset
rare enough to avoid conflict with other concepts. The au-
of latent space manipulation methods struggle with ‘real’
thors use a set of rare tokens corresponding to a sequence
face images and generate unrealistic outputs. Existing
of 3 or fewer Unicode characters and the T5-XXL tokenizer.
works reiterate that age progression is a smooth but non-
See [33] for more details. DreamBooth uses a class-specific
deterministic process that requires incremental evolution to
prior preservation loss to increase the variability of gener-
effectively transition between ages. This motivates the use
ated images while ensuring minimal deviation between the
of diffusion models, which naturally model the underly-
target subject and the output images. The original training
ing data distribution by incrementally adding and removing
loss can be written as follows.
noise. We start with a brief mathematical overview.
\label {Eqn:SD} \begin {split} \mathbb {E}_{\bm {x}, \bm {c}, t}[w_t &\|f_{\theta }(g_t(\bm {x}),c) - \bm {x}\|_2^2 + \\ & \lambda w_{t'} \| f_{\theta }(g_{t'}(\bm {x'}),c_{class}) - \bm {x'}\|_2^2]. \end {split}
3.1. Preliminaries (1)
Denoising diffusion probabilistic models (DDPMs) [18]
perform the following steps: 1) a forward diffusion process The first term in Eqn. 1 denotes the squared error between
>>ηt
x0 −−−→ xt 1 that incrementally adds Gaussian noise, η the ground-truth images, x, (training set) and the generated
sampled from a normal distribution, N (0, I), to the clean images, fθ (gt (x), c). Here, fθ (·, ·) denotes the pre-trained
data, x0 sampled from a real distribution, p(x) over t time diffusion model (parameterized by θ) that generates images
<<ηt for a noise map and a conditioning vector. The noise map is
steps. 2) a backward denoising process x0 ←−−− xt 1 obtained as gt (x) = αt x + σt η, where η ∼ N (0, I), and
that attempts to recover the clean data from the corrupted or αt , σt , wt are diffusion control parameters at time step t ∼
noisy data xt by approximating the conditional probability U[0, 1]. The conditioning vector c is generated using a text
distribution, p(xt−1 | xt ) using a neural network that serves encoder for a user-defined prompt. The second term refers
as a noise estimator. The forward and backward processes to the prior-preservation component using generated images
can be considered analogous to VAEs [22]. that represents the prior knowledge of the trained model for
Note that DDPMs are computationally expensive as the the specific class. The term is weighted by a scalar value,
estimated noise has the same dimension as the input. Alter- λ = 1. The conditioning vector in the second term, cclass ,
natively, stable diffusion [32] is a class of latent diffusion corresponds to the class label.
models that performs diffusion on a relatively lower dimen-
sional latent representation. Latent diffusion generates high 3.2. Methodology
quality images conditioned on text prompts. It comprises
DreamBooth works effectively with the aid of prior
three modules: an autoencoder (VAE), a U-Net and a text-
preservation for synthesizing images of dogs, cats, cartoons,
encoder. The encoder in the VAE converts the image into a
etc. But in this work, we are focusing on human face im-
low dimensional latent representation fed as the input to the
ages that contain intricate structural and textural details. Al-
U-Net model. The U-Net model estimates the noise needed
though the class label ‘person’ can capture human-like fea-
to recover the high resolution output from the decoder of
tures, this may not be adequate to capture identity-specific
the VAE. [32] further added cross-attention layers in the U-
features that vary across individuals. Therefore, we in-
Net backbone to use text embedding as a conditional input,
clude an identity-preserving term in the loss function. The
thereby enhancing the model’s generative capability.
identity-preserving component minimizes the distance be-
1 >> denotes noise addition while << denotes noise removal. tween the biometric features from the original and gener-
ated images as follows. ‘youngadults’, ‘middleaged’, ‘elderly’, and ‘old’. We could
have used numbers as age groups, for example, twenties,
\label {Eqn:SD_biom} \begin {split} \mathbb {E} & _{\bm {x}, \bm {c}, t}[w_t \|f_{\theta }(g_t(\bm {x}),c) - \bm {x}\|_2^2 + \\ & \lambda w_{t'} \| f_{\theta }(g_{t'}(\bm {x'}),c_{class}) - \bm {x'}\|_2^2 + \\& \lambda _{b}\mathcal {B}( f_{\theta }(g_{t}(\bm {x}),c_{class}), \bm {x})]. \end {split} forties or sixties, but we found that a language description
(2) is more suitable than a numeric identifier. Another reason
for pairing these age descriptions with the images is that we
can use these same age identifiers while prompting the dif-
We use this new loss to fine-tune the VAE. The third term fusion model during inference (photo of a ⟨ token ⟩ ⟨ class
in Eqn. 2 refers to the biometric loss computed between label ⟩ as ⟨ age group ⟩). We use the following six prompts
the ground-truth image of the subject, x, and the generated during inference. 1) photo of a sks person as child, 2) photo
image weighted by λb = 0.1. Note that fθ (gt′ (x), cclass ) of a sks person as teenager, 3) photo of a sks person as youn-
uses the training set (i.e., images of an individual subject), gadults, 4) photo of a sks person as middleaged, 5) photo of
whereas fθ (gt′ (x′ ), cclass ) uses the regularization set that a sks person as elderly, and 6) photo of a sks person as old.
contains representative images of a class. Here, B(·, ·) com- We have explored other tokens (see Sec. 5.4).
putes the L1 distance between the biometric features ex-
tracted from a pair of images (close to zero for same sub- 4. Experiments
jects, higher values correspond to different subjects). We Setup and implementation details. We conduct exper-
use a pre-trained VGGFace [4] feature extractor, such that, iments using DreamBooth implemented using Stable Dif-
fusion v1.4 [3]. The model uses CLIP’s [2] text encoder
B(i, j) = ∥V GGF ace(i) − V GGF ace(j)∥1 .
trained on laion-aesthetics v2 5+ and a vector quantized
Now, we turn to target-specific fine-tuning. The imple- VAE [35] to accomplish the task of age progression. The
mentation used in our work [3, 14] uses a frozen VAE and a text encoder stays frozen while training the diffusion model.
text-encoder while keeping the U-Net model unfrozen. U- We use two datasets, namely, CelebA [27] and AgeDB [28].
Net denoises the latent representation produced by the en- We use 2,258 face images belonging to 100 subjects from
coder of VAE, gt (x) = zt = αt x + σt η. Therefore, we the CelebA [27] dataset, and 659 images belonging to 100
use identity-preserving contrastive loss using the latent rep- subjects from the AgeDB dataset to form the ‘training set’.
resentation. We adopted the SimCLR [11] framework that CelebA does not contain age information, except a binary
uses a normalized temperature-scaled cross-entropy loss be- ‘Young’ attribute annotation. We do not have ground-truth
tween positive and negative pairs of augmented latent rep- for evaluating the generated images synthesized from the
resentations, denoted by S(·, ·) in Eqn. 3. We compute CelebA dataset. On the other hand, AgeDB dataset com-
the contrastive loss between the latent representation of the prises images with exact age values. We then select the age
noise-free inputs (z0 ) and the de-noised outputs (zt ) with a group that has the highest number of images and use them
weight term λs = 0.1 and a temperature value = 0.5. Refer as the training set, while the remaining images contribute
to [11] for more details. The contrastive loss between the towards the testing set. Therefore, 2,369 images serve as
latent representation in the U-Net architecture enables us to ground-truth for evaluation in AgeDB dataset.
fine-tune the diffusion model for each subject as follows. We use a regularization set comprising image-caption
pairs where each face image is associated with a caption in-
dicating its corresponding age label. We use 612 images
\label {Eqn:SD_contrast} \begin {split} \mathbb {E} & _{\bm {x}, \bm {c}, t}[w_t \|f_{\theta }(g_t(\bm {x}),c) - \bm {x}\|_2^2 + \\ & \lambda w_{t'} \| f_{\theta }(g_{t'}(\bm {x'}),c_{class}) - \bm {x'}\|_2^2 + \lambda _{s} \mathcal {S} (\bm {z}_{t} , \bm {z}_{0})]. \end {split} belonging to 375 subjects from the CelebA-Dialog [20]
(3)
dataset,where the authors provide fine-grained annotations
of age distributions. We convert the distribution to categor-
In addition to customizing the losses, we use the regu- ical labels to use as captions in the regularization images.
larization set to impart the concept of facial age progression We refer to them as {Child: <15 years, Teenager: 15-30
and regression to the latent diffusion model. The regular- years, Youngadults: 30-40 years, Middleaged: 40-50 years,
ization set contains representative images of a class, in our Elderly: 50-65 years and Old: >65 years}. We use 612
case, ‘person’. A regularization set comprising face im- (102 × 6) images in the subject disjoint regularization set.
ages selected from the internet would have sufficed if our The success of generating high quality images often de-
goal was to generate realistic faces as done in [30]. How- pend on effectively prompting the diffusion model during
ever, our task involves learning the concept of aging and inference. The text prompt at the time of inference needs
de-aging, and then apply it to any individual. To accom- a rare token/identifier that is associated with the concept
plish this task, we use face images from different age groups learnt during fine-tuning. We use four different rare tokens
and then pair it with one-word captions that indicate the age {wzx, sks, ams, ukj} [6] in this work for brevity.
group of the person depicted in the image. The captions We use the implementation of DreamBooth using stable
correspond to one of the six age groups: ‘child’, ‘teenager’, diffusion in [3] and used the following hyperparameters.
original child teenager youngadults middleaged elderly old
Figure 2. Illustration of age edited images generated from the CelebA dataset.
We adopt a learning rate = 1e-6, number of training steps curves and report the False Non-Match Rate (FNMR) at a
= 800, embedding dimensionality in autoencoder = 4, and False Match Rate (FMR) of 0.01% and 0.1%.
batch size = 8. The generated images are of size 512 × 512.
We use λ = 1, λb = 0.1 and λs = 0.1 (refer to Eqns. 2 and 5. Results
3). We generate 8 samples at inference. However, we per-
form a facial quality assessment using EQFace [26] to limit We report the biometric matching performance using the
the number of generated face images to 4, such that, each ArcFace matcher between original and modified images
generated image contains a single face with frontal pose. in Table 1 for the CelebA dataset. See examples of gener-
We adopt a threshold of 0.4, and retain the generated im- ated images in Fig. 2. In CelebA, we do not have access
ages if quality exceeds the threshold, else, discard them. to ground-truths, so we perform biometric matching with
Training each subject requires ∼5-8 mins. on a A100 GPU. disjoint samples of the subject not used in the training set.
We refer this as the ‘simulation’ result. We achieve the best
We perform qualitative evaluation of the generated im-
biometric matching using the initial loss settings of latent
ages by conducting a user study involving 26 volunteers.
diffusion (Eqn. 1). The biometric matching impacts the sim-
The volunteers are shown a set of 10 face images (original)
and then 10 generated sets; each set contains five images
belonging to five age groups (excluding old), resulting in a
total of 60 images. They are assigned two tasks: 1) identify
the individual from the original set who appears most simi-
lar to the subject in the generated set; 2) assign each of the
five generated images to the five age groups they are most
likely to belong to. We compute the proportion of correct
face recognition and age group assessment.
Further, we perform quantitative evaluation of the gen-
erated outputs using the ArcFace [12] matcher (different
from VGGFace used in identity-preserving biometric loss).
We utilize the genuine (intra-class) and imposter (inter-
class) scores to compute Detection Error Trade-off (DET)
Table 1. CelebA simulation results for biometric matching be- Matching scenarios FNMR@FMR=0.01/0.1%
tween Original-Modified images. The metrics are False Non- Ori-Ori 0.14/0.07
Match Rate (FNMR) at False Match Rate (FMR) = 0.01/0.1%.
Mod-Mod 0.02/0.01
Initial loss With contrastive loss
Age group Ori-Mod (w/o fine-tune) 0.41/0.16
sks wzx sks wzx
Ori-Mod (w/ fine-tune) 0.03/0.01
child 0.49/0.21 0.58/0.27 0.56/0.26 0.60/0.29
teenager 0.23/0.07 0.32/0.12 0.29/0.10 0.34/0.12 Figure 3. (Top:) DET curves of face matching using generated
youngadults 0.25/0.08 0.30/0.10 0.28/0.08 0.31/0.10 images from the CelebA dataset. (Bottom:) Recognition perfor-
middleaged 0.20/0.07 0.28/0.09 0.27/0.09 0.30/0.10 mance in the table indicating FNMR @ FMR=0.01/0.1%. The
elderly 0.22/0.07 0.29/0.10 0.25/0.09 0.29/0.11 age-edited images are generated using the wzx token with con-
old 0.24/0.10 0.31/0.12 0.29/0.11 0.32/0.12 trastive loss.
original child teenager youngadults middleaged elderly old
Figure 4. Illustration of age edited images generated from the AgeDB dataset.
Methods
Age group
AttGAN Talk-to-Edit Proposed
child - 0.99/0.40 0.56/0.26
teenager - 1.0/0.50 0.29/0.10
youngadults 0.47/0.20 0.70/0.21 0.28/0.08
Figure 6. Comparison of auxiliary loss functions (VGGFace-based middleaged - 0.51/0.13 0.27/0.09
biometric loss vs. Contrastive loss) in terms of cosine distance elderly - 0.83/0.39 0.25/0.09
scores computed for genuine pairs using the ArcFace matcher. old 0.31/0.11 0.56/0.22 0.29/0.11
Contrastive loss produces desirable lower distance between gen-
uine pairs. Average 0.39/0.15 0.76/0.31 0.32/0.12
Figure 8. (Top): Comparison of ‘young’ outputs (columns 2-4)
and ‘old’ outputs (columns 5-7) generated by the proposed method
plored different values of λb and λs , = {0.01, 0.1, 1, 10}, with baselines: AttGAN and Talk-to-Edit. The original images are
and observe 0.1 produces the best results for both variables. in the first column. (Bottom): False Non-Match Rate (FNMR) at
False Match Rate (FMR) = 0.01/0.1%
Figure 10. Impact of token (wzx) and class label (person) on generated images: “photo of a person” (left) vs. “photo of a wzx person”
(right). Note the token is strongly associated with a specific identity belonging to that class.
old ages. Further, we observe that the method outperforms remaining two tokens, and have been used for further evalu-
Talk-to-Edit by an average FNMR =44% at FMR=0.01. The ation. Note these tokens are condensed representations pro-
different age groups are simulated using a target value pa- vided by the tokenizer that are determined by identifying
rameter in Talk-to-Edit that varies from 0 to 5, each value rare phrases in the vocabulary (see Fig. 9). Additionally,
representing an age group. However, we observe several we evaluate the effect of the token and the class label in the
cases of distorted or absence of outputs in Talk-to-Edit. prompt in Fig. 10; removing the token results in lapse in
identity-specific features.
5.5. Effect of demographics
We also observed the following effects. Age: The gen-
erated images can capture different age groups well if the
training set contains images in the middle-aged category.
We observe that if training set images comprise mostly el-
derly images, then the method struggles to render images in
the other end of the spectrum, i.e., the child category, and
vice-versa. We also observe that we obtain visually com-
pelling results of advanced aging when we use ‘elderly’ in
the prompt instead of ‘old’. Sex: The generated images
can effectively translate the training images into older age
Figure 11. Examples of generated images pertaining to diverse sex
and ethnicity for ‘child’ group.
groups for men compared to women. This can be due to the
use of makeup in the training images. Ethnicity: We do
not observe any strong effects of ethnicity/race variations
in the outputs. See Fig. 11. Although in some cases, the
proposed method struggles with generating ‘child’ images
if most of the training images belong to elderly people or
contain facial hair. See Fig. 12.