Cs771 Mini Project-2
Cs771 Mini Project-2
Group 95 : Cerebro
IIT KANPUR
Introduction
We develop an algorithm to address unsupervised
domain adaptation (UDA) in continual learning (CL)
settings. The goal is to update a model continually to
learn distributional shifts across sequentially arriving
tasks with unlabeled data while retaining the knowledge
about the past learned tasks. Our solution is based on
consolidating the learned internal distribution for
improved model generalization on new domains and
benefiting from experience replay to overcome
catastrophic forgetting.
Challenges in Robust Generalization in Deep Learning
Deep Neural Networks and Domain-Specific Learning:
Deep neural networks (DNNs) are highly effective at identifying intricate patterns in
large datasets, allowing them to automate feature extraction and classification.
However, DNNs often overfit to the source domain, meaning they become too
specialized in the data they were trained on and struggle to perform well on new,
unseen data.
Domain Shift and the Generalization Challenge:
Domain shift occurs when the distribution of data changes between the training
phase and real-world testing, as seen when models are deployed in new
environments or with different data sources.
Under domain shift, DNNs tend to fail in producing accurate predictions, making
robust generalization a key challenge in fields like autonomous driving, healthcare
diagnostics, and finance.
Challenges in Continual Learning
What is Continual Learning (CL)?
Continual Learning refers to the ability to learn from non-stationary information streams
incrementally. “Non-stationary” represents continuously changing data distributions.
“Incremental” learning refers to preserving previous knowledge while continuously learning
new information.
In CL settings, models must adapt to a sequence of tasks or domains over time, each with
unique characteristics.
Traditional supervised learning methods are inefficient for CL as they require fully labeled data
and complete retraining, which is often impractical for evolving data streams.
Challenge in CL
Most current CL algorithms focus on tasks, or domains, that have fully labeled datasets. As a
result, they rely on extensive data labeling for each new domain encountered. However, manual
data annotation is often impractical, as it is both time-consuming and costly.
Challenges in Unsupervised Domain
Adaptation (UDA)
What is UDA?
Shared Latent Space Alignment: Many UDA methods align the data distributions of both the
source and target domains in a shared embedding space, allowing a classifier trained on the
source domain to generalize to the target domain.
Domain Alignment Techniques: Domain alignment can be achieved through generative
adversarial learning or by directly minimizing the distance between the two distributions.
However, unlike standard UDA methods, we can’t directly access the source
data during continual learning, which makes it challenging to align
distributions without forgetting past knowledge.
Proposed Solution
The solution described involves learning a discriminative embedding space for a model that
consolidates intermediate distributions to improve generalizability. The encoder maps the
input source distribution into a multi-modal distribution pJ(z) in the embedding space, with
each mode representing a class. Data points from a specific class are mapped to the
corresponding cluster in the embedding space.
To model the learned distribution, a Gaussian Mixture Model (GMM) is used, which consists
of kkk components. The model is defined by:
where αjare the mixture weights, and μjand Σj are
the mean and covariance for the components,
respectively.
Since the class labels are available, the
parameters for each mode can be computed
independently using the Maximum A Posteriori
(MAP) estimates. For each mode j, the support set
Sjconsists of data points belonging to that class.
The MAP estimates for the GMM parameters are:
Theoretical Analysis
In the PAC-learning framework, this theorem provides
a bound on the expected error a classifier will
experience on new, unseen tasks based on previous
learning experiences and data distribution changes
across tasks. Specifically, it considers a set of
possible classifiers (or hypothesis class) and
describes the errors: e0for the initial (source) domain,
etfor target domains (new tasks), and etJfor a
pseudo-dataset that approximates target performance.
The theorem then relates the error on target domains
to several factors, including the pseudo-dataset error,
the difference in data distributions between
consecutive tasks (measured as shifts in feature
distributions), an ideal error bound achievable by an
optimal classifier, and an additional residual error
term. By accounting for these elements, the theorem
provides insight into how well the classifier can adapt
to new tasks while retaining past knowledge, thus
helping guide continual learning by limiting error as
tasks evolve.
Theoretical Analysis
The model trained as
Theorem 1 explains why LDAuCID algorithm is effective. Major terms in the right-hand side of Eq. 5, as an
upperbound for the expected error for each task (domain), are continually minimized by LDAuCID. The first
term is minimized because the internal distribution random samples are used to minimize the empirical error
term as the first term of Eq. (4). The second term is minimized as the third terms of Eq. (4) when the task
distribution is aligned with the empirical internal distribution in the embedding space at time t. The third term
which is a summation which models the effect
Empirical Validation
We address lifelong unsupervised domain adaptation (UDA) using four classic
UDA benchmark datasets, adapted for sequential tasks. We adhere to the one-
source, one-target domain setup and standard evaluation protocols.
We use VGG16 as the base model for digit recognition tasks and
Decaf6 features for the Office-Caltech tasks. For ImageCLEF-DA and
Office-Home, we use ResNet-50 pre-trained on ImageNet as the
backbone. To analyze model learning dynamics over time, we generate
learning curves showing test performance across training epochs,
simulating continual training. After each target domain task, we report
the average classification accuracy and standard deviation on the
target domain over five runs. Initial performance is measured with only
source data to show domain shift impact; then, we adapt the model
using the LDAuCID algorithm on target data
Results
Learning Curves: Figure 2 shows the learning curves for eight sequential
UDA tasks, with the model trained for 100 epochs on each task.
Experience replay uses 10 samples per class per domain.
Domain Shift: Initially, domain shift causes a performance drop, but
subsequent tasks show improved performance due to knowledge
transfer.
LDAuCID Effectiveness: LDAuCID boosts performance on all target
tasks, with reduced catastrophic forgetting. However, the Office-Home
dataset, with larger domain gaps, shows less improvement.
Catastrophic Forgetting: The model retains performance on previously
learned tasks, with some forgetting observed in the SVHN dataset.
Comparison with Classic UDA: LDAuCID is compared to classic UDA
methods (Table 1) and often outperforms or is competitive with
methods like ETD and CDAN.
Balanced Datasets: LDAuCID performs particularly well on
balanced datasets like ImageCLEF-DA.
Lifelong UDA: LDAuCID effectively addresses lifelong UDA tasks,
outperforming many classic UDA methods, despite the limitation
of not having access to all source domain data.
Analytic and Ablative Studies
Analytic and Ablative Studies
1. Data Representation and Learning
Progress:
Each data point in the embedding
space is represented by a point in a 2D
plot, with colors denoting the ten digit
classes.
The rows in Figure correspond to the
data geometry at different time-steps,
with the second row showing the state
after learning the SVHN and MNIST
datasets.
By inspecting the columns vertically,
the impact of learning multiple tasks
over time can be observed. In
particular, when a new task is added,
the model retains the knowledge
learned in previous tasks, indicated by
the separability of classes across
rows.
Analytic and Ablative Studies
1.Catastrophic Forgetting Mitigation:
The model shows stability in retaining knowledge, suggesting that catastrophic
forgetting is being effectively mitigated. Even as new tasks are added (moving to
the next rows in the figure), the learned knowledge does not fade, and the model
adapts to new tasks while preserving previous information.
By the final row of Figure 3, the distributions of all domains align closely,
resembling the internally learned Gaussian Mixture Model (GMM) distribution.
This alignment suggests that the model successfully adapts the target domains
to share the same distribution.
For example, comparing the distribution of the MNIST dataset in the first row
(before adaptation) and the second row (after adaptation) shows that the MNIST
distribution increasingly resembles that of the SVHN dataset, the source domain.
This supports the hypothesis that the model’s domain adaptation mechanism
works as expected.
3.Hyperparameter Sensitivity:
The effect of two hyperparameters, λ and τ, on model performance is also studied,
particularly for the binary UDA task (S → M) and illustrated in Figures 3e and 3f.
λ (Trade-off Parameter): The results show that λ has a minimal effect on performance.
This is because the ERM (Empirical Risk Minimization) loss term is relatively small at the
start of the alignment process due to pre-training on the source domain, making λ less
critical for fine-tuning. The dominant optimization term is the domain-alignment term,
meaning that λ does not need careful adjustment.
τ (Confidence Parameter): The parameter τ, which controls the confidence in the
alignment process, shows that when τ is set to approximately 1, the model performs
better on the target domain. This is due to the reduction of label pollution caused by
outlier samples in the GMM distribution, which can negatively affect domain alignment.
When τ ≈ 1, the model becomes more robust to such outliers.
4.Empirical Support for Theoretical Analysis:
The observed results align with the theoretical analysis presented earlier (Theorem 1),
confirming the effectiveness of the domain adaptation method.
The overall conclusion from these empirical evaluations is that the LDAuCID method
works as expected, demonstrating effective domain adaptation and mitigated
catastrophic forgetting, as well as improved performance with the proper choice of
hyperparameters.
Conclusion
We propose a domain adaptation algorithm for continual learning,
where input distributions are mapped to an internal distribution in
an embedding space via a neural network. Our method aligns
these distributions across tasks, ensuring that new tasks do not
hinder generalization. Catastrophic forgetting is mitigated through
experience replay, which stores and replays informative input
samples for updating the internal distribution. While our approach
uses a simple distribution estimation, we anticipate that better
methods could further enhance performance. Future work will
explore the impact of task order and extend the approach to
incremental learning, allowing for new class discovery after initial
training.