0% found this document useful (0 votes)
37 views25 pages

Cs771 Mini Project-2

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
37 views25 pages

Cs771 Mini Project-2

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 25

CS771 MINI PROJECT-2

Group 95 : Cerebro

Lifelong Domain Adaptation via Consolidated


Internal Distribution

BY: VARSHA PILLAI, KISHORE S, ANIRVAN TYAGI

IIT KANPUR
Introduction
We develop an algorithm to address unsupervised
domain adaptation (UDA) in continual learning (CL)
settings. The goal is to update a model continually to
learn distributional shifts across sequentially arriving
tasks with unlabeled data while retaining the knowledge
about the past learned tasks. Our solution is based on
consolidating the learned internal distribution for
improved model generalization on new domains and
benefiting from experience replay to overcome
catastrophic forgetting.
Challenges in Robust Generalization in Deep Learning
Deep Neural Networks and Domain-Specific Learning:

Deep neural networks (DNNs) are highly effective at identifying intricate patterns in
large datasets, allowing them to automate feature extraction and classification.
However, DNNs often overfit to the source domain, meaning they become too
specialized in the data they were trained on and struggle to perform well on new,
unseen data.
Domain Shift and the Generalization Challenge:

Domain shift occurs when the distribution of data changes between the training
phase and real-world testing, as seen when models are deployed in new
environments or with different data sources.
Under domain shift, DNNs tend to fail in producing accurate predictions, making
robust generalization a key challenge in fields like autonomous driving, healthcare
diagnostics, and finance.
Challenges in Continual Learning
What is Continual Learning (CL)?
Continual Learning refers to the ability to learn from non-stationary information streams
incrementally. “Non-stationary” represents continuously changing data distributions.
“Incremental” learning refers to preserving previous knowledge while continuously learning
new information.
In CL settings, models must adapt to a sequence of tasks or domains over time, each with
unique characteristics.
Traditional supervised learning methods are inefficient for CL as they require fully labeled data
and complete retraining, which is often impractical for evolving data streams.

Challenge in CL
Most current CL algorithms focus on tasks, or domains, that have fully labeled datasets. As a
result, they rely on extensive data labeling for each new domain encountered. However, manual
data annotation is often impractical, as it is both time-consuming and costly.
Challenges in Unsupervised Domain
Adaptation (UDA)
What is UDA?
Shared Latent Space Alignment: Many UDA methods align the data distributions of both the
source and target domains in a shared embedding space, allowing a classifier trained on the
source domain to generalize to the target domain.
Domain Alignment Techniques: Domain alignment can be achieved through generative
adversarial learning or by directly minimizing the distance between the two distributions.

Challenge of Catastropic forgetting


Traditional UDA methods are not suitable for continual learning because they typically require
access to both source and target datasets simultaneously and often only handle a single target
and source domain.Simply updating the model for each new domain can lead to "catastrophic
forgetting," where the network loses information from previously learned domains due to
retroactive interference.
New Proposed Model
The algorithm aims to enable lifelong, unsupervised adaptation of a model to new
domains with only unannotated data. This means the model can continuously learn
from changing environments without labeled data.
Core Idea - Internal Distribution Consolidation: The approach consolidates the internal
representation or distribution learned by the model from the initial source domain (where
labeled data is available). This internal distribution acts as a memory of learned knowledge
and helps the model adapt to new, unlabeled domains.
Multimodal Distribution for Coupled Learning: By treating this internal representation as a
multimodal distribution, the algorithm updates the model to ensure that the knowledge from
the original domain is "coupled" with the new domains it encounters. This coupling allows the
model to generalize effectively across multiple unseen domains.
Addressing Catastrophic Forgetting with Experience Replay: To prevent the model from
forgetting past knowledge (catastrophic forgetting), it saves key representative samples from
previous tasks. When learning new tasks, it replays these samples, reinforcing past
knowledge as it adapts to new data.
Problem Statement
Source Domain Setup:
We start with a source domain S, where we have a labeled training
dataset . This dataset is drawn from an unknown
distribution , and we can train a deep neural network ​
on this data using empirical risk minimization (ERM), which
minimizes the classification error between predicted and actual
labels on the source domain. If the dataset is large and the
network is complex enough, the model will generalize well on new
samples from the same distribution

Optimum θ using ERM :


Problem Statement
Challenge of Continual Learning: In continual learning (CL), we face
the additional challenge that the input distribution is not stationary.
Over time, the data distribution may shift, leading to a distributional
gap (or "domain shift") between the training data distribution and
the new, incoming data. If the model is not adapted to these
changes, it may perform poorly on the new, shifted data.
Sequential Target Domains: To simulate real-world scenarios, we
consider a series of target domains each with an unlabeled
dataset .The samples in each target domain are drawn
from a different distribution , meaning that the data
distribution changes over time. Since these datasets are
unlabeled, we can’t use standard ERM methods, which rely on
labeled data for training.
Problem Statement
To address domain shift and catastrophic forgetting, we
decompose our network into a deep encoder (which maps
data to an embedding space Z) and a classifier ​. After
training on the source domain, the classes are well-separated in
Z. Our goal is to maintain this separation as new, unlabeled
target domains are introduced, allowing the model to generalize
to new domains.
We achieve this by keeping the distribution in Z stable across
domains, minimizing the distance between the source and target
distributions in Z,ie

However, unlike standard UDA methods, we can’t directly access the source
data during continual learning, which makes it challenging to align
distributions without forgetting past knowledge.
Proposed Solution
The solution described involves learning a discriminative embedding space for a model that
consolidates intermediate distributions to improve generalizability. The encoder maps the
input source distribution into a multi-modal distribution pJ(z) in the embedding space, with
each mode representing a class. Data points from a specific class are mapped to the
corresponding cluster in the embedding space.
To model the learned distribution, a Gaussian Mixture Model (GMM) is used, which consists
of kkk components. The model is defined by:
where αj​are the mixture weights, and μj​and Σj are
the mean and covariance for the components,
respectively.
Since the class labels are available, the
parameters for each mode can be computed
independently using the Maximum A Posteriori
(MAP) estimates. For each mode j, the support set
Sj​consists of data points belonging to that class.
The MAP estimates for the GMM parameters are:
Theoretical Analysis
In the PAC-learning framework, this theorem provides
a bound on the expected error a classifier will
experience on new, unseen tasks based on previous
learning experiences and data distribution changes
across tasks. Specifically, it considers a set of
possible classifiers (or hypothesis class) and
describes the errors: e0​for the initial (source) domain,
et​for target domains (new tasks), and etJ​for a
pseudo-dataset that approximates target performance.
The theorem then relates the error on target domains
to several factors, including the pseudo-dataset error,
the difference in data distributions between
consecutive tasks (measured as shifts in feature
distributions), an ideal error bound achievable by an
optimal classifier, and an additional residual error
term. By accounting for these elements, the theorem
provides insight into how well the classifier can adapt
to new tasks while retaining past knowledge, thus
helping guide continual learning by limiting error as
tasks evolve.
Theoretical Analysis
The model trained as

denotes the WD distance

Theorem 1 explains why LDAuCID algorithm is effective. Major terms in the right-hand side of Eq. 5, as an
upperbound for the expected error for each task (domain), are continually minimized by LDAuCID. The first
term is minimized because the internal distribution random samples are used to minimize the empirical error
term as the first term of Eq. (4). The second term is minimized as the third terms of Eq. (4) when the task
distribution is aligned with the empirical internal distribution in the embedding space at time t. The third term
which is a summation which models the effect
Empirical Validation
We address lifelong unsupervised domain adaptation (UDA) using four classic
UDA benchmark datasets, adapted for sequential tasks. We adhere to the one-
source, one-target domain setup and standard evaluation protocols.

Digit Recognition: ImageCLEF-DA: Office-Home: Office-Caltech:

Using MNIST (M), With 12 shared Consisting of Using 10


USPS (U), and classes from 15,500 images shared classes
SVHN (S), we test Caltech-256 (C), across Artistic (A),
from Office-31
on M → U, U → M, ILSVRC 2012 (I), Clip Art (C),
Product (P), and and Caltech-
and S → M, plus and Pascal VOC 256, we test A
Real-World (R), we
two sequential 2012 (P), we test A → C → P → →C→D→W
tasks: S → M → U evaluate on C → R and R → P → C and W → D → C
and S → U → M. I → P. → A. → A.
Network Structure and Evaluation Protocol:

We use VGG16 as the base model for digit recognition tasks and
Decaf6 features for the Office-Caltech tasks. For ImageCLEF-DA and
Office-Home, we use ResNet-50 pre-trained on ImageNet as the
backbone. To analyze model learning dynamics over time, we generate
learning curves showing test performance across training epochs,
simulating continual training. After each target domain task, we report
the average classification accuracy and standard deviation on the
target domain over five runs. Initial performance is measured with only
source data to show domain shift impact; then, we adapt the model
using the LDAuCID algorithm on target data
Results
Learning Curves: Figure 2 shows the learning curves for eight sequential
UDA tasks, with the model trained for 100 epochs on each task.
Experience replay uses 10 samples per class per domain.
Domain Shift: Initially, domain shift causes a performance drop, but
subsequent tasks show improved performance due to knowledge
transfer.
LDAuCID Effectiveness: LDAuCID boosts performance on all target
tasks, with reduced catastrophic forgetting. However, the Office-Home
dataset, with larger domain gaps, shows less improvement.
Catastrophic Forgetting: The model retains performance on previously
learned tasks, with some forgetting observed in the SVHN dataset.
Comparison with Classic UDA: LDAuCID is compared to classic UDA
methods (Table 1) and often outperforms or is competitive with
methods like ETD and CDAN.
Balanced Datasets: LDAuCID performs particularly well on
balanced datasets like ImageCLEF-DA.
Lifelong UDA: LDAuCID effectively addresses lifelong UDA tasks,
outperforming many classic UDA methods, despite the limitation
of not having access to all source domain data.
Analytic and Ablative Studies
Analytic and Ablative Studies
1. Data Representation and Learning
Progress:
Each data point in the embedding
space is represented by a point in a 2D
plot, with colors denoting the ten digit
classes.
The rows in Figure correspond to the
data geometry at different time-steps,
with the second row showing the state
after learning the SVHN and MNIST
datasets.
By inspecting the columns vertically,
the impact of learning multiple tasks
over time can be observed. In
particular, when a new task is added,
the model retains the knowledge
learned in previous tasks, indicated by
the separability of classes across
rows.
Analytic and Ablative Studies
1.Catastrophic Forgetting Mitigation:
The model shows stability in retaining knowledge, suggesting that catastrophic
forgetting is being effectively mitigated. Even as new tasks are added (moving to
the next rows in the figure), the learned knowledge does not fade, and the model
adapts to new tasks while preserving previous information.

2.Alignment of Data Distributions:

By the final row of Figure 3, the distributions of all domains align closely,
resembling the internally learned Gaussian Mixture Model (GMM) distribution.
This alignment suggests that the model successfully adapts the target domains
to share the same distribution.
For example, comparing the distribution of the MNIST dataset in the first row
(before adaptation) and the second row (after adaptation) shows that the MNIST
distribution increasingly resembles that of the SVHN dataset, the source domain.
This supports the hypothesis that the model’s domain adaptation mechanism
works as expected.
3.Hyperparameter Sensitivity:
The effect of two hyperparameters, λ and τ, on model performance is also studied,
particularly for the binary UDA task (S → M) and illustrated in Figures 3e and 3f.
λ (Trade-off Parameter): The results show that λ has a minimal effect on performance.
This is because the ERM (Empirical Risk Minimization) loss term is relatively small at the
start of the alignment process due to pre-training on the source domain, making λ less
critical for fine-tuning. The dominant optimization term is the domain-alignment term,
meaning that λ does not need careful adjustment.
τ (Confidence Parameter): The parameter τ, which controls the confidence in the
alignment process, shows that when τ is set to approximately 1, the model performs
better on the target domain. This is due to the reduction of label pollution caused by
outlier samples in the GMM distribution, which can negatively affect domain alignment.
When τ ≈ 1, the model becomes more robust to such outliers.
4.Empirical Support for Theoretical Analysis:
The observed results align with the theoretical analysis presented earlier (Theorem 1),
confirming the effectiveness of the domain adaptation method.
The overall conclusion from these empirical evaluations is that the LDAuCID method
works as expected, demonstrating effective domain adaptation and mitigated
catastrophic forgetting, as well as improved performance with the proper choice of
hyperparameters.
Conclusion
We propose a domain adaptation algorithm for continual learning,
where input distributions are mapped to an internal distribution in
an embedding space via a neural network. Our method aligns
these distributions across tasks, ensuring that new tasks do not
hinder generalization. Catastrophic forgetting is mitigated through
experience replay, which stores and replays informative input
samples for updating the internal distribution. While our approach
uses a simple distribution estimation, we anticipate that better
methods could further enhance performance. Future work will
explore the impact of task order and extend the approach to
incremental learning, allowing for new class discovery after initial
training.

You might also like