Multimodality in Meta-Learning - A Comprehensive Survey
Multimodality in Meta-Learning - A Comprehensive Survey
Yao Maa,b , Shilin Zhaoa , Weixiao Wanga , Yaoman Lia,c,∗, Irwin Kingc
a Lenovo Machine Intelligence Center, Hong Kong Science Park, Hong Kong
b Delft University of Technology, Delft, The Netherlands
c The Chinese University of Hong Kong, Shatin, NT, Hong Kong
Abstract
Meta-learning has gained wide popularity as a training framework that is more data-efficient than traditional machine learning
methods. However, its generalization ability in complex task distributions, such as multimodal tasks, has not been thoroughly
arXiv:2109.13576v2 [cs.LG] 7 May 2022
studied. Recently, some studies on multimodality-based meta-learning have emerged. This survey provides a comprehensive
overview of the multimodality-based meta-learning landscape in terms of the methodologies and applications. We first formalize
the definition of meta-learning in multimodality, along with the research challenges in this growing field, such as how to enrich the
input in few-shot learning (FSL) or zero-shot learning (ZSL) in multimodal scenarios and how to generalize the models to new tasks.
We then propose a new taxonomy to discuss typical meta-learning algorithms in multimodal tasks systematically. We investigate
the contributions of related papers and summarize them by our taxonomy. Finally, we propose potential research directions for this
promising field.
Keywords: Meta-Learning, Multimodal, Deep Learning, Few-shot Learning, Zero-shot Learning
loss function Ltask in Eq. (2). Then the meta-knowledge ω∗ is 2.3. Formalizing Multimodality
learned when evaluating on Dvalsource over all the source tasks: We consider data modalities to be multiple when data is col-
M
lected through multiple sensors, measurement equipment or ac-
quisition technologies [39]. The output of each sensory channel
X
ω∗ = arg min Lmeta (θ∗ (i) (ω), ω, Dval (i)
source ), (1)
ω i=1
is represented as a modality, which is usually a single form of
dataset associated with a medium of expression, such as vision
s.t. θ∗(i) (ω) = arg min L task
(θ, ω, Dtrain (i)
source ). (2)
θ data from seeing objects and audio data from hearing sounds.
The key characteristics constituting multimodality that cover
Usually, Dtrain val
source and D source are called support sets and query
data coming from multiple modalities can be summarized as
sets respectively. For the set of Q target tasks in meta-test stage follows:
Dtarget = {(Dtrain
target , Dtarget ) }i=1 , the learned prior knowledge
test (i) Q
ω∗ in the meta-training stage will be used to train the model on • Complementarity. Each modality brings a certain type of
unseen tasks. added value to the whole, and these added values cannot
be inferred or obtained from any other modality in the set-
The meta-learning paradigm discussed in this survey has two
ting [40].
key properties:
• Diversity. The meaning is composed of different modali-
• The existence of a task segmentation mechanism. The ties from different semiotic resources.
meta-training and meta-test sets [7] contain task units, • Integrity. The making of meaning involves the overall
where each task is divided into the support set and query attention to the potential and limitations of each modal-
set. Prior knowledge accumulated in the meta-training ity [41].
stage is used in the meta-test to evaluate the accuracy of
the meta-model. Our key assumption is to combine multimodal features, such
as images, fine-grained descriptions, auxiliary audio and video
• The process of constructing an adaptive learner. This information to force the model to recognize cross-modal dis-
refers to the accumulation of meta-knowledge experience criminative features to promote usage in meta-learning applica-
by dynamically improving the deviation between the inner tions. Under this assumption, we limit the scope of our research
base learner and the outer generalized learner. The type of to models that use multiple sensors as inputs, thereby excluding
meta-knowledge could be the estimation of initial param- studies that build their models on multimodal task distributions
eters [14], or an embedding strategy [15]. Ultimately, it with disjoint modes [11, 12] or shifted domains [42].
guarantees that the adapted knowledge can directly gener- On the other hand, there are various combinations of mul-
alize to predicting new tasks instead of adding extra data. timodal sources in different tasks (See Figure 2). Modalities
3
could be provided in pairs or across tasks. The diversity and ditional multimodal learning to better generalize to new tasks,
integrity of modalities are therefore reflected in the dependence additional challenges may also include elements that affect the
on the semantics of different modalities. New classes with in- effectiveness of generalization, such as the definition of mul-
sufficient training data in unimodality can benefit from previ- timodal tasks, the choice of meta-learning algorithms, and the
ously learned features. When one of the modalities is absent, number of modalities.
the complementarity between modalities provides a way to con-
ditionally consider auxiliary semantics and support the global
meaning. 3. Proposed Taxonomy
are optimized in different task distributions, within-task modal- ability to simulate distribution of the primary modality condi-
ity alignment and cross-task modality alignment will be dis- tioned on the auxiliary modality [30, 50]. We summarize the
cussed in Section 4.1. related methods as cross-modal augmentation with generating
Learn the Embedding Metric-based meta-learning meth- data across modalities.
ods are widely used as non-parametric algorithms for few-shot
problems. The idea is to learn an embedding network that helps
4. Learn the Optimization
project the training and testing points onto the same space to
implement similarity comparison. The embedding network is
Optimization-based meta-learning methods aim to solve the
trained on unimodal support sets where the quality of feature
inner-level task as an optimization problem [7] and extract the
embeddings is often restricted. The introduction of multimodal-
meta-parameter ω∗ through Eq. (1) to obtain the best perfor-
ity expands the number of original embedding spaces. We em-
mance for generalizing to new tasks within only a few gradient-
phasize that the embedding network aligning multiple spaces
descent updates. Most of these methods are built on top of
or adapting to the selection of multiple spaces can be applied to
the Model-Agnostic Meta-Learning (MAML) [14] framework,
match any source of multimodal information, with the ability
which is a popular approach to meta-learn an initial set of neu-
to capture the semantic relationship between input pairs more
ral network weights adapted for fine-tuning on few-shot prob-
effectively.
lems. Reptile [51] generalizes the first-order MAML by re-
Here, we summarize four widely used approaches originated peatedly sampling a task and training to minimize the loss on
from Pairwise Networks [47, 48], Prototypical Networks [15], the expectation over tasks. More alternatives that fit the meta-
Matching Networks [16] and Relation Networks [49]. For each optimization process illustrated by Eq. (1) and Eq. (2) also learn
one in the multimodality-based scenario, modalities are respec- specific inner optimizer by producing the trainable ω∗ as its own
tively provided with pairs for the pairwise networks, combined hyperparameters [52]. The exact optimization algorithm [17]
to train a new joint prototype, fused to formulate the atten- will then generalize to unseen tasks to optimize the inner learner
tion kernel, and related together to form the concatenation for directly.
the non-linear operators. Especially for variants of Prototypi- Multimodal information can usually be parameterized and
cal Networks, prototypes play different roles in the model due wrapped into the inner-level optimization task, and then trained
to the different stages of multimodal fusion. Details about the iteratively with the meta-parameter, either for the initialization
difference between deterministic prototypes and shifted proto- or the optimization algorithm. We summarize the relevant stud-
types will be discussed in Section 5.2. ies in terms of where to use the meta-parameter and how it is
Learn the Generation Another meta-knowledge is data gen- co-trained with multimodal parameters.
eration, which is widely employed for most unimodal few-shot
and zero-shot applications. Since the source of unimodal data 4.1. Parameterized-Modal Initialization
is relatively simple and scarce, we exploit prior knowledge of
the data to modify the model or execute data augmentation. 4.1.1. Overview
There are often implicit relationships between multiple modal- Meta-learning algorithms such as MAML and Reptile are
ities, which can help better describe the scarce label character- known to fine-tune parameters of unseen tasks quickly by us-
istics across tasks. Although not many, we notice that some ing gradient descent methods. Since seen and unseen tasks are
researchers have worked on the knowledge of conditional prob- involved in the learning process across multiple domains, the
5
diversity of the task space distributions puts forward higher re- characterized as the internal optimization involved in meta-
quirements for applying the initialized meta-parameter. One training, where the inner network can be extended to multiple.
common task distribution comprises tasks with the task in- Ma et al. [45] focus on the implementation details by extend-
stances of the same manifold from the same domain, where ing MAML to learn three networks in an integrated way for the
all tasks are either unimodal or multimodal. In addition, tasks main training process, missing modality reconstruction network
may come from the same domain, but the task instances have and feature regularization by a Bayesian meta-learning based
different structures or subspaces [46], such as using different model, SMIL. There is no modality across different tasks in-
modality features to describe the same semantic concept. We volved obviously in training, but the use of a feature reconstruc-
have reasons to believe that although the outer learner is still tion network has leveraged the auxiliary modality to generate
learning initial values that can be shared across different tasks, an approximation of the missing-modality feature efficiently. It
the generalization ability of these initialized parameters can be has been highlighted in the paper that such a method is more ef-
applied to more complex and multi-space task distributions. On ficient than the traditional generative methods (e.g., GAN, AE,
the other hand, the heterogeneity gap [53] makes it more chal- or VAE) for not requiring full-modality data. The usage of sin-
lenging to align multiple modalities into the same feature space gle modality embeddings also guarantees that the approxima-
across tasks. Therefore, according to the distribution pattern tion of full-modality is more flexible than the typical methods
of parameterized multimodal information in the task space and where feature spaces are regularized by the perturbation. De-
how they are optimized to be aligned in the inner learner, we di- spite the successful collaboration training between modalities,
vide the current research into two branches, within-task modal- an open challenge for deploying such Bayesian training frame-
ity alignment and cross-task modality alignment. work in the real-world scenario is how to learn a set of proper
For the within-task modality alignment, the model’s input modality priors for the unknown modalities.
comes from the homogeneous task space with different modal- Moreover, the inner networks can be multimodal encoders
ity characteristics. Each task adopts the same type and number for different features. Yao et al. [44] try to reduce the risk of
of modality feature spaces, such as paired modalities. The inner unstable results and knowledge transferred from a single source
learner does not need to pay attention to the domain adaptation city. They propose a MetaST network in a meta-learning man-
problems, but only needs to process the alignment of differ- ner to learn to represent the initialization of spatio-temporal
ent modality subspaces [46], such as reducing the dimension- correlation based on tasks from multiple source cities. The
ality of some modalities, training a hybrid neural network for estimated spatio-temporal values are computed by the hybrid
the mixed modality spaces, and constructing missing modality model ST-net that combines a CNN and an LSTM to model
subspaces aligned with the known modalities [45]. Finally, the each region’s spatial dependency and temporal evolution. The
meta-learning framework provides an iterative method of op- parameters of the ST-net are updated in the inner-optimization
timizing multiple parameters from multiple networks (i.e., in- to give representations of different modalities. Meta-parameters
ner network, outer network) together. Such a training frame- such as time dependence, spatial proximity, and regional func-
work still keeps the original structure of MAML or Reptile, in tion will be easily adapted to the target city by fine-tuning,
which the inner optimization is enriched by multiple modali- and finally, serve for the relevant task predictions. MetaST has
ties or other auxiliary training networks related to multimodal overcome the risk of transferring the knowledge from a single
scenarios. source and adapted the spatio-temporal sequences in various
Multimodal tasks are usually sampled from a heterogeneous scenarios rather than only discrete features. However, the type
and more complex task distribution for the cross-task modality of shared knowledge is still in the state of a black box, which
alignment. Each training task may contain a specific combi- poses a challenge to broader scalability. Figure 4 compares the
nation of modality subspaces. Most existing model-agnostic above parameterized inner learners that could enrich the train-
meta-learners assume that the tasks are evenly distributed, in ing of the meta-learning framework.
which all tasks belong to the same domain and have the same In addition to the application of conventional multimodal
manifold of task instances. However, such heterogeneous task task sets, the meta-learning framework can also help improve
distributions are often more challenging because instances with the training performance of neural network models in other
different task subspaces cannot share the same model structure real-world settings, such as helping agents in reinforcement
completely. Therefore, related research examines the common learning (RL) to better adapt to different scenarios and helping
and unique meta-parameters of different task spaces, exploring solve the catastrophic forgetting of previous tasks in continual
how to meta-align the knowledge of different subspaces to gen- learning.
eralize to new tasks. The learned meta-knowledge ω∗ is sup- Specifically, Yan et al. [54] adopt the MAML framework in
posed to incorporate task-aware parameters as part of the ini- the experiment of indoor navigation from the perspective of
tialization, where the task-specific information for each type of RL. Tasks are divided based on various room scenes to help
modality subspace may share the same manifold to adapt to the train and test the model using the aggregated features of visual
model-agnostic meta-learners. and audio modalities. Different features of visual observations
are encoded to aggregate the word embedding from audio fea-
4.1.2. Methods tures and then inputted to the memory network to carry out a
Within-Task Modality Alignment The parametric nature sequence of action policies. The good initialization used for
of the MAML framework enables multimodal problems to be navigating different scenes is trained along with the inner pa-
6
Figure 4: Network structures for parameterized inner learners. Left [45]: SMIL extends MAML by training the main network, the reconstruction network φc ,
and the regularization network φr together. xi denotes different modalities. φc and φr together denote the inner optimization parameters while the main network is
responsible for learning the meta-parameter ω∗ . Right [44]: ST-net is parameterized by θ which encrypts the spatio-temporal correlations. θ denotes the overall
parameters from CNN and LSTM where multimodal information is encoded. The initialization of ω from multiple source cities should be meta-learned.
rameters that control the optimization of the agent’s interaction performance of TGMZ over ZSL/GZSL has demonstrated the
loss. It is evidenced that the meta-learning framework makes necessity of task alignment in alleviating discrimination in class
the model more robust to unseen scenes, but further validation distributions. This suggests a potentially robust way of training
on a real programmable robot is still missing. meta-ZSL models by speculating more about the disjoint task
Verma et al. [55] introduce a new meta-continual scenario to distribution.
handle unseen classes that are collected in a sequential stream In addition to the existence of multi-domain labels in im-
dynamically. While the main architecture is based on pairing age classification applications, other multimodal scenes also
self-gating of attributes and scaled class normalization, the need include task sources from different domains. MLMUG [59]
for balancing even generalization across all tasks during the focuses on the user domain generalization issue that exists in
reservoir sampling requires the model to be trained with Reptile cross-modal retrieval. To allow the model to generalize from
in a meta-learning manner. Such an idea of avoiding expensive source user domains to unknown user domains without any up-
generative models has implications for applying multimodality- date or fine-tuning, the model first constructs a cross-modal em-
based meta-learning models in a real-world setting, where the bedding network to learn a shared modality feature space for
one-time adaptation paradigm of the streaming data possibly cross-modal matching, which generates a shared mask used to
fails. Similar work can be inspired to focus on catastrophic for- encode transferable knowledge between different user domains.
getting in the sequential data. Then a meta-learning framework is implemented to learn the
We can conclude that for the within-task modality align- meta covariant attention module. Unlike the original MAML,
ment, the methods mainly focus on transforming modality is- the gradients from the meta-training set and meta-test set finally
sues into the design of auxiliary learners to enrich the training get weighted aggregation. MLMUG has provided a potential
process. Flexible applications can be proposed as long as the training paradigm for applications that cannot obtain unlabelled
involved multiple modalities can be parameterized and added data instances from unknown fields in the real world, which
to the gradient-descent updates. possibly are more efficient than unsupervised domain adapta-
tion. For applications with large domain spaces, this method
Cross-Task Modality Alignment The most common cross-
gives access to generalized parameter learning instead of ex-
task alignment is used to learn shared knowledge from differ-
pensive joint training.
ent domain tasks. Modalities often appear in similar patterns
in each task, but the semantic difference of different domains Despite the success of task-aware models in multiple task
leads to the difficulty of transferring meta-knowledge across domains, different tasks are still limited to the same modality
tasks. TGMZ [56] addresses the limitations of meta-ZSL mod- patterns. Another more generalized task distribution will be
els that are mostly optimized on the same data distribution with- with different modality subspaces for different tasks. To align
out explicitly alleviating representation bias and prediction mis- multiple subspaces, HetMAML [57] tries to extend the appli-
match posed by diverse task distributions. It uses an attribute- cation of gradient-based meta-models from homogeneous task
conditioned auto-encoder to align multiple task domains in a distributions to heterogeneous tasks, where tasks in the same
unified distribution with auxiliary textual modalities. Differ- query set have different numbers of modalities or input struc-
ent from this study [50] which searches optimal parameters for tures. Compared with training different types of tasks sepa-
each task, the meta-training relies on the task-specific loss func- rately, HetMAML aims to train a unified meta-learner that can
tion to compute gradients on support and query sets for obtain- simultaneously capture the global meta-parameters about the
ing the overall optimal parameters of different modules, task en- concept domain shared by all tasks, as well as the customized
coder, task decoder, task discriminator, and task classifier. The parameters that characterize each task. A three-module shared
7
Figure 5: Meta-alignment representations of more generalized task distributions. Left [57]: Heterogeneous task distributions. Each task has its unique subspace
describing the commonly shared concept domain. Right [58]: Cross-modal task distributions. Source tasks and target tasks have different modalities to be
aligned.
network architecture is proposed to achieve this goal, includ- shared modality space.
ing multi-channel feature extractors, feature aggregations, and
the head module for decision making. By adopting the BRNN 4.2. Unified-Modal Optimizer
structure, different modality-specific feature extractors are it-
4.2.1. Overview
eratively aggregated to enable all types of tasks to be mapped
from heterogeneous feature space to a unified space. A bet- The meta-learning algorithm could learn the meta-optimizer
ter generalization of a few examples is not only reflected in by updating the inner learner at each iteration. The most typ-
the common meta-parameters shared between different types ical one is the LSTM-based meta-learner [17], aiming to use
of tasks, but also in the type-specific meta-parameters for each the sequential features of the LSTM model to simulate the cell
task. HetMAML is a potentially powerful extension of the states during the update process. The updating process ap-
MAML framework, which not only retains the model-agnostic proximates the gradient descent method, where the information
property of MAML, but embeds task-aware knowledge into the stored in forget and input gates during training is improved. The
outer-level update. parametrization of the optimizer allows new tasks to be learned
from a series of previous tasks. When the multimodal train-
Furthermore, CROMA [58] goes further to establish a more ing network is introduced, the meta-learner still keeps the idea
generalized meta-learning paradigm based on cross-modal of optimizing the optimizer to help update a better multimodal
alignment that is trained on different source modalities, and network, depending on the specific structure of the network.
then quickly generalize to the target modality to perform new When the initial value is meta-learned for multimodal tasks,
tasks. Specifically, they jointly train the cross-modal meta- the subspace of each modality needs to be learned together
alignment space and source modality classifier in the meta- through an aligned common space. However, the modality in-
training process to perform generalization on alignment and put in the real world is often dynamically changing, and the
classification tasks by Reptile. The meta-parameters learned adaptation of learning frameworks to different modalities un-
in the training process will be used as initial parameters for the der the sequential distribution is challenging. Recent work has
meta-test process to classify tasks in the target modality. When employed the idea of human cognitive learning to address the
the labeled data of the target modality is scarce, the joint space adaptive training of multimodal models, where various modal-
learned by the meta-alignment of the source modality and target ities do not need to be aligned during training at a fine-grained
modality will help transfer knowledge among tasks. Figure 5 level. Therefore, the LSTM-based meta-learner becomes an ap-
compares the meta-alignment representations of heterogeneous propriate alternative to take advantage of the previous experi-
task distributions and cross-modal task distributions. ence in the sequence to avoid catastrophic forgetting.
When generalizing gradient-based meta-models to the cross-
task modality alignment, the extent of what and how much 4.2.2. Methods
knowledge across modalities should be shared has been heav- While most articles apply all the modalities to each task to-
ily emphasized throughout the existing research. Both tasks gether, another way of thinking is to add the modalities sequen-
from different domains or the same domain concept are pa- tially, dynamically adjusting the performance of each task on
rameterized and updated in the outer learning process to adapt the unified multimodal model. Ge and Xiaoyang [53] specif-
to unknown tasks quickly. The major challenges of transfer- ically propose the sequential cross-modal learning (SCML)
ring knowledge under this framework remain open, such as im- from a novel perspective in terms of learning the optimization
plementing task-specific loss, the choice between modular and of the unified multimodal model. Each modality is learned se-
hierarchical network architectures, and the construction of the quentially by the unique feature extractor and then projected
8
onto the same embedding space by the unified multimodal 5.1.2. Methods
model. Then they extend the LSTM-based meta-learner [60] to As the output of paired networks, a shared feature space al-
effectively optimize the new unified multimodal model based lows the use of different modalities under the same distance
on the old experience well-trained on previous tasks. It is evi- metric. Liu and Zhang [65] propose STUM to take advantage
denced that due to the ability of the meta-learner, the updated of the siamese feature space formation process [66] to learn a
unified model obtains a slight increase over the ordinary gradi- shared feature space by different networks for different modal-
ent descent method. ities. Instead of using triplet loss during training, it also em-
From the number of articles that use LSTM-based meta- ploys a simple summation of the contrastive loss function to
learner, we can infer that the application of the LSTM model to handle the random number of streaming data representations
deal with dynamic sequential multimodal data is still limited. coming from the time relation of inputs. Positive and nega-
While mapping multiple modalities to the same embedding tive samples group the inputs in one or more modalities that are
space simultaneously has been adopted as the common practice, adapted to non-identical networks for processing. STUM has
sequential learning of multiple modalities by the LSTM-based effectively captured the formation process of the feature space
meta-learner is undoubtedly promising. within and across the modalities in time-cued data. The exper-
iment on visual modalities has shown that the model can po-
tentially be used for the feature organization of objects repre-
5. Learn the Embedding sented by multiple modalities. The high performance achieved
without supervision also implies the potential extension to fast
online learning.
Metric learning approaches learn the non-linear embedding Triplet loss is also often used as a loss variant in combination
space, where intrinsic class memberships are decided by mea- with paired networks to enhance the robustness of the model.
suring the distances between points. The learned prior knowl- Eloff et al. [69] focus on learning a multimodal matching space
edge is often implemented as an embedding network or a pro- from paired images and speech in the one-shot domain. They
jection function that transforms raw inputs into the represen- investigate a Siamese neural network built with the semi-hard
tation suitable for similarity comparison [7] in a feed-forward triplet mining trick to alleviate the memory issue. Each spoken
manner [15]. Popular embedding networks include Siamese test query is matched to a visual set according to the learned em-
Network [47], Triplet Network [48], Matching Networks [16], bedding metric from training the support set. However, they do
Prototypical Networks [15] and Relation Networks [49]. Other not approach the matching problem in a true multimodal space
advanced approaches such as graph-based networks [61, 62] are where the mapping happens mutually between two modalities,
also widely explored to model the relationship between sam- therefore transforming it into a two-step indirect comparison in
ples. In addition, metrics such as Euclidean distance [63], co- the visual embedding space.
sine distance, contrastive loss, and triplet loss [64] are typically To learn a multimodal space in one step, Nortje and Kam-
used to measure the similarities within pair or triplet samples. per [70] improved upon the above idea to learn a shared embed-
The flexible embedding network structures and metric com- ding space of spoken words and images from only a few paired
putations provide convenience for introducing multimodality, examples by optimizing the combination of two triplet losses.
which promotes the feature extraction and interpretation to be The model they came up with is MTriplet, which overcomes
realized in various ways. the weakness of learning representations that rely on unimodal
comparisons despite having the support set of speech-image
pairs. After preprocessing the two modalities by the corre-
5.1. Paired-Modal Networks sponding autoencoders, a multimodal triplet network is learned
to map inputs of the same class to similar representations mea-
5.1.1. Overview sured by a direct cross-modal distance metric.
Pairwise networks are introduced to take pair examples and Even though paired-modal networks have demonstrated the
learn their sharing feature space to discriminate between two potential for feature space formation, modality patterns ex-
classes. For example, Siamese Network [47] employs two iden- tracted from real-world datasets remain an open challenge.
tical neural networks to extract embeddings from a pair of sam- When the number of modalities increases from two to multi-
ples and computes a weighted metric to determine the similar- ple, it is unclear whether the network structure could achieve
ity. Triplet Network [48] extends the networks to three with high performance under conditions such as missing modalities
shared parameters to output the comparison probability. Al- and scarce paired samples.
though the pairwise comparators limit the possibility of train-
ing end-to-end networks directly for the few-shot problems, the 5.2. Joint-Modal Prototypes
learning framework is still valid to be applied to multimodal 5.2.1. Overview
datasets. A common technique is to change the identical archi- The term “prototype” refers to the centroid of each class
tectures that share parameters in the pairwise networks to dif- within the dataset. In Prototypical Networks [15], a support set
ferent embedding networks for different modalities. New loss is used to calculate the prototype for each class, and the query
metrics are often proposed along with the networks to perform samples are classified according to the distance from each pro-
the matching task between inputs from different modalities. totype. The calculation of the prototype pc needs to rely on the
9
Figure 6: Representation of joint-modal prototypes. Left: Deterministic prototypes. Modalities are fused through the fusion model and transformed as the
joint representation. Each multimodal prototype pm in each class is projected onto the multimodal embedding space. The query sample x is then compared to
different multimodal prototypes to obtain the prediction. Right: Shifted prototypes. New prototypes are generated by conditioning on the original prototypes in
one modality. Flexible choices on original averaging prototypes and generated prototypes are introduced. Take one class p1 as an example, AM3 [6] adaptively
0 0
combines pv1 and pt1 . Episode-based PGN [67] modifies the objective function by integrating both of the generated prototypes pv1 and pt1 . MPN [68] averages the
0
conditioned pv1 with the original pv1 . The new prototype is supposed to be anywhere on the green line based on the value of the weighting factor.
averaged embeddings of all the support samples in class c for By contrast, the early fusion of modalities may add noise
each episode of training [6]: to the embeddings since the information retained by different
1 X modalities is homogenized in this process. The idea of rely-
pc = c fθ (si ), (4) ing on learned embeddings and their distances to discriminate
|S e | (s ,y )∈S c
i i e unseen classes in the prototypical networks has a similar prob-
where S ec∈ S e is the subset of support set belonging to class c, lem structure to alignment, a key challenge in the field of multi-
S e = {(si , yi )i=1
N×K
} and fθ is the embedding network that needs modal learning. Alignment focuses on how to identify the direct
to be learned. S e contains K samples for each of the N classes. relationship between elements in two or more different modal-
In general, after acquiring the distances of the query embed- ities [43]. Existing research usually maps different modalities
ding to the embedded prototype pc in each class, we can obtain to the same semantic representation space, and then computes
the distribution of the query sample over all of the classes of the similarity as a direct relationship measurement. The phe-
the episode. The meta-objective of the model then becomes to nomenon of multimodal representations in the same semantic
minimize the expectation of the negative loglikelihood of the space allows extending the prototypical networks.
true class of each query sample [6]: Relevant research often applies unimodal task training re-
sulting in unimodal feature vectors of the prototypes. Accord-
J
X ingly, some studies have introduced multimodal feature vectors
L(θ) = E(S e ,Qe ) − logpθ (yt |qt , S e ), (5) to shift the prototype representations. In that case, we define
t=1 the method of shifted prototypes as the one that uses the vari-
where the query set Qe = {(q j , y j ) Jj=1 } contains J samples. ous multimodal vectors directly without prior fusion and builds
(yt , qt ) ∈ Qe and S e are the sampled query set and support set in models upon the ensemble. Figure 6 shows the two types of
each episode of training. joint-modal prototypes with two commonly used modalities,
When the prototype embeddings are computed from more image and text, as an illustration.
than one modality, we observe that the representation of pro-
totypes could be modified according to the way of introduc- 5.2.2. Methods
ing different modalities. By exploiting whether modalities are Deterministic Prototypes. The computation of distances be-
fused before passing to the prototypical networks, we divide the tween deterministic prototypes in the joint space of different
representation of joint-modal prototypes into two categories, modality features follows the same approach as for traditional
deterministic prototypes and shifted prototypes. unimodal prototypes. The metric of the meta-learning algo-
Deterministic prototypes use a unimodal vector extracted rithm does not require special modification. Wan et al. [71]
from the fusion of other models, which means usually differ- propose to learn the prototype for each class of the social re-
ent modalities are trained by different encoders, followed by the lation extracting from the combination of facial image features
connection layer to concatenate the outputs after training. The and text-based features. The unbalanced distribution of social
prototypical networks are expected to use the concatenated uni- relations from real-life multimodal datasets motivates them to
modal vector to perform the following similarity calculation, propose prototypical networks training on a random support set
so there is no obvious difference from the traditional proto- of tuples on different classes of social relations by using FSL
typical networks. This method cares more about multimodal techniques. A cross-modal encoder (Illustrated as the Fusion
learning rather than the meta-learning framework, such as how Model in Figure 6 Left) then concatenates the normalized fea-
to remove correlations between modalities and fuse them in a ture vectors of two modalities learned respectively from a pre-
lower-dimensional common subspace. trained language model and FaceNet. Eventually, the prototyp-
10
ical networks are applied to predict the social relations in the the generated features which are paired by two conditional vari-
query set by directly averaging concatenated cross-modal em- ational autoencoders with different modality conditions. The
beddings as the prototype for each class. adaptive learning approach is originated from:
Deterministic prototypes are easy to obtain, which allows for
accurately encoding different modalities before implementing pc 0 = λ ∗ pc + (1 − λ) ∗ g(tc ), (8)
the meta-learning framework. The early diffusion is also acces-
sible to more modalities if the data pattern changes in the fu- where λ is the adaptive mixture coefficient and g(tc ) is the nor-
ture. However, the effectiveness has not been justified theoret- malized version of the semantic embeddings lying on the space
ically, as most research does not favor deterministic prototypes of the same dimension with the visual prototypes. The above
but tends to use shifted prototypes. papers have demonstrated that the adaptive method is flexible
Shifted Prototypes. It is expected that leveraging auxiliary enough to project the possible modalities onto the same embed-
modality provides the means to inject diversity into the gener- ding space and provide a way to calculate their corresponding
ated sample space for novel classes during training [72, 73, 74]. relationships. The adaptive method actually blurs the bound-
Pahde et al. [68] aim to obtain more reliable prototypes and en- ary between them by dictating the modality semantics through
rich the intrinsic feature sparsity in the few-shot training space. the weighted interaction to scaffold the importance of auxiliary
By training a generative model that maps text data into the pre- modalities on the main modalities.
0
trained visual feature space, a new joint prototype pc for each In addition to modeling the shifted prototypes explicitly in
novel class is redefined as the weighted average of both the two modalities and then synthesizing a new joint multimodal
original visual prototype pc and the prototype computed from prototype, Mu et al. [77] also attempt to train an embedded
the generated visual feature vectors pGc conditioned on the text network to implicitly add the influence of shifted prototypes
modality: to the objective function. They propose language-shaped learn-
0 pc + λ ∗ pGc ing that encourages the original embedding function learned by
pc = , (6) the visual prototypes to decode the class language descriptions.
1+λ
The objective function L M (θ) of the prototypical training is op-
1 X
pGc = c Gt (ΦT (ti )), (7) timized by jointly minimizing the classification loss and the lan-
|S e | (s ,y )∈S c guage model loss conditioned on the averaged prototypes:
i i e
the space that is related to the best manifolds, facilitating the aggregated with the original novel samples. As shown in Figure
classification performance of matching networks. However, 7 (Left), the hallucinator is parameterized as G(xitext , zi ; wG ) and
generic generative methods mostly transform the FSL into stan- the discriminator is parameterized as D(xiimage ; wD ), both con-
dard supervised learning after generating data distributions or aug
nected to the inner classifier h(x, S train ; wC ) through the gener-
enough samples. They usually start with base classes and then aug
ated S train . The parameters wG and wD from the generative net-
migrate the learned data pattern to generate new samples of the work and wC from the few-shot classifier are updated directly
novel classes. One major weakness is that the generative model by a joint back-propagation in the meta-training stage to aug-
is trained separately from the classifier so that no explicit prior ment data that are useful for discriminating classes.
knowledge can be quickly adapted to novel classes. The generative model requires optimized generator and dis-
The recently emerging meta-learning framework has criminator parameters, which are naturally suitable for the
spawned relevant work on the meta-based generative methods inner-optimization of the MAML training framework. Verma et
to learn an optimal augmentation strategy that can be trained al. [50] identify the key challenge of using generative models to
within a few steps given unseen classes. The main idea is to train ZSL/GZSL by leveraging auxiliary modalities. They pro-
wrap up the data augmentation strategy in optimization steps pose a meta-based generative framework that integrates MAML
of the inner meta-learning process [7], with the augmentation with a generative model conditioned on class attributes for un-
parameterized and learned by the outer optimization in the seen samples generation. The key difference with the standard
episode training of meta-representation. MetaGAN [92] ex- meta-learning is to mimic the ZSL behavior by setting disjoint
plores the impact of generating fake data on the decision bound- meta-train and meta-validation partitions for each task. The
ary and provides ways to integrate the model with MAML or meta-learning protocol modifies the standard adversarial gener-
Relation Networks. Wang et al. [81] rely on a simple multi- ation process to provide an efficient discriminator and generator
layer perceptron as a hallucinator learned on the base classes by enhancing their learning capability.
to synthesize new data for unseen classes. However, the gen- After the data augmentation strategy is parameterized as the
eralization ability could still be limited due to the lack of ad- inner optimizer to participate in the training, it still relies on
ditional modalities that are helpful for the model to identify the optimization-based meta-learning framework, except that
highly discriminative features across modalities [30] in the low the inner learner no longer pays attention to the learning of mul-
data regime. Cross-modal data is considered an input that can timodal relations. More diversified data generators beyond con-
effectively assist in data augmentation. By mining the relation- ditional GANs are expected to be introduced in the future and
ship between the two modalities, the modality with insufficient combined with metric-based meta-learners.
information can be enriched.
6.1.2. Methods
Pahde et al. [30] extend the model from Wang et al. [81] in 7. Summary of Methods
a multimodal and progressive manner by hallucinating images
conditioned on fine-grained textual modality within the few- This section provides a summary of recent works about
shot scenario. They propose a text-conditional GAN (tcGAN) multimodality-based meta-learning methods. The attributes of
to learn the mapping from the textual space to visual space by each method are concluded in Table 2. Specifically, we high-
a self-paced based generative model [74], which ensures that light the following attributes and discuss some common appli-
only the top-ranked generated image samples are selected and cations.
14
Table 2: Summary of multimodality-based meta-learning methods. “V”, “T”, “S”, “Te” and “A” indicate visual modalities, textual modalities, spatio modalities,
temporal modalities and audio modalities, respectively. “M” and “F” indicate missing and full modalities, respectively.
• Reference: It denotes the individual papers that propose Table 3: List of open-source libraries for multimodality-based meta-learning
methods.
the method. Table 3 lists the open-source libraries for each
reference if available. Reference Link
Ma et al. [45] https://github.com/mengmenm/SMIL
• Type: It indicates the type of meta-knowledge learned by Yao et al. [44] https://github.com/huaxiuyao/MetaST
Liang et al. [58] https://github.com/peter-yh-wu/xmodal
the method, including learning the optimization, learning Eloff et al. [69] https://github.com/rpeloff/multimodal_one_
the embedding, and learning the generation. shot_learning
Nortje and Kamper [70] https://github.com/LeanneNortje/
DirectMultimodalFew-ShotLearning
• Modality: It summarizes the modalities and their pat- Wan et al. [71] https://github.com/sysulic/FL-MSRE
terns employed in the method. Missing modalities apply to Xing et al. [6] https://github.com/ElementAI/am3
Mu et al. [77] https://github.com/jayelm/lsl
those with partial data missing, and full modalities mean Yu et al. [67] https://github.com/yunlongyu/EPGN
that the involved modalities are not missing in the datasets. Sung et al. [49] https://github.com/floodsung/
LearningToCompare_FSL
Most studies address the two most common ones, visual Verma et al. [50] https://github.com/vkverma01/meta-gzsl
and textual modalities, to analyze image-related applica-
tions.
7.1. Common Applications
• Dataset: It introduces datasets that are used to evaluate
the method. Many works employ CUB-200, AWA1, and Image classification: Image classification is one of the most
AWA2 as the major datasets to derive multimodal infor- common applications in the field of supervised meta-learning,
mation in the application of image classification. Table 4 where predictions are made on unknown images based on a
lists open-source datasets used in papers if available. model trained on a set of tasks. With the addition of multi-
• Application: It gives the application fields of the method, modalities in FSL/ZSL, the model can better utilize textual or
most of which are summarized based on the evaluation ex- audio information to understand the visual modalities, such as
periments of the paper. A large number of studies address alignment, matching, and fusion.
image classification by using auxiliary textual modality. Multi-genre classification: Multi-genre classification is a
Other important areas are classification-based (e.g., senti- specific type of multi-label classification, which assumes that
ment, speech), prediction-based (e.g., traffic volume, wa- each sample can be assigned to more than one category. Genres
ter quality), and cross-modal retrieval/matching. are often accompanied by multiple modalities such as textual
attributes, meta-attributes, movie images, etc. The introduction
Furthermore, we summarize the advantages and disadvan- of meta-learning helps to better identify the effects of different
tages for each method in Table 5. The advantage that commonly modalities on different categories.
appears in papers involves the flexible architecture designed for Cross-modal retrieval: Cross-modal retrieval is concerned
auxiliary modalities or specific tasks. The integration of fea- with mutual retrieval across modalities and focuses on getting
ture spaces has been widely explored, though requirements for semantically similar instances in another modality (e.g., text)
transferring models to multiple modalities or more generalized by searching in one modality (e.g., image). It is often neces-
task distributions are still needed. sary to find a common representation space to compare samples
15
Table 4: List of open-source datasets used for experimental evaluations of 8. Discussions and Conclusions
multimodality-based meta-learning methods.
marize the perfect prototype fusion method. Although differ- training is thus alleviated.
ent methods have proved that their performance exceeds that In conclusion, the academic community has paid much at-
of unimodal prototype applications, a striking conclusion we tention to specific applications of multimodality-based meta-
can draw is that there is no exact solution to uniformly solve learning models. New frameworks are applied to multimodal
the problem of modifying multimodal prototypes in different datasets and compared with the baselines to illustrate their ad-
applications. For example, in the adaptive modality mixture vantages. However, theoretical analyses about mathematical as-
mechanism (AM3), many researchers [6, 75, 76] use the same sumptions, generalized performance, complexity, and conver-
prototypical networks and modality types, but the results are gence of models remain blank.
very different. A key aspect that is often overlooked is that the
discriminative features that different modalities can provide are
9. Future Directions
very different in various scenarios. Moreover, it also depends on
the dataset itself. Therefore, the determination of which modal- This section provides some insights about future directions
ity is more important is not a simple procedure. The benefits based on the current challenges and trends. We believe that
that multimodal prototypes can provide are not only to expand the summary provided in this survey can be used as a pivot
the possibility of more modality combinations in the embedding to help future research progress in specific directions, whether
space, but may also be affected by the previous encoding mod- it is from a modality perspective or a meta-learning algorithm
els of multimodal data. However, these factors hardly appear in perspective.
the current works. Granularity of modalities. The granularity of modalities
Researchers who have achieved good performance usually has not been studied systematically for the FSL/ZSL. Previous
distinguish aspects other than the structure of meta-learning studies have already differentiated on the mining of semantic
algorithms, such as new data distributions. This philosophy granularity. In some studies, only the simple word embedding
is embodied in learning data generation. Although Pahde et or a set of several prescribed attributes [6, 68, 78] were used to
al. [74] and Verma et al. [50] have verified the improved per- encode the semantics. Nevertheless, in other studies, richer se-
formance of the results brought by data augmentation strate- mantics could be employed as well, such as the category labels,
gies, there are no further discussions about whether the design richer ‘description level’ semantic sentences or attributes [75].
of a specific meta-learning architecture containing multimodal The image descriptions with fine-grained text have become a
tasks can be combined with data augmentation methods to ob- breakthrough point for those studies, which also implies that
tain better results. Furthermore, we have noticed the potential future research can explore the impact of different modalities,
of incorporating the meta-based ZSL/GZSL methods into our including the granularity’s size on the performance of the meta-
consideration. These methods divide training classes into sup- learning methods.
port sets and query sets, enabling the augmentation process to Missing modalities. As a common data pattern, missing data
learn a robust mapping when there are only auxiliary modalities is more likely to appear in multimodal scenarios in a way that is
or few samples. The prejudice against the seen classes during paired or separated. Considering the inherent task segmentation
17
in the meta-learning framework, more severe missing modali- modal tasks. We infer that this may be limited by factors such as
ties could also appear in support sets and query sets at the same the complex process of parameterizing multimodalities and net-
time. Existing work focuses on the direct application of the work structures. Future work can try to use the meta-learning
meta-learning framework, paying more attention to parameter models to embed the multimodal dataset into the activation state
training or modality fusion, but ignoring the fact that the lack and predict the test data based on this state.
of modalities in multiple scenes could cause unexpected effects.
Although several studies [45, 76] have tried to integrate auxil-
iary networks to deal with the missing modalities in the whole References
training, a standardized framework is still to be proposed.
[1] Y. Guo, Y. Liu, A. Oerlemans, S. Lao, S. Wu, M. S. Lew, Deep learn-
Multiple modalities. Current research, whether in terms ing for visual understanding: A review, Neurocomputing 187 (2016)
of methodology or experimental objectives, mostly limits the 27–48. URL: https://doi.org/10.1016/j.neucom.2015.09.116.
modalities to visual and textual ones, or dual-modality that ap- doi:10.1016/j.neucom.2015.09.116.
pears in pairs. Therefore it is still difficult to apply these meth- [2] A. Shrestha, A. Mahmood, Review of deep learning algorithms and ar-
chitectures, IEEE Access 7 (2019) 53040–53065. URL: https://doi.
ods to multiple modalities other than two. On the one hand, org/10.1109/ACCESS.2019.2912200. doi:10.1109/ACCESS.2019.
well-aligned multimodal datasets often need to be manually 2912200.
collected and artificially processed to be aligned with multiple [3] T. Young, D. Hazarika, S. Poria, E. Cambria, Recent trends in deep learn-
ing based natural language processing [review article], IEEE Comput. In-
modalities. On the other hand, the burden of input parame- tell. Mag. 13 (2018) 55–75. URL: https://doi.org/10.1109/MCI.
ter fusion or joint training will also increase as the modalities 2018.2840738. doi:10.1109/MCI.2018.2840738.
increase. Future research needs to discover ways to construct [4] Z. Song, X. Yang, Z. Xu, I. King, Graph-based semi-supervised learning:
good benchmark multimodal datasets, instead of only relying A comprehensive review, IEEE Transactions on Neural Networks and
Learning Systems (2022) 1–21. doi:10.1109/TNNLS.2022.3155478.
on the datasets that are originally designed for image classi- [5] S. Thrun, L. Y. Pratt, Learning to learn: Introduction and overview, in:
fication. In addition, true multimodal research is supposed to S. Thrun, L. Y. Pratt (Eds.), Learning to Learn, Springer, 1998, pp. 3–17.
be accompanied by more diverse modality distributions, espe- URL: https://doi.org/10.1007/978-1-4615-5529-2_1. doi:10.
cially in scenarios such as single-type tasks, cross-tasks, and 1007/978-1-4615-5529-2\_1.
[6] C. Xing, N. Rostamzadeh, B. N. Oreshkin, P. O. Pinheiro, Adaptive
heterogeneous tasks. It presents challenges to the combination cross-modal few-shot learning, in: H. M. Wallach, H. Larochelle,
of modalities under different tasks and the meta-learning frame- A. Beygelzimer, F. d’Alché-Buc, E. B. Fox, R. Garnett (Eds.), Advances
work. in Neural Information Processing Systems 32: Annual Conference
More than data augmentation. The literature on learning on Neural Information Processing Systems 2019, NeurIPS 2019, De-
cember 8-14, 2019, Vancouver, BC, Canada, 2019, pp. 4848–4858.
the data is currently limited, focusing on data augmentation. URL: https://proceedings.neurips.cc/paper/2019/hash/
Future work needs to include more data objectives that can be d790c9e6c0b5e02c87b375e782ac01bc-Abstract.html.
learned, not only use augmentation methods to learn existing [7] T. M. Hospedales, A. Antoniou, P. Micaelli, A. J. Storkey, Meta-learning
data patterns, but also learn generalized data patterns from sim- in neural networks: A survey, CoRR abs/2004.05439 (2020). URL:
https://arxiv.org/abs/2004.05439. arXiv:2004.05439.
ilar datasets. Apart from the data objectives, whether some data [8] I. Khan, X. Zhang, M. Rehman, R. Ali, A literature survey and em-
learning algorithms such as the GAN-based method [90] can be pirical study of meta-learning for classifier selection, IEEE Access
used for inner learning and optimization in outer learning is still 8 (2020) 10262–10281. URL: https://doi.org/10.1109/ACCESS.
2020.2964726. doi:10.1109/ACCESS.2020.2964726.
an open challenge for the meta-learning framework. [9] R. Vilalta, Y. Drissi, A perspective view and survey of meta-learning,
More than classification. Most studies emphasize the use of Artif. Intell. Rev. 18 (2002) 77–95. URL: https://doi.org/10.1023/
multimodal data to improve image classification performance, A:1019956318069. doi:10.1023/A:1019956318069.
which implements the visual modality as the main modal- [10] A. A. Rusu, D. Rao, J. Sygnowski, O. Vinyals, R. Pascanu, S. Osindero,
R. Hadsell, Meta-learning with latent embedding optimization, in: 7th
ity, accompanied by commonly used auxiliary modalities in- International Conference on Learning Representations, ICLR 2019, New
cluding text and audio. From a limited scope, this branch Orleans, LA, USA, May 6-9, 2019, OpenReview.net, 2019. URL: https:
of method usually belongs to the category of semantic-based //openreview.net/forum?id=BJgklhAcK7.
meta-learning, which uses the semantic features of text to en- [11] H. Sikka, Multimodal modular meta-learning (2020).
[12] R. Vuorio, S. Sun, H. Hu, J. J. Lim, Multimodal model-agnostic meta-
hance the performance of the few-shot classifier by using a con- learning via task-aware modulation, in: H. M. Wallach, H. Larochelle,
vex combination of semantic and visual modalities. The possi- A. Beygelzimer, F. d’Alché-Buc, E. B. Fox, R. Garnett (Eds.), Ad-
ble future direction is to use the visual modality as an auxiliary vances in Neural Information Processing Systems 32: Annual Con-
modality to perform classification tasks such as text, video, and ference on Neural Information Processing Systems 2019, NeurIPS
2019, December 8-14, 2019, Vancouver, BC, Canada, 2019, pp. 1–
speech classifications. Such applications put forward high de- 12. URL: https://proceedings.neurips.cc/paper/2019/hash/
mands on the model for refining image semantics, and need to e4da3b7fbbce2345d7772b0674a318d5-Abstract.html.
obtain unique semantics from the visual modality to match the [13] Y. Li, I. King, Autograph: Automated graph neural network, in: H. Yang,
semantic area of other modalities. K. Pasupa, A. C. Leung, J. T. Kwok, J. H. Chan, I. King (Eds.), Neural
Information Processing - 27th International Conference, ICONIP 2020,
Extend to model-based methods. Previous taxonomies of Bangkok, Thailand, November 23-27, 2020, Proceedings, Part II, vol-
meta-learning methods often include the model-based methods ume 12533 of Lecture Notes in Computer Science, Springer, 2020, pp.
as synthesizing models in a feed-forward manner and learning 189–201. URL: https://doi.org/10.1007/978-3-030-63833-7_
16. doi:10.1007/978-3-030-63833-7\_16.
the meta-knowledge of the single model directly. To the best [14] C. Finn, P. Abbeel, S. Levine, Model-agnostic meta-learning for fast
of our knowledge, currently, there is no explicit work to learn adaptation of deep networks, in: D. Precup, Y. W. Teh (Eds.), Proceedings
such a single model that can be generalized between multi- of the 34th International Conference on Machine Learning, ICML 2017,
18
Sydney, NSW, Australia, 6-11 August 2017, volume 70 of Proceedings of International Conference on Machine Learning, ICML 2011, Bellevue,
Machine Learning Research, PMLR, 2017, pp. 1126–1135. URL: http: Washington, USA, June 28 - July 2, 2011, Omnipress, 2011, pp. 689–
//proceedings.mlr.press/v70/finn17a.html. 696. URL: https://icml.cc/2011/papers/399_icmlpaper.pdf.
[15] J. Snell, K. Swersky, R. S. Zemel, Prototypical networks for few- [28] A. Fukui, D. H. Park, D. Yang, A. Rohrbach, T. Darrell, M. Rohrbach,
shot learning, in: I. Guyon, U. von Luxburg, S. Bengio, H. M. Multimodal compact bilinear pooling for visual question answering and
Wallach, R. Fergus, S. V. N. Vishwanathan, R. Garnett (Eds.), visual grounding, in: J. Su, X. Carreras, K. Duh (Eds.), Proceedings of the
Advances in Neural Information Processing Systems 30: Annual 2016 Conference on Empirical Methods in Natural Language Processing,
Conference on Neural Information Processing Systems 2017, De- EMNLP 2016, Austin, Texas, USA, November 1-4, 2016, The Associ-
cember 4-9, 2017, Long Beach, CA, USA, 2017, pp. 4077–4087. ation for Computational Linguistics, 2016, pp. 457–468. URL: https:
URL: https://proceedings.neurips.cc/paper/2017/hash/ //doi.org/10.18653/v1/d16-1044. doi:10.18653/v1/d16-1044.
cb8da6767461f2812ae4290eac7cbc42-Abstract.html. [29] J. Yu, J. Li, Z. Yu, Q. Huang, Multimodal transformer with multi-view
[16] O. Vinyals, C. Blundell, T. Lillicrap, K. Kavukcuoglu, D. Wier- visual representation for image captioning, IEEE Trans. Circuits Syst.
stra, Matching networks for one shot learning, in: D. D. Lee, Video Technol. 30 (2020) 4467–4480. URL: https://doi.org/10.
M. Sugiyama, U. von Luxburg, I. Guyon, R. Garnett (Eds.), Ad- 1109/TCSVT.2019.2947482. doi:10.1109/TCSVT.2019.2947482.
vances in Neural Information Processing Systems 29: Annual [30] F. Pahde, P. Jähnichen, T. Klein, M. Nabi, Cross-modal hallucination for
Conference on Neural Information Processing Systems 2016, De- few-shot fine-grained recognition, CoRR abs/1806.05147 (2018). URL:
cember 5-10, 2016, Barcelona, Spain, 2016, pp. 3630–3638. http://arxiv.org/abs/1806.05147. arXiv:1806.05147.
URL: https://proceedings.neurips.cc/paper/2016/hash/ [31] N. Bhatt, A. Thakkar, A. Ganatra, A survey and current research chal-
90e1357833654983612fb05e3ec9148c-Abstract.html. lenges in meta learning approaches based on dataset characteristics, In-
[17] S. Ravi, H. Larochelle, Optimization as a model for few-shot learn- ternational Journal of soft computing and Engineering 2 (2012) 234–247.
ing, in: 5th International Conference on Learning Representations, ICLR [32] C. Lemke, M. Budka, B. Gabrys, Metalearning: a survey of
2017, Toulon, France, April 24-26, 2017, Conference Track Proceedings, trends and technologies, Artif. Intell. Rev. 44 (2015) 117–130.
OpenReview.net, 2017. URL: https://openreview.net/forum?id= URL: https://doi.org/10.1007/s10462-013-9406-y. doi:10.
rJY0-Kcll. 1007/s10462-013-9406-y.
[18] J. Schmidhuber, Evolutionary Principles in Self-Referential Learning. On [33] J. Vanschoren, Meta-learning: A survey, CoRR abs/1810.03548 (2018).
Learning now to Learn: The Meta-Meta-Meta...-Hook, Diploma thesis, URL: http://arxiv.org/abs/1810.03548. arXiv:1810.03548.
Technische Universitat Munchen, Germany, 1987. URL: http://www. [34] Y. Wang, Q. Yao, J. T. Kwok, L. M. Ni, Generalizing from a
idsia.ch/~juergen/diploma.html. few examples: A survey on few-shot learning, ACM Comput. Surv.
[19] Y. Bengio, S. Bengio, J. Cloutier, Learning a synaptic learning rule, Cite- 53 (2020) 63:1–63:34. URL: https://doi.org/10.1145/3386252.
seer, 1990. doi:10.1145/3386252.
[20] S. Hochreiter, A. S. Younger, P. R. Conwell, Learning to learn using [35] H. Peng, A comprehensive overview and survey of recent advances in
gradient descent, in: G. Dorffner, H. Bischof, K. Hornik (Eds.), Artifi- meta-learning, CoRR abs/2004.11149 (2020). URL: https://arxiv.
cial Neural Networks - ICANN 2001, International Conference Vienna, org/abs/2004.11149. arXiv:2004.11149.
Austria, August 21-25, 2001 Proceedings, volume 2130 of Lecture Notes [36] W. Yin, Meta-learning for few-shot natural language processing: A sur-
in Computer Science, Springer, 2001, pp. 87–94. URL: https://doi. vey, CoRR abs/2007.09604 (2020). URL: https://arxiv.org/abs/
org/10.1007/3-540-44668-0_13. doi:10.1007/3-540-44668-0\ 2007.09604. arXiv:2007.09604.
_13. [37] M. Huisman, J. N. van Rijn, A. Plaat, A survey of deep meta-learning,
[21] A. S. Younger, S. Hochreiter, P. R. Conwell, Meta-learning with back- Artif. Intell. Rev. 54 (2021) 4483–4541. URL: https://doi.org/10.
propagation, in: IJCNN’01. International Joint Conference on Neural 1007/s10462-021-10004-4. doi:10.1007/s10462-021-10004-4.
Networks. Proceedings (Cat. No. 01CH37222), volume 3, IEEE, 2001. [38] A. Doke, M. Gaikwad, Survey on automated machine learning (au-
[22] S. Guiroy, V. Verma, C. J. Pal, Towards understanding generalization toml) and meta learning, in: 12th International Conference on
in gradient-based meta-learning, CoRR abs/1907.07287 (2019). URL: Computing Communication and Networking Technologies, ICCCNT
http://arxiv.org/abs/1907.07287. arXiv:1907.07287. 2021, Kharagpur, India, July 6-8, 2021, IEEE, 2021, pp. 1–5. URL:
[23] A. Santoro, S. Bartunov, M. Botvinick, D. Wierstra, T. P. Lillicrap, Meta- https://doi.org/10.1109/ICCCNT51525.2021.9579526. doi:10.
learning with memory-augmented neural networks, in: M. Balcan, K. Q. 1109/ICCCNT51525.2021.9579526.
Weinberger (Eds.), Proceedings of the 33nd International Conference on [39] D. Ramachandram, G. W. Taylor, Deep multimodal learning: A survey on
Machine Learning, ICML 2016, New York City, NY, USA, June 19- recent advances and trends, IEEE signal processing magazine 34 (2017)
24, 2016, volume 48 of JMLR Workshop and Conference Proceedings, 96–108.
JMLR.org, 2016, pp. 1842–1850. URL: http://proceedings.mlr. [40] D. Lahat, T. Adali, C. Jutten, Multimodal data fusion: an overview of
press/v48/santoro16.html. methods, challenges, and prospects, Proceedings of the IEEE 103 (2015)
[24] S. Khodadadeh, L. Bölöni, M. Shah, Unsupervised meta-learning 1449–1477.
for few-shot image classification, in: H. M. Wallach, H. Larochelle, [41] C. Jewitt, J. Bezemer, K. O’Halloran, Introducing multimodality, Rout-
A. Beygelzimer, F. d’Alché-Buc, E. B. Fox, R. Garnett (Eds.), Advances ledge, 2016.
in Neural Information Processing Systems 32: Annual Conference [42] P. W. Koh, S. Sagawa, H. Marklund, S. M. Xie, M. Zhang, A. Balsub-
on Neural Information Processing Systems 2019, NeurIPS 2019, De- ramani, W. Hu, M. Yasunaga, R. L. Phillips, I. Gao, T. Lee, E. David,
cember 8-14, 2019, Vancouver, BC, Canada, 2019, pp. 10132–10142. I. Stavness, W. Guo, B. Earnshaw, I. Haque, S. M. Beery, J. Leskovec,
URL: https://proceedings.neurips.cc/paper/2019/hash/ A. Kundaje, E. Pierson, S. Levine, C. Finn, P. Liang, WILDS: A bench-
fd0a5a5e367a0955d81278062ef37429-Abstract.html. mark of in-the-wild distribution shifts, in: M. Meila, T. Zhang (Eds.),
[25] Y. Li, I. King, Architecture search for image inpainting, in: Proceedings of the 38th International Conference on Machine Learning,
H. Lu, H. Tang, Z. Wang (Eds.), Advances in Neural Networks - ICML 2021, 18-24 July 2021, Virtual Event, volume 139 of Proceed-
ISNN 2019 - 16th International Symposium on Neural Networks, ISNN ings of Machine Learning Research, PMLR, 2021, pp. 5637–5664. URL:
2019, Moscow, Russia, July 10-12, 2019, Proceedings, Part I, vol- http://proceedings.mlr.press/v139/koh21a.html.
ume 11554 of Lecture Notes in Computer Science, Springer, 2019, pp. [43] T. Baltrusaitis, C. Ahuja, L. Morency, Multimodal machine learning: A
106–115. URL: https://doi.org/10.1007/978-3-030-22796-8_ survey and taxonomy, IEEE Trans. Pattern Anal. Mach. Intell. 41 (2019)
12. doi:10.1007/978-3-030-22796-8\_12. 423–443. URL: https://doi.org/10.1109/TPAMI.2018.2798607.
[26] B. P. Yuhas, M. H. G. Jr., T. J. Sejnowski, Integration of acoustic and doi:10.1109/TPAMI.2018.2798607.
visual speech signals using neural networks, IEEE Commun. Mag. 27 [44] H. Yao, Y. Liu, Y. Wei, X. Tang, Z. Li, Learning from multi-
(1989) 65–71. URL: https://doi.org/10.1109/35.41402. doi:10. ple cities: A meta-learning approach for spatial-temporal prediction,
1109/35.41402. in: L. Liu, R. W. White, A. Mantrach, F. Silvestri, J. J. McAuley,
[27] J. Ngiam, A. Khosla, M. Kim, J. Nam, H. Lee, A. Y. Ng, Multimodal R. Baeza-Yates, L. Zia (Eds.), The World Wide Web Conference, WWW
deep learning, in: L. Getoor, T. Scheffer (Eds.), Proceedings of the 28th 2019, San Francisco, CA, USA, May 13-17, 2019, ACM, 2019, pp.
19
2181–2191. URL: https://doi.org/10.1145/3308558.3313577. Management, Virtual Event, Queensland, Australia, November 1 - 5,
doi:10.1145/3308558.3313577. 2021, ACM, 2021, pp. 191–200. URL: https://doi.org/10.1145/
[45] M. Ma, J. Ren, L. Zhao, S. Tulyakov, C. Wu, X. Peng, SMIL: multi- 3459637.3482262. doi:10.1145/3459637.3482262.
modal learning with severely missing modality, in: Thirty-Fifth AAAI [58] P. P. Liang, P. Wu, Z. Liu, L. Morency, R. Salakhutdinov, Cross-modal
Conference on Artificial Intelligence, AAAI 2021, Thirty-Third Confer- generalization: Learning in low resource modalities via meta-alignment,
ence on Innovative Applications of Artificial Intelligence, IAAI 2021, CoRR abs/2012.02813 (2020). URL: https://arxiv.org/abs/2012.
The Eleventh Symposium on Educational Advances in Artificial Intel- 02813. arXiv:2012.02813.
ligence, EAAI 2021, Virtual Event, February 2-9, 2021, AAAI Press, [59] X. Ma, X. Yang, J. Gao, C. Xu, The model may fit you: User-generalized
2021, pp. 2302–2310. URL: https://ojs.aaai.org/index.php/ cross-modal retrieval, IEEE Transactions on Multimedia (2021).
AAAI/article/view/16330. [60] M. Andrychowicz, M. Denil, S. G. Colmenarejo, M. W. Hoffman,
[46] J. Chen, A. Zhang, Hetmaml: Task-heterogeneous model-agnostic meta- D. Pfau, T. Schaul, N. de Freitas, Learning to learn by gradient descent
learning for few-shot learning across modalities, CoRR abs/2105.07889 by gradient descent, in: D. D. Lee, M. Sugiyama, U. von Luxburg,
(2021). URL: https://arxiv.org/abs/2105.07889. I. Guyon, R. Garnett (Eds.), Advances in Neural Information Processing
arXiv:2105.07889. Systems 29: Annual Conference on Neural Information Processing
[47] G. Koch, R. Zemel, R. Salakhutdinov, et al., Siamese neural networks for Systems 2016, December 5-10, 2016, Barcelona, Spain, 2016, pp. 3981–
one-shot image recognition, in: ICML deep learning workshop, volume 2, 3989. URL: https://proceedings.neurips.cc/paper/2016/
Lille, 2015. hash/fb87582825f9d28a8d42c5e5e5e8b23d-Abstract.html.
[48] E. Hoffer, N. Ailon, Deep metric learning using triplet network, in: [61] J. Kim, T. Kim, S. Kim, C. D. Yoo, Edge-labeling graph neural
Y. Bengio, Y. LeCun (Eds.), 3rd International Conference on Learning network for few-shot learning, in: IEEE Conference on Com-
Representations, ICLR 2015, San Diego, CA, USA, May 7-9, 2015, puter Vision and Pattern Recognition, CVPR 2019, Long Beach,
Workshop Track Proceedings, 2015. URL: http://arxiv.org/abs/ CA, USA, June 16-20, 2019, Computer Vision Foundation / IEEE,
1412.6622. 2019, pp. 11–20. URL: http://openaccess.thecvf.com/
[49] F. Sung, Y. Yang, L. Zhang, T. Xiang, P. H. S. Torr, T. M. content_CVPR_2019/html/Kim_Edge-Labeling_Graph_Neural_
Hospedales, Learning to compare: Relation network for few- Network_for_Few-Shot_Learning_CVPR_2019_paper.html.
shot learning, in: 2018 IEEE Conference on Computer Vision doi:10.1109/CVPR.2019.00010.
and Pattern Recognition, CVPR 2018, Salt Lake City, UT, USA, [62] V. G. Satorras, J. B. Estrach, Few-shot learning with graph neural
June 18-22, 2018, IEEE Computer Society, 2018, pp. 1199–1208. networks, in: 6th International Conference on Learning Representa-
URL: http://openaccess.thecvf.com/content_cvpr_2018/ tions, ICLR 2018, Vancouver, BC, Canada, April 30 - May 3, 2018,
html/Sung_Learning_to_Compare_CVPR_2018_paper.html. Conference Track Proceedings, OpenReview.net, 2018. URL: https:
doi:10.1109/CVPR.2018.00131. //openreview.net/forum?id=BJj6qGbRW.
[50] V. K. Verma, D. Brahma, P. Rai, Meta-learning for generalized zero- [63] Q. Luo, L. Wang, J. Lv, S. Xiang, C. Pan, Few-shot learning via feature
shot learning, in: The Thirty-Fourth AAAI Conference on Artifi- hallucination with variational inference, in: IEEE Winter Conference on
cial Intelligence, AAAI 2020, The Thirty-Second Innovative Appli- Applications of Computer Vision, WACV 2021, Waikoloa, HI, USA, Jan-
cations of Artificial Intelligence Conference, IAAI 2020, The Tenth uary 3-8, 2021, IEEE, 2021, pp. 3962–3971. URL: https://doi.org/
AAAI Symposium on Educational Advances in Artificial Intelligence, 10.1109/WACV48630.2021.00401. doi:10.1109/WACV48630.2021.
EAAI 2020, New York, NY, USA, February 7-12, 2020, AAAI Press, 00401.
2020, pp. 6062–6069. URL: https://aaai.org/ojs/index.php/ [64] J. Zhang, C. Zhao, B. Ni, M. Xu, X. Yang, Variational few-shot learn-
AAAI/article/view/6069. ing, in: 2019 IEEE/CVF International Conference on Computer Vi-
[51] A. Nichol, J. Achiam, J. Schulman, On first-order meta-learning algo- sion, ICCV 2019, Seoul, Korea (South), October 27 - November 2,
rithms, CoRR abs/1803.02999 (2018). URL: http://arxiv.org/abs/ 2019, IEEE, 2019, pp. 1685–1694. URL: https://doi.org/10.1109/
1803.02999. arXiv:1803.02999. ICCV.2019.00177. doi:10.1109/ICCV.2019.00177.
[52] E. Grefenstette, B. Amos, D. Yarats, P. M. Htut, A. Molchanov, F. Meier, [65] Q. Liu, Y. Zhang, Using sensory time-cue to enable unsupervised mul-
D. Kiela, K. Cho, S. Chintala, Generalized inner loop meta-learning, timodal meta-learning, CoRR abs/2009.07879 (2020). URL: https:
arXiv preprint arXiv:1910.01727 (2019). //arxiv.org/abs/2009.07879. arXiv:2009.07879.
[53] G. Song, X. Tan, Sequential learning for cross-modal retrieval, in: [66] R. Hadsell, S. Chopra, Y. LeCun, Dimensionality reduction by learn-
2019 IEEE/CVF International Conference on Computer Vision Work- ing an invariant mapping, in: 2006 IEEE Computer Society Confer-
shops, ICCV Workshops 2019, Seoul, Korea (South), October 27-28, ence on Computer Vision and Pattern Recognition (CVPR 2006), 17-
2019, IEEE, 2019, pp. 4531–4539. URL: https://doi.org/10.1109/ 22 June 2006, New York, NY, USA, IEEE Computer Society, 2006,
ICCVW.2019.00554. doi:10.1109/ICCVW.2019.00554. pp. 1735–1742. URL: https://doi.org/10.1109/CVPR.2006.100.
[54] L. Yan, D. Liu, Y. Song, C. Yu, Multimodal aggregation approach doi:10.1109/CVPR.2006.100.
for memory vision-voice indoor navigation with meta-learning, in: [67] Y. Yu, Z. Ji, J. Han, Z. Zhang, Episode-based prototype gener-
IEEE/RSJ International Conference on Intelligent Robots and Sys- ating network for zero-shot learning, in: 2020 IEEE/CVF Con-
tems, IROS 2020, Las Vegas, NV, USA, October 24, 2020 - January ference on Computer Vision and Pattern Recognition, CVPR 2020,
24, 2021, IEEE, 2020, pp. 5847–5854. URL: https://doi.org/10. Seattle, WA, USA, June 13-19, 2020, IEEE, 2020, pp. 14032–
1109/IROS45743.2020.9341398. doi:10.1109/IROS45743.2020. 14041. URL: https://doi.org/10.1109/CVPR42600.2020.01405.
9341398. doi:10.1109/CVPR42600.2020.01405.
[55] V. K. Verma, K. J. Liang, N. Mehta, L. Carin, Meta-learned [68] F. Pahde, M. M. Puscas, T. Klein, M. Nabi, Multimodal prototypical
attribute self-gating for continual generalized zero-shot learning, networks for few-shot learning, in: IEEE Winter Conference on Applica-
CoRR abs/2102.11856 (2021). URL: https://arxiv.org/abs/2102. tions of Computer Vision, WACV 2021, Waikoloa, HI, USA, January 3-8,
11856. arXiv:2102.11856. 2021, IEEE, 2021, pp. 2643–2652. URL: https://doi.org/10.1109/
[56] Z. Liu, Y. Li, L. Yao, X. Wang, G. Long, Task aligned generative meta- WACV48630.2021.00269. doi:10.1109/WACV48630.2021.00269.
learning for zero-shot learning, in: Thirty-Fifth AAAI Conference on Ar- [69] R. Eloff, H. A. Engelbrecht, H. Kamper, Multimodal one-shot learn-
tificial Intelligence, AAAI 2021, Thirty-Third Conference on Innovative ing of speech and images, in: IEEE International Conference on
Applications of Artificial Intelligence, IAAI 2021, The Eleventh Sympo- Acoustics, Speech and Signal Processing, ICASSP 2019, Brighton,
sium on Educational Advances in Artificial Intelligence, EAAI 2021, Vir- United Kingdom, May 12-17, 2019, IEEE, 2019, pp. 8623–8627. URL:
tual Event, February 2-9, 2021, AAAI Press, 2021, pp. 8723–8731. URL: https://doi.org/10.1109/ICASSP.2019.8683587. doi:10.1109/
https://ojs.aaai.org/index.php/AAAI/article/view/17057. ICASSP.2019.8683587.
[57] J. Chen, A. Zhang, Hetmaml: Task-heterogeneous model-agnostic meta- [70] L. Nortje, H. Kamper, Direct multimodal few-shot learning of speech and
learning for few-shot learning across modalities, in: G. Demartini, images, CoRR abs/2012.05680 (2020). URL: https://arxiv.org/
G. Zuccon, J. S. Culpepper, Z. Huang, H. Tong (Eds.), CIKM ’21: abs/2012.05680. arXiv:2012.05680.
The 30th ACM International Conference on Information and Knowledge [71] H. Wan, M. Zhang, J. Du, Z. Huang, Y. Yang, J. Z. Pan, FL-MSRE:
20
A few-shot learning based approach to multimodal social relation ex- shot learning, in: The Thirty-Third AAAI Conference on Artificial Intel-
traction, in: Thirty-Fifth AAAI Conference on Artificial Intelligence, ligence, AAAI 2019, The Thirty-First Innovative Applications of Artifi-
AAAI 2021, Thirty-Third Conference on Innovative Applications of Ar- cial Intelligence Conference, IAAI 2019, The Ninth AAAI Symposium
tificial Intelligence, IAAI 2021, The Eleventh Symposium on Educa- on Educational Advances in Artificial Intelligence, EAAI 2019, Hon-
tional Advances in Artificial Intelligence, EAAI 2021, Virtual Event, olulu, Hawaii, USA, January 27 - February 1, 2019, AAAI Press, 2019,
February 2-9, 2021, AAAI Press, 2021, pp. 13916–13923. URL: https: pp. 3379–3386. URL: https://doi.org/10.1609/aaai.v33i01.
//ojs.aaai.org/index.php/AAAI/article/view/17639. 33013379. doi:10.1609/aaai.v33i01.33013379.
[72] Y. Xian, T. Lorenz, B. Schiele, Z. Akata, Feature generating networks [85] Z. Chen, Y. Fu, Y. Zhang, Y. Jiang, X. Xue, L. Sigal, Multi-level semantic
for zero-shot learning, in: 2018 IEEE Conference on Computer Vision feature augmentation for one-shot learning, IEEE Trans. Image Process.
and Pattern Recognition, CVPR 2018, Salt Lake City, UT, USA, June 28 (2019) 4594–4605. URL: https://doi.org/10.1109/TIP.2019.
18-22, 2018, IEEE Computer Society, 2018, pp. 5542–5551. URL: 2910052. doi:10.1109/TIP.2019.2910052.
http://openaccess.thecvf.com/content_cvpr_2018/html/ [86] J. Liu, F. Chao, C.-M. Lin, Task augmentation by rotating for meta-
Xian_Feature_Generating_Networks_CVPR_2018_paper.html. learning, arXiv preprint arXiv:2003.00804 (2020).
doi:10.1109/CVPR.2018.00581. [87] H. Yao, L.-K. Huang, L. Zhang, Y. Wei, L. Tian, J. Zou, J. Huang, et al.,
[73] Y. Zhu, M. Elhoseiny, B. Liu, X. Peng, A. Elgammal, A gen- Improving generalization in meta-learning via task augmentation, in: In-
erative adversarial approach for zero-shot learning from noisy ternational Conference on Machine Learning, PMLR, 2021, pp. 11887–
texts, in: 2018 IEEE Conference on Computer Vision and Pat- 11897.
tern Recognition, CVPR 2018, Salt Lake City, UT, USA, June [88] H. Yao, L. Zhang, C. Finn, Meta-learning with fewer tasks through task
18-22, 2018, IEEE Computer Society, 2018, pp. 1004–1013. URL: interpolation, arXiv preprint arXiv:2106.02695 (2021).
http://openaccess.thecvf.com/content_cvpr_2018/html/ [89] B. Hariharan, R. B. Girshick, Low-shot visual recognition by shrink-
Zhu_A_Generative_Adversarial_CVPR_2018_paper.html. ing and hallucinating features, in: IEEE International Conference on
doi:10.1109/CVPR.2018.00111. Computer Vision, ICCV 2017, Venice, Italy, October 22-29, 2017, IEEE
[74] F. Pahde, O. Ostapenko, P. Jähnichen, T. Klein, M. Nabi, Self- Computer Society, 2017, pp. 3037–3046. URL: https://doi.org/10.
paced adversarial training for multimodal few-shot learning, in: IEEE 1109/ICCV.2017.328. doi:10.1109/ICCV.2017.328.
Winter Conference on Applications of Computer Vision, WACV 2019, [90] A. Antoniou, A. J. Storkey, H. Edwards, Data augmentation genera-
Waikoloa Village, HI, USA, January 7-11, 2019, IEEE, 2019, pp. 218– tive adversarial networks, CoRR abs/1711.04340 (2017). URL: http:
226. URL: https://doi.org/10.1109/WACV.2019.00029. doi:10. //arxiv.org/abs/1711.04340. arXiv:1711.04340.
1109/WACV.2019.00029. [91] M. Arjovsky, S. Chintala, L. Bottou, Wasserstein generative adversar-
[75] E. Schwartz, L. Karlinsky, R. S. Feris, R. Giryes, A. M. Bronstein, ial networks, in: D. Precup, Y. W. Teh (Eds.), Proceedings of the
Baby steps towards few-shot learning with multiple semantics, CoRR 34th International Conference on Machine Learning, ICML 2017, Syd-
abs/1906.01905 (2019). URL: http://arxiv.org/abs/1906.01905. ney, NSW, Australia, 6-11 August 2017, volume 70 of Proceedings of
arXiv:1906.01905. Machine Learning Research, PMLR, 2017, pp. 214–223. URL: http:
[76] Y. Zhang, S. Huang, X. Peng, D. Yang, Dizygotic conditional variational //proceedings.mlr.press/v70/arjovsky17a.html.
autoencoder for multi-modal and partial modality absent few-shot learn- [92] R. Zhang, T. Che, Z. Ghahramani, Y. Bengio, Y. Song, Metagan:
ing, CoRR abs/2106.14467 (2021). URL: https://arxiv.org/abs/ An adversarial approach to few-shot learning, in: S. Bengio, H. M.
2106.14467. arXiv:2106.14467. Wallach, H. Larochelle, K. Grauman, N. Cesa-Bianchi, R. Garnett
[77] J. Mu, P. Liang, N. D. Goodman, Shaping visual representations with (Eds.), Advances in Neural Information Processing Systems 31: Annual
language for few-shot classification, in: D. Jurafsky, J. Chai, N. Schluter, Conference on Neural Information Processing Systems 2018, NeurIPS
J. R. Tetreault (Eds.), Proceedings of the 58th Annual Meeting of the 2018, December 3-8, 2018, Montréal, Canada, 2018, pp. 2371–2380.
Association for Computational Linguistics, ACL 2020, Online, July 5- URL: https://proceedings.neurips.cc/paper/2018/hash/
10, 2020, Association for Computational Linguistics, 2020, pp. 4823– 4e4e53aa080247bc31d0eb4e7aeb07a0-Abstract.html.
4830. URL: https://doi.org/10.18653/v1/2020.acl-main.436.
doi:10.18653/v1/2020.acl-main.436.
[78] R. L. Hu, C. Xiong, R. Socher, Correction networks: Meta-learning for
zero-shot learning (2018).
[79] Y. H. Tsai, R. Salakhutdinov, Improving one-shot learning through fusing
side information, CoRR abs/1710.08347 (2017). URL: http://arxiv.
org/abs/1710.08347. arXiv:1710.08347.
[80] A. Gretton, O. Bousquet, A. Smola, B. Schölkopf, Measuring statistical
dependence with hilbert-schmidt norms, in: International conference on
algorithmic learning theory, Springer, 2005, pp. 63–77.
[81] Y. Wang, R. B. Girshick, M. Hebert, B. Hariharan, Low-shot learning
from imaginary data, in: 2018 IEEE Conference on Computer Vision
and Pattern Recognition, CVPR 2018, Salt Lake City, UT, USA,
June 18-22, 2018, IEEE Computer Society, 2018, pp. 7278–7286.
URL: http://openaccess.thecvf.com/content_cvpr_2018/
html/Wang_Low-Shot_Learning_From_CVPR_2018_paper.html.
doi:10.1109/CVPR.2018.00760.
[82] S. Huang, J. Lin, L. Huangfu, Class-prototype discriminative net-
work for generalized zero-shot learning, IEEE Signal Process. Lett.
27 (2020) 301–305. URL: https://doi.org/10.1109/LSP.2020.
2968213. doi:10.1109/LSP.2020.2968213.
[83] Z. Chen, Y. Fu, Y. Wang, L. Ma, W. Liu, M. Hebert, Image de-
formation meta-networks for one-shot learning, in: IEEE Con-
ference on Computer Vision and Pattern Recognition, CVPR
2019, Long Beach, CA, USA, June 16-20, 2019, Computer Vi-
sion Foundation / IEEE, 2019, pp. 8680–8689. URL: http:
//openaccess.thecvf.com/content_CVPR_2019/html/Chen_
Image_Deformation_Meta-Networks_for_One-Shot_Learning_
CVPR_2019_paper.html. doi:10.1109/CVPR.2019.00888.
[84] Z. Chen, Y. Fu, K. Chen, Y. Jiang, Image block augmentation for one-
21