Using Variational Multi-View Learning For
Using Variational Multi-View Learning For
Article
Using Variational Multi-view Learning
for Classification of Grocery Items
Marcus Klasson,1,3,* Cheng Zhang,2,* and Hedvig Kjellström1,*
1Division €gen 24, 114 28 Stockholm, Sweden
of Robotics, Perception, and Learning, Lindstedtsva
2Microsoft Research Ltd, 21 Station Road, Cambridge CB1 2FB, UK
3Lead Contact
THE BIGGER PICTURE In recent years, several computer vision-based assistive technologies for helping
visually impaired people have been released on the market. We study a special case whereby visual capa-
bility is often important when searching for objects: grocery shopping. To enable assistive vision devices for
grocery shopping, data representing the grocery items have to be available. We, therefore, provide a chal-
lenging dataset of smartphone images of grocery items resembling the shopping scenario with an assistive
vision device. Our dataset is publicly available to encourage other researchers to evaluate their computer
vision models on grocery item classification in real-world environments. The next step would be to deploy
the trained models into mobile devices, such as smartphone applications, to evaluate whether the models
can perform effectively in real time via human users. This dataset is a step toward enabling these technol-
ogies to make everyday life easier for the visually impaired.
SUMMARY
An essential task for computer vision-based assistive technologies is to help visually impaired people to
recognize objects in constrained environments, for instance, recognizing food items in grocery stores. In
this paper, we introduce a novel dataset with natural images of groceries—fruits, vegetables, and packaged
products—where all images have been taken inside grocery stores to resemble a shopping scenario. Addi-
tionally, we download iconic images and text descriptions for each item that can be utilized for better repre-
sentation learning of groceries. We select a multi-view generative model, which can combine the different
item information into lower-dimensional representations. The experiments show that utilizing the additional
information yields higher accuracies on classifying grocery items than only using the natural images. We
observe that iconic images help to construct representations separated by visual differences of the items,
while text descriptions enable the model to distinguish between visually similar items by different ingredients.
low-dimensional latent representation that can be used for clas- cery store website. We also downloaded text descriptions that
sification. We also select a variant of VCCA, denoted VCCA-pri- describe general attributes of the grocery items, such as flavor
vate, which separates shared and private information about each and texture, rather than the whole visual scene. The grocery
view through factorization of the latent representation (see Ex- items have also been labeled hierarchically in a fine- and
tracting Private Information of Views with VCCA-private in Exper- coarse-grained manner if there exist multiple variants of specific
imental Procedures). Furthermore, we use a standard multi-view items. For example, fine-grained classes of apples such as
autoencoder model called Split Autoencoder (SplitAE)13,14 for Golden Delicious or Royal Gala belong to the coarse-grained
benchmarking against VCCA and VCCA-private on class ‘‘Apple.’’
classification.
We conduct experiments with SplitAE, VCCA, and VCCA-pri- Food Datasets
vate on the task of grocery item classification with our dataset Recognizing grocery items in their natural environments, such as
(Results). We perform a thorough ablation study over all views grocery stores, shelves, and kitchens, have been addressed in
in the dataset to demonstrate how each view contributes to numerous previous works.2,27–37 The addressed tasks range
enhancing the classification performance and conclude that uti- over hierarchical classification, object detection, segmentation,
lizing the web-scraped views yields better classification results and three-dimensional (3D) model generation. Most of these
than only using the natural images (see Classification Results works collect a dataset that resembles shopping or cooking sce-
in Results). To gain further insights into the results, we visualize narios, whereby the datasets vary in the degree of labeling,
the learned latent representations of the VCCA models and different camera views, and the data domain difference between
discuss how the iconic images and textual descriptions impose the training and test set. The training sets in GroZi-120,32 Gro-
different structures on the latent space that are beneficial for cery Products,28 and CAPG-GP27 datasets were obtained by
classification (see Investigation of the Learned Representations web-scraping product images of single instances on grocery
in Results). web stores, while the test sets were collected in grocery stores
This work is an extended version of Klasson et al.,2 in which we where there can be single and multiple instances of the same
first presented this dataset. In this paper, we have added a vali- item and other different items. The RPC35 and TGFS37 datasets
dation set of natural images from two stores that were not pre- are used for object detection and classification of grocery prod-
sent in the training and test set splits from Klasson et al.2 to avoid ucts, whereby RPC is targeted for automatic checkout systems
overfitting effects. We also demonstrate how the text descrip- and TGFS is given the task of recognizing items purchased from
tions can be utilized alone and along with the iconic images in self-service vending machines. The BigBIRD33 dataset and data-
a multi-view setting, while Klasson et al.2 only experimented sets from Hsiao et al.29 and Lai et al.31 contain images of grocery
with the combination of natural and iconic images to build better items from multiple camera views, segmentation masks, and
representations of grocery items. Finally, we decode iconic im- depth maps for 3D reconstruction of various items. The Freiburg
ages from unseen natural images as an alternative to evaluate Groceries30 dataset contains images taken with smartphone
the usefulness of the latent representations (see Decoding Iconic cameras of items inside grocery stores, while its test set consists
Images from Unseen Natural Images in Results). As we only eval- of smartphone photos in home environments with single or mul-
uated the decoded iconic images qualitatively in Klasson et al.,2 tiple instances from different kinds of items. The dataset pre-
we have extended the assessment by comparing the quality of sented in Waltner et al.34 also contains images taken with smart-
the decoded images from different VCCA models with multiple phone cameras inside grocery stores to develop a mobile
image similarity metrics. application for recognizing raw food items and provide details
Next we discuss image datasets, food datasets, and multi- about the item, such as nutrition values and recommendations
view models related to our work: of similar items. Other works that collected datasets of raw
food items, such as fruits and vegetables, focused on the stan-
Image Datasets dard image classification task38,39 and on detecting fruits in or-
Many image datasets used in computer vision have been chards for robotic harvesting.40,41 Our dataset—the Grocery
collected by downloading images from the web.15–26 Some Store dataset—shares many similarities with the aforementioned
datasets15,18,20,22,24 use search words with the object category works, for instance, all images of groceries being taken in their
in isolation, which typically returns high-quality images where natural environment, the hierarchical labeling of the classes,
the searched object is large and centered. To collect images and the iconic product images for each item in the dataset. Addi-
from more real-world scenarios, searching for combinations of tionally, we have provided a text description for each item that
object categories usually returns images of two searched cate- was web-scraped along with the iconic image. As most grocery
gories but also numerous other categories.21,26 The simplest item datasets only include packaged products, we have also
annotation of these images is to provide a class label for the pre- collected images of different fruit and vegetable classes along
sent objects. Occasionally, the dataset can use a hierarchical la- with packages in our dataset.
beling structure and provide a fine- and coarse-grained label to Other examples of food datasets are those with images of food
objects where it is applicable. The annotators can also be asked dishes, for which Min et al.42 provide a summary of existing
to increase the possible level of supervision for the objects by, for benchmark food datasets. The contents of these datasets range
instance, providing bounding boxes, segmentation masks, key- from images of food dishes,43–46 cooking videos,47 recipes,48–50
points, text captions that describe the scene, and reference im- and restaurant-oriented information.51,52 One major difference
ages of the objects.17,19,21,22,24,26 Our dataset includes reference between recognizing groceries and food dishes is the appear-
(iconic) images of the objects that were web-scraped from a gro- ance of the object categories in the images. For instance, a fruit
or vegetable is usually intact and present in the image, while in- This is especially the case when views have noisy observations,
gredients that are significant for recognizing a food dish may not which complicates learning a shared latent space that combines
be visible at all depending on the recipe. However, raw food the commonalities between the views. To avoid disturbing the
items and dishes share similarities in recognition, since they shared latent space with noise from single views, some works
can appear with many different visual features in the images design models that extend the shared latent space with
compared with packaged groceries, e.g., carton boxes, cans, private latent spaces for each view that should contain the
and bottles, where the object classes have identical shape and view-specific variations to make learning the shared variations
texture. Another key difference is the natural environments and easier.3,57,60,65,72 VCCA can be extended to extract shared and
scenes where the images of grocery items and food dishes private information between different views through factorization
have been taken. Food dish images usually show the food on a of the latent space into shared and private parts. In this paper, we
plate placed on a table and, occasionally, with cutlery and a investigate whether the classification performance of grocery
glass next to the plate. Images taken in grocery stores can cover items in natural images can be improved by extracting the
many instances of the same item stacked close to each other in view-specific variations in the additional views (iconic images
shelves, baskets, and refrigerators, while there can be multiple and product descriptions) from the shared latent space with
different kinds of items next to each other in a kitchen environ- this variant of VCCA, called VCCA-private. We experiment with
ment. To summarize, recognizing grocery items and food dishes treating each data point as pairs of natural images and either
are both challenging tasks because examples from the same iconic images or text descriptions as well as triplets of natural im-
category can look very different and also appear in various real- ages, iconic images, and text descriptions. A difference between
istic settings in images. how we apply VCCA to our dataset compared with the afore-
mentioned works is that the iconic image and text description
Multi-view Learning Models are the same for every natural image of a specific class.
There exist many multi-view learning approaches for data fusion
of multiple features.3,5,6,13,53–66 A common approach is to obtain RESULTS
a shared latent space for all views with the assumption that each
view has been generated from this shared space.64 A popular In this section, we begin by providing the details about the
example of this is approach is Canonical Correlation Analysis collected dataset, which we have named the Grocery Store da-
(CCA),67 which aims to project two sets of variables (views) taset. Furthermore, we illustrate the utility of the additional infor-
into a lower-dimensional space so that the correlation between mation in the Grocery Store dataset to classify grocery items in
the projections is maximized. Similar methods propose maxi- the experiments. We compare SplitAE, VCCA, and VCCA-private
mizing other alignment objectives for embedding the views, with different combinations of views against two standard image
such as ranking losses.5,6,55,63 There exist nonlinear extensions classifiers. Additionally, we experiment with a vanilla autoen-
of CCA, e.g., Kernel CCA68 and Deep CCA,69 which optimize coder (denoted as AE) and a VAE that post-processes the natural
their nonlinear feature mappings based on the CCA objective. image features to train a linear classifier and compare the perfor-
Deep Canonically Correlated Autoencoders (DCCAE)14 is a mance against the multi-view models. We measure the classifi-
Deep CCA model with an autoencoding part, which aims to cation performance on the test set for every model and also
maximize the canonical correlation between the extracted fea- compare the classification accuracies when the number of
tures as well as reconstructing the input data. Removing the words in the text description varies for the models concerned
CCA objective reduces DCCAE to a standard multi-view autoen- (see Classification Results in Results). To gain insights into
coder, e.g., Bimodal Autoencoders and Split Autoencoders how the additional views affect the learned representations, we
(SplitAEs),13,14 which only aim to learn a representation that visualize the latent spaces of VCCA and VCCA-private with prin-
best reconstructs the input data. SplitAEs aim to reconstruct cipal component analysis (PCA) and discuss how different views
two views from a representation encoded from one of the views. change the structure of the latent space (see Investigation of the
This approach was empirically shown to work better than Learned Representations in Results). Finally, we show how
Bimodal Autoencoders by Ngiam et al.13 in situations where iconic images can be used for enhancing the interpretability of
only a single view is present at both training and test times. the classification (see Decoding Iconic Images from Unseen Nat-
Variational CCA (VCCA)3 can be seen as an extension of CCA ural Images in Results), which was also illustrated by Klas-
to deep generative models, but can also be described as a prob- son et al.2
abilistic version of SplitAEs. VCCA uses the amortized inference
procedure from variational autoencoders (VAEs)70,71 to learn the The Grocery Store Dataset
shared latent space by maximizing a lower bound on the data log In Klasson et al.,2 we collected images from fruit, vegetable, and
likelihood of the views. Succeeding works have proposed new refrigerated sections with dairy and juice products in 20 different
learning objectives and inference methods for enabling condi- grocery stores. The dataset consists of 5,421 images from 81
tional generation of views, e.g., generating an image conditioned different classes. For each class, we have downloaded an iconic
on some text and vice versa.54,58,59,61,62 These approaches rely image of the item, a text description, and information including
on fusing the views into the shared latent space as the inference country of origin, appreciated weight, and nutrient values of
procedure, which often requires tailored training and testing par- the item from a grocery store website. Some examples of natural
adigms when views are missing. However, adding information images and downloaded iconic images can be seen in Figures 1
from multiple views may not lead to improved results and can and 2, respectively. Furthermore, Table S1 displays a selection
even make the model perform worse on the targeted task.12,13 of text descriptions with their corresponding iconic image. The
Grocery Store
Dataset
Table 1. Classification Accuracies on the Test Set for All Models scriptions with different description lengths T. Figure 5 shows the
Given as Percentage for Each Model fine-grained classification accuracies for the concerned models.
Model Accuracy (%) Coarse Accuracy (%) For models using only the text descriptions, the classification ac-
curacies increase as T increases in most cases. Setting TR32
DenseNet-scratch 67.33 ± 1.35 75.67 ± 1.15
results in good classification performance, potentially since the
Softmax 71.67 ± 0.28 83.34 ± 0.32
models have learned to separate the grocery items based on
AEx + Softmax 70.69 ± 0.82 82.42 ± 0.58 that the text descriptions have become more dissimilar and
VAEx + Softmax 69.20 ± 0.46 81.24 ± 0.63 unique. The classification accuracies are mostly stable as T
SplitAExy 70.34 ± 0.56 82.11 ± 0.38 varies for the models with the additional iconic images. Since
VCCAxy 70.72 ± 0.56 82.12 ± 0.61 including iconic images significantly increases the classification
SplitAExi + Softmax 77.68 ± 0.69 87.09 ± 0.53 performance over models using only text descriptions, we
VCCAxi + Softmax 77.02 ± 0.51 86.46 ± 0.42 conclude that the iconic images are more helpful when we
want to classify the grocery items from natural images.
VCCA-privatexi + Softmax 73.04 ± 0.56 84.16 ± 0.51
SplitAExiy 77.43 ± 0.80 87.14 ± 0.57
Investigation of the Learned Representations
VCCAxiy 77.22 ± 0.55 86.54 ± 0.51 To gain insights into the effects of each view on the classification
VCCA-privatexiy 74.04 ± 0.83 84.59 ± 0.83 performance, we visualize the latent space by plotting the latent
SplitAExw + Softmax 76.27 ± 0.66 86.45 ± 0.56 representations using PCA. Utilizing the additional views showed
VCCAxw + Softmax 75.37 ± 0.46 86.00 ± 0.32 similar effects on the structure of the latent spaces from SplitAE
VCCA-privatexw + Softmax 75.11 ± 0.81 85.91 ± 0.55 and VCCA. Since our main interest lies in representation learning
SplitAExwy 75.78 ± 0.84 86.13 ± 0.63 with variational methods, we focus on studying the latent repre-
sentations of VCCA and VCCA-private. First, we use PCA to visu-
VCCAxwy 74.72 ± 0.85 85.59 ± 0.78
alize the latent representations in two dimensions and plot the
VCCA-privatexwy 74.92 ± 0.74 85.59 ± 0.67
corresponding iconic images of the representations (Figure 6).
SplitAExiw + Softmax 77.79 ± 0.48 87.12 ± 0.62 Second, to illustrate the effects of the iconic images and text de-
VCCAxiw + Softmax 77.51 ± 0.51 86.69 ± 0.41 scriptions on the learned latent space, we focus on two cases of
SplitAExiwy 78.18 ± 0.53 87.26 ± 0.46 grocery items whereby one of the views helps to separate two
VCCAxiwy 77.78 ± 0.45 86.88 ± 0.47 different types of classes and the other one does not (Figures
The subscript letters in the model names indicate the data views used in 7 and 8). Finally, we look into the shared and private latent
the model. The column Accuracy corresponds to the fine-grained spaces learned by VCCA-privatexw and observe that variations
classification accuracy. The column Coarse Accuracy corresponds to in image backgrounds and structures of text sentences have
classification of a class within the correct parent class. Results are aver- been separated from the shared representations into the private
aged using 10 different random seeds reported as mean and standard ones.
deviation. AE, Autoencoder; VAE, Variational Autoencoder; SplitAE, Split In Figure 6, we show the latent representations for the VCCA
Autoencoder; VCCA, Variational Canonical Correlation Analysis.
models that were used in Table 1 (see Classification Results in
Results). We also plot the PCA projections of the natural image
We observe that VCCA models compete with their corre- features from the off-the-shelf DenseNet169 in Figure 6A as a
sponding SplitAE models in the classification task. The main baseline. Figures 6B and 6C show the latent space learned by
difference between these models is the Kullback-Leibler (KL) n VAEx and VCCAxy, which are similar to the DenseNet169
divergence74 term in the evidence lower bound (ELBO) that en- feature space since these models are performing compression
courages a smooth latent space for VCCA (see Experimental of the natural image features into the learned latent space. We
Procedures). In contrast, SplitAE learns a latent space that observe that these models have divided packages and raw
best reconstructs the input data, which can result in parts of food items into two separate clusters. However, the fruits and
the space that do not represent the observed data. We show- vegetables are scattered across their cluster and the packages
case these differences by plotting the latent representations of have been grouped close to each other despite having different
SplitAExiwy and VCCAxiwy using PCA in Figure 4. In Figures 4A colors, e.g., black and white, on the cartons.
and 4B, we have plotted the corresponding iconic image for The structure of the latent spaces becomes distinctly different
the latent representations. We observe that VCCAxiwy tries to for the VCCA models that use either iconic images or text de-
establish a smooth latent space by pushing visually similar items scriptions as an additional view, and we can observe the
closer to each other but at the same time prevent spreading out different structures that the views bring to the learned latent
the representations too far from the origin. Figures 4C and 4D space. In Figures 6D and 6G, we see that visually similar objects,
shows the positions of the bell pepper items in the latent spaces, in terms of color and shape, have moved closer together by uti-
where the color of the point corresponds to the specific bell pep- lizing iconic images in VCCAxi and VCCAxiy. When using text de-
per class. In Figure 4C, we observe that SplitAExiwy has spread scriptions in VCCAxw and VCCAxwy, we also observe in the fruit
out the bell peppers across the space, while VCCAxiwy estab- and vegetable cluster that the items are more grouped based
lishes shorter distances between them in Figure 4D due to the on their color in Figures 6E and 6H. Figures 6F and 6I show the
regularization. latent spaces in VCCAxiw and VCCAxiwy, respectively. These
We evaluated the classification performance achieved by latent spaces are similar to the ones learned by VCCAxi and
each SplitAE, VCCA, and VCCA-private model using the text de- VCCAxiy in the sense that these latent spaces also group items
Figure 9. Visualizations of the Latent Representations mz from VCCAxw and VCCA-privatexw Followed by muw and mux for VCCA-privatexw
Representations from Variation Canonical Correlation Analysis (VCCA)xw (A) and VCCA- privatexw (B), followed by muw and mux for VCCA-privatexw (C).
Table 2. Results on Image Quality of Decoded Iconic Images for Text Description Length
the Variational Canonical Correlation Analysis Models Using the We showed that the text descriptions are useful for the classifi-
Iconic Images cation task, and that careful selection of the description length
Model PSNR ([) SSIM ([) KL (Y) Accuracy (%) T is important for achieving the best possible performance (see
VCCAxi 20.13 ± 0.05 0.72 ± 0.00 4.43 ± 0.21 77.02 ± 0.51
Classification Results in Results). In Figure 5, we observed that
most models achieve significantly better classification perfor-
VCCAxiy 20.12 ± 0.09 0.73 ± 0.00 4.35 ± 0.22 77.22 ± 0.55
mance when the text description length T increases up until
VCCAxiw 20.11 ± 0.09 0.73 ± 0.00 4.29 ± 0.24 77.51 ± 0.51
T = 32. The reason for this increase is due to the similarities be-
VCCAxiwy 20.16 ± 0.08 0.73 ± 0.00 4.32 ± 0.22 77.78 ± 0.45 tween the descriptions of items from the same kind or brand,
The subscript letters in the model names indicate the data views used in such as milk and juice packages. For instance, in Table S1, the
the model. [ denotes higher is better, Y lower is better. Peak signal-to- first sentence in the descriptions for the milk packages only dif-
noise ratio (PSNR), structural similarity (SSIM), and Kullback-Leibler fers by the ninth word, which is ‘‘organic’’ for the ecological milk
divergence (KL) are measured by comparing the true iconic image
package. This means that their descriptions will be identical
against the decoded one. Accuracy shows the classification perfor-
when T = 8. Therefore, the descriptions will become more
mance for each model and has been taken from Table 1. Data are re-
ported as mean and standard deviation averaged over 10 random seeds different from each other as we increase T, which helps the
for all metrics. model to distinguish between items with similar descriptions.
However, the classification accuracies have more or less satu-
rated when setting T>32, which is also due to the similarity be-
from scratch when the model uses a class label decoder. Note tween the descriptions. For example, the bell pepper descrip-
that the encoder for any of the multi-view models can be used tions in Table S1 only differ by the second word that describes
for extracting latent representations for new tasks whether the the color of the bell pepper, i.e., the words ‘‘yellow’’ and ‘‘or-
model utilizes the label or not, since the encoder only uses nat- ange.’’ We also see that the third and fourth sentences in the
ural images as input. descriptions of the milk packages are identical. The text descrip-
tions typically have words that separate the items in the first or
Iconic Images versus Text Descriptions second sentence, whereas the subsequent sentences provide
In Table 1, the iconic images yielded higher classification accu- general information on ingredients and how the item can be
racies compared with using the text descriptions. This was also used in cooking. For items of the same kind but of different colors
evident in Figure 5 where the classification performance remains or brand, e.g., bell peppers or milk packages, respectively, the
more or less the same regardless of the text description length T useful textual information for constructing good representations
when the models utilize iconic images. We believe that the main of grocery items typically comes from a few words in the descrip-
reasons for the advantages with iconic images lie in the clear vi- tion that describes features of the item. Therefore, the models
sual features of the items in these images, e.g., their color and yield better classification performance when T is set to include
shape, which carry much information that is important for image at least the whole first sentence of the description. We could
classification tasks. However, we also observed that iconic im- select T more cleverly, e.g., by using different T for different de-
ages and text descriptions can yield different benefits for con- scriptions to make sure that we utilize the words that describe
structing good representations. In Figure 6, we see that iconic the item or filter out noninformative words for the classifica-
images and text descriptions make the model construct different tion task.
latent representations of the grocery items. Iconic images struc-
ture the representations with respect to color and shape of the VCCA versus VCCA-Private
items (Figure 7), while the descriptions group items based on The main motivation for using VCCA-private is to use private
their ingredients and flavor (Figure 8). Therefore, the latent repre- latent variables for modeling view-specific variations, e.g., image
sentations benefit differently from utilizing the additional views backgrounds and writing styles of text descriptions. This could
and a combination of all of them yields the best classification allow the model to build shared representations that more effi-
performance, as shown in Table 1. We want to highlight the re- ciently combine salient information shared between the views
sults in Figure 8, where the model manages to separate juice for training better classifiers. This would then remove noise
and yogurt packages based on their text description. Refriger- from the shared representation, since the private latent variables
ated items, e.g., milk and juice packages, have in general very are responsible for modeling the view-specific variations. For
similar shapes and the same color if they come from the same VCCA-privatexw, we observed that the private latent spaces
brand. There are minor visual differences between items of the managed to group different image backgrounds and grocery
same brand that makes it possible to differentiate between items with similar text descriptions in Figures 9C and 9D, respec-
them, e.g., the picture of the main ingredient on the package tively. This model also performed on par with VCCAxw regarding
and ingredient description. Additionally, these items can be classification performance in Table 1. However, we also saw in
almost identical depending on which side of the package that the same table that the VCCA-private models using the iconic
is present on the natural image. When utilizing the text descrip- image perform poorly on the classification task compared with
tions, we add useful information on how to distinguish between their VCCA counterpart. The reason why this model fails is
visually similar items that have different ingredients and con- because of a form of ‘‘posterior collapse’’79 in the encoder for
tents. This is highly important for using computer vision models the iconic image, where the encoder starts outputting random
to distinguish between packaged items without having to use noise. We noticed this as the KL divergence term for the private
other kinds of information, e.g., barcodes. latent variable converged to zero when we trained the models for
Natural Image Iconic Image Decoded Image Classifica on Metrics Figure 10. Examples of Decoded Iconic Im-
ages from VCCAxiwy with Their Correspond-
ing Natural Image and True Iconic Image as
True Label: Mango PSNR: 25.31
SSIM: 0.88 well as Predicted Labels and Image Similarity
Pred. Label: Mango KL: 0.36 Metrics
The column classification shows the true label for
the natural image (True Label) and the label pre-
True Label: Royal Gala PSNR: 25.31
dicted by the model (Pred. Label). VCCA, Variational
SSIM: 0.88 Canonical Correlation Analysis; PSNR, peak signal-
Pred. Label: Royal Gala KL: 0.36 to-noise ratio; SSIM, structural similarity; KL, Kull-
back-Leibler divergence; Arla Eco. Sourcream, Arla
ecological sour cream; Arla Std. Milk, Arla stan-
True Label: Orange Bell Pepper PSNR: 25.31 dard milk.
SSIM: 0.88
Pred. Label: Orange Bell Pepper KL: 0.36
high-quality representations. However, we observed that the As in the case for the Grocery Store dataset, we have multiple views avail-
private latent variables for the web-scraped views became unin- able during training, while only the natural image view is present at test time.
In this setting, we can use a Split Autoencoder (SplitAE) to extract shared rep-
formative by modeling noise due to the lack of variations in the
resentations by reconstructing all views during training from the one view that
additional web-scraped views. This encourages us to explore is available during the test phase.13,14 As an example, we have the two-view
new methods for extracting salient information from such data case with x present at both training and test while y is only available during
views that can be beneficial for downstream tasks. training. We therefore define an encoder f4 and two decoders gqx and gqy ,
An evident direction of future work would be to investigate other where both decoders input the same representation h = f4 ðxÞ. The objective
methods for utilizing the web-scraped views more efficiently. For of the SplitAE is to minimize the sum of the reconstruction losses, which will
encourage representations h that best reconstructs both views. The total
instance, we could apply pre-trained word representations for the
loss is then
text description, e.g., BERT81 or GloVe,83 to find out whether they
enable the construction of representations that can more easily LSplitAE ðq; 4; x; yÞ = lx Lx x; gqx ðhÞ + ly Ly y; gqy ðhÞ ; (Equation 2)
distinguish between visually similar items. Another interesting di-
rection would be to experiment with various data augmentation where qx ; qy ˛q and lx ; ly are scaling weights for the reconstruction losses. For
techniques in the web-scraped views to create view-specific var- images, the reconstruction loss can be the mean squared error, while the
iations without the need for collecting and annotating more data. It cross-entropy loss is commonly used for class labels and text. This architec-
is also important to investigate how the model can be extended to ture can be extended to more than two views by simply using view-specific de-
coders that input the shared representation extracted from natural images.
recognize multiple items. Finally, we see zero- and few-shot
Note that in the case when the class labels are available, we can use the class
learning63 of new grocery items and transfer learning84 as poten-
label decoder gqy as a classifier during test time. Alternatively, we can train a
tial applications for which our dataset can be used for bench- separate classifier with the learned shared representations after the SplitAE
marking of multi-view learning models on classification tasks. has been trained.
Variational Autoencoders with Only Natural Images
EXPERIMENTAL PROCEDURES The Variational Autoencoder (VAE) is a generative model that can be used for
generating data from single views. Here, we describe how the VAE learns
Resource Availability latent representations of the data and how the model can be used for clas-
Lead Contact sification. VAEs define a joint probability distribution pq ðx; zÞ = pðzÞpq ðxjzÞ,
Marcus Klasson is the lead contact for this study and can be contacted by where pðzÞ is a prior distribution over the latent variables z and pq ðxjzÞ is
email at [email protected]. the likelihood over the natural images x given z. The prior distribution is often
Materials Availability assumed to be an isotropic Gaussian distribution, pðzÞ = N ðz j 0; IÞ, with the
There are no physical materials associated with this study. zero vector 0 as mean and the identity matrix Ias the covariance. The likeli-
Data and Code Availability hood pq ðxjzÞ takes the latent variable z as input and outputs a distribution
1. The Grocery Store dataset along with documentation is available at the parameterized by a neural network with parameters q, which is referred to
following Github repository: https://github.com/marcusklasson/ as the decoder network. A common distribution for natural images is a multi-
GroceryStoreDataset variate Gaussian, pq ðxjzÞ = N ðx j mx ðzÞ; s2x ðzÞ 1IÞ, where mx ðzÞ and s2x ðzÞ are
2. The source code for the multi-view models along with documentation is the means and standard deviations of the pixels respectively outputted from
available at the following Github repository: https://github.com/ the decoder and, 1 denotes element-wise multiplication. We wish to find
marcusklasson/vcca_grocerystore latent variables z that are likely to have generated x, which is done by
approximating the intractable posterior distribution pq ðzjxÞ with a simpler dis-
tribution q4 ðzjxÞ.71 This approximate posterior q4 ðzjxÞ is referred to as the
Methods encoder network, since it is parameterized by a neural network 4, which in-
In this section, we outline the details of the models we use for grocery classi- puts x and outputs a latent variable z. Commonly, we let the approximate
fication. We begin by introducing autoencoders and SplitAEs.14 We then posterior to be Gaussian q4 ðzjxÞ = N ðz j mz ðxÞ; s2z ðxÞ 1IÞ, where the mean
describe VAEs70 and how it is applied to single-view data, followed by the mz ðxÞ and variance s2z ðxÞ are the outputs of the encoder. The latent variable
introduction of VCCA3 and how we adapt it to our dataset. We also discuss z is then sampled using the mean and variance from the encoder with the
a variant of VCCA called VCCA-private,3 which is used for extracting private reparameterization trick.70,86 The goal is to maximize a tractable lower bound
information about each view in addition to shared information across all views on the marginal log likelihood of x using q4 ðzjxÞ:
by factorizing the latent space. The graphical model representations of the
logpq ðxÞ R Lðq; 4; xÞ = Eq4 ðzjxÞ ½ logpq ðxjzÞ DKL ðq4 ðzjxÞ jj pðzÞÞ:
VAE, VCCA, and VCCA-private models that have been used in this paper are
shown in Figure S5. The model names use subscripts to denote the views uti- (Equation 3)
lized for learning the shared latent representations. For example, VCCAxi uti-
lizes natural image features x and iconic images i, while VCCAxiwy uses natural The last term is the KL divergence74 of the approximate posterior from the prior
image features x and iconic images i, text descriptions w, and class labels y. distribution, which can be computed analytically when q4 ðzjxÞ and pðzÞ are
Autoencoders and Split Autoencoders Gaussians. The expectation can be viewed as a reconstruction loss term,
The autoencoding framework can be used for feature extraction and learning which can be approximated using Monte Carlo sampling from q4 ðzjxÞ. The
latent representations of data in unsupervised manners.85 It begins with lower bound L is called the ELBO and can be optimized with stochastic
defining a parameterized function called the encoder for extracting features. gradient descent via backpropagation. The mean outputs mz ðxÞ from the
We denote the encoder as f4 where 4 includes its parameters, which encoder q4 ðzjxÞ are commonly used as the latent representation of the natural
commonly are the weights and bias vectors of a neural network. The encoder images x. We can use the representations mz ðxÞ for training any classifier
is used for computing a feature vector h = f4 ðxÞ from the input data x. Another pðyjmz ðxÞÞ, e.g., a Softmax classifier, where y is the class label of x. We can
parameterized function gq , called the decoder, is also defined, which maps the therefore see training the VAE as a pre-processing step, where we extract a
feature h back into input space, i.e., xb = gq ðhÞ. The encoder and decoder are lower-dimensional representation of the data x which hopefully makes the
learned simultaneously to minimize the reconstruction loss between the input classification problem easier to solve. We can also extend the VAE with a
and its reconstruction of all training samples. By setting the dimension of the generative classifier by incorporating the class label y in the model.87,88 Hence,
feature vector smaller than the input dimension, i.e., dh <dx , the autoencoder the VAE defines a joint distribution pq ðx; y; zÞ = pðzÞpqx ðxjzÞpqy ðyjzÞ, where the
can be used for dimensionality reduction, which makes the feature vectors class label decoder pqy ðyjzÞ is used as the final classifier. We therefore aim
suitable for training linear classifiers in a cheap way. to maximize the ELBO on the marginal log likelihood over x and y:
log pq ðx; yÞ R Lðq; 4; x; yÞ h i different writing styles for describing grocery items with text. We adapt the
= lx Eq4 ðzjxÞ log pqx ðxjzÞ + ly Eq4 ðzjxÞ log pqy ðyjzÞ DKL ðq4 ðzjxÞ jj pðzÞÞ: approach from Wang et al.3 called VCCA-private and introduce private latent
variables for each view along with the shared latent variable. To illustrate
(Equation 4)
how we employ this model in the Grocery Store dataset, we let the natural im-
The parameters lx and ly are used for scaling the magnitudes of the ex- ages x and the text descriptions w be the two views. The joint distribution of
pected values. When predicting the class label for an unseen natural image this model is written as
x , we can consider multiple output predictions of the class label by sampling
pq ðx; w; z; ux ; uw Þ = pqx xjz; ux Þpqw ðwjz; uw pðzÞpðux Þpðuw Þ; (Equation 7)
K different latent variables for x from the encoder to determine the final pre-
dicted class. For example, we could either average the predicted class scores
over K or use the maximum class score from K samples as the final predic- where ux and uw are the private latent variables for x and w, respectively. To
tions. In this paper, we compute the average of the predicted class enable tractable inference of this model, we employ a factorized approximate
scores using posterior distribution of the form
q4 z; ux ; uw jx; wÞ = q4z ðzjx q4x ux jxÞq4w ðuw jw ; (Equation 8)
1 XK
yb = arg max pq y zðkÞ ; zðkÞ q4 ðzjx Þ; (Equation 5)
y K k=1 y where each factor is represented as an encoder network inferring its associ-
ated latent variable. With this approximate posterior, the ELBO for VCCA-pri-
where yb is the predicted class for the natural image x . We denote this model vate is given by
as VCCAxy due to its closer resemblance to VCCA than VAE in this paper.
log pq ðx; wÞ R Lprivate ðq; 4; x; wÞ
Variational Canonical Correlation Analysis for Utilizing Multi-
View Data
= lx Eq4z ðzjxÞ;q4 ðux jxÞ log pqx ðxjz; ux Þ + lw Eq4z ðzjxÞ;q4w ðuw jwÞ log pqw ðwjz; uw
In this section, we describe the details of Variational Canonical Correlation x
Analysis (VCCA)3 for our application. In the Grocery Store dataset, the views
DKL q4z ðzjxÞpðzÞ DKL q4x ðux jxÞ pðuÞ DKL q4w ðuw jwÞpðuÞ
can be the natural images, iconic images, text descriptions, or class labels,
and we can use any combination of those three in VCCA. To illustrate how (Equation 9)
we can employ this model to the Grocery Store dataset, we let the natural im-
The expectations in Lprivate ðq; 4; x; wÞ in Equation 9 can be approximated us-
ages x and the iconic images i be the two views. We assume that both views x
ing Monte Carlo sampling from q4 ðzjxÞ. The sampled latent variables are
and i have been generated from a single latent variable z. Similarly as with
concatenated and then used as input to their corresponding decoder. We let
VAEs, VCCA defines a joint probability distribution pq ðx; i; zÞ =
the approximated posteriors over both shared and private latent variables to
pðzÞpqx ðx j zÞpqi ði j zÞ. There are now two likelihoods for each view modeled
be multivariate Gaussian distributions and their prior distributions to be stan-
by the decoders pqx ðx j zÞ and pqi ði j zÞ represented as neural networks with
dard isotropic Gaussians N ð0; IÞ. The KL divergences in Equation 8 can then
parameters qx and qi . Since we want to classify natural images, the other avail-
be computed analytically. Since only natural images are present during test
able views in the dataset will be missing when we have received a new natural
time and because the shared latent variable z should contain information about
image. Therefore, the encoder q4 ðzjxÞ only uses x as input to infer the latent
similarities between the views, e.g., the object class, we use the encoder
variable z shared across all views, such that we do not have to use inference
q4 ðzjxÞ to extract latent representations mz ðxÞ for training a separate classifier.
techniques that handle missing views. With this choice of approximate poste-
As for the VAE and standard VCCA, we can also add a class label decoder
rior, we receive the following ELBO on the marginal log likelihood over x and i
pqy ðyjzÞ only conditioned on z to the model and use Equation 5 to predict the
that we aim to maximize:
class of unseen natural images. We evaluated the classification performance
log pq ðx; iÞ R Lðq; 4; x; iÞ of VCCA-private and compared it with the standard VCCA model only using
a single shared latent variable (see Classification Results in Results).
= lx Eq4 ðzjxÞ log pqx ðxjzÞ + li Eq4 ðzjxÞ log pqi ðijzÞ DKL ðq4 ðzjxÞ jj pðzÞÞ:
(Equation 6) Experimental Setup
This section briefly describes the network architecture designs and the selec-
The parameters lx and li are used for scaling the magnitude of the expected tion of hyperparameters for the models. See Supplemental Experimental Pro-
values for each view. We provide a derivation of the ELBO for three or more cedures for full details of the network architectures and hyperparameters that
views in Supplemental Experimental Procedures. The representations mz ðxÞ we use.
from the encoder q4 ðzjxÞ can be used for training a separate classifier. We Processing of Natural Images
can also add a class label decoder pqy ðyjzÞ to the model and use Equation 5 We use a DenseNet16973 as the backbone for processing the natural images,
to predict the class of unseen natural images. since this architecture showed good classification performance in Klasson
Extracting Private Information of Views with VCCA-Private et al.2 As our first baseline, we customize the output layer of DenseNet169
In the following section, we show how the VCCA model can be altered to to our Grocery Store dataset and train it from scratch to classify the natural im-
extract shared information between the views as well as view-specific private ages. For the second baseline, we train a Softmax classifier on off-the-shelf
information to enable more efficient posterior inference. Assuming that a features from DenseNet169 pre-trained on the ImageNet dataset,15 where
shared latent variable z is sufficient for generating all different views may we extract 1664-dimensional from the average pooling layer before the classi-
have its disadvantages. Since the information in the views is rarely fully inde- fication layer in the architecture. Using pre-trained networks as feature extrac-
pendent or fully correlated, information only relevant to one of the views will tors for smaller datasets has previously been proved to be a successful
be mixed with the shared information. This may complicate the inference of approach for classification tasks,89 which makes it a suitable baseline for
the latent variables, which potentially can harm the classification performance. the Grocery Store dataset. We denoted the DenseNet169 trained from scratch
To tackle this problem, previous works have proposed learning separate latent and the Softmax classifier trained on off-the-shelf features as DenseNet-
spaces for modeling shared and private information of the different scratch and Softmax, respectively, in the Results section.
views.3,57,65 The shared information should represent the correlations between Network Architectures
the views while the private information represents the independent variations We use the same architectures for SplitAE and VCCA for a fair comparison. We
within each view. As an example, the shared information between natural and train the models using off-the-shelf features extracted from a pre-trained Den-
iconic images are the visual features of the grocery item, while their private in- seNet169 for the natural images. No fine-tuning of the DenseNet backbone
formation is considered as the various backgrounds that can appear in the nat- was used in the experiments, which we leave for future research. The image
ural images and the different locations of non-white pixels in the iconic images. feature encoder and decoder consist of a single hidden layer, where the
For the text descriptions, the shared information would be words that describe encoder outputs the latent representation and the decoder reconstructs the
visual features in the natural images, whereas the private information would be image feature. We use a DCGAN90 generator architecture for the iconic image
AUTHOR CONTRIBUTIONS 15. Deng, J., Dong, W., Socher, R., Li, L.-J., Li, K., and Fei-Fei, L., (2009).
Imagenet: A large-scale hierarchical image database. In 2009 IEEE
Conceptualization, M.K., C.Z., and H.K.; Methodology, M.K., C.Z., and H.K.; Conference on Computer Vision and Pattern Recognition (IEEE), pp.
Software, M.K.; Validation, M.K.; Investigation, M.K.; Data Curation, M.K.; 248–255.
16. Everingham, M., Van Gool, L., Williams, C.K.I., Winn, J., and Zisserman, A. Society Conference on Computer Vision and Pattern Recognition-
(2010). The Pascal visual object classes (VOC) challenge. Int. J. Comput. Workshops (IEEE), pp. 49–56.
Vis. 88, 303–338. 37. Yu, H., Yanwei, F., and Yu-Gang, J., (2019). Take goods from shelves: a
17. Gebru, T., Krause, J., Wang, Y., Chen, D., Deng, J., and Fei-Fei, L., (2017). dataset for class-incremental object detection. In ACM International
Fine-grained car detection for visual census estimation. In Proceedings of Conference on Multimedia Retrieval (ICMR ’19).
the 31st AAAI Conference on Artificial Intelligence (AAAI), pp. 4502–4508. 38. Muresan, H., and Oltean, M. (2017). Fruit Recognition from Images Using
18. Griffin, G., Holub, A., and Perona, P. (2007). Caltech-256 Object Category Deep Learning, Technical report (Babes-Bolyai University).
Dataset, Technical Report 7694 (California Institute of Technology). http:// (2013). Automatic Fruit Recognition Using Computer Vision.
39. Marko, S.
authors.library.caltech.edu/7694. unalnistvo in Informatiko
(Mentor: Matej Kristan), Fakulteta Za Rac
19. Krishna, R., Zhu, Y., Groth, O., Johnson, J., Hata, K., Kravitz, J., Chen, S., (Univerza v Ljubljani).
Kalantidis, Y., Li, L.-J., Shamma, D.A., et al. (2016). Visual genome: con- 40. Bargoti, S. and Underwood, J.P., (2017). Deep fruit detection in orchards.
necting language and vision using crowdsourced dense image annota- In IEEE International Conference on Robotics and Automation.
tions. arXiv, 1602.07332.
41. Sa, I., Ge, Z., Dayoub, F., Upcroft, B., Perez, T., and McCool, C. (2016).
20. Krizhevsky, A. (2009). Learning Multiple Layers of Features from Tiny Deepfruits: a fruit detection system using deep neural networks.
Images, Technical report (Citeseer). Sensors 16, 1222.
21. Lin, T.-Y., Maire, M., Belongie, S.J., Hays, J., Perona, P., Ramanan, D., 42. Min, W., Jiang, S., Liu, L., Rui, Y., and Jain, R. (2019). A survey on food
Dollár, P., and Zitnick, C.L., (2014). Microsoft coco: Common objects in computing. ACM Comput. Surv. (Csur) 52, 1–36.
context. In European Conference on Computer Vision (ECCV).
43. Bossard, L., Guillaumin, M., and Van Gool, L., (2014). Food-101—mining
22. Nilsback, M.-E. and Zisserman, A., (2008). Automated flower classification discriminative components with random forests. In European
over a large number of classes. In Indian Conference on Computer Vision, Conference on Computer Vision.
Graphics and Image Processing.
44. Kawano, Y. and Yanai, K., (2014). Automatic expansion of a food image
23. Song, H.O., Xiang, Y., Jegelka, S., and Savarese, S., (2016). Deep metric dataset leveraging existing categories with domain adaptation. In
learning via lifted structured feature embedding. In IEEE Conference on Proceedings of ECCV Workshop on Transferring and Adapting Source
Computer Vision and Pattern Recognition. Knowledge in Computer Vision (TASK-CV).
24. Wah, C., Branson, S., Welinder, P., Perona, P., and Belongie, S. (2011). 45. Min, W., Liu, L., Luo, Z., and Jiang, S., (2019). Ingredient-guided cascaded
The Caltech-UCSD Birds-200-2011 Dataset, Technical Report CNS-TR- multi-attention network for food recognition. In Proceedings of the 27th
2011-001 (California Institute of Technology). ACM International Conference on Multimedia, pp. 1331–1339.
25. Xiao, J., Hays, J., Ehinger, K.A., Oliva, A., and Torralba, A., (2010). SUN 46. Rich, J., Haddadi, H., and Hospedales, T.M., (2016). Towards bottom-up
database: Large-scale scene recognition from abbey to zoo. In IEEE analysis of social food. In Proceedings of the 6th International Conference
Conference on Computer Vision and Pattern Recognition. on Digital Health Conference, pp. 111–120.
26. Young, P., Lai, A., Hodosh, M., and Hockenmaier, J. (2014). From image 47. Damen, D., Doughty, H., Maria Farinella, G., Fidler, S., Furnari, A.,
descriptions to visual denotations: new similarity metrics for semantic Kazakos, E., Moltisanti, D., Munro, J., Perrett, T., Price, W., et al.,
inference over event descriptions. TACL 2, 67–78. (2018). Scaling egocentric vision: The epic-kitchens dataset. In
27. Geng, W., Han, F., Lin, J., Zhu, L., Bai, J., Wang, S., He, L., Xiao, Q., and Proceedings of the European Conference on Computer Vision (ECCV),
Lai, Z., (2018). Fine-grained grocery product recognition by one-shot pp. 720–736.
learning. In Proceedings of the 26th ACM International Conference on 48. Marin, J., Biswas, A., Ofli, F., Hynes, N., Salvador, A., Aytar, Y., Weber, I.,
Multimedia, pp. 1706–1714. and Torralba, A. (2019). Recipe1M+: a dataset for learning cross-modal
28. George, M. and Floerkemeier, C., (2014). Recognizing products: A per- embeddings for cooking recipes and food images. IEEE Trans. Pattern
exemplar multi-label image classification approach. In European Anal. Mach. Intell. https://doi.org/10.1109/tpami.2019.2927476.
Conference on Computer Vision (ECCV), pp. 440–455. 49. Salvador, A., Hynes, N., Aytar, Y., Marin, J., Ofli, F., Weber, I., and
29. Hsiao, E., Collet, A., and Hebert, M. (2010). Making specific features less Torralba, A., (2017). Learning cross-modal embeddings for cooking rec-
discriminative to improve point-based 3d object recognition. In 2010 IEEE ipes and food images. In The IEEE Conference on Computer Vision and
Computer Society Conference on Computer Vision and Pattern Pattern Recognition (CVPR).
Recognition (IEEE), pp. 2653–2660. 50. Yagcioglu, S., Erdem, A., Erdem, E., and Ikizler-Cinbis, N. (2018).
30. Jund, P., Abdo, N., Eitel, A., and Burgard, W. (2016). The Freiburg Recipeqa: a challenge dataset for multimodal comprehension of cooking
Groceries Dataset. arXiv, 1611.05799. recipes. arXiv, 1809.00812.
31. Lai, K., Bo, L., Ren, X., and Fox, D. (2011). A large-scale hierarchical multi- 51. Beijbom, O., Joshi, N., Morris, D., Saponas, S., and Khullar, S., (2015).
view RGB-D object dataset. In 2011 IEEE International Conference on Menu-match: Restaurant-specific food logging from images. In 2015
Robotics and Automation (IEEE), pp. 1817–1824. IEEE Winter Conference on Applications of Computer Vision (IEEE), pp.
32. Merler, M., Galleguillos, C., and Belongie, S., (2007). Recognizing gro- 844–851.
ceries in situ using in vitro training data. In 2007 IEEE Conference on 52. Xu, R., Herranz, L., Jiang, S., Wang, S., Song, X., and Jain, R. (2015).
Computer Vision and Pattern Recognition (IEEE), pp. 1–8. Geolocalized modeling for dish recognition. IEEE Trans. Multimedia 17,
33. Singh, A., Sha, J., Narayan, K.S., Achim, T., and Abbeel, P., (2014). 1187–1199.
Bigbird: A large-scale 3D database of object instances. In 2014 IEEE 53. Baltrusaitis, T., Ahuja, C., and Morency, L.-P. (2018). Multimodal machine
International Conference on Robotics and Automation (IEEE), pp. learning: a survey and taxonomy. IEEE Trans. Pattern Anal. Mach. Intell.
509–516. 41, 423–443.
€tter, S., Weber, A., Luley, P., Bischof, H.,
34. Waltner, G., Schwarz, M., Ladsta 54. Cremer, C. and Kushman, N., (2018). On the importance of learning aggre-
Lindschinger, M., Schmid, I., and Paletta, L., (2015). Mango—mobile gate posteriors in multimodal variational autoencoders. 1st Symposium on
augmented reality with functional eating guidance and food awareness. Advances in Approximate Bayesian Inference.
In International Workshop on Multimedia Assisted Dietary Management. 55. Fu, Y. and Sigal, L., (2016). Semi-supervised vocabulary-informed
35. Wei, X.-S., Cui, Q., Yang, L., Wang, P., and Liu, L. (2019). RPC: a large- learning. In Proceedings of the IEEE Conference on Computer Vision
scale retail product checkout dataset. arXiv, 1901.07249. and Pattern Recognition, pp. 5337–5346.
36. Winlock, T., Christiansen, E., and Belongie, S., (2010). Toward real-time 56. Pieropan, A., Salvi, G., Pauwels, K., and Kjellström, H., (2014). Audio-vi-
grocery detection for the visually impaired. In 2010 IEEE Computer sual classification and detection of human manipulation actions. In 2014
74. Kullback, S., and Leibler, R.A. (1951). On information and sufficiency. Ann. 90. Radford, A., Metz, L., and Chintala, S. (2015). Unsupervised representa-
Math. Stat. 22, 79–86. tion learning with deep convolutional generative adversarial networks.
arXiv, 1511.06434.
75. Wang, Z., Bovik, A.C., Sheikh, H.R., and Simoncelli, E.P. (2004). Image
quality assessment: from error visibility to structural similarity. IEEE 91. Hochreiter, S., and Schmidhuber, J. (1997). Long short-term memory.
Trans. Image Process. 13, 600–612. Neural Comput. 9, 1735–1780.
76. Cui, S. and Datcu, M., (2015). Comparison of Kullback-Leibler divergence 92. Kingma, D.P. and Ba, J., (2015). Adam: A method for stochastic optimiza-
approximation methods between Gaussian mixture models for satellite tion. In Advances in Neural Information Processing Systems 28.