Visualizing Hierarchies in scRNA-seq Data Using A Density Tree-Biased Autoencoder
Visualizing Hierarchies in scRNA-seq Data Using A Density Tree-Biased Autoencoder
autoencoder
Quentin Garrido 1,5 * Sebastian Damrich 1 Alexander Jäger 1 Dario Cerletti 2,3
Manfred Claassen 4 Laurent Najman 5 Fred A. Hamprecht 1
arXiv:2102.05892v3 [q-bio.QM] 22 Apr 2022
1
HCI/IWR, Heidelberg University, Germany
2
Institute of Molecular Systems Biology, ETH Zürich, Switzerland
3
Institute of Microbiology, ETH Zürich, Switzerland
4
Internal Medicine I, University Hospital Tübingen, University of Tübingen, Germany
5
Université Gustave Eiffel, CNRS, LIGM, F-77454 Marne-la-Vallée, France
1
Figure 1: Schematic method overview. a) High-dimensional data. b) Proposed density tree. After computing the k-means
centroids on the data, we build a tree based on the data density between pairs of centroids. c) DTAE. An autoencoder is used
to learn a representation of our data. This embedding is regularized by the previously computed tree in order to preserve
its hierarchical structure in low-dimensional space. d) The final DTAE embedding. After training of the autoencoder, the
bottleneck layer visualizes the data in low dimension and respects the density structure.
2
coders (Moor et al., 2020) are conceptually closest to our Algorithm 1 Density tree generation
idea of retaining topological properties during dimension Require: High-dimensional data X ∈ Rn×d
reduction. They compute the MST on all points, which pro- Require: Number of k-means centroids k
duces less stable results than our density-based approach on procedure G ENERATE T REE(X, k)
cluster centroids. C ← K M EANS(X, k) . O(nkdt) with t the number
Variational autoencoders (VAEs) (Kingma and Welling, of iterations
2013), a generative AE version, have also been explored. G = (C, E) the complete graph on our centroids
A popular VAE for scRNA-seq data is scVI (Lopez et al., for {i, j} a two-element subset of {1, . . . , k} do .
2018), which explicitly models batch effects and library O(k 2 )
sizes. Instead, scVAE (Grønbech et al., 2020) investigates di,j = 0
likelihood functions suitable for scRNA-seq data and pro- end for
poses a clustering model in latent space. DR-A (Lin et al., for i = 1, . . . , |X| do . O(nk)
2020) apply adverserial training instead of the variational a ← arg min ||xi − cj ||2 . Nearest centroid
objective. Finally, scvis is a VAE tailored to visualization j=1,...,k
(Ding et al., 2018) and uses a t-SNE-like regularization term b ← arg min ||xi − cj ||2 . Second nearest
in the latent space. j=1,...,k
j6=a
Ivis (Szubert et al., 2019) employs a triplet loss function centroid
and a siamese neural network instead of an AE to preserve da,b = da,b + 1 . Increase nearest centroids’
the nearest neighbor relations in the visualization. edge strength
Both scDeepCluster and scVAE shape the latent space end for
into disconnected clusters, which is orthogonal to our goal T ← M AX S PANNING T REE(G, d) . O(k 2 log k)
of illustrating continuous developmental hierarchies. scVI, return T, d . Retains the density tree and the edge
scGAE and scDeepCluster work with a latent space dimen- strengths
sion larger than two and thus require an additional dimen- end procedure
sion reduction, typically with t-SNE, to visualize the data.
Neither of the pure visualization methods aims to bring
out the hierarchical properties often present in scRNA-seq our original data well. But it is crucial that the extracted
dataset. In particular, they do not use the data density to tree follows the dense regions of the data if we want to vi-
infer lineages. None of them provide a graph summary of sualize developmental trajectories of differentiating cells: a
the data. Our contribution, however, is to supply the user trajectory is plausible if we observe intermediate cell states
with a tree-shaped graph summarizing the hierarchies along and unlikely if there are jumps in the development. By
dense lineages in the data as well as a 2D embedding that preferring tree edges in high density regions of the data,
respects this tree shape. we ensure that the computed spanning tree is biologically
plausible. Following this rationale, we build the maximum
spanning tree on the complete graph over centroids whose
3 Methods edge weights are given by the density of the data along each
edge instead of the minimum spanning tree on Euclidean
3.1 Approximating the High-dimensional
distance. This results in a tree that (we believe) captures
scRNA-seq Data with a Tree Waddington’s hypothesis better than merely considering cu-
To summarize the high-dimensional data in terms of a tree, mulative differences in expression levels.
the minimum spanning tree (MST) on the Euclidean dis- To estimate the support that a data sample provides for an
tances is an obvious choice. This route is followed by Moor edge, we follow Martinetz and Schulten (1994). Consider
et al. (2020) who reproduce the MST obtained on their the complete graph G = (C, E) such that C = {c1 , . . . , ck }
high-dimensional data in their low-dimensional embedding. is the set of centroids. In the spirit of Hebbian learning,
However, scRNA-seq data can be noisy, and a MST built on that is, emphasizing connections that appear frequently, we
all of our data is very sensitive to noise. Therefore, we first count, for each edge, how often its incident vertices are the
run k-means clustering on the original data, yielding more two closest centroids to any given datum.
robust centroids for the MST construction and also reducing As pointed out by Martinetz and Schulten (1994) this
downstream complexity. amounts to an empirical estimate of the integral of the den-
A problem with the Euclidean MST, illustrated in fig- sity of observations across the second-order Voronoı̈ region
ure 2, is that two centroids can be close in Euclidean space (defined as the set of points having a particular set of 2 cen-
without having many data points between them. In such a troids as its 2 nearest centroids) associated with this pair of
case, a Euclidean MST would not capture the skeleton of cluster centers. Finally, we compute the maximum span-
3
Figure 2: (left, middle) Comparison of the tree built on k-means centroids using Euclidean distance or density weights.
The data was generated using the PHATE library (Moon et al., 2019), with 3 branches in 2D. Original data points are
transparently overlayed to better visualize their density. While the tree based on the Euclidean distance places connections
between centroids that are close but have only few data points between them (see red ellipse), our tree based on the data
density instead includes those edges that lie in high density regions (see pink ellipse). (right) Complete graph over centroids
and its Hebbian edge weights. Null-weight edges, that is edges not supported by data, are omitted for clarity.
ning tree over these Hebbian edge weights. Our strategy for
building the tree is summarized in algorithm 1.
1 X
Our data-density based tree follows the true shape of the Lrec = MSE(X, g(f (X))) = ||xi − g(f (xi ))||22 .
N
data more closely than a MST based on the Euclidean dis- xi ∈X
tance weights, as illustrated in figure 2. We claim this in- (1)
dicates it being a better choice for capturing developmental This term is the typical loss function for an autoencoder and
trajectories. Having extracted the tree shape in high dimen- ensures that the embedding is as faithful to the original data
sions, our goal is to reproduce this tree as closely as possible as possible, forcing it to extract the most salient data fea-
in our embedding. tures.
4
geodesic distance dgeo (ca , cb ) with ca , cb ∈ C is defined
2 as the number of edges in the shortest path between ca
L̃push (hi ) = − (||hi − ci,1 ||2 + ||hi − ci,2 ||2 ) (2)
2 and cb in the density tree. Centroids at the end of differ-
L̃pull (hi ) = ||hi − c0i,1 ||2 + ||hi − c0i,2 ||2 (3) ent branches in the density have a higher geodesic distance
1 X than centroids nearby on the same branch. By weighing
L̃push-pull = L̃push (f (xi )) + L̃pull (f (xi )). (4) the push-pull loss contribution of an embedded point by
N
xi ∈X the geodesic distance between its two currently closest cen-
The push loss decreases as hi and the currently closest cen- troids, we focus the push-pull loss on embeddings which
troids, ci,1 and ci,2 , are placed further apart from each other, erroneously lie between different branches.
while the pull loss decreases when hi gets closer to the The geodesic distances can be computed quickly in
correct centroids, c0i,1 and c0i,2 . Indeed, the push-pull loss O(k 2 ) via breadth first search, and this only has to be done
term is minimized if and only if each embedding hi lies in once before training the autoencoder.
the second-order Voronoı̈ region of those low-dimensional The final version of our push-pull loss becomes
centroids whose high-dimensional counterparts contain the
data point xi in their second-order Voronoı̈ region. In other 1 X
Lpush-pull = dgeo (ci,1 , ci,2 )
words, the loss is zero precisely when we are reproducing N
xi ∈X
the edge densities from high dimension in low dimension.
Note that we let the gradient flow through both the in- · (Lpush (f (xi )) + Lpull (f (xi ))) .
dividual embeddings and through the centroids, which are (8)
means of embeddings themselves.
This naı̈ve formulation of the push-pull loss has the Note, that the normalized push-pull loss in equation (7) and
drawback that it can become very small if all embeddings the geodesically reweighted push-pull loss in (8) both also
are nearly collapsed into a single point, which is undesir- get minimized if and only if the closest centroids in em-
able for visualization. Therefore, we normalize the contri- bedding space correspond to the closest centroids in high-
bution of every embedding hi by the distance between the dimensional space.
two correct centroids in embedding space. This prevents the
collapsing of embeddings, and also ensures that each data- 3.2.3 Compactness loss
point xi contributes equally, regardless of how far apart c0i,1
and c0i,2 are. The push-pull loss thus becomes The push-pull loss replicates the empirical high-
!2 dimensional data density in embedding space by moving
||hi − ci,1 ||2 + ||hi − ci,2 ||2 the embeddings into the correct second-order Voronoı̈
Lpush (hi ) = − (5)
||c0i,1 − c0i,2 ||2 region, which can be large or unbounded. For optimal
!2 visibility of the tree structure, an embedding should not
||hi − c0i,1 ||2 + ||hi − c0i,2 ||2 only be in the correct second-order Voronoı̈ region, but lie
Lpull (hi ) = (6)
||c0i,1 − c0i,2 ||2 compactly around the line between its two centroids. To
achieve this, we add the compactness loss, which is just
1 X
Lpush-pull = Lpush (f (xi )) + Lpull (f (xi )). (7) another instance of the pull loss
N
xi ∈X
!2
So far, we only used the density information from high- 1 X ||hi − c0i,1 ||2 + ||hi − c0i,2 ||2
Lcomp = (9)
dimensional space for the embedding, but not the extracted N ||c0i,1 − c0i,2 ||2
xi ∈X
density tree itself. The push-pull loss in equation (7) is ag-
1 X
nostic to the positions of the involved centroids within the = Lpull (f (xi )), (10)
density tree, only their Euclidean distance to the embed- N
xi ∈X
ding hi matters. In contrast, the hierarchical structure is
important for the biological interpretation of the data: it is The compactness loss is minimized if the embedding hi is
much less important if an embedding is placed close to two exactly between the correct centroids c0i,1 and c0i,2 and has
centroids that are on the same branch of the density tree elliptic contour lines with foci at the centroids.
than it is if the embedding is placed between two different
branches. In the first case, cells are just not ordered cor- 3.2.4 Cosine loss
rectly within a trajectory, while in the second case we get
false evidence for an altogether different pathway. Since the encoder is a powerful non-linear map, it can
We tackle this problem by reweighing the push-pull loss introduce artifactual curves in the low-dimensional tree
with the geodesic distance along the density tree. The branches. However, especially tight turns can impede the
5
visual clarity of the embedding. As a remedy, we pro- 3.3 Training procedure
pose an optional additional loss term that tends to straighten
branches. Firstly, we compute the k-means centroids, the edge densi-
ties, the density tree, and geodesic distances. This has to be
Centroids at which the embedding should be straight are
done only once as an initialization step. Secondly, we pre-
the ones within a branch, but not at a branching event of
train the autoencoder with only the reconstruction loss via
the density tree. The former can easily be identified as the
stochastic gradient descent on minibatches. This provides a
centroids of degree 2.
warm start for finetuning the autoencoder with all losses in
Let c be a centroid in embedding space of degree 2 with the third step.
its two neighboring centroids nc,1 and nc,2 . The branch is During finetuning, all embedding points are needed to
straight at c if the two vectors c − nc,1 and nc,2 − c are par- compute the centroids in embedding space. Therefore, we
allel or, equivalently, if their cosine is maximal. Denoting perform full-batch gradient descent during finetuning. For
by C2 = {c ∈ C | deg(c) = 2} the set of all centroids algorithmic details regarding the training procedure, confer
of degree 2, considered in embedding space, we define the to supplementary algorithm S1.
cosine loss as We always used k = 50 centroids for k-means clus-
tering in our experiments. This number needs to be
1 X (c − nc,1 ) · (nc,2 − c)
Lcosine = 1 − . (11) high enough so that the tree yields a skeleton of the
|C2 | ||c − nc,1 ||2 ||nc,2 − c||2 data, but not so high that the density loses its mean-
c∈C2
ing. k = 50 is a default value that works well in
Essentially, it measures the cosine of the angles along the a variety of scenarios. Our autoencoder always has a
tree branch and becomes minimal if all these angles are zero bottleneck dimension of 2 for visualization. In the ex-
and the branches straight. periments, we used layers of the following dimensions
d(input dimension), 2048, 256, 32, 2, 32, 256, 2048, d. This
A generalization of this criterion that deals with noisy
results in symmetrical encoders and decoders with four lay-
edges in the density tree is discussed in section B of the
ers. While not necessary in our experiments, if a lighter
appendix.
network is desired, we recommend applying PCA first to
reduce the number of input dimensions, or to filter out more
genes during the preprocessing. We omitted hidden layers
3.2.5 Complete loss function of dimension larger than the input. We use fully connected
layers and ReLU activations after every layer but the last
Combining the four loss terms of the preceding sections, we
encoder and decoder layer and employ the Adam (Kingma
arrive at our final loss
and Ba, 2017) optimizer with learning rate 2 × 10−4 for
pretraining and 1 × 10−3 for finetuning unless stated oth-
L = λrec Lrec + λpush-pull Lpush-pull + λcomp Lcomp + λcos Lcos . erwise. We used a batch size of 256 for pretraining in all
(12) experiments.
The relative importance of the loss terms, especially of
Lcomp and Lcos , which control finer aspects of the visualiza-
tion, might depend on the use-case. In practice, we found 4 Results
λrec = λpush-pull = λcomp = 1 and λcos = 50 to work well.
This configuration reduces the number of weights to adjust In this section, we show the performance of our method on
from four to one. toy and real scRNA-seq datasets and compare it to a vanilla
An ablation study of the different losses’ contribution is autoencoder, as well as to the popular non-parametric meth-
available in section C of the appendix. Its main conclusion ods PCA, Force Atlas 2, UMAP and PHATE and to the most
is that while the push-pull loss and reconstruction loss are prevalent neural network-based approaches, SAUCIE, DCA
sufficient to obtain satisfactory results, the addition of the and scVI. For all network-based approaches, we choose a
compactness and cosine loss helps to improve the visualiza- bottleneck of dimension 2 to directly use them for visual-
tions further and facilitates reproducibility. Empirically, we ization.
found that adding the compactness loss without the cosine
loss sometimes leads to discontinuous embeddings. The
two loss terms should therefore be added or omitted jointly.
4.1 PHATE generated data
If the default loss weights are not satisfactory, we recom- We applied our method to an artificial dataset created with
mend adjusting the cosine loss weight first. To understand the library published alongside Moon et al. (2019), to
how changing the loss parameters may affect the results, demonstrate its functionality in a controlled setting. We
please refer to the qualitative results in the ablation study. generated a toy dataset whose skeleton is a tree with one
6
Figure 3: Results obtained using data generated by the PHATE library. Branches are coloured by groundtruth labels.
backbone branch and 9 branches emanating from the back- tree structure of the data, especially for the endocrine sub-
bone, consisting in total of 10,000 points in 100 dimensions. types. The visualized hierarchy is biologically plausible,
We pretrained for 150 epochs with a learning rate of with a particularly clear depiction of the α-, β- and ε-cell
10−3 and finetuned for another 150 epochs with a learning branches and a visible, albeit too strong, separation of the
rate of 10−2 . δ-cells. This is in agreement with the results from Bastidas-
Figure 3 shows the visualization results. The finetun- Ponce et al. (2019). UMAP also performs very well and
ing significantly improves the results of the pretrained au- attaches the δ-cells to the main trajectory. However, the
toencoder, whose visualisation collapses the grey and green α- and β-cell branches are not as prominent as in DTAE.
branch onto the blue branch. All methods other than DCA, PHATE does not manage to separate the δ- and ε-cells dis-
scVI and PCA achieve satisfactory results that make the true cernibly from the other endocrine subtypes. As on toy data
tree structure of the data evident. While PHATE, UMAP in figure 3, it produces overly crisp branches for the α- and
and Force Atlas 2 produce overly crisp branches compared β-cells. PCA mostly overlays all endocrine subtypes. All
to the PCA result, the reconstruction loss of our autoen- methods but the vanilla autoencoder show a clear branch
coder guards us from collapsing the branches into lines. with tip and acinar cells and one via EP and Fev+ cells
PHATE appears to overlap the cyan and yellow branches to the endocrine subtypes, but only DTAE, DCA, SAUCIE
near the backbone, and UMAP introduces artificially curved and scVI manage to also hint at the more generic trunk and
branches. scVI collapses the green and brown as well as the multipotent cells from which these two major branches em-
pink and cyan branches together, giving hard to interpret vi- anate. However, SAUCIE, DCA and scVI fail to produce a
sualizations. The results on this toy dataset demonstrate that meaningful separation between the α- and β-cell branches.
our method can embed high-dimensional hierarchical data The ductal and Ngn3 low EP cells overlap in all methods.
into 2D and emphasize its tree-structure while avoiding to It is worth noting that the autoencoder alone was not able
collapse too much information compared to state-of-the-art to visualize meaningful hierarchical properties of the data.
methods. In our method, all branches are easily visible. However, the density tree-biased finetuning in DTAE made
this structure evident, highlighting the benefits of our ap-
proach.
4.2 Endocrine pancreatic cell data
In figure 4, we overlay DTAE’s embedding with a pruned
We evaluated our method on the data from Bastidas-Ponce version of the density tree and see that the visualization
et al. (2019). It represents endocrine pancreatic cells at closely follows the tree structure around the differentiated
different stages of their development and consists of gene endocrine cells. This combined representation of low-
expression information for 36351 cells and 3999 genes. dimensional embedding and overlaid density tree further
Preprocessing information can be found in Bastidas-Ponce facilitates the identification of branching events, most no-
et al. (2019). We pretrained for 300 epochs and used 250 tably for the α- and β-cells, and shows the full power of
epochs for finetuning. our method. It also provides an explanation for the appar-
Figure 4 and supplementary figure S5 depicts visual- ent separation of the δ-cells. Since there are relatively few
izations of the embryonic pancreas development with dif- δ-cells, they are not represented by a distinct k-means cen-
ferent methods. Our method can faithfully reproduce the troid.
7
Figure 4: Pruned density tree superimposed over embeddings of the endocrine pancreatic cell dataset, colored by cell sub-
types. We use finer labels for the endocrine cells. Darker edges represent denser edges. Only edges with more than 100
points contributing to them are plotted here.
Our method places more k-means centroids in the dense method. For instance, together with the density tree, we can
region in the lower right part of DTAE’s panel in figure 4 identify the ε-cells as a separate branch and find the location
than is appropriate to capture the trajectories, resulting in of the branching event into different endocrine subtypes in
many small branches. Fortunately, this does not result in the UMAP embedding.
an exaggerated tree-shaped visualization that follows every
spurious branch, which we hypothesize is thanks to the suc-
cessful interplay between the tree bias and the reconstruc- 4.3 T-cell infection data
tion aim of the autoencoder: If the biological signal encoded
in the gene expressions can be reconstructed by the decoder We further applied our method to T-cell data of a chronic
from an embedding with enhanced hierarchical structure, and an acute infection, which was shared with us by the au-
the tree-bias shapes the visualization accordingly. Con- thors of Cerletti et al. (2020). The data was preprocessed
versely, an inappropriate tree-shape is prevented if it would using the method described in Zheng et al. (2017), for more
impair the reconstruction. Overall, the density tree recovers details confer Cerletti et al. (2020). It contains gene expres-
the pathways identified in Bastidas-Ponce et al. (2019) to a sion information for 19029 cells and 4999 genes. While
large extent. Only the trajectory from multipotent via tip to we used the combined dataset to fit all dimension reduc-
acinar cells includes an unexpected detour via the trunk and tion methods, we only visualize the 13707 cells of the
ductal cells, which the autoencoder mends by placing the chronic infection for which we have phenotype annotations
tip next to the multipotent cells. from Cerletti et al. (2020) allowing us to judge visualization
quality from a biological viewpoint. We pretrained for 600
The density tree also provides useful information in con- epochs and used 250 epochs for finetuning.
junction with other dimension reduction methods. In fig- Figure 5 and supplementary figure S6 demonstrate that
ure 4, we overlay their visualizations with the pruned den- our method makes the tree structure of the data clearly vis-
sity tree by computing the centroids in the respective em- ible. The visualized hierarchy is also biologically signif-
bedding spaces according to the k-means cluster assign- icant: The two branches on the right correspond to the
ments. The density tree can help to find branching events memory-like and terminally exhausted phenotypic states,
and gain insights into the hierarchical structure of the data which are identified as the main terminal fates of the differ-
that is visualized with an existing dimension reduction entiation process in Cerletti et al. (2020). Furthermore, the
8
Figure 5: Pruned density tree superimposed over embeddings of the chronic part of the T-cell data, colored by phenotypes.
Darker edges represent denser edges. Only edges with more than 100 points contributing to them are plotted here.
purple branch at the bottom contains the proliferating cells. event towards memory-like and terminally exhausted cells.
Since the cell cycle affects cell transcription significantly, PCA exhibits only the coarsest structure and fails to sepa-
those cells are expected to be distinct from the rest. rate the later states visibly. The biological structure is de-
It is encouraging that DTAE makes the expected bio- cently preserved in the UMAP visualization, but the hierar-
logical structure apparent even without relying on known chy is less apparent than in DTAE. SAUCIE, scVI and Force
marker genes or differential cell expression, which were Atlas 2 produce results that are very similar to PCA, with
used to obtain the phenotypic annotations in Cerletti et al. later states that are hard to distinguish. DCA produces re-
(2020). sults that are very similar to the vanilla autoencoder, where
Interestingly, our method places the branching event to- even though the later states are visible, there is a significant
wards the memory-like cells in the vicinity of the exhausted amount of noise in the embedding, making the analysis dif-
cells, as does UMAP, while Cerletti et al. (2020) recog- ficult. Overall, our method outperforms the other visualiza-
nized a trajectory directly from the early stage cells to the tion methods on this dataset.
memory-like fate. The exact location of a branching event In figure 5, we have overlaid our embedding with a
in a cell differentiation process is difficult to determine pre- pruned version of the density tree and see that DTAE’s vi-
cisely. We conjecture that fitting the dimensionality reduc- sualization indeed closely follows the tree structure. It is
tion methods on the gene expression measurements of cells noteworthy that even the circular behavior of proliferation
from an acute infection in addition to those from the chronic cells is accurately captured by a self-overlaid branch, al-
infection analyzed in Cerletti et al. (2020) provided addi- though our tree-based method is not directly designed to
tional evidence for the trajectory via exhausted cells to the extract circular structure.
memory-like fate. Unfortunately, an in-depth investigation Figure 5 also shows the other dimension reduction meth-
of this phenomenon is beyond the scope of this methodolog- ods in conjunction with the pruned density tree. Reassur-
ical paper. ingly, we find that all methods embed the tree in a plausi-
The competing methods expose the tree-structure of the ble way, i.e., without many self-intersections or oscillating
data less obviously than DTAE. The finetuning significantly branches. This is evidence that our density tree indeed cap-
improves the results from the autoencoder, which shows tures a meaningful tree structure of the data. As for the
no discernible hierarchical structure. PHATE separates the endocrine pancreas dataset, the density tree can enhance hi-
early cells, proliferating cells and the rest. But its layout erarchical structure in visualizations of existing dimension
is very tight around the biologically interesting branching reduction methods. It, for example, clarifies in the UMAP
9
Type of metric Local Global Voronoi
Euclidean Geodesic All
Metric ARI k-NN 1st order 2nd order
Pearson Spearman Pearson Spearman
DTAE (Ours) 93.75 48.70 85.51 72.91 82.39 87.19 98.24 94.21 82.86
AE 74.83 70.96 87.41 77.20 70.16 73.23 89.83 58.43 75.26
PHATE 84.76 73.48 45.43 46.04 74.15 78.45 85.27 44.04 66.45
UMAP 78.88 87.75 53.42 54.31 79.40 80.12 83.31 55.94 71.64
SAUCIE 89.99 67.43 82.22 78.50 84.03 85.41 96.43 78.58 82.83
DCA 49.79 64.37 76.54 90.95 40.40 65.92 63.26 49.33 62.57
scVI 74.80 54.30 87.82 67.68 75.45 82.75 86.42 57.77 73.37
Force Atlas 2 72.88 72.23 37.28 48.06 35.67 76.65 77.27 43.27 57.91
PCA 60.40 40.78 73.42 66.02 96.44 96.40 80.76 56.82 71.38
Table 1: Relative quantitative performances averaged over all studied datasets. For each metric, we give the best performing
method a value of 100 and scale other results proportionally. The metrics are described in section 4.4 and higher values
indicate better performance. The rightmost column contains the average relative performance over all metrics. DTAE and
SAUCIE have the best performance overall, with DTAE excelling in Voronoı̈ metrics and ARI.
plot that the pathway towards the terminally exhausted cells ond order Voronoı̈ diagram with k = 50, there is a bias to-
is via the exhausted and effector like cells and not directly wards DTAE since we optimize this criterion. For local and
via the proliferating cells. Voronoı̈ diagram based metrics, we have to adjust a param-
eter k (either for k-means clustering or for a k-NN graph).
We vary the value of k between 10 and 100 with a step of
4.4 Quantitative analysis 10 and report the area under the curve.
The purpose of a visualization method is to make the most We report results aggregated on all three datasets in ta-
salient, qualitative properties of a dataset visible. Neverthe- ble 1 and full results are available in supplementary ta-
less, a quantitative evaluation can support the comparison ble S5. This aggregation makes it easier to deduce general
of visualization methods and provide evidence that the data patterns of performance among multiple datasets. From the
and its visualization are structurally similar. Unfortunately, results on all datasets, we can clearly see that DTAE out-
there is to our knowledge no consensus as to which metric performs other methods on Voronoı̈ diagram based metrics,
aligns with practitioners’ notion of a useful visualization. in part due to the bias towards them for k = 50. On lo-
Hence, any single metric cannot validate the quality of a cal metrics, DTAE achieves the best performance on ARI,
method. This is why it is important to use multiple metrics, followed closely by SAUCIE. However, for k-NN preserva-
so that one can hope for a more reliable result. tion UMAP performs better than other methods by a signif-
We selected eight different metrics, some of which icant margin which is consistent with the criterion it opti-
have been employed to judge visualization methods be- mizes (Damrich and Hamprecht, 2021). For euclidean dis-
fore (Moon et al., 2019; Kobak and Berens, 2019; Becht tance preservation, autoencoder based methods perform the
et al., 2019). The first group of metric considers the local best, with no clear winner overall. For geodesic distance
structure. We compute the Adjusted Rand Index (ARI) be- preservation, PCA performs the best, even though it pro-
tween a k-means clustering in high and low dimension and duced poor visualizations. This is in line with previous
the number of correct neighbors in the k-NN graph in high findings (Kobak and Berens, 2019). Most other methods
and low dimension. The next category are global metrics, obtained very similar performance on this metric, making
which rely on distance preservation. Euclidean distances it hard to conclude that any method performs better than
are computed in low dimension and euclidean or geodesic another.
distances are computed in high dimension. Then correla- In order to more easily compare methods, aggregated
tions are computed between those distances. Finally, we performances over all metrics are reported in the rightmost
use Voronoı̈ diagram based metrics. First or second order column of table 1. This aggregation makes it easier to evalu-
Voronoı̈ diagrams on the k-means centroids are computed ate the overall performance of a method when using a wide
using the k-means assignments to obtain the seeds in low- variety of criteria. We chose the arithmetic mean to com-
dimensional space. Then the ratio of points placed in the bine the results for simplicity’s sake. From this, we can see
correct Voronoı̈ region is computed. When using the sec- that DTAE and SAUCIE perform significantly better than
10
structure of the dataset, a more general dimension reduction
method might be preferable for initial data exploration.
disconnected density trees by cutting edges below a den- Cannoodt, R. et al (2016). SCORPIUS improves trajectory inference and identifies
sity threshold. However, if little is known a priori about the novel modules in dendritic cell development. preprint, Bioinformatics.
11
Cerletti, D. et al (2020). Fate trajectories of CD8 + T cells in chronic LCMV infection. Szubert, B. et al (2019). Structure-preserving visualisation of high dimensional
preprint, Immunology. single-cell datasets. Scientific reports, 9(1), 1–10.
Damrich, S. and Hamprecht, F.A. (2021). On UMAP’s true loss function. Tian, T. et al (2019). Clustering single-cell rna-seq data with a model-based deep
arXiv:2103.14608 [cs, stat]. arXiv: 2103.14608. learning approach. Nature Machine Intelligence, 1(4), 191–198.
Ding, J. et al (2018). Interpretable dimensionality reduction of single cell transcrip- Waddington, C.H. (1957). The strategy of the genes : a discussion of some aspects
tome data with deep generative models. Nature communications, 9(1), 1–13. of theoretical biology. Routledge Library Editions: 20th Century Science. Rout-
ledge.
Eraslan, G. et al (2019). Single-cell rna-seq denoising using a deep count autoen-
coder. Nature communications, 10(1), 1–14. Wolf, F.A. et al (2019). PAGA: graph abstraction reconciles clustering with trajectory
inference through a topology preserving map of single cells. Genome Biology,
Grønbech, C.H. et al (2020). scvae: Variational auto-encoders for single-cell gene 20(1), 59.
expression data. Bioinformatics, 36(16), 4415–4422.
Zheng, G.X.Y. et al (2017). Massively parallel digital transcriptional profiling of
Hochgerner, H. et al (2018). Conserved properties of dentate gyrus neurogenesis single cells. Nature Communications, 8(1).
across postnatal development revealed by single-cell RNA sequencing. Nature
Neuroscience, 21(2), 290–299.
Kingma, D.P. and Ba, J. (2017). Adam: A method for stochastic optimization.
Kobak, D. and Berens, P. (2019). The art of using t-SNE for single-cell transcrip-
tomics. Nature Communications, 10(1), 5416.
Maaten, L.v.d. and Hinton, G. (2008). Visualizing Data using t-SNE. Journal of
Machine Learning Research, 9(86), 2579–2605.
Street, K. et al (2018). Slingshot: cell lineage and pseudotime inference for single-
cell transcriptomics. BMC Genomics, 19(1), 477.
12
A Training loop algorithm
deg(v, t) = min n
n=1...|Γ(v)|
Algorithm S1 Training loop Pn
i=1 WΓ(v)i t
Require: Autoencoder (g ◦ f )θ s.t. P|Γ(v)| ≥
100
j=1 WΓ(v)j
Require: Pretraining epochs np , batch size b and learning
rate αp We can clearly see that when t = 100 we obtain the
Require: Finetuning epochs nf and learning rate αf classical definition of degree. As this generalization has not
Require: Weight parameters for the loss improved the visualization quality drastically, we opted for
λrec ,λpush-pull ,λcomp ,λcos the simpler version of the cosine loss in the main paper.
1: T, C, C2 , dgeo ← I NITIALIZATION (X)
2: #P retraining
3: for t = 0, 1, . . . , np do C Ablation study
4: for i = 0, 1, . . . , np /b do
5: Sample a minibatch m from X In order to better visualize the contributions of each ele-
6: m̂ ← g(f (m)) ment of our method, we conducted an ablation study of the
7: L ← Lrec different loss parameters and evaluated their impact both
8: θt+1 ← θt − αp ∇L qualitatively and quantitatively.
9: end for
10: end for C.1 Loss parameters
11: #F inetuning
12: for t = np , . . . , np + nf do The first phenomenon that is studied is the influence of
13: h ← f (X) dropping loss terms entirely. The reconstruction loss is al-
14: X̂ ← g(h) ways kept since it is necessary for the embeddings to con-
15: L ← λrec Lrec + λpush-pull Lpush-pull + λcomp Lcomp + tain salient information about the data. Not all combinations
λcos Lcos of loss parameters will be studied, but only those that should
16: θt+1 ← θt − αf ∇L be interesting (for example, using only the cosine loss does
17: end for not make much sense, so it is not an interesting scenario).
We will not study the influence of the weights for every
loss since the default weights of 1 lead to good performance
and this configuration significantly reduces the dimension
of the hyperparameter space. All experiments are described
B Cosine loss generalization in table S1.
The performance will be evaluated both qualitatively and
The definition of a vertex’ degree in a graph as the number quantitatively on all three discussed datasets to demonstrate
of incident edges to it is not perfect, as it does not take into as clearly as possible the impact of every loss term.
account the noisiness of the graph. On real datasets, we may
have stray clusters which lead to noisy edges in the density Experiment Lrec Lpush-pull Lcomp Lcos (weight)
graph. These usually manifest as edges with only one point A X
contributing to them in high dimension. This leads to ver- B X X
tices with an effective degree of 2 that have a higher degree C X X
due to these noisy edges, and are thus ignored by the cosine D X X X
loss. E X X X(50)
To remedy this, we introduce a different definition of de- F X X X X(50)
gree. We consider a threshold t ∈ [0, 100] and define the
degree of a vertex as the smallest number of incident edges Table S1: List of loss parameters for our ablations.
that account for t% of all points contributing to the vertex’s
incident edges. As t gets closer to a hundred, we converge As can be seen in figures S1,S2 and S3, the compactness
to the original definition of degree. loss alone is not sufficient to obtain a good representation
More formally put, consider a weighted graph since it has no repulsive force. The reconstruction loss helps
G = (V, E, W ) and a function Γ that returns incident to avoid a total collapse but is not sufficient to prevent a
edges to a given vertex sorted by their weights. This partial collapse, as visible in the endocrine pancreas and the
alternative definition of a vertex’s degree is then: T-cell datasets. While the push-pull loss already gives good
13
Type of metric Local Global Voronoi
Euclidean Geodesic
Metric ARI k-NN 1st order 2nd order
Pearson Spearman Pearson Spearman
λpp = 0, λcomp = 0, λcos =0 34.62 19.09 0.66 0.66 0.56 0.56 70.01 38.98
λpp = 0, λcomp = 1, λcos =0 38.54 23.20 0.80 0.79 0.74 0.73 70.44 35.24
λpp = 1, λcomp = 0, λcos =0 48.19 25.05 0.78 0.76 0.72 0.70 79.68 56.02
λpp = 1, λcomp = 1, λcos =0 48.70 26.40 0.81 0.78 0.74 0.72 79.26 55.93
λpp = 1, λcomp = 0, λcos = 50 45.54 22.71 0.80 0.77 0.74 0.72 78.94 52.97
λpp = 1, λcomp = 1, λcos = 50 46.00 24.60 0.81 0.80 0.75 0.74 78.85 53.77
(a) PHATE generated dataset.
Euclidean Geodesic
Metric ARI k-NN 1st order 2nd order
Pearson Spearman Pearson Spearman
λpp = 0, λcomp = 0, λcos =0 34.52 4.17 0.81 0.85 0.57 0.62 65.37 30.83
λpp = 0, λcomp = 1, λcos =0 24.88 2.32 0.65 0.69 0.53 0.59 46.30 11.11
λpp = 1, λcomp = 0, λcos =0 43.07 3.07 0.77 0.79 0.71 0.77 73.79 50.68
λpp = 1, λcomp = 1, λcos =0 44.69 2.93 0.73 0.79 0.66 0.75 72.95 46.75
λpp = 1, λcomp = 0, λcos = 50 35.64 2.77 0.73 0.75 0.70 0.75 68.44 38.82
λpp = 1, λcomp = 1, λcos = 50 39.79 2.85 0.71 0.74 0.71 0.78 69.24 38.04
(b) Endocrine pancreas dataset.
Euclidean Geodesic
Metric ARI k-NN 1st order 2nd order
Pearson Spearman Pearson Spearman
λpp = 0, λcomp = 0, λcos =0 29.24 2.20 0.40 0.33 0.40 0.42 35.17 4.50
λpp = 0, λcomp = 1, λcos =0 40.65 1.24 0.15 0.17 0.20 0.20 28.75 2.75
λpp = 1, λcomp = 0, λcos =0 29.72 1.28 0.42 0.26 0.36 0.39 47.19 18.63
λpp = 1, λcomp = 1, λcos =0 45.55 1.15 0.38 0.23 0.35 0.38 37.71 16.29
λpp = 1, λcomp = 0, λcos = 50 29.24 1.23 0.37 0.24 0.42 0.44 44.73 12.16
λpp = 1, λcomp = 1, λcos = 50 37.25 1.15 0.31 0.19 0.40 0.41 38.41 12.85
(c) T-cells dataset.
Table S2: Quantitative results in different scenarios for DTAE’s loss weights.
Figure S1: Results of the ablations on the PHATE generated Figure S2: Results of the ablations on the T-cell dataset,
dataset, colored by groundtruth clusters. colored by phenotypes.
results when used alone, since the tree structure is visible, loss, however, this combination can lead to sparse repre-
adding the compactness loss yields embeddings in which sentation due to the fact that seeds of second order Voronoı̈
the points lie compactly along the tree. Without the cosine cells do not necessarily lie in their cell. This means that
14
Figure S3: Results of the ablations on the endocrine pan-
creas dataset, colored by cell types. Figure S4: Results obtained on the chronic infection subset
of the T-cell dataset when varying the cosine loss weight,
colored by phenotypes.
Rel. Perf.
λpp = 0, λcomp = 0, λcos =0 81.27 C.2 Cosine loss weight
λpp = 0, λcomp = 1, λcos =0 67.60
λpp = 1, λcomp = 0, λcos =0 92.04 A parameter that is interesting to study in more detail is the
λpp = 1, λcomp = 1, λcos =0 90.66 cosine loss weight. While a lot of the other losses have a sig-
λpp = 1, λcomp = 0, λcos = 50 87.24 nificant impact on the embeddings, the cosine loss is mostly
λpp = 1, λcomp = 1, λcos = 50 86.99 cosmetic, and it is important to understand its behavior for
low and high weights. The cosine loss weight will only be
Table S3: Relative performance out of a hundred over all studied on the T-cell dataset, since it is enough to demon-
datasets and metrics. strate its impact on quantitative and qualitative results.
As can be seen in figure S4 the cosine loss straightens
the branches for every weight as intended. However, with
higher weights, it also has a density regularizing effect. As
points will not necessarily be spread out along the line be- its weight increases, we obtain a more homogeneous and
tween two centroids but only lie inside the intersection of less clumped point cloud. While there is no clear explana-
the line between the two centroids and their second order tion for this behavior, a hypothesis is that the higher weight
Voronoı̈ cell, which may be much smaller than the full line means that this criterion will be optimized with higher pri-
between the centroids. Using only the push-pull and co- ority during the finetuning. Since the pretraining produces
sine loss can lead to satisfying results, but the embedding dense embedding and this cosine loss has no incentive to
is more spread out than with the compactness loss. Adding produce sparse embeddings, this denser structure is kept
the cosine loss makes all the results cleaner and helps with during training. On the contrary, the push-pull loss can have
the density of the point cloud. This effect is discussed in the a sparsifying effect, since the seeds of second order Voronoı̈
next section. cells do not necessarily lie in their cells. When the cosine
loss weight is smaller, this loss is optimized with higher pri-
From a quantitative point of view, adding all of these ority, which would lead to the sparser embeddings. All of
losses leads to worse performances than just using the push- this is intimately linked to the dynamics of neural network
pull loss alone. Since the compactness and cosine losses training and not only to minimizers of each criterion, mak-
are designed with visualization in mind, they can alter the ing a precise study of this process highly complex.
fidelity of the embedding. For example, making the points From a quantitative point of view, a slight decrease in
tighter along the density tree will lead to pairwise distances performance is visible in table S4 for all metrics except
that are preserved more poorly, which is an effect that we for the preservation of geodesic distances or of first order
indeed observe in the global metrics in table S2. Voronoı̈ diagrams. As a result, the overall performance de-
Nonetheless, when looking at aggregated performances in creases noticeably when increasing the cosine loss weight,
table S3 we can see that all experiments except when us- see the rightmost column in table S4.
ing the compactness loss alone still perform comparatively. This again illustrates the trade-offs between quantitative
As such, the increase in qualitative performance stemming and qualitative performance, where even though a method
from the addition of losses is not done at the expense of the performs slightly worse quantitatively, it might still produce
preservation of the data’s intrinsic structure. In particular, results that are easier to interpret for humans.
the push-pull loss alone drastically improves the visualiza-
tion not only qualitatively, but also quantitatively.
15
Type of metric Local Global Voronoi
Euclidean Geodesic All
Metric ARI k-NN 1st order 2nd order
Pearson Spearman Pearson Spearman
λcos =1 45.83 1.15 0.37 0.24 0.38 0.39 37.30 16.36 95.68
λcos =2 44.77 1.11 0.35 0.24 0.40 0.40 37.90 16.39 95.34
λcos =5 45.22 1.09 0.37 0.20 0.40 0.45 37.38 14.12 93.30
λcos = 10 44.29 1.12 0.35 0.18 0.42 0.46 36.40 13.66 91.83
λcos = 15 43.50 1.14 0.29 0.15 0.44 0.46 38.78 14.59 90.28
λcos = 20 43.08 1.17 0.33 0.17 0.42 0.46 37.43 14.41 91.74
λcos = 50 37.25 1.15 0.31 0.19 0.40 0.41 38.41 12.85 87.50
Table S4: Quantitative results on the T-cells dataset when varying the cosine loss weight. The weights for the push-pull
and compactness losses are set to one. The rightmost column contains the average performance over all metrics for a given
method.
16
D High resolution results
Figure S5: Results obtained on the endocrine pancreatic cell dataset, colored by cell types.
Figure S6: Results obtained on the chronic infection subset of the T-cell dataset, colored by phenotypes.
17
E Complete quantitative results
Table S5: Full Quantitative results on all studied datasets. Metrics are described in section 4.4 and higher values indicate
better performance.
18