0% found this document useful (0 votes)
36 views

Visualizing Hierarchies in scRNA-seq Data Using A Density Tree-Biased Autoencoder

This document presents a new method called DTAE (Density Tree-biased AutoEncoder) for visualizing hierarchies in single-cell RNA sequencing (scRNA-seq) data. DTAE first builds a density tree from the data to capture its hierarchical structure, then trains an autoencoder to embed the data in 2D while preserving the density tree structure. The authors introduce the density tree and DTAE, compare it to related visualization methods, and demonstrate its ability to qualitatively and quantitatively represent real and synthetic scRNA-seq data hierarchies.

Uploaded by

Romina Turco
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
36 views

Visualizing Hierarchies in scRNA-seq Data Using A Density Tree-Biased Autoencoder

This document presents a new method called DTAE (Density Tree-biased AutoEncoder) for visualizing hierarchies in single-cell RNA sequencing (scRNA-seq) data. DTAE first builds a density tree from the data to capture its hierarchical structure, then trains an autoencoder to embed the data in 2D while preserving the density tree structure. The authors introduce the density tree and DTAE, compare it to related visualization methods, and demonstrate its ability to qualitatively and quantitatively represent real and synthetic scRNA-seq data hierarchies.

Uploaded by

Romina Turco
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 18

Visualizing hierarchies in scRNA-seq data using a density tree-biased

autoencoder
Quentin Garrido 1,5 * Sebastian Damrich 1 Alexander Jäger 1 Dario Cerletti 2,3
Manfred Claassen 4 Laurent Najman 5 Fred A. Hamprecht 1
arXiv:2102.05892v3 [q-bio.QM] 22 Apr 2022

1
HCI/IWR, Heidelberg University, Germany
2
Institute of Molecular Systems Biology, ETH Zürich, Switzerland
3
Institute of Microbiology, ETH Zürich, Switzerland
4
Internal Medicine I, University Hospital Tübingen, University of Tübingen, Germany
5
Université Gustave Eiffel, CNRS, LIGM, F-77454 Marne-la-Vallée, France

Abstract tion. In particular, this permits studying the cell develop-


ment through time more precisely.
Motivation: Single cell RNA sequencing (scRNA-seq) Waddington’s popular metaphor likens the development
allows studying the development of cells in unprecedented of cells to marbles rolling down a landscape (Waddington,
detail. Given that many cellular differentiation processes 1957). While cells are all grouped at the top of the hill when
are hierarchical, their scRNA-seq data is expected to they are not yet differentiated (e.g., stem cells), as they start
be approximately tree-shaped in gene expression space. rolling down, they can take multiple paths and end up in
Inference and representation of this tree-structure in two distinct differentiated states, or cell fates.
dimensions is highly desirable for biological interpretation However, for every cell, hundreds or thousands of ex-
and exploratory analysis. pressed genes are recorded, and this data is noisy. To sum-
Results: Our two contributions are an approach for iden- marize such high-dimensional data, it is useful to visualize
tifying a meaningful tree structure from high-dimensional it in two or three dimensions.
scRNA-seq data, and a visualization method respecting the Our goal, then, is to identify the hierarchical (tree) struc-
tree-structure. We extract the tree structure by means of a ture of the scRNA-seq data and subsequently reduce its
density based maximum spanning tree on a vector quan- dimensionality while preserving the extracted hierarchical
tization of the data and show that it captures biological properties. We address this in two steps, illustrated in fig-
information well. We then introduce DTAE, a tree-biased ure 1.
autoencoder that emphasizes the tree structure of the data First, we cluster the scRNA-seq data in high-dimensional
in low dimensional space. We compare to other dimension space to obtain a more concise and robust representa-
reduction methods and demonstrate the success of our tion. Then, we capture the hierarchical structure as a min-
method both qualitatively and quantitatively on real and imum spanning tree (MST) on our cluster centers, with
toy data. edge weights reflecting the data density in high-dimensional
Availability: Our implementation relying on PyTorch space. We dub the resulting tree “density tree”.
(Paszke et al., 2019) and Higra (Perret et al., 2019) is Second, we embed the data to low dimension with an au-
available at https://github.com/hci-unihd/DTAE. toencoder, a type of artificial neural network. In addition to
the usual aim of reconstructing its input, we bias the autoen-
coder to also reproduce the density tree in low-dimensional
space. As a result, the hierarchical properties of the data are
1 Introduction emphasized in our visualization.

Single-cell RNA sequencing (scRNA-seq) data allows ana-


lyzing gene expression profiles at the single-cell level, thus 2 Related Work
granting insights into cell behavior at unparalleled resolu-
There are various methods for visualizing scRNA-seq data
* quentin.garrido[at]edu.esiee.fr and trajectory inference, and many of them have been re-

1
Figure 1: Schematic method overview. a) High-dimensional data. b) Proposed density tree. After computing the k-means
centroids on the data, we build a tree based on the data density between pairs of centroids. c) DTAE. An autoencoder is used
to learn a representation of our data. This embedding is regularized by the previously computed tree in order to preserve
its hierarchical structure in low-dimensional space. d) The final DTAE embedding. After training of the autoencoder, the
bottleneck layer visualizes the data in low dimension and respects the density structure.

viewed for instance in Saelens et al. (2019). We therefore to MONOCLE 2.


mention only some exemplary approaches here. Visualization only. Most visualization methods do not
Graph only. SCORPIUS (Cannoodt et al., 2016) was provide a graph representation of the data. PHATE (Moon
one of the first such methods. It is limited to linear topolo- et al., 2019) is a recent approach which computes diffusion
gies rather than trees. More versatile methods include probabilities on the data before applying multidimensional
SLINGSHOT (Street et al., 2018) and SPADE (Bendall scaling.
et al., 2011; Qiu et al., 2011). In contrast to our work, The general purpose dimension reduction methods
these three methods only provide a graph summary of the t-SNE (Maaten and Hinton, 2008), UMAP (McInnes et al.,
data, but not a 2D scatter plot. Similar to our density tree, 2020; Becht et al., 2019) and ForceAtlas2 (Jacomy et al.,
SLINGSHOT and SPADE determine the hierarchical struc- 2014) are popular for visualizing scRNA-seq data. They
ture of the dataset as a MST on cluster centers. However, aim to layout the kNN graph structure of the data with t-
SLINGSHOT does not consider density. SPADE addresses SNE focusing more on discrete clusters and ForceAtlas2
the data density only by downsampling dense regions to better representing the continuous structure (Böhm et al.,
equalize the data density. In particular, it does not inform 2020). This often works well, but lacks the focus on hierar-
the MST by the actual data density, which can be problem- chies that our method provides. While the continuous focus
atic, as illustrated in figure 2. In contrast, we induce our of ForceAtlas2 seems apt to show differentiation processes
density tree to have edges in high-density regions. PAGA in scRNA-seq datasets, we find that without a specific tree-
(Wolf et al., 2019) produces primarily a graph summary prior the biologically interesting branching events are often
of the data. It first clusters the k nearest neighbor (kNN) poorly resolved.
graph of the data by modularity, and then places edges be- Like DTAE, several recent methods for visualizing
tween clusters of high connectivity. Optionally, a layout of scRNA-seq data rely on neural networks. We describe them
the PAGA graph can serve as initialization to other meth- in the following. Many approaches are extensions of the au-
ods, such as UMAP. In our method, we connect clusters de- toencoder (AE) (Rumelhart et al., 1985), a network which
pending on the data density between two cluster centroids. encodes the data to a lower dimensional latent space from
Moreover, our proposed visualization is directly optimized which it tries to decode the input. A prominent member of
to respect the density tree, while PAGA injects graph infor- this family is DCA (Eraslan et al., 2019) which replaces the
mation only at the initialization of the visualization. usual reconstruction loss by a count-based ZINB loss and
Graph and visualization. MONOCLE 2 (Qiu et al., aims at denoising scRNA-seq data. Its extension scDeep-
2017) is more similar to our method, as it provides both a Cluster (Tian et al., 2019) jointly trains a clustering model
visualization and a hierarchical graph structure on a vector in latent space. SAUCIE (Amodio et al., 2019) is another
quantization of the data. The tree in MONOCLE 2 is in- popular AE method and addresses multiple tasks includ-
ferred in conjunction with the embedding, while we learn ing batch effect removal and clustering, for which it uses
it as a first step in high-dimensional space and consider the a binary hidden layer. In order to exploit more relational
data density explicitly. As a result, our density tree depends information, scGAE (Luo et al., 2021) uses a graph AE
only on the biological data but not the embedding initializa- based on the kNN graph and achieves good visualization
tion or dimension. MONOCLE 2 is conceptually promis- results both for clustered and continuous scRNA-seq data,
ing, but empirically found to be often inferior to other meth- but without our inductive prior of a hierarchical embedding
ods, confer (Moon et al., 2019). Hence, we did not compare or our explicit focus on data density. Topological autoen-

2
coders (Moor et al., 2020) are conceptually closest to our Algorithm 1 Density tree generation
idea of retaining topological properties during dimension Require: High-dimensional data X ∈ Rn×d
reduction. They compute the MST on all points, which pro- Require: Number of k-means centroids k
duces less stable results than our density-based approach on procedure G ENERATE T REE(X, k)
cluster centroids. C ← K M EANS(X, k) . O(nkdt) with t the number
Variational autoencoders (VAEs) (Kingma and Welling, of iterations
2013), a generative AE version, have also been explored. G = (C, E) the complete graph on our centroids
A popular VAE for scRNA-seq data is scVI (Lopez et al., for {i, j} a two-element subset of {1, . . . , k} do .
2018), which explicitly models batch effects and library O(k 2 )
sizes. Instead, scVAE (Grønbech et al., 2020) investigates di,j = 0
likelihood functions suitable for scRNA-seq data and pro- end for
poses a clustering model in latent space. DR-A (Lin et al., for i = 1, . . . , |X| do . O(nk)
2020) apply adverserial training instead of the variational a ← arg min ||xi − cj ||2 . Nearest centroid
objective. Finally, scvis is a VAE tailored to visualization j=1,...,k
(Ding et al., 2018) and uses a t-SNE-like regularization term b ← arg min ||xi − cj ||2 . Second nearest
in the latent space. j=1,...,k
j6=a
Ivis (Szubert et al., 2019) employs a triplet loss function centroid
and a siamese neural network instead of an AE to preserve da,b = da,b + 1 . Increase nearest centroids’
the nearest neighbor relations in the visualization. edge strength
Both scDeepCluster and scVAE shape the latent space end for
into disconnected clusters, which is orthogonal to our goal T ← M AX S PANNING T REE(G, d) . O(k 2 log k)
of illustrating continuous developmental hierarchies. scVI, return T, d . Retains the density tree and the edge
scGAE and scDeepCluster work with a latent space dimen- strengths
sion larger than two and thus require an additional dimen- end procedure
sion reduction, typically with t-SNE, to visualize the data.
Neither of the pure visualization methods aims to bring
out the hierarchical properties often present in scRNA-seq our original data well. But it is crucial that the extracted
dataset. In particular, they do not use the data density to tree follows the dense regions of the data if we want to vi-
infer lineages. None of them provide a graph summary of sualize developmental trajectories of differentiating cells: a
the data. Our contribution, however, is to supply the user trajectory is plausible if we observe intermediate cell states
with a tree-shaped graph summarizing the hierarchies along and unlikely if there are jumps in the development. By
dense lineages in the data as well as a 2D embedding that preferring tree edges in high density regions of the data,
respects this tree shape. we ensure that the computed spanning tree is biologically
plausible. Following this rationale, we build the maximum
spanning tree on the complete graph over centroids whose
3 Methods edge weights are given by the density of the data along each
edge instead of the minimum spanning tree on Euclidean
3.1 Approximating the High-dimensional
distance. This results in a tree that (we believe) captures
scRNA-seq Data with a Tree Waddington’s hypothesis better than merely considering cu-
To summarize the high-dimensional data in terms of a tree, mulative differences in expression levels.
the minimum spanning tree (MST) on the Euclidean dis- To estimate the support that a data sample provides for an
tances is an obvious choice. This route is followed by Moor edge, we follow Martinetz and Schulten (1994). Consider
et al. (2020) who reproduce the MST obtained on their the complete graph G = (C, E) such that C = {c1 , . . . , ck }
high-dimensional data in their low-dimensional embedding. is the set of centroids. In the spirit of Hebbian learning,
However, scRNA-seq data can be noisy, and a MST built on that is, emphasizing connections that appear frequently, we
all of our data is very sensitive to noise. Therefore, we first count, for each edge, how often its incident vertices are the
run k-means clustering on the original data, yielding more two closest centroids to any given datum.
robust centroids for the MST construction and also reducing As pointed out by Martinetz and Schulten (1994) this
downstream complexity. amounts to an empirical estimate of the integral of the den-
A problem with the Euclidean MST, illustrated in fig- sity of observations across the second-order Voronoı̈ region
ure 2, is that two centroids can be close in Euclidean space (defined as the set of points having a particular set of 2 cen-
without having many data points between them. In such a troids as its 2 nearest centroids) associated with this pair of
case, a Euclidean MST would not capture the skeleton of cluster centers. Finally, we compute the maximum span-

3
Figure 2: (left, middle) Comparison of the tree built on k-means centroids using Euclidean distance or density weights.
The data was generated using the PHATE library (Moon et al., 2019), with 3 branches in 2D. Original data points are
transparently overlayed to better visualize their density. While the tree based on the Euclidean distance places connections
between centroids that are close but have only few data points between them (see red ellipse), our tree based on the data
density instead includes those edges that lie in high density regions (see pink ellipse). (right) Complete graph over centroids
and its Hebbian edge weights. Null-weight edges, that is edges not supported by data, are omitted for clarity.

ning tree over these Hebbian edge weights. Our strategy for
building the tree is summarized in algorithm 1.
1 X
Our data-density based tree follows the true shape of the Lrec = MSE(X, g(f (X))) = ||xi − g(f (xi ))||22 .
N
data more closely than a MST based on the Euclidean dis- xi ∈X
tance weights, as illustrated in figure 2. We claim this in- (1)
dicates it being a better choice for capturing developmental This term is the typical loss function for an autoencoder and
trajectories. Having extracted the tree shape in high dimen- ensures that the embedding is as faithful to the original data
sions, our goal is to reproduce this tree as closely as possible as possible, forcing it to extract the most salient data fea-
in our embedding. tures.

3.2.2 Push-Pull Loss


3.2 Density-Tree biased Autoencoder The main loss term that biases the DTAE towards the den-
(DTAE) sity tree is the push-pull loss. It trains the encoder to embed
the data points such that the high-dimensional data density,
We use an autoencoder to faithfully embed the high- and, in particular, the density tree, are reproduced in low-
dimensional scRNA-seq data in a low-dimensional space, dimensional-space.
and bias it such that the topology inferred in high- We find a centroid in embedding space by averaging
dimensional space is respected. An autoencoder is an the embeddings of all points assigned to the corresponding
artificial neural network consisting of two concatenated k-means cluster in high-dimensional space. In this way, we
subnetworks, the encoder f , which maps the input to can easily relate the centroids in high and low dimension,
lower-dimensional space, also called embedding space, and and will simply speak of centroids when the ambient space
the decoder g, which tries to reconstruct the input from is clear from the context.
the lower-dimensional embedding. It can be seen as a To reproduce the density structure in low-dimensional
non-linear generalization of PCA. We visualize the low- space, we want that the closest two high-dimensional cen-
dimensional embeddings hi = f (xi ) and hence choose troids to a point xi ∈ X correspond to the two low-
their dimension to be 2. dimensional centroids that are closest to its embedding
The autoencoder is trained by minimizing the following hi = f (xi ). We denote the latter centroids by ci,1 and ci,2 ,
loss terms, including new ones that bias the autoencoder to and low-dimensional centroids that actually correspond to
also adhere to the tree structure. the closest high-dimensional centroids by c0i,1 and c0i,2 . As
long as c0i,1 , c0i,2 differ from ci,1 and ci,2 , the encoder places
hi next to different centroids than in high-dimensional
3.2.1 Reconstruction Loss space. To improve this, we want to move c0i,1 , c0i,2 and hi
towards each other while separating ci,1 and ci,2 from hi .
The first term of the loss is the reconstruction loss, defined The following preliminary version of our push-pull loss im-
as plements this:

4
geodesic distance dgeo (ca , cb ) with ca , cb ∈ C is defined
2 as the number of edges in the shortest path between ca
L̃push (hi ) = − (||hi − ci,1 ||2 + ||hi − ci,2 ||2 ) (2)
2 and cb in the density tree. Centroids at the end of differ-
L̃pull (hi ) = ||hi − c0i,1 ||2 + ||hi − c0i,2 ||2 (3) ent branches in the density have a higher geodesic distance
1 X than centroids nearby on the same branch. By weighing
L̃push-pull = L̃push (f (xi )) + L̃pull (f (xi )). (4) the push-pull loss contribution of an embedded point by
N
xi ∈X the geodesic distance between its two currently closest cen-
The push loss decreases as hi and the currently closest cen- troids, we focus the push-pull loss on embeddings which
troids, ci,1 and ci,2 , are placed further apart from each other, erroneously lie between different branches.
while the pull loss decreases when hi gets closer to the The geodesic distances can be computed quickly in
correct centroids, c0i,1 and c0i,2 . Indeed, the push-pull loss O(k 2 ) via breadth first search, and this only has to be done
term is minimized if and only if each embedding hi lies in once before training the autoencoder.
the second-order Voronoı̈ region of those low-dimensional The final version of our push-pull loss becomes
centroids whose high-dimensional counterparts contain the
data point xi in their second-order Voronoı̈ region. In other 1 X 
Lpush-pull = dgeo (ci,1 , ci,2 )
words, the loss is zero precisely when we are reproducing N
xi ∈X
the edge densities from high dimension in low dimension. 
Note that we let the gradient flow through both the in- · (Lpush (f (xi )) + Lpull (f (xi ))) .
dividual embeddings and through the centroids, which are (8)
means of embeddings themselves.
This naı̈ve formulation of the push-pull loss has the Note, that the normalized push-pull loss in equation (7) and
drawback that it can become very small if all embeddings the geodesically reweighted push-pull loss in (8) both also
are nearly collapsed into a single point, which is undesir- get minimized if and only if the closest centroids in em-
able for visualization. Therefore, we normalize the contri- bedding space correspond to the closest centroids in high-
bution of every embedding hi by the distance between the dimensional space.
two correct centroids in embedding space. This prevents the
collapsing of embeddings, and also ensures that each data- 3.2.3 Compactness loss
point xi contributes equally, regardless of how far apart c0i,1
and c0i,2 are. The push-pull loss thus becomes The push-pull loss replicates the empirical high-
!2 dimensional data density in embedding space by moving
||hi − ci,1 ||2 + ||hi − ci,2 ||2 the embeddings into the correct second-order Voronoı̈
Lpush (hi ) = − (5)
||c0i,1 − c0i,2 ||2 region, which can be large or unbounded. For optimal
!2 visibility of the tree structure, an embedding should not
||hi − c0i,1 ||2 + ||hi − c0i,2 ||2 only be in the correct second-order Voronoı̈ region, but lie
Lpull (hi ) = (6)
||c0i,1 − c0i,2 ||2 compactly around the line between its two centroids. To
achieve this, we add the compactness loss, which is just
1 X
Lpush-pull = Lpush (f (xi )) + Lpull (f (xi )). (7) another instance of the pull loss
N
xi ∈X
!2
So far, we only used the density information from high- 1 X ||hi − c0i,1 ||2 + ||hi − c0i,2 ||2
Lcomp = (9)
dimensional space for the embedding, but not the extracted N ||c0i,1 − c0i,2 ||2
xi ∈X
density tree itself. The push-pull loss in equation (7) is ag-
1 X
nostic to the positions of the involved centroids within the = Lpull (f (xi )), (10)
density tree, only their Euclidean distance to the embed- N
xi ∈X
ding hi matters. In contrast, the hierarchical structure is
important for the biological interpretation of the data: it is The compactness loss is minimized if the embedding hi is
much less important if an embedding is placed close to two exactly between the correct centroids c0i,1 and c0i,2 and has
centroids that are on the same branch of the density tree elliptic contour lines with foci at the centroids.
than it is if the embedding is placed between two different
branches. In the first case, cells are just not ordered cor- 3.2.4 Cosine loss
rectly within a trajectory, while in the second case we get
false evidence for an altogether different pathway. Since the encoder is a powerful non-linear map, it can
We tackle this problem by reweighing the push-pull loss introduce artifactual curves in the low-dimensional tree
with the geodesic distance along the density tree. The branches. However, especially tight turns can impede the

5
visual clarity of the embedding. As a remedy, we pro- 3.3 Training procedure
pose an optional additional loss term that tends to straighten
branches. Firstly, we compute the k-means centroids, the edge densi-
ties, the density tree, and geodesic distances. This has to be
Centroids at which the embedding should be straight are
done only once as an initialization step. Secondly, we pre-
the ones within a branch, but not at a branching event of
train the autoencoder with only the reconstruction loss via
the density tree. The former can easily be identified as the
stochastic gradient descent on minibatches. This provides a
centroids of degree 2.
warm start for finetuning the autoencoder with all losses in
Let c be a centroid in embedding space of degree 2 with the third step.
its two neighboring centroids nc,1 and nc,2 . The branch is During finetuning, all embedding points are needed to
straight at c if the two vectors c − nc,1 and nc,2 − c are par- compute the centroids in embedding space. Therefore, we
allel or, equivalently, if their cosine is maximal. Denoting perform full-batch gradient descent during finetuning. For
by C2 = {c ∈ C | deg(c) = 2} the set of all centroids algorithmic details regarding the training procedure, confer
of degree 2, considered in embedding space, we define the to supplementary algorithm S1.
cosine loss as We always used k = 50 centroids for k-means clus-
tering in our experiments. This number needs to be
1 X (c − nc,1 ) · (nc,2 − c)
Lcosine = 1 − . (11) high enough so that the tree yields a skeleton of the
|C2 | ||c − nc,1 ||2 ||nc,2 − c||2 data, but not so high that the density loses its mean-
c∈C2
ing. k = 50 is a default value that works well in
Essentially, it measures the cosine of the angles along the a variety of scenarios. Our autoencoder always has a
tree branch and becomes minimal if all these angles are zero bottleneck dimension of 2 for visualization. In the ex-
and the branches straight. periments, we used layers of the following dimensions
d(input dimension), 2048, 256, 32, 2, 32, 256, 2048, d. This
A generalization of this criterion that deals with noisy
results in symmetrical encoders and decoders with four lay-
edges in the density tree is discussed in section B of the
ers. While not necessary in our experiments, if a lighter
appendix.
network is desired, we recommend applying PCA first to
reduce the number of input dimensions, or to filter out more
genes during the preprocessing. We omitted hidden layers
3.2.5 Complete loss function of dimension larger than the input. We use fully connected
layers and ReLU activations after every layer but the last
Combining the four loss terms of the preceding sections, we
encoder and decoder layer and employ the Adam (Kingma
arrive at our final loss
and Ba, 2017) optimizer with learning rate 2 × 10−4 for
pretraining and 1 × 10−3 for finetuning unless stated oth-
L = λrec Lrec + λpush-pull Lpush-pull + λcomp Lcomp + λcos Lcos . erwise. We used a batch size of 256 for pretraining in all
(12) experiments.
The relative importance of the loss terms, especially of
Lcomp and Lcos , which control finer aspects of the visualiza-
tion, might depend on the use-case. In practice, we found 4 Results
λrec = λpush-pull = λcomp = 1 and λcos = 50 to work well.
This configuration reduces the number of weights to adjust In this section, we show the performance of our method on
from four to one. toy and real scRNA-seq datasets and compare it to a vanilla
An ablation study of the different losses’ contribution is autoencoder, as well as to the popular non-parametric meth-
available in section C of the appendix. Its main conclusion ods PCA, Force Atlas 2, UMAP and PHATE and to the most
is that while the push-pull loss and reconstruction loss are prevalent neural network-based approaches, SAUCIE, DCA
sufficient to obtain satisfactory results, the addition of the and scVI. For all network-based approaches, we choose a
compactness and cosine loss helps to improve the visualiza- bottleneck of dimension 2 to directly use them for visual-
tions further and facilitates reproducibility. Empirically, we ization.
found that adding the compactness loss without the cosine
loss sometimes leads to discontinuous embeddings. The
two loss terms should therefore be added or omitted jointly.
4.1 PHATE generated data
If the default loss weights are not satisfactory, we recom- We applied our method to an artificial dataset created with
mend adjusting the cosine loss weight first. To understand the library published alongside Moon et al. (2019), to
how changing the loss parameters may affect the results, demonstrate its functionality in a controlled setting. We
please refer to the qualitative results in the ablation study. generated a toy dataset whose skeleton is a tree with one

6
Figure 3: Results obtained using data generated by the PHATE library. Branches are coloured by groundtruth labels.

backbone branch and 9 branches emanating from the back- tree structure of the data, especially for the endocrine sub-
bone, consisting in total of 10,000 points in 100 dimensions. types. The visualized hierarchy is biologically plausible,
We pretrained for 150 epochs with a learning rate of with a particularly clear depiction of the α-, β- and ε-cell
10−3 and finetuned for another 150 epochs with a learning branches and a visible, albeit too strong, separation of the
rate of 10−2 . δ-cells. This is in agreement with the results from Bastidas-
Figure 3 shows the visualization results. The finetun- Ponce et al. (2019). UMAP also performs very well and
ing significantly improves the results of the pretrained au- attaches the δ-cells to the main trajectory. However, the
toencoder, whose visualisation collapses the grey and green α- and β-cell branches are not as prominent as in DTAE.
branch onto the blue branch. All methods other than DCA, PHATE does not manage to separate the δ- and ε-cells dis-
scVI and PCA achieve satisfactory results that make the true cernibly from the other endocrine subtypes. As on toy data
tree structure of the data evident. While PHATE, UMAP in figure 3, it produces overly crisp branches for the α- and
and Force Atlas 2 produce overly crisp branches compared β-cells. PCA mostly overlays all endocrine subtypes. All
to the PCA result, the reconstruction loss of our autoen- methods but the vanilla autoencoder show a clear branch
coder guards us from collapsing the branches into lines. with tip and acinar cells and one via EP and Fev+ cells
PHATE appears to overlap the cyan and yellow branches to the endocrine subtypes, but only DTAE, DCA, SAUCIE
near the backbone, and UMAP introduces artificially curved and scVI manage to also hint at the more generic trunk and
branches. scVI collapses the green and brown as well as the multipotent cells from which these two major branches em-
pink and cyan branches together, giving hard to interpret vi- anate. However, SAUCIE, DCA and scVI fail to produce a
sualizations. The results on this toy dataset demonstrate that meaningful separation between the α- and β-cell branches.
our method can embed high-dimensional hierarchical data The ductal and Ngn3 low EP cells overlap in all methods.
into 2D and emphasize its tree-structure while avoiding to It is worth noting that the autoencoder alone was not able
collapse too much information compared to state-of-the-art to visualize meaningful hierarchical properties of the data.
methods. In our method, all branches are easily visible. However, the density tree-biased finetuning in DTAE made
this structure evident, highlighting the benefits of our ap-
proach.
4.2 Endocrine pancreatic cell data
In figure 4, we overlay DTAE’s embedding with a pruned
We evaluated our method on the data from Bastidas-Ponce version of the density tree and see that the visualization
et al. (2019). It represents endocrine pancreatic cells at closely follows the tree structure around the differentiated
different stages of their development and consists of gene endocrine cells. This combined representation of low-
expression information for 36351 cells and 3999 genes. dimensional embedding and overlaid density tree further
Preprocessing information can be found in Bastidas-Ponce facilitates the identification of branching events, most no-
et al. (2019). We pretrained for 300 epochs and used 250 tably for the α- and β-cells, and shows the full power of
epochs for finetuning. our method. It also provides an explanation for the appar-
Figure 4 and supplementary figure S5 depicts visual- ent separation of the δ-cells. Since there are relatively few
izations of the embryonic pancreas development with dif- δ-cells, they are not represented by a distinct k-means cen-
ferent methods. Our method can faithfully reproduce the troid.

7
Figure 4: Pruned density tree superimposed over embeddings of the endocrine pancreatic cell dataset, colored by cell sub-
types. We use finer labels for the endocrine cells. Darker edges represent denser edges. Only edges with more than 100
points contributing to them are plotted here.

Our method places more k-means centroids in the dense method. For instance, together with the density tree, we can
region in the lower right part of DTAE’s panel in figure 4 identify the ε-cells as a separate branch and find the location
than is appropriate to capture the trajectories, resulting in of the branching event into different endocrine subtypes in
many small branches. Fortunately, this does not result in the UMAP embedding.
an exaggerated tree-shaped visualization that follows every
spurious branch, which we hypothesize is thanks to the suc-
cessful interplay between the tree bias and the reconstruc- 4.3 T-cell infection data
tion aim of the autoencoder: If the biological signal encoded
in the gene expressions can be reconstructed by the decoder We further applied our method to T-cell data of a chronic
from an embedding with enhanced hierarchical structure, and an acute infection, which was shared with us by the au-
the tree-bias shapes the visualization accordingly. Con- thors of Cerletti et al. (2020). The data was preprocessed
versely, an inappropriate tree-shape is prevented if it would using the method described in Zheng et al. (2017), for more
impair the reconstruction. Overall, the density tree recovers details confer Cerletti et al. (2020). It contains gene expres-
the pathways identified in Bastidas-Ponce et al. (2019) to a sion information for 19029 cells and 4999 genes. While
large extent. Only the trajectory from multipotent via tip to we used the combined dataset to fit all dimension reduc-
acinar cells includes an unexpected detour via the trunk and tion methods, we only visualize the 13707 cells of the
ductal cells, which the autoencoder mends by placing the chronic infection for which we have phenotype annotations
tip next to the multipotent cells. from Cerletti et al. (2020) allowing us to judge visualization
quality from a biological viewpoint. We pretrained for 600
The density tree also provides useful information in con- epochs and used 250 epochs for finetuning.
junction with other dimension reduction methods. In fig- Figure 5 and supplementary figure S6 demonstrate that
ure 4, we overlay their visualizations with the pruned den- our method makes the tree structure of the data clearly vis-
sity tree by computing the centroids in the respective em- ible. The visualized hierarchy is also biologically signif-
bedding spaces according to the k-means cluster assign- icant: The two branches on the right correspond to the
ments. The density tree can help to find branching events memory-like and terminally exhausted phenotypic states,
and gain insights into the hierarchical structure of the data which are identified as the main terminal fates of the differ-
that is visualized with an existing dimension reduction entiation process in Cerletti et al. (2020). Furthermore, the

8
Figure 5: Pruned density tree superimposed over embeddings of the chronic part of the T-cell data, colored by phenotypes.
Darker edges represent denser edges. Only edges with more than 100 points contributing to them are plotted here.

purple branch at the bottom contains the proliferating cells. event towards memory-like and terminally exhausted cells.
Since the cell cycle affects cell transcription significantly, PCA exhibits only the coarsest structure and fails to sepa-
those cells are expected to be distinct from the rest. rate the later states visibly. The biological structure is de-
It is encouraging that DTAE makes the expected bio- cently preserved in the UMAP visualization, but the hierar-
logical structure apparent even without relying on known chy is less apparent than in DTAE. SAUCIE, scVI and Force
marker genes or differential cell expression, which were Atlas 2 produce results that are very similar to PCA, with
used to obtain the phenotypic annotations in Cerletti et al. later states that are hard to distinguish. DCA produces re-
(2020). sults that are very similar to the vanilla autoencoder, where
Interestingly, our method places the branching event to- even though the later states are visible, there is a significant
wards the memory-like cells in the vicinity of the exhausted amount of noise in the embedding, making the analysis dif-
cells, as does UMAP, while Cerletti et al. (2020) recog- ficult. Overall, our method outperforms the other visualiza-
nized a trajectory directly from the early stage cells to the tion methods on this dataset.
memory-like fate. The exact location of a branching event In figure 5, we have overlaid our embedding with a
in a cell differentiation process is difficult to determine pre- pruned version of the density tree and see that DTAE’s vi-
cisely. We conjecture that fitting the dimensionality reduc- sualization indeed closely follows the tree structure. It is
tion methods on the gene expression measurements of cells noteworthy that even the circular behavior of proliferation
from an acute infection in addition to those from the chronic cells is accurately captured by a self-overlaid branch, al-
infection analyzed in Cerletti et al. (2020) provided addi- though our tree-based method is not directly designed to
tional evidence for the trajectory via exhausted cells to the extract circular structure.
memory-like fate. Unfortunately, an in-depth investigation Figure 5 also shows the other dimension reduction meth-
of this phenomenon is beyond the scope of this methodolog- ods in conjunction with the pruned density tree. Reassur-
ical paper. ingly, we find that all methods embed the tree in a plausi-
The competing methods expose the tree-structure of the ble way, i.e., without many self-intersections or oscillating
data less obviously than DTAE. The finetuning significantly branches. This is evidence that our density tree indeed cap-
improves the results from the autoencoder, which shows tures a meaningful tree structure of the data. As for the
no discernible hierarchical structure. PHATE separates the endocrine pancreas dataset, the density tree can enhance hi-
early cells, proliferating cells and the rest. But its layout erarchical structure in visualizations of existing dimension
is very tight around the biologically interesting branching reduction methods. It, for example, clarifies in the UMAP

9
Type of metric Local Global Voronoi
Euclidean Geodesic All
Metric ARI k-NN 1st order 2nd order
Pearson Spearman Pearson Spearman
DTAE (Ours) 93.75 48.70 85.51 72.91 82.39 87.19 98.24 94.21 82.86
AE 74.83 70.96 87.41 77.20 70.16 73.23 89.83 58.43 75.26
PHATE 84.76 73.48 45.43 46.04 74.15 78.45 85.27 44.04 66.45
UMAP 78.88 87.75 53.42 54.31 79.40 80.12 83.31 55.94 71.64
SAUCIE 89.99 67.43 82.22 78.50 84.03 85.41 96.43 78.58 82.83
DCA 49.79 64.37 76.54 90.95 40.40 65.92 63.26 49.33 62.57
scVI 74.80 54.30 87.82 67.68 75.45 82.75 86.42 57.77 73.37
Force Atlas 2 72.88 72.23 37.28 48.06 35.67 76.65 77.27 43.27 57.91
PCA 60.40 40.78 73.42 66.02 96.44 96.40 80.76 56.82 71.38

Table 1: Relative quantitative performances averaged over all studied datasets. For each metric, we give the best performing
method a value of 100 and scale other results proportionally. The metrics are described in section 4.4 and higher values
indicate better performance. The rightmost column contains the average relative performance over all metrics. DTAE and
SAUCIE have the best performance overall, with DTAE excelling in Voronoı̈ metrics and ARI.

plot that the pathway towards the terminally exhausted cells ond order Voronoı̈ diagram with k = 50, there is a bias to-
is via the exhausted and effector like cells and not directly wards DTAE since we optimize this criterion. For local and
via the proliferating cells. Voronoı̈ diagram based metrics, we have to adjust a param-
eter k (either for k-means clustering or for a k-NN graph).
We vary the value of k between 10 and 100 with a step of
4.4 Quantitative analysis 10 and report the area under the curve.

The purpose of a visualization method is to make the most We report results aggregated on all three datasets in ta-
salient, qualitative properties of a dataset visible. Neverthe- ble 1 and full results are available in supplementary ta-
less, a quantitative evaluation can support the comparison ble S5. This aggregation makes it easier to deduce general
of visualization methods and provide evidence that the data patterns of performance among multiple datasets. From the
and its visualization are structurally similar. Unfortunately, results on all datasets, we can clearly see that DTAE out-
there is to our knowledge no consensus as to which metric performs other methods on Voronoı̈ diagram based metrics,
aligns with practitioners’ notion of a useful visualization. in part due to the bias towards them for k = 50. On lo-
Hence, any single metric cannot validate the quality of a cal metrics, DTAE achieves the best performance on ARI,
method. This is why it is important to use multiple metrics, followed closely by SAUCIE. However, for k-NN preserva-
so that one can hope for a more reliable result. tion UMAP performs better than other methods by a signif-
We selected eight different metrics, some of which icant margin which is consistent with the criterion it opti-
have been employed to judge visualization methods be- mizes (Damrich and Hamprecht, 2021). For euclidean dis-
fore (Moon et al., 2019; Kobak and Berens, 2019; Becht tance preservation, autoencoder based methods perform the
et al., 2019). The first group of metric considers the local best, with no clear winner overall. For geodesic distance
structure. We compute the Adjusted Rand Index (ARI) be- preservation, PCA performs the best, even though it pro-
tween a k-means clustering in high and low dimension and duced poor visualizations. This is in line with previous
the number of correct neighbors in the k-NN graph in high findings (Kobak and Berens, 2019). Most other methods
and low dimension. The next category are global metrics, obtained very similar performance on this metric, making
which rely on distance preservation. Euclidean distances it hard to conclude that any method performs better than
are computed in low dimension and euclidean or geodesic another.
distances are computed in high dimension. Then correla- In order to more easily compare methods, aggregated
tions are computed between those distances. Finally, we performances over all metrics are reported in the rightmost
use Voronoı̈ diagram based metrics. First or second order column of table 1. This aggregation makes it easier to evalu-
Voronoı̈ diagrams on the k-means centroids are computed ate the overall performance of a method when using a wide
using the k-means assignments to obtain the seeds in low- variety of criteria. We chose the arithmetic mean to com-
dimensional space. Then the ratio of points placed in the bine the results for simplicity’s sake. From this, we can see
correct Voronoı̈ region is computed. When using the sec- that DTAE and SAUCIE perform significantly better than

10
structure of the dataset, a more general dimension reduction
method might be preferable for initial data exploration.

5.2 Neural network limitations


Artificial neural networks are powerful non-linear func-
tions that can produce impressive results. Unfortunately,
they require the choice of a number of hyperparameters,
such as the dimension of the hidden layers and the learning
rate, making them less end-user friendly than their classical
counterparts.

Figure 6: Failure case: Highly clustered data violates our 6 Conclusion


underlying assumption of a tree structure. Dentate gyrus
data from Hochgerner et al. (2018) with clusters colored by We have introduced a new way of capturing the hierarchical
groundtruth cluster assignments. properties of scRNA-seq data of a developing cell popula-
tion with a density based minimum spanning tree. This tree
is a hierarchical representation of the data that places edges
other methods, with DTAE surpassing SAUCIE by a small in high density regions and thus captures biologically plau-
margin. However, from a qualitative point of view, DTAE sible trajectories. The density tree can be used to inform
produced superior visualizations compared to SAUCIE, as any dimension reduction method about the hierarchical na-
discussed previously. ture of the data.
Overall, DTAE produced excellent results both from a Moreover, we used the density tree to bias an autoen-
quantitative and qualitative point of view, highlighting its coder and were thus able to produce promising visualiza-
usefulness as a visualization method for tree-shaped data. tions exhibiting clearly visible tree-structure both on syn-
thetic and real world scRNA-seq data of developing cell
populations.
5 Limitations
5.1 Hierarchy assumption Funding
Our method is tailored to Waddington’s hierarchical struc- Supported, in part, by Informatics for Life funded by the
ture assumption of developmental cell populations, in Klaus Tschira Foundation.
which the highest data density is along the developmen-
tal trajectory. It produces convincing results in this setting
as shown above. However, if the assumption is violated,
for instance because the dataset contains multiple separate References
developmental hierarchies or a mixture of hierarchies and
distinct clusters of fully differentiated cell fates, the den- Amodio, M. et al (2019). Exploring single-cell data with deep multitasking neural
networks. Nature Methods, 16(11), 1139–1145.
sity tree cannot possibly be a faithful representation of the
dataset. Indeed, in such a case, our method yields a poor Bastidas-Ponce, A. et al (2019). Comprehensive single cell mRNA profiling re-
result. As an example, confer figure 6 with visualizations veals a detailed roadmap for pancreatic endocrinogenesis. Development, 146(12),
dev173849.
of the dentate gyrus dataset from Hochgerner et al. (2018),
preprocessed according to Zheng et al. (2017). This dataset Becht, E. et al (2019). Dimensionality reduction for visualizing single-cell data using
consists of a mostly linear cell trajectory and several dis- UMAP. Nature Biotechnology, 37(1), 38–44. Number: 1 Publisher: Nature
Publishing Group.
tinct clusters of differentiated cells, and consequently does
not meet our model’s assumption. Indeed, DTAE manages Bendall, S.C. et al (2011). Single-cell mass cytometry of differential immune and
to only extract some linear structures, but overall fails on drug responses across a human hematopoietic continuum. Science, 332(6030),
687–696.
this dataset, similarly to PHATE. UMAP seems to produce
the most useful visualization here. Böhm, J.N. et al (2020). Attraction-repulsion spectrum in neighbor embeddings.
One could adapt our method by extracting a forest of arXiv preprint arXiv:2007.08902.

disconnected density trees by cutting edges below a den- Cannoodt, R. et al (2016). SCORPIUS improves trajectory inference and identifies
sity threshold. However, if little is known a priori about the novel modules in dendritic cell development. preprint, Bioinformatics.

11
Cerletti, D. et al (2020). Fate trajectories of CD8 + T cells in chronic LCMV infection. Szubert, B. et al (2019). Structure-preserving visualisation of high dimensional
preprint, Immunology. single-cell datasets. Scientific reports, 9(1), 1–10.

Damrich, S. and Hamprecht, F.A. (2021). On UMAP’s true loss function. Tian, T. et al (2019). Clustering single-cell rna-seq data with a model-based deep
arXiv:2103.14608 [cs, stat]. arXiv: 2103.14608. learning approach. Nature Machine Intelligence, 1(4), 191–198.

Ding, J. et al (2018). Interpretable dimensionality reduction of single cell transcrip- Waddington, C.H. (1957). The strategy of the genes : a discussion of some aspects
tome data with deep generative models. Nature communications, 9(1), 1–13. of theoretical biology. Routledge Library Editions: 20th Century Science. Rout-
ledge.
Eraslan, G. et al (2019). Single-cell rna-seq denoising using a deep count autoen-
coder. Nature communications, 10(1), 1–14. Wolf, F.A. et al (2019). PAGA: graph abstraction reconciles clustering with trajectory
inference through a topology preserving map of single cells. Genome Biology,
Grønbech, C.H. et al (2020). scvae: Variational auto-encoders for single-cell gene 20(1), 59.
expression data. Bioinformatics, 36(16), 4415–4422.
Zheng, G.X.Y. et al (2017). Massively parallel digital transcriptional profiling of
Hochgerner, H. et al (2018). Conserved properties of dentate gyrus neurogenesis single cells. Nature Communications, 8(1).
across postnatal development revealed by single-cell RNA sequencing. Nature
Neuroscience, 21(2), 290–299.

Jacomy, M. et al (2014). Forceatlas2, a continuous graph layout algorithm for handy


network visualization designed for the gephi software. PloS one, 9(6), e98679.

Kingma, D.P. and Ba, J. (2017). Adam: A method for stochastic optimization.

Kingma, D.P. and Welling, M. (2013). Auto-encoding variational bayes. arXiv


preprint arXiv:1312.6114.

Kobak, D. and Berens, P. (2019). The art of using t-SNE for single-cell transcrip-
tomics. Nature Communications, 10(1), 5416.

Lin, E. et al (2020). A deep adversarial variational autoencoder model for dimension-


ality reduction in single-cell rna sequencing analysis. BMC bioinformatics, 21(1),
1–11.

Lopez, R. et al (2018). Deep generative modeling for single-cell transcriptomics.


Nature methods, 15(12), 1053–1058.

Luo, Z. et al (2021). scgae: topology-preserving dimensionality reduction for single-


cell rna-seq data using graph autoencoder. bioRxiv.

Maaten, L.v.d. and Hinton, G. (2008). Visualizing Data using t-SNE. Journal of
Machine Learning Research, 9(86), 2579–2605.

Martinetz, T. and Schulten, K. (1994). Topology representing networks. Neural


Networks, 7(3), 507–522.

McInnes, L. et al (2020). UMAP: Uniform Manifold Approximation and Projection


for Dimension Reduction. arXiv:1802.03426 [cs, stat]. arXiv: 1802.03426.

Moon, K.R. et al (2019). Visualizing structure and transitions in high-dimensional


biological data. Nature Biotechnology, 37(12), 1482–1492.

Moor, M. et al (2020). Topological Autoencoders. arXiv:1906.00722 [cs, math, stat].


arXiv: 1906.00722.

Paszke, A. et al (2019). PyTorch: An Imperative Style, High-Performance Deep


Learning Library. arXiv:1912.01703 [cs, stat]. arXiv: 1912.01703.

Perret, B. et al (2019). Higra: Hierarchical Graph Analysis. SoftwareX, 10, 100335.

Qiu, P. et al (2011). Extracting a cellular hierarchy from high-dimensional cytometry


data with spade. Nature biotechnology, 29(10), 886–891.

Qiu, X. et al (2017). Reversed graph embedding resolves complex single-cell trajec-


tories. Nature Methods, 14(10), 979–982.

Rumelhart, D.E. et al (1985). Learning internal representations by error propagation.


Technical report, California Univ San Diego La Jolla Inst for Cognitive Science.

Saelens, W. et al (2019). A comparison of single-cell trajectory inference methods.


Nature Biotechnology, 37(5), 547–554.

Street, K. et al (2018). Slingshot: cell lineage and pseudotime inference for single-
cell transcriptomics. BMC Genomics, 19(1), 477.

12
A Training loop algorithm
deg(v, t) = min n
n=1...|Γ(v)|
Algorithm S1 Training loop Pn
i=1 WΓ(v)i t
Require: Autoencoder (g ◦ f )θ s.t. P|Γ(v)| ≥
100
j=1 WΓ(v)j
Require: Pretraining epochs np , batch size b and learning
rate αp We can clearly see that when t = 100 we obtain the
Require: Finetuning epochs nf and learning rate αf classical definition of degree. As this generalization has not
Require: Weight parameters for the loss improved the visualization quality drastically, we opted for
λrec ,λpush-pull ,λcomp ,λcos the simpler version of the cosine loss in the main paper.
1: T, C, C2 , dgeo ← I NITIALIZATION (X)
2: #P retraining
3: for t = 0, 1, . . . , np do C Ablation study
4: for i = 0, 1, . . . , np /b do
5: Sample a minibatch m from X In order to better visualize the contributions of each ele-
6: m̂ ← g(f (m)) ment of our method, we conducted an ablation study of the
7: L ← Lrec different loss parameters and evaluated their impact both
8: θt+1 ← θt − αp ∇L qualitatively and quantitatively.
9: end for
10: end for C.1 Loss parameters
11: #F inetuning
12: for t = np , . . . , np + nf do The first phenomenon that is studied is the influence of
13: h ← f (X) dropping loss terms entirely. The reconstruction loss is al-
14: X̂ ← g(h) ways kept since it is necessary for the embeddings to con-
15: L ← λrec Lrec + λpush-pull Lpush-pull + λcomp Lcomp + tain salient information about the data. Not all combinations
λcos Lcos of loss parameters will be studied, but only those that should
16: θt+1 ← θt − αf ∇L be interesting (for example, using only the cosine loss does
17: end for not make much sense, so it is not an interesting scenario).
We will not study the influence of the weights for every
loss since the default weights of 1 lead to good performance
and this configuration significantly reduces the dimension
of the hyperparameter space. All experiments are described
B Cosine loss generalization in table S1.
The performance will be evaluated both qualitatively and
The definition of a vertex’ degree in a graph as the number quantitatively on all three discussed datasets to demonstrate
of incident edges to it is not perfect, as it does not take into as clearly as possible the impact of every loss term.
account the noisiness of the graph. On real datasets, we may
have stray clusters which lead to noisy edges in the density Experiment Lrec Lpush-pull Lcomp Lcos (weight)
graph. These usually manifest as edges with only one point A X
contributing to them in high dimension. This leads to ver- B X X
tices with an effective degree of 2 that have a higher degree C X X
due to these noisy edges, and are thus ignored by the cosine D X X X
loss. E X X X(50)
To remedy this, we introduce a different definition of de- F X X X X(50)
gree. We consider a threshold t ∈ [0, 100] and define the
degree of a vertex as the smallest number of incident edges Table S1: List of loss parameters for our ablations.
that account for t% of all points contributing to the vertex’s
incident edges. As t gets closer to a hundred, we converge As can be seen in figures S1,S2 and S3, the compactness
to the original definition of degree. loss alone is not sufficient to obtain a good representation
More formally put, consider a weighted graph since it has no repulsive force. The reconstruction loss helps
G = (V, E, W ) and a function Γ that returns incident to avoid a total collapse but is not sufficient to prevent a
edges to a given vertex sorted by their weights. This partial collapse, as visible in the endocrine pancreas and the
alternative definition of a vertex’s degree is then: T-cell datasets. While the push-pull loss already gives good

13
Type of metric Local Global Voronoi
Euclidean Geodesic
Metric ARI k-NN 1st order 2nd order
Pearson Spearman Pearson Spearman
λpp = 0, λcomp = 0, λcos =0 34.62 19.09 0.66 0.66 0.56 0.56 70.01 38.98
λpp = 0, λcomp = 1, λcos =0 38.54 23.20 0.80 0.79 0.74 0.73 70.44 35.24
λpp = 1, λcomp = 0, λcos =0 48.19 25.05 0.78 0.76 0.72 0.70 79.68 56.02
λpp = 1, λcomp = 1, λcos =0 48.70 26.40 0.81 0.78 0.74 0.72 79.26 55.93
λpp = 1, λcomp = 0, λcos = 50 45.54 22.71 0.80 0.77 0.74 0.72 78.94 52.97
λpp = 1, λcomp = 1, λcos = 50 46.00 24.60 0.81 0.80 0.75 0.74 78.85 53.77
(a) PHATE generated dataset.
Euclidean Geodesic
Metric ARI k-NN 1st order 2nd order
Pearson Spearman Pearson Spearman
λpp = 0, λcomp = 0, λcos =0 34.52 4.17 0.81 0.85 0.57 0.62 65.37 30.83
λpp = 0, λcomp = 1, λcos =0 24.88 2.32 0.65 0.69 0.53 0.59 46.30 11.11
λpp = 1, λcomp = 0, λcos =0 43.07 3.07 0.77 0.79 0.71 0.77 73.79 50.68
λpp = 1, λcomp = 1, λcos =0 44.69 2.93 0.73 0.79 0.66 0.75 72.95 46.75
λpp = 1, λcomp = 0, λcos = 50 35.64 2.77 0.73 0.75 0.70 0.75 68.44 38.82
λpp = 1, λcomp = 1, λcos = 50 39.79 2.85 0.71 0.74 0.71 0.78 69.24 38.04
(b) Endocrine pancreas dataset.
Euclidean Geodesic
Metric ARI k-NN 1st order 2nd order
Pearson Spearman Pearson Spearman
λpp = 0, λcomp = 0, λcos =0 29.24 2.20 0.40 0.33 0.40 0.42 35.17 4.50
λpp = 0, λcomp = 1, λcos =0 40.65 1.24 0.15 0.17 0.20 0.20 28.75 2.75
λpp = 1, λcomp = 0, λcos =0 29.72 1.28 0.42 0.26 0.36 0.39 47.19 18.63
λpp = 1, λcomp = 1, λcos =0 45.55 1.15 0.38 0.23 0.35 0.38 37.71 16.29
λpp = 1, λcomp = 0, λcos = 50 29.24 1.23 0.37 0.24 0.42 0.44 44.73 12.16
λpp = 1, λcomp = 1, λcos = 50 37.25 1.15 0.31 0.19 0.40 0.41 38.41 12.85
(c) T-cells dataset.

Table S2: Quantitative results in different scenarios for DTAE’s loss weights.

Figure S1: Results of the ablations on the PHATE generated Figure S2: Results of the ablations on the T-cell dataset,
dataset, colored by groundtruth clusters. colored by phenotypes.

results when used alone, since the tree structure is visible, loss, however, this combination can lead to sparse repre-
adding the compactness loss yields embeddings in which sentation due to the fact that seeds of second order Voronoı̈
the points lie compactly along the tree. Without the cosine cells do not necessarily lie in their cell. This means that

14
Figure S3: Results of the ablations on the endocrine pan-
creas dataset, colored by cell types. Figure S4: Results obtained on the chronic infection subset
of the T-cell dataset when varying the cosine loss weight,
colored by phenotypes.
Rel. Perf.
λpp = 0, λcomp = 0, λcos =0 81.27 C.2 Cosine loss weight
λpp = 0, λcomp = 1, λcos =0 67.60
λpp = 1, λcomp = 0, λcos =0 92.04 A parameter that is interesting to study in more detail is the
λpp = 1, λcomp = 1, λcos =0 90.66 cosine loss weight. While a lot of the other losses have a sig-
λpp = 1, λcomp = 0, λcos = 50 87.24 nificant impact on the embeddings, the cosine loss is mostly
λpp = 1, λcomp = 1, λcos = 50 86.99 cosmetic, and it is important to understand its behavior for
low and high weights. The cosine loss weight will only be
Table S3: Relative performance out of a hundred over all studied on the T-cell dataset, since it is enough to demon-
datasets and metrics. strate its impact on quantitative and qualitative results.
As can be seen in figure S4 the cosine loss straightens
the branches for every weight as intended. However, with
higher weights, it also has a density regularizing effect. As
points will not necessarily be spread out along the line be- its weight increases, we obtain a more homogeneous and
tween two centroids but only lie inside the intersection of less clumped point cloud. While there is no clear explana-
the line between the two centroids and their second order tion for this behavior, a hypothesis is that the higher weight
Voronoı̈ cell, which may be much smaller than the full line means that this criterion will be optimized with higher pri-
between the centroids. Using only the push-pull and co- ority during the finetuning. Since the pretraining produces
sine loss can lead to satisfying results, but the embedding dense embedding and this cosine loss has no incentive to
is more spread out than with the compactness loss. Adding produce sparse embeddings, this denser structure is kept
the cosine loss makes all the results cleaner and helps with during training. On the contrary, the push-pull loss can have
the density of the point cloud. This effect is discussed in the a sparsifying effect, since the seeds of second order Voronoı̈
next section. cells do not necessarily lie in their cells. When the cosine
loss weight is smaller, this loss is optimized with higher pri-
From a quantitative point of view, adding all of these ority, which would lead to the sparser embeddings. All of
losses leads to worse performances than just using the push- this is intimately linked to the dynamics of neural network
pull loss alone. Since the compactness and cosine losses training and not only to minimizers of each criterion, mak-
are designed with visualization in mind, they can alter the ing a precise study of this process highly complex.
fidelity of the embedding. For example, making the points From a quantitative point of view, a slight decrease in
tighter along the density tree will lead to pairwise distances performance is visible in table S4 for all metrics except
that are preserved more poorly, which is an effect that we for the preservation of geodesic distances or of first order
indeed observe in the global metrics in table S2. Voronoı̈ diagrams. As a result, the overall performance de-
Nonetheless, when looking at aggregated performances in creases noticeably when increasing the cosine loss weight,
table S3 we can see that all experiments except when us- see the rightmost column in table S4.
ing the compactness loss alone still perform comparatively. This again illustrates the trade-offs between quantitative
As such, the increase in qualitative performance stemming and qualitative performance, where even though a method
from the addition of losses is not done at the expense of the performs slightly worse quantitatively, it might still produce
preservation of the data’s intrinsic structure. In particular, results that are easier to interpret for humans.
the push-pull loss alone drastically improves the visualiza-
tion not only qualitatively, but also quantitatively.

15
Type of metric Local Global Voronoi
Euclidean Geodesic All
Metric ARI k-NN 1st order 2nd order
Pearson Spearman Pearson Spearman
λcos =1 45.83 1.15 0.37 0.24 0.38 0.39 37.30 16.36 95.68
λcos =2 44.77 1.11 0.35 0.24 0.40 0.40 37.90 16.39 95.34
λcos =5 45.22 1.09 0.37 0.20 0.40 0.45 37.38 14.12 93.30
λcos = 10 44.29 1.12 0.35 0.18 0.42 0.46 36.40 13.66 91.83
λcos = 15 43.50 1.14 0.29 0.15 0.44 0.46 38.78 14.59 90.28
λcos = 20 43.08 1.17 0.33 0.17 0.42 0.46 37.43 14.41 91.74
λcos = 50 37.25 1.15 0.31 0.19 0.40 0.41 38.41 12.85 87.50

Table S4: Quantitative results on the T-cells dataset when varying the cosine loss weight. The weights for the push-pull
and compactness losses are set to one. The rightmost column contains the average performance over all metrics for a given
method.

16
D High resolution results

Figure S5: Results obtained on the endocrine pancreatic cell dataset, colored by cell types.

Figure S6: Results obtained on the chronic infection subset of the T-cell dataset, colored by phenotypes.

17
E Complete quantitative results

Type of metric Local Global Voronoi


Euclidean Geodesic
Metric ARI k-NN 1st order 2nd order
Pearson Spearman Pearson Spearman
DTAE (Ours) 46.00 24.60 0.81 0.80 0.75 0.74 78.85 53.77
AE 34.67 19.18 0.63 0.64 0.55 0.54 70.56 38.64
PHATE 51.33 60.44 0.50 0.46 0.54 0.52 71.12 30.40
UMAP 55.13 67.92 0.53 0.48 0.54 0.51 75.18 46.19
SAUCIE 56.62 37.61 0.81 0.79 0.75 0.73 82.98 65.07
DCA 40.41 21.29 0.64 0.64 0.59 0.59 73.83 43.17
scVI 36.51 19.68 0.69 0.68 0.67 0.67 69.94 36.98
Force Atlas 2 51.85 64.38 0.59 0.55 0.56 0.54 75.79 46.44
PCA 39.87 22.36 0.77 0.74 0.67 0.66 73.14 42.53
(a) PHATE generated dataset.
Euclidean Geodesic
Metric ARI k-NN 1st order 2nd order
Pearson Spearman Pearson Spearman
DTAE (Ours) 39.79 2.85 0.71 0.74 0.71 0.78 69.24 38.04
AE 34.12 4.24 0.81 0.84 0.58 0.62 65.12 30.53
PHATE 30.92 3.70 0.64 0.65 0.71 0.78 57.22 22.27
UMAP 30.41 4.67 0.57 0.58 0.79 0.82 57.79 21.21
SAUCIE 38.94 3.69 0.81 0.81 0.71 0.73 69.46 37.93
DCA 30.93 5.01 0.41 0.78 0.37 0.63 64.29 27.98
scVI 32.99 3.80 0.67 0.68 0.65 0.67 61.29 27.52
Force Atlas 2 25.11 3.96 0.28 0.62 0.21 0.74 46.24 13.08
PCA 21.84 1.94 0.69 0.67 0.87 0.87 48.65 15.50
(b) Endocrine pancreas dataset.
Euclidean Geodesic
Metric ARI k-NN 1st order 2nd order
Pearson Spearman Pearson Spearman
DTAE (Ours) 37.25 1.15 0.31 0.19 0.40 0.41 38.41 12.85
AE 28.87 2.17 0.38 0.32 0.43 0.43 34.84 4.58
PHATE 32.00 1.25 -0.02 0.02 0.42 0.43 33.70 3.45
UMAP 23.41 1.52 0.11 0.21 0.46 0.44 29.24 5.28
SAUCIE 26.86 1.59 0.21 0.25 0.43 0.42 34.30 4.63
DCA 0.10 1.34 0.45 0.62 0.00 0.26 3.17 1.04
scVI 28.69 1.26 0.43 0.23 0.38 0.46 33.31 5.67
Force Atlas 2 23.83 0.93 0.02 0.01 0.05 0.41 28.39 3.09
PCA 20.82 1.10 0.18 0.16 0.61 0.57 32.30 8.27
(c) T-cells dataset.

Table S5: Full Quantitative results on all studied datasets. Metrics are described in section 4.4 and higher values indicate
better performance.

18

You might also like