0% found this document useful (0 votes)

8 views12 pages

Learning Context-Aware Distributed Gene Representa

Uploaded by

mhantaruchita2009

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

8 views12 pages

Learning Context-Aware Distributed Gene Representa

Uploaded by

mhantaruchita2009

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 12

bioRxiv preprint doi: https://doi.org/10.1101/2024.06.07.598026; this version posted June 10, 2024.

The copyright holder for this preprint

(which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made
available under aCC-BY-NC-ND 4.0 International license.

Learning context-aware, distributed gene

representations in spatial transcriptomics with
SpaCEX
Xiaobo Suna,1 , Yucheng Xub, , Wenlin Lia , Mengqian Huanga , Ziyi Wanga , Jing Chena , and Hao Wuc,d,1
a
School of Statistics and Mathematics, Zhongnan University of Economics and Law, Wuhan, 430073, China.; b School of Statistics and Data Sciences, Nankai University,
Tianjin, 300071, China.; c Faculty of Computer Science and Control Engineering, Shenzhen Institute of Advanced Technology, Chinese Academy of Sciences, Shenzhen,
518055, China.; d Key Laboratory of Biomedical Imaging Science and System, Chinese Academy of Sciences, Shenzhen, 518055, China.; 1 Corresponding author

This manuscript was compiled on June 8, 2024

1 Distributed gene representations are pivotal in data-driven genomic processes and disease development (8), but also allow the 21

2 research, offering a structured way to understand the complexities learning of distributed gene representations, resembling the 22

3 of genomic data and providing foundation for various data analysis learning of word representations from word contexts in lin- 23

4 tasks. Current gene representation learning methods demand costly guistic models (1, 9, 10). These gene embeddings provide a 24

5 pretraining on heterogeneous transcriptomic corpora, making them foundation for quantitatively characterizing context-specific 25

6 less approachable and prone to over-generalization. For spatial tran- gene functions and interactions from a spatial perspective, 26

7 scriptomics (ST), there is a plethora of methods for learning spot em- facilitating various analytical endeavors where insights into 27

8 beddings but serious lacking method for generating gene embeddings spatial genetic mechanisms are critical. 28

9 from spatial gene profiles. In response, we present SpaCEX, a pioneer However, to the best of our knowledge, there is currently 29

10 cost-effective self-supervised learning model that generates gene em- no method for learning gene embeddings from ST data due to 30

beddings from ST data through exploiting spatial genomic “context” challenges in effectively identifying spatial genomic contexts
11

17
identified as spatially co-expressed gene groups. SpaCEX-generated
gene embeddings (SGE) feature in context-awareness, rich seman-
tics, and robustness to cross-sample technical artifacts. Extensive
real data analyses reveal biological relevance of SpaCEX-identified
genomic contexts and validate functional and relational semantics of
SGEs. We further develop a suite of SGE-based computational meth-
FT and encoding spatial gene expression patterns. While recent
works, including Gene2vec (1), scGPT (9), scFoundation (11),
scBERT (12), and geneFormer (13), have been developed to
learn gene embeddings from atlas-scale microarray or scRNA-
seq data, they do not extend to ST, overlooking crucial spatial
gene expression information. Moreover, it has been observed
31

37
RA
18 ods for a range of key downstream objectives: identifying disease- that the extensive pretraining of these models on massive data 38

19 associated genes and gene-gene interactions, pinpointing genes with corpora offers marginal benefits for finetuning downstream 39

20 designated spatial expression patterns, enhancing transcriptomic tasks (14). This is probably due to the irreconcilable het- 40

21 coverage of FISH-based ST, detecting spatially variable genes, and erogeneities in pretraining data collected across a variety of 41

22 improving spatial clustering. Extensive real data results demonstrate

23 these methods’ superior performance, thereby affirming the potential
24 of SGEs in facilitating various analytical task. Significance Statement
D

spatial transcriptomics | gene embeddings | genomic contexts | self- Spatial transcriptomics enables the identification of spatial gene
supervised learning relationships within tissues, providing semantically rich ge-
nomic "contexts" for understanding functional interconnections
among genes. SpaCEX marks the first endeavor to effec-
1

3
D istributed gene representations embed multifaceted na-
ture of genes within a high-dimensional space, offering
profound insights into the complex mechanisms of gene ex-
tively harnesses these contexts to yield biologically relevant
distributed gene representations. These representations serve
4 pression, regulation, and interaction, and paving ways for as a powerful tool to greatly facilitate the exploration of the
5 leveraging machine learning techniques in advancing biomed- genetic mechanisms behind phenotypes and diseases, as ex-
6 ical research (1), disease diagnosis (2), and the discovery of emplified by their utility in key downstream analytical tasks in
7 therapeutic targets (3) with unprecedented precision and effi- biomedical research, including identifying disease-associated
8 ciency. genes and gene interactions, in silico expanding the transcrip-
9 Spatial transcriptomics (ST), including high-resolution tomic coverage of low-throughput, high-resolution ST tech-
10 in situ hybridization-based (e.g., SeqFISH (4)) and high- nologies, pinpointing diverse spatial gene expression patterns
11 throughput in situ capturing-based (e.g., 10x Visium (5)) (co-expression, spatially variable pattern, and patterns with spe-
12 technologies, enables the profiling of spatial gene expression cific expression levels across tissue domains), and enhancing
13 in heterogeneous tissues, providing unprecedent opportunities tissue domain discovery.
14 to characterize spatial distribution of cell types (6), delineate
X.S. and H.W. conceived and supervised the study. X.S. derived the model and developed the
15 spatial tissue organization (5), gene-gene interactions (7), etc. framework. X.S., H.W., W.L. wrote and revised the manuscript. X.S., Y.X., W.L., and Z.W. imple-
16 ST also can reveal spatial genomic “contexts” formed by genes mented the framework. Y.X., W.L., M.H., and J.C. conducted the experiments. X.S., Y.X., M.H., Z.W.
conducted the analyses. Y.X., W.L., X.S., M.H., and J.C. collected the results and plot the figures.
17 cofunctional in the same biological processes and pathways, All authors approved the manuscript.
18 given the similarity of these genes in spatial expression pat- The authors declare no competing interests.
19 terns within tissues. Such genomic contexts not only provide
1
20 insights into the molecular mechanisms underlying biological To whom correspondence may be addressed. E-mail: [email protected]; [email protected]

www.xxxx.org/cgi/doi/10.1073/xxxx.XXXXXXXXXX PNAS | June 8, 2024 | vol. XXX | no. XX | 1–12

bioRxiv preprint doi: https://doi.org/10.1101/2024.06.07.598026; this version posted June 10, 2024. The copyright holder for this preprint
(which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made
available under aCC-BY-NC-ND 4.0 International license.

A B
Spatially Distributed Gene Moudule I Patch Embedding Mask Token Reconstructed Patch
Genomic “Context’’
Co-expressied Genes Representations
N
Patchify
PI3 ·· ·· ·· ··
TC2 TC1
Mask

Convolutional
P2I

ViT Encoder Reconstructed

P3I Gene images Decoder Image X'
Pathway Cofunctional module 1 Embedding group 1
Iteratively Downstream Specific Tasks
Update Encoder

··
Weights SGEs

Moudule II Moudule III

Gene network Cofunctional module N Embedding group N Non-linear Projection Head

Q : Posterior
Downstream Tasks Normalized Gene Embedding Z' Soft Assignment

M̄ ≫ M II III
SMM Inference via MAP-EM
M genes M̄ genes Gene1 Gene1 P : Target
I Distribution
N

N
FISH-based Enhanced
Health Disease

Successively Refine
ST FISH-based ST Before After
Enhancing transcriptomic Cross-sample gene Identifying disease-
coverage of FISH-based ST alignment associated genes

V VI VII : KL( P || Q )
IV
+

Health Disease
Identifying disease- Identifying genes Detecting SVGs Improving
associated gene crosstalk with specific spatial clutering
expression patterns

FT
Fig. 1. Overview of the distributed gene representation learning with SpaCEX. (A) Outline of learning distributed gene representations from spatial genomic contexts. ST data
reveal spatial genomic contexts comprising genes cofunctional in gene pathways and networks since they tend to exhibit similar expression patterns across tissue space.
By leveraging these genomic contexts, distributed gene representations can be learned. These representations not only encapsulate gene spatial functional and relational
semantics but also instrumental in facilitating various downstream task-specific objectives. (B) Workflow of SpaCEX. It comprises three modules: In Module I, SpaCEX employs
an adapted masked autoencoder (MAE) to learn the representations of gene images, which enhances gene embeddings’ local-context perceptibility. Module II involves
modeling gene embeddings using a Student’s t mixture model, with parameters estimated via a MAP-EM algorithm. This module aims to maximize the likelihood of the entire
dataset. In Module III, gene embeddings are refined through a self-paced pretext task that aims to identify genomic contexts via iterative pseudo-contrastive learning. Together,
Modules II and III constitute a single training epoch, in which gene embeddings’ discriminability is enhanced. The training process continues until either the change in gene
RA
assignments falls below a threshold or a predetermined number of epochs is reached. Upon training completion, SpaCEX-generated gene embeddings (SGEs) can be utilized
in downstream analytical tasks, such as i) enhancing transcriptomic coverage of FISH-based ST, ii) cross-sample gene alignment, iii) identifying disease-associated genes, iv)
identifying disease-associated gene crosstalk, v) pinpointing genes with designated spatial expression patterns, vi) detecting SVGs, vii) improving spatial clustering.

42 technical and biological conditions, which hinder the model’s ration of intricate relational semantic structures among genes. 68

43 ability to grasp context-specific gene embedding nuances. In The innovative integration of MIM with contrastive learning 69

44 addition, these methods, characterized by their tremendous equips SpaCEX with a superior capability to comprehend and 70
D

45 amount of parameters, are prone to sensitivity regarding hyper- translate spatial gene information, including both expression 71

46 parameters settings and parameter initializations (14). This patterns and inter-gene relationships, into SpaCEX-generated 72

47 sensitivity necessitates repeated pretraining, particularly with Gene Embeddings (SGE). 73

48 the introduction of new data, which, coupled with the pro-

49 hibitive costs of data collection and model pretraining, renders Our study is organized as follows: Firstly, to establish 74

50 the updating of pretrained models impractical for the broader the validity of our method, we demonstrate SpaCEX’s ef- 75

51 research community. fectiveness in identifying spatially co-expressed gene groups, 76

justifying their role as genomic contexts by showing their bi- 77

52 To bridge this gap, we develop a “context-aware, self- ological relevance. In addition, the functional and relational 78

53 supervised learning on Spatially Co-EXpressed genes semantics of SGEs undergo rigorous validations. Particularly, 79

54 (SpaCEX)” model. SpaCEX features in utilizing spatial ge- we demonstrate SGEs’ robustness to technical variations and 80

55 nomic context inherent in ST data to generate gene embed- their utility in gene alignment and comparison across sam- 81

56 dings that accurately represent the condition-specific spatial ples. Secondly, we demonstrate the applicability of SpaCEX 82

57 functional and relational semantics of genes. Technically, and SGEs in key downstream tasks by developing a suite of 83

58 SpaCEX treats gene spatial expressions as images and lever- SGE-based computational methods for: i) the identification 84

59 ages a masked-image model (MIM), which excels in extracting of disease-associated genes and gene crosstalk; ii) the de novo 85

60 local-context perceptible and holistic visual features (15), to expansion of transcriptomic coverage of FISH-based ST data, 86

61 yield initial gene embeddings. These embeddings are iter- addressing a longstanding challenge that has restricted the 87

62 atively refined through a self-paced pretext task aimed at broader application of FISH-based ST data; iii) pinpointing 88

63 discerning genomic contexts by contrasting spatial expres- genes with designated spatial expression patterns in tissues; iv) 89

64 sion patterns among genes, drawing genes with similar spatial detecting spatially variable genes (SVGs); v) improving spatial 90

65 expressions closer in the latent embedding space, while dis- clustering. Extensive real data analyses demonstrate that our 91

66 tancing those with divergent patterns. This step enhances the methods either provide optimal solutions to challenges that 92

67 discriminability of gene embeddings, facilitating their incorpo- have been not inadequately addressed, e.g., the first three 93

2 | www.xxxx.org/cgi/doi/10.1073/xxxx.XXXXXXXXXX Sun et al.

10x-hBC
A B
Manual Annotation C1 (IDC) C2 (DCIS) C3 (Benign Stroma) C4 (None)

C C1 (IDC) C2 (DCIS) C3 (Benign Stroma) C4 (Control)

C1 C2

C1
-log 10 p.values -log 10 p.values -log 10 p.values -log 10 p.values

Benign Stoma Related Invasive Cancer Related Cancer Related Others 3 6 9

D
C1
C1C1
(IDC) C2
C2 C2
(DCIS) C3 Stroma)
C3 (Benign C4(Control)
C4 (Control)

C4(Control)

FT
C2
C3
C3 C4(Control)
C4(Control)
0 1 2 3
3 6 9
-log p.adjust Number of genes
RA
SpaCEX CNN-PReg Giotto SPARK STUtility

Fig. 2. SpaCEX identifies clusters of spatially co-expressed genes within a biologically relevant genomic context. (A) Performance comparison of SpaCEX and four benchmark
methods in grouping co-expressed genes in the 10x-hDLPFC-151673 and ssq-mHippo datasets. The DB index calculated based on Pearson and Euclidean distance are
used to measure the overall co-expression (left panel) and spatial coherence (right panel) of the gene clusters, respectively. In both cases, a lower DB index value indicates
enhanced co-expression or spatial coherence. (B) Spatial expression patterns of SpaCEX-generated gene groups overlap with the cell type distributions in the 10x-hBC dataset.
The leftmost panel displays the manually annotated distributions of ductal carcinoma in situ (DCIS, in yellow), invasive ductal carcinoma (IDC, in red) and benign stroma cells (in
original color) in human breast cancer tissues. The right four panels display the aggregated spatial expression (module scores) of three SpaCEX-generated gene groups
(C1-C3) and a control cluster (C4) consisting of randomly selected genes. The brightness in each panel is positively correlated with the level of aggregated gene expression.
D

The name of the cell type, whose distribution overlaps with the aggregated expression pattern of the gene group, are indicated above each panel. (C) GO enrichment and
cofunction analyses of genes within C1-C4. The dots represent the 20 most significantly enriched GOBPs in each gene group. The x-axis represents the negative logarithm of
adjusted P-values of biological process enrichment significance, while the y-axis represents the ROC AUC scores of the 20 GOBPs in the gene cofunction analysis. Red color
indicates benign stroma-related GOBPs, green color the noninvasive cancer-related GOBPs, cyan color invasive cancer-related GOBPs, and purple color other GOBPs. (D) The
connectivity between the nodes corresponding to most significantly enriched GOBP indicates their functional associations. The node color represents the GOBP’s enrichment
significance (negative logarithm of adjusted P-value), with darker colors indicative of lower significance levels. Node size indicates the number of genes involved in the GOBP.

94 tasks, or outperform established benchmarks as seen in task cilitate downstream task-specific objectives. As illustrated 110

95 iv and v. These tasks exemplify how SGEs can be effectively in Methods and Fig. 1B, SpaCEX mainly consists of three 111

96 employed to address various downstream task-specific objec- modules. In Module I, erage an adapted masked autoencoder 112

97 tives, promising SpaCEX’s potential in developing a genomic (MAE) (15) to tranansform gene images into gene embeddings 113

98 “language”-based methodological ecosystem. that follow a mixture of multivariate Student’s t distributions 114

(see “Representation learning of spatial gene expression maps” 115

99 Result in Methods). With this MIM, SpaCEX gains the local-context 116

perceptibility by learning to regenerate masked image patches 117

100 Overview of SpaCEX. The fundamental idea (Fig. 1A) of
from the surrounding contexts. In Module II, the gene embed- 118
101 SpaCEX is that genes co-functional in gene networks and
dings are modeled as a Students’ t mixture model (SMM) in 119
102 pathways typically exhibit similar spatial expression patterns
the latent feature space, with each mixture component serving 120
103 in tissues, forming a “genomic context” resembling the word
as a genomic context comprising spatially co-expressed genes. 121
104 context in natural languages. Through a self-supervised learn-
After the estimation of SMM parameters via a Maximum a 122
105 ing of the “proximity” of genes in spatial transcriptional activ-
posterior (MAP)-EM algorithm, soft assignments of genes to 123
106 ity, as implied by the ST data, we can concurrently identify
the mixture components are computed (see “SMM-based mod- 124
107 spatial genomic contexts and acquire semantically meaningful
eling” in Methods). Module III implements a semi-contrastive 125
108 gene embeddings in a data-driven manner. These embeddings,
learning process, through which SpaCEX gains the ability to 126
109 representing spatial gene functions and relationships, can fa-

Sun et al. PNAS | June 8, 2024 | vol. XXX | no. XX | 3

A B D
KRT-II
HLA-I
HLA-II

Interaction
Non Interaction

FT
Fig. 3. Validating the relational semantics of SGEs. SGEs are derived from the 10x-hDLPFC-151676 dataset. (A) Hierarchical clustering based on the SGEs of the HLA-I,
HLA-II, and KRT-II gene family members. Genes with similar SGEs are positioned in close proximity along the y-axis. The 64 dimensions of the SGEs are represented along the
x-axis. Gene families are indicated in different colors on the y-axis. The red color intensity in the diagram positively correlates with the SGE values. (B) Reactome-based
pathway enrichment analysis of gene clusters generated by Leiden at various resolutions, based on gene-gene similarity matrices computed from either the SGEs or original
gene expression profiles. The x-axis denotes the Leiden resolution, while the y-axis represents the average number of statistically significant enriched gene pathways (or
high-confidence “pathway hits”) across the gene clusters. Red and blue spots represent gene clusters derived from the SGEs and original gene expression profiles, respectively.
(C) The predictive power of gene-gene interactions with different types of gene embeddings. Here, we showcase the mean accuracies (left panel) and ROC AUC (right panel)
RA
scores for Gene-Gene Interaction Predictor Neural Network (GGIPNN)-based predictions of gene-gene interactions using four distinct types of gene embeddings: SGEs,
scBERT embeddings, Gene2vec embeddings, and randomly generated embeddings. Refer to Supplementary Note 1.4 for details of creating training and testing datasets.
(D) The gene-gene interaction heatmaps are presented to compare the ground truth (top-left) with prediction results using the four types of gene embeddings. For better
visualization, the heatmaps only include the top 1,000 genes that exhibit the most interactions with other genes as per ground truth. In these maps, a filled cell in the indicates
the existence of an interaction between the pair of genes in the corresponding row and column, while a blank cell indicates the opposite. The prediction accuracy is indicated on
top of each heatmap.

discriminate between spatial gene expressions and grasps the utilize twelve human dorsolateral prefrontal cortex (DLPFC)
D

127 150

128 intricate relational semantic structures among genes. This 10x Visium datasets (10x-hDLPFC) (16) and a mouse hip- 151

129 process involves a self-paced, iterative joint optimization of pocampus Slide-seqV2 dataset (ssq-mHippo) (17), as listed 152

130 MAE weights and SMM parameters using two loss functions in SI Appendix, Table S1. We benchmark SpaCEX against 153

131 L1 and L2 (see “Self-paced semi-contrastive optimization of four state-of-the-art competing methods: CNN-PReg (18), 154

132 gene embeddings” in Methods). Each training epoch begins Giotto (19), Spark (20) and STUtility (21) (SI Appendix, Ta- 155

133 with L1 , updating the MAE weights to maximize the log like- ble S2). Our initial assessment involves visualizing spatial 156

134 lihood of the entire dataset while controlling for macro-factors expression maps of four randomly selected genes from each 157

135 (e.g., cluster size imbalance) via regularization terms. Follow- of two SpaCEX-identified clusters, chosen to represent high 158

136 ing this, L2 , designed for discriminatively boosted clustering, and medium quality clusters, respectively (see “Identifying 159

137 further refine gene embeddings and SMM parameters, drawing groups of spatially co-expressed genes” in Methods). SI Ap- 160

138 closer similar genes and distancing dissimilar ones over succes- pendix, Fig. S1 shows that genes within both clusters exhibit 161

139 sive batches. Overall, the training process alternates between congruent expression patterns. Next, the overall co-expression 162

140 the Module II and III until either a predetermined number and spatial coherence of SpaCEX-generated clusters are quan- 163

141 of training epochs is reached, or the change in gene assign- titatively measured using two Davies-Bouldin (DB) indices 164

142 ments between successive epochs falls beneath a prespecified (see “Evaluation metrics” in Methods). Fig. 2A shows that 165

143 threshold. SpaCEX consistently outperforms the competing methods in 166

both DB indices, demonstrating its effectiveness in identifying 167

spatially co-expressed gene clusters. 168

144 SpaCEX Identifies Spatially Co-expressed Gene Clusters as
145 Biologically Relevant Genomic Context. The effectiveness of To verify the legitimacy of using SpaCEX-identified gene 169

146 SpaCEX in generating semantically meaningful gene repre- clusters as spatial genomic contexts, we delve into their biolog- 170

147 sentations hinges on its ability to identify biologically rel- ical significances through gene pathway enrichment analysis 171

148 evant genomic contexts, manifested as clusters of spatially and gene cofunction analysis using the 10x-hBC dataset de- 172

149 co-expressed genes. To systematically evaluate this ability, we rived from human breast cancer tissue (20). Our analysis 173

4 | www.xxxx.org/cgi/doi/10.1073/xxxx.XXXXXXXXXX Sun et al.

A B
Embedding (Disease)
SGEs Gene Umap Embeddings

Identity

MLP
MLP
Batch Norm and 3x
LekeyRelu
Batch Norm
Dropout

+
MLP

Output
MAE Loss
Embedding (Health)

FT
RA
Fig. 4. SGEs of 2177 housekeeping genes are generated from two healthy human MTG 10x Visium dataset (10x-hMTG-1-1 and 10x-hMTG-18-64), respectively. UMAP
embeddings of the same dimension as SGEs for these housekeeping genes are also generated from both datasets. Embedding pairs of identical genes but from different
datasets are then subjected to alignment. (A) The network architecture of SGE alignment network (SAN). (B) PCA plots of SGEs (left) and gene UMAP embeddings (right) after
SAN-mediated alignment. (C) Scaled cosine dissimilarities between pairs of aligned SGEs (in blue) versus those between aligned UMAP embeddings (in orange).

174 targets three specific gene clusters (C1, C2, and C3), each SGEs Effectively Capture Fundamental Gene Semantics. We 199

175 comprising over 20 genes and demonstrating the lowest group begin this section by validating the functional and relational 200
D

176 closeness centrality (SI Appendix, Supporting Text). This cen- semantics of SGEs through three analyses (see “Evaluating 201

177 trality metric suggests their expression patterns are most likely SpaCEX-generated gene embeddings” in Methods). In the 202

178 to diverge from the majority, probably due to pathological first analysis, hierarchical clustering is performed on SGEs for 203

179 functions. As depicted in Fig. 2B, the aggregated expression the KRT-II, HLA-I and HLA-II gene family members derived 204

180 patterns (22) (SI Appendix, Supporting Text) of C1 through from human DLPFC tissues (Fig. 3A). We find that SGEs 205

181 C3 are associated with the spatial distributions of invasive from the same family are clustered together, and those from 206

182 ductal carcinoma (IDC), ductal carcinoma in situ (DCIS) and functionally related gene families (e.g., HLA-I and HLA-II) 207

183 benign stroma cells, respectively. In contrast, the aggregated are positioned in closer proximity within the hierarchy than 208

184 expression of C4, a control cluster comprising 30 randomly those from less related families (e.g., KRT-II and HLA-II). 209

185 selected genes, is dispersed over the spatial map. Addition- In contrast, these gene families are more intermingled when 210

186 ally, Fig. 2C -D showcase that these clusters are statistically the clustering is based on their original expression profiles 211

187 significantly enriched with pathologically/biologically relevant (SI Appendix, Fig S2). In the second analysis, Leiden is uti- 212

188 and densely inter-connected gene ontology biological processes lized to identify gene clusters at various resolutions, using 213

189 (GOBP). The functional coherence of member genes within either SGEs or the original gene expression profiles. Subse- 214

190 these clusters is also notable, given the involvement of a gene quently, pathway enrichment analysis against the Reactome 215

191 in a GOBP can be reliably predicated based on other mem- database (23) is conducted on these gene clusters to compile 216

192 ber genes’ involvement in the same GOBP (Fig. 2C ). A more their high-confidence “pathway hits”, which is indicative of 217

193 detailed explanation of the methodology and results of this the gene representations’ efficacy in encoding complex gene- 218

194 analysis is available in the SI Appendix, Supporting Text. gene connections (9). Fig 3B showcases that gene clusters 219

195 These findings altogether affirm that SpaCEX-identified gene derived from SGEs yield a consistently higher average number 220

196 clusters are cofunctional and biologically/pathologically rele- of “pathway hits” compared to those derived from original gene 221

197 vant to the context under investigation, thus endorsing their expression profiles across resolution levels. The third analysis 222

198 roles as spatial genomic contexts. focuses on predicting gene-gene interactions using four types 223

of gene embeddings, including SGEs derived from the 10x- 224

Sun et al. PNAS | June 8, 2024 | vol. XXX | no. XX | 5

A B Health 0 and Disease

C Health 0 and Disease
D
Health 0 and Disease
Housekeeping Genes AD Genes

Housekeeping Genes
AD Genes
Health 0 and Health 1 Health 0 and Health 1 Health 0 and Health 1
Housekeeping Genes AD Genes

Housekeeping Genes
AD Genes
Fig. 5. Identifying AD-associated genes and gene crosstalk with SGEs. Fig. 5A and 5B are for identifying AD-associated genes, while Fig. 5C and Fig. 5D for AD-associated
gene crosstalk. SGEs of 42 AD-associated genes and 126 non-anchor housekeeping genes are obtained from two healthy MTG 10x Visium datasets (health_0: 10x-hMTG-1-1

FT
and health_1: 10x-hMTG-18-64) alongside an AD MTG 10x Visium dataset (10x-hMTG-2-3). SGE pairs of identical genes but from different datasets are aligned using SAN.
The AD and the healthy (health_0) datasets form the study group, while the two healthy datasets form the control group. (A) PCA plots of SGEs in the study (left) and control
(right) groups. Round dots and triangles represent housekeeping and AD-associated genes, respectively. The PCA distances between SGE pairs are visually represented by
yellow and blue lines for housekeeping and AD-associated genes, respectively. Compared to the control group, blue lines are markedly longer than yellow lines in the study
group. (B) Box-plots of scaled cosine dissimilarities between SGE pairs in the study (left) and control (right) groups. Yellow and blue boxes represent housekeeping genes and
AD-associated genes, respectively. (C) Gene-gene interactions within each dataset are quantified using a Pearson correlation matrix calculated from SGEs. Alterations in
gene-gene interactions between two datasets are measured as a correlation shift matrix representing the absolute differences between the two correlation matrices, which is
visualized as a heatmap wherein darker colors indicates larger shifts. Compared to the control group (right), correlation shifts in the study group (left) for gene pairs involving at
RA
least one AD-associated gene are significantly larger than those for gene pairs of housekeeping genes only. (D) Scatterplots of correlations between gene pairs involved in the
same AD-associated pathways, with each cross representing a gene pair. For the study group (left), y- and x- axes denote gene correlations in the AD and healthy (health_0)
datasets, respectively. In the control group (right), these axes represent gene correlations in the health_1 and health_0 datasets, respectively. A t-test is used to assess the
statistical significance of differences in gene correlations between the datasets, with P-values indicated on top of each panel.

225 hDLPFC-151676, scBERT embeddings, Gene2vec embeddings, if they are unaffected by cross-sample technical variations. 250

226 and randomly generated embeddings. As elaborated in SI Ap- For comparison, gene expressions from both datasets are also 251
D

227 pendix, Supporting Text, SGE-based prediction proves to be transformed into uniform manifold approximation and projec- 252

228 the most accurate, as evidenced by its superior accuracy and tion (UMAP) embeddings with the same dimensionality as 253

229 AUC scores (Fig 3C ), along with a prediction heatmap close SGEs. SAN is then trained to align these UMAP embeddings, 254

230 aligning with the ground truth (Fig 3D). Collectively, these which could potentially be confounded by technical variations. 255

231 analyses affirm that fundamental functional and relational The cosine dissimilarity of aligned embeddings pairs serves 256

232 semantics of genes are encapsulated into SGEs. as an indicator of alignment discrepancies. Fig. 4B reveals 257

that discrepancies between SGE pairs are significantly lower 258

233 SGEs Facilitate Cross-sample Gene Alignment. Given that than those between pairs of UMAP embeddings, highlighting 259

234 gene-gene relationships remain relatively stable across datasets SGEs’ robustness against cross-sample technical noises. More- 260

235 under identical conditions and are less prone to technical ar- over, by visualizing aligned SGEs and UMAP embeddings 261

236 tifacts like batch effects (24), we posit that SGEs, generated on a principal component analysis (PCA) plot (Fig. 4C ), we 262

237 based on gene relational semantics as previously validated, are find SGEs from different datasets are more evenly mixed com- 263

238 also resilient to technical artifacts, thus capable of facilitating pared to the UMAP embeddings. These observations provide 264

239 cross-sample gene alignment. To verify this point, SGEs from strong evidence to that SGEs prioritize capturing genuine 265

240 two healthy human middle temporal gyrus (MTG) 10x Visium gene semantics and are resistant to technical artifacts across 266

241 datasets (10x-hMTG-1-1 and 10x-hMTG-18-64) are aligned us- samples, thereby facilitating more accurate cross-sample gene 267

242 ing our SGE alignment network (SAN), which is a three-layer alignment. 268

243 feedforward neural network (FFN) with nonlinear activation

244 functions (Fig. 4A). SAN learns a mapping function, F, that Identifying Disease-associated Genes and Gene Crosstalk 269

245 minimizes the mean absolute errors (MAE) between SGE pairs with SGEs. Identifying genes with altered spatial expression 270

246 of 2177 housekeeping genes from two different datasets (see and interactions is pivotal for illuminating pathogenic mecha- 271

247 “Data and Code Availability” for where housekeeping genes nisms underlying disease progression, e.g., the elevated APOE 272

248 are acquired). The consistent biological roles of these house- expression within hippocampus in Alzheimer’s Disease (AD) 273

249 keeping genes suggest that their SGEs should align accurately (25) and the intensified interplay between Notch and Wnt path- 274

6 | www.xxxx.org/cgi/doi/10.1073/xxxx.XXXXXXXXXX Sun et al.

A 10x-hBC B 10x-hMTG-2-3
Identified Genes Domains Identified Genes Domains
Designated Patterns BRCA1 BRCA2 Designated Patterns TREM2 PSEN1

75%
35%
95% 75% 95%

Designated Patterns IDC Designated Patterns Layer 1 Layer 2

(Denoised) PALB2 TCEAL4 Tumor edge (Denoised) BIN1 APOE Layer 3 Layer 4
Others Layer 5 Layer 6
Noise White Matter

Fig. 6. Identifying disease-associated genes with designated spatial expression patterns. For both Fig. 6A,B, the designated gene spatial expression patterns are displayed in
the leftmost panel in the first row. The expression level percentiles are indicated within different regions demarcated by dotted lines of varying colors. The rightmost panel in the
first row show expert-curated domain labels. The leftmost panel in the second row represents denoised SpaCEX-SPS-simulated genes that mirror the designated patterns. The
rest panels correspond to denoised spatial maps of identified real genes whose expression patterns resemble the designated ones. (A) The designated expression patterns
have a high expression level (95% percentile) in tumor cores (i.e., IDC), a medium expression (75%) in tumor edges, and a low expression (35%) elsewhere in the human
breast cancer 10x Visium dataset (10x-hBC). (B) The designated expression patterns have a high expression level (95% percentile) in white matter (WM), a medium expression
(75%) in other cortex layers in the human MTG 10x Visium dataset (10x-hMTG-2-3).

275

276

277

278

279

280
ways in many cancers (26). We define strategies for identifying
disease-related genes as either reference-based or reference-
free. The former compares spatial gene expressions in diseased
versus healthy tissues to pinpoint differences, while the latter
identifies genes exhibiting specific expression patterns within
putative pathogenic regions independently of healthy tissue
expression benchmarks. Herein, we detail the use of SpaCEX
FT imply that differences in SGEs between healthy and diseased
states could signal changes in gene functions associated with
the disease. Therefore, genes can be prioritized based on the
degree of their SGE dissimilarities, furnishing insights for the
identification of disease-associated genes.
SGEs can also provide valuable clues to altered gene
crosstalk in disease. We denote the Pearson correlations
312

313

314

315

316

317
RA
281 318

282 and SGEs in both strategies. between SGEs of gene pairs from a healthy brain tissue 319

dataset (10x-hMTG-2-5) as ρhealth , and from an Alzheimer’s 320

283 Reference-based. In ST, directly comparing gene expression pat- Disease (AD) dataset (10x-hMTG-2-3) as ρAD . Fig. 5C 321
284 terns between conditions is difficult due to variations in tissue reveals that the absolute difference between these correla- 322
285 slice preparations, technical artifacts, and spatial heterogene- tions, δρ =|ρhealth − ρAD |, signals potential alterations in gene 323
286 ity across slices. Nevertheless, SGEs, which are context-aware crosstalk, as δρ values are notably higher when at least one 324
287 embeddings capturing fundamental gene semantics and whose gene in the pair is AD-associated compared to pairs of house- 325
D

288 value discrepancies across samples/conditions are reconcilable keeping genes. This observation is bolstered by a statistically 326
289 by SAN-mediated gene alignment as shown in Fig. 4A, allow significant increase (P-value=5.78e-06) in correlations between 327
290 for the detection of disease-associated alterations in spatial gene pairs that participate in the same AD-related pathways 328
291 gene expressions and relationships. (Fig. 5D Top) in the AD dataset. In contrast, no such shifts 329
292 To verify this, we generate SGEs from two healthy brain in correlation are detected between the two healthy datasets 330
293 tissue datasets (10x-hMTG-1-1 and 10x-hMTG-18-64) and (Fig. 5D Bottom), in line with the expectation that gene re- 331
294 one AD brain tissue dataset (10x-hMTG-2-3). SAN described lationships in health conditions should remain largely stable. 332
295 in the preceding section is used to align SGE pairs of identical Therefore, substantial changes in SGE correlations are indica- 333
296 genes from two different datasets. The 2177 housekeeping tive of disease-related alterations in gene interactions. 334
297 genes are randomly split into an “anchor” set of 2051 genes for
298 SAN training and a “non-anchor” set of 126 genes reserved for Reference-free. Our reference-free approach involves employing 335

299 testing. The trained SAN is applied to align SGE pairs from our innovative method, SpaCEX-SPS, to create a pseudo-gene 336

300 both the non-anchor housekeeping gene set and a set of 42 AD- with designated spatial expression patterns (see “SpaCEX- 337

301 associated genes reported by previous studies (SI Appendix, SPS” in Methods), subsequently converted into an SGE. Based 338

302 Table S3). When aligning SGEs between an AD and a healthy on the scaled cosine similarities between the SGEs of real genes 339

303 dataset (i.e., 10x-hMTG-1-1), we find that AD-associated and the pseudo-gene, we can pinpoint genes whose spatial 340

304 genes demonstrate significantly greater PCA distances (Fig. 5A expression patterns correspond to the predefined one. Our 341

305 Top) and scaled cosine dissimilarities (Fig. 5B Top) compared method is tested on two datasets: a human breast cancer 342

306 to the non-anchor housekeeping genes. In contrast, alignment dataset (10x-hBC) and an AD MTG dataset (10x-hMTG- 343

307 of SGES between the two healthy datasets does not show 2-3). In the SpaCEX-SPS simulation, “high”-level of gene 344

308 such disparities, neither in PCA distances (Fig. 5A Bottom) expression is defined as the 95th percentile of average expres- 345

309 nor in scaled cosine dissimilarity (Fig. 5B Bottom), aligning sions across all genes, “medium”-level the 75th percentile, and 346

310 with expectations that biological semantics of AD-related genes “low”-level the 35th percentile. For the 10x-hBC dataset, we 347

311 should remain unchanged in healthy conditions. These findings aim to discover genes with high expression within tumor cores 348

Sun et al. PNAS | June 8, 2024 | vol. XXX | no. XX | 7

Real SeqFISH
bioRxiv preprint doi: https://doi.org/10.1101/2024.06.07.598026; this version posted June 10, 2024. The copyright holder for this preprint
SpaCEX SGEs
(which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint
!
in perpetuity.
!
It is made !
Finetuning
available under aCC-BY-NC-ND 4.0 International license. P M "

Discriminator Projection Memory bank

S P M

Generator Generated SeqFISH

10x Visium SGEs Decoder Encoder

A B
SpaCEX-ETC 10x Visium Original SpaCEX-ETC SpaOTsc SpaGE Tangram
S
High
Real SeqFISH
SpaCEX SGEs

Wdr5
!! !"
Finetuning P M
Discriminator Projection Memory bank
S P M

Generator Generated SeqFISH

10x Visium SGEs Decoder Encoder

Zfp444
C

Moran' s I of real sqf-mEmb genes

10x Visium Original SpaCEX-ETC SpaOTsc SpaGE Tangram 0.6 ! = 0.724
High 0.5

Moran' s I of real sqf-mEmb genes

0.4 0.6 ! = 0.724
Wdr5

Wnt3a
0.5

0.3
0.4

0.3
0.2
Low
0.2

0.1
0.1 10x-mEmb
0.2 0.3Real sqf-mEmb
0.4 0.5 0.6 0.1
0.1 0.2 0.3 0.4 0.5 0.6
Moran’ s I of generated sqf-mEmb genes via SpaCEX Moran’ s I of generated sqf-mEmb genes via SpaCEX
Zfp444

Ground Truth Original

D SpaCEX-ETC SpaOTsc
E SpaGE
Ground Truth Original Tangram
SpaCEX-ETC SpaOTsc SpaGE Tangram
Moran' s I of real sqf-mEmb genes

0.6 ! = 0.724 ARI: 0.361 ARI: 0.442 ARI: 0.226 ARI: 0.383 ARI: 0.268
ARI: 0.361 ARI: 0.442 ARI: 0.226 ARI: 0.383 ARI: 0.268
0.5

0.4
Wnt3a

0.3

0.2
Low
0.1 330 Genes 660 Genes 660 Genes 660 Genes 660 Genes
10x-mEmb Real sqf-mEmb 0.1 0.2 0.3 0.4 0.5 0.6
Enhanced via Enhanced via Enhanced via Enhanced via
Moran’ s I of generated sqf-mEmb genes via SpaCEX
SpaCEX-ETC SpaOTsc SpaGE Tangram

ruth Original SpaCEX-ETC SpaOTsc SpaGE Tangram

Fig.
ARI: 0.361 7. SGE-based 330
enhancement
ARI: 0.442 Genes
of0.226
ARI:
full transcriptomic coverage are encoded into SGEs.
660 Genes
transcriptomic ARI:
coverage
0.383
Enhanced via
SpaCEX-ETC
660
of FISH-based
ARI: Genes
0.268
Enhanced via
In the training phase, SGEs of the genes
SpaOTsc
660 Genes
ST with SpaCEX-ETC.
Enhanced via
660
(A) Workflow of Genes
SpaCEX-ETC.
Enhanced via
present in the target
SpaGE

FT
Initially, genes from a ST dataset with
FISH-based ST dataset are fed into SpaCEX-ETC’s
Tangram
generator to regenerate their original spatial gene expressions. The generator consists of three components: a multilayer perceptron (MLP)-based encoder, an MLP-based
decoder, and a memory bank that ensures the training stability. Following this, a discriminator is trained to distinguish between the original and regenerated genes. Meanwhile,
the loss gradients from the generator are backpropagated to update the parameters of the MAE’s encoder within SpaCEX, so that SGEs are more adapted to the specific
semantics inherent in the FISH-based ST dataset. Once trained, SpaCEX-ETC can generate genes that are initially absent in the dataset, using their finetuned SGEs. (B)
Three genes (Wdr5, Zfp444, and Wnt3a), which are covered by both the SeqFish (sqf-mEmb) and the 10x-mEmb datasets and differ in their expression levels, are used to
evaluate SpaCEX-ETC. These genes’ SGEs are initially obtained from the 10x-mEmb dataset and used to regenerate their expression profiles in the sqf-mEmb dataset. From
RA
the left to the 660
330 Genes Genes
right, 660 Genes
we show the three genes’ original660 Genes expression660
spatial Genes in the 10x-mEmb dataset, in the sqf-mEmb dataset, alongside their regenerated expression
profiles
Enhanced via Enhanced via Enhanced via Enhanced via
profiles by SpaCEX-ETC
SpaCEX-ETC and theSpaOTsc
three benchmark methods.
SpaGE (C) Displayed in the box plot are the Pearson correlation coefficients (PCCs) between the original and spatial
Tangram
expression profiles of the 330 genes regenerated by SpaCEX-ETC and the benchmark methods. (D) The scatterplot shows spatial variabilities of the 330 genes and their
respective regenerated counterparts, with spatial variability quantified using Moran’s I index. The red line in the scatterplot represents a fitted regression line with R=0.724. (E)
The transcriptomic coverage of sqf-mEmb is doubled by using SpaCEX-ETC and the benchmark methods to generate an additional 330 genes absent in the original dataset.
The leftmost spatial map shows the ground truth tissue domain annotations, the second map shows the spatial clustering results of SpaGCN using the unexpanded gene set,
while subsequent maps showcase the spatial clustering results using gene set augmented by SpaCEX-ETC and the benchmark methods, respectively. The clustering accuracy
(i.e., ARI) is shown below each method name.
D

349 (IDC), medium expression within tumor edges, and low ex- ST data, but they essentially generate “pseudo-ST” data since 371

350 pression elsewhere. Using SpaCEX-SPS, we simulate a gene they focus on mapping single cells in scRNA-seq onto spatial 372

351 mirroring these expression patterns. As shown in Fig. 6A, we locations in ST to compensate for genes not profiled in ST 373

352 successfully identify cancer-associated genes such as BRCA1, with scRNA-seq data, rather than the de novo generation of 374

353 BRCA2, PALB2, and TCEAL4, all exhibiting the sought- ST data with inherent spatial semantics. These methods are 375

354 after expression patterns. Similarly, for the 10x-hMTG-2-3 limited by their underutilization of spatial information in the 376

355 dataset, we pinpoint AD-associated genes like TREM2, PSEN1, mapping process, as seen in Tangram and SpaGE, and the 377

356 BIN1, and APOE, characterized by high expression within introduction of systematic biases from discrepancies between 378

357 the white matter (WM) layer and medium expression within scRNA-seq and ST data such as inconsistencies in data scales. 379

358 other cortex layers (Fig. 6B). Particularly, the upregulation To address this challenge, we introduce SpaCEX-enhanced- 380

359 of BIN1 within the WM of AD brain has been previously transcriptomics-coverage (SpaCEX-ETC), an innovative SGE- 381

360 reported (27). These outcomes collectively demonstrate the based Generative Adversarial Network (GAN) model as 382

361 efficacy of utilizing SGEs and SpaCEX-SPS for discerning detailed in the “SpaCEX-ETC” in Methods and Fig. 7A. 383

362 genes with disease-specific spatial expression profiles. More- SpaCEX-ETC is predicated on the notion that gene relational 384

363 over, our methodology extends beyond disease-associated gene semantics should remain largely consistent across different ST 385

364 identification to encompass any genes with designated spatial data types for the same tissue type. Consequently, the spatial 386

365 expression patterns. expression profiles of uncovered genes can be extrapolated from 387

those of covered genes, drawing on their semantic relationships 388

366 SGE-based Enhancement of the Transcriptomic Coverage in inherent in SGEs derived from a full transcriptomic coverage 389

367 FISH-based ST. A significant challenge in ST is achieving both ST dataset (e.g., 10x Visium). We evaluate the effectiveness 390

368 full transcriptomic coverage and high-resolution. Existing of SpaCEX-ETC by reproducing the 330 covered genes from 391

369 methods like Tangram (28), SpaGE (29) SpaOTsc (30) aug- a mouse embryo SeqFISH dataset (sqf-mEmb), guided by the 392

370 ment transcriptomic coverage in high-resolution, FISH-based SGEs from a mouse embryo 10x Visium dataset (10x-mEmb). 393

8 | www.xxxx.org/cgi/doi/10.1073/xxxx.XXXXXXXXXX Sun et al.

394 We include Tangram, SpaGE, and SpaOTsc as benchmarks Redundant information among genes can be revealed through 454

395 (SI Appendix, Table S2) to reproduce the same gene set from their SGE similarity matrix and reduced by only selecting 455

396 scRNA-seq data. To ease a direct comparison between the and retaining the most discriminative genes within groups 456

397 original and reproduced genes, we visualize the spatial maps of highly similar ones. The most discriminative genes are 457

398 of three genes of different expression levels, including Wdr5, those with the highest spatial variability scores, as determined 458

399 Zfp444, and Wnt3a. Fig. 7B illustrates that SpaCEX-ETC by SpaCEX-SVG. Subsequently, any spatial clustering algo- 459

400 surpasses benchmark methods in accurately reproducing genes, rithm, which is SpaGCN in our case, can work with this 460

401 achieving high fidelity in both spatial expression patterns and information-efficient set of feature genes to achieve improved 461

402 data scales. performance. We select two state-of-the-art spatial cluster- 462

403 Additionally, Fig. 7C quantitatively demonstrates SpaCEX- ing methods, GraphST and SpaGCN, alongside a baseline 463

404 ETC’s superiority in generating genes that closely correlate method, Leiden, as benchmarks for comparison (SI Appendix, 464

405 with their actual values. This concordance is further supported Table S2). In a comprehensive evaluation across twelve 10x- 465

406 by the highly correlated spatial variability (R=0.724) between hDLPFC datasets, SpaCEX-ISC consistently outperforms the 466

407 authentic and SpaCEX-ETC-generated gene expressions, as benchmark methods, as evidenced by its highest Adjusted 467

408 depicted in Fig. 7D. Finally, we select an additional 330 genes Rand Index (ARI) and Normalized Mutual Information (NMI) 468

409 imputed by SpaCEX-ETC, deemed most real-like by the GAN scores (SI Appendix, Fig. S5A). SpaCEX-ISC’s superiority over 469

410 discriminator, to double the transcriptomic coverage of the the benchmark methods is further illustrated by its more ac- 470

411 sqf-mEmb dataset. This augmented gene set undergoes spatial curately recovered annotated anatomical cortex layers in the 471

412 clustering to evaluate the imputed genes’ quality and their spatial maps of the 10x-hDLPFC-151676 and 10x-hDLPFC- 472

413 analytical utility. Fig. 7E shows that spatial clustering with 151669 datasets (SI Appendix, Fig. S5B). Finally, SpaCEX- 473

414 SpaCEX-ETC-imputed genes achieves a significantly higher ISC achieves optimal performances across six 10x-hDLPFC 474

415 accuracy compared to either the original dataset or genes datasets when approximately 50%-60% redundant information 475

416 imputed by the benchmark methods. Collectively, these results is excluded (SI Appendix, Fig. S5C). 476

417 highlight the efficacy of SGEs in enhancing the transcriptomic

418

419

420

421

422

423
coverage of FISH-based ST via a generative approach.

SGE-based SVG Detection. In this section, we introduce a novel

computational method, SpaCEX-SVG, that leverages SGEs to
detect SVGs from ST datasets, as detailed in the “SpaCEX-
SVG” section in Methods. Essentially, SpaCEX-SVG calculate
a spatial variability score for each gene based on the similarity
FT Discussion
In ST, Genomic contexts unveiled as groups of spatially cofunc-
tional and co-expressed genes are instrumental for generating
semantically rich gene embeddings, paralleling the concept of
word vectorization in natural languages processing. Existing
foundational models designed to learn gene embeddings from
477

478

479

480

481

482
RA
424 of its SGE with those of simulated spatially homogeneous microarray or scRNA-seq data typically rely on massive pre- 483

425 genes. SVGs then are ranked and selected according to these training, resulting in a weakened sensitivity to context-specific 484

426 scores. For this assessment, we select the top 3000 SVGs from nuances, and fall short of incorporating spatial expression 485

427 both the 10x-hDLPFC-151507 and 10x-hBC datasets using information into the gene vectorization process. In this work, 486

428 SpaCEX-SVG and two benchmark methods: SpatialDE and we propose SpaCEX, a novel context-aware, self-supervised 487

429 SPARK-X (31) (SI Appendix, Table S2). Both Moran’s I and learning model that exploits spatial genomic contexts in ST 488

430 Geary’s C indices shows that the SVGs selected from the 10x- to derive distributed gene representations imbued with spatial 489
D

431 hDLPFC-151507 dataset by SpaCEX-SVG are more spatially gene functional and relational semantics. 490

432 variable than those selected by the benchmark methods (SI We comprehensively evaluate SpaCEX across ST datasets 491

433 Appendix, Fig. S3A). Moreover, one well-documented drawback of various tissues, species, and platforms in aspects regarding 492

434 of the benchmark methods is their inability to effectively rank the model’s legitimacy and its utility in downstream, task- 493

435 SVGs based on their spatial variability scores (P- or Q-values) specific applications. To establish the methodological sound- 494

436 (32). In contrast, SpaCEX-SVG is more sensitive and effective ness, we initially demonstrate SpaCEX’s adeptness at identi- 495

437 in distinguishing levels of spatial variability among SVGs. This fying spatial genomic contexts as groups of cofunctional and 496

438 is exemplified in SI Appendix, Fig. S3B, where the top four co-expressed genes. Subsequent analyses confirm SpaCEX’s 497

439 SVGs selected by SpaCEX-SVG exhibit more noticeable spatial ability in generating SGEs that encapsulate essential gene 498

440 variabilities than those selected by the benchmark methods. semantics from these genomic contexts, with SGE correlations 499

441 A parallel analysis conducted on the 10x-hBC dataset yields reflecting gene familial and ontological ties. Notably, SGEs 500

442 similar results (SI Appendix, Fig. S4). These results altogether prioritize biological variations over technical noises, enhancing 501

443 demonstrate the potential of SGEs for detecting and ranking their utility in cross-sample gene alignment, as demonstrated 502

444 SVGs in ST datasets. by the accurate alignment of SGEs of functionally stable house- 503

keeping genes. For task-specific applications, we propose a 504

445 SGE-improved Spatial Clustering. In this section, we propose suite of innovative SGE-based methods for identifying disease- 505

446 SpaCEX-Improved-Spatial-Clustering (SpaCEX-ISC), a novel associated genes and gene crosstalk, pinpointing genes with 506

447 SGE-based computational method for enhancing spatial clus- designated spatial expression patterns, enhancing the tran- 507

448 tering, as detailed in the “SpaCEX-ISC” section in Methods. scriptomic coverage of FISH-based ST, detecting SVGs, and 508

449 The rationale behind SpaCEX-ISC is that spatial cluster- improving spatial clustering. These methods either pioneer 509

450 ing can be effectively improved by optimizing the informa- solutions to existing problems or markedly surpass established 510

451 tional efficiency of spatial transcriptomic data, which involves benchmarks. 511

452 minimizing redundant information and retaining the most SpaCEX’s remarkable performance are rooted in four as- 512

453 discriminative information presented by feature genes (33). pects: the effective integration of spatial expression patterns 513

Sun et al. PNAS | June 8, 2024 | vol. XXX | no. XX | 9

514 into SGEs via an image-focused, self-supervised MIM approach; key strength of this modeling is its robustness to outliers, 572

515 learning relational semantic structures among genes through a which are assigned reduced weights during the estimation of 573

516 flexible and robust SMM-based clustering; a novel combination model parameters. Specifically, let Z ∈ RN ×D denote SGEs, 574

517 of MIM with contrastive learning for iterative joint optimiza- where N is the total number of genes and D is the dimen- 575

518 tion, enhancing the perceptibility and discriminability of SGEs; sion of the feature space. We model the distribution of Z as 576

519 and the resilience of SGEs to technical noise, ensuring reliable an SMM parameterized by Θ = {Θk :πk , µk , Σk , vk , ∀k ∈ K}. 577

520 gene alignment across conditions. Overall, SpaCEX not only Here, K represents the total number of gene clusters, while 578

521 facilitates the discovery of cofunctional gene modules, like πk , µk , Σk , υk represents the weight, mean, covariance matrix 579

522 gene networks and pathways, but also generates biologically and freedom of the k-th component, respectively. The density 580

523 significant gene embeddings, laying the foundation for a suite function of zi is then formulated as follows: 581

524 of downstream task-specific tools. Thus, SpaCEX promises

K
525 to contribute to the development of a genomic “language”- X
p(zi |Θ) = πk Φ(zi |µk , Σk , v) [1] 582
526 based methodological ecosystem. Future improvements for
k=1
527 SpaCEX may include enriching the informativeness of SGEs
528 through a multimodal learning approach, integrating diverse We utilize an Expectation-Maximization (EM) algorithm 583
529 gene relational semantics from additional datasets like gene to iteratively estimate parameters of the SMM. Given a multi- 584
530 co-expression patterns across cell types observed in scRNA-seq. tude of parameters in the model, a conventional MLE-based 585

EM tends to overfit the data. To mitigate this problem, we 586

531 Methods introduce priors on the model parameters for model regular- 587

ization purpose: we use a conjugate Dirichlet prior on Π and 588

532 Data Quality Control and Preprocessing. We conform to the a normal-inverse Wishart (NIW) prior on µk ,Σk : 589
533 conventional procedure for preprocessing ST data, as imple-
534 mented in the S CANPY package (34). Specifically, we first Π ∼ Dir(Π|α0 ),
535 remove mitochondrial and External RNA Controls Consortium [2] 590
µk , Σk ∼ N IW (µk , Σk |m0 , κ0 , S0 , ρ0 ), ∀k ∈ [1, K]
(ERCC) spike-in genes. Then, genes detected in fewer than 10
536

537

538

539

540

541
spots are excluded. To preserve the spatial data integrity, we
do not perform quality control on spatial spots. Finally, the
gene expression counts are normalized by library size, followed
by log-transformation.

Representation Learning of Spatial Gene Expression Maps. As

FT To simplify the EM algorithm, we rewrite the Student’s t
distribution as a Gaussian scale mixture by introducing an
“artificial”hidden variable ζi,k , ∀i ∈ [1, N ], ∀k ∈ [1, K] that
follows a Gamma distribution parameterized by vk :
Z
Σk

vk vk

591

592

593

594
RA
Φ(zi |µk , Σk , vk ) = N zi |µk , Γ ζi,k | , dζi,k .
542 the spatial gene maps can be visualized as gray-scale images, ζi,k 2 2
543 we devise an adapted version of MAE to transform visual [3] 595

544 features of gene images into embeddings in a latent feature We also introduce a missing variable ξi to represent the 596

545 space. A given gene image is first segmented into regular component membership of zi . Then the posterior complete 597

546 non-overlapping patches, from which a subset of patches is data log likelihood can be written as: 598

547 randomly selected, masked and discarded. The remaining

548 patches are fed into the MAE encoder to generate visible ℓc (Θ) = log P(Z, ζ, ξ|Θ)
D

549 patch embeddings. Given that a gene image is gray-scale XX

= [II(ξi = k) (log πk + log Φ(zi , ζi,k |µk , Σk , vk ))]
550 and often sparse, we use a higher masking ratio (80%) and
i k
551 a light-weighted ViT encoder with four transformer blocks X
552 and four attention heads rather than the masking ratio (75%) + log Dir(Π|α0 ) + log N IW (µk , Σk |m0 , κ0 , S0 , ρ0 ).
553 and the ViT-L encoder in the original paper. The visible k
554 patch embeddings and trainable tokens of masked patches [4] 599

555 are then input into the MAE decoder to reconstruct the gene In the t-th iteration of the E-step, the expected sufficient 600
(t) (t)
556 image. We replace the transformer architecture of the original statistics ξi,k and ζi,k are derived based on Θ(t−1) . In the 601
(t−1)
557 MAE decoder with a convolutional autodecoder to enhance subsequent M-step, Θ is updated to Θ(t) by maximiz- 602

558 the performance in our case. A more important modification ing the auxiliary function Q(Θ, Θ(t−1) ) = E(ℓc (Θ)|Θ(t−1) ). 603
(t)
559 is the adding of a nonlinear projection head to the end of the Note that υk is estimated via a Generalized EM (GEM) 604

560 encoder. This projection head consists of a linear layer, a batch technique to speed up the calculation without harming its 605

561 normalization (BN) layer and a Scaled Exponential Linear converging to at least a local optimum. The two steps are 606

562 Unit (SELU) activation layer. Owing to the BN layer and alternatively conducted until either convergence is achieved 607

563 the self-normalizing property of the SELU function, the gene or a pre-specified maximum number of iterations is reached. 608

564 embeddings output from the encoder more closely conform to Refer to SI Appendix, Supporting Text for details about the 609

565 the mixed Student’s t distribution. model inference. 610

566 SMM-based Modeling. As Stuhlsatz et al (35) have demon- Self-paced Pseudo-contrastive Optimization of SGEs. Two 611

567 strated the capability of deep image encoder in learning vi- loss functions, L1 and L2 , are calculated based on clustering 612

568 sual representations that follow a multivariate Student’s t- results for updating parameters of both representation learning 613

569 distribution, we utilize an SMM to model the distributions of and the SMM through loss gradient backpropagation. This 614

570 SGEs in a latent feature space, with individual components iterative process progressively improves the clustering-oriented 615

571 of the SMM corresponding to distinct gene clusters. The image embeddings and clustering results. Upon completing 616

10 | www.xxxx.org/cgi/doi/10.1073/xxxx.XXXXXXXXXX Sun et al.

617 the inference of SMM parameters Θ e in each epoch, let W and and pi,k an auxiliary target distribution that boosts up high- 661

618 W
c represent the parameters of the encoder and decoder of the confidence images. After this joint optimization, the training 662

619 representation learning model respectively, an epoch-level loss progresses to the next epoch, iterating until the end of the 663

620 L1 is calculated for updating parameters of MAE : training process. The mathematical derivations of gradients 664

of L1 and L2 with respect to W , W c and Θ are detailed in SI 665

621 L1 = −Lℓℓ (Z|Θ)+η1 ·Lℓap −η2 ·Lsize (Z|Θ)+η3 ·Lr (Z, W
c) [5] Appendix, Supporting Text. 666

622 Here, Lℓap is a Laplacian regularization term that promotes

SpaCEX-SPS. The section describes a generalized linear model 667
623 the similarities among image embeddings Z to be consistent
(GLM)-based method, SpaCEX-spatial pattern simulator 668
624 with a seeding image-image similarity matrix S, informing the
(SpaCEX-SPS), for simulating genes with specific spatial ex- 669
625 initial training phase. The derivation of S is detailed in SI
pression patterns in a dataset. It uses a negative binomial 670
626 Appendix, Supporting Text. Lℓap is defined as follows:
(NB) distribution to model gene expressions at various spots, 671

incorporating mean and dispersion parameters that relate to

T −1 1
−2 672
627 Lℓap = T r Z I −D 2 SD Z , [6]
the variance and squared coefficient of variation. The model 673

estimates these parameters through regression analysis, pre- 674

628 where D is the degree matrix of S, and η, initially set at 0.5,
dicting spatial expression levels at desired quantiles in specific 675
629 decays over the training course so that the influence of S is
tissue regions. By ranking and assigning expression levels 676
630 gradually reduced. Lℓℓ represents the log likelihood of the
based on the NB distribution, the method ensures that the 677
631 embeddings given the estimated SMM parameters Θ e:
simulated genes reflect the spatial structure inherent in the 678

ST dataset, allowing for precise control over the expression

N
" #
679
X X
632 Lℓℓ = log qi,k , [7] pattern of simulated genes across different spatial regions. For 680

i=1 k further details, see the SI Appendix, Supporting Text. 681

633
634 qi,k = πk Φ(zi |µk , Σk , vk ), ∀i ∈ [1, N ], ∀k ∈ [1, K]. [8] SpaCEX-ETC. As shown in Fig. 7A, SpaCEX-ETC is a GAN- 682

635

636

637

638
Lsize penalizes empty and tiny clusters, while exempting those
whose size exceeds a predefined threshold υ so that image
assignments is not overly uniform:

Lsize =
K
X

k=1
−Jk logJk , Jk =
(
ΣN

1
i qi,k
N
, if Jk ≤ υ
.otherwise
FT
[9]
based model that uses an encoder to transform gene expression
matrices into SGEs, which then serve as inputs for a generator.
The generator, consisting of an encoder, a decoder, and a
memory bank, reconstructs gene expression profiles using an
attention mechanism and a continuously updated embedding
queue. The discriminator, an MLP-based network, distin-
guishes between actual and generated gene expressions. The
683

684

685

686

687

688
RA
689

639 Lr , represents the fidelity loss of the reconstructed gene image model adjusts the MAE encoder weights in SpaCEX’s Module 690

640 by the convolutional autodecoder, expressed as: I through adversarial and reconstruction losses , which enables 691

the SGEs to adapt to the particular semantics inherent in the 692

N
X FISH-based ST dataset. This finetuning facilitates the gen- 693
||xi − x̂i ||22 = T r (X − X̂)(X − X̂) ,

641 Lr = [10] eration of those genes uncovered in the FISH-based dataset 694
i with optimized fidelity. For additional details, refer to the SI 695

Appendix, Supporting Text. 696

where x̂i = fdecoder (W
c, zi ). This term supervises the training
D

642

643 of encoder, guiding it to generate gene embeddings that pre-

SpaCEX-SVG. In our study, we first developed a method that 697
644 serves the local structural integrity of spatial gene expressions.
simulates spatially homogeneous genes using observed spatial 698
645 We set η1 =0.5, η2 =0.1, η3 =0.1. Note, the value of η1 decays
transcriptomics data, applying either Negative Binomial or 699
646 as the training progresses so that the impact of the seeding
Zero-Inflated Negative Binomial distributions. We estimate 700
647 matrix diminishes over the training course.
parameters directly from the data to simulate genes and gener- 701
648 Subsequently, within the same epoch, we utilize a batch-
ate SGEs for both real and simulated genes. These embeddings 702
649 level loss:
enable us to calculate spatial variability scores by comparing 703

650 L2 = Lc (Zb , Θ) + λ1 · Lr (Zb , W

c) + λ2 · Lℓap (Sb , Zb ) [11] real genes to their simulated counterparts using scaled cosine 704

dissimilarity. We then rank the genes by these scores to iden- 705

651 to update MAE and SMM parameters across successive tify those with notable spatial variations. Details provided in 706

652 batches. Here, Lr and Lℓap remains same as in Equations the SI Appendix, Supporting Text. 707

653 Equation (10) and Equation (6) except being calculated on the
654 batch-level. Lc boosts high-confidence images, incrementally SpaCEX-ISC. In our study, we employ the SpaCEX-ISC 708

655 grouping similar instances while separating dissimilar ones: method to optimize the analysis of spatial transcriptomics 709

data. This involves constructing a gene identity matrix G to 710

pi,j log pi,j ,

N K
XX distinguish genes across functional groups, and a similarity 711
Lc = KL(P|Q) = [12]
qi,j
656
matrix S to assess gene relationships based on SGEs. We then 712
i j
apply Shi-Malik spectral clustering on an adjacency matrix 713
657
qi,k / i qi,k 2 derived from S to organize genes into functionally coherent
P 714
qi,k
where qi,k = P , pi,k = P groups. Using SpaCEX-SVG, we calculate spatial variability
q2i,c / i qi,c
658 P 715
q
c i,c c scores for each gene to identify significant spatial expression 716

659 Here, qi,k is same as in Equation 8, qi,k represents the prob- variations. These scores are used to filter out redundant data, 717

660 ability of assigning i-th gene to the k-th SMM component, ensuring that only the most informative gene expressions are 718

Sun et al. PNAS | June 8, 2024 | vol. XXX | no. XX | 11

719 retained. The processed data is then fed into a graph neural 8. Chen S, et al. (2021) Spatially resolved transcriptomics reveals unique gene signatures 781
associated with human temporal cortical architecture and alzheimer’s pathology. BioRxiv pp. 782
720 network to generate spot embeddings for clustering, enhancing
2021–07. 783
721 the clarity and utility of spatial gene expression analysis. For 9. Cui H, et al. (2024) scgpt: toward building a foundation model for single-cell multi-omics using 784
722 additional details, refer to the SI Appendix, Supporting Text. generative ai. Nature Methods pp. 1–11. 785
10. Mikolov T, Chen K, Corrado G, Dean J (2013) Efficient estimation of word representations in 786
vector space. arXiv preprint arXiv:1301.3781. 787
723 Experimental Settings. Detailed experimental settings for the 11. Hao M, et al. (2023) Large scale foundation model on single-cell transcriptomics. bioRxiv pp. 788
2023–05. 789
724 SpaCEX study are extensively documented in the SI Ap-
12. Yang F, et al. (2022) scbert as a large-scale pretrained deep language model for cell type 790
725 pendix, Supporting Text. These include the methodologies for annotation of single-cell rna-seq data. Nature Machine Intelligence 4(10):852–866. 791
726 identifying groups of spatially co-expressed genes, protocols 13. Theodoris CV, et al. (2023) Transfer learning enables predictions in network biology. Nature 792
618(7965):616–624. 793
727 for enrichment analysis, techniques for cofunction analysis of 14. Boiarsky R, Singh NM, Buendia A, Getz G, Sontag D (2023) A deep dive into single-cell rna 794
728 intra-cluster genes, criteria for evaluating SpaCEX-generated sequencing foundation models. bioRxiv pp. 2023–10. 795

729 gene embeddings, and the metrics used to assess the overall 15. He K, et al. (2022) Masked autoencoders are scalable vision learners in Proceedings of the 796
IEEE/CVF conference on computer vision and pattern recognition. pp. 16000–16009. 797
730 performance of SpaCEX. 16. Maynard KR, et al. (2021) Transcriptome-scale spatial gene expression in the human dorsolat- 798
eral prefrontal cortex. Nature neuroscience 24(3):425–436. 799
17. Stickels RR, et al. (2021) Highly sensitive spatial transcriptomics at near-cellular resolution 800
731 Data and Code Availability with slide-seqv2. Nature biotechnology 39(3):313–319. 801
18. Song T, et al. (2022) Detecting spatially co-expressed gene clusters with functional coherence 802
732 All data are available in the main text or the by graph-regularized convolutional neural network. Bioinformatics 38(5):1344–1352. 803
19. Dries R, et al. (2021) Giotto: a toolbox for integrative analysis and visualization of spatial 804
733 supplementary materials: The mouse hippocam- expression data. Genome biology 22:1–31. 805
734 pus dataset (ssq-mHippo) can be downloaded from 20. Sun S, Zhu J, Zhou X (2020) Statistical analysis of spatial expression patterns for spatially 806

735 https://singlecell.broadinstitute.org/single_cell/study/SCP81 resolved transcriptomic studies. Nature methods 17(2):193–200. 807

21. Bergenstråhle J, Larsson L, Lundeberg J (2020) Seamless integration of image and molecular 808
736 5/sensitive-spatial-genome-wide-expression-profiling-at-cellul analysis for spatial transcriptomics workflows. BMC genomics 21:1–7. 809
737 ar-resolutionstudy-summary. The human dorsolat- 22. Hao Y, et al. (2021) Integrated analysis of multimodal single-cell data. Cell 184(13):3573–3587. 810

738 eral prefrontal cortex datasets (10x-hDLPFC) are 23. Fabregat A, et al. (2018) The reactome pathway knowledgebase. Nucleic acids research 811
46(D1):D649–D655. 812
739 available through the spatialLIBD package (36) 24. Parsana P, et al. (2019) Addressing confounding artifacts in reconstruction of gene co- 813
740

741

742

743

744

745
at http://spatial.libd.org/spatialLIBD.
breast cancer dataset (10x-hBC) can be obtained
from
The

expression/datasets/1.1.0/V1_Breast_Cancer_Block_A_Sec
tion_1. The three human MTG datasets, including two
healthy datasets (10x-hMTG-1-1 and 10x-hMTG-18-64)
and an AD dataset (10x-hMTG-2-3), are available in the
human

https://support.10xgenomics.com/spatial-gene- FT expression networks. Genome biology 20(1):1–6.

25. Zhang L, Xia Y, Gui Y (2023) Neuronal apoe4 in alzheimer’s disease and potential therapeutic
targets. Frontiers in Aging Neuroscience 15:1199434.
26. Katoh M, Katoh M (2020) Precision medicine for human cancers with notch signaling dysregu-
lation. International journal of molecular medicine 45(2):279–297.
27. De Rossi P, et al. (2016) Predominant expression of alzheimer’s disease-associated bin1 in
mature oligodendrocytes and localization to white matter tracts. Molecular neurodegeneration
11:1–21.
28. Biancalani T, et al. (2021) Deep learning and alignment of spatially resolved single-cell
814
815
816
817
818
819
820
821
822
RA
746
transcriptomes with tangram. Nature methods 18(11):1352–1362. 823
747 GEO database (GSE220442) (37). The mouse embryo 29. Abdelaal T, Mourragui S, Mahfouz A, Reinders MJ (2020) Spage: spatial gene enhancement 824
748 dataset based on 10x Visium (10x-mEmb) can be found at using scrna-seq. Nucleic acids research 48(18):e107–e107. 825
30. Cang Z, Nie Q (2020) Inferring spatial and signaling relationships between cells from single 826
749 https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE17
cell transcriptomic data. Nature communications 11(1):2084. 827
750 8636. The mouse embryo dataset based 31. Zhu J, Sun S, Zhou X (2021) Spark-x: non-parametric modeling enables scalable and robust 828
751 on SeqFISH (sqf-mEmb) is obtainable at detection of spatial expression patterns for large spatial transcriptomic studies. Genome 829
biology 22(1):184. 830
752 https://crukci.shinyapps.io/SpatialMouseAtlas/. The detailed 32. Hu J, et al. (2021) Spagcn: Integrating gene expression, spatial location and histology to 831
753 descriptions of the datasets can be found in SI Appendix, Table identify spatial domains and spatially variable genes by graph convolutional network. Nature
D

832

754 S1. Moreover, we acquire 2,162 human housekeeping genes methods 18(11):1342–1351. 833
33. Deng T, et al. (2023) A cofunctional grouping-based approach for non-redundant feature 834
755 from the HRT Atlas (38) (https://housekeeping.unicamp.br) gene selection in unannotated single-cell rna-seq analysis. Briefings in Bioinformatics 835
756 and 15 additional housekeeping genes from previous 24(2):bbad042. 836

757 studies (39, 40). SpaCEX is publicly available at at 34. Wolf FA, Angerer P, Theis FJ (2018) Scanpy: large-scale single-cell gene expression data 837
analysis. Genome biology 19:1–5. 838
758 https://github.com/WLatSunLab/SpaCEX. 35. Stuhlsatz A, Lippel J, Zielke T (2012) Feature extraction with deep neural networks by a 839
generalized discriminant analysis. IEEE transactions on neural networks and learning systems 840
23(4):596–608. 841
759 ACKNOWLEDGMENTS. The project is funded by Strategic Pri- 36. Pardo B, et al. (2022) spatiallibd: an r/bioconductor package to visualize spatially-resolved 842
760 ority Research Program of Chinese Academy of Sciences (Grant transcriptomics data. BMC genomics 23(1):434. 843
761 No. XDB38050100) to H.W. X.S. is supported by Excellent Young 37. Chen S, et al. (2022) Spatially resolved transcriptomics reveals genes associated with the 844

762 Scientist Fund of Wuhan City (Grant No. 21129040740). We also vulnerability of middle temporal gyrus in alzheimer’s disease. Acta Neuropathologica Commu- 845

763 thank Jiadi Lv, Daoli Wang, Suoya Han, Siyu Chen and Yuwei Hu nications 10(1):188. 846
38. Hounkpe BW, Chenou F, de Lima F, De Paula EV (2021) Hrt atlas v1. 0 database: redefining 847
764 for their helps in plotting figures and participation in discussions.
human and mouse housekeeping genes and candidate reference transcripts by mining massive 848
rna-seq datasets. Nucleic acids research 49(D1):D947–D955. 849
765 1. Du J, et al. (2019) Gene2vec: distributed representation of genes based on co-expression.
39. de Jonge HJ, et al. (2007) Evidence based selection of housekeeping genes. PloS one 850
766 BMC genomics 20:7–15.
2(9):e898. 851
767 2. Li Y, Keqi W, Wang G (2021) Evaluating disease similarity based on gene network reconstruc-
40. Zhang X, Ding L, Sandford AJ (2005) Selection of reference genes for gene expression studies 852
768 tion and representation. Bioinformatics 37(20):3579–3587.
in human neutrophils by real-time pcr. BMC molecular biology 6:1–7. 853
769 3. Bazaga A, Leggate D, Weisser H (2020) Genome-wide investigation of gene-cancer associa-
770 tions for the prediction of novel therapeutic targets in oncology. Scientific reports 10(1):10787.
771 4. Moffitt JR, et al. (2018) Molecular, spatial, and functional single-cell profiling of the hypothala-
772 mic preoptic region. Science 362(6416):eaau5324.
773 5. Rao N, Clark S, Habern O (2020) Bridging genomics and tissue pathology: 10x genomics
774 explores new frontiers with the visium spatial gene expression solution. Genetic Engineering
775 & Biotechnology News 40(2):50–51.
776 6. Kleshchevnikov V, et al. (2022) Cell2location maps fine-grained cell types in spatial transcrip-
777 tomics. Nature biotechnology 40(5):661–671.
778 7. Tanevski J, Flores ROR, Gabor A, Schapiro D, Saez-Rodriguez J (2022) Explainable multiview
779 framework for dissecting spatial relationships from highly multiplexed data. Genome biology
780 23(1):97.

12 | www.xxxx.org/cgi/doi/10.1073/xxxx.XXXXXXXXXX Sun et al.

Research Framework
No ratings yet
Research Framework
15 pages
Stat Prob Q4 Module 4
50% (2)
Stat Prob Q4 Module 4
20 pages
Data Standards & Clinical Data Interchange Standards Consortium (CDISC)
No ratings yet
Data Standards & Clinical Data Interchange Standards Consortium (CDISC)
179 pages
Classical Music Thesis Statement
100% (3)
Classical Music Thesis Statement
6 pages
Spatial Omics Guide
No ratings yet
Spatial Omics Guide
5 pages
Benchmarking Spatial and Single-Cell Transcriptomics Integration Methods For Transcript Distribution Prediction and Cell Type Deconvolution
No ratings yet
Benchmarking Spatial and Single-Cell Transcriptomics Integration Methods For Transcript Distribution Prediction and Cell Type Deconvolution
28 pages
Syed Raghab Ali Conflict Studies 2020 Ndu Isb
No ratings yet
Syed Raghab Ali Conflict Studies 2020 Ndu Isb
312 pages
ScGen Predicts Single-Cell Perturbation Responses
No ratings yet
ScGen Predicts Single-Cell Perturbation Responses
11 pages
Educating School Leaders
No ratings yet
Educating School Leaders
89 pages
Studying Organizations Using Critical Realism A Practical Guide 1st Edition Paul K. Edwards - Download The Ebook Now For The Best Reading Experience
No ratings yet
Studying Organizations Using Critical Realism A Practical Guide 1st Edition Paul K. Edwards - Download The Ebook Now For The Best Reading Experience
83 pages
Dissecting The Brain With Spatially Resolved Multi-Omics
No ratings yet
Dissecting The Brain With Spatially Resolved Multi-Omics
64 pages
2024 2201 Moesm1 Esm
No ratings yet
2024 2201 Moesm1 Esm
38 pages
Befa Unit-I
No ratings yet
Befa Unit-I
40 pages
Neighborhood Based Computational Approaches For The Prediction of Lncrna Disease Associations
No ratings yet
Neighborhood Based Computational Approaches For The Prediction of Lncrna Disease Associations
37 pages
Song 等 - 2025 - Spatially resolved mapping of cells associated with human complex traits
No ratings yet
Song 等 - 2025 - Spatially resolved mapping of cells associated with human complex traits
35 pages
HEST-1k: A Dataset For Spatial Transcriptomics and Histology Image Analysis
No ratings yet
HEST-1k: A Dataset For Spatial Transcriptomics and Histology Image Analysis
40 pages
Deciphering Spatial Domains From Spatial Multi-Omics With Spatialglue
No ratings yet
Deciphering Spatial Domains From Spatial Multi-Omics With Spatialglue
33 pages
ScBERT - A Large-Scale Pretrained Deep Langurage Model For Cell Type Annotation of Single-Cell RNA-seq Data
No ratings yet
ScBERT - A Large-Scale Pretrained Deep Langurage Model For Cell Type Annotation of Single-Cell RNA-seq Data
35 pages
2023 Specimen Paper 4 Mark Scheme
No ratings yet
2023 Specimen Paper 4 Mark Scheme
10 pages
Content10 11012025 05 08 652741v1 Full PDF
No ratings yet
Content10 11012025 05 08 652741v1 Full PDF
22 pages
Epistasis Quantum Computing
No ratings yet
Epistasis Quantum Computing
21 pages
HEST-1k: A Dataset For Spatial Transcriptomics and Histology Image Analysis
No ratings yet
HEST-1k: A Dataset For Spatial Transcriptomics and Histology Image Analysis
36 pages
SELF ASSESSMENT English Speaking Skills
No ratings yet
SELF ASSESSMENT English Speaking Skills
36 pages
CH 5 Time Series
No ratings yet
CH 5 Time Series
46 pages
A Case Report:: Organophosphate Insecticide Intoxication in A Family
No ratings yet
A Case Report:: Organophosphate Insecticide Intoxication in A Family
4 pages
Comparative Analysis of Multiplexed in Situ Gene Expression Profiling Technologies
No ratings yet
Comparative Analysis of Multiplexed in Situ Gene Expression Profiling Technologies
25 pages
FICTURE: Scalable Segmentation-Free Analysis of Submicron-Resolution Spatial Transcriptomics
No ratings yet
FICTURE: Scalable Segmentation-Free Analysis of Submicron-Resolution Spatial Transcriptomics
32 pages
Opportunities and Challenges of Single-Cell and Spatially Resolved Genomics Methods For Neuroscience Discovery
No ratings yet
Opportunities and Challenges of Single-Cell and Spatially Resolved Genomics Methods For Neuroscience Discovery
18 pages
Modeling Intercellular Communication in Tissues Using Spatial Graphs of Cells
No ratings yet
Modeling Intercellular Communication in Tissues Using Spatial Graphs of Cells
23 pages
Child Directed Speech BSE2023-1-04
No ratings yet
Child Directed Speech BSE2023-1-04
23 pages
Robustness and Applicability of Transcription Factor and Pathway Analysis Tools On Single-Cell RNA-seq Data
No ratings yet
Robustness and Applicability of Transcription Factor and Pathway Analysis Tools On Single-Cell RNA-seq Data
19 pages
Scholography: A Computational Method For Single-Cell Spatial Neighborhood Reconstruction and Analysis
No ratings yet
Scholography: A Computational Method For Single-Cell Spatial Neighborhood Reconstruction and Analysis
30 pages
Pasta Pattern Analysis For Spatial Omics Data
No ratings yet
Pasta Pattern Analysis For Spatial Omics Data
25 pages
Integrating Single-Cell Multi-Omics and Prior Biological Knowledge For A Functional Characterization of The Immune System
No ratings yet
Integrating Single-Cell Multi-Omics and Prior Biological Knowledge For A Functional Characterization of The Immune System
13 pages
NIHMS1510930 Supplement Supplementary Materials
No ratings yet
NIHMS1510930 Supplement Supplementary Materials
14 pages
Integrating Image and Molecular
No ratings yet
Integrating Image and Molecular
25 pages
Self-Supervised Contrastive Learning On Attribute and Topology Graphs For Predicting Relationships Among lncRNAs miRNAs and Diseases
No ratings yet
Self-Supervised Contrastive Learning On Attribute and Topology Graphs For Predicting Relationships Among lncRNAs miRNAs and Diseases
12 pages
Computational Approaches and Challenges in Spatial Transcriptomics
No ratings yet
Computational Approaches and Challenges in Spatial Transcriptomics
24 pages
1 s2.0 S1673852723000759 Main
No ratings yet
1 s2.0 S1673852723000759 Main
16 pages
Chidester Et Al. - 2023 - SPICEMIX Enables Integrative Single-Cell Spatial M
No ratings yet
Chidester Et Al. - 2023 - SPICEMIX Enables Integrative Single-Cell Spatial M
17 pages
Model-Based Prediction of Spatial Gene Expression
No ratings yet
Model-Based Prediction of Spatial Gene Expression
13 pages
2023 05 14 540710v1 Full
No ratings yet
2023 05 14 540710v1 Full
17 pages
Paulson 2017
No ratings yet
Paulson 2017
10 pages
Bbab 340
No ratings yet
Bbab 340
16 pages
Ye 2023
No ratings yet
Ye 2023
9 pages
Exploring Tissue Architecture Using Spatial Transcriptomics - Biozion
No ratings yet
Exploring Tissue Architecture Using Spatial Transcriptomics - Biozion
25 pages
Museum of Spatial Transcriptomics: Review Article
No ratings yet
Museum of Spatial Transcriptomics: Review Article
13 pages
Bioengineering 11 00263
No ratings yet
Bioengineering 11 00263
13 pages
Unsupervised Spatially Embedded Deep Representation of Spatial Transcriptomics
No ratings yet
Unsupervised Spatially Embedded Deep Representation of Spatial Transcriptomics
15 pages
The Emerging Landscape of Spatial Profiling Technologies: Jeffrey R. Moffitt, Emma Lundberg and Holger Heyn
No ratings yet
The Emerging Landscape of Spatial Profiling Technologies: Jeffrey R. Moffitt, Emma Lundberg and Holger Heyn
19 pages
Advances in Spatial Transcriptomic Data Analysis
No ratings yet
Advances in Spatial Transcriptomic Data Analysis
13 pages
Fgene 12 785290
No ratings yet
Fgene 12 785290
15 pages
Martin Et Al 2022 Vesalius High Resolution in Silico Anatomization of Spatial Transcriptomic Data Using Image Analysis
No ratings yet
Martin Et Al 2022 Vesalius High Resolution in Silico Anatomization of Spatial Transcriptomic Data Using Image Analysis
16 pages
STAGATE
No ratings yet
STAGATE
12 pages
Chapter 5 Masunda
No ratings yet
Chapter 5 Masunda
6 pages
Anjali 2021
No ratings yet
Anjali 2021
10 pages
Spatially Resolved Transcriptomics Advances And.1
No ratings yet
Spatially Resolved Transcriptomics Advances And.1
14 pages
Squidpy: A Scalable Framework For Spatial Omics Analysis: Articles
No ratings yet
Squidpy: A Scalable Framework For Spatial Omics Analysis: Articles
14 pages
Spatial Statistics For Understanding Tissue Organization
No ratings yet
Spatial Statistics For Understanding Tissue Organization
6 pages
Spatial Charting of Single-Cell Transcriptomes in Tissues: Articles
No ratings yet
Spatial Charting of Single-Cell Transcriptomes in Tissues: Articles
15 pages
Scribble Dom
No ratings yet
Scribble Dom
10 pages
The Dawn of Spatiotemporal Transcriptomics
No ratings yet
The Dawn of Spatiotemporal Transcriptomics
8 pages
Spatial Components of Molecular Tissue Biology
No ratings yet
Spatial Components of Molecular Tissue Biology
11 pages
NATURE 2021 Method of The Year Spatially Resolved Transcriptomics
No ratings yet
NATURE 2021 Method of The Year Spatially Resolved Transcriptomics
6 pages
Computational Characterization of Transc
No ratings yet
Computational Characterization of Transc
6 pages
Lesson 6 - Sampling Design and Measurement (Rev)
No ratings yet
Lesson 6 - Sampling Design and Measurement (Rev)
13 pages
1Manuscript-BSN-3y2-1A-CEDILLO-222 11111
No ratings yet
1Manuscript-BSN-3y2-1A-CEDILLO-222 11111
32 pages
Anxiety and Affective Style: Role of Prefrontal Cortex and Amygdala
No ratings yet
Anxiety and Affective Style: Role of Prefrontal Cortex and Amygdala
13 pages
Analisis Penilaian Triage Dan Revised Trauma Score Dalam Memprediksi Mortalitas Pada Pasien Trauma Kepala
No ratings yet
Analisis Penilaian Triage Dan Revised Trauma Score Dalam Memprediksi Mortalitas Pada Pasien Trauma Kepala
11 pages
Untitled Document
No ratings yet
Untitled Document
3 pages
Spatial Transcriptomics - 10x Genomics
No ratings yet
Spatial Transcriptomics - 10x Genomics
8 pages
Functional Fixation
No ratings yet
Functional Fixation
29 pages
Real-Time Adaptive Estimation Framework For P80 in Hydrocyclones Overflow
No ratings yet
Real-Time Adaptive Estimation Framework For P80 in Hydrocyclones Overflow
10 pages
SpatialGlue Method
No ratings yet
SpatialGlue Method
2 pages
Wang2019 Article MiningDataAndMetadataFromTheGe
No ratings yet
Wang2019 Article MiningDataAndMetadataFromTheGe
8 pages
3 MSCSVG
No ratings yet
3 MSCSVG
1 page
What Is Statistics
No ratings yet
What Is Statistics
25 pages
Scgen Predicts Single-Cell Perturbation Responses: Articles
No ratings yet
Scgen Predicts Single-Cell Perturbation Responses: Articles
10 pages
Module 1 PPT Ge Readings in Philippine History
No ratings yet
Module 1 PPT Ge Readings in Philippine History
11 pages
Genetic Architect: Discovering Genomic Structure With Learned Neural Architectures
No ratings yet
Genetic Architect: Discovering Genomic Structure With Learned Neural Architectures
10 pages
Asap Poster Sibdays
No ratings yet
Asap Poster Sibdays
1 page
Value Consensus and Partner Satisfaction Among Dating Couples
No ratings yet
Value Consensus and Partner Satisfaction Among Dating Couples
9 pages
Factors Effecting The Customers Selection of Restaurants in Pakistan
No ratings yet
Factors Effecting The Customers Selection of Restaurants in Pakistan
11 pages
English M2
No ratings yet
English M2
3 pages
Blackbook Assignment
No ratings yet
Blackbook Assignment
6 pages
Mba - Managerial Economics: Basicinformation
No ratings yet
Mba - Managerial Economics: Basicinformation
4 pages
Grid Organization Development
No ratings yet
Grid Organization Development
4 pages
A General Methodology For Modeling Loss Given Default
No ratings yet
A General Methodology For Modeling Loss Given Default
4 pages

Learning Context-Aware Distributed Gene Representa

Uploaded by

Learning Context-Aware Distributed Gene Representa

Uploaded by

bioRxiv preprint doi: https://doi.org/10.1101/2024.06.07.598026; this version posted June 10, 2024.

The copyright holder for this preprint

Learning context-aware, distributed gene

This manuscript was compiled on June 8, 2024

22 improving spatial clustering. Extensive real data results demonstrate

www.xxxx.org/cgi/doi/10.1073/xxxx.XXXXXXXXXX PNAS | June 8, 2024 | vol. XXX | no. XX | 1–12

ViT Encoder Reconstructed

Moudule II Moudule III

Gene network Cofunctional module N Embedding group N Non-linear Projection Head

47 sensitivity necessitates repeated pretraining, particularly with Gene Embeddings (SGE). 73

48 the introduction of new data, which, coupled with the pro-

51 research community. fectiveness in identifying spatially co-expressed gene groups, 76

justifying their role as genomic contexts by showing their bi- 77

2 | www.xxxx.org/cgi/doi/10.1073/xxxx.XXXXXXXXXX Sun et al.

C C1 (IDC) C2 (DCIS) C3 (Benign Stroma) C4 (Control)

Benign Stoma Related Invasive Cancer Related Cancer Related Others 3 6 9

(see “Representation learning of spatial gene expression maps” 115

perceptibility by learning to regenerate masked image patches 117

Sun et al. PNAS | June 8, 2024 | vol. XXX | no. XX | 3

143 threshold. SpaCEX consistently outperforms the competing methods in 166

both DB indices, demonstrating its effectiveness in identifying 167

spatially co-expressed gene clusters. 168

4 | www.xxxx.org/cgi/doi/10.1073/xxxx.XXXXXXXXXX Sun et al.

of gene embeddings, including SGEs derived from the 10x- 224

Sun et al. PNAS | June 8, 2024 | vol. XXX | no. XX | 5

A B Health 0 and Disease

that discrepancies between SGE pairs are significantly lower 258

243 feedforward neural network (FFN) with nonlinear activation

6 | www.xxxx.org/cgi/doi/10.1073/xxxx.XXXXXXXXXX Sun et al.

Designated Patterns IDC Designated Patterns Layer 1 Layer 2

dataset (10x-hMTG-2-5) as ρhealth , and from an Alzheimer’s 320

Sun et al. PNAS | June 8, 2024 | vol. XXX | no. XX | 7

Discriminator Projection Memory bank

Generator Generated SeqFISH

Generator Generated SeqFISH

Moran' s I of real sqf-mEmb genes

Moran' s I of real sqf-mEmb genes

Ground Truth Original

ruth Original SpaCEX-ETC SpaOTsc SpaGE Tangram

those of covered genes, drawing on their semantic relationships 388

8 | www.xxxx.org/cgi/doi/10.1073/xxxx.XXXXXXXXXX Sun et al.

417 highlight the efficacy of SGEs in enhancing the transcriptomic

SGE-based SVG Detection. In this section, we introduce a novel

keeping genes. For task-specific applications, we propose a 504

Sun et al. PNAS | June 8, 2024 | vol. XXX | no. XX | 9

524 of downstream task-specific tools. Thus, SpaCEX promises

EM tends to overfit the data. To mitigate this problem, we 586

ization purpose: we use a conjugate Dirichlet prior on Π and 588

Representation Learning of Spatial Gene Expression Maps. As

547 randomly selected, masked and discarded. The remaining

549 patch embeddings. Given that a gene image is gray-scale XX

565 the mixed Student’s t distribution. model inference. 610

10 | www.xxxx.org/cgi/doi/10.1073/xxxx.XXXXXXXXXX Sun et al.

of L1 and L2 with respect to W , W c and Θ are detailed in SI 665

622 Here, Lℓap is a Laplacian regularization term that promotes

incorporating mean and dispersion parameters that relate to

estimates these parameters through regression analysis, pre- 674

ST dataset, allowing for precise control over the expression

i=1 k further details, see the SI Appendix, Supporting Text. 681

the SGEs to adapt to the particular semantics inherent in the 692

Appendix, Supporting Text. 696

643 of encoder, guiding it to generate gene embeddings that pre-

650 L2 = Lc (Zb , Θ) + λ1 · Lr (Zb , W

dissimilarity. We then rank the genes by these scores to iden- 705

data. This involves constructing a gene identity matrix G to 710

pi,j log pi,j ,

Sun et al. PNAS | June 8, 2024 | vol. XXX | no. XX | 11

735 https://singlecell.broadinstitute.org/single_cell/study/SCP81 resolved transcriptomic studies. Nature methods 17(2):193–200. 807

https://support.10xgenomics.com/spatial-gene- FT expression networks. Genome biology 20(1):1–6.

12 | www.xxxx.org/cgi/doi/10.1073/xxxx.XXXXXXXXXX Sun et al.

You might also like