Learning Context-Aware Distributed Gene Representa
Learning Context-Aware Distributed Gene Representa
1 Distributed gene representations are pivotal in data-driven genomic processes and disease development (8), but also allow the 21
2 research, offering a structured way to understand the complexities learning of distributed gene representations, resembling the 22
3 of genomic data and providing foundation for various data analysis learning of word representations from word contexts in lin- 23
4 tasks. Current gene representation learning methods demand costly guistic models (1, 9, 10). These gene embeddings provide a 24
5 pretraining on heterogeneous transcriptomic corpora, making them foundation for quantitatively characterizing context-specific 25
6 less approachable and prone to over-generalization. For spatial tran- gene functions and interactions from a spatial perspective, 26
7 scriptomics (ST), there is a plethora of methods for learning spot em- facilitating various analytical endeavors where insights into 27
8 beddings but serious lacking method for generating gene embeddings spatial genetic mechanisms are critical. 28
9 from spatial gene profiles. In response, we present SpaCEX, a pioneer However, to the best of our knowledge, there is currently 29
10 cost-effective self-supervised learning model that generates gene em- no method for learning gene embeddings from ST data due to 30
beddings from ST data through exploiting spatial genomic “context” challenges in effectively identifying spatial genomic contexts
11
12
13
14
15
16
17
identified as spatially co-expressed gene groups. SpaCEX-generated
gene embeddings (SGE) feature in context-awareness, rich seman-
tics, and robustness to cross-sample technical artifacts. Extensive
real data analyses reveal biological relevance of SpaCEX-identified
genomic contexts and validate functional and relational semantics of
SGEs. We further develop a suite of SGE-based computational meth-
FT and encoding spatial gene expression patterns. While recent
works, including Gene2vec (1), scGPT (9), scFoundation (11),
scBERT (12), and geneFormer (13), have been developed to
learn gene embeddings from atlas-scale microarray or scRNA-
seq data, they do not extend to ST, overlooking crucial spatial
gene expression information. Moreover, it has been observed
31
32
33
34
35
36
37
RA
18 ods for a range of key downstream objectives: identifying disease- that the extensive pretraining of these models on massive data 38
19 associated genes and gene-gene interactions, pinpointing genes with corpora offers marginal benefits for finetuning downstream 39
20 designated spatial expression patterns, enhancing transcriptomic tasks (14). This is probably due to the irreconcilable het- 40
21 coverage of FISH-based ST, detecting spatially variable genes, and erogeneities in pretraining data collected across a variety of 41
spatial transcriptomics | gene embeddings | genomic contexts | self- Spatial transcriptomics enables the identification of spatial gene
supervised learning relationships within tissues, providing semantically rich ge-
nomic "contexts" for understanding functional interconnections
among genes. SpaCEX marks the first endeavor to effec-
1
3
D istributed gene representations embed multifaceted na-
ture of genes within a high-dimensional space, offering
profound insights into the complex mechanisms of gene ex-
tively harnesses these contexts to yield biologically relevant
distributed gene representations. These representations serve
4 pression, regulation, and interaction, and paving ways for as a powerful tool to greatly facilitate the exploration of the
5 leveraging machine learning techniques in advancing biomed- genetic mechanisms behind phenotypes and diseases, as ex-
6 ical research (1), disease diagnosis (2), and the discovery of emplified by their utility in key downstream analytical tasks in
7 therapeutic targets (3) with unprecedented precision and effi- biomedical research, including identifying disease-associated
8 ciency. genes and gene interactions, in silico expanding the transcrip-
9 Spatial transcriptomics (ST), including high-resolution tomic coverage of low-throughput, high-resolution ST tech-
10 in situ hybridization-based (e.g., SeqFISH (4)) and high- nologies, pinpointing diverse spatial gene expression patterns
11 throughput in situ capturing-based (e.g., 10x Visium (5)) (co-expression, spatially variable pattern, and patterns with spe-
12 technologies, enables the profiling of spatial gene expression cific expression levels across tissue domains), and enhancing
13 in heterogeneous tissues, providing unprecedent opportunities tissue domain discovery.
14 to characterize spatial distribution of cell types (6), delineate
X.S. and H.W. conceived and supervised the study. X.S. derived the model and developed the
15 spatial tissue organization (5), gene-gene interactions (7), etc. framework. X.S., H.W., W.L. wrote and revised the manuscript. X.S., Y.X., W.L., and Z.W. imple-
16 ST also can reveal spatial genomic “contexts” formed by genes mented the framework. Y.X., W.L., M.H., and J.C. conducted the experiments. X.S., Y.X., M.H., Z.W.
conducted the analyses. Y.X., W.L., X.S., M.H., and J.C. collected the results and plot the figures.
17 cofunctional in the same biological processes and pathways, All authors approved the manuscript.
18 given the similarity of these genes in spatial expression pat- The authors declare no competing interests.
19 terns within tissues. Such genomic contexts not only provide
1
20 insights into the molecular mechanisms underlying biological To whom correspondence may be addressed. E-mail: [email protected]; [email protected]
A B
Spatially Distributed Gene Moudule I Patch Embedding Mask Token Reconstructed Patch
Genomic “Context’’
Co-expressied Genes Representations
N
Patchify
PI3 ·· ·· ·· ··
TC2 TC1
Mask
Convolutional
P2I
··
Weights SGEs
M̄ ≫ M II III
SMM Inference via MAP-EM
M genes M̄ genes Gene1 Gene1 P : Target
I Distribution
N
N
FISH-based Enhanced
Health Disease
Successively Refine
ST FISH-based ST Before After
Enhancing transcriptomic Cross-sample gene Identifying disease-
coverage of FISH-based ST alignment associated genes
V VI VII : KL( P || Q )
IV
+
Health Disease
Identifying disease- Identifying genes Detecting SVGs Improving
associated gene crosstalk with specific spatial clutering
expression patterns
FT
Fig. 1. Overview of the distributed gene representation learning with SpaCEX. (A) Outline of learning distributed gene representations from spatial genomic contexts. ST data
reveal spatial genomic contexts comprising genes cofunctional in gene pathways and networks since they tend to exhibit similar expression patterns across tissue space.
By leveraging these genomic contexts, distributed gene representations can be learned. These representations not only encapsulate gene spatial functional and relational
semantics but also instrumental in facilitating various downstream task-specific objectives. (B) Workflow of SpaCEX. It comprises three modules: In Module I, SpaCEX employs
an adapted masked autoencoder (MAE) to learn the representations of gene images, which enhances gene embeddings’ local-context perceptibility. Module II involves
modeling gene embeddings using a Student’s t mixture model, with parameters estimated via a MAP-EM algorithm. This module aims to maximize the likelihood of the entire
dataset. In Module III, gene embeddings are refined through a self-paced pretext task that aims to identify genomic contexts via iterative pseudo-contrastive learning. Together,
Modules II and III constitute a single training epoch, in which gene embeddings’ discriminability is enhanced. The training process continues until either the change in gene
RA
assignments falls below a threshold or a predetermined number of epochs is reached. Upon training completion, SpaCEX-generated gene embeddings (SGEs) can be utilized
in downstream analytical tasks, such as i) enhancing transcriptomic coverage of FISH-based ST, ii) cross-sample gene alignment, iii) identifying disease-associated genes, iv)
identifying disease-associated gene crosstalk, v) pinpointing genes with designated spatial expression patterns, vi) detecting SVGs, vii) improving spatial clustering.
42 technical and biological conditions, which hinder the model’s ration of intricate relational semantic structures among genes. 68
43 ability to grasp context-specific gene embedding nuances. In The innovative integration of MIM with contrastive learning 69
44 addition, these methods, characterized by their tremendous equips SpaCEX with a superior capability to comprehend and 70
D
45 amount of parameters, are prone to sensitivity regarding hyper- translate spatial gene information, including both expression 71
46 parameters settings and parameter initializations (14). This patterns and inter-gene relationships, into SpaCEX-generated 72
50 the updating of pretrained models impractical for the broader the validity of our method, we demonstrate SpaCEX’s ef- 75
52 To bridge this gap, we develop a “context-aware, self- ological relevance. In addition, the functional and relational 78
53 supervised learning on Spatially Co-EXpressed genes semantics of SGEs undergo rigorous validations. Particularly, 79
54 (SpaCEX)” model. SpaCEX features in utilizing spatial ge- we demonstrate SGEs’ robustness to technical variations and 80
55 nomic context inherent in ST data to generate gene embed- their utility in gene alignment and comparison across sam- 81
56 dings that accurately represent the condition-specific spatial ples. Secondly, we demonstrate the applicability of SpaCEX 82
57 functional and relational semantics of genes. Technically, and SGEs in key downstream tasks by developing a suite of 83
58 SpaCEX treats gene spatial expressions as images and lever- SGE-based computational methods for: i) the identification 84
59 ages a masked-image model (MIM), which excels in extracting of disease-associated genes and gene crosstalk; ii) the de novo 85
60 local-context perceptible and holistic visual features (15), to expansion of transcriptomic coverage of FISH-based ST data, 86
61 yield initial gene embeddings. These embeddings are iter- addressing a longstanding challenge that has restricted the 87
62 atively refined through a self-paced pretext task aimed at broader application of FISH-based ST data; iii) pinpointing 88
63 discerning genomic contexts by contrasting spatial expres- genes with designated spatial expression patterns in tissues; iv) 89
64 sion patterns among genes, drawing genes with similar spatial detecting spatially variable genes (SVGs); v) improving spatial 90
65 expressions closer in the latent embedding space, while dis- clustering. Extensive real data analyses demonstrate that our 91
66 tancing those with divergent patterns. This step enhances the methods either provide optimal solutions to challenges that 92
67 discriminability of gene embeddings, facilitating their incorpo- have been not inadequately addressed, e.g., the first three 93
10x-hBC
A B
Manual Annotation C1 (IDC) C2 (DCIS) C3 (Benign Stroma) C4 (None)
C1 C2
C3
C1
-log 10 p.values -log 10 p.values -log 10 p.values -log 10 p.values
D
C1
C1C1
(IDC) C2
C2 C2
(DCIS) C3 Stroma)
C3 (Benign C4(Control)
C4 (Control)
C4(Control)
FT
C2
C3
C3 C4(Control)
C4(Control)
0 1 2 3
3 6 9
-log p.adjust Number of genes
RA
SpaCEX CNN-PReg Giotto SPARK STUtility
Fig. 2. SpaCEX identifies clusters of spatially co-expressed genes within a biologically relevant genomic context. (A) Performance comparison of SpaCEX and four benchmark
methods in grouping co-expressed genes in the 10x-hDLPFC-151673 and ssq-mHippo datasets. The DB index calculated based on Pearson and Euclidean distance are
used to measure the overall co-expression (left panel) and spatial coherence (right panel) of the gene clusters, respectively. In both cases, a lower DB index value indicates
enhanced co-expression or spatial coherence. (B) Spatial expression patterns of SpaCEX-generated gene groups overlap with the cell type distributions in the 10x-hBC dataset.
The leftmost panel displays the manually annotated distributions of ductal carcinoma in situ (DCIS, in yellow), invasive ductal carcinoma (IDC, in red) and benign stroma cells (in
original color) in human breast cancer tissues. The right four panels display the aggregated spatial expression (module scores) of three SpaCEX-generated gene groups
(C1-C3) and a control cluster (C4) consisting of randomly selected genes. The brightness in each panel is positively correlated with the level of aggregated gene expression.
D
The name of the cell type, whose distribution overlaps with the aggregated expression pattern of the gene group, are indicated above each panel. (C) GO enrichment and
cofunction analyses of genes within C1-C4. The dots represent the 20 most significantly enriched GOBPs in each gene group. The x-axis represents the negative logarithm of
adjusted P-values of biological process enrichment significance, while the y-axis represents the ROC AUC scores of the 20 GOBPs in the gene cofunction analysis. Red color
indicates benign stroma-related GOBPs, green color the noninvasive cancer-related GOBPs, cyan color invasive cancer-related GOBPs, and purple color other GOBPs. (D) The
connectivity between the nodes corresponding to most significantly enriched GOBP indicates their functional associations. The node color represents the GOBP’s enrichment
significance (negative logarithm of adjusted P-value), with darker colors indicative of lower significance levels. Node size indicates the number of genes involved in the GOBP.
94 tasks, or outperform established benchmarks as seen in task cilitate downstream task-specific objectives. As illustrated 110
95 iv and v. These tasks exemplify how SGEs can be effectively in Methods and Fig. 1B, SpaCEX mainly consists of three 111
96 employed to address various downstream task-specific objec- modules. In Module I, erage an adapted masked autoencoder 112
97 tives, promising SpaCEX’s potential in developing a genomic (MAE) (15) to tranansform gene images into gene embeddings 113
98 “language”-based methodological ecosystem. that follow a mixture of multivariate Student’s t distributions 114
99 Result in Methods). With this MIM, SpaCEX gains the local-context 116
A B D
KRT-II
HLA-I
HLA-II
Interaction
Non Interaction
FT
Fig. 3. Validating the relational semantics of SGEs. SGEs are derived from the 10x-hDLPFC-151676 dataset. (A) Hierarchical clustering based on the SGEs of the HLA-I,
HLA-II, and KRT-II gene family members. Genes with similar SGEs are positioned in close proximity along the y-axis. The 64 dimensions of the SGEs are represented along the
x-axis. Gene families are indicated in different colors on the y-axis. The red color intensity in the diagram positively correlates with the SGE values. (B) Reactome-based
pathway enrichment analysis of gene clusters generated by Leiden at various resolutions, based on gene-gene similarity matrices computed from either the SGEs or original
gene expression profiles. The x-axis denotes the Leiden resolution, while the y-axis represents the average number of statistically significant enriched gene pathways (or
high-confidence “pathway hits”) across the gene clusters. Red and blue spots represent gene clusters derived from the SGEs and original gene expression profiles, respectively.
(C) The predictive power of gene-gene interactions with different types of gene embeddings. Here, we showcase the mean accuracies (left panel) and ROC AUC (right panel)
RA
scores for Gene-Gene Interaction Predictor Neural Network (GGIPNN)-based predictions of gene-gene interactions using four distinct types of gene embeddings: SGEs,
scBERT embeddings, Gene2vec embeddings, and randomly generated embeddings. Refer to Supplementary Note 1.4 for details of creating training and testing datasets.
(D) The gene-gene interaction heatmaps are presented to compare the ground truth (top-left) with prediction results using the four types of gene embeddings. For better
visualization, the heatmaps only include the top 1,000 genes that exhibit the most interactions with other genes as per ground truth. In these maps, a filled cell in the indicates
the existence of an interaction between the pair of genes in the corresponding row and column, while a blank cell indicates the opposite. The prediction accuracy is indicated on
top of each heatmap.
discriminate between spatial gene expressions and grasps the utilize twelve human dorsolateral prefrontal cortex (DLPFC)
D
127 150
128 intricate relational semantic structures among genes. This 10x Visium datasets (10x-hDLPFC) (16) and a mouse hip- 151
129 process involves a self-paced, iterative joint optimization of pocampus Slide-seqV2 dataset (ssq-mHippo) (17), as listed 152
130 MAE weights and SMM parameters using two loss functions in SI Appendix, Table S1. We benchmark SpaCEX against 153
131 L1 and L2 (see “Self-paced semi-contrastive optimization of four state-of-the-art competing methods: CNN-PReg (18), 154
132 gene embeddings” in Methods). Each training epoch begins Giotto (19), Spark (20) and STUtility (21) (SI Appendix, Ta- 155
133 with L1 , updating the MAE weights to maximize the log like- ble S2). Our initial assessment involves visualizing spatial 156
134 lihood of the entire dataset while controlling for macro-factors expression maps of four randomly selected genes from each 157
135 (e.g., cluster size imbalance) via regularization terms. Follow- of two SpaCEX-identified clusters, chosen to represent high 158
136 ing this, L2 , designed for discriminatively boosted clustering, and medium quality clusters, respectively (see “Identifying 159
137 further refine gene embeddings and SMM parameters, drawing groups of spatially co-expressed genes” in Methods). SI Ap- 160
138 closer similar genes and distancing dissimilar ones over succes- pendix, Fig. S1 shows that genes within both clusters exhibit 161
139 sive batches. Overall, the training process alternates between congruent expression patterns. Next, the overall co-expression 162
140 the Module II and III until either a predetermined number and spatial coherence of SpaCEX-generated clusters are quan- 163
141 of training epochs is reached, or the change in gene assign- titatively measured using two Davies-Bouldin (DB) indices 164
142 ments between successive epochs falls beneath a prespecified (see “Evaluation metrics” in Methods). Fig. 2A shows that 165
146 SpaCEX in generating semantically meaningful gene repre- clusters as spatial genomic contexts, we delve into their biolog- 170
147 sentations hinges on its ability to identify biologically rel- ical significances through gene pathway enrichment analysis 171
148 evant genomic contexts, manifested as clusters of spatially and gene cofunction analysis using the 10x-hBC dataset de- 172
149 co-expressed genes. To systematically evaluate this ability, we rived from human breast cancer tissue (20). Our analysis 173
A B
Embedding (Disease)
SGEs Gene Umap Embeddings
Identity
MLP
MLP
Batch Norm and 3x
LekeyRelu
Batch Norm
Dropout
+
MLP
Output
MAE Loss
Embedding (Health)
FT
RA
Fig. 4. SGEs of 2177 housekeeping genes are generated from two healthy human MTG 10x Visium dataset (10x-hMTG-1-1 and 10x-hMTG-18-64), respectively. UMAP
embeddings of the same dimension as SGEs for these housekeeping genes are also generated from both datasets. Embedding pairs of identical genes but from different
datasets are then subjected to alignment. (A) The network architecture of SGE alignment network (SAN). (B) PCA plots of SGEs (left) and gene UMAP embeddings (right) after
SAN-mediated alignment. (C) Scaled cosine dissimilarities between pairs of aligned SGEs (in blue) versus those between aligned UMAP embeddings (in orange).
174 targets three specific gene clusters (C1, C2, and C3), each SGEs Effectively Capture Fundamental Gene Semantics. We 199
175 comprising over 20 genes and demonstrating the lowest group begin this section by validating the functional and relational 200
D
176 closeness centrality (SI Appendix, Supporting Text). This cen- semantics of SGEs through three analyses (see “Evaluating 201
177 trality metric suggests their expression patterns are most likely SpaCEX-generated gene embeddings” in Methods). In the 202
178 to diverge from the majority, probably due to pathological first analysis, hierarchical clustering is performed on SGEs for 203
179 functions. As depicted in Fig. 2B, the aggregated expression the KRT-II, HLA-I and HLA-II gene family members derived 204
180 patterns (22) (SI Appendix, Supporting Text) of C1 through from human DLPFC tissues (Fig. 3A). We find that SGEs 205
181 C3 are associated with the spatial distributions of invasive from the same family are clustered together, and those from 206
182 ductal carcinoma (IDC), ductal carcinoma in situ (DCIS) and functionally related gene families (e.g., HLA-I and HLA-II) 207
183 benign stroma cells, respectively. In contrast, the aggregated are positioned in closer proximity within the hierarchy than 208
184 expression of C4, a control cluster comprising 30 randomly those from less related families (e.g., KRT-II and HLA-II). 209
185 selected genes, is dispersed over the spatial map. Addition- In contrast, these gene families are more intermingled when 210
186 ally, Fig. 2C -D showcase that these clusters are statistically the clustering is based on their original expression profiles 211
187 significantly enriched with pathologically/biologically relevant (SI Appendix, Fig S2). In the second analysis, Leiden is uti- 212
188 and densely inter-connected gene ontology biological processes lized to identify gene clusters at various resolutions, using 213
189 (GOBP). The functional coherence of member genes within either SGEs or the original gene expression profiles. Subse- 214
190 these clusters is also notable, given the involvement of a gene quently, pathway enrichment analysis against the Reactome 215
191 in a GOBP can be reliably predicated based on other mem- database (23) is conducted on these gene clusters to compile 216
192 ber genes’ involvement in the same GOBP (Fig. 2C ). A more their high-confidence “pathway hits”, which is indicative of 217
193 detailed explanation of the methodology and results of this the gene representations’ efficacy in encoding complex gene- 218
194 analysis is available in the SI Appendix, Supporting Text. gene connections (9). Fig 3B showcases that gene clusters 219
195 These findings altogether affirm that SpaCEX-identified gene derived from SGEs yield a consistently higher average number 220
196 clusters are cofunctional and biologically/pathologically rele- of “pathway hits” compared to those derived from original gene 221
197 vant to the context under investigation, thus endorsing their expression profiles across resolution levels. The third analysis 222
198 roles as spatial genomic contexts. focuses on predicting gene-gene interactions using four types 223
Housekeeping Genes
AD Genes
Health 0 and Health 1 Health 0 and Health 1 Health 0 and Health 1
Housekeeping Genes AD Genes
Housekeeping Genes
AD Genes
Fig. 5. Identifying AD-associated genes and gene crosstalk with SGEs. Fig. 5A and 5B are for identifying AD-associated genes, while Fig. 5C and Fig. 5D for AD-associated
gene crosstalk. SGEs of 42 AD-associated genes and 126 non-anchor housekeeping genes are obtained from two healthy MTG 10x Visium datasets (health_0: 10x-hMTG-1-1
FT
and health_1: 10x-hMTG-18-64) alongside an AD MTG 10x Visium dataset (10x-hMTG-2-3). SGE pairs of identical genes but from different datasets are aligned using SAN.
The AD and the healthy (health_0) datasets form the study group, while the two healthy datasets form the control group. (A) PCA plots of SGEs in the study (left) and control
(right) groups. Round dots and triangles represent housekeeping and AD-associated genes, respectively. The PCA distances between SGE pairs are visually represented by
yellow and blue lines for housekeeping and AD-associated genes, respectively. Compared to the control group, blue lines are markedly longer than yellow lines in the study
group. (B) Box-plots of scaled cosine dissimilarities between SGE pairs in the study (left) and control (right) groups. Yellow and blue boxes represent housekeeping genes and
AD-associated genes, respectively. (C) Gene-gene interactions within each dataset are quantified using a Pearson correlation matrix calculated from SGEs. Alterations in
gene-gene interactions between two datasets are measured as a correlation shift matrix representing the absolute differences between the two correlation matrices, which is
visualized as a heatmap wherein darker colors indicates larger shifts. Compared to the control group (right), correlation shifts in the study group (left) for gene pairs involving at
RA
least one AD-associated gene are significantly larger than those for gene pairs of housekeeping genes only. (D) Scatterplots of correlations between gene pairs involved in the
same AD-associated pathways, with each cross representing a gene pair. For the study group (left), y- and x- axes denote gene correlations in the AD and healthy (health_0)
datasets, respectively. In the control group (right), these axes represent gene correlations in the health_1 and health_0 datasets, respectively. A t-test is used to assess the
statistical significance of differences in gene correlations between the datasets, with P-values indicated on top of each panel.
225 hDLPFC-151676, scBERT embeddings, Gene2vec embeddings, if they are unaffected by cross-sample technical variations. 250
226 and randomly generated embeddings. As elaborated in SI Ap- For comparison, gene expressions from both datasets are also 251
D
227 pendix, Supporting Text, SGE-based prediction proves to be transformed into uniform manifold approximation and projec- 252
228 the most accurate, as evidenced by its superior accuracy and tion (UMAP) embeddings with the same dimensionality as 253
229 AUC scores (Fig 3C ), along with a prediction heatmap close SGEs. SAN is then trained to align these UMAP embeddings, 254
230 aligning with the ground truth (Fig 3D). Collectively, these which could potentially be confounded by technical variations. 255
231 analyses affirm that fundamental functional and relational The cosine dissimilarity of aligned embeddings pairs serves 256
232 semantics of genes are encapsulated into SGEs. as an indicator of alignment discrepancies. Fig. 4B reveals 257
233 SGEs Facilitate Cross-sample Gene Alignment. Given that than those between pairs of UMAP embeddings, highlighting 259
234 gene-gene relationships remain relatively stable across datasets SGEs’ robustness against cross-sample technical noises. More- 260
235 under identical conditions and are less prone to technical ar- over, by visualizing aligned SGEs and UMAP embeddings 261
236 tifacts like batch effects (24), we posit that SGEs, generated on a principal component analysis (PCA) plot (Fig. 4C ), we 262
237 based on gene relational semantics as previously validated, are find SGEs from different datasets are more evenly mixed com- 263
238 also resilient to technical artifacts, thus capable of facilitating pared to the UMAP embeddings. These observations provide 264
239 cross-sample gene alignment. To verify this point, SGEs from strong evidence to that SGEs prioritize capturing genuine 265
240 two healthy human middle temporal gyrus (MTG) 10x Visium gene semantics and are resistant to technical artifacts across 266
241 datasets (10x-hMTG-1-1 and 10x-hMTG-18-64) are aligned us- samples, thereby facilitating more accurate cross-sample gene 267
242 ing our SGE alignment network (SAN), which is a three-layer alignment. 268
245 minimizes the mean absolute errors (MAE) between SGE pairs with SGEs. Identifying genes with altered spatial expression 270
246 of 2177 housekeeping genes from two different datasets (see and interactions is pivotal for illuminating pathogenic mecha- 271
247 “Data and Code Availability” for where housekeeping genes nisms underlying disease progression, e.g., the elevated APOE 272
248 are acquired). The consistent biological roles of these house- expression within hippocampus in Alzheimer’s Disease (AD) 273
249 keeping genes suggest that their SGEs should align accurately (25) and the intensified interplay between Notch and Wnt path- 274
A 10x-hBC B 10x-hMTG-2-3
Identified Genes Domains Identified Genes Domains
Designated Patterns BRCA1 BRCA2 Designated Patterns TREM2 PSEN1
75%
35%
95% 75% 95%
Fig. 6. Identifying disease-associated genes with designated spatial expression patterns. For both Fig. 6A,B, the designated gene spatial expression patterns are displayed in
the leftmost panel in the first row. The expression level percentiles are indicated within different regions demarcated by dotted lines of varying colors. The rightmost panel in the
first row show expert-curated domain labels. The leftmost panel in the second row represents denoised SpaCEX-SPS-simulated genes that mirror the designated patterns. The
rest panels correspond to denoised spatial maps of identified real genes whose expression patterns resemble the designated ones. (A) The designated expression patterns
have a high expression level (95% percentile) in tumor cores (i.e., IDC), a medium expression (75%) in tumor edges, and a low expression (35%) elsewhere in the human
breast cancer 10x Visium dataset (10x-hBC). (B) The designated expression patterns have a high expression level (95% percentile) in white matter (WM), a medium expression
(75%) in other cortex layers in the human MTG 10x Visium dataset (10x-hMTG-2-3).
275
276
277
278
279
280
ways in many cancers (26). We define strategies for identifying
disease-related genes as either reference-based or reference-
free. The former compares spatial gene expressions in diseased
versus healthy tissues to pinpoint differences, while the latter
identifies genes exhibiting specific expression patterns within
putative pathogenic regions independently of healthy tissue
expression benchmarks. Herein, we detail the use of SpaCEX
FT imply that differences in SGEs between healthy and diseased
states could signal changes in gene functions associated with
the disease. Therefore, genes can be prioritized based on the
degree of their SGE dissimilarities, furnishing insights for the
identification of disease-associated genes.
SGEs can also provide valuable clues to altered gene
crosstalk in disease. We denote the Pearson correlations
312
313
314
315
316
317
RA
281 318
282 and SGEs in both strategies. between SGEs of gene pairs from a healthy brain tissue 319
288 value discrepancies across samples/conditions are reconcilable keeping genes. This observation is bolstered by a statistically 326
289 by SAN-mediated gene alignment as shown in Fig. 4A, allow significant increase (P-value=5.78e-06) in correlations between 327
290 for the detection of disease-associated alterations in spatial gene pairs that participate in the same AD-related pathways 328
291 gene expressions and relationships. (Fig. 5D Top) in the AD dataset. In contrast, no such shifts 329
292 To verify this, we generate SGEs from two healthy brain in correlation are detected between the two healthy datasets 330
293 tissue datasets (10x-hMTG-1-1 and 10x-hMTG-18-64) and (Fig. 5D Bottom), in line with the expectation that gene re- 331
294 one AD brain tissue dataset (10x-hMTG-2-3). SAN described lationships in health conditions should remain largely stable. 332
295 in the preceding section is used to align SGE pairs of identical Therefore, substantial changes in SGE correlations are indica- 333
296 genes from two different datasets. The 2177 housekeeping tive of disease-related alterations in gene interactions. 334
297 genes are randomly split into an “anchor” set of 2051 genes for
298 SAN training and a “non-anchor” set of 126 genes reserved for Reference-free. Our reference-free approach involves employing 335
299 testing. The trained SAN is applied to align SGE pairs from our innovative method, SpaCEX-SPS, to create a pseudo-gene 336
300 both the non-anchor housekeeping gene set and a set of 42 AD- with designated spatial expression patterns (see “SpaCEX- 337
301 associated genes reported by previous studies (SI Appendix, SPS” in Methods), subsequently converted into an SGE. Based 338
302 Table S3). When aligning SGEs between an AD and a healthy on the scaled cosine similarities between the SGEs of real genes 339
303 dataset (i.e., 10x-hMTG-1-1), we find that AD-associated and the pseudo-gene, we can pinpoint genes whose spatial 340
304 genes demonstrate significantly greater PCA distances (Fig. 5A expression patterns correspond to the predefined one. Our 341
305 Top) and scaled cosine dissimilarities (Fig. 5B Top) compared method is tested on two datasets: a human breast cancer 342
306 to the non-anchor housekeeping genes. In contrast, alignment dataset (10x-hBC) and an AD MTG dataset (10x-hMTG- 343
307 of SGES between the two healthy datasets does not show 2-3). In the SpaCEX-SPS simulation, “high”-level of gene 344
308 such disparities, neither in PCA distances (Fig. 5A Bottom) expression is defined as the 95th percentile of average expres- 345
309 nor in scaled cosine dissimilarity (Fig. 5B Bottom), aligning sions across all genes, “medium”-level the 75th percentile, and 346
310 with expectations that biological semantics of AD-related genes “low”-level the 35th percentile. For the 10x-hBC dataset, we 347
311 should remain unchanged in healthy conditions. These findings aim to discover genes with high expression within tumor cores 348
A B
SpaCEX-ETC 10x Visium Original SpaCEX-ETC SpaOTsc SpaGE Tangram
S
High
Real SeqFISH
SpaCEX SGEs
Wdr5
!! !"
Finetuning P M
Discriminator Projection Memory bank
S P M
Zfp444
C
Wnt3a
0.5
0.3
0.4
0.3
0.2
Low
0.2
0.1
0.1 10x-mEmb
0.2 0.3Real sqf-mEmb
0.4 0.5 0.6 0.1
0.1 0.2 0.3 0.4 0.5 0.6
Moran’ s I of generated sqf-mEmb genes via SpaCEX Moran’ s I of generated sqf-mEmb genes via SpaCEX
Zfp444
0.6 ! = 0.724 ARI: 0.361 ARI: 0.442 ARI: 0.226 ARI: 0.383 ARI: 0.268
ARI: 0.361 ARI: 0.442 ARI: 0.226 ARI: 0.383 ARI: 0.268
0.5
0.4
Wnt3a
0.3
0.2
Low
0.1 330 Genes 660 Genes 660 Genes 660 Genes 660 Genes
10x-mEmb Real sqf-mEmb 0.1 0.2 0.3 0.4 0.5 0.6
Enhanced via Enhanced via Enhanced via Enhanced via
Moran’ s I of generated sqf-mEmb genes via SpaCEX
SpaCEX-ETC SpaOTsc SpaGE Tangram
FT
Initially, genes from a ST dataset with
FISH-based ST dataset are fed into SpaCEX-ETC’s
Tangram
generator to regenerate their original spatial gene expressions. The generator consists of three components: a multilayer perceptron (MLP)-based encoder, an MLP-based
decoder, and a memory bank that ensures the training stability. Following this, a discriminator is trained to distinguish between the original and regenerated genes. Meanwhile,
the loss gradients from the generator are backpropagated to update the parameters of the MAE’s encoder within SpaCEX, so that SGEs are more adapted to the specific
semantics inherent in the FISH-based ST dataset. Once trained, SpaCEX-ETC can generate genes that are initially absent in the dataset, using their finetuned SGEs. (B)
Three genes (Wdr5, Zfp444, and Wnt3a), which are covered by both the SeqFish (sqf-mEmb) and the 10x-mEmb datasets and differ in their expression levels, are used to
evaluate SpaCEX-ETC. These genes’ SGEs are initially obtained from the 10x-mEmb dataset and used to regenerate their expression profiles in the sqf-mEmb dataset. From
RA
the left to the 660
330 Genes Genes
right, 660 Genes
we show the three genes’ original660 Genes expression660
spatial Genes in the 10x-mEmb dataset, in the sqf-mEmb dataset, alongside their regenerated expression
profiles
Enhanced via Enhanced via Enhanced via Enhanced via
profiles by SpaCEX-ETC
SpaCEX-ETC and theSpaOTsc
three benchmark methods.
SpaGE (C) Displayed in the box plot are the Pearson correlation coefficients (PCCs) between the original and spatial
Tangram
expression profiles of the 330 genes regenerated by SpaCEX-ETC and the benchmark methods. (D) The scatterplot shows spatial variabilities of the 330 genes and their
respective regenerated counterparts, with spatial variability quantified using Moran’s I index. The red line in the scatterplot represents a fitted regression line with R=0.724. (E)
The transcriptomic coverage of sqf-mEmb is doubled by using SpaCEX-ETC and the benchmark methods to generate an additional 330 genes absent in the original dataset.
The leftmost spatial map shows the ground truth tissue domain annotations, the second map shows the spatial clustering results of SpaGCN using the unexpanded gene set,
while subsequent maps showcase the spatial clustering results using gene set augmented by SpaCEX-ETC and the benchmark methods, respectively. The clustering accuracy
(i.e., ARI) is shown below each method name.
D
349 (IDC), medium expression within tumor edges, and low ex- ST data, but they essentially generate “pseudo-ST” data since 371
350 pression elsewhere. Using SpaCEX-SPS, we simulate a gene they focus on mapping single cells in scRNA-seq onto spatial 372
351 mirroring these expression patterns. As shown in Fig. 6A, we locations in ST to compensate for genes not profiled in ST 373
352 successfully identify cancer-associated genes such as BRCA1, with scRNA-seq data, rather than the de novo generation of 374
353 BRCA2, PALB2, and TCEAL4, all exhibiting the sought- ST data with inherent spatial semantics. These methods are 375
354 after expression patterns. Similarly, for the 10x-hMTG-2-3 limited by their underutilization of spatial information in the 376
355 dataset, we pinpoint AD-associated genes like TREM2, PSEN1, mapping process, as seen in Tangram and SpaGE, and the 377
356 BIN1, and APOE, characterized by high expression within introduction of systematic biases from discrepancies between 378
357 the white matter (WM) layer and medium expression within scRNA-seq and ST data such as inconsistencies in data scales. 379
358 other cortex layers (Fig. 6B). Particularly, the upregulation To address this challenge, we introduce SpaCEX-enhanced- 380
359 of BIN1 within the WM of AD brain has been previously transcriptomics-coverage (SpaCEX-ETC), an innovative SGE- 381
360 reported (27). These outcomes collectively demonstrate the based Generative Adversarial Network (GAN) model as 382
361 efficacy of utilizing SGEs and SpaCEX-SPS for discerning detailed in the “SpaCEX-ETC” in Methods and Fig. 7A. 383
362 genes with disease-specific spatial expression profiles. More- SpaCEX-ETC is predicated on the notion that gene relational 384
363 over, our methodology extends beyond disease-associated gene semantics should remain largely consistent across different ST 385
364 identification to encompass any genes with designated spatial data types for the same tissue type. Consequently, the spatial 386
365 expression patterns. expression profiles of uncovered genes can be extrapolated from 387
366 SGE-based Enhancement of the Transcriptomic Coverage in inherent in SGEs derived from a full transcriptomic coverage 389
367 FISH-based ST. A significant challenge in ST is achieving both ST dataset (e.g., 10x Visium). We evaluate the effectiveness 390
368 full transcriptomic coverage and high-resolution. Existing of SpaCEX-ETC by reproducing the 330 covered genes from 391
369 methods like Tangram (28), SpaGE (29) SpaOTsc (30) aug- a mouse embryo SeqFISH dataset (sqf-mEmb), guided by the 392
370 ment transcriptomic coverage in high-resolution, FISH-based SGEs from a mouse embryo 10x Visium dataset (10x-mEmb). 393
394 We include Tangram, SpaGE, and SpaOTsc as benchmarks Redundant information among genes can be revealed through 454
395 (SI Appendix, Table S2) to reproduce the same gene set from their SGE similarity matrix and reduced by only selecting 455
396 scRNA-seq data. To ease a direct comparison between the and retaining the most discriminative genes within groups 456
397 original and reproduced genes, we visualize the spatial maps of highly similar ones. The most discriminative genes are 457
398 of three genes of different expression levels, including Wdr5, those with the highest spatial variability scores, as determined 458
399 Zfp444, and Wnt3a. Fig. 7B illustrates that SpaCEX-ETC by SpaCEX-SVG. Subsequently, any spatial clustering algo- 459
400 surpasses benchmark methods in accurately reproducing genes, rithm, which is SpaGCN in our case, can work with this 460
401 achieving high fidelity in both spatial expression patterns and information-efficient set of feature genes to achieve improved 461
402 data scales. performance. We select two state-of-the-art spatial cluster- 462
403 Additionally, Fig. 7C quantitatively demonstrates SpaCEX- ing methods, GraphST and SpaGCN, alongside a baseline 463
404 ETC’s superiority in generating genes that closely correlate method, Leiden, as benchmarks for comparison (SI Appendix, 464
405 with their actual values. This concordance is further supported Table S2). In a comprehensive evaluation across twelve 10x- 465
406 by the highly correlated spatial variability (R=0.724) between hDLPFC datasets, SpaCEX-ISC consistently outperforms the 466
407 authentic and SpaCEX-ETC-generated gene expressions, as benchmark methods, as evidenced by its highest Adjusted 467
408 depicted in Fig. 7D. Finally, we select an additional 330 genes Rand Index (ARI) and Normalized Mutual Information (NMI) 468
409 imputed by SpaCEX-ETC, deemed most real-like by the GAN scores (SI Appendix, Fig. S5A). SpaCEX-ISC’s superiority over 469
410 discriminator, to double the transcriptomic coverage of the the benchmark methods is further illustrated by its more ac- 470
411 sqf-mEmb dataset. This augmented gene set undergoes spatial curately recovered annotated anatomical cortex layers in the 471
412 clustering to evaluate the imputed genes’ quality and their spatial maps of the 10x-hDLPFC-151676 and 10x-hDLPFC- 472
413 analytical utility. Fig. 7E shows that spatial clustering with 151669 datasets (SI Appendix, Fig. S5B). Finally, SpaCEX- 473
414 SpaCEX-ETC-imputed genes achieves a significantly higher ISC achieves optimal performances across six 10x-hDLPFC 474
415 accuracy compared to either the original dataset or genes datasets when approximately 50%-60% redundant information 475
416 imputed by the benchmark methods. Collectively, these results is excluded (SI Appendix, Fig. S5C). 476
419
420
421
422
423
coverage of FISH-based ST via a generative approach.
478
479
480
481
482
RA
424 of its SGE with those of simulated spatially homogeneous microarray or scRNA-seq data typically rely on massive pre- 483
425 genes. SVGs then are ranked and selected according to these training, resulting in a weakened sensitivity to context-specific 484
426 scores. For this assessment, we select the top 3000 SVGs from nuances, and fall short of incorporating spatial expression 485
427 both the 10x-hDLPFC-151507 and 10x-hBC datasets using information into the gene vectorization process. In this work, 486
428 SpaCEX-SVG and two benchmark methods: SpatialDE and we propose SpaCEX, a novel context-aware, self-supervised 487
429 SPARK-X (31) (SI Appendix, Table S2). Both Moran’s I and learning model that exploits spatial genomic contexts in ST 488
430 Geary’s C indices shows that the SVGs selected from the 10x- to derive distributed gene representations imbued with spatial 489
D
431 hDLPFC-151507 dataset by SpaCEX-SVG are more spatially gene functional and relational semantics. 490
432 variable than those selected by the benchmark methods (SI We comprehensively evaluate SpaCEX across ST datasets 491
433 Appendix, Fig. S3A). Moreover, one well-documented drawback of various tissues, species, and platforms in aspects regarding 492
434 of the benchmark methods is their inability to effectively rank the model’s legitimacy and its utility in downstream, task- 493
435 SVGs based on their spatial variability scores (P- or Q-values) specific applications. To establish the methodological sound- 494
436 (32). In contrast, SpaCEX-SVG is more sensitive and effective ness, we initially demonstrate SpaCEX’s adeptness at identi- 495
437 in distinguishing levels of spatial variability among SVGs. This fying spatial genomic contexts as groups of cofunctional and 496
438 is exemplified in SI Appendix, Fig. S3B, where the top four co-expressed genes. Subsequent analyses confirm SpaCEX’s 497
439 SVGs selected by SpaCEX-SVG exhibit more noticeable spatial ability in generating SGEs that encapsulate essential gene 498
440 variabilities than those selected by the benchmark methods. semantics from these genomic contexts, with SGE correlations 499
441 A parallel analysis conducted on the 10x-hBC dataset yields reflecting gene familial and ontological ties. Notably, SGEs 500
442 similar results (SI Appendix, Fig. S4). These results altogether prioritize biological variations over technical noises, enhancing 501
443 demonstrate the potential of SGEs for detecting and ranking their utility in cross-sample gene alignment, as demonstrated 502
444 SVGs in ST datasets. by the accurate alignment of SGEs of functionally stable house- 503
445 SGE-improved Spatial Clustering. In this section, we propose suite of innovative SGE-based methods for identifying disease- 505
446 SpaCEX-Improved-Spatial-Clustering (SpaCEX-ISC), a novel associated genes and gene crosstalk, pinpointing genes with 506
447 SGE-based computational method for enhancing spatial clus- designated spatial expression patterns, enhancing the tran- 507
448 tering, as detailed in the “SpaCEX-ISC” section in Methods. scriptomic coverage of FISH-based ST, detecting SVGs, and 508
449 The rationale behind SpaCEX-ISC is that spatial cluster- improving spatial clustering. These methods either pioneer 509
450 ing can be effectively improved by optimizing the informa- solutions to existing problems or markedly surpass established 510
451 tional efficiency of spatial transcriptomic data, which involves benchmarks. 511
452 minimizing redundant information and retaining the most SpaCEX’s remarkable performance are rooted in four as- 512
453 discriminative information presented by feature genes (33). pects: the effective integration of spatial expression patterns 513
514 into SGEs via an image-focused, self-supervised MIM approach; key strength of this modeling is its robustness to outliers, 572
515 learning relational semantic structures among genes through a which are assigned reduced weights during the estimation of 573
516 flexible and robust SMM-based clustering; a novel combination model parameters. Specifically, let Z ∈ RN ×D denote SGEs, 574
517 of MIM with contrastive learning for iterative joint optimiza- where N is the total number of genes and D is the dimen- 575
518 tion, enhancing the perceptibility and discriminability of SGEs; sion of the feature space. We model the distribution of Z as 576
519 and the resilience of SGEs to technical noise, ensuring reliable an SMM parameterized by Θ = {Θk :πk , µk , Σk , vk , ∀k ∈ K}. 577
520 gene alignment across conditions. Overall, SpaCEX not only Here, K represents the total number of gene clusters, while 578
521 facilitates the discovery of cofunctional gene modules, like πk , µk , Σk , υk represents the weight, mean, covariance matrix 579
522 gene networks and pathways, but also generates biologically and freedom of the k-th component, respectively. The density 580
523 significant gene embeddings, laying the foundation for a suite function of zi is then formulated as follows: 581
531 Methods introduce priors on the model parameters for model regular- 587
537
538
539
540
541
spots are excluded. To preserve the spatial data integrity, we
do not perform quality control on spatial spots. Finally, the
gene expression counts are normalized by library size, followed
by log-transformation.
592
593
594
RA
Φ(zi |µk , Σk , vk ) = N zi |µk , Γ ζi,k | , dζi,k .
542 the spatial gene maps can be visualized as gray-scale images, ζi,k 2 2
543 we devise an adapted version of MAE to transform visual [3] 595
544 features of gene images into embeddings in a latent feature We also introduce a missing variable ξi to represent the 596
545 space. A given gene image is first segmented into regular component membership of zi . Then the posterior complete 597
546 non-overlapping patches, from which a subset of patches is data log likelihood can be written as: 598
555 are then input into the MAE decoder to reconstruct the gene In the t-th iteration of the E-step, the expected sufficient 600
(t) (t)
556 image. We replace the transformer architecture of the original statistics ξi,k and ζi,k are derived based on Θ(t−1) . In the 601
(t−1)
557 MAE decoder with a convolutional autodecoder to enhance subsequent M-step, Θ is updated to Θ(t) by maximiz- 602
558 the performance in our case. A more important modification ing the auxiliary function Q(Θ, Θ(t−1) ) = E(ℓc (Θ)|Θ(t−1) ). 603
(t)
559 is the adding of a nonlinear projection head to the end of the Note that υk is estimated via a Generalized EM (GEM) 604
560 encoder. This projection head consists of a linear layer, a batch technique to speed up the calculation without harming its 605
561 normalization (BN) layer and a Scaled Exponential Linear converging to at least a local optimum. The two steps are 606
562 Unit (SELU) activation layer. Owing to the BN layer and alternatively conducted until either convergence is achieved 607
563 the self-normalizing property of the SELU function, the gene or a pre-specified maximum number of iterations is reached. 608
564 embeddings output from the encoder more closely conform to Refer to SI Appendix, Supporting Text for details about the 609
566 SMM-based Modeling. As Stuhlsatz et al (35) have demon- Self-paced Pseudo-contrastive Optimization of SGEs. Two 611
567 strated the capability of deep image encoder in learning vi- loss functions, L1 and L2 , are calculated based on clustering 612
568 sual representations that follow a multivariate Student’s t- results for updating parameters of both representation learning 613
569 distribution, we utilize an SMM to model the distributions of and the SMM through loss gradient backpropagation. This 614
570 SGEs in a latent feature space, with individual components iterative process progressively improves the clustering-oriented 615
571 of the SMM corresponding to distinct gene clusters. The image embeddings and clustering results. Upon completing 616
617 the inference of SMM parameters Θ e in each epoch, let W and and pi,k an auxiliary target distribution that boosts up high- 661
618 W
c represent the parameters of the encoder and decoder of the confidence images. After this joint optimization, the training 662
619 representation learning model respectively, an epoch-level loss progresses to the next epoch, iterating until the end of the 663
620 L1 is calculated for updating parameters of MAE : training process. The mathematical derivations of gradients 664
633
634 qi,k = πk Φ(zi |µk , Σk , vk ), ∀i ∈ [1, N ], ∀k ∈ [1, K]. [8] SpaCEX-ETC. As shown in Fig. 7A, SpaCEX-ETC is a GAN- 682
635
636
637
638
Lsize penalizes empty and tiny clusters, while exempting those
whose size exceeds a predefined threshold υ so that image
assignments is not overly uniform:
Lsize =
K
X
k=1
−Jk logJk , Jk =
(
ΣN
1
i qi,k
N
, if Jk ≤ υ
.otherwise
FT
[9]
based model that uses an encoder to transform gene expression
matrices into SGEs, which then serve as inputs for a generator.
The generator, consisting of an encoder, a decoder, and a
memory bank, reconstructs gene expression profiles using an
attention mechanism and a continuously updated embedding
queue. The discriminator, an MLP-based network, distin-
guishes between actual and generated gene expressions. The
683
684
685
686
687
688
RA
689
639 Lr , represents the fidelity loss of the reconstructed gene image model adjusts the MAE encoder weights in SpaCEX’s Module 690
640 by the convolutional autodecoder, expressed as: I through adversarial and reconstruction losses , which enables 691
642
651 to update MAE and SMM parameters across successive tify those with notable spatial variations. Details provided in 706
652 batches. Here, Lr and Lℓap remains same as in Equations the SI Appendix, Supporting Text. 707
653 Equation (10) and Equation (6) except being calculated on the
654 batch-level. Lc boosts high-confidence images, incrementally SpaCEX-ISC. In our study, we employ the SpaCEX-ISC 708
655 grouping similar instances while separating dissimilar ones: method to optimize the analysis of spatial transcriptomics 709
659 Here, qi,k is same as in Equation 8, qi,k represents the prob- variations. These scores are used to filter out redundant data, 717
660 ability of assigning i-th gene to the k-th SMM component, ensuring that only the most informative gene expressions are 718
719 retained. The processed data is then fed into a graph neural 8. Chen S, et al. (2021) Spatially resolved transcriptomics reveals unique gene signatures 781
associated with human temporal cortical architecture and alzheimer’s pathology. BioRxiv pp. 782
720 network to generate spot embeddings for clustering, enhancing
2021–07. 783
721 the clarity and utility of spatial gene expression analysis. For 9. Cui H, et al. (2024) scgpt: toward building a foundation model for single-cell multi-omics using 784
722 additional details, refer to the SI Appendix, Supporting Text. generative ai. Nature Methods pp. 1–11. 785
10. Mikolov T, Chen K, Corrado G, Dean J (2013) Efficient estimation of word representations in 786
vector space. arXiv preprint arXiv:1301.3781. 787
723 Experimental Settings. Detailed experimental settings for the 11. Hao M, et al. (2023) Large scale foundation model on single-cell transcriptomics. bioRxiv pp. 788
2023–05. 789
724 SpaCEX study are extensively documented in the SI Ap-
12. Yang F, et al. (2022) scbert as a large-scale pretrained deep language model for cell type 790
725 pendix, Supporting Text. These include the methodologies for annotation of single-cell rna-seq data. Nature Machine Intelligence 4(10):852–866. 791
726 identifying groups of spatially co-expressed genes, protocols 13. Theodoris CV, et al. (2023) Transfer learning enables predictions in network biology. Nature 792
618(7965):616–624. 793
727 for enrichment analysis, techniques for cofunction analysis of 14. Boiarsky R, Singh NM, Buendia A, Getz G, Sontag D (2023) A deep dive into single-cell rna 794
728 intra-cluster genes, criteria for evaluating SpaCEX-generated sequencing foundation models. bioRxiv pp. 2023–10. 795
729 gene embeddings, and the metrics used to assess the overall 15. He K, et al. (2022) Masked autoencoders are scalable vision learners in Proceedings of the 796
IEEE/CVF conference on computer vision and pattern recognition. pp. 16000–16009. 797
730 performance of SpaCEX. 16. Maynard KR, et al. (2021) Transcriptome-scale spatial gene expression in the human dorsolat- 798
eral prefrontal cortex. Nature neuroscience 24(3):425–436. 799
17. Stickels RR, et al. (2021) Highly sensitive spatial transcriptomics at near-cellular resolution 800
731 Data and Code Availability with slide-seqv2. Nature biotechnology 39(3):313–319. 801
18. Song T, et al. (2022) Detecting spatially co-expressed gene clusters with functional coherence 802
732 All data are available in the main text or the by graph-regularized convolutional neural network. Bioinformatics 38(5):1344–1352. 803
19. Dries R, et al. (2021) Giotto: a toolbox for integrative analysis and visualization of spatial 804
733 supplementary materials: The mouse hippocam- expression data. Genome biology 22:1–31. 805
734 pus dataset (ssq-mHippo) can be downloaded from 20. Sun S, Zhu J, Zhou X (2020) Statistical analysis of spatial expression patterns for spatially 806
738 eral prefrontal cortex datasets (10x-hDLPFC) are 23. Fabregat A, et al. (2018) The reactome pathway knowledgebase. Nucleic acids research 811
46(D1):D649–D655. 812
739 available through the spatialLIBD package (36) 24. Parsana P, et al. (2019) Addressing confounding artifacts in reconstruction of gene co- 813
740
741
742
743
744
745
at http://spatial.libd.org/spatialLIBD.
breast cancer dataset (10x-hBC) can be obtained
from
The
expression/datasets/1.1.0/V1_Breast_Cancer_Block_A_Sec
tion_1. The three human MTG datasets, including two
healthy datasets (10x-hMTG-1-1 and 10x-hMTG-18-64)
and an AD dataset (10x-hMTG-2-3), are available in the
human
832
754 S1. Moreover, we acquire 2,162 human housekeeping genes methods 18(11):1342–1351. 833
33. Deng T, et al. (2023) A cofunctional grouping-based approach for non-redundant feature 834
755 from the HRT Atlas (38) (https://housekeeping.unicamp.br) gene selection in unannotated single-cell rna-seq analysis. Briefings in Bioinformatics 835
756 and 15 additional housekeeping genes from previous 24(2):bbad042. 836
757 studies (39, 40). SpaCEX is publicly available at at 34. Wolf FA, Angerer P, Theis FJ (2018) Scanpy: large-scale single-cell gene expression data 837
analysis. Genome biology 19:1–5. 838
758 https://github.com/WLatSunLab/SpaCEX. 35. Stuhlsatz A, Lippel J, Zielke T (2012) Feature extraction with deep neural networks by a 839
generalized discriminant analysis. IEEE transactions on neural networks and learning systems 840
23(4):596–608. 841
759 ACKNOWLEDGMENTS. The project is funded by Strategic Pri- 36. Pardo B, et al. (2022) spatiallibd: an r/bioconductor package to visualize spatially-resolved 842
760 ority Research Program of Chinese Academy of Sciences (Grant transcriptomics data. BMC genomics 23(1):434. 843
761 No. XDB38050100) to H.W. X.S. is supported by Excellent Young 37. Chen S, et al. (2022) Spatially resolved transcriptomics reveals genes associated with the 844
762 Scientist Fund of Wuhan City (Grant No. 21129040740). We also vulnerability of middle temporal gyrus in alzheimer’s disease. Acta Neuropathologica Commu- 845
763 thank Jiadi Lv, Daoli Wang, Suoya Han, Siyu Chen and Yuwei Hu nications 10(1):188. 846
38. Hounkpe BW, Chenou F, de Lima F, De Paula EV (2021) Hrt atlas v1. 0 database: redefining 847
764 for their helps in plotting figures and participation in discussions.
human and mouse housekeeping genes and candidate reference transcripts by mining massive 848
rna-seq datasets. Nucleic acids research 49(D1):D947–D955. 849
765 1. Du J, et al. (2019) Gene2vec: distributed representation of genes based on co-expression.
39. de Jonge HJ, et al. (2007) Evidence based selection of housekeeping genes. PloS one 850
766 BMC genomics 20:7–15.
2(9):e898. 851
767 2. Li Y, Keqi W, Wang G (2021) Evaluating disease similarity based on gene network reconstruc-
40. Zhang X, Ding L, Sandford AJ (2005) Selection of reference genes for gene expression studies 852
768 tion and representation. Bioinformatics 37(20):3579–3587.
in human neutrophils by real-time pcr. BMC molecular biology 6:1–7. 853
769 3. Bazaga A, Leggate D, Weisser H (2020) Genome-wide investigation of gene-cancer associa-
770 tions for the prediction of novel therapeutic targets in oncology. Scientific reports 10(1):10787.
771 4. Moffitt JR, et al. (2018) Molecular, spatial, and functional single-cell profiling of the hypothala-
772 mic preoptic region. Science 362(6416):eaau5324.
773 5. Rao N, Clark S, Habern O (2020) Bridging genomics and tissue pathology: 10x genomics
774 explores new frontiers with the visium spatial gene expression solution. Genetic Engineering
775 & Biotechnology News 40(2):50–51.
776 6. Kleshchevnikov V, et al. (2022) Cell2location maps fine-grained cell types in spatial transcrip-
777 tomics. Nature biotechnology 40(5):661–671.
778 7. Tanevski J, Flores ROR, Gabor A, Schapiro D, Saez-Rodriguez J (2022) Explainable multiview
779 framework for dissecting spatial relationships from highly multiplexed data. Genome biology
780 23(1):97.