0% found this document useful (0 votes)
27 views14 pages

Dimensionality Reduction

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
27 views14 pages

Dimensionality Reduction

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 14

Articles

https://doi.org/10.1038/s41551-020-00635-3

A data-driven dimensionality-reduction algorithm


for the exploration of patterns in biomedical data
Md Tauhidul Islam and Lei Xing ✉

Dimensionality reduction is widely used in the visualization, compression, exploration and classification of data. Yet a generally
applicable solution remains unavailable. Here, we report an accurate and broadly applicable data-driven algorithm for dimen-
sionality reduction. The algorithm, which we named ‘feature-augmented embedding machine’ (FEM), first learns the structure
of the data and the inherent characteristics of the data components (such as central tendency and dispersion), denoises the
data, increases the separation of the components, and then projects the data onto a lower number of dimensions. We show that
the technique is effective at revealing the underlying dominant trends in datasets of protein expression and single-cell RNA
sequencing, computed tomography, electroencephalography and wearable physiological sensors.

R
ecent technological advances are transforming biomedical can be performed more reliably at higher dimensions, reduc-
science into a digitized, data-intensive discipline. Large-scale ing ambiguity and the generation of artifacts. Generally, the data
and high-dimensional data from biomedical studies, such as components that are separable at a high number of dimensions
single-cell cytometry and transcriptome analysis, sensor experi- should be maximally separated there at first; the high-dimensional
ments, biomarker discovery and drug design, and medical imag- data processing is valuable for subsequent dimensionality reduc-
ing are accumulating at a staggering rate1. Data processing at this tion. A salient feature of clustering the data at an intermediate
scale and dimensionality presents a daunting challenge, especially number of dimensions is that, by computing the distance of the
considering the varying quality of the available biomedical data data points to their respective cluster centres, adverse influence of
caused by noise, artifacts, missing information and the batch effect. noise in the data is reduced due to noise subtraction. This step also
Accurate and efficient dimensionality reduction is central to reduc- helps to augment the differentiating features of the data owing to
ing data complexity, understanding the local and global structures of mean-value subtraction in the distance-computation process. The
the data, generating hypotheses and to making optimal data-driven intermediate processing of the data is therefore a key step of FEM,
decisions2. In practice, although a large armamentarium of algo- and leads to significantly improved performance with respect
rithms exist to accomplish dimensionality reduction, most of them to existing techniques. In the final step, we use deep learning to
rely on some specific assumptions about the underlying data struc- extract the essential features of the clustered data, and project
ture in low or high dimensions (or in both). For example, princi- them onto a lower number of dimensions so that they can be more
pal component analysis (PCA)3 assumes that the data components easily visualized.
are orthogonal, and is typically used to find a linear combination Seeking distinctive characteristics that are generally applicable
of the components to compress high-dimensional data, limiting its to describe any kind of data for raw-data stratification, we turned to
adequacy when nonlinear combinations of the components would central tendency and dispersion as two natural choices7. In brief, a
be more appropriate. Other commonly used methods, such as measure of central tendency is a single value that describes the man-
independent component analysis (ICA)4, t-distributed stochastic ner in which data cluster around a central value, such as the arith-
neighbourhood embedding (t-SNE)5 and multidimensional scal- metic mean, median, mode, geometric mean, trimean and trimmed
ing (MDS)6 have similar pitfalls and are applicable only to a sub- mean. The dispersion (also called variability, scatter or spread) mea-
set of problems. Furthermore, as they work directly on raw data, sures how the data are distributed around the centre value using a
these methods are known to be susceptible to noise in the data. distance metric (for example, Euclidean distance, Manhattan dis-
Furthermore, t-SNE and related methods do not preserve the global tance, Minkowski distance, correlation distance, cosine distance,
structure of the data and are therefore limited only to data explora- Kullback–Leibler divergence (KL divergence), Jeffrey divergence or
tion or visualization; methods such as PCA and ICA are used as geodesic distance). For example, in Gaussian component mixtures,
analysis tools and are not optimized for the visualizing data at a low the data components may have different means, different variances
number of dimensions. (average square Euclidean distance from the mean) or both. For
Here, to mitigate the limitations of the traditional techniques, Rayleigh or Student’s t mixtures, the components may have different
we propose a data-driven strategy for dimensionality reduction, scale parameters (a function of the average square Euclidean dis-
which we named feature-augmented embedding machine (FEM; tance from the mean). In general, regardless of the characteristics of
Fig. 1a). Instead of assuming a data structure at a low or high num- the data components or the way they are generated, either a measure
ber of dimensions, we first stratify the high-dimensional data under of central tendency or of dispersion (or both) should separate them
exploration according to their inherent characteristics (in particu- in the original high dimension7.
lar, central tendency and dispersion), and use the information to Computationally, FEM first attempts to increase the separation
project the data onto an intermediate number of dimensions using of the data components on the basis of central tendency. To achieve
data clustering. The rationale for this step is that data representation this, it fits a number of subspaces to the data and projects the data

Department of Radiation Oncology, Stanford University, Stanford, CA, USA. ✉e-mail: [email protected]

Nature Biomedical Engineering | www.nature.com/natbiomedeng


Articles NATuRE BioMEDicAl EnginEERing

a Find best subspace


dimension
and number
Input data on the basis of cluster-
quality indices,
and project the data
onto the subspaces

Step 1
Find the best
central-tendency
and distance
Step 2 metric on the basis
of cluster-
Project Compute the quality indices
distance data distance of data
onto the reduced points from
dimension cluster centres
Dimension-reduced data Step 4
Step 3

b 20 c 15 2
20 Non-DS
10 DS
10
Overlap
FEM1
PC1

10 1
0 5
–10
0 0 0
–10 0 10 –5 0 5 10 0.5 1.0 1.5 0 0.5 1.0 1.5

30 2 15
10
20 10
FEM2
PC2

0 1
10 5

–10 0 0 0
–10 0 10 –5 0 5 10 0.5 1.0 1.5 0 0.5 1.0 1.5
PC1 PC2 FEM1 FEM2

d e 10 2
6 10 Saline
Memantine
4 Overlap
PC1

FEM1

0 5 1
2
–10
0 0 0
–10 0 10 –5 0 5 0.5 1.0 1.5 0.5 1.0 1.5

5 10 2 6

0 4
PC2

FEM2

5 1
–5 2

–10 0 0 0
–10 0 10 –5 0 5 0.5 1.0 1.5 0.5 1.0 1.5
PC1 PC2 FEM1 FEM2

Fig. 1 | Workflow of FEM, and discovery of subgroups in protein expression data. a, The operations in each step of FEM. Step 1: the best number of
subspaces and the subspace dimension are selected on the basis of five cluster-quality indices (Methods). The data are projected onto the selected
subspaces. Step 2: the central-tendency measure and distance metric that best describe the subspace-projected data are chosen on the basis of the
values of the cluster-quality indices. Step 3: the distances of the data points from each cluster centre are computed. Step 4: the distance data are projected
onto the reduced dimensions. b–e, Discovery of subgroups in protein-expression data. b, PCA has been used to analyse a protein-expression dataset of
mice with and without DS. The lower-dimensional representations of protein-expression measurements from mice with and without DS from PCA are
distributed similarly, with no clear distinction. c, However, when FEM is used to project the dataset onto its first two components, a lower-dimensional
representation shows differences between mice with and without DS. d,e, Furthermore, PCA (d) and FEM (e) have been used to project the dataset
of treated (with memantine) and untreated (injected with saline) mice onto the first two components. In this case, PCA does not show any difference
between these two types of mice, whereas FEM components clearly separate them.

onto these subspaces (Fig. 1a (step 1) and Supplementary Fig. 69). (Supplementary Fig. 69 and Supplementary Information 1 and
If the data consist of components of different central tendencies, 11). Normally, the subspace number is larger than the number of
the FEM chooses subspaces of lower dimensionality on the basis of data clusters and, as a result, each of the clusters is represented by a
cluster-quality indices. Cluster-quality indices measure the quality number of subspaces. Thus, the projection of the data onto the sub-
of clusters through the separation of data points from different clus- spaces ultimately reduces the number of outliers and increases the
ters and through the aggregation of data points of the same cluster separation between the centres of the data clusters (Supplementary

Nature Biomedical Engineering | www.nature.com/natbiomedeng


NATuRE BioMEDicAl EnginEERing Articles
a 20 20
Non-DS 50
DS
10

Dhaka2
t-SNE2
0

Ivis2
0
0
–50
–10
–20
–20 0 20 –10 0 10 20 –40 –20 0 20 40
t-SNE1 Dhaka1 Ivis1

b –10
100
Saline 10
Memantine
–15

Dhaka2
t-SNE2

Ivis2
0

–20
–10
–100
–25
–50 –40 –30 –20 –10 0 10 –50 0 50 100 150
t-SNE1 Dhaka1 Ivis1

Fig. 2 | Discovery of subgroups in protein-expression data from mice. a, Visualization of expression data of mice with and without DS by t-SNE, Dhaka
and ivis. b, Protein-expression data from treated and untreated mice, visualized using t-SNE, Dhaka and ivis. For both datasets, t-SNE, Dhaka and ivis did
not differentiate the data classes as efficiently as FEM. t-SNE shows inaccurate and spurious clusters in both cases.

Information 1). This step also helps FEM to learn the central ten- these approaches are trained in an end-to-end manner without con-
dency and the distance metrics in the next step. sidering explicitly any inherent data structure. FEM is also unique
For the next step, to further augment the separation of the data in its distance-learning method. To date, most distance-learning
components, FEM learns the type of central-tendency measure methods learn only Euclidean-type distances (mostly Mahalanobis).
and distance metric that best describes the components (Fig. 1a FEM goes beyond this, as it can learn any type of distance metric,
(step 2) and Supplementary Fig. 69). FEM performs this learn- such as Euclidean, probabilistic, geodesic and correlation. As FEM
ing in an unsupervised manner by clustering the data with differ- learns the inherent data characteristics without any assumption
ent central-tendency measures and distance metrics. At the end of on the data components, and incorporates the information into
the calculation, the central-tendency measure and distance met- data-analysis methods, it is generally applicable.
rics that result in the best cluster-quality indices are chosen. For
deep-learning-based data projection, we compute the distances Results
of each data point to the centres of the clusters (Fig. 1a (step 3), FEM discovers subgroups from protein-expression data. First, we
Supplementary Fig. 69 and Supplementary Information 1) and then used a dataset consisting of protein-expression measurements from
feed the distance matrix into the projection model (Fig. 1a (step 4) mice that have received shock therapy13–15. Down syndrome (DS)
and Supplementary Fig. 69). The use of a deep autoencoder for data developed in some of these mice. We apply both PCA and FEM to
projection produces better data-dimensionality reduction com- the data, and test whether the first two components of these meth-
pared with other techniques, such as PCA and ICA8. This approach ods are able to detect any significant difference within the mouse
also provides opportunities of dimensionality reduction with dif- population on the basis of the presence or absence of DS. In Fig. 1b,
ferent goals by changing the structure and loss functions (a detailed we show the PCA result. The first two principal components cannot
discussion of which is provided in Supplementary Information 2). separate the classes at all within the population of mice. The results
As FEM computes the learned distance from the cluster centres of FEM on the same datasets are shown in Fig. 1c; it can be observed
in the second step, the remaining steps work on the distance data that the two data components are better separable within the
rather than on the raw data. This process automatically denoises mouse population.
the data, making FEM inherently more robust to noise and outli- We devised another problem by taking protein-expression mea-
ers (Supplementary Information 18 and 19). The unique denois- surements corresponding to treated and untreated mice13–15, and
ing property is particularly valuable for dimensionality reduction applied PCA and FEM to determine whether the methods can sepa-
in many important biological data-analysis problems (for example, rate these two classes. The treated mice were given memantine, and
single-cell RNA-sequencing (scRNA-seq) data with dropout noise) the untreated mice were injected with saline. The results of the PCA
in which noise has long been a main concern. are shown in Fig. 1d, in which we see that PCA cannot show any sig-
To summarize, FEM reduces data dimensionality through the nificant differences between these two types of mice, whereas FEM
effective incorporation of high-dimensional-data characteristics, shows a clear distinction between these groups (Fig. 1e).
and contributes to data science in the following two aspects: FEM In Fig. 2, we show the results of three more methods, namely
learns the data structure on the basis of the inherent properties of t-SNE, Dhaka16 and ivis17, for the above two datasets. Figure 2 shows
the data components, and provides a mechanism to leverage the that t-SNE, Dhaka and ivis did not keep the separation of the data
information; and it offers a unique way to increase the separation classes as efficiently as FEM while projecting the data onto two
of the data components at a high number of dimensions with sup- dimensions. Moreover, t-SNE creates spurious and inaccurate clus-
pressed noise level, leading to a substantially improved performance ters in the low-dimensional representation of the data.
in dimensionality reduction. Note that there are methods available
in the literature9–12 for dimensionality reduction that are based on FEM discovers distinct patterns from scRNA-seq data. Next,
unsupervised or supervised learning of distance metrics. However, we used a high-dimensional dataset consisting of single-cell

Nature Biomedical Engineering | www.nature.com/natbiomedeng


Articles NATuRE BioMEDicAl EnginEERing

a 200 200 b 1,000

2,000 1.0 1.0

FEM1
PC1

100 100 500


0.5 0.5

0 0 0 0 0 0
0 50 100 150 –20 0 20 40 –20 0 20 40 60 0.2 0.6 1.0 1.4 0 0.5 1.0 0.2 0.6 1.0
2,000 1,000
40 40
1.0 1.0

FEM2
20 20
PC2

1,000 500
0 0 0.5 0.5
–20 –20
0 0 0 0
0 50 100 150 –20 0 20 40 –20 0 20 40 60 0.2 0.6 1.0 1.4 0 0.5 1.0 0.2 0.6 1.0

60 60 1,000
2,000 1.0 1.0
40 40

FEM3
PC3

20 20 0.5 0.5 500


1,000
0 0
–20 –20 0 0 0 0
0 50 100 150 –20 0 20 40 –20 0 20 40 60 0.2 0.6 1.0 1.4 0 0.5 1.0 0.2 0.6 1.0
PC1 PC2 PC3 FEM1 FEM2 FEM3
Patient 1 (post) Patient 1 (pre) Overlap

c 10,000 200 200 d 5,000

1.0 1.0
FEM1
PC1

5,000 100 100

0 0 0 0 0 0
0 50 100 150 0 20 40 60 –20 0 20 40 0.5 1.0 1.5 0 1.0 0.2 0.6 1.0
60 10,000 60 5,000

40 40 1.0 1.0
FEM2
PC2

5,000
20 20
0 0
0 0 0 0
0 50 100 150 0 20 40 60 –20 0 20 40 0.5 1.0 1.5 0 1.0 0.2 0.6 1.0
10,000 2,000
40 40
1.0 1.0
20 20
FEM3
PC3

5,000 1,000
0 0 0.5 0.5
–20 –20
0 0 0 0
0 50 100 150 0 20 40 60 –20 0 20 40 0.5 1.0 1.5 0 1.0 0.2 0.6 1.0
PC1 PC2 PC3 FEM1 FEM2 FEM3

HEK293T Patient 1 (post) CD19+ B Healthy 1 Overlap

Fig. 3 | Visualization of a high-dimensional scRNA-seq dataset in three dimensions using PCA and FEM. a,b, Two classes were analysed using PCA (a)
and FEM (b): a pre-transplant sample from patient 1 and a post-transplant sample from patient 1. PCA did not differentiate between these two classes,
whereas FEM shows two classes projected onto different angles. c,d, A dataset consisting of four cell samples from HEK293T cells, a post-transplant
sample from patient 1, a sample from CD19+ B cells and a sample from a healthy human was analysed using PCA (c) and FEM (d). When performance was
compared, FEM performed better at projecting the data into separate clusters compared with PCA.

RNA-expression levels of a mixture of bone marrow monocular cells with PCA, for FEM (Fig. 3d), we observed that the four classes had
(BMMCs) obtained from a patient with leukaemia before and after better-differentiated distributions that were better separated from
a stem-cell transplant18. All RNA-seq data have been pre-processed, each other in the three dimensions.
and the datasets were reduced to 500 genes on the basis of higher For the above two datasets, we present the visualization results
dispersion (variance by the mean) in the data19. We performed both of t-SNE, Dhaka and ivis, along with quantitative evaluations of all
PCA and FEM on the reduced dataset, and show the results in Fig. 3. five methods (Fig. 4). In the visualization of the two-class dataset
It is seen from the PCA result (Fig. 3a) that both components follow (Fig. 4a), t-SNE creates four clusters of data, with a number of spuri-
similar types of distribution in the space covered by the components, ous red data points (representing the pre-treatment sample) spread
and that there is no clustering. However, for FEM, the data compo- around the cyan points (representing the post-treatment sample).
nents have well-differentiated distributions in the space spanned by This phenomenon can also be observed in the case of the four-class
the first three components, and there is clustering (Fig. 3b). dataset (Fig. 4b). For both datasets, ivis creates a different distri-
From the same dataset, we formulated a problem with the follow- bution of data classes; however, it does not create distinct clusters
ing four classes: HEK293T cells, patient 1 after transplant, CD19+ B for the data classes as efficiently as FEM. Dhaka does not show any
cells and a healthy human. The first three principal components are distribution differences between data classes for both datasets. The
shown in Fig. 3c, in which we see that PCA has the ability to show a inefficiency of the three methods in reducing the dimensionality of
distinction between the HEK293T cells and the other three classes. the data can also be seen from the quantitative values of accuracy
However, the patient, CD19+ B-cell and healthy classes have the and normalized mutual information (NMI), as shown in Fig. 4a,b.
same kind of distribution and cannot be separated. In comparison For both datasets, the highest accuracy and NMI were obtained

Nature Biomedical Engineering | www.nature.com/natbiomedeng


NATuRE BioMEDicAl EnginEERing Articles
a 50
P1 (post)
400 0.8 0.4
P1 (pre)
10
200 0.6 0.3
t-SNE2

Dhaka2

Accuracy
0

Ivis2
0 0

NMI
0.4 0.2
–200
–10 0.2 0.1
–400
–50 0 0
–50 0 50 –200 0 200 –20 –10 0 10 PCA Dhaka t-SNE Ivis
PCA Dhaka t-SNE Ivis FEM FEM
t-SNE1 Dhaka1 Ivis1 Method Method

b 50 40
HEK293T
P1 (post) 0 0.8
0.8
CD19+ B
20
Dhaka2

0.6
t-SNE2

H1

Accuracy
0.6

Ivis2
0

NMI
–500 0.4
0.4
0
0.2 0.2
–50 –1,000
–150 –100 –50 0 50 –400 –200 0 200 –10 0 10 20 0 0
PCA Dhaka t-SNE Ivis FEM PCA Dhaka t-SNE Ivis FEM
t-SNE1 Dhaka1 Ivis1 Method Method

Fig. 4 | Visualization of a high-dimensional scRNA-seq dataset using t-SNE, Dhaka and ivis. a, The first dataset consists of a pre-transplant sample
from patient 1 and a post-transplant sample from patient 1 (P1). Dimension-reduced data from t-SNE, Dhaka and ivis are shown. b, The second dataset
consists of four cell samples from HEK293T cells, a post-transplant sample from patient 1, a sample from CD19+ B cells and a sample from a healthy
human. Dimension-reduced data from t-SNE, Dhaka and ivis are shown. All three methods did not maintain the efficient separation of the data classes
in a lower-dimensional representation. Although t-SNE and ivis can separate out the data classes partially, their representation is not as good as the
representation of FEM. The quantitative evaluation (accuracy and NMI) of the five methods (PCA, FEM, t-SNE, Dhaka and ivis) for the first and second
dataset is shown in the bar graphs of a and b, respectively. The error bars represent the s.d. of the indices from the mean values for 100 different
initializations of k-means clustering.

using FEM. Ivis performed better than the other three methods for represent two histograms in polar space. We see that the change
both datasets. t-SNE performed worse than PCA. Dhaka performed in height is actually represented in FEM results as a circular path
the worst for both datasets. (from red to magenta). In PCA, this is not obvious at all. Thus, FEM
Supplementary Figures 19–33 provide results on the same preserves the actual data characteristics better than PCA. Fourth,
RNA-seq dataset, but for the differentiation of different classes the FEM provides better spatial differences between classes than
using ten prominent dimensionality-reduction methods (PCA, ker- PCA. This can be clearly seen when we compare the distances
nel PCA (KPCA)20, ICA4, probabilistic PCA21, autoencoder22, t-SNE, between green and red-violet points and between yellow and blue
MDS, non-negative matrix factorization23, locally linear embedding points. These four observations become more evident when analys-
(LLE)24 and FEM). ing the heat maps of normalized mean geodesic distance between
data classes in the first two components of PCA and FEM (Fig.
FEM preserves high-dimensional patterns of medical imaging 5b). In the heat map of FEM components (Fig. 5b, bottom), there
data. Next, we used a dataset retrieved from a set of 53,500 com- is nearly a linear relationship between the data points of FEM com-
puted tomography (CT) images from 74 different patients (43 male, ponents from different classes in terms of geodesic distance. As
31 female)15,25. Each CT image is described by two histograms in the distance between data points from different classes in physical
polar space. The first histogram contains information of the location space increases, the geodesic distance between them increases in
of bone structures in the image, and the second histogram contains FEM-embedding space at a linear rate, and vice versa. Such a linear
information about the location of air inclusions inside the body. The relationship is not as evident for PCA (Fig. 5b, top).
final feature vector for analysis was formed by concatenating both The other three methods, t-SNE, Dhaka and ivis, did not main-
histograms. The class labels (relative location of an image on the axial tain the relationship of height with the data (Fig. 5c). t-SNE cre-
axis) were created by manually annotating up to ten different distinct ated hundreds of inaccurate clusters, in which no relationship of the
landmarks in each CT volume with known location. The locations height with the data could be found (Fig. 5c, left). Dhaka created
of CT images in between the landmark positions were obtained by a low-dimensional representation that is totally inconsistent with
interpolation. Class label values range from 0 to 180, where 0 denotes the actual height information. Ivis created large space differences
the top of the head and 180 indicates the soles of the feet. between red and yellow points, yet it projected the green and blue
The dimensionality-reduced data for 39 classes (axial positions points close to each other. The failure of these methods to main-
52–90) of the CT dataset are shown in Fig. 5a for both PCA and tain the data structure can be better seen in the heat maps (Fig. 5c).
FEM. There are a number of interesting observations. First of all, No linear relationship between class distance and geodesic distance
the change in colour, from red to magenta through yellow, green, can be observed from these results, in contrast to what was found
cyan, blue and red-violet, corresponds to the decrement of axial for FEM. It was not possible to compute the geodesic distance and,
height in actual CT examination (52–90). FEM preserves the rela- as such, the heat map from the t-SNE representation owing to the
tionship of height with the data better than PCA, as, in PCA, we complete separation of the data classes26.
see that green, cyan, blue and red-violet almost overlap with each The preservation of data characteristics and the maximization of
other. The height that they represent therefore cannot be separated space between classes by FEM become clearer in Fig. 5d. These fig-
as well as with FEM. Second, as the difference between two classes ures show the dimensionality-reduced data with 5 classes, for axial
represents the same distance in height, they should have similar dis- locations 20–60. In the FEM results, if we start from the red points
tance after dimensionality reduction. This was better preserved in (20) and then go through the green points (30), the blue points (40),
the FEM results compared with the results from PCA, in which the the magenta points (50) and the black points (60), we see that this
span of red is much larger than that of other colours. Third, the data creates a radial path from higher height to lower height. For PCA,

Nature Biomedical Engineering | www.nature.com/natbiomedeng


Articles NATuRE BioMEDicAl EnginEERing

a 25 b

20

15 52

10 59
PC2

Class number
66
5
73
0
80 1.0
–5
87
0.8
–10
52 59 66 73 80 87
–10 –5 0 5 10 15 Class number 0.6
PC1

1.4 52 0.4

59 0.2
1.2

Class number
66
0
73
1.0
FEM2

80

0.8 87

52 59 66 73 80 87
0.6 Class number

0.4 0.6 0.8 1.0 1.2 1.4 1.6


FEM1
c 100 52 300 52
100

Class number
Class number

50 59 200 59
50
Dhaka2

66 66
t-SNE2

Ivis2

0
0 73 100 73
–50
–50 80 0 80
–100 87 87
–100
–50 0 50 100 0 100 200 52 59 66 73 80 87 –200 0 200 400 52 59 66 73 80 87
Dhaka1 Class number Ivis1 Class number
t-SNE1

d 100
10 50 200 1.5
50 100
Dhaka2
t-SNE2

FEM2

0
Ivis2
PC2

0 0 1.0
0
–10 –100
–50 –50 0.5
–200
0 10 20 –50 0 50 –50 0 50 100 –100 0 100 0.5 1.0 1.5
PC1 t-SNE1 Dhaka1 Ivis1 FEM1

Fig. 5 | Visualization of high-dimensional CT-scan localization data in two dimensions. a–d, Analysis of two datasets; there are 39 classes in the dataset
analysed in a–c, and 5 classes in the dataset analysed in d. Each class corresponds to a distinct axial location in the CT scan. In a–c, axial locations 52–56,
57–60, 61–69, 70–73, 74–80, 81–85 and 86–90 correspond to red, yellow, green, cyan, blue, red-violet and magenta with different shades, respectively. The
heat maps of normalized mean geodesic distance among all 39 classes from the first two components of PCA and FEM are shown in b. In d, red, green, blue,
magenta and black points correspond to the axial locations 20, 30, 40, 50 and 60, respectively. Data were projected onto the first two principal components
(a (top) and d (left)) and FEM components (a (bottom) and d (right)). The dimension-reduced data and heat maps for t-SNE, Dhaka and ivis for the first
dataset are shown in c. The dimension-reduced data from t-SNE, Dhaka and ivis for the second dataset are shown in d. FEM components better preserve
the radial characteristics of the data, and FEM also better distinguishes between data points of different colours, as is evident from the scatter plots and the
heat maps. In FEM, results from d (right), the data are better clustered in five classes in comparison to that in PCA results (d, left). t-SNE, Dhaka and ivis did
not retain any relationship between axial locations and data classes in both cases.

t-SNE, Dhaka and ivis, no such radial nature of the data can be by FEM in terms of classification. These datasets were acquired for
seen. For FEM, the data points are also much better clustered in five emotion classification and human activity classification, from wear-
classes. This is absent in the results from the other methods. able sensors and smartphones.
The dataset for emotion classification consists of electroencepha-
FEM better classifies data from a wide range of datasets. We logram (EEG) brainwave data processed using a previously described
chose three more datasets to show better dimensionality reduction statistical extraction method27. The EEG brainwave data were

Nature Biomedical Engineering | www.nature.com/natbiomedeng


NATuRE BioMEDicAl EnginEERing Articles
a 40
Negative 40 40
Neutral 20 1.5
20
20 Positive 20

Dhaka2
t-SNE2

FEM2
0 1.0

Ivis2
PC2

–20 0
0 0
0.5
–40
–20 –20 –20
–20 0 20 40 –20 0 20 –40 –20 0 20 –20 0 20 0.5 1.0 1.5
PC1 t-SNE1 Dhaka1 Ivis1 FEM1

b 40
Running 50
300 1.5
200 Cycling 20
200

Dhaka2
t-SNE2

FEM2
0 1.0

Ivis2
PC2

0
100 –20
–40 0.5
–200 –50 0
–60
–100 0 100 200 –50 0 50 –50 0 50 100 –20 0 20 40 60 0.5 1.0 1.5
PC1 t-SNE1 Dhaka1 Ivis1 FEM1

c 60
Walking 50 1.5
40 Other activities 50 0
Dhaka2
t-SNE2

FEM2
Ivis2
PC2

20 0 –20 1.0
0
0
–40 0.5
–50 –50
–20
0 20 40 60 80 –50 0 50 –20 0 20 40 60 –40 –20 0 20 0.5 1.0 1.5
PC1 t-SNE1 Dhaka1 Ivis1 FEM1

d 1.2 e
PredEmotion
MobileHealth 1.0 PredEmotion
1.0 HumanAct MobileHealth
0.8 HumanAct
0.8
Accuracy

0.6
0.6
NMI

0.4 0.4

0.2 0.2

0 0
PCA Dhaka t-SNE Ivis FEM PCA Dhaka t-SNE Ivis FEM
Method Method

Fig. 6 | Classification of biomedical data. a, Emotion classification from EEG data. b,c, Human-activity classification (running and cycling (b); and
walking and other activities (c)) from data from wearable sensors and smartphones, respectively. PCA, t-SNE, Dhaka, ivis and FEM results are shown.
d, Classification results from training a multiclass error-correcting output codes model using support vector machine binary learners. PredEmotion,
emotion classification; mobileHealth, human-activity classification from smartphone data; humanAct, human-activity classification from wearable
sensors data. e, NMI values for different methods. FEM shows better separation of the data distribution for all three datasets. For emotion classification
and human-activity classification from wearable sensors data, PCA, t-SNE, Dhaka and ivis do not show any separation between data clusters. For human
activity classification from smartphone data, PCA and ivis perform better than t-SNE and Dhaka. However, even in this case, the ivis results are spread
and there is less consistency among data points in the clusters. In all three cases, t-SNE produced results with inaccurate and sporadic clusters. The better
dimensionality reduction that was achieved by FEM is reflected in the quantitative evaluation, in which FEM achieved maximum accuracy and NMI index
values in all cases. For d and e, the error bars represent the s.d. of the indices from the mean values for 100 different initializations of k-means clustering.

collected from two human participants (1 male, 1 female) for 3 min choosing data points that correspond to running and cycling from
per state—positive, neutral and negative. The dimension-reduced this dataset. PCA, t-SNE, Dhaka, ivis and FEM were used to reduce
data using PCA, t-SNE, Dhaka, ivis and FEM are shown in Fig. 6a. the data dimensionality. The first two dimensions from these analy-
PCA, t-SNE and Dhaka did not show any kind of distribution dif- ses are shown in Fig. 6b. We see that the data points that correspond
ference between classes. Ivis performed better than PCA, t-SNE and to the classes running and cycling cannot be separated in the results
Dhaka, but could not separate the data classes as efficiently as FEM. from PCA, t-SNE, Dhaka and ivis, as the data classes are mixed with
The results of the other seven competing methods are provided in each other. For FEM, data from the two classes are almost linearly
Supplementary Information 9. separable, as the red points (corresponding to the running class) are
Similar scenarios can also be seen for the datasets for clustered on the right of the figure and the cyan points are clustered
human-activity classification from wearable sensors. This dataset on the left (Fig. 6b, right).
comprises body-motion and vital-sign recordings for ten volunteers Another dataset for human-activity classification was built from
of diverse profiles while performing several physical activities—that recordings of 30 participants while they performed daily activities
is, standing still, sitting and relaxing, lying down, walking, climbing (for example, walking and other activities, such as sitting, standing
stairs, waist bends forward, frontal elevation of arms, knees bend- and laying down). A waist-mounted smartphone (Samsung Galaxy
ing (crouching), cycling, jogging, running, and jumping forwards S II) with embedded inertial sensors was used to capture three-axis
and backwards15,28. The dimensionality-reduction problem is set by linear acceleration and three-axis angular velocity at a constant rate

Nature Biomedical Engineering | www.nature.com/natbiomedeng


Articles NATuRE BioMEDicAl EnginEERing

of 50 Hz29. The dimension-reduced data from all of the methods are subspaces are chosen randomly; in k-centres clustering, the centres
shown in Fig. 6c. From these figures, it is seen that all five methods of data clusters are chosen randomly; in autoencoding, the initial
are able to separate the data classes, and that the data-cluster quality neuron weights are chosen randomly. As described previously31–33,
is better for PCA and FEM. As with other datasets, t-SNE creates random initializations of all three methods result in stable results
inaccurate data clusters. for well-behaved datasets, and the optimization procedures for all
The efficacy of FEM in reducing the dimensions of the data three methods are numerically stable. FEM is therefore guaranteed
can be more easily appreciated from the quantitative evalua- to be numerically stable and is expected to provide sensible results
tion of all five methods for the above three datasets, as shown in for all initializations. We provide an analysis of the stability of FEM
Fig. 6d. Both the classification accuracy and NMI indices of FEM in Supplementary Information 27, in which we analysed a simulated
are maximal for all three datasets. For the smartphone-acquired dataset with two Gaussian data classes of the same mean (0) but
human-activity-classification dataset, all four methods show com- different variances (1 and 2) with FEM for 1,000 different initializa-
parable results to FEM. However, for the first two datasets, FEM tions. FEM separates the data classes with high accuracy for all 1,000
provides a much better accuracy and NMI. initializations (the computed accuracy was 99.50% with 0.6% s.d.).
We performed additional analyses on synthetic data and on We note that, to our knowledge, no other dimensionality-reduction
datasets from various biomedical fields, and provide the results in method has the ability to separate these data classes. For this data-
the Supplementary Information. The results with synthetic data set, Supplementary Fig. 72 shows FEM, PCA and t-SNE visualiza-
are presented in Supplementary Information 1. The results from tions for one initialization.
12 methods, including FEM, for different applications include the As there is no inherent assumption in FEM, adding constraints on
following: biomedical feature-based orthopaedic patient classifica- its behaviour may result in application-specific data-dimensionality
tion (Supplementary Information 5), gender classification on the reduction. As examples, we show in Supplementary Information 2
basis of voice data (Supplementary Information 6), heart-disease —through reasoning and a number of simulation results—that dif-
detection (Supplementary Information 7), chronic-kidney-disease ferent dimensionality-reduction techniques, such as PCA, KPCA,
detection (Supplementary Information 8), breast-cancer detec- ICA, MDS and Isomap, can be built by setting constraints on dif-
tion (Supplementary Information 10), splice-junction detec- ferent steps of FEM. As such, these methods may be considered to
tion (Supplementary Information 12), cardiotocography be subsets of FEM. The traditional distance-learning methods can
classification (Supplementary Information 13), drug-consumption also be considered to be subsets of FEM. These methods learn the
analysis (Supplementary Information 14), lung-cancer detection distance metric only, overlooking the central-tendency measure,
(Supplementary Information 15) and simple framework for con- whereas FEM learns both. In Supplementary Information 3, we
trastive learning of visual representations30 feature visualization show that the inclusion of a central-tendency measure is imperative
(Supplementary Information 20). The results show that, in a few and that the performance of dimensionality-reduction methods can
cases, the performances of t-SNE, KPCA, MDS and other methods be degraded if the central tendency is not selected appropriately.
are comparable to that of FEM. However, no method except for In our implementation of FEM, we considered only the medoid
FEM performed as well in all cases. and centroid as the central tendency and a limited number of dis-
tance metrics. However, other central-tendency measures, such as
Discussion geometric mean, trimean and trimmed mean, can also be used on
We have described a general data-dimensionality-reduction the basis of applications in a particular field. In the shared codes, we
method, FEM, which—in contrast to widely used algorithms—is provide an implementation of FEM in which any central tendency
driven by data through self-learning. It first learns the data struc- and distance metric can be included.
ture and component characteristics and then increases the separa- A limitation of FEM is that it is computationally more expen-
tion of the components at a high number of dimensions. Owing to sive than many other available techniques, as it tries to find out the
these features, the data components still remain well-separated even best description of the data structure by computing cluster-quality
when the dimensionality of the data is reduced. We have provided indices for different central-tendency and dispersion measures.
results of the application of FEM to problems from diverse fields of However, it should be noted that this step may be skipped in appli-
biomedical research to show the applicability, robustness and accu- cations in which this information is known a priori, or when it
racy of FEM. In particular, FEM outperforms widely used methods can be learnt from a representative dataset. Another limitation is a
in most cases. requirement for a sufficient amount of data, so that the actual data
Specifically, we showed that protein-expression data from mice characteristics can be learnt. This may prevent the application of
with DS display a different distribution in a low-dimensional rep- FEM in applications in which only small amounts of data are avail-
resentation compared with the data from healthy mice. Similarly, able for analysis. Finally, because it is possible to include only a finite
data from the treated mice show a different distribution compared number of central-tendency and distance metrics in FEM, for par-
with data from the untreated mice. This analysis can therefore ticular datasets, it is possible to miss the actual central-tendency or
help to detect mice with DS or the effect of treatment on the mice. distance metric. Moreover, in many cases, a combination of distance
Moreover, the fact that some of the data points are closer to each metrics may be appropriate. In such cases, FEM finds the closest
other in a low-dimensional representation may indicate the sever- distance metric that fits the data and reduces the dimensions on the
ity of DS, or a high versus low impact of the treatment. Similarly, basis of it.
the analysis of gene-sequencing data from a patient with leukaemia The computational complexity of subspace clustering in
before and after haematopoietic stem cell transplantation (HSCT) FEM is OðN 3 Þ (ref. 34) for N data points. The complexity of
with different types of cell might help with finding relationships k-centres I clustering is OðN ´ ðE ´ F ´ P þ QÞ ´ D ´ iÞ (ref. 35)
between sequencing data in different cells as well as the effects of in FEM, where P is the desired I reduced data dimension, Q is
HSCT treatment on different patients at the cellular level. The anal- the dimension of the distance matrix (Supplementary Fig. 69),
ysis of CT images may help to find the location of CT slices from E and F are the number of central-tendency and distance met-
only the images, as well as to detect abnormalities in the images. rics considered, D is the data dimensionality and i is the nec-
FEM involves three main operations: (1) K-subspace clustering31, essary iteration number for k-centres clustering to reach the
(2) k-centres clustering32 and (3) autoencoding33. In each of the solution. In the end, the complexity of the autoencoder is
three steps, there are random initializations of parameters for start- Oðe ´ N ´ ðD ´ PÞÞ, where e denotes the epoch number for training
ing the computation. In K-subspace clustering, the orthonormal (Supplementary
I Information 24). Therefore, the overall complexity

Nature Biomedical Engineering | www.nature.com/natbiomedeng


NATuRE BioMEDicAl EnginEERing Articles
x0
of FEM is OðN 3 þ N ´ ðE ´ F ´ P þ QÞ ´ D ´ i þ e ´ N ´ ðD ´ PÞÞ. By and 2 (of size N × 1) are the inner product of x with first and second dimension of
I
the subspace.
contrast, the
I complexity of PCA is OðD2 N þ D3 Þ (Supplementary
At each iteration of K-subspace clustering, we see that each data point is
Information 24). The complexity I of t-SNE is OðN 2 DÞ (ref. labelled as member of a subspace denoting a cluster. A number of data points are
17
). PThe computational complexity of deep Inetworks is therefore assigned to each subspace. These data points are denoted as ‘member
Oð li¼1 mi ´ mi�1 ´ N ´ eÞ (ref. 36; ivis), where the network has points’ to the subspace. ‘Parent subspace’ is the label of the data points, that is,
l Ihidden layers and mi nodes in ith layer and the training is per- which subspace (or cluster) the data point belongs to.
formed on all N data points (Supplementary Information 24). The
Selection of subspace dimension and number. For subspace dimension dw, we
computational complexity of the four-layer variational autoencoder first compute the following five cluster quality indices I dw ;i , Silhouette (i = 1),
Dhaka is Oðe ´ N ´ ðD ´ m2 þ m2 ´ m3 þ m3 ´ m4 þ m4 ´ PÞÞ, Davies–Bouldin (i = 2), Calinski–Harabasz (i = 3), homogeneity
I (i = 4) and
where m2 = 1,024 I denotes the number of nodes of the second layer, separation (i = 5). The calculation of these indices is described in detail in
m3 = 512 denotes the number of nodes in the third layer and m4 = 256 Supplementary Information 11. We then compute I(dw − nd, i) for subspace
dimension of dw − nd, nd is the difference between subspace dimension in two
denotes the number of nodes in the fourth layer. We have added consecutive iterations (Supplementary Fig. 69, step 1). Next, the logical decision
the computational time required for analysing different number vector lv defined as
of data points with different dimensionality for simulated data by
lv ðiÞ ¼ 1; if Iðd w � nd ; iÞ >Iðd w ; iÞ; i ¼ 1; 3; 4
FEM and four other techniques in Supplementary Fig. 70. From this ð2Þ
figure, we see that, as the number of points increases, the computa- ¼ 0; otherwise ; i ¼ 1; 3; 4
tional time for FEM increases nonlinearly (OðN 3 Þ). However, with
increments of data dimensionality, the computational
I time of FEM ¼ 1; if Iðd w � nd ; iÞ < Iðdw ; iÞ; i ¼ 2; 5
ð3Þ
increases linearly (OðDÞ). The main computational burden in FEM ¼ 0; otherwise ; i ¼ 2; 5
comes from learning I the central tendency and distance measures.
is evaluated. If the sum of the components of lv is less than 3, the loop is broken
To reduce the computational burden of FEM, we learnt the central and the subspace dimension in previous iteration is taken as optimum dimension,
tendency and distance metrics from the first 5,000 points if the that is, di = dw. Otherwise, the dimension is further reduced by nd and I(dw − 2nd, i)
dataset of our examples contains more than 5,000 points. Once FEM is computed for new subspace dimension. lv is computed again by comparing
has learnt the data structure, the rest of the steps of the algorithm I(dw − 2nd, i) and I(dw − nd, i) and the decision for further dimension reduction is taken.
The subspace number n was chosen as the quotient of the division of
are very fast. FEM can be used to analyse very-large-scale data in a original data dimensionality D by the subspace dimension di, that is, n ¼ dDi
limited time (discussed in detail in Supplementary Information 26). (Supplementary Fig. 69, step 1), on the basis of a study shown in SupplementaryI
Moreover, in Supplementary Information 21, we show that, when Figs. 51, 52 and 53. In case dDi is a fraction, n is chosen as the closest integer
the data points are more than a few hundred, the selected central (greater or equal). From Supplementary Fig. 53, we see that the subspace
tendency and distance metrics generally do not change. number determined in this manner provides the best component separation
without distorting the data. By contrast, if the subspace number is kept fixed
while subspace clustering, it may result in the distortion of the data as shown in
Methods Supplementary Figs. 51 and 52.
Theory. Subspace projection. Let us consider modelling a number of data points
N
fxj 2 RD gj¼1 with a union of n ≥ 1 number of linear subspaces fSi gni¼1 of Selection of central tendency and distance metric. Let us assume that the
dimensions
I di = dim(Si), 0 < di ≤ D, i = 1, …, n. The equation of theI subspaces can be subspace-projected data xs can be described using P number of centres (cs) and
written as31 a distance metric (rs) so that they can be clustered by k-centres (for example,
k-means++32, partitioning around medoids (PAM)40) clustering technique
Si ¼ fx 2 RD : x ¼ U i yg; i ¼ 1; ¼ ; n ð1Þ
(Supplementary Fig. 69, step 2). We can select cs and rs using
Here U i 2 RD ´ di ; d i ≤ D is a basis for Si and y 2 Rdi is a di-dimensional cs ; r s ¼ modeðIði; c; rÞÞ ð4Þ
presentation
I of the data points x. We need to Idetermine the subspace bases
Ui, i = 1, …, n and the clustering of the points on the basis of the subspaces.
where ‘mode’ denotes the most frequent component of I. Here,
We used an iterative method of K-subspace clustering as described by Vidal
et al.31 to perform the subspace clustering and then project the data points onto Iði; c; rÞ ¼ argmaxc2U;r2V ðGm;k;i Þ; i ¼ 1; 3; 4 ð5Þ
subspaces (Supplementary Fig. 69, step 1). In this algorithm, we first choose n
orthonormal subspaces of dimension di randomly, that is, S = {S1, …, Sn}. We then
(1) compute the inner product between each data point (of size 1 × D) and each ¼ argminc2U;r2V ðGm;k;i Þ; i ¼ 2; 5 ð6Þ
dimension of subspace (of size D × 1) resulting a scalar value37. This operation
between N data points and n subspaces of di dimension results in a matrix of where U and V are the vectors containing the central tendency and distance
size N × di × n; (2) compute the norm of inner products of data and subspaces, measures. For a specific central tendency measure (m = 1, …, E, E is length of U)
which has size of N × n; (3) find the maximum of the norm of inner products for and distance metric type (k = 1, …, F, F is length of V), Gm,k,i are the cluster quality
each data point among the subspaces and remember the associated subspace. indices, that is, Silhouette (i = 1), Davies–Bouldin (i = 2), Calinski–Harabasz (i = 3),
As such, each data point becomes member of a subspace; (4) for each subspace, homogeneity (i = 4), separation (i = 5). If multiple cs and rs are obtained from the
take the member points and perform eigen decomposition to determine the mode operation, the first set of cs and rs is taken as the optimum central tendency
eigen vectors that correspond to the di largest eigen values. These di eigen vectors and distance measures.
form the basis of the subspace; and (5) repeat steps 1–4 until the subspaces in
two consecutive iterations have a very small difference (ϵ = 0.001). Subspace Distance computation. xs is clustered into Q clusters with the k-centres clustering
difference between consecutive iterations is computed as follows: maximum value technique using the selected central tendency and distance metrics (Supplementary
of absolute difference between subspaces of j and (j − 1)th iteration (pseudocode Fig. 69, step 3). xd 2 RQ refers to a vector, the elements of which are the distance of
in Supplementary Information 23), where j is the iteration index. The detail xs from Q clusterIcentres. xd is evaluated by computing the distance (defined by rs)
convergence analysis of K-subspace clustering and similarity with k-means of xs from Q cluster centres (defined by cs; Supplementary Fig. 69, step 3).
clustering is reported in Vidal et al31 and Wang et al38. In the end, we take the
projection of the data points onto their parent subspaces (subspace denoting a Dimensionality reduction. For dimensionality reduction, we use a deep-learning
cluster the data point belongs to) to obtain xs (Supplementary Fig. 69, step 1). method named autoencoder33 (Supplementary Fig. 69, step 4). An autoencoder
xs = {xs1, xs2, …, xsn}, where xsi, i = 1, …, n is computed as xsi ¼ xi Si STi (page 300 of ref. consists of two components, an encoder and a decoder. When the autoencoder has
39
), where xi denotes the data points, which are member Iof ith subspace Si. Here, its simplest form with a single hidden layer, the encoder stage of an autoencoder
Si has a size of D × di, xi has size of Ni × D, Ni is the number of data points belong takes the input xd 2 RQ and maps it to a latent space h 2 RP , where D ≥ Q > P, h
to ith subspace cluster. As such, subspace projection in FEM does not change the is defined as I I
dimensionality of the data.
We note that the norm of inner products of qdata and subspaces in ffistep 2 of our
ffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi h ¼ σðWxd þ bÞ ð7Þ
subspace-clustering algorithm is computing ðx0 21 þ x0 22 þ    þ x0 2p Þ, where x0 p is
the inner product of x with pth dimension ofqIeach subspace. As an example, when I Here, σ is an element-wise activation function that can be a sigmoid or linear
ffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi
x0 function. W is a weight matrix and b is a bias vector. Weights and biases are
the subspace dimension is 2, this becomes ðx0 21 þ x0 22 Þ, where 1 (of size N × 1)
I usually initialized randomly, and then updated iteratively during training using
I
Nature Biomedical Engineering | www.nature.com/natbiomedeng
Articles NATuRE BioMEDicAl EnginEERing
backpropagation technique. After that, the decoder stage of the autoencoder maps If summation of component of lc is less than 3, we choose Euclidean distance.
the latent variable h to the reconstruction data x0d . This process can be written as Otherwise, we choose correlation distance. If Euclidean (correlation) distance is
I found to be appropriate, we compute the cluster indices for all of the Euclidean
x0d ¼ σ 0 ðW 0 h þ b0 Þ ð8Þ and probabilistic (correlation) distance metrics. If the selected distance metric is
correlation, the dimensionality reduction is performed in third step by selecting Q
While training the autoencoder, reconstruction errors such as squared error are equal to the desired dimensionality P and the fourth step (autoencoder) is skipped.
minimized. The squared error (also called loss) can be expressed as In all other cases, Q is assumed to be equal to the subspace dimension chosen in
the subspace projection step. If any dataset contains more than 5,000 points, the
Lðxd ; x0d Þ ¼ jjx d � x0d jj2 ð9Þ
subspace dimension and number, the central tendency and distance measures
have been selected on the basis of the first 5,000 points (justification is provided in
After we replace x0d in equation (9) using equation (8), we obtain
Supplementary Information 21). This further reduced the computational burden
I
Lðxd ; x0d Þ ¼ jjx d � σ 0 ðW 0 h þ b0 Þjj2 ð10Þ of the method.
For computing the cluster quality indices, the data are clustered into P clusters
If we replace h in equation (10) using equation (7), we obtain the following using different central tendency and distance metrics, where P is the number of
expression of the loss function: reduced dimension (dimension of latent space of autoencoder). To cluster the data
by k-centres clustering, the k-means++32 and PAM40 techniques were used. We note
Lðxd ; x0d Þ ¼ jjxd � σ 0 ðW 0 ðσðWxd þ bÞÞ þ b0 Þjj2 ð11Þ that K-subspace clustering cannot be used as a replacement of k-centres clustering
(k-means++ or PAM), because only Euclidean-type distance is computable from
The resulting latent space h is the desired dimensionality-reduced data. To improve a point to a subspace (Supplementary Information 17). Probabilistic distance or
the performance of the autoencoder, two additional terms (L2 regularization and correlation distance is not defined from a point to a subspace.
sparsity regularization terms) are generally added to the loss function as follows: In the case of the autoencoder of FEM, a linear function is used as a decoder
transfer function and logistic sigmoid function as an encoder transfer function.
1 Mean squared error between the input data and reconstructed data from the
Lðxd ; x0d Þ ¼ jjxd � σ 0 ðW 0 ðσðWxd þ bÞÞ þ b0 Þjj2 þ λΩw þ βΩs ð12Þ
N decoder is used as the loss function of the autoencoder. The maximum number
of epochs is set to be 500. L2 regularization parameter (λ) was chosen as 0.001 and
where N is the total number of training examples and λ and β are coefficients of L2
sparsity proportion parameter (ρ) was chosen as 0.05. The sparsity regularization
regularization and sparsity regularization terms. The sparsity regularization term
parameter (β) was chosen as 1.6. Among these parameters, loss function, encoder
Ωs is defined as
transfer function, L2 regularization parameter, sparsity proportion parameter and
XP XP ρ 1�ρ sparsity regularization parameter are the default parameters set by MATLAB.
Ωs ¼ KLðρjj^ρi Þ ¼ ρlogð Þ þ ð1 � ρÞlogð Þ ð13Þ Setting the encoder transfer function to sigmoid function and decoder transfer
i¼1 i¼1 ρi
^ 1�^
ρi
function to purely linear is an ideal setting for the autoencoder for better
Here, P is the reduced dimension by the autoencoder (number of neurons in performance as described by Vincent et al.33. 500 epoch was enough to reach the
hidden layer of the autoencoder) and ρ is the sparsity proportion parameter global minimum in all our analysis. However, these parameters may be tuned for
denoting the desired value of the average activation value of the neurons in the better performance of FEM in specific applications.
autoencoder. The average activation value of ith neuron is computed as
Competing methods. In KPCA20, the sigma parameter has been taken as five times
1 XN the mean of a vector dk, where dk contains the minimum values of columns of
^ρi ¼ σðW Ti xdj þ bi Þ ð14Þ
N j¼1
pairwise Euclidean distance matrix. The FastICA technique was used to perform
independent component analysis41 and the publicly available MATLAB code of
where xdj is the jth training example. In equation (12), Ωw denotes the L2
FastICA from http://research.ics.aalto.fi/ica/fastica/ was used. t-SNE, MDS and
regularization term and is defined as
non-negative matrix factorization implemented by MATLAB using the default
1 XN XQ parameters were used to produce the results of these methods. The implementation
Ωw ¼ ðW ji Þ2 ð15Þ of LLE was taken from the authors’ website (https://cs.nyu.edu/roweis/lle/code.
2 j¼1 i¼1
html). The number of nearest neighbours considered in LLE is 15. Probabilistic
PCA was implemented according to the method described by Bishop et al.21. The
Simulation procedures. All of the synthetic data were created in MATLAB autoencoder as an independent competing method has the same specifications as
(MathWorks). To understand the effect of the dimensionality number on the the autoencoder used as a part in the FEM.
performance of different dimensionality-reduction techniques (Supplementary In Dhaka, following the work of Rashid et al16, the epoch number was set to 100
Fig. 3), the random number generator was set to default state (seed = 0, Mersenne and the batch size was set to 200 if the number of data points was more than 200.
Twister generator) in MATLAB. For datasets with less than 200 data points, the batch size was set to 20. Activation
function was set to ‘relu’ in Dhaka. Following the work of Szubert et al17, the siamese
Implementation and parameter settings. FEM. MATLAB was used to implement neural network in ivis was trained on mini-batches of size 128 for 1,000 epochs
the FEM technique. The central tendency measures (members of set U) considered using the Adam optimizer function with a learning rate of 0.001 and the standard
were centroid (mean in arbitrary dimension) and medoid (median in arbitrary parameters (β1 = 0.9, β2 = 0.999). Training was stopped early if the loss did not
dimension). The distance metrics (members of set V) considered were Euclidean, decrease over 50 consecutive epochs. The nearest neighbour K was selected as 5. For
square Euclidean, standard Euclidean, city block, Manhattan, Minkowski with datasets with less than 128 datapoints, the batch size was set to 20 in ivis. All other
exponent of 3 and 4, correlation, Chebychev, Mahalanobis, χ2, KL divergence and parameters in Dhaka and ivis were set to the default values used by the authors of
Jeffrey divergence. Mathematical definitions of some of the distance metrics are these works. The implementation codes provided by the authors, which are publicly
provided in Supplementary Information 16. available on GitHub (https://github.com/sabrinar/Dhaka and https://github.com/
In the subspace projection step of FEM, the starting subspace dimension (dw) beringresearch/ivis), were used to evaluate the performance of these methods.
is taken as the closest integer (greater or equal) to original data dimensionality
divided by 2 and nd is taken as 3. Subspace dimension is searched provided that Computation of normalized mean geodesic distance. At first, the neighbourhood
(dw − und) ≥ 9, where u is the search iteration number. points were determined on the manifold of FEM and PCA components, on the
We divided the distance metrics into two different categories: (1) correlation basis of the Euclidean distances rq(i, j) between pairs of points i, j using K = 15
type distances, for example, correlation; and (2) Euclidean and probabilistic type nearest neighbours. K = 15 nearest neighbours was selected to make sure that all
distances: Euclidean, square Euclidean, standard Euclidean, city block, Manhattan, of the points were well connected in the geodesic map. This value has been chosen
Minkowski with exponent of 3 and 4, Chebychev, Mahalanobis, χ2, KL divergence by several authors for computing geodesic distance, such as Silva et al.42. We note
and Jeffrey divergence. To reduce the computational burden of FEM, we first that, for a very low value of K, the geodesic map can become fragmented and, for
compare the cluster indices for only correlation and Euclidean distances as follows. very large values, the geodesic map may become noisy. A weighted graph G over
We compute the five cluster quality indices from clustered data (by k-centres the data points was developed that represents these neighbourhood relations,
clustering) for Euclidean (Ieuc) and correlation (Icorr) distance. We then compute the with edges of weight rq(i, j) between neighbouring points26. Then, the geodesic
following logical decision vector: distances rm(i, j) between all of the pairs of points are calculated on the manifold
by computing their shortest path distances rg(i, j) in the graph G. The geodesic
lc ðiÞ ¼ 1; if I corr; i > I euc; i ; i ¼ 1; 3; 4
ð16Þ distances between the points of each class are averaged to obtain the mean geodesic
¼ 0; otherwise ; i ¼ 1; 3; 4 distance matrix. The normalized mean geodesic distance matrix is computed by
dividing the matrix by its maximum value.
¼ 1; if I corr; i < I euc; i ; i ¼ 2; 5
ð17Þ Computation of accuracy and NMI. We trained a multiclass error-correcting
¼ 0; otherwise; i ¼ 2; 5 output codes model using support vector machine binary learners using the cluster

Nature Biomedical Engineering | www.nature.com/natbiomedeng


NATuRE BioMEDicAl EnginEERing Articles
labels as target variable and the embeddings’ coordinates from different methods as and body motion components, was separated using a Butterworth low-pass filter
training variables. 50% of the data were used for training purposes and the rest of (with 0.3 Hz cut-off frequency) into body acceleration and gravity. From each
the data were used for testing. We then used the trained classifier to predict cluster window of the signal, a vector of features was obtained by calculating variables
identities of the testing data and computed the accuracy of these predictions, from the time and frequency domain.
therefore assessing the ability of each method to separate data clusters. Accuracy For each record in the dataset, the following attributes were accumulated:
was computed as the number of correctly found class labels divided by total (1) triaxial acceleration from the accelerometer (total acceleration) and the
number of class labels. NMI is the normalized mutual information43 between the estimated body acceleration. (2) Triaxial angular velocity from the gyroscope. (3) A
estimated labels and true labels computed following the work of Becht et al.44. 561-feature vector with time and frequency domain variables. (4) Activity label. (5)
An identifier of the individual who performed the experiment.
Statistical analysis. Data in Figs. 4 and 6 and Supplementary Fig. 8 are presented
as mean ± s.d. Motion detection from inertial measurement unit sensor data. This dataset,
which is called the MHEALTH (Mobile HEALTH) dataset15, consists of recordings
Datasets. Protein expression data from mice. This dataset13–15 contains 38 control of body motion and vital signs for ten volunteers of diverse profile while
mice and 34 trisomic mice (DS), totalling 72 mice. Expression levels of 77 protein performing the following physical activities: (1) standing still, (2) sitting and
modifications that produced detectable signals in the nuclear fraction of cortex relaxing, (3) lying down, (4) walking, (5) climbing stairs, (6) waist bends forward,
in the mice were recorded. For each mouse, 15 measurements were registered (7) frontal elevation of arms, (8) knees bending (crouching), (9) cycling, (10)
of each protein. There are therefore 570 measurements for control mice and 510 jogging, (11) running and (12) jumping fowards and backwards15,28. Sensors were
measurements for trisomic mice, totalling 1,080 measurements per protein. placed onto the individual’s chest, right wrist and left ankle to measure the motion
The mice in the dataset were classified on the basis of features, such as experienced by diverse body parts, namely, acceleration, rate of turn and magnetic
genotype, behaviour and treatment. According to genotype, mice can be control field orientation. Shimmer2 (ref. 46) wearable sensors were used for the recordings.
or trisomic and, according to behaviour, some mice were stimulated to learn and All sensing modalities were recorded at a sampling rate of 50 Hz, which was
others were not. According to treatment, some mice were injected with the drug considered to be sufficient to capture human activity. The activities were collected
memantine and others were not. in an out-of-lab environment.
All of the mice in the dataset can be classified into eight classes: (1) control
mice, stimulated to learn, injected with saline (9 mice); (2) control mice, stimulated Reporting Summary. Further information on research design is available in the
to learn, injected with memantine (10 mice); (3) control mice, not stimulated Nature Research Reporting Summary linked to this article.
to learn, injected with saline (9 mice); (4) control mice, not stimulated to learn,
injected with memantine (10 mice); (5) trisomy mice, stimulated to learn, Data availability
injected with saline (7 mice); (6) trisomy mice, stimulated to learn, injected with The main data supporting the results in this study are available within the paper
memantine (9 mice); (7) trisomy mice, not stimulated to learn, injected with saline and its Supplementary Information. The protein-expression dataset from mice
(9 mice); and (8) trisomy mice, not stimulated to learn, injected with memantine is available at https://archive.ics.uci.edu/ml/datasets/Mice+Protein+Expression.
(9 mice). The scRNA-seq data of patients with leukaemia are available at https://
support.10xgenomics.com/single-cell-gene-expression/datasets. The CT dataset
scRNA-seq data. HEK293T (ATCC CRL-11268) cells, BMMCs from two healthy is available at https://archive.ics.uci.edu/ml/datasets/Relative+location+of+CT+
donors and peripheral blood mononuclear cells (PBMCs) from one healthy donor slices+on+axial+axis. The datasets for emotion classification and human-activity
were acquired from American Type Culture Collection (ATCC) and cultured classification, and the data from wearable sensors and from smartphones, were
according to the ATCC guidelines18. B cells were separated from the PBMCs. downloaded from the UCI machine learning repository (https://archive.ics.uci.
scRNA-seq libraries were generated from BMMC samples obtained from two edu/ml/datasets).
patients before and after undergoing HSCT for adult acute myeloid leukaemia
(AML) (AML027 and AML035). The amount of RNA per cell type was determined
by quantifying (Qubit; Invitrogen) RNA that was extracted (Maxwell RSC
Code availability
The implementation code of the proposed FEM algorithm is
simplyRNA Cells Kit) from several different known numbers of cells. Details of
available for research uses at https://github.com/tauhidstanford/
the sample processing and RNA-seq acquisition were described by Zhang et al18.
Feature-augmented-embedding-machine.
We used a part of the data from these authors. The total number of cells that we
used was 29,826, in which the number of genes was 32,733 in each cell. We used a
total of eight classes in our analysis: (1) HEK293T cells (2,885 cells), (2) AML027 Received: 3 February 2020; Accepted: 23 September 2020;
post-HSCT (3,965 cells), (3) AML027 pre-HSCT (3,933 cells), (4) AML035 Published: xx xx xxxx
post-HSCT (909 cells), (5) AML035 pre-HSCT (3,592 cells), (6) B cells (10,085
cells), (7) BMMCs from first healthy human being (1,985 cells) and (8) BMMCs References
from second healthy human being (2,472 cells). 1. Xing, L., Giger, M. & Min, J. K. Artificial Intelligence in Medicine: Technical
The scRNA-seq dataset is a matrix that consists of the number (integer) Basis and Clinical Applications (Elsevier Science, 2020).
of appearance of a particular gene sequence in cells. Each row of the matrix 2. Moon, K. R. et al. Visualizing structure and transitions in high-dimensional
corresponds to one cell, whereas each column represents the values of a specific biological data. Nat. Biotechnol. 37, 1482–1492 (2019).
gene. To apply downstream tasks, each column of the scRNA-seq dataset first 3. Jolliffe, I. T. Principal Component Analysis 2nd edn (Springer, 2002).
needs to be normalized to mean = 0 and scaled to s.d. = 1. This is a very established 4. Hyvärinen, A. & Oja, E. Independent component analysis: algorithms and
protocol and is used in most analyses regarding scRNA-seq data45. applications. Neural Netw. 13, 411–430 (2000).
5. van der Maaten, L. & Hinton, G. Visualizing data using t-SNE. J. Mach.
EEG brainwave data. This dataset consists of EEG brainwave data processed Learn. Res. 9, 2579–2605 (2008).
using the statistical extraction method described by Bird et al.15,27. The data were 6. Kruskal, J. B. & Wish, M. Multidimensional Scaling (SAGE, 1978).
collected from two human participants (one male and one female) for 3 min at 7. Watkins, J. C., Kishore, R. & Priya, S. An Introduction to the Science of
each emotional state—positive, neutral and negative. A Muse EEG headband Statistics: From Theory to Implementation 12–19 (Watkins, J. C., 2016).
was used to collect the data, which recorded the TP9, AF7, AF8 and TP10 EEG 8. Hinton, G. E. Reducing the dimensionality of data with neural networks.
placements using dry electrodes. Six minutes of resting neutral data were also Science 313, 504–507 (2006).
recorded. The stimuli that were used to evoke the emotions in the participants 9. Pinheiro, P. O. Unsupervised domain adaptation with similarity learning. In
were as follows: (1) Marley and Me (Twentieth Century Fox), negative, death scene; Proc. 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition
(2) Up (Walt Disney Pictures), negative, opening death scene; (3) My Girl (Imagine 8004–8013 (IEEE, 2018).
Entertainment), negative, funeral scene; (4) La La Land (Summit Entertainment), 10. Sohn, K., Shang, W., Yu, X. & Chandraker, M. Unsupervised domain
positive, opening musical number; (5) Slow Life (BioQuest Studios), positive, adaptation for distance metric learning. In Proc. International Conference on
nature time lapse; and (6) Funny Dogs (MashupZone), positive, funny dog clips. Learning Representations (ICLR, 2019).
11. Xing, E. P., Jordan, M. I., Russell, S. J. & Ng, A. Y. Distance metric learning
Human activity data classification. The Human Activity Recognition dataset15 was with application to clustering with side-information. In Proc. 15th
built from recordings of 30 participants while performing daily activities (such International Conference on Neural Information Processing Systems
as walking, walking upstairs, walking downstairs, sitting, standing and laying). A (Eds Becker, S. et al.) 521–528 (MIT Press, 2002).
waist-mounted smart phone (Samsung Galaxy S II) with embedded inertial sensors 12. Suárez, J. L., García, S. & Herrera, F. A tutorial on distance metric learning:
was used to capture three-axial linear acceleration and three-axial angular velocity mathematical foundations, algorithms and software. Preprint at https://arxiv.
at a constant rate of 50 Hz (ref. 29). org/abs/1812.05944 (2018).
The sensor signals were preprocessed in two steps: (1) by applying noise filters 13. Higuera, C., Gardiner, K. J. & Cios, K. J. Self-organizing feature maps identify
and then sampled in fixed-width sliding windows of 2.56 s and 50% overlap (128 proteins critical to learning in a mouse model of down syndrome. PLoS ONE
readings per window). (2) The sensor acceleration signal, which has gravitational 10, e0129126 (2015).

Nature Biomedical Engineering | www.nature.com/natbiomedeng


Articles NATuRE BioMEDicAl EnginEERing
14. Ahmed, M. M. et al. Protein dynamics associated with failed and rescued 35. Manning, C. D., Raghavan, P. & Schütze, H. Introduction to Information
learning in the Ts65Dn mouse model of down syndrome. PLoS ONE 10, Retrieval (Cambridge University Press, 2008).
e0119491 (2015). 36. Stone, J. V. Artificial Intelligence Engines: A Tutorial Introduction to the
15. Dua, D. & Graff, C. UCI Machine Learning Repository (University of Mathematics of Deep Learning (Sebtel Press, 2019).
California, Irvine, accessed 15 September 2019); http://archive.ics.uci.edu/ml 37. Lipschutz, M. L. S. Schaum’s Outline of Linear Algebra 4th edn (McGraw-Hill,
16. Rashid, S., Shah, S., Bar-Joseph, Z. & Pandya, R. Dhaka: variational 2009).
autoencoder for unmasking tumor heterogeneity from single cell genomic 38. Wang, D., Ding, C. & Li, T. K-Subspace clustering. In Proc. Machine Learning
data. Bioinformatics https://doi.org/10.1093/bioinformatics/btz095 (2019). and Knowledge Discovery in Databases (Eds Buntine, W.) 506–521 (Springer,
17. Szubert, B., Cole, J. E., Monaco, C. & Drozdov, I. Structure-preserving 2009).
visualisation of high dimensional single-cell datasets. Sci. Rep. 9, 8914 (2019). 39. Carrell, J. B. Fundamentals of Linear Algebra 412 (2015); https://www.math.
18. Zheng, G. X. Y. et al. Massively parallel digital transcriptional profiling of ubc.ca/~carrell/NB.pdf
single cells. Nat. Commun. 8, 14049 (2017). 40. Kaufman, L. & Rousseeuw, P. J. Finding Groups in Data: An Introduction to
19. Abid, A., Zhang, M. J., Bagaria, V. K. & Zou, J. Exploring patterns enriched in Cluster Analysis (John Wiley & Sons, 1990).
a dataset with contrastive principal component analysis. Nat. Commun. 9, 41. Hyvarinen, A. Fast and robust fixed-point algorithms for independent
2134 (2018). component analysis. IEEE Trans. Neural Netw. 10, 626–634 (1999).
20. Schölkopf, B., Smola, A. & Müller, K.-R. Nonlinear component analysis as a 42. de Silva, V. & Tenenbaum, J. B. Global versus local methods in nonlinear
kernel eigenvalue problem. Neural Comput. 10, 1299–1319 (1998). dimensionality reduction. In Proc. 15th International Conference on Neural
21. Bishop, C. M. Pattern Recognition and Machine Learning (Springer, 2006). Information Processing Systems 721–728 (MIT Press, 2002).
22. Goodfellow, I., Bengio, Y. & Courville, A. Deep Learning (MIT Press, 2015). 43. Vinh, N. X., Epps, J. & Bailey, J. Information theoretic measures for
23. Lee, D. D. & Seung, H. S. Algorithms for non-negative matrix factorization. clusterings comparison: variants, properties, normalization and correction for
In Proc. 13th International Conference on Neural Information Processing chance. J. Mach. Learn. Res. 11, 2837–2854 (2010).
Systems (Eds Leen, T. K. et al.) 556–562 (MIT Press, 2001). 44. Becht, E. et al. Dimensionality reduction for visualizing single-cell data using
24. Roweis, S. T. & Saul, L. K. Nonlinear dimensionality reduction by locally UMAP. Nat. Biotechnol. 37, 38–44 (2019).
linear embedding. Science 290, 2323–2326 (2000). 45. Evans, C., Hardin, J. & Stoebel, D. M. Selecting between-sample RNA-seq
25. Graf, F., Kriegel, H.-P., Schubert, M., Pölsterl, S. & Cavallaro, A. 2D image normalization methods from the perspective of their assumptions.
registration in CT images using radial image descriptors. In Proc. Medical Brief. Bioinform. 19, 776–792 (2017).
Image Computing and Computer-Assisted Intervention—MICCAI 2011 46. Burns, A. et al. SHIMMER™—a wireless sensor platform for noninvasive
(Eds Fichtinger, G. et al.) 607–614 (Springer, 2011). biomedical research. IEEE Sens. J. 10, 1527–1534 (2010).
26. Tenenbaum, J. B., de Silva, V. & Langford, J. C. A global geometric framework
for nonlinear dimensionality reduction. Science 290, 2319–2323 (2000). Acknowledgements
27. Bird, J. J., Manso, L. J., Ribeiro, E. P., Ekárt, A. & Faria, D. R. A study on We thank M. B. Khuzani and H. Ren for their advice in improving the manuscript. This
mental state classification using eeg-based brain-machine interface. In Proc. work was partially supported by the National Institutes of Health (nos. 1R01 CA223667
2018 International Conference on Intelligent Systems (IS) 795–800 (IEEE, 2018). and R01CA227713) and by a Faculty Research Award from Google.
28. Banos, O. et al. mHealthDroid: a novel framework for agile development of
mobile health applications. In Proc. Ambient Assisted Living and Daily
Activities (Eds Pecchia, L. et al.) 91–98 (Springer, 2014). Author contributions
29. Anguita, D., Ghio, A., Oneto, L., Parra, X. & Reyes-Ortiz, J. L. A public L.X. conceived the experiments; M.T.I conducted the experiments; and M.T.I. analysed
domain dataset for human activity recognition using smartphones. In the results. Both of the authors reviewed the manuscript.
European Symposium on Artificial Neural Networks, Computational Intelligence
and Machine Learning 437–442 (ESANN, 2013). Competing interests
30. Chen, T., Kornblith, S., Norouzi, M. & Hinton, G. A simple framework for The authors declare no competing interests.
contrastive learning of visual representations. Preprint at http://arxiv.org/
abs/2002.05709 (2020).
31. Vidal, R. Subspace clustering. IEEE Signal Process. Mag. 28, 52–68 (2011). Additional information
32. Arthur, D. & Vassilvitskii, S. k-means++: The advantages of careful seeding. Supplementary information is available for this paper at https://doi.org/10.1038/
In Proc. 18th Annual ACM–SIAM Symposium on Discrete Algorithms s41551-020-00635-3.
1027–1035 (ACM–SIAM, 2007). Correspondence and requests for materials should be addressed to L.X.
33. Vincent, P., Larochelle, H., Lajoie, I., Bengio, Y. & Manzagol, P.-A. Stacked Reprints and permissions information is available at www.nature.com/reprints.
denoising autoencoders: learning useful representations in a deep network
with a local denoising criterion. J. Mach. Learn. Res. 11, 3371–3408 (2010). Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in
34. Pourkamali-Anaraki, F., Folberth, J. & Becker, S. Efficient solvers for sparse published maps and institutional affiliations.
subspace clustering. Preprint at http://arxiv.org/abs/1804.06291 (2018). © The Author(s), under exclusive licence to Springer Nature Limited 2020

Nature Biomedical Engineering | www.nature.com/natbiomedeng


nature research | reporting summary
Corresponding author(s): Lei Xing
Last updated by author(s): Sep 10, 2020

Reporting Summary
Nature Research wishes to improve the reproducibility of the work that we publish. This form provides structure for consistency and transparency
in reporting. For further information on Nature Research policies, see Authors & Referees and the Editorial Policy Checklist.

Statistics
For all statistical analyses, confirm that the following items are present in the figure legend, table legend, main text, or Methods section.
n/a Confirmed
The exact sample size (n) for each experimental group/condition, given as a discrete number and unit of measurement
A statement on whether measurements were taken from distinct samples or whether the same sample was measured repeatedly
The statistical test(s) used AND whether they are one- or two-sided
Only common tests should be described solely by name; describe more complex techniques in the Methods section.

A description of all covariates tested


A description of any assumptions or corrections, such as tests of normality and adjustment for multiple comparisons
A full description of the statistical parameters including central tendency (e.g. means) or other basic estimates (e.g. regression coefficient)
AND variation (e.g. standard deviation) or associated estimates of uncertainty (e.g. confidence intervals)

For null hypothesis testing, the test statistic (e.g. F, t, r) with confidence intervals, effect sizes, degrees of freedom and P value noted
Give P values as exact values whenever suitable.

For Bayesian analysis, information on the choice of priors and Markov chain Monte Carlo settings
For hierarchical and complex designs, identification of the appropriate level for tests and full reporting of outcomes
Estimates of effect sizes (e.g. Cohen's d, Pearson's r), indicating how they were calculated
Our web collection on statistics for biologists contains articles on many of the points above.

Software and code


Policy information about availability of computer code
Data collection No software was used for data collection.

Data analysis The custom codes were written in Matlab 2019a. All data analyses have been performed in Matlab 2019a. The implementation code of
the FEM algorithm is available for research uses at https://github.com/tauhidstanford/Feature-augmented-embedding-machine.
For manuscripts utilizing custom algorithms or software that are central to the research but not yet described in published literature, software must be made available to editors/reviewers.
We strongly encourage code deposition in a community repository (e.g. GitHub). See the Nature Research guidelines for submitting code & software for further information.

Data
Policy information about availability of data
All manuscripts must include a data availability statement. This statement should provide the following information, where applicable:
- Accession codes, unique identifiers, or web links for publicly available datasets
- A list of figures that have associated raw data
- A description of any restrictions on data availability

The main data supporting the results in this study are available within the paper and its Supplementary Information. The protein-expression dataset from mice is
October 2018

available at https://archive.ics.uci.edu/ml/datasets/Mice+Protein+Expression. The single-cell RNA-seq data of leukemia patients are available at https://
support.10xgenomics.com/single-cell-gene-expression/datasets. The CT dataset is available at https://archive.ics.uci.edu/ml/datasets/Relative+location+of+CT
+slices+on+axial+axis. The datasets for emotion classification and human-activity classification, and the data from wearable sensors and from smartphones, were
downloaded from the UCI machine learning repository (https://archive.ics.uci.edu/ml/datasets).

1
nature research | reporting summary
Field-specific reporting
Please select the one below that is the best fit for your research. If you are not sure, read the appropriate sections before making your selection.
Life sciences Behavioural & social sciences Ecological, evolutionary & environmental sciences
For a reference copy of the document with all sections, see nature.com/documents/nr-reporting-summary-flat.pdf

Life sciences study design


All studies must disclose on these points even when the disclosure is negative.
Sample size For the protein-expression dataset, 285 cell data from mice with and without Down syndrome have been used. For treated and untreated
mice, 132 cell data have been used. For two-class analysis from scRNA-seq data, 7,898 cell data have been used, and for four-class analysis
18,920 cell data have been used. For CT data, 19,872 images were used for 39 class problems and 3,743 images were used for 5 class
problems.

Data exclusions No data were excluded from the analyses.

Replication All experiments were conducted in a manner that produced clean replicates.

Randomization For clustering, the initializations in kmeans++ and partition around medoids were performed randomly.

Blinding The analysed datasets were acquired from established databases. No knowledge of the analysed data classes has been used in algorithm
development. Knowledge of the data classes has only been used to measure the performance of the developed method.

Reporting for specific materials, systems and methods


We require information from authors about some types of materials, experimental systems and methods used in many studies. Here, indicate whether each material,
system or method listed is relevant to your study. If you are not sure if a list item applies to your research, read the appropriate section before selecting a response.

Materials & experimental systems Methods


n/a Involved in the study n/a Involved in the study
Antibodies ChIP-seq
Eukaryotic cell lines Flow cytometry
Palaeontology MRI-based neuroimaging
Animals and other organisms
Human research participants
Clinical data

October 2018

You might also like