Dimensionality Reduction
Dimensionality Reduction
https://doi.org/10.1038/s41551-020-00635-3
Dimensionality reduction is widely used in the visualization, compression, exploration and classification of data. Yet a generally
applicable solution remains unavailable. Here, we report an accurate and broadly applicable data-driven algorithm for dimen-
sionality reduction. The algorithm, which we named ‘feature-augmented embedding machine’ (FEM), first learns the structure
of the data and the inherent characteristics of the data components (such as central tendency and dispersion), denoises the
data, increases the separation of the components, and then projects the data onto a lower number of dimensions. We show that
the technique is effective at revealing the underlying dominant trends in datasets of protein expression and single-cell RNA
sequencing, computed tomography, electroencephalography and wearable physiological sensors.
R
ecent technological advances are transforming biomedical can be performed more reliably at higher dimensions, reduc-
science into a digitized, data-intensive discipline. Large-scale ing ambiguity and the generation of artifacts. Generally, the data
and high-dimensional data from biomedical studies, such as components that are separable at a high number of dimensions
single-cell cytometry and transcriptome analysis, sensor experi- should be maximally separated there at first; the high-dimensional
ments, biomarker discovery and drug design, and medical imag- data processing is valuable for subsequent dimensionality reduc-
ing are accumulating at a staggering rate1. Data processing at this tion. A salient feature of clustering the data at an intermediate
scale and dimensionality presents a daunting challenge, especially number of dimensions is that, by computing the distance of the
considering the varying quality of the available biomedical data data points to their respective cluster centres, adverse influence of
caused by noise, artifacts, missing information and the batch effect. noise in the data is reduced due to noise subtraction. This step also
Accurate and efficient dimensionality reduction is central to reduc- helps to augment the differentiating features of the data owing to
ing data complexity, understanding the local and global structures of mean-value subtraction in the distance-computation process. The
the data, generating hypotheses and to making optimal data-driven intermediate processing of the data is therefore a key step of FEM,
decisions2. In practice, although a large armamentarium of algo- and leads to significantly improved performance with respect
rithms exist to accomplish dimensionality reduction, most of them to existing techniques. In the final step, we use deep learning to
rely on some specific assumptions about the underlying data struc- extract the essential features of the clustered data, and project
ture in low or high dimensions (or in both). For example, princi- them onto a lower number of dimensions so that they can be more
pal component analysis (PCA)3 assumes that the data components easily visualized.
are orthogonal, and is typically used to find a linear combination Seeking distinctive characteristics that are generally applicable
of the components to compress high-dimensional data, limiting its to describe any kind of data for raw-data stratification, we turned to
adequacy when nonlinear combinations of the components would central tendency and dispersion as two natural choices7. In brief, a
be more appropriate. Other commonly used methods, such as measure of central tendency is a single value that describes the man-
independent component analysis (ICA)4, t-distributed stochastic ner in which data cluster around a central value, such as the arith-
neighbourhood embedding (t-SNE)5 and multidimensional scal- metic mean, median, mode, geometric mean, trimean and trimmed
ing (MDS)6 have similar pitfalls and are applicable only to a sub- mean. The dispersion (also called variability, scatter or spread) mea-
set of problems. Furthermore, as they work directly on raw data, sures how the data are distributed around the centre value using a
these methods are known to be susceptible to noise in the data. distance metric (for example, Euclidean distance, Manhattan dis-
Furthermore, t-SNE and related methods do not preserve the global tance, Minkowski distance, correlation distance, cosine distance,
structure of the data and are therefore limited only to data explora- Kullback–Leibler divergence (KL divergence), Jeffrey divergence or
tion or visualization; methods such as PCA and ICA are used as geodesic distance). For example, in Gaussian component mixtures,
analysis tools and are not optimized for the visualizing data at a low the data components may have different means, different variances
number of dimensions. (average square Euclidean distance from the mean) or both. For
Here, to mitigate the limitations of the traditional techniques, Rayleigh or Student’s t mixtures, the components may have different
we propose a data-driven strategy for dimensionality reduction, scale parameters (a function of the average square Euclidean dis-
which we named feature-augmented embedding machine (FEM; tance from the mean). In general, regardless of the characteristics of
Fig. 1a). Instead of assuming a data structure at a low or high num- the data components or the way they are generated, either a measure
ber of dimensions, we first stratify the high-dimensional data under of central tendency or of dispersion (or both) should separate them
exploration according to their inherent characteristics (in particu- in the original high dimension7.
lar, central tendency and dispersion), and use the information to Computationally, FEM first attempts to increase the separation
project the data onto an intermediate number of dimensions using of the data components on the basis of central tendency. To achieve
data clustering. The rationale for this step is that data representation this, it fits a number of subspaces to the data and projects the data
Department of Radiation Oncology, Stanford University, Stanford, CA, USA. ✉e-mail: [email protected]
Step 1
Find the best
central-tendency
and distance
Step 2 metric on the basis
of cluster-
Project Compute the quality indices
distance data distance of data
onto the reduced points from
dimension cluster centres
Dimension-reduced data Step 4
Step 3
b 20 c 15 2
20 Non-DS
10 DS
10
Overlap
FEM1
PC1
10 1
0 5
–10
0 0 0
–10 0 10 –5 0 5 10 0.5 1.0 1.5 0 0.5 1.0 1.5
30 2 15
10
20 10
FEM2
PC2
0 1
10 5
–10 0 0 0
–10 0 10 –5 0 5 10 0.5 1.0 1.5 0 0.5 1.0 1.5
PC1 PC2 FEM1 FEM2
d e 10 2
6 10 Saline
Memantine
4 Overlap
PC1
FEM1
0 5 1
2
–10
0 0 0
–10 0 10 –5 0 5 0.5 1.0 1.5 0.5 1.0 1.5
5 10 2 6
0 4
PC2
FEM2
5 1
–5 2
–10 0 0 0
–10 0 10 –5 0 5 0.5 1.0 1.5 0.5 1.0 1.5
PC1 PC2 FEM1 FEM2
Fig. 1 | Workflow of FEM, and discovery of subgroups in protein expression data. a, The operations in each step of FEM. Step 1: the best number of
subspaces and the subspace dimension are selected on the basis of five cluster-quality indices (Methods). The data are projected onto the selected
subspaces. Step 2: the central-tendency measure and distance metric that best describe the subspace-projected data are chosen on the basis of the
values of the cluster-quality indices. Step 3: the distances of the data points from each cluster centre are computed. Step 4: the distance data are projected
onto the reduced dimensions. b–e, Discovery of subgroups in protein-expression data. b, PCA has been used to analyse a protein-expression dataset of
mice with and without DS. The lower-dimensional representations of protein-expression measurements from mice with and without DS from PCA are
distributed similarly, with no clear distinction. c, However, when FEM is used to project the dataset onto its first two components, a lower-dimensional
representation shows differences between mice with and without DS. d,e, Furthermore, PCA (d) and FEM (e) have been used to project the dataset
of treated (with memantine) and untreated (injected with saline) mice onto the first two components. In this case, PCA does not show any difference
between these two types of mice, whereas FEM components clearly separate them.
onto these subspaces (Fig. 1a (step 1) and Supplementary Fig. 69). (Supplementary Fig. 69 and Supplementary Information 1 and
If the data consist of components of different central tendencies, 11). Normally, the subspace number is larger than the number of
the FEM chooses subspaces of lower dimensionality on the basis of data clusters and, as a result, each of the clusters is represented by a
cluster-quality indices. Cluster-quality indices measure the quality number of subspaces. Thus, the projection of the data onto the sub-
of clusters through the separation of data points from different clus- spaces ultimately reduces the number of outliers and increases the
ters and through the aggregation of data points of the same cluster separation between the centres of the data clusters (Supplementary
Dhaka2
t-SNE2
0
Ivis2
0
0
–50
–10
–20
–20 0 20 –10 0 10 20 –40 –20 0 20 40
t-SNE1 Dhaka1 Ivis1
b –10
100
Saline 10
Memantine
–15
Dhaka2
t-SNE2
Ivis2
0
–20
–10
–100
–25
–50 –40 –30 –20 –10 0 10 –50 0 50 100 150
t-SNE1 Dhaka1 Ivis1
Fig. 2 | Discovery of subgroups in protein-expression data from mice. a, Visualization of expression data of mice with and without DS by t-SNE, Dhaka
and ivis. b, Protein-expression data from treated and untreated mice, visualized using t-SNE, Dhaka and ivis. For both datasets, t-SNE, Dhaka and ivis did
not differentiate the data classes as efficiently as FEM. t-SNE shows inaccurate and spurious clusters in both cases.
Information 1). This step also helps FEM to learn the central ten- these approaches are trained in an end-to-end manner without con-
dency and the distance metrics in the next step. sidering explicitly any inherent data structure. FEM is also unique
For the next step, to further augment the separation of the data in its distance-learning method. To date, most distance-learning
components, FEM learns the type of central-tendency measure methods learn only Euclidean-type distances (mostly Mahalanobis).
and distance metric that best describes the components (Fig. 1a FEM goes beyond this, as it can learn any type of distance metric,
(step 2) and Supplementary Fig. 69). FEM performs this learn- such as Euclidean, probabilistic, geodesic and correlation. As FEM
ing in an unsupervised manner by clustering the data with differ- learns the inherent data characteristics without any assumption
ent central-tendency measures and distance metrics. At the end of on the data components, and incorporates the information into
the calculation, the central-tendency measure and distance met- data-analysis methods, it is generally applicable.
rics that result in the best cluster-quality indices are chosen. For
deep-learning-based data projection, we compute the distances Results
of each data point to the centres of the clusters (Fig. 1a (step 3), FEM discovers subgroups from protein-expression data. First, we
Supplementary Fig. 69 and Supplementary Information 1) and then used a dataset consisting of protein-expression measurements from
feed the distance matrix into the projection model (Fig. 1a (step 4) mice that have received shock therapy13–15. Down syndrome (DS)
and Supplementary Fig. 69). The use of a deep autoencoder for data developed in some of these mice. We apply both PCA and FEM to
projection produces better data-dimensionality reduction com- the data, and test whether the first two components of these meth-
pared with other techniques, such as PCA and ICA8. This approach ods are able to detect any significant difference within the mouse
also provides opportunities of dimensionality reduction with dif- population on the basis of the presence or absence of DS. In Fig. 1b,
ferent goals by changing the structure and loss functions (a detailed we show the PCA result. The first two principal components cannot
discussion of which is provided in Supplementary Information 2). separate the classes at all within the population of mice. The results
As FEM computes the learned distance from the cluster centres of FEM on the same datasets are shown in Fig. 1c; it can be observed
in the second step, the remaining steps work on the distance data that the two data components are better separable within the
rather than on the raw data. This process automatically denoises mouse population.
the data, making FEM inherently more robust to noise and outli- We devised another problem by taking protein-expression mea-
ers (Supplementary Information 18 and 19). The unique denois- surements corresponding to treated and untreated mice13–15, and
ing property is particularly valuable for dimensionality reduction applied PCA and FEM to determine whether the methods can sepa-
in many important biological data-analysis problems (for example, rate these two classes. The treated mice were given memantine, and
single-cell RNA-sequencing (scRNA-seq) data with dropout noise) the untreated mice were injected with saline. The results of the PCA
in which noise has long been a main concern. are shown in Fig. 1d, in which we see that PCA cannot show any sig-
To summarize, FEM reduces data dimensionality through the nificant differences between these two types of mice, whereas FEM
effective incorporation of high-dimensional-data characteristics, shows a clear distinction between these groups (Fig. 1e).
and contributes to data science in the following two aspects: FEM In Fig. 2, we show the results of three more methods, namely
learns the data structure on the basis of the inherent properties of t-SNE, Dhaka16 and ivis17, for the above two datasets. Figure 2 shows
the data components, and provides a mechanism to leverage the that t-SNE, Dhaka and ivis did not keep the separation of the data
information; and it offers a unique way to increase the separation classes as efficiently as FEM while projecting the data onto two
of the data components at a high number of dimensions with sup- dimensions. Moreover, t-SNE creates spurious and inaccurate clus-
pressed noise level, leading to a substantially improved performance ters in the low-dimensional representation of the data.
in dimensionality reduction. Note that there are methods available
in the literature9–12 for dimensionality reduction that are based on FEM discovers distinct patterns from scRNA-seq data. Next,
unsupervised or supervised learning of distance metrics. However, we used a high-dimensional dataset consisting of single-cell
FEM1
PC1
0 0 0 0 0 0
0 50 100 150 –20 0 20 40 –20 0 20 40 60 0.2 0.6 1.0 1.4 0 0.5 1.0 0.2 0.6 1.0
2,000 1,000
40 40
1.0 1.0
FEM2
20 20
PC2
1,000 500
0 0 0.5 0.5
–20 –20
0 0 0 0
0 50 100 150 –20 0 20 40 –20 0 20 40 60 0.2 0.6 1.0 1.4 0 0.5 1.0 0.2 0.6 1.0
60 60 1,000
2,000 1.0 1.0
40 40
FEM3
PC3
1.0 1.0
FEM1
PC1
0 0 0 0 0 0
0 50 100 150 0 20 40 60 –20 0 20 40 0.5 1.0 1.5 0 1.0 0.2 0.6 1.0
60 10,000 60 5,000
40 40 1.0 1.0
FEM2
PC2
5,000
20 20
0 0
0 0 0 0
0 50 100 150 0 20 40 60 –20 0 20 40 0.5 1.0 1.5 0 1.0 0.2 0.6 1.0
10,000 2,000
40 40
1.0 1.0
20 20
FEM3
PC3
5,000 1,000
0 0 0.5 0.5
–20 –20
0 0 0 0
0 50 100 150 0 20 40 60 –20 0 20 40 0.5 1.0 1.5 0 1.0 0.2 0.6 1.0
PC1 PC2 PC3 FEM1 FEM2 FEM3
Fig. 3 | Visualization of a high-dimensional scRNA-seq dataset in three dimensions using PCA and FEM. a,b, Two classes were analysed using PCA (a)
and FEM (b): a pre-transplant sample from patient 1 and a post-transplant sample from patient 1. PCA did not differentiate between these two classes,
whereas FEM shows two classes projected onto different angles. c,d, A dataset consisting of four cell samples from HEK293T cells, a post-transplant
sample from patient 1, a sample from CD19+ B cells and a sample from a healthy human was analysed using PCA (c) and FEM (d). When performance was
compared, FEM performed better at projecting the data into separate clusters compared with PCA.
RNA-expression levels of a mixture of bone marrow monocular cells with PCA, for FEM (Fig. 3d), we observed that the four classes had
(BMMCs) obtained from a patient with leukaemia before and after better-differentiated distributions that were better separated from
a stem-cell transplant18. All RNA-seq data have been pre-processed, each other in the three dimensions.
and the datasets were reduced to 500 genes on the basis of higher For the above two datasets, we present the visualization results
dispersion (variance by the mean) in the data19. We performed both of t-SNE, Dhaka and ivis, along with quantitative evaluations of all
PCA and FEM on the reduced dataset, and show the results in Fig. 3. five methods (Fig. 4). In the visualization of the two-class dataset
It is seen from the PCA result (Fig. 3a) that both components follow (Fig. 4a), t-SNE creates four clusters of data, with a number of spuri-
similar types of distribution in the space covered by the components, ous red data points (representing the pre-treatment sample) spread
and that there is no clustering. However, for FEM, the data compo- around the cyan points (representing the post-treatment sample).
nents have well-differentiated distributions in the space spanned by This phenomenon can also be observed in the case of the four-class
the first three components, and there is clustering (Fig. 3b). dataset (Fig. 4b). For both datasets, ivis creates a different distri-
From the same dataset, we formulated a problem with the follow- bution of data classes; however, it does not create distinct clusters
ing four classes: HEK293T cells, patient 1 after transplant, CD19+ B for the data classes as efficiently as FEM. Dhaka does not show any
cells and a healthy human. The first three principal components are distribution differences between data classes for both datasets. The
shown in Fig. 3c, in which we see that PCA has the ability to show a inefficiency of the three methods in reducing the dimensionality of
distinction between the HEK293T cells and the other three classes. the data can also be seen from the quantitative values of accuracy
However, the patient, CD19+ B-cell and healthy classes have the and normalized mutual information (NMI), as shown in Fig. 4a,b.
same kind of distribution and cannot be separated. In comparison For both datasets, the highest accuracy and NMI were obtained
Dhaka2
Accuracy
0
Ivis2
0 0
NMI
0.4 0.2
–200
–10 0.2 0.1
–400
–50 0 0
–50 0 50 –200 0 200 –20 –10 0 10 PCA Dhaka t-SNE Ivis
PCA Dhaka t-SNE Ivis FEM FEM
t-SNE1 Dhaka1 Ivis1 Method Method
b 50 40
HEK293T
P1 (post) 0 0.8
0.8
CD19+ B
20
Dhaka2
0.6
t-SNE2
H1
Accuracy
0.6
Ivis2
0
NMI
–500 0.4
0.4
0
0.2 0.2
–50 –1,000
–150 –100 –50 0 50 –400 –200 0 200 –10 0 10 20 0 0
PCA Dhaka t-SNE Ivis FEM PCA Dhaka t-SNE Ivis FEM
t-SNE1 Dhaka1 Ivis1 Method Method
Fig. 4 | Visualization of a high-dimensional scRNA-seq dataset using t-SNE, Dhaka and ivis. a, The first dataset consists of a pre-transplant sample
from patient 1 and a post-transplant sample from patient 1 (P1). Dimension-reduced data from t-SNE, Dhaka and ivis are shown. b, The second dataset
consists of four cell samples from HEK293T cells, a post-transplant sample from patient 1, a sample from CD19+ B cells and a sample from a healthy
human. Dimension-reduced data from t-SNE, Dhaka and ivis are shown. All three methods did not maintain the efficient separation of the data classes
in a lower-dimensional representation. Although t-SNE and ivis can separate out the data classes partially, their representation is not as good as the
representation of FEM. The quantitative evaluation (accuracy and NMI) of the five methods (PCA, FEM, t-SNE, Dhaka and ivis) for the first and second
dataset is shown in the bar graphs of a and b, respectively. The error bars represent the s.d. of the indices from the mean values for 100 different
initializations of k-means clustering.
using FEM. Ivis performed better than the other three methods for represent two histograms in polar space. We see that the change
both datasets. t-SNE performed worse than PCA. Dhaka performed in height is actually represented in FEM results as a circular path
the worst for both datasets. (from red to magenta). In PCA, this is not obvious at all. Thus, FEM
Supplementary Figures 19–33 provide results on the same preserves the actual data characteristics better than PCA. Fourth,
RNA-seq dataset, but for the differentiation of different classes the FEM provides better spatial differences between classes than
using ten prominent dimensionality-reduction methods (PCA, ker- PCA. This can be clearly seen when we compare the distances
nel PCA (KPCA)20, ICA4, probabilistic PCA21, autoencoder22, t-SNE, between green and red-violet points and between yellow and blue
MDS, non-negative matrix factorization23, locally linear embedding points. These four observations become more evident when analys-
(LLE)24 and FEM). ing the heat maps of normalized mean geodesic distance between
data classes in the first two components of PCA and FEM (Fig.
FEM preserves high-dimensional patterns of medical imaging 5b). In the heat map of FEM components (Fig. 5b, bottom), there
data. Next, we used a dataset retrieved from a set of 53,500 com- is nearly a linear relationship between the data points of FEM com-
puted tomography (CT) images from 74 different patients (43 male, ponents from different classes in terms of geodesic distance. As
31 female)15,25. Each CT image is described by two histograms in the distance between data points from different classes in physical
polar space. The first histogram contains information of the location space increases, the geodesic distance between them increases in
of bone structures in the image, and the second histogram contains FEM-embedding space at a linear rate, and vice versa. Such a linear
information about the location of air inclusions inside the body. The relationship is not as evident for PCA (Fig. 5b, top).
final feature vector for analysis was formed by concatenating both The other three methods, t-SNE, Dhaka and ivis, did not main-
histograms. The class labels (relative location of an image on the axial tain the relationship of height with the data (Fig. 5c). t-SNE cre-
axis) were created by manually annotating up to ten different distinct ated hundreds of inaccurate clusters, in which no relationship of the
landmarks in each CT volume with known location. The locations height with the data could be found (Fig. 5c, left). Dhaka created
of CT images in between the landmark positions were obtained by a low-dimensional representation that is totally inconsistent with
interpolation. Class label values range from 0 to 180, where 0 denotes the actual height information. Ivis created large space differences
the top of the head and 180 indicates the soles of the feet. between red and yellow points, yet it projected the green and blue
The dimensionality-reduced data for 39 classes (axial positions points close to each other. The failure of these methods to main-
52–90) of the CT dataset are shown in Fig. 5a for both PCA and tain the data structure can be better seen in the heat maps (Fig. 5c).
FEM. There are a number of interesting observations. First of all, No linear relationship between class distance and geodesic distance
the change in colour, from red to magenta through yellow, green, can be observed from these results, in contrast to what was found
cyan, blue and red-violet, corresponds to the decrement of axial for FEM. It was not possible to compute the geodesic distance and,
height in actual CT examination (52–90). FEM preserves the rela- as such, the heat map from the t-SNE representation owing to the
tionship of height with the data better than PCA, as, in PCA, we complete separation of the data classes26.
see that green, cyan, blue and red-violet almost overlap with each The preservation of data characteristics and the maximization of
other. The height that they represent therefore cannot be separated space between classes by FEM become clearer in Fig. 5d. These fig-
as well as with FEM. Second, as the difference between two classes ures show the dimensionality-reduced data with 5 classes, for axial
represents the same distance in height, they should have similar dis- locations 20–60. In the FEM results, if we start from the red points
tance after dimensionality reduction. This was better preserved in (20) and then go through the green points (30), the blue points (40),
the FEM results compared with the results from PCA, in which the the magenta points (50) and the black points (60), we see that this
span of red is much larger than that of other colours. Third, the data creates a radial path from higher height to lower height. For PCA,
a 25 b
20
15 52
10 59
PC2
Class number
66
5
73
0
80 1.0
–5
87
0.8
–10
52 59 66 73 80 87
–10 –5 0 5 10 15 Class number 0.6
PC1
1.4 52 0.4
59 0.2
1.2
Class number
66
0
73
1.0
FEM2
80
0.8 87
52 59 66 73 80 87
0.6 Class number
Class number
Class number
50 59 200 59
50
Dhaka2
66 66
t-SNE2
Ivis2
0
0 73 100 73
–50
–50 80 0 80
–100 87 87
–100
–50 0 50 100 0 100 200 52 59 66 73 80 87 –200 0 200 400 52 59 66 73 80 87
Dhaka1 Class number Ivis1 Class number
t-SNE1
d 100
10 50 200 1.5
50 100
Dhaka2
t-SNE2
FEM2
0
Ivis2
PC2
0 0 1.0
0
–10 –100
–50 –50 0.5
–200
0 10 20 –50 0 50 –50 0 50 100 –100 0 100 0.5 1.0 1.5
PC1 t-SNE1 Dhaka1 Ivis1 FEM1
Fig. 5 | Visualization of high-dimensional CT-scan localization data in two dimensions. a–d, Analysis of two datasets; there are 39 classes in the dataset
analysed in a–c, and 5 classes in the dataset analysed in d. Each class corresponds to a distinct axial location in the CT scan. In a–c, axial locations 52–56,
57–60, 61–69, 70–73, 74–80, 81–85 and 86–90 correspond to red, yellow, green, cyan, blue, red-violet and magenta with different shades, respectively. The
heat maps of normalized mean geodesic distance among all 39 classes from the first two components of PCA and FEM are shown in b. In d, red, green, blue,
magenta and black points correspond to the axial locations 20, 30, 40, 50 and 60, respectively. Data were projected onto the first two principal components
(a (top) and d (left)) and FEM components (a (bottom) and d (right)). The dimension-reduced data and heat maps for t-SNE, Dhaka and ivis for the first
dataset are shown in c. The dimension-reduced data from t-SNE, Dhaka and ivis for the second dataset are shown in d. FEM components better preserve
the radial characteristics of the data, and FEM also better distinguishes between data points of different colours, as is evident from the scatter plots and the
heat maps. In FEM, results from d (right), the data are better clustered in five classes in comparison to that in PCA results (d, left). t-SNE, Dhaka and ivis did
not retain any relationship between axial locations and data classes in both cases.
t-SNE, Dhaka and ivis, no such radial nature of the data can be by FEM in terms of classification. These datasets were acquired for
seen. For FEM, the data points are also much better clustered in five emotion classification and human activity classification, from wear-
classes. This is absent in the results from the other methods. able sensors and smartphones.
The dataset for emotion classification consists of electroencepha-
FEM better classifies data from a wide range of datasets. We logram (EEG) brainwave data processed using a previously described
chose three more datasets to show better dimensionality reduction statistical extraction method27. The EEG brainwave data were
Dhaka2
t-SNE2
FEM2
0 1.0
Ivis2
PC2
–20 0
0 0
0.5
–40
–20 –20 –20
–20 0 20 40 –20 0 20 –40 –20 0 20 –20 0 20 0.5 1.0 1.5
PC1 t-SNE1 Dhaka1 Ivis1 FEM1
b 40
Running 50
300 1.5
200 Cycling 20
200
Dhaka2
t-SNE2
FEM2
0 1.0
Ivis2
PC2
0
100 –20
–40 0.5
–200 –50 0
–60
–100 0 100 200 –50 0 50 –50 0 50 100 –20 0 20 40 60 0.5 1.0 1.5
PC1 t-SNE1 Dhaka1 Ivis1 FEM1
c 60
Walking 50 1.5
40 Other activities 50 0
Dhaka2
t-SNE2
FEM2
Ivis2
PC2
20 0 –20 1.0
0
0
–40 0.5
–50 –50
–20
0 20 40 60 80 –50 0 50 –20 0 20 40 60 –40 –20 0 20 0.5 1.0 1.5
PC1 t-SNE1 Dhaka1 Ivis1 FEM1
d 1.2 e
PredEmotion
MobileHealth 1.0 PredEmotion
1.0 HumanAct MobileHealth
0.8 HumanAct
0.8
Accuracy
0.6
0.6
NMI
0.4 0.4
0.2 0.2
0 0
PCA Dhaka t-SNE Ivis FEM PCA Dhaka t-SNE Ivis FEM
Method Method
Fig. 6 | Classification of biomedical data. a, Emotion classification from EEG data. b,c, Human-activity classification (running and cycling (b); and
walking and other activities (c)) from data from wearable sensors and smartphones, respectively. PCA, t-SNE, Dhaka, ivis and FEM results are shown.
d, Classification results from training a multiclass error-correcting output codes model using support vector machine binary learners. PredEmotion,
emotion classification; mobileHealth, human-activity classification from smartphone data; humanAct, human-activity classification from wearable
sensors data. e, NMI values for different methods. FEM shows better separation of the data distribution for all three datasets. For emotion classification
and human-activity classification from wearable sensors data, PCA, t-SNE, Dhaka and ivis do not show any separation between data clusters. For human
activity classification from smartphone data, PCA and ivis perform better than t-SNE and Dhaka. However, even in this case, the ivis results are spread
and there is less consistency among data points in the clusters. In all three cases, t-SNE produced results with inaccurate and sporadic clusters. The better
dimensionality reduction that was achieved by FEM is reflected in the quantitative evaluation, in which FEM achieved maximum accuracy and NMI index
values in all cases. For d and e, the error bars represent the s.d. of the indices from the mean values for 100 different initializations of k-means clustering.
collected from two human participants (1 male, 1 female) for 3 min choosing data points that correspond to running and cycling from
per state—positive, neutral and negative. The dimension-reduced this dataset. PCA, t-SNE, Dhaka, ivis and FEM were used to reduce
data using PCA, t-SNE, Dhaka, ivis and FEM are shown in Fig. 6a. the data dimensionality. The first two dimensions from these analy-
PCA, t-SNE and Dhaka did not show any kind of distribution dif- ses are shown in Fig. 6b. We see that the data points that correspond
ference between classes. Ivis performed better than PCA, t-SNE and to the classes running and cycling cannot be separated in the results
Dhaka, but could not separate the data classes as efficiently as FEM. from PCA, t-SNE, Dhaka and ivis, as the data classes are mixed with
The results of the other seven competing methods are provided in each other. For FEM, data from the two classes are almost linearly
Supplementary Information 9. separable, as the red points (corresponding to the running class) are
Similar scenarios can also be seen for the datasets for clustered on the right of the figure and the cyan points are clustered
human-activity classification from wearable sensors. This dataset on the left (Fig. 6b, right).
comprises body-motion and vital-sign recordings for ten volunteers Another dataset for human-activity classification was built from
of diverse profiles while performing several physical activities—that recordings of 30 participants while they performed daily activities
is, standing still, sitting and relaxing, lying down, walking, climbing (for example, walking and other activities, such as sitting, standing
stairs, waist bends forward, frontal elevation of arms, knees bend- and laying down). A waist-mounted smartphone (Samsung Galaxy
ing (crouching), cycling, jogging, running, and jumping forwards S II) with embedded inertial sensors was used to capture three-axis
and backwards15,28. The dimensionality-reduction problem is set by linear acceleration and three-axis angular velocity at a constant rate
of 50 Hz29. The dimension-reduced data from all of the methods are subspaces are chosen randomly; in k-centres clustering, the centres
shown in Fig. 6c. From these figures, it is seen that all five methods of data clusters are chosen randomly; in autoencoding, the initial
are able to separate the data classes, and that the data-cluster quality neuron weights are chosen randomly. As described previously31–33,
is better for PCA and FEM. As with other datasets, t-SNE creates random initializations of all three methods result in stable results
inaccurate data clusters. for well-behaved datasets, and the optimization procedures for all
The efficacy of FEM in reducing the dimensions of the data three methods are numerically stable. FEM is therefore guaranteed
can be more easily appreciated from the quantitative evalua- to be numerically stable and is expected to provide sensible results
tion of all five methods for the above three datasets, as shown in for all initializations. We provide an analysis of the stability of FEM
Fig. 6d. Both the classification accuracy and NMI indices of FEM in Supplementary Information 27, in which we analysed a simulated
are maximal for all three datasets. For the smartphone-acquired dataset with two Gaussian data classes of the same mean (0) but
human-activity-classification dataset, all four methods show com- different variances (1 and 2) with FEM for 1,000 different initializa-
parable results to FEM. However, for the first two datasets, FEM tions. FEM separates the data classes with high accuracy for all 1,000
provides a much better accuracy and NMI. initializations (the computed accuracy was 99.50% with 0.6% s.d.).
We performed additional analyses on synthetic data and on We note that, to our knowledge, no other dimensionality-reduction
datasets from various biomedical fields, and provide the results in method has the ability to separate these data classes. For this data-
the Supplementary Information. The results with synthetic data set, Supplementary Fig. 72 shows FEM, PCA and t-SNE visualiza-
are presented in Supplementary Information 1. The results from tions for one initialization.
12 methods, including FEM, for different applications include the As there is no inherent assumption in FEM, adding constraints on
following: biomedical feature-based orthopaedic patient classifica- its behaviour may result in application-specific data-dimensionality
tion (Supplementary Information 5), gender classification on the reduction. As examples, we show in Supplementary Information 2
basis of voice data (Supplementary Information 6), heart-disease —through reasoning and a number of simulation results—that dif-
detection (Supplementary Information 7), chronic-kidney-disease ferent dimensionality-reduction techniques, such as PCA, KPCA,
detection (Supplementary Information 8), breast-cancer detec- ICA, MDS and Isomap, can be built by setting constraints on dif-
tion (Supplementary Information 10), splice-junction detec- ferent steps of FEM. As such, these methods may be considered to
tion (Supplementary Information 12), cardiotocography be subsets of FEM. The traditional distance-learning methods can
classification (Supplementary Information 13), drug-consumption also be considered to be subsets of FEM. These methods learn the
analysis (Supplementary Information 14), lung-cancer detection distance metric only, overlooking the central-tendency measure,
(Supplementary Information 15) and simple framework for con- whereas FEM learns both. In Supplementary Information 3, we
trastive learning of visual representations30 feature visualization show that the inclusion of a central-tendency measure is imperative
(Supplementary Information 20). The results show that, in a few and that the performance of dimensionality-reduction methods can
cases, the performances of t-SNE, KPCA, MDS and other methods be degraded if the central tendency is not selected appropriately.
are comparable to that of FEM. However, no method except for In our implementation of FEM, we considered only the medoid
FEM performed as well in all cases. and centroid as the central tendency and a limited number of dis-
tance metrics. However, other central-tendency measures, such as
Discussion geometric mean, trimean and trimmed mean, can also be used on
We have described a general data-dimensionality-reduction the basis of applications in a particular field. In the shared codes, we
method, FEM, which—in contrast to widely used algorithms—is provide an implementation of FEM in which any central tendency
driven by data through self-learning. It first learns the data struc- and distance metric can be included.
ture and component characteristics and then increases the separa- A limitation of FEM is that it is computationally more expen-
tion of the components at a high number of dimensions. Owing to sive than many other available techniques, as it tries to find out the
these features, the data components still remain well-separated even best description of the data structure by computing cluster-quality
when the dimensionality of the data is reduced. We have provided indices for different central-tendency and dispersion measures.
results of the application of FEM to problems from diverse fields of However, it should be noted that this step may be skipped in appli-
biomedical research to show the applicability, robustness and accu- cations in which this information is known a priori, or when it
racy of FEM. In particular, FEM outperforms widely used methods can be learnt from a representative dataset. Another limitation is a
in most cases. requirement for a sufficient amount of data, so that the actual data
Specifically, we showed that protein-expression data from mice characteristics can be learnt. This may prevent the application of
with DS display a different distribution in a low-dimensional rep- FEM in applications in which only small amounts of data are avail-
resentation compared with the data from healthy mice. Similarly, able for analysis. Finally, because it is possible to include only a finite
data from the treated mice show a different distribution compared number of central-tendency and distance metrics in FEM, for par-
with data from the untreated mice. This analysis can therefore ticular datasets, it is possible to miss the actual central-tendency or
help to detect mice with DS or the effect of treatment on the mice. distance metric. Moreover, in many cases, a combination of distance
Moreover, the fact that some of the data points are closer to each metrics may be appropriate. In such cases, FEM finds the closest
other in a low-dimensional representation may indicate the sever- distance metric that fits the data and reduces the dimensions on the
ity of DS, or a high versus low impact of the treatment. Similarly, basis of it.
the analysis of gene-sequencing data from a patient with leukaemia The computational complexity of subspace clustering in
before and after haematopoietic stem cell transplantation (HSCT) FEM is OðN 3 Þ (ref. 34) for N data points. The complexity of
with different types of cell might help with finding relationships k-centres I clustering is OðN ´ ðE ´ F ´ P þ QÞ ´ D ´ iÞ (ref. 35)
between sequencing data in different cells as well as the effects of in FEM, where P is the desired I reduced data dimension, Q is
HSCT treatment on different patients at the cellular level. The anal- the dimension of the distance matrix (Supplementary Fig. 69),
ysis of CT images may help to find the location of CT slices from E and F are the number of central-tendency and distance met-
only the images, as well as to detect abnormalities in the images. rics considered, D is the data dimensionality and i is the nec-
FEM involves three main operations: (1) K-subspace clustering31, essary iteration number for k-centres clustering to reach the
(2) k-centres clustering32 and (3) autoencoding33. In each of the solution. In the end, the complexity of the autoencoder is
three steps, there are random initializations of parameters for start- Oðe ´ N ´ ðD ´ PÞÞ, where e denotes the epoch number for training
ing the computation. In K-subspace clustering, the orthonormal (Supplementary
I Information 24). Therefore, the overall complexity
Reporting Summary
Nature Research wishes to improve the reproducibility of the work that we publish. This form provides structure for consistency and transparency
in reporting. For further information on Nature Research policies, see Authors & Referees and the Editorial Policy Checklist.
Statistics
For all statistical analyses, confirm that the following items are present in the figure legend, table legend, main text, or Methods section.
n/a Confirmed
The exact sample size (n) for each experimental group/condition, given as a discrete number and unit of measurement
A statement on whether measurements were taken from distinct samples or whether the same sample was measured repeatedly
The statistical test(s) used AND whether they are one- or two-sided
Only common tests should be described solely by name; describe more complex techniques in the Methods section.
For null hypothesis testing, the test statistic (e.g. F, t, r) with confidence intervals, effect sizes, degrees of freedom and P value noted
Give P values as exact values whenever suitable.
For Bayesian analysis, information on the choice of priors and Markov chain Monte Carlo settings
For hierarchical and complex designs, identification of the appropriate level for tests and full reporting of outcomes
Estimates of effect sizes (e.g. Cohen's d, Pearson's r), indicating how they were calculated
Our web collection on statistics for biologists contains articles on many of the points above.
Data analysis The custom codes were written in Matlab 2019a. All data analyses have been performed in Matlab 2019a. The implementation code of
the FEM algorithm is available for research uses at https://github.com/tauhidstanford/Feature-augmented-embedding-machine.
For manuscripts utilizing custom algorithms or software that are central to the research but not yet described in published literature, software must be made available to editors/reviewers.
We strongly encourage code deposition in a community repository (e.g. GitHub). See the Nature Research guidelines for submitting code & software for further information.
Data
Policy information about availability of data
All manuscripts must include a data availability statement. This statement should provide the following information, where applicable:
- Accession codes, unique identifiers, or web links for publicly available datasets
- A list of figures that have associated raw data
- A description of any restrictions on data availability
The main data supporting the results in this study are available within the paper and its Supplementary Information. The protein-expression dataset from mice is
October 2018
available at https://archive.ics.uci.edu/ml/datasets/Mice+Protein+Expression. The single-cell RNA-seq data of leukemia patients are available at https://
support.10xgenomics.com/single-cell-gene-expression/datasets. The CT dataset is available at https://archive.ics.uci.edu/ml/datasets/Relative+location+of+CT
+slices+on+axial+axis. The datasets for emotion classification and human-activity classification, and the data from wearable sensors and from smartphones, were
downloaded from the UCI machine learning repository (https://archive.ics.uci.edu/ml/datasets).
1
nature research | reporting summary
Field-specific reporting
Please select the one below that is the best fit for your research. If you are not sure, read the appropriate sections before making your selection.
Life sciences Behavioural & social sciences Ecological, evolutionary & environmental sciences
For a reference copy of the document with all sections, see nature.com/documents/nr-reporting-summary-flat.pdf
Replication All experiments were conducted in a manner that produced clean replicates.
Randomization For clustering, the initializations in kmeans++ and partition around medoids were performed randomly.
Blinding The analysed datasets were acquired from established databases. No knowledge of the analysed data classes has been used in algorithm
development. Knowledge of the data classes has only been used to measure the performance of the developed method.
October 2018