Analysis of Clinical Flow Cytometric
Analysis of Clinical Flow Cytometric
Original Articles
Analysis of Clinical Flow Cytometric
Immunophenotyping Data by Clustering on
Statistical Manifolds: Treating Flow Cytometry
Data as High-Dimensional Objects
William G. Finn,1* Kevin M. Carter,2 Raviv Raich,3 Lloyd M. Stoolman,1 and Alfred O. Hero2
1
Department of Pathology, University of Michigan, Ann Arbor, Michigan 48109
2
Department of Electrical Engineering and Computer Science, University of Michigan, Ann Arbor, Michigan 48109
3
School of Electrical Engineering and Computer Science, Oregon State University, Corvallis, OR 97331
Background: Clinical flow cytometry typically involves the sequential interpretation of two-dimensional
histograms, usually culled from six or more cellular characteristics, following initial selection (gating) of
cell populations based on a different subset of these characteristics. We examined the feasibility of
instead treating gated n-parameter clinical flow cytometry data as objects embedded in n-dimensional
space using principles of information geometry via a recently described method known as Fisher Informa-
tion Non-parametric Embedding (FINE).
Methods: After initial selection of relevant cell populations through an iterative gating strategy, we
converted four color (six-parameter) clinical flow cytometry datasets into six-dimensional probability den-
sity functions, and calculated differences among these distributions using the Kullback-Leibler diver-
gence (a measurement of relative distributional entropy shown to be an appropriate approximation of
Fisher information distance in certain types of statistical manifolds). Neighborhood maps based on Kull-
back-Leibler divergences were projected onto two dimensional displays for comparison.
Results: These methods resulted in the effective unsupervised clustering of cases of acute lympho-
blastic leukemia from cases of expansion of physiologic B-cell precursors (hematogones) within a set of
54 patient samples.
Conclusions: The treatment of flow cytometry datasets as objects embedded in high-dimensional space
(as opposed to sequential two-dimensional analyses) harbors the potential for use as a decision-support
tool in clinical practice or as a means for context-based archiving and searching of clinical flow
cytometry data based on high-dimensional distribution patterns contained within stored list mode
data. Additional studies will be needed to further test the effectiveness of this approach in clinical
practice. q 2008 Clinical Cytometry Society
Key terms: flow cytometry; statistical manifold; information geometry; immunophenotyping; immunopheno-
type clustering
How to cite this article: Finn WG, Carter KM, Raich R, Stoolman LM, Hero AO. Analysis of clinical flow cytomet-
ric immunophenotyping data by clustering on statistical manifolds: Treating flow cytometry data as high-dimen-
sional objects. Cytometry Part B 2009; 76B: 1–7.
tively realized by systems that treat single muticolor anal- Fisher information metric (12). However, calculating
yses as individual high-dimensional datasets (1–8). the Fisher information metric requires knowledge of the
The analysis of high-dimensional datasets has become underlying parameterization of the assumed manifold,
more common in the age of applied genomics and pro- knowledge that is generally not available or feasible in
teomics. However, the fact that all measured characteris- the analysis of flow cytometry datasets.
tics of a given analysis can be traced to each individual Recently, Carter et al. described a nonparametric
cell gives the dimensionality of flow cytometry a approach to clustering and classification on statistical
uniquely spatial characteristic not shared by other pro- manifolds using a similarity measurement known as the
teomic platforms (7,9). Each individual tube analyzed in Kullback-Leibler divergence (commonly referred to as
a routine n-parameter flow cytometry study can be rep- the relative entropy of a probability distribution) as an
resented conceptually as a single object embedded in n- estimate of the Fisher information distance for statistical
dimensional space and formed in aggregate by thousands manifolds for which parameterization is unknown, and
of analyzed cells, each of which displays a unique n- for which individual data points lie in reasonably close
dimensional signature. Just as an ordinary object is bet- proximity (as would generally apply to immunopheno-
ter described by its shape and overall appearance than typic analysis of distinct cell populations by multipara-
by the measuring of its individual dimensions, one could meter flow cytometry) (12,14). As a given manifold is
consider the possibility that flow cytometry data could more densely sampled, the Kullback-Leibler divergence
be better represented by the general shape of a cell pop- converges to the Fisher information distance. This
ulation over all of the dimensions analyzed (5). Since we approach has been termed Fisher Information Non-
live in three-dimensional space, direct visualization of a parametric Embedding (FINE) (12).
four color (six dimensional) flow cytometry dataset as a In this study, we attempted to apply these principles to
six-dimensional object is not feasible. However, rather the interpretation of flow cytometry datasets as high-
than utilizing the interpretation of sequential two-dimen- dimensional objects generated by probability density func-
sional projections of this six-dimensional object (as is tions embedded on a statistical manifold (as opposed to
the current norm), analytical methods can be devised for sequential groups of individual light scatter characteristics
the comparison of separate datasets embedded as unique or surface antigens). As an initial test of this approach, we
objects in six-dimensional space. chose to compare the immunophenotypic patterns of leu-
The analysis of high-dimensional datasets often kemic B-precursor lymphoblasts against the immunophe-
involves characterizing the manifold within which the notypic patterns of physiologic B-cell precursors (hemato-
data are assumed to be embedded. In layman’s terms, gones), since distinction between these often similar cell
the mathematical concept of a manifold could be defined types is an important and sometimes challenging task that
as a smooth space or surface (of any dimensionality) that often confronts practicing hematopathologists on the day-
is nearly ‘‘flat’’ on small scales, and within which geomet- to-day diagnostic service (15).
rical objects may be embedded. Examples could include
a sphere, a torus, Euclidean space in general, and indeed MATERIALS AND METHODS
our three-dimensional universe. The field of manifold
Case Selection
learning involves the discovery of lower dimensional
manifolds for objects embedded in higher dimensional The use of previously analyzed clinical flow cytometry
space and is often applied to dimensionality reduction of data for cluster analysis was approved by our Institu-
high-dimensional datasets (10). tional Review Board. The files of the clinical flow cyto-
It is often assumed that high-dimensional datasets can metry laboratory at the University of Michigan were
be appropriately represented on Euclidean manifolds searched for cases coded as B-precursor acute lympho-
(manifolds comprised of points or coordinates embedded blastic leukemia (ALL) based on complete diagnostic
within Euclidean space). However, there are many prob- assessment including morphologic assessment of mar-
lems in which the data cannot be appropriately repre- row, flow cytometric immunophenotyping, and cytoge-
sented by a Euclidean manifold, and the model parame- netic analysis where indicated per World Health Organi-
ters are unspecified and must be learned through the zation diagnostic criteria (16). From this list, cases were
data. In such cases, it may be helpful to assume that the selected that had sufficient available list mode data and
data lie in a manifold composed not of individual spatial sufficient cells for analysis, searching back from the most
coordinates, but of probability density functions. The recent cases available. Thirty-one cases of ALL were
term statistical manifold has been used to describe retrieved for analysis, spanning an approximately 18-
such manifolds composed of probability density func- month period. For comparison, the flow cytometry data-
tions rather than spatial coordinates (11,12). base was manually screened for the presence of cases
The emerging field of information geometry involves with hematogone hyperplasia, and from this screen 23
the analysis of probability distributions as geometric cases were retrieved showing prominent hematogone
structures within non-Euclidean space and can be populations, again based on a combination of morpho-
applied to the study of statistical manifolds (13). The dis- logic assessment, clinical correlation, and flow cytomet-
tance between points or objects on a statistical manifold ric immunophenotyping based on previously published
can be measured by a distance function known as the descriptions of hematogone immunophenotypes (15).
FIG. 3. Contour plots of CD38 versus CD10 expression for several data sets. The top row corresponds to hematogone hyperplasia (HP) cases, and
the bottom row represents acute lymphoblastic leukemia (ALL) cases. The selected patients are those most similar between disease classes, the cent-
roids of each disease class, and those with little similarity between disease classes, as highlighted in Figure 2.
entire distribution formed by multicolor flow cytometric the manifold learning algorithm. One could argue, how-
analysis of cell suspensions. Our study was performed ever, that the analysis of entire datasets (including both
using archived clinical four-color datasets. The power of normal and abnormal cell types) would be of potential
this approach could be magnified considerably if applied value, since the nature of the host response may be dis-
to higher dimensional datasets (10 color and beyond) tinct in a given disease process and may be represented
currently deployed in research settings (22). by the immunophenotypic pattern of non-neoplastic
To our knowledge, our study is the first to employ the cells in the sample. Furthermore, the nature of flow
principles of information geometry and statistical mani- cytometry data allows for the virtual selection of numer-
fold embedding in the comparison of flow cytometry ous different cell types without preanalytical sorting or
results between different patient samples. However, pre- isolation, and subsequent analysis of these subsets via
vious studies have described methods that treat flow manifold learning. A caveat, of course, is that any given
cytometry output as single high-dimensional datasets process of selection for cell populations of interest could
rather than as collections of two-dimensional projec- influence the subsequent clustering algorithm, and
tions. Roederer et al. described systems based on proba- minor differences in cell selection strategies could har-
bility binning of n-dimensional data, including the use of bor the potential to inordinately affect the clustering
an algorithm that identified geographic regions in n- due to potential inconsistencies in initial data selection.
dimensional space that contain significantly more or The influence of various preanalytical factors (number of
fewer events than other areas (7,23). They termed this colors in the analysis, presence of normal cell popula-
statistical comparison of event numbers in high dimen- tions, cell selection strategies, etc.) on the performance
sional space ‘‘frequency difference gating.’’ Zeng et al. of this statistical manifold clustering approach will have
and Zamir et al. described approaches with some con- to be evaluated in expanded prospective studies.
ceptual similarity to ours but with different methods In summary, this study was an attempted demonstra-
(2,6). Zamir et al. evaluated single four-color (six-dimen- tion of principle for the analysis of clinical flow cytome-
sional) flow cytometry assays by converting each of try data as individual high-dimensional datasets using the
them into a single matrix with the number of rows equal principles of information geometry and statistical mani-
to the number of cells analyzed, and the number of col- folds. Such an approach may harbor potential for the de-
umns equal to the number of measured flow cytometry velopment of decision support tools and context-based
characteristics (in this Case 6), each normalized to a search capability in clinical flow cytometry laboratories,
mean of zero and standard deviation of 1. The matrices and for the analysis of flow cytometry data as a proteo-
were then subjected to statistical clustering methods for mic discovery tool. Additional studies will be required to
the classification of different cell populations within the formally assess the potential utility of this approach for
sample. Although this method was based on the analysis such specific applications.
of a six-dimensional dataset as a single entity, it main-
tained the identity of each cell as a discrete point in the LITERATURE CITED
matrix, without conversion to probability density func- 1. Valet GK, Hoffkes HG. Automated classification of patients with
tions as in our study. Zeng et al. used a kernel density chronic lymphocytic leukemia and immunocytoma from flow cyto-
metric three-color immunophenotypes. Cytometry 1997;30:275–
estimation method similar to ours to convert high-dimen- 288.
sional flow cytometry datasets into probability density 2. Zamir E, Geiger B, Cohen N, Kam Z, Katz BZ. Resolving and classi-
functions, but then used histogram features extracted fying haematopoietic bone-marrow cell populations by multi-dimen-
sional analysis of flow-cytometry data. Br J Haematol 2005;129:420–
from each dimension of the probability density function 431.
to guide k-means clustering as a means to identify dis- 3. Collins GS, Krzanowski WJ. Nonparametric discriminant analysis of
phytoplankton species using data from analytical flow cytometry.
crete cell populations within a given dataset. Pedreira Cytometry 2002;48:26–33.
et al. described a multidimensional classification 4. Boddy L, Wilkins MF, Morris CW. Pattern recognition in flow cytom-
approach for automated flow cytometry analysis that, etry. Cytometry 2001;44:195–209.
5. Toedling J, Rhein P, Ratei R, Karawajew L, Spang R. Automated in-
like our method, treated flow cytometry datasets as silico detection of cell populations in flow cytometry readouts and
objects embedded in n-dimensional space and did not its application to leukemia disease monitoring. BMC Bioinformatics
require the application of an assumed distribution onto 2006;7:282.
6. Zeng QT, Pratt JP, Pak J, Ravnic D, Huss H, Mentzer SJ. Feature-
the flow cytometry dataset, but did not use the specific guided clustering of multi-dimensional flow cytometry datasets.
principles of information geometry outlined in the cur- J Biomed Inform 2007;40:325–331.
7. Roederer M, Hardy RR. Frequency difference gating: A multivariate
rent study (8). method for identifying subsets that differ between samples. Cytome-
There are limitations to the treatment of entire flow try 2001;45:56–64.
cytometry datasets as single high-dimensional distribu- 8. Pedreira CE, Costa ES, Arroyo ME, Almeida J, Orfao A. A multidi-
mensional classification approach for the automated analysis of flow
tions. For example, patients with immunophenotypically cytometry data. IEEE Trans Biomed Eng 2008;55:1155–1162.
identical abnormal cell populations would likely be clus- 9. Perez OD, Nolan GP. Phospho-proteomic immune analysis by flow
cytometry: From mechanism to translational medicine at the single-
tered separately depending on the nature of the non-neo- cell level. Immunol Rev 2006;210:208–228.
plastic background cells or on the sheer percentage of 10. Law M. Manifold Learning (Web Page). 2008. Available at: http://
abnormal cells in the sample. For this reason, we chose www.cse.msu.edu/lawhiu/manifold/. Accessed January 25, 2008.
11. Lee S, Abbott AL, Clark N, Araman P. Active contours on statistical
in this study to purify the cells of interest through an manifolds and texture segmentation. In the IEEE International Con-
iterative list-mode selection process before application of ference on Image Processing, IEEE; Genoa, Italy: 2005. pp 828–831.