Correlation Visualization of High Dimensional Data
Correlation Visualization of High Dimensional Data
net/publication/221079744
CITATIONS READS
7 175
3 authors, including:
SEE PROFILE
Some of the authors of this publication are also working on these related projects:
Visual Analytics Techniques for Improving Efficiency in Buildings and Processes View project
All content following this page was uploaded by Ignacio Díaz Blanco on 03 June 2014.
Ignacio Dı́az Blanco, Abel A. Cuadrado Vega, and Alberto B. Diez González
Abstract. Correlation analysis has always been a key technique for un-
derstanding data. However, traditional methods are only applicable on
the whole data set, providing only global information on correlations.
Correlations usually have a local nature and two variables can be di-
rectly and inversely correlated at different points in the same data set.
This situation arises typically in nonlinear processes. In this paper we
propose a method to visualize the distribution of local correlations along
the whole data set using dimension reduction mappings. The ideas are
illustrated through an artificial data example.
1 Introduction
Visualization and dimension reduction techniques have received considerable at-
tention in recent years for the analysis of large sets of multidimensional data
[1–3] and particularly for supervision and condition monitoring of complex in-
dustrial processes [4–6]. These techniques allow to discover unknown features
and relationships of high dimensional data in a visual manner by means of a
mapping from a data space D (also input space) onto a low dimensional visu-
alization space V where complex relationships among input variables can be
easily represented and visualized while preserving information significant to a
given problem.
Another very useful technique when dealing with high dimensional data is
correlation analysis. Correlation analysis is concerned with finding how compo-
nents x1 , · · · , xp of the sample data vectors {xi }i=1,···,n are mutually related.
The standard way to cope with this problem is through the analysis of second
order statistics such as the correlation matrix R whose coefficients r ij ∈ [−1, 1]
provide a description of how variables xi and xj are related. These coefficients
are the result of a normalized inner product –the cosine– between vectors formed
by the values of xi and xj for the whole data set and, in consequence, they pro-
vide a correlation information of a global nature. However, in many cases data
variables can be correlated in different ways for different regions of the data
space. This is the case, for instance, of multimodal or nonlinear processes, which
behave locally in different ways depending on the working point. Thus, we need
a local description of correlation.
In this paper, we suggest a method to combine correlation analysis with the
power of dimension reduction visualization methods, such as the Self-Organizing
Map (SOM) [7] or the Generative Topographic Map (GTM) [8], allowing to vi-
sualize local correlations for each pair of variables xi , xj through the so called
correlation maps defined in the visualization space. The paper is organized as
follows. In section 2 the ideas of local covariance and local correlation are intro-
duced, and a method to display the information provided by local second order
statistics in the visualization space is proposed. In section 3 the proposed ideas
are illustrated through a simple example. Finally, in section 4 some concluding
remarks and future research lines are outlined.
2 Correlation Maps
P
i xi · wi (y)
m(y) = P (1)
i wi (y)
P
[xi − m(y)][xi − m(y)]T · wi (y)
C(y) = (cij ) = i P (2)
i wi (y)
Taken independently, the n×n components cij (y) of the covariance matrix C(y),
can be regarded as local covariance values which describe the local dependency
between variables xi and xj . Expressions (1) and (2) represent local versions of
the sample first and second order moments of the input data distribution around
the image of point y in the visualization space, i.e., ψ(y), where the width factor
σ is a design parameter related the degree of locality to be taken into account,
allowing to establish a tradeoff between global and local correlations.
The local covariance C(y) described in (2) defines in V a field of covariance
matrices from D each of which provides a local description of second order
statistical features of data in D lying in the vicinity of ψ(y).
The local correlation matrix R(y) has n × n components rij (y) which rep-
resent the local correlation coefficient between variable xi and variable xj and
lie always in the interval [−1, +1], where +1 denotes full direct correlation, 0
denotes incorrelation, and −1 denotes full inverse correlation.
Both the covariance matrix C(y) and correlation matrix R(y) are defined for
each point y of V . In addition to this, all powerful geometrical and statistical
interpretations underlying both matrices can be represented in V using scalar
quantities. Thus, for instance, each component cij (y) or rij (y) defines a scalar
quantity susceptible to be represented in the same way as SOM planes, using a
color code for each pixel y. In the same way, the principal values of the covariance
matrix λi (y) or the components of the principal vectors ui (y) can be represented
as SOM planes.
This representation provides a unified visualization of the underlying correla-
tions and second order statistical properties in general. Moreover, it is coherent
with other SOM representations such as SOM planes or the u-matrix providing
insight in the pattern of correlation dependencies among variables or revealing
the most important features describing the behavior of the underlying process
for each data region.
All these ideas are illustrated in figures 1, 2 and 3. A simple 2D data set was
used to train both a 1D-SOM and a 2D-SOM. Local covariances were obtained
for the 2D-SOM using (1) and (2) and then plotted in both the data space D
and the visualization space V . Local correlations were also obtained using (3) to
build the correlation maps of rxx , rxy , ryx , ryy shown in figure 2. A set of points
with negative local correlations (corresponding to the right part of the “arc” in
the data) can be discovered by looking at the upper left corner in the rxy plane.
Similarly, moderately high correlations appear in the upper right corner of the
map, showing up the positive local correlations existing in the left part of the
“arc” in the data space. It can also be observed that the graphical information
provided by correlation maps in figure 2 is consistent with that shown in the
SOM planes in figure 3, because both are descriptions in the same visualization
space V . Finally, as we should expect, planes rxx and ryy are equal to 1, and
rxy = ryx due to the symmetry properties of correlation matrices.
1D−SOM in Data Space 2D−SOM in Data Space
5 5
0 0
−5 −5
−10 −10
−5 0 5 −5 0 5 10
5 10
8
0
6
−5 4
−10 2
0
−15
0 10 20 30 0 5 10 15
4 Concluding Remarks
We have proposed here a method for the visualization of local second order sta-
tistical properties using dimension reduction mappings like –but not restricted
to– the SOM. The proposed idea has strong connections with local model ap-
proaches, such as [9], where local linear PCA projections are proposed to capture
the nonlinear structure of data.
We showed here through an artificial data example how local second order
statistical properties can be revealed by means of correlation maps, which, in
xx xy
14 1 14 1
12 12
0.5 0.5
10 10
8 8
0 0
6 6
4 4
−0.5 −0.5
2 2
0 −1 0 −1
0 5 10 0 5 10
yx yy
14 1 14 1
12 12
0.5 0.5
10 10
8 8
0 0
6 6
4 4
−0.5 −0.5
2 2
0 0
−1 −1
0 5 10 0 5 10 15
Fig. 2. Correlation Maps for the 2D-SOM show a region in V (up-left) related to
highly negative local correlations and another region (up-right) revealing positive local
correlations.
14 14 14
12 12 12
10 10 10
8 8 8
6 6 6
4 4 4
2 2 2
0 0 0
0 5 10 0 5 10 0 5 10
The ideas proposed in this paper are currently being tested in the steel in-
dustry to investigate the effects of several dozens of process variables in several
quality factors of the processed coils in a tandem mill with encouraging results.
References
1. Joshua B. Tenenbaum, Vin de Silva, and John C. Langford. A global geometric
framework for nonlinear dimensionality reduction. Science, 290:2319–2323, Dec,
22 2000.
2. Sam T. Roweis and Lawrence K. Saul. Nonlinear dimensionality reduction by
locally linear embedding. Science, 290:2323–2326, Dec., 22 2000.
3. Jianchang Mao and Anil K. Jain. Artificial neural networks for feature extrac-
tion and multivariate data projection. IEEE Transactions on Neural Networks,
6(2):296–316, March 1995.
4. David J. H. Wilson and George W. Irwin. RBF principal manifolds for process
monitoring. IEEE Transactions on Neural Networks, 10(6):1424–1434, November
1999.
5. Teuvo Kohonen, Erkki Oja, Olli Simula, Ari Visa, and Jari Kangas. Engineering
applications of the self-organizing map. Proceedings of the IEEE, 84(10):1358–1384,
october 1996.
6. Esa Alhoniemi, Jaakko Hollmén, Olli Simula, and Juha Vesanto. Process mon-
itoring and modeling using the self-organizing map. Integrated Computer Aided
Engineering, 6(1):3–14, 1999.
7. Teuvo Kohonen. Self-Organizing Maps. Springer-Verlag, 1995.
8. Christopher M. Bishop, Markus Svensen, and Christopher K. I. Williams. GTM:
The generative topographic mapping. Neural Computation, 10(1):215–234, 1998.
9. M. Tipping and C. Bishop. Mixtures of probabilistic principal component analyz-
ers. Neural Computation, 11(2):443–482, 1999.
10. Juha Vesanto. Som-based data visualization methods. Intelligent Data Analysis,
3(2):111–126, 1999.
11. Juha Vesanto and Esa Alhoniemi. Clustering of the self-organizing map. IEEE
Transactions on Neural Networks, 11(3):586–600, May 2000.