0809 4615 PDF
0809 4615 PDF
financial markets
Abstract
1 Introduction
Many complex systems observed in the physical, biological and social sciences
are organized in a nested hierarchical structure, i.e. the elements of the system
can be partitioned in clusters which in turn can be partitioned in subclusters
and so on up to a certain level (Simon, 1962). The hierarchical structure of
interactions among elements strongly affects the dynamics of complex sys-
tems. Therefore a quantitative description of hierarchies of the system is a
key step in the modeling of complex systems (Anderson, 1972). The analy-
sis of multivariate data provides crucial information in the investigation of
2
al., 2007c).
In the present paper we discuss in a coherent and self-consistent way (i) some
filtering procedures of the correlation matrix based on hierarchical clustering
and the bootstrap validation of hierarchical trees and correlation based net-
works, (ii) the hierarchically nested factor model, (iii) the Kullback-Leibler
distance between the probability density functions of two sets of multivariate
random variables and (iv) the retained information and stability of a filtered
3
correlation matrix. We apply the discussed concepts to a portfolio of stocks
traded in a financial market. The paper is organized as follows. In Section 2 we
discuss how to obtain hierarchical trees and correlation based trees or networks
from the correlation matrix of a complex system and we discuss about the role
of bootstrap in the statistical validation of hierarchical trees and correlation
based networks. In Section 3 we discuss the definition and the properties of a
factor model with independent factors which are hierarchically nested. In Sec-
tion 4 we present an empirical application of the hierarchically nested factor
model. Section 5 discusses how to quantify the information and stability of a
correlation matrix by using a Kullback-Leibler distance and Section 6 presents
the quantitative comparison of different filtering procedures performed with
the same distance. Section 7 briefly presents some conclusions.
The stock return correlation matrix computed by using T = 748 records is the
4
following
1.000 0.413 0.518 0.543 0.529 0.341 0.271 0.231 0.412 0.294
1.000 0.471 0.537 0.617 0.552 0.298 0.475 0.373 0.270
1.000 0.547 0.592 0.400 0.258 0.349 0.370 0.276
1.000 0.664 0.422 0.347 0.351 0.414 0.269
1.000 0.533 0.344 0.462 0.440 0.318
C= (1)
1.000 0.305 0.582 0.355 0.245
1.000 0.193 0.533 0.591
1.000 0.258 0.166
1.000 0.590
1.000
where the order of elements of the correlation matrix from left to right and
from top to bottom is the one based on capitalization given above.
The starting point of both the procedures is the empirical correlation ma-
trix C. The following procedure performs the ALCA giving as an output a
hierarchical tree and a filtered correlation matrix C<
ALCA :
(i) Set B = C.
(ii) Select the maximum correlation bhk in the correlation matrix B. Note
that h and k can be simple elements (i.e. clusters of one element each)
or clusters (sets of elements). ∀ i ∈ h and ∀ j ∈ k one sets the elements
ρ< < < <
ij of the matrix CALCA as ρij = ρji = bhk .
(iii) Merge cluster h and cluster k into a single cluster, say q. The merging
operation identifies a node in the rooted tree connecting clusters h and k
at the correlation bhk .
(iv) Redefine the matrix B:
nh bhj + nk bkj
b = if j ∈
/ h and j ∈
/k
qj
nh + nk
bij = bij otherwise,
5
By replacing point (iv) of the above algorithm with the following item
one obtains an algorithm performing the SLCA and the associated filtered
correlation matrix C<
SLCA .
The hierarchical trees obtained from the sample correlation matrix of Eq. (1)
by applying the ALCA and the SLCA are given in Fig. 1 and in Fig. 2 re-
spectively. A hierarchical tree is a rooted tree, i.e. a tree in which a special
node (the root) is singled out. In our example this node is α1 . In the rooted
tree, we distinguish between leaves and internal nodes . Specifically, vertices of
degree 1 represent leaves (vertices labeled 1, 2, ..., 10 in Fig. 1) while vertices
of degree greater than 1 represent internal nodes (vertices labeled α1 , α2 ,...,
α9 in Fig. 1). The two trees are slightly different showing that each clustering
method produce a different output putting emphasis on different aspects of
the sample correlation matrix.
6
For the sake of comparison, here the ALCA and SLCA filtered correlation ma-
trices are both written with the same order of stocks of the sample correlation
matrix. By comparing the sample and the filtered matrices one immediately
notes that the filtered ones contain less information being defined by a number
of distinct correlation coefficients equals to n − 1 whereas the original matrix
has n(n − 1)/2 distinct correlation coefficients. The two filtering methods
detect different information. In fact the ALCA uses the average correlation
coefficient between distinct groups of elements whereas the SLCA uses the
maximal correlation. The two choices filter correlation coefficients character-
ized by a different degree of representativeness and statistical reliability.
It is worth noting that the hierarchical methods reveal the sectorial structure of
the considered set of stocks. Specifically, in both cases the stocks belonging to
the energy sector form a cluster. Fig. 1 shows that for the ALCA dendrogram
the node α2 splits the stocks in two sets, one composed by two technology
stocks and one composed by the financial stocks plus the IBM. For the SLCA
dendrogram the separation of the set in the technology and financial subsectors
is less sharp (see Fig. 2). However in general hierarchical methods perform
quite well in identifying groups of stocks belonging to the same economic
sector (Mantegna, 1999; Bonanno et al., 2001; Coronnello et al., 2005).
In addition to the hierarchical trees and to the related filtered correlation ma-
trices one can also obtain correlation based networks. Here we briefly recall
how to select a correlation based graph out of the complete graph describ-
ing the system. A complete graph is a graph with links connecting all the
elements (or nodes in the graph terminology) of the system of interest. In
correlation based networks a weigth, which is monotonically related to the
correlation coefficient of each pair of elements, can be associsted with each
link. Therefore one can immediately associates a weighted completed graph
with the correlation matrix among n elements of interest. A complete graph
is too rich of information and therefore a ”filterin” (or ”pruning”) of it can
improve its readability. For this reason a procedure can be set to select a
subset of links which are highly informative about the hierarchical structure
of the system. By using clastering algorithms as filtering procedures a certain
number of correlation based graphs have been investigated in the econophysics
literature. Correlation based networks which have been found very useful in
the elucidation of economic properties of stock returns traded in a financial
market are the minimum spanning tree (MST) (Mantegna, 1999), the planar
maximally filtered graph (PMFG) (Tumminello et al., 2005) and the average
linkage minimum spanning tree (ALMST) (Tumminello et al., 2007c). In the
cited cases all the elements of the system are connected within the graph.
Correlation based graphs with elements disconnected from a giant component
can also be obtained starting from the correlation matrix. For example, an
extension from trees to more general graphs generated by selecting the most
correlated links has been proposed in Onnela et al. (2003). However, this last
7
method selects only a subset of the investigated elements controlled by an
arbitrarely chosen threshold.
The MST is a correlation based tree associated with the SLCA. An illustrative
algorithm providing the MST is the following. Let us first recall that the
connected component of a graph g containing the vertex i is the maximal set of
vertices Si (with i included) such that there exists a path in g between all pairs
of vertices belonging to Si . When the element i has no links to other vertices
then Si reduces just to the element i. The starting point of the procedure is
an empty graph g with N vertices. The MST algorithm can be summarized
in 6 steps:
The resulting graph g is the MST of the system and the matrix Q is the
correlation matrix associated to the SLCA. The presented algorithm is not
the most popular or the simplest algorithm for the construction of the MST
but it clearly reveals the relation between SLCA and MST. Indeed connected
components progressively merging together during the construction of g are
nothing else but clusters progressively merging together in the SLCA. In Fig.
3 we show the MST associated with the considered example. It should be
noted that the correlation based tree contains more information than the hi-
erarchical tree or the filtered correlation matrix. For example, the fact that
the connection between the cluster of two technology stocks (MOT and TXN)
and the cluster of mostly financial stocks (MER, AXP, AIG, BAC and IBM)
occurs through IBM is something which is not contained in the hierarchical
tree but it is present in the MST.
8
Fig. 1. Average linkage cluster analysis. Illustrative example of a hierarchical tree
associated to a system of N = 10 stocks (tick symbols label stocks at the bottom
of the hierarchical tree. Each element of the system is also labeled with an integer
number). The color of line indicates the primary economic sector of the stock, red
for technology, blue for energy and green for financial. The labels of the nodes of
the hierarchical tree are used in the discussion of the hierarchically nested factor
model of Section 3.
By replacing eq. (4) with
in the step (v) of the above procedure one obtains an algorithm performing
the ALCA and the final Q of the procedure is the correspondent correlation
matrix. The obtained tree g that we termed ALMST (Tumminello et al.,
2007c) is a tree naturally associated with such a clustering procedure. The
choice of the link at step (iii) of the ALMST construction algorithm does not
affect the clustering procedure but specify the construction of the correlation
based tree. More precisely by selecting any link between nodes u ∈ Sh and p ∈
Sk the matrix Q representing the result of ALCA remains the same in terms
of hierarchical tree. This degeneracy allows one to consider different rules to
9
Fig. 2. Single linkage cluster analysis. Illustrative example of a hierarchical tree
associated to a system of N = 10 stocks (tick symbols label stocks at the bottom
of the hierarchical tree). The color of line indicates the primary economic sector of
the stock, red for technology, blue for energy and green for financial.
select the link between elements u and p at the step (iii) of the construction
algorithm. Different rules at step (iii) give rise to different correlation based
trees. The same observation holds true for the algorithm that generates the
MST. This fact implies that in principle one can consider spanning trees which
are different from the MST and are still associated with the SLCA. However,
we have already recalled that the MST is unique in the sense that, when an
Euclidean distance is defined between links of the spanning tree, MST is the
spanning tree of shortest length (West, 2001).
For the present example the ALMST is essentially indistinguishable from the
MST and for this reason we will not display it here. It is worth noting that
whereas the hierarchical trees obtained with ALCA and SLCA show slight
differences, these differences essentially disappears at the level of the associated
correlation based trees in the present example.
Starting from the sample correlation matrix one can also obtain correlation
based networks having a structure more complex than a tree. One of such
correlation based networks is the PMFG (Tumminello et al., 2005). This cor-
relation based network has associated a hierarchical structure which is the one
10
Fig. 3. Minimum spanning tree associated with the SLCA of the example. The
vertices indicate the stocks. Colors indicate the different economic sectors, red for
technology, blue for energy and green for financial. The thickness of links is pro-
portional to the bootstrap percentage and this value is the number close to each
link.
given by SLCA but it presents a graph structure which is richer than the one
of the MST. In fact, the PMFG has loops and cliques. A clique of k elements is
a complete subgraph that links all k elements. Due to topological constraints,
only cliques of 3 and 4 elements are allowed in the PMFG. To illustrate the
PMFG algorithm, let us first consider a different construction algorithm for
the MST. Following the ordered list Sord of correlation coefficients starting
from the couple of elements with largest correlation one adds a link between
element i and element j if and only if the graph obtained after the link inser-
tion is still a forest or it is a tree. A forest is a disconnected graph in which any
two elements are connected by at most one path, i.e. a disconnected ensemble
of trees. With this procedure, equivalent to the algorithm above detailed, the
graph obtained after all links of Sord are considered is the MST. In direct anal-
ogy, Tumminello et al. (2005) introduce a correlation based graph obtained by
connecting elements with largest correlation under the topological constraint
of fixed genus G = 0. The genus is a topologically invariant property of a
surface defined as the largest number of nonintersecting simple closed curves
that can be drawn on the surface without separating it. Roughly speaking, it
is the number of holes in a surface. The construction algorithm for such graph
is: following the ordered list Sord starting from the couple of elements with
largest correlation one adds a link between element i and element j if and
only if the resulting graph can still be embedded on a plane or a sphere, i.e.
topological surfaces with G = 0. A basic difference of the PMFG with respect
to the MST is the number of links which is N − 1 in the MST and 3 (N − 2) in
the PMFG. Moreover, the PMFG is a network with loops whereas the MST
11
Fig. 4. Planar maximally filtered graph of the correlation matrix of the considered
example. The vertices indicate the stocks. Colors indicate the different economic
sectors, red for technology, blue for energy and green for financial. The thickness of
links is proportional to the bootstrap percentage and this value is the number close
to each link.
is a tree. It is worth recalling that Tumminello et al. (2005) have proven that
the PMFG always contains the MST.
In Fig. 4 we show the PMFG obtained for the considered example. In the
figure the length of the links is not related to the similarity measure between
the two vertices they connect. We are using this kind of representation to put
emphasis on the topological planarity of the network. In fact from the figure
it is evident that there are no crossings of links, and the entire network is
topologically embedded in a plane. By comparing Fig.s 3 and 4 we note that
the stock MER which turns out to be of central reference in the MST and
ALMST is the only stock partecipating to all the seven 4-cliques which are
observed in the PMFG. In other words, the PMFG allows to consider more
details present in the sample correlation matrix than those selected by the
MST or ALMST. For example the PMFG of Fig. 4 shows that two stocks of the
financial sector (AXP and MER) are connected to stocks of both the tecnology
and energy sector. Such a property was not present in the MST (or ALMST)
where only MER was linking two stocks of the two sectors (specifically RD
and IBM). The PMFG is therefore showing more details on the interrelations
present among stocks than the MST. The PMFG has been recently used to
investigate stock return multivariate time series in references Tumminello et
al. (2005), Coronnello et al. (2005) and Tumminello et al. (2007e).
12
graphs cannot be theoretically evaluated in spite of the fact that the statistical
reliability of the spectral properties of the correlation matrix can be assessed
under the assumption of multivariate normal distribution for the time series
of the elements of the investigated set. In the absence of such a theoretical ap-
proach we have devised a method to evaluate the statistical reliability of nodes
in a hierarchical tree obtained by using a correlation matrix as a similarity
measure and links in a correlation based graph. The method we use is based
on a bootstrap procedure of the time series used to compute the correlation
matrix of the system. The method is detailed in Tumminello et al. (2007d) and
Tumminello et al. (2007c). Here we just sketch the most important aspects of
the procedure allowing to associate a bootstrap value to each internal node of
a hierarchical tree. Consider a system of N time series of length T and sup-
pose to collect data in a matrix X with N columns and T rows. A bootstrap
data matrix X∗ is formed by randomly sampling T rows from the original
data matrix X allowing multiple sampling of the same row. For each replica
X∗ , the associated correlation matrix C∗ is evaluated and a hierarchical tree
is constructed by hierarchical clustering. A large number (typically 1000) of
independent bootstrap replicas is considered and for each internal node of the
original data hierarchical tree we compute the fraction of bootstrap replicas
(commonly referred to as bootstrap value) preserving the internal node in the
hierarchical tree. Given an internal node αk of the original hierarchical tree,
we say that a bootstrap replica is preserving that node if and only if a node αh∗
in the replica hierarchical tree exists and identifies a branch characterized by
the same leaves identified by αk in the original hierarchical tree. For instance,
we say that the node α3 of the hierarchical tree in Fig. 1 is preserved in some
replica hierarchical tree D ∗ if and only if a node of D ∗ exists such that it
connects all and only the leaves 1, 2, 3, 4, and 5.
13
Fig. 5. Average linkage cluster analysis. Illustrative example of the estimation of
the bootstrap value associated with each node of the hierarchical tree. The color of
horizontal lines indicate the bootstrap value b of the node according to the following
color code: Green 0.4 ≤ b < 0.6, Cyan 0.6 ≤ b < 0.8, and Purple 0.8 ≤ b ≤ 1.0.
The result is a collection of correlation based graphs. For example, in the case
of MSTs, {MST1∗ , ..., MSTr∗ } . To associate the so called bootstrap value to a
link of the original correlation based graph (in the present example a MST)
one evaluates the number of MSTi∗ where the link is appearing and normalizes
such a number with the total number of replicas, e.g. r = 1000. The bootstrap
value gives information about the reliability of each link of a correlation based
graph. It is worth noting that the bootstrap approach does not require the
knowledge of the data distribution and then it is particularly useful to deal
with high dimensional systems where it is difficult to infer the joint probability
distribution from data. One might then be tempted to expect that the higher
is the correlation associated to a link in a correlation based network the higher
is the reliability of the link. Tumminello et al. (2007c) show that such hypoth-
esis is not always observed in empirical results of sets of stock returns traded
in a financial market. The bootstrap vaue and the correlation coefficient can
be different indicating a different degree of stability with respect to metric
and topological aspects. In Figs 3 and 4 the bootstrap values associated with
each link are reported in the figure as the number close to each link. For a
detailed discussion about the use of the bootstrap procedure to estimate the
14
reliability of correlation based graphs see Tumminello et al. (2007c).
where i ∈ {1, ..., N}, ηi = [1 − αh ∈G(i) γα2 h ]1/2 , the hth factor f (αh ) (t) and ǫi
P
are i.i.d. random variables with zero mean and unit variance. By fixing the γ
parameters as
15
√
γα1 = ρα1
q
γαh = ραh − ρg(αh ) ∀ h = 2, ..., n − 1 (7)
the model of Eq. (6) is the factor model characterized by a correlation matrix
equals to a given matrix C< . It should be noted that by assuming ρα1 ≥ 0,
all the coefficients γαh are non negative real numbers. In Tumminello et al.
(2007d) we prove that correlation ρ< ij = ραk . In fact, the cross correlation
1
hxi xj i only depends on the factors f (αh ) which are common to xi and xj . Since
one associates a factor to each internal node, one needs to identify the internal
nodes belonging to both the genealogies G(i) and G(j). One can verifies that
G(i) ∩ G(j) = G(αk ). For example, in Fig. 1 we have that G(5) = {α3 , α2 , α1 }
and G(6) = {α7 , α2 , α1 } so that G(5) ∩ G(6) = {α2 , α1 } = G(α2 ). By making
use of Eqs. (6, 7) the cross correlation between variables xi and xj is
For example with reference to Fig. 1 we have hx5 x6 i = γα2 2 + γα2 1 = ρα2 −
ρα1 + ρα1 = ρα2 . Thus the matrix C< is the correlation matrix associated with
the factor model of Eq. (6). It is worth noting that the existence of a factor
model whose matrix C< is the correlation matrix implies that the matrix C<
is always positive definite if ρα1 ≥ 0.
In the case in which negative correlations are associated with some nodes
in the dendrogram, it is sometimes possible to suitably modify Eqs. (7) by
introducing multiplicative sign variables in order to get an HNFM describing
the system. The description of the most general case is left for a future work.
Here we just consider the case in which only ρα1 < 0, because this is the case
in the empirical application described in section 4. Let us assume that all the
correlations associated with nodes in the dendrogram are non negative but
ρα1 < 0. Furthermore assume that |ρα1 | < ρα2 (this constraint is satisfied in
the empirical application of Section 4). In order to construct the HNFM, we
divide the elements of the system into two groups. These are the two groups of
elements merging together at root node. The coefficient γα1 shall be different
for elements belonging to different groups. Specifically,
q
γα1 1 = − |ρα1 | for all the elements of the first group
q
γα2 1 = |ρα1 | for all the elements of the second group, (9)
16
q
γαh = ραh − |ρg(αh ) | ∀ h = 2, ..., n − 1. (10)
We note that the constraint |ρα1 | < ρα2 is required by Eq. (10) in order to have
a real value of γ coefficients. It is also to notice that |ρg(αh ) | = ρg(αh ) ∀ g(αh ) 6=
α1 , and accordingly, the γ coefficients associated with all of the nodes differ-
ent from the root node and its sons as given in Eq. (10) coincide with the
corresponding coefficients as defined in Eq. (7).
17
Fig. 6. Hierarchical tree of the set of daily equity return of 100 highly capitalized
stocks traded at the NYSE during the period 2001-2003 obtained by applying the
average linkage clustering algorithm to the correlation matrix. Colors are chosen
according to the stock economic sector according to the classification of Yahoo
Finance. Specifically these sectors are Basic Materials (violet), Consumer Cyclical
(tan), Consumer Non Cyclical (yellow), Energy (blue), Services (cyan), Financial
(green), Healthcare (gray), Technology (red), Utilities (magenta), Transportation
(brown), Conglomerates (orange) and Capital Goods (light green).
To evaluate the statistical robustness of each node and to simplify the descrip-
18
Fig. 7. Hierarchical tree with 27 internal nodes obtained by node reduction of the
ALCA hierarchical tree shown in Fig. 6. Rectangles at the bottom are indicating
8 clusters and the associated symbols label the classification of stocks in terms of
economic sectors or sub-sectors according to the classification of Yahoo Finance (see
text for the legend). Colors of lines indicate stock sectors as in Fig. 6. The labeled
internal nodes are discussed in the text. In the figure we do not comment on clusters
composed by only two leaves.
Let us first comment the properties of the reduced HNFM. In the figure we
observe several clusters and sub-clusters. As already noticed in previous studies
(Mantegna, 1999; Bonanno et al., 2001; Tumminello et al., 2005), the detected
clusters and sub-clusters are overlapping in part with economic classification
such as, for example, the one provided by the Yahoo Finance (at April 2005).
This can be seen in Fig. 6 and 7 where we use this classification to characterize
with a specific color each stock. Most of the groups detected by hierarchical
clustering are characterized by the same color. For example, financial firms
are represented in Fig. 6 and 7 as green lines in the hierarchical tree.
19
The root of the dendrogram of Fig. 7 is associated with a parameter ρα1 =
−0.004. This value is negative even if it is not statistically significantly different
from
√ zero given that the error associated with the correlation coefficient is
1/ T = 0.036. The fact that ρα1 < 0 requires the introduction of sign variables
as explained at the end of Section 3. Since it is ρα2 = 0.2 > |ρα1 |, we can use
the Eq.s 9 and 10 to determine the parameters of the model. The root node α1
splits the set of stocks in two subsets, one composed by one stock (NEM, a gold
mining company, see Table I) and another composed by 99 stocks. The value of
ρα1 is consistent with the interpretation that NEM is uncorrelated to the rest of
the stocks. By using Eq. 9 we set γα1 1 = −0.063 and γα2 1 = +0.063. The second
factor (node) α2 describes the market mean behavior and it is associated with
the parameter γα2 = 0.44. The other 25 factors describe clusters of stocks
that are often significantly homogeneous with respect to the sector activity
of the stocks. In Fig. 7 we have highlighted 8 clusters by using rectangles
at the bottom of the figure. Specifically, F1 is the sub-sector of investment
services and F2 contains the sub-sectors of regional banks and money center
banks. Both F1 and F2 belong to the economic sector of Financial; T and
E are indicating the economic sectors of Technology and Energy respectively;
H1 indicates the sub-sector major drugs of the economic sector Healthcare; S1
and S2 indicate the two sub-sectors of retail and communication services of the
sector of Services respectively. Finally, X is a cluster which is not homogeneous
with respect to sector and sub-sector classification. It comprises stocks in the
sector of basic materials, stocks of the sub-sector constructions of capital goods
and stocks as EMR (classified as technology) and GM (classified as consumer
cyclical).
One prominent example is the group of technology stocks (group T in Fig. 7).
The first two stocks (their tick symbols are TXN and ADI) from left to right
of the group labeled as T in the reduced HNFM of Fig. 7 are described by the
equation
2
xFi (t) = γα23 f (α23 ) (t) + γα4 f (α4 ) (t) + γαh f (αh ) (t) + ηF ǫi (t) =
X
(11)
h=1
0.57f (α23 ) (t) + 0.51f (α4 ) (t) + 0.44f (α2 ) (t) + 0.063f (α1 ) (t) + 0.47ǫi (t)
The factors f (α1 ) (t) and f (α2 ) (t) are common to almost all stocks whereas
f (α4 ) (t) and f (α23 (t) are specific to these stocks. The other four technology
stocks (which are EMC, IBM, MOT and CA) are described by the equation
2
xFi (t) = γα4 f (α4 ) (t) + γαh f (αh ) (t) + ηF ǫi (t) =
X
(12)
h=1
0.51f (α4 ) (t) + 0.44f (α2 ) (t) + 0.063f (α1 ) (t) + 0.74ǫi (t).
20
In this last case only the f (α4 ) (t) factor is present in addition to the f (α1 ) (t)
and f (α2 ) (t) factors common to almost all stocks. It is therefore natural to
consider f (α4 ) (t) as a factor characterizing technology stocks whereas f (α23 ) (t)
is an additional factor further characterizing only the two stocks TXN and
ADI. A similar organization in nested clusters is observed in all the groups
detected by the reduced HNFM. The number of factors characterizing the
various stocks is ranging from one to five.
It is worth to compare Fig. 6 and 7. The comparison shows that the self-
consistent reduction of the number of factors allow a robust statistical valida-
tion of the groups that are detected from the data analysis. Only the informa-
tion which is statistically robust at the 95% level is retained in the reduced
HNFM. For example, the financial cluster observed at the left end of the hier-
archical tree in Fig. 6 is not robust at the selected confidence level whereas the
two sub-clusters indicated as F1 (LEH, BSC, MER and SCH) and F2 (NCC,
STI, ONE, PNC, BAC, WFC, BK and MEL) in Fig.7 are. This empirical
analysis has shown the usefulness of HNFM in an empirical investigation of
hierarchically organized complex systems.
21
have
" !#
P (Σ1, X)
K(P (Σ1 , X), P (Σ2, X)) = EP (Σ1 ,X) log =
P (Σ2, X)
" #
P (Σ1 , X)
Z
= P (Σ1, X) log dX, (14)
P (Σ2 , X)
" ! #
1 |Σ2 |
K(P (Σ1 , X), P (Σ2, X)) = log + tr Σ−1
2 Σ 1 −N , (15)
2 |Σ1 |
where N is the dimension of the space spanned by the X variable and |Σ| indi-
cates the determinant of Σ. From now on we indicate K(P (Σ1, X), P (Σ2, X))
simply with K(Σ1 , Σ2 ). It is worth noting that the Kullback-Leibler distance
takes naturally into account the statistical nature of correlation matrices. In-
deed K(Σ1 , Σ2 ) is well defined only provided that the matrices Σ1 and Σ2
are positive definite. This property is not common to other measures of dis-
tance between matrices. However this property can also be a limitation. The
Kullback-Leibler distance cannot be used to quantify the distance between
semi-positive correlation matrices that are observed, for example, when the
length T of data series is smaller than the number N of elements of the sys-
tem.
We have obtained the value of the Kullback-Leibler distance between two mul-
tivariate distribution as a function of the two corresponding Pearson correla-
tion matrices. We are interested to the case in which one or both correlation
matrices are sample correlation matrices and thus are random variables. Since
different realizations of the process give rise to different sample correlation
matrices, a Kullback-Leibler distance having one or two sample correlation
matrices as arguments is a function of one or two random matrices.
22
By making use of the theory of Wishart matrices, we obtained (Tumminello
et al., 2007a) that
T
Γ′ (p/2)
" #
1 2 N(N + 1)
X
E [K(Σ, C1 )] = N log + + ,(16)
2 T p=T −N +1 Γ(p/2) T − N − 1
T
Γ′ (p/2)
" #
1 T
X
E [K(C1 , Σ)] = N log − (17)
2 2 p=T −N +1 Γ(p/2)
and
1 N(N + 1)
E [K(C1 , C2 )] = , (18)
2T −N −1
where Γ(x) is the usual Gamma function and Γ′ (x) is the derivative of Γ(x).
It is important to observe that all the expectation values given in Eq.s (16-18)
are independent of Σ, i.e. they are independent of the specific model gener-
ating or describing the data. The independence property implies that (i) the
Kullback-Leibler distance is a good measure of the statistical uncertainty of
correlation matrix which is due to the finite length of data series and (ii) the
expected value of the Kullback-Leibler distance is known also when the un-
derlying model hypothesized to describe the system is unknown. This fact has
important consequences. Suppose one knows that the observed data are well
approximated by a multivariate Gaussian distribution and that one measures
a sample correlation matrix C. In order to remove some unavoidably present
statistical uncertainty, the observer applies a filtering procedure to the data
obtaining the filtered 2 correlation matrix Cf ilt . If the filtering technique is
able to recover the model correlation matrix, i.e. Cf ilt = Σ, the Kullback-
Leibler distance K(C, Cf ilt ) must be equal on average to the value given in
Eq. (17). This expected value is independent on the (unknown) model correla-
tion matrix Σ. Therefore large deviations from this expectation value indicate
that the filtered matrix is not consistent with the true matrix of the system.
If K(C, Cf ilt ) is significantly smaller than the expectation value of Eq. (17)
the filtered matrix is keeping some of the statistical uncertainty due to the
finite length T . If, on the other hand, K(C, Cf ilt ) is significantly larger than
the value of Eq. (17), it means that the filtered matrix is either filtering too
much information or distorting the signal. The distance between K(C, Cf ilt )
and the expected value of Eq. (17) is a measure of the goodness of the filtering
procedure in keeping the maximal amount of information which can be present
in sample correlation matrices estimated with a finite number of records.
2 In this and in the following sections we use the superscript to indicate the filtering
procedure, whereas in Section 2 we used the subscript.
23
A second aspect concerns the stability of the filtered correlation matrix ob-
tained from a sample matrix. Let us suppose to apply a certain filtering pro-
cedure to the correlation matrices C1 and C2 of two independent realizations
of the system, obtaining two filtered correlation matrices Cf1 ilt and Cf2 ilt . If it
turns out that K(Cf1 ilt , Cf2 ilt ) is larger than the expected value of K(C1 , C2 )
described by Eq. (18), one can conclude that the filtering procedure produces
correlation matrices less reproducible than the sample correlation matrices
and therefore the procedure is not suitable for the purpose of filtering robust
information from the empirical correlation matrices C1 and C2 .
Biroli et al. (2007) have extended the above results on the Kullback-Leibler
distance to a general class of elliptic distributions. Specifically, they considered
random variables xi (i = 1, ..., N) that can be generated by starting from
random variables yi (i = 1, ..., N) following a generic multivariate distribution
and by setting xi = syi , where s is a positive random variable.
As a specific example, which is also relevant for financial data, they considered
the multivariate Student’s t-distribution. In this case the variables yi are taken
from a multivariate normal distribution with correlation matrix Σ and s is
distributed according to
s20 sµ0
" #
2
P (s) = exp − 2 1+µ (19)
Γ(µ/2) s s
where s20 = 2µ/(µ − 2) in such a way that s has unit variance. The joint
probability density function for the xi is
Γ( N 2+µ ) 1
P (x1 , x2 , ..., xN ) = q N+µ (20)
Γ(µ/2) (µπ)N |Σ| 1 +
1 −1 2
i,j xi (Σ )ij xj
P
µ
where the parameter µ gives the degrees of freedom of the distribution and
describes the tail behavior of the marginal distribution of any xi since P (xi ) ∼
x−1−µ
i .
24
Let us assume that the correlation matrix is computed with the Pearson esti-
mator for correlation coefficients. Biroli et al. (2007) show that the Kullback-
Leibler distance between two multivariate Student’s t-distributions with the
same scaling parameter µ, and correlation matrices Σ1 and Σ2 is
" !
1 |Σ2 |
K(Σ1 , Σ2 ) = log +
2 |Σ1 |
1 + tr Σ−1
2 Σ1 /(2s)
Z
+(N + µ) dsP (s) log , (21)
1 + N/(2s)
In the limit µ/N → ∞ this expression coincides with the one obtained for the
Gaussian case above, whereas in the limit µ/N → 0 Biroli et al. (2007) obtain
tr Σ−1
2 Σ1
!
1 |Σ2 |
K(Σ1 , Σ2 ) = log + N log . (22)
2 |Σ1 | N
N 1 −1 N 1 −1
log tr Σ2 Σ1 ≃ tr Σ2 Σ1 − 1 , (23)
2 N 2 N
25
which is exactly the second term of the right hand side of Eq. (15). This calcu-
lation shows that whenever the correlation matrices involved in the Kullback-
Leibler distance are very close one to the other (Σ1 ≃ Σ2 ) then the expres-
sion of the Kullback-Leibler distance for Student’s t-distributions with small
µ coincides, at the first order of approximation, with the expression of the
Kullback-Leibler distance obtained for Gaussian random variables.
As a final remark we note that there is way to compute the expected value of a
Kullback-Leibler distance which does not make use of the Pearson estimator.
It is known that the Pearson’s estimator of the correlation matrix is not the
maximum likelihood estimator when the variables are non-Gaussian. In the
case of the Student’s t-distribution of Eq. 20 there exists a recursive equation
for the maximum likelihood estimator C̄ which is (Bouchaud and Potters,
2003)
T
N +µX xi (t)xj (t)
C̄ij = (24)
T t=1 µ + pq xp (t)(C̄−1 )pq xq (t)
P
The KL distance can therefore be used to quantify and compare the perfor-
mance of different filtering procedures of correlation matrices (Tumminello et
al., 2007a). A good filtering procedure should have two important properties:
(i) being able to remove the “right” amount of noise from the data in order to
recover the signal and (ii) produce filtered matrices which are stable when one
makes different observations of the same system. These two requirements are
often in competition one with the other. The proposed procedure to evaluate
the performance of a filtering procedure is the following.
Suppose we are given with a data sample X and we have our favorite filtering
procedure. We generate M bootstrap replicas Xi (i = 1, .., M) of the data. We
then compute the sample correlation matrix Ci and apply the filtering proce-
dure obtaining the filtered matrix Cfi ilt to each replica Xi . In order to measure
26
the stability of the filtering procedure, we consider the average of the quan-
tity K(Cfi ilt , Cfj ilt ) over the replicas. An optimal filtering procedure should be
perfectly stable (i.e. hK(Cfi ilt , Cfj ilt )i = 0) because from each realization the
filtering recovers the model matrix. In order to measure the filtered informa-
tion we consider the average of K(Ci , Cfi ilt ) over the replicas. This quantity
measures the information present in the sample correlation matrix Ci that
has been discarded by the filtering procedure. We have seen above that for
Gaussian variables the KL distance hK(Ci , Σ)i is different from zero and inde-
pendent from the model Σ (see Eq. (17)). Therefore if our filtering procedure
is recovering the true underlying model we should expect that K(Ci , Cfi ilt )
is equal to the right hand side of Eq. (17). We have thus a reference value
for both the stability and the information expected from an optimal filtering
and these values are independent from the underlying model. We represent
the result of the analysis in a plane where the x axis reports the stability
hK(Cfi ilt , Cfj ilt )i and the y axis reports the information hK(Ci , Cfi ilt )i. In this
plane the optimal point, labeled Σ, has coordinate x = 0 and y equal to the
right hand side of Eq. 17. A filtering procedure will be considered good if the
corresponding point in the stability-information plane is close to Σ.
27
Fig. 8. Stability of the filtered matrix (x axis) against the amount of information
about the correlation matrix that is retained in the filtered matrix (y axis). The
points labeled with CSHR (α) correspond to the shrinkage procedure (see Eq. 26)
and the parameter α goes from 0 to 1 when one goes from the bottom right to the
top left corner. Left panel shows the result for a block diagonal model of N = 100
elements divided in 12 groups and simulated for T = 748 points. Right panel shows
the result for a hierarchically nested model of 100 elements following the HNFM
with 23 factors of Tumminello et al. (2007d).
Laloux et al. (1999) propose to modify the null hypothesis so that correla-
tions can be explained in terms of a one factor model and σ 2 = 1 − λ1 /N.
The filtering procedure considered here has been proposed by Potters et al.
(2005) and it works as follows. One diagonalizes the correlation matrix and
replaces the all eigenvalues smaller than λmax in the diagonal matrix with
their average value. Then one retransforms the modified diagonal matrix in
the standard basis obtaining a matrix HRM T of elements hRM ij
T
to preserve the
RM T
trace. Finally, the
q filtered correlation matrix C is the matrix of elements
RM T RM T RM T RM T
cij = hij / hii hjj .
28
Fig. 9. Stability of the filtered matrix (x axis) against the amount of information
about the correlation matrix that is retained in the filtered matrix (y axis) for
N = 100 stocks of the NYSE in the period 2001-2003 (T = 748). The points labeled
with CSHR (α) correspond to the shrinkage procedure (see Eq. 26) and the parameter
α goes from 0 to 1 when one goes from the bottom right to the top left corner. The
point CSHR (αK ) is obtained for α = αK = 0.55, which is the value of the shrinkage
parameter that minimize the euclidean distance in the plane stability-information
between the curve corresponding to CSHR (α) and the point associated with Σ. The
latter point represents the expectation value for a filtering procedure able to recover
the true correlation matrix of the system. See the text for more details.
returns of N = 100 highly capitalized stocks traded at the NYSE in the
period 2001-2003 (T = 748). In present study, differently than in Tumminello
et al. (2007a) where we worked in the Gaussian approximation, we assume
stock returns to be Student’s t-distributed according to Eq. (20), . The scaling
parameter µ in the distribution of Eq. (20) is assumed to be the same for all of
the stocks in the portfolio. Accordingly, we determine µ as the average value
of the maximum likelihood estimates of µ independently evaluated for each
stock. The result is µ = 5.9 with a standard deviation σµ = 1.8. The ratio
µ/N = 0.059 is much less than one and therefore ensures that Eq. (22) is a
good approximation of the Kullback-Leibler distance for the present case. In
Fig. 9 we show the performance of different filtering procedures in the plane
stability-information. In the figure the Kullback-Leibler distance is calculated
according to Eq. (22).
29
The filtering procedures based on RMT, SLCA and ALCA have different
properties in terms of stability and information (Tumminello et al., 2007a).
SLCA is the most stable even if it is the least informative, whereas RMT is
the least stable but the most informative. ALCA has intermediate proper-
ties both with respect to stability and to information. The filtering procedure
based on shrinkage seems to outperform the other filtering techniques for se-
lected values of the α parameter. In Fig. 9, we also highlight (red color) a
point corresponding to the optimal value αK = 0.55 of the shrinkage pa-
rameter α in the plane stability-information. This point is obtained by min-
imizing the Euclidean distance between the points (0, < K(Ci , Σ) >) and
(< K(CSHRi (α), CSHR
j (α)) >, < K(Ci , CSHRj (α)) >) in the space stability-
information. The point (0, < K(Ci , Σ) >) is expected for a filtering proce-
dure able to perfectly detect the correlation matrix Σ of the system. Our
aim is now to estimate < K(Ci , Σ) >. By using the estimate of µ, we eval-
uate the quantity < K(Ci , Σ) > by exploiting the model independency of
the Kullback-Leibler distance. During our analysis we realized that for Stu-
dent’s t-distributions the bootstrap replicas, which we actually use to deal
with real data, can introduce a bias with respect to independent simulations.
By performing numerical simulations of Student’s t-distributed random vari-
ables characterized by different values of µ, we notice that this bias does not
appear in the case of normal or close to normal distributions (typically when
µ is greater than 10). In order to overcame the problem when Student’s t-
distributed random variables with a low value of µ describe the data better
than Gaussian random variables, we perform 100 independent simulations of
Student’s t-distributed data series of length T = 748 (the same as real data)
according to the model of Eq. (25) with scaling parameter µ = 5.9 and we
construct 100 bootstrap replicas of each simulated data series. It is to notice
that the choice of Eq. (25) does not affect the generality of results because the
expectation values of the Kullback-Leibler distance do not depend on the cor-
relation structure of the model. We indicate the correlation matrix of simulated
series with Cj (j = 1, ..., 100), and the correlation matrix of a bootstrap replica
associated with Cj with Cbji (i=1,...,100). Because the expectation value of the
correlation matrix Cbji is Cj and the expectation values of the Kullback-Leibler
distance are model independent, we can estimate the value of < K(Ci , Σ) >
as < K(Cbji , Ci) >, where the average is taken over both the indices i and j.
The result we obtain < K(Cbji, Ci ) >= 6.01. This result is shown in the Fig. 9
as a blue circle. Finally, in order to associate an error bar with this value, we
apply the same procedure used to obtain the estimate of < K(Ci , Σ) > for
values of µ equal to µmin = µ − σµ = 5.9 − 1.8 = 4.1 (providing the value at
the top of the error bar in the figure) and µmax = µ + σµ = 5.9 + 1.8 = 7.7
(providing the value at the bottom of the error bar in the figure). A series of
numerical simulations performed with different model generating the dynamics
of the random variables show that the bias introduced by bootstrapping data
instead of performing independent simulations turns out to be approximately
independent of the actual correlation matrix of the system and for µ = 5.9,
30
N=100, and T = 748 the bias is equal to −10.8% ± 1.7%, i.e. the bootstrap
based estimation gives an underestimate of < K(Ci , Σ) >. However, in the
investigations summarized in Fig. 9 the bias is the same for all of the points
and therefore the comparison of filtering procedure is possible. Our numerical
simulations also show that the value of the bias tends to increase as the value
of µ decreases.
It is also worth noting that results very similar to those shown in Fig. 9 can
be obtained by using the expression of Eq. (15) (valid for Gaussian variables)
to evaluate the Kullback -Leibler distance instead of Eq. (22) (valid for Stu-
dent’s t-distributed variables with µ/N << 1). This fact can be interpreted
by observing that the correlation matrices involved in the Kullback-Leibler
distance do not differ much one from the other and, therefore, Eq. (22) gives
estimates of the Kullback-Leibler distance similar to those obtained by us-
ing Eq. (15) that is strictly speaking only valid for Gaussian variables, and
which has been used in Tumminello et al. (2007a). A final remark concern
the shrinkage technique. We note that the shrinkage parameter αK = 0.55 is
significantly different than α∗ = 0.16, obtained by minimizing the Frobenius
norm. The point associated with α∗ = 0.16 (green circle in Fig. 9) in the plane
stability-information is quite far from the point of an ideal filtering procedure
able to perfectly detect the correlation matrix Σ of the system (blue circle in
Fig. 9). This observation suggests that by using the Frobenius distance to get
an estimate of the shrinkage parameter α one puts too much faith on the sta-
tistical robustness of the sample correlation matrix. Conversely, αK represents
a more reliable estimate of the optimal shrinkage parameter.
7 Conclusions
31
and graphs provides additional clues about the interrelations among stocks of
different economic sectors and sub-sectors. It is worth noting that this kind of
information is not present in the information stored in the hierarchical trees
obtained by ALCA and SLCA clustering procedures or, equivalently, in the
associated ultrametric correlation matrices. The information obtained from
what we call the “filtering procedure” of the correlation matrix is subjected
to statistical uncertainty. For this reason, we discuss a bootstrap methodology
able to quantify the statistical robustness of both the hierarchical trees and
correlation based trees or graphs.
The hierarchical trees and correlation based trees and graphs associated with
portfolios of stocks traded in financial markets often show clusters of stocks
partitioned in sub-clusters, sub-clusters partitioned in sub-sub-clusters and
so on until the level of the single stock. The ubiquity of this observation
has motivated us to develop a hierarchically nested factor model able to fully
describe this property. Our model is a nested factor model characterized by the
same correlation matrix as the empirical set of data. The model is expressed
in a direct and simple form when all the correlation coefficients are positive
or very close to positive (for a precise definition of the limits of validity of this
extension see Section 3). The number of factors of the model is by construction
equal to the number of elements of the system. Again the selection of the most
statistically reliable factors detected in a real system is obtained by a procedure
based on bootstrap with a bootstrap threshold selected in a self-consistent way.
32
References
[1] Anderberg, M.R., 1973. Cluster Analysis for Applications. Academic Press, New
York.
[3] Biroli, G., Bouchaud, J.-P., Potters, M., 2007. Acta Physica Polonica B 38,
4009-4026.
[4] Blatt, M., Wiseman, S., Domany, E., 1996. Phys. Rev. Lett. 76, 3251-3254.
[5] Bonanno, G., Vandewalle, N., Mantegna, R.N., 2000. Phys. Rev. E 62, R7615.
[6] Bonanno, G., Lillo, F., Mantegna, R.N., 2001. Quantitative Finance 1, 96-104.
[7] Bonanno, G., Caldarelli, G., Lillo, F., Mantegna, R.N., 2003. Phys. Rev. E 68,
046130.
[8] Bouchaud, J.-P., Potters, M., 2003. Theory of Financial Risk and Derivative
Pricing. Cambridge University Press, Cambridge, UK.
[9] Coronnello, C., Tumminello, M., Lillo, F., Micciche, S., Mantegna, R.N., 2005.
Acta Phys. Pol. B 36, 2653.
[10] Cover, T.M., Thomas, J. A., 1991. Elements of Information Theory. Wiley
Interscience, New York.
[12] Efron, B., Tibshirani, R.J., 1994. An introduction to the Bootstrap. Chapman
& Hall/CRC, Boca Raton, Florida.
[13] Giada, L., Marsili, M., 2001. Phys. Rev. E 63, 1101.
[14] Hillis, D.M., Bull, J.J., 1993. Syst. Biol. 42, 182-192.
[15] Hutt, A., Uhl, C., Friedrich, R., 1999. Phys. Rev. E 60, 1350.
[16] Kraskov, A., Stogbauer, H., Andrzejak, R.G., Grassberger, P., 2005. Europhys.
Lett., 70, 278-284.
[17] Kullback, S., Leibler, R. A., 1951. Ann. Math. Statist. 22, 79-86.
[18] Laloux L., et al., 1999. Phys. Rev. Lett. 83, 1467.
[19] Ledoit, O., Wolf, M., 2003. J. Empir. Finance 10, 603-621.
[21] Mardia, K.V., Kent, J.T., Bibby, J. M., 1979. Multivariate Analysis. Academic
Press, San Diego, CA.
[22] Metha, M.L., 1990. Random Matrices. Academic Press, New York.
33
[23] Miccichè, S., Bonanno, G., Lillo, F., Mantegna, R.N., 2003. Physica A 324, 66.
[24] Onnela, J.-P., Chakraborti, A., Kaski, K., Kertész, J., Kanto, A., 2003. Phys.
Rev. E 68, 056110.
[25] Plerou, V., et al., 1999. Phys. Rev. Lett. 83, 1471.
[26] Potters, M., Bouchaud, J.-P., Laloux, L, 2005. Acta Phys. Pol. B 36, 2767.
[27] Rosenow, B., Plerou, V., Gopikrishnan, P., Stanley, H.E, 2002. Europhys. Lett.
59, 500-506.
[28] Schäfer, J., Strimmer, K., 2005. Stat. Appl. Gen. Mol. Biol. 4 (1), 32.
[29] Simon, H.A., 1962. Proceedings of the American Philosophical Society 106, 467.
[30] Slonim, N., et al., 2005. Proc. Natl. Acad. Sci. USA 102, 18297.
[31] Tola, V., Lillo, F., Gallegati, M., Mantegna, R.N., 2008. Journal of Economic
Dynamics & Control 32, 235.
[33] Tumminello, M., Aste, T., Di Matteo, T., Mantegna, R.N., 2005. Proc. Natl.
Acad. Sci. USA 102, 10421.
[34] Tumminello, M., Lillo, F., Mantegna, R.N., 2007a. Phys. Rev. E 76, 031123.
[35] Tumminello, M., Lillo, F., Mantegna, R.N., 2007b. Acta Phys. Pol. B 38, 4079-
4088.
[36] Tumminello, M., Coronnello, C., Lillo, F., Miccichè, S., Mantegna, R.N., 2007c.
Int. J. Bifurcation Chaos 17, 2319-2329.
[37] Tumminello, M., Lillo, F., Mantegna, R.N., 2007d. EPL 78, 30006.
[38] Tumminello, M., Di Matteo, T. , Aste, T. , Mantegna, R.N., 2007e. Eur. Phys.
J. B 55, 209.
34
Table 1
1A- Stocks with tick symbol from ABT to IGT. The first column is the tick symbol
in alphabetical order, the second column reports an abbreviation of the economic
sector of the considered company. Specifically, we have Basic Materials (BM), Con-
sumer Cyclical (CC), Consumer Non Cyclical (CNC), Energy (E), Services (S),
Financial (F), Healthcare (H), Technology (T), Utilities (U), Transportation (TR),
Conglomerates (CO) and Capital Goods (CG). The third column indicates the eco-
nomic sub-sector of the company. The forth column reports the company name
whereas the fifth column is the numerical label of the stock used in Fig.s 6 and 7.
tick sector sub-sector name ord
ABT H Major Drugs Abbott Laboratories 72
ADI T Semiconductors Analog Devices Inc 48
AFL F Insurance Accidental & Health Aflac Inc 62
AIG F Insurance Prop. & Casualty American Intl Group Inc 16
ALL F Insurance Prop. & Casualty Allstate Corp The 60
AVP CNC Personal & Household Products Avon Products Inc 82
AXP F Consumer Financial Services American Express Company 6
BA CG Aerospace & Defense Boeing Co 36
BAC F Money Center Banks Bank Of America Corp 12
BAX H Medical Equipment & Supplies Baxter International Inc 78
BBY S Retail Technology Best Buy Co Inc 43
BK F Money Center Banks Bank Of New York Inc 14
BLS S Communication Services Bellsouth Corporation 67
BMY H Major Drugs Bristol Myers Squibb Company 74
BNI TR Railroad Burlington Nrthrn Santa Fe Com 38
BSC F Investment Services Bear Stearns Companies Inc 2
BSX H Medical Equipment & Supplies Boston Scientific Corp 97
BUD CNC Beverages Alcoholic Anheuser Busch Cos Inc 87
CA T Software & Programming Computer Associates Intl Inc 52
CAG CNC Food Processing Conagra Foods Inc. 91
CAH H Biotechnology & Drugs Cardinal Health Inc 75
CAT CG Constr. & Agric. Machinery Caterpillar Inc 29
CCU S Broadcasting & Cable TV Clear Channel Communictns Inc 22
CI F Insurance Accidental & Health Cigna Corp 63
CL CNC Personal & Household Products Colgate-palmolive Co 80
DD BM Chemical - Plastic & Rubber Du Pont De Nemours E I Co 25
DE CG Constr. & Agric. Machinery Deere Co 30
DHR T Scientific & Technical Instr. Danaher Corp 32
DIS S Broadcasting & Cable TV Walt Disney Co-disney Common 21
DOW BM Chemical - Plastic & Rubber Dow Chemical Co 27
DUK U Electric Utilities Duke Energy Corporation 93
35
Table 2
1B- Stocks with tick symbol from EMC to LOW. The content of columns is the
same as in Table 1A.
tick sector sub-sector name ord
EMC T Computer Storage Devices Emc Corporation 49
EMR CO Conglomerates Emerson Electric Co 33
FDC T Computer Services First Data Corp 46
FNM F Consumer Financial Services Fannie Mae 58
FON S Communication Services Sprint Corp Fon Group 68
FRE F Consumer Financial Services Freddie Mac D/b/a Voting 59
G CNC Personal & Household Products Gillette Co 83
GCI S Printing & Publishing Gannett Co Inc 19
GD CG Aerospace & Defense General Dynamics Corp 95
GDT H Medical Equipment & Supplies Guidant Corp 98
GDW F S&Ls/Savings Banks Golden West Financial Corp 61
GE CO Conglomerates General Electric 5
GIS CNC Food Processing General Mills Inc 90
GM CC Auto & Truck Manufacturers General Motors Corp 34
GPS S Retail Apparel Gap Inc The 45
HD S Retail Home Improvement Home Depot Inc 39
HDI CC Recreational Products Harley Davidson Inc 24
IBM T Computer Hardware Intl Business Machines Corp 50
IGT S Casinos & Gaming Intl Game Technology 56
IP BM Paper & Paper Products International Paper Co 28
ITW CG Misc. Capital Goods Illinois Tool Works 31
JNJ H Major Drugs Johnson And Johnson 71
K CNC Food Processing Kellogg Co 89
KMB BM Paper & Paper Products Kimberly Clark Corp 81
KO CNC Beverages Non-Alcoholic Coca-cola Co 84
KR S Retail Grocery Kroger Co 94
KRB F Regional Banks M B N A Corp 7
KSS S Retail Department & Discount Kohls Corp 42
LEH F Investment Services Lehman Brothers Holdings 1
LLY H Major Drugs Lilly Eli Co 73
LOW S Retail Home Improvement Lowes Companies Inc 40
36
Table 3
1C- Stocks with tick symbol from MCD to WMT. The content of columns is the
same as in Table 1A.
tick sector sub-sector name ord
MCD S Restaurants Mcdonalds Corp 99
MDT H Medical Equipment & Supplies Medtronic Inc 77
MEL F Investment Services Mellon Financial Corp 15
MER F Investment Services Merrill Lynch Co Inc 3
MMC F Insurance Miscellaneous Marsh Mclennan Cos Inc 18
MOT T Communication Equipment Motorola Inc 51
MRK H Major Drugs Merck Co Inc 70
NCC F Regional Banks National City Corp 8
NEM BM Gold & Silver Newmont Mining Corp Holding C 100
NOC CG Aerospace & Defense Northrop Grumman Cp Hldg Co 96
OMC S Advertising Omnicom Group Inc 23
ONE F Regional Banks Bank One Corp 10
OXY E Oil & Gas Operations Occidental Petroleum Corp 54
PEP CNC Beverages Non-Alcoholic Pepsico Inc 85
PFE H Major Drugs Pfizer Inc 69
PG CNC Personal & Household Products Procter Gamble Co 79
PGR F Insurance Prop. & Casualty Progressive Corp 17
PNC F Regional Banks Pnc Finl Svcs Grp Inc The 11
PPG BM Chemical Manifacturing Ppg Industries Inc 26
RD E Oil & Gas - Integrated Royal Dutch Pet New 1.25gldrs 55
S S Retail Department & Discount Sears Roebuck Co 44
SBC S Communication Services Sbc Communications Inc 66
SCH F Investment Services Schwab Charles Corp 4
SGP H Major Drugs Schering Plough Corp 76
SLB E Oil Well Services & Equipment Schlumberger Ltd 53
SLE CNC Food Processing Sara Lee Corp 88
SO U Electric Utilities Southern Co 92
STI F Regional Banks Suntrust Banks Inc 9
SYY S Retail Grocery Sysco Corp 86
TRB S Printing & Publishing Tribune Company 20
TXN T Semiconductors Texas Instruments 47
TYC CO Conglomerates Tyco International Ltd New 65
UNP TR Railroad Union Pacific Corporation 37
UTX CO Conglomerates United Technologies Corp 35
WAG S Retail Drugs Walgreen Company 57
WFC F Money Center Banks Wells Fargo Co New 13
WLP F Insurance Accidental & Health Wellpoint Hlth Netwks Hldg Co 64
WMT S Retail Department & Discount Wal-mart Stores Inc 41
37