Greco ND
Greco ND
a r t i c l e i n f o a b s t r a c t
Article history: We present a novel method of decomposition of an n × m binary matrix I into a Boolean
Received 15 December 2007 product A ◦ B of an n × k binary matrix A and a k × m binary matrix B with k as small
Received in revised form 30 July 2008 as possible. Attempts to solve this problem are known from Boolean factor analysis where
Available online 18 May 2009
I is interpreted as an object–attribute matrix, A and B are interpreted as object–factor
Dedicated to Professor Rudolf Wille and factor–attribute matrices, and the aim is to find a decomposition with a small number
k of factors. The method presented here is based on a theorem proved in this paper. It
Keywords: says that optimal decompositions, i.e. those with the least number of factors possible, are
Binary matrix decomposition those where factors are formal concepts in the sense of formal concept analysis. Finding
Factor analysis an optimal decomposition is an NP-hard problem. However, we present an approximation
Binary data
algorithm for finding optimal decompositions which is based on the insight provided by
Formal concept analysis
the theorem. The algorithm avoids the need to compute all formal concepts and signifi-
Concept lattice
cantly outperforms a greedy approximation algorithm for a set covering problem to which
the problem of matrix decomposition is easily shown to be reducible. We present results
of several experiments with various data sets including those from CIA World Factbook
and UCI Machine Learning Repository. In addition, we present further geometric insight
including description of transformations between the space of attributes and the space of
factors.
© 2009 Elsevier Inc. All rights reserved.
Matrix decomposition methods provide representations of an object–variable data matrix by a product of two differ-
ent matrices, one describing a relationship between objects and hidden variables or factors, and the other describing a
relationship between factors and the original variables. In this paper, we consider the following problem:
Problem. Given an n × m matrix I with I i j ∈ {0, 1}, a decomposition of I is sought into a Boolean matrix product A ◦ B of
an n × k matrix A with A il ∈ {0, 1} and a k × m matrix B with B lj ∈ {0, 1} with k as small as possible.
✩
Supported by grant No. 201/05/0079 of the Czech Science Foundation, by grant No. 1ET101370417 of GA AV ČR, and by institutional support, research
plan MSM 6198959214. This paper is an extended version of “R. Belohlavek, V. Vychodil: On Boolean factor analysis with formal concepts as factors, in:
SCIS & ISIS 2006, Tokyo, Japan, pp. 1054–1059.”
*
Corresponding author at: Thomas J. Watson School of Engineering and Applied Science, Binghamton University, Department of Systems Science and
Industrial Engineering, Binghamton, NY, United States.
E-mail addresses: [email protected] (R. Belohlavek), [email protected] (V. Vychodil).
0022-0000/$ – see front matter © 2009 Elsevier Inc. All rights reserved.
doi:10.1016/j.jcss.2009.05.002
4 R. Belohlavek, V. Vychodil / Journal of Computer and System Sciences 76 (2010) 3–20
Note that the smallest number k mentioned above is known as the Schein rank of matrix I in the Boolean matrix theory,
see [15, p. 37]. Recall that a Boolean matrix (or binary matrix) is a matrix whose entries are 0 or 1. A Boolean matrix
product A ◦ B is defined by
k
( A ◦ B )i j = A il · B lj , (1)
l =1
where denotes maximum (truth function of logical disjunction) and · is the usual product (truth function of logical
conjunction). As an example,
⎛ ⎞ ⎛ ⎞ ⎛ ⎞
1 1 0 0 0 1 0 0 1 1 1 0 0 0
⎝1 1 0 0 1⎠ ⎝1 0 1 0⎠ ⎝0 0 1 1 0⎠
1 1 1 1 0 = 1 1 0 0 ◦ 1 0 0 0 1 .
1 0 0 0 1 0 0 1 0 0 1 0 0 0
Boolean matrix product represents a non-linear relationship between objects, factors, and original attributes. By this, we
mean that if v 1 and v 2 are two 1 × k binary vectors for which v 1 + v 2 (ordinary sum of vectors) is a binary vector, then
( v 1 + v 2 ) ◦ B is in general different from v 1 ◦ B + v 2 ◦ B (ordinary sum of matrices). Nevertheless, the relationship between
objects, factors, and original attributes specified by I = A ◦ B has a clear verbal description. Namely, let I , A, and B represent
an object–attribute relationship (I i j = 1 means that object i has attribute j), an object–factors relationship ( A il = 1 means
that factor l applies to object i), and a factor–attribute relationship (B lj = 1 means that attribute j is one of the particular
manifestations of factor l). Then I = A ◦ B says: “object i has attribute j if and only if there is a factor l such that l applies
to i and j is one of the manifestations of l.” From this description, we immediately see that a decomposition I = A ◦ B has
the following interpretation:
Factor analysis interpretation. Decomposition of I into A ◦ B corresponds to discovery of k factors which explain the data
represented by I .
Reducing dimensionality of data by mapping the data from a space of directly observable variables into a lower-
dimensional space of new variables is of fundamental importance for understanding and management of data. Classical
methods such as PCA (principal component analysis), IDA (independent component analysis), SVD (singular value decompo-
sition), or FA (factor analysis) are available for data represented by real-valued matrices, see e.g. [12,21]. Non-negative matrix
factorization [18] is an example of a method for decomposition of a constrained matrix into a product of constrained matri-
ces. Namely, the constraint is that the matrix entries of both the input matrix and the output matrices are non-negative. For
data represented by binary matrices, several methods have been proposed. Some of them are based on various extensions of
the methods developed for real-valued matrices, see e.g. [19,26,27,29,33] and also [30] for further references. By and large,
these methods yield a decomposition of a binary matrix into a product of possibly real-valued non-binary matrices which
causes problems regarding interpretation, see e.g. [30].
Directly relevant to our paper are methods which attempt to decompose a binary matrix I into A ◦ B with ◦ being
the Boolean product of binary matrices defined by (1). Early work on this topic includes [23,24,28]. These papers already
contain complexity results showing hardness of problems related to such decompositions. Another source of information on
composition and decomposition of binary matrices is a monograph on binary matrices [15]. These early works seem to be
forgotten in the recent work on decompositions of binary matrices.
The recent work includes an approach using Hopfield-like associative neural networks which has been studied in a series
of papers by Frolov et al., see e.g. [8]. This approach is a heuristic in which the factors correspond to attractors of the neural
network. Little theoretical insight regarding the decomposition is used in this approach. In the introduction, the authors say
[8, p. 698]: “However, there is no general algorithm for the analysis except a brute force search. This method requires an
exponentially increasing number of trials as the dimension of the pattern space increases. It can, therefore, only be used
when this dimension is relatively small.” In this paper, we develop a theoretical insight which leads to quite an efficient
algorithm for finding factors and matrix decompositions and which therefore eliminates the need for a brute force search
mentioned in [8, p. 698]. [10] studies the problem of covering binary matrices with their submatrices containing 1s, which
is relevant to our approach. Namely, as we note in Section 2.6, such a covering is equivalent to an approximation of a
binary matrix by a Boolean product A ◦ B from below. Another paper on this topic is [14], where the authors proposed
to use formal concepts as factors for the decomposition problem. [14] does not provide any theoretical insight and even
does not say how formal concepts are to be used. Basically, it describes results of experiments with small data. [14] can
be seen as the inspiration for our paper. We developed basic insight on the role of formal concepts for decompositions of
binary matrices in our conference paper “R. Belohlavek, V. Vychodil: On Boolean factor analysis with formal concepts as
factors, in: SCIS & ISIS 2006, Tokyo, Japan, pp. 1054–1059.” The present paper is an extended version of this paper. [22]
presents an algorithm for finding approximate decompositions of binary matrices into a Boolean product of binary matrices
which is based on associations between columns of the matrix. [30] contains several remarks and experiments regarding the
R. Belohlavek, V. Vychodil / Journal of Computer and System Sciences 76 (2010) 3–20 5
general question of dimensionality of binary data, including comparisons of various methods for decomposition of binary
matrices into products of non-binary matrices. [31] looks at relationship between several problems related to decomposition
of binary matrices. [20] discusses problems related to this paper, namely a problem of covering a binary matrix by so-called
pertinent formal concepts which are formal concepts minimizing an entropy function associated to a classification problem
discussed in that paper. Unfortunately, the algorithm presented in [20] is not described clearly enough for us to implement
it and perform comparisons with our algorithms.
We present a novel method for the problem of decomposition of binary matrices described in Section 1.1. We prove an
optimality theorem saying that the decompositions with the least number k of factors are those where the factors are formal
concepts in the sense of formal concept analysis. The constructive aspect of this theorem is that the set of formal concepts
associated to the input matrix provides us with a non-redundant space of factors for optimal decompositions. This finding
provides a new perspective for decompositions of binary matrices and Boolean factor analysis. It opens a way to methods
which are better than the brute force search methods and their variants, see e.g. [8]. An immediate consequence of our
theorem is that the decomposition problem is reducible to the well-known set covering problem. This reduction enables us
to adopt a greedy approximation algorithm which is available for the set covering problem to find approximately optimal
decompositions of binary matrices. We present another greedy approximation algorithm for finding optimal decompositions
which significantly outperforms the algorithm for set covering problem. The algorithm avoids the need to compute the set
of all formal concepts associated to the input matrix. We present experiments for comparing performance of the algorithms
as well as experiments demonstrating factorization of real data sets. For this purpose, we use CIA World Factbook data and
UCI Machine Learning Repository data. In addition, we present results on transformations between the space of original at-
tributes and the space of factors. We outline future research topics including an extension of our approach to decomposition
of matrices containing degrees from partially ordered scales such as the unit interval [0, 1].
Formal concept analysis (FCA) is a method for data analysis with applications in various domains [5,9]. Our paper can be
seen as an application of FCA in data preprocessing. Namely, the factors in a binary matrix we are looking for are sought in
the set of all formal concepts associated to the matrix.
Let X = {1, . . . , n} and Y = {1, . . . , m} be sets of objects and attributes, respectively, and I be a binary relation between
X and Y . The triplet X , Y , I is called a formal context. Since a binary relation between X and Y can be represented by an
n × m Boolean matrix, we denote both the binary relation and the corresponding Boolean matrix by I . That is, for the entry
I i j of (matrix) I we have I i j = 1 iff the pair i , j belongs to (relation) I and I i j = 0 if i , j does not belong to (relation) I .
A formal concept of X , Y , I is any pair C , D of sets C ⊆ X (so-called extent) and D ⊆ Y (so-called intent) such that
C ↑ = D and D ↓ = C where
C ↑ = y ∈ Y for each x ∈ C : x, y ∈ I
is the set of all attributes shared by all objects from C , and
D ↓ = x ∈ X for each y ∈ D: x, y ∈ I
is the set of all objects sharing all attributes from D. The set of all formal concepts of X , Y , I is denoted by B ( X , Y , I ).
That is,
B ( X , Y , I ) = C , D C ↑ = D , D ↓ = C .
Under partial order defined by
C 1 , D 1 C 2 , D 2 iff C 1 ⊆ C 2 (iff D 2 ⊆ D 1 )
for C 1 , D 1 , C 2 , D 2 ∈ B ( X , Y , I ), B ( X , Y , I ) happens to be a complete lattice, so-called concept lattice associated to
X , Y , I [9,32]. That is, for any subset K of B ( X , Y , I ), there exists the least formal concept in B ( X , Y , I ) greater than
or equal to any formal concept from K (the supremum of K ) as well as the greatest formal concept in B ( X , Y , I ) smaller
than or equal to any formal concept from K (the infimum of K ). Efficient algorithms for computing B ( X , Y , I ) exist and a
good overview is provided by [17].
Example 1. Let a binary matrix describing a binary relation I between X = {1, . . . , 4} and Y = {1, . . . , 5} be given by
⎛ ⎞
1 1 0 0 0
1 1 0 0 1⎠
I =⎝1 1 1 1 0 .
1 0 0 0 1
Then, the pair {1, 2, 3}, {1, 2} is a formal concept of B ( X , Y , I ) because {1, 2, 3}↑ = {1, 2} and {1, 2}↓ = {1, 2, 3}. Further
formal concepts include {1, 2, 3, 4}, {1}, {2}, {1, 2, 5}, {2, 4}, {1, 5}, {3}, {1, 2, 3, 4}, and ∅, {1, 2, 3, 4, 5}.
6 R. Belohlavek, V. Vychodil / Journal of Computer and System Sciences 76 (2010) 3–20
Consider an n × m binary matrix I . Our aim is to decompose I into a Boolean product I = A ◦ B of binary matrices A
and B of dimensions n × k and k × m, as described above.
We are going to consider decompositions of I into a product
AF ◦ B F
of binary matrices A F and B F constructed from a set F of formal concepts associated to I . In particular, consider the
concept lattice B ( X , Y , I ) associated to I , with X = {1, . . . , n} and Y = {1, . . . , m}. Let
F = A 1 , B 1 , . . . , A k , B k ⊆ B( X , Y , I ),
i.e. F is a set of formal concepts from B ( X , Y , I ). Denote by A F and B F the n × k and k × m binary matrices defined by
1 if i ∈ Al , 1 if j ∈ B l ,
( A F )il = and ( B F )lj =
0 if i ∈
/ Al , 0 if j ∈
/ Bl ,
for l = 1, . . . , k. That is, the lth column ( A F )_ l of A F consists of the characteristic vector of A l and the lth row ( B F )l_ of
B F consists of the characteristic vector of B l .
In the previous example, I = A F ◦ B F . The question of whether for every I there is some F ⊆ B ( X , Y , I ) such that
I = A F ◦ B F has a positive answer.
Theorem 1 (Universality of formal concepts as factors). For every I there is F ⊆ B ( X , Y , I ) such that I = A F ◦ B F .
Proof. The proof follows from the fact that I i j = 1 iff there is a formal concept C , D ∈ B ( X , Y , I ) such that i ∈ C and j ∈ D.
Therefore, taking F = B ( X , Y , I ) we have: ( A F ◦ B F )i j = 1 iff there is l such that ( A F )il = 1 and ( B F )lj = 1 iff there is
C l , D l ∈ B ( X , Y , I ) with i ∈ C l and j ∈ D l iff I i j = 1. 2
Moreover, decompositions using formal concepts as factors are optimal in that they yield the least number of factors
possible:
Theorem 2 (Optimality of formal concepts as factors). Let I = A ◦ B for n × k and k × m binary matrices A and B. Then there exists a
set F ⊆ B ( X , Y , I ) of formal concepts of I with
|F | k
such that for the n × |F | and |F | × m binary matrices A F and B F we have
I = AF ◦ B F .
R. Belohlavek, V. Vychodil / Journal of Computer and System Sciences 76 (2010) 3–20 7
Proof. First, it is an easy-to-observe fact that formal concepts of I are just maximal rectangles of matrix I , i.e. maximal
submatrices of I , which are full of 1s. For instance, in the binary matrix
⎛ ⎞
1 1 0 0 0
⎝1 1 0 0 1⎠
1 1 1 1 0 ,
1 0 0 0 1
the bold 1s form a maximal rectangle, with rows 1, 2, 3, and columns 1 and 2. This rectangle corresponds to the formal
concept {1, 2, 3}, {1, 2} of I . Contrary to that, the rectangle corresponding to rows 1 and 3, and columns 1 and 2, is not
maximal because it is contained in the one consisting of bold 1s.
Second, observe that I = A ◦ B for an n × k matrix A and a k × m matrix B means that I can be seen as a -superposition,
i.e. as a union, of k rectangles consisting of 1s (not necessarily maximal ones). Namely, since
I i j = A i1 · B 1 j ∨ · · · ∨ A ik · B kj ,
I is a union of rectangles J l = A _ l ◦ B l_ , l = 1, . . . , k. Note that A _ l ◦ B l_ is an n × m matrix which results by a Boolean
multiplication of the lth column A _ l of A and the lth row B l_ of B. As an example, with I = A ◦ B being
⎛ ⎞ ⎛ ⎞ ⎛ ⎞
1 1 0 0 0 1 0 0 1 1 1 0 0 0
⎝1 1 0 0 1⎠ ⎝1
= 0 1 0⎠ ⎝0
◦ 0 1 1 0⎠
1 1 1 1 0 1 1 0 0 1 0 0 0 1 ,
1 0 0 0 1 0 0 1 0 0 1 0 0 0
the decomposition can be rewritten as a -superposition
⎛ ⎞ ⎛ ⎞ ⎛ ⎞ ⎛ ⎞ ⎛ ⎞
1 1 0 0 0 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0
⎝1 1 0 0 1⎠ ⎝1
= 1 0 0 0⎠ ⎝0
∨ 0 0 0 0⎠ ⎝1
∨ 0 0 0 1⎠ ⎝0
∨ 0 0 0 0⎠
1 1 1 1 0 1 1 0 0 0 0 0 1 1 0 0 0 0 0 0 0 0 0 0 0
1 0 0 0 1 0 0 0 0 0 0 0 0 0 0 1 0 0 0 1 0 0 0 0 0
of rectangles J 1 , J 2 , J 3 , J 4 , i.e. binary matrices whose 1s form rectangles.
Third, let now I = A ◦ B for an n × k binary matrix A and a k × m binary matrix B and consider the corresponding
rectangles J 1 , . . . , J k , i.e. binary matrices where 1s form rectangles. Each such a rectangle J l is obviously contained in a
maximal rectangle J l of I , i.e. ( J l )i j ( J l )i j for all i , j. Now, every J l corresponds to a formal concept C l , D l in that
( J l )i j = 1 iff i ∈ C l and j ∈ D l . Since
k
k
Iij = ( J l )i j Jl ij Iij,
l =1 l =1
a -superposition of maximal rectangles J l yields I . Putting therefore
F = C 1 , D 1 , . . . , C k , D k ,
we get I = A F ◦ B F and |F | k. Note that since two distinct rectangles may be contained in a single maximal rectangle,
we may have |F | < k. 2
We will see later that the geometric argument behind the proof enables us to see that the problem to find a decompo-
sition of I into A F ◦ B F can be reduced to a particular instance of a set covering optimization problem.
Denoting by ( A F )_ l and ( B F )l_ the lth column of A F and the lth row of B F , I = A F ◦ B F can further be rewritten as
I = ( A F )_ 1 ◦ ( B F )1_ ∨ ( A F )_ 2 ◦ ( B F )2_ ∨ ( A F )_ 3 ◦ ( B F )3_ , which yields a -decomposition of I into maximal rectangles.
With our example, we have
⎛ ⎞ ⎛ ⎞ ⎛ ⎞ ⎛ ⎞
1 1 0 0 0 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0
⎝1 1 0 0 1⎠ ⎝1
= 1 0 0 0⎠ ⎝0
∨ 0 0 0 0⎠ ⎝1
∨ 0 0 0 1⎠
1 1 1 1 0 1 1 0 0 0 1 1 1 1 0 0 0 0 0 0 .
1 0 0 0 1 0 0 0 0 0 0 0 0 0 0 1 0 0 0 1
Note that the argument used in the proof of Theorem 2 implies that for any decomposition I = A ◦ B, factors correspond
to rectangles of 1s (submatrices of I consisting of 1s). To every such factor, therefore, corresponds a set C of objects and a set
D of attributes. As a result, factors in Boolean factor analysis in general can be regarded as covering objects and applying to
attributes. That is, every factor can be interpreted as having its extent C and intent D. The intuitive requirement to include
all objects to which attributes from D apply and to include all attributes shared by all objects from C supports the idea of
having maximal C and D and, therefore, having maximal rectangles. A crucial argument for maximal rectangles as factors
is, nevertheless, Theorem 2 and further practical advantages of maximal rectangles including computational treatability,
cf. [17].
Formal concepts which are both object and attribute concepts are mandatory:
Proof. If C , D ∈ O ( X , Y , I ) ∩ A( X , Y , I ), i.e. C , D = {x}↑↓ , {x}↑ for some x ∈ X and C , D = { y }↓ , { y }↓↑ for some
y ∈ Y , then C , D is the only formal concept from B ( X , Y , I ) for which x ∈ C and y ∈ D. Namely, if C 1 , D 1 is another such
↑↓ ↑
formal concept then from {x} ⊆ C 1 we get C = {x}↑↓ ⊆ C 1 = C 1 . Applying the antitony of ↑ , we get D = C ↑ ⊇ C 1 = D 1 .
↓↑
Furthermore, from { y } ⊆ D 1 we get D = { y }↓↑ ⊆ D 1 = D 1 . This gives D = D 1 and, as a consequence, C , D = C 1 , D 1 . The
maximal rectangle corresponding to C , D is thus the only one which covers the 1 at the intersection of the row and the
column corresponding to x and y, respectively. 2
In this section, we present a small illustrative example. Suppose we have a record of a collection of patients. For every
patient, our record contains a set of his symptoms. For our purpose, we consider a set Y of 8 symptoms which we denote
by 1, . . . , 8, i.e. Y = {1, . . . , 8}. The symptoms are described in Table 1.
Suppose our collection contains 12 patients. We denote the ith patient by i and put X = {1, . . . , 12}. Our record describ-
ing patients and symptoms is given by the following 12 × 8 binary matrix I :
R. Belohlavek, V. Vychodil / Journal of Computer and System Sciences 76 (2010) 3–20 9
Table 1
Symptoms and their descriptions.
Table 2
Formal concepts of data given by patients and their symptoms.
ci Ai , B i
c0 {}, {1, 2, 3, 4, 5, 6, 7, 8}
c1 {1, 5, 9, 11}, {1, 2, 3, 5}
c2 {2, 4, 12}, {1, 2, 6, 8}
c3 {3, 6, 7}, {2, 5, 7}
c4 {3, 6, 7, 8, 10}, {7}
c5 {1, 3, 5, 6, 7, 9, 11}, {2, 5}
c6 {1, 2, 4, 5, 9, 11, 12}, {1, 2}
c7 {1, 2, 3, 4, 5, 6, 7, 9, 11, 12}, {2}
c8 {1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12}, {}
Fig. 1. Hasse diagram of concept lattice given by patients and their symptoms.
⎛1 1 1 0 1 0 0 0⎞
⎜1 1 0 0 0 1 0 1
0⎟
⎜0 1 0 0 1 0 1 ⎟
⎜1 1 0 0 0 1 0 1⎟
⎜1 1 0⎟
⎜ 1 0 1 0 0 ⎟
⎜0 1 0 0 1 0 1 0⎟
I =⎜0 1 0⎟ .
⎜ 0 0 1 0 1 ⎟
⎜0 0 0 0 0 0 1 0⎟
⎜1 1 1 0 1 0 0 0⎟
⎜ ⎟
⎝0 0 0 0 0 0 1 0⎠
1 1 1 0 1 0 0 0
1 1 0 0 0 1 0 1
That is, rows correspond to patients, columns correspond to symptoms, I i j = 1 if patient i has symptom j, and I i j = 0 if
patient i does not have symptom j.
Our intention is to find a set F of factor concepts. That is, we want to find some F ⊆ B ( X , Y , I ) such that I = A F ◦ B F .
Let us first look at the concept lattice B ( X , Y , I ). B ( X , Y , I ) contains 9 formal concepts. That is B ( X , Y , I ) = {c 0 , . . . , c 8 } and
↑
each c i is of the form c i = A i , B i where A i ⊆ X is a set of patients, B i ⊆ Y is a set of symptoms such that A i = B i and
↓
B i = A i . That is, B i is the set of all symptoms common to all patients from A i , and A i is the set of all patients sharing all
symptoms from B i . All formal concepts from B ( X , Y , I ) are depicted in Table 2.
For instance, the extent A 3 of the formal concept c 3 consists of patients 3, 6, 7, and the intent B 3 of c 3 consists of
attributes 2, 5, 7. The concept lattice B ( X , Y , I ) equipped with a partial order (the subconcept-superconcept hierarchy)
can be visualized by its Hasse diagram. The Hasse diagram of B ( X , Y , I ) is shown in Fig. 1. We can see from the diagram
that, e.g., c 3 c 4 , i.e., formal concept c 4 is more general than formal concept c 3 . This is because the extent A 3 of c 3 is
contained in the extent A 4 of c 4 , i.e. A 3 ⊆ A 4 , meaning that each patient covered by c 3 is also covered by c 4 . Equivalently,
10 R. Belohlavek, V. Vychodil / Journal of Computer and System Sciences 76 (2010) 3–20
Table 3
Diseases and their symptoms (an excerpt).
Disease Symptoms
chickenpox rash
flu headache, fever, painful limbs, cold
measles fever, cold, rash
meningitis headache, fever, stiff neck, vomiting
. .
. .
. .
Table 4
Description of formal concepts of data given by patients and their
symptoms.
the intent B 4 of c 4 is contained in the intent B 3 of c 3 , i.e. B 4 ⊆ B 3 , meaning that each attribute characteristic for c 4 is also
a characteristic attribute of c 3 . In a similar way, one can see that c 3 c 8 , c 2 c 7 , etc.
Let us now look at the meaning of formal concepts c 0 , . . . , c 8 from B ( X , Y , I ). One naturally expects that the concepts,
which are given by groups of patients (concept extents) and groups of symptoms (concept intents) are related to diseases.
This is, indeed the case. Table 3 shows an excerpt from a family medical book. We can see that, e.g., formal concept c 1
represents flu because c 1 covers just headache, fever, painful limbs, cold, which are exactly the characteristic attributes of
flu. In the same way, c 2 , c 3 , and c 4 represent meningitis, measles, and chickenpox. However, B ( X , Y , I ) contains also formal
concepts which can be interpreted as “suspicion of (a disease).” For instance, c 5 can be interpreted as “suspicion of flu or
measles” because the intent B 5 of c 5 contains attributes 2 (fever) and 5 (cold) which belong to the characteristic attributes
of both flu and measles. Note that B ( X , Y , I ) also contains an “empty concept” c 0 (the concept is the least concept in
the conceptual hierarchy and applies to no patient) and the universal concept c 8 (applies to all patients). The empty and
the universal concepts are usually not interesting. Verbal descriptions of formal concepts from c 0 , . . . , c 8 are presented in
Table 4. Note also that the verbal description and the conceptual hierarchy are compatible. For instance, we have c 1 c 6 ,
i.e. c 6 represents a weaker concept (more general concept) than c 1 , which corresponds to their descriptions: c 1 is “flu,” c 6
is “suspicion of flu or meningitis.”
Consider now the problem of factorization of I . Our aim is to illustrate the above results. According to Theorem 1, we
can factorize I by F = B ( X , Y , I ). However, since |F | = 9 > 8 = |Y |, this would mean that we would end up with a number
of factors bigger than the number of the original attributes. Therefore, we are looking for some small set F ⊂ B ( X , Y , I ) of
factor concepts. First, as is easily seen, the empty concept and the universal concept can always be disregarded. Furthermore,
B( X , Y , I ) contains the following object- and attribute-concepts:
O( X , Y , I ) = {c 1 , c 2 , c 3 , c 4 }, A( X , Y , I ) = {c 0 , c 1 , c 2 , c 4 , c 5 , c 6 , c 7 }.
Thus,
O( X , Y , I ) ∩ A( X , Y , I ) = {c 1 , c 2 , c 4 }.
F = {c 1 , c 2 , c 3 , c 4 } and F = {c 1 , c 2 , c 4 , c 5 }.
⎛1 0 0 0⎞
⎜0 1 0 0
1⎟
⎜0 0 1 ⎟
⎜0 1 0 0⎟ ⎛ ⎞
⎜1 0 0⎟
⎜ 0 ⎟ 1 1 1 0 1 0 0 0
⎜0 0 1 1⎟ 1 1 0 0 0 1 0 1⎠
AF = ⎜ 0 0 1⎟ , BF = ⎝ 0
⎜ 1 ⎟ 1 0 0 1 0 1 0
⎜0 0 0 1⎟ 0 0 0 0 0 0 1 0
⎜1 0 0 0⎟
⎜ ⎟
⎝0 0 0 1⎠
1 0 0 0
0 1 0 0
and similarly for A F and B F . From the point of view of dimension reduction, instead in an 8-dimensional space of symp-
toms (as described by I ), the patients are now described in a 4-dimensional space of disease-like concepts (as described by
A F ).
Let now I = A ◦ B be a decomposition of I such that A = A F and B = B F for a set F ⊆ B ( X , Y , I ) of formal concepts. For
every object i (1 i n) we can consider its representation in the m-dimensional Boolean space {0, 1}m of attributes and
its representation in the k-dimensional Boolean space {0, 1}k of factors. In the space of attributes, the vector representing
object i is the ith row I i _ of I . In the space of factors, the vector representing i is the ith row A i _ of A. We are going to
consider transformations between the space {0, 1}m of attributes and the space {0, 1}k of factors.
Let denote the componentwise partial order on the set {0, 1} p of binary vectors, i.e. V 1 , . . . , V p W 1 , . . . , W p iff
V 1 W 1 , . . . , V p W p . Clearly, {0, 1} p equipped with forms a Boolean lattice. We are going to consider the mappings
g : {0, 1}m → {0, 1}k and h : {0, 1}k → {0, 1}m defined for P ∈ {0, 1}m and Q ∈ {0, 1}k by
m
g( P ) l = ( B lj → P j ), (2)
j =1
k
h( Q ) j
= ( Q l · B lj ), (3)
l =1
g ( I i_ ) = A i_ and h( A i _ ) = I i _ .
That is, g maps the rows of I to the rows of A and vice versa, h maps the rows of A to the rows of I .
Proof. h( A i _ ) = I i _ follows directly from I = A ◦ B. To see g ( I i _ ) = A i _ , note first that since A = A F and B = B F , the lth
row B l_ of B coincides with the characteristic vector c ( D l ) of the intent D l of a formal concept C l , D l ∈ F , and that the
↓
lth column A _l of A coincides with the characteristic vector c (C l ). Therefore, using C l = D l , we get
m
m
↓
g ( I i_ ) l = B lj → ( I i _ ) j = c( Dl ) j
→ I i j = c D l i = c (C l ) i = A il ,
j =1 j =1
Note that it is essential for the previous theorem that the decomposition I = A ◦ B uses formal concepts as factors (in
fact, essential is that columns of A are extents of formal concepts). The following theorem shows the basic properties of g
and h.
12 R. Belohlavek, V. Vychodil / Journal of Computer and System Sciences 76 (2010) 3–20
Proof. By routine verification using (2) and (3). (Note that the properties can also be proved by observing that g and h are
so-called isotone Galois connections associated to matrix B, see [11].) 2
Corollary 1.
g( P ) = g h g( P ) , (8)
h( Q ) = h g h( Q ) , (9)
g Ps = g ( P s ), (10)
s∈ S s∈ S
h Qt = h( Q t ). (11)
t ∈T t ∈T
Observe now that properties (4)–(7) can be regarded as natural requirements for mappings transforming vectors between
the attributes and factors spaces: (4) says that the more attributes an object has, the more factors apply, while (5) says
that the more factors apply, the more attributes an object has. This is in accordance with the factor model implied by the
decomposition I = A ◦ B which uses the Boolean matrix product. Namely, according to this model, an object has an attribute
iff there is a factor which applies to the object and is associated to the attribute. Therefore, “more attributes” is positively
correlated with “more factors” for an object. (6) corresponds to the idea that common attributes associated to all the factors
which apply to a given object need to be contained in the collection of all attributes possessed by that object. (7) also has
a simple meaning though the verbal description is somewhat cumbersome.
To get a further understanding of the transformations between the space of attributes and the space of factors, let for
P ∈ {0, 1}m and Q ∈ {0, 1}k denote by g −1 ( Q ) the set of all vectors mapped to Q by g and by h−1 ( P ) the set of all vectors
mapped to P by h, i.e.
g −1 ( Q ) = P ∈ {0, 1}m g ( P ) = Q ,
h−1 ( P ) = Q ∈ {0, 1}k h( Q ) = P .
Then we have:
Theorem 6.
(1) g −1 ( Q ) is a convex partially ordered subspace of the attribute space and h( Q ) is the least element of g −1 ( Q ).
(2) h−1 ( P ) is a convex partially ordered subspace of the attribute space and g ( P ) is the largest element of h −1 ( P ).
Proof. (1) Let P be from g −1 ( Q ), i.e. g ( P ) = Q . Then, in particular, Q g ( P ). Using (5) and (6), h( Q ) h( g ( P )) P .
Moreover, using (8) we get Q = g ( P ) = ghg ( P ) = gh( Q ), hence h( Q ) ∈ g −1 ( Q ). Therefore, h( Q ) is the least vector of
g −1 ( Q ). Let now U , W ∈ g −1 ( Q ) and U V W . (4) yields Q = g (U ) g ( V ) g ( W ) = Q , hence g ( V ) = Q , proving that
g −1 ( Q ) is convex.
The proof of (2) is analogous. 2
Theorem 6 describes the geometry behind g and h: The space {0, 1}m of attributes and the space {0, 1}k of factors are
partitioned into an equal number of convex subsets. The subsets of the space of attributes have least elements, the subsets
of the space of factors have greatest elements. Every element of one of the convex subset of the space of attributes is
mapped by g to the greatest element of the corresponding convex subset of the space of factors. Every element of one of
the convex subset of the space of factors is mapped by h to the least element of the corresponding convex subset of the
space of attributes. Theorem 6 is illustrated in Fig. 2
R. Belohlavek, V. Vychodil / Journal of Computer and System Sciences 76 (2010) 3–20 13
Theorem 7. The problem to find a decomposition I = A ◦ B of an n × m binary matrix I into an n × k binary matrix A and a k × m
binary matrix B with k as small as possible is NP-hard and the corresponding decision problem is NP-complete.
Note that this result is also reported in [22,31] where the authors were apparently not familiar with [23,24]. Therefore,
unless P = N P , there is no polynomial-time algorithm which solves the factorization problem.
Let us now turn to algorithms. We propose two approximation algorithms for the solution of the problem of decompo-
sition of a binary matrix I . Due to Theorem 2, we look for a smallest set F of factor concepts of I , i.e. for a smallest set
F ⊆ B( X , Y , I ) of formal concepts of I for which I = A F ◦ B F .
14 R. Belohlavek, V. Vychodil / Journal of Computer and System Sciences 76 (2010) 3–20
Algorithm 1
INPUT: I (Boolean matrix)
OUTPUT: F (set of factor concepts)
set S to B( X , Y , I )
set U to {i , j | I i j = 1}
set F to ∅
for each C , D ∈ S :
if C , D ∈ O ( X , Y , I ) ∩ A( X , Y , I ):
add C , D to F
remove C , D from S
for each i , j ∈ C × D:
remove i , j from U
while U = ∅:
do select C , D ∈ S that maximizes (C × D ) ∩ U :
add C , D to F
remove C , D from S
for each i , j ∈ C × D:
remove i , j from U
return F
Example 4. Algorithm 1 is based on selecting formal concepts which have the maximal overlap with U . To demonstrate
Algorithm 1, consider the following binary matrix:
⎛ ⎞
1 0 1 0 1 1
⎜0 0 1 0 0 0⎟
⎝1 1 0 1 1 1 ⎠.
0 0 1 0 0 1
0 1 1 1 0 1
The matrix has 16 non-zero entries. We need to cover them with the smallest number of formal concepts (rectangles). In
the first step, Algorithm 1 selects the concept {1, 4, 5}, {3, 6} whose “value” given by the size of {1, 4, 5} × {3, 6} is 6. The
rectangle corresponding to {1, 4, 5}, {3, 6} is indicated by bold 1s in the following matrix:
⎛ ⎞
1 0 1 0 1 1
⎜0 0 1 0 0 0⎟
⎝1 1 0 1 1 1 ⎠.
0 0 1 0 0 1
0 1 1 1 0 1
In the next step, the algorithm removes the bold 1s, i.e. we get a new matrix:
⎛ ⎞
1 0 0 0 1 0
⎜0 0 1 0 0 0⎟
⎝1 1 0 1 1 1 ⎠.
0 0 0 0 0 0
0 1 0 1 0 0
We now repeat the procedure with this new matrix. The next concept C , D which maximizes (C × D ) ∩ U is
{3, 5}, {2, 4, 6}. The “value” of this concept is 5 because the overlap of the rectangle {3, 5} × {2, 4, 6} with the remain-
ing 1s in the matrix contains five 1s. After removing the rectangle from the matrix, we have:
⎛ ⎞
1 0 0 0 1 0
⎜0 0 1 0 0 0⎟
⎝1 0 0 0 1 0 ⎠.
0 0 0 0 0 0
0 0 0 0 0 0
Then, the algorithm selects {1, 3}, {1, 5, 6} and removes four 1s from the matrix. Finally, it chooses {1, 2, 4, 5}, {3} and
removes the remaining 1. Then, the algorithm stops and returns
F = {1, 4, 5}, {3, 6} , {3, 5}, {2, 4, 6} , {1, 3}, {1, 5, 6} , {1, 2, 4, 5}, {3} .
This gives the following factorization A F ◦ B F = I :
⎛ ⎛⎞ ⎞ ⎛ ⎞
1 0 1 1 0 0 1 0 0 1 1 0 1 0 1 1
⎜0 0 0 1⎟ 1⎠ ⎜0 0 1 0 0 0⎟
⎝0 1 1 0⎠◦⎝01
1
0
0
0
1
0
0
1 1 =⎝1 1 0 1 1 1 ⎠.
1 0 0 1 0 0 1 0 0 0 0 0 1 0 0 1
1 1 0 1 0 1 1 1 0 1
R. Belohlavek, V. Vychodil / Journal of Computer and System Sciences 76 (2010) 3–20 15
Algorithm 2
INPUT: I (Boolean matrix)
OUTPUT: F (set of factor concepts)
set U to {i , j | I i j = 1}
set F to ∅
while U = ∅:
set D to ∅
set V to 0
while there is j ∈ / D such that | D ⊕ j | > V :
do select j ∈ / D that maximizes D ⊕ j:
set D to ( D ∪ { j })↓↑
set V to |( D ↓ × D ) ∩ U |
set C to D ↓
add C , D to F
for each i , j ∈ C × D:
remove i , j from U
return F
Example 5. In general, Algorithms 1 and 2 produce different results. For instance, if we take the binary matrix from Ex-
ample 4, Algorithm 2 will proceed as follows. First, it selects 1 ∈ Y because |∅ ⊕ 1| = 6, which is the maximum value.
Since, {1}↓ = {1, 3} and {1}↓↑ = {1, 5, 6}, the first factor concept selected is {1, 3}, {1, 5, 6}. No further attribute is added to
{1, 5, 6} because it would not increase the number of 1s covered by this factor. After removing the corresponding rectangle,
we obtain matrix
⎛ ⎞
0 0 1 0 0 0
⎜0 0 1 0 0 0⎟
⎝0 1 0 1 0 0 ⎠.
0 0 1 0 0 1
0 1 1 1 0 1
In the next step, attribute 2 is selected and the induced concept {2}↓ , {2}↓↑ = {3, 5}, {2, 4, 6} is the second factor concept.
The process continues with attributes 3 and 6 so that we finally obtain
F = {1, 3}, {1, 5, 6} , {3, 5}, {2, 4, 6} , {1, 2, 4, 5}, {3} , {1, 3, 4, 5}, {6} ,
inducing the following factorization A F ◦ B F = I :
⎛ ⎛ ⎞ ⎞ ⎛ ⎞
1 0 1 1 1 0 0 0 1 1 1 0 1 0 1 1
⎜0 0 1 0⎟ 1⎠ ⎜0 0 1 0 0 0⎟
⎝1 1 0 1⎠◦⎝00
1
0
0
1
1
0
0
0 0 =⎝1 1 0 1 1 1 ⎠.
0 0 1 1 0 0 0 0 0 1 0 0 1 0 0 1
0 1 1 1 0 1 1 1 0 1
Table 5
Approximation of optimal factorizations.
Table 6
Factorization of randomly generated matrices.
number of factors using a brute-force algorithm and compared the number with the number of factors which were found
by Algorithm 1 and Algorithm 2. The results are shown in Table 5. Rows of Table 5 correspond to the optimal numbers
of factors. The second column of Table 5 shows the average number of factors determined by Algorithm 1. The numbers
are written in the form of “average value ± standard deviation.” The last column presents results for Algorithm 2. We can
see that the performance of both of the algorithms is practically the same. The average numbers of factors are close to the
optimum value and, in most cases, the optimum lies within (or close to) one standard deviation of the average value. Note
that the total number of matrices from which we obtained the results in Table 5 was 80,000.
The next example demonstrates factorization of binary matrices with varying “sparseness”/“density.” The results of this
experiment correspond to the common intuition that “sparse” and “dense” matrices can be factorized with small number
of factors whereas matrices with approximately equal numbers of 1s and 0s are not that good for factorization, cf. [8]. The
values in Table 6, show results of factorization of 15 × 15 matrices of various densities. By density we mean the percentage
of 1s which appear in a matrix. Again, both of the algorithms are comparable in terms of the number of factor concepts
they produce. This time, the rows contain average numbers of factor concepts found in matrices of a given density. We can
see from Table 6 that for randomly generated matrices with average density, the number of computed factors is nearly the
same as the number of the original attributes. On the other hand, sparse matrices can usually be factorized by a considerably
smaller number of factors.
We now turn our attention to performance of the algorithms. Algorithm 1 can deliver better factorizations than Algo-
rithm 2 because it selects factor concepts form the system of all formal concepts associated to the binary matrix I . When
the binary matrix is large, Algorithm 2 is much faster than Algorithm 1. For instance, we have tested the computing time
with several data sets from the UCI Machine Learning Repository [1]. For the well-known MUSHROOM data set, which con-
sists of 8124 objects and 119 attributes, there are 238, 710 associated formal concepts. During its computation, Algorithm 1
first computes all the concepts, stores them and then iterates over the 238, 710 concepts multiple times. On the contrary,
Algorithm 2 just goes over the 119 attributes and it does not precompute all the concepts. Table 7 compares efficiency of
both the algorithms on the MUSHROOM data set. The algorithms were implemented in ANSI C and executed on an Intel
Xeon 4 CPU 3.20 GHz machine with 1 GB RAM. Algorithm 2 outperforms Algorithm 1 both in terms of memory consumption
and the time needed to find the factorization. Notice the significant speed-up which is due to not computing and iterating
multiple times the set of all formal concepts associated to the input data.
R. Belohlavek, V. Vychodil / Journal of Computer and System Sciences 76 (2010) 3–20 17
Table 7
MUSHROOM factorization performance.
Algorithm 1 Algorithm 2
Time 18 min, 5.66 s 12.47 s
Memory 97 MB RAM 2 MB RAM
In case of large matrices I , it is particularly appealing to look for approximate decompositions of I . That is, we look for
A and B such that I is approximately equal to A ◦ B. We know from the discussion above that for every set F ⊆ B ( X , Y , I )
of formal concepts, A F ◦ B F approximates I from below. That is, A F ◦ B F I , i.e. ( A F ◦ B F )i j I i j for every row i
and column j. Furthermore, by adding further concepts to F , we tighten the approximation. That is, for F ⊆ F we have
A F ◦ B F A F ◦ B F I . Intuitively, while exact factorization may require a large number of factors, a considerably smaller
number of factors may account for a large portion of the data set. That is, while the requirement of I being equal to
A F ◦ B F may imply the need for a large F , a considerably smaller F may be enough to meet a weaker requirement
of I being approximately equal to A F ◦ B F . From the perspective of the results presented in this paper, solutions to the
approximate factorization problem can be looked for by a slight modification of Algorithm 1 and Algorithm 2. Recall that
the algorithms finish the computation if each 1 from the input table is covered by at least one factor. We can modify the
halting condition of the algorithm so that the algorithm
In either case, we obtain a set F of factor concepts for which A F ◦ B F I . Due to this fact, closeness of A F ◦ B F to I can
be assessed as follows. For I and F ⊆ B ( X , Y , I ), define A ( I , F ) by
Area(F )
A(I , F ) = ,
Area(B ( X , Y , I ))
where for G ⊆ B ( X , Y , I ) we put
Area(G ) = i , j ( A G ◦ B G )i j = 1 .
Hence, Area(G ) is the number of 1s in the matrix I covered by the set G of formal concepts. As a consequence,
Area(B ( X , Y , I )) is the number of 1s in the input matrix. We call A ( I , F ) a degree of approximation of I by F . Clearly,
100 · A ( I , F ) is the percentage of 1s in the input matrix I which are covered by factors from F . Observe that A ( I , F ) ∈ [0, 1]
and in addition, A ( I , F ) = 1 iff I = A F ◦ B F , i.e., iff the factors completely explain the data. Note that looking for a set F
of approximate factors, i.e. such with a sufficiently high A ( I , F ), is particularly appealing for larger matrices I . Experiments
on approximate factorization are presented in Section 3.
3. Experiments
In this section, we present several examples of factorization using Algorithm 1 and Algorithm 2.
Example 6. The first experiment concerns analysis of factors which determine attributes of European Union countries. We
used data from the Rank Order pages of the CIA World Factbook 2006.1 The data describe socio-economic indicators such as
“GDP per capita,” “energy production/consumption,” “population growth,” “military expenditures,” “industrial production,”
etc. The indicators have numerical values. For instance, “GDP per capita” is measured in thousands USD. Using binarization,
we transformed the data to a binary matrix consisting of 27 rows (EU countries) and 235 columns (yes/no attributes).
For the binarization, we proceed as follows. For every socio-economic indicator y, we first compute its basic statistical
characteristics (quartiles, sample mean, and standard deviation) of the corresponding sample given by the values on 27
EU countries. Then, we introduce five Boolean (yes/no) attributes based on these characteristics. In detail, for v being a
numerical value of indicator y, we consider five Boolean attributes with values 1 (or 0) if the following conditions are (or
are not) satisfied:
1
https://www.cia.gov/library/publications/download/.
18 R. Belohlavek, V. Vychodil / Journal of Computer and System Sciences 76 (2010) 3–20
The resulting binary matrix serves as testing data for the above-described method of factor analysis. Let us also mention
that replacing the original numerical values by the above-mentioned Boolean attributes may help users understand the
data. Often, people are not interested in exact numerical values of socio-economic indicators. Rather, they are (intuitively)
working with notions like “value is close to the average” which is captured the Boolean attribute described by (d). This is
particularly true if one needs to draw conclusions regarding performance of EU countries, policy, etc.
The total number of formal concepts present in the 27 × 235 Boolean matrix is 299,982, i.e. with X denoting the 27 EU
countries, Y denoting the 235 Boolean attributes, and I denoting the corresponding Boolean matrix, |B ( X , Y , I )| = 299,982.
According to Theorem 2, B ( X , Y , I ) can be considered a space of optimal potential factors. From the formal concepts of
B( X , Y , I ), Algorithm 1 computes a set F ⊆ B( X , Y , I ) of factor concepts which explain the data in I , i.e. a set F for which
I = A F ◦ B F . Running Algorithm 1, we obtained a set F of 53 factor concepts, i.e. there exists F with
|F | = 53
for which I = A F ◦ B F . This means that the 27 × 235 matrix I can been decomposed into a Boolean product of a 27 ×
53 binary matrix A F representing a relationship between EU countries and the factors and a 53 × 235 binary matrix
B F representing a relationship between the factors and the original attributes. Note that the relationship between EU
countries and the original attributes (described by I ) can be retrieved from the relationship between EU countries and
factors (described by A F ) and the relationship between factors and original attributes (described by B F ) according to:
Country x has attribute y iff there is a factor f such that f applies to x and y is one of the particular manifestations
of f . The transformations between the 235-dimensional space of attributes and the 53-dimensional space of factors are
accomplished by mappings g and h defined by (2) and (3).
Due to space limitations, we cannot list all the attributes and factors. Note however, that the factors have a natural
interpretation providing us with an insight into the data. For instance, the largest factor applies to all the EU countries except
for France, Germany, Italy, Netherlands, Spain, and UK. The manifestations of this factor in terms of the original attributes
are attribute (d) for “GDP per capita” and attribute (d) for “oil import in bbl/day.” Therefore, the verbal description of this
factor (“GDP within two standard deviations of the mean and oil import within two standard deviations of the mean”) has
a clear meaning. Note that this factor does not apply to France, Germany, Italy, and UK due to their high GDP per capita,
and does not apply to Netherlands and Spain due to their high oil imports.
Example 7. Intuitively, if we factor-analyze a related subgroup of EU countries in Example 6, we expect that their 235
attributes are reducible to still a smaller number of factors. For instance, if we focus on the ten countries which joined EU
in 2004, there is a set F of 19 factors, i.e. for the 10 × 235 Boolean matrix I we have I = A F ◦ B F where A F is a 10 × 19
Boolean matrix and B F is a 19 × 235 Boolean matrix. Therefore, the 235 Boolean attributes are explained using just 19
1
factors ( 12 reduction). The larger reduction in this case can be seen consistent with the common knowledge that the new
EU countries have similar socio-economic characteristics.
Example 8. This example further illustrates the comparison of Algorithm 1 and Algorithm 2 on the MUSHROOM data set.
Both algorithms compute the same number of factor concepts but the sets of factor concepts are different. Nevertheless,
both the corresponding factors from the different collections cover approximately the same number of 1s in the matrix.
Namely, if we form a sequence x1 , y 1 , . . . , xk , yk where xi and y i are the numbers of 1’s covered by ith factor produced
by Algorithm 1 and Algorithm 2, respectively, and if we plot the points xi , y i in a two-dimensional plot (a qq-plot), we
can see that the points are gathered close to the diagonal. Fig. 3 shows this situation (the plot uses axes with a logarithmic
scale).
Example 9. The last example illustrates approximate factorization, cf. Section 2.6, of the MUSHROOM data set. Recall that the
MUSHROOM data set contains 8124 objects and 119 attributes and that the corresponding binary matrix I contains 238, 710
formal concepts, i.e. |B ( X , Y , I )| = 238, 710. Experiments have shown that most of the information contained in the data
set can be represented through a relatively small number of factor concepts, i.e. most 1s in I are covered by a relatively
small number of concepts from B ( X , Y , I ). The results can be seen from the graphs shown in Fig. 4. The left and the right
part describe the results for Algorithm 1 and Algorithm 2, respectively. Notice the similarity of the graphs (cf. the previous
example). The graph shows a relationship between the number of factor concepts and the degree of approximation of the
original data set. We can see that even if we take a relatively small number of factor concepts, we achieve a high degree of
approximation. For instance, if we take first 6 factor concepts returned by Algorithm 1, we get A ( I , F ) · 100% = 51.89%. This
can be interpreted as more than half the data contained in the MUSHROOM data set can be explained by six factors. The growth of
the degree of approximation is rapid for first 10 factor concepts. For instance, first 2, 3, and 4 factors cover 26.03%, 34.35%,
and 41.29% of the incidence data, respectively. If we use 60 factors, 95% of the data is covered.
Note that the need for approximate factorization of binary data is quite natural. In case of larger data sets, users are
usually interested in major dependencies and structures hidden in the data. The approach to approximate Boolean factor
R. Belohlavek, V. Vychodil / Journal of Computer and System Sciences 76 (2010) 3–20 19
Fig. 4. Relationship between the number of factors and the approximation degree.
analysis can help achieve this goal: a user can specify the approximation degree a% as an additional constraint, meaning
that he is interested in factors which explain at least a% of the data. Investigation of further variations of the problem of
approximate factorization and developments of methods is necessary.
We presented a novel approach to factorization of binary data. The approach is based on a theorem presented in this
paper, describing a space of optimal factors for decomposition of binary matrices. We presented greedy approximation al-
gorithms for finding a smallest set of factors in the space of optimal factors. The first one is based on an approximation
algorithm for the set covering optimization problem, the second one is its modification which avoids the need to compute
the set of all formal concepts. We presented examples demonstrating applications of Boolean factor analysis and perfor-
mance of the algorithms. The following is a list of problems left for future research:
– Criteria for good factors: Investigation of further criteria for discovery of factors including psychologically motivated
criteria. It might be desirable to look for F such that the numbers of attributes in the formal concepts’ intents are
restricted in a suitable way. For instance, one might require all formal concepts from F have approximately the same
number of attributes, i.e. the level of generality of all factors be approximately the same. Another criterion might be a
suitably defined independence of factors.
– Further heuristics for improvement of Algorithm 1 and Algorithm 2. The heuristics are supposed to be drawn from
further theoretical insights.
– Approximate factorization: Can we combine the approximation “from below” provided by factor concepts with “approx-
imation from above”? The goal is to stop looking for additional factor concepts once the new factor concepts do not
contribute much to covering the yet uncovered part of I . After stopping, try to approximate I with a small number of
rectangles from above. Such rectangles are not formal concepts and their inclusion leads to inclusion of 1s for positions
where there are 0s in I .
– Extension to graded incidence data: Work is in progress on extending the methods from binary matrices to matrices
containing more general entries, such as numbers from the unit interval [0, 1], instead of just 0 and 1, expressing
degrees to which attributes apply to objects. This is possible using an extension of formal concept analysis to the
setting of fuzzy logic, see e.g. [3,4].
– Relationships between BFA using formal concepts and approaches via associative networks, see [2,25] for hints.
– In general, applications of the dimensionality reduction aspect of decompositions of binary matrices in the fields of
knowledge discovery and machine learning.
20 R. Belohlavek, V. Vychodil / Journal of Computer and System Sciences 76 (2010) 3–20
References
[1] A. Asuncion, D.J. Newman, UCI machine learning repository, University of California, Irvine, School of Information and Computer Sciences, http://www.
ics.uci.edu/~mlearn/MLRepository.html, 2007.
[2] R. Belohlavek, Representation of concept lattices by bidirectional associative memories, Neural Comput. 12 (10) (2000) 2279–2290.
[3] R. Belohlavek, Concept lattices and order in fuzzy logic, Ann. Pure Appl. Logic 128 (1–3) (2004) 277–298.
[4] R. Belohlavek, J. Dvorak, J. Outrata, Fast factorization by similarity in formal concept analysis of data with fuzzy attributes, J. Comput. System Sci. 73 (6)
(2007) 1012–1022.
[5] C. Carpineto, G. Romano, Concept Data Analysis. Theory and Applications, J. Wiley, 2004.
[6] T.H. Cormen, C.E. Leiserson, R.L. Rivest, C. Stein, Introduction to Algorithms, 2nd ed., MIT Press, 2001.
[7] J.H. Correia, G. Stumme, R. Wille, U. Wille, Conceptual knowledge discovery—A human-centered approach, Appl. Artificial Intelligence 17 (3) (2003)
281–302.
[8] A.A. Frolov, D. Húsek, I.P. Muraviev, P.A. Polyakov, Boolean factor analysis by Hopfield-like autoassociative memory, IEEE Trans. Neural Netw. 18 (3)
(2007) 698–707, May.
[9] B. Ganter, R. Wille, Formal Concept Analysis. Mathematical Foundations, Springer, Berlin, 1999.
[10] F. Geerts, B. Goethals, T. Mielikäinen, Tiling databases, in: Proc. DS 2004, in: Lecture Notes in Comput. Sci., vol. 3245, 2004, pp. 278–289.
[11] G. Georgescu, A. Popescu, Non-dual fuzzy connections, Arch. Math. Logic 43 (2004) 1009–1039.
[12] G.A. Golub, C.F. Van Loan, Matrix Computations, 3rd ed., The Johns Hopkins University Press, 1995.
[13] H.H. Harman, Modern Factor Analysis, 2nd ed., The Univ. of Chicago Press, Chicago, 1970.
[14] A. Keprt, V. Snášel, Binary factor analysis with help of formal concepts, in: Proc. CLA 2004, Ostrava, Czech Republic, 2004, ISBN 80-248-0597-9, pp. 90–
101.
[15] K.H. Kim, Boolean Matrix Theory and Applications, M. Dekker, 1982.
[16] W. Kneale, M. Kneale, The Development of Logic, Clarendon Press, Oxford, 1984.
[17] S. Kuznetsov, S. Obiedkov, Comparing performance of algorithms for generating concept lattices, J. Exp. Theor. Artificial Intelligence 14 (2–3) (2002)
189–216.
[18] D. Lee, H. Seung, Learning parts of objects by non-negative matrix factorization, Nature 401 (1999) 788–791.
[19] J.D. Leeuw, Principal component analysis of binary data. Application to roll-call analysis [online], available at: http://gifi.stat.ucla.edu, 2003.
[20] M. Maddouri, Towards a machine learning approach based on incremental concept formation, Intelligent Data Analysis 6 (2002) 1–15.
[21] R.P. McDonald, Factor Analysis and Related Methods, Lawrence Erlbaum Associates, Inc., 1985.
[22] P. Miettinen, T. Mielikäinen, A. Gionis, G. Das, H. Mannila, The discrete basis problem, in: Proc. PKDD 2006, in: Lecture Notes in Artificial Intelligence,
vol. 4213, 2006, pp. 335–346.
[23] D.S. Nau, Specificity covering: Immunological and other applications, computational complexity and other mathematical properties, and a computer
program, A.M. Thesis, Technical Report CS-1976-7, Computer Sci. Dept., Duke Univ., Durham, NC, 1976.
[24] D.S. Nau, G. Markowsky, M.A. Woodbury, D.B. Amos, A mathematical analysis of human leukocyte antigen serology, Math. Biosci. 40 (1978) 243–270.
[25] R.K. Rajapakse, M. Denham, Fast access to concepts in concept lattices via bidirectional associative memories, Neural Comput. 17 (10) (2005) 2291–
2300.
[26] Sajama, A. Orlitsky, Semi-parametric exponential family PC, in: L.K. Saul, et al. (Eds.), Advances in Neural Information Processing Systems 17, MIT Press,
Cambridge, MA, 2005, http://books.nips.cc/papers/files/nips17/NIPS2004_0152.pdf.
[27] A. Schein, L. Saul, L. Ungar, A generalized linear model for principal component analysis of binary data, in: Proc. Int. Workshop on Artificial Intelligence
and Statistics, 2003, pp. 14–21.
[28] L.J. Stockmeyer, The set basis problem is NP-complete, IBM Research Report RC5431, Yorktown Heights, NY, 1975.
[29] F. Tang, H. Tao, Binary principal component analysis, in: Proc. British Machine Vision Conference 2006, 2006, pp. 377–386.
[30] N. Tatti, T. Mielikäinen, A. Gionis, H. Mannila, What is the dimension of your binary data? in: The 2006 IEEE Conference on Data Mining, ICDM 2006,
IEEE Computer Society, 2006, pp. 603–612.
[31] J. Vaidya, V. Atluri, Q. Guo, The role mining problem: Finding a minimal descriptive set of roles, in: ACM Symposium on Access Control Models and
Technologies, June, 2007, pp. 175–184.
[32] R. Wille, Restructuring lattice theory: An approach based on hierarchies of concepts, in: I. Rival (Ed.), Ordered Sets, Reidel, Dordrecht/Boston, 1982,
pp. 445–470.
[33] Z. Zivkovic, J. Verbeek, Transformation invariant component analysis for binary images, in: 2006 IEEE Computer Society Conference on Computer Vision
and Pattern Recognition, vol. 1, CVPR’06, pp. 254–259.