Inference of Structures of Models of Probabilistic Dependences From Statistical Data
Inference of Structures of Models of Probabilistic Dependences From Statistical Data
6, 2005
A. S. Balabanov
UDC 007:681.3.00
Problems of reconstruction of structures of probabilistic dependence models in the class of directed (oriented) acyclic graphs (DAGs) and mono-flow graphs are considered. (Mono-flow graphs form a subclass of DAGs in which the cycles with one collider are prohibited.) The technique of induced (provoked) dependences is investigated and its application to the identification of structures of models is shown. The algorithm Collifinder-M is developed that identifies all collider variables (i.e., solves an intermediate problem of reconstruction of the structure of a mono-flow model). It is shown that a generalization of the technique of induced dependences makes it possible to strengthen well-known rules of identification of orientation of edges in a DAG model. Keywords: identification of the structure of a model, DAG model of probabilistic dependences, mono-flow model, conditional independence, collider, d-separation, induced (provoked) dependence. 1. PROBABILISTIC DEPENDENCE MODELS BASED ON ACYCLIC DIRECTED GRAPHS Probabilistic models of systems of dependences based on graphs are topical subjects of modern investigations at the interfaces between multidimensional statistical analysis, graph theory, information theory, and artificial intelligence. Probabilistic graphical models of dependencies play the role of a strict language for description of uncertainties and knowledge representation, in particular, in new-generation expert systems. It is an efficient apparatus for solution of various analytical problems. The number of applications of such models in various spheres quickly increases. Models based on directed acyclic graphs (DAGs), i.e., DAG models, have come forward. The advantages of DAG models are their obviousness, ability to represent cause-and-effect relations, compact representation (as to the number of parameters) of systems of dependences, and computational efficiency of probabilistic inference from evidences [18]. These properties provide an efficient solution of problems of medical and technical diagnosis, speech recognition, prediction of consequences of decisions made by man, actions of robots, and program agents, etc. An advantage of probabilistic graphical models of dependencies is also that they can be identified inductively on the basis of statistical data (based on several conceptual methodological postulates). Therefore, DAG models play an important role in the methodology of knowledge discovery in databases [4, 5, 811]. The following two kinds of DAG models received the widest acceptance: (1) models with nominal variables, i.e., Bayesian networks; (2) linear models with continuous variables, i.e., Gaussian networks. We also call the former models discrete DAG models and the latter ones models of recursive systems of linear structural equations [35, 7, 12]. If a digraph G contains an edge x y, then we call the node x a parent of the node y and the node y a child of the node x. A DAG model is defined as (G , q ) s, where G is a directed acyclic graph (to each variable of which corresponds a node of the graph) and q is a collection of locally specified parameters. In Gaussian networks, parameters are the coefficients of regression equations. In discrete models, the parameters q are conditional probability distributions of variable values. In
Institute of Programm Systems, National Academy of Sciences of Ukraine, Kiev, Ukraine, [email protected]. Translated from Kibernetika i Sistemnyi Analiz, No. 6, pp. 19-31, November-December 2005. Original article submitted August 12, 2003. 808 1060-0396/05/4106-0808
Fig. 2. Example of an MFDG. Fig. 1. Hierarchy of subclasses of DAG models. view of one-to-one correspondence, the terms a variable and a node of a graph are used as interchangeable in the literature of the subject. The joint probability distribution of all the variables of a Bayesian network is described in the form p( x1 , . . . , x n ) = p( x i | F ( x i )) ,
i
(1)
where F ( x i ) is the set of parents of a variable x i . Definition 1. A collider in a digraph is understood to be a fragment consisting of two adjacent edges of the form x y z. If the collider x y z is a part of a path p in a digraph, then y is called a collider variable on the path p. (Note that the variable y can simultaneously be colliderless on some other path.) A colliderless path or a chain in a digraph is a path that does not contain any collider. We call a variable (a node) of a model a collider variable if it is a collider variable on some path in the corresponding digraph. Partial subclasses of DAG models such as mono-flow models and polyforests and forests are well known. The hierarchy of these models is shown in Fig. 1. Subclasses of DAG models are specified by topological constraints on the structure of the graph of a model. A DAG is a digraph without oriented cycles. A mono-flow graph is a digraph in which each adjacency cycle has two or more colliders (Fig. 2). A polyforest is a digraph without adjacency cycles. A forest is a digraph without any cycle and collider. Thus, the hierarchy of subclasses of DAG models is simply specified with the help of successively strengthening constraints on the structure of digraphs, namely, (1) absence of cycles with the number of colliders 0; (2) absence of cycles with the number of colliders 0 or 1; (3) absence of (any) cycles; (4) absence of cycles and colliders. In this article, general problems of identification of DAG models and also distinctive features of mono-flow models are considered. Universal methods of inference of DAG models require a sophisticated exhaustive search. The well-known specialized algorithm [13] of identification of structures of mono-flow models is practically inefficient since statistical tests of high order are used in it that are unreliable when a small amount of information is accessible. We propose a more efficient approach to the identification of the dependence structure. 2. PROBLEMS OF INDUCTIVE IDENTIFICATION OF STRUCTURES OF DAG MODELS In actual (socioeconomic, epidemiologic, technical, biological, etc.) investigations, the necessary model is, as a rule, a priori unknown and, at the same time, the construction of the model on the basis of expert opinions and subjective conceptions is unacceptable. The identification of a model by objective methods is a topical problem. If the object (system) to be modelled actually exists, then one can identify the sought-for model by statistical methods on the basis of measurement data. In this case, we are frequently forced to use the results of passive observations of the object since active experiments are inaccessible or are practically, economically, or morally unjustified in the corresponding object domain. Problems of inductive inference (reconstruction) of DAG models from data of passive observations remain the subject of intensive investigations [4, 5, 8, 12, 14, 15]. The majority of inductive methods of identification of structures of DAG models can be considered within the frameworks of two basic approaches, namely, the separation and optimization (or approximation) approaches [4, 5, 11, 12, 809
14, 15]. The optimization (in particular, Bayesian [11, 16]) approach consists of maximizing the criterion of quality of a model (for example, the posteriori Bayesian probability) during the selection of a set of parents for each variable. This approach encounters the problem of searching in the multidimensional space of possible structures of the model that contain many local maxima of the criterion. In the separation approach, the statements of conditional independence are elicited (with the help of statistical testing of hypotheses), and the structure of the sought-for model is directly derived from them [4, 7, 1215]. The advantages of this approach are its speed and obviousness. In this case, the structure of the sought-for model is identified at the first stage and its parameters are determined at the second stage. The well-known methods of this approach are the algorithms PC, SGS, IC, and TPDA [4, 7, 1214]. It has been found that polyforests and forests (trees) are identified by algorithms of square-law complexity [11, 17, 18], whereas the problem of identification of a DAG model by optimization methods is NP-hard [19] in the general case. In addition to combinatorial complexity, the problem of reliability exists for model identification methods. Let us recall the basic concepts that are important for the separation approach. When the values of a collection of variables S are fixed, we express the well-known (from probability theory and statistics [20]) relation of conditional independence of variables x and y by the formula Pr( x ^ S ^ y ) , where x, y S. For discrete variables, the independence Pr( x ^ S ^ y ) means that (2) p( y |S , x ) = p( y |S ) . (For frequency estimation of probabilities, the corresponding equality is true in the asymptotic sense.) The set of variables S in this expression is called the separator for the pair of variables x and y. In Gaussian networks, the conditional independence is indicated by the zero (negligible) value of the coefficient of partial correlation. The unconditional independence of variables is simply a special case of their conditional independence when the condition is empty, i.e., Pr( x ^ ^ y ) . We write this more briefly as Pr( x ^ ^ y ) or as an equivalent expression in the form of the measure of mutual information Inf ( x, y ) 0. Accordingly, we express the fact of unconditional dependence in the form Pr( x ^ ^ y ) . The following equalities strictly follow from equality (2): Inf ( y, ( x, S )) = Inf ( y, S ) and Inf ( x, ( y, S )) = Inf ( x, S ) . (3)
Thus, the facts of conditional independence adequately reflect the structure of statistical relations between variables of a system. The relation between the structure of a model and the facts of conditional independence that are presented in a DAG model is formalized with the help of the criterion of d-separation [1, 3, 4, 7, 14]. Before its definition, we recall several elementary concepts. An (acyclic) path in a digraph is a sequence of adjacent edges (edges of any orientation) without repetition of nodes. A strictly oriented path (orpath) is a path on which all the edges are oriented in the direction of the same end of the path. A node x is called an ancestor of a node y in a digraph if there is an orpath from x to y (in particular, the zero-length path). The relation ancestordescendant is the transitive closure of the relation parentchild that is supplemented with reflexivity. We denote the set of ancestors of a variable x by Anc( x ) . Definition 2 [3, 7]. A path p in a DAG model is called d-closed (d-blocked) with the help of a set of nodes S if and only if (1) there exists a node x, x S , x p, and also there is an edge x or x on the path p; (2) or there is at least one collider y on the path p and, at the same time, we have yS and the node y is not an ancestor of any node in S, i.e., "z S : y Anc( z ) . The set of nodes S d-separates the nodes x and y if and only if all the paths between x and y are d-closed with the help of the set of nodes S. We denote such a d-separation by the predicate Ds( x ^ S ^ y ) . If at least one path between x and y is not d-closed, then we say that the nodes x and y are d-connected. The well-known theorem from [6] (we call it the theorem of DAG-semantics) states that the positive answer concerning the criterion of d-separation implies the proof of the corresponding statement of conditional independence, i.e., we have Ds ( x ^ S ^ y ) Pr ( x ^ S ^ y ) . (4)
The theorem of DAG-semantics supports the procedure of reading statements of conditional independence from the graph of a DAG model. We express the result of the statistical test of x and y for conditional independence on the basis of available data D for fixed values (conditioning) of S by the predicate T( x ^ S ^ y ) | D. As is commonly accepted, the fact of conditional 810
independence implies the positive answer concerning a correctly performed statistical test for independence on the basis of the sample of data D, i.e., we have (5) Pr ( x ^ S ^ y ) T( x ^ S ^ y ) | D. We call this thesis the pragmatic postulate of testing. The theorem of DAG-semantics and the pragmatic postulate of testing in aggregate predetermine the results of the corresponding tests for independence by the rule Ds ( x ^ S ^ y ) T( x ^ S ^ y ) | D. (6)
(Henceforth, for brevity, we omit the symbol D in such expressions.) The process of identification of a model by statistical data must be based on the results of testing for independence, i.e., requires rules of inverse form. To this end, basic assumptions are necessary. As the methodological basis of methods of identification of DAG models, theorists use the assumption of the faithfulness of the probability distribution of a DAG model [4, 7, 12, 14], which can be expressed in the form Pr ( x ^ S ^ y ) Ds ( x ^ S ^ y ) , (7)
i.e., as the inverse implication in comparison with statement (4). The assumption of the faithfulness of the distribution provides the structural and behavioral isomorphism of the model. In particular, the unconditional form of the faithfulness assumption is expressed by the rule Pr ( x ^ ^ y ) Ds ( x ^ ^ y ) (7a)
and claims that the absence of association of variables means the absence of a chain (a colliderless path) between these variables in the digraph of the corresponding model. However, the rules that completely use the faithfulness assumption are reliable only in the asymptotic sense, i.e., when the size of a sample of data is very large. In many practical situations, samples of such a size are absent and, hence, well-known methods and algorithms of separation approach use more careful inference rules, in particular, the following rule of identification of the absence of an edge: ($S : T( x ^ S ^ y )) ( x - y ) . (8)
The cardinality of the condition S determines the order of the conditional independence Pr( x ^ S ^ y ) and the order of the corresponding test. A principal drawback of the separation approach is the risk of errors during the identification of edges of a model. This risk is explained by the unreliability of results of testing statements of conditional independence when the size of a sample of data is insufficiently large or in situations with noisy data. With increasing the cardinality of the condition S, the sample of data is actually split. The order of magnitude of the factor of splitting a sample (in discrete models) is || x|| || y|| ||S || , where ||S || = P i || z i ||, z i S, and | | x| | is the length of a variable x. To reliably reconstruct the true structure of a model, it is necessary to get by with tests of small order, whenever possible. This is also desirable in order to decrease a combinatorial exhaustive search in structure identification algorithms (i.e., to reduce the number of the tests being used). Moreover, the number of statistics necessary for an algorithm increases with increasing the order of tests [14]. The problem of splitting of a sample of data touches on not only the general case of DAG models but also their subclass, namely, the models with the structure of a mono-flow dependence graph (MFDG). In fact, in MFDG-models, as well as in DAG models, the cardinality of separators, in principle, is not bounded by nothing but the number of nodes of a graph. The algorithm (proposed in [13]) of identification of structures of mono-flow models (called simple) realizes the following two tests for each pair of variables: in the format T( x ^ ^ y ) and in the format T( x ^ U \ {x, y} ^ y ) , where U is the set of all variables. Thus, in this algorithm, an exponential exhaustive search for separators is prevented. However, when the number of variables in a model is large, the order of tests of the latter format is very high and, hence, the tests are unreliable. In fact, let us suppose that two variables are connected not by an edge but a chain consisting of two or more edges so that these variables are strongly associated. Then the unconditional independence of these variables will be correctly rejected, and their conditional independence can also be rejected because of strong splitting of the sample. Then the algorithm from [13] will wrongly connect these nodes by an edge. A specific property of mono-flow models is the absence of cycles in which the number of colliders is less than two, which is formalized by the following axiom from [21]: ( x Anc( y )) & ( y Anc( z )) Ds ( x ^ y ^ z ) . (9) 811
Axiom (9) implies the following statements: x F( y ) & z F( y ) Ds ( x ^ ^ z ) ; x F( y ) & z F( y ) & r Anc( z ) Ds ( x ^ ^ r ) . (10) (11)
Hence, if an orpath connects given variables, then the existence of other chains between them is excluded. Using these properties, we can construct a method of reconstruction of mono-flow models that is based on tests for independence of only the zeroth and first orders. Variants of realization of this method are algorithms of the series Genealog [21, 22]. This method is based on the apparatus of genotypes of variables [21, 23]. But the binary relation of unconnectedness of all the variables that is defined as NC ( x, y ) Ds ( x ^ ^ y ) should be applied to the input of these algorithms. However, such a priori knowledge can be absent. It is obvious that if we use the unconditional form of the faithfulness assumption (7a), then it is possible to identify the relation of (un)connectedness by tests of the zeroth order. But, in practice, not the asymptotic rule (7a) is required but the rule of the faithfulness of a sampling distribution of probabilities that is critical to the sample size and, hence, is unreliable. This is the reason why all well-known methods (for example, PC, SGS, and IC) use not rule (7) but rule (8), i.e., the assumption of sampling faithfulness is used to identify edges rather than paths. At the same time, during the reconstruction of the relation of (un)connectedness, the very thing that is necessary is the identification of the existence of chains. The assumption of sampling faithfulness implies the equivalence of statistically significant association and the presence of a chain between the corresponding nodes. This assumption is acceptable only if a sample of data has asymptotic properties (in the sense of associations of variables) or under the condition that long (weak) chains are absent in the corresponding digraph. If all the chains connecting two given variables are sufficiently long (and weak), then these variables are recognized as unconditionally independent as a result of statistical testing in a small sample. In such a situation, more sophisticated methods of identification of the existence of chains should be used [24]. Thus, a demand arises for the development of a new inductive method of identification of MFDG-models that is less critical to the size of a sample of data and does not require a priori knowledge of the relation of (un)connectedness between variables. 3. INDUCED DEPENDENCES IN MONO-FLOW MODELS To identify structures of mono-flow models, it is required to identify special patterns in data. Definition 3. We call the pattern T( x ^ ^ z ) & T( x ^ y ^ z ) the induced (provoked) dependence between x and z. In this case, we call the variable y the activator of the induced dependence. Statement 1. If an MFDG contains the collider x y z, then the statement T( x ^ ^ y ) & T( x ^ y ^ z ) is true. The truth of the statement is demonstrated in two steps, namely, (1) the premise implies the intermediate statement Pr ( x ^ ^ z ) & Pr ( x ^ y ^ z ) and (2) this statement implies the final statement T( x ^ ^ z ) & T( x ^ y ^ z ) . The second step is substantiated as follows. The correctness of the implication Pr ( x ^ ^ z ) T( x ^ ^ z ) is justified by the pragmatic postulate of testing, and its statistical reliability is high in view of the empty condition of independence. The implication Pr ( x ^ y ^ z ) T( x ^ y ^ z ) is justified by the locality of the considered fragment of the graph of a model (its nodes are connected by two adjacent edges) and the simplicity of the condition (in the separator position is one variable). We pass to the proof of the first step. It follows from the conditions of the statement and from statements (10) and (4) that we have (12) Pr( x ^ ^ z ) . It remains to prove that the statement Pr( x ^ y ^ z ) is true. The proof follows from the criterion of d-separation and the faithfulness assumption. However, since the criterion of d-separation is an abstract and unusual concept for a general audience and Statement 1 plays the basic role in further constructions, we will give two variant of the proof without using this concept. First, the proof can be constructed on the basis of one axiom of DAG models, namely, the axiom of weak transitivity [3] that is expressed in the form Pr ( x ^ S ^ z ) & Pr ( x ^ S { y} ^ z ) Pr( x ^ S ^ y ) or Pr( y ^ S ^ z ) . (13)
812
We construct the proof by contradiction. Let, under the conditions of Statement 1, we have Pr( x ^ y ^ z ) . The existence of the edges ( x - y ) and ( y - z ) immediately implies Pr ( x ^ ^ y ), Pr ( x ^ ^ y ) . (15) (14)
We put S = and then statements (12) and (14) imply the left side of the axiom of weak transitivity (13), whence we have Pr( x ^ ^ y ) or Pr( y ^ ^ z ). However, this contradicts (15). Hence, assumption (14) is incorrect. p Second, we will construct a self-sufficient (without using axioms) proof of Pr( x ^ y ^ z ) for linear and discrete models individually. Linear models. In this case, the coefficient of partial correlation is used in the format rxz \ y = ( rxz - rxy r yz ) /
2 2 (1 - rxy )(1 - r yz ). By virtue of Pr ( x ^^ z ), Pr ( x ^^ y ) , and Pr( y ^^ z ) , we obtain rxz = 0, rxy 0, and r yz 0. Hence,
we have rxz \ y 0 . We obtain rxz = 0 & rxz \ y 0, i.e., the sought-for result. Discrete models with a binary variable y. The variable y assumes two values y ( 0 ) and y (1) . Proving by contradiction, we assume that Pr( x ^ y ^ z ) . Then we obtain "xyz : p ( xz | y ) = p( x| y ) p( z | y ) . According to statement (12), we have p( xz ) = p( x ) p( z ) . Taking into account equality (17) and using identical mathematical transformations of statement (16), we obtain p( y| xz ) = p( y| x ) p( y| z ) / p( y ) . For the binary variable, we have the identity p( y ( 0 ) | xz ) + p( y (1) | xz ) = 1 . We substitute (18) in Eq. (19) and, after mathematical transformations, obtain ( p( y ( 0 ) ) - p( y ( 0 ) | x )) ( p( y ( 0 ) ) - p( y ( 0 ) | z )) = 0 ; (20) (19) (18) (17) (16)
at the same time, identity (20) is true for any pair of values x and z. The first difference in (20) cannot become zero for all the values of the variable x (it would contradict the existence of the edge ( x - y )). Hence, for some value x ~ , we have p( y ( 0 ) ) p( y ( 0 ) | x
~
) . But then, for all the values of the variable z, we must have p( y ( 0 ) ) = p( y ( 0 ) | z ) ,
contrary to the existence of the edge ( y - z ) . Thus, in any case, we arrive at a contradiction, which proves Pr( x ^ y ^ z ) and, hence, Statement 1. p Statement 1 cannot be proved in the general case of a discrete model (there are counterexamples). However, the falsity of Statement 1 means a gross violation of the faithfulness assumption. This is possible in cases of specific combinations of parameter values, which is almost incredible in practice. We note that any discrete y can always be represented as a binary variable. We can first transform the variable y into binary form and then condition it. Then the above proof can lose its meaning only in the case when one of the edges ( x - y ) or ( y - z ) collapses (the dependence disappears) as a result of binarization. Then Eq. (20) is true and does not lead to a contradiction. But even if the faithfulness assumption is not violated (in the strict sense), then the induced dependence can nevertheless be so weak that it cannot be identified in small samples. In linear models, such unexpectednesses are rarer in occurrence since, in this case, the value of the coefficient of partial correlation is assigned to the force of the corresponding edges. For example, for the symmetric model y = x + z + e, we can easily obtain rxz \ y = - D1 / ( D1 + D2 ) , where D1 is the dispersion of the variables x and z and D2 is the dispersion of the noise e. The amount of induced dependence was investigated experimentally. For the structure of the model x y z with binary variables (where z is unconditionally independent of x), we randomly generated about 2500 variants of collections of parameters and computed the value of the conditional mutual information Inf ( x, z | y ). It turned out that absolutely all the
813
variants of the model satisfy the inequality Inf ( x, z | y ) > Inf ( x, y ) Inf ( z , y ), i.e., the induced dependence is stronger than the product of associations of the edges that form it. Moreover, about 72% of investigated variants of the model provide the value of the induced dependence that is larger than the value of association for one of the edges ( x - y ) or ( y - z ) . The value of the induced dependence of several variants of the model is larger than 0.5. Thus, according to the property of d-separation, unconditionally independent variables become d-dependent as a result of conditioning their common descendant. But when the common child of the variables x and z is conditioned in an MFDG, these variables become not only d-dependent but are also significantly associated in a sample of realistic size. Let Pvk( y ) be the set of induced (provoked) dependences for an activator y, i.e., the set of all the pairs of variables whose dependence is induced by the variable y. We define the set of the passive participants of the induction Vpv ( y ) = {x|$z :( x, z ) Pvk( y )}, i.e., the set of the variables that are accessible to the variable y as an activator. 4. IDENTIFICATION OF COLLIDER VARIABLES AND PATTERNS OF INDUCED DEPENDENCES IN MONO-FLOW MODELS Let Rw( x ) be a set of pairs of variables ( z , w ) such that there are chains ( z - x ) and ( w - x ) and there exists no chain between z and w in an MFDG (see Fig. 2). It is easy to see that, for any variable x, the condition ( z , w ) Rw ( x ) is necessary in order that ( z , w ) Pvk( x ) be true. This follows from the criterion of d-separation, assumption (7) and axiom (9). Hence, we have (21) ( z , w ) Pvk( x ) ( z , w ) Rw ( x ) ; Rw ( x ) Pvk( x ) ; ( z , w ) Rw ( x ) ( z , w ) Pvk( x ) . (22) (23)
If we have x Anc( y ) in an MFDG, then, according to statement (11), we obtain "z :( z , y ) Rw ( x ) and then, according to (23), we have y Vpv ( x ). Thus, we obtain ( z , w ) Pvk( y ) y Anc( z ), y Anc( w ) . According to Statement 1 and axiom (9), each collider node in mono-flow models is the activator of at least one induced (provoked) dependence. However, not every activator of an induced dependence is a collider node (a descendant of the collider node may be an activator). The identification of patterns of an induced dependence in MFDG-models makes it possible to find all the candidates for collider nodes (so that any collider is not missed). Then the problem is reduced to the elimination of colliderless nodes from the set of activators of induced dependences. Statement 2. If we have x Anc( y ) and y is a collider node in an MFDG, then we obtain ( Pvk( x ) Pvk( y )). Proof. By the definition of an MFDG, all the parents of the node y are mutually unconditionally independent and, hence, only one of them can be on an orpath from x to y. We denote this parent by t (see Fig. 2). The node t is a descendant of the node x (or coincides with it). For another parent r F( y ), according to statement (11), we have Ds( x ^ ^ r ) . Then we have ( t , r ) Rw ( x ) and, in view of statement (23), we obtain ( t , r ) Pvk( x ). At the same time, according to Statement 1, we have ( t , r ) Pvk( y ) . p Statement 3. Let an orpath of the form x y exist in an MFDG, and let all the descendants of the node x on this path be colliderless. Then we have Pvk( x ) Pvk( y ) . As is obvious, under the conditions of Statement 3, we have Rw ( x ) = Rw ( y ) . For all pairs ( z , w ) Rw ( y ), the implication ( z , w ) Pvk( y ) ( z , w ) Pvk( x ) is justified by the fact that the path between z and x is a part of the path between z and y (similarly, for the node w). At the same time, a pair ( z , w ) Pvk( x ) can be found such that we have ( z , w ) Pvk( y ) since the orpath x y is found to be weak so that the conditioning of the variable y does not have any statistically significant influence on x and, hence, on the association of z with w. Moreover, Statement 3 plausibly remains true even for a naive procedure of identification of induced dependences with the help of a threshold e fixed for the amount of mutual information. This procedure is specified as follows: ( x, z ) Pvk( y ) ( Inf ( x, z ) < e ) & ( Inf ( x, z | y ) > e ) . In this case (under the conditions of Statement 3), the implication ( z , w ) Pvk( y ) ( z , w ) Pvk( x ) is intuitively convincing and is empirically supported for binary variables [25]. Namely, a hundred thousand variants of parametrization of a binary model under the conditions of Statement 3 was generated in [25] and it turned out that, in
814
99.97% of cases, we have Inf ( z , w | x ) Inf ( z , w | y ). In order that a pair ( z , w ) Pvk( x ) & ( z , w ) Pvk( y ) be found, it is necessary not only that a variant belong to 0.03 % of cases, where Inf ( z , w | x ) < Inf ( z , w | y ), but also that the threshold of the significance of the corresponding dependence be between the quantities Inf ( z , w | x ) and Inf ( z , w | y ) . Thus, Statement 3 is true with very high reliability. p A similar statement for linear models (in which the coefficient of partial correlation is used instead of mutual information) is proved in [26]. Statement 3 directly implies that if we have ( Pvk( x ) Pvk( y )) in an MFDG and there exists an orpath from x to y, then at least one collider node w, w x, exists on this path. According to the statements established, colliderless nodes among activators are only nodes x such that, for each of them, some y can be found that satisfies the relation Pvk( x ) Pvk( y ). In order to exactly identify collider and colliderless nodes, it is necessary to test suspect activators. Let x be a colliderless activator node. Then a collider node q can be found in the set of activators such that there is an orpath from q to x on which all the nodes (except for q) are colliderless (i.e., q is the nearest collider ancestor of x). Next, let we have ( z , w ) Pvk( x ) Pvk( q ). Then all the colliderless paths that connect z or w with the node x pass through q (see Fig. 2). Hence, the node q d-separates the node x from z and w. Hence, we have T( x ^ q ^ z , w ) , where ( z , w ) Pvk( x ) Pvk( q ) . From this (by virtue of the well-known law for mutual information), we have "z Vpv ( x ) Vpv ( q ): Inf ( z , x ) < Inf ( z , q ) . (25) (24)
On the contrary, if a collider node (in addition to q) is on an orpath from q to x, then the separation of the form (24) takes place not for all ( z , w ) Pvk( x ). And, finally, if the node q is not an ancestor for x, then separation (24) cannot take place for ( z , w ) Pvk( x ). Thus, we have justified the statement formulated below. Statement 4. A node x is colliderless if and only if it is possible to find a node q such that, for all z Vpv ( x ), we have T( x ^ q ^ z ) . This statement makes it possible to identify all collider nodes. Note that, with a view to economy of the number of the tests performed, it suffices to check the specified conditions not for all variables z Vpv ( x ) but only for one variable, and it is desirable that this variable be strongly associated with the activator x. It is obvious that the variable q can be searched for only in the set of activators. Moreover, according to Statement 3, such a variable q should be searched for among the variables that satisfy the condition Pvk( q ) Pvk( x ). Thus, we obtain the following statement. Statement 5. A node x is a collider node if it is an activator node and, for z = argmax{Inf ( z , x ) | z Vpv ( x )} and for all q, q x, Pvk( q ) , we have ( Inf ( z , q ) Inf ( z , x )) & ( Pvk( q ) Pvk( x )) T( z ^ q ^ x ) . (26)
The statements presented allow one to justify and construct a computationally efficient algorithm of identification of collider variables in an MFDG [24]. The algorithm Collifinder described in [24] first determines the set of all the activators of induced dependences and then eliminates colliderless variables from this set according to Statement 5. However, from the practical viewpoint, it is more convenient to identify induced dependences using a threshold fixed for mutual information. In this case, the question of the substantiation of the relationship Pvk( q ) Pvk( x ) arises. When the lengths of variables x and q are different, the investigation of the behavior of conditional mutual information is a difficult problem and unexpected fine points are not excluded. Then the relationship Pvk( q ) Pvk( x ) can be found to be insufficiently reliable and, hence, it is advisable to relax the condition Pvk( q ) Pvk( x ) by reducing it to the condition Pvk( q ) Pvk( x ) but to collect and use the strongest necessary conditions based on unconditional mutual information. Then we obtain the statement formulated below. Statement 6. If, in an MFDG, a node x is a colliderless activator and a collider ancestor of the node x is a node q closest to x, then we have (27) Pvk( x ) Pvk( q ) , x Vpv ( q ) & q Vpv ( x ) , "z Vpv ( x ): T( z ^ ^ q ) , "z Vpv ( x ) Vpv ( q ): Inf ( z , x ) < Inf ( z , q ) , T( z ^ q ^ x ), where z = argmax{Inf ( z , x ) | z Vpv ( x )} . (28) (29) (30) (31) 815
The modernized algorithm presented below is based on Statement 6 and identifies collider variables in an MFDG. Algorithm Collifinder-M 1. Find all the induced dependences and put all the activators in the list L 2. For all elements xL do For all q L and q x do If (27)(30) are true, then If for z = argmax{Inf ( z , x ) | z Vpv ( x )} we have T( z ^ q ^ x ), then eliminate the variable x from the list L Repeat Repeat As a result of execution of the algorithm Collifinder-M, collider variables remain in the list L. The efficiency of the algorithm Collifinder-M is a result of its simplicity (the number of tests is less than N 3 , where N is the number of variables) and its reliability (the algorithm uses tests of independence of only the zeroth and first orders). 5. ORIENTATION OF EDGES AND GENERALIZATION OF THE CONCEPT OF INDUCED DEPENDENCE The majority of methods of statistical inference of DAG models use the following rule of collider orientation of edges: ( x - y - z ) & $S : T( x ^ S ^ z ) & y S x y z. (32)
However, this rule can result in an erroneous orientation of edges. In fact, the generating model can contain a fragment of the form x y z , x y z, or x y z and the variables x and z can behave as unconditionally independent of each other or can be separated by the separator S , yS. This behavior of the model can be determined by, first, the weakness of the chain between x and z (the weakness of transitive dependence) and, second, the collapse of transitive dependence (though this means a gross violation of the faithfulness assumption). The instrument of induced (provoked) dependence makes it possible to prevent errors in such situations, using the following more cautious and reliable version of the rule of collider orientation: ( x - y - z ) & $S : T( x ^ S ^ z ) & T( x ^ S { y} ^ z ) x y z. (33)
(In the condition T( x ^ S ^ z ) & T( x ^ S { y} ^ z ) , one can easily see a generalization of the pattern of an induced dependence (the context S is added). According to its physical meaning, this pattern can be called a reactivated or reanimated dependence. We note that though the condition T( x ^ S ^ z ) & T( x ^ S { y} ^ z ) recognizes weak chains, it does not protect from other rare traps.) As applied to the identification of mono-flow models, rule (32) is reduced to the form ( x - y - z ) & T( x ^^ z ) x y z. (34)
The instrument of induced dependence suggests that the premise of the rule should be strengthened so that we obtain the rule [24] (35) ( x - y - z ) & ( x, z ) Pvk( y ) x y z that, as well as rule (33), is protected from an error when the chain between x and z is weak. The proposed method Collifinder-M identifies collider variables and thereby solves an auxiliary problem for reconstruction of structures of mono-flow dependence models. Knowing all collider variables and induced dependences, it is possible to construct an algorithm [24] of identification of mono-flow models with complexity comparable with the well-known Chow&Liu algorithm for forests (trees) [17]. A generalization of the pattern of an induced dependence allows one to formulate more cautious and reliable versions of the rules of orientation of edges of a model. The results obtained show that, for identification of a model on the basis of statistical data (within the framework of the methodology of conditional independence), a useful tactics is not only the elimination of dependences (i.e., separation) but also inducing dependences (i.e., an opposite action in some sense). 816
REFERENCES 1. 2. 3. 4. 5. 6. 7. 8. 9. 10. 11. 12. 13. 14. 15. 16. 17. 18. 19. 20. 21. 22. 23. 24. 25. 26. S. L. Lauritzen, Graphical Models, Clarendon Press, Oxford (1996). R. G., Cowell, A. P. Dawid, S. L. Lauritzen, and D. J. Spiegelhalter, Probabilistic Networks and Expert Systems, Springer-Verlag, Berlin-Heidelberg-New York (1999). J. Pearl, Probabilistic Reasoning in Intelligent Systems: Networks of Plausible Inference, Morgan Kaufmann, San Mateo (1988). P. Spirtes, C. Glymour, and R. Scheines, Causation, Prediction, and Search, MIT Press, New York (2001). D. Heckerman, Bayesian networks for data mining, Data Mining and Knowledge Discovery, 1, No. 1, 79119 (1997). T. Verma and J. Pearl, Causal networks: semantics and expressiveness, in: R. Shachter, T. S. Levitt, and L. N. Kanal (eds.), Uncertainty in Artificial Intelligence, 4, Elsevier (1990), pp. 6976. J. Pearl, Causality: Models, Reasoning, and Inference, Univ. Press, Cambridge (2000). O. S. Balabanov, Determination of structures of dependences in data: From indirect associations to causality, in: Proc. 2nd Intern. Conf. UkrProg 2002, Probl. Programmirovaniya, Nos. 12, 309316 (2002). F. I. Andon and A. S. Balabanov, Identification of knowledge and research in databases: Approaches, models, methods, and systems (a review), in: Proc. 2nd Intern. Conf. UkrProg 2000, Probl. Programmirovaniya, Nos. 12, 513526 (2000). A. S. Balabanov, Extraction of knowledge from databases: Advanced computer technologies of intellectual analysis of data, Mathematical Machines and Systems, No. 12, 4054 (2001). D. Heckerman, D. Geiger, and D. M. Chickering, Learning Bayesian networks: The combination of knowledge and statistical data, Machine Learning, 20, 197243 (1995). R. Scheines, P. Spirtes, C. Glymour, C. Meek, and T. Richardson, The TETRAD project: Constraint based aids to causal model specification, Multivariate Behavioral Research, 33, No. 1, 65118 (1998). D. Geiger, A. Paz, and J Pearl, Learning simple causal structures, Intern. Journ. of Intelligent Systems, 8, No. 2, 231247 (1993). J. Cheng, R. Greiner, J. Kelly, D. Bell, and W. Liu, Learning Bayesian networks from data: An information-theory based approach, Artificial Intelligence, 137, 4390 (2002). M. I. Jordan (ed.), Learning in Graphical Models, MIT Press, Cambridge (1999). A. M. Gupal and A. A. Vagis, Learning in Bayesian networks, Problems of Control and Informatics, No. 3, 106111 (2002). C. K. Chow and C. N. Liu, Approximating discrete probability distributions with dependence trees, IEEE Trans. Inform. Theory, 14, No. 3, 462467 (1968). O. S. Balabanov, Inductive reconstruction of treelike structures of systems of dependences, Probl. Programmirovaniya, No. 12, 95108 (2001). D. M. Chickering, C. Meek, and D. Heckerman, Large-sample learning of Bayesian networks is NP-hard, in: Proc. 19th Conf. on Uncertainty in Artificial Intelligence, Morgan Kaufmann, Acapulco, Mexico (2003), pp. 124133. A. P. Dawid, Conditional independence in statistical theory (with discussion), Journ. of Royal Statist. Soc., 41-B, 131 (1979). A. S. Balabanov, Inductive method of reconstruction of mono-flow probabilistic graphical models of dependencies, Probl. Upravlen. Inf., No. 5, 7584 (2003). A. S. Balabanov, New method of reconstruction of probabilistic graphical models of dependencies, in: Proc. 1th Intern. Conf. on Inductive Modeling, MKIM-2002, 1, Lviv (2002), pp. 118124. A. S. Balabanov, Reconstruction of structures of probabilistic dependence systems from data: The apparatus of genotypes of variables, Probl. Upravlen. Inf., No. 2, 9199 (2003). A. S. Balabanov, Efficient method of identification of dependence structures in statistical data, in: Proc. 4th Intern. Conf. UkrProg 2004, Probl. Programmirovaniya, No. 23, 312319 (2004). D. M. Chickering and C. Meek, Monotone DAG Faithfulness: A Bad Assumption, Techn. Rep. MSR-TR-2003-16, Microsoft, Redmond, WA. (2003). S. Chaudhuri and T. Richardson, Using the structure of d-connecting paths as a qualitative measure of the strength of dependence, in: Proc. 19th Conf. on Uncertainty in Artificial Intelligence, Part 2, Morgan Kaufmann, Acapulco, Mexico (2003), pp. 116123.
817
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.