Bayesian Network Learning With The PC Algorithm An Improved and Correct Variation PDF
Bayesian Network Learning With The PC Algorithm An Improved and Correct Variation PDF
net/publication/327884019
CITATIONS READS
0 1,632
1 author:
Michail Tsagris
55 PUBLICATIONS 267 CITATIONS
SEE PROFILE
Some of the authors of this publication are also working on these related projects:
All content following this page was uploaded by Michail Tsagris on 26 September 2018.
Abstract
PC is a prototypical constraint-based algorithm for learning Bayesian networks, a spe-
cial case of directed acyclic graphs. An existing variant of it, in the R package pcalg, was
developed to make the skeleton phase order independent. In return, it has notably increased
execution time. In this paper we clarify that the PC algorithm the skeleton phase of PC is
indeed order independent. The modification we propose outperforms pcalg’s variant of the
PC in terms of returning correct networks of better quality as is less prone to errors and
in some cases it is a lot more computationally cheaper. In addition, we show that pcalg’s
variant does not return valid acyclic graphs.
1 Introduction
Learning causal relationships in datasets with many variables (or features) is of high importance
in many scientific fields. Bayesian networks (BN) have been applied for this purpose in many
settings. In clinical set ups for example, BNs can be used for disease diagnostic purposes (Bucci
et al., 2011; Zagorecki et al., 2013; Suchánek et al., 2014). In biology they can be used to
discover interaction networks (Isci et al., 2013) or to analyze gene expression data (Friedman
et al., 2000). Other applications include psychology (Glymour, 2001), teaching purposes (Conati
et al., 2002), data mining (Heckerman, 1997), environmental modelling (Aguilera et al., 2011)
and criminology (Baumgartner et al., 2008) to name a few. Finance, and insurance are linked
with operational risk management (or modeling), which is another area where BNs have been
used (Cowell et al., 2007).
BNs are probabilistic graphical models representing causal relationships between variables.
Network visualization is offered and intuitively (and causally in the case of BNs) one can in-
terpret relationships among variables. Two main classes of algorithms for BN learning are
constraint-based and score-based methods. Constraint-based learning algorithms, such as PC
(Spirtes and Glymour, 1991; Spirtes et al., 2000) and FCI (Spirtes et al., 2000) employ condi-
tional independence tests to discover the structure of the network, and then orient the edges
by repetitively applying orientation rules. Score-based methods on the other hand (Cooper and
Herskovits, 1992; Heckerman et al., 1995; Chickering, 2002) assign a score on the whole network
and perform a search in the space of BNs to identify a high-scoring network. Furthermore, hy-
brid algorithms exist, such as MMHC (Tsamardinos et al., 2006) which first performs conditional
independence (CI) tests and then uses a scoring method on the reduced space.
In this work we focus on the PC1 algorithm for BN learning and a modification of it called
1
PC stands for Peter and Clark, named after Peter Spirtes and Clark Glymour, the names of the two researchers
who invented it.
1
PC-stable (SPC) (Colombo and Maathuis, 2014). SPC is a popular modification of the skeleton
phase of the PC, available in the R package pcalg(Kalisch et al., 2012), which was suggested as
a means of making the skeleton of the original PC order independent. As analyzed later in this
work, the original PC is already order independent and hence this variation did not make any
improvement and is computationally more expensive. What is more is that it does not always
produce a valid partially oriented DAG (PDAG).
The aforementioned observations, encouraged us to revisit the original PC algorithm (Spirtes
et al., 2000). We have re-implemented, in the R package MXM (Lagani et al., 2017), the original
PC algorithm and also modified its orientation rules2 to ensure the validity of the returned graph
and finally attempted to resolve some conflicts which occur.
Extensive simulation studies depict that our modification of the PC algorithm, termed MPC
hereafter, leads to better results in comparison to SPC. The comparison takes place using a
variety of CI tests for continuous, categorical and mixed data and the results show that MPC
leads not only to better results but also, unlike SPC, returns a valid PDAG. In terms of com-
putational cost, the implementation of the skeleton phase of the MPC (which is the same as
PC) is much cheaper than SPC. For categorical data for example it is more than 2 orders of
magnitude faster. Time efficiency, validity of the output graph, less errors in conjunction with
its functionalities (works with many types of data) make MPC a more practical than PC-stable,
algorithm to use.
Section 2 contains brief information about BNs and conditional independence tests. Section
3 summarizes the two algorithms, along with some algorithmic details and our modifications in
the orientation rules. Finally in Section 4, extensive experiments are presented and Section 5
concludes the paper.
2 Preliminaries
2.1 Directed acyclic graphs
A graphical model or probabilistic graphical model is a probabilistic model for which a graph
expresses the conditional (in)dependencies between random variables. A directed graph is a
graphical model where arrows indicate a direction. When no cycles are allowed, for example
V1 → V2 → V3 , the graph is termed directed acyclic graph (DAG).
The variables are denoted by nodes or vertices in the graph. The parents of a node Vi are
the nodes whose direction (arrows) points towards Vi , and respectively, the node Vi is the child
of these nodes. For example in Figure 2(a), the nodes X, Z and W are the parents of the node
Y and the node Y is the child of the nodes X, Z and W.
where d is the total number of variables in G and Pa(Vi ) denotes the parent set of Vi in G. If
all conditional independencies in P are entailed by the Markov condition in G, the BN is called
faithful.
2
We will drop the word "orientation" hereafter and simply refer to them as rules or PC rules.
2
Causal sufficiency, i.e. no latent confounders between the measured variables in V, is a
necessary assumption made by PC. A causal BN is a BN where edges are interpreted causally.
Specifically, an edge X → Y exists if X is a direct cause of Y in the context of the variables
V. For every directed edge, Vi → Vj , Vi denotes the parent and Vj the child. A collider is a
triplet (Vi , Vk , Vj ) where Vi → Vk ← Vj . If there is no edge between Vi and Vj the node Vk is
called unshielded collider. This translates to independence between Vi and Vj condition on Vk
if G and P are faithful to each other Spirtes et al. (2000).
Typically, multiple BNs encode the same set of conditional independences3 . Such BNs are
called Markov equivalent, and the set of all Markov equivalent BNs forms a Markov equivalence
class. This class can be represented by a complete partially directed acyclic graph (CPDAG),
which in addition to directed edges also contains undirected edges. Undirected edges may be
oriented either way in some BNs in the Markov equivalence class (although not all combinations
are possible), while directed and missing edges are shared among all equivalent networks.
where n is the sample size, |Z| the number of conditioning variables in the set Z and rX,Y |z is
tha partial Pearson correlation of X and Y conditioning on Z.
Asymptotically both test statistics follow the standard normal distribution under the zero corre-
lation assumption. In the R package MXM though, they are calibrated against a t distribution
3
Two DAGs are called Markov equivalent if and only if they have the same skeletons and the same v-structures
Verma and Pearl (1991).
3
with n − |Z| − 3 degrees of freedom, whose performance is better for small sample sizes.
The conditional Pearson correlation is given via the correlation of these residuals. P-values are
obtained as before using the Pearson’s test statistic (1).
is used for for categorical variables. The Oij|k are the observed frequencies of the i − th and
j − th values of X and Y respectively in the k − th set of values of Z and the Eij|k are their
corresponding expected frequencies. Under the conditional independence assumption, the G2
test statistic follows the χ2 distribution with (|X| − 1)(|X| − 1)(|Z| − 1) degrees of freedom.
dcov(X, Y )
dcor(X, Y ) = p p ,
dvar(Y ), dvar(Y )
where dcov(X, Y ) is the distance covariance between X and Y and dvar(.) denotes the distance
variance. The p-value for zero correlation is calculated via permutations.
4
p-values are then combined in a meta-analytic way which handles the inherited correlation
between the two tests.
A faster alternative method is to perform one regression model only, which was shown to
be asymptotically equivalent to performing both models (Tsagris et al., 2017). In that case,
(Tsagris et al., 2017) proposed a priority rule for implementing a regression model. For example,
when the pairs consist of a continuous and a nominal, ordinal etc. variable, the faster model that
should be applied is the linear one. In case of a nominal-ordinal pair, the multinomial logistic
regression model should be fitted as is faster than the ordinal logistic regression model.
The family of symmetric CI tests includes repeated measurements and clustered data (data
from family members for example). These types of data are handled by generalized linear
mixed models or by generalized estimating equations (Demidenko, 2013). This means, that our
implementation of PC constructs PDAGs for many types of data.
5
of the pairs of variables and the order of appearance makes no difference. No experiment is
necessary to show that the first two heuristics are order dependent, whereas the third one is
order-independent. Hence, we can state that the skeleton phase of the (original) PC
algorithm is order-independent.
1. The first and most important feature is that we have utilized the often neglected third
heuristic in our implementation in the R package MXM.
2. This heuristic relies on the correct ordering of the pairs of variables. When the p-values
are very small below a threshold value, 10−16 for example, R rounds them to 0. When it
comes to ordering 2 or more p-values, R R Core Team (2016) orders 2 or more 0 values
at random. This case is not at all rare, especially with the G2 test. This problem is also
common in feature selection algorithms which use an ordering of the p-values in order to
select the most statistically significant variable. The answer to this problem is the use of
the logarithm of the p-values. This of course is not the case for the SPC since it does not
rely on any p-value ordering heuristic.
6
3. When it comes to permutation based CI tests, equal (logged) p-values is a very frequent
phenomenon. For this reason, we order the p-values first and then for the equal p-values
we order their test statistic value divided by the degrees of freedom of the test. For the
partial correlation test this division is not different from taking the test statistic, as at
each step of the algorithm the number of conditioning variables is the same. For the G2
test though this makes a difference, as the degrees of freedom are usually different.
4. The CI tests return a p-value and using a significance level α (Colombo and Maathuis
(2014) suggest values of 0.01 or less), a decision on the edge removal is made. In all of
our experiments we have chosen α = 0.01. The α (type I error) denotes the probability of
falsely assuming that two variables are not independent when in fact they are. In general,
there is a trade-off between the type I error and the type II error (not detecting dependence
when in fact there is one, termed β). But, with large samples, the power of the CI test (1
- β) is high. Thus, it is advisable to have a low significance level. This way, false positive
added edges remain bounded at the 1% 5 of the total number of edges, in other words,
statistical errors can be kept at very low levels if a large sample size is available.
Rule 0. For every triplet of variables (Vi , Vj , Vk ) such that Vi and Vj and Vj and Vk are adjacent
in G, but Vi and Vk are not, orient Vi − Vj − Vk as Vi → Vj ← Vk if Vj 6∈ sepset(Vi , Vk ).
Rule 0 is to be applied first. Then, according to Spirtes et al. (2000) the order of the other
three rules is independent, as in the sample limit (sample size going to infinity) and under no
statistical errors, the output will be a Markov equivalence DAG. In the finite sample size case
though statistical errors exist and can lead to the wrong skeleton. Application of the third
heuristic requires some attention and conflicts within the rules almost always appear. Cycles
prevention is another issue not heavily addressed in SPC. In the next sub-Section we try to
address some of these issues and point out some algorithmic and CI tests related details.
1. As mentioned in the Preliminaries Section, a DAG and hence a BN does not allow cycles.
Unfortunately the 4 aforementioned rules do not include cycles prevention in their protocol.
When it comes to applying one rule, we check if cycles are created. If the answer is
yes, the rule is canceled and the edge is left un-oriented. None of the aforementioned
variations of the PC Abellán et al. (2006); Cano et al. (2008); Colombo and Maathuis
5
This is the worst case scenario upper bound which is never reached. The construction of the skeleton using
the PC algorithm ensures this.
7
(2014) addresses this issue. Figure 1 shows an example where SPC produces a cycle6 ,
namely X1 → X9 → X8 → X1 . The difference in the skeleton between MPC and SPC is
the edge connecting the nodes X1 and X8 .
True DAG
Figure 1: An example where pcalg produced a partially oriented cyclic directed graph.
2. Unfaithful colliders often emerge. For example, let pairs X - Y, W - Y and Z - Y. The first
triplet X → Y ← Z holds true (X 6⊥⊥ Z|Y ), the second one Z → Y ← W (X 6⊥⊥ W |Y ) holds
true as well, but at the same time, the triplet X → Y ← W that has been created does
not hold true (X ⊥ ⊥ W |Y ). In this case, similarly to Isozaki (2014) the MPC disorients X
- Y and W - Y (see Figure 2).
3. Colliders can be created when applying Rule 1. If that is the case, the rule is canceled
for that particular node. Figure 3 gives such an example. In (a) Rule 1 would turn Y - Z
into Y → Z and W - Z into W → Z. This would create a collider, Z for Y and W. We will
direct one edge only (the first seen). Suppose we turned Y - Z into Y → Z. Then, Z - W
is not allowed to be turned into Z → W because W would become a collider. In this case,
Z - W would remain as is. In (b) orienting Y − Z and W − Z would create a new collider
as well. In this case, both edges will be left un-oriented.
The 4 orientation rules do not include many conflict resolution strategies. Similarly to the
SPC we treated the triplets in a lexicographical order. The difference is that we do not overwrite
6
We highlight that SPC does not always produce acyclic graphs and we show this in the experimentation
Section. This was a rare example, yet not impossible to happen. The R code used to generate these Figures is
in the Appendix.
8
(a) (b)
Figure 2: (a) Y is a collider for X and Z, for W and Z, but falsely considered as collider for X
and W. (b) After discovering the mistake, the edges X → Y and W → Y loose their arrow.
(a) (b)
the directions, but perform them in a first-come-first-serve fashion. We should note that conflict
resolution strategies still remains an open area of research.
To sum up, SPC relies on a modified skeleton phase of the original PC. MPC on the other
hand has left the skeleton phase of the original PC the same and changed the application of the
orientation rules towards two directions: a) preventing cycles and b) preventing the creation of
non existing colliders.
1. Generate samples for each variable in P a(X) recursively, until samples for each variable
are available.
9
2. Sample the coefficients β of f (P a(X)) uniformly at random from [−1, −0.1] ∪ [0.1, 1].
In order to generate ordinal variables (for the mixed data case scenario), we first generated
a continuous variable as previously described, and then discretized it into 2-4 categories appro-
priately (retaining the ordinal scale). Each category contains at least 15% of the observations,
while the remaining ones are randomly allocated to all categories. This is identical to having
a latent continuous variable (the one generated), but observing its discretized proxy variable
with some noise added. Note that, as the discretization is random, any normality of the input
continuous variable is not preserved. Finally, ordinal variables in the parent sets are not treated
as nominal variables, but simply as continuous ones and thus only one coefficient is used for
them for the purpose of data generation.
After completing the dataset set generation we shuffle the columns of the matrix (change the
order of the variables) and hence the same columns and rows of the adjacency matrix representing
the BN. This makes the estimation procedure more difficult, because the order in which Rule 0
is applied is lexicographical and the order with which we have generated the data would benefit
from the application of this rule.
Many evaluation criteria were employed in order to accurately evaluate the performance of
the two algorithms. For the skeleton phase we compared the computational efficiency and the
number of tests performed by both algorithms, along with the Hamming distance (HD). The
HD between two strings of equal length is the number of positions at which the corresponding
symbols are different. In our case, the two strings (one for the estimated skeleton and one for
the true skeleton of the BN) are binary, indicating the presence or absence of an edge between
two pairs.
The quality of the learned BNs was assessed using the structural Hamming distance (SHD)
Tsamardinos et al. (2006) of the estimated PDAG from the true PDAG. This is defined as the
number of operations required to make the estimated graph equal to the true graph. The true
PDAG is simply the Markov equivalence graph of the true BN; that is some edges have been
un-oriented as their direction cannot be statistically decided. The transform from the DAG
to the PDAG is carried out using Chickering’s algorithm Chickering (1995). The number of
times the estimated network is a valid PDAG (no cycles are present) is another crucial measure
reported.
4.3 Computational time and number of tests performed during the skeleton
phase
We evaluated the MPC and SPC algorithm in terms of computational time and number of CI
tests performed. For the continuous data case, the generated BNs contained a various number
of nodes, p = (50, 100, 150, 200, 300, 500, 700, 1000), with 3 and 5 neighbors on average. For
each case we created 30 random BNs, and simulated Gaussian data with various sample sizes,
n = (100, 200, 500, 1000, 2000, 5000). In total, this amounts to 2880 datasets.
As for the categorical data we simulated datasets with different sample size from the INSUR-
ANCE network Binder et al. (1997) which contains 27 variables only (and 52 edges). The SPC
algorithm for continuous data has been implemented in C++, whereas for categorical data has
been implemented in R. On the contrary, our PC algorithm implementation, for both continuous
10
3 neighbors on average
5 neighbors on average
Figure 4: Differences in the average HDs between MPC (MXM R package) and SPC (pcalg R
package) are presented for a range of sample sizes.
and categorical data, is in C++. Henceforth, the time comparisons with categorical data are
not fair and this is why we chose to simulate from a BN with few variables and vary the sample
size.
SPC performs more tests, thus reporting only the time would not result in a fair comparison.
For this reason, for each algorithm we report the total time required divided by the number of
tests carried out. Figure 5 presents the ratios of the normalized times required by MPC and
SPC with both continuous and categorical data. Overall, we can see that the skeleton of the
MPC is much more computationally efficient than SPC.
11
(a) 3 neighbors on average. (b) 5 neighbors on average.
Figure 5: For each algorithm, MPC and SPC, we have calculated the normalized computational
cost required to construct the skeleton. That is the total time divided by the number of tests
executed. The ratio of the normalized times between MPC and SPC appears in these two
graphs for BNs with continuous data and (a) 3 and (b) 5 neighbors on average. The bottom
graph contains the same information with categorical data obtained from a real BN.
12
3 neighbors on average
5 neighbors on average
Figure 6: Differences in the average SHDs between MPC (SHD(MPC) - SHD(SPC)) are pre-
sented for a range of sample sizes. Negative values are in favor of MPC.
will return an acyclic graph with a high probability when the sample size is small. But, as the
sample size increases, even with 50 variables, this probability decays.
13
3 neighbors on average 5 neighbors on average
Figure 7: Mixed data scenario. The average SHD differences between MPC and SPC
(SHD(MPC) - SHD(SPC)) are presented for a range of sample sizes. Negative values are in
favor of MPC.
produces a slightly less SHD on average, but it returns as acyclic graph in 73.33% of the times.
We remind the reader that in Figure 8, the percentage of times SPC produces a DAG decays as
the number of variables increases and the minimum number of variables we consider is 50.
Table 1: Summary statistics regarding SHD and computational cost (in seconds) of the 2 al-
gorithms applied to 30 random permutations of the variables of the data generated from the
HAILFINDER and ALARM networks.
SHD: (Minimum, Maximum)
SPC MPC
HAILFINDER (38, 39) (38, 40)
ALARM (57, 57) (60, 60)
Computational cost: (Minimum, Maximum)
SPC MPC
HAILFINDER (206.42, 431.67) (0.20, 0.27)
ALARM (188.02, 204.84) (0.18, 0.22)
5 Conclusions
In this paper we showed that the skeleton phase of the original PC algorithm is indeed order
independent. Extensive simulation studies showed that the returned skeleton of the MPC (which
is the same as the skeleton phase of the original PC) and of the SPC have very small differences,
which vanish as the sample size increases, yet the former performs half the test the latter
performs and is computationally more efficient. When proper modifications are applied on the
orientation rules a valid PDAG (no cycles) must be returned. This essential acyclicity property
is not checked by the package pcalg Kalisch et al. (2012), even though this package is quite
popular and has been used by other researchers (Harris and Drton, 2013).
The comparisons between MPC and SPC (Colombo and Maathuis, 2014) showed that, with
continuous data, the first leads to PDAGs whose SHD is lower when dealing with continuous
data. The same was with mixed data, continuous, binary and ordinal, but due to the increased
14
Continuous data
Mixed data
Figure 8: Percentage of times there were no cycles in the estimated PDAG when using the SPC.
The lines correspond to different number of variables as the sample size increases. MPC does
not appear because it always returns acyclic graphs.
computational cost required we did not try high dimensional settings. Despite the difference in
the SHD being smaller, SPC produces partially oriented graphs which are not acyclic, hence they
violate a basic and necessary condition of BNs. The BNs examined here contained continuous,
categorical and ordinal data for which Pearson correlation, G2 -test and appropriate regression
models respectively were used.
MXM also provides many functionalities to assess the skeleton of the BN. Confidence on the
discovered edges can be calculated either theoretically (Triantafillou et al., 2014) or numerically
via bootstrap and a lower limit in the confidence, as proposed by Scutari and Nagarajan (2013).
Estimation of the false discovery rate (Tsamardinos and Brown, 2008) and construction of ROC
curves are some of the utility functions.
Our main future research direction is to focus on conflict resolution strategies. This is a
key aspect of the MPC rules which can lead to further improvements in the estimated PDAG.
More types of data will be handled, making MPC practical and generic. In addition, we plan
to examine further the coupling of the MMPC Tsamardinos et al. (2006) with the MPC rules.
15
Another direction is to substitute the scoring search with rules that are order independent and
produce PDAGs of similar or better quality than MMHC (Tsamardinos et al., 2006).
Appendix
R code used to generate Figure 1.
library(MXM)
library(pcalg)
set.seed(489)
n <- 100 ## sample size
p <- 10 ## number of variables (or nodes)
A <- MXM::rdag2(n, p = p, nei = 4) ## generate data and store the
## true adjacency matrix
id <- c(7, 8, 3, 2, 6, 1, 9, 10, 4, 5)
dat <- A$x[, id] ## re-order the data
g1 <- MXM::pc.skel(dat, method = "pearson", alpha = 0.01) ## skeleton
g1 <- MXM::pc.or(g1)$G ## orientation rules
a1 <- pcalg::pc(suffStat = list(C = cor(dat), n = n), indepTest =
gaussCItest, p = p, alpha = 0.01) ## skeleton and orientation phase
g2 <- a1@graph
g2 <- pcalg::wgtMatrix(g2, transpose = FALSE)
MXM::plotnetwork(2 * A$G)
MXM::plotnetwork(g1)
k1 <- which( g2 == 1 & t(g2) == 0 )
k2 <- which( g2 == 0 & t(g2) == 1 )
g2[k1] <- 2
g2[k2] <- 3
colnames(g2) <- rownames(g2) <- colnames(g1)
MXM::plotnetwork(g2)
plot(a1) ## in pcalg’s format
Acknowledgements
I would like to acknowledge Professor Tsamardinos Ioannis for our fruitful conversations which
inspired me for this paper. Also, Stefanos Fafalios for his constructive comments and Konstanti-
nos Tsirlis for reading an earlier draft.
The research leading to these results has received funding from the European Research
Council under the European Union’s Seventh Framework Programme (FP/2007-2013) / ERC
Grant Agreement n. 617393.
References
Abellán, J., M. Gómez-Olmedo, S. Moral, et al. (2006). Some Variations on the PC Algorithm.
In Probabilistic Graphical Models, pp. 1–8.
Agresti, A. (2002). Categorical Data Analysis (2nd ed.). Wiley Series in Probability and Statis-
tics. Wiley-Interscience.
Aguilera, P., A. Fernández, R. Fernández, R. Rumí, and A. Salmerón (2011). Bayesian networks
in environmental modelling. Environmental Modelling & Software 26 (12), 1376–1388.
16
Baba, K., R. Shibata, and M. Sibuya (2004). Partial correlation and conditional correlation as
measures of conditional independence. Australian & New Zealand Journal of Statistics 46 (4),
657–664.
Baumgartner, K., S. Ferrari, and G. Palermo (2008). Constructing Bayesian networks for crim-
inal profiling from limited data. Knowledge-Based Systems 21 (7), 563–572.
Beinlich, I. A., H. J. Suermondt, R. M. Chavez, and G. F. Cooper (1989). The alarm monitoring
system: A case study with two probabilistic inference techniques for belief networks. In AIME
89, pp. 247–256. Springer.
Binder, J., D. Koller, S. Russell, and K. Kanazawa (1997). Adaptive probabilistic networks with
hidden variables. Machine Learning 29 (2), 213–244.
Bucci, G., V. Sandrucci, and E. Vicario (2011). Ontologies and Bayesian networks in medical
diagnosis. In System Sciences (HICSS), 2011 44th Hawaii International Conference on, pp.
1–8. IEEE.
Cano, A., M. Gómez-Olmedo, and S. Moral (2008). A score based ranking of the edges for the
PC algorithm. In Proceedings of the Fourth European Workshop on Probabilistic Graphical
Models, pp. 41–48.
Conati, C., A. Gertner, and K. Vanlehn (2002). Using Bayesian networks to manage uncertainty
in student modeling. User modeling and user-adapted interaction 12 (4), 371–417.
Cooper, G. F. and E. Herskovits (1992). A Bayesian method for the induction of probabilistic
networks from data. Machine learning 9 (4), 309–347.
Cowell, R. G., R. J. Verrall, and Y. Yoon (2007). Modeling operational risk with Bayesian
networks. Journal of Risk and Insurance 74 (4), 795–827.
Demidenko, E. (2013). Mixed models: theory and applications with R. John Wiley & Sons.
Friedman, N., M. Linial, I. Nachman, and D. Pe’er (2000). Using Bayesian networks to analyze
expression data. Journal of computational biology 7 (3-4), 601–620.
Glymour, C. N. (2001). The mind’s arrows: Bayes nets and graphical causal models in psychol-
ogy. MIT press.
Harris, N. and M. Drton (2013). PC algorithm for nonparanormal graphical models. Journal of
Machine Learning Research 14 (1), 3365–3383.
Heckerman, D. (1997). Bayesian networks for data mining. Data mining and knowledge discov-
ery 1 (1), 79–119.
Heckerman, D., D. Geiger, and D. M. Chickering (1995). Learning Bayesian networks: The
combination of knowledge and statistical data. Machine learning 20 (3), 197–243.
17
Isci, S., H. Dogan, C. Ozturk, and H. H. Otu (2013). Bayesian network prior: network analysis
of biological data using external knowledge. Bioinformatics 30 (6), 860–867.
Isozaki, T. (2014). A robust causal discovery algorithm against faithfulness violation. Informa-
tion and Media Technologies 9 (1), 121–131.
Jensen, A. L. and F. V. Jensen (1996). MIDAS-an influence diagram for management of mildew
in winter wheat. In Proceedings of the Twelfth international conference on Uncertainty in
artificial intelligence, pp. 349–356. Morgan Kaufmann Publishers Inc.
Kalisch, M., M. Mächler, D. Colombo, M. H. Maathuis, P. Bühlmann, et al. (2012). Causal infer-
ence using graphical models with the R package pcalg. Journal of Statistical Software 47 (11).
Lagani, V., G. Athineou, A. Farcomeni, M. Tsagris, and I. Tsamardinos (2017). Feature Selection
with the R Package MXM: Discovering Statistically-Equivalent Feature Subsets. Journal of
Statistical Software 80 (7).
R Core Team (2016). R: A Language and Environment for Statistical Computing. Vienna,
Austria: R Foundation for Statistical Computing.
Scutari, M. (2010). Learning Bayesian Networks with the bnlearn R Package. Journal of Sta-
tistical Software 35 (3).
Scutari, M. and R. Nagarajan (2013). Identifying significant edges in graphical models of molec-
ular networks. Artificial intelligence in medicine 57 (3), 207–217.
Spirtes, P. and C. Glymour (1991). An algorithm for fast recovery of sparse causal graphs. Social
science computer review 9 (1), 62–72.
Spirtes, P., C. N. Glymour, and R. Scheines (2000). Causation, Prediction, and Search. MIT
press.
Suchánek, P., F. Marecki, and R. Bucki (2014). Self-learning Bayesian networks in diagnosis.
Procedia Computer Science 35, 1426–1435.
Szekely, G. J., M. L. Rizzo, et al. (2014). Partial distance correlation with methods for dissimi-
larities. The Annals of Statistics 42 (6), 2382–2412.
Székely, G. J., M. L. Rizzo, and N. K. Bakirov (2007). Measuring and testing dependence by
correlation of distances. The annals of statistics, 2769–2794.
Tsamardinos, I. and L. E. Brown (2008). Bounding the False Discovery Rate in Local Bayesian
Network Learning. In AAAI, pp. 1100–1105.
18
Tsamardinos, I., L. E. Brown, and C. F. Aliferis (2006). The max-min hill-climbing Bayesian
network structure learning algorithm. Machine learning 65 (1), 31–78.
Verma, T. and J. Pearl (1991). Equivalence and synthesis of causal models. In Proceedings of
the Sixth Conference on Uncertainty in Artificial Intelligence, pp. 220–227.
Yohai, V. J. (1987). High breakdown-point and high efficiency robust estimates for regression.
The Annals of Statistics 15 (2), 642–656.
Zagorecki, A., P. Orzechowski, and K. Holownia (2013). A system for automated general medical
diagnosis using Bayesian networks. MedInfo 192, 461–465.
19