Bnstruct: An R Package For Bayesian Network Structure Learning With Missing Data
Bnstruct: An R Package For Bayesian Network Structure Learning With Missing Data
1 Introduction
Bayesian Networks (Pearl [8]) are a powerful tool for probabilistic inference
among a set of variables, modeled using a directed acyclic graph. However,
one often does not have the network, but only a set of observations, and wants
to reconstruct the network that generated the data. The bnstruct package
provides objects and methods for learning the structure and parameters of the
network in various situations, such as in presence of missing data, for which it is
possible to perform imputation (guessing the missing values, by looking at the
data), or the modeling of evolving systems using Dynamic Bayesian Networks.
The package also contains methods for learning using the Bootstrap technique.
Finally, bnstruct, has a set of additional tools to use Bayesian Networks, such
as methods to perform belief propagation.
In particular, the absence of some observations in the dataset is a very com-
mon situation in real-life applications such as biology or medicine, but very few
software around is devoted to address these problems. bnstruct is developed
mainly with the purpose of filling this void.
This document is intended to show some examples of how bnstruct can be
used to learn and use Bayesian Networks. First we describe how to manage
data sets, how to use them to discover a Bayesian Network, and finally how
to perform some operations on a network. Complete reference for classes and
methods can be found in the package documentation.
1.1 Overview
We provide here some general informations about the context for understanding
and using properly this document. A more thorough introduction to the topic
can be found for example in Koller and Friedman [6].
1
feature, an event or an entity considered significant and therefore measured.
In a Bayesian Network, each variable is associated to a node. The number
of variables is the size of the network. Each variable has a certain range of
values it can take. If the variable can take any possible value in its range,
it is called a continuous variable; otherwise, if a variable can only take some
values in its range, it is a discrete variable. The number of values a variable
can take is called its cardinality. A continuous variable has infinite cardinality;
however, in order to deal with it, we have to restrict its domain to a smaller
set of values, in order to be able to treat it as a discrete variable; this process
is called quantization, the number of values it can take is called the number of
levels of the quantization step, and we will therefore call the cardinality of a
continuous variable the number of its quantization levels, with a little abuse of
terminology.
In many real-life applications and contexts, it often happens that some ob-
servations in the dataset we are studying are absent, for many reasons. In this
case, one may want to “guess” a reasonable (according to the other observations)
value that would have been present in the dataset, if the observations was suc-
cessful. This inference task is called imputation. In this case, the dataset with
the “holes filled” is called the imputed dataset, while the original dataset with
missing values is referred to as the raw dataset. In section 3.2 we show how to
perform imputation in bnstruct.
Another common operation on data is the employment of resampling tech-
niques in order to estimate properties of a distribution from an approximate
one. This usually allows to have more confidence in the results. We implement
the bootstrap technique (Efron and Tibshirani [2]), and provide it to generate
samples of a dataset, with the chance of using it on both raw and imputed data.
2
only for networks with no more than 20-30 nodes. For larger networks, several
heuristic strategies exist.
The subsequent problem of parameter learning, instead, aims to discover the
conditional probabilities that relate the variables, given the dataset of observa-
tions and the structure of the network.
In addition to structure learning, sometimes it is of interest to estimate a
level of the confidence on the presence of an edge in the network. This is what
happens when we apply bootstrap to the problem of structure learning. The
result is not a DAG, but a different entity that we call weighted partially DAG,
which is an adjacency matrix whose cells of indices (i, j) take the number of
times that an edge going from node i to node j appear in the network obtained
from each bootstrap sample.
As the graph obtained when performing structure learning with bootstrap
represents a measure of the confidence on the presence of each edge in the
original network, and not a binary response on the presence of the edge, the
graph is likely to contain undirected edges or cycles.
As the structure learnt is not a DAG but a measure of confidence, it cannot
be used to learn conditional probabilities. Therefore, parameter learning is not
defined in case of network learning with bootstrap.
One of the most relevant operations that can be performed with a Bayesian
Network is to perform inference with it. Inference is the operations that, given a
set of observed variables, computes the probabilities of the remaining variables
updated according to the new knowledge. Inference answers questions like “How
does the probability for variable Y change, given that variable X is taking value
x0?”.
2 Installation
The latest stable version of the package is available on CRAN https://cran.
r-project.org, and can therefore be installed from an R session using
> install.packages("bnstruct")
3
3 Data sets
The class that bnstruct provides to manage datasets is BNDataset. It contains
all of the data and the informations related to it: raw and imputed data, raw
and imputed bootstrap samples, and variable names and cardinality.
4
to be in a format subsequently described. The actual data has to be in (a text
file containing data in) tabular format, one tuple per row, with the values for
each variable separated by a space or a tab. Values for each variable have to
be numbers, starting from 1 in case of discrete variables. Data files can have a
first row containing the names of the corresponding variables.
In addition to the data file, a header file containing additional informations
can also be provided. An header file has to be composed by three rows of
tab-delimited values:
1. list of names of the variables, in the same order of the data file;
2. a list of integers representing the cardinality of the variables, in case of
discrete variables, or the number of levels each variable has to be quantized
in, in case of continuous variables;
3. a list that indicates, for each variable, if the variable is continuous (c or
C), and thus has to be quantized before learning, or discrete (d or D).
In case of need of more advanced options when reading a dataset from files,
please refer to the documentation of the read.dataset method. Imputation
and bootstrap are also available as separate routines (impute and bootstrap,
respectively).
We provide two sample datasets, one with complete data (the Asia network,
Lauritzen and Spiegelhalter [7]) and one with missing values (the Child network,
Spiegelhalter, Dawid, Lauritzen, and Cowell [10]), in the extdata subfolder; the
user can refer to them as an example. The two datasets have been created with
5
It is assumed that the variables in the dataset are reported in the following
order: V1 t1, V2 t1, . . . , VN tk, V1 t2, V2 t2, . . . , VN tk.
3.2 Imputation
A dataset may contain various kinds of missing data, namely unobserved vari-
ables, and unobserved values for otherwise observed variables. We currently
deal only with this second kind of missing data. The process of guessing the
missing values is called imputation.
We provide the impute function to perform imputation.
3.3 Bootstrap
BNDataset objects have also room for bootstrap samples (Efron and Tibshirani
[2]), i.e. random samples with replacement of the original data with the same
number of observations, both for raw and imputed data. Samples for imputed
data are generated by imputing the corresponding sample of raw data. There-
fore, by requesting imputed samples, also the raw samples will be generated.
We provide the bootstrap method for this.
6
documentation; we are not going to cover all of the methods in this brief series
of examples.
For example, one may want to see the dataset.
The show() method is an alias for the print() method, but allows to print
the state of an instance of an object just by typing its name in an R session.
The main operation that can be done with a BNDataset is to get the data
it contains. The main methods we provide are raw.data and imputed.data,
which provide the raw and the imputed data, respectively. The data must be
present in the object; conversely, an error will be raised. To avoid an abrupt
termination of the execution in case of error, one may run these methods in
a tryCatch() construct and manage the errors in case they happen. Another
alternative is to test the presence of data before attempting to retrieve it, using
the tester methods has.raw.data and has.imputed.data.
7
[9,] NA 1 1 2 2
[10,] 1 1 2 2 NA
[ reached getOption("max.print") -- omitted 4990 rows ]
> imp.data
NULL
NULL
8
[4,] 2 4 1 1 1 3 1 1 2 1 3 1 1 2 1
[5,] 2 2 1 2 2 4 1 1 1 1 3 1 1 2 2
[6,] 2 2 1 2 1 4 1 3 2 1 3 1 3 2 2
[7,] 2 2 1 2 2 4 1 3 2 1 3 1 1 2 2
[8,] 2 1 1 2 3 1 1 1 2 2 2 1 4 2 2
[9,] 2 3 2 2 1 3 1 2 2 1 2 1 2 2 2
[10,] 2 4 1 1 1 3 1 2 2 1 3 3 2 2 1
V16 V17 V18 V19 V20
[1,] 2 3 2 1 2
[2,] 2 2 1 2 2
[3,] 1 2 1 2 2
[4,] 3 1 1 1 2
[5,] 1 1 1 2 2
[6,] 2 1 1 3 2
[7,] 1 1 1 1 2
[8,] 1 1 1 4 2
[9,] 1 1 1 2 2
[10,] 1 1 2 2 2
[ reached getOption("max.print") -- omitted 4990 rows ]
Complete cases of the raw dataset, that is, rows that have no missing data,
can be selected with the complete method. Please note that this method re-
turns a new copy of the original BNDataset with no imputed data or bootstrap
samples (if previously generated), as it is not possible to ensure consistence
among data. It is possible to restrict the completeness requirement only to a
subset of variables.
By default, learning operations on the raw dataset operate with available
cases.
In order to retrieve bootstrap samples, one can use the boots and imp.boots
methods for samples made of raw and imputed data. The presence of raw and
imputed samples can be tested using has.boots and has.imputed.boots. Try-
ing to access a non-existent sample (e.g. imputed sample when no imputation
has been performed, or sample index out of range) will raise an error. The
method num.boots returns the number of samples.
We also provide the boot method to directly access a single sample.
9
> # get imputed samples
> for (i in 1:num.boots(dataset))
+ print( boot(dataset, i, use.imputed.data = TRUE) )
4 Bayesian Networks
Bayesian Network are represented using the BN object. It contains information
regarding the variables in the network, the directed acyclic graph (DAG) repre-
senting the structure of the network, the conditional probability tables entailed
by the network, and the weighted partially DAG representing the structure as
learnt using bootstrap samples.
The following code will create a BN object for the Child network, with no
structure nor parameters.
Then we can fill in the fields of net by hand. See the inline help for more
details.
The method of choice to create a BN object is, however, to create it from a
BNDataset using the learn.network method.
The learn.network method returns a new BN object, with a new DAG (or
WPDAG, if the structure learning has been performed using bootstrap – more
on this later).
Here we briefly describe the two tasks performed by the method, along with
the main options.
10
4.1.1 Structure learning
We provide five algorithms to learn the structure of the network, that can be
chosen with the algo parameter. The first is the Silander-Myllymäki (sm) exact
search-and-score algorithm (see Silander and Myllymaki [9]), that performs a
complete evaluation of the search space in order to discover the best network;
this algorithm may take a very long time, and can be inapplicable when discov-
ering networks with more than 25–30 nodes. Even for small networks, users are
strongly encouraged to provide meaningful parameters such as the layering of
the nodes, or the maximum number of parents – refer to the documentation in
package manual for more details on the method parameters.
The second algorithm is the Max-Min Parent-and-Children (mmpc, see Tsamardi-
nos, Brown, and Aliferis [11]), a constraint-based heuristic approach that discov-
ers the skeleton of the network, that is, the set of edges connecting the variables
without discovering their directionality.
Another heuristic algorithm included is a Hill Climbing method (hc, see
Tsamardinos, Brown, and Aliferis [11]).
The fourth algorithm (and the default one) is the Max-Min Hill-Climbing
heuristic (mmhc, Tsamardinos, Brown, and Aliferis [11]), based on the combina-
tion of the previous two algorithms, that performs a statistical sieving of the
search space followed by a greedy evaluation.
The heuristic algorithms provided are considerably faster than the complete
method, at the cost of a (likely) lower quality. Also note that in the case of
a very dense network and lots of obsevations, the statistical evaluation of the
search space may take a long time.
The last method is the Structural Expectation-Maximization (sem) algo-
rithm (Friedman [3, 4]), for learning a network from a dataset with missing
values. It iterates a sequence of Expectation-Maximization (in order to “fill
in” the holes in the dataset) and structure learning from the guessed dataset,
until convergence. The structure learning used inside SEM, due to compu-
tational reasons, is MMHC. Convergence of SEM can be controlled with the
parameters struct.threshold, param.threshold, max.sem.iterations and
max.em.iterations, for the structure and the parameter convergence and the
maximum number of iterations of SEM and EM, respectively.
Search-and-score methods also need a scoring function to compute an esti-
mated measure of each configuration of nodes. We provide three of the most pop-
ular scoring functions, BDeu (Bayesian-Dirichlet equivalent uniform, default),
AIC (Akaike Information Criterion) and BIC (Bayesian Information Criterion).
The scoring function can be chosen using the scoring.func parameter.
11
+ algo = "mmhc",
+ scoring.func = "BDeu",
+ use.imputed.data = TRUE)
12
DAG, in order to get a weighted confidence on the presence or absence of an
edge in the structure (Friedman, Goldszmidt, and Wyner [5]). This can be done
by providing the constructor or the learn.network method a BNDataset with
bootstrap samples, and the additional parameter bootstrap = TRUE.
13
> # The dataset contains therefore 32 variables observed.
> dataset <- BNDataset("evolving_system.data",
+ "evolving_system.header",
+ num.time.steps = 4)
> dbn <- learn.dynamic.network(dataset, num.time.steps = 4)
If the BNDataset already contains the information about the number of time
steps, it is also possible to use learn.network, which will recognize the evolving
system and treat the network as a dynamic one.
It is also possible to exploit knowledge about the layering in each time slot by
specifying the layering for just N variables; such layering will be unfolded over
the T time steps. However, in this case, any variable vx in time slot i will still
be considered a potential parent for variable vy in time slot j > i, even if such
relationship is forbidden in time slot i. The same applies for layer.struct.
14
30
Avail MMHC BDeu
Impute MMHC BDeu
25
SEM BDeu
Avail MMHC BIC
Impute MMHC BIC
20
SEM BIC
SHD
15
10
time [s]
Figure 1: SHD vs. running time of i) available case analysis with MMHC,
ii) kNN imputation followed by MMHC and iii) SEM, with both the BDeu
and the BIC scoring functions, on 20 datasets with 1000 observations and 20%
missingness sampled from the 20-nodes Child network (25 edges in the original
network).
55
SEM BDeu
Avail MMHC BIC
45
40
35
30
1 5 10 50 500
time [s]
Figure 2: SHD vs. running time of i) available case analysis with MMHC,
ii) kNN imputation followed by MMHC and iii) SEM, with both the BDeu
and the BIC scoring functions, on 20 datasets with 1000 observations and 20%
missingness sampled from the 37-nodes Alarm network (46 edges in the original
network).
15
135
125
120
115
5 10 50 500 5000
time [s]
Figure 3: SHD vs. running time of i) available case analysis with MMHC, ii)
kNN imputation followed by MMHC and iii) SEM, with both the BDeu and
the BIC scoring functions, on 20 datasets with 1000 observations and 20% miss-
ingness sampled from the 70-nodes Hepar2 network (123 edges in the original
network).
16
500
BDeu 100 obs
BIC 100 obs
400
BDeu 1000 obs
BIC 1000 obs
BDeu 10000 obs
BIC 10000 obs
300
SHD
200
100
time [s]
network, and the time needed to obtain to learn the network (in seconds, on the
horizontal axis).
For all the three networks the overall best results are obtained using MMHC
with BDeu after dataset imputation. MMHC converges in few seconds even for
the large Hepar2 network; conversely, SEM takes from few minutes up to few
hours, often with worse results. SEM paired with BIC failed to converge in 24
hours for all the 20 datasets.
We report also the results obtained by the MMHC algorithm over a larger
network, Andes (223 nodes), using BDeu and BIC and three sets of datasets with
100, 1000 and 10000 observations each. In Figure 4 we show the results in terms
of SHD with respect to the original network, and in terms of convergence time.
In this case there is no significant difference in terms of time between the two
scoring functions. The trend observed, with the learning from the datasets with
1000 observations being the slowest one, is a bit surprising, and can be explained
with the fact that with fewer data, the statistical pruning is less effective but the
computation of the score for each possible parent set is faster, while for larger
datasets computing the score for one candidate parent set is more expensive,
but there are less candidates to evaluate, as MMPC is more effective; with 1000
observations, in this case we observe an unlucky combination of relatively high
computing time over a larger set of candidates.
In terms of solution quality, the results clearly improve as the size of the
dataset grows. For 10000 observations BDeu finds slightly better networks, in
17
terms of similarity with the original one, than BIC, while for smaller datasets
BIC is significantly more robust. The main difference, however, is given by the
number of observations available: with 100 observations, effectively a p > n
case, the quality of the discovered networks is very poor, and it improves as
the size of the datasets grows, having up to the 75% of correctness with 10000
observations.
5 Using a network
Once a network is created, it can be used. Here we briefly mention some of
the basic methods provided in order to manipulate a network and access its
components.
First of all, it is surely of interest to obtain the structure of a network. The
bnstruct package provides the dag() and wpdag() methods in order to access
the structure of a network learnt without and with bootstrap (respectively).
> dag(net)
> wpdag(net)
Then we may want to retrieve the parameters, using the cpts() method.
> cpts(net)
18
to specify the latter as structure to be plotted, the plot.wpdag logical parameter
is provided. As usual, more details are available in the inline documentation of
the method.
> # TFAE
> print(net)
> show(net)
> net
> write.xgmml(net)
5.1 Inference
Inference is performed in bnstruct using an InferenceEngine object. An
InferenceEngine is created directly from a network.
19
>
> # the following are equivalent:
> engine <- InferenceEngine(net, obs)
>
> # and
> engine <- InferenceEngine(net)
> observations(engine) <- obs
The InferenceEngine class provides methods for belief propagation, that is,
updating the conditional probabilities according to observed values, and for the
Expectation-Maximization (EM) algorithm ([1]), which learns the parameters
of a network from a dataset with missing values trying at the same time to guess
the missing values.
Belief propagation can be done using the belief.propagation method. It
takes an InferenceEngine and an optional list of observations. If no obser-
vations are provided, the engine will use the ones it already contains. The
belief.propagation method returns an InferenceEngine with an updated.bn
updated network.
20
6.1 Basic Learning
First we show how some different learning setups perform on the Child dataset.
We compare the default mmhc-BDeu pair on available case analysis (raw data
with missing values) and on imputed data, and the sem-BDeu pair.
LVHReport
Disease
LungFlow DuctFlow
CardiacMixing
LungParench
Age
Sick
21
> # SEM, BDeu using previous network as starting point
> net <- learn.network(dataset, algo = "sem",
+ scoring.func = "BDeu",
+ initial.network = net,
+ struct.threshold = 10,
+ param.threshold = 0.001)
> plot(net)
LungFlow
Disease
22
6.2 Learning with bootstrap
The second example is about learning with bootstrap. This time we use the
Asia dataset.
LungCancer Asia
Bronchitis Tubercolosys
Dyspnea
23
with the words. We then specify the presence or absence of edges in each layer,
and among the layers. Edges from lower layers to upper layers are forbidden.
For this we need to define a binary squared matrix with as many rows and
columns as the number of layers, so in our example we need a 2x2 matrix. Each
entry mi,j of the matrix contains 0 if no edges can go from variables in layer i
to variables in layer j, and 1 if the presence of such edges is allowed; the matrix
should be upper triangular, and it will be transformed as such if it is not.
As our Naive Bayes network has edges only from the target variable to the
word variables, and not between variables in layer 2, we want edges only in m1,2 ,
so we set that cell to 1 and all the others to 0.
> # artificial dataset generation
> spam <- sample(c(0,1), 1000, prob=c(0.5, 0.5), replace=T)
> buy <- sapply(spam, function(x) {
+ if (x == 0) {
+ sample(c(0,1),1,prob=c(0.8,0.2),replace=T)
+ } else {
+ sample(c(0,1),1,prob=c(0.2,0.8))}
+ })
> med <- sapply(spam, function(x) {
+ if (x == 0) {
+ sample(c(0,1),1,prob=c(0.95,0.05),replace=T)
+ } else {
+ sample(c(0,1),1,prob=c(0.05,0.95))}
+ })
> bns <- sapply(spam, function(x) {
+ if (x == 0) {
+ sample(c(0,1),1,prob=c(0.01,0.99),replace=T)
+ } else {
+ sample(c(0,1),1,prob=c(0.01,0.99))}
+ })
> lea <- sapply(spam, function(x) {
+ if (x == 0) {
+ sample(c(0,1),1,prob=c(0.05,0.95),replace=T)
+ } else {
+ sample(c(0,1),1,prob=c(0.95,0.05))}
+ })
> d <- as.matrix(cbind(spam,buy,med,bns,lea))
> colnames(d) <- c("spam","buy","med","bnstruct","learn")
> library(bnstruct)
> spamdataset <- BNDataset(d, c(T,T,T,T,T),
+ c("spam","buy","med","bnstruct","learn"),
+ c(2,2,2,2,2), starts.from=0)
> n <- learn.network(spamdataset,
+ algo="mmhc",
+ layering=c(1,2,2,2,2),
24
spam
+ layer.struct=matrix(c(0,0,1,0),
+ c(2,2)))
> plot(n)
References
[1] Arthur P Dempster, Nan M Laird, and Donald B Rubin. Maximum like-
lihood from incomplete data via the em algorithm. Journal of the Royal
Statistical Society. Series B (Methodological), pages 1–38, 1977.
[2] Bradley Efron and Robert J Tibshirani. An introduction to the bootstrap.
CRC press, 1994.
[3] Nir Friedman. Learning belief networks in the presence of missing values
and hidden variables. In ICML, volume 97, pages 125–133, 1997.
[4] Nir Friedman. The bayesian structural em algorithm. In Proceedings of
the Fourteenth conference on Uncertainty in artificial intelligence, pages
129–138. Morgan Kaufmann Publishers Inc., 1998.
[5] Nir Friedman, Moises Goldszmidt, and Abraham Wyner. Data analysis
with bayesian networks: A bootstrap approach. In Proceedings of the Fif-
teenth conference on Uncertainty in artificial intelligence, pages 196–205.
Morgan Kaufmann Publishers Inc., 1999.
25
[6] Daphne Koller and Nir Friedman. Probabilistic graphical models: principles
and techniques. MIT press, 2009.
[7] Steffen L Lauritzen and David J Spiegelhalter. Local computations with
probabilities on graphical structures and their application to expert sys-
tems. Journal of the Royal Statistical Society. Series B (Methodological),
pages 157–224, 1988.
[8] Judea Pearl. Probabilistic reasoning in intelligent systems: networks of
plausible inference. Morgan Kaufmann, 1988.
[9] Tomi Silander and Petri Myllymaki. A simple approach for finding the glob-
ally optimal bayesian network structure. arXiv preprint arXiv:1206.6875,
2012.
[10] David J Spiegelhalter, A Philip Dawid, Steffen L Lauritzen, and Robert G
Cowell. Bayesian analysis in expert systems. Statistical science, pages
219–247, 1993.
[11] Ioannis Tsamardinos, Laura E Brown, and Constantin F Aliferis. The max-
min hill-climbing bayesian network structure learning algorithm. Machine
learning, 65(1):31–78, 2006.
26