A Comparative Study of The Time-Series Data For Inference of Gene Regulatory Networks Using B-Spline
A Comparative Study of The Time-Series Data For Inference of Gene Regulatory Networks Using B-Spline
Abstract In this paper, the quantitative analysis of timeseries gene expression data on inference of gene regulatory
networks is performed. Time-series gene data are modeled by
the B-Spline algorithm to improve the overall smooth expression
curves which can further reduce over-fitting. The effect of the
different sizes of observed time-series data on gene regulatory
networks inference is analyzed. The stochastic errors introduced
by the B-Spline algorithm to the system are evaluated. The
precision of different sizes of time-series data on parameter
estimations is compared. With application of the B-Spline to
generate continuous curves, simulation results can be much
more accurate and inference results are significantly improved.
Both synthetic data and experimental data from microarray
measurements are used to demonstrate the effectiveness of the
proposed method.
I. I NTRODUCTION
ENETIC regulatory networks (GRNs) are collections
of DNA segments in a cell which interact with each
other and with other substances in the cell, thereby governing
gene transcriptions. In light of the recent development of
high-throughput DNA microarray technology, it becomes
possible to discover GRNs, which are complex and nonlinear
in nature. Specifically, the increasing existence of microarray
time-series data makes possible the characterization of dynamic nonlinear regulatory interactions among genes. The
modeling, analysis and control of GRNs are critical for
finding medicine for gene-related diseases.
One issue during the inference of GRNs is that not enough
time-series experiment data are available [1]. This makes
many inference processes not realizable. Many models are
restricted by limited data. B-Spline can reconstruct unobserved gene expression data or recover the missing data.
Compared to the B-Spline algorithm, simple algorithms, for
example linear interpolation, can lead to poor estimation for
the missing time-series data, especially when the sampled
data is not uniform [2].
In this paper, the ordinary differential equation (ODE)
model using S-system is adopted to infer GRNs [3]. Different sizes of time-series data are analyzed. The B-Spline
algorithm is introduced to analyze the raw data. The effects
of the time-series data to infer gene regulatory networks are
analyzed in synthetic model and microarray experimental
expression data.
Haixin Wang and James E. Glover are with the Department of Mathematics and Computer Science, Fort Valley State University, Fort Valley,
Georgia, USA (email: [email protected]).
Lijun Qian is with the Department of Electrical and Computer Engineering, Prairie View A&M University, Prairie View, Texas, USA (email:
[email protected]).
II. I NTRODUCTION
TO
B-S PLINE
ALGORITHM
N m1
X
i=0
where h = 1, , k; i = 0, , N m 1;
h,i =
t Pi
.
pi+k+1h Pi
1 3 3 1
pi1
1 3 6 3 0 pi
Si (t) = u3 u2 u 1
3 0 pi+1
6 3 0
1
4
1 0
pi+2
(1)
where u [0, 1].
The general process of inserting time series data via Bspline algorithm is shown in Algorithm 1.
Microarray
Experiments
Apply B-Spline
Algorithm to
Gene Expression
Data
Genetic
Programming
Kalman Filter
REGULATORY NETWORKS
In general, modeling gene regulatory networks is a nonlinear identification problem. Assuming there are N genes
of interest and xi denotes the state (such as the microarray
reading) of the ith gene, then the dynamics of the GRN may
be modeled as
dxi
= fi (x1 , x2 , , xN )
i = 1, 2, , N
(2)
dt
where the nonlinear functions fi need to be determined
from time-series microarray measurements. In this study, we
assume that the functions (fi , i) are in the form [7]:
fi =
Li
X
[wij ij (x1 , x2 , , xN )]
i = 1, 2, , N (3)
j=1
where Li is the number of terms in fi , wij are the parameters to be estimated and ij (x1 , x2 , , xN ) is the j th
component of the nonlinear function fi . A two-step nested
optimization procedure is proposed to identify the nonlinear differential equation for each individual gene. Genetic
programming (GP) is applied to determine the nonlinear
parameters (global optimization) and then the corresponding parameters associated with each term are estimated by
Kalman filtering (local optimization) in each iteration. Such
a decomposition of the problem into a structural part solved
by GP and a parameter optimization part solved by Kalman
filtering reduces the complexity significantly and speeds
up convergence. In this paper the S-system is adopted for
function fi . The S-system model is given by [3]:
N
N
Y
Y
dxi
gi,j
h
= i
xj i
xj i,j , (i = 1, ..., N )
dt
j=1
j=1
Data Sturcture
Model Parameters
Entire Model
No
Evaluation by Fitness Function?
Yes
Fig. 1. The flowchart for GRN identification using GP, Kalman Filtering
and B-Spline
(4)
0.9
0.9
BSplline
BSpline
BSpline
0.8
0.8
Linear
BSpline
Linear
Linear
0.7
0.7
MSE Error
Average CoD
data4
0.6
0.5
0.6
0.5
0.4
0.4
0.3
0.3
0.2
0.2
0.1
0.1
0
0
0
0
10
12
14
16
18
10
15
20
25
20
Fig. 2.
CoD is given by
SSerr
N
X
(yi fi )2
N
X
(yi y)2
i=0
SStot
Fig. 3.
i=0
= 1
SSerr
SStot
1.2
0.2
= x1.5
1 x2 x2
0.5
= x0.1
1 x1 x2
From Fig. 2, the averaged CoD from the estimator becomes larger with the increased amount of sampling data.
The CoD does not change much after certain number of
samples, which implies that the CoD has an upper bound.
It is observed that increasing the number of sampling points
yields more exact estimates, within the upper bound. Note
2500
HAP1
CYB2
CYC7
CYT1
COX5A
x1 Bspline
x2 Bspline
x3 Bspline
x4 Bspline
x5 Bspline
2000
Concentration Level
1500
1000
500
50
100
150
200
250
300
350
Time
BSpline
0.95
CoD of Estimated Parameters
BSpline
Fig. 5.
Linear
Linear
0.9
0.85
0.8
0.75
0.7
0
20
40
60
80
100
120
140
160
180
200
Data Size
Fig. 4.
Fig. 6.