MSB: A mean-shift-based approach for the analysis of structural variation in the genome

  1. Lu-yong Wang1,4,
  2. Alexej Abyzov2,
  3. Jan O. Korbel2,
  4. Michael Snyder1,2,3 and
  5. Mark Gerstein1,2,4,5
  1. 1 Program in Computational Biology and Bioinformatics, Yale University, New Haven, Connecticut 06520, USA;
  2. 2 Department of Molecular Biophysics and Biochemistry, Yale University, New Haven, Connecticut 06520, USA;
  3. 3 Department of Molecular, Cellular, and Developmental Biology, Yale University, New Haven, Connecticut 06520, USA;
  4. 4 Department of Computer Science, Yale University, New Haven, Connecticut 06520, USA

Abstract

Genome structural variation includes segmental duplications, deletions, and other rearrangements, and array-based comparative genomic hybridization (array-CGH) is a popular technology for determining this. Drawing relevant conclusions from array-CGH requires computational methods for partitioning the chromosome into segments of elevated, reduced, or unchanged copy number. Several approaches have been described, most of which attempt to explicitly model the underlying distribution of data based on particular assumptions. Often, they optimize likelihood functions for estimating model parameters, by expectation maximization or related approaches; however, this requires good parameter initialization through prespecifying the number of segments. Moreover, convergence is difficult to achieve, since many parameters are required to characterize an experiment. To overcome these limitations, we propose a nonparametric method without a global criterion to be optimized. Our method involves mean-shift-based (MSB) procedures; it considers the observed array-CGH signal as sampling from a probabilitydensity function, uses a kernel-based approach to estimate local gradients for this function, and iteratively follows them to determine local modes of the signal. Overall, our method achieves robust discontinuity-preserving smoothing, thus accurately segmenting chromosomes into regions of duplication and deletion. It does not require the number of segments as input, nor does its convergence depend on this. We successfully applied our method to both simulated data and array-CGH experiments on glioblastoma and adenocarcinoma. We show that it performs at least as well as, and often better than, 10 previously published algorithms. Finally, we show that our approach can be extended to segmenting the signal resulting from the depth-of-coverage of mapped reads from next-generation sequencing.

Footnotes

| Table of Contents

Preprint Server