Question

Speed and memory of custom function using scanBam

0

Entering edit mode

António Miguel de Jesus Domingues ▴ 510

@antonio-miguel-de-jesus-domingues-5182

Last seen 9 months ago

Germany

Dear BioC team,

I am trying to leverage the power of Rsamtools to summarize reads in a number of ways. For instance, read length (seq) of reads mapping to certain genes. Here is simplified version of one my functions:

library("Rsamtools")

get_seq_lengths <- function(bamPath, coordinates = NULL){
    if (!is.null(coordinates)){
            param1 <- ScanBamParam(
            which = coordinates,
            what = c("seq")
        )
    } else {
        param1 <- ScanBamParam(
            what=c("seq")
        )
    }
    seq_width <- scanBam(bamPath, param = param1)
    seq_width <- width(seq_width[[1]]$seq)
    tmp <- as.data.frame(table(seq_width))
    tmp$Bam <- bamPath
    return(tmp)
}

system.time(res <- get_seq_lengths(bam1))
   user  system elapsed 
  2.572   0.084   9.688

sum(Rsamtools::idxstatsBam(file = bam1)$mapped)
2592550

This process is actually quite fast (and memory efficient) if applied to the entire bam file. However, when filtering using genomic coordinates it is (1) very slow (see bellow for a test with 100.000 coordinates, Granges object) or (2) eat up all the memory (8GB) and eventually crash if all the genomic locations are used (~3M).

system.time(
    get_seq_lengths(
            bam1,
            coordinates = repeats_loc[1:10^5,]
        )
)

   user  system elapsed 
 80.488   0.348  80.913

repeats_loc
GRanges object with 2442918 ranges and 3 metadata columns:
                    seqnames      ranges strand |     type        gene_id
                       <Rle>   <IRanges>  <Rle> | <factor>    <character>
        [1]             chr1       1-397      - |     exon     DNA-3-1_DR
        [2]             chr1     403-462      + |     exon       L2-5_DRe
        [3]             chr1     460-532      + |     exon     hAT-N25_DR
        [4]             chr1     469-614      + |     exon   DNA25TWA1_DR
        [5]             chr1    625-1027      + |     exon       L2-5_DRe
        ...              ...         ...    ... .      ...            ...
  [2442914] chrUn_KN150713v1 12946-13006      - |     exon   Gypsy13-I_DR
  [2442915] chrUn_KN150713v1 13009-13341      - |     exon    ENSPM-2N_DR
  [2442916] chrUn_KN150713v1 13211-13370      + |     exon     hAT-N79_DR
  [2442917] chrUn_KN150713v1 13606-13668      - |     exon Gypsy13-LTR_DR
  [2442918] chrUn_KN150713v1 13927-14167      + |     exon        HE1_DR1
            gene_biotype
             <character>
        [1]         <NA>
        [2]         <NA>
        [3]         <NA>
        [4]         <NA>
        [5]         <NA>
        ...          ...
  [2442914]         <NA>
  [2442915]         <NA>
  [2442916]         <NA>
  [2442917]         <NA>
  [2442918]         <NA>
  -------
  seqinfo: 1061 sequences from an unspecified genome; no seqlengths

My question is if there is a way to optimize this process?

sessionInfo()
R version 3.5.1 (2018-07-02)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Ubuntu 16.04.5 LTS

Matrix products: default
BLAS: /usr/lib/libblas/libblas.so.3.6.0
LAPACK: /usr/lib/lapack/liblapack.so.3.6.0

locale:
 [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
 [3] LC_TIME=de_DE.UTF-8        LC_COLLATE=en_US.UTF-8    
 [5] LC_MONETARY=de_DE.UTF-8    LC_MESSAGES=en_US.UTF-8   
 [7] LC_PAPER=de_DE.UTF-8       LC_NAME=C                 
 [9] LC_ADDRESS=C               LC_TELEPHONE=C            
[11] LC_MEASUREMENT=de_DE.UTF-8 LC_IDENTIFICATION=C       

attached base packages:
[1] stats4    parallel  stats     graphics  grDevices utils     datasets 
[8] methods   base     

other attached packages:
 [1] rtracklayer_1.40.4   scales_1.0.0         ggplot2_3.0.0       
 [4] data.table_1.11.4    Rsamtools_1.32.2     Biostrings_2.48.0   
 [7] XVector_0.20.0       GenomicRanges_1.32.6 GenomeInfoDb_1.16.0 
[10] IRanges_2.14.10      S4Vectors_0.18.3     BiocGenerics_0.26.0 

loaded via a namespace (and not attached):
 [1] Rcpp_0.12.18                pryr_0.1.4                 
 [3] compiler_3.5.1              pillar_1.2.3               
 [5] plyr_1.8.4                  bindr_0.1.1                
 [7] bitops_1.0-6                tools_3.5.1                
 [9] zlibbioc_1.26.0             lattice_0.20-35            
[11] tibble_1.4.2                gtable_0.2.0               
[13] pkgconfig_2.0.1             rlang_0.2.2                
[15] Matrix_1.2-14               DBI_1.0.0                  
[17] DelayedArray_0.6.1          bindrcpp_0.2.2             
[19] GenomeInfoDbData_1.1.0      stringr_1.3.1              
[21] withr_2.1.2                 dplyr_0.7.6                
[23] grid_3.5.1                  tidyselect_0.2.4           
[25] Biobase_2.40.0              glue_1.3.0                 
[27] R6_2.2.2                    RMySQL_0.10.15             
[29] XML_3.98-1.15               BiocParallel_1.14.2        
[31] purrr_0.2.5                 magrittr_1.5               
[33] codetools_0.2-15            matrixStats_0.53.1         
[35] GenomicAlignments_1.16.0    SummarizedExperiment_1.10.1
[37] assertthat_0.2.0            colorspace_1.3-2           
[39] stringi_1.2.4               RCurl_1.95-4.11            
[41] lazyeval_0.2.1              munsell_0.5.0

rsamtools scanbamparam bam • 1.3k views

ADD COMMENT • link updated 6.1 years ago by Martin Morgan 25k • written 6.1 years ago by António Miguel de Jesus Domingues ▴ 510

score 1 · Accepted Answer · 2018-09-28

1

Entering edit mode

Martin Morgan 25k

@martin-morgan-1513

Last seen 3 months ago

United States

The slowness and memory use is because each range is a separate call -- so 2442918 calls instead of one, and creates 2442918 DNAStringSets instead of 1. If there are many ranges I think the strategy should be to make one call and use subsetByOverlaps(), etc on the GRanges, or if concerned about memory make one-call-per-chromosome.

One subtlety of calling one range at a time is that an aligned read that overlaps two queries will be counted / reported twice -- this may or may not be desired.

Asking for "qwidth" rather than "seq" is a more efficient way to extract the length of the aligned read.

There might be less 'low level' work (e.g., creating GRanges for subsetByOverlap) if one used GenomicAlignments::readGAlignments() rather than scanBam.

ADD COMMENT • link 6.1 years ago Martin Morgan 25k

0

Entering edit mode

Sorry for the belated reply, but working on other things got in the way. Thank your for the clarification regarding the separate call for each range - I had no idea it worked that way. For now I am sticking with GenomicAlignments followed by GR. Memory might be an issue down the line, but having reads being report multiple times is also less than ideal.

ADD REPLY • link 6.0 years ago António Miguel de Jesus Domingues ▴ 510