DSRC User Guide
DSRC User Guide
User Guide
Release 2.0
Contents ii
1 Introduction 1
1.1 What is DSRC? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.2 Main features . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.3 Compression factor and speed . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.4 Contact and support . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
2 Quickstart 3
2.1 Download . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
2.2 Building . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
2.3 Program usage . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
3 DSRC integration 6
3.1 Python API . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
3.2 Python examples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
3.3 C++ API . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
3.4 C++ examples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
ii
1
Introduction
DNA Sequence Reads Compression is an application designed for loosless and lossy compression
of DNA sequencing reads stored in FASTQ format. The first release (0.x) of DSRC was accompa-
nied by the research paper:
• S. Deorowicz and S. Grabowski: Compression of DNA sequence reads in FASTQ format, Bioin-
formatics (2011).
The newest release is described in:
• Ł. Roguski and S. Deorowicz: DSRC 2—Industry-oriented compression of FASTQ files (under
review).
In terms of lossless compression factor (the ability to reduce the file size), DSRC in is usually 35–
60% better than gzip and 15–30% better than bzip2, which are currently the most common tools
for storing FASTQ files in a compressed format. DSRC is a multithreaded software and the speed
in the fast mode usually reaches the I/O limits (e.g., about 500 MB/s for 8 threads). The achieved
compression ratio and speeds are described in the mentioned research papers.
1
1.4 C ONTACT AND SUPPORT
2
2
Quickstart
2.1 D OWNLOAD
The compiled binaries, C++ and Python libraries with example of usage can be downloaded from
the official Web site: http://sun.aei.polsl.pl/dsrc/
2.2 B UILDING
2.2.1 Linux
To compile DSRC on Linux platform do use provided makefile files in the main directory. The
default Makefile will compile DSRC in a static-linking variant using boost::thread1 library
for multithreading functionality, so boost development libraries are required to be present. The
Makefile.g++11 will use g++ (≥ 4.8) with implementation of C++11 standard, which provides
multithreading functionality without using Boost libraries.
To build DSRC binary in the main directory:
make bin
where the resulting dsrc binary file will be placed in bin directory.
2.2.2 Windows
To compile on Windows platform there are provided project solution files dsrc-vs2k10.sln and
dsrc-vs2k12.sln for respectively Microsoft Visual Studio 2010 and 2012 with configurations
for building DSRC binary and C++ library. When building DSRC using Microsoft Visual Studio
2010 boost::thread library will be required to provide multithreading functionality, while by
using Microsoft Visual Studio 2012 C++11 threads implementation will be used. DSRC can be also
compiled using MinGW-W64 (64-bit) using linux makefile files.
1 Boost libraries can be downloaded from http://www.boost.org/
3
2.2.3 Mac OSX
To compile DSRC on Mac OSX use the provided Makefile.osx file in the main directory. The
makefile uses clang compiler with C++11 standard support for multithreading.
To build DSRC binary in the main directory:
make -f Makefile . osx bin
where the resulting dsrc binary file will be placed in bin directory.
c — compression,
d — decompression.
Available compression options are:
4
-m<n > — Automated compression mode (one of the three preset combination of other pa-
rameters): 0–2
-o<n > — Quality offset, 0 for auto selection, default: 0
-l — use Quality lossy mode (Illumina binning scheme), default: false
-c — calculate and check CRC32 checksum calculation per block (slows the compression
about twice), default: false
-t<n > — processing threads number, default: max available hardware threads
-s — use stdin/stdout for reading/writing FASTQ data (stderr is used for info/warning
messages)
Usage examples:
5
3
DSRC integration
DSRC can be easily integrated with applications written in C++ or Python. We provide C++ and
Python libraries with very similar interfaces. Although the methods and members names are al-
most identical in both cases, for clarity the C++ and Python descriptions are divided into two
separate sections.
To start using the compressor functionality in Python it’s only needed to import pydsrc module
(pydsrc.so file) in the project. In order to provide high performance of compression routines the
core operations were written in C++ and exported to Python.
3.1.1 FastqRecord
FastqRecord represents a single DNA sequencing read. The ID, sequence, plus, and quality fields
are accessible as Python string type.
3.1.2 FastqFile
FastqFile is used for reading and writing FASTQ files in a sequential manner, read by read,
where each is of type FastqRecord. Figure 3.1 shows FastqFile available public methods—all
methods throw exception on error or failure.
6
3.1.3 DsrcArchive
DsrcArchive represents DSRC archive file and provides compression and decompression rou-
tines. Figure 3.2 shows DsrcArchive available public methods—all methods throw exception on
error or failure. Figure 3.3 shows public properties. It’s important to note, that properties cannot
be set, when DsrcArchive is already in processing mode (compression or decompression routines
have been stared).
7
3.1.4 DsrcModule
DsrcModule provides automated and parallel compression routines, working on whole files in-
stead of single records. Figures 3.4 and 3.5 show available public methods and public properties.
All methods throw exception on error or failure.
8
3.2 P YTHON EXAMPLES
import pydsrc
import pydsrc
9
3.2.3 Manual compression using DsrcArchive and FastqFile
import pydsrc
24
# read all records from FASTQ file and write to DSRC archive
rc = 0
rec = pydsrc . FastqRecord ()
28 while fqfile . Read Next Recor d ( rec ):
archive . W ri t eN e xt R ec o rd ( rec )
rc += 1
10
3.2.4 Manual decompression using DsrcArchive and FastqFile
import pydsrc
# read all records from DSRC archive and write to FASTQ file
16 rc = 0
rec = pydsrc . FastqRecord ()
while archive . Re adNex tRec ord ( rec ):
fqfile . Wr i te N ex t Re c or d ( rec )
20 rc += 1
11
3.3 C++ API
To start using DSRC C++ library it’s only needed to include Dsrc.h header file and to link appli-
cation with libdsrc (libdsrc.a file under Linux or libdsrc.lib file under Windows) library.
3.3.1 FastqRecord
FastqRecord stores a single DNA sequencing read information. The IDs, sequence, plus, and
quality fields are represented using std::string type.
3.3.2 FastqFile
FastqFile is used to read and write FASTQ records, where each record is of FastqRecord type.
Figure 3.6 shows FasqtFile public methods—methods throw std::runtime_error exception
on error.
3.3.3 DsrcArchive
DsrcArchve provides methods to read from and write to DSRC archive file in a continuous way,
read by read, where record is of FastqRecord type. Figure 3.7 shows DsrcArchive public meth-
ods, while Figure 3.8 shows public accessors.
3.3.4 DsrcModule
DsrcModule provides automated parallel compression and decompression routines operating on
whole files instead of single records. Figure 3.9 shows available public methods, Figure 3.10 shows
public accessors.
12
Table 3.7: DsrcArchive public methods
Method Returns Parameters Description
Creates a new DSRC archive and pre-
StartCompress — std::string
pares for compression.
FastqRecord
WriteNextRecord — Writes a new record to file.
record
Finalizes the compression of the
FinishCompress — —
archive and performs cleanup.
Opens a DSRC archive and prepares
StartDecompress — std::string
for decompression.
Reads the next decompressed FASTQ
FastqRecord&
ReadNextRecord bool record from archive. Returns false on
record
reaching end-of-file.
Finalizes the decompression of the
FinishDecompress — —
archive and performs cleanup.
13
Table 3.9: DsrcModule public methods
Method Returns Parameters Description
std::string
fastqFilename, Compresses the FASTQ file to DSRC
Compress bool
std::string archive.
dsrcFilename
std::string
dsrcFilename, Decompresses the DSRC archive to
Decompress bool
std::string file.
fastqFilename
14
3.4 C++ EXAMPLES
try
16 {
DsrcModule dsrc ;
15
3.4.2 Decompressing DSRC archive using automated DsrcModule module
try
16 {
DsrcModule dsrc ;
16
3.4.3 Manual compression using DsrcArchive and FastqFile
archive . S e t D n a C o m p r e s s i o n L e v e l (2);
36 archive . S e t F a s t q B u f f e r S i z e M B (512);
std :: cerr << " Error !\ n " << e . what () << std :: endl ;
48 return -1;
}
17
}
64 std :: cout << " Sucess !\ nCompressed ␣ records : ␣ " << recCount << std :: endl ;
return 0;
}
18
3.4.4 Manual decompression using DsrcArchive and FastqFile
# include < string >
# include < iostream >
std :: cout << " Success !\ nDecompressed ␣ records : ␣ " << recCount << std :: endl ;
return 0;
56 }
19