Bioinformatics Class2
Bioinformatics Class2
Data Quality
Read Mapping
Riddhiman Dhar
Department of Biotechnology
IIT Kharagpur
Data formats
- FASTA format
Illumina quad i
- FASTQ format
Data formats
PacBio
- basecall HDF5 format earlier
- currently BAM format (convertible to fastq format)
Nanopore
- HDF5 format
- FAST5 format
Fastq format
Fasta format with Quality scores
Fastq format
Fasta format with Quality scores
@D00733:181:CAH6EANXX:8:2210:1499:2056 1:N:0:GTAGAG
NCTTTGTACTATGACCGATACACTCAACCGGCGAAAAGTGGAACTTGAGAATTGATGTCTTCATCTTATT
ATCTGTCTCTTATACACATCTCCGAGCCCACGAGACGTAGAGGAATCTCGTATGC
+
#<=BBGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGG
GGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGG@GGGGGGGGGGGGGEGGEGG8CEGGG
GGGG
@D00733:181:CAH6EANXX:8:2210:1668:2177 1:N:0:GTAGAG
GCTTTTTTTACGGCAGCATTTTTTTTTCAACTCTGATCGCCCCTTTACTGCTCCCTCCGCCCAAATTCCA
TTGCAGTTCAAATGTATACTGAAAAAAACCCCATTGCTATTGTTAAACAGTGAAC
+
BBCCCGGGGBFGGGGGGEGGGGGGGGGGGGCGFECCGGGGGGGGGGCDGGGGGGGGED>BGG
GGBD@CGGEG>CCFGDCCGCDGG8FCGG=FEGGGGGGGDGDDEGG/6D@/DGGCB//6C9CD/
Header: @D00733:181:CAH6EANXX:8:2210:1668:2177 1:N:0:GTAGAG
Element Requirements Description
@ @ Each sequence identifier line starts
with @
<instrument> Characters allowed: Instrument ID
a–z, A–Z, 0–9 and underscore
<run number> Numerical Run number on instrument
<flowcell ID> Characters allowed:
a–z, A–Z, 0–9
<lane> Numerical Lane number
<tile> Numerical Tile number
<x_pos> Numerical X coordinate of cluster
<y_pos> Numerical Y coordinate of cluster
<read> Numerical Read number. 1 can be single read
or Read 2 of paired-end
<is filtered> Y or N Y if the read is filtered (did not
pass), N otherwise
<control number> Numerical 0 when none of the control bits are
on, otherwise it is an even number.
On HiSeq X and NextSeq systems,
control specification is not
performed and this number is
always 0.
<sample number> Numerical Sample number from sample sheet
Quality score – ASCII table
HDF5 format
- Groups
- Datasets
- Attributes
FAST5 format
FastQC
https://www.bioinformatics.babraham.ac.uk/projects/fastqc/
Checking NGS Data Quality
FastQC
https://www.bioinformatics.babraham.ac.uk/projects/fastqc/
Per base quality score
Per base quality score
Good data!
Per base quality score
Bad data!
Per sequence quality scores
Per sequence quality scores
Good data!
Per sequence quality scores
Not so good!
Per base sequence content
Per sequence GC content
Per Base N content
Sequence length distribution
Duplicate Sequences
Over-represented k-mers
Adapter content
Read trimming tools
Trimmomatic
http://www.usadellab.org/cms/?page=trimmomatic
bbduk
https://jgi.doe.gov/data-and-tools/bbtools/bb-tools-user-
guide/bbduk-guide/
Cutadapt
https://pypi.org/project/cutadapt/1.3/