Genomics
Genomics
Release 2020.2.0
1 Introduction 3
1.1 The workflow . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
1.2 Learning outcomes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
2 Tool installation 5
2.1 Install the conda package manager . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
2.2 Creating environments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
2.3 Install software . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
2.4 General conda commands . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
3 Quality control 9
3.1 Preface . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
3.2 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
3.3 Learning outcomes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
3.4 The data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
3.5 The fastq file format . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
3.6 The QC process . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
3.7 PhiX genome . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
3.8 Adapter trimming . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
3.9 Quality assessment of sequencing reads . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
3.10 Run FastQC and MultiQC on the trimmed data . . . . . . . . . . . . . . . . . . . . . . . . 15
4 Genome assembly 19
4.1 Preface . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
4.2 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
4.3 Learning outcomes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
4.4 Before we start . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
4.5 Creating a genome assembly . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
4.6 Assembly quality assessment . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
4.7 Compare the untrimmed data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
4.8 Further reading . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
4.9 Web links . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
5 Read mapping 25
5.1 Preface . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
5.2 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
5.3 Learning outcomes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
5.4 Before we start . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
5.5 Mapping sequence reads to a reference genome . . . . . . . . . . . . . . . . . . . . . . . . 27
5.6 BWA . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28
5.7 The sam mapping file-format . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29
5.8 Mapping post-processing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29
5.9 Mapping statistics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31
5.10 Sub-selecting reads . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32
i
6 Taxonomic investigation 35
6.1 Preface . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35
6.2 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35
6.3 Before we start . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35
6.4 Kraken2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37
6.5 Centrifuge . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41
6.6 Visualisation (Krona) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44
7 Variant calling 47
7.1 Preface . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47
7.2 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47
7.3 Learning outcomes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47
7.4 Before we start . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47
7.5 Installing necessary software . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49
7.6 Preprocessing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49
7.7 Calling variants . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49
7.8 Post-processing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49
8 Genome annotation 55
8.1 Preface . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55
8.2 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55
8.3 Learning outcomes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55
8.4 Before we start . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55
8.5 Installing the software . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57
8.6 Assessment of orthologue presence and absence . . . . . . . . . . . . . . . . . . . . . . . 58
8.7 Annotation with Augustus . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58
8.8 Annotation with Prokka . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59
8.9 Interactive viewing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59
10 Variants-of-interest 65
10.1 Preface . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65
10.2 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65
10.3 Learning outcomes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65
10.4 Before we start . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65
10.5 General comments for identifying variants-of-interest . . . . . . . . . . . . . . . . . . . . 67
10.6 SnpEff . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67
12 Coding solutions 73
12.1 QC . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73
12.2 Assembly . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 74
12.3 Mapping . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 74
13 Downloads 77
13.1 Tools . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 77
ii
13.2 Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 77
Bibliography 83
iii
iv
Computational Genomics Tutorial, Release 2020.2.0
Attention: This is a new revised release, now using data from E. coli to speed the analysis steps up.
The former tutorial based on fungi data can be accessed at https://genomics-fungi.sschmeier.com/.
This is an introductory tutorial for learning computational genomics mostly on the Linux command-line.
You will learn how to analyse next-generation sequencing (NGS) data. The data you will be using is
real research data. The final aim is to identify genome variations in evolved lines of wild yeast that can
explain the observed biological phenotypes. Until 2020, Sebastian1 was teaching this material in the
Massey University course Genome Science2 .
More information about other bioinformatics material and our past research can be found on the former
webpages of the Schmeier Group3 (https://www.schmeierlab.com).
1 https://www.sschmeier.com
2 https://www.massey.ac.nz/massey/learning/programme-course/course.cfm?course_code=203341
3 https://www.schmeierlab.com
CONTENTS 1
Computational Genomics Tutorial, Release 2020.2.0
2 CONTENTS
CHAPTER
ONE
INTRODUCTION
This is an introductory tutorial for learning genomics mostly on the Linux command-line. Should you
need to refresh your knowledge about either Linux or the command-line, have a look here4 .
In this tutorial you will learn how to analyse next-generation sequencing (NGS) data. The data you
will be using is actual research data. The experiment follows a similar strategy as in what is called
an “experimental evolution” experiment [KAWECKI2012], [ZEYL2006]. The final aim is to identify the
genome variations in evolved lines of E. coli that can explain the observed biological phenotype(s).
4 http://linux.sschmeier.com/
3
Computational Genomics Tutorial, Release 2020.2.0
4 Chapter 1. Introduction
CHAPTER
TWO
TOOL INSTALLATION
We will use the package/tool managing system conda7 to install some programs that we will use during
the course. It is not installed by default, thus we need to install it first to be able to use it. Let us
download conda8 first:
Note: Should the conda installer download fail. Please find links to alternative locations on the Down-
loads (page 77) page.
After you accepted the license agreement conda10 will be installed. At the end of the installation you will
encounter the following:
...
installation finished.
Do you wish the installer to initialize Miniconda3
by running conda init? [yes|no]
[no] >>>
Please type “yes” here. This will add some code to your .bashrc init file, which is important to work with
conda11 correctly.
Attention: Please close and reopen the terminal, to complete the installation.
After closing and re-opening the shell/terminal, we should be able to use the conda12 command:
7 http://conda.pydata.org/miniconda.html
8 http://conda.pydata.org/miniconda.html
9 http://conda.pydata.org/miniconda.html
10 http://conda.pydata.org/miniconda.html
11 http://conda.pydata.org/miniconda.html
12 http://conda.pydata.org/miniconda.html
5
Computational Genomics Tutorial, Release 2020.2.0
Different tools are packaged in what conda13 calls channels. We need to add some channels to make the
bioinformatics and genomics tools available for installation:
Attention: The order of adding channels is important. Make sure you use the shown order of
commands.
We create a conda14 environment for some tools. This is useful to work reproducible as we can easily
re-create the tool-set with the same version numbers later on.
So what is happening when you type conda activate ngs in a shell. The PATH variable of your shell gets
temporarily manipulated and set to:
$ echo $PATH
/home/guest/miniconda3/bin:/home/guest/miniconda3/condabin:...
$ conda activate ngs
$ echo $PATH
/home/guest/miniconda3/envs/ngs/bin:/home/guest/miniconda3/condabin: ...
Now it will look first in your environment’s bin directory but afterwards in the general conda bin (/home/
guest/miniconda3/condabin). So basically everything you install generally with conda (without being in
an environment) is also available to you but gets overshadowed if a similar program is in /home/guest/
miniconda3/envs/ngs/bin and you are in the ngs environment.
To install software into the activated environment, one uses the command conda install.
Note: To tell if you are in the correct conda environment, look at the command-prompt. Do you see
the name of the environment in round brackets at the very beginning of the prompt, e.g. (ngs)? If not,
activate the ngs environment with conda activate ngs before installing the tools.
13 http://conda.pydata.org/miniconda.html
14 http://conda.pydata.org/miniconda.html
# activate env
$ conda activate [name]
# deactivate env
$ conda deactivate
THREE
QUALITY CONTROL
3.1 Preface
There are many sources of errors that can influence the quality of your sequencing run [ROBASKY2014].
In this quality control section we will use our skill on the command-line interface to deal with the task
of investigating the quality and cleaning sequencing data [KIRCHNER2014].
Note: You will encounter some To-do sections at times. Write the solutions and answers into a text-file.
3.2 Overview
The part of the workflow we will work on in this section can be viewed in Fig. 3.1.
First, we are going to download the data we will analyse. Open a shell/terminal.
# uncompress it
$ tar -xvzf data.tar.gz
9
Computational Genomics Tutorial, Release 2020.2.0
Fig. 3.1: The part of the workflow we will work on in this section marked in red.
Note: Should the download fail, download manually from Downloads (page 77). Download the file to
the ~/analysis directory and decompress.
The data is from a paired-end sequencing run data (see Fig. 3.2) from an Illumina15 HiSeq [GLENN2011].
Thus, we have two files, one for each end of the read.
If you need to refresh how Illumina16 paired-end sequencing works have a look at the Illumina technol-
ogy webpage17 and this video18 .
Attention: The data we are using is “almost” raw data as it came from the machine. This data
has been post-processed in two ways already. All sequences that were identified as belonging to the
PhiX genome have been removed. This process requires some skills we will learn in later sections.
Illumina19 adapters have been removed as well already! The process is explained below and we are
going to run through the process anyways.
15 http://illumina.com
16 http://illumina.com
17 http://www.illumina.com/technology/next-generation-sequencing/paired-end-sequencing_assay.html
18 https://youtu.be/HMyCqWhwB8E
19 http://illumina.com
Make use of your newly developed skills on the command-line to investigate the files in data folder.
Todo:
1. Use the command-line to get some ideas about the file.
2. What kind of files are we dealing with?
3. How many sequence reads are in the file?
4. Assume a genome size of ~4.6 MB. Calculate the coverage based on this formula: C = LN / G
• C: Coverage
• G: is the haploid genome length in bp
• L: is the read length in bp (e.g. 2x150 paired-end = 300)
• N: is the number of reads sequenced
The data we receive from the sequencing is in fastq format. To remind us what this format entails, we
can revisit the fastq wikipedia-page20 !
A useful tool to decode base qualities can be found here21 .
There are a few steps one need to do when getting the raw sequencing data from the sequencing facility:
1. Remove PhiX sequences (we are not going to do this)
2. Adapter trimming
3. Quality trimming of reads
4. Quality assessment
PhiX22 is a nontailed bacteriophage with a single-stranded DNA and a genome with 5386 nucleotides.
PhiX is used as a quality and calibration control for sequencing runs23 . PhiX is often added at a low
known concentration, spiked in the same lane along with the sample or used as a separate lane. As the
concentration of the genome is known, one can calibrate the instruments. Thus, PhiX genomic sequences
need to be removed before processing your data further as this constitutes a deliberate contamination
[MUKHERJEE2015]. The steps involve mapping all reads to the “known” PhiX genome, and removing
all of those sequence reads from the data.
20 https://en.wikipedia.org/wiki/FASTQ_format
21 http://broadinstitute.github.io/picard/explain-qualities.html
22 https://en.wikipedia.org/wiki/Phi_X_174
23 http://www.illumina.com/products/by-type/sequencing-kits/cluster-gen-sequencing-reagents/phix-control-v3.html
However, your sequencing provider might not have used PhiX, thus you need to read the protocol care-
fully, or just do this step in any case.
Attention: We are not going to do this step here, as the sequencing run we are using did not
use PhiX. Please see the Read mapping (page 25) section on how to map reads against a reference
genome.
The process of sequencing DNA via Illumina24 technology requires the addition of some adapters to the
sequences. These get sequenced as well and need to be removed as they are artificial and do not belong
to the species we try to sequence. Generally speaking we have to deal with a trade-off between accuracy
of adapter removal and speed of the process. Adapter trimming does take some time.
Also, we have generally two different approaches when trimming adapter:
1. We can use a tool that takes an adapter or list of adapters and removes these from each sequence
read.
2. We can use a tool that predicts adapters and removes them from each sequence read.
For the first approach we need to know the adapter sequences that were used during the sequencing of
our samples. Normally, you should ask your sequencing provider, who should be providing this informa-
tion to you. Illumina25 itself provides a document26 that describes the adapters used for their different
technologies. Also the FastQC27 tool, we will be using later on, provides a collection of contaminants
and adapters28 .
However, often (sadly) this information is not readily available, e.g. when dealing with public data.
Thus, the second approach can be employed, that is, using a tool that predicts adapters.
Here, we are going to use the second approach with a tool called fastp to trim adapters and do quality
trimming. fastp has a few characteristics which make it a great tool, most importantly: it is pretty fast,
provides good information after the run, and can do quality trimming as well, thus saving us to use
another tool to do this.
Quality trimming of our sequencing reads will remove bad quality called bases from our reads, which is
especially important when dealing with variant identification.
# activate env
$ conda activate qc
$ mkdir trimmed
$ fastp --detect_adapter_for_pe
--overrepresentation_analysis
--correction --cut_right --thread 2
--html trimmed/anc.fastp.html --json trimmed/anc.fastp.json
-i data/anc_R1.fastq.gz -I data/anc_R2.fastq.gz
-o trimmed/anc_R1.fastq.gz -O trimmed/anc_R2.fastq.gz
24 http://illumina.com
25 http://illumina.com
26 https://support.illumina.com/downloads/illumina-customer-sequence-letter.html
27 http://www.bioinformatics.babraham.ac.uk/projects/fastqc/
28 https://github.com/csf-ngs/fastqc/blob/master/Contaminants/contaminant_list.txt
Todo:
1. Run fastp also on the evolved samples.
Hint: Should you not get the commands together to trim the evolved samples, have a look at the
coding solutions at Code: fastp (page 73). Should you be unable to run fastp at all to trim the data.
You can download the trimmed dataset here29 . Unarchive and uncompress the files with tar -xvzf
trimmed.tar.gz.
$ fastqc --help
SYNOPSIS
DESCRIPTION
FastQC reads a set of sequence files and produces from each one a quality
control report consisting of a number of different modules, each one of
which will help to identify a different potential type of problem in your
data.
If no files to process are specified on the command line then the program
will start as an interactive graphical application. If files are provided
on the command line then the program will run with no user interaction
required. In this mode it is suitable for inclusion into a standardised
analysis pipeline.
29 https://osf.io/m3wpr/download
FastQC30 is a very simple program to run that provides inforation about sequence read quality.
From the webpage:
“FastQC aims to provide a simple way to do some quality control checks on raw sequence data
coming from high throughput sequencing pipelines. It provides a modular set of analyses
which you can use to give a quick impression of whether your data has any problems of
which you should be aware before doing any further analysis.”
The basic command looks like:
Hint: The result will be a HTML page per input file that can be opened in a web-browser.
Hint: The authors of FastQC31 made some nice help pages explaining each of the plots and results you
expect to see here32 .
3.9.3 MultiQC
MultiQC33 is an excellent tool to put FastQC34 (and other tool) results of different samples into context.
It compiles all FastQC35 results and fastp stats into one nice web-page.
The use of MultiQC36 is simple. Just provide the command with a directories where multiple results are
stored and it will compile a nice report, e.g.:
Todo:
1. Create a directory for the results –> trimmed-fastqc
2. Run FastQC on all trimmed files.
3. Visit the FastQC37 website and read about sequencing QC reports for good and bad Illumina38
sequencing runs.
4. Run MultiQC39 on the trimmed-fastqc and trimmed directories
30 http://www.bioinformatics.babraham.ac.uk/projects/fastqc/
31 http://www.bioinformatics.babraham.ac.uk/projects/fastqc/
32 http://www.bioinformatics.babraham.ac.uk/projects/fastqc/Help/3%20Analysis%20Modules/
33 https://multiqc.info/
34 http://www.bioinformatics.babraham.ac.uk/projects/fastqc/
35 http://www.bioinformatics.babraham.ac.uk/projects/fastqc/
36 https://multiqc.info/
37 http://www.bioinformatics.babraham.ac.uk/projects/fastqc/
38 http://illumina.com
39 https://multiqc.info/
5. Compare your results to these examples (Fig. 3.3 to Fig. 3.5) of a particularly bad run (taken from
the FastQC40 website) and write down your observations with regards to your data.
6. What elements in these example figures (Fig. 3.3 to Fig. 3.5) indicate that the example is from a
bad run?
Hint: Should you not get it right, try the commands in Code: FastQC (page 73).
40 http://www.bioinformatics.babraham.ac.uk/projects/fastqc/
FOUR
GENOME ASSEMBLY
4.1 Preface
In this section we will use our skill on the command-line interface to create a genome assembly from
sequencing data.
Note: You will encounter some To-do sections at times. Write the solutions and answers into a text-file.
4.2 Overview
The part of the workflow we will work on in this section can be viewed in Fig. 4.1.
data/
multiqc_data/
multiqc_report.html
trimmed/
trimmed-fastqc/
Attention: If you have not run the previous section Quality control (page 9), you can download the
trimmed data needed for this section here: Downloads (page 77). Download the file to the ~/analysis
directory and decompress. Alternatively on the CLI try:
cd ~/analysis
wget -O trimmed.tar.gz https://osf.io/m3wpr/download
tar xvzf trimmed.tar.gz
19
Computational Genomics Tutorial, Release 2020.2.0
Fig. 4.1: The part of the workflow we will work on in this section marked in red.
We want to create a genome assembly for our ancestor. We are going to use the quality trimmed forward
and backward DNA sequences and use a program called SPAdes45 to build a genome assembly.
Todo:
1. Discuss briefly why we are using the ancestral sequences to create a reference genome as opposed
to the evolved line.
We are going to use a program called SPAdes46 fo assembling our genome. In a recent evaluation of
assembly software, SPAdes47 was found to be a good choice for fungal genomes [ABBAS2014]. It is also
simple to install and use.
Todo:
1. Run SPAdes49 with default parameters on the ancestor’s trimmed reads
2. Read in the SPAdes50 manual about about assembling with 2x150bp reads
3. Run SPAdes51 a second time but use the options suggested at the SPAdes52 manual section 3.453
for assembling 2x150bp paired-end reads. Use a different output directory assembly/spades-150
for this run.
45 http://bioinf.spbau.ru/spades
46 http://bioinf.spbau.ru/spades
47 http://bioinf.spbau.ru/spades
48 http://bioinf.spbau.ru/spades
49 http://bioinf.spbau.ru/spades
50 http://bioinf.spbau.ru/spades
51 http://bioinf.spbau.ru/spades
52 http://bioinf.spbau.ru/spades
53 http://cab.spbu.ru/files/release3.14.0/manual.html#sec3.4
Hint: Should you not get it right, try the commands in Code: SPAdes assembly (trimmed data) (page 74).
Quast54 (QUality ASsessment Tool) [GUREVICH2013], evaluates genome assemblies by computing vari-
ous metrics, including:
• N50: length for which the collection of all contigs of that length or longer covers at least 50% of
assembly length
• NG50: where length of the reference genome is being covered
• NA50 and NGA50: where aligned blocks instead of contigs are taken
• miss-assemblies: miss-assembled and unaligned contigs or contigs bases
• genes and operons covered
It is easy with Quast55 to compare these measures among several assemblies. The program can be used
on their website56 .
Run Quast57 with both assembly scaffolds.fasta files to compare the results.
Todo:
1. Compare the results of Quast58 with regards to the two different assemblies.
2. Which one do you prefer and why?
Todo:
1. To see if our trimming procedure has an influence on our assembly, run the same command you
used on the trimmed data on the original untrimmed data.
2. Run Quast59 on the assembly and compare the statistics to the one derived for the trimmed data
set. Write down your observations.
54 http://quast.bioinf.spbau.ru/
55 http://quast.bioinf.spbau.ru/
56 http://quast.bioinf.spbau.ru/
57 http://quast.bioinf.spbau.ru/
58 http://quast.bioinf.spbau.ru/
59 http://quast.bioinf.spbau.ru/
Hint: Should you not get it right, try the commands in Code: SPAdes assembly (original data) (page 74).
60 https://dx.doi.org/10.6084/m9.figshare.2972323.v1
61 http://bioinf.spbau.ru/spades
62 http://quast.bioinf.spbau.ru/
63 https://rrwick.github.io/Bandage/
FIVE
READ MAPPING
5.1 Preface
In this section we will use our skill on the command-line interface to map our reads from the evolved
line to our ancestral reference genome.
Note: You will encounter some To-do sections at times. Write the solutions and answers into a text-file.
5.2 Overview
The part of the workflow we will work on in this section can be viewed in Fig. 5.1.
After studying this section of the tutorial you should be able to:
1. Explain the process of sequence read mapping.
2. Use bioinformatics tools to map sequencing reads to a reference genome.
3. Filter mapped reads based on quality.
$ cd ~/analysis
# create a mapping result directory
$ mkdir mappings
$ ls -1F
assembly/
data/
mappings/
multiqc_data/
multiqc_report.html
trimmed/
trimmed-fastqc/
25
Computational Genomics Tutorial, Release 2020.2.0
Fig. 5.1: The part of the workflow we will work on in this section marked in red.
Attention: If you have not run the previous sections on Quality control (page 9) and Genome as-
sembly (page 19), you can download the trimmed data and the genome assembly needed for this
section here: Downloads (page 77). Download the files to the ~/analysis directory and decompress.
Alternatively on the CLI try:
cd ~/analysis
wget -O trimmed.tar.gz https://osf.io/m3wpr/download
tar xvzf trimmed.tar.gz
wget -O assembly.tar.gz https://osf.io/t2zpm/download
tar xvzf assembly.tar.gz
We want to map the sequencing reads to the ancestral reference genome. We are going to use the quality
trimmed forward and backward DNA sequences of the evolved line and use a program called BWA70 to
map the reads.
Todo:
1. Discuss briefly why we are using the ancestral genome as a reference genome as opposed to a
genome for the evolved line.
Todo: In the assembly section at “Genome assembly (page 19)”, we created a genome assembly. How-
ever, we actually used sub-sampled data as otherwise the assemblies would have taken a long time to fin-
ish. To continue, please download the assembly created on the complete dataset (Downloads (page 77)).
Unarchive and uncompress the files with tar -xvzf assembly.tar.gz.
We are going to use a program called BWA71 to map our reads to our genome.
It is simple to install and use.
70 http://bio-bwa.sourceforge.net/
71 http://bio-bwa.sourceforge.net/
5.6 BWA
5.6.1 Overview
BWA72 is a short read aligner, that can take a reference genome and map single- or paired-end sequence
data to it [LI2009]. It requires an indexing step in which one supplies the reference genome and BWA73
will create an index that in the subsequent steps will be used for aligning the reads to the reference
genome. While this step can take some time, the good thing is the index can be reused over and over.
The general command structure of the BWA74 tools we are going to use are shown below:
# indexing
$ bwa index path/to/reference-genome.fa
Todo: Create an BWA75 index for our reference genome assembly. Attention! Remember which file you
need to submit to BWA76 .
Hint: Should you not get it right, try the commands in Code: BWA indexing (page 74).
Note: Should you be unable to run BWA77 indexing on the data, you can download the index from
Downloads (page 77). Unarchive and uncompress the files with tar -xvzf bwa-index.tar.gz.
Now that we have created our index, it is time to map the trimmed sequencing reads of our two evolved
line to the reference genome.
Todo: Use the correct bwa mem command structure from above and map the reads of the two evolved
line to the reference genome.
72 http://bio-bwa.sourceforge.net/
73 http://bio-bwa.sourceforge.net/
74 http://bio-bwa.sourceforge.net/
75 http://bio-bwa.sourceforge.net/
76 http://bio-bwa.sourceforge.net/
77 http://bio-bwa.sourceforge.net/
Hint: Should you not get it right, try the commands in Code: BWA mapping (page 74).
BWA78 , like most mappers, will produce a mapping file in sam-format. Have a look into the sam-file that
was created by either program. A quick overview of the sam-format can be found here79 and even more
information can be found here80 . Briefly, first there are a lot of header lines. Then, for each read, that
mapped to the reference, there is one line.
The columns of such a line in the mapping file are described in Table 5.1.
˓→FHGHHHHHGGGHHHHHHHHHHHHHHHHHHGHHHHHHHHHHHGHHHHHGHHHHHHHHGDHHHHHHHHGHHHHGHHHGHHHHHHFHHHHGHHHHIHHHHHHHHHHHHHHHHHHHGHH
It basically defines the read and the position within the reference genome, where the read mapped and
a quality of the mapping.
Because aligners can sometimes leave unusual SAM flag81 information on SAM records, it is helpful when
working with many tools to first clean up read pairing information and flags with SAMtools82 . We are
going to produce also compressed bam output for efficient storing of and access to the mapped reads.
Note, samtools fixmate expects name-sorted input files, which we can achieve with samtools sort -n.
$ samtools sort -n -O sam mappings/evol1.sam | samtools fixmate -m -O bam - mappings/evol1.fixmate.
˓→bam
78 http://bio-bwa.sourceforge.net/
79 http://bio-bwa.sourceforge.net/bwa.shtml#4
80 http://samtools.github.io/hts-specs/SAMv1.pdf
81 http://bio-bwa.sourceforge.net/bwa.shtml#4
82 http://samtools.sourceforge.net/
• -m: Add ms (mate score) tags. These are used by markdup (below) to select the best reads to keep.
• -O bam: specifies that we want compressed bam output from fixmate
Attention: The step of sam to bam-file conversion might take a few minutes to finish, depending on
how big your mapping file is.
We will be using the SAM flag83 information later below to extract specific alignments.
Once we have bam-file, we can also delete the original sam-file as it requires too much space and we can
always recreate it from the bam-file.
$ rm mappings/evol1.sam
5.8.2 Sorting
We are going to use SAMtools85 again to sort the bam-file into coordinate order:
In this step we remove duplicate reads. The main purpose of removing duplicates is to mitigate the
effects of PCR amplification bias introduced during library construction. It should be noted that this
step is not always recommended. It depends on the research question. In SNP calling it is a good idea
to remove duplicates, as the statistics used in the tools that call SNPs sub-sequently expect this (most
tools anyways). However, for other research questions that use mapping, you might not want to remove
duplicates, e.g. RNA-seq.
Note: Should you be unable to do the post-processing steps, you can download the mapped data from
Downloads (page 77).
83 http://bio-bwa.sourceforge.net/bwa.shtml#4
84 http://broadinstitute.github.io/picard/explain-flags.html
85 http://samtools.sourceforge.net/
Todo: Look at the mapping statistics and understand their meaning86 . Discuss your results. Explain
why we may find mapped reads that have their mate mapped to a different chromosome/contig? Can
they be used for something?
For the sorted bam-file we can get read depth for at all positions of the reference genome, e.g. how many
reads are overlapping the genomic position.
Todo: Extract the depth values for contig 20 and load the data into R, calculate some statistics of our
scaffold.
Now we quickly use some R87 to make a coverage plot for contig NODE20. Open a R88 shell by typing R
on the command-line of the shell.
# to save a plot
png('mappings/covNODE20.png', width = 1200, height = 500)
plot(x[,2], x[,3], col = ifelse(x[,3] < 20,'red','black'), pch=19, xlab='postion', ylab='coverage')
dev.off()
The result plot will be looking similar to the one in Fig. 5.2
Todo: Look at the created plot. Explain why it makes sense that you find relatively bad coverage at the
beginning and the end of the contig.
86 https://www.biostars.org/p/12475/
87 https://www.r-project.org/
88 https://www.r-project.org/
Fig. 5.2: A example coverage plot for a contig with highlighted in red regions with a coverage below 20
reads.
For a more in depth analysis of the mappings, one can use QualiMap89 [OKO2015].
QualiMap90 examines sequencing alignment data in SAM/BAM files according to the features of the
mapped reads and provides an overall view of the data that helps to the detect biases in the sequencing
and/or mapping of the data and eases decision-making for further analysis.
Run QualiMap91 with:
This will create a report in the mapping folder. See this webpage92 to get help on the sections in the
report.
Todo: Investigate the mapping of the evolved sample. Write down your observations.
It is important to remember that the mapping commands we used above, without additional parameters
to sub-select specific alignments (e.g. for Bowtie293 there are options like --no-mixed, which suppresses
unpaired alignments for paired reads or --no-discordant, which suppresses discordant alignments for
paired reads, etc.), are going to output all reads, including unmapped reads, multi-mapping reads,
unpaired reads, discordant read pairs, etc. in one file. We can sub-select from the output reads we
want to analyse further using SAMtools94 .
89 http://qualimap.bioinfo.cipf.es/
90 http://qualimap.bioinfo.cipf.es/
91 http://qualimap.bioinfo.cipf.es/
92 http://qualimap.bioinfo.cipf.es/doc_html/analysis.html#output
93 http://bowtie-bio.sourceforge.net/bowtie2/index.shtml
94 http://samtools.sourceforge.net/
Todo: Explain what concordant and discordant read pairs are? Look at the Bowtie295 manual.
We can select read-pair that have been mapped in a correct manner (same chromosome/contig, correct
orientation to each other, distance between reads is not stupid).
Attention: We show the command here, but we are not going to use it.
Todo: Our final aim is to identify variants. For a particular class of variants, it is not the best idea to
only focus on concordant reads. Why is that?
In this section we want to sub-select reads based on the quality of the mapping. It seems a reasonable
idea to only keep good mapping reads. As the SAM-format contains at column 5 the 𝑀 𝐴𝑃 𝑄 value,
which we established earlier is the “MAPping Quality” in Phred-scaled, this seems easily achieved. The
formula to calculate the 𝑀 𝐴𝑃 𝑄 value is: 𝑀 𝐴𝑃 𝑄 = −10 * 𝑙𝑜𝑔10(𝑝), where 𝑝 is the probability that the
read is mapped wrongly. However, there is a problem! While the MAPQ information would be very
helpful indeed, the way that various tools implement this value differs. A good overview can be
found here97 . Bottom-line is that we need to be aware that different tools use this value in different
ways and the it is good to know the information that is encoded in the value. Once you dig deeper into
the mechanics of the 𝑀 𝐴𝑃 𝑄 implementation it becomes clear that this is not an easy topic. If you want
to know more about the 𝑀 𝐴𝑃 𝑄 topic, please follow the link above.
For the sake of going forward, we will sub-select reads with at least medium quality as defined by
Bowtie298 :
Hint: I will repeat here a recommendation given at the source link99 above, as it is a good one: If you
unsure what 𝑀 𝐴𝑃 𝑄 scoring scheme is being used in your own data then you can plot out the 𝑀 𝐴𝑃 𝑄
distribution in a BAM file using programs like the mentioned QualiMap100 or similar programs. This will
at least show you the range and frequency with which different 𝑀 𝐴𝑃 𝑄 values appear and may help
identify a suitable threshold you may want to use.
95 http://bowtie-bio.sourceforge.net/bowtie2/index.shtml
96 http://bio-bwa.sourceforge.net/bwa.shtml#4
97 https://sequencing.qcfail.com/articles/mapq-values-are-really-useful-but-their-implementation-is-a-mess/
98 http://bowtie-bio.sourceforge.net/bowtie2/index.shtml
99 https://sequencing.qcfail.com/articles/mapq-values-are-really-useful-but-their-implementation-is-a-mess/
100 http://qualimap.bioinfo.cipf.es/
Todo: Please repeat the whole process for the second evolved strain => mapping and post-processing.
Note: Should you be unable to process the second evolved strain look at the coding solutions here:
Code: Mapping post-processing (page 74)
We could decide to use Kraken2101 like in section Taxonomic investigation (page 35) to classify all un-
mapped sequence reads and identify the species they are coming from and test for contamination.
Lets see how we can get the unmapped portion of the reads from the bam-file:
101 https://www.ccb.jhu.edu/software/kraken2/
102 http://bio-bwa.sourceforge.net/bwa.shtml#4
SIX
TAXONOMIC INVESTIGATION
6.1 Preface
We want to investigate if there are sequences of other species in our collection of sequenced DNA pieces.
We hope that most of them are from our species that we try to study, i.e. the DNA that we have extracted
and amplified. This might be a way of quality control, e.g. have the samples been contaminated? Lets
investigate if we find sequences from other species in our sequence set.
We will use the tool Kraken2105 to assign taxonomic classifications to our sequence reads. Let us see if
we can id some sequences from other species.
Note: You will encounter some To-do sections at times. Write the solutions and answers into a text-file.
6.2 Overview
The part of the workflow we will work on in this section can be viewed in Fig. 6.1.
$ cd ~/analysis
$ ls -1F
assembly/
data/
mappings/
multiqc_data
trimmed/
trimmed-fastqc/
Attention: If you have not run the previous section Read mapping (page 25), you can download the
unmapped sequencing data needed for this section here: Downloads (page 77). Download the file to
the ~/analysis directory and decompress. Alternatively on the CLI try:
cd ~/analysis
wget -O mappings.tar.gz https://osf.io/g5at8/download
tar xvzf mappings.tar.gz
105 https://www.ccb.jhu.edu/software/kraken2/
35
Computational Genomics Tutorial, Release 2020.2.0
Fig. 6.1: The part of the workflow we will work on in this section marked in red.
6.4 Kraken2
We will be using a tool called Kraken2106 [WOOD2014]. This tool uses k-mers to assign a taxonomic
labels in form of NCBI Taxonomy107 to the sequence (if possible). The taxonomic label is assigned based
on similar k-mer content of the sequence in question to the k-mer content of reference genome sequence.
The result is a classification of the sequence in question to the most likely taxonomic label. If the k-mer
content is not similar to any genomic sequence in the database used, it will not assign any taxonomic
label.
6.4.1 Installation
Use conda in the same fashion as before to install Kraken2108 . However, we are going to install kraken
into its own environment:
Now we create a directory where we are going to do the analysis and we will change into that directory
too.
# create dir
$ mkdir kraken
$ cd kraken
Now we need to create or download a Kraken2109 database that can be used to assign the taxonomic
labels to sequences. We opt for downloading the pre-build “minikraken2” database from the Kraken2110
website:
$ curl -O ftp://ftp.ccb.jhu.edu/pub/data/kraken2_dbs/minikraken2_v2_8GB_201904_UPDATE.tgz
Attention: Should the download fail. Please find links to alternative locations on the Downloads
(page 77) page.
Note: The “minikraken2” database was created from bacteria, viral and archaea sequences. What are
the implications for us when we are trying to classify our sequences?
106 https://www.ccb.jhu.edu/software/kraken2/
107 https://www.ncbi.nlm.nih.gov/taxonomy
108 https://www.ccb.jhu.edu/software/kraken2/
109 https://www.ccb.jhu.edu/software/kraken2/
110 https://www.ccb.jhu.edu/software/kraken2/
6.4. Kraken2 37
Computational Genomics Tutorial, Release 2020.2.0
6.4.2 Usage
Now that we have installed Kraken2111 and downloaded and extracted the minikraken2 database, we
can attempt to investigate the sequences we got back from the sequencing provider for other species as
the one it should contain. We call the Kraken2112 tool and specify the database and fasta-file with the
sequences it should use. The general command structure looks like this:
However, we may have fastq-files, so we need to use --fastq-input which tells Kraken2113 that it is
dealing with fastq-formated files. The --gzip-compressed flag specifies tat te input-files are compressed.
In addition, we are dealing with paired-end data, which we can tell Kraken2114 with the switch --paired.
Here, we are investigating one of the unmapped paired-end read files of the evolved line.
This classification may take a while, depending on how many sequences we are going to classify. The
resulting content of the file evol1.kraken looks similar to the following example:
Each sequence classified by Kraken2115 results in a single line of output. Output lines contain five tab-
delimited fields; from left to right, they are:
1. C/U: one letter code indicating that the sequence was either classified or unclassified.
2. The sequence ID, obtained from the FASTA/FASTQ header.
3. The taxonomy ID Kraken2116 used to label the sequence; this is 0 if the sequence is unclassified
and otherwise should be the NCBI Taxonomy117 identifier.
4. The length of the sequence in bp.
5. A space-delimited list indicating the lowest common ancestor (in the taxonomic tree) mapping of
each k-mer in the sequence. For example, 562:13 561:4 A:31 0:1 562:3 would indicate that:
• the first 13 k-mers mapped to taxonomy ID #562
• the next 4 k-mers mapped to taxonomy ID #561
• the next 31 k-mers contained an ambiguous nucleotide
• the next k-mer was not in the database
• the last 3 k-mers mapped to taxonomy ID #562
111 https://www.ccb.jhu.edu/software/kraken2/
112 https://www.ccb.jhu.edu/software/kraken2/
113 https://www.ccb.jhu.edu/software/kraken2/
114 https://www.ccb.jhu.edu/software/kraken2/
115 https://www.ccb.jhu.edu/software/kraken2/
116 https://www.ccb.jhu.edu/software/kraken2/
117 https://www.ncbi.nlm.nih.gov/taxonomy
We can use the webpage NCBI TaxIdentifier120 to quickly get the names to the taxonomy identifier.
However, this is impractical as we are dealing potentially with many sequences. Kraken2121 has some
scripts that help us understand our results better.
Because we used the Kraken2122 switch --report FILE, we have got also a sample-wide report of all
taxa found. This is much better to get an overview what was found.
The first few lines of an example report are shown below.
The output of kraken-report is tab-delimited, with one line per taxon. The fields of the output, from
left-to-right, are as follows:
1. Percentage of reads covered by the clade rooted at this taxon
2. Number of reads covered by the clade rooted at this taxon
3. Number of reads assigned directly to this taxon
4. A rank code, indicating (U)nclassified, (D)omain, (K)ingdom, (P)hylum, (C)lass, (O)rder,
(F)amily, (G)enus, or (S)pecies. All other ranks are simply “-“.
5. NCBI Taxonomy123 ID
6. The indented scientific name
Note: If you want to compare the taxa content of different samples to another, one can create a report
whose structure is always the same for all samples, disregarding which taxa are found (obviously the
percentages and numbers will be different).
We can cerate such a report using the option --report-zero-counts which will print out all taxa (instead
of only those found). We then sort the taxa according to taxa-ids (column 5), e.g. sort -n -k5.
The report is not ordered according to taxa ids and contains all taxa in the database, even if they have
not been found in our sample and are thus zero. The columns are the same as in the former report,
however, we have more rows and they are now differently sorted, according to the NCBI Taxonomy124
id.
118 https://www.ccb.jhu.edu/software/kraken2/
119 https://www.ccb.jhu.edu/software/kraken2/index.shtml?t=manual
120 https://www.ncbi.nlm.nih.gov/Taxonomy/TaxIdentifier/tax_identifier.cgi
121 https://www.ccb.jhu.edu/software/kraken2/
122 https://www.ccb.jhu.edu/software/kraken2/
123 https://www.ncbi.nlm.nih.gov/taxonomy
124 https://www.ncbi.nlm.nih.gov/taxonomy
6.4. Kraken2 39
Computational Genomics Tutorial, Release 2020.2.0
6.4.4 Bracken
Bracken125 stands for Bayesian Re-estimation of Abundance with KrakEN, and is a statistical method
that computes the abundance of species in DNA sequences from a metagenomics sample [LU2017].
Bracken126 uses the taxonomy labels assigned by Kraken2127 (see above) to estimate the number of
reads originating from each species present in a sample. Bracken128 classifies reads to the best matching
location in the taxonomic tree, but does not estimate abundances of species. Combined with the Kraken
classifier, Bracken129 will produces more accurate species- and genus-level abundance estimates than
Kraken2130 alone.
The use of Bracken131 subsequent to Kraken2132 is optional but might improve on the Kraken2133 results.
Installation
We installed Bracken134 already together with Kraken2135 above, so it should be ready to be used. We
also downloaded the Bracken136 files together with the minikraken2 database above, so we are good to
go.
Usage
• -l S: denotes the level we want to look at. S stands for species but other levels are available.
• -d PATH_TO_DB_DIR: specifies the path to the Kraken2140 database that should be used.
Let us apply Bracken141 to the example above:
The important column is the new_est_reads, which gives the newly estimated reads.
6.5 Centrifuge
We can also use another tool by the same group called Centrifuge142 [KIM2017]. This tool uses a novel
indexing scheme based on the Burrows-Wheeler transform (BWT) and the Ferragina-Manzini (FM) index,
optimized specifically for the metagenomic classification problem to assign a taxonomic labels in form of
NCBI Taxonomy143 to the sequence (if possible). The result is a classification of the sequence in question
to the most likely taxonomic label. If the search sequence is not similar to any genomic sequence in the
database used, it will not assign any taxonomic label.
Note: I would normally use Kraken2144 and only prefer Centrifuge145 if memory and/or speed are an
issue .
6.5.1 Installation
Now we create a directory where we are going to do the analysis and we will change into that directory
too.
# create dir
$ mkdir centrifuge
$ cd centrifuge
Now we need to create or download a Centrifuge147 database that can be used to assign the taxonomic
labels to sequences. We opt for downloading the pre-build database from the Centrifuge148 website:
$ curl -O ftp://ftp.ccb.jhu.edu/pub/infphilo/centrifuge/data/p_compressed+h+v.tar.gz
142 http://www.ccb.jhu.edu/software/centrifuge/index.shtml
143 https://www.ncbi.nlm.nih.gov/taxonomy
144 https://www.ccb.jhu.edu/software/kraken2/
145 http://www.ccb.jhu.edu/software/centrifuge/index.shtml
146 http://www.ccb.jhu.edu/software/centrifuge/index.shtml
147 http://www.ccb.jhu.edu/software/centrifuge/index.shtml
148 http://www.ccb.jhu.edu/software/centrifuge/index.shtml
6.5. Centrifuge 41
Computational Genomics Tutorial, Release 2020.2.0
Attention: Should the download fail. Please find links to alternative locations on the Downloads
(page 77) page.
Note: The database we will be using was created from bacteria and archaea sequences only. What are
the implications for us when we are trying to classify our sequences?
6.5.2 Usage
Now that we have installed Centrifuge149 and downloaded and extracted the pre-build database, we can
attempt to investigate the sequences we got back from the sequencing provider for other species as the
one it should contain. We call the Centrifuge150 tool and specify the database and fastq-files with the
sequences it should use. The general command structure looks like this:
This classification may take a moment, depending on how many sequences we are going to classify. The
resulting content of the file evol1-results.txt looks similar to the following example:
readID seqID taxID score 2ndBestScore hitLength queryLength numMatches
M02810:197:000000000-AV55U:1:1101:15316:8461 cid|1747 1747 1892 0 ␣
˓→103 135 1
M02810:197:000000000-AV55U:1:1101:15563:3249 cid|161879 161879 18496 0 ␣
˓→151 151 1
M02810:197:000000000-AV55U:1:1101:19743:5166 cid|564 564 10404 10404 117 ␣
˓→151 2
M02810:197:000000000-AV55U:1:1101:19743:5166 cid|562 562 10404 10404 117 ␣
˓→151 2
Each sequence classified by Centrifuge151 results in a single line of output. Output lines contain eight
tab-delimited fields; from left to right, they are according to the Centrifuge152 website:
1. The read ID from a raw sequencing read.
2. The sequence ID of the genomic sequence, where the read is classified.
3. The taxonomic ID of the genomic sequence in the second column.
4. The score for the classification, which is the weighted sum of hits.
5. The score for the next best classification.
6. A pair of two numbers: (1) an approximate number of base pairs of the read that match the
genomic sequence and (2) the length of a read or the combined length of mate pairs.
7. A pair of two numbers: (1) an approximate number of base pairs of the read that match the
genomic sequence and (2) the length of a read or the combined length of mate pairs.
8. The number of classifications for this read, indicating how many assignments were made.
149 http://www.ccb.jhu.edu/software/centrifuge/index.shtml
150 http://www.ccb.jhu.edu/software/centrifuge/index.shtml
151 http://www.ccb.jhu.edu/software/centrifuge/index.shtml
152 http://www.ccb.jhu.edu/software/centrifuge/index.shtml
Centrifuge report
The command above creates a Centrifuge153 report automatically for us. It contains an overview of the
identified taxa and their abundances in your supplied sequences (normalised to genomic length):
name taxID taxRank genomeSize numReads numUniqueReads abundance
Pseudomonas aeruginosa 287 species 22457305 1 0 0.0
Pseudomonas fluorescens 294 species 14826544 1 1 0.0
Pseudomonas putida 303 species 6888188 1 1 0.0
Ralstonia pickettii 329 species 6378979 3 2 0.0
Pseudomonas pseudoalcaligenes 330 species 4691662 1 1 0.0171143
Each line contains seven tab-delimited fields; from left to right, they are according to the Centrifuge154
website:
1. The name of a genome, or the name corresponding to a taxonomic ID (the second column) at a
rank higher than the strain.
2. The taxonomic ID.
3. The taxonomic rank.
4. The length of the genome sequence.
5. The number of reads classified to this genomic sequence including multi-classified reads.
6. The number of reads uniquely classified to this genomic sequence.
7. The proportion of this genome normalized by its genomic length.
Kraken-like report
If we would like to generate a report as generated with the former tool Kraken2155 , we can do it like this:
0.00 0 0 U 0 unclassified
78.74 163 0 - 1 root
78.74 163 0 - 131567 cellular organisms
78.74 163 0 D 2 Bacteria
54.67 113 0 P 1224 Proteobacteria
36.60 75 0 C 1236 Gammaproteobacteria
31.18 64 0 O 91347 Enterobacterales
30.96 64 0 F 543 Enterobacteriaceae
23.89 49 0 G 561 Escherichia
23.37 48 48 S 562 Escherichia coli
0.40 0 0 S 564 Escherichia fergusonii
0.12 0 0 S 208962 Escherichia albertii
3.26 6 0 G 570 Klebsiella
3.14 6 6 S 573 Klebsiella pneumoniae
0.12 0 0 S 548 [Enterobacter] aerogenes
2.92 6 0 G 620 Shigella
1.13 2 2 S 623 Shigella flexneri
0.82 1 1 S 624 Shigella sonnei
0.50 1 1 S 1813821 Shigella sp. PAMC 28760
0.38 0 0 S 621 Shigella boydii
153 http://www.ccb.jhu.edu/software/centrifuge/index.shtml
154 http://www.ccb.jhu.edu/software/centrifuge/index.shtml
155 https://www.ccb.jhu.edu/software/kraken2/
6.5. Centrifuge 43
Computational Genomics Tutorial, Release 2020.2.0
This gives a similar (not the same) report as the Kraken2156 tool. The report is tab-delimited, with one
line per taxon. The fields of the output, from left-to-right, are as follows:
1. Percentage of reads covered by the clade rooted at this taxon
2. Number of reads covered by the clade rooted at this taxon
3. Number of reads assigned directly to this taxon
4. A rank code, indicating (U)nclassified, (D)omain, (K)ingdom, (P)hylum, (C)lass, (O)rder, (F)amily,
(G)enus, or (S)pecies. All other ranks are simply “-“.
5. NCBI Taxonomy ID
6. The indented scientific name
We use the Krona157 tools to create a nice interactive visualisation of the taxa content of our sample
[ONDOV2011]. Fig. 6.2 shows an example (albeit an artificial one) snapshot of the visualisation Krona158
provides. Fig. 6.2 is a snapshot of the interactive web-page similar to the one we try to create.
6.6.1 Installation
First some house-keeping to make the Krona160 installation work. Do not worry to much about what is
happening here.
# we create a directory in our home where the krona database will live
(continues on next page)
156 https://www.ccb.jhu.edu/software/kraken2/
157 https://github.com/marbl/Krona/wiki
158 https://github.com/marbl/Krona/wiki
159 https://github.com/marbl/Krona/wiki
160 https://github.com/marbl/Krona/wiki
We need to build a taxonomy database for Krona161 . However, if this fails we will skip this step and just
download a pre-build one. Lets first try to build one.
$ ktUpdateTaxonomy.sh ~/krona/taxonomy
Attention: Should this fail we can download a pre-build database on the Downloads (page 77) page
via a browser.
# we move the unzipped file to our taxonomy directory we specified in the step before.
$ mv taxonomy.tab ~/krona/taxonomy
6.6.3 Visualise
Now, we use the tool ktImportTaxonomy from the Krona162 tools to create the html web-page. We first
need build a two column file (read_id<tab>tax_id) as input to the ktImportTaxonomy tool. We will do
this by cutting the columns out of either the Kraken2163 or Centrifuge164 results:
# Kraken2
$ cd kraken
$ cat evol1.kraken | cut -f 2,3 > evol1.kraken.krona
$ ktImportTaxonomy evol1.kraken.krona
$ firefox taxonomy.krona.html
# Centrifuge
$ cd centrifuge
$ cat evol1-results.txt | cut -f 1,3 > evol1-results.krona
$ ktImportTaxonomy evol1-results.krona
$ firefox taxonomy.krona.html
What happens here is that we extract the second and third column from the Kraken2165 results. After-
wards, we input these to the Krona166 script, and open the resulting web-page in a bowser. Done!
161 https://github.com/marbl/Krona/wiki
162 https://github.com/marbl/Krona/wiki
163 https://www.ccb.jhu.edu/software/kraken2/
164 http://www.ccb.jhu.edu/software/centrifuge/index.shtml
165 https://www.ccb.jhu.edu/software/kraken2/
166 https://github.com/marbl/Krona/wiki
SEVEN
VARIANT CALLING
7.1 Preface
In this section we will use our genome assembly based on the ancestor and call genetic variants in the
evolved line [NIELSEN2011].
7.2 Overview
The part of the workflow we will work on in this section can be viewed in Fig. 7.1.
$ cd ~/analysis
$ ls -1F
assembly/
data/
kraken/
mappings/
multiqc_data/
trimmed/
trimmed-fastqc/
Attention: If you have not run the previous section on Read mapping (page 25), you can download
the mapped data needed for this section here: Downloads (page 77). Download the file to the ~/
analysis directory and decompress. Alternatively on the CLI try:
cd ~/analysis
wget -O mappings.tar.gz https://osf.io/g5at8/download
tar xvzf mappings.tar.gz
47
Computational Genomics Tutorial, Release 2020.2.0
Fig. 7.1: The part of the workflow we will work on in this section marked in red.
Tools we are going to use in this section and how to intall them if you not have done it yet.
7.6 Preprocessing
We first need to make an index of our reference genome as this is required by the SNP caller. Given a
scaffold/contig file in fasta-format, e.g. scaffolds.fasta which is located in the directory assembly/,
use SAMtools171 to do this:
Furthermore we need to pre-process our mapping files a bit further and create a bam-index file (.bai)
for the bam-file we want to work with:
$ mkdir variants
7.7.1 Freebayes
We can call variants with a tool called freebayes172 . Given a reference genome scaffold file in fasta-
format, e.g. scaffolds.fasta and the index in .fai format and a mapping file (.bam file) and a mapping
index (.bai file), we can call variants with freebayes173 like so:
# Now we call variants and pipe the results into a new file
$ freebayes -p 1 -f assembly/scaffolds.fasta mappings/evol1.sorted.dedup.q20.bam > variants/evol1.
˓→freebayes.vcf
7.8 Post-processing
171 http://samtools.sourceforge.net/
172 https://github.com/ekg/freebayes
173 https://github.com/ekg/freebayes
##fileformat=VCFv4.2
##fileDate=20200122
##source=freeBayes v1.3.1-dirty
##reference=assembly/scaffolds.fasta
##contig=<ID=NODE_1_length_348724_cov_30.410613,length=348724>
##contig=<ID=NODE_2_length_327290_cov_30.828326,length=327290>
##contig=<ID=NODE_3_length_312063_cov_30.523209,length=312063>
##contig=<ID=NODE_4_length_202800_cov_31.500777,length=202800>
##contig=<ID=NODE_5_length_164027_cov_28.935175,length=164027>
##contig=<ID=NODE_6_length_144088_cov_29.907986,length=144088>
˓→MQM=44;MQMR=40.3333;NS=1;NUMALT=1;ODDS=63.5226;PAIRED=1;PAIREDR=1;PAO=0;PQA=0;PQR=0;PRO=0;QA=53;
˓→QR=414;RO=18;RPL=2;RPP=3.73412;RPPR=7.35324;RPR=1;RUN=1;SAF=3;SAP=9.52472;SAR=0;SRF=14;SRP=15.074;
˓→MQM=36;MQMR=42.9545;NS=1;NUMALT=1;ODDS=127.074;PAIRED=1;PAIREDR=1;PAO=0;PQA=0;PQR=0;PRO=0;QA=21;
˓→QR=717;RO=22;RPL=2;RPP=7.35324;RPPR=3.0103;RPR=0;RUN=1;SAF=2;SAP=7.35324;SAR=0;SRF=17;SRP=17.2236;
˓→MEANALT=1;MQM=60;MQMR=60;NS=1;NUMALT=1;ODDS=296.374;PAIRED=1;PAIREDR=0.977778;PAO=0;PQA=0;PQR=0;
˓→PRO=0;QA=53;QR=1495;RO=45;RPL=0;RPP=9.52472;RPPR=3.44459;RPR=3;RUN=1;SAF=3;SAP=9.52472;SAR=0;
7.8.2 Statistics
Now we can use it to do some statistics and filter our variant calls.
First, to prepare out vcf-file for querying we need to index it with tabix:
# compress file
$ bgzip variants/evol1.freebayes.vcf
# index
$ tabix -p vcf variants/evol1.freebayes.vcf.gz
Location : variants/evol1.freebayes.vcf.gz
Failed Filters : 0
Passed Filters : 35233
SNPs : 55
MNPs : 6
Insertions : 3
Deletions : 5
Indels : 0
Same as reference : 35164
SNP Transitions/Transversions: 0.83 (25/30)
Total Haploid : 69
Haploid SNPs : 55
Haploid MNPs : 6
Haploid Insertions : 3
Haploid Deletions : 5
Haploid Indels : 0
Insertion/Deletion ratio : 0.60 (3/5)
Indel/SNP+MNP ratio : 0.13 (8/61)
However, we can also run BCFtools174 to extract more detailed statistics about our variant calls:
$ mkdir variants/plots
$ plot-vcfstats -p variants/plots/ variants/evol1.freebayes.vcf.gz.stats
• -p: The output files prefix, add a slash at the end to create a new directory.
174 http://www.htslib.org/doc/bcftools.html
7.8. Post-processing 51
Computational Genomics Tutorial, Release 2020.2.0
Variant filtration is a big topic in itself [OLSEN2015]. There is no consens yet and research on how to
best filter variants is ongoing.
We will do some simple filtration procedures here. For one, we can filter out low quality reads.
Here, we only include variants that have quality > 30.
# or use vcflib
$ zcat variants/evol1.freebayes.vcf.gz | vcffilter -f "QUAL >= 30" | gzip > variants/evol1.
˓→freebayes.q30.vcf.gz
• -f "QUAL >= 30": we only include variants that have been called with quality >= 30.
Quick stats for the filtered variants:
freebayes176 adds some extra information to the vcf-files it creates. This allows for some more detailed
filtering. This strategy will NOT work on calls done with e.g. SAMtools177 /bcftools mpileup called
variants. Here we filter, based on some recommendation form the developer of freebayes178 :
$ zcat variants/evol1.freebayes.vcf.gz | vcffilter -f "QUAL > 1 & QUAL / AO > 10 & SAF > 0 & SAR >␣
˓→0 & RPR > 1 & RPL > 1" | bgzip > variants/evol1.freebayes.filtered.vcf.gz
175 https://github.com/vcflib/vcflib#vcflib
176 https://github.com/ekg/freebayes
177 http://samtools.sourceforge.net/
178 https://github.com/ekg/freebayes
This strategy used here will do for our purposes. However, several more elaborate filtering strategies
have been explored, e.g. here179 .
Todo: Look at the statistics. One ratio that is mentioned in the statistics is transition transversion ratio
(ts/tv). Explain what this ratio is and why the observed ratio makes sense.
Todo: Call and filter variants for the second evolved strain, similarily to what ws described here for the
first strain. Should you be unable to do it, check the code section: Code: Variant calling (page 75).
179 https://github.com/ekg/freebayes#observation-filters-and-qualities
7.8. Post-processing 53
Computational Genomics Tutorial, Release 2020.2.0
EIGHT
GENOME ANNOTATION
8.1 Preface
In this section you will predict genes and assess your assembly using Augustus182 and BUSCO183 , as well
as Prokka184 .
Note: You will encounter some To-do sections at times. Write the solutions and answers into a text-file.
8.2 Overview
The part of the workflow we will work on in this section can be viewed in Fig. 8.1.
After studying this section of the tutorial you should be able to:
1. Explain how annotation completeness is assessed using orthologues
2. Use bioinformatics tools to perform gene prediction
3. Use genome-viewing software to graphically explore genome annotations and NGS data overlays
$ cd ~/analysis
$ ls -1F
assembly/
data/
kraken/
mappings/
multiqc_data/
trimmed/
trimmed-fastqc/
variants/
182 http://augustus.gobics.de
183 http://busco.ezlab.org
184 https://github.com/tseemann/prokka
55
Computational Genomics Tutorial, Release 2020.2.0
Fig. 8.1: The part of the workflow we will work on in this section marked in red.
Attention: If you have not run the previous section Genome assembly (page 19), you can download
the genome assembly needed for this section here: Downloads (page 77). Download the file to the
~/analysis directory and decompress. Alternatively on the CLI try:
cd ~/analysis
wget -O assembly.tar.gz https://osf.io/t2zpm/download
tar xvzf assembly.tar.gz
This will install both the Augustus185 [STANKE2005] and the BUSCO186 [SIMAO2015] software, which
we will use (separately) for gene prediction and assessment of assembly completeness, respectively.
Make a directory for the annotation results:
$ mkdir annotation
$ cd annotation
We need to get the database that BUSCO187 will use to assess orthologue presence absence in our genome
annotation. BUSCO188 provides a command to list all available datasets and download datasets.
$ busco --list-datasets
################################################
bacteria_odb10
- acidobacteria_odb10
- actinobacteria_phylum_odb10
- actinobacteria_class_odb10
- corynebacteriales_odb10
- micrococcales_odb10
- propionibacteriales_odb10
- streptomycetales_odb10
- streptosporangiales_odb10
- coriobacteriia_odb10
- coriobacteriales_odb10
...
$ cp -r ~/miniconda3/envs/anno/config/ .
185 http://augustus.gobics.de
186 http://busco.ezlab.org
187 http://busco.ezlab.org
188 http://busco.ezlab.org
189 http://busco.ezlab.org
BUSCO190 will assess orthologue presence absence using blastn191 , a rapid method of finding close
matches in large databases (we will discuss this in lecture). It uses blastn192 to make sure that it does
not miss any part of any possible coding sequences. To run the program, we give it
• A fasta format input file
• A name for the output files
• The name of the lineage database against which we are assessing orthologue presence absence
(that we downloaded above)
• An indication of the type of annotation we are doing (genomic, as opposed to transcriptomic or
previously annotated protein files).
• The config file to use
Navigate into the output directory you created. There are many directories and files in there containing
information on the orthologues that were found, but here we are only really interested in one: the
summary statistics. This is located in the short_summary*.txt file. Look at this file. It will note the total
number of orthologues found, the number expected, and the number missing. This gives an indication
of your genome completeness.
Todo: Is it necessarily true that your assembly is incomplete if it is missing some orthologues? Why or
why not?
We will use Augustus194 to perform gene prediction. This program implements a hidden markov model
(HMM) to infer where genes lie in the assembly you have made. To run the program you need to give it:
• Information as to whether you would like the genes called on both strands (or just the forward or
reverse strands)
• A “model” organism on which it can base it’s HMM parameters on (in this case we will use E.coli)
• The location of the assembly file
• A name for the output file, which will be a .gff (general feature format) file.
• We will also tell it to display a progress bar as it moves through the genome assembly.
Note: Should the process of producing your annotation fail, you can download a annotation manually
from Downloads (page 77). Remember to unzip the file.
190 http://busco.ezlab.org
191 https://blast.ncbi.nlm.nih.gov/Blast.cgi?PAGE_TYPE=BlastSearch
192 https://blast.ncbi.nlm.nih.gov/Blast.cgi?PAGE_TYPE=BlastSearch
193 http://augustus.gobics.de
194 http://augustus.gobics.de
Install Prokka196 :
Run Prokka197 :
$ prokka --kingdom Bacteria --genus Escherichia --species coli --outdir annotation assembly/
˓→scaffolds.fasta
Your results will be in the annotation directory with the prefix PROKKA.
We will use the software IGV198 to view the assembly, the gene predictions you have made, and the
variants that you have called, all in one window.
8.9.1 IGV199
$ igv
This will open up a new window. Navigate to that window and open up your genome assembly:
• Genomes -> Load Genome from File
• Load your assembly (scaffolds.fasta), not your gff file.
Load the tracks:
• File -> Load from File
• Load your unzipped vcf file from section: Variant calling (page 47)
• Load your unzipped gff file from this section.
At this point you should be able to zoom in and out to see regions in which there are SNPs or other types
of variants. You can also see the predicted genes. If you zoom in far enough, you can see the sequence
(DNA and protein).
If you have time and interest, you can right click on the sequence and copy it. Open a new browser
window and go to the blastn homepage. There, you can blast your gene of interest (GOI) and see if blast
can assign a function to it.
The end goal of this lab will be for you to select a variant that you feel is interesting (e.g. due to the
gene it falls near or within), and hypothesize as to why that mutation might have increased in frequency
in these evolving populations.
195 https://github.com/tseemann/prokka
196 https://github.com/tseemann/prokka
197 https://github.com/tseemann/prokka
198 http://software.broadinstitute.org/software/igv/
199 http://software.broadinstitute.org/software/igv/
NINE
9.1 Preface
In this section you will use some software to find orthologue genes and do phylogenetic reconstructions.
$ cd ~/analysis
$ ls -1F
annotation/
assembly/
data/
kraken/
mappings/
multiqc_data/
trimmed/
trimmed-fastqc/
variants/
Make a directory for the phylogeny results (in your analysis directory):
$ mkdir phylogeny
61
Computational Genomics Tutorial, Release 2020.2.0
This will install a BLAST202 executable that you can use to remotely query the NCBI database and the
MAFFT203 alignment program that you can use to align sequences. We also install RAxML204 and IQ-
TREE205 , phylogenetic tree inference tools, which use maximum-likelihood (ML) optimality criterion.
We are using the the gene gnd (gluconate-6-phosphate dehydrogenase) as an example. gnd is a highly
polymorphic gene within E. coli populations, likely due to interstrain transfer and recombination. This
may be a result of its proximity to the rfb region, which determines O antigen structure.
First, we are going to make a bed-file to get coordinates from file:
# these are the coordinates from the contigs-file (Yours might be different), we copy into the vi/
˓→nano buffer
Hint: To edit in vi editor, you will need to press the escape key and “a” or “e”. To save in vi, you will
need to press the escape key and “w” (write). To quit vi, you will need to press the escape key and “q”
(quit).
Now, we have a fasta-file with exactly on genic region, the one from the gnd gene.
blastn -db nt -query gnd.fasta -remote -evalue 1e-100 -outfmt "6 qseqid sseqid sseq" > gnd_blast_
˓→hits.out
• -outfmt "6 qseqid sseqid sseq": Some notes about the format we want
• -evalue 1e-100: An evalue cutoff for inclusion of results
Next, we are formating the result into fasta-format using the program awk:
awk 'BEGIN { OFS = "\n" } { print ">"$2, $3 }' gnd_blast_hits.out > gnd_blast_hits.fasta
Append the fasta file of your E. coli gene to this file, using whatever set of commands you wish/know.
For example:
We will use MAFFT207 to perform our alignment on all the sequences in the BLAST208 fasta file. This
syntax is very simple (change the filenames accordingly):
We will use RAxML209 to build our phylogeny. This uses a maximum likelihood method to infer parame-
ters of evolution and the topology of the tree. Again, the syntax of the command is fairly simple, except
you must make sure that you are using the directory in which RAxML210 sits.
The arguments are:
• -s: an alignment file
• -m: a model of evolution. In this case we will use a general time reversible model with gamma
distributed rates (GTR+GAMMA)
• -n: outfile-name
• -p: specify a random number seed for the parsimony inferences
We can also use IQ-TREE211 , which provides more information than RAxML212 .
iqtree -s gnd_blast_hits.aln
207 https://mafft.cbrc.jp/alignment/software/
208 https://blast.ncbi.nlm.nih.gov/Blast.cgi
209 https://github.com/stamatak/standard-RAxML
210 https://github.com/stamatak/standard-RAxML
211 http://www.iqtree.org/
212 https://github.com/stamatak/standard-RAxML
We will use the online software Interactive Tree of Life (iTOL)213 to visualize the tree. Navigate to this
homepage. Open the file containing your tree (*bestTree.out), copy the contents, and paste into the
web page (in the Tree text box).
You should then be able to zoom in and out to see where your taxa is. To find out the closest relative,
you will have to use the NCBI taxa page214 .
213 http://itol.embl.de/upload.cgi
214 https://www.ncbi.nlm.nih.gov/Taxonomy/TaxIdentifier/tax_identifier.cgi
TEN
VARIANTS-OF-INTEREST
10.1 Preface
In this section we will use our genome annotation of our reference and our genome variants in the
evolved line to find variants that are interesting in terms of the observed biology.
Note: You will encounter some To-do sections at times. Write the solutions and answers into a text-file.
10.2 Overview
The part of the workflow we will work on in this section can be viewed in Fig. 10.1.
After studying this section of the tutorial you should be able to:
1. Identify variants of interests.
2. Understand how the variants might affect the observed biology in the evolved line.
$ cd ~/analysis
$ ls -1F
annotation/
assembly/
data/
kraken/
mappings/
phylogeny/
trimmed/
trimmed-fastqc/
variants/
65
Computational Genomics Tutorial, Release 2020.2.0
Fig. 10.1: The part of the workflow we will work on in this section marked in red.
Attention: If you have not run the previous sections on Genome assembly (page 19) and Variant
calling (page 47), you can download the variant calls and the genome assembly needed for this
section here: Downloads (page 77). Download the files to the ~/analysis directory and decompress.
Alternatively on the CLI try:
cd ~/analysis
wget -O assembly.tar.gz https://osf.io/t2zpm/download
tar xvzf assembly.tar.gz
eget -O variants.tar.gz https://osf.io/4nzrm/download
tar xvzf variants.tar.gz
10.6 SnpEff
We will be using SnpEff215 to annotate our identified variants. The tool will tell us on to which genes we
should focus further analyses.
Tools we are going to use in this section and how to install them if you not have done it yet:
Make a directory for the results (in your analysis directory) and change into the directory:
$ mkdir voi
215 http://snpeff.sourceforge.net/index.html
We need to create our own config-file for SnpEff216 . Where is the snpEff.config:
This will give you the path to the snpEff.config. It might be looking a bit different then the one shown
here, depending on the version of SnpEff217 that is installed.
Make a local copy of the snpEff.config and then edit it with an editor of your choice:
$ cp /home/guest/miniconda3/envs/voi/share/snpeff-4.3.1t-3/snpEff.config .
$ nano snpEff.config
Make sure the data directory path in the snpEff.config looks like this:
data.dir = ./data/
#-------------------------------------------------------------------------------
# Databases & Genomes
#
# One entry per genome version.
#
# For genome version 'ZZZ' the entries look like
# ZZZ.genome : Real name for ZZZ (e.g. 'Human')
# ZZZ.reference : [Optional] Comma separated list of URL to site/s Where information␣
˓→for building ZZZ database was extracted.
# ZZZ.chrName.codonTable : [Optional] Define codon table used for chromosome 'chrName' (Default:
˓→'codon.Standard')
#
#-------------------------------------------------------------------------------
Add the following two lines in the database section underneath these header lines:
# my genome
mygenome.genome : EColiMut
# create folders
$ mkdir -p ./data/mygenome
Copy our genome assembly to the newly created data folder. The name needs to be sequences.fa or
mygenome.fa:
$ cp ../assembly/scaffolds.fasta ./data/mygenome/sequences.fa
$ gzip ./data/mygenome/sequences.fa
Copy our genome annotation to the data folder. The name needs to be genes.gff (or genes.gtf for
gtf-files).
$ cp ../annotation/PROKKA_12345.gff ./data/mygenome/genes.gff
$ gzip ./data/mygenome/genes.gff
Note: Should this fail, due to gff-format of the annotation, we can try to convert the gff to gtf:
# using genometools
$ gt gff3_to_gtf ../annotation/PROKKA_12345.gff -o ./data/mygenome/genes.gtf
$ gzip ./data/mygenome/genes.gtf
Now, we can use the gtf annotation top build the database:
$ snpEff build -c snpEff.config -gtf22 -v mygenome > snpEff.stdout 2> snpEff.stderr
Now we can use our new SnpEff219 database to annotate some variants, e.g.:
$ snpEff -c snpEff.config mygenome ../variants/evol1.freebayes.filtered.vcf > evol1.freebayes.
˓→filtered.anno.vcf
SnpEff220 adds ANN fields to the vcf-file entries that explain the effect of the variant.
Note: If you are unable to do the annotation, you can download an annotated vcf-file from Downloads
(page 77).
10.6.4 Example
Lets look at one entry from the original vcf-file and the annotated one. We are only interested in the 8th
column, which contains information regarding the variant. SnpEff221 will add fields here :
# evol2.freebayes.filtered.vcf (the original), column 8
AB=0;ABP=0;AC=1;AF=1;AN=1;AO=37;CIGAR=1X;DP=37;DPB=37;DPRA=0;EPP=10.1116;EPPR=0;GTI=0;LEN=1;
˓→MEANALT=1;MQM=60;MQMR=0;NS=1;NUMALT=1;ODDS=226.923;PAIRED=0.972973;PAIREDR=0;PAO=0;PQA=0;PQR=0;
˓→PRO=0;QA=1155;QR=0;RO=0;RPL=12;RPP=12.9286;RPPR=0;RPR=25;RUN=1;SAF=26;SAP=16.2152;SAR=11;SRF=0;
˓→SRP=0;SRR=0;TYPE=snp
# evol2.freebayes.filtered.anno.vcf, column 8
AB=0;ABP=0;AC=1;AF=1;AN=1;AO=37;CIGAR=1X;DP=37;DPB=37;DPRA=0;EPP=10.1116;EPPR=0;GTI=0;LEN=1;
˓→MEANALT=1;MQM=60;MQMR=0;NS=1;NUMALT=1;ODDS=226.923;PAIRED=0.972973;PAIREDR=0;PAO=0;PQA=0;PQR=0;
˓→PRO=0;QA=1155;QR=0;RO=0;RPL=12;RPP=12.9286;RPPR=0;RPR=25;RUN=1;SAF=26;SAP=16.2152;SAR=11;SRF=0;
˓→SRP=0;SRR=0;TYPE=snp;ANN=T|missense_variant|MODERATE|HGGMJBFA_02792|GENE_HGGMJBFA_
˓→02792|transcript|TRANSCRIPT_HGGMJBFA_02792|protein_coding|1/1|c.773G>A|p.Arg258His|773/1092|773/
˓→1092|258/363||WARNING_TRANSCRIPT_NO_START_CODON,T|upstream_gene_variant|MODIFIER|HGGMJBFA_
˓→02789|GENE_HGGMJBFA_02789|transcript|TRANSCRIPT_HGGMJBFA_02789|protein_coding||c.-4878G>
˓→A|||||4878|,T|upstream_gene_variant|MODIFIER|HGGMJBFA_02790|GENE_HGGMJBFA_
˓→02790|transcript|TRANSCRIPT_HGGMJBFA_02790|protein_coding||c.-3568G>A|||||3568|,T|upstream_gene_
˓→variant|MODIFIER|HGGMJBFA_02791|GENE_HGGMJBFA_02791|transcript|TRANSCRIPT_HGGMJBFA_02791|protein_
˓→coding||c.-442G>A|||||442|,T|upstream_gene_variant|MODIFIER|HGGMJBFA_02794|GENE_HGGMJBFA_
˓→02794|transcript|TRANSCRIPT_HGGMJBFA_02794|protein_coding||c.-1864C>T|||||1864|,T|upstream_gene_
˓→variant|MODIFIER|HGGMJBFA_02795|GENE_HGGMJBFA_02795|transcript|TRANSCRIPT_HGGMJBFA_02795|protein_
˓→coding||c.-3530C>T|||||3530|,T|upstream_gene_variant|MODIFIER|HGGMJBFA_02796|GENE_HGGMJBFA_
(continues on next page)
˓→02796|transcript|TRANSCRIPT_HGGMJBFA_02796|protein_coding||c.-4492C>T|||||4492|,T|downstream_gene_
˓→variant|MODIFIER|HGGMJBFA_02793|GENE_HGGMJBFA_02793|transcript|TRANSCRIPT_HGGMJBFA_02793|protein_
219 http://snpeff.sourceforge.net/index.html
˓→ coding||c.*840G>A|||||840|
220 http://snpeff.sourceforge.net/index.html
221 http://snpeff.sourceforge.net/index.html
10.6. SnpEff 69
Computational Genomics Tutorial, Release 2020.2.0
When expecting the second entry, we find that SnpEff222 added annotation information starting with
ANN=T|missense_variant|.... If we look a bit more closely we find that the variant results in a amino
acid change from a arginine to a histidine (c.773G>A|p.Arg258His). The codon for arginine is CGN and for
histidine is CAT/CAC, so the variant in the second nucleotide of the codon made the amino acid change.
A quick BLAST223 search of the CDS sequence, where the variant was found (extracted from the genes.
gff.gz) shows that the closest hit is a DNA-binding transcriptional regulator from several different E.Coli
strains.
222 http://snpeff.sourceforge.net/index.html
223 https://blast.ncbi.nlm.nih.gov/Blast.cgi
224 https://blast.ncbi.nlm.nih.gov/Blast.cgi
ELEVEN
71
Computational Genomics Tutorial, Release 2020.2.0
# activate env
conda activate [name]
# deavtivate env
conda deactivate
TWELVE
CODING SOLUTIONS
12.1 QC
Create directory:
mkdir trimmed-fastqc
Run FastQC:
Run MultiQC
firefox multiqc_report.html
73
Computational Genomics Tutorial, Release 2020.2.0
12.2 Assembly
12.3 Mapping
#
# Evol 1
#
#
# Evol 2
#
# index genome
samtools faidx assembly/scaffolds.fasta
mkdir variants
#
# Evol 1
#
# index mappings
bamtools index -in mappings/evol1.sorted.dedup.q20.bam
# calling variants
freebayes -p 1 -f assembly/scaffolds.fasta mappings/evol1.sorted.dedup.q20.bam > variants/evol1.
˓→freebayes.vcf
# compress
bgzip variants/evol1.freebayes.vcf
# index
$ tabix -p vcf variants/evol1.freebayes.vcf.gz
# filtering
zcat variants/evol1.freebayes.vcf.gz | vcffilter -f "QUAL > 1 & QUAL / AO > 10 & SAF > 0 & SAR > 0 &
˓→ RPR > 1 & RPL > 1" | bgzip > variants/evol1.freebayes.filtered.vcf.gz
#
# Evol 2
#
# index mappings
bamtools index -in mappings/evol2.sorted.dedup.q20.bam
# calling variants
freebayes -p 1 -f assembly/scaffolds.fasta mappings/evol2.sorted.dedup.q20.bam > variants/evol2.
˓→freebayes.vcf
# compress
bgzip variants/evol2.freebayes.vcf
(continues on next page)
12.3. Mapping 75
Computational Genomics Tutorial, Release 2020.2.0
# filtering
zcat variants/evol2.freebayes.vcf.gz | vcffilter -f "QUAL > 1 & QUAL / AO > 10 & SAF > 0 & SAR > 0 &
˓→ RPR > 1 & RPL > 1" | bgzip > variants/evol2.freebayes.filtered.vcf.gz
THIRTEEN
DOWNLOADS
13.1 Tools
13.2 Data
225 https://repo.continuum.io/miniconda/Miniconda3-latest-Linux-x86_64.sh
226 ftp://ftp.ccb.jhu.edu/pub/data/kraken2_dbs/minikraken2_v2_8GB_201904_UPDATE.tgz
227 ftp://ftp.ccb.jhu.edu/pub/infphilo/centrifuge/data/p_compressed+h+v.tar.gz
228 https://www.dropbox.com/s/cwf1qc5zyq65yvn/taxonomy.tab.gz?dl=0
229 https://www.dropbox.com/s/3vu1mct230ewhwl/data.tar.gz?dl=0
230 https://osf.io/2jc4a/download
231 https://www.dropbox.com/s/y3xsggn0glb6ter/trimmed.tar.gz?dl=0
232 https://osf.io/m3wpr/download
233 https://www.dropbox.com/s/h906x9maw879t5s/assembly.tar.gz?dl=0
234 https://osf.io/t2zpm/download
235 https://www.dropbox.com/s/ii3vbdj9yn916k4/mapping_idx.tar.gz?dl=0
236 https://osf.io/tnzrf/download
237 https://www.dropbox.com/s/8bporren0o230oo/mappings.tar.gz?dl=0
238 https://osf.io/g5at8/download
239 https://www.dropbox.com/s/lraiepofsvkl1md/variants.tar.gz?dl=0
240 https://osf.io/4nzrm/download
241 https://www.dropbox.com/s/16p9tb22lsvqxbg/annotation.tar.gz?dl=0
242 https://osf.io/7t4yh/download
243 https://www.dropbox.com/s/yzbu0eealf7xfr1/voi.tar.gz?dl=0
244 https://osf.io/5c6w9/download
77
Computational Genomics Tutorial, Release 2020.2.0
3.1 The part of the workflow we will work on in this section marked in red. . . . . . . . . . . 10
3.2 Illustration of single-end (SE) versus paired-end (PE) sequencing. . . . . . . . . . . . . . . 11
3.3 Quality score across bases. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
3.4 Quality per tile. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
3.5 GC distribution over all sequences. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
4.1 The part of the workflow we will work on in this section marked in red. . . . . . . . . . . 20
5.1 The part of the workflow we will work on in this section marked in red. . . . . . . . . . . 26
5.2 A example coverage plot for a contig with highlighted in red regions with a coverage
below 20 reads. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32
6.1 The part of the workflow we will work on in this section marked in red. . . . . . . . . . . 36
6.2 Example of an Krona output webpage. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44
7.1 The part of the workflow we will work on in this section marked in red. . . . . . . . . . . 48
7.2 Example of plot-vcfstats output. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52
8.1 The part of the workflow we will work on in this section marked in red. . . . . . . . . . . 56
10.1 The part of the workflow we will work on in this section marked in red. . . . . . . . . . . 66
10.2 Results of a BLAST search of the CDS. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 70
79
Computational Genomics Tutorial, Release 2020.2.0
80 List of Figures
LIST OF TABLES
81
Computational Genomics Tutorial, Release 2020.2.0
82 List of Tables
BIBLIOGRAPHY
[KAWECKI2012] Kawecki TJ et al. Experimental evolution. Trends in Ecology and Evolution (2012)
27:105
[ZEYL2006] Zeyl C. Experimental evolution with yeast. FEMS Yeast Res, 2006, 685–6916
[GLENN2011] Glenn T. Field guide to next-generation DNA sequencers. Molecular Ecology Resources
(2011) 11, 759–769 doi: 10.1111/j.1755-0998.2011.03024.x41
[KIRCHNER2014] Kirchner et al. Addressing challenges in the production and analysis of Illumina se-
quencing data. BMC Genomics (2011) 12:38242
[MUKHERJEE2015] Mukherjee S, Huntemann M, Ivanova N, Kyrpides NC and Pati A. Large-scale con-
tamination of microbial isolate genomes by Illumina PhiX control. Standards in Genomic
Sciences, 2015, 10:18. DOI: 10.1186/1944-3277-10-1843
[ROBASKY2014] Robasky et al. The role of replicates for error mitigation in next-generation sequencing.
Nature Reviews Genetics (2014) 15, 56-6244
[ABBAS2014] Abbas MM, Malluhi QM, Balakrishnan P. Assessment of de novo assemblers for draft
genomes: a case study with fungal genomes. BMC Genomics. 2014;15 Suppl 9:S10.64
doi: 10.1186/1471-2164-15-S9-S10. Epub 2014 Dec 8.
[COMPEAU2011] Compeau PE, Pevzner PA, Tesler G. How to apply de Bruijn graphs to genome assem-
bly. Nat Biotechnol. 2011 Nov 8;29(11):987-9165
[GUREVICH2013] Gurevich A, Saveliev V, Vyahhi N and Tesler G. QUAST: quality assessment tool for
genome assemblies. Bioinformatics 2013, 29(8), 1072-107566
[NAGARAJAN2013] Nagarajan N, Pop M. Sequence assembly demystified. Nat Rev Genet. 2013
Mar;14(3):157-6767
[SALZBERG2012] Salzberg SL, Phillippy AM, Zimin A, Puiu D, Magoc T, Koren S, Treangen TJ, Schatz
MC, Delcher AL, Roberts M, Marçais G, Pop M, Yorke JA. GAGE: A critical evaluation of
genome assemblies and assembly algorithms. Genome Res. 2012 Mar;22(3):557-6768
[WICK2015] Wick RR, Schultz MB, Zobel J and Holt KE. Bandage: interactive visualization of de novo
genome assemblies. Bioinformatics 2015, 10.1093/bioinformatics/btv38369
5 http://dx.doi.org/10.1016/j.tree.2012.06.001
6 http://doi.org/10.1111/j.1567-1364.2006.00061.x
41 http://doi.org/10.1111/j.1755-0998.2011.03024.x
42 http://doi.org/10.1186/1471-2164-12-382
43 http://doi.org/10.1186/1944-3277-10-18
44 http://doi.org/10.1038/nrg3655
64 https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4290589/
65 http://dx.doi.org/10.1038/nbt.2023
66 http://bioinformatics.oxfordjournals.org/content/29/8/1072
67 http://dx.doi.org/10.1038/nrg3367
68 http://genome.cshlp.org/content/22/3/557.full?sid=59ea80f7-b408-4a38-9888-3737bc670876
69 http://bioinformatics.oxfordjournals.org/content/early/2015/07/11/bioinformatics.btv383.long
83
Computational Genomics Tutorial, Release 2020.2.0
[LI2009] Li H, Durbin R. (2009). Fast and accurate short read alignment with Burrows-Wheeler
transform. Bioinformatics. 25 (14): 1754–1760.103
[OKO2015] Okonechnikov K, Conesa A, García-Alcalde F. Qualimap 2: advanced multi-sample quality
control for high-throughput sequencing data. Bioinformatics (2015), 32, 2:292–294.104
[KIM2017] Kim D, Song L, Breitwieser FP, Salzberg SL. Centrifuge: rapid and sensitive classification
of metagenomic sequences. Genome Res. 2016 Dec;26(12):1721-1729167
[LU2017] Lu J, Breitwieser FP, Thielen P, Salzberg SL. Bracken: estimating species abundance in
metagenomics data. PeerJ Computer Science, 2017, 3:e104, doi:10.7717/peerj-cs.104168
[ONDOV2011] Ondov BD, Bergman NH, and Phillippy AM. Interactive metagenomic visualization in a
Web browser. BMC Bioinformatics, 2011, 12(1):385.169
[WOOD2014] Wood DE and Steven L Salzberg SL. Kraken: ultrafast metagenomic sequence classifica-
tion using exact alignments. Genome Biology, 2014, 15:R46. DOI: 10.1186/gb-2014-15-
3-r46170 .
[NIELSEN2011] Nielsen R, Paul JS, Albrechtsen A, Song YS. Genotype and SNP calling from next-
generation sequencing data. Nat Rev Genetics, 2011, 12:433-451180
[OLSEN2015] Olsen ND et al. Best practices for evaluating single nucleotide variant calling methods for
microbial genomics. Front. Genet., 2015, 6:235.181
[SIMAO2015] Simao FA, Waterhouse RM, Ioannidis P, Kriventseva EV and Zdobnov EM. BUSCO: assess-
ing genome assembly and annotation completeness with single-copy orthologs. Bioinfor-
matics, 2015, Oct 1;31(19):3210-2200
[STANKE2005] Stanke M and Morgenstern B. AUGUSTUS: a web server for gene prediction in eukary-
otes that allows user-defined constraints. Nucleic Acids Res, 2005, 33(Web Server issue):
W465–W467.201
103 https://doi.org/10.1093%2Fbioinformatics%2Fbtp324
104 https://doi.org/10.1093/bioinformatics/btv566
167 https://www.ncbi.nlm.nih.gov/pubmed/27852649
168 https://peerj.com/articles/cs-104/
169 http://www.ncbi.nlm.nih.gov/pubmed/21961884
170 http://doi.org/10.1186/gb-2014-15-3-r46
180 http://doi.org/10.1038/nrg2986
181 https://doi.org/10.3389/fgene.2015.00235
200 http://doi.org/10.1093/bioinformatics/btv351
201 https://dx.doi.org/10.1093/nar/gki458
84 Bibliography