Trainer Handout
Trainer Handout
TRAINER’S MANUAL
Licensing
This work is licensed under a Creative Commons Attribution 3.0 Unported License and
the below text is a summary of the main terms of the full Legal Code (the full licence)
available at http://creativecommons.org/licenses/by/3.0/legalcode.
• Your fair dealing or fair use rights, or other applicable copyright exceptions
and limitations;
• The author’s moral rights;
• Rights other persons may have either in the work itself or in how the work
is used, such as publicity or privacy rights.
Notice - For any reuse or distribution, you must make clear to others the licence
terms of this work.
Contents
Licensing 3
Contents 4
Workshop Information 7
The Trainers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
Providing Feedback . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
Document Structure . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
Resources Used . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
Data Quality 13
Key Learning Outcomes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
Resources You’ll be Using . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
Useful Links . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
Prepare the Environment . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
Quality Visualisation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
Read Trimming . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
Read Alignment 25
Key Learning Outcomes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26
Resources You’ll be Using . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26
Useful Links . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26
Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28
Prepare the Environment . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28
Alignment . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28
Manipulate SAM output . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30
Visualize alignments in IGV . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31
Practice Makes Perfect! . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32
ChIP-Seq 33
Key Learning Outcomes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34
Resources You’ll be Using . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34
Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36
Prepare the Environment . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36
Finding enriched areas using MACS . . . . . . . . . . . . . . . . . . . . . . . . 36
Viewing results with the Ensembl genome browser . . . . . . . . . . . . . . . . 38
Annotation: From peaks to biological interpretation . . . . . . . . . . . . . . . 40
Motif analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41
Contents Contents
Reference . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43
RNA-Seq 45
Key Learning Outcomes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46
Resources You’ll be Using . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46
Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48
Prepare the Environment . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48
Alignment . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49
Isoform Expression and Transcriptome Assembly . . . . . . . . . . . . . . . . 52
Differential Expression . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54
Visualising the CuffDiff expression analysis . . . . . . . . . . . . . . . . . . . . 56
Functional Annotation of Differentially Expressed Genes . . . . . . . . . . . . 60
Differential Gene Expression Analysis using edgeR . . . . . . . . . . . . . . . . . 61
References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67
Post-Workshop Information 95
Access to Computational Resources . . . . . . . . . . . . . . . . . . . . . . . . 96
Access to Workshop Documents . . . . . . . . . . . . . . . . . . . . . . . . . . 110
Access to Workshop Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 110
TRAINER’S MANUAL 5
Workshop Information
Workshop Information
8 TRAINER’S MANUAL
The Trainers Workshop Information
The Trainers
Providing Feedback
While we endeavour to deliver a workshop with quality content and documentation in a
venue conducive to an exciting, well run hands-on workshop with a bunch of knowledgeable
and likable trainers, we know there are things we could do better.
Whilst we want to know what didn’t quite hit the mark for you, what would be most
helpful and least depressing, would be for you to provide ways to improve the workshop.
i.e. constructive feedback. After all, if we knew something wasn’t going to work, we
wouldn’t have done it or put it into the workshop in the first place! Remember, we’re
experts in the field of bioinformatics not experts in the field of biology!
Clearly, we also want to know what we did well! This gives us that “feel good” factor
which will see us through those long days and nights in the lead up to such hands-on
workshops!
With that in mind, we’ll provide three really high tech mechanism through which you can
provide anonymous feedback during the workshop:
1. A sheet of paper, from a flip-chart, sporting a “happy” face and a “not so happy”
face. Armed with a stack of colourful post-it notes, your mission is to see how many
comments you can stick on the “happy” side!
2. Some empty ruled pages at the back of this handout. Use them for your own personal
notes or for write specific comments/feedback about the workshop as it progresses.
3. An online post-workshop evaluation survey. We’ll ask you to complete this before
you leave. If you’ve used the blank pages at the back of this handout to make
feedback notes, you’ll be able to provide more specific and helpful feedback with the
least amount of brain-drain!
Document Structure
We have provided you with an electronic copy of the workshop’s hands-on tutorial
documents. We have done this for two reasons: 1) you will have something to take away
with you at the end of the workshop, and 2) you can save time (mis)typing commands on
the command line by using copy-and-paste.
We advise you to use Acrobat Reader to view the PDF. This is because it properly supports
some features we have implemented to ensure that copy-and-paste of commands works as
expected. This includes the appropriate copy-and-paste of special characters like tilde and
hyphens as well as skipping line numbers for easy copy-and-past of whole code blocks.
While you could fly through the hands-on sessions doing copy-and-paste you will
learn more if you take the time, saved from not having to type all those commands,
to understand what each command is doing!
10 TRAINER’S MANUAL
Resources Used Workshop Information
The following styled code is not to be entered at a terminal, it is simply to show you the
syntax of the command. You must use your own judgement to substitute in the correct
arguments, options, filenames etc
tophat [options]* <index_base> <reads_1> <reads_2>
The following icons are used in the margin, throughout the documentation to help you
navigate around the document more easily:
Important
For reference
Questions to answer
Resources Used
We have provided you with an environment which contains all the tools and data you
need for the duration of this workshop. However, we also provide details about the tools
and data used by each module at the start of the respective module documentation.
TRAINER’S MANUAL 11
Module: Data Quality
Primary Author(s):
Sonika Tyagi [email protected]
Contributor(s):
Nathan S. Watson-Haigh [email protected]
Data Quality Key Learning Outcomes
• Visualise the quality, and other associated matrices, of reads to decide on filters and
cutoffs for cleaning up data ready for downstream analysis
Tools Used
FastQC
http://www.bioinformatics.babraham.ac.uk/projects/fastqc/
FASTX-Toolkit
http://hannonlab.cshl.edu/fastx_toolkit/
Picard
http://picard.sourceforge.net/
Useful Links
FASTQ Encoding
http://en.wikipedia.org/wiki/FASTQ_format#Encoding
14 TRAINER’S MANUAL
Introduction Data Quality
Introduction
Going on a blind date with your read set? For a better understanding of the consequences
please check the data quality!
For the purpose of this tutorial we are focusing only on Illumina sequencing which
uses ’sequence by synthesis’ technology in a highly parallel fashion. Although Illumina
high throughput sequencing provides highly accurate sequence data, several sequence
artifacts, including base calling errors and small insertions/deletions, poor quality reads
and primer/adapter contamination are quite common in the high throughput sequencing
data. The primary errors are substitution errors. The error rates can vary from 0.5-2.0%
with errors mainly rising in frequency at the 3’ ends of reads.
One way to investigate sequence data quality is to visualize the quality scores and other
metrics in a compact manner to get an idea about the quality of a read data set. Read
data sets can be improved by post processing in different ways like trimming off low
quality bases, cleaning up any sequencing adapters and removing PCR duplicates. We
can also look at other statistics such as, sequence length distribution, base composition,
sequence complexity, presence of ambiguous bases etc. to assess the overall quality of the
data set.
Highly redundant coverage (>15X) of the genome can be used to correct sequencing errors
in the reads before assembly and errors. Various k-mer based error correction methods
exist but are beyond the scope of this tutorial.
TRAINER’S MANUAL 15
Data Quality Prepare the Environment
encoding is in use. For example, if the only characters seen in the quality string are
(@ABCDEFGHI), then it is impossible to know if you have really good Phred+33 encoded
qualities or really bad Phred+64 encoded qualities.
For a grapical representation of the different ASCII characters used in the two encoding
schema see: http://en.wikipedia.org/wiki/FASTQ_format#Encoding.
Open the Terminal and go to the directory where the data are stored:
cd ~/QC/
pwd
At any time, help can be displayed for FastQC using the following command:
fastqc -h
Quality Visualisation
We have a file for a good quality and bad quality statistics. FastQC generates results in
the form of a zipped and unzipped directory for each input file.
View the FastQC report file of the bad data using a web browser such as firefox.
firefox bad_example_fastqc.html &
The report file will have a Basic Statistics table and various graphs and tables for different
quality statistics. E.g.:
16 TRAINER’S MANUAL
Quality Visualisation Data Quality
A Phred quality score (or Q-score) expresses an error probability. In particular, it serves
as a convenient and compact way to communicate very small error probabilities. The
probability that base A is wrong (P (∼ A)) is expressed by a quality score, Q(A), according
to the relationship:
The relationship between the quality score and error probability is demonstrated with the
following table:
TRAINER’S MANUAL 17
Data Quality Quality Visualisation
Quality score, Q(A) Error probability, P(∼A) Accuracy of the base call
10 0.1 90%
20 0.01 99%
30 0.001 99.9%
40 0.0001 99.99%
50 0.00001 99.999%
How many sequences were there in your file? What is the read length? 40,000. read
length=100bp
Does the quality score values vary throughout the read length? (hint: look at the
’per base sequence quality plot’) Yes. Quality scores are dropping towards the end of
the reads.
What is the quality score range you see? 2-40
At around which position do the scores start falling below Q20? Around 80 bp
position
How can we trim the reads to filter out the low quality data? By trimming off the
bases after a fixed position of the read or by trimming off bases based on the quality
score.
Sequencing errors can complicate the downstream analysis, which normally requires that
reads be aligned to each other (for genome assembly) or to a reference genome (for
detection of mutations). Sequence reads containing errors may lead to ambiguous paths
in the assembly or improper gaps. In variant analysis projects sequence reads are aligned
against the reference genome. The errors in the reads may lead to more mismatches than
expected from mutations alone. But if these errors can be removed or corrected, the read
alignments and hence the variant detection will improve. The assemblies will also improve
after pre-processing the reads with errors.
18 TRAINER’S MANUAL
Read Trimming Data Quality
Read Trimming
Read trimming can be done in a variety of different ways. Choose a method which best
suits your data. Here we are giving examples of fixed-length trimming and quality-based
trimming.
Run FastQC on the trimmed file and visualise the quality scores of the trimmed file.
fastqc -f fastq bad_example_trimmed01.fastq
firefox bad_example_trimmed01_fastqc.html &
TRAINER’S MANUAL 19
Data Quality Read Trimming
Figure 2: Per base sequence quality plot for the fixed-length trimmed bad example.fastq
reads.
What values would you use for -f if you wanted to trim off 10 bases at the 5’ end of
the reads? -f 11
20 TRAINER’S MANUAL
Read Trimming Data Quality
Run FastQC on the quality trimmed file and visualise the quality scores.
fastqc -f fastq bad_example_quality_trimmed.fastq
firefox bad_example_quality_trimmed_fastqc.html &
TRAINER’S MANUAL 21
Data Quality Read Trimming
Figure 3: Per base sequence quality plot for the quality-trimmed bad example.fastq
reads.
How did the quality score range change with two types of trimming? Some poor
quality bases (Q <20) are still present at the 3’ end of the fixed-length trimmed reads.
It also removes bases that are good quality.
Quality-based trimming retains the 3’ ends of reads which have good quality scores.
Did the number of total reads change after two types of trimming? Quality trimming
discarded >1000 reads. However, We retain a lot of maximal length reads which have
good quality all the way to the ends.
What reads lengths were obtained after quality based trimming? 50-100
Reads <50 bp, following quality trimming, were discarded.
Did you observe adapter sequences in the data? No. (Hint: look at the overrepresented
sequences.
How can you use -a option with fastqc ? (Hint: try fastqc -h). Adaptors can be
supplied in a file for screening.
22 TRAINER’S MANUAL
Read Trimming Data Quality
Adapter Clipping
Sometimes sequence reads may end up getting the leftover of adapters and primers
used in the sequencing process. It’s good practice to screen your data for these
possible contamination for more sensitive alignment and assembly based analysis.
This is particularly important when read lengths can be longer than the molecules
being sequenced. For example when sequencing miRNAs.
Various QC tools are available to screen and/or clip these adapter/primer sequences
from your data. (e.g. FastQC, FASTX-Toolkit, cutadapt).
Here we are demonstrating fastx clipper to trim a given adapter sequence.
cd ~/QC
fastx_clipper -h
fastx_clipper -v -Q 33 -l 20 -M 15 -a \
GATCGGAAGAGCGGTTCAGCAGGAATGCCGAG -i bad_example.fastq -o \
bad_example_clipped.fastq
An alternative tool, not installed on this system, for adapter clipping is fastq-mcf.
A list of adapters is provided in a text file. For more information, see FastqMcf at
http://code.google.com/p/ea-utils/wiki/FastqMcf.
Removing Duplicates
Duplicate reads are the ones having the same start and end coordinates. This may be
the result of technical duplication (too many PCR cycles), or over-sequencing (very
high fold coverage). It is very important to put the duplication level in context of
your experiment. For example, duplication level in targeted or re-sequencing projects
may mean something different in RNA-seq experiments. In RNA-seq experiments
oversequencing is usually necessary when detecting low abundance transcripts.
The duplication level computed by FastQC is based on sequence identity at the end
of reads. Another tool, Picard, determines duplicates based on identical start and
end positions in SAM/BAM alignment files.
We will not cover Picard but provide the following for your information.
Picard is a suite of tools for performing many common tasks with SAM/BAM format
files. For more information see the Picard website and information about the various
command-line tools available:
http://picard.sourceforge.net/command-line-overview.shtml
TRAINER’S MANUAL 23
Data Quality Read Trimming
Interested users can use the following general command to run the MarkDuplicates
tool at their leisure. You only need to provide a BAM file for the INPUT argument
(not provided):
cd ~/QC
java -jar /tools/Picard/picard-default/MarkDuplicates.jar \
INPUT=<alignment_file.bam> VALIDATION_STRINGENCY=LENIENT \
OUTPUT=alignment_file.dup METRICS_FILE=alignment_file.matric \
ASSUME_SORTED=true REMOVE_DUPLICATES=true
24 TRAINER’S MANUAL
Module: Read Alignment
Primary Author(s):
Myrto Kostadima [email protected]
Contributor(s):
Xi Li [email protected]
Read Alignment Key Learning Outcomes
• Perform the simple NGS data alignment task against one interested reference data
• Visualise the alignment via a standard genome browser, e.g. IGV browser
Tools Used
Bowtie
http://bowtie-bio.sourceforge.net/index.shtml
Bowtie 2
http://bowtie-bio.sourceforge.net/bowtie2/index.shtml
Samtools
http://picard.sourceforge.net/
BEDTools
http://code.google.com/p/bedtools/
UCSC tools
http://hgdownload.cse.ucsc.edu/admin/exe/
Useful Links
SAM Specification
http://samtools.sourceforge.net/SAM1.pdf
26 TRAINER’S MANUAL
Useful Links Read Alignment
Sources of Data
http://www.ebi.ac.uk/arrayexpress/experiments/E-GEOD-11431
TRAINER’S MANUAL 27
Read Alignment Introduction
Introduction
The goal of this hands-on session is to perform an unspliced alignment for a small subset
of raw reads. We will align raw sequencing data to the mouse genome using Bowtie and
then we will manipulate the SAM output in order to visualize the alignment on the IGV
browser.
The .fastq file that we will align is called Oct4.fastq. This file is based on Oct4
ChIP-seq data published by Chen et al. (2008). For the sake of time, we will align these
reads to a single mouse chromosome.
Alignment
You already know that there are a number of competing tools for short read alignment,
each with its own set of strengths, weaknesses, and caveats. Here we will try Bowtie, a
widely used ultrafast, memory efficient short read aligner.
Bowtie has a number of parameters in order to perform the alignment. To view them all
type
bowtie --help
Bowtie uses indexed genome for the alignment in order to keep its memory footprint small.
Because of time constraints we will build the index only for one chromosome of the mouse
genome. For this we need the chromosome sequence in FASTA format. This is stored in a
file named mm10, under the subdirectory bowtie index.
The indexed chromosome is generated using the command:
bowtie-build bowtie_index/mm10.fa bowtie_index/mm10
This command will output 6 files that constitute the index. These files that have the
prefix mm10 are stored in the bowtie index subdirectory. To view if they files have been
successfully created type:
ls -l bowtie_index
28 TRAINER’S MANUAL
Alignment Read Alignment
Now that the genome is indexed we can move on to the actual alignment. The first
argument for bowtie is the basename of the index for the genome to be searched; in our
case this is mm10. We also want to make sure that the output is in SAM format using the
-S parameter. The last argument is the name of the FASTQ file.
The above command outputs the alignment in SAM format and stores them in the file
Oct4.sam.
In general before you run Bowtie, you have to know what quality encoding your FASTQ
files are in. The available FASTQ encodings for bowtie are:
The FASTQ files we are working with are Sanger encoded (Phred+33), which is the
default for Bowtie.
Bowtie will take 2-3 minutes to align the file. This is fast compared to other aligners
which sacrifice some speed to obtain higher sensitivity.
TRAINER’S MANUAL 29
Read Alignment Manipulate SAM output
Can you distinguish between the header of the SAM format and the actual alignments?
The header line starts with the letter ‘@’, i.e.:
@HD VN:1.0 SO:unsorted
@SQ SN:chr1 LN:197195432
@PG ID:Bowtie VN:0.12.8 CL:“bowtie bowtie index/mm10 -S Oct4.fastq”
While, the actual alignments start with read id, i.e.:
SRR002012.45 0 chr1 etc
SRR002012.48 16 chr1 etc
What kind of information does the header provide?
• @HD: Header line; VN: Format version; SO: the sort order of alignments.
• @SQ: Reference sequence information; SN: reference sequence name; LN: refer-
ence sequence length.
• @PG: Read group information; ID: Read group identifier; VN: Program version;
CL: the command line that produces the alignment.
Convert SAM to BAM using samtools view and store the output in the file Oct4.bam.
You have to instruct samtools view that the input is in SAM format (-S), the output
should be in BAM format (-b) and that you want the output to be stored in the file
specified by the -o option:
samtools view -bSo Oct4.bam Oct4.sam
Compute summary stats for the Flag values associated with the alignments using:
samtools flagstat Oct4.bam
30 TRAINER’S MANUAL
Visualize alignments in IGV Read Alignment
When uploading a BAM file into the genome browser, the browser will look for the index
of the BAM file in the same folder where the BAM files is. The index file should have the
same name as the BAM file and the suffix .bai. Finally, to create the index of a BAM
file you need to make sure that the file is sorted according to chromosomal coordinates.
Sort alignments according to chromosomal position and store the result in the file with
the prefix Oct4.sorted:
samtools sort Oct4.bam Oct4.sorted
The indexing will create a file called Oct4.sorted.bam.bai. Note that you don’t have to
specify the name of the index file when running samtools index, it simply appends a
.bai suffix to the input BAM file.
Another way to visualize the alignments is to convert the BAM file into a bigWig file.
The bigWig format is for display of dense, continuous data and the data will be displayed
as a graph. The resulting bigWig files are in an indexed binary format.
The BAM to bigWig conversion takes place in two steps. Firstly, we convert the BAM
file into a bedgraph, called Oct4.bedgraph, using the tool genomeCoverageBed from
BEDTools. Then we convert the bedgraph into a bigWig binary file called Oct4.bw, using
bedGraphToBigWig from the UCSC tools:
genomeCoverageBed -bg -ibam Oct4.sorted.bam -g \
bowtie_index/mouse.mm10.genome > Oct4.bedgraph
bedGraphToBigWig Oct4.bedgraph bowtie_index/mouse.mm10.genome Oct4.bw
Both of the commands above take as input a file called mouse.mm10.genome that is stored
under the subdirectory bowtie index. These genome files are tab-delimited and describe
the size of the chromosomes for the organism of interest. When using the UCSC Genome
Browser, Ensembl, or Galaxy, you typically indicate which species/genome build you are
working with. The way you do this for BEDTools is to create a “genome” file, which
simply lists the names of the chromosomes (or scaffolds, etc.) and their size (in basepairs).
BEDTools includes pre-defined genome files for human and mouse in the genomes subdi-
rectory included in the BEDTools distribution.
TRAINER’S MANUAL 31
Read Alignment Practice Makes Perfect!
Now we will load the data into the IGV browser for visualization. In order to launch
IGV double click on the IGV 2.3 icon on your Desktop. Ignore any warnings and when it
opens you have to load the genome of interest.
On the top left of your screen choose from the drop down menu Mus musculus (mm10).
Then in order to load the desire files go to:
File > Load from File
On the pop up window navigate to Desktop > ChIP-seq folder and select the file
Oct4.sorted.bam.
Repeat these steps in order to load Oct4.bw as well.
Select chr1 from the drop down menu on the top left. Right click on the name of
Oct4.bw and choose Maximum under the Windowing Function. Right click again and
select Autoscale.
In order to see the aligned reads of the BAM file, you need to zoom in to a specific region.
For example, look for gene Lemd1 in the search box.
What is the main difference between the visualization of BAM and bigWig files? The
actual alignment of reads that stack to a particular region can be displayed using
the information stored in a BAM format. The bigWig format is for display of dense,
continuous data that will be displayed in the Genome Browser as a graph.
Using the + button on the top right, zoom in to see more of the details of the alignments.
What do you think the different colors mean? The different color represents four
nucleotides, e.g. blue is Cytidine (C), red is Thymidine (T).
32 TRAINER’S MANUAL
Module: ChIP-Seq
Primary Author(s):
Remco Loos, EMBL-EBI [email protected]
Myrto Kostadima [email protected]
Contributor(s):
Xi Li [email protected]
ChIP-Seq Key Learning Outcomes
• Visualize the peak regions through a genome browser, e.g. Ensembl, and identify
the real peak regions
• Perform functional annotation and detect potential binding sites (motif) in the
predicted binding regions using motif discovery tool, e.g. MEME.
Tools Used
MACS
http://liulab.dfci.harvard.edu/MACS/index.html
Ensembl
http://www.ensembl.org
PeakAnalyzer
http://www.ebi.ac.uk/bertone/software
MEME
http://meme.ebi.edu.au/meme/tools/meme
TOMTOM
http://meme.ebi.edu.au/meme/tools/tomtom
DAVID
http://david.abcc.ncifcrf.gov
GOstat
http://gostat.wehi.edu.au
34 TRAINER’S MANUAL
Resources You’ll be Using ChIP-Seq
Sources of Data
http://www.ebi.ac.uk/arrayexpress/experiments/E-GEOD-11431
TRAINER’S MANUAL 35
ChIP-Seq Introduction
Introduction
The goal of this hands-on session is to perform some basic tasks in the analysis of ChIP-seq
data. In fact, you already performed the first step, alignment of the reads to the genome,
in the previous session. We start from the aligned reads and we will find immuno-enriched
areas using the peak caller MACS. We will visualize the identified regions in a genome
browser and perform functional annotation and motif analysis on the predicted binding
regions.
If you didn’t have time to align the control file called gfp.fastq during the alignment
practical, please do it now. Follow the same steps, from the bowtie alignment step, as for
the Oct4.fastq file.
Consult the MACS help file to see the options and parameters:
macs --help
36 TRAINER’S MANUAL
Finding enriched areas using MACS ChIP-Seq
The input for MACS can be in ELAND, BED, SAM, BAM or BOWTIE formats (you
just have to set the --format option).
Options that you will have to use include:
Look at the output saturation table (Oct4 diag.xls). To open this file file, right-click on
it and choose “Open with” and select LibreOffice. Do you think that more sequencing is
necessary?
Open the Excel peak file and view the peak details. Note that the number of tags (column
6) refers to the number of reads in the whole peak region and not the peak height.
TRAINER’S MANUAL 37
ChIP-Seq Viewing results with the Ensembl genome browser
We have generated bigWig files in advance for you to upload to the Ensembl browser. They
are at the following URL: http://www.ebi.ac.uk/˜remco/ChIP-Seq_course/Oct4.bw
To visualise the data:
• Choose some informative name and in the next window choose the colour of your
preference.
• Click Save and close the window to return to the genome browser.
Repeat the process for the gfp control sample, located at:
http://www.ebi.ac.uk/˜remco/ChIP-Seq_course/gfp.bw.
After uploading, to make sure your data is visible:
38 TRAINER’S MANUAL
Viewing results with the Ensembl genome browser ChIP-Seq
• Choose each of the uploaded *.bw files to confirm the Wiggle plot in Change
track style pop up menu has been choosen.
What can you say about the profile of Oct4 peaks in this region? There are no
significant Oct4 peaks over the selected region.
Compare it with H3K4me3 histone modification wig file we have generated at http:
//www.ebi.ac.uk/˜remco/ChIP-Seq_course/H3K4me3.bw. H3K4me3 has a region
that contains relatively high peaks than Oct4.
Jump to 1:36066594-36079728 for a sample peak. Do you think H3K4me3 peaks
regions contain one or more modification sites? What about Oct4? Yes. There
are roughly three peaks, which indicate the possibility of having more than one
modification sites in this region.
For Oct4, no peak can be observed.
MACS generates its peak files in a file format called bed file. This is a simple text format
containing genomic locations, specified by chromosome, begin and end positions, and
some more optional information.
See http://genome.ucsc.edu/FAQ/FAQformat.html#format1 for details.
Bed files can also be uploaded to the Ensembl browser.
Try uploading the peak file generated by MACS to Ensembl. Find the first peak in
the file (use the head command to view the beginning of the bed file), and see if the
peak looks convincing to you.
TRAINER’S MANUAL 39
ChIP-Seq Annotation: From peaks to biological interpretation
The first window allows you to choose between the split application (which we will try
next) and peak annotation. Choose the peak annotation option and click Next.
We would like to find the closest downstream genes to each peak, and the genes that
overlap with the peak region. For that purpose you should choose the NDG option and
click Next.
Fill in the location of the peak file Oct4 peaks.bed, and choose the mouse GTF as the
annotation file. You don’t have to define a symbol file since gene symbols are included in
the GTF file.
Choose the output directory and run the program.
When the program has finished running, you will have the option to generate plots, by
pressing the Generate plots button. This is only possible if R is installed on your
computer, as it is on this system. A PDF file with the plots will be generated in the
output folder. You could generate similar plots with Excel using the output files that
were generated by PeakAnalyzer.
This list of closest downstream genes (contained in the file Oct4 peaks.ndg.bed) can
be the basis of further analysis. For instance, you could look at the Gene Ontology
terms associated with these genes to get an idea of the biological processes that may be
affected. Web-based tools like DAVID (http://david.abcc.ncifcrf.gov) or GOstat
(http://gostat.wehi.edu.au) take a list of genes and return the enriched GO categories.
We can pull out Ensemble Transcript IDs from the Oct4 peaks.ndg.bed file and
write them to another file ready for use with DAVID or GOstat:
cut -f 5 Oct4_peaks.ndg.bed | sed '1 d' > Oct4_peaks.ndg.tid
40 TRAINER’S MANUAL
Motif analysis ChIP-Seq
Motif analysis
It is often interesting to find out whether we can associate identified the binding sites
with a sequence pattern or motif. We will use MEME for motif analysis. The input for
MEME should be a file in FASTA format containing the sequences of interest. In our case,
these are the sequences of the identified peaks that probably contain Oct4 binding sites.
Since many peak-finding tools merge overlapping areas of enrichment, the resulting peaks
tend to be much wider than the actual binding sites. Sub-dividing the enriched areas by
accurately partitioning enriched loci into a finer-resolution set of individual binding sites,
and fetching sequences from the summit region where binding motifs are most likely to
appear enhances the quality of the motif analysis. Sub-peak summit sequences can be
retrieved directly from the Ensembl database using PeakAnalyzer.
If you have closed the PeakAnalyzer running window, open it again. If it is still open,
just go back to the first window.
Choose the split peaks utility and click Next. The input consists of files generated by
most peak-finding tools: a file containing the chromosome, start and end locations of the
enriched regions, and a .wig signal file describing the size and shape of each peak. Fill in
the location of both files Oct4 peaks.bed and the wig file generated by MACS, which
is under the Oct4 MACS wiggle/treat/ directory, check the option to Fetch subpeak
sequences and click Next.
In the next window you have to set some parameters for splitting the peaks.
Since the program has to read large wig files, it will take a few minutes to run. Once the
run is finished, two output files will be produced. The first describes the location of the
sub-peaks, and the second is a FASTA file containing 300 sequences of length 61 bases,
taken from the summit regions of the highest sub-peaks.
TRAINER’S MANUAL 41
ChIP-Seq Motif analysis
• The maximum number of motifs to find (3 by default). For Oct4 one classical motif
is known.
You will receive the results by e-mail. This usually doesn’t take more than a few minutes.
Open the e-mail and click on the link that leads to the HTML results page.
Scroll down until you see the first motif logo. We would like to know if this motif is similar
to any other known motif. We will use TOMTOM for this. Scroll down until you see the
option Submit this motif to. Click the TOMTOM button to compare to known motifs
in motif databases, and on the new page choose to compare your motif to those in the
JASPAR and UniPROBE database.
Which motif was found to be the most similar to your motif? Sox2
42 TRAINER’S MANUAL
Reference ChIP-Seq
Reference
Chen, X et al.: Integration of external signaling pathways with the core transcriptional
network in embryonic stem cells. Cell 133:6, 1106-17 (2008).
TRAINER’S MANUAL 43
Module: RNA-Seq
Primary Author(s):
Myrto Kostadima, EMBL-EBI [email protected]
Remco Loos, EMBL-EBI [email protected]
Sonika Tyagi, AGRF [email protected]
Contributor(s):
Nathan S. Watson-Haigh [email protected]
Susan M Corley [email protected]
RNA-Seq Key Learning Outcomes
Tools Used
Tophat
http://tophat.cbcb.umd.edu/
Cufflinks
http://cufflinks.cbcb.umd.edu/
Samtools
http://samtools.sourceforge.net/
BEDTools
http://code.google.com/p/bedtools/
UCSC tools
http://hgdownload.cse.ucsc.edu/admin/exe/
IGV
http://www.broadinstitute.org/igv/
edgeR pakcage
http://http://www.bioconductor.org/packages/release/bioc/html/edgeR.html/
CummeRbund manual
http://www.bioconductor.org/packages/release/bioc/vignettes/cummeRbund/
inst/doc/cummeRbund-manual.pdf
46 TRAINER’S MANUAL
Resources You’ll be Using RNA-Seq
Sources of Data
http://www.ebi.ac.uk/ena/data/view/ERR022484
http://www.ebi.ac.uk/ena/data/view/ERR022485
http://www.pnas.org/content/suppl/2008/12/16/0807121105.DCSupplemental
TRAINER’S MANUAL 47
RNA-Seq Introduction
Introduction
The goal of this hands-on session is to perform some basic tasks in the downstream analysis
of RNA-seq data. We will start from RNA-seq data aligned to the zebrafish genome using
Tophat.
We will perform transcriptome reconstruction using Cufflinks and we will compare the gene
expression between two different conditions in order to identify differentially expressed
genes.
In the second part of the tutorial we will also be demonstrating usage of R-based packages
to perform differential expression analysis. We will be using edgeR for the demonstration.
The gene/tag counts generated from the alignment are used as input for edgeR.
All commands entered into the terminal for this tutorial should be from within the
∼/RNA-seq directory.
Check that the data directory contains the above-mentioned files by typing:
ls data
48 TRAINER’S MANUAL
Alignment RNA-Seq
Alignment
There are numerous tools for performing short read alignment and the choice of aligner
should be carefully made according to the analysis goals/requirements. Here we will use
Tophat, a widely used ultrafast aligner that performs spliced alignments.
Tophat is based on the Bowtie aligner and uses an indexed genome for the alignment to
speed up the alignment and keep its memory footprint small. The the index for the Danio
rerio genome has been created for you.
The command to create an index is as follows. You DO NOT need to run this
command yourself - we have done this for you.
bowtie-build genome/Danio_rerio.Zv9.66.dna.fa genome/ZV9
Tophat has a number of parameters in order to perform the alignment. To view them all
type:
tophat --help
Where the last two arguments are the .fastq files of the paired end reads, and the
argument before is the basename of the indexed genome.
The quality values in the FASTQ files used in this hands-on session are Phred+33 encoded.
We explicitly tell tophat of this fact by using the command line argument --solexa-quals.
You can look at the first few reads in the file data/2cells 1.fastq with:
head -n 20 data/2cells_1.fastq
Some other parameters that we are going to use to run Tophat are listed below:
TRAINER’S MANUAL 49
RNA-Seq Alignment
It takes some time (approx. 20 min) to perform tophat spliced alignments, even for
this subset of reads. Therefore, we have pre-aligned the 2cells data for you using the
following command:
You DO NOT need to run this command yourself - we have done this for you.
tophat --solexa-quals -g 2 --library-type fr-unstranded -j \
annotation/Danio_rerio.Zv9.66.spliceSites -o tophat/ZV9_2cells \
genome/ZV9 data/2cells_1.fastq data/2cells_2.fastq
The 6h read alignment will take approx. 20 min to complete. Therefore, we’ll take a look
at some of the files, generated by tophat, for the pre-computed 2cells data.
50 TRAINER’S MANUAL
Alignment RNA-Seq
1. Launch IGV by double-clicking the “IGV 2.3.*” icon on the Desktop (ignore any
warnings that you may get as it opens). NOTE: IGV may take several minutes to
load for the first time, please be patient.
2. Choose “Zebrafish (Zv9)” from the drop-down box in the top left of the IGV window.
Else you can also load the genome fasta file.
3. Load the accepted hits.sorted.bam file by clicking the “File” menu, selecting
“Load from File” and navigating to the Desktop/RNA-seq/tophat/ZV9 2cells di-
rectory.
4. Rename the track by right-clicking on its name and choosing “Rename Track”. Give
it a meaningful name like “2cells BAM”.
5. Load the junctions.bed from the same directory and rename the track “2cells
Junctions BED”.
6. Load the Ensembl annotations file Danio rerio.Zv9.66.gtf stored in the RNA-seq/annotation
directory.
Can you identify the splice junctions from the BAM file? Slice junctions can be
identified in the alignment BAM files. These are the aligned RNA-Seq reads that
have skipped-bases from the reference genome (most likely introns).
Are the junctions annotated for CBY1 consistent with the annotation? Read alignment
supports an extended length in exon 5 to the gene model (cby1-001)
Are all annotated genes, from both RefSeq and Ensembl, expressed? No BX000473.1-
201 is not expressed
Once tophat finishes aligning the 6h data you will need to sort the alignments found in
the BAM file and then index the sorted BAM file.
samtools sort tophat/ZV9_6h/accepted_hits.bam \
tophat/ZV9_6h/accepted_hits.sorted
samtools index tophat/ZV9_6h/accepted_hits.sorted.bam
Load the sorted BAM file into IGV, as described previously, and rename the track
appropriately.
TRAINER’S MANUAL 51
RNA-Seq Isoform Expression and Transcriptome Assembly
We aim to reconstruct the transcriptome for both samples by using the Ensembl annotation
both strictly and as a guide. In the first case Cufflinks will only report isoforms that are
included in the annotation, while in the latter case it will report novel isoforms as well.
The Ensembl annotation for Danio rerio is available in annotation/Danio rerio.Zv9.66.gtf.
The general format of the cufflinks command is:
cufflinks [options]* <aligned_reads.(sam|bam)>
Where the input is the aligned reads (either in SAM or BAM format).
Some of the available parameters for Cufflinks that we are going to use to run Cufflinks
are listed below:
-o Output directory.
52 TRAINER’S MANUAL
Isoform Expression and Transcriptome Assembly RNA-Seq
Perform transcriptome assembly, strictly using the supplied GTF annotations, for the
2cells and 6h data using cufflinks:
# 2cells data (takes approx. 5mins):
cufflinks -o cufflinks/ZV9_2cells_gtf -G \
annotation/Danio_rerio.Zv9.66.gtf -b \
genome/Danio_rerio.Zv9.66.dna.fa -u --library-type fr-unstranded \
tophat/ZV9_2cells/accepted_hits.bam
# 6h data (takes approx. 5mins):
cufflinks -o cufflinks/ZV9_6h_gtf -G annotation/Danio_rerio.Zv9.66.gtf \
-b genome/Danio_rerio.Zv9.66.dna.fa -u --library-type fr-unstranded \
tophat/ZV9_6h/accepted_hits.bam
Cufflinks generates several files in the specified output directory. Here’s a short description
of these files:
So far we have forced cufflinks, by using the -G option, to strictly use the GTF annotations
provided and thus novel transcripts will not be reported. We can get cufflinks to perform
a GTF-guided transcriptome assembly by using the -g option instead. Thus, novel
transcripts will be reported.
TRAINER’S MANUAL 53
RNA-Seq Differential Expression
3. In the search box type ENSDART00000082297 in order for the browser to zoom in to
the gene of interest.
Do you observe any difference between the Ensembl GTF annotations and the GTF-
guided transcripts assembled by cufflinks (the “2cells GTF-Guided Transcripts” track)?
Yes. It appears that the Ensembl annotations may have truncated the last exon.
However, our data also doesn’t contain reads that span between the last two exons.
Differential Expression
One of the stand-alone tools that perform differential expression analysis is Cuffdiff. We
use this tool to compare between two conditions; for example different conditions could
be control and disease, or wild-type and mutant, or various developmental stages.
In our case we want to identify genes that are differentially expressed between two
developmental stages; a 2cells embryo and 6h post fertilization.
The general format of the cuffdiff command is:
54 TRAINER’S MANUAL
Differential Expression RNA-Seq
Where the input includes a transcripts.gtf file, which is an annotation file of the
genome of interest or the cufflinks assembled transcripts, and the aligned reads (either in
SAM or BAM format) for the conditions. Some of the Cufflinks options that we will use
to run the program are:
-o Output directory.
We are interested in the differential expression at the gene level. The results are reported
by Cuffdiff in the file cuffdiff/gene exp.diff. Look at the first few lines of the file
using the following command:
head -n 20 cuffdiff/gene_exp.diff
TRAINER’S MANUAL 55
RNA-Seq Differential Expression
We would like to see which are the most significantly differentially expressed genes.
Therefore we will sort the above file according to the q value (corrected p value for multiple
testing). The result will be stored in a different file called gene exp qval.sorted.diff.
sort -t$'\t' -g -k 13 cuffdiff/gene_exp.diff > \
cuffdiff/gene_exp_qval.sorted.diff
Look again at the first few lines of the sorted file by typing:
head -n 20 cuffdiff/gene_exp_qval.sorted.diff
Copy an Ensembl transcript identifier from the first two columns for one of these genes
(e.g. ENSDARG00000077178). Now go back to the IGV browser and paste it in the search
box.
What are the various outputs generated by cuffdiff? Hint: Please refer to the Cuffdiff
output section of the cufflinks manual online.
Do you see any difference in the read coverage between the 2cells and 6h con-
ditions that might have given rise to this transcript being called as differentially
expressed?
The coverage on the Ensembl browser is based on raw reads and no normalisation
has taken place contrary to the FPKM values.
The read coverage of this transcript (ENSDARG00000077178) in the 2cells data set is
much higher than in the 6h data set.
Cuffquant utility from the cufflinks suite can be used to generate the count files to be
used with count based differential analysis methods such as, edgeR and Deseq.
56 TRAINER’S MANUAL
Differential Expression RNA-Seq
CummeRbund takes the cuffdiff output and populates a SQLite database with various
type of output generated by cuffdiff e.g, genes, transcripts, transcription start site,
isoforms and CDS regions. The data from this database can be accessed and processed
easily. This package comes with a number of in-built plotting functions that are
commonly used for visualising the expression data. We strongly recommend reading
through the bioconductor manual and user guide of CummeRbund to learn about
functionality of the tool. The reference is provided in the resource section.
TRAINER’S MANUAL 57
RNA-Seq Differential Expression
Prepare the environment. Go to the cuffdiff output folder and copy the transcripts
file there.
cd ~/RNA-seq/cuffdiff
cp ~/RNA-seq/annotation/Danio_rerio.Zv9.66.gtf ~/RNA-seq/cuffdiff
ls -l
58 TRAINER’S MANUAL
Differential Expression RNA-Seq
What options would you use to draw a density or boxplot for different replicates
if available ? (Hint: look at the manual at Bioconductor website)
densRep<-csDensity(genes(cuff),replicates=T)
brep<-csBoxplot(genes(cuff),replicates=T)
How many differentially expressed genes did you observe? type ’summary(sigGenes)’
on the R prompt to see.
TRAINER’S MANUAL 59
RNA-Seq Differential Expression
Do these categories make sense given the samples we’re studying? Developmental
Biology
Browse around DAVID website and check what other information are available.
Cellular component, Molecular function, Biological Processes, Tissue expression,
Pathways, Literature, Protein domains
60 TRAINER’S MANUAL
Differential Gene Expression Analysis using edgeR RNA-Seq
TRAINER’S MANUAL 61
RNA-Seq Differential Gene Expression Analysis using edgeR
How many rows (genes) are retained now dim(y) would give you 16494
How many genes were filtered out? do 37435-16494.
Lets have a look at whether the samples cluster by condition. (You should produce a plot
as shown in Figure 4):
plotMDS(y)
62 TRAINER’S MANUAL
Differential Gene Expression Analysis using edgeR RNA-Seq
We will plot the tagwise dispersion and the common dispersion (You should obtain a plot
as shown in the Figure 5):
plotBCV(y)
TRAINER’S MANUAL 63
RNA-Seq Differential Gene Expression Analysis using edgeR
We see here that the common dispersion estimates the overall Biological Coefficient of
Variation (BCV) of the dataset averaged over all genes. The common dispersion is 0.02
and the BCV is the square root of the common dispersion (sqrt[0.02] = 0.14). A BCV of
14% is typical for cell line experiment.
We now test for differentially expressed BCV genes:
et <- exactTest(y)
Now we will use the topTags function to adjust for multiple testing. We will use the
Benjimini Hochberg (”BH”) method and we will produce a table of results:
res <- topTags(et, n=nrow(y$counts), adjust.method="BH")$table
You can see we have the ensemble gene identifier in the first column, the log fold change in
the second column, the the logCPM, the P-Value and the adjusted P-Value. The ensemble
gene identifier is not as helpful as the gene symbol so let’s add in a column with the gene
64 TRAINER’S MANUAL
Differential Gene Expression Analysis using edgeR RNA-Seq
We then use the function getBM to get the gene symbol data we want
Tthis can take about a minute or so to complete.
We see that we have 3 columns, the ensemble id, the entrez gene id and the hgnc symbol
We use the match function to match up our data with the data we have just retrieved
from the database.
idx <- match(ensembl_names, genemap$ensembl_gene_id )
res$entrez <-genemap$entrezgene [ idx ]
res$hgnc_symbol <- genemap$hgnc_symbol [ idx ]
As you see we have now added the hgnc symbol and the entrez id to our results.
Let’s now make a subset of the most significant upregulated and downregulated genes:
de<-res[res$FDR<0.05, ]
de_upreg <-res[res$FDR<0.05 & res$logFC >0,]
de_downreg <-res[res$FDR<0.05 & res$logFC <0,]
How many differentially expressed genes are there? (Hint: Try str(de) 4429
How many upregulated genes and downregulated genes do we have? str(de upreg) =
2345 str(de downreg) = 2084
You can try running the list through DAVID for functional annotation. We will select top
TRAINER’S MANUAL 65
RNA-Seq Differential Gene Expression Analysis using edgeR
100 genes from the differential expressed list and write those to a separate list.
de_top_3000 <-de[1:3000,]
de_top_gene_symbols <-de_top_3000$hgnc_symbol
write(de_top_gene_symbols, "DE_gene_symbols.txt", sep="\t")
Please note that the output files you are creating are saved in your present working
directory. If you are not sure where you are in the file system try typing pwd on your
command prompt to find out.
66 TRAINER’S MANUAL
References RNA-Seq
References
1. Trapnell, C., Pachter, L. & Salzberg, S. L. TopHat: discovering splice junctions with
RNA-Seq. Bioinformatics 25, 1105-1111 (2009).
3. Langmead, B., Trapnell, C., Pop, M. & Salzberg, S. L. Ultrafast and memory-efficient
alignment of short DNA sequences to the human genome. Genome Biol. 10, R25
(2009).
4. Roberts, A., Pimentel, H., Trapnell, C. & Pachter, L. Identification of novel tran-
scripts in annotated genomes using RNA-Seq. Bioinformatics 27, 2325-2329 (2011).
5. Roberts, A., Trapnell, C., Donaghey, J., Rinn, J. L. & Pachter, L. Improving RNA-
Seq expression estimates by correcting for fragment bias. Genome Biol. 12, R22
(2011).
6. Robinson MD, McCarthy DJ and Smyth GK. edgeR: a Bioconductor package for
differential expression analysis of digital gene expression data. Bioinformatics, 26
(2010).
TRAINER’S MANUAL 67
Module: de novo Genome Assembly
Primary Author(s):
Matthias Haimel [email protected]
Nathan S. Watson-Haigh [email protected]
Contributor(s):
de novo Genome Assembly Key Learning Outcomes
• Compile velvet with appropriate compile-time parameters set for a specific analysis
Tools Used
Velvet
http://www.ebi.ac.uk/˜zerbino/velvet/
AMOS Hawkeye
http://apps.sourceforge.net/mediawiki/amos/index.php?title=Hawkeye
gnx-tools
https://github.com/mh11/gnx-tools
FastQC
http://www.bioinformatics.bbsrc.ac.uk/projects/fastqc/
R
http://www.r-project.org/
70 TRAINER’S MANUAL
Resources You’ll be Using de novo Genome Assembly
Sources of Data
• ftp://ftp.ensemblgenomes.org/pub/release-8/bacteria/fasta/Staphylococcus/
s_aureus_mrsa252/dna/s_aureus_mrsa252.EB1_s_aureus_mrsa252.dna.chromosome.
Chromosome.fa.gz
• http://www.ebi.ac.uk/ena/data/view/SRS004748
• ftp://ftp.sra.ebi.ac.uk/vol1/fastq/SRR022/SRR022825/SRR022825.fastq.gz
• ftp://ftp.sra.ebi.ac.uk/vol1/fastq/SRR022/SRR022823/SRR022823.fastq.gz
• http://www.ebi.ac.uk/ena/data/view/SRX008042
• ftp://ftp.sra.ebi.ac.uk/vol1/fastq/SRR022/SRR022852/SRR022852_1.fastq.
gz
• ftp://ftp.sra.ebi.ac.uk/vol1/fastq/SRR022/SRR022852/SRR022852_2.fastq.
gz
• ftp://ftp.sra.ebi.ac.uk/vol1/fastq/SRR023/SRR023408/SRR023408_1.fastq.
gz
• ftp://ftp.sra.ebi.ac.uk/vol1/fastq/SRR023/SRR023408/SRR023408_2.fastq.
gz
• http://www.ebi.ac.uk/ena/data/view/SRX000181
• ftp://ftp.sra.ebi.ac.uk/vol1/fastq/SRR000/SRR000892/SRR000892.fastq.gz
• ftp://ftp.sra.ebi.ac.uk/vol1/fastq/SRR000/SRR000893/SRR000893.fastq.gz
• http://www.ebi.ac.uk/ena/data/view/SRX007709
• ftp://ftp.sra.ebi.ac.uk/vol1/fastq/SRR022/SRR022863/SRR022863_1.fastq.
gz
• ftp://ftp.sra.ebi.ac.uk/vol1/fastq/SRR022/SRR022863/SRR022863_2.fastq.
gz
TRAINER’S MANUAL 71
de novo Genome Assembly Introduction
Introduction
The aim of this module is to become familiar with performing de novo genome assembly
using Velvet, a de Bruijn graph based assembler, on a variety of sequence data.
Now create sub-directories for this and the two other velvet practicals. All these directories
will be made as sub-directories of a directory for the whole course called NGS. For this
you can use the following commands:
mkdir -p NGS/velvet/{part1,part2,part3}
The -p tells mkdir (make directory) to make any parent directories if they don’t already
exist. You could have created the above directories one-at-a-time by doing this instead:
mkdir NGS
mkdir NGS/velvet
mkdir NGS/velvet/part1
mkdir NGS/velvet/part2
mkdir NGS/velvet/part3
After creating the directories, examine the structure and move into the directory ready
for the first velvet exercise by typing:
ls -R NGS
cd NGS/velvet/part1
pwd
72 TRAINER’S MANUAL
Downloading and Compiling Velvet de novo Genome Assembly
Although you will be using the preinstalled version of velvet, it is useful to know how to
compile velvet as some of the parameters you might like to control can only be set at
compile time. You can find the latest version of velvet at:
http://www.ebi.ac.uk/˜zerbino/velvet/
You could go to this URL and download the latest velvet version, or equivalently, you
could type the following, which will download, unpack, inspect, compile and execute your
locally compiled version of velvet:
cd ~/NGS/velvet/part1
pwd
tar xzf ~/NGS/Data/velvet_1.2.10.tgz
ls -R
cd velvet_1.2.10
make
./velveth
The standout displayed to screen when ’make’ runs may contain an error message but it
is ignored
Take a look at the executables you have created. They will be displayed as green by the
command:
ls --color=always
The switch --color, instructs that files be coloured according to their type. This is often
the default but we are just being explicit. By specifying the value always, we ensure that
colouring is always applied, even from a script.
Have a look of the output the command produces and you will see that MAXKMERLENGTH=31
and CATEGORIES=2 parameters were passed into the compiler.
This indicates that the default compilation was set for de Bruijn graph k-mers of maximum
size 31 and to allow a maximum of just 2 read categories. You can override these, and
other, default configuration choices using command line parameters. Assume, you want
to run velvet with a k-mer length of 41 using 3 categories, velvet needs to be recompiled
to enable this functionality by typing:
make clean
make MAXKMERLENGTH=41 CATEGORIES=3
./velveth
TRAINER’S MANUAL 73
de novo Genome Assembly Downloading and Compiling Velvet
velvet can also be used to process SOLID colour space data. To do this you need
a further make parameter. With the following command clean away your last
compilation and try the following parameters:
make clean
make MAXKMERLENGTH=41 CATEGORIES=3 color
./velveth_de
For a further description of velvet compile and runtime parameters please see the velvet
Manual: https://github.com/dzerbino/velvet/wiki/Manual
74 TRAINER’S MANUAL
Assembling Single-end Reads de novo Genome Assembly
http://www.ebi.ac.uk/ena/data/view/SRS004748
To begin with, first move back to the directory you prepared for this exercise, create a new
folder with a suitable name for this part and move into it. There is no need to download
the read files, as they are already stored locally. Instead we will create symlinks to the
files. Continue by copying (or typing):
cd ~/NGS/velvet/part1
mkdir SRS004748
cd SRS004748
pwd
ln -s ~/NGS/Data/SRR022825.fastq.gz ./
ln -s ~/NGS/Data/SRR022823.fastq.gz ./
ls -l
You are ready to process your data with Velvet. There are two main components to
Velvet:
You can always get further information about the usage of both velvet programs by typing
velvetg or velveth in your terminal.
Now run velveth for the reads in SRR022825.fastq.gz and SRR022823.fastq.gz using
the following options:
TRAINER’S MANUAL 75
de novo Genome Assembly Assembling Single-end Reads
velveth Once velveth finishes, move into the output directory run 25 and have a look
at what velveth has generated so far. The command less allows you to look at output
files (press q to quit and return to the command prompt). Here are some other options
for looking at file contents:
cd run_25
ls -l
head Sequences
cat Log
What did you find in the folder run 25? Sequences, Roadmaps, Log
Describe the content of the two velveth output files? Sequences: FASTA file version
of provided reads
Roadmaps: Internal file of velvet - basic information about number of reads, k-mer
size
What does the Log file store for you? Time stamp, Executed commands; velvet
version + compiler parameters, results
Now move one directory level up and run velvetg on your output directory, with the
commands:
cd ../
time velvetg run_25
Move back into your results directory to examine the effects of velvetg:
cd run_25
ls -l
What extra files do you see in the folder run 25? PreGraph, Graph, stats.txt,
contigs.fa, LastGraph
What do you suppose they might represent? PreGraph, Graph, LastGraph: Velvet
internal graph representation at different stages (see manual for more details about
the file format)
stats.txt: tab-delimited description of the nodes of the graph incl. coverage information
contigs.fa: assembly output file
In the Log file in run 25, what is the N50? 4409 bp
76 TRAINER’S MANUAL
Assembling Single-end Reads de novo Genome Assembly
Hopefully, we will have discussed what the N50 statistic is by this point. Broadly, it is the
median (not average) of a sorted data set using the length of a set of sequences. Usually
it is the length of the contig whose length, when added to the length of all longer contigs,
makes a total greater that half the sum of the lengths of all contigs. Easy, but messy - a
more formal definition can be found here:
http://www.broadinstitute.org/crd/wiki/index.php/N50
Backup the contigs.fa file and calculate the N50 (and the N25,N75) value with the
command:
cp contigs.fa contigs.fa.0
gnx -min 100 -nx 25,50,75 contigs.fa
Does the value of N50 agree with the value stored in the Log file? No
If not, why do you think this might be? K-mer N50 vs bp N50; contig length cut-off
value, estimated genome length
In order to improve our results, take a closer look at the standard options of velvetg
by typing velvetg without parameters. For the moment focus on the two options
-cov cutoff and -exp cov. Clearly -cov cutoff will allow you to exclude contigs for
which the k-mer coverage is low, implying unacceptably poor quality. The -exp cov
switch is used to give velvetg an idea of the coverage to expect.
If the expected coverage of any contig is substantially in excess of the suggested expected
value, maybe this would indicate a repeat. For further details of how to choose the
parameters, go to “Choice of a coverage cutoff”:
http://wiki.github.com/dzerbino/velvet/
Briefly, the k-mer coverage (and much more information) for each contig is stored in the
file stats.txt and can be used with R to visualize the k-mer coverage distribution. Take
a look at the stats.txt file, start R, load and visualize the data using the following
commands:
R --no-save --no-restore
install.packages('plotrix')
library(plotrix)
data <- read.table("stats.txt", header=TRUE)
weighted.hist(data$short1_cov, data$lgth, breaks=0:50)
TRAINER’S MANUAL 77
de novo Genome Assembly Assembling Single-end Reads
After choosing the expected coverage and the coverage cut-off, you can exit R by typing:
q()
The weighted histogram suggests to me that the expected coverage is around 14 and that
everything below 6 is likely to be noise. Some coverage is also represented at around 20,
30 and greater 50, which might be contamination or repeats (depending on the dataset),
but at the moment this should not worry you. To see the improvements, rerun velvetg
first with -cov cutoff 6 and after checking the N50 use only / add -exp cov 14 to the
command line option. Also keep a copy of the contigs file for comparison:
cd ~/NGS/velvet/part1/SRS004748
time velvetg run_25 -cov_cutoff 6
78 TRAINER’S MANUAL
Assembling Single-end Reads de novo Genome Assembly
You were running velvetg with the -exp cov and -cov cutoff parameters. Now try to
experiment using different cut-offs, expected parameters and also explore other settings
(e.g. -max coverage, -max branch length, -unused reads, -amos file, -read trkg or
see velvetg help menu).
Make some notes about the parameters you’ve played with and the results you ob-
tained. -max coverage: cut-off value for the upper range (like cov cutoff for the lower
range)
-max branch length: length of branch to look for bubble
-unused reads: write unused reads into file
-amos file: write AMOS message file
-read trkg: tracking read (more memory usage) - automatically on for certain opera-
tions
AMOS Hawkeye
The -amos file argument tells velvetg to output the assembly as an AMOS message
file (*.afg) which can then be used by tools like Hawkeye from the AMOS suite of tools.
Lets create the AMOS message file by running velvetg with some appropriate parameters:
velvetg run_25 -cov_cutoff 6 -exp_cov 14 -amos_file yes
The -exp cov argument to enable read-tracking -read trkg yes in Velvet. Without read
tracking enabled, very little read-level information can be output to the AMOS message
file. This results in a pretty useless visualisation in Hawkeye! However, since reads are
being tracked, the analysis takes longer and uses more memory.
Now convert the AMOS message file velvet asm.afg into an AMOS bank using bank-transact
and view the assembly with AMOS Hawkeye.
TRAINER’S MANUAL 79
de novo Genome Assembly Assembling Single-end Reads
Have a look around the interface, in particular try to look at the “Scaffold View” and
“Contig View” of the larges scaffold. You should see something like this:
Figure 7:
If you have time, try running the velvetg command without the -exp cov argument,
create the AMOS bank and see how the assemblies look different in Hawkeye. Here’s
a hint:
velvetg run_25 -cov_cutoff 6 -amos_file yes
bank-transact -c -b run_25/velvet_asm.bnk -m run_25/velvet_asm.afg
hawkeye run_25/velvet_asm.bnk
80 TRAINER’S MANUAL
Assembling Single-end Reads de novo Genome Assembly
In this exercise you will process the single whole genome sequence with velveth and
velvetg, look at the output only and go no further. The main intent of processing
this whole genome is to compute its N50 value. This must clearly be very close to the
ideal N50 value for the short reads assembly and so aid evaluation of that assembly.
To begin, move back to the main directory for this exercise, make a sub-directory for
the processing of this data and move into it. All in one go, this would be:
cd ~/NGS/velvet/part1/
mkdir MRSA252
cd MRSA252
Usually Velvet expects relatively short sequence entries and for this reason has a read
limit of 32,767 bp per sequence entry. As the genome size is 2,902,619 bp - longer
as the allowed limit and does not fit with the standard settings into velvet. But
like the maximum k-mer size option, you can tell Velvet during compile time, using
LONGSEQUENCES=Y, to expect longer input sequences than usual. I already prepared
the executable which you can use by typing velveth long and velvetg long.
TRAINER’S MANUAL 81
de novo Genome Assembly Assembling Paired-end Reads
Now, run velveth long, using the file you either just downloaded or created a symlink
to as the input:
velveth_long run_25 25 -fasta.gz -long \
s_aureus_mrsa252.EB1_s_aureus_mrsa252.dna.chromosome.Chromosome.fa.gz
velvetg_long run_25
If you are doing de novo assembly, pay the extra and get paired-ends: they’re worth
it!
The data you will examine in this exercise is again from Staphylococcus aureus which has
a genome of around 3MBases. The reads are Illumina paired end with an insert size of
350 bp.
The required data can be downloaded from the SRA. Specifically, the run data (SRR022852)
from the SRA Sample SRS004748.
http://www.ebi.ac.uk/ena/data/view/SRS004748
The following exercise focuses on preparing the paired-end FASTQ files ready for Velvet,
using Velvet in paired-end mode and comparing results with Velvet’s ’auto’ option.
First move to the directory you made for this exercise and make a suitable named directory
for the exercise:
cd ~/NGS/velvet/part2
mkdir SRS004748
82 TRAINER’S MANUAL
Assembling Paired-end Reads de novo Genome Assembly
cd SRS004748
There is no need to download the read files, as they are already stored locally. You will
simply create a symlink to this pre-downloaded data using the following commands:
ln -s ~/NGS/Data/SRR022852_?.fastq.gz ./
top is a program that continually monitors all the processes running on your computer,
showing the resources used by each. Leave this running and refer to it periodically
throughout your Velvet analyses. Particularly if they are taking a long time or whenever
your curiosity gets the better of you. You should find that as this practical progresses,
memory usage will increase significantly.
Now, back to the first terminal, you are ready to run velveth and velvetg. The reads
are -shortPaired and for the first run you should not use any parameters for velvetg.
From this point on, where it will be informative to time your runs. This is very easy to
do, just prefix the command to run the program with the command time. This will cause
UNIX to report how long the program took to complete its task.
Set the two stages of velvet running, whilst you watch the memory usage as reported by
top. Time the velvetg stage:
velveth run_25 25 -fmtAuto -create_binary -shortPaired -separate \
SRR022852_1.fastq.gz SRR022852_2.fastq.gz
time velvetg run_25
What does -fmtAuto and -create binary do? (see help menu) -fmtAuto tries to
detect the correct format of the input files e.g. FASTA, FASTQ and whether they
are compressed or not.
-create binary outputs sequences as a binary file. That means that velvetg can
read the sequences from the binary file more quickly that from the original sequence
files.
Comment on the use of memory and CPU for velveth and velvetg? velveth uses
only one CPU while velvetg uses all possible CPUs for some parts of the calculation.
TRAINER’S MANUAL 83
de novo Genome Assembly Assembling Paired-end Reads
Next, after saving your contigs.fa file from being overwritten, set the cut-off parameters
that you investigated in the previous exercise and rerun velvetg. time and monitor the
use of resources as previously. Start with -cov cutoff 16 thus:
mv run_25/contigs.fa run_25/contigs.fa.0
time velvetg run_25 -cov_cutoff 16
Up until now, velvetg has ignored the paired-end information. Now try running velvetg
with both -cov cutoff 16 and -exp cov 26, but first save your contigs.fa file. By
using -cov cutoff and -exp cov, velvetg tries to estimate the insert length, which you
will see in the velvetg output. The command is, of course:
mv run_25/contigs.fa run_25/contigs.fa.1
time velvetg run_25 -cov_cutoff 16 -exp_cov 26
Comment on the time required, use of memory and CPU for velvetg? Runtime
is lower when velvet can reuse previously calculated data. By using -exp cov, the
memory usage increases.
Which insert length does Velvet estimate? Paired-end library 1 has length: 228,
sample standard deviation: 26
Next try running velvetg in ‘paired-end mode‘. This entails running velvetg specifying
the insert length with the parameter -ins length set to 350. Even though velvet estimates
the insert length it is always advisable to check / provide the insert length manually as
velvet can get the statistics wrong due to noise. Just in case, save your last version of
contigs.fa. The commands are:
mv run_25/contigs.fa run_25/contigs.fa.2
time velvetg run_25 -cov_cutoff 16 -exp_cov 26 -ins_length 350
mv run_25/contigs.fa run_25/contigs.fa.3
What is the N50 value for the velvetg runs using the switches:
Base run: 19,510 bp -cov cutoff 16 24,739 bp
-cov cutoff 16 -exp cov 26 61,793 bp
-cov cutoff 16 -exp cov 26 -ins length 350 n50 of 62,740 bp; max 194,649 bp;
total 2,871,093 bp
84 TRAINER’S MANUAL
Assembling Paired-end Reads de novo Genome Assembly
Try giving the -cov cutoff and/or -exp cov parameters the value auto. See the velvetg
help to show you how. The information Velvet prints during running includes information
about the values used (coverage cut-off or insert length) when using the auto option.
What coverage values does Velvet choose (hint: look at the output that Velvet
produces while running)? Median coverage depth = 26.021837
Removing contigs with coverage < 13.010918 . . .
How does the N50 value change? n50 of 68,843 bp; max 194,645 bp; total 2,872,678
bp
Run gnx on all the contig.fa files you have generated in the course of this exercise. The
command will be:
gnx -min 100 -nx 25,50,75 run_25/contigs.fa*
For which runs are there Ns in the contigs.fa file and why? contigs.fa.2, contigs.fa.3,
contigs.fa
Velvet tries to use the provided (or infers) the insert length and fills ambiguous regions
with Ns.
Comment on the number of contigs and total length generated for each run.
AMOS Hawkeye
We will now output the assembly in the AMOS massage format and visualise the assembly
using AMOS Hawkeye.
Run velvetg with appropriate arguments and output the AMOS message file, then
convert it to an AMOS bank and open it in Hawkeye:
time velvetg run_25 -cov_cutoff 16 -exp_cov 26 -ins_length 350 \
-amos_file yes -read_trkg yes
time bank-transact -c -b run_25/velvet_asm.bnk -m run_25/velvet_asm.afg
TRAINER’S MANUAL 85
de novo Genome Assembly Assembling Paired-end Reads
hawkeye run_25/velvet_asm.bnk
Looking at the scaffold view of a contig, comment on the proportion of “happy mates”
to “compressed mates” and “stretched mates”. Nearly all mates are compressed with
no stretched mates and very few happy mates.
What is the mean and standard deviation of the insert size reported under the
Libraries tab? Mean: 350 bp SD: 35 bp
Look at the actual distribution of insert sizes for this library. Can you explain where
there is a difference between the mean and SD reported in those two places? We
specified -ins length 350 to the velvetg command. Velvet uses this value, in the
AMOS message file that it outputs, rather than its own estimate.
You can get AMOS to re-estimate the mean and SD of insert sizes using intra-contig pairs.
First, close Hawkeye and then run the following commands before reopening the AMOS
bank to see what has changed.
asmQC -b run_25/velvet_asm.bnk -scaff -recompute -update -numsd 2
hawkeye run_25/velvet_asm.bnk
Looking at the scaffold view of a contig, comment on the proportion of “happy mates”
to “compressed mates” and “stretched mates”. There are only a few compressed and
stretched mates compared to happy mates. There are similar numbers of stretched
and compressed mates.
What is the mean and standard deviation of the insert size reported under the
Libraries tab? TODO Mean: 226 bp SD: 25 bp
Look at the actual distribution of insert sizes for this library. Does the mean and SD
reported in both places now match? Yes
Can you find a region with an unusually high proportion of stretched, compressed,
incorrectly orientated or linking mates? What might this situation indicate? This
would indicate a possible misassembly and worthy of further investigation.
Look at the largest scaffold, there are stacks of stretched pairs which span contig
boundaries. This indicates that the gap size has been underestimated during the
scaffolding phase.
86 TRAINER’S MANUAL
Assembling Paired-end Reads de novo Genome Assembly
For this reason, it is vitally important to perform read QC and quality trimming. In
doing so, we remove errors/noise from the dataset which in turn means velvet will run
faster, will use less memory and will produce a better assembly. Assuming we haven’t
compromised too much on coverage.
To investigate the effect of data quality, we will use the run data (SRR023408) from the
SRA experiment SRX008042. The reads are Illumina paired end with an insert size of 92
bp.
Go back to the main directory for this exercise and create and enter a new directory
dedicated to this phase of the exercise. The commands are:
cd ~/NGS/velvet/part2
mkdir SRX008042
cd SRX008042
Create symlinks to the read data files that we downloaded for you from the SRA:
ln -s ~/NGS/Data/SRR023408_?.fastq.gz ./
We will use FastQC, a tool you should be familiar with, to visualise the quality of our
data. We will use FastQC in the Graphical User Interface (GUI) mode.
Start FastQC and set the process running in the background, by using a trailing &, so we
get control of our terminal back for entering more commands:
fastqc &
Open the two compressed FASTQ files (File − > Open) by selecting them both and
clicking OK). Look at tabs for both files:
Figure 8:
TRAINER’S MANUAL 87
de novo Genome Assembly Assembling Paired-end Reads
Are the quality scores the same for both files? Overall yes
Which value varies? Per sequence quality scores
Take a look at the Per base sequence quality for both files. Did you note that it is
not good for either file? The quality score of both files drop very fast. Qualities of
the REV strand drop faster than the FWD strand. This is because the template has
been sat around while the FWD strand was sequenced.
At which positions would you cut the reads if we did “fixed length trimming”? Looking
at the “Per base quality” and “Per base sequence content”, I would choose around 27
Why does the quality deteriorate towards the end of the read? Errors more likely for
later cycles
Does it make sense to trim the 5’ start of reads? Looking at the “Per base sequence
content”, yes - there is a clear signal at the beginning.
Which other statistics could you use to support your trimming strategy? “Per base
sequence content”, “Per base GC content”, “Kmer content”, “Per base sequence
quality”
Figure 9:
Once you have decided what your trim points will be, close FastQC. We will use
fastx trimmer from the FASTX-Toolkit to perform fixed-length trimming. For usage
information see the help:
fastx_trimmer -h
fastx trimmer is not able to read compressed FASTQ files, so we first need to decompress
the files ready for input.
The suggestion (hopefully not far from your own thoughts?) is that you trim your reads
88 TRAINER’S MANUAL
Assembling Paired-end Reads de novo Genome Assembly
as follows:
gunzip < SRR023408_1.fastq.gz > SRR023408_1.fastq
gunzip < SRR023408_2.fastq.gz > SRR023408_2.fastq
fastx_trimmer -Q 33 -f 1 -l 32 -i SRR023408_1.fastq -o \
SRR023408_trim1.fastq
fastx_trimmer -Q 33 -f 1 -l 27 -i SRR023408_2.fastq -o \
SRR023408_trim2.fastq
Many NGS read files are large. This means that simply reading and writing files can
become the bottleneck, also known as I/O bound. Therefore, it is often good practice
to avoid unnecessary disk read/write.
We could do what is called pipelining to send a stream of data from one command to
another, using the pipe (|) character, without the need for intermediary files. The
following command would achieve this:
gunzip --to-stdout < SRR023408_1.fastq.gz | fastx_trimmer -Q 33 -f 4 \
-l 32 -o SRR023408_trim1.fastq
gunzip --to-stdout < SRR023408_2.fastq.gz | fastx_trimmer -Q 33 -f 3 \
-l 29 -o SRR023408_trim2.fastq
Now run velveth with a k-mer value of 21 for both the untrimmed and trimmed read
files in -shortPaired mode. Separate the output of the two executions of velveth into
suitably named directories, followed by velvetg:
# untrimmed reads
velveth run_21 21 -fmtAuto -create_binary -shortPaired -separate \
SRR023408_1.fastq SRR023408_2.fastq
time velvetg run_21
# trimmed reads
velveth run_21trim 21 -fmtAuto -create_binary -shortPaired -separate \
SRR023408_trim1.fastq SRR023408_trim2.fastq
time velvetg run_21trim
How long did the two velvetg runs take? run 25: real 3m16.132s; user 8m18.261s;
sys 0m7.317s
run 25trim: real 1m18.611s; user 3m53.140s; sys 0m4.962s
What N50 scores did you achieve? Untrimmed: 11
Trimmed: 15
What were the overall effects of trimming? Time saving, increased N50, reduced
coverage
TRAINER’S MANUAL 89
de novo Genome Assembly Assembling Paired-end Reads
The evidence is that trimming improved the assembly. The thing to do surely, is to
run velvetg with the -cov cutoff and -exp cov. In order to use -cov cutoff and
-exp cov sensibly, you need to investigate with R, as you did in the previous exercise,
what parameter values to use. Start up R and produce the weighted histograms:
R --no-save
library(plotrix)
data <- read.table("run_21/stats.txt", header=TRUE)
data2 <- read.table("run_21trim/stats.txt", header=TRUE)
par(mfrow=c(1,2))
weighted.hist(data$short1_cov, data$lgth, breaks=0:50)
weighted.hist(data2$short1_cov, data2$lgth, breaks=0:50)
Figure 10: Weighted k-mer coverage histograms of the paired-end reads pre-trimmed
(left) and post-trimmed (right).
For the untrimmed read histogram (left) there is an expected coverage of around 13 with a
coverage cut-off of around 7. For the trimmed read histogram (right) there is an expected
coverage of around 9 with a coverage cut-off of around 5.
If you disagree, feel free to try different settings, but first quit R before running velvetg:
q()
90 TRAINER’S MANUAL
Assembling Paired-end Reads de novo Genome Assembly
Compare the results, produced during the last exercises, with each other:
bp Coverage
k-mer Coverage
Table 8:
TRAINER’S MANUAL 91
de novo Genome Assembly Assembling Paired-end Reads
bp Coverage 136 x (36 bp;11,374,488) 95x (37bp; 7761796) 82x (32bp; 7761796)
k-mer Coverage 45x 43x (21); 33x (25) 30x (21); 20.5x (25)
Table 9:
Hybrid Assembly
Like the previous examples, the data you will examine in this exercise is again from
Staphylococcus aureus which has a genome of around 3MB. The reads are 454 single
end and Illumina paired end with an insert size of 170 bp. You already downloaded
the required reads from the SRA in previous exercises. Specifically, the run data
(SRR022863, SRR000892, SRR000893) from the SRA experiments SRX007709 and
SRX000181.
The following exercise focuses on handing 454 long reads and paired-end reads with
velvet and the differences in setting parameters.
92 TRAINER’S MANUAL
Assembling Paired-end Reads de novo Genome Assembly
First move to the directory you made for this exercise, make a suitable named directory
for the exercise and check if all the three files are in place:
cd ~/NGS/velvet/part3
mkdir SRR000892-SRR022863
cd SRR000892-SRR022863
# 454 single end data
ln -s ~/NGS/Data/SRR00089[2-3].fastq.gz ./
# illumina paired end data
ln -s ~/NGS/Data/SRR022863_?.fastq.gz ./
The following command will run for a LONG time. This indicated the amount of
calculations being preformed by Velvet to reach a conclusion. To wait for velvet
to finish would exceed the time available in this workshop, but it is up to you
to either let it run over night or kill the process by using the key combination
CTRL+c.
velveth run_25 25 -fmtAuto -create_binary -long \
SRR00089?.fastq.gz -shortPaired -separate \
SRR022863_1.fastq.gz SRR022863_2.fastq.gz
time velvetg run_25
What are your conclusions using Velvet in an hybrid assembly? 17 min: time
velvetg run 25
TRAINER’S MANUAL 93
Post-Workshop Information
Post-Workshop Information Access to Computational Resources
We’re ecstatic you’re thinking this way and want to help guide you! However, lets take
this one step at a time.
The quickest way to dabble is to use a clone of the operating system (OS) you’ve been
using during this workshop. That means you’ll have hassle-free access to a plethora of
pre-installed, pre-configured bioinformatics tools. You could even set it up to contain a
copy of all the workshop data and handouts etc to go through the hands-on practicals in
your own time!
We have created an image file (approx. 10 GBytes in size) of the NGS Training OS for
you to use as you wish:
https://swift.rc.nectar.org.au:8888/v1/AUTH_33065ff5c34a4652aa2fefb292b3195a/
VMs/NGSTrainingV1.2.1.vdi
We would advise one of the following two approaches for making use of it:
• Import it into VirtualBox to setup a virtual machine (VM) on your own computer.
• Instantiate a VM on the NeCTAR Research Cloud.
96 TRAINER’S MANUAL
Access to Computational Resources Post-Workshop Information
run an operating system (the guest OS) within another (the host OS). VirtualBox is
available for several different host OSes including MS Windows, OS X, Linux and Solaris
(https://www.virtualbox.org/wiki/Downloads). Once VirtualBox is installed on your
host OS, you can then install a guest OS inside VirtualBox. VirtualBox supports a lot of
different OSes (https://www.virtualbox.org/wiki/Guest_OSes).
Here are the steps to setting up a VM in VirtualBox with our image file:
2. Start VirtualBox and click New to start the Create New Virtual Machine wizard
3. Give the VM a useful name like “NGS Training” and choose Linux and either
Ubuntu or Ubuntu (64-bit) as the OS Type
4. Give the VM access to a reasonable amount of the host Oses memory. i.e. somewhere
near the top of the green. If this value is < 2000 MB, you are likely to have insufficient
memory for your NGS data analysis needs.
5. For the virtual hard disk, select “Use existing hard disk” and browse to and select
the NGSTrainingV1.2.1.vdi file you downloaded.
8. Once booted, log into the VM as either ubuntu (a sudoer user; i.e. has admin rights)
or as ngstrainee (a regular unprivileged user). See table below for passwords.
TRAINER’S MANUAL 97
Post-Workshop Information Access to Computational Resources
The online dashboard is a graphical interface for administering (creating, deleting, reboot-
ing etc) your virtual machines (VMs) on the NeCTAR research cloud.
2. When you see the following page, click the “Log In” button:
Figure 11:
3. At the following screen, simply choose your institution from the dropdown box and
click “Select”. Now follow the on screen prompts and enter your regular institutional
login details.
Figure 12:
4. If you see the following screen, congratulations, you have successfully logged into
the NeCTAR Research Cloud dashboard!
98 TRAINER’S MANUAL
Access to Computational Resources Post-Workshop Information
Figure 13:
We will now show you how to instantiate the “NGS Training” image using your own
personal cloud allocation.
1. In the NeCTAR Research Cloud dashboard, click “Images & Snapshots” to list all the
publicly available images from which you can instantiate a VM. Under “Snapshots”
Click the “Launch” button for the latest version of the “NGSTraining” snapshot:
TRAINER’S MANUAL 99
Post-Workshop Information Access to Computational Resources
Figure 14:
2. You will now see a “Launch Instances” window where you are required to enter some
details about how you want the VM to be setup before clicking “Launch Instance”.
In the “Launch Instances” pop-up frame choose the following settings:
Server Name A human readable name for your convenience. e.g. “My NGS VM”
Flavor The resources you want to allocate to this VM. I suggest a Medium sized VM
(2 CPUs and 8 GBytes RAM). This will use all your personal allocation, but
anything less will probably be insufficient. You could request a new allocation
of resources if you want to instantiate a larger VM with more memory.
Security Groups Select SSH.
Figure 15:
4. You will be taken to the “Instances” page and you will see the “Status” and “Task”
column for your new VM is “Building” and “Spawning”. Once the “IP Address” cell
is populated, take a note of it as you will need it for configuring the NX Client later
on.
Figure 16:
5. Once the Status and Task for the VM change to “Active and “None” respectively,
your VM is powered up and is configuring itself. Congratulations, you have now
instantiated a Virtual Machine! If you try to connect to the VM too quickly, you
might not be successful. The OS may still be configuring itself, so give it a few
minutes to finish before continuing.
Figure 17:
Sometimes, the cloud experiences a “hiccup” and a newly instantiated VM will get stuck
in the “Build” and “Spawning” state (step 3) for more than a few minutes. This can be
rectified by terminating the instance and creating a new VM from scratch:
Figure 18:
2. Go back to step 1 of “Instantiating Your Own VM” and create the VM from scratch:
• You have administrator rights on your local computer for installing software.
We show you instructions below for the MS Windows version of the NX Client, but
procedures for other supported OSes (Linux, Mac OSX and Solaris) should be very
similar.
Figure 19:
3. On the ”NX Client for Windows” page, click the ”Download package” button:
Figure 20:
• You know the IP address of the VM you want to remote desktop into.
1. Start the NX Connection Wizard and click ”Next” to advance to the ”Session”
settings page.
Figure 21:
Session A memorable name so you know which VM this session is pointing at. You
could use the same name you chose for the VM you instantiated earlier e.g.
”NGS Training”.
Host Enter the IP address of the VM you instantiated on the NeCTAR Research
Cloud.
Figure 22:
3. Click ”Next” to advance to the ”Desktop” settings page. You should use the ”Unix
GNOME” setting.
Figure 23:
Connecting to a VM
If you just completed the NX Connection Wizard described above, the wizard should have
opened the NX Client window. If not, run the ”NX Client”. You will be presented with a
window like this:
Figure 24:
The ”Login” and ”Password” boxes in the NX Client are for user accounts setup on the
VM. By default our image, from which you instantiated your VM, has two preconfigured
users:
Figure 25:
Unless you know what you are doing, we suggest you use the ngstrainee user account
details to initiate an NX connection to your VM. In less than a minute, you should see an
NX Window showing the desktop of your VM:
Figure 26:
NX Connection Failure
In the event that you don’t get the NX Window with your VM’s desktop displaying inside
it. The most common errors are:
• You failed to select the ”ssh” security group when instantiating the VM. You’ll need
to terminate the instance and create a new VM from scratch
• You failed to select ”Unix GNOME” when you configured the NX Client session.
You’ll need to reconfigure the session using the NX Client
• Your institutions firewall blocks TCP port 22. You may need to request this port to
be opened by your local network team or configure the NX client to use a proxy
server.
Advanced Configuration
In the session configuration, you can configure the size of the NX Window in which the
desktop of the VM is drawn:
Figure 27:
• Have the NX Window occupy the entire screen, without window decorations. This
is often desirable if you wish to ”hide” the host OS from the person sitting at the
computer running the NX Client.
Trainee Handout
https://github.com/nathanhaigh/ngs_workshop/raw/master/trainee_handout.pdf
Trainer Handout
https://github.com/nathanhaigh/ngs_workshop/raw/master/trainer_handout.pdf
While you’re at it, you may also like to change the timezone of your VM to match that of
your own. To do this simply run the following commands as the ubuntu user:
TZ="Australia/Adelaide"
echo "$TZ" | sudo tee /etc/timezone
sudo dpkg-reconfigure --frontend noninteractive tzdata
For further information about what this script does and possible command line arguments,
see the script’s help:
bash NGS_workshop_deployment.sh -h
For further information about setting up the VM for the workshop, please see:
https://github.com/nathanhaigh/ngs_workshop/blob/master/workshop_deployment/
README.md