Module Exercise C
Module Exercise C
In this exercise, we will investigate exome sequencing data from a cancer patient where NO
MATCHED NORMAL sample is available and aim to identify somatic single nucleotide variants. In
this exercise, we will use the mutect2 algorithm and you can get information about the inner working
of this algorithm from its preprint (read). We have installed mutect2 (download and tutorial are available
here: link) on the bioinformatics server and the algorithm is available here:
• /student_resources/apps/
Similar to the previous exercise, we have already aligned the sequencing reads to the human genome
(hg19) and the bam files are available here:
• /student_resources/module5/CellLine
In this exercise, we will also use some of the tools and databases that you are familiar with from the
previous exercise as well as new tools which you will work with for the first time. A flow chart of the
overall bioinformatics pipeline, which we will implement through individual exercises, is shown in
Figure 1.
Mutect2
-germline-resource
Variant Calling
Bioinformatics analysis to identify, annotate, and visually inspect SNPs from WES or WGS experiments.
The pipeline is heavily based on GATK (https://gatk.broadinstitute.org/) including Mutect2,
FilterMutectCalls, SelectVariants and Funcotator, as well as SnpSift Filter
(https://pcingola.github.io/SnpEff/), as well as IGV.
C1. Variant-Calling
1. Take a look at the online manual of mutect2 (read). Can you work out how to run mutect2 in tumour-
only mode?
This mode runs on a single type of sample, e.g. the tumor or the normal. To create a Panel of Normal (
PoN), call on each normal sample in this mode, then use CreateSomaticPanelOfNormals (read) to
generate the PoN.
gatk Mutect2 \
-R reference.fa \
-I sample.bam \
-O single_sample.vcf.gz
To call mutations on a tumor sample, call in this mode using a PoN and germline resource. After
FilterMutectCalls filtering, consider additional filtering by functional significance with Funcotator
(read).
gatk Mutect2 \
-R reference.fa \
-I sample.bam \
--germline-resource af-only-gnomad.vcf.gz \
--panel-of-normals pon.vcf.gz \
-O single_sample.vcf.gz
2. Execute mutect2 to identify single nucleotide variants in tumour-only mode (note: this will take
some time to complete, and we are providing a vcf file in class)?
Mutect2
java -Xmx4g -jar /nfs/uts_bioinformatics/student_resources/apps/gatk-
4.1.3.0/gatk-package-4.1.3.0-local.jar Mutect2
Result:
/nfs/uts_bioinformatics/student_resources/module5/CellLine/ExomSeq_fixReadGroups.gnomad
.vcf
/nfs/uts_bioinformatics/student_resources/module5/CellLine/ExomSeq_fixReadGroups.gnoma
d.vcf.idx
/nfs/uts_bioinformatics/student_resources/module5/CellLine/ExomSeq_fixReadGroups.gnoma
d.vcf.stats
3. How many variants did you identify (hint: apply tools that you have learned in the past weeks)?
4. VCF files contain information on variant calls in the field FILTER. How many filters did mutect2
apply to better characterise the quality of its mutation calls (hint: apply tools that you have learned
in the past weeks)?
C2. Variant-Filtering
5. Take a look at the online manual of the tool FilterMutectCalls (read) and its technical description
including the ‘best-practices’ for variants filtering (read). Apply FilterMutectCalls to your variant
calls from C1.2 ?
FilterMutectCalls applies filters to the raw output of Mutect2. Parameters are contained in
M2FiltersArgumentCollection and described in
https://github.com/broadinstitute/gatk/tree/master/docs/mutect/mutect.pdf. To filter based on sequence
context artifacts, specify the --orientation-bias-artifact-priors [artifact priors tar.gz file] argument one or
more times. This input is generated by LearnReadOrientationModel.
If given a --contamination-table file, e.g. results from CalculateContamination, the tool will additionally
filter variants due to contamination. This argument may be specified with a table for one or more tumor
samples. Alternatively, provide an estimate of the contamination with the --contamination argument.
FilterMutectCalls can also be given one or more --tumor-segmentation files, which are also output by
CalculateContamination.
This tool is featured in the Somatic Short Mutation calling Best Practice Workflow. See
Tutorial#11136 for a step-by-step description of the workflow and Article#11127 for an overview of
what traditional somatic calling entails. For the latest pipeline scripts, see the Mutect2 WDL scripts
directory.
Usage example:
gatk FilterMutectCalls
-R reference.fast
-V somatic.vcf.gz
--contamination-table contamination.table
--tumor-segmentation segments.tsv
-O filtered.vcf.gz
6. Take a look at the updated VCF file. How many filters did FilterMutectCalls apply to better
characterise the quality of its mutation calls? How many variants have passed the filters
implemented by FilterMutectCalls?
7. Take a look at the online manual of the tool SelectVariants (read). Apply SelectVariants to your
variant calls to remove all variants that (i) have not been marked by a PASS using the tool
FilterMutectCalls and (ii) are not single nucleotide polymorphisms? How many variants can you
find for (i) and (ii)?
Select a subset of variants from a VCF file: This tool makes it possible to select a subset of variants
based on various criteria in order to facilitate certain analyses. Examples of such analyses include
comparing and contrasting cases vs. controls, extracting variant or non-variant loci that meet certain
requirements, or troubleshooting some unexpected results, to name a few.
There are many different options for selecting subsets of variants from a larger callset:
• Extract one or more samples from a callset based on either a complete sample name or a pattern
match.
• Specify criteria for inclusion that place thresholds on annotation values, e.g. "DP > 1000" (depth
of coverage greater than 1000x), "AF < 0.25" (sites with allele frequency less than 0.25). These
criteria are written as "JEXL expressions", which are documented in the article about using
JEXL expressions (Links to an external site.).
• Provide concordance or discordance tracks in order to include or exclude variants that are also
present in other given callsets.
• Select variants based on criteria like their type (e.g. INDELs only), evidence of mendelian
violation, filtering status, allelicity, etc.
There are also several options for recording the original values of certain annotations which are
recalculated when one subsets the new callset, trims alleles, etc.
Input: A variant call set in VCF format from which a subset can be selected.
Example Parameters:
--select-type-to-exclude
[] Do not select certain type of variants from the input file
-xl-select-type
--select-type-to-include
[] Select only a certain type of variants from the input file
-select-type
--selectExpressions
[] One or more criteria to use when selecting the data
-select
Usage examples:
• SNPs:
gatk SelectVariants
-R Homo_sapiens_assembly38.fasta
-V input.vcf
--select-type-to-include SNP
-O output.vcf
8. SelectVariants currently allows filtering by the VCF field “FILTER”. SnpSift is an easy to use
alternative that extends the functionality of SelectVariants and allows the filtering by the VCF fields
“FILTER” and “GENOTYPE”. Take a look at the online manual of the tool SnpSift (read).
SnpSift filter: SnpSift filter is one of the most useful SnpSift commands. Using SnpSift filter you can filter VCF
files using arbitrary expressions, for instance "(QUAL > 30) | (exists INDEL) | ( countHet() > 2 )". The actual
expressions can be quite complex, so it allows for a lot of flexibility.
Typical usage
SnpSift:
java -Xmx4G -jar
/nfs/uts_bioinformatics/student_resources/apps/snpEff/SnpSift.jar
(hint: SnpSift refers to ‘features’ in the FILTER column using FILTER and to features in the
GENOTYPE column with GEN[N] - where N is the n-th genotype. The is an example of the SnpSift
syntax "( (FILTER = 'PASS') & (GEN[0].AF >= 0.5) & (GEN[0].DP >= 100))").
Funcotator (FUNCtional annOTATOR): Analyzes given variants for their function (as retrieved from a set of
data sources) and produces the analysis in a specified output file. This tool is a functional annotation tool that
allows a user to add annotations to called variants based on a set of data sources, each with its own matching
criteria.
Required Inputs
• A VCF or MAF file containing all variants from the input file with added annotations corresponding to
annotations from each data source that matched a given variant according to that data source's matching
criteria.
Usage example:
gatk Funcotator
--data-sources-path file
--ref-version version
--output-file-format VCF
-R reference.fasta
-V input.vcf
-O output.vcf
Funcotator:
java -Xmx64g -jar /nfs/uts_bioinformatics/student_resources /apps/gatk-4.1.3.0/gatk-package-4.1.3.0-local.jar
Data-sources-path:
/nfs/uts_bioinformatics/student_resources/index/funcotator_dataSources.v1.6.20190124g/
11. Take a look at the output from Funcotator, what categories of functional mutations were identified
by your analysis?
12. Apply a Linux tool to identify the number of MISSENSE and NONSENSE mutations that were
identified by your analysis?
13. Apply a Linux tool to identify the names of all genes that have a MISSENSE mutation? How many
individual genes are potentially affected?
14. The patient has a reported MISSENSE mutation in the NRAS genes. Can you modify the filtering
parameters used in exercise 9 to ensure the mutation in NRAS is identified? Re-run steps 9 and 11
using the new set of parameters and use this file for the next exercises.
#CHROM chr1
POS 115256528
ID .
REF T
ALT A
QUAL .
FILTER PASS
CONTQ=93;DP=18;ECNT=1;GERMQ=25;MBQ=41,41;MFRL=289,274;MMQ=60,60;MPOS=61;POPAF=7.30;
INFO SEQQ=93; STRANDQ=93;TLOD=25.30
FORMAT GT:AD:AF:DP:F1R2:F2R1:SB
##tumor_sample=20 0/1:10,7:0.421:17:3,4:7,3:7,3,3,4
Table 1. VCF entry
16. The ClinVar database maintains a list of genome variations and observed phenotypes i.e. the health
status associated with the observer phenotype as well as its history of that interpretation (read). A
bed file of this database is available from
/nfs/uts_bioinformatics/student_resources/index/clinvar/human-variant-annotation_clinVar-
vcf_GRCh37_clinvar_20190815.bed. How many of these “known” and how many “novel” SNPs
are associated with a disease?
17. Is the known SNP in NRAS associated with Acute Myeloid Leukemia?