0% found this document useful (0 votes)
42 views6 pages

Module Exercise C

The document describes a bioinformatics pipeline to identify somatic single nucleotide variants (SNVs) from whole exome sequencing (WES) or whole genome sequencing (WGS) data of a cancer patient without a matched normal sample. It involves using Mutect2 in tumor-only mode to call variants, FilterMutectCalls to filter variants, and SelectVariants and SnpSift to further select and annotate SNVs, and visualize results in IGV. The key steps are: 1) Align reads to reference genome with BWA; 2) Call SNVs in tumor-only mode using Mutect2; 3) Filter variants using FilterMutectCalls; 4) Select and annotate SNVs using SelectVariants

Uploaded by

Robert
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
42 views6 pages

Module Exercise C

The document describes a bioinformatics pipeline to identify somatic single nucleotide variants (SNVs) from whole exome sequencing (WES) or whole genome sequencing (WGS) data of a cancer patient without a matched normal sample. It involves using Mutect2 in tumor-only mode to call variants, FilterMutectCalls to filter variants, and SelectVariants and SnpSift to further select and annotate SNVs, and visualize results in IGV. The key steps are: 1) Align reads to reference genome with BWA; 2) Call SNVs in tumor-only mode using Mutect2; 3) Filter variants using FilterMutectCalls; 4) Select and annotate SNVs using SelectVariants

Uploaded by

Robert
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 6

Identifying single nucleotide variants in a in a tumor sample without matched normal samples

In this exercise, we will investigate exome sequencing data from a cancer patient where NO
MATCHED NORMAL sample is available and aim to identify somatic single nucleotide variants. In
this exercise, we will use the mutect2 algorithm and you can get information about the inner working
of this algorithm from its preprint (read). We have installed mutect2 (download and tutorial are available
here: link) on the bioinformatics server and the algorithm is available here:

• /student_resources/apps/

Similar to the previous exercise, we have already aligned the sequencing reads to the human genome
(hg19) and the bam files are available here:

• /student_resources/module5/CellLine

In this exercise, we will also use some of the tools and databases that you are familiar with from the
previous exercise as well as new tools which you will work with for the first time. A flow chart of the
overall bioinformatics pipeline, which we will implement through individual exercises, is shown in
Figure 1.

Raw Reads (FastQ)

BWA Map to Reference Visualise IGV

Picard Adjust Read


AddOrReplaceReadGroups Groups

Mutect2
-germline-resource
Variant Calling

Extract SNPs Filter SNPs Annotate SNPs

FilterMutectCalls SelectVariants Funcotator


SnpSift BedTools

Bioinformatics analysis to identify, annotate, and visually inspect SNPs from WES or WGS experiments.
The pipeline is heavily based on GATK (https://gatk.broadinstitute.org/) including Mutect2,
FilterMutectCalls, SelectVariants and Funcotator, as well as SnpSift Filter
(https://pcingola.github.io/SnpEff/), as well as IGV.
C1. Variant-Calling
1. Take a look at the online manual of mutect2 (read). Can you work out how to run mutect2 in tumour-
only mode?

(ii) Tumor-only mode

This mode runs on a single type of sample, e.g. the tumor or the normal. To create a Panel of Normal (
PoN), call on each normal sample in this mode, then use CreateSomaticPanelOfNormals (read) to
generate the PoN.

gatk Mutect2 \
-R reference.fa \
-I sample.bam \
-O single_sample.vcf.gz

To call mutations on a tumor sample, call in this mode using a PoN and germline resource. After
FilterMutectCalls filtering, consider additional filtering by functional significance with Funcotator
(read).

gatk Mutect2 \
-R reference.fa \
-I sample.bam \
--germline-resource af-only-gnomad.vcf.gz \
--panel-of-normals pon.vcf.gz \
-O single_sample.vcf.gz

2. Execute mutect2 to identify single nucleotide variants in tumour-only mode (note: this will take
some time to complete, and we are providing a vcf file in class)?

Mutect2
java -Xmx4g -jar /nfs/uts_bioinformatics/student_resources/apps/gatk-
4.1.3.0/gatk-package-4.1.3.0-local.jar Mutect2

Human Genome hg19:


/nfs/uts_bioinformatics/student_resources/index/genomes/hg19.fa

Germline Variants Gnomad hg19:


/nfs/uts_bioinformatics/student_resources/index/dbsnp/af-only-
gnomad.raw.sites.hg19.vcf.gz

Result:
/nfs/uts_bioinformatics/student_resources/module5/CellLine/ExomSeq_fixReadGroups.gnomad
.vcf

/nfs/uts_bioinformatics/student_resources/module5/CellLine/ExomSeq_fixReadGroups.gnoma
d.vcf.idx

/nfs/uts_bioinformatics/student_resources/module5/CellLine/ExomSeq_fixReadGroups.gnoma
d.vcf.stats

3. How many variants did you identify (hint: apply tools that you have learned in the past weeks)?

4. VCF files contain information on variant calls in the field FILTER. How many filters did mutect2
apply to better characterise the quality of its mutation calls (hint: apply tools that you have learned
in the past weeks)?
C2. Variant-Filtering
5. Take a look at the online manual of the tool FilterMutectCalls (read) and its technical description
including the ‘best-practices’ for variants filtering (read). Apply FilterMutectCalls to your variant
calls from C1.2 ?

Filter variants in a Mutect2 VCF callset:

FilterMutectCalls applies filters to the raw output of Mutect2. Parameters are contained in
M2FiltersArgumentCollection and described in
https://github.com/broadinstitute/gatk/tree/master/docs/mutect/mutect.pdf. To filter based on sequence
context artifacts, specify the --orientation-bias-artifact-priors [artifact priors tar.gz file] argument one or
more times. This input is generated by LearnReadOrientationModel.

If given a --contamination-table file, e.g. results from CalculateContamination, the tool will additionally
filter variants due to contamination. This argument may be specified with a table for one or more tumor
samples. Alternatively, provide an estimate of the contamination with the --contamination argument.
FilterMutectCalls can also be given one or more --tumor-segmentation files, which are also output by
CalculateContamination.

This tool is featured in the Somatic Short Mutation calling Best Practice Workflow. See
Tutorial#11136 for a step-by-step description of the workflow and Article#11127 for an overview of
what traditional somatic calling entails. For the latest pipeline scripts, see the Mutect2 WDL scripts
directory.

Usage example:
gatk FilterMutectCalls
-R reference.fast
-V somatic.vcf.gz
--contamination-table contamination.table
--tumor-segmentation segments.tsv
-O filtered.vcf.gz

6. Take a look at the updated VCF file. How many filters did FilterMutectCalls apply to better
characterise the quality of its mutation calls? How many variants have passed the filters
implemented by FilterMutectCalls?

7. Take a look at the online manual of the tool SelectVariants (read). Apply SelectVariants to your
variant calls to remove all variants that (i) have not been marked by a PASS using the tool
FilterMutectCalls and (ii) are not single nucleotide polymorphisms? How many variants can you
find for (i) and (ii)?

Select a subset of variants from a VCF file: This tool makes it possible to select a subset of variants
based on various criteria in order to facilitate certain analyses. Examples of such analyses include
comparing and contrasting cases vs. controls, extracting variant or non-variant loci that meet certain
requirements, or troubleshooting some unexpected results, to name a few.

There are many different options for selecting subsets of variants from a larger callset:

• Extract one or more samples from a callset based on either a complete sample name or a pattern
match.
• Specify criteria for inclusion that place thresholds on annotation values, e.g. "DP > 1000" (depth
of coverage greater than 1000x), "AF < 0.25" (sites with allele frequency less than 0.25). These
criteria are written as "JEXL expressions", which are documented in the article about using
JEXL expressions (Links to an external site.).
• Provide concordance or discordance tracks in order to include or exclude variants that are also
present in other given callsets.
• Select variants based on criteria like their type (e.g. INDELs only), evidence of mendelian
violation, filtering status, allelicity, etc.
There are also several options for recording the original values of certain annotations which are
recalculated when one subsets the new callset, trims alleles, etc.

Input: A variant call set in VCF format from which a subset can be selected.

Output: A new VCF file containing the selected subset of variants.

Example Parameters:

--select-type-to-exclude
[] Do not select certain type of variants from the input file
-xl-select-type

--select-type-to-include
[] Select only a certain type of variants from the input file
-select-type

--selectExpressions
[] One or more criteria to use when selecting the data
-select

--exclude-filtered false Don't include filtered sites

--exclude-non-variants false Don't include non-variant sites

Usage examples:
• SNPs:
gatk SelectVariants
-R Homo_sapiens_assembly38.fasta
-V input.vcf
--select-type-to-include SNP
-O output.vcf

• Chromosome 20 Variants from a GenomicsDB:


gatk SelectVariants
-R Homo_sapiens_assembly38.fasta
-V gendb://genomicsDB
-L 20
-O output.chr20.vcf

8. SelectVariants currently allows filtering by the VCF field “FILTER”. SnpSift is an easy to use
alternative that extends the functionality of SelectVariants and allows the filtering by the VCF fields
“FILTER” and “GENOTYPE”. Take a look at the online manual of the tool SnpSift (read).

SnpSift filter: SnpSift filter is one of the most useful SnpSift commands. Using SnpSift filter you can filter VCF
files using arbitrary expressions, for instance "(QUAL > 30) | (exists INDEL) | ( countHet() > 2 )". The actual
expressions can be quite complex, so it allows for a lot of flexibility.

Typical usage

• I want to filter out samples with quality less than 30:


cat variants.vcf | java -jar SnpSift.jar filter " ( QUAL >= 30 )" > filtered.vcf

• …but we also want InDels that have quality 20 or more:


cat variants.vcf | java -jar SnpSift.jar filter "(( exists INDEL ) & (QUAL >= 20)) | (QUAL >= 30 )" >
filtered.vcf
9. Apply SnpSift to your variant calls and remove all variants that (i) are not single nucleotide
polymorphisms (i.e. by the tool SelectVariants), (ii) have not been marked by a PASS using the
tool FilterMutectCalls, and (iii) have an allele frequency of at least 30% and a sequencing depth of
at least 50 reads? How many variants passed all three filters?

SnpSift:
java -Xmx4G -jar
/nfs/uts_bioinformatics/student_resources/apps/snpEff/SnpSift.jar

(hint: SnpSift refers to ‘features’ in the FILTER column using FILTER and to features in the
GENOTYPE column with GEN[N] - where N is the n-th genotype. The is an example of the SnpSift
syntax "( (FILTER = 'PASS') & (GEN[0].AF >= 0.5) & (GEN[0].DP >= 100))").

C3. Variant-Annotation by function


10. Take a look at the online manual of the tool Funcotator (read) and apply the tool to annotate your
filtered variants with their potential function.

Funcotator (FUNCtional annOTATOR): Analyzes given variants for their function (as retrieved from a set of
data sources) and produces the analysis in a specified output file. This tool is a functional annotation tool that
allows a user to add annotations to called variants based on a set of data sources, each with its own matching
criteria.

Required Inputs

• A reference genome sequence.


• The version of the reference genome sequence being used (e.g. hg19, hg38, etc.).
• A VCF of variant calls to annotate.
• The path to a folder of data sources formatted for use by Funcotator.
• The desired output format for the annotated vaiants file (either MAF or VCF)

Output: The basic output of Funcotator is:

• A VCF or MAF file containing all variants from the input file with added annotations corresponding to
annotations from each data source that matched a given variant according to that data source's matching
criteria.

Usage example:

gatk Funcotator

--data-sources-path file

--ref-version version

--output-file-format VCF

-R reference.fasta

-V input.vcf

-O output.vcf

Funcotator:
java -Xmx64g -jar /nfs/uts_bioinformatics/student_resources /apps/gatk-4.1.3.0/gatk-package-4.1.3.0-local.jar

Data-sources-path:
/nfs/uts_bioinformatics/student_resources/index/funcotator_dataSources.v1.6.20190124g/
11. Take a look at the output from Funcotator, what categories of functional mutations were identified
by your analysis?

12. Apply a Linux tool to identify the number of MISSENSE and NONSENSE mutations that were
identified by your analysis?

13. Apply a Linux tool to identify the names of all genes that have a MISSENSE mutation? How many
individual genes are potentially affected?

14. The patient has a reported MISSENSE mutation in the NRAS genes. Can you modify the filtering
parameters used in exercise 9 to ensure the mutation in NRAS is identified? Re-run steps 9 and 11
using the new set of parameters and use this file for the next exercises.

#CHROM chr1

POS 115256528

ID .

REF T

ALT A

QUAL .

FILTER PASS
CONTQ=93;DP=18;ECNT=1;GERMQ=25;MBQ=41,41;MFRL=289,274;MMQ=60,60;MPOS=61;POPAF=7.30;
INFO SEQQ=93; STRANDQ=93;TLOD=25.30

FORMAT GT:AD:AF:DP:F1R2:F2R1:SB

##tumor_sample=20 0/1:10,7:0.421:17:3,4:7,3:7,3,3,4
Table 1. VCF entry

C5. Variant-Annotation by clinical phenotype


15. In a previous analysis of the patient’s exome, we identified 289 SNPs and a bed file of their location
is available form /nfs/uts_bioinformatics/student_resources/module5/CellLine/SNPknown.bed.
How many of these “known” and how many “novel” SNPs did your analysis identify?

16. The ClinVar database maintains a list of genome variations and observed phenotypes i.e. the health
status associated with the observer phenotype as well as its history of that interpretation (read). A
bed file of this database is available from
/nfs/uts_bioinformatics/student_resources/index/clinvar/human-variant-annotation_clinVar-
vcf_GRCh37_clinvar_20190815.bed. How many of these “known” and how many “novel” SNPs
are associated with a disease?

17. Is the known SNP in NRAS associated with Acute Myeloid Leukemia?

You might also like