Assignment I
Assignment I
In the past weeks, you have learned how to use Linux commands and Bioinformatics tools to analyse
sequence variation and to quantify gene expression from datasets generated by next-generation
sequencing protocols. In this assignment, your team is tasked with (i) the development of a
bioinformatics pipeline for the analysis of Exome-Seq and RNA-seq datasets and (ii) the application of
this pipeline to interrogate a leukaemia patient sample to confirm known and to identify novel
sequence variants for potential clinical followup.
(1) DNA was captured using and the Agilent SureSelect Human All Exon Probes and sequencing
was performed using an Illumina HiSeq X Ten machine. The dataset was aligned to the human
genome (hg19) using the tool bowtie2 and the output is available as:
• ExomSeq.bam
• ExomSeq.bamStats
Note, that due to the disease status, matched cancer to normal tissue is not available for this
patient. Therefore, we propose to apply the Mutect2 pipeline learned in the course.
(2) Genome sequence variants previously identified for the patient. These are provided as excel
table:
• KnownVariants.xlsx
(3) RNA was extracted with the miRNeasy Mini Kit and sequencing was performed using an
Illumina HiSeq2500 machine. The dataset was aligned to the human genome (hg19) using the
tool bowtie2 and the output is available as:
• RNASeq.bam
• RNASeq.bamStats
This setup is very typical for a real-world project that you might encounter as a bioinformatics
specialist in a research or a diagnostic laboratory. Please find a list of aims to guide your analysis below:
(A) Investigate and analyse all datasets. Special consideration should be given to:
5) Interpretation of results?
Of note, these datasets have not been previously analyzed and might lead to useful results that are of
interest to our research group.
Useful Hacks
1. DNA sequence alignment (option mem)
• http://bio-bwa.sourceforge.net/bwa.shtml
• https://software.broadinstitute.org/gatk/documentation/tooldocs/4.beta.4/org_broadinstitute_
hellbender_tools_walkers_mutect_Mutect2.php
• https://software.broadinstitute.org/gatk/blog?id=11337
•
• https://gatkforums.broadinstitute.org/gatk/discussion/24057/how-to-call-somatic-mutations-
using-gatk4-mutect2#latest
/home/student_resources/index/af-only-gnomad.raw.sites.hg19.vcf.gz
/home/student_resources/index/af-only-gnomad.raw.sites.hg19.vcf.idx
2.3 Runtime:
• Expect this to run ~8hrs (~See Linux hacks)
3. Variant Annotation:
3.1 Variant Effects
• McLaren et al, The Ensembl Variant Effect Predictor, Genome Biology 2016; 17(122)
• https://asia.ensembl.org/info/docs/tools/vep/index.html
• https://www.ncbi.nlm.nih.gov/variation/docs/ClinVar_vcf_files/
/home/student_resources/index/clinvar_20190902.vcf
4. Other Tools
2.1 Picard
• https://broadinstitute.github.io/picard/
• https://software.broadinstitute.org/gatk/documentation/tooldocs/4.0.0.0/picard_sam_Add
OrReplaceReadGroups.php
3.3. Linux
3.1 Execute code in the background: Nohup [CODE] &