Fundamentals of Bioinformatics Project Manual 2022
Fundamentals of Bioinformatics Project Manual 2022
Fundamentals of Bioinformatics Project Manual 2022
Fundamentals of Bioinformatics
Project Manual 2022: Mutation Impact
Prediction Methods
September 2022
1
Table of Contents
Table of Contents 2
Introduction 3
Aims of the Group Project 3
Predicting the Impact of Mutations 3
Timeline 4
Grading 5
Discussion Sessions 18
Sharing your data 18
Discussion Questions 18
Discussion Within Your Group: 18
Discussion with Other Groups: 18
References 21
Report grading rubrics 22
2
Introduction
3
Timeline
Week Date Activity/Deadline Exercises
(Green boxes)
4
Grading
This project counts for 40% of the final course grade. The deliverables that are listed below
all have to be handed in and are either graded or pass/fail. You will be assigned to give
feedback on the draft report of another group and your group's draft report will get feedback
from members from other groups. During the final weeks of the course, you will compare
your results to that of other groups. This will help you to write the discussion for your report.
At the end of this manual, each of these deliverables is explained in greater detail. Note that
the group project will also help you to prepare for the exam.
Through this manual you will find green boxes like this one. Green boxes contain
exercises that will help you understand the project. You should discuss them with
the Teacher’s Assistants (TAs), but you are NOT expected to answer them in the
report directly.
Setup
Also if you decide to share your code through a GitHub repository, remember to make the
repository private, a public repository would be seen as enabling plagiarism.
5
files and three skeleton scripts. To organize your working directory for the project, let’s create
two subdirectories in your working directory named data and output. This can be done from
the command line with mkdir (also see programming tutorial 1 on Canvas).
From FoB Project in Canvas Modules, download the BLOSUM62.txt file and the
<HGVSdataset>_benchmark.tsv, <HGVSdataset>_sift_scores.tsv, <HGVSdataset>_
polyphen_scores.tsv and <HGVSdataset>_VEP_baseline.tsv files. Place the BLOSUM62.txt
file and the <HGVSdataset>_benchmark.tsv in the data folder. Download the three skeleton
scripts from Canvas (.py files) and place them in your working directory. Note that if you are
going to be working on the compute server, you first need to copy your files there. Please
see the tutorial working at home or on your own laptop to see how you can move/copy files
to the compute servers.
Create a subdirectory in data called vep which stands for Ensembl Variant Effect Predictor.
6
Figure 1. Project workflow. The workflow shows a simple overview of the project. A benchmark will
be performed between the gold standard, VEP, and the methods to compare, baseline, SIFT and
PolyPhen. The scores from the benchmark can then be used to produce a ROC plot for each
predictor.
You will be given four initial .tsv files (tsv stands for tab separated values), one text file and
three skeleton scripts.
- Two of these .tsv files contain the results of PolyPhen-2 and SIFT, a third .tsv
contains the information of the benchmark, and the fourth .tsv file contains the
information that you will need to create your baseline model.
- The .txt file contains the BLOSUM62 matrix that you will need for the baseline model.
- The skeleton scripts have missing blocks of code that you will have to complete, you
can find them between the “START CODING HERE” and “END CODING HERE”.
- The first script you will need to complete and run is the
skeleton_script_baseline_model.py, to create a baseline model.
- The skeleton_script_create_roc_plot.py uses the output of the baseline model
along with the three other .tsv files as its input to create ROC plots that
compare the baseline model, SIFT and PolyPhen with the data from ClinVar.
- Finally, you will run the skeleton_script_roc_plot_tsv.py script which needs the
output.tsv files generated with the previous script from your data and from
your fellow students data, which will be provided to you three weeks into the
project.
Before starting, remember that you cannot import any packages apart from the ones already
found in the skeleton scripts. That would make the code harder to read for those who have
just started programming, and will most likely be graded as a fail. Note that even though
some students in your group can focus on the programming, everyone in the group should
be able to run the scripts and understand what they aim to do (also to prepare for the exam).
We recommend you start reading the script in the main() function, and then try to read and
understand each function as they are being called in the main function. To make the code
understandable for everyone, make sure to comment what you are doing with the lines of
code you add.
7
The Benchmark Datasets
ClinVar is a database of how human genomic variants are related to phenotypes of human
disease and supporting clinical evidence for these relationships, managed by the NCBI.
This database has been used to obtain the HGVS IDs of the genomic variants and their
clinical significance (label) you will work with (<HGVSdataset>_benchmark.tsv). Each
group will work with one of the three datasets:
1. old version of the dataset (HGVS_2014_<...>.tsv)
2. short dataset of the up-to-date database (HGVS_2020_small_<...>.tsv)
3. long dataset of the up-to-date database (HGVS_2020_big_<...>.tsv).
Each of these have been mapped to the reference genome GRCh38. The up-to-date
database (HGVS_2020_small_benchmark.tsv and HGVS_2020_big_benchmark.tsv) was
obtained from clinvar_20200629.vcf.gz while the old database (HGVS_2014_benchmark.tsv)
was obtained from clinvar_20141202.vcf.gz. A selection process was done for all of them:
only benign or pathogenic SNPs (Single Nucleotide Polymorphisms) were selected (likely
benign or likely pathogenic SNPs were excluded), to overcome ambiguities. Subsequently,
the following SNPs were filtered out:
- Intron variants
- Synonymous variants (those that lead to synonymous mutations)
- Variants in mitochondrial DNA
- Variants that vary into multiple bases
- Unknown variants.
The three datasets were balanced obtaining the same number of ‘Benign’ and ‘Pathogenic’
samples.
- Can you check that the benchmark dataset is indeed balanced? Why is this
important?
In this exercise, we are interested in modifying the gene coding for Apoliprotein
E (APOE) to reverse a missense mutation from a patient with APOE deficiency.
We do not know how many missense mutations can be pathogenic. Go to
https://www.ncbi.nlm.nih.gov/clinvar and search for APOE. From how many
missense variants can this deficiency originate?
8
The HGVS Format
The Human Genome Variation Society (HGVS) nomenclature is used worldwide as a
standard language for the description of changes (variants/polymorphisms/mutations) in
RNA and DNA sequences. It is formatted as reference : description – the reference
sequence (e.g., NM_004006.2) is how the variant is referenced in databases, in this case
RefSeq, and it is followed by a description of the variant (e.g., c.4375C>T). These
descriptions are usually given in the context of a specific gene. The first (lowercase) letter
stands for the context of the code: c for coding DNA, g for genomic DNA, r for RNA and p for
protein. The number is the position of the polymorphism in the reference sequence (e.g.,
4375) and the last two letters separated by a > symbol represent the two different
nucleotides that are found in this position.
In this project, you will not have to use VEP directly, as we have selected some of the results
we obtained previously by running REST API.
9
Each HGVS ID is a variant of a genomic sequence that can overlap multiple transcripts.
Hence, when using VEP each HGVS ID has as many outputs as transcripts there are, and
each one can take different PolyPhen and SIFT scores. To make things easier for you, we
have selected only one transcript result for each HGVS ID. The transcript with the highest
impact (most deleterious) predicted by SIFT and PolyPhen was the one selected. If several
transcripts have this score, the transcript is selected at random among these ones. If
PolyPhen and SIFT do not agree, that HGVS ID will be skipped to avoid bias towards either
PolyPhen or SIFT.
This output has been extracted from the VEP web interface when providing three
HGVS IDs. If you have understood the paragraph above, you should know which
transcripts have been selected in your output.
The output is given to you in three separate .tsv files (“<...>” refers to your assigned dataset,
as described above in The Benchmark Datasets):
- <HGVSdataset>_sift_scores.tsv: contains the HGVS IDs* and the SIFT score.
- <HGVSdataset>_polyphen_scores.tsv: contains the HGVS IDs* and PolyPhen
score.
- <HGVSdataset>_VEP_baseline.tsv: contains the HGVS IDs*, Amino acid change
and Codon change.
*Note that the HGVS IDs are the same for the 3 files.
10
Create Baseline Prediction
The BLOSUM62 matrix is based on frequencies of amino acid substitutions of a collection of
protein alignments with 62% identity. As you know, this matrix is being used in alignment
tools such as BLAST or BLASTP. In this project, you will use the information in BLOSUM62
to obtain an insight into a substitution’s expected impact, and with this create a baseline
impact prediction method. You can find the BLOSUM62 matrix in the BLOSUM62.txt file.
Check out the BLOSUM62 matrix in the BLOSUM62.txt file. Do you think the
diagonal values are going to be used on the baseline model with the data set that
we have provided? Why, or why not?
$ python3 skeleton_script_baseline_model.py
data/vep/HGVS_2020_small_VEP_baseline.tsv data/BLOSUM62.txt -o
data/HGVS_2020_small_baseline_scores.tsv
ROC Plot
Your task here is to create a Receiver Operating Characteristic (ROC) plot by comparing the
results from your predictors to the gold standard data we have obtained from ClinVar. A
ROC plot is a method of visualising the performance of your predictor, it plots the True
Positive Rate (TPR) against the False Positive Rate (FPR). Refer to the lecture on machine
learning and benchmarking for a thorough explanation of what ROC plots are and how they
11
can be used to evaluate, compare, and refine classification methods. In addition
http://wikipedia.org/wiki/Receiver_operating_characteristic may be a helpful resource.
Note that a threshold to classify variants as (putatively) benign or damaging is not fixed at a
constant value, in order to create a ROC plot of the results. Instead, in a ROC plot you
calculate the true and false positive rate for every possible threshold spanning the range of
possible values for your method, from 0 until 1 for SIFT or from -2 until 9. For every
threshold, this allows every variant classified by the predictor to be categorised as a True
Positive (TP), False Positive (FP), False Negative (FN) or a True Negative (TN).
Complete the blank cells in this confusion matrix. Hint: The conclusion drawn from
the predictor depends on the threshold.
(Putative
Damaging)
When working with ROC plots, the Area Under the Curve (AUC) is often taken as a measure
to evaluate performance. To calculate the AUC you have to approximate the integral of the
function f(x) that describes the shape of the curve of the ROC-plot. We do not know the
function that describes the curve, thus we have to evaluate the integral numerically. A
method that approximates the integral is the trapezoidal rule.
Think of a clever way to implement this rule, and complete the provided skeleton script:
skeleton_script_create_roc_plot.py. This script will parse your predictor and benchmark
results, count the number of TP, FP, FN and TN, calculate your ROC plot’s line coordinates,
create the corresponding figure, and integrate the AUC.
12
Complete and execute the skeleton script skeleton_script_create_roc_plot.py. For a better
understanding of the ROC plot, the script produces a color gradient indicating the score
range.
To obtain the individual ROC plot for one predictor, the optional argument -ipred should be
included once with the .tsv file with the scores of one of your methods (SIFT, PolyPhen or
baseline). The -ibench should be included with the <HGVSdataset>_benchmark.tsv file. You
can use the help function -h or --help for explanation of these and other options. You will
have to specify a path for the output .png file with the argument -o (including the .png
extension). A .tsv file with the ROC x- and y-coordinates will be saved automatically to the
same output directory. For example, to call the script for the PolyPhen ROC plot:
To show the ROC curves of the three predictors in one figure, the script can be run with the
-ipred argument three times for each of the three prediction .tsv files (SIFT, PolyPhen and
baseline). (In this case, the ROC plot coordinates file will not be created). A command line
example is provided below:
- In your ROC plots you will probably find that some of the methods provide a
much more detailed curve than others. Can you explain why this is?
- What will a ROC-plot look like if you have extremely unbalanced data? Will the
ROC-plot be representative?
13
third script. To test how different data sets can influence the ROC plot results, you will run
the third script skeleton_script_roc_plot_tsv.py.
This script contains only one coding block which is the same you have found in the last script
to calculate the AUC – please use the same code. As inputs, you will use the coordinates
that your fellow students have obtained with the other two datasets by providing paths to
-itsv. These two sets of coordinates will be given to you through Canvas. You will have to run
the code twice, once for each type of dataset. As output, you will get a .png file with the ROC
plot comparing the same dataset on the three predictors, just as the previous script
(skeleton_script_create_roc_plot.py). This command line illustrates how can it be run:
The ROC plots from this script will be discussed in the discussion session and you will have
to add them to your final report as well.
SIFT and PolyPhen define default thresholds for their score to classify a mutation as
benign or pathogenic. Do you think the default thresholds make sense according to
your ROC plot (look at the FPR and TPR)? What would happen if you changed the
threshold?
14
Instructions for Submitting Draft Report
(Submit in PDF format via Canvas)
The draft report must contain between 1000 and 1500 words and contain following sections:
● Abstract
● Introduction
● Methods; and
● (Preliminary) results.
The results section should include a ROC plot and its interpretation. You must clearly state
your research question in the introduction and answer it in the results and discussion
sections. Note that in the section below, and in the rubric (you will receive for the peer
review), more details are provided about what the report should contain.
Please add word counts in square brackets [ ] behind the title of each section, before you
submit. Your draft report needs to be handed in via Canvas, and will be peer reviewed by
students of other groups. Note that you do not yet need to write the discussion and
conclusion sections for your draft.
Everyone should peer review the report of one other group, meaning that each group should
get around 4 peer reviews back. The peer review should be handed in on Canvas. Note that
the peer review should be based on the rubrics provided. You need to write a peer review in
order to pass the course.
15
Use Cases: Investigating two SNPs in detail
Now we would like you to think more about the biological aspect of impact prediction. You
will do this by comparing two SNPs from the same gene, where one is known to be a benign
SNP and one is known to be a pathogenic SNP. You will report on your findings in a section
called “Use cases”. The section needs to contain the answers to the exercises below and
you may add additional information to support your findings. Also see the rubrics.
Below a list with three genes is shown, with corresponding SNPs and the variant. Depending
on your dataset used you will do the following steps for two SNPs from a single gene.
As a first step, go to the Ensembl Variant Effect Predictor (VEP) website. Click on the Web
interface option. In your input data, paste each of your SNPs in HGVS format on a separate
line. You can leave all other settings on default, and click “Run” at the bottom. Please be
aware that a job can take a few minutes to complete. Click on “view results” when your job is
complete. You might need to change the shown columns by clicking on the “Show/hide
columns” button in the blue bar.
Alternatively you can use the REST API you can type the following url in your browser
“https://rest.ensembl.org//vep/human/hgvs/” followed by the HGVS code. See also:
https://rest.ensembl.org/documentation/info/vep_hgvs_ge
16
We will start with comparing the sequence conservation between the two SNPs. The web
server of PolyPhen-2 is the tool we will use for this. Before going to the website, find the rsID
in the VEP output under “Existing variant” for each SNP (be sure to look for the right feature
as indicated in the table above). Use the rsID one by one as input for the query WHESS.db.
If you get multiple options in the results screen, choose the results with the same protein
position as in the VEP output. On the report page you can find the sequence conservation
under the Multiple sequence analysis tab.
- Do you see differences in sequence conservation for the region around the SNP
and for the SNP itself?
On the same report page produced by PolyPhen-2, you can also find a tab called 3D
visualization. Use this to find where in the protein structure the mutation occurs.
- What is the structural environment around the SNP, when focusing on where in
the protein the secondary structure occurs?
- Does the place of the SNP in the protein make sense considering it’s impact as
predicted by VEP?
17
Discussion Sessions
In the discussion sessions, you will be matched with students from other groups to discuss
your findings about the Use Cases. The discussion points you need to prepare are the
questions from exercises 10 and 11. Additionally we want you to tell something about the
biological background of the SNPs, this is however not required for the report. The
discussion session will be moderated by teachers and TAs.
Make sure you can show figures of the sequence conservation and the structure. This can
be done in a small presentation.
When you execute the create_roc_plot.py script, the coordinates of the ROC plot will
automatically be exported to '[your_custom_plotname]_xy.tsv'. You need to share the .tsv
files generated by the skeleton_script_create_roc_plot.py with other students for your
specific benchmark datasets, for all three methods.
Discussion Questions
Your discussion section in the report should contain the answers to the following questions.
The questions under discussion within your group should be based on your own results, and
the questions under discussion with other groups should be based on your own results and
the results from other groups.
1. Are there any clear differences between the different benchmark datasets in terms of
the AUC and the shape of the ROC curves? What is the effect of having more
benchmark data available? [B1]
2. Is the relative performance the same in all three datasets for SIFT, PolyPhen and the
baseline script, how could you test this? [B2]
18
Format of the Final Report
The final report must contain the following sections:
The final report must contain between 3000 and 3500 words. We count everything (including
figure text) except references.
Please read the rubric for the final report and make sure you include every requirement
listed.
19
Handing in the Final Report
The final report must clearly state what each group member contributed to the project.
Individual students' grades may be adjusted according to the reported and observed
differences in workload.
Note that you need to add comments to the code you write, so that all group members can
understand what is going on in the code blocks.
The code will count 10% towards the final project grade, the final report 90%.
CodeGrade
We use CodeGrade, an automatic grading system, to evaluate your code. This means that
coding outside the code block for editing (“START CODING HERE” and “END CODING
HERE”) is generally not allowed (otherwise CodeGrade may not work). Importing additional
packages is also not allowed.
You can submit your code to CodeGrade multiple times to check if your code works correctly.
20
References
Adzhubei, I. A., Schmidt, S., Peshkin, L., Ramensky, V. E., Gerasimova, A., Bork,P.,
Kondrashov, A. S. and Sunyaev, S. R. (2010) A method and server for predicting damaging
missense mutations. Nature methods, 7 , 248–249
Vaser, R., Adusumalli, S., Leng, S. N., Sikic, M. and Ng, P. C. (2016) Sift missense
predictions for genomes. Nature protocols,11, 1.
21
Rubric for the draft report
Below are the grading criteria for the draft report. Keep in mind though that the draft report is
not graded. However, you can use this rubric when giving feedback to the draft report.
Criteria Pts
22
Scores for the different methods (Materials and Methods) 3 pts
Describe the scores for the different methods.
23
Rubric for the final report
The rubric for the final report includes the rubric for the draft report, and the additional
grading elements below.
Criteria Pts
24
Discuss your observations with the impact predicted by VEP (Use Cases) 5 pts
Discuss if you agree with the prediction made by VEP, based on your results from
the sequence conservation and the protein structure.
25