0% found this document useful (0 votes)
73 views15 pages

Summary Bioinformation Technology

The document provides an overview of key concepts in bioinformatics and genomics including: 1) Genomes contain chromosomes or genes, genes contain coding sequences that encode proteins, and reading frames contain sets of 3 nucleotides that can encode amino acids. 2) Sequencing a whole genome requires determining coverage, or the number of times each base is sequenced, to minimize gaps between non-sequenced areas. 3) Sequence assembly involves merging reads based on overlap to form contigs, then scaffolding to order contigs and fill gaps, with the N50 statistic measuring assembly quality.

Uploaded by

tj
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
73 views15 pages

Summary Bioinformation Technology

The document provides an overview of key concepts in bioinformatics and genomics including: 1) Genomes contain chromosomes or genes, genes contain coding sequences that encode proteins, and reading frames contain sets of 3 nucleotides that can encode amino acids. 2) Sequencing a whole genome requires determining coverage, or the number of times each base is sequenced, to minimize gaps between non-sequenced areas. 3) Sequence assembly involves merging reads based on overlap to form contigs, then scaffolding to order contigs and fill gaps, with the N50 statistic measuring assembly quality.

Uploaded by

tj
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 15

Summary Bioinformation Technology period 1 2018

Lecture 1: Introduction
Genome: the full set of chromosomes or genes in a gamete (contains only one set of dissimilar
chromosomes). A regular somatic cell contains two full sets of genomes.
Gene: a DNA-based unit that can exert its effects on the organism through RNA or protein products.
Reading frame: frame with sets of 3 nucleotide bases. Each set can encode for an AA.
Open reading frame: the part of a reading frame that has the potential to code for a protein or
peptide. It is a continuous stretch of codons from start codon till stop codon.
Coding sequence (CDS): the portion of a gene’s DNA/RNA that codes for a protein (composed of
exons).

Translation from mRNA to protein goes from 5’ to 3’


An AA chain is read from N-terminal to C-terminal

Prokaryotes don’t have introns. Through an operon structure, a prokaryotic mRNA strand can encode
different proteins separated from each other with non-coding regions.
Eukaryotes have introns. One mRNA strand encodes one protein (alternative splicing is an exception).
The mRNA strand has a cap at the 5’ end and a polyA tail at the 3’ end.

Lecture 2a: Sequence coverage


Sequencing a whole genome at once is impossible because:
- Limited length per read
- Limited output per run
- Complexity in genome architecture

Coverage: Average number of times any given base in the sequenced fragment is sequenced
Coverage = (number of reads * read length) / fragment size
A read of length L can start anywhere except the last L-1 positions (because otherwise reads will go
beyond last nucleotide)

Poisson distribution can be used to approximate the binomial distribution when p (probality) is very
small or N (number of attempts) is very large.

Gap: non sequenced area


Probability of gap (p, same as percentage of nucleotides in gaps) = e^-coverage  coverage = -ln(p)
Number of contigs = N * p = N * e^-coverage with N = number of reads
According to above formula, it seems like more reads results in more different contigs so more gaps,
which doesn’t seem logical. However more reads increases the coverage and thus decreases the
probability of a gap, so it lowers p.

Before you start genome sequencing it is important to determine the desired number of contigs/gaps
and calculate the needed coverage, read length and read number based on this threshold.
Take into account that the above formulas are purely statistical and do not take errors
(contaminations, false positives, false negatives, etc.) into account. In general a higher coverage than
suggested by the formulas needs to be used.

Lecture 2b: Sequence assembly


Analysis of genomics data: sequencing  quality control  assembly  scaffolding  structural
annotation  functional annotation
When you have a known sequence you can use it as template for mapping your assembly, otherwise
you have to make a de novo (new) assembly.

Scaffold: contigs in the right order and location containing possible gaps in between them
N50: a statistic that defines the quality of an assembly. It is defined by the length of the shortest
contig such that the sum of contigs of equal length or longer is at least 50% of the total length of all
contigs.
Fasta format: first line is header starting with “>”, the new line directly after and possible following
lines contain the sequence.
Fastq format: first line is header starting with “>”, the second line contains the sequence, the third
line start with a “+” and is usually blank, the fourth line contains ASCII quality values for each single
symbol in line 2.
Linkage: overlap between different fragments

Greedy assembly: merge pair of reads with most overlap and keep adding reads until you cannot
continue (only works well with small number of reads).
OLC-assembly: find potentially Overlapping fragments. Find the order of the fragments and make a
Layout. Derive the final sequence from the layout through Consensus.

Obtaining different contigs is mainly caused by repeats, sequences that are repeated in different
parts of the sequenced fragment.

Lecture 3: Proteomics
Proteomics identifies proteins in biological systems. Proteins are very diverse because they can have
different locations in a cell, form complexes, have many different modifications, are dynamic in turn-
over and have a great variety in activity.

Genomic information is essential in large scale proteomics studies.

You can study many proteins in one sample.

The most important information is obtained from ESI-LCMS/MS studies (ElectroSpray Ionization-
Liquid-Chromatography Mass Spectrometry)

Traditionally 2D gels were used to separate different proteins by size and charge. This resulted in a
gel with many different spots unique for proteins. This spot pattern could be seen as fingerprint and
single spots could be cut out and analyzed by MS.
Nowadays 1D gels are sufficient followed by further analysis.

After cutting out a spot: protease digestion  make mass spectrum of fragments  match spectrum
with library spectra.
How to make library: Take sequence from database  artificial digestion  artificial mass
spectrometry  library spectrum

Trypsin is often used for digestion. It cleaves exclusively after arginine and lysine.

Isotopes can result in several peaks for one fragment. For MS the mono-isotopic peak is used (mass
when al carbon atoms are C12, so smaller than average mass).

Several peptide fragments need to be used in order for a proper identification.


(generally understand picture)
Both positive and negative ion modes are possible depending on functional groups (-OH, -NH 3,
-COOH, etc.) in sample.

MS spectra have m/z on x-axis  mass/charge obtained during ionization.


Actual mass = (peak value * charge) – mass of added protons

When performing an artificial digestion and spectrometry:


- You can choose from many optional post-translational modifications (phosphorylation,
methylation, O18 labeling, modification of reactive –SH group of cysteine, etc.).
- You have to make a decoy database when making a target database by reversing your AA
sequence and performing the same steps (modification, digestion, etc.) on this sample.
- Hits in the decoy database are used to determine the rate of false positives in the real
database.

First choose an E-value threshold for hits with the database and then determine the false discovery
rate with the outputs that match the threshold. False discovery rate = hits with decoy database /
(total hits – hits in decoy database)
Ionisation with protons can result in different types of ions (a,b,c,x,y,z) with different mass.
B-ions have a tendency to degrade and lose their CO-group making them into A-ions.
In a mass spectrum all the different ion peaks are seen together. However you don’t know which
type of ion belongs to a certain peak.

Lecture 4A: Substitution patterns


Mutation rate = number of visible substitutions / (2 * time)

Hidden changes: When a mutated nucleotide mutates further into another nucleotide so you can’t
see that the previous nucleotide was already mutated.
Transition: point mutation between nucleotides that are both either purines (A and G) or pyrimidines
(C, T and U). This occurs relatively often
Transversion: point mutation where a purine becomes a pyrimidine or the other way around.

Jukes-Cantor model takes the possibility of hidden changes into account and calculates the actual
dissimilarity based on the observed dissimilarity (assuming all changes are equally likely to occur).
Kimura models also takes the difference in prevalence of transitions and transversions into account.
Both models don’t take functional constraints into account
Proteins are even more complicated because different triplets can encode for same AA so not all
mutations are visible in AA sequence (= synonymous mutation).

Lecture 4B: Matrices


Alignment is an important tool to learn about possible functions of a protein based on homology with
well characterized molecules and to help find conserved domains.
In protein alignments different AAs have more similarities than others, a scoring matrix is used.

PAM matrix is based on observed similarity. Derived from global alignment of similar sequences and
observed changes. Score is based on whether a certain change is observed often or not.
PAM matrices have numbers which refer to evolutionary distance, higher number  larger distance.
1 PAM  you expect 1 mutation per 100 AA
BLOSUM matrix is based on local alignment of distantly related sequences (domains with certain
threshold similarity are clustered and analyzed).
BLOSUM matrices have numbers which refer to the threshold similarity value.
BLOSUM62  sequences with at least 62% similarity are used to build the matrix.

Overall BLOSUM matrix is more effective (therefore used in many alignment programs like BLAST).
Specific substitution matrices can perform better if you are studying specific protein families.

For close relatives For distant relatives

To study possible homology between two sequences, score the sequences based on a model that
assumes that the proteins are related (PAM or BLOSUM) and a model that assumes that the proteins
are unrelated. The unrelated model always gives a higher score except in the case of homology +
optimal alignment.
For the unrelated model, take into account that different AA are present in different frequencies

Unrelated model (expected frequency, e):


Chance of an AA aligning with itself = eii = frequency of AA (%) ^ 2
Chance of an AA aligning with another AA = e ij = frequency of AA 1 * frequency of AA 2 * 2
Total is e1 * e2 * e3 * etc.

Determining observed frequency (qij) of substitution based on matrix score and expected frequency:
Sij = 0,5 * score in matrix for substitution from i to j (i can be the same as j)
Sij = log2(qij / e)  qij / e = 2^(0,5 * sum of scores from matrix) = probability that the alignment is
meaningful compared to random
For sequences longer than 1, Sij becomes the sum of the individual S ij’s

E-value: The number of alignments with scores greater than or equal to a given value expected to
occur in a search against a database of known size, based on chance.
P-value: number between 0 and 1 representing the chance
When E<0,01, P-values are nearly identical to E-values

An E value of 10 for a match means that in this particular database, you can expect to see 10 other
matches with a similar or better scores, simply by chance alone.

To study raw scores (S):


S = sum of score from similarity matrix – (number of gaps * penaltyA) – (length of gaps * penaltyB)
Raw score depends on kind of matrix and penalty values so has to be converted into bit score (S’)
S’ = (parameterA * S – ln(parameterB))/ln(2)
Parameter A and B depend on search conditions (type of matrix, penalty values, search space, etc.)
Bit scores allow you to compare scores obtained through using different matrices, etc.
E-value can be calculated from bitscore, parameterA, parameterB and database size

Lecture 4D: The BLAST algorithm


Heuristic = “program cuts corners” so I doesn’t have to align everything with each other  faster
BLAST is heuristic and starts the alignment from seed words

BLAST tool makes local alignment. It starts at some point in the middle and builds on from there.
Alignment stops at the end of a region with strong similarity.
FASTA tool makes global alignment. Alignment is stretched over the entire sequence lengths (causes
gaps).
Local alignment is preferred for database searches, global alignment is preferred for aligning
homologs that are known to share similarity over their whole length.

General workflow of FASTA:


Make seed words (with defined length) of query, try to align seed words to template/database based
on identity score, rank alignments using matrix, join high scoring regions on same template
sequence, find and penalize gaps, optimize alignment by filling up gaps with dynamic programming.

General workflow of BLAST:


Make seed words (with defined length) of query, score seed words using matrix and select words
with score above a certain threshold, add synonymous words to list if they score above the same
threshold, search all sequences in database and select the ones where multiple words from the
extended list are found, from each word alignment extend alignment in both directions to find
alignments with a better score,
BLASTp: a protein query against a protein database.
BLASTn: a nucleotide query against a nucleotide database.
BLASTx: a nucleotide query against a protein database, by first translating the query nucleotide
sequence in all 6 reading frames.
TBLASTn: a protein query against a nucleotide database, by translating each database nucleotide
sequence in all 6 reading frames.
TBLASTx: a nucleotide query against a nucleotide database, by translating each database and query
nucleotide sequence in all 6 reading frames.

BLAST uses AA matrix (usually BLOSUM ) for scoring except for BLASTn were you compare nucleotide
with nucleotide (match/mismatch scoring).

Lecture 7: PSI BLAST and protein domains


Proteins usually contain one or more functional regions commonly termed as domains. Identification
of the different domains can provide insights into the function of the protein.

Multiple sequence alignments of sequences containing a certain domain can result in a pattern of
important AAs in this domain.
A profile is a table of position specific scores and gap penalties based on a multiple sequence
alignment to identify possible patterns for a domain in a sequence.

The Hidden Markov model (HMM) is a mathematical model that can be used for pattern recognition.

PSI BLAST can be used to make your own position specific scoring matrix. First do a normal BLASTp
and take the results above a certain threshold. Base your matrix on a multiple sequence alignment of
these sequences. Further search the database with this matrix for new matches and update your set
of sequences used to make your matrix based on new results.
Lecture 8: Best bidirectional hits and FAIR
Provenance of data: what is the background of data. How is the data obtained and how reliable is it.

FAIR data:
Findable: keeping track of data provenance. Easy to find data back by humans and computer systems
Accessible: readable for both humans and computers
Interoperable: annotation according to standards, procedures, protocols
Reusable: directly usable by computational methods

Gene onthology (GO) terms: defined terms representing gene product properties
- Cellular component (part of the cell or extracellular environment where product is present)
- Molecular function (the activities of a product on molecular level like binding, catalysis, etc.)
- Biological process (the process the activity of the product contributes to)

Homolog: proteins have common ancestor (both orthologs and paralogs)


Ortholog: genes that derive from a single ancestral gene in the last common ancestor
Paralog: the rest of the homologs (same ancestral gene but in same organism, or not in last ancestor)

The best hit of a given gene used as query in BLAST, is the gene from the target genome that gives
the best match. If you BLAST the gene that was the best hit against the genome of the initial query
gene and this gene comes up as best hit, you have a bidirectional hit.
Bidirectional hits represent a very strong similarity and indicate a possibility for orthology.

A clique is a closed circuit of bidirectional hits in different organisms.


Downsides of bidirectional hits:
- Takes a long time when you look at multiple species (quadratic increase with more species).
- It only recovers one orthologous pair (only best hit), so a possible co-ortholog resulting from
a duplication event after speciation will not be found.
- Finding bidirectional hits is harder when the distance between two organisms increases.
- A single mutation can already change the function.

Alternative approach: identification through function instead of evolution, look at domains.

Lecture 9: transcripts and mRNA measurement


RNA is single stranded so it can fold into structures
mRNA gets a 5’ cap and a 3’ poly A tail after splicing of introns. This makes the mRNA more stable
and plays a role in transportation.

Alternative splicing of mRNA of a single gene can result into isoforms

For most biological questions, protein levels and activity are most informative. mRNA is easier to
measure than proteins and still gives an indication. mRNA levels not always corresponds to protein
levels. Protein levels not always correspond to protein activity.

mRNA measurements are useful when you compare two different conditions (yeast growth in
aerobic and anaerobic conditions) or when you measure differences over time (cell cycle).

mRNA levels: differ between genes, differ between isoforms, differ between tissues, differ between
developmental stages, vary with cell cycle, vary during the day (circadian rhythm), differ between
individual cells, depend on the environment, are the result of mRNA synthesis and mRNA decay
(speed varies between different mRNA’s).

Ways of measuring mRNA’s:


Quantitative PCR to measure amounts, only for few genes at a time.
Microarrays to measure between different conditions. Labeled mRNA hybridizes on probes with
corresponding DNA on chips, followed by visualization of labels using fluorescence.

Upsides microarrays: Highly standardized, relatively cheap, easy to handle due to small data size
Downsides microarrays: Gene sequence should be known, no detailed position specific information,
not very quantitative.

Microarrays get used less and less because of a new better technique, mRNA sequencing.
Isolate all RNA from sample, filter mRNA by fishing for poly-A-tail, Make double stranded cDNA from
mRNA, fragment and sequence cDNA, map to reference genome or assemble de novo, count reads
to quantify.

Upsides RNA-seq: still works without known reference genome, can identify alternatively spliced
transcripts, can identify single nucleotide mutations (SNP) in transcripts, quantitative.
Downsides RNA-seq: relatively expensive

Output of RNA-seq is many files of short reads so you have to do a de novo assembly or map the
reads on a reference sequence.
Algorithms can be used for mapping. Take into account that some sequences span over an intron.
Lecture 10: transcript quantification and differential expression
When comparing expression levels of different genes by looking at mapped mRNA sequence reads on
these genes, take the total number of reads per sample and gene size into account.

Reads per kilo base of transcription per million reads (RPKM) =


109 * # reads mapped to a region / (total # reads * region length)

Transcripts per million transcripts (TPM) =


106 * (# reads mapped to a transcript / transcript length) * (1 / total length of all corrected counts)

Replicate measurements to find variation.

Assume that there is no real difference (null hypothesis), determine the probability of finding the
measured difference in expression in the data by chance, if this probability is low, the null hypothesis
is likely to be wrong so there really is a difference.

Calculate t-value based on your measurements and standard deviation, look up the probability of
finding that t-value randomly using a distribution plot

For RNA-seq you have many samples so using a p-value is not right. Instead use a q-value which only
looks at the set of genes that is significant based on a certain threshold you choose yourself.

The relative difference in expression is measured as a fold change between two conditions. Usually
the log2 of the fold change is used. A log2 fold change of 1 means log2(situationA/situationB) = 1 so
situationA/situationB = 21 = 2 so expression in situationA is twice as high as in situationB.

Enrichment analysis: look at list of up/down regulated genes and see whether a certain type of gene
is present in this list more than you would expect by chance.

Lecture 11: topological signal sequences


Topological signals give information about the destination of a protein and thus help with annotating
the function of a protein.

Translation  endoplasmic reticulum  Golgi apparatus  excretion

Primary signals are signals with a sequential pattern.


Patched signals have a non-sequential pattern with different regions contributing to the signal patch.
Patched signals are much harder to detect with dedicated software than primary signals.

Co-translational translocation is the process where a polypeptide chain is translated while directly
being transported into the ER. The chain contains a topological signal on the N-terminal (translated
first) of the sequence which can get recognized while the chain is still being synthesized on the
ribosome. Recognition of the signal leads to transfer of the ribosome to a special receptor on the
membrane of the ER where translation continues. The signal peptide gets cleaved of after translation
is complete.
When the polypeptide contains a special hydrophobic part, this part can stay in the ER membrane
while the rest of the sequence is being translated on the outside of the ER resulting in a
transmembrane protein. The part of the protein facing the inside of the ER will eventually face the
outside of the cell.

When studying possible signal sequences for secretion, you number the AAs upstream of the
cleavage site (part of the topological signal) with negative numbers starting from the cleavage site.
The AAs downstream of the cleavage site (part of the eventual protein) get positive numbers.

Recognizing signal sequences with software like SignalP uses different scores. The C-score is an
estimate of the likelihood of a certain position in your input sequence to be the first amino acid of
the mature protein. The S-score is an estimate of the likelihood of a certain position in your input
sequence to be part of the signal peptide. The Y-score is the geometric average of the C-score and
the location were the S-score forms a slope indicating the end of the signal peptide sequence.
Therefore the Y-score gives the best estimate of where the signal peptide is cleaved.

Transmembrane domains usually consist of an alpha helix with a length of 18-19 AAs. The AAs just
outside of the membrane are usually charged and therefore hydrophilic to keep the domain in place.

Lecture 12: Multiple sequence alignments


MSA is a good tool to find common ancestry between proteins and important conserved AAs.

For MSA, all sequences must be related in a linear fashion (same order of domains and no domain
extensions). Meaningful alignments require homologous sequences as input. When you use MSA to
study possible homology, you have to repeat the alignment several times with different inputs based
on new insights.

The MSA tool ClustalW first looks for the most homologues pair based on all the input sequences.
Than the next most related sequence is added and so on.

After the MSA the results can be shown in a phylogenetic tree.


You can root a tree with a outgroup which is a sequence distantly related to all the other sequences.
The length of the branches of a tree indicate how close proteins are related.

Cladogram: no variation in distance between different sequences so no indication of relatedness.


Phenogram: includes similarity scores depicted as length of the different branches.

Bootstrapping is a statistical technique to determine the reliability of a tree. It randomly resamples


the data and looks whether the resampled data give the same result. If you repeat this many times
(1000) and you often or always get back the same results, you know that your tree is reliable.

Lecture 14: Amino acids I


General structure of an AA and the allowed rotations in the dipeptide plane (φ and ψ).

Secondary structure preferences:


Helix = AMELK
Strand = VITWYF
Turn = PSDNG (pisding without i)

Below are some important properties of certain AAs. Not all AAs are discussed. According to Jacques
Vervoort, who gave this lecture, it is important to know general properties (size, polarity,
hydrophobicity) of all AAs and some important properties of the special ones.

Alanine has a –CH3 as side group which makes it ‘a subset of all other amino acids’. If you want to
mutate a residue but you don’t have an idea what to replace it with, use alanine.
Cysteine has a very reactive S-H group which it can use to form bridges with other cysteines or bind
metals. The group can easily be oxidized.
Aspartic acid often occurs in active sites and is able to bind ions.
Glycine doesn’t have a side chain, making its backbone very flexible allowing it to make turns other
residues cannot make.
Histidine is often seen in active sites. It can easily accept a proton and is able to bind metal ions.
Lysine has a long and flexible side chain and is therefore often found at the surface of a protein.
Leucine is the most abundant AA. The body measures leucine levels before it synthesizes muscle
proteins.
Methionine also has a sulphur group which enables it to bind metal ions. However this sulphur group
is not very reactive unlike cysteine. Furthermore, methionine can function as methyl donor.
Proline does not have a backbone proton but has an extra covalent bond to the amino group giving it
a pre-bend structure making it extra suitable for turns. Turns tend to be at the surface of proteins
so even though proline is very hydrophobic, it is often found at the surface. Also, 20% of all helices
tend to start with a proline.
Arginine has a so called guadinium group in its side chain making it rigid and therefore it is more
often found on the inside of a protein.
Tryptophan is the most conserved of all residues due to its big size.

Lecture 15: Amino acids II


Hydrophobic AAs on the inside and hydrophilic AAs on the outside of the protein are preferred.

2&3 were derived experimentally and based on the properties of the single amino acids.
1&4 were derived theoretically, based on observing a dataset of known protein structures.

Alpha helices are formed through hydrogen bonds between the backbone C=O and the backbone N-
H, 4 AAs further in the sequence.
Helices often have a side with mostly hydrophilic side groups and a side with mostly hydrophobic side
groups.
Helices also have a dipole over the length of the helix. All peptides point in the same direction with
their C=O and N-H backbone groups. The N-H group is more positively charged so the N-terminus end
has a slight positive charge compared to the C-terminus end. At the beginning and end of the helix
you often find charged AAs to correct for this dipole.

A beta sheet consists of at least two beta strands that interact with each other.
Hydrogen bonding between backbone C=O and N-H groups of two or more strands results in a sheet.
R-groups extend perpendicular to the plane of H-bonds.
Just like for helices, one side of a strand is mostly hydrophilic and the other side is mostly
hydrophobic. Every other AA in a beta strand points in the same direction and should therefore have
a similar hydrophobicity.
Beta sheets exist in either parallel (two strands both go from N  C) or antiparallel (one strand goes
from N  C while other goes from C  N) configuration.

Turns are well defined structures with limited length that connect secondary structure elements.

A loop is everything that has no defined structure. A loop is different from a turn.

A Chou Fasman parameter is determined by taking the percentage of a certain AA in a certain


conformation and dividing it by the overall percentage of AAs in that conformation.
For example: you have 50 alanines in your set, 25 of them in alpha helix conformation (50%). Your
total set consist of 1000 amino acids of which 350 are in alpha helix conformation (35%). Therefore
the helix preference parameter for alanine is 50/35 = 1,43.
If the preference parameter is higher than 1 it means your specific residue prefers this secondary
structure. If it is lower than 1 it dislikes this structure.

Lecture 13: 3D protein structures


Determining complete 3D structure based on protein sequence only is difficult.
Other methods to determine structure: X-ray, crystallography, NMR, cryo-EM and comparative
modeling.

X-ray: make crystals of many proteins (that are the same), hit it with X-ray beam and study the
diffraction. This gives information of the electron density in certain region which can be fitted with
the sequence to find the structure. The quality of the results depend strongly on resolution.

NMR: measures interaction between atoms based on distance. Gets complex with larger molecules
like proteins. 2D NMR gives a better overview but still complex. Results are often not clear because
proteins are flexible so distances can very. It does give information about which parts are flexible.

Cryo-EM: new emerging technology that allows you to study larger proteins and complexes.

The peptide bonds between C=O and N-H have double bond characteristics so do not rotate. The
bonds with the Ca are allowed to rotate (see picture in lecture: amino acids I)
Due to rotation, side groups can be in different orientations compared to each other. Some
orientations are energetically favorable, some aren’t due to steric interactions.

The Ramachandran plot has the rotational degrees of φ and ψ on each axis and gives a prediction
about which orientations of a dipeptide plane are energetically allowed and which are not.
Glycine has no side group so less steric hindrance and more freedom in the allowed orientations.

Lecture 16: Structural comparison


Protein data bank (PDB) is the only database with protein 3D structures. All information is checked.

Proteins with a low sequence identity score (30%) can still form similar structures.

Vector Alignment Search Tool (VAST) can be used to compare structures. For your protein chain,
locate the secondary structure elements, represent them as individual vectors, align the vectors with
other proteins in the database. (loops are not studied as secondary structure elements because they
often contain the most mutations)
Lecture 17: Comparative protein structure modeling
Determining protein 3D structures is difficult and expensive. Instead you can make hypothetical
models based on sequence and already available information in database.

Ab inito modeling: making model based on predicted folding based on sequence.


Comparative modeling: making model based on other homologues structures in database (template).

Ab inito prediction Comparative modeling


Applicable to any sequence Only applicable to sequences that share
recognizable similarity to a template structure
Not very accurate Fairly accurate (comparable to low resolution X-ray)
Only for proteins of <100 residues Not limited by size
Accuracy and applicability are limited by our Accuracy and applicability are limited by the number
understanding of how proteins fold. of known structures.

Software for making comparative models looks at distances between different AAs in template
(special restraints) and tries to fold the query sequence so that these restraints are also present in
the model.

Typical errors in comparative modeling: incorrect template, misalignment, region without template,
sidechain packing, shifts in aligned regions.

Lecture 18: Quality check of protein model


PROSA is software to analyze your protein model.
Depending on the distance between atoms, the interaction energy can be favorable or unfavorable.
PROSA calculates the interaction energy between different parts of your protein and represents the
results in a plot. The interaction energy should be below zero for a favorable structure.
To smooth the results, an average of a certain amount of residues is taken and shown in the graph.

Also look if the dipeptide plains of your model fit in the allowed regions of the Ramachandran plot.

You might also like