Summary Bioinformation Technology
Summary Bioinformation Technology
Lecture 1: Introduction
Genome: the full set of chromosomes or genes in a gamete (contains only one set of dissimilar
chromosomes). A regular somatic cell contains two full sets of genomes.
Gene: a DNA-based unit that can exert its effects on the organism through RNA or protein products.
Reading frame: frame with sets of 3 nucleotide bases. Each set can encode for an AA.
Open reading frame: the part of a reading frame that has the potential to code for a protein or
peptide. It is a continuous stretch of codons from start codon till stop codon.
Coding sequence (CDS): the portion of a gene’s DNA/RNA that codes for a protein (composed of
exons).
Prokaryotes don’t have introns. Through an operon structure, a prokaryotic mRNA strand can encode
different proteins separated from each other with non-coding regions.
Eukaryotes have introns. One mRNA strand encodes one protein (alternative splicing is an exception).
The mRNA strand has a cap at the 5’ end and a polyA tail at the 3’ end.
Coverage: Average number of times any given base in the sequenced fragment is sequenced
Coverage = (number of reads * read length) / fragment size
A read of length L can start anywhere except the last L-1 positions (because otherwise reads will go
beyond last nucleotide)
Poisson distribution can be used to approximate the binomial distribution when p (probality) is very
small or N (number of attempts) is very large.
Before you start genome sequencing it is important to determine the desired number of contigs/gaps
and calculate the needed coverage, read length and read number based on this threshold.
Take into account that the above formulas are purely statistical and do not take errors
(contaminations, false positives, false negatives, etc.) into account. In general a higher coverage than
suggested by the formulas needs to be used.
Scaffold: contigs in the right order and location containing possible gaps in between them
N50: a statistic that defines the quality of an assembly. It is defined by the length of the shortest
contig such that the sum of contigs of equal length or longer is at least 50% of the total length of all
contigs.
Fasta format: first line is header starting with “>”, the new line directly after and possible following
lines contain the sequence.
Fastq format: first line is header starting with “>”, the second line contains the sequence, the third
line start with a “+” and is usually blank, the fourth line contains ASCII quality values for each single
symbol in line 2.
Linkage: overlap between different fragments
Greedy assembly: merge pair of reads with most overlap and keep adding reads until you cannot
continue (only works well with small number of reads).
OLC-assembly: find potentially Overlapping fragments. Find the order of the fragments and make a
Layout. Derive the final sequence from the layout through Consensus.
Obtaining different contigs is mainly caused by repeats, sequences that are repeated in different
parts of the sequenced fragment.
Lecture 3: Proteomics
Proteomics identifies proteins in biological systems. Proteins are very diverse because they can have
different locations in a cell, form complexes, have many different modifications, are dynamic in turn-
over and have a great variety in activity.
The most important information is obtained from ESI-LCMS/MS studies (ElectroSpray Ionization-
Liquid-Chromatography Mass Spectrometry)
Traditionally 2D gels were used to separate different proteins by size and charge. This resulted in a
gel with many different spots unique for proteins. This spot pattern could be seen as fingerprint and
single spots could be cut out and analyzed by MS.
Nowadays 1D gels are sufficient followed by further analysis.
After cutting out a spot: protease digestion make mass spectrum of fragments match spectrum
with library spectra.
How to make library: Take sequence from database artificial digestion artificial mass
spectrometry library spectrum
Trypsin is often used for digestion. It cleaves exclusively after arginine and lysine.
Isotopes can result in several peaks for one fragment. For MS the mono-isotopic peak is used (mass
when al carbon atoms are C12, so smaller than average mass).
First choose an E-value threshold for hits with the database and then determine the false discovery
rate with the outputs that match the threshold. False discovery rate = hits with decoy database /
(total hits – hits in decoy database)
Ionisation with protons can result in different types of ions (a,b,c,x,y,z) with different mass.
B-ions have a tendency to degrade and lose their CO-group making them into A-ions.
In a mass spectrum all the different ion peaks are seen together. However you don’t know which
type of ion belongs to a certain peak.
Hidden changes: When a mutated nucleotide mutates further into another nucleotide so you can’t
see that the previous nucleotide was already mutated.
Transition: point mutation between nucleotides that are both either purines (A and G) or pyrimidines
(C, T and U). This occurs relatively often
Transversion: point mutation where a purine becomes a pyrimidine or the other way around.
Jukes-Cantor model takes the possibility of hidden changes into account and calculates the actual
dissimilarity based on the observed dissimilarity (assuming all changes are equally likely to occur).
Kimura models also takes the difference in prevalence of transitions and transversions into account.
Both models don’t take functional constraints into account
Proteins are even more complicated because different triplets can encode for same AA so not all
mutations are visible in AA sequence (= synonymous mutation).
PAM matrix is based on observed similarity. Derived from global alignment of similar sequences and
observed changes. Score is based on whether a certain change is observed often or not.
PAM matrices have numbers which refer to evolutionary distance, higher number larger distance.
1 PAM you expect 1 mutation per 100 AA
BLOSUM matrix is based on local alignment of distantly related sequences (domains with certain
threshold similarity are clustered and analyzed).
BLOSUM matrices have numbers which refer to the threshold similarity value.
BLOSUM62 sequences with at least 62% similarity are used to build the matrix.
Overall BLOSUM matrix is more effective (therefore used in many alignment programs like BLAST).
Specific substitution matrices can perform better if you are studying specific protein families.
To study possible homology between two sequences, score the sequences based on a model that
assumes that the proteins are related (PAM or BLOSUM) and a model that assumes that the proteins
are unrelated. The unrelated model always gives a higher score except in the case of homology +
optimal alignment.
For the unrelated model, take into account that different AA are present in different frequencies
Determining observed frequency (qij) of substitution based on matrix score and expected frequency:
Sij = 0,5 * score in matrix for substitution from i to j (i can be the same as j)
Sij = log2(qij / e) qij / e = 2^(0,5 * sum of scores from matrix) = probability that the alignment is
meaningful compared to random
For sequences longer than 1, Sij becomes the sum of the individual S ij’s
E-value: The number of alignments with scores greater than or equal to a given value expected to
occur in a search against a database of known size, based on chance.
P-value: number between 0 and 1 representing the chance
When E<0,01, P-values are nearly identical to E-values
An E value of 10 for a match means that in this particular database, you can expect to see 10 other
matches with a similar or better scores, simply by chance alone.
BLAST tool makes local alignment. It starts at some point in the middle and builds on from there.
Alignment stops at the end of a region with strong similarity.
FASTA tool makes global alignment. Alignment is stretched over the entire sequence lengths (causes
gaps).
Local alignment is preferred for database searches, global alignment is preferred for aligning
homologs that are known to share similarity over their whole length.
BLAST uses AA matrix (usually BLOSUM ) for scoring except for BLASTn were you compare nucleotide
with nucleotide (match/mismatch scoring).
Multiple sequence alignments of sequences containing a certain domain can result in a pattern of
important AAs in this domain.
A profile is a table of position specific scores and gap penalties based on a multiple sequence
alignment to identify possible patterns for a domain in a sequence.
The Hidden Markov model (HMM) is a mathematical model that can be used for pattern recognition.
PSI BLAST can be used to make your own position specific scoring matrix. First do a normal BLASTp
and take the results above a certain threshold. Base your matrix on a multiple sequence alignment of
these sequences. Further search the database with this matrix for new matches and update your set
of sequences used to make your matrix based on new results.
Lecture 8: Best bidirectional hits and FAIR
Provenance of data: what is the background of data. How is the data obtained and how reliable is it.
FAIR data:
Findable: keeping track of data provenance. Easy to find data back by humans and computer systems
Accessible: readable for both humans and computers
Interoperable: annotation according to standards, procedures, protocols
Reusable: directly usable by computational methods
Gene onthology (GO) terms: defined terms representing gene product properties
- Cellular component (part of the cell or extracellular environment where product is present)
- Molecular function (the activities of a product on molecular level like binding, catalysis, etc.)
- Biological process (the process the activity of the product contributes to)
The best hit of a given gene used as query in BLAST, is the gene from the target genome that gives
the best match. If you BLAST the gene that was the best hit against the genome of the initial query
gene and this gene comes up as best hit, you have a bidirectional hit.
Bidirectional hits represent a very strong similarity and indicate a possibility for orthology.
For most biological questions, protein levels and activity are most informative. mRNA is easier to
measure than proteins and still gives an indication. mRNA levels not always corresponds to protein
levels. Protein levels not always correspond to protein activity.
mRNA measurements are useful when you compare two different conditions (yeast growth in
aerobic and anaerobic conditions) or when you measure differences over time (cell cycle).
mRNA levels: differ between genes, differ between isoforms, differ between tissues, differ between
developmental stages, vary with cell cycle, vary during the day (circadian rhythm), differ between
individual cells, depend on the environment, are the result of mRNA synthesis and mRNA decay
(speed varies between different mRNA’s).
Upsides microarrays: Highly standardized, relatively cheap, easy to handle due to small data size
Downsides microarrays: Gene sequence should be known, no detailed position specific information,
not very quantitative.
Microarrays get used less and less because of a new better technique, mRNA sequencing.
Isolate all RNA from sample, filter mRNA by fishing for poly-A-tail, Make double stranded cDNA from
mRNA, fragment and sequence cDNA, map to reference genome or assemble de novo, count reads
to quantify.
Upsides RNA-seq: still works without known reference genome, can identify alternatively spliced
transcripts, can identify single nucleotide mutations (SNP) in transcripts, quantitative.
Downsides RNA-seq: relatively expensive
Output of RNA-seq is many files of short reads so you have to do a de novo assembly or map the
reads on a reference sequence.
Algorithms can be used for mapping. Take into account that some sequences span over an intron.
Lecture 10: transcript quantification and differential expression
When comparing expression levels of different genes by looking at mapped mRNA sequence reads on
these genes, take the total number of reads per sample and gene size into account.
Assume that there is no real difference (null hypothesis), determine the probability of finding the
measured difference in expression in the data by chance, if this probability is low, the null hypothesis
is likely to be wrong so there really is a difference.
Calculate t-value based on your measurements and standard deviation, look up the probability of
finding that t-value randomly using a distribution plot
For RNA-seq you have many samples so using a p-value is not right. Instead use a q-value which only
looks at the set of genes that is significant based on a certain threshold you choose yourself.
The relative difference in expression is measured as a fold change between two conditions. Usually
the log2 of the fold change is used. A log2 fold change of 1 means log2(situationA/situationB) = 1 so
situationA/situationB = 21 = 2 so expression in situationA is twice as high as in situationB.
Enrichment analysis: look at list of up/down regulated genes and see whether a certain type of gene
is present in this list more than you would expect by chance.
Co-translational translocation is the process where a polypeptide chain is translated while directly
being transported into the ER. The chain contains a topological signal on the N-terminal (translated
first) of the sequence which can get recognized while the chain is still being synthesized on the
ribosome. Recognition of the signal leads to transfer of the ribosome to a special receptor on the
membrane of the ER where translation continues. The signal peptide gets cleaved of after translation
is complete.
When the polypeptide contains a special hydrophobic part, this part can stay in the ER membrane
while the rest of the sequence is being translated on the outside of the ER resulting in a
transmembrane protein. The part of the protein facing the inside of the ER will eventually face the
outside of the cell.
When studying possible signal sequences for secretion, you number the AAs upstream of the
cleavage site (part of the topological signal) with negative numbers starting from the cleavage site.
The AAs downstream of the cleavage site (part of the eventual protein) get positive numbers.
Recognizing signal sequences with software like SignalP uses different scores. The C-score is an
estimate of the likelihood of a certain position in your input sequence to be the first amino acid of
the mature protein. The S-score is an estimate of the likelihood of a certain position in your input
sequence to be part of the signal peptide. The Y-score is the geometric average of the C-score and
the location were the S-score forms a slope indicating the end of the signal peptide sequence.
Therefore the Y-score gives the best estimate of where the signal peptide is cleaved.
Transmembrane domains usually consist of an alpha helix with a length of 18-19 AAs. The AAs just
outside of the membrane are usually charged and therefore hydrophilic to keep the domain in place.
For MSA, all sequences must be related in a linear fashion (same order of domains and no domain
extensions). Meaningful alignments require homologous sequences as input. When you use MSA to
study possible homology, you have to repeat the alignment several times with different inputs based
on new insights.
The MSA tool ClustalW first looks for the most homologues pair based on all the input sequences.
Than the next most related sequence is added and so on.
Below are some important properties of certain AAs. Not all AAs are discussed. According to Jacques
Vervoort, who gave this lecture, it is important to know general properties (size, polarity,
hydrophobicity) of all AAs and some important properties of the special ones.
Alanine has a –CH3 as side group which makes it ‘a subset of all other amino acids’. If you want to
mutate a residue but you don’t have an idea what to replace it with, use alanine.
Cysteine has a very reactive S-H group which it can use to form bridges with other cysteines or bind
metals. The group can easily be oxidized.
Aspartic acid often occurs in active sites and is able to bind ions.
Glycine doesn’t have a side chain, making its backbone very flexible allowing it to make turns other
residues cannot make.
Histidine is often seen in active sites. It can easily accept a proton and is able to bind metal ions.
Lysine has a long and flexible side chain and is therefore often found at the surface of a protein.
Leucine is the most abundant AA. The body measures leucine levels before it synthesizes muscle
proteins.
Methionine also has a sulphur group which enables it to bind metal ions. However this sulphur group
is not very reactive unlike cysteine. Furthermore, methionine can function as methyl donor.
Proline does not have a backbone proton but has an extra covalent bond to the amino group giving it
a pre-bend structure making it extra suitable for turns. Turns tend to be at the surface of proteins
so even though proline is very hydrophobic, it is often found at the surface. Also, 20% of all helices
tend to start with a proline.
Arginine has a so called guadinium group in its side chain making it rigid and therefore it is more
often found on the inside of a protein.
Tryptophan is the most conserved of all residues due to its big size.
2&3 were derived experimentally and based on the properties of the single amino acids.
1&4 were derived theoretically, based on observing a dataset of known protein structures.
Alpha helices are formed through hydrogen bonds between the backbone C=O and the backbone N-
H, 4 AAs further in the sequence.
Helices often have a side with mostly hydrophilic side groups and a side with mostly hydrophobic side
groups.
Helices also have a dipole over the length of the helix. All peptides point in the same direction with
their C=O and N-H backbone groups. The N-H group is more positively charged so the N-terminus end
has a slight positive charge compared to the C-terminus end. At the beginning and end of the helix
you often find charged AAs to correct for this dipole.
A beta sheet consists of at least two beta strands that interact with each other.
Hydrogen bonding between backbone C=O and N-H groups of two or more strands results in a sheet.
R-groups extend perpendicular to the plane of H-bonds.
Just like for helices, one side of a strand is mostly hydrophilic and the other side is mostly
hydrophobic. Every other AA in a beta strand points in the same direction and should therefore have
a similar hydrophobicity.
Beta sheets exist in either parallel (two strands both go from N C) or antiparallel (one strand goes
from N C while other goes from C N) configuration.
Turns are well defined structures with limited length that connect secondary structure elements.
A loop is everything that has no defined structure. A loop is different from a turn.
X-ray: make crystals of many proteins (that are the same), hit it with X-ray beam and study the
diffraction. This gives information of the electron density in certain region which can be fitted with
the sequence to find the structure. The quality of the results depend strongly on resolution.
NMR: measures interaction between atoms based on distance. Gets complex with larger molecules
like proteins. 2D NMR gives a better overview but still complex. Results are often not clear because
proteins are flexible so distances can very. It does give information about which parts are flexible.
Cryo-EM: new emerging technology that allows you to study larger proteins and complexes.
The peptide bonds between C=O and N-H have double bond characteristics so do not rotate. The
bonds with the Ca are allowed to rotate (see picture in lecture: amino acids I)
Due to rotation, side groups can be in different orientations compared to each other. Some
orientations are energetically favorable, some aren’t due to steric interactions.
The Ramachandran plot has the rotational degrees of φ and ψ on each axis and gives a prediction
about which orientations of a dipeptide plane are energetically allowed and which are not.
Glycine has no side group so less steric hindrance and more freedom in the allowed orientations.
Proteins with a low sequence identity score (30%) can still form similar structures.
Vector Alignment Search Tool (VAST) can be used to compare structures. For your protein chain,
locate the secondary structure elements, represent them as individual vectors, align the vectors with
other proteins in the database. (loops are not studied as secondary structure elements because they
often contain the most mutations)
Lecture 17: Comparative protein structure modeling
Determining protein 3D structures is difficult and expensive. Instead you can make hypothetical
models based on sequence and already available information in database.
Software for making comparative models looks at distances between different AAs in template
(special restraints) and tries to fold the query sequence so that these restraints are also present in
the model.
Typical errors in comparative modeling: incorrect template, misalignment, region without template,
sidechain packing, shifts in aligned regions.
Also look if the dipeptide plains of your model fit in the allowed regions of the Ramachandran plot.