0% found this document useful (0 votes)

73 views15 pages

Summary Bioinformation Technology

The document provides an overview of key concepts in bioinformatics and genomics including: 1) Genomes contain chromosomes or genes, genes contain coding sequences that encode proteins, and reading frames contain sets of 3 nucleotides that can encode amino acids. 2) Sequencing a whole genome requires determining coverage, or the number of times each base is sequenced, to minimize gaps between non-sequenced areas. 3) Sequence assembly involves merging reads based on overlap to form contigs, then scaffolding to order contigs and fill gaps, with the N50 statistic measuring assembly quality.

Uploaded by

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

73 views15 pages

Summary Bioinformation Technology

Uploaded by

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

You are on page 1/ 15

Summary Bioinformation Technology period 1 2018

Lecture 1: Introduction
Genome: the full set of chromosomes or genes in a gamete (contains only one set of dissimilar
chromosomes). A regular somatic cell contains two full sets of genomes.
Gene: a DNA-based unit that can exert its effects on the organism through RNA or protein products.
Reading frame: frame with sets of 3 nucleotide bases. Each set can encode for an AA.
Open reading frame: the part of a reading frame that has the potential to code for a protein or
peptide. It is a continuous stretch of codons from start codon till stop codon.
Coding sequence (CDS): the portion of a gene’s DNA/RNA that codes for a protein (composed of
exons).

Translation from mRNA to protein goes from 5’ to 3’

An AA chain is read from N-terminal to C-terminal

Prokaryotes don’t have introns. Through an operon structure, a prokaryotic mRNA strand can encode
different proteins separated from each other with non-coding regions.
Eukaryotes have introns. One mRNA strand encodes one protein (alternative splicing is an exception).
The mRNA strand has a cap at the 5’ end and a polyA tail at the 3’ end.

Lecture 2a: Sequence coverage

Sequencing a whole genome at once is impossible because:
- Limited length per read
- Limited output per run
- Complexity in genome architecture

Coverage: Average number of times any given base in the sequenced fragment is sequenced
Coverage = (number of reads * read length) / fragment size
A read of length L can start anywhere except the last L-1 positions (because otherwise reads will go
beyond last nucleotide)

Poisson distribution can be used to approximate the binomial distribution when p (probality) is very
small or N (number of attempts) is very large.

Gap: non sequenced area

Probability of gap (p, same as percentage of nucleotides in gaps) = e^-coverage  coverage = -ln(p)
Number of contigs = N * p = N * e^-coverage with N = number of reads
According to above formula, it seems like more reads results in more different contigs so more gaps,
which doesn’t seem logical. However more reads increases the coverage and thus decreases the
probability of a gap, so it lowers p.

Before you start genome sequencing it is important to determine the desired number of contigs/gaps
and calculate the needed coverage, read length and read number based on this threshold.
Take into account that the above formulas are purely statistical and do not take errors
(contaminations, false positives, false negatives, etc.) into account. In general a higher coverage than
suggested by the formulas needs to be used.

Lecture 2b: Sequence assembly

Analysis of genomics data: sequencing  quality control  assembly  scaffolding  structural
annotation  functional annotation
When you have a known sequence you can use it as template for mapping your assembly, otherwise
you have to make a de novo (new) assembly.

Scaffold: contigs in the right order and location containing possible gaps in between them
N50: a statistic that defines the quality of an assembly. It is defined by the length of the shortest
contig such that the sum of contigs of equal length or longer is at least 50% of the total length of all
contigs.
Fasta format: first line is header starting with “>”, the new line directly after and possible following
lines contain the sequence.
Fastq format: first line is header starting with “>”, the second line contains the sequence, the third
line start with a “+” and is usually blank, the fourth line contains ASCII quality values for each single
symbol in line 2.
Linkage: overlap between different fragments

Greedy assembly: merge pair of reads with most overlap and keep adding reads until you cannot
continue (only works well with small number of reads).
OLC-assembly: find potentially Overlapping fragments. Find the order of the fragments and make a
Layout. Derive the final sequence from the layout through Consensus.

Obtaining different contigs is mainly caused by repeats, sequences that are repeated in different
parts of the sequenced fragment.

Lecture 3: Proteomics
Proteomics identifies proteins in biological systems. Proteins are very diverse because they can have
different locations in a cell, form complexes, have many different modifications, are dynamic in turn-
over and have a great variety in activity.

Genomic information is essential in large scale proteomics studies.

You can study many proteins in one sample.

The most important information is obtained from ESI-LCMS/MS studies (ElectroSpray Ionization-
Liquid-Chromatography Mass Spectrometry)

Traditionally 2D gels were used to separate different proteins by size and charge. This resulted in a
gel with many different spots unique for proteins. This spot pattern could be seen as fingerprint and
single spots could be cut out and analyzed by MS.
Nowadays 1D gels are sufficient followed by further analysis.

After cutting out a spot: protease digestion  make mass spectrum of fragments  match spectrum
with library spectra.
How to make library: Take sequence from database  artificial digestion  artificial mass
spectrometry  library spectrum

Trypsin is often used for digestion. It cleaves exclusively after arginine and lysine.

Isotopes can result in several peaks for one fragment. For MS the mono-isotopic peak is used (mass
when al carbon atoms are C12, so smaller than average mass).

Several peptide fragments need to be used in order for a proper identification.

(generally understand picture)
Both positive and negative ion modes are possible depending on functional groups (-OH, -NH 3,
-COOH, etc.) in sample.

MS spectra have m/z on x-axis  mass/charge obtained during ionization.

Actual mass = (peak value * charge) – mass of added protons

When performing an artificial digestion and spectrometry:

- You can choose from many optional post-translational modifications (phosphorylation,
methylation, O18 labeling, modification of reactive –SH group of cysteine, etc.).
- You have to make a decoy database when making a target database by reversing your AA
sequence and performing the same steps (modification, digestion, etc.) on this sample.
- Hits in the decoy database are used to determine the rate of false positives in the real
database.

First choose an E-value threshold for hits with the database and then determine the false discovery
rate with the outputs that match the threshold. False discovery rate = hits with decoy database /
(total hits – hits in decoy database)
Ionisation with protons can result in different types of ions (a,b,c,x,y,z) with different mass.
B-ions have a tendency to degrade and lose their CO-group making them into A-ions.
In a mass spectrum all the different ion peaks are seen together. However you don’t know which
type of ion belongs to a certain peak.

Lecture 4A: Substitution patterns

Mutation rate = number of visible substitutions / (2 * time)

Hidden changes: When a mutated nucleotide mutates further into another nucleotide so you can’t
see that the previous nucleotide was already mutated.
Transition: point mutation between nucleotides that are both either purines (A and G) or pyrimidines
(C, T and U). This occurs relatively often
Transversion: point mutation where a purine becomes a pyrimidine or the other way around.

Jukes-Cantor model takes the possibility of hidden changes into account and calculates the actual
dissimilarity based on the observed dissimilarity (assuming all changes are equally likely to occur).
Kimura models also takes the difference in prevalence of transitions and transversions into account.
Both models don’t take functional constraints into account
Proteins are even more complicated because different triplets can encode for same AA so not all
mutations are visible in AA sequence (= synonymous mutation).

Lecture 4B: Matrices

Alignment is an important tool to learn about possible functions of a protein based on homology with
well characterized molecules and to help find conserved domains.
In protein alignments different AAs have more similarities than others, a scoring matrix is used.

PAM matrix is based on observed similarity. Derived from global alignment of similar sequences and
observed changes. Score is based on whether a certain change is observed often or not.
PAM matrices have numbers which refer to evolutionary distance, higher number  larger distance.
1 PAM  you expect 1 mutation per 100 AA
BLOSUM matrix is based on local alignment of distantly related sequences (domains with certain
threshold similarity are clustered and analyzed).
BLOSUM matrices have numbers which refer to the threshold similarity value.
BLOSUM62  sequences with at least 62% similarity are used to build the matrix.

Overall BLOSUM matrix is more effective (therefore used in many alignment programs like BLAST).
Specific substitution matrices can perform better if you are studying specific protein families.

For close relatives For distant relatives

To study possible homology between two sequences, score the sequences based on a model that
assumes that the proteins are related (PAM or BLOSUM) and a model that assumes that the proteins
are unrelated. The unrelated model always gives a higher score except in the case of homology +
optimal alignment.
For the unrelated model, take into account that different AA are present in different frequencies

Unrelated model (expected frequency, e):

Chance of an AA aligning with itself = eii = frequency of AA (%) ^ 2
Chance of an AA aligning with another AA = e ij = frequency of AA 1 * frequency of AA 2 * 2
Total is e1 * e2 * e3 * etc.

Determining observed frequency (qij) of substitution based on matrix score and expected frequency:
Sij = 0,5 * score in matrix for substitution from i to j (i can be the same as j)
Sij = log2(qij / e)  qij / e = 2^(0,5 * sum of scores from matrix) = probability that the alignment is
meaningful compared to random
For sequences longer than 1, Sij becomes the sum of the individual S ij’s

E-value: The number of alignments with scores greater than or equal to a given value expected to
occur in a search against a database of known size, based on chance.
P-value: number between 0 and 1 representing the chance
When E<0,01, P-values are nearly identical to E-values

An E value of 10 for a match means that in this particular database, you can expect to see 10 other
matches with a similar or better scores, simply by chance alone.

To study raw scores (S):

S = sum of score from similarity matrix – (number of gaps * penaltyA) – (length of gaps * penaltyB)
Raw score depends on kind of matrix and penalty values so has to be converted into bit score (S’)
S’ = (parameterA * S – ln(parameterB))/ln(2)
Parameter A and B depend on search conditions (type of matrix, penalty values, search space, etc.)
Bit scores allow you to compare scores obtained through using different matrices, etc.
E-value can be calculated from bitscore, parameterA, parameterB and database size

Lecture 4D: The BLAST algorithm

Heuristic = “program cuts corners” so I doesn’t have to align everything with each other  faster
BLAST is heuristic and starts the alignment from seed words

BLAST tool makes local alignment. It starts at some point in the middle and builds on from there.
Alignment stops at the end of a region with strong similarity.
FASTA tool makes global alignment. Alignment is stretched over the entire sequence lengths (causes
gaps).
Local alignment is preferred for database searches, global alignment is preferred for aligning
homologs that are known to share similarity over their whole length.

General workflow of FASTA:

Make seed words (with defined length) of query, try to align seed words to template/database based
on identity score, rank alignments using matrix, join high scoring regions on same template
sequence, find and penalize gaps, optimize alignment by filling up gaps with dynamic programming.

General workflow of BLAST:

Make seed words (with defined length) of query, score seed words using matrix and select words
with score above a certain threshold, add synonymous words to list if they score above the same
threshold, search all sequences in database and select the ones where multiple words from the
extended list are found, from each word alignment extend alignment in both directions to find
alignments with a better score,
BLASTp: a protein query against a protein database.
BLASTn: a nucleotide query against a nucleotide database.
BLASTx: a nucleotide query against a protein database, by first translating the query nucleotide
sequence in all 6 reading frames.
TBLASTn: a protein query against a nucleotide database, by translating each database nucleotide
sequence in all 6 reading frames.
TBLASTx: a nucleotide query against a nucleotide database, by translating each database and query
nucleotide sequence in all 6 reading frames.

BLAST uses AA matrix (usually BLOSUM ) for scoring except for BLASTn were you compare nucleotide
with nucleotide (match/mismatch scoring).

Lecture 7: PSI BLAST and protein domains

Proteins usually contain one or more functional regions commonly termed as domains. Identification
of the different domains can provide insights into the function of the protein.

Multiple sequence alignments of sequences containing a certain domain can result in a pattern of
important AAs in this domain.
A profile is a table of position specific scores and gap penalties based on a multiple sequence
alignment to identify possible patterns for a domain in a sequence.

The Hidden Markov model (HMM) is a mathematical model that can be used for pattern recognition.

PSI BLAST can be used to make your own position specific scoring matrix. First do a normal BLASTp
and take the results above a certain threshold. Base your matrix on a multiple sequence alignment of
these sequences. Further search the database with this matrix for new matches and update your set
of sequences used to make your matrix based on new results.
Lecture 8: Best bidirectional hits and FAIR
Provenance of data: what is the background of data. How is the data obtained and how reliable is it.

FAIR data:
Findable: keeping track of data provenance. Easy to find data back by humans and computer systems
Accessible: readable for both humans and computers
Interoperable: annotation according to standards, procedures, protocols
Reusable: directly usable by computational methods

Gene onthology (GO) terms: defined terms representing gene product properties
- Cellular component (part of the cell or extracellular environment where product is present)
- Molecular function (the activities of a product on molecular level like binding, catalysis, etc.)
- Biological process (the process the activity of the product contributes to)

Homolog: proteins have common ancestor (both orthologs and paralogs)

Ortholog: genes that derive from a single ancestral gene in the last common ancestor
Paralog: the rest of the homologs (same ancestral gene but in same organism, or not in last ancestor)

The best hit of a given gene used as query in BLAST, is the gene from the target genome that gives
the best match. If you BLAST the gene that was the best hit against the genome of the initial query
gene and this gene comes up as best hit, you have a bidirectional hit.
Bidirectional hits represent a very strong similarity and indicate a possibility for orthology.

A clique is a closed circuit of bidirectional hits in different organisms.

Downsides of bidirectional hits:
- Takes a long time when you look at multiple species (quadratic increase with more species).
- It only recovers one orthologous pair (only best hit), so a possible co-ortholog resulting from
a duplication event after speciation will not be found.
- Finding bidirectional hits is harder when the distance between two organisms increases.
- A single mutation can already change the function.

Alternative approach: identification through function instead of evolution, look at domains.

Lecture 9: transcripts and mRNA measurement

RNA is single stranded so it can fold into structures
mRNA gets a 5’ cap and a 3’ poly A tail after splicing of introns. This makes the mRNA more stable
and plays a role in transportation.

Alternative splicing of mRNA of a single gene can result into isoforms

For most biological questions, protein levels and activity are most informative. mRNA is easier to
measure than proteins and still gives an indication. mRNA levels not always corresponds to protein
levels. Protein levels not always correspond to protein activity.

mRNA measurements are useful when you compare two different conditions (yeast growth in
aerobic and anaerobic conditions) or when you measure differences over time (cell cycle).

mRNA levels: differ between genes, differ between isoforms, differ between tissues, differ between
developmental stages, vary with cell cycle, vary during the day (circadian rhythm), differ between
individual cells, depend on the environment, are the result of mRNA synthesis and mRNA decay
(speed varies between different mRNA’s).

Ways of measuring mRNA’s:

Quantitative PCR to measure amounts, only for few genes at a time.
Microarrays to measure between different conditions. Labeled mRNA hybridizes on probes with
corresponding DNA on chips, followed by visualization of labels using fluorescence.

Upsides microarrays: Highly standardized, relatively cheap, easy to handle due to small data size
Downsides microarrays: Gene sequence should be known, no detailed position specific information,
not very quantitative.

Microarrays get used less and less because of a new better technique, mRNA sequencing.
Isolate all RNA from sample, filter mRNA by fishing for poly-A-tail, Make double stranded cDNA from
mRNA, fragment and sequence cDNA, map to reference genome or assemble de novo, count reads
to quantify.

Upsides RNA-seq: still works without known reference genome, can identify alternatively spliced
transcripts, can identify single nucleotide mutations (SNP) in transcripts, quantitative.
Downsides RNA-seq: relatively expensive

Output of RNA-seq is many files of short reads so you have to do a de novo assembly or map the
reads on a reference sequence.
Algorithms can be used for mapping. Take into account that some sequences span over an intron.
Lecture 10: transcript quantification and differential expression
When comparing expression levels of different genes by looking at mapped mRNA sequence reads on
these genes, take the total number of reads per sample and gene size into account.

Reads per kilo base of transcription per million reads (RPKM) =

109 * # reads mapped to a region / (total # reads * region length)

Transcripts per million transcripts (TPM) =

106 * (# reads mapped to a transcript / transcript length) * (1 / total length of all corrected counts)

Replicate measurements to find variation.

Assume that there is no real difference (null hypothesis), determine the probability of finding the
measured difference in expression in the data by chance, if this probability is low, the null hypothesis
is likely to be wrong so there really is a difference.

Calculate t-value based on your measurements and standard deviation, look up the probability of
finding that t-value randomly using a distribution plot

For RNA-seq you have many samples so using a p-value is not right. Instead use a q-value which only
looks at the set of genes that is significant based on a certain threshold you choose yourself.

The relative difference in expression is measured as a fold change between two conditions. Usually
the log2 of the fold change is used. A log2 fold change of 1 means log2(situationA/situationB) = 1 so
situationA/situationB = 21 = 2 so expression in situationA is twice as high as in situationB.

Enrichment analysis: look at list of up/down regulated genes and see whether a certain type of gene
is present in this list more than you would expect by chance.

Lecture 11: topological signal sequences

Topological signals give information about the destination of a protein and thus help with annotating
the function of a protein.

Translation  endoplasmic reticulum  Golgi apparatus  excretion

Primary signals are signals with a sequential pattern.

Patched signals have a non-sequential pattern with different regions contributing to the signal patch.
Patched signals are much harder to detect with dedicated software than primary signals.

Co-translational translocation is the process where a polypeptide chain is translated while directly
being transported into the ER. The chain contains a topological signal on the N-terminal (translated
first) of the sequence which can get recognized while the chain is still being synthesized on the
ribosome. Recognition of the signal leads to transfer of the ribosome to a special receptor on the
membrane of the ER where translation continues. The signal peptide gets cleaved of after translation
is complete.
When the polypeptide contains a special hydrophobic part, this part can stay in the ER membrane
while the rest of the sequence is being translated on the outside of the ER resulting in a
transmembrane protein. The part of the protein facing the inside of the ER will eventually face the
outside of the cell.

When studying possible signal sequences for secretion, you number the AAs upstream of the
cleavage site (part of the topological signal) with negative numbers starting from the cleavage site.
The AAs downstream of the cleavage site (part of the eventual protein) get positive numbers.

Recognizing signal sequences with software like SignalP uses different scores. The C-score is an
estimate of the likelihood of a certain position in your input sequence to be the first amino acid of
the mature protein. The S-score is an estimate of the likelihood of a certain position in your input
sequence to be part of the signal peptide. The Y-score is the geometric average of the C-score and
the location were the S-score forms a slope indicating the end of the signal peptide sequence.
Therefore the Y-score gives the best estimate of where the signal peptide is cleaved.

Transmembrane domains usually consist of an alpha helix with a length of 18-19 AAs. The AAs just
outside of the membrane are usually charged and therefore hydrophilic to keep the domain in place.

Lecture 12: Multiple sequence alignments

MSA is a good tool to find common ancestry between proteins and important conserved AAs.

For MSA, all sequences must be related in a linear fashion (same order of domains and no domain
extensions). Meaningful alignments require homologous sequences as input. When you use MSA to
study possible homology, you have to repeat the alignment several times with different inputs based
on new insights.

The MSA tool ClustalW first looks for the most homologues pair based on all the input sequences.
Than the next most related sequence is added and so on.

After the MSA the results can be shown in a phylogenetic tree.

You can root a tree with a outgroup which is a sequence distantly related to all the other sequences.
The length of the branches of a tree indicate how close proteins are related.

Cladogram: no variation in distance between different sequences so no indication of relatedness.

Phenogram: includes similarity scores depicted as length of the different branches.

Bootstrapping is a statistical technique to determine the reliability of a tree. It randomly resamples

the data and looks whether the resampled data give the same result. If you repeat this many times
(1000) and you often or always get back the same results, you know that your tree is reliable.

Lecture 14: Amino acids I

General structure of an AA and the allowed rotations in the dipeptide plane (φ and ψ).

Secondary structure preferences:

Helix = AMELK
Strand = VITWYF
Turn = PSDNG (pisding without i)

Below are some important properties of certain AAs. Not all AAs are discussed. According to Jacques
Vervoort, who gave this lecture, it is important to know general properties (size, polarity,
hydrophobicity) of all AAs and some important properties of the special ones.

Alanine has a –CH3 as side group which makes it ‘a subset of all other amino acids’. If you want to
mutate a residue but you don’t have an idea what to replace it with, use alanine.
Cysteine has a very reactive S-H group which it can use to form bridges with other cysteines or bind
metals. The group can easily be oxidized.
Aspartic acid often occurs in active sites and is able to bind ions.
Glycine doesn’t have a side chain, making its backbone very flexible allowing it to make turns other
residues cannot make.
Histidine is often seen in active sites. It can easily accept a proton and is able to bind metal ions.
Lysine has a long and flexible side chain and is therefore often found at the surface of a protein.
Leucine is the most abundant AA. The body measures leucine levels before it synthesizes muscle
proteins.
Methionine also has a sulphur group which enables it to bind metal ions. However this sulphur group
is not very reactive unlike cysteine. Furthermore, methionine can function as methyl donor.
Proline does not have a backbone proton but has an extra covalent bond to the amino group giving it
a pre-bend structure making it extra suitable for turns. Turns tend to be at the surface of proteins
so even though proline is very hydrophobic, it is often found at the surface. Also, 20% of all helices
tend to start with a proline.
Arginine has a so called guadinium group in its side chain making it rigid and therefore it is more
often found on the inside of a protein.
Tryptophan is the most conserved of all residues due to its big size.

Lecture 15: Amino acids II

Hydrophobic AAs on the inside and hydrophilic AAs on the outside of the protein are preferred.

2&3 were derived experimentally and based on the properties of the single amino acids.
1&4 were derived theoretically, based on observing a dataset of known protein structures.

Alpha helices are formed through hydrogen bonds between the backbone C=O and the backbone N-
H, 4 AAs further in the sequence.
Helices often have a side with mostly hydrophilic side groups and a side with mostly hydrophobic side
groups.
Helices also have a dipole over the length of the helix. All peptides point in the same direction with
their C=O and N-H backbone groups. The N-H group is more positively charged so the N-terminus end
has a slight positive charge compared to the C-terminus end. At the beginning and end of the helix
you often find charged AAs to correct for this dipole.

A beta sheet consists of at least two beta strands that interact with each other.
Hydrogen bonding between backbone C=O and N-H groups of two or more strands results in a sheet.
R-groups extend perpendicular to the plane of H-bonds.
Just like for helices, one side of a strand is mostly hydrophilic and the other side is mostly
hydrophobic. Every other AA in a beta strand points in the same direction and should therefore have
a similar hydrophobicity.
Beta sheets exist in either parallel (two strands both go from N  C) or antiparallel (one strand goes
from N  C while other goes from C  N) configuration.

Turns are well defined structures with limited length that connect secondary structure elements.

A loop is everything that has no defined structure. A loop is different from a turn.

A Chou Fasman parameter is determined by taking the percentage of a certain AA in a certain

conformation and dividing it by the overall percentage of AAs in that conformation.
For example: you have 50 alanines in your set, 25 of them in alpha helix conformation (50%). Your
total set consist of 1000 amino acids of which 350 are in alpha helix conformation (35%). Therefore
the helix preference parameter for alanine is 50/35 = 1,43.
If the preference parameter is higher than 1 it means your specific residue prefers this secondary
structure. If it is lower than 1 it dislikes this structure.

Lecture 13: 3D protein structures

Determining complete 3D structure based on protein sequence only is difficult.
Other methods to determine structure: X-ray, crystallography, NMR, cryo-EM and comparative
modeling.

X-ray: make crystals of many proteins (that are the same), hit it with X-ray beam and study the
diffraction. This gives information of the electron density in certain region which can be fitted with
the sequence to find the structure. The quality of the results depend strongly on resolution.

NMR: measures interaction between atoms based on distance. Gets complex with larger molecules
like proteins. 2D NMR gives a better overview but still complex. Results are often not clear because
proteins are flexible so distances can very. It does give information about which parts are flexible.

Cryo-EM: new emerging technology that allows you to study larger proteins and complexes.

The peptide bonds between C=O and N-H have double bond characteristics so do not rotate. The
bonds with the Ca are allowed to rotate (see picture in lecture: amino acids I)
Due to rotation, side groups can be in different orientations compared to each other. Some
orientations are energetically favorable, some aren’t due to steric interactions.

The Ramachandran plot has the rotational degrees of φ and ψ on each axis and gives a prediction
about which orientations of a dipeptide plane are energetically allowed and which are not.
Glycine has no side group so less steric hindrance and more freedom in the allowed orientations.

Lecture 16: Structural comparison

Protein data bank (PDB) is the only database with protein 3D structures. All information is checked.

Proteins with a low sequence identity score (30%) can still form similar structures.

Vector Alignment Search Tool (VAST) can be used to compare structures. For your protein chain,
locate the secondary structure elements, represent them as individual vectors, align the vectors with
other proteins in the database. (loops are not studied as secondary structure elements because they
often contain the most mutations)
Lecture 17: Comparative protein structure modeling
Determining protein 3D structures is difficult and expensive. Instead you can make hypothetical
models based on sequence and already available information in database.

Ab inito modeling: making model based on predicted folding based on sequence.

Comparative modeling: making model based on other homologues structures in database (template).

Ab inito prediction Comparative modeling

Applicable to any sequence Only applicable to sequences that share
recognizable similarity to a template structure
Not very accurate Fairly accurate (comparable to low resolution X-ray)
Only for proteins of <100 residues Not limited by size
Accuracy and applicability are limited by our Accuracy and applicability are limited by the number
understanding of how proteins fold. of known structures.

Software for making comparative models looks at distances between different AAs in template
(special restraints) and tries to fold the query sequence so that these restraints are also present in
the model.

Typical errors in comparative modeling: incorrect template, misalignment, region without template,
sidechain packing, shifts in aligned regions.

Lecture 18: Quality check of protein model

PROSA is software to analyze your protein model.
Depending on the distance between atoms, the interaction energy can be favorable or unfavorable.
PROSA calculates the interaction energy between different parts of your protein and represents the
results in a plot. The interaction energy should be below zero for a favorable structure.
To smooth the results, an average of a certain amount of residues is taken and shown in the graph.

Also look if the dipeptide plains of your model fit in the allowed regions of the Ramachandran plot.

(Encyclopedia of Physical Science and Technology) Robert A. Meyers (Editor) - Encyclopedia of Physical Science and Technology - Polymers-Academic Press (2001)
No ratings yet
(Encyclopedia of Physical Science and Technology) Robert A. Meyers (Editor) - Encyclopedia of Physical Science and Technology - Polymers-Academic Press (2001)
339 pages
Bioinformatics Complete All 5 Units Notes
100% (1)
Bioinformatics Complete All 5 Units Notes
97 pages
Essential Biochemistry - S PDF
100% (5)
Essential Biochemistry - S PDF
586 pages
MATH3353 Notes
No ratings yet
MATH3353 Notes
100 pages
Brief Guide For NGS Transcriptomics: From Gene Expression To Genetics
No ratings yet
Brief Guide For NGS Transcriptomics: From Gene Expression To Genetics
120 pages
2019 Evomics Reference Free
No ratings yet
2019 Evomics Reference Free
118 pages
Introduction To Differential Gene Expression Analysis Using RNA-seq
No ratings yet
Introduction To Differential Gene Expression Analysis Using RNA-seq
97 pages
01 - Protein-Based Surfactants (Surfactant Science Series)
No ratings yet
01 - Protein-Based Surfactants (Surfactant Science Series)
302 pages
Biology As + A2 Combined
No ratings yet
Biology As + A2 Combined
253 pages
Scope of Poultry Waste Utilization: D.Thyagarajan, M.Barathi, R.Sakthivadivu
No ratings yet
Scope of Poultry Waste Utilization: D.Thyagarajan, M.Barathi, R.Sakthivadivu
7 pages
Agency Law
No ratings yet
Agency Law
57 pages
Proteins Bioinfo Latest
No ratings yet
Proteins Bioinfo Latest
45 pages
Proteins and Amino Acids
100% (2)
Proteins and Amino Acids
50 pages
Intro 2 RNAseq
No ratings yet
Intro 2 RNAseq
98 pages
Test For Upload
No ratings yet
Test For Upload
25 pages
L2 Proteomics, Genomics and Bioinformatics
No ratings yet
L2 Proteomics, Genomics and Bioinformatics
30 pages
An Introduction To Proteomics: The Protein Complement of The Genome
No ratings yet
An Introduction To Proteomics: The Protein Complement of The Genome
40 pages
2 1proteomics
No ratings yet
2 1proteomics
28 pages
BIF501-Bioinformatics-II Solved Questions FINAL TERM (PAST PAPERS)
No ratings yet
BIF501-Bioinformatics-II Solved Questions FINAL TERM (PAST PAPERS)
23 pages
Lecture1 Genome - Sequencing 2019
No ratings yet
Lecture1 Genome - Sequencing 2019
41 pages
Reading The Blueprint of Life: DNA Sequencing
No ratings yet
Reading The Blueprint of Life: DNA Sequencing
23 pages
Protein Modeling: Protein Structure Prediction Other Topics
No ratings yet
Protein Modeling: Protein Structure Prediction Other Topics
76 pages
Gelatin and Non-Gelatin Capsule Dosage Forms
100% (1)
Gelatin and Non-Gelatin Capsule Dosage Forms
13 pages
RNA-Seq Module 1
No ratings yet
RNA-Seq Module 1
54 pages
Proteomics Workshop Sessions
No ratings yet
Proteomics Workshop Sessions
13 pages
Chemistry-Viii Notes Prepared by Dr. Dhondiba Vishwanath Suryawanshi, GFGC KR Puram Bengaluru-36
No ratings yet
Chemistry-Viii Notes Prepared by Dr. Dhondiba Vishwanath Suryawanshi, GFGC KR Puram Bengaluru-36
25 pages
Biological Macro
No ratings yet
Biological Macro
49 pages
Bif401 Solved Final Papers 2017
100% (1)
Bif401 Solved Final Papers 2017
8 pages
Bioinformatics Seminar3rdOct18
No ratings yet
Bioinformatics Seminar3rdOct18
25 pages
Proteomics & Genomics
No ratings yet
Proteomics & Genomics
11 pages
RNA-Seq Analysis Course
No ratings yet
RNA-Seq Analysis Course
40 pages
02.-Sequence Analysis PDF
No ratings yet
02.-Sequence Analysis PDF
14 pages
Evaluation of DNA/Protein Interactions and Cytotoxic Studies of Copper (II) Complexes Incorporated With N, N Donor Ligands and Terpyridine Ligand
No ratings yet
Evaluation of DNA/Protein Interactions and Cytotoxic Studies of Copper (II) Complexes Incorporated With N, N Donor Ligands and Terpyridine Ligand
42 pages
Bio Model
No ratings yet
Bio Model
12 pages
Amino Acids, Proteins, and Enzymes
No ratings yet
Amino Acids, Proteins, and Enzymes
62 pages
BIOMOLECULES Ncert Class 12 Most Important Questions
No ratings yet
BIOMOLECULES Ncert Class 12 Most Important Questions
6 pages
Unit-5 Bioinformatics
No ratings yet
Unit-5 Bioinformatics
13 pages
Biophysical Chemistry: J. Seelig
No ratings yet
Biophysical Chemistry: J. Seelig
7 pages
Same Nva Tting
No ratings yet
Same Nva Tting
22 pages
Keratin
No ratings yet
Keratin
15 pages
Bioinformatics 2015
No ratings yet
Bioinformatics 2015
269 pages
Binc Syllabus For Paper-Ii Binc Bioinformatics Syllabus - Advanced
No ratings yet
Binc Syllabus For Paper-Ii Binc Bioinformatics Syllabus - Advanced
7 pages
Bio Info Merged
No ratings yet
Bio Info Merged
154 pages
Swiss PDB Viewer Exercises & Answers: General Instructions
No ratings yet
Swiss PDB Viewer Exercises & Answers: General Instructions
9 pages
Programme - M.Sc. Biotechnology: L T P/S SW/F W Total Credit Units 3 0 0 0 3
No ratings yet
Programme - M.Sc. Biotechnology: L T P/S SW/F W Total Credit Units 3 0 0 0 3
4 pages
BINC Syllabus All
No ratings yet
BINC Syllabus All
14 pages
Bioinformatics
No ratings yet
Bioinformatics
11 pages
Bioinformatics Syllabus: Course Description
No ratings yet
Bioinformatics Syllabus: Course Description
7 pages
What Happens When D-Glucose Is Treated With The Following Reagents? HI (Ii) Bromine Water (Iii) HNO
No ratings yet
What Happens When D-Glucose Is Treated With The Following Reagents? HI (Ii) Bromine Water (Iii) HNO
3 pages
Ilovepdf Merged
No ratings yet
Ilovepdf Merged
66 pages
Binc Syllabus For Paper-Iii Binc Bioinformatics Syllabus - Basic
No ratings yet
Binc Syllabus For Paper-Iii Binc Bioinformatics Syllabus - Basic
7 pages
BT301 Tutorial-2 Solutions
No ratings yet
BT301 Tutorial-2 Solutions
17 pages
Protein STR
No ratings yet
Protein STR
63 pages
Chapter 3 Proteins & Proteins Purification Techniques
No ratings yet
Chapter 3 Proteins & Proteins Purification Techniques
10 pages
Unit 1: Structural Genomics
No ratings yet
Unit 1: Structural Genomics
4 pages
2023-GenomicaFuncional y Biocomputacion-Day1
No ratings yet
2023-GenomicaFuncional y Biocomputacion-Day1
92 pages
Bif 401 100% Solved Final Term Paper by Sulman Ali
No ratings yet
Bif 401 100% Solved Final Term Paper by Sulman Ali
5 pages
Protein Structure Modelling
No ratings yet
Protein Structure Modelling
3 pages
CARBOHYDRATE and PROTEIN
No ratings yet
CARBOHYDRATE and PROTEIN
38 pages
2019 Jacs
No ratings yet
2019 Jacs
5 pages
BIF401 Midterm Past Papers Subjective
No ratings yet
BIF401 Midterm Past Papers Subjective
10 pages
Nazarov QC-Statistics
No ratings yet
Nazarov QC-Statistics
50 pages
RNA Seq R - Final Decode
No ratings yet
RNA Seq R - Final Decode
76 pages
The Shape and Structure of Proteins - Molecular Biology of The Cell - NCBI Bookshelf
No ratings yet
The Shape and Structure of Proteins - Molecular Biology of The Cell - NCBI Bookshelf
6 pages
Analysis of RNA-Seq Data
No ratings yet
Analysis of RNA-Seq Data
71 pages
Mbt24a Syllabus
No ratings yet
Mbt24a Syllabus
2 pages
BIO101 Module 3
No ratings yet
BIO101 Module 3
15 pages
Tutorial 1
No ratings yet
Tutorial 1
12 pages
Bioinformatics Question Bank For FAT
No ratings yet
Bioinformatics Question Bank For FAT
53 pages
Bioinfo Notes
No ratings yet
Bioinfo Notes
5 pages
Structure of DNA (WC Model)
No ratings yet
Structure of DNA (WC Model)
7 pages
Bchn213 (M-Campus) Test 5 Su 9 and 10 Memo
No ratings yet
Bchn213 (M-Campus) Test 5 Su 9 and 10 Memo
5 pages
Sequence Analysis Primer, 1st Edition Full Download
100% (8)
Sequence Analysis Primer, 1st Edition Full Download
17 pages
Lecture 12 - Protein Structure
No ratings yet
Lecture 12 - Protein Structure
24 pages
Lecture 01 - Genome Sequencing
No ratings yet
Lecture 01 - Genome Sequencing
48 pages
Bioinformatics Notes
No ratings yet
Bioinformatics Notes
104 pages
Angew Chem Int Ed - 2019 - Sharma - Entropically Favoured Assembly of Pyrazine Based Helical Fibers Into Superstructures
No ratings yet
Angew Chem Int Ed - 2019 - Sharma - Entropically Favoured Assembly of Pyrazine Based Helical Fibers Into Superstructures
7 pages
You Said
No ratings yet
You Said
8 pages
Sequencing Quality Control
No ratings yet
Sequencing Quality Control
104 pages
Biophysics Assignment
No ratings yet
Biophysics Assignment
6 pages
Lab 2
No ratings yet
Lab 2
7 pages
BCH 211 - Amino Acids and Protein
No ratings yet
BCH 211 - Amino Acids and Protein
29 pages
High Throughput Sequencing
No ratings yet
High Throughput Sequencing
5 pages
Sequence Analysis Primer 1st Edition ISBN 0195098749, 9780195098747 Full Text Download
No ratings yet
Sequence Analysis Primer 1st Edition ISBN 0195098749, 9780195098747 Full Text Download
16 pages
Module 3 5mark.
No ratings yet
Module 3 5mark.
23 pages
Sequencing Genomes
No ratings yet
Sequencing Genomes
7 pages
Sequence Alignment
No ratings yet
Sequence Alignment
8 pages
Multilayer Perceptron: Fundamentals and Applications for Decoding Neural Networks
From Everand
Multilayer Perceptron: Fundamentals and Applications for Decoding Neural Networks
Fouad Sabry
No ratings yet
Competitive Learning: Fundamentals and Applications for Reinforcement Learning through Competition
From Everand
Competitive Learning: Fundamentals and Applications for Reinforcement Learning through Competition
Fouad Sabry
No ratings yet
Lessons in Bioinformatics - Dot Plots: Lessons in Bioinformatics, #1
From Everand
Lessons in Bioinformatics - Dot Plots: Lessons in Bioinformatics, #1
Björn Olsson
No ratings yet

Summary Bioinformation Technology

Uploaded by

Summary Bioinformation Technology

Uploaded by

Summary Bioinformation Technology period 1 2018

Translation from mRNA to protein goes from 5’ to 3’

Lecture 2a: Sequence coverage

Gap: non sequenced area

Lecture 2b: Sequence assembly

Genomic information is essential in large scale proteomics studies.

You can study many proteins in one sample.

Several peptide fragments need to be used in order for a proper identification.

MS spectra have m/z on x-axis  mass/charge obtained during ionization.

When performing an artificial digestion and spectrometry:

Lecture 4A: Substitution patterns

Lecture 4B: Matrices

For close relatives For distant relatives

Unrelated model (expected frequency, e):

To study raw scores (S):

Lecture 4D: The BLAST algorithm

General workflow of FASTA:

General workflow of BLAST:

Lecture 7: PSI BLAST and protein domains

Homolog: proteins have common ancestor (both orthologs and paralogs)

A clique is a closed circuit of bidirectional hits in different organisms.

Alternative approach: identification through function instead of evolution, look at domains.

Lecture 9: transcripts and mRNA measurement

Alternative splicing of mRNA of a single gene can result into isoforms

Ways of measuring mRNA’s:

Reads per kilo base of transcription per million reads (RPKM) =

Transcripts per million transcripts (TPM) =

Replicate measurements to find variation.

Lecture 11: topological signal sequences

Translation  endoplasmic reticulum  Golgi apparatus  excretion

Primary signals are signals with a sequential pattern.

Lecture 12: Multiple sequence alignments

After the MSA the results can be shown in a phylogenetic tree.

Cladogram: no variation in distance between different sequences so no indication of relatedness.

Bootstrapping is a statistical technique to determine the reliability of a tree. It randomly resamples

Lecture 14: Amino acids I

Secondary structure preferences:

Lecture 15: Amino acids II

A Chou Fasman parameter is determined by taking the percentage of a certain AA in a certain

Lecture 13: 3D protein structures

Lecture 16: Structural comparison

Ab inito modeling: making model based on predicted folding based on sequence.

Ab inito prediction Comparative modeling

Lecture 18: Quality check of protein model

You might also like