0% found this document useful (0 votes)
4 views

Lecture3-DNA Data Analysis

Uploaded by

shoyo3918
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
4 views

Lecture3-DNA Data Analysis

Uploaded by

shoyo3918
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 17

DNA Data Analysis

Dr. Y. V. Lokeswari
Associate Professor
SSN College of Engineering
DNA Data Analysis - DNA Sequence
• DNA is the basis of heredity.
• It is a polymer made up of small molecules called nucleotides, which can be distinguished by the four
bases: adenine (A), cytosine (C), guanine (G), and thymine (T).
• DNA usually occurs in double strands, and the bases in the two strands are complementary to
each other, i.e., A pairing with T and G pairing with C with hydrogen bonds.
• a single strand of DNA (written in the 5 to 3 direction): 5 AACCGTACC 3 is paired to a
complementary strand running in the opposite direction

DNA=basis for heredity.


It's made of small molecules called neucelotides which are distingushed by the 4 bases.
The bases running along the 2 strands are complementary to each other
DNA Data Analysis - DNA Sequence
DNA Data Analysis - DNA Sequence
• The transcription process is different in prokaryotes (i.e., simple bacteria) and eukaryotes (non-bacteria
possessing a nucleus, e.g., fungi, unicellular paramecia, and all plants and animals).
• In prokaryotes, RNA polymerase produces an mRNA transcript directly from the DNA template.
• In eukaryotes, genes in a DNA sequence are not continuous, but instead are broken up into coding
regions (exons, which code for proteins) and noncoding regions (introns).
• The amino acid leucine (Leu) is encoded by six different codons. There are three codons, UAA, UAG,
and UGA, that do not encode any amino acids.
• They are the stop codons that terminate the translation process.
• It is important then to decide with which nucleotide to start translation, and when to stop. This is
called an open reading frame (ORF). Determination of the correct reading frame is an important
problem in genomics and bioinformatics. Transcription process is diff in prokaryotes (no nucles)
and eukaryotes.

Prokaryotes--DNA template lendhu RNA polymerase


produces an mRNA transcript directly

Eukaryotes--genes in DNA seq are broken into coding


(exons) and non-coding(introns regions)

Ex: Leucine (Leu)--encoded by 6 diff codons


There are in general 3 codons that are useless (non
coding)--these are stop codons that terminate the
translation process.

with which nucleotide u should start translation?


with which neuclotide to stop?
Determine the correct reading frame(Open Reading
Frame ORF)--imp problem in bioinf
DNA Data Analysis - Sequence Comparison and Alignment
• To study the functional and structural information encoded in the sequence.
• Done by comparing the new sequence with sequences that have already being well studied and
annotated. compare unknown with known
• Sequences that are similar would probably have the same function, be it a functional role (i.e., ORFs
coding for similar proteins), regulatory role (i.e., similar regulatory or biochemical pathways), or
structural properties in the case of proteins. seq similar=> structure, function (ORF), regulation(pathways) same
• Additionally, if two sequences from different organisms are similar, there may be a common ancestor
sequence, and the sequences are then said to be homologous. seq similar in diff org=> may have common ancestor, they are homolgous
• Sequence alignment: comparing two (pairwise alignment) or more (multiple sequence alignment)
sequences by searching for a series of individual characters or character patterns that are in the same
order in the sequences. string matching
• Global alignment tries to align the entire sequence in such a way as to maximize the degree of similarity
between the two sequences. epdiyaachu both are similar nu sollanum
• In local alignment, the alignment stops at the ends of regions of strong similarity, and a much higher
priority is given to finding these local regions than to extending the alignment to include more
neighboring pairs. (BLAST, FASTA) shorter seq, higher sim is priority, don't extend
• Dynamic programming for sequence alignment is an efficient mathematical technique for optimum
alignment, it is still too slow for comparing large numbers of bases. DP=slow
• Needleman-Wunsch algorithm and the Smith-Waterman algorithm for sequence alignment is available
freelyEMBOSS (European Molecular Biology Open Software Suite)
DNA Data Analysis - Sequence Comparison and Alignment
• In sequence alignment represents the following in all sequences used in the alignment
• Gaps are represented as dash (-).
• The asterisk ( Ã ), - indicates identical amino acid residues * - identical amino acid residues
• colon ( : ) indicates conserved substitution and : - conserved substitution
• dot ( Á ) indicates semi-conserved substitution. . - semiconserved substituition
Demo on BLAST and ClustalW
DNA Data Analysis - Gene Prediction
• Gene prediction requires the integration of many different signals such as promoter regions, translation
start and stop codons, reading frame periodicities, polyadenylation (polyA) signals, and,
• In eukaryotes, intron splicing signals, base compositional bias between codon positions for exons and
introns, and various coding statistics.
• In prokaryotes, gene finding is made simpler by the fact that coding regions are not interrupted by
intervening sequences such as introns.
• 3 approaches of gene prediction:
• Similarity-based, Content-based, and Site-based.
• Similarity-based methods make use of already determined sequences by a comparison of sequence
data.
• Content-based methods determine the overall properties of a sequence in terms of the various coding
statistics.
• Site-based methods determine transcription factor binding sites, polyA signals, start and stop codons,
splice junctions, and other specific subsequences or sequence patterns.
Gene prediction needs integ of diff signals like-start & stop codons, promoter
regions, reading frame periodicities
In prokaryotes--all are exons--coding regions not interrupted by introns
In eukaryotes--splice coding and non-coding regions
3 approaches of gene prediction-Sim, Content, Site Based (SCS)
DNA Data Analysis - Gene Prediction
• study of several coding statistics for the recognition of human, yeast and C. elegans coding and non-
coding sequences.
DNA Data Analysis - Gene Prediction
One Pyrimidine feature
DNA Data Analysis - Gene Prediction
DNA Data Analysis - Gene Prediction
DNA Data Analysis - Gene Prediction
DNA Data Analysis - Gene Prediction
DNA Data Analysis - Phylogenetic Analysis
• A phylogenetic analysis of a family of related DNA or protein sequences is a determination of how the
family might have been derived during molecular evolution
• Phylogenetic analysis leads to the construction of an evolution tree.
• The evolutionary relationships among the sequences are depicted by placing the sequences the leaves
on the tree in such a way that the branching relationship in the tree reflects the degree to which
different sequences are related.
• Phylogenetic study performed on a gene family could also aid in the prediction of genes with
equivalent or similar functions.
• Phylogenetic analysis is closely linked to sequence alignment.
• Three methods that are commonly used to derive the phylogenetic tree
• Maximum parsimony method, Distance method, and Maximum likelihood method.
DNA Data Analysis - Phylogenetic Analysis
• For parsimony analysis, the best results are obtained when the amount of variation among all pairs of
sequences is similar and the amount of variation is small.
• It is not good for reconstructing ancient phylogenies.
• If variation among sequences is present (some sequences are more similar than others) and the amount of
variation is intermediate, distance method can be used.
• Distance method, the concept of genetic distance between two sequences needs to be defined
appropriately, depending on the type of sequences in consideration, and on their structural properties.
• Algorithms are also available for converting sequence similarity scores into distance scores.
• The genetic distances between sequences are then used to construct the phylogenetic tree.
• Maximum likelihood methods are particularly useful when the sequences are more variable.
• The method uses probability calculations based on an explicit evolutionary model.
• e.g., the F84 substitution model in the PHYLIP package and the TN93 substitution model, to find a tree
that best accounts for the variation in the sequence.
DNA Data Analysis - Phylogenetic Analysis
• Phylogenetic analysis programs are widely available. Two main ones
• PHYLIP (available at http://evolution.genetics.washington.edu/phylip.html) and
• PAUP (available at http://www.lms.si.edu/PAUP/).
• Both packages provide the three main methods for phylogenetic analysis.

You might also like