Internship Report
Internship Report
Internship Report
Submitted by,
SATHYA PRABHA K
III BCA B
19BCA087
Guided by,
AUGUST 2021
DEPARTMENT OF BCA
COLLEGE OF EXCELLENCE
1
CERTIFICATE
Peelamedu, Coimbatore-641004
This is to certify that SATHYA PRABHA K (19BCA 087) of BCA has undergone Internship
training SEQUENCE COMPARISON AND PHYLOGENY USING
BIOINFOMATICS during August 2021.
___________________________ _______________________________
2
Company certificate
3
WORK DAIRY
4
ACKNOWLEDGEMENT:
I take this opportunity to acknowledge with great pleasure deep satisfaction and
gratitude, to the contribution of many individuals in the successful completion of the
project.
My sincere thanks to all staff of our department for their constant support
and encouragement.
I express my heartfelt thanks to Mrs. BANU- Proprietor Accentz Techno Soft, for
providing me an opportunity to undertake Internship Study in his esteemed concern. I
heartfully thanks to Mrs. Geethalakshmi who teaches the internship classes through a
discipline manner for understanding easily.
5
DECLARATION
I hereby declare that this Internship study entitled “SEQUENCE COMPARISON AND
PHYLOGENY USING BIOINFOMATICS” submitted to PSGR Krishnammal College for
Women, Coimbatore for the award of the Degree of Bachelor of Computer Application is a
record of original work done by SATHYA PRABHA K (19BCA087) under the guidance of
Mrs.BHARATHI.V M.C A., M.phil., (Ph.D) Assistant Professor, Department of BCA,
PSGR Krishnammal College for Women, Coimbatore and this internship have not found the
basis for the award of any Degree/Diploma or similar title to any candidate of any university.
Date :
Endorsed by
6
CONTENTS
7
SEQUENCE COMPARISON AND PHYLOGENY USING BIOINFOMATICS
ABSTRACT
This is a study of bioinformatics field which also includes the base of information,
technology, computer science, mathematics and statistics. It represents the cell and cell
components details. In this we learn about lot of proteins, Genes & Where to collect their
databases and how can we analysis it. There is lot of tools there for finding structure,
molecule weights, type of proteins, count of acids, gene proteins in both offline and online.
While comparing the unknown sequences we can identify which sequence is related to the
query sequence. We can find the molecules pathway in a biological resource which shows the
energy component or some other reactions in a human resource. You can also find the drugs
usage, drug molecule and more details about the drugs to use. It is considered as more
valuable field in future.
8
INTRODUCTION TO BIOINFORMATICS:
WHAT DO WE DO IN BIOINFORMATICS?
-Analyse and interpret the various types of biological data:
1. Genomic Sequences (DNA).
2. Transcriptomic Sequences (RNA).
3. Proteomic Sequence (Proteins).
4. Protein Structures (Proteins).
5. RNA Structure (RNA).
-Develop new algorithms and tools
1. To access the biological information.
2. Handle large datasets.
3. Find relationships between data sources etc..,
PROTEIN CHOSEN AND FASTA FORMAT
9
BLASTp
10
11
INTERPRETATION:
I have chosen Secretin preprotein [Homo Sapiens] for Multiple Sequence Alignment.
The RID for secretin [Homo Sapiens] is GSEFZVX801R with a program of BLASTP and
query id is lcl|Query_679205 with the length of 121 of a molecular type of amino acid.I have
collected 10 sequences for Multiple Sequence Alignment. The query matches with secretin
[Gorilla gorilla gorilla], secretin [Nomascus leucogenys], secretin [Piliocolobus
tephrosceles], secretin [Sapajus apella], hypothetical protein DBR06_SOUSAS35410027
[Sousa chinensis], secretin [ursus maritimus], secretin [canis lupus dingo], secretin
[Peromyscus leucopus], and secretin isoform X2 [Arvicola amphibius]. Gorilla gorilla gorilla
matches 94.40%, Nomascus leucogenys matches 89.68%, Piliocolobus tephrosceles matches
88.00%, Sapajus apella matches 78.46%, Sousa chinensis matches 69.23%, ursus maritimus
matches 65.71%, canis lupus dingo matches 65.20%, Peromyscus leucopus matches 57.76%,
Arvicola amphibius matches 54.92%. These are the percentages of the sequences that
matches with the query secretin preprotein [Homo Sapiens].
12
INTRODUCTION TO MULTIPLE SEQUENCE ALIGNMENT AND ITS USES:
A Multiple Sequence Alignment (MSA) is a basic tool for the sequence alignment of
two or more biological sequences. It refers to a series of algorithmic solution for the
alignment of evolutionarily related sequences, while taking into account evolutionary events
such as mutations, insertions, deletions and rearrangements under certain conditions. It is a
tool used to study closely related genes or proteins in order to find the evolutionary
relationships between genes and to identify shared patterns among functionally or structurally
related genes. Generally, Protein, DNA, or RNA. In many cases, the input set of query
sequences are assumed to have an evolutionary relationship. By which they share a lineage
and are descended from a common ancestor. Multiple sequence alignment (MSA)
has assumed a key role in comparative structure and function analysis of biological
sequences. It often leads to fundamental biological insight into sequence-structure-function
relationships of nucleotide or protein sequence families. Multiple sequence alignments can
be used to create a phylogenetic tree. This is made possible by two reasons. The first is
because functional domains that are known in annotated sequences can be used for alignment
in non-annotated sequences. These algorithms can deal with sequences that are quite
different, but, as in the pair-wise case, when the sequences are very different they might have
problems creating good algorithm. A good algorithm should align the homologous positions
or the positions with the same structure or function. Computational algorithms are used to
produce and analyse the MSAs due to the difficulty and intractability of manually processing
the sequences given their biologically-relevant length. MSAs require more sophisticated
methodologies than pairwise alignment because they are more computationally complex.
CLUSTALW:
13
RESULTS:
14
MENU IN CLUSTALW:
15
16
INTRODUCTION TO PHYLOGENETIC ANALYSIS
17
Phylogenetic analysis provides an in-depth understanding of how species evolve
through genetic changes. Using phylogenetics, scientists can evaluate the path that connects a
present-day organism with its ancestral origin, as well as can predict the genetic divergence
that may occur in the future. To construct a visual representation (a tree) to describe the
assumed evolution occurring between and among different groups (individuals, populations,
species, etc.) and to study the reliability of the consensus tree. Phylogenetic analysis can be
useful in comparative genomics, which studies the relationship between genomes of different
species. In this context, one major application is gene prediction or gene finding, which
means locating specific genetic regions along a genome.
Root
Ingroup
18
TYPES OF TREES:
There are three types of phylogenetic tree namely, Cladogram, Phylogram and
Ultrametric tree. Cladogram represents no numbers that is it gives random length Outgroup
and it is a
rough representation of phylogenetic tree. Phylogram shows the genetic change of the
organism that is the time taken by the organism to get mutated to another organism and it is
represented by my i.e., million years. Ultrametric tree shows the size of the branch which is
taken time for mutation of the organism. The branches of the Ultrametric tree represents the
time i.e., if the organism takes 2my for mutation the size of the branch will be 2cm. The
below diagram shows the types of trees.
All show the same evolutionarily relationships, or branching orders, between the taxa.
19
phylogeny. In the diagram the arrow points at the polytomy or multifurcation and a
bifurcation of the evolutionary tree.
Taxonomical Reaction:
There are three possible unrooted trees for four taxa (A, B, C, D). Phylogenetic tree
building (or inference) methods are aimed at discovering which of the possible unrooted trees
is “correct”. We would like this to be the “true” biological tree – that is, one that accurately
represents the evolutionary history of the taxa. However, we must settle for discovering the
computationally correct or optimal tree for the phylogenetic method of choice.
PHYLOGENETIC TREE:
20
INTERPRETATION:
A phylogenetic tree is a branching diagram or a tree showing the evolutionary
relationships among various biological species or other entities based upon similarities and
differences in their physical or genetic characteristics. All life on Earth is part of a single
21
phylogenetic tree, indicating common ancestry. I have taken Homo sapiens [EAX02368.1],
Gorilla gorilla gorilla [XP_004050413.2], Nomascus Leucogenys [XP_030657519.1],
Pillocolobus Tephroscales [XP_023039225.1], Sapajus Apella [0321227361.2], Sausa
Chinacnisis [TEA36508.1], Ursus Maritims [XP_040494734.1], Canis Lupus Dingo
[XP_025308592], Peromyscus Leucopius[XP_028728132] , Arvicola Amphibius
[XP_038172384.1] of secretin for phylogenetic comparison.
The Phylogenetic tree have three parts which are, root is the first organism that been
created, node is the junction region in the phylogenetic tree and branch is the line that tells
the evolutionary relationship between the organism.
The root of the phylogenetic tree is Pillocolobus Tephroscales [XP_023039225.1].
First branch is Peromyscus Leucopius [XP_030657519.1] and Arvicola Amphibius
[XP_038172384.1] which both are evolutionary related, second branch is Canis Lupus Dingo
[XP_025308592] and Ursus Maritims [XP_040494734.1] are evolutionary related to each
other, third branch is Sausa Chinacnisis [TEA36508.1], the first and second set are related to
third set. Fourth branch is Sapella Apella [0321227361.2] which is evlotionary related to the
first three sets. Fifth is the root of the phylogenetic tree i.e., Pillocolobus Tephroscales
[XP_023039225.1]. Fifth set is Nomascus Leucogenys [XP_030657519.1] which is related to
the above sets, and the final branch is homo sapiens [EAX02368.1] and Gorilla gorilla gorilla
[XP_004050413.2] which is related to all six branches in the phylogenetic tree. This is the
structure of phylogenetic tree, which shows the evolutionary relation between the organism.
22