Biotools Pca2
Biotools Pca2
Assignment No: 8
Aim: Review the use of sequence manipulation tools (EMBOSS extractseq) to extract
subsequences.
Theory:
EMBOSS extractseq reads a sequence and writes sub-sequences from it to file. The set of
regions to extract is specified on the command-line or in a file as pairs of start and end
positions. The regions are written in the order in which they are specified. Thus, if the
sequence AAAGGGTTT has been input and the regions: 7-9, 3-4 have been specified, then
the output sequence will be: TTTAG. Optionally, each region may be written out as a separate
sequence
Procedure:
1. Accessed EMBOSS-extractseq website (via https://www.bioinformatics.nl/cgi-
bin/emboss/extractseq)
2. Pasted FASTA sequence manually in input text area under Input section > 3.
3. Under Required section > Regions to extract – 4-23 range was provided for extraction
4. Clicked on Run extractseq to obtain output
Output:
15
Bioinformatics Lab (MSBIN194)
Maulana Abul Kalam Azad University of Technology, West Bengal
Assignment No: 9
Aim: Review the use of sequence manipulation tools (EMBOSS extractalign) to extract
subsets of sequence alignments
Theory:
EMBOSS extractalign allows you to specify one or more regions of a sequence alignment to
extract subsequences from to build up a resulting sub-sequence alignment. extractalign reads
in a sequence alignment and a set of regions of that alignment as specified by pairs of start
and end positions (either on the command-line or contained in a file) using gapped alignment
positions as the coordinates, and writes out the specified regions of the input sequence in the
order in which they have been specified. Thus, if the sequence "AAAGGGTTT" has been
input and the regions: "7-9, 3-4" have been specified, then the output sequence will be:
"TTTAG".
Procedure:
1. Accessed EMBOSS-extractalign website (via
https://www.bioinformatics.nl/cgibin/emboss/extractalign)
2. Followed Lab 5 instructions to generate MSA for 3 sequences:
a. Carex littledalei KAF3341700.1
b. Zea mays NP_001131801.1
c. Vigna angularis KAG2409240.1
3. Pasted the MSA result into the input text area under Input section > 3.
4. Under Required section > Regions to extract – 4-23 range was provided for extraction
5. Clicked on Run extractalign to obtain output.
Output:
16
Bioinformatics Lab (MSBIN194)
Maulana Abul Kalam Azad University of Technology, West Bengal
Assignment No: 10
Aim: Review the use of WebLogo tool to visualize sequence motifs and domains of interest
by using inputs from MSA outputs.
Theory:
WebLogo is a web-based application designed to make the generation of sequence logos as
easy and painless as possible. A sequence logo is a graphical representation of an amino acid
or nucleic acid multiple sequence alignment. Each logo consists of stacks of symbols, one
stack for each position in the sequence. The overall height of the stack indicates the sequence
conservation at that position, while the height of symbols within the stack indicates the
relative frequency of each amino or nucleic acid at that position. The width of the stack is
proportional to the fraction of valid symbols in that position. (Positions with many gaps have
thin stacks.) In general, a sequence logo provides a richer and more precise description of, for
example, a binding site, than would a consensus sequence.
Procedure:
1. Accessed WebLogo website (via https://weblogo.threeplusone.com/create.cgi)
2. Pasted the MSA output from the Lab 9 as shown in the figure below.
3. Clicked on Create WebLogo to obtain output.
Output:
• WebLogo generates sequence logos that visually represent the conservation of residues
(nucleotides or amino acids) at each position in an alignment.
• The height of each letter indicates the degree of conservation (measured as information
content), while the size of individual letters within a stack reflects the relative frequency of each
residue at that position
Reference:
Crooks, Gavin E., et al. "WebLogo: a sequence logo generator.
17
Bioinformatics Lab (MSBIN194)
Maulana Abul Kalam Azad University of Technology, West Bengal
Assignment No: 11
Aim: Review the use of primer3 tool for designing PCR primers.
Theory:
Primer3 is a widely used program for designing PCR primers (PCR = "Polymerase Chain
Reaction"). PCR is an essential and ubiquitous tool in genetics and molecular biology.
Primer3 can also design hybridization probes and sequencing primers.
Procedure:
1. Accessed Primer3Web website (via https://primer3.ut.ee/)
2. Pasted nucleotide sequence of the Human amyloid beta precursor protein
(https://www.ncbi.nlm.nih.gov/nuccore/NM_000484.4)
3. Parameters are kept at their default values for demonstration
4. Clicked on Pick Primers
Output:
18
Bioinformatics Lab (MSBIN194)
Maulana Abul Kalam Azad University of Technology, West Bengal
Assignment No: 12
Aim: Review the use of WSL (Windows Subsystem for Linux) on Windows.
Theory: Windows Subsystem for Linux is a feature of Windows that allows developers to
run a Linux environment without the need for a separate virtual machine or dual booting.
There are two versions of WSL, WSL 1 and WSL 2.
Procedure:
Output:
19
Bioinformatics Lab (MSBIN194)
Maulana Abul Kalam Azad University of Technology, West Bengal
Assignment No: 13
Aim: Review the use of Linux commands for navigating through file system and performing
basic file operations.
Theory:
Linux is a family of open-source Unix-like operating systems based on the Linux kernel, an
operating system kernel first released on September 17, 1991, by Linus Torvalds. A few
reasons why Bioinformatics requires the know-how about navigating through a Linux system
are:
1. Most Bioinformatics tools are only available in Linux
2. Windows is very slow at processing biological data
3. Linux has built-in programming languages and compilers
4. Creating a biological analysis pipeline can be done easily in Linux
5. Linux is free
Output:
20
Bioinformatics Lab (MSBIN194)
Maulana Abul Kalam Azad University of Technology, West Bengal
21
Bioinformatics Lab (MSBIN194)
Maulana Abul Kalam Azad University of Technology, West Bengal
Assignment No: 14
Aim: Review the use of FGENESH to predict structure of a given eukaryotic gene.
Theory:
FGENESH is an HMM-based gene structure prediction (multiple genes, both chains). The
Fgenesh gene-finder was selected as the most accurate program for plant gene identification.
Plant Molecular Biology (2005), 57, 3, 445-460: "Five ab initio programs (FGENESH,
GeneMark.hmm, GENSCAN, GlimmerR and Grail) were evaluated for their accuracy in
predicting maize genes. FGENESH yielded the most accurate and GeneMark.hmm the
second most accurate predictions" (FGENESH identified 11% more correct gene models than
GeneMark on a set of 1353 test genes).
Procedure:
1. Accessed FGENESH via SoftBerry website
(http://www.softberry.com/cgibin/programs/gfind/fgenesh.pl)
2. Pasted first 128318 nucleotides from homo sapiences under section Paste nucleotide
sequence here
3. Selected Mouse (generic) from organism dropdown.
4. Clicked search to obtain result
Output:
22
Bioinformatics Lab (MSBIN194)
Maulana Abul Kalam Azad University of Technology, West Bengal
Fgenesh output:
G - predicted gene number, starting from start of sequence;
Str - DNA strand (+ for direct or - for complementary);
Assignment No: 15
Aim: Review the use of FGENESB for prediction of bacterial genes.
Theory:
FGENESB is a suite of bacterial operon and gene prediction programs: its detailed
description is given here. Presented on this page is gene finding portion of FGENESB, which
is pattern/Markov chain-based and is the fastest (E.coli genome is annotated in appr. 14 sec)
and most accurate ab initio bacterial gene prediction program available - for more details, see
FGENESB help. FGENESB uses genome-specific parameters learned by FgenesB-train
script, which requires only DNA sequence from genome of interest as an input. It
automatically creates a file with gene prediction parameters for analyzed genome. It took
only a few minutes to create such file for E.coli genome using its sequence. If you need
parameters for your new bacteria, please contact Softberry - we can include them in the web
list.
Procedure:
1. Accessed FGENESB via SoftBerry website
(http://www.softberry.com/cgibin/programs/gfindb/fgenesb.pl)
2. Pasted first 125177 nucleotides from Helicobacter pylori genome under section Paste
nucleotide
sequence here
3. Selected Helicobacter pylori 26695 from organism dropdown.
4. Provided Table of Genetic code = 11 as it is a bacterial species.
5. Clicked on Process to obtain result
Output:
24
Bioinformatics Lab (MSBIN194)
Maulana Abul Kalam Azad University of Technology, West Bengal
Assignment No: 16
Theory:
BPROM is bacterial sigma70 promoter recognition program with about 80% accuracy and
specificity. It is best used in regions immediately upstream from ORF start for improved gene
and operon prediction in bacteria.
Procedure:
1. Accessed BPROM via SoftBerry website (via
http://www.softberry.com/cgibin/programs/gfindb/bprom.pl)
2. Pasted first 49240 nucleotides from Vibrio cholerae strain Amazonia genome under section
Paste nucleotide
sequence here
3. Selected Acinetobactor baumannii 26695 from organism dropdown.
4. Clicked on Process to obtain result
Output:
25
Bioinformatics Lab (MSBIN194)
Maulana Abul Kalam Azad University of Technology, West Bengal
Assignment No: 17
Theory:
FGENESV algorithm is based on pattern recognition of different types of signals and Markov
chain models of coding regions. Optimal combination of these features is then found by
dynamic programming and a set of gene models is constructed along given sequence. There
are two variants of viral gene prediction program: FGENESV0, which is suited for small
(<10 kb) genomes, uses generic parameters of coding regions, while FGENESV learns
genome-specific parameters using viral genome sequence as an input. As additional
parameters, you can choose Linear or Circular form of your virus and select alternative
genetic code (Standard code is default): The Bacterial and Plant Plastid Code
(transl_table=11) or The Mold, Protozoan, and Coelenterate Mitochondrial Code and the
Mycoplasma/Spiroplasma Code (transl_table=4).
Procedure:
1. Accessed FGENESV0 via SoftBerry website (via
http://www.softberry.com/cgibin/programs/gfindb/virus0.pl)
2. Pasted the nucleotide sequence of the MYC proto-oncogene from NCBI
(https://www.ncbi.nlm.nih.gov/nuccore/OP763993.1?report=fasta) under section Paste
nucleotide
sequence here
3. Selected Linear
4. Selected The Bacterial and Plant Plastic Code as it is a viral species
5. Clicked on Process to obtain result
Output:
26
Bioinformatics Lab (MSBIN194)
Maulana Abul Kalam Azad University of Technology, West Bengal
Assignment No: 18
Aim: Review the use of Cytoscape to analyse and visualize given miRNA-TargetFactor
bipartite biological networks
Theory:
Cytoscape is an open-source bioinformatics software platform for visualizing molecular interaction
networks and integrating with gene expression profiles and other state data. Additional features are
available as plugins.
Procedure:
1. Cytoscape is downloaded on local system (via: https://cytoscape.org/)
2. Clicked File > Import > Network from File (Selected miRNA-TF.csv)
3. Clicked File > Import > Table from File (File containing node type information)
4. Modified and applied different color to each node type from the
Style tab on the left as shown below.
5. Color is applied by selecting the NodeType column and Mapping
Type is selected as Discrete Mapping
6. To perform clustering analysis two separate plugins are executed
after installing them from the AppStore via Apps > App Manager…
a. MCODE
b. CytoHubba
7. MCODE is run from the same menu: Apps > MCODE > Analyze
Current Network and the result is visualized next to the original
network
8. CytoHubba is run from the same menu and visualized as a separate subnetwork:
a. Apps > CytoHubba
b. Node’s score > Calculate > EPC (Best according to CytoHubba publication) > Submit
Output:
27
Bioinformatics Lab (MSBIN194)
Maulana Abul Kalam Azad University of Technology, West Bengal
28
Bioinformatics Lab (MSBIN194)
Maulana Abul Kalam Azad University of Technology, West Bengal
Assignment No: 19
Aim: To found microRNA of gene using miRDB
Theory: miRDB (microRNA Database) is an online resource that focuses on predicting the
target genes of microRNAs (miRNAs) and is widely used for bioinformatics research and
analysis. MicroRNAs are small, non-coding RNA molecules that play crucial roles in
regulating gene expression by binding to messenger RNAs (mRNAs), often leading to mRNA
degradation or inhibition of translation.
Procedure:
1. Go to the miRDB website(https://mirdb.org/).
2. Choose the "Target Search" option for miRNA target analysis.
3. Enter the gene identifier 5922 (specific to your database, e.g., NCBI Entrez ID).
4. Click the search button to retrieve miRNAs predicted to target gene 4747.
5. A list of miRNAs targeting the gene will be displayed. Each miRNA will have
associated scores, indicating the confidence of the prediction (higher scores mean
stronger predictions).
6. Evaluate the potential miRNA-gene interactions based on prediction scores and their
relevance to your research.
Output:
29
Bioinformatics Lab (MSBIN194)
Maulana Abul Kalam Azad University of Technology, West Bengal
References:-
MicroRNA discovery and profiling in human embryonic stem cells by deep sequencing of small RNA
libraries. Bar M, Wyman SK, Fritz BR, Qi J, Garg KS, Parkin RK, Kroh EM, Bendoraite A, Mitchell PS, Nelson
AM, Ruzzo WL, Ware C, Radich JP, Gentleman R, Ruohola-Baker H, Tewari M. Stem Cells. 2008
Oct;26(10):2496-505.
miRBase: microRNA sequences, targets and gene nomenclature. Griffiths-Jones S, Grocock RJ, van Dongen
S, Bateman A, Enright AJ. Nucleic Acids Res. 2006 Jan 1;34(Database issue):D140-4.
30
Bioinformatics Lab (MSBIN194)
Maulana Abul Kalam Azad University of Technology, West Bengal
Assignment No: 20
Aim: To found hTFtarget of gene using hTFtarget
Theory:
hTFtarget is a database designed to systematically explore transcription factor (TF)
targets in humans. It integrates various data sources, such as ChIP-seq experiments,
motif-based predictions, and literature mining, to provide a comprehensive repository of
TF-target interactions. By leveraging multi-omics data and bioinformatics tools, it helps
researchers identify regulatory networks, analyze transcriptional regulation, and
understand TF-gene interactions in different biological contexts.
Procedure:
1.Visit the htFTarget website.
2.Ensure you have the specific details about the gene, transcription factor, or condition
you're investigating
3.Enter the target gene's name (e.g., "RASA2") in the search bar.
4.Select the appropriate options for species (e.g., human or mouse) and data type(e.g.,
ChIP-seq, motif-based, etc.).
5.If you are interested in a specific transcription factor (e.g., "NEFL"), enter its name.
Output:
31
Bioinformatics Lab (MSBIN194)
Maulana Abul Kalam Azad University of Technology, West Bengal
Assignment No: 21
Aim:-To make phylogenetic tree using a software MEGA 11.
Theory:- MEGA (Molecular Evolutionary Genetics Analysis)is an integrated tool for
automatic and manual sequence alignment, inferring phylogenetic trees, mining web-based
databases, estimating rates of molecular evolution, and testing evolutionary hypotheses
Procedure:-
• Install software MEGA 11.
• Go to NCBI, Choose any organism nucleotide sequence and BLAST the sequence .Here I
have chosen nucleotide sequence of Helicobacter pylori
• Download FASTA format of first five sequence from BLAST .
• Open MEGA software, Go to align option, click on edit and build alignment option. • select
create a new alignment from select option and click ok.
• Select DNA from Data type alignment box.
• Go to edit option and click on Insert sequence from file.
• click on downloded BLAST sequence.
• Select the sequence which has opened.
• Go to alignment option click on align by clustalW
• Go to Data, Go to export alignment, click on MEGA FORMAT.
• Select construct tree by Neighbour-joining method.
• click yes to activate current DATA in box.
• save with name of your own choice, Input title of the data, clik ok,The confirmation box
will open click on no option.
• Go to home screen of MEGA11 software,for making phylogenetic tree click on phylogeny
option.
• Select file which has been saved in MEGA format. Now the file is in software
• click on TA icon, select only the sequences, again go back to phylogeny option and again
select construct tree by neighbour joining method.
• select Bootstrap method in Test for phylogeny box and set Bootstrap Replication value in
phylogeny reconstruction box and click on ok button.
• The phylogenetic tree has been constructed.
Output:-
32
Bioinformatics Lab (MSBIN194)
Maulana Abul Kalam Azad University of Technology, West Bengal
33
Bioinformatics Lab (MSBIN194)
Maulana Abul Kalam Azad University of Technology, West Bengal
Assignment No: 22
Aim:- To predict the secondary structure of a protein using PSIPRED, a bioinformatics tool
that analyzes amino acid sequences to identify structural features such as alpha-helices, beta-
strands, and coils.
Theory:- The PSIPRED Workbench provides a range of protein structure prediction
methods. The site can be used interactively via a web browser or programmatically via our
REST API. For high-throughput analyses, downloads of all the algorithms are available.
Amino acid sequences enable: secondary structure prediction, including regions of disorder
and transmembrane helix packing; contact analysis; fold recognition; structure modelling;
and prediction of domains and function. In addition PDB Structure files allow prediction of
protein-metal ion contacts, protein-protein hotspot residues, and membrane protein
orientation.
Procedure:-
1. Users need to select their input data is a protein sequence or a PDB structure file
2. You must choose at least 1 predictive method to run.
3. If you selected "Sequence Data" then enter your AMINO ACID sequence or MSA
here(http://bioinf.cs.ucl.ac.uk/psipred/).
4. Alternatively, you can enter your sequence in FASTA format, but the description text will
be ignored by the server. Note that MSA data must be in FASTA format. There is an upper
limit of 1,500 residue to the length of sequences which can be submitted/ If your sequence is
longer than this, try breaking it into likely domains before submitting it.
5. If you selected "PDB Structure" then use the Browse button to find the location of the PDB
file you wish to predict.
6. Click the Submit button to begin the prediction process.
Output:-