0% found this document useful (0 votes)
20 views

Module 1_Session 3_Part 1

Uploaded by

mariabrowny33
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
20 views

Module 1_Session 3_Part 1

Uploaded by

mariabrowny33
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 17

Introduction to Bioinformatics Online Course : IBT

Module 1: Introduction to databases and resources

(Session 3)

Part I sequence identifier and sequence


processing pipeline

Introduction to Bioinformatics online course: IBT


Bioinformatics Resources & Databases: Abeir Shalaby
Learning Outcomes
• Objective: Basic DNA sequence analysis

• ILOs : By the end of the session the trainee will be able to


• Define accession numbers and the significance of RefSeq identifiers
• Understand how to find a DNA sequence and save it in the correct format
• Identify features on the sequence and their applications such as coding
regions, restriction enzyme sites, etc.
• Interpret sequence analysis results and understand the biological impact
of functional regions
• Finding the six open reading frames and choose the correct one
• Design primers for amplification of a DNA sequence

Introduction to Bioinformatics online course: IBT


Bioinformatics Resources & Databases: Abeir Shalaby
How to identify genes using bioinformatics tools
Obtain the DNA or RNA sequence data from the organism of
interest through web lab research using various sequencing
Sequencing technologies, such as Sanger sequencing or next- generation
sequencing (NGS) is first done to obtain the sequence of the
gene of interest.

Clean and process the sequence data to remove any errors or


Preprocessing artefacts introduced during sequencing, as well as any
sequences that do not align with the reference genome or
transcriptome.

A sequence record is called 'annotated' when biological information is


added and linked to a position in the sequence annotated information

Gene identification
represented in sequence features table

Annotate the genome or transcriptome to identify potential genes


and annotation and their features, such as coding regions, exons, introns,
promoters, and regulatory elements. This can be done using tools
such as Ensembl, NCBI's, RefSeq, or UCSC Genome Browser.
Introduction to Bioinformatics online course: IBT
Bioinformatics Resources & Databases: Abeir Shalaby
Use computational methods to predict the location
and structure of genes within the genome or
Gene prediction transcriptome. This can be done using tools such as
Augustus, GeneMark, or Glimmer.

Validate the predicted genes computationally by comparing


.
them to known genes in related organisms, or by using
Validation experimental techniques such as RNA sequencing or
reverse transcription polymerase chain reaction (RT- PCR) to
confirm their expression

Functional Once genes have been identified, further analysis can be


done to determine their functions, interactions, and
analysis pathways. This can be done using tools such as Gene
Ontology (GO) and KEGG Pathway databases.
Introduction to Bioinformatics online course: IBT
Introduction to Bioinformatics online course: IBT
Bioinformatics Resources & Databases:
Bioinformatics Abeir Shalaby
Resources & Databases: Abeir Shalaby
Searching the database for your gene of interest

First you have to determine for yourself which information you want
- NA sequences or protein sequences
- If NA, genomic sequences, or RNA derived sequences
- All possible sequences that exists, or just curated ones
- Retrieving the annotated sequence
- finding and interpret the annotated information represented in sequence features table (The
different kinds of features (e.g., gene, mRNA, coding region, tRNA)

The characterization of genomic features using computational and experimental methods


is called gene identification or annotation to answer the following questions
Which region codes for a protein?
Which DNA strand is used to encode the gene?
Where does the gene start and end?
Where are the exon-intron boundaries?
Where are the regulatory sequences for that gene?
Introduction to Bioinformatics online course: IBT
Bioinformatics Resources & Databases: Abeir Shalaby
Data exchange
between DDBJ, EMBL
and GenBank occurs
daily so it is only
necessary to submit
the sequence to one National Institute of Genetics (NIG)
database, whichever
one is most
convenient, without EMBL-bank
regard for where the
sequence may be
published

Introduction to Bioinformatics online course: IBT


12
Bioinformatics Resources & Databases: Abeir Shalaby
Introduction to Bioinformatics online course: IBT
Bioinformatics Resources & Databases: Abeir Shalaby
GenBank Numbers used to Key or uniquely Identify Entries

• 1. SeqID:
Initially, the Entry Name in the LOCUS line was used as the only key to a GenBank entry
This name attempted to mimic the organism and function of the gene encoded
Problem: impossible to do this systematically and uniquely with new knowledge ...
These Entry Names now change over time...
• 2. Accession Numbers:
The Accession Number was then introduced, to be the primary key to reference an entry in the
database . It will always stay with the entry, even when entry is updated
a. Genbank accession number , either 5 (eg: X79797) or 6 (eg: AF028831)
b. 'RefSeq' entry is the new entry (eg: NC_001140 )

the letter used reflects which of the three databases (GenBank, EMBL, DDBJ) is the primary database
, they have So many different IDs , we need to mapping accession numbers to move between them
Using EBI tool to Convert genbank accession number to ebi accession number
https://www.ebi.ac.uk/ebisearch/search?db=biotools&query=GenBank

Introduction to Bioinformatics online course: IBT


Bioinformatics Resources & Databases: Abeir Shalaby
GenBank Numbers used to Key or uniquely Identify Entries
3. new SeqID number = Accession . Version and unique GenBank Identifier."gi|" number
• Accession Number : identifies a sequence record , it does not change even the sequence is changed
• Version Number: tracks changes to the sequence itself by an integer extension of the accession
number called “Version” Initial version is “.1”
• Each version of a sequence gets a new unique NCBI identifier called a GI number , the number which
follows is a unique sequence id. Any change to the sequence data will result in a new gi number.
SO don’t search by GI

Entrez and BLAST results both present the following formatted text as part of the returned result:

gi|4557284|ref|NM_000646.1|AGLf| [4557284]
Gi gene identifier 4557284
Accession Number NM_000646
Version NM_000646.1
LOCUS name AGLf Introduction to Bioinformatics online course: IBT
Bioinformatics Resources & Databases: Abeir Shalaby
Using EBI tool to Convert genbank accession number to ebi accession
number
https://www.ebi.ac.uk/ebisearch/search?db=biotools&query=GenBank

Introduction to Bioinformatics online course: IBT


Bioinformatics Resources & Databases: Abeir Shalaby
Gencard Human gene database
• You can use Gencard database to get all information of only
human genes https://www.genecards.org/

Introduction to Bioinformatics online course: IBT


Bioinformatics Resources & Databases: Abeir Shalaby
RefSeq processing pipline

Once sequence data are deposited in the public archival


databases, it is available for an automatic RefSeq processing
pipelines in collaboration with authoritative scientists or groups
outside NCBI, and curation by biological experts at NCBI center
The RefSeq processing pipelines
(A) 1) vertebrate curation pipeline, Curation of REFSEQ an
automated BLAST step

2) the computational genome annotation pipeline,


Available RefSeq and INSDC data are aligned to an assembled
genome, for gene prediction to define the annotation models. New
MODEL RefSeq

3) extraction from GenBank All RefSeq bacterial and archaeal


genomes, with the exception of RefSeq Prokaryotic Reference
Genomes, are annotated using NCBI's prokaryotic genome cross-
references (db_xref) to the source GenBank record
http://www.ncbi.nlm.nih.gov/RefSeq/
Introduction to Bioinformatics online course: IBT
Bioinformatics Resources & Databases: Abeir Shalaby
RefSeq Accession Number and its recourses
•Model RefSeq : RNA and protein products that are generated by the eukaryotic genome annotation pipeline.
These records use accession prefixes XM_, XR_, and XP_.
•Known RefSeq : RNA and protein products that are mainly derived from GenBank cDNA and EST data and
are supported by the RefSeq eukaryotic curation group. These records use accession prefixes NM_,NR_, NP_.
mRNAs and Proteins
NM_123456 Curated mRNA
NP_123456 Curated Protein
NR_123456 Curated non-coding RNA
XM_123456 Predicted mRNA
XP_123456 Predicted Protein
XR_123456 Predicted non-coding RNA
Gene Records
NG_123456 Reference Genomic Sequence
Chromosome
NC_123455 genomes, human chromosomes ,
Microbial replicons, organelle
AC_123455 Alternate assemblies
Assemblies
NT_123456 Contig
NW_123456 WGS Supercontig
Introduction to Bioinformatics online course: IBT
Bioinformatics Resources & Databases: Abeir Shalaby
REFSEQ status code and level of curation
A combined approach uses both collaborator supplied sequence information and automated
BLAST analysis to provide an initial RefSeq record
The initial RefSeq record will have a status of INFERRED, PREDICTED, or PROVISIONAL and
may include enhanced feature annotation
The status of Refseq is presented in “Comment” section in header of genbank file

• PROVISIONAL - Submitted, but not reviewed


• PREDICTED- Submitted but not reviewed , and some aspect of the RefSeq record is predicted.
• INFERRED- not subjected to individual reviewing Predicted by genome sequence analysis,
possibly homology not experimental evidence.
• VALIDATED- Additional manual curation mainly by NCBI staff , such as sequencing errors and
misassociation with a locus.
• REVIEWED- Additional annotation, a summary description, and other functional information as
available.

Introduction to Bioinformatics online course: IBT


Bioinformatics Resources & Databases: Abeir Shalaby
• https://www.ncbi.nlm.nih.gov/books/NBK21105/#ch1.Appendix_GenBank_RefSeq_TPA_and_UniP

Introduction to Bioinformatics online course: IBT


Bioinformatics Resources & Databases: Abeir Shalaby
Retrieving refseq record depending on biological data type

Introduction to Bioinformatics online course: IBT


Bioinformatics Resources & Databases: Abeir Shalaby
The increasing size of the database and the diversity of data
sources available have made it convenient to split Genebank into
smaller discrete division.

PRI Primate PHG Bacteriophage


ROD Rodent SYN Synthetic
MAM Mammalian EST Expressed Sequence Tags
VRT Vertebrate PAT Patent
INV Invertebrate STS Sequence tagged sites
PLA Plant, Fungal GSS Genome Survey Sequence
BCT Bacterial
RNA Structural
RNA
VRL Viral

Introduction to Bioinformatics online course: IBT


Bioinformatics Resources & Databases: Abeir Shalaby

You might also like