Bioinformatics Lab Assaignment 2
Bioinformatics Lab Assaignment 2
When doing database searches, your search terms need to be as specific as possible in order to
eliminate large returns of data, some of it useless. A record is an individual file or NCBI “hit”
obtained from a search.
1. Start at the main NCBI page. Use All Databases on the NCBI home page. To retrieve a large
amount of returns, use “retinol binding protein” as your search term. How many hits did you
find in the Entrez Page?
1. Would you get a different answer without using the quotation marks around your search
term? Why?
2. Now try “retinol binding protein 4” on the All Databases search. How many proteins do you
find in the Entrez Page?
The term “rpb4” in the All Databases search likely refers to the gene encoding the fourth-largest
subunit of RNA polymerase II (Pol II), a crucial enzyme responsible for synthesizing messenger
RNA in eukaryotic cells. In Saccharomyces cerevisiae (baker’s yeast), this gene is known as
RPB4, and in humans, it is encoded by the POLR2D gene.
3. How many proteins do you find using the All Databases Search?
There are 42 proteins find in the All Databases Search.
4. In question 3 you are actually looking for the full length rbp4 for Homo sapiens with
accession number NP_006735
5. What about searching using this tool do you think you still make you get other hits when you type in
“rbp4 homo sapiens?”
6. What is the full name of this gene’s protein product?
7. Give a brief description of what the protein does. If you quote a record, give me the link you
used?
https://en.wikipedia.org/wiki/Retinol_binding_protein_4?utm_source=chatgpt.com
The number of amino acids in retinol-binding proteins varies depending on the specific type of
retinol-binding protein (RBP).
1. Human Serum Retinol-Binding Protein (RBP4): This protein consists of 182 amino
acid residues.
2. Human Cellular Retinol-Binding Protein 1 (CRBP1): This protein comprises 135
amino acid residues.
These variations are due to differences in the specific functions and structures of each RBP type
9. Is there functional protein domains described for this protein? You will find this in the
conserved domain database. This is either in RefSeq or can be linked from Domains through
the record. List them.
Yes, the RBP4 (Retinol Binding Protein 4) protein contains functional domains that are well-
characterized in the Conserved Domain Database (CDD) and other resources like UniProt and
InterPro.
10. How many amino acids are in the sig_peptide____________? What is the sig peptide? How
many are in the mat_peptide__________? What is the mat peptide?
The signal peptide is a short amino acid sequence at the beginning (N-terminus) of a newly
synthesized protein. It typically consists of 15–30 amino acids.
Its function is to direct the protein to the endoplasmic reticulum (ER) for secretion or
membrane insertion in eukaryotic cells.
After reaching the ER, the signal peptide is usually cleaved off by a signal peptidase.
It is not part of the mature protein.
The mature peptide is the final, functional form of the protein after all processing steps (like
signal peptide removal, cleavage, and folding).
To get the exact number, you'd need the amino acid sequence or information from a database
(e.g., UniProt or NCBI).
Amino acids are in the mat_peptide
You can calculate this by subtracting the number of amino acids in the signal peptide
from the full precursor protein length (if there are no other propeptides or cleavages).
11. What does CDS stand for and how many nucleotides are in the CDS for this gene?
CDS stands for Coding DNA Sequence. It refers to the portion of a gene's DNA or RNA that
codes for protein — specifically, it's the region that is translated into amino acids. To
determine how many nucleotides are in the CDS for a particular gene, I would need to see the
gene sequence or be given the gene name and access to a database such as NCBI, Ensemble or a
genome browser. If you have:
The gene sequence: You can count the number of nucleotides in the CDS.
The gene name: I can look it up and tell you the CDS length.
A FASTA or GenBank file: Upload it here, and I’ll extract the CDS length for you.
12. Can you find any PubMed references for this gene? Give me the link(s) of 3 of these.
13. What does it mean when the record states that it has been “curated by NCBI staff?”
When a record states that it has been “curated by NCBI staff,” it means that the information in
that record has been reviewed, verified, and possibly edited by experts at the National Center
for Biotechnology Information (NCBI). This manual curation ensures greater accuracy,
consistency, and reliability of the data, compared to automatically generated or unreviewed
records.
For instance, in the study of retinol-binding proteins, which are crucial for vitamin A
transport and metabolism, RefSeq offers a comprehensive and accurate reference for the
RBP1 gene and its associated protein. This enables researchers to confidently interpret
sequence variations, understand gene function, and explore potential implications in health
and disease.
RefSeq's utility extends beyond individual gene studies. Its integration with other NCBI
resources facilitates comparative genomics, evolutionary studies, and the development of
diagnostic tools. By providing a stable and consistent coordinate system, RefSeq supports the
accurate reporting of clinical variations and enhances the reproducibility of bioinformatics
analysis.
The RBP4 gene, which encodes retinol binding protein 4, is located on chromosome 10 at
cytoband 10q23.33. Its precise genomic coordinates are 93591694 to 93601744 on the reverse
strand of chromosome 10, according to the GRCh38.p14 human genome assembly.
This gene is situated on the long arm (q arm) of chromosome 10, specifically in the 10q23.33
region. The "q" designation indicates the long arm of the chromosome, distinguishing it from the
short arm, labeled "p"
1. Entire Chromosome 10
17. Click on Map viewer. What is the accession number of the genomic contig for
RBP4_____________? How many nucleotides do it contain___________? What is a
genomic contig?
A genomic contig (short for contiguous sequence) is a continuous stretch of DNA
sequence that has been assembled from shorter sequence reads during genome
sequencing.
18. Click on the annotation links labeled sv, pr, dl, ev, mm, hm, sts in Map viewer in the pink
box. What is each of these links abbreviations for?
In NCBI's Map Viewer, the annotation links labeled sv, pr, dl, ev, mm, hm, and sts correspond
to specific tools and resources that provide detailed information about genomic regions. Here's
what each abbreviation stands for:
1. sv – Sequence Viewer
o Displays a graphical representation of the nucleotide sequence for the selected
region, allowing users to view gene structures, exons, and other genomic features.
2. pr – Protein
o Links to the protein sequence(s) associated with the gene or genomic region of
interest, providing insights into the translated product.
3. dl – Download
o Offers options to download sequence data from the specified chromosomal region
in various formats for further analysis.
4. ev – Evidence Viewer
o Shows alignments of RefSeq and GenBank transcript sequences (such as mRNAs
and ESTs) to the genomic contig, highlighting supporting evidence for gene
models.
5. mm – Model Maker
o Provides tools to construct or refine gene models based on available transcript
data and genomic sequence, aiding in the prediction of gene structure.
6. hm – HomoloGene
o Links to the HomoloGene database, which identifies homologous genes across
different species, facilitating comparative genomics studies.
7. sts – Sequence Tagged Site
o Directs to information about Sequence Tagged Sites, which are short, unique
DNA sequences used as landmarks in genetic mapping and marker-assisted
studies.
These tools collectively enhance the functionality of Map Viewer by providing comprehensive
resources for viewing, analyzing, and interpreting genomic data
19. Does this gene contain introns? If so, how many and where are the splice junctions? Which
link did you use to discover this? There are several, including looking for the gene name in
the genomic contig sequence, or looking in the whole chromosome sequence.
Yes, the RBP4 gene (retinol binding protein 4) contains introns. According to the NCBI Gene
database, the RBP4 gene comprises 8 exons. This suggests the presence of introns between these
exons, as the gene is transcribed into precursor mRNA (pre-mRNA) that includes both exons and
introns.
To identify the exact number and locations of introns, as well as the splice junctions, you can
utilize the NCBI Genome Data Viewer or the NCBI Gene database. Here's how:
By examining the graphical representation, you can determine the number of introns and the
precise locations of the splice junctions.
19. Click on the OMIM link. What is a biological consequence of a mutation in this protein for
humans?
A mutation in the RBP4 (Retinol Binding Protein 4) gene can have several biological
consequences in humans, primarily due to its essential role in transporting vitamin A (retinol)
from the liver to peripheral tissues.
RBP4 binds retinol (vitamin A) in the blood and delivers it to cells via interaction with a receptor
called STRA6. Vitamin A is crucial for:
Case Example:
A loss-of-function mutation in RBP4 can lead to vitamin A deficiency, even if dietary intake is
adequate, because the body cannot transport retinol efficiently. This can result in symptoms like:
Dry eyes
Impaired immunity
Skin issues
Growth retardation in children
20. Can you find your gene in SwissProt (http://us.expasy.org/sprot/) database? Give me the
accession number in SwissProt.
21. What is the advantage of SWISS-PROT vs. NCBI?
Both SWISS-PROT (now part of UniProtKB/Swiss-Prot) and NCBI provide protein sequence
data, but they serve different purposes and offer different strengths.
SWISS-PROT (UniProtKB/Swiss-Prot)
✅ Advantages:
1. Manual Curation
o Every entry is reviewed by experts.
o Errors are corrected, and information is added based on experimental evidence.
2. High-Quality Functional Annotations
o Includes detailed info on:
Protein function
Domain structure
Post-translational modifications
Variants and disease links
3. Non-redundant
o Only one entry per protein per species (no duplicated submissions).
4. Stable and Consistent Format
o Better suited for reliable data mining, modeling, and pathway analysis.
Advantages:
1. Broad Coverage
o Includes both curated (RefSeq) and unreviewed submissions (GenBank).
o Contains more recent and raw data, including novel or predicted proteins.
2. Integrated with Genomic Data
o Easily connects with gene locations, mRNA, genomic contigs, and other NCBI
tools (BLAST, Gene, Genome Data Viewer).
3. Rapid Updates
o New sequences are submitted and published quickly—useful for cutting-edge
research.
Feature SWISS-PROT (UniProtKB) NCBI (RefSeq/GenPept)