0% found this document useful (0 votes)
19 views2 pages

E1 - Biological Databases and Data Organization: General Content

Uploaded by

aysepolat7000
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
19 views2 pages

E1 - Biological Databases and Data Organization: General Content

Uploaded by

aysepolat7000
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 2

E1 - biological databases and data organization

General content
1. Swissprot has 565254 sequence entries and has higher evidence percentages at both
protein and transcript level, compared to trEMBl, which has 219174961 sequence
entries and low evidence percentages. The difference between the distributions for
the two databases is that trEMBL is unreviewed and computationally annotated
compared to SwissProt, which is review/verified with literature by a curator and
annotated manually.
- SwissProt

- TrEMBL

2. The active sites annotated in the total number of sequences is found under features,
act_site and total number. Active sites are found to be 168907 in SwissProt
3. The number of proteins with at least one active site is found to be 102056, under
number of entries.
4. May be because most of the proteins are sequenced in the given model organisms
and the graph therefore stagnates.

Understand a protein entry


1. Three reviewed proteins from the PRKD1 gene were found in SwissProt. The
serin/threonine-protein kinase D1 was found in homo sapiens, mus musculus and
rattus norvegicus. The length of the protein is shorter in human; 912, than in mice
and rats; 918.

2. The length of the sequence is 912. After downloading the FASTA sequence it was
inserted into https://www.browserling.com/tools/letter-frequency, and according to
the online tool, the protein consists of 79 serines.
3. According to NCBI the taxonomic identifier for Homo sapiens is 9606, and
approximately 1480000 proteins,
https://www.ncbi.nlm.nih.gov/Taxonomy/Browser/wwwtax.cgi?lvl=0&id=9606. The
human body consists of lots of proteins, which explains why the number of proteins
is that high in the database.
4. Existence of the protein is supported by experimental evidence at protein level.

5. Manual annotation means that a curator has been reading literature about the
protein, checked databases, and reviewed papers.
6. 9 mutations were found under pathology and biotech, HeLa cells is mentioned in 4 of
the mutations. HeLa cells are a cancer cell line, and cancer is a sign of mutated cells,
therefore tested in these cells.

Identifiers and file formats


1. Uniqueness of the protein is the accession number, when the sequence change, the
accession number change.
2. The data is organized in two letters, where the lefthand side describes the
information on the righthand side.
3. Machine readable means that data can be read and understood by a
computer/machine.
This format helps the machine find information way faster than searching manually.
4. After the sign ’>’ the sp means swissprot, as it is the database used. After the letter Q
the unique identification of the protein. KPCD1 is the entry name, followed by the
species the protein is found in. Thereafter the protein itself is written out. OS means
organism species, OX is the organism identifier, while GN is the gene name. PE
stands for protein existence and is scored to 1, meaning evidence at protein level. SV
means sequence version and is scored to 2, meaning 2 is the version number of the
sequence. Followed by the sequence consisting of amino acid codes.
https://www.uniprot.org/uniprot/Q15139.fasta
https://www.uniprot.org/uniprot/Q9WTQ1.fasta
https://www.uniprot.org/uniprot/Q62101.fasta
5. I get 374 hits in total
6. 374 entries from 274 species according to the database.

7. Without the quotation marks, the database searches for every word and gives hits
for every single word. With a quotation mark, the search engine searches for
everything between the quotation marks as a sentence.

You might also like