W9-SIO1003 Practical 4-Questions

Download as docx, pdf, or txt
Download as docx, pdf, or txt
You are on page 1of 6

SIO1003 Bioinformatics Concepts

Semester 1, Session 2023/2024

Practical 4: BLAST (20 marks)

Name: NUR AIN BASHIRAH BINTI ZAHRUL HAKIM


Matrix No: 23005852
Group: OCC 3

Instruction:
1. Two questions are given. Each question carries 10 marks. Answer all questions.
2. Please send in your answers either in .doc/.docx/.pdf format via Spectrum-UM with
a filename as follows: SIO1003_ID_FIRSTNAME_W9_P4
3. Please also be reminded that plagiarism is an academic offense. If you are found to
have plagiarized your colleague’s work, you will be penalized with 0 mark.
4. Submission Deadline: Next Tuesday 11.59 pm

Brief description about BLAST:


BLAST programs from NCBI are powerful sequence alignment tools widely used in the
analyses of biological sequences. It is also used to look for similar sequences in a
database.
In this exercise you will be given some sequences (DNA or RNA or Protein) and you are
required to do some search and analyses using NCBI BLAST services
(blast.ncbi.nlm.nih.gov) to address some specific biological questions.
The purpose of these exercises are to help you to familiarize yourself with the web
interface and better understand the capability of different BLAST programs.
Exercise 1: Finding an unknown gene.
Your supervisor has conducted a PCR experiment and given you an unknown
sequence for analysis. The sequence is as below:
>Unknown_sequence_1
AAATGAGTTAATAGAATCTTTACAAATAAGAATATACACTTCTGCTTAGGATGATAAT
TGGAGGCAAGTGAATCCTGAGCGTGATTTGATAATGACCTAATAATGATGGGTTTT
ATTTCCAGACTTCACTTCTAATGGTGATTATGGGAGAACTGGAGCCTTCAGAGGGT
AAAATTAAGCACAGTGGAAGAATTTCATTCTGTTCTCAGTTTTCCTGGATTATGCCT
GGCACCATTAAAGAAAATATCATCTTTGGTGTTTCCTATGATGAATATAGATACAGA
AGCGTCATCAAAGCATGCCAACTAGAAGAGGTAAGAAACTATGTGAAAACTTTTTG
ATTATGCATATGAACCCTTCACACTACCCAAATTATATATTTGGCTCCATATTCAATC
GGTTAGTCTACATATATTTATGTTTCCTCTATGGGTAAGCTACTGTGAATGGATCAA
TTAATAAAACACATGACCTATGCTTTAAGAAGCTTGCAAACACATGAA

Guidelines to conduct sequence analysis


1. Navigate to the main BLAST page ( https://blast.ncbi.nlm.nih.gov/Blast.cgi )
2. Select the appropriate type of BLAST for your sequence
3. Paste the first unknown sequence into the box
(for this activity, you can ignore the search options)
4. Click on the “BLAST” button and wait for the results.
5. Once the results are displayed, notice there are three main headings:
Graphic Summary, Descriptions, and Alignments (these may be expanded so you’ll
have to scroll down).
6. Use these results to answer the questions below.
Questions:
1. In the Descriptions section, look at the top result, which should be the result with
the highest score. Write down information about the best match:

a. Description
(no need to write the whole thing)
=Homo sapiens CF transmembrane conductance regulator (CFTR),
RefSeqGene (LRG_663) on chromosome 7
b. E value

= 0.0
c. Percent identity
=503/503(100%)

d. Query cover

=100%

2. Now scroll down to the Alignments heading. Look at the top result, which should
be the same one. Look at the alignment between your query and the reference.
Do you see any mismatches?
= No mismatch.

3. How can you judge whether this is a good match?


= Because the query cover is 100%, the E value is 0.0, and the overall score is
high.

4. What is this gene? Google the name of the gene and write down something
significant you learned about it.
= CTFR gene
This gene encodes a member of the ATP-binding cassette (ABC) transporter
superfamily. The encoded protein functions as a chloride channel, making it
unique among members of this protein family, and controls ion and water
secretion and absorption in epithelial tissues. Channel activation is mediated by
cycles of regulatory domain phosphorylation, ATP-binding by the nucleotide-
binding domains, and ATP hydrolysis. Mutations in this gene cause cystic
fibrosis, the most common lethal genetic disorder in populations of Northern
European descent. The most frequently occurring mutation in cystic fibrosis,
DeltaF508, results in impaired folding and trafficking of the encoded protein.
Multiple pseudogenes have been identified in the human genome.
Exercise 2: Investigating sets of sequences
Assume that you have joined as a staff at a bioinformatics company after your
undergraduate study. There, your supervisor is actively involved in research, and he
has a number of sequences which he had obtained from experimental work –
sequencing experiment. He has given three sets of sequences for you to analyze and
obtain some information from them.
The given sequences are as in the table below:

>Sequence1a
GTAATGTACATAACATTAATGTAATAAAGA

>Sequence1b
SET 1
ATCACGAGCTTAATTACCATGCCGCGTGAAACCAGCAACC

>Sequence1c
ATGGACTAATGGCTAATCAGCCCATGCTCACACATA

>Sequence2a
TTTGGTTGTTCGACGACGGATGCAGAGCTCAGGGAAGTGGGGACGTGTTTTG
GCTATCCT

>Sequence2b
GCGATGCATCAGGATGCATCCTCTGATCTTAGGGTGGTACGAGAAAAATTGA
SET 2
AGAATGTA

>Sequence2c
GCGGTTCCACAAGACCCTGAGGCGCCTGGTGCCTGACTCGGACGTCCGGTT
CCTCCTCTC

SET 3
>Sequence3a
TAACCTACGGGTGGCCGCAGTGGGGAATATTGCACAATGGACACAAGTCTGA
TGCAGCGACGCCGCGTGGGGGATGAAGGCTTTCGGGTTGTAAACTCCTTTC
AGTACAGAAGAAGCATTTTTGTGACGGTATGTGCAGAAGAAGCGCCGGCTAA
CTACGTGCCAGCAGCCGCGGTAATACGTAGGGCGCGAGCGTTGTCCGGAAT
TATTGGGCGTAAAGAGCTCGTAGGCGGTTTGTTGCGCCTGCTGTG

>Sequence3b
TGTCCTACGGGGGGCTGCAGTGAGGAATATTGGTCAATGGGCGAGAGCCTG
AACCAGCCAAGTCGCGTGAAGGATGACTGTCTTATGGATTGTAAACTTCTTTT
ATACGGGAATAACAAGAGTCACGTGTGGCTCCCTGCATGTACCGTATGAATA
AGCATCGGCTAACTCCGTGCCAGCAGCCGCGGTAATACGGAGGATGCGAGC
GTTATCCGGATTTATTGGGTTTAAAGGGTGCGTAGGCGGC

>Sequence3c
GGCCTACGGGGGGCTGCAGTGGGTACGGGCAGACTAGAGTGTGGTAGGGG
TAATTGGAATTCCTGGTGTAGCGGTGGAATGCGCAGATATCAGGAGGAACAC
CGATGGCGAAGGCAGGTTACTGGGCCATTACTGACGCTGAGGAGCGAAAGC
GTGGGTAGCGAACAGGATTAGATACCCTAGTAGTCT

Guidelines to conduct sequence analysis.

1. Navigate to the main BLAST page ( https://blast.ncbi.nlm.nih.gov/Blast.cgi )


2. Select the appropriate type of BLAST for your sequence
3. Paste the first set (set 1) sequences into the box and click BLAST button by leaving
other parameters in default values.
4. Once the results are displayed, notice at the three main headings: Graphic
Summary, Descriptions, and Alignments.
5. Use these results to answer the questions below.

Questions: for each experimental set, answer these questions,


1. Description
2. Maximum score
3. E-Value
4. Sequence ID
5. Query cover
6. Query length
7. Which organism do these sequences belong to?
8. What do these sequences have in common?
9. What is your best guess about the original purpose of this experiment?
No Aspect Set 1 Set 2 Set 3
1 Description Bos taurus Arabidopsis Trueperella
mitochondrial thaliana pyogenes
DNA, D-loop, genome strain TN2 16S
complete assembly, ribosomal RNA
sequence, chromosome: gene, partial
haplotype: 4 sequence
JSH35_1
2 Maximum score 56.5 bits(30) 111 bits(60) 442 bits (239)
3 E-value 4e-05 8e-21 2e-119
4 Sequence ID LC314271.2 LR782545.1 MT269339.1
5 Query cover 100% 100% 98%
6 Query length 30 60 63
7 Which organism do these Cattle Thale cress Bacteria
sequences belong to
8 What do these sequences  The genetic material of every organism is
have in common? read by all of these sequences.
 They have high query coverage , meaning
query coverage more than 98%
 The E-value of these sequences are
significant , meaning the alignments are
most probably correct
9 What is your best guess To identify and analyze the unknown sequences to
about the original purpose obtain information on species of the organism the
of this experiment? sequences belongs to , by comparing with existing
data in the databases.

You might also like