Lab 2
Lab 2
IT Faculty
2024-2025
Bioinformatics Practical Labs
DNA sequencing is essentially the process of bridging the biological object (DNA molecule) to the
computational object (DNA sequence data).
DNA Sequencer takes as input tangible DNA molecules, and outputs sequences in textual format.
The first line starts with a forward arrow (>), then identifying information including the name of organism.
Followed by the entire genome which consists of many bases like the following:
2
Bioinformatics Practical Labs
In our first example we’ll work with human mitochondrial genome which is small structure in cells that
generates energy for the cell to use. FASTA files has (.fa) extension.
The following function is used to read FASTA file by passing to file name:
def readFASTAGenome(filename):
"""this function is used to load FASTA genome
it takes as input filename and returns a string
containing the hole genome.
"""
genome = '' #empty string to store genome.
#Openning FASTA file for reading (r) mode:
with open(filename, 'r') as f: #f is file handler.
for line in f:
if line[0] != '>': #ignore information line.
genome+=line.rstrip()
return genome
Note: the first lines in the function called “Doc String” they used to put an explanation to the function we
wrote for others, in jupyter notebook to see any doc string just put the pointer into the function and hit
“shift + tab“ keys.
Using a dictionary we’ll store each base with its frequency from our read genome, then we’ll use a library
called collections that provides us with simpler code to achieve this idea.
def baseFreqDictionary(genome):
"""this function is used to calculate each base count and
store result in a dictionary """
base_freq = {'A': 0, 'C':0, 'G':0, 'T':0}
for base in genome:
base_freq[base]+=1
return base_freq
#Call our function:
bf = baseFreqDictionary(hmg)
print(bf) #result is {'A': 5125, 'C': 5181, 'G': 2169, 'T': 4094}
Now using collections library:
3
Bioinformatics Practical Labs
import collections
collections.Counter(hmg) #result is Counter({'G': 2169, 'A': 5125,
'T': 4094, 'C': 5181})
Now, we’re going to represent the information visually using matplotlib library provided by python:
4
Bioinformatics Practical Labs
def readFastqGenome(filename):
"""this function is used to load FASTQ genome
it takes as input filename and returns two lists
containing sequences (reads) and qualities.
"""
sequences = []
qualities = [] #define two empty lists for reads and
associated qualities.
with open(filename, 'r') as f: # open file in read only mode.
while True:
f.readline() #ignore information line.
seq = f.readline().rstrip() # read sequence line.
f.readline() #ignore placeholder line.
qual = f.readline().rstrip() #read quality line.
if len(seq) ==0: #that means we reach the end of file.
break
sequences.append(seq)
qualities.append(qual)
return sequences, qualities
now let’s test our function:
Now, we’ll create a histogram to see which quality scores are the most common and which are less
common.
To do that we need to write a helper function to convert ASCII symbols to quality score. So, this will
convert Phred33 encoded value to just a quality score.
def phred33ToQ(p33):
"""this function is used to
convert phred33 to quality value
"""
return ord(p33)-33 #get quality value
#test the function:
print('the quality value for # is: ',phred33ToQ('#')) #result is 2
print('the quality value for J is: ',phred33ToQ('J')) #result is 41
A low-quality score means that we have a very low confidence in our value, it’s not likely to be correct.
In this case, there is about 30% to 32% chance that the read in this position is incorrect.
For the second value we have a high-quality, so we have less chance that the read in this position is
incorrect.
Now, let’s draw the histogram after we satisfy all requirements:
5
Bioinformatics Practical Labs
def createHist(qualities):
""" this function is used to calculate histogram for
our quality scores and how many times each quality
score appears in our file
"""
hist = [0]*50 #create empty list with 50 items
for qual in qualities: #work through each quality line
for phred in qual: #work through each Phred33 encoded value
q = phred33ToQ(phred)
hist[q]+=1
return hist
h = createHist(quals)
plt.bar(range(len(h)),h)
plt.title('Quality Score Histogram')
plt.xlabel('Qualities Values')
plt.ylabel('Frequencies')
plt.show()
We can notice that our quality scores start from 2 and end in 41. You can also notice that we have a big
spike of low-quality values which is 2. Then drops down really low then increase gradually and we have
high-quality score frequencies.
In the next example we’ll plot GC-Content at each position of the read, the number reads are different
from species to species.
This helps us to figure out whether the mix of different bases is changing as we move along the
read in general it won’t change so much.
But if we have one particularly bad sequencing cycle, then we might see a very different mix of
G’s and C’s relatively to other bases.
The following python function we’ll draw a plot that shows the GC-Content for the previous reads
we have from the FASTQ file.
6
Bioinformatics Practical Labs
def drawGCContent(reads):
""" this function is used to calculate histogram for
our quality scores and how many times each quality
score appears in our file
"""
gc = [0]*100 #define a list with 100 items equal to reads length
totals = [0]*100
for read in reads:
for i in range(len(read)):
if read[i] == 'C' or read[i] == 'G':
gc[i]+=1
totals[i]+=1
for i in range(len(gc)):
if totals[i]>0:
gc[i]/=float(totals[i])
return gc
# Call our function and draw the result:
gc = drawGCContent(seqs)
plt.title('GC-Content By Position')
plt.xlabel('Positions')
plt.ylabel('Average GC Content')
plt.plot(range(len(gc)),gc)
plt.show()
Notice that the change is pretty small and the average GC content is around 0.56 and 0.62.
As we did in the FASTA file, we’ll find the distribution of bases in these sequences by using collections
library again:
count = collections.Counter()
for seq in seqs:
count.update(seq)
print(count)
N is used when the sequencer or base caller has no confidence, it doesn’t even want to make a call.
Because there is no good evidence to support one base over the others.
References:
Bioinformatic lectures by Dr. Eng. Asmaa Shaar.