0% found this document useful (0 votes)
22 views7 pages

Lab 2

The document outlines practical labs for bioinformatics at Al-Hawash University, focusing on DNA sequencing manipulation and analysis of FASTA and FASTQ files. It includes functions for reading genome files, calculating base frequencies, and visualizing data using libraries like matplotlib. Additionally, it discusses quality score analysis and GC-content plotting to assess sequencing quality.

Uploaded by

nano.rajab17
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
22 views7 pages

Lab 2

The document outlines practical labs for bioinformatics at Al-Hawash University, focusing on DNA sequencing manipulation and analysis of FASTA and FASTQ files. It includes functions for reading genome files, calculating base frequencies, and visualizing data using libraries like matplotlib. Additionally, it discusses quality score analysis and GC-content plotting to assess sequencing quality.

Uploaded by

nano.rajab17
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 7

AL-Hawash University

IT Faculty

Bioinformatics Practical Labs

2024-2025
Bioinformatics Practical Labs

DNA Sequencing Manipulation


Introduction to DNA Sequencing:
DNA is the Kind of molecule that encodes genome (all genetic information / genes).
DNA is represented with alphabet consists of 4 letters (A, C, G and T) which called bases or nucleotides.
DNA molecule is shaped like a double helix and rungs are made up of complementary base pairs:
A is complementary to T and vice versa.
C is complementary to G and vice versa.

DNA sequencing is essentially the process of bridging the biological object (DNA molecule) to the
computational object (DNA sequence data).
DNA Sequencer takes as input tangible DNA molecules, and outputs sequences in textual format.

Analyzing FASTA Genome Files:


FASTA Files:

The first line starts with a forward arrow (>), then identifying information including the name of organism.
Followed by the entire genome which consists of many bases like the following:

2
Bioinformatics Practical Labs
In our first example we’ll work with human mitochondrial genome which is small structure in cells that
generates energy for the cell to use. FASTA files has (.fa) extension.
The following function is used to read FASTA file by passing to file name:

def readFASTAGenome(filename):
"""this function is used to load FASTA genome
it takes as input filename and returns a string
containing the hole genome.
"""
genome = '' #empty string to store genome.
#Openning FASTA file for reading (r) mode:
with open(filename, 'r') as f: #f is file handler.
for line in f:
if line[0] != '>': #ignore information line.
genome+=line.rstrip()
return genome

#Test our function:


hmg = readFASTAGenome('chrMT.fa')
print("the first characters from our genome:\n",hmg[:100])
print("the length of our genome is equal to: ",len(hmg))

Note: the first lines in the function called “Doc String” they used to put an explanation to the function we
wrote for others, in jupyter notebook to see any doc string just put the pointer into the function and hit
“shift + tab“ keys.

Using a dictionary we’ll store each base with its frequency from our read genome, then we’ll use a library
called collections that provides us with simpler code to achieve this idea.

def baseFreqDictionary(genome):
"""this function is used to calculate each base count and
store result in a dictionary """
base_freq = {'A': 0, 'C':0, 'G':0, 'T':0}
for base in genome:
base_freq[base]+=1
return base_freq
#Call our function:
bf = baseFreqDictionary(hmg)
print(bf) #result is {'A': 5125, 'C': 5181, 'G': 2169, 'T': 4094}
Now using collections library:

3
Bioinformatics Practical Labs
import collections
collections.Counter(hmg) #result is Counter({'G': 2169, 'A': 5125,
'T': 4094, 'C': 5181})
Now, we’re going to represent the information visually using matplotlib library provided by python:

import matplotlib.pyplot as plt


plt.title('Occurrence of bases in the HMG genome') #graph title
#Draw 4 bars one for each base.
plt.bar(range(len(bf)),list(bf.values()),align='center')
plt.xticks(range(len(bf)),list(bf.keys()) ) # change x label
plt.ylabel("Base Frequency") #add label to y axis
plt.show() #show the figure

Analyzing FASTQ Files:


Sequencing reads from DNA Sequencer are encoded in compact file format called FASTQ. Each FASTQ
file has the following groups of lines:
The first line is the name of the read, which encode some information like the experiment that it
comes from, the kind of sequencing machine that was used, where on the slide this particular
cluster was located.
The second line the sequence of bases as reported by the base caller software.
The third line is a placeholder line.
The fourth line is a sequence of qualities for the corresponding bases in the sequence line as
reported by the base caller software.

4
Bioinformatics Practical Labs
def readFastqGenome(filename):
"""this function is used to load FASTQ genome
it takes as input filename and returns two lists
containing sequences (reads) and qualities.
"""
sequences = []
qualities = [] #define two empty lists for reads and
associated qualities.
with open(filename, 'r') as f: # open file in read only mode.
while True:
f.readline() #ignore information line.
seq = f.readline().rstrip() # read sequence line.
f.readline() #ignore placeholder line.
qual = f.readline().rstrip() #read quality line.
if len(seq) ==0: #that means we reach the end of file.
break
sequences.append(seq)
qualities.append(qual)
return sequences, qualities
now let’s test our function:

#test our function:


seqs, quals = readFastqGenome('SRR835775_1.first1000.fastq')
print("the first five group of reads: ", seqs[:5],"\n")
print("the first five group of associated qualities: ", quals[:5])

Now, we’ll create a histogram to see which quality scores are the most common and which are less
common.
To do that we need to write a helper function to convert ASCII symbols to quality score. So, this will
convert Phred33 encoded value to just a quality score.

def phred33ToQ(p33):
"""this function is used to
convert phred33 to quality value
"""
return ord(p33)-33 #get quality value
#test the function:
print('the quality value for # is: ',phred33ToQ('#')) #result is 2
print('the quality value for J is: ',phred33ToQ('J')) #result is 41
A low-quality score means that we have a very low confidence in our value, it’s not likely to be correct.
In this case, there is about 30% to 32% chance that the read in this position is incorrect.
For the second value we have a high-quality, so we have less chance that the read in this position is
incorrect.
Now, let’s draw the histogram after we satisfy all requirements:

5
Bioinformatics Practical Labs
def createHist(qualities):
""" this function is used to calculate histogram for
our quality scores and how many times each quality
score appears in our file
"""
hist = [0]*50 #create empty list with 50 items
for qual in qualities: #work through each quality line
for phred in qual: #work through each Phred33 encoded value
q = phred33ToQ(phred)
hist[q]+=1
return hist
h = createHist(quals)
plt.bar(range(len(h)),h)
plt.title('Quality Score Histogram')
plt.xlabel('Qualities Values')
plt.ylabel('Frequencies')
plt.show()

We can notice that our quality scores start from 2 and end in 41. You can also notice that we have a big
spike of low-quality values which is 2. Then drops down really low then increase gradually and we have
high-quality score frequencies.
In the next example we’ll plot GC-Content at each position of the read, the number reads are different
from species to species.
This helps us to figure out whether the mix of different bases is changing as we move along the
read in general it won’t change so much.
But if we have one particularly bad sequencing cycle, then we might see a very different mix of
G’s and C’s relatively to other bases.
The following python function we’ll draw a plot that shows the GC-Content for the previous reads
we have from the FASTQ file.

6
Bioinformatics Practical Labs
def drawGCContent(reads):
""" this function is used to calculate histogram for
our quality scores and how many times each quality
score appears in our file
"""
gc = [0]*100 #define a list with 100 items equal to reads length
totals = [0]*100
for read in reads:
for i in range(len(read)):
if read[i] == 'C' or read[i] == 'G':
gc[i]+=1
totals[i]+=1
for i in range(len(gc)):
if totals[i]>0:
gc[i]/=float(totals[i])
return gc
# Call our function and draw the result:
gc = drawGCContent(seqs)
plt.title('GC-Content By Position')
plt.xlabel('Positions')
plt.ylabel('Average GC Content')
plt.plot(range(len(gc)),gc)
plt.show()
Notice that the change is pretty small and the average GC content is around 0.56 and 0.62.
As we did in the FASTA file, we’ll find the distribution of bases in these sequences by using collections
library again:

count = collections.Counter()
for seq in seqs:
count.update(seq)
print(count)

N is used when the sequencer or base caller has no confidence, it doesn’t even want to make a call.
Because there is no good evidence to support one base over the others.

References:
Bioinformatic lectures by Dr. Eng. Asmaa Shaar.

You might also like