0% found this document useful (0 votes)

22 views7 pages

Lab 2

The document outlines practical labs for bioinformatics at Al-Hawash University, focusing on DNA sequencing manipulation and analysis of FASTA and FASTQ files. It includes functions for reading genome files, calculating base frequencies, and visualizing data using libraries like matplotlib. Additionally, it discusses quality score analysis and GC-content plotting to assess sequencing quality.

Uploaded by

nano.rajab17

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

22 views7 pages

Lab 2

Uploaded by

nano.rajab17

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 7

AL-Hawash University

IT Faculty

Bioinformatics Practical Labs

2024-2025
Bioinformatics Practical Labs

DNA Sequencing Manipulation

Introduction to DNA Sequencing:
DNA is the Kind of molecule that encodes genome (all genetic information / genes).
DNA is represented with alphabet consists of 4 letters (A, C, G and T) which called bases or nucleotides.
DNA molecule is shaped like a double helix and rungs are made up of complementary base pairs:
A is complementary to T and vice versa.
C is complementary to G and vice versa.

DNA sequencing is essentially the process of bridging the biological object (DNA molecule) to the
computational object (DNA sequence data).
DNA Sequencer takes as input tangible DNA molecules, and outputs sequences in textual format.

Analyzing FASTA Genome Files:

FASTA Files:

The first line starts with a forward arrow (>), then identifying information including the name of organism.
Followed by the entire genome which consists of many bases like the following:

2
Bioinformatics Practical Labs
In our first example we’ll work with human mitochondrial genome which is small structure in cells that
generates energy for the cell to use. FASTA files has (.fa) extension.
The following function is used to read FASTA file by passing to file name:

def readFASTAGenome(filename):
"""this function is used to load FASTA genome
it takes as input filename and returns a string
containing the hole genome.
"""
genome = '' #empty string to store genome.
#Openning FASTA file for reading (r) mode:
with open(filename, 'r') as f: #f is file handler.
for line in f:
if line[0] != '>': #ignore information line.
genome+=line.rstrip()
return genome

#Test our function:

hmg = readFASTAGenome('chrMT.fa')
print("the first characters from our genome:\n",hmg[:100])
print("the length of our genome is equal to: ",len(hmg))

Note: the first lines in the function called “Doc String” they used to put an explanation to the function we
wrote for others, in jupyter notebook to see any doc string just put the pointer into the function and hit
“shift + tab“ keys.

Using a dictionary we’ll store each base with its frequency from our read genome, then we’ll use a library
called collections that provides us with simpler code to achieve this idea.

def baseFreqDictionary(genome):
"""this function is used to calculate each base count and
store result in a dictionary """
base_freq = {'A': 0, 'C':0, 'G':0, 'T':0}
for base in genome:
base_freq[base]+=1
return base_freq
#Call our function:
bf = baseFreqDictionary(hmg)
print(bf) #result is {'A': 5125, 'C': 5181, 'G': 2169, 'T': 4094}
Now using collections library:

3
Bioinformatics Practical Labs
import collections
collections.Counter(hmg) #result is Counter({'G': 2169, 'A': 5125,
'T': 4094, 'C': 5181})
Now, we’re going to represent the information visually using matplotlib library provided by python:

import matplotlib.pyplot as plt

plt.title('Occurrence of bases in the HMG genome') #graph title
#Draw 4 bars one for each base.
plt.bar(range(len(bf)),list(bf.values()),align='center')
plt.xticks(range(len(bf)),list(bf.keys()) ) # change x label
plt.ylabel("Base Frequency") #add label to y axis
plt.show() #show the figure

Analyzing FASTQ Files:

Sequencing reads from DNA Sequencer are encoded in compact file format called FASTQ. Each FASTQ
file has the following groups of lines:
The first line is the name of the read, which encode some information like the experiment that it
comes from, the kind of sequencing machine that was used, where on the slide this particular
cluster was located.
The second line the sequence of bases as reported by the base caller software.
The third line is a placeholder line.
The fourth line is a sequence of qualities for the corresponding bases in the sequence line as
reported by the base caller software.

4
Bioinformatics Practical Labs
def readFastqGenome(filename):
"""this function is used to load FASTQ genome
it takes as input filename and returns two lists
containing sequences (reads) and qualities.
"""
sequences = []
qualities = [] #define two empty lists for reads and
associated qualities.
with open(filename, 'r') as f: # open file in read only mode.
while True:
f.readline() #ignore information line.
seq = f.readline().rstrip() # read sequence line.
f.readline() #ignore placeholder line.
qual = f.readline().rstrip() #read quality line.
if len(seq) ==0: #that means we reach the end of file.
break
sequences.append(seq)
qualities.append(qual)
return sequences, qualities
now let’s test our function:

#test our function:

seqs, quals = readFastqGenome('SRR835775_1.first1000.fastq')
print("the first five group of reads: ", seqs[:5],"\n")
print("the first five group of associated qualities: ", quals[:5])

Now, we’ll create a histogram to see which quality scores are the most common and which are less
common.
To do that we need to write a helper function to convert ASCII symbols to quality score. So, this will
convert Phred33 encoded value to just a quality score.

def phred33ToQ(p33):
"""this function is used to
convert phred33 to quality value
"""
return ord(p33)-33 #get quality value
#test the function:
print('the quality value for # is: ',phred33ToQ('#')) #result is 2
print('the quality value for J is: ',phred33ToQ('J')) #result is 41
A low-quality score means that we have a very low confidence in our value, it’s not likely to be correct.
In this case, there is about 30% to 32% chance that the read in this position is incorrect.
For the second value we have a high-quality, so we have less chance that the read in this position is
incorrect.
Now, let’s draw the histogram after we satisfy all requirements:

5
Bioinformatics Practical Labs
def createHist(qualities):
""" this function is used to calculate histogram for
our quality scores and how many times each quality
score appears in our file
"""
hist = [0]*50 #create empty list with 50 items
for qual in qualities: #work through each quality line
for phred in qual: #work through each Phred33 encoded value
q = phred33ToQ(phred)
hist[q]+=1
return hist
h = createHist(quals)
plt.bar(range(len(h)),h)
plt.title('Quality Score Histogram')
plt.xlabel('Qualities Values')
plt.ylabel('Frequencies')
plt.show()

We can notice that our quality scores start from 2 and end in 41. You can also notice that we have a big
spike of low-quality values which is 2. Then drops down really low then increase gradually and we have
high-quality score frequencies.
In the next example we’ll plot GC-Content at each position of the read, the number reads are different
from species to species.
This helps us to figure out whether the mix of different bases is changing as we move along the
read in general it won’t change so much.
But if we have one particularly bad sequencing cycle, then we might see a very different mix of
G’s and C’s relatively to other bases.
The following python function we’ll draw a plot that shows the GC-Content for the previous reads
we have from the FASTQ file.

6
Bioinformatics Practical Labs
def drawGCContent(reads):
""" this function is used to calculate histogram for
our quality scores and how many times each quality
score appears in our file
"""
gc = [0]*100 #define a list with 100 items equal to reads length
totals = [0]*100
for read in reads:
for i in range(len(read)):
if read[i] == 'C' or read[i] == 'G':
gc[i]+=1
totals[i]+=1
for i in range(len(gc)):
if totals[i]>0:
gc[i]/=float(totals[i])
return gc
# Call our function and draw the result:
gc = drawGCContent(seqs)
plt.title('GC-Content By Position')
plt.xlabel('Positions')
plt.ylabel('Average GC Content')
plt.plot(range(len(gc)),gc)
plt.show()
Notice that the change is pretty small and the average GC content is around 0.56 and 0.62.
As we did in the FASTA file, we’ll find the distribution of bases in these sequences by using collections
library again:

count = collections.Counter()
for seq in seqs:
count.update(seq)
print(count)

N is used when the sequencer or base caller has no confidence, it doesn’t even want to make a call.
Because there is no good evidence to support one base over the others.

References:
Bioinformatic lectures by Dr. Eng. Asmaa Shaar.

Computational and Systems Biology Assignment Help
100% (1)
Computational and Systems Biology Assignment Help
15 pages
Biopython Tutorial
100% (1)
Biopython Tutorial
26 pages
Edge RUsers Guide
No ratings yet
Edge RUsers Guide
138 pages
Molecular Biology of Bacteria
100% (1)
Molecular Biology of Bacteria
93 pages
BioinformaticsProjects Introduction
No ratings yet
BioinformaticsProjects Introduction
2 pages
Genome Organization and Control
100% (1)
Genome Organization and Control
32 pages
Computational Genome Analysis: Lecture-4
No ratings yet
Computational Genome Analysis: Lecture-4
60 pages
2 NGSintro
No ratings yet
2 NGSintro
48 pages
Intro 2 RNAseq
No ratings yet
Intro 2 RNAseq
98 pages
Edger: Differential Analysis of Sequence Read Count Data User'S Guide
No ratings yet
Edger: Differential Analysis of Sequence Read Count Data User'S Guide
119 pages
Phylip Via Emboss - Tree Building:: Phylip (Phylogeny Inference Programs)
No ratings yet
Phylip Via Emboss - Tree Building:: Phylip (Phylogeny Inference Programs)
17 pages
MATH3353 Notes
No ratings yet
MATH3353 Notes
100 pages
From RNA-seq Reads To Gene Expression
No ratings yet
From RNA-seq Reads To Gene Expression
27 pages
Introduction To Differential Gene Expression Analysis Using RNA-seq
No ratings yet
Introduction To Differential Gene Expression Analysis Using RNA-seq
97 pages
Edger: Differential Analysis of Sequence Read Count Data User'S Guide
No ratings yet
Edger: Differential Analysis of Sequence Read Count Data User'S Guide
122 pages
COMP90016 2023 06 Data Sources
No ratings yet
COMP90016 2023 06 Data Sources
64 pages
Sequencing Quality Control
No ratings yet
Sequencing Quality Control
104 pages
Biopython Org DIST Docs Tutorial Tutorial HTML
No ratings yet
Biopython Org DIST Docs Tutorial Tutorial HTML
267 pages
Beginner's Guide To Using The DESeq2 Package
No ratings yet
Beginner's Guide To Using The DESeq2 Package
32 pages
Function Solutions
No ratings yet
Function Solutions
10 pages
Biopython - Quick Guide
No ratings yet
Biopython - Quick Guide
79 pages
Intro To RNA-seq Concepts
No ratings yet
Intro To RNA-seq Concepts
85 pages
Sequence Comparison1
No ratings yet
Sequence Comparison1
25 pages
2023-GenomicaFuncional y Biocomputacion-Day1
No ratings yet
2023-GenomicaFuncional y Biocomputacion-Day1
92 pages
Bio Python 202111
No ratings yet
Bio Python 202111
63 pages
02 Handling Files
No ratings yet
02 Handling Files
18 pages
solutionsExerciseMaster11 23
No ratings yet
solutionsExerciseMaster11 23
13 pages
2023s2 Cosc122 Assignment1 Handout
No ratings yet
2023s2 Cosc122 Assignment1 Handout
9 pages
DNA Base Pairing Worksheet
No ratings yet
DNA Base Pairing Worksheet
4 pages
Measuring Transcriptomes With RNA-Seq
No ratings yet
Measuring Transcriptomes With RNA-Seq
48 pages
RNA Seq R - Final Decode
No ratings yet
RNA Seq R - Final Decode
76 pages
Analysis of RNA-Seq Data
No ratings yet
Analysis of RNA-Seq Data
71 pages
Genomic Analyses Using Radseq: 1. Raw Data Manipulation
No ratings yet
Genomic Analyses Using Radseq: 1. Raw Data Manipulation
7 pages
INFO390C DNDS Pset05
No ratings yet
INFO390C DNDS Pset05
9 pages
Quality Control & Normalization of RNA SEQ Data: Shivangi Agarwal, PHD
No ratings yet
Quality Control & Normalization of RNA SEQ Data: Shivangi Agarwal, PHD
35 pages
Lecture 01 - Genome Sequencing
No ratings yet
Lecture 01 - Genome Sequencing
48 pages
Lab02 - Reading Results
No ratings yet
Lab02 - Reading Results
16 pages
TrieDedup - A Fast Trie-Based Deduplication Algorithm To Handle Ambiguous Base Deduplication in HTS
No ratings yet
TrieDedup - A Fast Trie-Based Deduplication Algorithm To Handle Ambiguous Base Deduplication in HTS
13 pages
Analysis Results
No ratings yet
Analysis Results
29 pages
Differential Expression of Rna-Seq Data at The Gene Level - The Deseq Package
No ratings yet
Differential Expression of Rna-Seq Data at The Gene Level - The Deseq Package
24 pages
M.SC Transcriptome Analysis 2025
No ratings yet
M.SC Transcriptome Analysis 2025
21 pages
Computational Biology, Part 8: Protein Coding Regions
No ratings yet
Computational Biology, Part 8: Protein Coding Regions
40 pages
Python Assignment
No ratings yet
Python Assignment
8 pages
Unit-5 Bioinformatics
No ratings yet
Unit-5 Bioinformatics
13 pages
p3 Python Project
No ratings yet
p3 Python Project
4 pages
RIP Tutorials Bioinformatics
No ratings yet
RIP Tutorials Bioinformatics
19 pages
Nihms 977214
No ratings yet
Nihms 977214
21 pages
High Throughput Sequencing
No ratings yet
High Throughput Sequencing
5 pages
Intro To NGS - Torsten Seemann - PeterMac - 27 Jul 2012
No ratings yet
Intro To NGS - Torsten Seemann - PeterMac - 27 Jul 2012
51 pages
Bioinfo Course Notes M1 2020 DR Mbulli
No ratings yet
Bioinfo Course Notes M1 2020 DR Mbulli
56 pages
FastQC TutorialAndFAQ
No ratings yet
FastQC TutorialAndFAQ
8 pages
02-11-22-Lab-5-MS21212.ipynb - Colaboratory
No ratings yet
02-11-22-Lab-5-MS21212.ipynb - Colaboratory
8 pages
The Sanger FASTQ File Format For Sequences
No ratings yet
The Sanger FASTQ File Format For Sequences
5 pages
Dhiraj Kumar, Chengliang Gong-Trends in Insect Molecular Biology and Biotechnology-Springer International Publishing (2018) PDF
No ratings yet
Dhiraj Kumar, Chengliang Gong-Trends in Insect Molecular Biology and Biotechnology-Springer International Publishing (2018) PDF
376 pages
Chapter 3 Inspection of Sequence Quality PDF
No ratings yet
Chapter 3 Inspection of Sequence Quality PDF
18 pages
BPS 3101 Mid 1 Study Guide
No ratings yet
BPS 3101 Mid 1 Study Guide
32 pages
Vimal Roll No 2211022 ANALYSIS TOOL. PHYLIPpptx
No ratings yet
Vimal Roll No 2211022 ANALYSIS TOOL. PHYLIPpptx
27 pages
RNA-Seq Module 1
No ratings yet
RNA-Seq Module 1
54 pages
RNA-Seq Analysis Course
No ratings yet
RNA-Seq Analysis Course
40 pages
Summary Bioinformation Technology
No ratings yet
Summary Bioinformation Technology
15 pages
Gene Identification - I: Shivani Chandra Birla Institute of Scientific Research
No ratings yet
Gene Identification - I: Shivani Chandra Birla Institute of Scientific Research
35 pages
Nucleiacids
No ratings yet
Nucleiacids
89 pages
Genomics For Beginner
No ratings yet
Genomics For Beginner
9 pages
L31 PDF
No ratings yet
L31 PDF
33 pages
Advances in Genetics Exclusive Download
100% (12)
Advances in Genetics Exclusive Download
14 pages
Kinetic Vs Chemical Mechanism
No ratings yet
Kinetic Vs Chemical Mechanism
34 pages
DoubleHelix Pulse Chase Primer Teacher
No ratings yet
DoubleHelix Pulse Chase Primer Teacher
11 pages
MEMBRANE PLASMA - Mind Map
No ratings yet
MEMBRANE PLASMA - Mind Map
1 page
Tools in Recombinant Dna Technology
No ratings yet
Tools in Recombinant Dna Technology
18 pages
L7S12. Central Dogma and Genetic Engineering - F
No ratings yet
L7S12. Central Dogma and Genetic Engineering - F
35 pages
Study Guide Protein Synthesis (KEY)
No ratings yet
Study Guide Protein Synthesis (KEY)
2 pages
MR. No. 20081290896 Lab No. Collection: Name: Gender: Male Age: 31 Years
No ratings yet
MR. No. 20081290896 Lab No. Collection: Name: Gender: Male Age: 31 Years
1 page
Unit 1 Ap Biology Review Guide
No ratings yet
Unit 1 Ap Biology Review Guide
12 pages
6.4 - DNA Replication and Repair (Text Ref
No ratings yet
6.4 - DNA Replication and Repair (Text Ref
2 pages
1,2
No ratings yet
1,2
29 pages
Oksidasi Piruvat
No ratings yet
Oksidasi Piruvat
14 pages
How To Read, Interpret and Analyze Gel Electrophoresis Results
No ratings yet
How To Read, Interpret and Analyze Gel Electrophoresis Results
55 pages
Introduction To Recombinant DNA Technology
No ratings yet
Introduction To Recombinant DNA Technology
4 pages
Egg Albumin and Boiling Water
No ratings yet
Egg Albumin and Boiling Water
2 pages
Biomolecules
No ratings yet
Biomolecules
41 pages
Phenoomenon - Meet Your Macromolecules Lesson Plan
No ratings yet
Phenoomenon - Meet Your Macromolecules Lesson Plan
7 pages
Lecture 4
No ratings yet
Lecture 4
5 pages
Icsehelp Com Structure of Chromosome Part B Long and Structured Questions Class ...
No ratings yet
Icsehelp Com Structure of Chromosome Part B Long and Structured Questions Class ...
7 pages
Nucleic Acids Brochure
No ratings yet
Nucleic Acids Brochure
2 pages
Protein Structure and Function
No ratings yet
Protein Structure and Function
3 pages
Telomeres - A Ticking Cellular Clock L
No ratings yet
Telomeres - A Ticking Cellular Clock L
1 page
Assignment Ch-11 Biotechnology
No ratings yet
Assignment Ch-11 Biotechnology
4 pages

Lab 2

Uploaded by

Lab 2

Uploaded by

AL-Hawash University

Bioinformatics Practical Labs

DNA Sequencing Manipulation

Analyzing FASTA Genome Files:

#Test our function:

import matplotlib.pyplot as plt

Analyzing FASTQ Files:

#test our function:

You might also like