0% found this document useful (0 votes)

16 views19 pages

RIP-Tutorials-bioinformatics

Uploaded by

portlandonline

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

16 views19 pages

RIP-Tutorials-bioinformatics

Uploaded by

portlandonline

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 19

bioinformatics

#bioinformat
ics
Table of Contents
About 1

Chapter 1: Getting started with bioinformatics 2

Remarks 2

Examples 2

Definition 2

.GFF file parser (as buffer) with filter to keep only rows 2

Using mapping of DNA sequences to answer biological questions 3

Chapter 2: Basic Samtools 5

Examples 5

Count number of records per reference in bamfile 5

Convert sam into bam (and back again) 5

Chapter 3: BLAST 6

Examples 6

Create a DNA blastdb 6

Extract fasta sequences from a nucl blastdb 6

Install blast on ubuntu 6

Extract GI and taxid from blastdb 7

Chapter 4: Common File Formats 8

Examples 8

FASTA 8

Mutation Annotation Format (MAF) 8

GCT 9

Sequence Writing In fasta Format 9

Chapter 5: Linearizing a FASTA sequence. 11

Examples 11

Linearize a FASTA sequence with AWK 11

Reading line by line 11

Linearize FASTA sequences from Uniprot 11

Chapter 6: Linearizing a fastq file 13

Examples 13
Using Paste 13

Using Awk 13

Chapter 7: Sequence analysis 15

Examples 15

Calculate the GC% of a sequence 15

Credits 16
About
You can share this PDF with anyone you feel could benefit from it, downloaded the latest version
from: bioinformatics

It is an unofficial and free bioinformatics ebook created for educational purposes. All the content is
extracted from Stack Overflow Documentation, which is written by many hardworking individuals at
Stack Overflow. It is neither affiliated with Stack Overflow nor official bioinformatics.

The content is released under Creative Commons BY-SA, and the list of contributors to each
chapter are provided in the credits section at the end of this book. Images may be copyright of
their respective owners unless otherwise specified. All trademarks and registered trademarks are
the property of their respective company owners.

Use the content presented in this book at your own risk; it is not guaranteed to be correct nor
accurate, please send your feedback and corrections to [email protected]

https://riptutorial.com/ 1
Chapter 1: Getting started with bioinformatics
Remarks
Bioinformatics is an interdisciplinary field that develops methods and software tools for
understanding biological data.

Topics within bioinformatics include:

• Sequence analysis
• Phylogenetics
• Molecular modeling
• Analysis of gene and protein expression

Examples
Definition

(Wikipedia) Bioinformatics is an interdisciplinary field that develops methods and

software tools for understanding biological data. As an interdisciplinary field of science,
bioinformatics combines computer science, statistics, mathematics, and engineering to
analyze and interpret biological data. Bioinformatics has been used for in silico
analyses of biological queries using mathematical and statistical techniques.

.GFF file parser (as buffer) with filter to keep only rows

""" A [GFF parser script][1] in Python for [www.VigiLab.org][2]

Description:
- That performs buffered reading, and filtering (see: @filter) of .GFF input file
(e.g. "[./toy.gff][3]") to keep only rows whose field (column) values are equal to
"transcript"...

Args:
- None (yet)

Returns:
- None (yet)

Related:
- [1]: https://github.com/a1ultima/vigilab_intergeneShareGFF/blob/master/README.md
- [2]: http://www.vigilab.org/
- [3]: https://github.com/a1ultima/vigilab_intergeneShareGFF/blob/master/toy.gff

"""
gene_to_field = {} # dict whose keys: genes represented (i.e. later slice-able/index-able) as
1..n, values, where n = 8 total #fields (cols) of a gff row, whose version is unknown but
example is: https://github.com/a1ultima/vigilab_intergeneShareGFF/blob/master/toy.gff

gene_i = 0

https://riptutorial.com/ 2
with open("./toy.gff", "r") as fi:

print("Reading GFF file into: gene_to_field (dict), index as such: gene_to_field[gene_i],

where gene_i is between 1-to-n...")

while True: # breaks once there are no more lines in the input .gff file, see "@break"

line = fi.readline().rstrip() # no need for trailing newline chars ("\n")

if line == "": # @break

break

line_split = line.split("\t") # turn a line of input data into a list, each element =
different field value, e.g. [...,"transcript",...]

if line_split[2] != "transcript": # @@filter incoming rows so only those with "transcript"

are not skipped by "continue"
continue

gene_i += 1 # indexing starts from 1 (i.e. [1] = first gene) ends at n

##@TEST: sometimes 4.00 instead of 4.0 (trivial) # some @deprecated code, but may be
useful one day
#if not (str(line_split[5])==str(float(line_split[5]))):
# print("oops")
# print("\t"+str(line_split[5])+"___"+str(float(line_split[5])))

# create a dict key, for gene_to_field dict, and set its values according to list elements
in line_split

gene_to_field[gene_i] = { \
"c1_reference_seq":line_split[0],# e.g. 'scaffold_150' \
"c2_source":line_split[1],# e.g. 'GWSUNI' \
"c3_type":line_split[2],# e.g. 'transcript' \
"c4_start":int(line_split[3]),# e.g. '1372' \
"c5_end":int(line_split[4]),# e.g. '2031' \
"c6_score":float(line_split[5]),# e.g. '45.89' \
"c7_strand":line_split[6],# e.g. '+' \
"c8_phase":line_split[7],# e.g. '.' @Note: codon frame (0,1,2) \
"c9_attributes":line_split[8]# e.g. <see @gff3.md> \
}

Using mapping of DNA sequences to answer biological questions

Many biological questions can be translated into a DNA sequencing problem. For instance, if you
want to know the expression level of a gene you can: copy its mRNAs into complementary DNA
molecules, sequence each of the resulting DNA molecules, map those sequences back to the
reference genome, and then use the count of alignments overlapping the gene as a proxy of its
expression (see RNA-seq). Other examples include: determining the 3D structure of the genome,
locating histone marks, and mapping RNA-DNA interactions. A not up-to-date list of biological
questions addressed by clever DNA-sequencing methods can be found here.

Typically, the wet-lab scientists (the people wearing white coats and goggles) will design and
perform the experiments to get the sequenced DNA samples. Then, a bioinformatician (the people
using computers and drinking coffee) will take these sequences --encoded as FASTQ files-- and

https://riptutorial.com/ 3
will map them to a reference genome, saving the results as BAM files.

Going back to our gene expression example, this is how a bioinformatician would generate a BAM
file from a FASTQ file (using a Linux system):

STAR --genomeDir path/to/reference/genome --outSAMtype BAM --readFilesIn my_reads.fastq

Where STAR is a spliced-tolerant aligner (necessary for the exon-intron junctions that may be
present on the mRNA).

PS: Once the mapping results are obtained, the creative part begins. Here is where
bioinformaticians devised statistical test to check whether the data is showing biologically
meaningful patterns or spurious signals born out of noise.

Read Getting started with bioinformatics online:

https://riptutorial.com/bioinformatics/topic/3960/getting-started-with-bioinformatics

https://riptutorial.com/ 4
Chapter 2: Basic Samtools
Examples
Count number of records per reference in bamfile

samtools idxstats thing.bam

Convert sam into bam (and back again)

Samtools can be used to convert between sam and bam:

• -b indicates that the input file will be in BAM format

• -S indicates that the stdout should be in SAM format

samtools view -sB thing.bam > thing.sam

And to convert between sam and bam:

samtools view thing.sam > thing.bam

samtools sort thing.bam thing
samtools index thing.bam

This will produce a sorted, indexed bam. This will create the files thing.bam and thing.bam.bai. To
use a bam you must have an index file.

Read Basic Samtools online: https://riptutorial.com/bioinformatics/topic/6886/basic-samtools

https://riptutorial.com/ 5
Chapter 3: BLAST
Examples
Create a DNA blastdb

In order to compare query sequences against reference sequences, you must create a blastdb of
your reference(s). This is done using makeblastdb which is included when you install blast.

makeblastdb -in <input fasta> -dbtype nucl -out <label for database>

So if you had a file reference.fasta containing the following records:

>reference_1
ATCGATAAA
>reference_2
ATCGATCCC

You would run the following:

makeblastdb -in reference.fasta -dbtype nucl -out my_database

This would create the following files:

• my_database.nhr
• my_database.nin
• my_database.nsq

Note, the database files are labelled with the -out argument.

Extract fasta sequences from a nucl blastdb

You can extract fasta sequence from a blastdb constructed from a fasta file using blastdbcmd which
should be installed when you install makeblastdb.

blastdbcmd -entry all -db <database label> -out <outfile>

If you had a database called my_database which contained the files my_database.nhr, my_database.nsq,
my_database.nin and you wanted your fasta output file to be called reference.fasta you would run
the following:

blastdbcmd -entry all -db my_database -out reference.fasta

Install blast on ubuntu

apt-get install ncbi-blast+

https://riptutorial.com/ 6
You can check the version that will be installed in advance here:

http://packages.ubuntu.com/xenial/ncbi-blast+

Extract GI and taxid from blastdb

Data can be extracted from a blastdb using blastdbcmd which should be included in a blast
installation. You can specify from the options below as part of -outfmt what metadata to include
and in what order.

From the man page:

-outfmt <String>
Output format, where the available format specifiers are:
%f means sequence in FASTA format
%s means sequence data (without defline)
%a means accession
%g means gi
%o means ordinal id (OID)
%i means sequence id
%t means sequence title
%l means sequence length
%h means sequence hash value
%T means taxid
%X means leaf-node taxids
%e means membership integer
%L means common taxonomic name
%C means common taxonomic names for leaf-node taxids
%S means scientific name
%N means scientific names for leaf-node taxids
%B means BLAST name
%K means taxonomic super kingdom
%P means PIG

The example snippet shows how gi and taxid can be extracted from blastdb. The NCBI
16SMicrobial (ftp) blastdb was chosen for this example:

# Example:
# blastdbcmd -db <db label> -entry all -outfmt "%g %T" -out <outfile>
blastdbcmd -db 16SMicrobial -entry all -outfmt "%g %T" -out 16SMicrobial.gi_taxid.tsv

Which will produce a file 16SMicrobial.gi_taxid.tsv that looks like this:

939733319 526714
636559958 429001
645319546 629680

Read BLAST online: https://riptutorial.com/bioinformatics/topic/5371/blast

https://riptutorial.com/ 7
Chapter 4: Common File Formats
Examples
FASTA

The FASTA file format is used for representing one or more nucleotide or amino acid sequences
as a continuous string of characters. Sequences are annotated with a comment line, which starts
with the > character, that precedes each sequence. The comment line is typically formatted in a
uniform way, dictated by the sequence's source database or generating software. For example:

>gi|62241013|ref|NP_001014431.1| RAC-alpha serine/threonine-protein kinase [Homo sapiens]

MSDVAIVKEGWLHKRGEYIKTWRPRYFLLKNDGTFIGYKERPQDVDQREAPLNNFSVAQCQLMKTERPRP
NTFIIRCLQWTTVIERTFHVETPEEREEWTTAIQTVADGLKKQEEEEMDFRSGSPSDNSGAEEMEVSLAK
PKHRVTMNEFEYLKLLGKGTFGKVILVKEKATGRYYAMKILKKEVIVAKDEVAHTLTENRVLQNSRHPFL
TALKYSFQTHDRLCFVMEYANGGELFFHLSRERVFSEDRARFYGAEIVSALDYLHSEKNVVYRDLKLENL
MLDKDGHIKITDFGLCKEGIKDGATMKTFCGTPEYLAPEVLEDNDYGRAVDWWGLGVVMYEMMCGRLPFY
NQDHEKLFELILMEEIRFPRTLGPEAKSLLSGLLKKDPKQRLGGGSEDAKEIMQHRFFAGIVWQHVYEKK
LSPPFKPQVTSETDTRYFDEEFTAQMITITPPDQDDSMECVDSERRPHFPQFSYSASGTA

The above example illustrates the amino acid sequence of an isoform of the human AKT1 genes,
as fetched from the NCBI protein database. The header line specifies that this sequence may be
identified with the GI ID 62241013 and the protein transcript ID NP_001014431.1. This protein is named
RAC-alpha serine/threonine-protein kinase and is derived from the species, Homo sapiens.

Mutation Annotation Format (MAF)

The MAF file format is a tab-delimited text file format intended for describing somatic DNA
mutations detected in sequencing results, and is distinct from the Multiple Alignment Format file
type, which is intended for representing aligned nucleotide sequences. Column headers and
ordering may sometimes vary between files of different sources, but the names and orders of
columns, as defined in the specification, are the following:

Hugo_Symbol
Entrez_Gene_Id
Center
NCBI_Build
Chromosome
Start_Position
End_Position
Strand
Variant_Classification
Variant_Type
Reference_Allele
Tumor_Seq_Allele1
Tumor_Seq_Allele2
dbSNP_RS
dbSNP_Val_Status
Tumor_Sample_Barcode
Matched_Norm_Sample_Barcode

https://riptutorial.com/ 8
Match_Norm_Seq_Allele1
Match_Norm_Seq_Allele2
Tumor_Validation_Allele1
Tumor_Validation_Allele2
Match_Norm_Validation_Allele1
Match_Norm_Validation_Allele2
Verification_Status4
Validation_Status4
Mutation_Status
Sequencing_Phase
Sequence_Source
Validation_Method
Score
BAM_File
Sequencer
Tumor_Sample_UUID
Matched_Norm_Sample_UUID

Many MAF files, such as those available from the TCGA, also contain additional columns
expanding on the variant annotation. These columns can include reference nucleotide transcript
IDs for corresponding genes, representative codon or amino acid changes, QC metrics, population
statistics, and more.

GCT

The GCT file format is a tab-delimited text file format used for describing processed gene
expression or RNAi data, typically derived from microarray chip analysis. This data is arranged
with a single annotated gene or probe per line, and a single chip sample per column (beyond the
annotation columns). For example:

#1.2
22215 2
Name Description Tumor_One Normal_One
1007_s_at DDR1 -0.214548 -0.18069
1053_at RFC2 0.868853 -1.330921
117_at HSPA6 1.124814 0.933021
121_at PAX8 -0.825381 0.102078
1255_g_at GUCA1A -0.734896 -0.184104
1294_at UBE1L -0.366741 -1.209838

In this example, the first line specifies the version of the GCT file specification, which in this case
is 1.2. The second line specifies the number of rows of data (22215) and the number of samples (2).
The header row specifies two annotation columns (Name for the chip probe set identifiers and
Description for the gene symbols the probe set covers) and the names of the samples being
assayed (Tumor_One and Normal_One). Each row of data beyond the header lists a single probe set
identifier (in this case, Affymetrix gene chip probe sets), its corresponding gene symbol (if one
exists), and the normalized values for each sample. Sample data values will vary based upon
assay type and normalization methods, but are typically signed floating point numeric values.

Sequence Writing In fasta Format

This a python example function for sequence writing in fasta format.

https://riptutorial.com/ 9
Parameters:

• filename(String) - A file name for writing sequence in fasta format.

• seq(String) - A DNA or RNA sequence.
• id(String) - The ID of the given sequence.
• desc(String) - A short description of the given sequence.

import math

def save_fsta(filename,seq,id,desc):
fo = open(filename+'.fa',"a")
header= str(id)+' <'+desc+'> \n'
fo.write(header)
count=math.floor(len(seq)/80+1)
iteration = range(count)
for i in iteration:
fo.write(seq[80*(i):80*(i+1)]+'\n')
fo.write('\n \n')
fo.close()

Another way is using textwrap

import textwrap

def save_fasta(filename,seq, id, desc):

filename+='.fa'
with open(filename, 'w') as f:
f.write('>'+id+' <'+desc+'>\n');
text = textwrap.wrap(seq,80);
for x in text:
f.write(x+'\n');

Read Common File Formats online: https://riptutorial.com/bioinformatics/topic/4034/common-file-

formats

https://riptutorial.com/ 10
Chapter 5: Linearizing a FASTA sequence.
Examples
Linearize a FASTA sequence with AWK

Reading line by line

awk '/^>/ {printf("%s%s\t",(N>0?"\n":""),$0);N++;next;} {printf("%s",$0);} END
{printf("\n");}' < input.fa

one can read this awk script as:

• if the current line ($0) starts like a fasta header (^>). Then we print a carriage return if this is
not the first sequence. (N>0?"\n":"") followed with the line itself ($0), followed with a
tabulation (\t). And we look for the next line (next;)
• if the current line ($0) does not start like a fasta header, this is the default awk pattern. We
just print the whole line without carriage return.
• At the end (END) we only print a carriage return for the last sequence.

Linearize FASTA sequences from Uniprot

download and linearize the 10 first FASTA sequences from UniProt:

$ curl -s
"ftp://ftp.uniprot.org/pub/databases/uniprot/current_release/knowledgebase/complete/uniprot_sprot.fasta
|\
gunzip -c |\
awk '/^>/ {printf("%s%s\t",(N>0?"\n":""),$0);N++;next;} {printf("%s",$0);} END
{printf("\n");}' |\
head

>sp|Q6GZX4|001R_FRG3G Putative transcription factor 001R OS=Frog virus 3 (isolate Goorha)

GN=FV3-001R PE=4 SV=1
MAFSAEDVLKEYDRRRRMEALLLSLYYPNDRKLLDYKEWSPPRVQVECPKAPVEWNNPPSEKGLIVGHFSGIKYKGEKAQASEVDVNKMCCWVSKFKDAMRRY

>sp|Q6GZX3|002L_FRG3G Uncharacterized protein 002L OS=Frog virus 3 (isolate Goorha) GN=FV3-

002L PE=4 SV=1
MSIIGATRLQNDKSDTYSAGPCYAGGCSAFTPRGTCGKDWDLGEQTCASGFCTSQPLCARIKKTQVCGLRYSSKGKDPLVSAEWDSRGAPYVRCTYDADLIDT

>sp|Q197F8|002R_IIV3 Uncharacterized protein 002R OS=Invertebrate iridescent virus 3 GN=IIV3-

002R PE=4 SV=1
MASNTVSAQGGSNRPVRDFSNIQDVAQFLLFDPIWNEQPGSIVPWKMNREQALAERYPELQTSEPSEDYSGPVESLELLPLEIKLDIMQYLSWEQISWCKHPW

>sp|Q197F7|003L_IIV3 Uncharacterized protein 003L OS=Invertebrate iridescent virus 3 GN=IIV3-

003L PE=4 SV=1
MYQAINPCPQSWYGSPQLEREIVCKMSGAPHYPNYYPVHPNALGGAWFDTSLNARSLTTTPSLTTCTPPSLAACTPPTSLGMVDSPPHINPPRRIGTLCFDFG

>sp|Q6GZX2|003R_FRG3G Uncharacterized protein 3R OS=Frog virus 3 (isolate Goorha) GN=FV3-003R

PE=3 SV=1

https://riptutorial.com/ 11
MARPLLGKTSSVRRRLESLSACSIFFFLRKFCQKMASLVFLNSPVYQMSNILLTERRQVDRAMGGSDDDGVMVVALSPSDFKTVLGSALLAVERDMVHVVPKY

>sp|Q6GZX1|004R_FRG3G Uncharacterized protein 004R OS=Frog virus 3 (isolate Goorha) GN=FV3-

004R PE=4 SV=1 MNAKYDTDQGVGRMLFLGTIGLAVVVGGLMAYGYYYDGKTPSSGTSFHTASPSFSSRYRY
>sp|Q197F5|005L_IIV3 Uncharacterized protein 005L OS=Invertebrate iridescent virus 3 GN=IIV3-
005L PE=3 SV=1
MRYTVLIALQGALLLLLLIDDGQGQSPYPYPGMPCNSSRQCGLGTCVHSRCAHCSSDGTLCSPEDPTMVWPCCPESSCQLVVGLPSLVNHYNCLPNQCTDSSQ

>sp|Q6GZX0|005R_FRG3G Uncharacterized protein 005R OS=Frog virus 3 (isolate Goorha) GN=FV3-

005R PE=4 SV=1
MQNPLPEVMSPEHDKRTTTPMSKEANKFIRELDKKPGDLAVVSDFVKRNTGKRLPIGKRSNLYVRICDLSGTIYMGETFILESWEELYLPEPTKMEVLGTLES

>sp|Q91G88|006L_IIV6 Putative KilA-N domain-containing protein 006L OS=Invertebrate iridescent

virus 6 GN=IIV6-006L PE=3 SV=1
MDSLNEVCYEQIKGTFYKGLFGDFPLIVDKKTGCFNATKLCVLGGKRFVDWNKTLRSKKLIQYYETRCDIKTESLLYEIKGDNNDEITKQITGTYLPKEFILD

>sp|Q6GZW9|006R_FRG3G Uncharacterized protein 006R OS=Frog virus 3 (isolate Goorha) GN=FV3-

006R PE=4 SV=1 MYKMYFLKDQKFSLSGTIRINDKTQSEYGSVWCPGLSITGLHHDAIDHNMFEEMETEIIEYLGPWVQAEYRRIKG

Read Linearizing a FASTA sequence. online:

https://riptutorial.com/bioinformatics/topic/4194/linearizing-a-fasta-sequence-

https://riptutorial.com/ 12
Chapter 6: Linearizing a fastq file
Examples
Using Paste

$ gunzip -c input.fastq.gz | paste - - - - | head

@IL31_4368:1:1:996:8507/2 TCCCTTACCCCCAAGCTCCATACCCTCCTAATGCCCACACCTCTTACCTTAGGA +
FFCEFFFEEFFFFFFFEFFEFFFEFCFC<EEFEFFFCEFF<;EEFF=FEE?FCE
@IL31_4368:1:1:996:21421/2 CAAAAACTTTCACTTTACCTGCCGGGTTTCCCAGTTTACATTCCACTGTTTGAC +
>DBDDB,B9BAA4AAB7BB?7BBB=91;+*@;5<87+*=/*@@?9=73=.7)7*
@IL31_4368:1:1:997:10572/2 GATCTTCTGTGACTGGAAGAAAATGTGTTACATATTACATTTCTGTCCCCATTG +
E?=EECE<EEEE98EEEEAEEBD??BE@AEAB><EEABCEEDEC<<EBDA=DEE
@IL31_4368:1:1:997:15684/2 CAGCCTCAGATTCAGCATTCTCAAATTCAGCTGCGGCTGAAACAGCAGCAGGAC +
EEEEDEEE9EAEEDEEEEEEEEEECEEAAEEDEE<CD=D=*BCAC?;CB,<D@,
@IL31_4368:1:1:997:15249/2 AATGTTCTGAAACCTCTGAGAAAGCAAATATTTATTTTAATGAAAAATCCTTAT +
EDEEC;EEE;EEE?EECE;7AEEEEEE07EECEA;D6D>+EE4E7EEE4;E=EA
@IL31_4368:1:1:997:6273/2 ACATTTACCAAGACCAAAGGAAACTTACCTTGCAAGAATTAGACAGTTCATTTG +
EEAAFFFEEFEFCFAFFAFCCFFEFEF>EFFFFB?ABA@ECEE=<F@DE@DDF;
@IL31_4368:1:1:997:1657/2 CCCACCTCTCTCAATGTTTTCCATATGGCAGGGACTCAGCACAGGTGGATTAAT +
A;0A?AA+@A<7A7019/<65,3A;'''07<A=<=>?7=?6&)'9('*%,>/(<
@IL31_4368:1:1:997:5609/2 TCACTATCAGAAACAGAATGTATAACTTCCAAATCAGTAGGAAACACAAGGAAA +
AEECECBEC@A;AC=<AEEEEAEEEE>AC,CE?ECCE9EAEC4E:<C>AC@EE)
@IL31_4368:1:1:997:14262/2 TGTTTTTTCTTTTTCTTTTTTTTTTGACAGTGCAGAGATTTTTTATCTTTTTAA +
97'<2<.64.?7/3(891?=(6??6+<6<++/*..3(:'/'9::''&(1<>.(,
@IL31_4368:1:1:998:19914/2 GAATGAAAGCAGAGACCCTGATCGAGCCCCAGAAAGATACACCTCCAGATTTTA +
C?=CECE4CD<?8@==;EBE<=0@:@@92@???6<991>.<?A=@5?@99;971

Using Awk

$ gunzip -c input.fastq.gz | awk '{printf("%s%s",$0,((NR+1)%4==1?"\n":"\t"));}' | head

https://riptutorial.com/ 13
Read Linearizing a fastq file online: https://riptutorial.com/bioinformatics/topic/4286/linearizing-a-
fastq-file

https://riptutorial.com/ 14
Chapter 7: Sequence analysis
Examples
Calculate the GC% of a sequence

In molecular biology and genetics, GC-content (or guanine-cytosine content, GC% in short) is the
percentage of nitrogenous bases on a DNA molecule that are either guanine or cytosine (from a
possibility of four different ones, also including adenine and thymine).

Using BioPython:

>>> from Bio.Seq import Seq

>>> from Bio.Alphabet import IUPAC
>>> from Bio.SeqUtils import GC
>>> my_seq = Seq('GATCGATGGGCCTATATAGGATCGAAAATCGC', IUPAC.unambiguous_dna)
>>> GC(my_seq)
46.875

Using BioRuby:

bioruby> require 'bio'

bioruby> seq = Bio::Sequence::NA.new("atgcatgcaaaa")
==> "atgcatgcaaaa"
bioruby> seq.gc_percent

==> 33

Using R:

# Load the SeqinR package.

library("seqinr")
mysequence <- s2c("atgcatgcaaaa")
GC(mysequence)

# [1] 0.3333333

Using Awk:

echo atgcatgcaaaa |\
awk '{dna=$0; gsub(/[^GCSgcs]/,""); print dna,": GC=",length($0)/length(dna)}'

# atgcatgcaaaa : GC= 0.333333

Read Sequence analysis online: https://riptutorial.com/bioinformatics/topic/4015/sequence-

analysis

https://riptutorial.com/ 15
Credits
S.
Chapters Contributors
No

Getting started with

1 BioGeek, Community, hello_there_andy, Marcelo, Pierre
bioinformatics

2 Basic Samtools amblina

3 BLAST amblina

Common File
4 Razik, woemler
Formats

Linearizing a FASTA
5 Pierre
sequence.

Linearizing a fastq
6 Pierre
file

7 Sequence analysis BioGeek, Pierre, zx8754

https://riptutorial.com/ 16

Get (eBook PDF) Introduction to Bioinformatics 5th Edition free all chapters
100% (6)
Get (eBook PDF) Introduction to Bioinformatics 5th Edition free all chapters
41 pages
Biopython Tutorial
100% (1)
Biopython Tutorial
26 pages
Module in Tics
No ratings yet
Module in Tics
20 pages
Does Beauty Build Adapted Minds?
No ratings yet
Does Beauty Build Adapted Minds?
23 pages
Biopython Org DIST Docs Tutorial Tutorial HTML
No ratings yet
Biopython Org DIST Docs Tutorial Tutorial HTML
267 pages
Download
No ratings yet
Download
19 pages
Lab2
No ratings yet
Lab2
7 pages
Bioinfo Course Notes M1 2020 Dr Mbulli
No ratings yet
Bioinfo Course Notes M1 2020 Dr Mbulli
56 pages
Bio Tools Booklet
No ratings yet
Bio Tools Booklet
5 pages
Genomic Data Preprocessing Through Different Libraries
No ratings yet
Genomic Data Preprocessing Through Different Libraries
30 pages
Bio Intro
No ratings yet
Bio Intro
32 pages
BioInformatics Abstract For Paper Presentation
100% (1)
BioInformatics Abstract For Paper Presentation
11 pages
Bioinformatics
No ratings yet
Bioinformatics
22 pages
Biopython - Quick Guide
No ratings yet
Biopython - Quick Guide
79 pages
Bioinformatics: ABE 2007 Kent Koster Group 3
No ratings yet
Bioinformatics: ABE 2007 Kent Koster Group 3
43 pages
BioInformatics For Newbies Dantelan
No ratings yet
BioInformatics For Newbies Dantelan
46 pages
BMB402_502_Introduction_to_Bioinformatics_Syllabus_2025
No ratings yet
BMB402_502_Introduction_to_Bioinformatics_Syllabus_2025
11 pages
Unit 6 - Bioinformatics
No ratings yet
Unit 6 - Bioinformatics
41 pages
Bio in For Ma Tics
No ratings yet
Bio in For Ma Tics
8 pages
Introduction To Bioinformatics
No ratings yet
Introduction To Bioinformatics
33 pages
02 Handling Files
No ratings yet
02 Handling Files
18 pages
Bioinformatics final
No ratings yet
Bioinformatics final
18 pages
Bioinformatics: Tina Elizabeth Varghese
No ratings yet
Bioinformatics: Tina Elizabeth Varghese
9 pages
COMP90016 2023 06 Data Sources
No ratings yet
COMP90016 2023 06 Data Sources
64 pages
Combined
No ratings yet
Combined
417 pages
Gene Identification - I: Shivani Chandra Birla Institute of Scientific Research
No ratings yet
Gene Identification - I: Shivani Chandra Birla Institute of Scientific Research
35 pages
Bio in For Matics
No ratings yet
Bio in For Matics
17 pages
Databases and Ontologies
No ratings yet
Databases and Ontologies
1 page
Introduction To Bioinformatics
No ratings yet
Introduction To Bioinformatics
14 pages
Introduction To Bioinformatics: Tolga Can
No ratings yet
Introduction To Bioinformatics: Tolga Can
21 pages
Bioinformatics Database and Applications
100% (3)
Bioinformatics Database and Applications
82 pages
Collection
No ratings yet
Collection
8 pages
Bio in For Ma Tics
No ratings yet
Bio in For Ma Tics
7 pages
Introduction To Different Resources of Bioinformatics and Application PDF
No ratings yet
Introduction To Different Resources of Bioinformatics and Application PDF
55 pages
Bioinformatics Class Notes
No ratings yet
Bioinformatics Class Notes
12 pages
What Is Bioinformatics
No ratings yet
What Is Bioinformatics
10 pages
R NGS
No ratings yet
R NGS
29 pages
Bioinformatics Learning Framework
No ratings yet
Bioinformatics Learning Framework
7 pages
Introduction To Bioinformatics
No ratings yet
Introduction To Bioinformatics
10 pages
PDF (eBook PDF) Introduction to Bioinformatics 5th Edition download
100% (1)
PDF (eBook PDF) Introduction to Bioinformatics 5th Edition download
50 pages
Lecture 2
No ratings yet
Lecture 2
36 pages
NGS ToolsFormats r1 BDG
No ratings yet
NGS ToolsFormats r1 BDG
32 pages
Biostar Handbook Chapter
No ratings yet
Biostar Handbook Chapter
51 pages
(eBook PDF) Introduction to Bioinformatics 5th Edition download
No ratings yet
(eBook PDF) Introduction to Bioinformatics 5th Edition download
46 pages
7256
No ratings yet
7256
51 pages
Bioinformatics1
No ratings yet
Bioinformatics1
37 pages
The Bioinformatics Toolbox Extends MATLAB
No ratings yet
The Bioinformatics Toolbox Extends MATLAB
19 pages
Bioinformatics: Major Research Areas
No ratings yet
Bioinformatics: Major Research Areas
2 pages
BFG Chapter1 Introduction v03
No ratings yet
BFG Chapter1 Introduction v03
26 pages
Bioperl Overview
No ratings yet
Bioperl Overview
47 pages
Module 2 (Bioinformatics)
No ratings yet
Module 2 (Bioinformatics)
81 pages
MATH3353 Notes
No ratings yet
MATH3353 Notes
100 pages
Bioinformatics: Nadiya Akmal Binti Baharum (PHD)
100% (1)
Bioinformatics: Nadiya Akmal Binti Baharum (PHD)
54 pages
Sec1 Introduction to Bioinformatics
No ratings yet
Sec1 Introduction to Bioinformatics
20 pages
Introduction To R For Gene Expression Data Analysis
No ratings yet
Introduction To R For Gene Expression Data Analysis
11 pages
Bioinformatics:: Guide To Bio-Computing and The Internet
No ratings yet
Bioinformatics:: Guide To Bio-Computing and The Internet
34 pages
XProc 3.0 Programmer Reference
From Everand
XProc 3.0 Programmer Reference
Erik Siegel
No ratings yet
Python Pranks and Mischief with NLP
From Everand
Python Pranks and Mischief with NLP
Edward Franklin
No ratings yet
C# Package Mastery: 100 Essentials in 1 Hour - 2024 Edition
From Everand
C# Package Mastery: 100 Essentials in 1 Hour - 2024 Edition
Tenko
No ratings yet
Accelerated Computing With HIP: Second Edition
From Everand
Accelerated Computing With HIP: Second Edition
Yifan Sun
No ratings yet
Programming Concepts in Python
From Everand
Programming Concepts in Python
Robert Burns
No ratings yet
Unisys RDMS
No ratings yet
Unisys RDMS
134 pages
SVN Presentation
No ratings yet
SVN Presentation
17 pages
Crush Bugs Now. Pay Less Later.
No ratings yet
Crush Bugs Now. Pay Less Later.
8 pages
Comparison CBLManaged Langs
No ratings yet
Comparison CBLManaged Langs
6 pages
Building A Single Page Application With AngularJS and Sitecore
No ratings yet
Building A Single Page Application With AngularJS and Sitecore
8 pages
Metadata Exchange With Allfusion Erwin Data Modeler: Application Life Cycle Management
No ratings yet
Metadata Exchange With Allfusion Erwin Data Modeler: Application Life Cycle Management
43 pages
SOAP Toolkits PDF
No ratings yet
SOAP Toolkits PDF
54 pages
Build Your Business On: A Proven Platform
No ratings yet
Build Your Business On: A Proven Platform
8 pages
CyberLife Notes
No ratings yet
CyberLife Notes
2 pages
Allfusion Erwin Data Modeler Api
No ratings yet
Allfusion Erwin Data Modeler Api
52 pages
Big Data Problems: Understanding Hadoop Framework: G S Aditya Rao, Palak Pandey
No ratings yet
Big Data Problems: Understanding Hadoop Framework: G S Aditya Rao, Palak Pandey
3 pages
WWW, HTTP, Ajax, Jsonp: Telerik Software Academy
No ratings yet
WWW, HTTP, Ajax, Jsonp: Telerik Software Academy
56 pages
Monthly TCR Calculation Draft 3: Also Review ERCOT Protocols Section 7 For More Details and Additional Data
No ratings yet
Monthly TCR Calculation Draft 3: Also Review ERCOT Protocols Section 7 For More Details and Additional Data
20 pages
Web Design Tools
No ratings yet
Web Design Tools
9 pages
Example: Now Add Some Razor Code To The Example
No ratings yet
Example: Now Add Some Razor Code To The Example
1 page
Arrays: Processing Sequences of Elements
No ratings yet
Arrays: Processing Sequences of Elements
55 pages
Motorola Sbg6580 QSG
No ratings yet
Motorola Sbg6580 QSG
2 pages
Commercial Operations Subcommittee (COPS) January 2014 Update To TAC 1/29/2015
No ratings yet
Commercial Operations Subcommittee (COPS) January 2014 Update To TAC 1/29/2015
8 pages
Payoff Calculator: Credit Card
No ratings yet
Payoff Calculator: Credit Card
1 page
Atlas of Marine Invertebrate Larvae PDF
50% (2)
Atlas of Marine Invertebrate Larvae PDF
44 pages
Yr 9 Half Yealy Exam 2022
No ratings yet
Yr 9 Half Yealy Exam 2022
19 pages
Assessment of Milk Production and Resilience of Girolando Cattle, Reared in Semi-Improved Breeding System in Benin
No ratings yet
Assessment of Milk Production and Resilience of Girolando Cattle, Reared in Semi-Improved Breeding System in Benin
15 pages
Link For Entrance Test Syllabus: Name of Department Course Name Link
No ratings yet
Link For Entrance Test Syllabus: Name of Department Course Name Link
10 pages
IB - B4.2 - Ecological Niches
No ratings yet
IB - B4.2 - Ecological Niches
20 pages
Activity 4: Creating Reading and Writing
No ratings yet
Activity 4: Creating Reading and Writing
6 pages
Tian Et Al 2021
No ratings yet
Tian Et Al 2021
68 pages
Test 8
No ratings yet
Test 8
15 pages
2015_A novel intracellular nitrogen-fixing symbiosis made by Ustilago maydis and Bacillus spp
No ratings yet
2015_A novel intracellular nitrogen-fixing symbiosis made by Ustilago maydis and Bacillus spp
9 pages
Cell Division: Better Your Dreams
No ratings yet
Cell Division: Better Your Dreams
11 pages
SoP Pathology
No ratings yet
SoP Pathology
3 pages
Development Psychology
No ratings yet
Development Psychology
12 pages
SAS1-SAS2brochure
No ratings yet
SAS1-SAS2brochure
18 pages
Bộ đề dự đoán đặc biệt cho kì thi THPT 2024 - Đề số 02
No ratings yet
Bộ đề dự đoán đặc biệt cho kì thi THPT 2024 - Đề số 02
11 pages
DNAextraction Bacteria
No ratings yet
DNAextraction Bacteria
3 pages
(Ebook) Evolution: The First Four Billion Years by Michael Ruse, Joseph Travis, Edward O. Wilson ISBN 9780674031753, 067403175X - The ebook is ready for instant download and access
100% (2)
(Ebook) Evolution: The First Four Billion Years by Michael Ruse, Joseph Travis, Edward O. Wilson ISBN 9780674031753, 067403175X - The ebook is ready for instant download and access
46 pages
Lectura 3 - Kaufman, Anth Beg and End Life
No ratings yet
Lectura 3 - Kaufman, Anth Beg and End Life
30 pages
Power, Sex, Suicide: Mitochondria and the Meaning of Life Nick Lane instant download
100% (1)
Power, Sex, Suicide: Mitochondria and the Meaning of Life Nick Lane instant download
67 pages
SAT Reading - No Man Likes
No ratings yet
SAT Reading - No Man Likes
3 pages
Lesson 3 - From The Perspective of Anthropology
No ratings yet
Lesson 3 - From The Perspective of Anthropology
18 pages
Anatomy Midwifery
No ratings yet
Anatomy Midwifery
15 pages
Daftar Pustaka
No ratings yet
Daftar Pustaka
4 pages
General Biology
No ratings yet
General Biology
3 pages
Flesh Market
No ratings yet
Flesh Market
2 pages
Cell Wall of Gram-Negative and Positive Bacteria
No ratings yet
Cell Wall of Gram-Negative and Positive Bacteria
1 page
Deviyanti Rizqina - 26040119120038 - Lapsem Koralogi 7
No ratings yet
Deviyanti Rizqina - 26040119120038 - Lapsem Koralogi 7
6 pages
Syllabus of 16 New SEC courses (E.C. (1266)-09.06.2023)
No ratings yet
Syllabus of 16 New SEC courses (E.C. (1266)-09.06.2023)
43 pages
STS Lecture 16 The Nano World
No ratings yet
STS Lecture 16 The Nano World
20 pages
2 Genetics and Evolution
No ratings yet
2 Genetics and Evolution
2 pages