RIP-Tutorials-bioinformatics
RIP-Tutorials-bioinformatics
#bioinformat
ics
Table of Contents
About 1
Remarks 2
Examples 2
Definition 2
.GFF file parser (as buffer) with filter to keep only rows 2
Examples 5
Chapter 3: BLAST 6
Examples 6
Examples 8
FASTA 8
GCT 9
Examples 11
Examples 13
Using Paste 13
Using Awk 13
Examples 15
Credits 16
About
You can share this PDF with anyone you feel could benefit from it, downloaded the latest version
from: bioinformatics
It is an unofficial and free bioinformatics ebook created for educational purposes. All the content is
extracted from Stack Overflow Documentation, which is written by many hardworking individuals at
Stack Overflow. It is neither affiliated with Stack Overflow nor official bioinformatics.
The content is released under Creative Commons BY-SA, and the list of contributors to each
chapter are provided in the credits section at the end of this book. Images may be copyright of
their respective owners unless otherwise specified. All trademarks and registered trademarks are
the property of their respective company owners.
Use the content presented in this book at your own risk; it is not guaranteed to be correct nor
accurate, please send your feedback and corrections to info@zzzprojects.com
https://riptutorial.com/ 1
Chapter 1: Getting started with bioinformatics
Remarks
Bioinformatics is an interdisciplinary field that develops methods and software tools for
understanding biological data.
• Sequence analysis
• Phylogenetics
• Molecular modeling
• Analysis of gene and protein expression
Examples
Definition
.GFF file parser (as buffer) with filter to keep only rows
Description:
- That performs buffered reading, and filtering (see: @filter) of .GFF input file
(e.g. "[./toy.gff][3]") to keep only rows whose field (column) values are equal to
"transcript"...
Args:
- None (yet)
Returns:
- None (yet)
Related:
- [1]: https://github.com/a1ultima/vigilab_intergeneShareGFF/blob/master/README.md
- [2]: http://www.vigilab.org/
- [3]: https://github.com/a1ultima/vigilab_intergeneShareGFF/blob/master/toy.gff
"""
gene_to_field = {} # dict whose keys: genes represented (i.e. later slice-able/index-able) as
1..n, values, where n = 8 total #fields (cols) of a gff row, whose version is unknown but
example is: https://github.com/a1ultima/vigilab_intergeneShareGFF/blob/master/toy.gff
gene_i = 0
https://riptutorial.com/ 2
with open("./toy.gff", "r") as fi:
while True: # breaks once there are no more lines in the input .gff file, see "@break"
line_split = line.split("\t") # turn a line of input data into a list, each element =
different field value, e.g. [...,"transcript",...]
##@TEST: sometimes 4.00 instead of 4.0 (trivial) # some @deprecated code, but may be
useful one day
#if not (str(line_split[5])==str(float(line_split[5]))):
# print("oops")
# print("\t"+str(line_split[5])+"___"+str(float(line_split[5])))
# create a dict key, for gene_to_field dict, and set its values according to list elements
in line_split
gene_to_field[gene_i] = { \
"c1_reference_seq":line_split[0],# e.g. 'scaffold_150' \
"c2_source":line_split[1],# e.g. 'GWSUNI' \
"c3_type":line_split[2],# e.g. 'transcript' \
"c4_start":int(line_split[3]),# e.g. '1372' \
"c5_end":int(line_split[4]),# e.g. '2031' \
"c6_score":float(line_split[5]),# e.g. '45.89' \
"c7_strand":line_split[6],# e.g. '+' \
"c8_phase":line_split[7],# e.g. '.' @Note: codon frame (0,1,2) \
"c9_attributes":line_split[8]# e.g. <see @gff3.md> \
}
Many biological questions can be translated into a DNA sequencing problem. For instance, if you
want to know the expression level of a gene you can: copy its mRNAs into complementary DNA
molecules, sequence each of the resulting DNA molecules, map those sequences back to the
reference genome, and then use the count of alignments overlapping the gene as a proxy of its
expression (see RNA-seq). Other examples include: determining the 3D structure of the genome,
locating histone marks, and mapping RNA-DNA interactions. A not up-to-date list of biological
questions addressed by clever DNA-sequencing methods can be found here.
Typically, the wet-lab scientists (the people wearing white coats and goggles) will design and
perform the experiments to get the sequenced DNA samples. Then, a bioinformatician (the people
using computers and drinking coffee) will take these sequences --encoded as FASTQ files-- and
https://riptutorial.com/ 3
will map them to a reference genome, saving the results as BAM files.
Going back to our gene expression example, this is how a bioinformatician would generate a BAM
file from a FASTQ file (using a Linux system):
Where STAR is a spliced-tolerant aligner (necessary for the exon-intron junctions that may be
present on the mRNA).
PS: Once the mapping results are obtained, the creative part begins. Here is where
bioinformaticians devised statistical test to check whether the data is showing biologically
meaningful patterns or spurious signals born out of noise.
https://riptutorial.com/ 4
Chapter 2: Basic Samtools
Examples
Count number of records per reference in bamfile
This will produce a sorted, indexed bam. This will create the files thing.bam and thing.bam.bai. To
use a bam you must have an index file.
https://riptutorial.com/ 5
Chapter 3: BLAST
Examples
Create a DNA blastdb
In order to compare query sequences against reference sequences, you must create a blastdb of
your reference(s). This is done using makeblastdb which is included when you install blast.
makeblastdb -in <input fasta> -dbtype nucl -out <label for database>
>reference_1
ATCGATAAA
>reference_2
ATCGATCCC
• my_database.nhr
• my_database.nin
• my_database.nsq
Note, the database files are labelled with the -out argument.
You can extract fasta sequence from a blastdb constructed from a fasta file using blastdbcmd which
should be installed when you install makeblastdb.
If you had a database called my_database which contained the files my_database.nhr, my_database.nsq,
my_database.nin and you wanted your fasta output file to be called reference.fasta you would run
the following:
https://riptutorial.com/ 6
You can check the version that will be installed in advance here:
http://packages.ubuntu.com/xenial/ncbi-blast+
Data can be extracted from a blastdb using blastdbcmd which should be included in a blast
installation. You can specify from the options below as part of -outfmt what metadata to include
and in what order.
-outfmt <String>
Output format, where the available format specifiers are:
%f means sequence in FASTA format
%s means sequence data (without defline)
%a means accession
%g means gi
%o means ordinal id (OID)
%i means sequence id
%t means sequence title
%l means sequence length
%h means sequence hash value
%T means taxid
%X means leaf-node taxids
%e means membership integer
%L means common taxonomic name
%C means common taxonomic names for leaf-node taxids
%S means scientific name
%N means scientific names for leaf-node taxids
%B means BLAST name
%K means taxonomic super kingdom
%P means PIG
The example snippet shows how gi and taxid can be extracted from blastdb. The NCBI
16SMicrobial (ftp) blastdb was chosen for this example:
# Example:
# blastdbcmd -db <db label> -entry all -outfmt "%g %T" -out <outfile>
blastdbcmd -db 16SMicrobial -entry all -outfmt "%g %T" -out 16SMicrobial.gi_taxid.tsv
939733319 526714
636559958 429001
645319546 629680
https://riptutorial.com/ 7
Chapter 4: Common File Formats
Examples
FASTA
The FASTA file format is used for representing one or more nucleotide or amino acid sequences
as a continuous string of characters. Sequences are annotated with a comment line, which starts
with the > character, that precedes each sequence. The comment line is typically formatted in a
uniform way, dictated by the sequence's source database or generating software. For example:
The above example illustrates the amino acid sequence of an isoform of the human AKT1 genes,
as fetched from the NCBI protein database. The header line specifies that this sequence may be
identified with the GI ID 62241013 and the protein transcript ID NP_001014431.1. This protein is named
RAC-alpha serine/threonine-protein kinase and is derived from the species, Homo sapiens.
The MAF file format is a tab-delimited text file format intended for describing somatic DNA
mutations detected in sequencing results, and is distinct from the Multiple Alignment Format file
type, which is intended for representing aligned nucleotide sequences. Column headers and
ordering may sometimes vary between files of different sources, but the names and orders of
columns, as defined in the specification, are the following:
Hugo_Symbol
Entrez_Gene_Id
Center
NCBI_Build
Chromosome
Start_Position
End_Position
Strand
Variant_Classification
Variant_Type
Reference_Allele
Tumor_Seq_Allele1
Tumor_Seq_Allele2
dbSNP_RS
dbSNP_Val_Status
Tumor_Sample_Barcode
Matched_Norm_Sample_Barcode
https://riptutorial.com/ 8
Match_Norm_Seq_Allele1
Match_Norm_Seq_Allele2
Tumor_Validation_Allele1
Tumor_Validation_Allele2
Match_Norm_Validation_Allele1
Match_Norm_Validation_Allele2
Verification_Status4
Validation_Status4
Mutation_Status
Sequencing_Phase
Sequence_Source
Validation_Method
Score
BAM_File
Sequencer
Tumor_Sample_UUID
Matched_Norm_Sample_UUID
Many MAF files, such as those available from the TCGA, also contain additional columns
expanding on the variant annotation. These columns can include reference nucleotide transcript
IDs for corresponding genes, representative codon or amino acid changes, QC metrics, population
statistics, and more.
GCT
The GCT file format is a tab-delimited text file format used for describing processed gene
expression or RNAi data, typically derived from microarray chip analysis. This data is arranged
with a single annotated gene or probe per line, and a single chip sample per column (beyond the
annotation columns). For example:
#1.2
22215 2
Name Description Tumor_One Normal_One
1007_s_at DDR1 -0.214548 -0.18069
1053_at RFC2 0.868853 -1.330921
117_at HSPA6 1.124814 0.933021
121_at PAX8 -0.825381 0.102078
1255_g_at GUCA1A -0.734896 -0.184104
1294_at UBE1L -0.366741 -1.209838
In this example, the first line specifies the version of the GCT file specification, which in this case
is 1.2. The second line specifies the number of rows of data (22215) and the number of samples (2).
The header row specifies two annotation columns (Name for the chip probe set identifiers and
Description for the gene symbols the probe set covers) and the names of the samples being
assayed (Tumor_One and Normal_One). Each row of data beyond the header lists a single probe set
identifier (in this case, Affymetrix gene chip probe sets), its corresponding gene symbol (if one
exists), and the normalized values for each sample. Sample data values will vary based upon
assay type and normalization methods, but are typically signed floating point numeric values.
https://riptutorial.com/ 9
Parameters:
import math
def save_fsta(filename,seq,id,desc):
fo = open(filename+'.fa',"a")
header= str(id)+' <'+desc+'> \n'
fo.write(header)
count=math.floor(len(seq)/80+1)
iteration = range(count)
for i in iteration:
fo.write(seq[80*(i):80*(i+1)]+'\n')
fo.write('\n \n')
fo.close()
import textwrap
https://riptutorial.com/ 10
Chapter 5: Linearizing a FASTA sequence.
Examples
Linearize a FASTA sequence with AWK
• if the current line ($0) starts like a fasta header (^>). Then we print a carriage return if this is
not the first sequence. (N>0?"\n":"") followed with the line itself ($0), followed with a
tabulation (\t). And we look for the next line (next;)
• if the current line ($0) does not start like a fasta header, this is the default awk pattern. We
just print the whole line without carriage return.
• At the end (END) we only print a carriage return for the last sequence.
$ curl -s
"ftp://ftp.uniprot.org/pub/databases/uniprot/current_release/knowledgebase/complete/uniprot_sprot.fasta
|\
gunzip -c |\
awk '/^>/ {printf("%s%s\t",(N>0?"\n":""),$0);N++;next;} {printf("%s",$0);} END
{printf("\n");}' |\
head
https://riptutorial.com/ 11
MARPLLGKTSSVRRRLESLSACSIFFFLRKFCQKMASLVFLNSPVYQMSNILLTERRQVDRAMGGSDDDGVMVVALSPSDFKTVLGSALLAVERDMVHVVPKY
https://riptutorial.com/ 12
Chapter 6: Linearizing a fastq file
Examples
Using Paste
@IL31_4368:1:1:996:8507/2 TCCCTTACCCCCAAGCTCCATACCCTCCTAATGCCCACACCTCTTACCTTAGGA +
FFCEFFFEEFFFFFFFEFFEFFFEFCFC<EEFEFFFCEFF<;EEFF=FEE?FCE
@IL31_4368:1:1:996:21421/2 CAAAAACTTTCACTTTACCTGCCGGGTTTCCCAGTTTACATTCCACTGTTTGAC +
>DBDDB,B9BAA4AAB7BB?7BBB=91;+*@;5<87+*=/*@@?9=73=.7)7*
@IL31_4368:1:1:997:10572/2 GATCTTCTGTGACTGGAAGAAAATGTGTTACATATTACATTTCTGTCCCCATTG +
E?=EECE<EEEE98EEEEAEEBD??BE@AEAB><EEABCEEDEC<<EBDA=DEE
@IL31_4368:1:1:997:15684/2 CAGCCTCAGATTCAGCATTCTCAAATTCAGCTGCGGCTGAAACAGCAGCAGGAC +
EEEEDEEE9EAEEDEEEEEEEEEECEEAAEEDEE<CD=D=*BCAC?;CB,<D@,
@IL31_4368:1:1:997:15249/2 AATGTTCTGAAACCTCTGAGAAAGCAAATATTTATTTTAATGAAAAATCCTTAT +
EDEEC;EEE;EEE?EECE;7AEEEEEE07EECEA;D6D>+EE4E7EEE4;E=EA
@IL31_4368:1:1:997:6273/2 ACATTTACCAAGACCAAAGGAAACTTACCTTGCAAGAATTAGACAGTTCATTTG +
EEAAFFFEEFEFCFAFFAFCCFFEFEF>EFFFFB?ABA@ECEE=<F@DE@DDF;
@IL31_4368:1:1:997:1657/2 CCCACCTCTCTCAATGTTTTCCATATGGCAGGGACTCAGCACAGGTGGATTAAT +
A;0A?AA+@A<7A7019/<65,3A;'''07<A=<=>?7=?6&)'9('*%,>/(<
@IL31_4368:1:1:997:5609/2 TCACTATCAGAAACAGAATGTATAACTTCCAAATCAGTAGGAAACACAAGGAAA +
AEECECBEC@A;AC=<AEEEEAEEEE>AC,CE?ECCE9EAEC4E:<C>AC@EE)
@IL31_4368:1:1:997:14262/2 TGTTTTTTCTTTTTCTTTTTTTTTTGACAGTGCAGAGATTTTTTATCTTTTTAA +
97'<2<.64.?7/3(891?=(6??6+<6<++/*..3(:'/'9::''&(1<>.(,
@IL31_4368:1:1:998:19914/2 GAATGAAAGCAGAGACCCTGATCGAGCCCCAGAAAGATACACCTCCAGATTTTA +
C?=CECE4CD<?8@==;EBE<=0@:@@92@???6<991>.<?A=@5?@99;971
Using Awk
@IL31_4368:1:1:996:8507/2 TCCCTTACCCCCAAGCTCCATACCCTCCTAATGCCCACACCTCTTACCTTAGGA +
FFCEFFFEEFFFFFFFEFFEFFFEFCFC<EEFEFFFCEFF<;EEFF=FEE?FCE
@IL31_4368:1:1:996:21421/2 CAAAAACTTTCACTTTACCTGCCGGGTTTCCCAGTTTACATTCCACTGTTTGAC +
>DBDDB,B9BAA4AAB7BB?7BBB=91;+*@;5<87+*=/*@@?9=73=.7)7*
@IL31_4368:1:1:997:10572/2 GATCTTCTGTGACTGGAAGAAAATGTGTTACATATTACATTTCTGTCCCCATTG +
E?=EECE<EEEE98EEEEAEEBD??BE@AEAB><EEABCEEDEC<<EBDA=DEE
@IL31_4368:1:1:997:15684/2 CAGCCTCAGATTCAGCATTCTCAAATTCAGCTGCGGCTGAAACAGCAGCAGGAC +
EEEEDEEE9EAEEDEEEEEEEEEECEEAAEEDEE<CD=D=*BCAC?;CB,<D@,
@IL31_4368:1:1:997:15249/2 AATGTTCTGAAACCTCTGAGAAAGCAAATATTTATTTTAATGAAAAATCCTTAT +
EDEEC;EEE;EEE?EECE;7AEEEEEE07EECEA;D6D>+EE4E7EEE4;E=EA
@IL31_4368:1:1:997:6273/2 ACATTTACCAAGACCAAAGGAAACTTACCTTGCAAGAATTAGACAGTTCATTTG +
EEAAFFFEEFEFCFAFFAFCCFFEFEF>EFFFFB?ABA@ECEE=<F@DE@DDF;
@IL31_4368:1:1:997:1657/2 CCCACCTCTCTCAATGTTTTCCATATGGCAGGGACTCAGCACAGGTGGATTAAT +
A;0A?AA+@A<7A7019/<65,3A;'''07<A=<=>?7=?6&)'9('*%,>/(<
@IL31_4368:1:1:997:5609/2 TCACTATCAGAAACAGAATGTATAACTTCCAAATCAGTAGGAAACACAAGGAAA +
AEECECBEC@A;AC=<AEEEEAEEEE>AC,CE?ECCE9EAEC4E:<C>AC@EE)
@IL31_4368:1:1:997:14262/2 TGTTTTTTCTTTTTCTTTTTTTTTTGACAGTGCAGAGATTTTTTATCTTTTTAA +
97'<2<.64.?7/3(891?=(6??6+<6<++/*..3(:'/'9::''&(1<>.(,
@IL31_4368:1:1:998:19914/2 GAATGAAAGCAGAGACCCTGATCGAGCCCCAGAAAGATACACCTCCAGATTTTA +
C?=CECE4CD<?8@==;EBE<=0@:@@92@???6<991>.<?A=@5?@99;971
https://riptutorial.com/ 13
Read Linearizing a fastq file online: https://riptutorial.com/bioinformatics/topic/4286/linearizing-a-
fastq-file
https://riptutorial.com/ 14
Chapter 7: Sequence analysis
Examples
Calculate the GC% of a sequence
In molecular biology and genetics, GC-content (or guanine-cytosine content, GC% in short) is the
percentage of nitrogenous bases on a DNA molecule that are either guanine or cytosine (from a
possibility of four different ones, also including adenine and thymine).
Using BioPython:
Using BioRuby:
==> 33
Using R:
# [1] 0.3333333
Using Awk:
echo atgcatgcaaaa |\
awk '{dna=$0; gsub(/[^GCSgcs]/,""); print dna,": GC=",length($0)/length(dna)}'
https://riptutorial.com/ 15
Credits
S.
Chapters Contributors
No
3 BLAST amblina
Common File
4 Razik, woemler
Formats
Linearizing a FASTA
5 Pierre
sequence.
Linearizing a fastq
6 Pierre
file
https://riptutorial.com/ 16