0% found this document useful (0 votes)

2K views22 pages

Sequence File Formats

The document discusses several common file formats used in bioinformatics to store biological sequence data, including FASTA, Multi-FASTA, GCG, EMBL, GenBank, and PHYP formats. It provides examples of sequences stored in each format and describes the key features and conventions used in each format.

Uploaded by

Pragya Mukherjee

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

2K views22 pages

Sequence File Formats

Uploaded by

Pragya Mukherjee

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

You are on page 1/ 22

File Formats Different programs require that the information be specified to them in a formal manner, using particular keyword

and ordering these specification is called file formats. Bioinformatics data are stored in data bases as a specific file formats. There are many file formats in bioinformatics datas.

FASTA

is a DNA and protein sequence alignment software package first described (as FASTP) by David J. Lipman and William R. Pearson in 1985. FASTA is pronounced "fast A", In bioinformatics, FASTA format is a text-based format for representing either nucleotide sequences or peptide sequences, in which base pairs or amino acids are represented using single-letter codes. The format also allows for sequence names and comments to precede the sequences.

sequence in FASTA format begins with a singleline description, followed by lines of sequence data. The description line is distinguished from the sequence data by a greater-than (">") symbol in the first column. The word following the ">" symbol is the identifier of the sequence, and the rest of the line is the description (both are optional). There should be no space between the ">" and the first letter of the identifier.

Blank lines in a FASTA file are ignored, and so are spaces or other gap symbols (dashes, underscores, periods) in a sequence.

>gi|5524211|gb|AAD44166.1| cytochrome b LCLYTHIGRNIYYGSYLYSETWNTGIMLLLITMAT AFMGYVLPWGQMSFWGATVITNLFSAIPYIGTNL VEWIWGGFSVDKATLNRFFAFHFILPFTMVALAG VHLTFLHETGSNNPLGLTSDSDKIPFHPYYTIKDF LGLLILILLLLLLALLSPDMLGDPDNHMPADPLN TPLHIKPEWYFLFAYAILRSVPNKLGGVLALFLSI VILGLMPFLHTSKHRSMMLRPLSQALFWTLTMD LLTLTWIGSQPVEYPYTIIGQMASILYFSIILAFLPI AGX IENY

Multi-FASTA format consists of multiple FASTA format sequences.

>sequence1 ACTCCCCGTGCGCGCCCGGCCCGTAGCGTCCTCGTCGCCGCCCCTCGTCTCGCA GCCGCAGCCCGCGTGG ACGCTCTCGCCTGAGCGCCGCGGACTAGCCCGGGTGGCC >sequence2 CAGTCCGGCAGCGCCGGGGTTAAGCGGCCCAAGTAAACGTAGCGCAGCGATCG GCGCCGGAGATTCGCGA ACCCGACACTCCGCGCCGCCCGCCGGCCAGGACCCGCGGCGCGATCGCGGCGC CGCGCTACAGCCAGCCT CACTGGCGCGCGGGCGAGCGCACGGGCGCTC >sequence3 CACGACAGGCCCGCTGAGGCTTGTGCCAGACCTTGGAAACCTCAGGTATATACCT TTCCAGACGCGGGAT CTCCCCTCCCC >sequence4 CAGCAGACATCTGAATGAAGAAGAGGGTGCCAGCGGGTATGAGGAGTGCATTAT CGTTAATGGGAACTTC AGTGACCAGTCCTCAGACACGAAGGATGCTCCCTCACCCCCAGTCTTGGAGGCA ATCTGCACAGAGCCAG TCTGCACACC

Example

A sequence file in GCG format contains exactly one sequence, begins with annotation lines and the start of the sequence is marked by a line ending with two dot ("..") characters. This line also contains the sequence identifier, the sequence length and a checksum. This format should only be used if the file was created with the GCG package. The GCG format is not used much nowadays

An example sequence in GCG format is: ID AB000263 standard; RNA; PRI; 368 BP. XX AC AB000263; XX DE Homo sapiens mRNA for prepro cortistatin like peptide, complete cds. XX SQ Sequence 368 BP; AB000263 Length: 368 Check: 4514 .. acaagatgcc attgtccccc ggcctcctgc tgctgctgct ctccggggcc acggccaccg ctgccctgcc cctggagggt ggccccaccg gccgagacag cgagcatatgaggaagcgg caggaataag gaaaagcagc ctcctgactt tcctcgcttg gtggtttgag tggacctccc aggccagtgc cgggcccctc ataggagagg aagctcggga gtggccagg cggcaggaag gcgcaccccc ccagcaatcc gcgcgccggg acagaatgcc ctgcaggaac ttcttctgga agaccttctc ctcctgcaaa taaaacctca cccatgaatg ctcacgcaag tttaattaca acctgaa

The new GCG-RSF can contain several sequences in one file. This format should only be used if the file was created with the GCG package.

The programs of the Staden suite of biological analysis software accept sequences in staden format. A typical staden format file is : GGTACGTAGTAGCTGCTGCTACGTGCGCTAGCT AGTACGTCATTA CGACGTAGATGCTAGCTGACTCGATGCAGTACG TAGTAGCTGCTG CTACGTGCGCTAGCTAGTACGTCACGACGTAGA TGCTAGCTGACT CGATGC

Staden formatted sequence files contain the sequence and nothing else.

A sequence file in EMBL format can contain several sequences. One sequence entry starts with an identifier line ("ID"), followed by further annotation lines. The start of the sequence is marked by a line starting with "SQ" and the end of the sequence is marked by two slashes ("//").
An example sequence in EMBL format is: ID AB000263 standard; RNA; PRI; 368 BP. XX AC AB000263; XX DE Homo sapiens mRNA for prepro cortistatin like peptide, complete cds. XX SQ Sequence 368 BP; acaagatgcc attgtccccc ggcctcctgc tgctgctgct ctccggggcc acggccaccg ctgccctgcc cctggagggt ggccccaccg gccgagacag cgagcatatg caggaagcgg caggaataag gaaaagcagc ctcctgactt tcctcgcttg gtggtttgag tggacctccc aggccagtgc cgggcccctc ataggagagg aagctcggga ggtggccagg cggcaggaag gcgcaccccc ccagcaatcc gcgcgccggg acagaatgcc ctgcaggaac ttcttctgga gaccttctc ctcctgcaaa taaaacctca cccatgaatg ctcacgcaag tttaattacagacctgaa //

A sequence file in GenBank format can contain several sequences. One sequence in GenBank format starts with a line containing the word LOCUS and a number of annotation lines. The start of the sequence is marked by a line containing "ORIGIN" and the end of the sequence is marked by two slashes ("//").

An example sequence in GenBank format is: LOCUS AB000263 368 bp mRNA linear PRI 05-FEB-1999 DEFINITION Homo sapiens mRNA for prepro cortistatin like peptide, complete cds. ACCESSION AB000263 ORIGIN acaagatgcc attgtccccc ggcctcctgc tgctgctgct ctccggggcc acggccaccg ctgccctgcc cctggagggt ggccccaccg gccgagacag cgagcatatg caggaagcggcaggaataag gaaaagcagc ctcctgactt tcctcgcttg gtggtttgag tggacctccc aggccagtgc cgggcccctc ataggagagg aagctcggga ggtggccagg cggcaggaag gcgcaccccc ccagcaatcc gcgcgccggg acagaatgcc ctgcaggaac ttcttctggaagaccttctc ctcctgcaaa taaaacctca cccatgaatg ctcacgcaag tttaattaca gacctgaa //

A sequence file in IG format can contain several sequences, each consisting of a number of comment lines that must begin with a semicolon (";"), a line with the sequence name (it may not contain spaces!) and the sequence itself terminated with the termination character '1' for linear or '2' for circular sequences. An example sequence in IG format is: ; comment ; comment AB000263 ACAAGATGCCATTGTCCCCCGGCCTCCTGCTGCTGCTGCTCTCC GGGGCCACGGCCACCGCTGCCCTGCC CCTGGAGGGTGGCCCCACCGGCCGAGACAGCGAGCATATGCAG GAAGCGGCAGGAATAAGGAAAAGCAGC CTCCTGACTTTCCTCGCTTGGTGGTTTGAGTGGACCTCCCAGGC CAGTGCCGGGCCCCTCATAGGAGAGG AAGCTCGGGAGGTGGCCAGGCGGCAGGAAGGCGCACCCCCCC AGCAATCCGCGCGCCGGGACAGAATGCC CTGCAGGAACTTCTTCTGGAAGACCTTCTCCTCCTGCAAATAAA ACCTCACCCATGAATGCTCACGCAAG TTTAATTACAGACCTGAA1

Clustal format files contain the word clustal at the beginning. Sequences can be interleaved, (as in the example below) or sequential. (note: the multiple sequence alignment program. Clustalw (and clustalx) produce clustal format files by default, but you can specify in "output format options" if you want your results in a different format

An example Clustal file:

CLUSTAL W (1.74) multiple sequence alignment

seq1 -----------------------KSKERYKDENGGNYFQLREDWWDANRETVWKAITCNA seq2 ---------------YEGLTTANGXKEYYQDKNGGNFFKLREDWWTANRETVWKAITCGA seq3 ----KRIYKKIFKEIHSGLSTKNGVKDRYQNDGDNYFQLREDWWTANRSTVWKALTCSD seq4 ------------------------SQRHYKD-DGGNYFQLREDWWTANRHTVWEAITCSA seq5 --------------------NVAALKTRYEK-DGQNFYQLREDWWTANRATIWEAITCSA seq6 ------FSKNIX--QIEELQDEWLLEARYKD--TDNYYELREHWWTENRHTVWEALTCEA seq7 -------------------------------------------------KELWEALTCSR

The first line of the input file contains the number of species and the number of characters separated by blanks. The information for each species follows, starting with a ten-character species name (which can include punctuation marks and blanks), and continuing with the characters for that species. An example phylip format file:
7 123 seq1 ---------- ---------- ---KSKERYK DENGGNYFQL seq2 ---------- -----YEGLT TANGXKEYYQ DKNGGNFFKL REDWWTANRE seq3 ---------- ---------- ----SQRHYK D-DGGNYFQL REDWWTANRH seq4 ---------- ---------- NVAALKTRYE K-DGQNFYQL REDWWTANRA seq5 ----KRIYKK IFKEIHSGLS TKNGVKDRYQ N-DGDNYFQL REDWWTANRS seq6 ------FSKN IX-QIEELQ DEWLLEARYK D--TDNYYEL REHWWTENRH seq7 ---------- ---------- ---------- ---------- ---------K TVWKAITCNA --GGGKYFRN TCDG--GQNP TETQNNCRCI G--------- TVWKAITCGA P-GDASYFHA

Msf

formatted multiple sequence files are most often created when using programs of the GCG suite. msf files include the sequence name and the sequence itself, which is usually aligned with other sequences in the file. You can specify a single sequence or many sequences within an msf file.

Some of the hallmarks of a msf formatted sequence are the same as a single sequence gcg format file:

Begins with the line (all uppercase) !!NA_MULTIPLE_ALIGNMENT 1.0 for nucleic acid sequences or !!AA_MULTIPLE_ALIGNMENT 1.0 for amino acid sequences. Do not edit or delete the file type if its present.(optional)
A dividing line which contains the number of bases or residues in the sequence, when the file was created, and importantly, two dots (..) which act as a divider between the descriptive information and the following sequence information.(required) msf files contain some other information as well: Name/Weight: The name of each sequence included in the alignment, as well as its length and checksum (both non-editable) and weight (editable).(required) Separating Line. Must include two slashes (//) to divide the name/weight information from the sequence alignment.(required) Multiple Sequence Alignment. Each sequence named in the above Name/Weight lines is included. The alignment allows you to view the relationship among sequences.

!!AA_MULTIPLE_ALIGNMENT 1.0 PileUp of: @hsp70.list Symbol comparison table: GenRunData:blosum62.cmp CompCheck: 6430 GapWeight: 8 GapLengthWeight: 2 hsp70.msf MSF: 743 Type: P October 6, 1998 18:23 Check: 7784 .. Name: S11448 Len: 743 Check: 3635 Weight: 1.00 Name: S06443 Len: 743 Check: 5861 Weight: 1.00 Name: S29261 Len: 743 Check: 7748 Weight: 1.00 // 1 50 S11448 ~~~~~~~~~~ ~~~~~~~~~~ ~~~~~~MTFD GAIGIDLGTT YSCVGVWQNE S06443 ~~~~~~~~~~ ~~~~~~~~~~ ~~~~~~MTFD GAIGIDLGTT YSCVGVWQNE S29261 ~~~~~~~~~~ ~~~~~~~~~~ ~~~~~~~~MG KIIGIDLGTT NSCVAIMDGT

A = adenine
C = cytosine G = guanine T = thymine U = uracil R = G A (purine) Y = T C (pyrimidine) K = G T (keto) M = A C (amino) S=GC W =AT B=GTC D = GAT H =ACT V=G CA N = A G C T (any)

Molecular Marker
No ratings yet
Molecular Marker
19 pages
Genomics and Proteomics
100% (1)
Genomics and Proteomics
317 pages
Bioinformatics Database
No ratings yet
Bioinformatics Database
50 pages
Recombinant DNA Technology
50% (2)
Recombinant DNA Technology
34 pages
Open Rack Base Specification Version 3 - Rev1.1 - 030524
No ratings yet
Open Rack Base Specification Version 3 - Rev1.1 - 030524
28 pages
Ribozyme Technology
100% (1)
Ribozyme Technology
16 pages
Reverse Vaccinology
100% (1)
Reverse Vaccinology
18 pages
Gene Prediction
25% (4)
Gene Prediction
36 pages
Science 6 Q1 PT
No ratings yet
Science 6 Q1 PT
4 pages
DNA Kinetics Reassociation
100% (1)
DNA Kinetics Reassociation
4 pages
Protein Database Overview
No ratings yet
Protein Database Overview
13 pages
Bioinformatics in PAM AND BLOSUM
100% (15)
Bioinformatics in PAM AND BLOSUM
17 pages
Organisation of Eukaryotic Chromosomes
91% (23)
Organisation of Eukaryotic Chromosomes
48 pages
Hoogsteen Base Pair PDF
No ratings yet
Hoogsteen Base Pair PDF
7 pages
A, B and Z Dna
100% (1)
A, B and Z Dna
24 pages
Bioinformatics Syllabus For M.Sc.
No ratings yet
Bioinformatics Syllabus For M.Sc.
19 pages
432515250219pertemuan Vi Dan Vii Introduction of Dna Into Living Cell-1
No ratings yet
432515250219pertemuan Vi Dan Vii Introduction of Dna Into Living Cell-1
21 pages
Little Leaf of Brinjal
No ratings yet
Little Leaf of Brinjal
64 pages
Bioinformatics Biological Database
No ratings yet
Bioinformatics Biological Database
31 pages
Linker, Adaptor, Homopolymer Tailing
56% (9)
Linker, Adaptor, Homopolymer Tailing
15 pages
Genome Annotation and Tools
No ratings yet
Genome Annotation and Tools
20 pages
5 Mitochondrial DNA and Chloroplast DNA
No ratings yet
5 Mitochondrial DNA and Chloroplast DNA
16 pages
Fire Engine Operation PDF
No ratings yet
Fire Engine Operation PDF
21 pages
Bacterial Chromosome.
No ratings yet
Bacterial Chromosome.
24 pages
Acquisition of New Genes - Final
100% (2)
Acquisition of New Genes - Final
23 pages
Industrial Biotechnology
No ratings yet
Industrial Biotechnology
57 pages
Genetic Re Combination and Its Molecular Mechanisms
100% (2)
Genetic Re Combination and Its Molecular Mechanisms
20 pages
What Is Bioinformatics
100% (1)
What Is Bioinformatics
22 pages
Screening of Microorganisms: Primary and Secondary Techniques - Industrial Biotechnology
No ratings yet
Screening of Microorganisms: Primary and Secondary Techniques - Industrial Biotechnology
10 pages
Animal Glues A Review of Their Key Properties Relevant To Conservationconservation
No ratings yet
Animal Glues A Review of Their Key Properties Relevant To Conservationconservation
12 pages
PFAM Database
No ratings yet
PFAM Database
22 pages
Bioinformatics. CH 3 Databases (Summarized Notes)
50% (2)
Bioinformatics. CH 3 Databases (Summarized Notes)
5 pages
DDPI Hassan Passing Package Science English Medium
No ratings yet
DDPI Hassan Passing Package Science English Medium
47 pages
Complexity of EUKARYOTic Genome
No ratings yet
Complexity of EUKARYOTic Genome
27 pages
Nucleic Acids As Genetic Information Carriers
100% (5)
Nucleic Acids As Genetic Information Carriers
32 pages
Alkaline Phosphatase and Ligases
No ratings yet
Alkaline Phosphatase and Ligases
18 pages
Enzymes Used in RDT Corrected Version Edited
No ratings yet
Enzymes Used in RDT Corrected Version Edited
43 pages
Unit 6 - Bioinformatics
No ratings yet
Unit 6 - Bioinformatics
41 pages
Pickling
100% (1)
Pickling
22 pages
Cytoplasmic Inheritance
No ratings yet
Cytoplasmic Inheritance
11 pages
Lecture 5-Restriction Endonuclease
No ratings yet
Lecture 5-Restriction Endonuclease
33 pages
Vulval Development in C Elegans
No ratings yet
Vulval Development in C Elegans
28 pages
WPR September 1
No ratings yet
WPR September 1
101 pages
Milestones in Genetic Engineering
100% (5)
Milestones in Genetic Engineering
2 pages
B.tech. Biotechnology Notes
No ratings yet
B.tech. Biotechnology Notes
3 pages
Sequence Retrieval System
No ratings yet
Sequence Retrieval System
2 pages
Worksheet On Minerals For Class 6 ICSE Geogra-Phy: Questions and Answer Options
No ratings yet
Worksheet On Minerals For Class 6 ICSE Geogra-Phy: Questions and Answer Options
4 pages
D4 GC XK 385 o Bhu 9 N 5 Ewkc
No ratings yet
D4 GC XK 385 o Bhu 9 N 5 Ewkc
4 pages
YOKOGAWA Exa Fc400g (Ing)
No ratings yet
YOKOGAWA Exa Fc400g (Ing)
91 pages
2nd Lec Student Copy - 2
No ratings yet
2nd Lec Student Copy - 2
19 pages
Hose-Reel-Catalogue Extract
No ratings yet
Hose-Reel-Catalogue Extract
5 pages
PAM Blosum: Assignment 1 Bioinformatics (DSE 1)
100% (3)
PAM Blosum: Assignment 1 Bioinformatics (DSE 1)
9 pages
Blast (Basic Local Alignment Search Tool)
No ratings yet
Blast (Basic Local Alignment Search Tool)
28 pages
Genetic Recombination
No ratings yet
Genetic Recombination
24 pages
Microbial Fermentation and Production of Small and Macro Molecules
60% (5)
Microbial Fermentation and Production of Small and Macro Molecules
5 pages
AIATS-2022 (CF+OYM) Test-08 - Code-A - Solutions - 20.03.2022
No ratings yet
AIATS-2022 (CF+OYM) Test-08 - Code-A - Solutions - 20.03.2022
21 pages
SY-8100 Manual Operacion
No ratings yet
SY-8100 Manual Operacion
36 pages
Discover SynZeal Research: Manufacturer of High Quality Donepezil API Reference Standards
No ratings yet
Discover SynZeal Research: Manufacturer of High Quality Donepezil API Reference Standards
7 pages
Shuttle Vectors and Expression Vectors
100% (2)
Shuttle Vectors and Expression Vectors
2 pages
DNA As Genetic Material PDF
No ratings yet
DNA As Genetic Material PDF
12 pages
Repair Manual Rev 0 6
No ratings yet
Repair Manual Rev 0 6
28 pages
Nature of Enzymes, Nomenclature and Classification
No ratings yet
Nature of Enzymes, Nomenclature and Classification
13 pages
KEGG
No ratings yet
KEGG
6 pages
Cell Culture Based Vaccine
No ratings yet
Cell Culture Based Vaccine
11 pages
GC Fatty Acid Methyl Esters
No ratings yet
GC Fatty Acid Methyl Esters
11 pages
Gummy Katalog - 2018
No ratings yet
Gummy Katalog - 2018
12 pages
Genetic Recombination
No ratings yet
Genetic Recombination
19 pages
Multiple Sequence Alignment 3
No ratings yet
Multiple Sequence Alignment 3
22 pages
Chemical Engineering Journal: Dariush Mowla, Gholamreza Karimi, Kobra Salehi
No ratings yet
Chemical Engineering Journal: Dariush Mowla, Gholamreza Karimi, Kobra Salehi
10 pages
Cultivation of Bacteria
50% (2)
Cultivation of Bacteria
27 pages
Mp13 Bacteriophage As A Cloning Vector
0% (1)
Mp13 Bacteriophage As A Cloning Vector
3 pages
Virus Classification
No ratings yet
Virus Classification
11 pages
BSC Microbiology Syllabus III BSC - Nehru
No ratings yet
BSC Microbiology Syllabus III BSC - Nehru
23 pages
Revision Questions Chapter1 Class X
No ratings yet
Revision Questions Chapter1 Class X
2 pages
Dymonic 100 Data Sheet
No ratings yet
Dymonic 100 Data Sheet
2 pages
Contributions of Martinus Willem Beijerinck Word
No ratings yet
Contributions of Martinus Willem Beijerinck Word
8 pages
Molecular Mechanism of Mutations
100% (6)
Molecular Mechanism of Mutations
38 pages
Head Office: 7, Kofo Abayomi Street, Victoria Island, Lagos State, Nigeria Web Address
No ratings yet
Head Office: 7, Kofo Abayomi Street, Victoria Island, Lagos State, Nigeria Web Address
21 pages
Drawing Lewis Structures Using Formal Charge
No ratings yet
Drawing Lewis Structures Using Formal Charge
6 pages
Manual Sellos
No ratings yet
Manual Sellos
28 pages
QB Topic 4 SHM
No ratings yet
QB Topic 4 SHM
5 pages
IR&NMR Problems
100% (2)
IR&NMR Problems
43 pages
CSIR NET Life Science Important Topics CSIR NET Reference Books
No ratings yet
CSIR NET Life Science Important Topics CSIR NET Reference Books
12 pages
Plant Genome Project
100% (2)
Plant Genome Project
5 pages
Biological Search Engines
No ratings yet
Biological Search Engines
3 pages
Kos Penjara Kluang
No ratings yet
Kos Penjara Kluang
1 page
Pds-Cassida Rls GR 0
No ratings yet
Pds-Cassida Rls GR 0
3 pages
Selection of Recombinant Clones
100% (2)
Selection of Recombinant Clones
2 pages

Sequence File Formats

Uploaded by

Sequence File Formats

Uploaded by

File Formats Different programs require that the information be specified to them in a formal manner, using particular keyword

Multi-FASTA format consists of multiple FASTA format sequences.

An example Clustal file:

CLUSTAL W (1.74) multiple sequence alignment

You might also like