Sequence File Formats
Sequence File Formats
and ordering these specification is called file formats. Bioinformatics data are stored in data bases as a specific file formats. There are many file formats in bioinformatics datas.
FASTA
is a DNA and protein sequence alignment software package first described (as FASTP) by David J. Lipman and William R. Pearson in 1985. FASTA is pronounced "fast A", In bioinformatics, FASTA format is a text-based format for representing either nucleotide sequences or peptide sequences, in which base pairs or amino acids are represented using single-letter codes. The format also allows for sequence names and comments to precede the sequences.
sequence in FASTA format begins with a singleline description, followed by lines of sequence data. The description line is distinguished from the sequence data by a greater-than (">") symbol in the first column. The word following the ">" symbol is the identifier of the sequence, and the rest of the line is the description (both are optional). There should be no space between the ">" and the first letter of the identifier.
Blank lines in a FASTA file are ignored, and so are spaces or other gap symbols (dashes, underscores, periods) in a sequence.
>gi|5524211|gb|AAD44166.1| cytochrome b LCLYTHIGRNIYYGSYLYSETWNTGIMLLLITMAT AFMGYVLPWGQMSFWGATVITNLFSAIPYIGTNL VEWIWGGFSVDKATLNRFFAFHFILPFTMVALAG VHLTFLHETGSNNPLGLTSDSDKIPFHPYYTIKDF LGLLILILLLLLLALLSPDMLGDPDNHMPADPLN TPLHIKPEWYFLFAYAILRSVPNKLGGVLALFLSI VILGLMPFLHTSKHRSMMLRPLSQALFWTLTMD LLTLTWIGSQPVEYPYTIIGQMASILYFSIILAFLPI AGX IENY
Example
A sequence file in GCG format contains exactly one sequence, begins with annotation lines and the start of the sequence is marked by a line ending with two dot ("..") characters. This line also contains the sequence identifier, the sequence length and a checksum. This format should only be used if the file was created with the GCG package. The GCG format is not used much nowadays
An example sequence in GCG format is: ID AB000263 standard; RNA; PRI; 368 BP. XX AC AB000263; XX DE Homo sapiens mRNA for prepro cortistatin like peptide, complete cds. XX SQ Sequence 368 BP; AB000263 Length: 368 Check: 4514 .. acaagatgcc attgtccccc ggcctcctgc tgctgctgct ctccggggcc acggccaccg ctgccctgcc cctggagggt ggccccaccg gccgagacag cgagcatatgaggaagcgg caggaataag gaaaagcagc ctcctgactt tcctcgcttg gtggtttgag tggacctccc aggccagtgc cgggcccctc ataggagagg aagctcggga gtggccagg cggcaggaag gcgcaccccc ccagcaatcc gcgcgccggg acagaatgcc ctgcaggaac ttcttctgga agaccttctc ctcctgcaaa taaaacctca cccatgaatg ctcacgcaag tttaattaca acctgaa
The new GCG-RSF can contain several sequences in one file. This format should only be used if the file was created with the GCG package.
The programs of the Staden suite of biological analysis software accept sequences in staden format. A typical staden format file is : GGTACGTAGTAGCTGCTGCTACGTGCGCTAGCT AGTACGTCATTA CGACGTAGATGCTAGCTGACTCGATGCAGTACG TAGTAGCTGCTG CTACGTGCGCTAGCTAGTACGTCACGACGTAGA TGCTAGCTGACT CGATGC
Staden formatted sequence files contain the sequence and nothing else.
A sequence file in EMBL format can contain several sequences. One sequence entry starts with an identifier line ("ID"), followed by further annotation lines. The start of the sequence is marked by a line starting with "SQ" and the end of the sequence is marked by two slashes ("//").
An example sequence in EMBL format is: ID AB000263 standard; RNA; PRI; 368 BP. XX AC AB000263; XX DE Homo sapiens mRNA for prepro cortistatin like peptide, complete cds. XX SQ Sequence 368 BP; acaagatgcc attgtccccc ggcctcctgc tgctgctgct ctccggggcc acggccaccg ctgccctgcc cctggagggt ggccccaccg gccgagacag cgagcatatg caggaagcgg caggaataag gaaaagcagc ctcctgactt tcctcgcttg gtggtttgag tggacctccc aggccagtgc cgggcccctc ataggagagg aagctcggga ggtggccagg cggcaggaag gcgcaccccc ccagcaatcc gcgcgccggg acagaatgcc ctgcaggaac ttcttctgga gaccttctc ctcctgcaaa taaaacctca cccatgaatg ctcacgcaag tttaattacagacctgaa //
A sequence file in GenBank format can contain several sequences. One sequence in GenBank format starts with a line containing the word LOCUS and a number of annotation lines. The start of the sequence is marked by a line containing "ORIGIN" and the end of the sequence is marked by two slashes ("//").
An example sequence in GenBank format is: LOCUS AB000263 368 bp mRNA linear PRI 05-FEB-1999 DEFINITION Homo sapiens mRNA for prepro cortistatin like peptide, complete cds. ACCESSION AB000263 ORIGIN acaagatgcc attgtccccc ggcctcctgc tgctgctgct ctccggggcc acggccaccg ctgccctgcc cctggagggt ggccccaccg gccgagacag cgagcatatg caggaagcggcaggaataag gaaaagcagc ctcctgactt tcctcgcttg gtggtttgag tggacctccc aggccagtgc cgggcccctc ataggagagg aagctcggga ggtggccagg cggcaggaag gcgcaccccc ccagcaatcc gcgcgccggg acagaatgcc ctgcaggaac ttcttctggaagaccttctc ctcctgcaaa taaaacctca cccatgaatg ctcacgcaag tttaattaca gacctgaa //
A sequence file in IG format can contain several sequences, each consisting of a number of comment lines that must begin with a semicolon (";"), a line with the sequence name (it may not contain spaces!) and the sequence itself terminated with the termination character '1' for linear or '2' for circular sequences. An example sequence in IG format is: ; comment ; comment AB000263 ACAAGATGCCATTGTCCCCCGGCCTCCTGCTGCTGCTGCTCTCC GGGGCCACGGCCACCGCTGCCCTGCC CCTGGAGGGTGGCCCCACCGGCCGAGACAGCGAGCATATGCAG GAAGCGGCAGGAATAAGGAAAAGCAGC CTCCTGACTTTCCTCGCTTGGTGGTTTGAGTGGACCTCCCAGGC CAGTGCCGGGCCCCTCATAGGAGAGG AAGCTCGGGAGGTGGCCAGGCGGCAGGAAGGCGCACCCCCCC AGCAATCCGCGCGCCGGGACAGAATGCC CTGCAGGAACTTCTTCTGGAAGACCTTCTCCTCCTGCAAATAAA ACCTCACCCATGAATGCTCACGCAAG TTTAATTACAGACCTGAA1
Clustal format files contain the word clustal at the beginning. Sequences can be interleaved, (as in the example below) or sequential. (note: the multiple sequence alignment program. Clustalw (and clustalx) produce clustal format files by default, but you can specify in "output format options" if you want your results in a different format
The first line of the input file contains the number of species and the number of characters separated by blanks. The information for each species follows, starting with a ten-character species name (which can include punctuation marks and blanks), and continuing with the characters for that species. An example phylip format file:
7 123 seq1 ---------- ---------- ---KSKERYK DENGGNYFQL seq2 ---------- -----YEGLT TANGXKEYYQ DKNGGNFFKL REDWWTANRE seq3 ---------- ---------- ----SQRHYK D-DGGNYFQL REDWWTANRH seq4 ---------- ---------- NVAALKTRYE K-DGQNFYQL REDWWTANRA seq5 ----KRIYKK IFKEIHSGLS TKNGVKDRYQ N-DGDNYFQL REDWWTANRS seq6 ------FSKN IX-QIEELQ DEWLLEARYK D--TDNYYEL REHWWTENRH seq7 ---------- ---------- ---------- ---------- ---------K TVWKAITCNA --GGGKYFRN TCDG--GQNP TETQNNCRCI G--------- TVWKAITCGA P-GDASYFHA
Msf
formatted multiple sequence files are most often created when using programs of the GCG suite. msf files include the sequence name and the sequence itself, which is usually aligned with other sequences in the file. You can specify a single sequence or many sequences within an msf file.
Some of the hallmarks of a msf formatted sequence are the same as a single sequence gcg format file:
Begins with the line (all uppercase) !!NA_MULTIPLE_ALIGNMENT 1.0 for nucleic acid sequences or !!AA_MULTIPLE_ALIGNMENT 1.0 for amino acid sequences. Do not edit or delete the file type if its present.(optional)
A dividing line which contains the number of bases or residues in the sequence, when the file was created, and importantly, two dots (..) which act as a divider between the descriptive information and the following sequence information.(required) msf files contain some other information as well: Name/Weight: The name of each sequence included in the alignment, as well as its length and checksum (both non-editable) and weight (editable).(required) Separating Line. Must include two slashes (//) to divide the name/weight information from the sequence alignment.(required) Multiple Sequence Alignment. Each sequence named in the above Name/Weight lines is included. The alignment allows you to view the relationship among sequences.
!!AA_MULTIPLE_ALIGNMENT 1.0 PileUp of: @hsp70.list Symbol comparison table: GenRunData:blosum62.cmp CompCheck: 6430 GapWeight: 8 GapLengthWeight: 2 hsp70.msf MSF: 743 Type: P October 6, 1998 18:23 Check: 7784 .. Name: S11448 Len: 743 Check: 3635 Weight: 1.00 Name: S06443 Len: 743 Check: 5861 Weight: 1.00 Name: S29261 Len: 743 Check: 7748 Weight: 1.00 // 1 50 S11448 ~~~~~~~~~~ ~~~~~~~~~~ ~~~~~~MTFD GAIGIDLGTT YSCVGVWQNE S06443 ~~~~~~~~~~ ~~~~~~~~~~ ~~~~~~MTFD GAIGIDLGTT YSCVGVWQNE S29261 ~~~~~~~~~~ ~~~~~~~~~~ ~~~~~~~~MG KIIGIDLGTT NSCVAIMDGT
A = adenine
C = cytosine G = guanine T = thymine U = uracil R = G A (purine) Y = T C (pyrimidine) K = G T (keto) M = A C (amino) S=GC W =AT B=GTC D = GAT H =ACT V=G CA N = A G C T (any)