0% found this document useful (0 votes)
2K views22 pages

Sequence File Formats

The document discusses several common file formats used in bioinformatics to store biological sequence data, including FASTA, Multi-FASTA, GCG, EMBL, GenBank, and PHYP formats. It provides examples of sequences stored in each format and describes the key features and conventions used in each format.

Uploaded by

Pragya Mukherjee
Copyright
© Attribution Non-Commercial (BY-NC)
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
2K views22 pages

Sequence File Formats

The document discusses several common file formats used in bioinformatics to store biological sequence data, including FASTA, Multi-FASTA, GCG, EMBL, GenBank, and PHYP formats. It provides examples of sequences stored in each format and describes the key features and conventions used in each format.

Uploaded by

Pragya Mukherjee
Copyright
© Attribution Non-Commercial (BY-NC)
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 22

File Formats Different programs require that the information be specified to them in a formal manner, using particular keyword

and ordering these specification is called file formats. Bioinformatics data are stored in data bases as a specific file formats. There are many file formats in bioinformatics datas.

FASTA

is a DNA and protein sequence alignment software package first described (as FASTP) by David J. Lipman and William R. Pearson in 1985. FASTA is pronounced "fast A", In bioinformatics, FASTA format is a text-based format for representing either nucleotide sequences or peptide sequences, in which base pairs or amino acids are represented using single-letter codes. The format also allows for sequence names and comments to precede the sequences.

sequence in FASTA format begins with a singleline description, followed by lines of sequence data. The description line is distinguished from the sequence data by a greater-than (">") symbol in the first column. The word following the ">" symbol is the identifier of the sequence, and the rest of the line is the description (both are optional). There should be no space between the ">" and the first letter of the identifier.

Blank lines in a FASTA file are ignored, and so are spaces or other gap symbols (dashes, underscores, periods) in a sequence.

>gi|5524211|gb|AAD44166.1| cytochrome b LCLYTHIGRNIYYGSYLYSETWNTGIMLLLITMAT AFMGYVLPWGQMSFWGATVITNLFSAIPYIGTNL VEWIWGGFSVDKATLNRFFAFHFILPFTMVALAG VHLTFLHETGSNNPLGLTSDSDKIPFHPYYTIKDF LGLLILILLLLLLALLSPDMLGDPDNHMPADPLN TPLHIKPEWYFLFAYAILRSVPNKLGGVLALFLSI VILGLMPFLHTSKHRSMMLRPLSQALFWTLTMD LLTLTWIGSQPVEYPYTIIGQMASILYFSIILAFLPI AGX IENY

Multi-FASTA format consists of multiple FASTA format sequences.


>sequence1 ACTCCCCGTGCGCGCCCGGCCCGTAGCGTCCTCGTCGCCGCCCCTCGTCTCGCA GCCGCAGCCCGCGTGG ACGCTCTCGCCTGAGCGCCGCGGACTAGCCCGGGTGGCC >sequence2 CAGTCCGGCAGCGCCGGGGTTAAGCGGCCCAAGTAAACGTAGCGCAGCGATCG GCGCCGGAGATTCGCGA ACCCGACACTCCGCGCCGCCCGCCGGCCAGGACCCGCGGCGCGATCGCGGCGC CGCGCTACAGCCAGCCT CACTGGCGCGCGGGCGAGCGCACGGGCGCTC >sequence3 CACGACAGGCCCGCTGAGGCTTGTGCCAGACCTTGGAAACCTCAGGTATATACCT TTCCAGACGCGGGAT CTCCCCTCCCC >sequence4 CAGCAGACATCTGAATGAAGAAGAGGGTGCCAGCGGGTATGAGGAGTGCATTAT CGTTAATGGGAACTTC AGTGACCAGTCCTCAGACACGAAGGATGCTCCCTCACCCCCAGTCTTGGAGGCA ATCTGCACAGAGCCAG TCTGCACACC

Example

A sequence file in GCG format contains exactly one sequence, begins with annotation lines and the start of the sequence is marked by a line ending with two dot ("..") characters. This line also contains the sequence identifier, the sequence length and a checksum. This format should only be used if the file was created with the GCG package. The GCG format is not used much nowadays

An example sequence in GCG format is: ID AB000263 standard; RNA; PRI; 368 BP. XX AC AB000263; XX DE Homo sapiens mRNA for prepro cortistatin like peptide, complete cds. XX SQ Sequence 368 BP; AB000263 Length: 368 Check: 4514 .. acaagatgcc attgtccccc ggcctcctgc tgctgctgct ctccggggcc acggccaccg ctgccctgcc cctggagggt ggccccaccg gccgagacag cgagcatatgaggaagcgg caggaataag gaaaagcagc ctcctgactt tcctcgcttg gtggtttgag tggacctccc aggccagtgc cgggcccctc ataggagagg aagctcggga gtggccagg cggcaggaag gcgcaccccc ccagcaatcc gcgcgccggg acagaatgcc ctgcaggaac ttcttctgga agaccttctc ctcctgcaaa taaaacctca cccatgaatg ctcacgcaag tttaattaca acctgaa

The new GCG-RSF can contain several sequences in one file. This format should only be used if the file was created with the GCG package.

The programs of the Staden suite of biological analysis software accept sequences in staden format. A typical staden format file is : GGTACGTAGTAGCTGCTGCTACGTGCGCTAGCT AGTACGTCATTA CGACGTAGATGCTAGCTGACTCGATGCAGTACG TAGTAGCTGCTG CTACGTGCGCTAGCTAGTACGTCACGACGTAGA TGCTAGCTGACT CGATGC

Staden formatted sequence files contain the sequence and nothing else.

A sequence file in EMBL format can contain several sequences. One sequence entry starts with an identifier line ("ID"), followed by further annotation lines. The start of the sequence is marked by a line starting with "SQ" and the end of the sequence is marked by two slashes ("//").
An example sequence in EMBL format is: ID AB000263 standard; RNA; PRI; 368 BP. XX AC AB000263; XX DE Homo sapiens mRNA for prepro cortistatin like peptide, complete cds. XX SQ Sequence 368 BP; acaagatgcc attgtccccc ggcctcctgc tgctgctgct ctccggggcc acggccaccg ctgccctgcc cctggagggt ggccccaccg gccgagacag cgagcatatg caggaagcgg caggaataag gaaaagcagc ctcctgactt tcctcgcttg gtggtttgag tggacctccc aggccagtgc cgggcccctc ataggagagg aagctcggga ggtggccagg cggcaggaag gcgcaccccc ccagcaatcc gcgcgccggg acagaatgcc ctgcaggaac ttcttctgga gaccttctc ctcctgcaaa taaaacctca cccatgaatg ctcacgcaag tttaattacagacctgaa //

A sequence file in GenBank format can contain several sequences. One sequence in GenBank format starts with a line containing the word LOCUS and a number of annotation lines. The start of the sequence is marked by a line containing "ORIGIN" and the end of the sequence is marked by two slashes ("//").

An example sequence in GenBank format is: LOCUS AB000263 368 bp mRNA linear PRI 05-FEB-1999 DEFINITION Homo sapiens mRNA for prepro cortistatin like peptide, complete cds. ACCESSION AB000263 ORIGIN acaagatgcc attgtccccc ggcctcctgc tgctgctgct ctccggggcc acggccaccg ctgccctgcc cctggagggt ggccccaccg gccgagacag cgagcatatg caggaagcggcaggaataag gaaaagcagc ctcctgactt tcctcgcttg gtggtttgag tggacctccc aggccagtgc cgggcccctc ataggagagg aagctcggga ggtggccagg cggcaggaag gcgcaccccc ccagcaatcc gcgcgccggg acagaatgcc ctgcaggaac ttcttctggaagaccttctc ctcctgcaaa taaaacctca cccatgaatg ctcacgcaag tttaattaca gacctgaa //

A sequence file in IG format can contain several sequences, each consisting of a number of comment lines that must begin with a semicolon (";"), a line with the sequence name (it may not contain spaces!) and the sequence itself terminated with the termination character '1' for linear or '2' for circular sequences. An example sequence in IG format is: ; comment ; comment AB000263 ACAAGATGCCATTGTCCCCCGGCCTCCTGCTGCTGCTGCTCTCC GGGGCCACGGCCACCGCTGCCCTGCC CCTGGAGGGTGGCCCCACCGGCCGAGACAGCGAGCATATGCAG GAAGCGGCAGGAATAAGGAAAAGCAGC CTCCTGACTTTCCTCGCTTGGTGGTTTGAGTGGACCTCCCAGGC CAGTGCCGGGCCCCTCATAGGAGAGG AAGCTCGGGAGGTGGCCAGGCGGCAGGAAGGCGCACCCCCCC AGCAATCCGCGCGCCGGGACAGAATGCC CTGCAGGAACTTCTTCTGGAAGACCTTCTCCTCCTGCAAATAAA ACCTCACCCATGAATGCTCACGCAAG TTTAATTACAGACCTGAA1

Clustal format files contain the word clustal at the beginning. Sequences can be interleaved, (as in the example below) or sequential. (note: the multiple sequence alignment program. Clustalw (and clustalx) produce clustal format files by default, but you can specify in "output format options" if you want your results in a different format

An example Clustal file:

CLUSTAL W (1.74) multiple sequence alignment


seq1 -----------------------KSKERYKDENGGNYFQLREDWWDANRETVWKAITCNA seq2 ---------------YEGLTTANGXKEYYQDKNGGNFFKLREDWWTANRETVWKAITCGA seq3 ----KRIYKKIFKEIHSGLSTKNGVKDRYQNDGDNYFQLREDWWTANRSTVWKALTCSD seq4 ------------------------SQRHYKD-DGGNYFQLREDWWTANRHTVWEAITCSA seq5 --------------------NVAALKTRYEK-DGQNFYQLREDWWTANRATIWEAITCSA seq6 ------FSKNIX--QIEELQDEWLLEARYKD--TDNYYELREHWWTENRHTVWEALTCEA seq7 -------------------------------------------------KELWEALTCSR

The first line of the input file contains the number of species and the number of characters separated by blanks. The information for each species follows, starting with a ten-character species name (which can include punctuation marks and blanks), and continuing with the characters for that species. An example phylip format file:
7 123 seq1 ---------- ---------- ---KSKERYK DENGGNYFQL seq2 ---------- -----YEGLT TANGXKEYYQ DKNGGNFFKL REDWWTANRE seq3 ---------- ---------- ----SQRHYK D-DGGNYFQL REDWWTANRH seq4 ---------- ---------- NVAALKTRYE K-DGQNFYQL REDWWTANRA seq5 ----KRIYKK IFKEIHSGLS TKNGVKDRYQ N-DGDNYFQL REDWWTANRS seq6 ------FSKN IX-QIEELQ DEWLLEARYK D--TDNYYEL REHWWTENRH seq7 ---------- ---------- ---------- ---------- ---------K TVWKAITCNA --GGGKYFRN TCDG--GQNP TETQNNCRCI G--------- TVWKAITCGA P-GDASYFHA

Msf

formatted multiple sequence files are most often created when using programs of the GCG suite. msf files include the sequence name and the sequence itself, which is usually aligned with other sequences in the file. You can specify a single sequence or many sequences within an msf file.

Some of the hallmarks of a msf formatted sequence are the same as a single sequence gcg format file:

Begins with the line (all uppercase) !!NA_MULTIPLE_ALIGNMENT 1.0 for nucleic acid sequences or !!AA_MULTIPLE_ALIGNMENT 1.0 for amino acid sequences. Do not edit or delete the file type if its present.(optional)
A dividing line which contains the number of bases or residues in the sequence, when the file was created, and importantly, two dots (..) which act as a divider between the descriptive information and the following sequence information.(required) msf files contain some other information as well: Name/Weight: The name of each sequence included in the alignment, as well as its length and checksum (both non-editable) and weight (editable).(required) Separating Line. Must include two slashes (//) to divide the name/weight information from the sequence alignment.(required) Multiple Sequence Alignment. Each sequence named in the above Name/Weight lines is included. The alignment allows you to view the relationship among sequences.

!!AA_MULTIPLE_ALIGNMENT 1.0 PileUp of: @hsp70.list Symbol comparison table: GenRunData:blosum62.cmp CompCheck: 6430 GapWeight: 8 GapLengthWeight: 2 hsp70.msf MSF: 743 Type: P October 6, 1998 18:23 Check: 7784 .. Name: S11448 Len: 743 Check: 3635 Weight: 1.00 Name: S06443 Len: 743 Check: 5861 Weight: 1.00 Name: S29261 Len: 743 Check: 7748 Weight: 1.00 // 1 50 S11448 ~~~~~~~~~~ ~~~~~~~~~~ ~~~~~~MTFD GAIGIDLGTT YSCVGVWQNE S06443 ~~~~~~~~~~ ~~~~~~~~~~ ~~~~~~MTFD GAIGIDLGTT YSCVGVWQNE S29261 ~~~~~~~~~~ ~~~~~~~~~~ ~~~~~~~~MG KIIGIDLGTT NSCVAIMDGT

A = adenine
C = cytosine G = guanine T = thymine U = uracil R = G A (purine) Y = T C (pyrimidine) K = G T (keto) M = A C (amino) S=GC W =AT B=GTC D = GAT H =ACT V=G CA N = A G C T (any)

You might also like