0% found this document useful (0 votes)
3 views3 pages

The Variant Call Format and VCFtools

The document discusses the variant call format (VCF), a standardized format for storing DNA polymorphism data, developed primarily for the 1000 Genomes Project. VCF allows for the efficient storage and retrieval of variant information across multiple samples, and is accompanied by VCFtools, a software suite for processing VCF files. The document outlines the structure of VCF files, including mandatory fields, types of variations, and the capabilities of VCFtools for data manipulation and analysis.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
3 views3 pages

The Variant Call Format and VCFtools

The document discusses the variant call format (VCF), a standardized format for storing DNA polymorphism data, developed primarily for the 1000 Genomes Project. VCF allows for the efficient storage and retrieval of variant information across multiple samples, and is accompanied by VCFtools, a software suite for processing VCF files. The document outlines the structure of VCF files, including mandatory fields, types of variations, and the capabilities of VCFtools for data manipulation and analysis.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 3

Vol. 27 no.

15 2011, pages 2156–2158


BIOINFORMATICS APPLICATIONS NOTE doi:10.1093/bioinformatics/btr330

Sequence analysis Advance Access publication June 7, 2011

The variant call format and VCFtools


Petr Danecek1,† , Adam Auton2,† , Goncalo Abecasis3 , Cornelis A. Albers1 , Eric Banks4 ,
Mark A. DePristo4 , Robert E. Handsaker4 , Gerton Lunter2 , Gabor T. Marth5 ,
Stephen T. Sherry6 , Gilean McVean2,7 , Richard Durbin1,∗ and 1000 Genomes Project
Analysis Group‡
1 Wellcome Trust Sanger Institute, Wellcome Trust Genome Campus, Cambridge CB10 1SA, 2 Wellcome Trust Centre
for Human Genetics, University of Oxford, Oxford OX3 7BN, UK, 3 Center for Statistical Genetics, Department of
Biostatistics, University of Michigan, Ann Arbor, MI 48109, 4 Program in Medical and Population Genetics, Broad
Institute of MIT and Harvard, Cambridge, MA 02141, 5 Department of Biology, Boston College, MA 02467, 6 National
Institutes of Health National Center for Biotechnology Information, MD 20894, USA and 7 Department of Statistics,
University of Oxford, Oxford OX1 3TG, UK
Associate Editor: John Quackenbush

ABSTRACT Although generic feature format (GFF) has recently been extended
Summary: The variant call format (VCF) is a generic format for to standardize storage of variant information in genome variant
storing DNA polymorphism data such as SNPs, insertions, deletions format (GVF) (Reese et al., 2010), this is not tailored for storing
and structural variants, together with rich annotations. VCF is usually information across many samples. We have designed the VCF
stored in a compressed manner and can be indexed for fast data format to be scalable so as to encompass millions of sites with
retrieval of variants from a range of positions on the reference genotype data and annotations from thousands of samples. We have
genome. The format was developed for the 1000 Genomes Project, adopted a textual encoding, with complementary indexing, to allow
and has also been adopted by other projects such as UK10K, easy generation of the files while maintaining fast data access.
dbSNP and the NHLBI Exome Project. VCFtools is a software suite In this article, we present an overview of the VCF and briefly
that implements various utilities for processing VCF files, including introduce the companion VCFtools software package. A detailed
validation, merging, comparing and also provides a general Perl API. format specification and the complete documentation of VCFtools
Availability: http://vcftools.sourceforge.net are available at the VCFtools web site.
Contact: [email protected]
2 METHODS
Received on October 28, 2010; revised on May 4, 2011; accepted
on May 28, 2011 2.1 The VCF
2.1.1 Overview of the VCF A VCF file (Fig. 1a) consists of a header
section and a data section. The header contains an arbitrary number of meta-
1 INTRODUCTION
information lines, each starting with characters ‘##’, and a TAB delimited
One of the main uses of next-generation sequencing is to discover field definition line, starting with a single ‘#’ character. The meta-information
variation among large populations of related samples. Recently, header lines provide a standardized description of tags and annotations used
a format for storing next-generation read alignments has been in the data section. The use of meta-information allows the information
standardized by the SAM/BAM file format specification (Li et al., stored within a VCF file to be tailored to the dataset in question. It can
2009). This has significantly improved the interoperability of next- be also used to provide information about the means of file creation, date
generation tools for alignment, visualization and variant calling. We of creation, version of the reference sequence, software used and any other
information relevant to the history of the file. The field definition line names
propose the variant call format (VCF) as a standardized format for
eight mandatory columns, corresponding to data columns representing the
storing the most prevalent types of sequence variation, including chromosome (CHROM), a 1-based position of the start of the variant (POS),
SNPs, indels and larger structural variants, together with rich unique identifiers of the variant (ID), the reference allele (REF), a comma
annotations. The format was developed with the primary intention separated list of alternate non-reference alleles (ALT), a phred-scaled quality
to represent human genetic variation, but its use is not restricted score (QUAL), site filtering information (FILTER) and a semicolon separated
to diploid genomes and can be used in different contexts as well. list of additional, user extensible annotation (INFO). In addition, if samples
Its flexibility and user extensibility allows representation of a wide are present in the file, the mandatory header columns are followed by a
variety of genomic variation with respect to a single reference FORMAT column and an arbitrary number of sample IDs that define the
sequence. samples included in the VCF file. The FORMAT column is used to define
the information contained within each subsequent genotype column, which
consists of a colon separated list of fields. For example, the FORMAT field
GT:GQ:DP in the fourth data entry of Figure 1a indicates that the subsequent
entries contain information regarding the genotype, genotype quality and
∗ To whom correspondence should be addressed. read depth for each sample. All data lines are TAB delimited and the number
† The authors wish it to be known that, in their opinion, the first two authors of fields in each data line must match the number of fields in the header line.
should be regarded as joint First Authors. It is strongly recommended that all annotation tags used are declared in the
‡ http://www.1000genomes.org VCF header section.

© The Author(s) 2011. Published by Oxford University Press.


This is an Open Access article distributed under the terms of the Creative Commons Attribution Non-Commercial License (http://creativecommons.org/licenses/
by-nc/2.5), which permits unrestricted non-commercial use, distribution, and reproduction in any medium, provided the original work is properly cited.

[12:26 5/7/2011 Bioinformatics-btr330.tex] Page: 2156 2156–2158


Variant call format

(a)

(b) (c) (d) (e)

(f)

(g)

Fig. 1. (a) Example of valid VCF. The header lines ##fileformat and #CHROM are mandatory, the rest is optional but strongly recommended. Each line of the
body describes variants present in the sampled population at one genomic position or region. All alternate alleles are listed in the ALT column and referenced
from the genotype fields as 1-based indexes to this list; the reference haplotype is designated as 0. For multiploid data, the separator indicates whether the
data are phased (|) or unphased (/). Thus, the two alleles C and G at the positions 2 and 5 in this figure occur on the same chromosome in SAMPLE1. The
first data line shows an example of a deletion (present in SAMPLE1) and a replacement of two bases by another base (SAMPLE2); the second line shows a
SNP and an insertion; the third a SNP; the fourth a large structural variant described by the annotation in the INFO column, the coordinate is that of the base
before the variant. (b–f ) Alignments and VCF representations of different sequence variants: SNP, insertion, deletion, replacement, and a large deletion. The
REF columns shows the reference bases replaced by the haplotype in the ALT column. The coordinate refers to the first reference base. (g) Users are advised
to use simplest representation possible and lowest coordinate in cases where the position is ambiguous.

2.1.2 Conventions and reserved keywords The VCF specification includes the separator indicates whether the alleles are phased (‘|’) or unphased
several common keywords with standardized meaning. The following list (‘/’) with respect to other data lines (Fig. 1).
gives some examples of the reserved tags. • PS, phase set, indicates that the alleles of genotypes with the same PS
value are listed in the same order.
Genotype columns: • DP, read depth at this position.
• GL, genotype likelihoods for all possible genotypes given the set of
• GT, genotype, encodes alleles as numbers: 0 for the reference allele, 1 alleles defined in the REF and ALT fields.
for the first allele listed in ALT column, 2 for the second allele listed in • GQ, genotype quality, probability that the genotype call is wrong under
ALT and so on. The number of alleles suggests ploidy of the sample and the condition that the site is being variant. Note that the QUAL column

2157

[12:26 5/7/2011 Bioinformatics-btr330.tex] Page: 2157 2156–2158


P.Danecek et al.

gives an overall quality score for the assertion made in ALT that the indexer for TAB-delimited files. Both programs, bgzip and tabix, are part of
site is variant or no variant. the samtools software package and can be downloaded from the SAMtools
web site (http://samtools.sourceforge.net).
INFO column:
• DB, dbSNP membership; 2.2 VCFtools software package
• H3, membership in HapMap3;
VCFtools is an open-source software package for parsing, analyzing and
• VALIDATED, validated by follow-up experiment; manipulating VCF files. The software suite is broadly split into two modules.
• AN, total number of alleles in called genotypes; The first module provides a general Perl API, and allows various operations to
• AC, allele count in genotypes, for each ALT allele, in the same order be performed on VCF files, including format validation, merging, comparing,
as listed; intersecting, making complements and basic overall statistics. The second
• SVTYPE, type of structural variant (DEL for deletion, DUP for module consists of C++ executable primarily used to analyze SNP data
duplication, INV for inversion, etc. as described in the specification); in VCF format, allowing the user to estimate allele frequencies, levels of
linkage disequilibrium and various Quality Control metrics. Further details
• END, end position of the variant;
of VCFtools can be found on the web site (http://vcftools.sourceforge.net/),
• IMPRECISE, indicates that the position of the variant is not known where the reader can also find links to alternative tools for VCF generation
accurately; and and manipulation, such as the GATK toolkit (McKenna et al., 2010).
• CIPOS/CIEND, confidence interval around POS and END positions
for imprecise variants.
Missing values are represented with a dot. For practical reasons, the VCF
3 CONCLUSIONS
specification requires that the data lines appear in their chromosomal order. We describe a generic format for storing the most prevalent types of
The full format specification is available at the VCFtools web site. sequence variation. The format is highly flexible, and can be adapted
to store a wide variety of information. It has already been adopted by
2.1.3 Variation types VCF is flexible and allows to express virtually any a number of large-scale projects, and is supported by an increasing
type of variation by listing both the reference haplotype (the REF column) number of software tools.
and the alternate haplotypes (the ALT column). This permits redundancy
such that the same event can be expressed in multiple ways by including Funding: Medical Research Council, UK; British Heart Foundation
different numbers of reference bases or by combining two adjacent SNPs into (grant RG/09/012/28096); Wellcome Trust (grants 090532/Z/09/Z
one haplotype (Fig. 1g). Users are advised to follow recommended practice and 075491/Z/04); National Human Genome Research Institute
whenever possible: one reference base for SNPs and insertions, and one (grants 54 HG003067, R01 HG004719 and U01 HG005208);
alternate base for deletions. The lowest possible coordinate should be used Intramural Research Program of the National Institutes of Health,
in cases where the position is ambiguous. When comparing or merging indel the National Library of Medicine.
variants, the variant haplotypes should be reconstructed and reconciled, such
as in the Figure 1g example, although the exact nature of the reconciliation Conflict of Interest: none declared.
can be arbitrary. For larger, more complex, variants, quoting large sequences
becomes impractical, and in these cases the annotations in the INFO column
can be used to describe the variant (Fig. 1f). The full VCF specification also REFERENCES
includes a set of recommended practices for describing complex variants.
Durbin,R.M. et al. (2010) A map of human genome variation from population-scale
sequencing. Nature, 467, 1061–1073.
2.1.4 Compression and indexing Given the large number of variant sites Li,H. et al. (2009) The sequence alignment/map format and SAMtools. Bioinformatics,
in the human genome and the number of individuals the 1000 Genomes 25, 2078–2079.
Project aims to sequence (Durbin et al., 2010), VCF files are usually stored McKenna,A.H. et al. (2010) The genome analysis toolkit: a MapReduce framework for
in a compact binary form, compressed by bgzip, a program which utilizes the analyzing next-generation DNA sequencing data. Genome Res., 20, 1297–1303.
zlib-compatible BGZF library (Li et al., 2009). Files compressed by bgzip Reese,M.G. et al. (2010) A standard variation file format for human genome sequences.
can be decompressed by the standard gunzip and zcat utilities. Fast random Genome Biol., 11, 20796305.
access can be achieved by indexing genomic position using tabix, a generic

2158

[12:26 5/7/2011 Bioinformatics-btr330.tex] Page: 2158 2156–2158

You might also like