Tutorial Genomics
Tutorial Genomics
0
∗
Thibaut Jombart and Caitlin Collins
Imperial College London
MRC Centre for Outbreak Analysis and Modelling
Abstract
Genome-wide SNP data can quickly be challenging to analyse using standard
computer. The package adegenet [1] for the R software [2] implements representation of
these data with unprecedented efficiency using the classes SNPbin and genlight, which
can require up to 60 times less RAM than usual representation using allele frequencies.
This vignette introduces these classes and illustrates how these objects can be handled
and analyzed in R.
∗
[email protected], [email protected]
1
Contents
1 Introduction 3
2 Classes of objects 3
2.1 SNPbin: storage of single genomes . . . . . . . . . . . . . . . . . . . . . . . . 3
2.2 genlight: storage of multiple genomes . . . . . . . . . . . . . . . . . . . . . 6
2
1 Introduction
Modern sequencing technologies now make complete genomes more widely accessible. The
subsequent amounts of genetic data pose challenges in terms of storing and handling the
data, making former tools developed for classical genetic markers such as microsatellite
impracticable using standard computers. Adegenet has developed new object classes
dedicated to handling genome-wide polymorphism (SNPs) with minimum random access
memory (RAM) requirements.
Two new formal classes have been implemented: SNPbin, used to store genome-wide
SNPs for one individual, and genlight, which stored the same information for multiple
individuals. Information represented this way is binary: only biallelic SNPs can be stored
and analyzed using these classes. However, these objects are otherwise very flexible, and
can incorporate different levels of ploidy across individuals within a single dataset. In this
vignette, we present these object classes and show how their content can be further handled
and content analyzed.
2 Classes of objects
2.1 SNPbin: storage of single genomes
The class SNPbin is the core representation of biallelic SNPs which allows to represent data
with unprecedented efficiency. The essential idea is to code binary SNPs not as integers,
but as bits. This operation is tricky in R as there is no handling of bits, only bytes – series
of 8 bits. However, the class SNPbin handles this transparently using sub-rountines in C
language. Considerable efforts have been made so that the user does not have to dig into the
complex internal structure of the objects, and can handle SNPbin objects as easily as possible.
Like genind and genpop objects, SNPbin is a formal ”S4” class. The structure of these
objects is detailed in the dedicated manpage (?SNPbin). As all S4 objects, instances of the
class SNPbin are composed of slots accessible using the @ operator. This content is generic
(it is the same for all instances of the class), and returned by:
library(adegenet)
getClassDef("SNPbin")
3
• n.loc: the number of SNPs stored in the object.
• NA.posi: position of the missing data (NAs).
• label: an optional label for the individual.
• ploidy: the ploidy level of the genome.
New objects are created using new, with these slots as arguments. If no argument is
provided, an empty object is created:
new("SNPbin")
In practice, only the snp information and possibly the ploidy has to be provided; various
formats are accepted for the snp component, but the simplest is a vector of integers (or
numeric) indicating the number of second allele at each locus. The argument snp, if provided
alone, does not have to be named:
If not provided, the ploidy is detected from the data and determined as the largest number
in the input vector. Obviously, in many cases this will not be adequate, but ploidy can always
be rectified afterwards; for instance:
x
## /// SNPBIN OBJECT /////////
## 7 SNPs coded as bits, size: 1.4 Kb
## Ploidy: 2
## 0 (0 %) missing data
ploidy(x) <- 3
x
## /// SNPBIN OBJECT /////////
## 7 SNPs coded as bits, size: 1.4 Kb
## Ploidy: 3
## 0 (0 %) missing data
4
The internal coding of the objects is cryptic, and not meant to be accessed directly:
x@snp
## [[1]]
## [1] 08
##
## [[2]]
## [1] 4e
as.integer(x)
## [1] 0 1 1 2 0 0 1
The main interest of this representation is its efficiency in terms of storage. For instance:
## 3.8 Mb
print(object.size(x),unit="auto")
## 123.4 Kb
here, we converted a million SNPs into a SNPbin object, which turns out to be 32 smaller
than the original data. However, the information in dat and x is strictly identical:
identical(as.integer(x),dat)
## [1] FALSE
The advantage of this storage is therefore being extremely compact, and allowing to
analyse big datasets using standard computers.
While SNPbin objects are the very mean by which we store data efficiently, in practice
we need to analyze several genomes at a time. This is made possible by the class genlight,
which relies on SNPbin but allows for storing data from several genomes at a time.
5
2.2 genlight: storage of multiple genomes
Like SNPbin, genlight is a formal S4 class. The slots of instances of this class are described
by:
getClassDef("genlight")
As it can be seen, these objects allow for storing more information in addition to vectors
of SNP frequencies. More precisely, their content is (see ?genlight for more details):
• gen: SNP data for different individuals, each stored as a SNPbin; loci have to be
identical across all individuals.
• loc.all: (optional) alleles of the loci separated by ’/’ (e.g. ’a/t’, ’g/c’, etc.).
• chromosome: (optional) a factor indicating the chromosome to which the SNPs belong.
Like SNPbin object, genlight object are created using the constructor new, providing content
for the slots above as arguments. When none is provided, an empty object is created:
6
new("genlight")
The most important information to provide is obviously the genotypes (argument gen);
these can be provided as:
• a list of integer vectors representing the number of second allele at each locus.
• a matrix / data.frame of integers, with individuals in rows and SNPs in columns.
• a list of SNPbin objects.
Ploidy has to be consistent across loci for a given individual, but individuals do not
have to have the same ploidy, so that it is possible to have hapoid, diploid, and tetraploid
individuals in the same dataset; for instance:
ploidy(x)
7
As for SNPbin, genlight objects can be converted back to integers vectors, stored as
matrices or lists:
as.list(x)
## $indiv1
## [1] 1 1 0 1 1 0
##
## $indiv2
## [1] 2 1 1 0 0 0
##
## $toto
## [1] 2 2 0 0 4 4
as.matrix(x)
In practice, genlight objects can be handled as if they were matrices of integers as the
one above returned by as.matrix. However, they offer the advantage of efficient storage of
the information; for instance, we can simulate 50 individuals typed for 100,000 SNPs each
(including occasional NAs):
## 38.2 Mb
8
## @ind.names: 50 individual labels
## @other: a list containing: elements without names
object.size(dat)/object.size(x)
## 55.8 bytes
here again, the storage if the data is much more efficient in genlight than using integers:
converted data occupy 56 times less memory than the original data.
The advantage of this storage is therefore being extremely compact, and allowing to
analyse very large datasets using standard computers. Obviously, usual computations
demand data to be at one moment coded as numeric values (as opposed to bits). However,
most usual computations can be achieved by only converting one or two genomes back to
numeric values at a time, therefore keeping RAM requirements low, albeit at a possible cost
of increased computational time. This however is minimized by three ways:
3. handling smaller objects, thereby decreasing the possibly high computational time
taken by memory allocation.
While this makes implementing methods more complicated. In practice, routines are
implemented so as to minimize the amount of data converted back to integers, use C code
where possible, and use multiple cores if the package parallel is installed an multiple cores
are available. Fortunately, these underlying technical issues are oblivious to the user, and
one merely needs to know how to manipulate genlight objects using a few key functions to
be able to analyze data.
Available accessors are documented in ?genlight. Most of them are identical to accessors
for genind and genpop objects, such as:
9
• nLoc: returns the number of loci (SNPs).
## [[1]]
## [1] 1 2 1 1 2 0 1 1 0 0
##
## [[2]]
## [1] 1 2 2 2 1 2 0 1 2 2
##
## [[3]]
## [1] 1 2 2 2 1 2 0 1 2 0
indNames(x)
## NULL
10
## [1] "individual 1" "individual 2" "individual 3"
locNames(x)
locNames(x) <- paste("SNP",1:nLoc(x),sep=".")
as.matrix(x)
## SNP.1 SNP.2 SNP.3 SNP.4 SNP.5 SNP.6 SNP.7 SNP.8 SNP.9 SNP.10
## individual 1 1 2 1 1 2 0 1 1 0 0
## individual 2 1 2 2 2 1 2 0 1 2 2
## individual 3 1 2 2 2 1 2 0 1 2 0
Accessors are meant to be clever about replacement, meaning that they try hard to
prevent replacement with inconsistent values. For instance, in object x:
if we try to set information about the chromosomes of the SNPs, the instruction:
will generate an error because the provided factor does not match the number of loci (10),
while:
11
chr(x) <- rep("chr-1", 10)
x
chr(x)
## [1] chr-1 chr-1 chr-1 chr-1 chr-1 chr-1 chr-1 chr-1 chr-1 chr-1
## Levels: chr-1
is a valid replacement.
12
## @chromosome: factor storing chromosomes of the SNPs
## @other: a list containing: elements without names
as.matrix(x)
## SNP.1 SNP.2 SNP.3 SNP.4 SNP.5 SNP.6 SNP.7 SNP.8 SNP.9 SNP.10
## individual 1 1 2 1 1 2 0 1 1 0 0
## individual 2 1 2 2 2 1 2 0 1 2 2
## individual 3 1 2 2 2 1 2 0 1 2 0
as.matrix(x[c(1,3),])
## SNP.1 SNP.2 SNP.3 SNP.4 SNP.5 SNP.6 SNP.7 SNP.8 SNP.9 SNP.10
## individual 1 1 2 1 1 2 0 1 1 0 0
## individual 3 1 2 2 2 1 2 0 1 2 0
as.matrix(x[, c(TRUE,FALSE)])
as.matrix(x[1:2, c(1,1,1,2,2,2,3,3,3)])
Moreover, one can split data into blocks of SNPs using seploc. This can be achieved by
specifying either a number of blocks (argument n.block) or the size of the blocks (argument
block.size). The function also allows for randomizing the distribution of the SNPs in
the blocks (argument random=TRUE), which is especially useful to replace computations that
cannot be achieved on the whole dataset with parallelized computations performed on random
blocks (for parallelization, remove the argument parallel=FALSE). For instance:
13
## // Optional content
## @ind.names: 3 individual labels
## @loc.names: 10 locus labels
## @chromosome: factor storing chromosomes of the SNPs
## @other: a list containing: elements without names
as.matrix(x)
## SNP.1 SNP.2 SNP.3 SNP.4 SNP.5 SNP.6 SNP.7 SNP.8 SNP.9 SNP.10
## individual 1 1 2 1 1 2 0 1 1 0 0
## individual 2 1 2 2 2 1 2 0 1 2 2
## individual 3 1 2 2 2 1 2 0 1 2 0
## $block.1
## /// GENLIGHT OBJECT /////////
##
## // 3 genotypes, 5 binary SNPs, size: 7.4 Kb
## 0 (0 %) missing data
##
## // Basic content
## @gen: list of 3 SNPbin
##
## // Optional content
## @ind.names: 3 individual labels
## @loc.names: 5 locus labels
## @chromosome: factor storing chromosomes of the SNPs
## @other: a list containing: elements without names
##
##
## $block.2
## /// GENLIGHT OBJECT /////////
##
## // 3 genotypes, 5 binary SNPs, size: 7.4 Kb
## 0 (0 %) missing data
##
## // Basic content
## @gen: list of 3 SNPbin
##
## // Optional content
## @ind.names: 3 individual labels
## @loc.names: 5 locus labels
## @chromosome: factor storing chromosomes of the SNPs
## @other: a list containing: elements without names
14
lapply(seploc(x, n.block=2, parallel=FALSE),as.matrix)
## $block.1
## SNP.1 SNP.2 SNP.3 SNP.4 SNP.5
## individual 1 1 2 1 1 2
## individual 2 1 2 2 2 1
## individual 3 1 2 2 2 1
##
## $block.2
## SNP.6 SNP.7 SNP.8 SNP.9 SNP.10
## individual 1 0 1 1 0 0
## individual 2 2 0 1 2 2
## individual 3 2 0 1 2 0
## $block.1
## SNP.10 SNP.9 SNP.1 SNP.8 SNP.7
## individual 1 0 0 1 1 1
## individual 2 2 2 1 1 0
## individual 3 0 2 1 1 0
##
## $block.2
## SNP.4 SNP.3 SNP.6 SNP.2 SNP.5
## individual 1 1 1 0 2 2
## individual 2 2 2 2 2 1
## individual 3 2 2 2 2 1
15
file.show(system.file("files/exampleSnpDat.snp",package="adegenet"))
Otherwise, this file is also accessible from the adegenet website (section ’Documents’). A
complete description of the .snp format is provided in the comment section of the file.
While this section can be left empty, these two lines have to be present for the format to be
valid. Each meta-information is stored using two lines, the first starting as:
>> name-of-the-information
and the second containing the information itself, each item separated by a single space. Any
label can be used, but some specific names will be recognized and interpreted by the parser:
• position: the following line contains integers giving the position of the SNPs on the
sequence
• allele: character strings representing the two alleles of each loci separated by ”/”
• ploidy: integers indicating the ploidy of each individual; alternatively, one single
integer if all individuals have the same ploidy
• chromosome: character strings indicating the chromosome on which the SNP are located
> label-of-the-individual
16
and the second being integers corresponding to the number of second allele for each loci,
without separators; missing data are coded as ’-’.
.snp files can be read in R using read.snp, which converts data into genlight objects.
The function reads data by chunks of a several individuals (minimum 1, no maximum besides
RAM constraints) at a time, which allows one to read massive datasets with negligible RAM
requirements (albeit at a cost of computational time). The argument chunkSize indicates
the number of genomes read at a time; larger values mean reading data faster but require
more RAM. We can illustrate read.snp using the example file mentioned above. The non-
comment part of the file reads:
[...]
>> position
1 8 11 43
>> allele
a/t g/c a/c t/a
>> population
Brit Brit Fren monster NA
>> ploidy
2
> foo
1020
> bar
0012
> toto
10-0
> Nyarlathotep
0120
> an even longer label but OK since on a single line
1100
We read the file in using:
##
## Reading biallelic SNP data file into a genlight object...
##
##
## Reading comments...
##
## Reading general information...
##
## Reading 5 genotypes...
## ...
17
## Checking consistency...
##
## Building final object...
##
## ...done.
obj
as.matrix(obj, parallel=FALSE)
alleles(obj)
obj@pop
18
indNames(obj)
## [1] "foo"
## [2] "bar"
## [3] "toto"
## [4] "Nyarlathotep"
## [5] "an even longer label but OK since on a single line"
Note that system.file is generally useless: it is only used in this example to access a
file installed alongside the package. Usual calls to read.snp will ressemble:
Like read.snp, read.PLINK has the advantage of reading data by chunks of a few
individuals (down to a single one at a time, no upper limits), which minimizes the amount of
memory needed to read information before its conversion to genlight; however, using more
chunks also means more computational time, since the procedure has to re-read the same file
several time. Note that meta information about the loci also known as .map can also be read
alongside a .raw file using the argument map.file. Alternatively, such information can be
added to a genlight object afterwards using extract.PLINKmap.
The function fasta2genlight extracts SNPs from alignments with fasta format (file
extensions ’.fasta’, ’.fas’, or ’.fa’). Like read.snp and read.PLINK, fasta2genlight
processes data by chunks of individuals so as to save memory requirements. It first scans the
whole file for polymorphic positions, and then extracts all biallelic SNPs from the alignment.
19
fasta2genlight is illustrated like read.snp using a toy dataset distributed alongside
the package. The file is first located using system.file, and then processed using
fasta2genlight:
##
## Converting FASTA alignment into a genlight object...
##
##
## Looking for polymorphic positions...
## .....................................................................................
## Extracting SNPs from the alignment...
## .....................................................................................
## Building final object...
##
## ...done.
flu
flu is a genlight object containing SNPs of 80 isolates of seasonal influenza (H3N2) sampled
within the US over the last two decades; sequences correspond to the hemagglutinin (HA)
segment. Besides genotypes, flu contains the positions of the SNPs and the alleles at each
retained loci. Names of the loci are constructed as the combination of both:
head(position(flu), 20)
## [1] 7 12 31 32 36 37 44 45 52 60 62 72 73 78 96 99 105
## [18] 108 121 128
20
head(alleles(flu), 20)
## [1] "a/g" "c/t" "t/c" "t/c" "t/c" "c/a" "t/c" "c/t" "a/g" "c/t" "g/t"
## [12] "c/a" "a/g" "a/g" "a/g" "c/t" "a/g" "g/a" "c/a" "a/g"
head(locNames(flu), 20)
It is usually informative to assess the position of the polymorphic sites within the genome;
this is very easily done in R, using density with an appropriate bandewidth:
21
Location of the SNPs
0.0015
0.0010
Density
0.0005
0.0000
| |||||||| || ||||| ||||| ||||||||||||| | ||||||||||| ||||||||| ||||||||||||| | |||||||||| || |||| | |||||||||||||||||||||||| ||| ||||||||||||||||| |||||| | ||| |||| |||| || |||||||||| |||||| | |
22
Distribution of SNPs in the genome
0.0015
0.0010
density
0.0005
0.0000
snpposi.plot(position(flu), genome.size=1700)
23
Distribution of SNPs in the genome
0.0020
0.0015
Codon position
density
1
2
3
0.0010
0.0005
0.0000
In this case, SNPs seem to be distributed fairly homogeneously across the HA segment, with
a few possible hotspots of polymorphism within positions 400—700. This can be tested by
snpposi.test:
snpposi.test(position(flu), genome.size=1700)
## Monte-Carlo test
## Call: as.randtest(sim = sim, obs = obs, alter = "less")
##
## Observation: 2
##
## Based on 999 replicates
## Simulated p-value: 0.495
## Alternative hypothesis: less
##
## Std.Obs Expectation Variance
## -0.9904786 2.4849850 0.2397543
Note that retaining only biallelic sites may cause minor loss of information, as sites with
24
more than 2 alleles are discarded from the data. It is however possible to ask fasta2genlight
to keep track of the number of alleles for each site of the original alignment, by specifying:
The output object flu now contains the number of alleles of each position, stored in the
other slot:
head(other(flu)$nb.all.per.loc, 20)
## [1] 1 1 1 1 1 1 2 1 1 1 1 2 1 1 1 1 1 1 1 1
100*mean(unlist(other(flu))>1)
## [1] 17.81305
About 18% of the sites are polymorphic, which is fairly high. This is not entirely
surprising, given that the HA segment of influenza is known for its high mutation rate.
What is the nature of this polymorphism?
25
Distribution of the number
of alleles per loci
1200
1000
Number of sites
800
600
400
200
0
1 2 3 4
Number of alleles
Most polymorphic loci are biallelic, but a few loci with 3 or 4 alleles were lost. We can
estimate the loss of information very simply:
##
## 2 3 4
## 90.4 8.3 1.3
In this case, 90.4% of the polymorphic sites were biallelic, the others being essentially
triallelic. This is probably a fairly exceptional situation due to the high mutation rate of the
HA segment.
26
multivariate approaches. Some examples below are illustrated using toy datasets generated
using the function glSim.
args(glSim)
The new version of glSim contains new optional arguments to allow for greater complexity
in the simulated data.
In addition, the user has the option to simulate background group structure between k
ancestral populations of relative size pop.freq, generated by altering allele frequencies in
the ’non-structural’ SNPs. The factorisation of individuals into these k groups will then
occupy the slot @other$ancestral.pops. If the user wishes to sort the genotypes according
to the ancestral populations (rather than the dichotomous structural groups), this can be
accomplished by setting sort.pop = TRUE.
The non-structural SNPs, with or without ancestral population structure, can also
be generated with or without linkage disequilibrium (LD). The arguments LD = TRUE,
block.minsize, block.maxsize, and theta, a dilution parameter (set to 0 for strongest
LD and 0.5 for weakest-but-present LD).
27
4.2 Basic analyses
4.2.1 Plotting genlight objects
Basic features of the data may also be inferred by simply looking at the data. genlight
objects can be plotted using glPlot, or simply plot (both names actually correspond to the
same function). This function displays the data as images, representing numbers of second
alleles using colours. For instance, we can have a feel for the amount and location of missing
data in the influenza dataset (see previous section) fairly easily:
glPlot(flu, posi="topleft")
40
60
80
SNP index
The white streches in the first 30 SNPs observed around individual 70 indicate missing
data. There are only a few missing data, and they only concern a couple of individuals.
In some simple cases, some biological structures might also be apparent in such plot. For
instance, we can generate data for 100 diploid individuals belonging to 5 separate populations
(i.e. with independent allele frequencies):
28
x <- glSim(100, 1000, k=5, block.maxsize=200, ploidy=2,
sort.pop=TRUE)
glPlot(x, col=bluepal(3))
20
40
Individual index
60
80
SNP index
Note that both population structures (vertical blocks) and patterns of LD between
contiguous sites (horizontal blocks) are easy to spot on the above figure. Of course, data
visualization merely is a preliminary approach to the data. More detailed analysis can be
achieved using both standard and ad hoc procedures as detailed below.
29
All these procedures are named using the prefix gl (for genlight), and can therefore be
listed by typing gl and pressing the TAB key twice. They are (see ?glMean):
• glSum: computes the sum of second alleles for each SNP.
• glNA: computes the number of missing values in each locus.
• glMean: computes the mean of second alleles, i.e. second allele frequencies for each
SNP.
• glVar: computes the variance of the second allele frequency for each SNP.
• glDotProd: computes the dot products between all pairs of individuals, with possible
centring and scaling.
For instance, one can easily derive the distributiong of allele frequencies using:
myFreq <- glMean(flu)
hist(myFreq, proba=TRUE, col="gold", xlab="Allele frequencies",
main="Distribution of (second) allele frequencies")
temp <- density(myFreq)
lines(temp$x, temp$y*1.8,lwd=3)
3
2
1
0
Allele frequencies
30
In biallelic loci, one allele is always entirely redundant with the other, so it is generally
sufficient to analyse a single allele per loci. However, the distribution of allele frequencies
may be more interpretable by restoring its native symmetry:
3
2
1
0
Allele frequencies
While a large number of loci are nearly fixed (frequencies close to 0 or 1), there is an
appreciable number of alleles with intermediate frequencies and therefore susceptible to
contain interesting biological signal. More generally and perhaps more importantly, this
figure may also cast light on a well-known social phenomenon occuring mainly in young
people attending noisy kinds of conferences:
31
We can indeed wonder whether the gesture usually referred to as the ’devil sign’ is not
actually a reference to the usual shape of SNPs frequency distributions. It is still unclear,
however, how many geneticists do attend metal gigs, although recent observations suggest
they would be more frequent in grindcore events than in classical heavy metal shows.
Besides these considerations, we can also map missing data across loci as we have done
for SNP positions in the US influenza dataset (see previous section) using glNA and density:
head(glNA(flu),20)
32
Location of the missing values (NAs)
0.04
0.03
Density
0.02
0.01
0.00
Here, the few missing values are all located at the beginning at the alignment, probably
reflecting heterogeneity in DNA amplification during the sequencing process. In larger
datasets, such simple investigation can give crucial insights about the quality of the data
and the existence of possible sequencing biases.
Let us illustrate this procedure using 40 simulated individuals with 10,000 SNPs each:
33
x <- glSim(40, 1e4, LD=FALSE, parallel=FALSE)
x
seploc is used to create a list of smaller objects (here, 10 blocks of 10,000 SNPs):
## [1] "list"
names(x)
x[1:2]
## $block.1
## /// GENLIGHT OBJECT /////////
##
## // 40 genotypes, 1,000 binary SNPs, size: 60.2 Kb
## 0 (0 %) missing data
##
## // Basic content
## @gen: list of 40 SNPbin
## @ploidy: ploidy of each individual (range: 1-1)
##
## // Optional content
## @other: a list containing: ancestral.pops
##
##
## $block.2
## /// GENLIGHT OBJECT /////////
34
##
## // 40 genotypes, 1,000 binary SNPs, size: 60.2 Kb
## 0 (0 %) missing data
##
## // Basic content
## @gen: list of 40 SNPbin
## @ploidy: ploidy of each individual (range: 1-1)
##
## // Optional content
## @other: a list containing: ancestral.pops
dist is used within a lapply loop to compute pairwise distances between individuals for
each block:
## [1] "list"
names(lD)
class(lD[[1]])
## [1] "dist"
lD is a list of distances matrices (dist objects) between pairs of individuals. The general
distance matrix is obtained by summing these:
And we could now carry on further analyses, such as a neighbor-joining tree using the
ape package:
library(ape)
plot(nj(D), type="fan")
title("A simple NJ tree of simulated genlight data")
35
A simple NJ tree of simulated genlight data
34
20
21
26
29
40
2
3
9
24
11
6 38
17
10
7 36
32 35
22 12
15 39
37 33
19 18
16
30
8
28
25
27
5
13
14
4
31
1
23
• alleles: in this case we consider that each individual represents a sample of alleles, with
a sample size equalling the ploidy for each locus
This distinction is most of the time overlooked when analysing genetic data. As a matter
of fact, it does not matter when all individuals have the same ploidy. For instance, if we take
the following data:
36
x
as.matrix(x)
## 1 2 3 4
## a 0 0 1 1
## b 1 1 0 0
## c 1 1 1 1
and assume that all individuals are haploid, then computing e.g. the allele frequencies is
straightforward (they all equal 2/3):
glMean(x)
## 1 2 3 4
## 0.6666667 0.6666667 0.6666667 0.6666667
37
## // Optional content
## @ind.names: 3 individual labels
## @loc.names: 4 locus labels
## @other: a list containing: elements without names
as.matrix(x)
## 1 2 3 4
## a 0 0 2 2
## b 1 1 0 0
## c 1 1 1 1
ploidy(x)
## a b c
## 2 1 1
What are the allele frequencies in this case? Well, it depends on what we mean by ’allele
frequency’.
Is it the frequency of the alleles in the population? In this case, the unit of observation
is the allele. We have a total of 4 samples for each loci, (since ’a’ is diploid, it represents
actually two samples) and the frequencies are 1/2, 1/2, 3/4, 3/4. Note, however, that this
assumes that alleles are randomly associated within individuals (pangamy).
Or is it the frequency of the alleles within the individuals? In this case, the unit of
observation is the individual, and the vector of allele frequencies represents the ’average
individual’. We first need to convert each individual vector into relative frequencies (i.e.,
divide by their respective ploidy), and then compute the average frequency across individuals,
which ends up with 2/3 for each locus:
## 1 2 3 4
## 0.6666667 0.6666667 0.6666667 0.6666667
The procedures designed for genlight objects seen above (glMean, glNA, etc.) allow for
this distinction to be made. The option alleleAsUnit is a logical indicating whether the
observation unit is the allele (TRUE, default) or the individual (FALSE). For instance:
as.matrix(x)
## 1 2 3 4
## a 0 0 2 2
## b 1 1 0 0
## c 1 1 1 1
38
glMean(x, alleleAsUnit=TRUE)
## 1 2 3 4
## 0.50 0.50 0.75 0.75
glMean(x, alleleAsUnit=FALSE)
## 1 2 3 4
## 0.6666667 0.6666667 0.6666667 0.6666667
set.seed(1)
toy <- lapply(1:10, function(i)
sample(c(1:2, NA), 10,
prob = c(.45, .45, .1),
replace = TRUE))
toy
## [[1]]
## [1] 2 2 1 NA 2 1 NA 1 1 2
##
## [[2]]
## [1] 2 2 1 2 1 1 1 NA 2 1
##
## [[3]]
## [1] NA 2 1 2 2 2 2 2 1 2
##
## [[4]]
## [1] 1 1 1 2 1 1 1 2 1 2
##
## [[5]]
## [1] 1 1 1 1 1 1 2 1 1 1
##
## [[6]]
## [1] 1 1 2 2 2 2 2 1 1 2
##
## [[7]]
39
## [1] NA 2 1 2 1 2 1 1 2 1
##
## [[8]]
## [1] 2 1 2 2 1 1 1 2 1 NA
##
## [[9]]
## [1] 2 1 2 2 1 2 1 2 2 2
##
## [[10]]
## [1] 2 2 1 1 1 1 1 2 1 1
## allele counts
as.matrix(x)
## [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10]
## [1,] 2 2 1 NA 2 1 NA 1 1 2
## [2,] 2 2 1 2 1 1 1 NA 2 1
## [3,] NA 2 1 2 2 2 2 2 1 2
## [4,] 1 1 1 2 1 1 1 2 1 2
## [5,] 1 1 1 1 1 1 2 1 1 1
## [6,] 1 1 2 2 2 2 2 1 1 2
## [7,] NA 2 1 2 1 2 1 1 2 1
## [8,] 2 1 2 2 1 1 1 2 1 NA
## [9,] 2 1 2 2 1 2 1 2 2 2
## [10,] 2 2 1 1 1 1 1 2 1 1
## [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10]
40
## [1,] 2.000 2 1 1.777778 2 1 1.333333 1.000000 1 2.000000
## [2,] 2.000 2 1 2.000000 1 1 1.000000 1.555556 2 1.000000
## [3,] 1.625 2 1 2.000000 2 2 2.000000 2.000000 1 2.000000
## [4,] 1.000 1 1 2.000000 1 1 1.000000 2.000000 1 2.000000
## [5,] 1.000 1 1 1.000000 1 1 2.000000 1.000000 1 1.000000
## [6,] 1.000 1 2 2.000000 2 2 2.000000 1.000000 1 2.000000
## [7,] 1.625 2 1 2.000000 1 2 1.000000 1.000000 2 1.000000
## [8,] 2.000 1 2 2.000000 1 1 1.000000 2.000000 1 1.555556
## [9,] 2.000 1 2 2.000000 1 2 1.000000 2.000000 2 2.000000
## [10,] 2.000 2 1 1.000000 1 1 1.000000 2.000000 1 1.000000
41
10
8
6
4
2
0 Eigenvalues
When nf (number of retained factors) is not specified, the function displays the barplot of
eigenvalues of the analysis and asks the user for a number of retained principal components.
glPca returns a list with the class glPca containing the eigenvalues, principal components
and loadings of the analysis:
pca1
42
## Class: list of type glPca
## Call ($call):glPca(x = flu)
##
## Eigenvalues ($eig):
## 11.385 4.019 1.391 1.275 0.636 0.569 ...
##
## Principal components ($scores):
## matrix with 80 rows (individuals) and 4 columns (axes)
##
## Principal axes ($loadings):
## matrix with 274 rows (SNPs) and 4 columns (axes)
In addition to usual graphics, glPca object can displayed using scatter (produces
a scatterplot of the principal components (PCs)) and loadingplot (plots the allele
contributions, i.e. squared loadings). The scatterplot is obtained by:
scatter(pca1, posi="bottomright")
title("PCA of the US influenza data\n axes 1-2")
d=2
PCA of the US influenza data
axes 1−2
CY000297
CY000584
CY000185
CY000545
CY001152
CY002816
CY002328
CY003096
CY001720
CY000289
CY001552
CY002384
CY003272
CY000737
CY003785
CY001453
CY000657
CY001704
CY001365
CY000705
CY001413
CY001616
CY006787
CY006235
CY006243
CY006259
CY006267
CY006595
CY006627
CY006563
CY008964
CY000105
CY000353
CY002104
CY001648
CY010028
CY011424 CY003664
CY006155
CY003336
CY002432
CY019301
CY019285
CY021989
CY019245
CY003640
CY034116
CY019843
EF554795
CY019859
CY014159
CY009476
CY011128
CY010036 EU100713
EU199369
EU779500
EU516036
EU199254
CY031555
EU516212
EU852005
FJ549055
CY035190
EU779498
CY012504
CY010748 Eigenvalues
CY012480
CY011528
CY013613
CY012160
CY013200
CY017291
CY013781
CY010988
CY012272
CY012288
CY012568
CY013016
CY012128
43
The first PC suggests the existence of two clades in the data, while the second one shows
groups of closely related isolates arranged along a cline of genetic differentiation. This
structure is confirmed by a simple neighbour-joining (NJ) tree:
library(ape)
tre <- nj(dist(as.matrix(flu)))
tre
##
## Phylogenetic tree with 80 tips and 78 internal nodes.
##
## Tip labels:
## CY013200, CY013781, CY012128, CY013613, CY012160, CY012272, ...
##
## Unrooted; includes branch lengths.
CY021989
EF5
CY019245
CY019285
CY
01
40
CY00
2432
CY
019
CY0193
664
036
547
CY
01
CY00
6155
843
003
CY0
95
01
36
98
CY
EU
41
59
CY
03
EU
10
59
41
07
CY
19
13
16
92
0
31
54
EU
CY CY Y00 297
55
Y0 05 09 2
19 5
C 00 003 155
EU 9
C 000
77 36
00 45 6
94 9
CY
EU 98
51
5
62
18
FJ5 12
490 20
55 17 28
EU5 Y 00 023 84
160
36 C Y 005 9
0
C Y0 28
CY03 C 000 152
5190 CY Y001
C 816
EU8520 02
05 CY0
EU779500 657
C Y000001453
00 CY37
07
CY002104 CY
CY003272
CYC00
Y037 85
C0Y007005
CY001648 1
CY3065
05 59 CY 0170
CY0001 062 7 002 4
0353 CY00062695 384
CY00 CY 006543
CY 0062 627 3 CY
6 6
CY Y00 065 CY 011
CY 014 678 235 4
01 424
6 6
C Y0
CY Y0 00 089
0
C CY 028
00 13 7
C CY 0
Y
CY 010
C
16
0
01 036
16
11
Y0
28
0
09
47 CY 115
6
04
CY 1248
748
01
125
CY
CY 1212
CY0
CY01
72
0
CY0137
CY012272
CY
012
010
CY012288
68
CY0
91
988
132
CY013016
0
CY0125
28
160
CY
3613
0
00
0
CY01
81
44
The correspondance between both analyses can be better assessed using colors based on
PCs; this is achieved by colorplot:
Eigenvalues
3
2
1
PC2
0
−1
−2
−3
−4 −2 0 2 4
PC1
45
NJ tree of the US influenza data
As expected, both approaches give congruent results, but both are complementary: NJ is
better at showing bunches of related isolates, but the cline of genetic differentiation is much
clearer in PCA.
46
dapc1
## #################################################
## # Discriminant Analysis of Principal Components #
## #################################################
## class: dapc
## $call: dapc.genlight(x = x, n.pca = 10, n.da = 1)
##
## $n.pca: 10 first PCs of PCA used
## $n.da: 1 discriminant functions saved
## $var (proportion of conserved variance): 0.121
##
## $eig (eigenvalues): 361.6 vector length content
## 1 $eig 1 eigenvalues
## 2 $grp 100 prior group assignment
## 3 $prior 2 prior group probabilities
## 4 $assign 100 posterior group assignment
## 5 $pca.cent 10050 centring vector of PCA
## 6 $pca.norm 10050 scaling vector of PCA
## 7 $pca.eig 99 eigenvalues of PCA
##
## data.frame nrow ncol
## 1 $tab 100 10
## 2 $means 2 10
## 3 $loadings 10 1
## 4 $ind.coord 100 1
## 5 $grp.coord 2 1
## 6 $posterior 100 2
## 7 $pca.loadings 10050 10
## 8 $var.contr 10050 1
## content
## 1 retained PCs of PCA
## 2 group means
## 3 loadings of variables
## 4 coordinates of individuals (principal components)
## 5 coordinates of groups
## 6 posterior membership probabilities
## 7 PCA loadings of original variables
## 8 contribution of original variables
Note that for cross-validation (xvalDapc) you will need to extract the table of allele
frequencies first using tab.
For the last 10 structured SNPs (located at the end of the alignment), the two groups
of individuals have different (random) distribution of allele frequencies, while they share the
same distributions in other loci. DAPC can still make some decent discrimination:
47
scatter(dapc1,scree.da=FALSE, bg="white", posi.pca="topright", legend=TRUE,
txt.leg=paste("group", 1:2), col=c("red","blue"))
0.5
group 1
group 2
0.4
0.3
Density
0.2
0.1
0.0
−4 −2 0 2 4
Discriminant function 1
While the composition plot confirms that groups are not entirely disentangled...
48
1.0
0.8
membership probability
0.6
0.4
0.2
0.0
group 1 group 2
... the loading plot identifies pretty well the most discriminating alleles:
loadingplot(dapc1$var.contr, thres=1e-3)
49
Loading plot
0.010
10004
0.008
10037
10012
10019
10032
10044
10013
10015
0.006
10035
10048
10050
10027
10002
10022
10010
Loadings
10024
10033
10040
10009
10038
10025
0.004
10005
10003
10036
10049
8487 10017
0.002
10028
10001
6414 7721
1718
6234 9182 9808
6448 9696
167 1149 3229
4366 5636 6188 90059378
8811
8904
10023
10018
10047
94189818
0.000
Variables
And we can zoom in to the contributions of the last 100 SNPs to make sure that the tail
indeed corresponds to the 50 last structured loci:
loadingplot(tail(dapc1$var.contr[,1],100), thres=1e-3)
50
Loading plot
0.010
10004
0.008
10037
10012 10019
10032
10013 10044
10015
0.006
10035 10048
10027 10050
10002 10010 10022
Loadings
10005
10003
10036
10049
10017
0.002
10028
10001
1001810023 10047
0.000
Variables
Here, we indeed identified the structured region of the genome fairly well.
References
[1] Jombart, T. (2008) adegenet: a R package for the multivariate analysis of genetic markers.
Bioinformatics 24: 1403-1405.
[2] R Development Core Team (2011). R: A language and environment for statistical
computing. R Foundation for Statistical Computing, Vienna, Austria. ISBN 3-900051-
07-0.
51