0% found this document useful (0 votes)
64 views6 pages

DNA Copy Number and Loss of Heterozygosity Analysis Algorithms

The document describes algorithms in GenomeStudio software for detecting copy number variants and loss of heterozygosity from microarray data. It explains the CNV Region Report, cnvPartition, and LOH Score algorithms, including how cnvPartition models copy numbers and estimates likelihoods to identify copy number changes.

Uploaded by

robbish
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
64 views6 pages

DNA Copy Number and Loss of Heterozygosity Analysis Algorithms

The document describes algorithms in GenomeStudio software for detecting copy number variants and loss of heterozygosity from microarray data. It explains the CNV Region Report, cnvPartition, and LOH Score algorithms, including how cnvPartition models copy numbers and estimates likelihoods to identify copy number changes.

Uploaded by

robbish
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 6

Technical Note: Informatics

DNA Copy Number and Loss of Heterozygosity


Analysis Algorithms
Detection of copy-number variants and chromosomal aberrations in GenomeStudio® software.

Introduction Table 1: GenomeStudio Copy


Number Algorithms
Illumina has developed several algorithms for detecting copy number
variants (CNVs) and other structural variants from microarray data Algorithm Function
(Table 1). These algorithms are available as individual software plug-ins
CNV Region Generates three separate CNV reports
for the GenomeStudio Genotyping Module and can be downloaded
Report
from the Illumina support page: http://support.illumina.com/
downloads/genomestudio-2-0-plug-ins.html. The plug-ins are used
cnvPartition Calculates copy numbers with confidence
within the CNV Analysis workbench, and results can be visualized scores and generates CNV regions
within the GenomeStudio Full Data Table, in the Illumina Genome
Viewer (IGV), or in a CNV region display window. This technical note Homozygosity Autobookmarks samples with extended
describes the function of these algorithms and how they can be Detector tracts of homozygosity (single-sample
employed to analyze a chromosomal region of interest. analysis only)

CNV Region Report LOH Score Estimates the likelihood of a region


exhibiting LOH
CNV Region Report is a software plug-in for GenomeStudio that
generates three separate CNV reports.
• Standard Report—Lists each CNV and loss of heterozygosity
(LOH) region for each selected sample.
Copy Number Estimation
• Allele-Specific Copy Number Report—Estimates the allele-
cnvPartition models LRRs and BAFs for each of 14 different copy
specific copy number for each probe entry (e.g., A- or AAB). In the
number scenarios as simple bivariate Gaussian distributions (Table 2).
output file, the CN_GTYPE column is calculated using the CNV
Value (as determined by CNVPartition), the B Allele Frequency Modeling copy number in this way allows for computation of a
data, the GTYPE (genotype column), and the theoretical B Allele preliminary copy number estimate for each assayed locus by
Frequency normal distributions for each copy number. comparing its observed LRR and BAF to values predicted from each
• PLINK CNV Input Report—Creates input files for some of the CNV of the fourteen models. Specifically, the likelihood of observing a
features of the PLINK genome-wide association study (GWAS) and given LRR and BAF under each of the 14 models is calculated. For
CNV analysis application1. example, to compute the likelihood of a particular LRR and BAF given
a genotype of AAB (LAAB), the AAB parameters from Table 2 and the
standard normal density are used:
cnvPartition
The goal of the cnvPartition algorithm is to identify regions of the
genome that are aberrant in copy number using two Infinium® assay 1 ( LRR − 0.3)2 1 (BAF − 13 ) 2

LAAB = exp − + exp −


0.18 2π 2(0.18 2 ) 0.03 2π 2 (0.032 )
outputs: the log R ratio (LRR) and B allele frequency (BAF). Because
LRR is the logged ratio of observed probe intensity to expected
intensity, any deviations from zero in this metric are evidence for copy
Likelihoods are also computed for other model genotypes listed in
number change. BAF is the proportion of hybridized sample that
Table 1 with the exception of the homozygous deletion (DD). For
carries the B allele as designated by the Infinium assay. In a normal
homozygous deletions, a very low LRR is expected, but the BAF may
sample, discrete BAFs of 0.0, 0.5, and 1.0 are expected for each
be any value between zero and one. Therefore, the likelihood of a
locus (representing AA, AB, and BB).
double deletion (LDD) is calculated by the equation:
Deviations from this expectation are indicative of aberrant copy
number. For example, if a locus has a BAF of 0.66, this might indicate
LDD =
1
exp −
(LRR − ( −5))2
that there are two copies of the B allele and one copy of the A allele 2 × 2π 2 (22 )
2
present in the sample 2 + 1 ≈ 0.66 . Analyzing both of these metrics
provides stronger resolution for detecting true copy number changes.
Technical Note: Informatics

Table 2: Genotypes Modeled


by cnvPartition
Si = X1 + … + X i ,1 ≤ i ≤ n

−1/2
Genotype CN LRR LRR BAF BAF 1 1 ( S j − Si ) ( Sn − S j + Si )
Mean SD Mean SD Z ij = + × −
( j − i) (n − j + i) ( j − i) (n − j + i)
DD 0 -5 2 NA NA
A 1 -0.45 0.18 0 0.03 where n is the number of loci assayed on the chromosome.
B 1 -0.45 0.18 1 0.03
An exhaustive search through all pairs of i and j scales quadratically
AA 2 0 0.18 0 0.03 with n and is therefore an inefficient process for use with Illumina
AB 2 0 0.18 0.5 0.03 whole-genome genotyping products2,3. To simplify the calculations
BB 2 0 0.18 1 0.03 required, cnvPartition uses a sliding window strategy to maximize
|Zij|, but where j = i + w, with w the defined window size. After the
AAA 3 0.3 0.18 0 0.03
optimal window size value is found, the algorithm attempts to extend
AAB 3 0.3 0.18 1/3 0.03 the window in both directions to maximize the value of |Zij| further.
ABB 3 0.3 0.18 2/3 0.03 As implemented, the algorithm repeats this procedure for w = 4, 8,
16, and 32 then reports the i and j corresponding to the maximal
BBB 3 0.3 0.18 1 0.03
|Zij| found. When a maximally different segment is found, |Zij| is
AAAA 4 0.75 0.18 0 0.03
compared to a pre-determined threshold (default is 6). If the threshold
AAAB 4 0.75 0.18 0.25 0.03 is exceeded, the boundaries are noted and the algorithm is applied
ABBB 4 0.75 0.18 0.75 0.03 recursively to the regions between 1 and i, i+1 and j, and j+1 through
n. The threshold of 6 was chosen as a default because it minimizes
BBBB 4 0.75 0.18 1 0.03
false positives, particularly for short aberrations.

Parameters for each of the fourteen genotypes considered by cnvPartition Copy Number Assignment to Partitioned Regions
are shown. BAFs are modeled as a uniform distribution between zero and
one for homozygous deletions (DD). All other distributions are modeled with The partitioning procedure results in a set of putative breakpoints
Gaussian distributions with the given parameters. The genotype AABB is scattered across the genome. The next step is to assign a copy
not modeled since this would represent two independent duplication events number for each region lying between two consecutive breakpoints.
and rarely occurs in nature. (CN = copy number, DD = double deletion, SD To do this, L0, L1, L2, L3, and L4 for each locus within the region are
= standard deviation)
used. For each putative copy number (0–4), the logarithms of all Lk for
each k are summed. The k with the highest sum is the copy number
These likelihoods are then summarized by four composite copy assigned to this region. For regions with copy numbers other than 2,
number likelihoods: the algorithm also assigns a confidence score for the copy number
that is called. The confidence score is defined as the sum of all logged
L0 = LDD likelihoods in the region for the assigned copy number minus the sum
L1 = LA + LB of all log L2 values for loci in the region.
L2 = LAA + LAB + LBB
Additional Usage Notes
L3 = LAAA + LAAB + LABB + LBBB
• Regions with copy number = 1 on the X or Y for males are filtered
L4 = LAAAA + LAAAB + LABBB + LBBBB
from the CNV results.

where Lk denotes the likelihood of copy number k for integer values • Probes that are designated as Intensity Only are treated differently than
of k and the likelihood of a genotype for non-numeric values of k. The normal probes. The B Allele Frequency is ignored for these probes.
preliminary copy number estimate (X) is defined as the average of the • Y probes are not considered for samples designated as female.
five modeled copy numbers, weighted by their respective likelihoods:
Homozygous Region Detection
L1 + 2L2 + 3L3 + 4 L4 cnvPartition also includes a homozygosity detection algorithm that
X=
L0 + L1 + L2 + L3 + L4 runs separately from the partitioning algorithm already described. This
Breakpoint Identification algorithm only runs on copy number 2 regions by default. Therefore,
it is sometimes called a Copy Neutral LOH Detector. The logic is
Preliminary copy number estimates are the inputs to the core
similar to that used in the homozygosity detector autobookmarking
partitioning algorithm. The goal of partitioning is to identify regions of
plug-in (see next section). However, additional logic has been added
the genome where the values of X are consistently higher or lower than
to simplify usage for the end user. Instead of adjusting the ChiSquare
2, the expected value for a diploid sample. To find an aberrant region,
threshold as in the autobookmarking plug-in, the user can simply
the algorithm orders the X values by their position along a chromosome
adjust the MinHomozygousRegionSize configuration parameter. By
and searches for the indexes i and j such that the values Xi . . .Xj are
default, this is set to 10 Mb based on empirical testing.
maximally different than those outside this region. Thus, the algorithm
seeks to maximize |Zij| over all i and j with i < j, defined by the equations:
Technical Note: Informatics

Experimental Features the expectation that a single SNP is homozygous in a single sample.
The algorithm then calculates the X2 value of the observation of
Several experimental features are included in cnvPartition v3.2.0.
zero heterozygotes in N SNPs versus the expected number of
These features are all disabled by default but can be activated by
heterozygotes given the frequencies of the SNPs. In this algorithm,
adjusting the associated configuration files. Configuration details for
there is a fixed cutoff significance (X2 = 23.5), which corresponds to
each feature are available in the cnvPartition user document5. These
50 contiguous SNPs, each with a minor allele frequency (MAF) of 0.2.
features include:
• LOH detection for the entire genome. Previous versions only This algorithm requires that each LOH region contains at least
allowed LOH detection to be run on regions with a copy number of 2. 50 homozygous SNPs by default. All LOH regions with more than
50 SNPs and X2 > 23.5 are bookmarked.
• Log R Ratio adjustment for the Y chromosome. Y SNPs are
clustered using only males, so the log R Ratio indicates that the Homozygosity Detector Algorithm Process
copy number = 2. The adjustment lowers all log R Ratios on Y. 1. The X2 threshold is preset to 23.5, but users can change this value.
If Y chromosome SNP clusters are already adjusted, additional
2. The minimum number of SNPs per region is preset to 50, but
adjustment could provide inconsistent results.
users can change this value.
• Support for highly amplified genomes. These are common with
3. The allele frequencies are calculated for each SNP before analysis.
many cancer samples. Log R Ratios are adjusted upward based on
The allele frequencies are used to calculate the expectation that
calculation of average genomic ploidy.
each SNP is heterozygous in a single sample, assuming Hardy-
• A GC wave adjustment based on linear regression of LRR vs. GC Weinberg equilibrium.
content in probes. GC waves can be present when using incorrectly
4. Next, for each sample, all of the genotypes on each chromosome
quantified DNA in the Infinium Assay, or they might be present in
are scanned, and all of the contiguous regions without
regions of high or low GC content.
heterozygous genotypes are located.
CNV Display in GenomeStudio 5. For each of these regions, the expected number of heterozygotes
The copy number values are then used to create CNV regions and is calculated by the equation:
bookmarks in GenomeStudio for visualization of aberrant regions. The
N
cnvPartition algorithm incorporates two user-definable thresholds for E het = ∑ 2fi ( l − fi )
optimization of CNV detection. The Confidence Threshold allows users i =1

to filter out CNV regions that have low confidence values. The default where N is the number of homozygous SNPs and fi is the frequency
of 35 was determined empirically using normal HapMap samples on of either of the SNP alleles in the general population. The X2 value is
the Illumina Human1M BeadChip. The Probe Gap Size Threshold given as:
allows users to filter out CNV regions that are in large probe gaps,
such as centromeres. The default of 1,000,000 (1 Mb) was determined ( Nhom E hom )2 ( Nhet − E het )2
empirically to help prevent CNV regions from being falsely detected X2 = +
E hom E het
across centromeres and other large gaps.
where Nhom and Nhet are the number of homozygous and
Algorithms for Automated Bookmarking heterozygous genotypes respectively, and Ehom and Ehet are the
expected number of homozygous and heterozygous genotypes.
GenomeStudio can use several automated bookmarking algorithms. By definition, there are no heterozygotes, so the X2 value can be
These plug-in algorithms automatically scan data for the presence of simplified to:
structural aberrations and CNVs. Each algorithm employs a different
strategy to search for aberrations and CNVs. They are intended to NE het
X2 =
assist with visually categorizing various types of aberrations present N − E het
in samples of interest. Automated bookmarking algorithms can be
used as data-mining tools for the discovery of new regions, or to where N is the number of SNPs with genotype calls.
verify known regions of interest. Bookmarks can be edited whether 6. Each segment that is more significant than the predefined or user-
they are created manually or generated using the autobookmarking supplied X2 threshold value and has more SNPs than the predefined
tool. All bookmarks can be exported to share with other users. The or user-supplied minimum number of SNPs is bookmarked.
Homozygosity Detector is an autobookmarking plug-in available for
GenomeStudio.
LOH Score Statistical Algorithm
Homozygosity Detector
The LOH Score column plug-in reports the likelihood of loss of
The homozygosity detector algorithm can be used to autobookmark heterozygosity (LOH) existing in a region of interest. The LOH Score
samples with extended tracts of homozygosity (single-sample algorithm scans data sets to determine and identify the presence of
analysis only). Homozygosity tracts may result from inbreeding, LOH. Variances in LOH score can be plotted in the Chromosome Heat
large-scale gene conversion, uniparental disomy (UPD), or Map or in the Illumina Genome Viewer (IGV).
chromosomal deletions. Other factors such as population history or
low recombination rates may also contribute to creating extended
regions of LOH. This algorithm uses SNP frequencies to calculate
Technical Note: Informatics

Figure 1: LOH Score Equation An odds ratio is defined as the ratio of the number of subjects in a
group with an event to the number of subjects without an event. The
log odds ratio for each of these hypotheses is computed using the
Perror = 0.001
cluster file to estimate heterozygote allele frequencies for every SNP,
N assuming that genotyping calls are independent.
i=1
Π Pi(gt i|LOH)
LOHScore = Log
10 N
The LOH Score algorithm does not incorporate haplotype structures
Π Pi(gt i|NoLOH) and assumes that heterozygote frequencies in the training set are
i=1 representative of frequencies in the population under study. Therefore,
the LOH score is a generalization of what may be occurring in a region
Pi(gt i|LOH) = { Perror, gti = AB
1 − Perror, gti = AAorBB of interest. In single-sample mode, the reference is a cluster file and it
is possible that copy-neutral LOH detected may be due to haplotype
Pi(gt i|NoLOH) = { hetfreq, gti = AB
1 − hetfreq, gti = AAorBB block structure in the data.

A diverse panel of 270 HapMap samples, including Caucasian,


Perror = genotyping error rate
hetfreq = mean heterozygote frequency Han Chinese, Japanese, and Yoruba HapMap populations, is
gti = genotype of locus i
used to create the default cluster file and to calculate heterozygote
frequencies. Heterozygosity rates are estimated for the combined
group. If the population under study is not represented well by this
group, it is beneficial to create a new cluster file based on the data
If a chromosomal region is lost and LOH is observed, the only from the unique population. Independent of the platform used, false
expected genotypes are AA and BB. In this case, AB would be positives in the LOH score may be due to some SNPs being rare in the
observed only as a result of genotyping error. If there is no LOH, all studied population but common in the diversity panel used to create
three genotypes are possible. the cluster file.

The LOH score is a measure of the likelihood that a SNP is exhibiting LOH Score Example
LOH around a window over all N SNPs, where N is the number of To illustrate the flow of this algorithm, consider this simplified example.
SNPs in a user-designated window size centered at the chromosomal For a window containing N SNPs, all resulting in homozygote calls
position of the SNP. The equation used to determine the LOH Score is with heterozygote frequencies of h = 0.1, let the genotyping error be
shown in Figure 1. The recommended window size depends upon the e = 0.001. The likelihood of LOH occurring is (1-e)N. The likelihood of no
density of probes on the product in use, as defined by the following LOH is (1-h)N. Therefore, the log odds ratio is: log10{(1-e)N/(1-h)N}, which
equation is the same as N{log10(1-e) -log10(1-h)}, which equals {N(h-e)/2.3}.
The odds of LOH or No LOH grows in a roughly linear fashion with the
Factor Number number of consecutive homozygotes.
Minimum Window Size =
Number of Probes
where Factor Number = 88,000 Conversely, if that stretch contains M heterozygote calls, the likelihood of
LOH decreases and the equation is adjusted to (1-e)(N-M) × eM, because
The window size for a specific workspace may require optimization heterozygotes in a region with LOH occur only through genotyping
depending upon the type of aberration under examination and the errors. The likelihood of No LOH also changes and becomes
quality of the data. (1-h)(N-M) x hM. The log odds ratio now becomes equal to
{(N-M)(h-e)/2.3} + M{log(e)-log(h)}, which equals {(N-M)(h-e)/2.3}-2M.
If the number of heterozygote SNPs matches the prediction in the In this case, the odds have diminished. When roughly 1 in 10 SNPs
specified window, the LOH score is 0. The LOH score increases if receives a heterozygote call, the odds of both hypotheses are equal.
there is an unexpectedly low number of heterozygote SNPs in the If more heterozygote calls are produced, the log odds ratio becomes
window. Because of this design, the algorithm is based purely on negative as it becomes less likely that these observations come from a
genotype calls and heterozygote frequencies. Taking both of these into region with LOH.
account, the LOH Score algorithm is a log odds ratio of the probability
of a region exhibiting LOH versus not exhibiting LOH. Because It is important to remember that there are usually unknown haplotype
levels of LOH can vary from sample to sample and region to region, structures and population-dependent heterozygote frequencies that
it is difficult to assign LOH score thresholds that always positively may play a role in the accuracy of the LOH score. However, this score
identify regions exhibiting LOH. However, the LOH score is a valuable is provided as a starting point to determine whether a particular stretch
calculation that can be used to detect chromosomal aberrations. of homozygotes contains LOH.
Technical Note: Informatics

Summary References
1. pngu.mgh.harvard.edu/~purcell/plink/
GenomeStudio provides several methods to analyze SNP and
probe intensity data to identify chromosomal regions with LOH 2. Olshen AB, Venkatraman ES, Lucito R, Wigler M (2004) Circular binary
and copy number variations. The software plug-ins described in segmentation for the analysis of array-based DNA copy number data.
this technical note are freely available to GenomeStudio users to Biostatistics 5: 557–572.

provide extended functionality. 3. Venkatraman ES, Olshen AB (2007) A faster circular binary segmentation
algorithm for the analysis of array CGH data. Bioinformatics 23: 657–663.
Automated bookmarking algorithms save time by automatically
scanning and categorizing samples. Researchers can use cnvPartition
to find and calculate copy numbers, or Homozygosity Detector to
identify extended tracts of LOH.

The LOH Score algorithm provides statistical information about


chromosomal aberrations of interest. This information includes the
probability of LOH existing. This algorithm can be used to identify
interesting regions in large sample sets quickly or to analyze a more
refined region further.

The open architecture of Illumina GenomeStudio software allows for


customized and advanced analysis tools for the downstream analysis
of Illumina DNA Analysis BeadChip Genotyping data. The plug-ins
described in this document can be downloaded from the Illumina
support page: http://support.illumina.com/downloads/genomestudio-
2-0-plug-ins.html.
Technical Note: Informatics

Illumina • 1.800.809.4566 toll-free (U.S.) • +1.858.202.4566 tel • [email protected] • www.illumina.com


FOR RESEARCH USE ONLY
© 2007–2014 Illumina, Inc. All rights reserved.
Illumina, BeadArray, GenomeStudio, Infinium, the pumpkin orange color, and the Genetic Energy streaming bases design
are trademarks or registered trademarks of Illumina, Inc. All other brands and names contained herein are the property
of their respective owners.
Pub. No. 970-2007-008 Current as of 04 January 2017

You might also like