0% found this document useful (0 votes)
3 views13 pages

Seq Quantification App Note

This application note discusses the detection and quantification of sequence variants from Sanger sequencing traces, emphasizing the importance of peak height data in identifying minor alleles. It reviews two software applications, QSVanalyzer and ab1PeakReporter, which facilitate the analysis of sequencing data to extract quantitative information about DNA sequences. The document outlines workflows for using these tools and highlights their utility in analyzing somatic mutations and other genetic variations.

Uploaded by

trthuphuong1801
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
3 views13 pages

Seq Quantification App Note

This application note discusses the detection and quantification of sequence variants from Sanger sequencing traces, emphasizing the importance of peak height data in identifying minor alleles. It reviews two software applications, QSVanalyzer and ab1PeakReporter, which facilitate the analysis of sequencing data to extract quantitative information about DNA sequences. The document outlines workflows for using these tools and highlights their utility in analyzing somatic mutations and other genetic variations.

Uploaded by

trthuphuong1801
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 13

APPLICATION NOTE Sanger Sequencing Data Analysis

Detection and Quantification of Sequence


Variants from Sanger Sequencing Traces
Determination of minor alleles by analyzing peak height data
The introduction of semi- What you sequence What you see in the electropherogram
automated fluorescent dye-
terminator DNA Sequencing
using capillary electrophoresis
(aka CE or Sanger sequencing) n
has revolutionized life and G

medical sciences by unraveling 2n


complete genomes and the G+G
elucidation of genetic structures
n
of many organisms. The primary G
information and value of the
Figure 1: Homozygosity. The peak signal is the sum of the peak signals from the two haploid
DNA sequencing process is the input DNAs.
identification of the nucleotides
and of possible sequence
variants. A largely unknown and
of Sanger sequencing traces for from a single allele, the observed
unexplored feature of fluorescent
quantitative data analysis. signal is actually the result of both
Sanger sequencing traces is the
alleles combined, or 2n.
quantitative information embedded The composite electropherogram
therein. With the growing need for and the challenge of mixed An individual peak in a sequencing
quantifying somatic mutations in basecalling trace representing a homozygous
tumor tissue, emerging mutations DNA basecalling software base is a composite mixture
in viral genomes conferring drug programs analyze fluorescent of two identical bases each
resistance, or the amount of Sanger sequencing traces and contributing approximately half of
methylation in a particular CpG reveal the base identities of a the fluorescent signal in relative
locus, it is desirable to exploit DNA sample along with quality fluorescent units (RFU) to a given
the potential of the quantitative values (phred scores) which peak height (see Figure 1). Hence,
information obtained from indicate the reliability of the the loss (e.g., by amplification
sequencing traces. basecall. In a typical PCR-based drop out) of one allele will typically
sequencing project that uses DNA lead to a drop of signal by half
In this application note we review
from a diploid organism both (illustrated in Figure 2).
two freely available software
copies of an allele are sequenced
applications that help to extract In the case of heterozygosity
simultaneously. Compared to the
and present the peak height data at an allele the resulting peak
hypothetical signal n resulting
pair migrates at the same or
What you sequence What you see in the electropherogram

n
G

“1n”
G

n
G

Figure 2: Allele Drop-Out. The peak signal is only approximately half of the signal expected in the case of homozygosity.

What you sequence What you see in the electropherogram

2n I
R (G+A)

n
G

2n II
R (G+A)

n
A

2n III
R (G+A)

Figure 3: Heterozygosity: sequencing a heterozygous allele may ideally present in an electropherogram as a balanced peak pair (Outcome I) or
may appear somewhat imbalanced (Outcome II or III). The specific outcome for a given peak pair is typically highly reproducible and depends on the
local sequence context.

similar position as a mixed base. accounted for using homozygous detected by commercial and public
The signal strength of each control samples (see text). (Figure 3) domain sequence analysis software
component is approximately half packages. However, minor sequence
of the homozygous counterpart. The simple principle that the variants such as they are found
Ideally, the two heterozygous proportion of each of two sequence in somatic mutations in tumor
peaks appear to be of equal height variants in a mixture determine tissue or in emerging mutations in
(see outcome I) but in reality they the relative heights of the peaks subpopulations of microbial or viral
may occur somewhat unbalanced that represent each variant in a organisms often elude detection
(outcome II or III) depending on sequence electropherogram has because the abundance of the minor
the DNA strand sequenced and inspired Ian Carr and colleagues allele is too low for triggering a
sequence-dependent context. This from the University of Leeds Institute (mixed) basecall.
complicates the determination of of Molecular Medicine to develop a
peak height ratios. However, this software application that exploits the The heights of the primary and
imbalance phenomenon is typically quantitative information embedded secondary peaks in a mixed-base
highly reproducible for a given allele in a sequencing trace. situation are the most important
from sample to sample and can be attributes for basecalling. If the
Homozygous and heterozygous peak height ratio of a secondary to a
sequence variants are readily
2
• Move .ab1 files for QSV • Open .ab1 file of sample • Run Batch command and
analysis into a project with allele of interest select project folder as • Open result folder
folder input source “QSV Data” located
• Inspect electropherogram
• Must include • QSV analysis is executed in project folder
and
homozygous and a report folder with • Review results
counterparts for • Select peak(s) of interest results is deposited in the
(heterozygous) allele(s) and 5´ reference peaks project folder
of interest

• Optional: if calibrator files


(dilution series) are
included in data set use file
“QSV_ratios.xls” as source
for quantitation analysis by
polynomial regression
using Microsoft Excel®
software LINEST function
(See text and Figure 7)

Figure 4: Overview of the QSVanalyzer workflow.

primary peak drops below 30% (or Inferring Allelic Variant Ratios using regular SNPs, paralogous sequence
other user-set threshold) it is usually QSVAnalyzer variants (PSV) and SNPs in the
not considered and therefore not In 2009, Carr et al. published a background of copy number variation
called out as a mixed base. paper describing the QSVanalyzer (CNV).
desktop application in the journal
In this application note we will review Bioinformatics. QSVanalyzer enables An important concept presented in
the paper and the QSVanalyzer the high-throughput quantification of the paper is the normalization of
software published by Ian Carr et the proportions of DNA sequences electropherograms: Fluorescent
al. from the University of Leeds, containing single-nucleotide dideoxynucleotide terminators are
UK and recommend its utility for sequence variants (SNVs) from incorporated dependent on their
the detection and quantification of fluorescent Sanger sequencing sequence context and may appear
sequence variants. traces. The paper is open access imbalanced in heterozygous mixed
and can be downloaded with bases (see Figure 3). Further, the
We also describe a new amount of template DNA and other
bioinformatics utility, supplementary data from [1]. The
QSVanalyzer application including factors affect the absolute peak
ab1PeakReporter, which is available height. Therefore, relative (rather
on the Life Technologies web site. original sequencing trace files used
in the study can be downloaded from than absolute) peak heights are
The utility provides numerical peak determined by comparing the variant
height data of Sanger sequencing http://dna.leeds.ac.uk/qsv/ .
nucleotide’s peak height to that of
traces allowing the quantitative In the paper, Carr et al. an invariant nucleotide located 5’
analysis of peak height data. To that demonstrated the utility of the (upstream) where one can assume
end, we show how minor alleles method for estimation of copy a neutral sequence background,
can be quantified by polynomial number proportions (CNPs) for i.e., no variant–introduced effects.
regression analysis using Microsoft various quantitative sequence The software also corrects for the
Excel software.
®

variant (QSV) types such as common background baseline signal in each


3
Figure 5: Electropherogram viewer with a heterozygous allele “S (for C and G)” near scan # 900.

A B

Figure 6: Output reports of the QSVanalyzer application. (A) Widget of the electropherogram accompanied by peak heights of the area. (B)
Comprehensive Excel-readable table with raw and reference-adjusted data. (C) Final Quantitative Sequence Variant (QSV) report with adjusted peak
heights (see Carr et al. for details).

4
trace and subtracts the allele-
specific “background noise” from
the relative peak height for a final
normalized peak height (NPH). To
calculate the QSV ratio, the program
needs two reference sequences,
each containing the homozygous
allele of the two variants.

A detailed, illustrated guide for use


of the application can be found on
http://dna.leeds.ac.uk/qsv/guide/
and a set of original .ab1 sample
files with a differential dilution
series is available on http://dna.
leeds.ac.uk/qsv/download.

Figure 5 shows the


electropherogram viewer with a
heterozygous allele “S (for C and G)”
near scan # 900 along with the QSV
information setup window where
this and other alleles of interest
Figure 7: A simple polynomial equation calculator using the LINEST function in Microsoft Excel
®

are selected for subsequent batch


allows the estimation of allele proportion in % (column C) in relation to peak height in RFU
analysis of QSV ratios (step 2 of (column A).
workflow).

Figure 6 shows QSVanalyzer results


for a mixed DNA sample that
contained a pre-mixed ratio of 40%
To that end we entered the “raw” allele proportion is obtained: in
variant A (C nucleotide blue trace)
peak height data for variant B (see the case of 43 RFU it is 5%. Since
and 60% variant B (G nucleotide
Figure 6B column C) into column A in this particular experiment and
black trace). QSVanalyzer reported
of a new spreadsheet and entered the given allele the background
a ratio of 0.4 to 0.6 for A:B (see
the admixed values of this particular noise is around 5 RFU (average
report 6C).
allele (0, 10, 20%, etc.) into column of 0, 2, 13 RFU), it is conceivable
What is the limit of detection for B. Next, we applied a scatter plot that the peak for a minority allele
minor alleles? of the data and used the trend line of 5% proportion is potentially
The Carr paper provides a web function in Excel with a polynomial detectable.
link to sets of original sequencing of order 3 for curve fitting. This
operation typically yielded a good Note the excellent correlation
files from three dilution series
correlation coefficient (>0.98). We of averaged measured values
each consisting of 11 samples in
also checked the “Display equation (column D) with the theoretical
triplicates with nominal sequence
on chart” box which shows the proportions of allele amount in
variant proportions 10:0, 9:1, 8:2, 7:3,
components of the polynomial this dilution series (column B). The
6:4, 5:5, 4:6, 3:7, 2:8, 1:9, and 0:10. We
in the graph. Next we applied peak height of an (hypothetical or
have used these sequencing traces
the LINEST function in Excel to real) allele of interest at 43 RFU
to ask whether it would be possible
solve the polynomial equation so entered into cell A35 was calculated
to detect a minority allele at 10%
that we can calculate for a given to correspond to an allele of 5%
and distinguish it significantly from
peak height (measured as RFU) which may be distinguishable from
background noise. The QSVanalyzer
value the corresponding amount background (approximately 5 RFU;
application was used to process
of allele proportion (column C). see cells A2–A4).
the data set “G in CG” provided by
the authors and the output Table The required formulas and steps Taken together, quantification of
shown in Figure 6B was opened with for this are shown in the box in minor alleles in the 5–10% range
Microsoft Excel software.
®
Figure 7. By entering an RFU may be feasible for at least one
value into cell A35 the estimated 5
allele of an allele pair provided
that the experimental system
is sufficiently supported with
replicates and controlled with
calibrator samples. Sequencing
and data analysis of the opposite
DNA strand may provide further
information and resolution.

The ab1PeakReporter utility


provides quantitative information
of fluorescent Sanger sequencing
peak traces
Numerical data describing the
raw and processed sequencing
traces are embedded in the .ab1
file but are not readily visible
using common sequence analysis
software. The architecture of the
.ab1 file is described in detail in a
white paper [2].

To meet the need for quantitative Figure 8: Data from polynomial regression analysis of peak height data of a particular allele
information from Sanger containing defined proportions; only values for 0% and 10% are listed (dilution series data
sequencing traces we have provided by Carr et al. 2009). RFU = relative fluorescent units = peak height.

developed a basic utility that reads


an .ab1 sequencing file and exports
the trace data in various numerical
formats.

The ab1PeakReporter utility


Send .ab1 file
can be accessed via https:// to the cloud
apps.lifetechnologies.com/
ab1peakreporter My Sanger
Sequence
File.ab1

(Logging in to your Life Technologies


customer account is required to use
the tool.) Receive output file

The ab1PeakReporter tool extracts


and presents the numerical
information from Sanger Extract .zip file to .csv
sequencing traces into an Excel- and open in Excel

readable file so that base peak


characteristics (reflecting, e.g.,
allele proportions) can be studied Analyze numerical
sequencing data in
quantitatively using downstream Excel or other app
software such as spreadsheet
processors. Figure 9: The ab1PeakReporter workflow.

A batch of up to 96 .ab1 sequencing


files can be uploaded into the
ab1PeakReporter tool and
processed, then exported as a zip
file back to a local drive. The zip

6
file is extracted and opened as an
Excel-readable .csv file.

The tool is very simple to use:

1. Browse for .ab1 sample files (up to


96) to upload.
2. (Optional) Enter a value between
0–100 to set a detection threshold
for a secondary peak; a default
value of “0” will detect all peaks
including background; a value
of “5” will detect and list all
secondary peaks over 5% in
addition to the primary peak.
3. Select the sample files to analyze.
Up to 96 files can be processed at
a time. Figure 10: User interface of the .ab1Tracer application.
4. Click the Start button.
5. You will be prompted to open or
save a zip folder with the analyzed
results to a location on a local
computer drive.
6. Extracting the zip folder will yield
.csv files that can be opened
using Microsoft Excel or other
®

spreadsheet processing software.


7. Use the data for customized
downstream analysis such as the
determination of allele ratios (e.g.,
methylated vs. un-methylated
CpG residues in bisulfite-
converted DNA) or quantification
of minor alleles.
The (your sample name here)
_Alldata.csv file (Figure 11) lists
all peak height values of all 4
nucleotide traces at all scans along
with primary base identification at
the location of the amplitude. Rows
1–16 contain a header with basic Figure 11: The comprehensive ab1PeakReporter AllData Table (_Alldata.csv file).
sample file and run information.

Below row 16 the following


components are listed:
Columns C–F: continuous peak Re-create the electropherogram
Column A: the scan number heights for nucleotide traces G, A, T, plot in Excel
C, respectively The electropherogram can be
Column B: primary peak as generated using the line graph plot
identified by KB Basecaller
™ Column G: the phred Quality value is function in Excel by selecting cells
shown A–F or B–F, then go to tab “Insert”
and select “Chart > Line” (Figure 12).
The electropherogram aids in visual
interpretation of ambiguous loci,
7
assisting in distinguishing genuine 2
peaks from background noise or 3
artifacts.

Applying filters aids in data


exploration
The next step is to filter out the
uncalled scans; this will enable
customized display of data and is
a great aid in exploring the data.
To set filters click on row 16, go
to tab “Data” and select “Filter”
(Figure 13).
1
The Filter tool allows selective
display of data by (un-)checking
individual data points, sorting, and
various ranges and rank formats in
its “Number Filters” section.

Condensing the table to basecalled


data only
Using the Filter tool, the table can
be condensed to display only the
basecalled data points. This feature
is useful for transferring data into a
Figure 12: Creating line plots of electropherograms. 1) Select cells A–F or B–F, 2) go to tab
database for subsequent archiving, “Insert” and 3) select Line in the Chart section.
and further exploratory or statistical
analysis.

To condense the data table to 2


basecalled data points only, 1) 3

Click the Filter icon in column B,


"BaseCall", 2) Uncheck the "–" box,
3) click the "OK" button (Figure 14).

Find loci of interest with “Sequence


Window”
To facilitate identification of an
allele (nucleotide) of interest (e.g.,
a SNP) in the table, the “Sequence
Window” can be used to display
each base in the center of a string
of 7 nucleotides (Figure 15). Use 1

the 7-base string as input in Excel's


"Search" and "Find" functions.
Use a * character if the base in the
middle of the string is unknown
or N,Y,R, etc. This string of 7
nucleotides can also be used in
Sequence Analysis or Sequence

Figure 13: Applying Filters to the data. 1) Click on row 16, 2) go to tab “Data” and 3) select Filter.

8
Scanner software to readily find a
peak of interest. 1

Ratio of maximal signals in a 7-scan


window
In columns I–L (Figure 15 )the peak
height ratios in a 7-scan window
of the maximal signal between the
2
primary (i.e., basecalled) peak and
the maximal signal of either G, A,
T, C respectively is shown. What
exactly are these numbers?
3
The peak height ratio is calculated
as the maximum of heights
measured at the scan of the
amplitude and the 3 scans
upstream and downstream (hence Figure 14: Condensing the data table to basecalled data points only.
7-scan) of that particular location.

Figure 16 shows an example of the


MaxSig7Scan Ratio calculation:
at scan location 799 a peak was
called “N”; the highest peak was
an “A” trace with 555 RFU, followed
by a “T” trace with 438 RFU and
C (14 RFU) and G (12 RFU) in a ACCTGTA
7-scan window (highlighted in
yellow) which is centered at the
peak location and extends for 3
more scans on either side (3+1+3 Figure 15: The SequenceWindow lists 7-nucleotide strings, and can be used to facilitate finding
a base of interest.
= 7 scans). The ratio is calculated
by dividing the peak height of the
“primary” (i.e., highest) peak by the
basecalled or highest peak height
of either peak trace (G in column I,
A in column J, T in column K, and T
in column L) in the 7-scan window.
One caveat: in traces with sub-
optimal spacing or mobility overlap
between adjacent bases it is
possible that the trailing or leading
slope of a legitimate adjacent peak
is considered in this calculation
which may produce an artificially
higher ratio.

Ratio of signal in a 1-scan window


at the basecall location
In columns M–P the peak height
ratios in a 1-scan window of the
maximal signal between the
primary (i.e., basecalled) peak
and the signal of either G, A, T, C,
respectively, is shown (Figure 17). Figure 16: Ratios of maximal signal in a 7-scan window.
9
This ratio measurement is taken
in a narrow window of one scan
only. It may miss the amplitude of
a peak under peak if it is outside
this window. Therefore, both
ratio measurements (7-scan and
1-scan) should be considered and
possibly be combined (averaged), if
necessary or warranted.

Data from the KB Basecaller (v1.4.1


and higher)
Columns Q, R, S, T are populated
with the amplitude and sequence
output data from the KB ™

Basecaller (Figure 18). Note that


the amplitudes of the primary and
secondary peak may differ from
the original signal strength (RFU)
shown in columns C–F. This is due
to the mobility and other noise Figure 17: The MaxSigAtPeak peak height ratio is determined by dividing the peak height of the
correction of the trace data during primary peak (highest peak) by peak heights of either peak trace at the location (scan number)
the basecalling process. of the basecall.

Column G lists the Quality value


(phd or phred score) of the basecall
(Figure 19).

Figure 18: Amplitudes and basecalls of primary and secondary peak as determined by KB ™

Basecaller.

Quality Values
The QV is a per-base estimate of the KB Basecaller accuracy.

The QVs are calibrated on a scale corresponding to:


QV = –10log10(Pe)
where Pe is the probability of error.
The KB Basecaller generates QVs from 1 to 99.

Quality Value Probability the basecall is incorrect


10 10%
20 1%
30 0.1%
40 0.01%
50 0.001%
• Typical high-quality pure bases have QVs between 20–50
• Typical high-quality mixed bases have QVs between 10–20

Figure 19: The Quality values indicate the probability of an incorrect basecall of primary peak.

10
Measuring allele proportions by
peak height ratios
To demonstrate the utility of the
tool we have prepared genomic
DNA mixtures of normal and
mutant TP53 gene (exon 11) at
various proportions and determined
the peak height ratios between
minor and major allele using the
ab1PeakReporter tool. Figure 20
shows that in this particular allele
situation the peak height ratios
obtained from both channels (1-
scan window or 7-scan window)
correlated quite well up to 15%.
A 5% level of mutant allele was
clearly distinguishable from 0%
(normal control; Figure 21).

Figure 20: Peak height ratios.

0% 2.5% 5% 7.5%

10% 15% 25% 50%

Figure 21: Sequencing electropherograms of DNA mixtures prepared at various ratios of wt


and mutant allele “697” in exon 11 of the human p53 gene as viewed in Sequence Scanner
software. Note that the mutant allele was “called” as “S” at the 25% and 50% level but not below
these ratios using the KB Basecaller.

11
QSVanalyzer ab1PeakReporter
Number of alleles Limited to predefined positions All bases in trace file
Number of .ab1 files that can be analyzed Multiple (maximum # not specified) QSV 96 (maximum upload per processing)
analysis requires presence of homozygous
controls for either variant
Table of peak height data of primary and Yes (see Figure 6, columns B and C) Yes (requires that .ab1 file is analyzed with KB™

secondary peaks Basecaller v1.4)


Compatible data files .ab1, .scf .ab1
Peak traces displayed Yes, in comprehensive HTML report and in No, but can be created as a line graph in Excel
separate window as .png file using .csv file with complete data points
Output reports Folder with trace data, comprehensive QSV Zip folder with .csv file that opens in Microsoft
report in HTML and table (.xls) with raw and Excel ®

normalized peak heights and ratios


Suitable for quantitative assessment of SNPs, Yes (see Carr et al. 2009 for details) Delivers peak height data and peak height
paralogous variant analysis and copy number ratios for customized downstream analysis
variants
Suitable for methylation analysis (sequencing Can potentially provide allele ratios CpG to TpG Delivers peak height data for customized
of bisulfite-converted DNA)? (UpG). Delivers normalized peak height data for downstream analysis
customized downstream analysis
Suitable for minor allele quantification Possible when used with appropriate calibrator Delivers peak height data for customized
(somatic or emerging mutations)? controls, replicates and customized data downstream analysis
analysis (polynomial regression), see Figure 7

Table 1: Summary of features available in the QSVanalyser application and the ab1PeakReporter tool.

Conclusions References
This application note shows tools [1] Carr IM*, Robinson JI, Dimitriou
and methods for extracting and R et al. (2009) Bioinformatics,
using peak height data from 25 (24):3244–3250. http://
fluorescent Sanger sequencing bioinformatics.oxfordjournals.org/
traces for determination of allele content/25/24/3244.long
ratios or allele quantification.
Table 1 summarizes the features [2] White paper: Applied Biosystems
of the two software applications Genetic Analysis Data File Format
presented. http://www6.appliedbiosystems.
com/support/software_community/
ABIF_File_Format.pdf

12
Find out more at lifetechnologies.com
For Research Use Only. Not for use in diagnostic procedures. ©2013 Life Technologies Corporation. All rights reserved. The trademarks
mentioned herein are the property of Life Technologies Corporation and/or its affiliate(s) or their respective owners. Excel is a registered
trademark of Microsoft Corporation. CO07793 1113

You might also like