Bio in For Matic Stools
Bio in For Matic Stools
Bio in For Matic Stools
Contributors
Xianquan Zhan, Tian Zhou, Tingting Cheng, Miaolong Lu, Gaston K. Mazandu, Emile R. Chimusa,
Ephifania Geza, Milaine Seuneu, Juliano Lino Ferreira, Leila Ferreira, Thelma Sáfadi, Tesfahun Alemu
Setotaw, Osman Ugur Sezerman, Kok-Siong Poon, Evelyn Siew-Chuan Koay, Julian Wei-Tze Tang
Individual chapters of this publication are distributed under the terms of the Creative Commons
Attribution 3.0 Unported License which permits commercial use, distribution and reproduction of
the individual chapters, provided the original author(s) and source publication are appropriately
acknowledged. If so indicated, certain images may not be included under the Creative Commons
license. In such cases users will need to obtain permission from the license holder to reproduce
the material. More details and guidelines concerning content reuse and adaptation can be found at
http://www.intechopen.com/copyright-policy.html.
Notice
Statements and opinions expressed in the chapters are these of the individual contributors and not
necessarily those of the editors or publisher. No responsibility is accepted for the accuracy of
information contained in the published chapters. The publisher assumes no responsibility for any
damage or injury to persons or property arising out of the use of any materials, instructions, methods
or ideas contained in the book.
151
Countries delivered to
Top 1%
most cited scientists
12.2%
Contributors from top 500 universities
Preface XIII
Chapter 1 1
The Bioinformatics Tools for Discovery of Genetic Diversity by Means
of Elastic Net and Hurst Exponent
by Leila Maria Ferreira, Thelma Sáfadi, Tesfahun Alemu Setotaw
and Juliano Lino Ferreira
Chapter 2 15
Bioinformatics Workflows for Genomic Variant Discovery, Interpretation
and Prioritization
by Osman Ugur Sezerman, Ege Ulgen, Nogayhan Seymen
and Ilknur Melis Durasi
Chapter 3 35
Orienting Future Trends in Local Ancestry Deconvolution Models
to Optimally Decipher Admixed Individual Genome Variations
by Gaston K. Mazandu, Ephifania Geza, Milaine Seuneu
and Emile R. Chimusa
Chapter 4 53
Recognition of Multiomics-Based Molecule-Pattern Biomarker for
Precise Prediction, Diagnosis, and Prognostic Assessment in Cancer
by Xanquan Zhan, Tian Zhou, Tingting Cheng and Miaolong Lu
Chapter 5 75
HCV Genotyping with Concurrent Profiling of Resistance-Associated
Variants by NGS Analysis
by Kok-Siong Poon, Julian Wei-Tze Tang and Evelyn Siew-Chuan Koay
Preface
Genomic variations are the basis for phenotypic variations of individual organisms
of the same species. These phenotypic variations could be of clinical importance in
humans and medically relevant organisms. Therefore detection of genomic varia-
tions, and interpretation of their phenotypic effects and pathogenic potentials, has
become a growing field in both biomedical research and clinical medicine.
Chapter 3 discusses local ancestry deconvolution and dating admixture events and
the possible gaps in the knowledge that lead to the current challenges. Chapter 4
addresses the value of multiomics-based molecular patterns and the concept of
pattern recognition and pattern biomarkers in cancer diagnosis and prognosis. It
also explores the application of these concepts in personalized medicine. Chapter 5
addresses the genetic diversity of the hepatitis C virus and discusses its genotyping
and concurrent variant profiling, as identification of resistance-associated variants
of this virus determines the choice of anti-viral regimes in infected patients.
We would like to thank all the authors for their contributions and time in prepar-
ing this valuable collection. Also, we would like to extend our thanks to Mr. Luka
Cvjetković for his great help during the editing of this book and to IntechOpen for
their commitment and support.
Abstract
1. Introduction
1
Bioinformatics Tools for Detection and Clinical Interpretation of Genomic Variations
2. Wavelet
2
The Bioinformatics Tools for Discovery of Genetic Diversity by Means of Elastic Net and Hurst…
DOI: http://dx.doi.org/10.5772/intechopen.82755
3. Wavelet transform
Wavelet analysis has arisen as a possible device for spectral investigation owing
to the interval incidence localization which makes it appropriate for multifaceted
and motionless signals. The wavelet transform has added meaningfully in the train-
ing of many processes/signals in virtually all areas of earth science [4].
Wavelet is mathematical function. To be considered a wavelet, it must have the
total area on the function curve equals to zero. The energy of the behavior must be
limited (regularity and well located). Another need in the art is the speed and ease
of calculating the wavelet transform and the inverse transform.
Among various application areas of wavelets are computer vision, data compres-
sion, fingerprint compression at the FBI, data recovery affected by noise, similar
behavior detection, musical tones, astronomy, meteorology, numerical image
processing, and many others.
The wavelet transform rots a function demarcated in the period domain into
another function, well-defined in time domain and frequency domain. It is defined as
( a )dt,
W(a, b) = ∫∞∞ f(t) ___
___ t−b
1 ψ ∗ ___ (1)
√
| | a
3
Bioinformatics Tools for Detection and Clinical Interpretation of Genomic Variations
( a ),
ψ a,b(t) = ___
___ t−b
1 ψ ∗ ___ (2)
√ a
| |
we may put another way the transform as the inner output of the functions f(t)
and ψ a,b(t):
∞
W(a, b) = 〈 f(t), ψ a,b(t)〉 = ∫-∞ f(t) ψ ∗a,b(t)dt. (3)
The function ψ(t) which equals ψ 1,0(t) is entitled the mother wavelet, while the
other functions ψ a,b(t) stay called daughter wavelets. The parameter b designates that
the function ψ(t) has been translated on the t axis of a distance equivalent to b, being
then a translation parameter. The parameter causes a change of scale, increasing (if
a > 1) or decreasing (if a < 1) the wavelet formed by the function. Consequently, the
parameter “a” remains known as the scaling parameter.
4. Wavelet analysis
There are abundant types of wavelet transform. Rely on the procedure one can
be desired that others. The wavelet analysis is prepared by the successive procedure
of wavelet transform with several values for the criterion a and b, representing the
decomposition of the signal components located in period and the agreeing to these
parameters. Each wavelet has a better or worse location in the domains of frequency
and of the time, so the analysis can be done with wavelets according to the desired
result. Wavelet analysis brings with it an analysis of where the resolution level is set
by the index a.
Discrete wavelets: among them are the Daubechies wavelet, wavelet of Cohen-
Daubechies-Feauveau (occasionally mentioned as CDF N/P or Daubechies bior-
thogonal wavelets), Beylkin [5], BNC wavelets, Coiflet, Mathieu wavelet, Haar
wavelet, binomial-QMF, Villasenor wavelet, Legendre wavelet, and symlet.
Continuous wavelets: (1) the real-valued wavelets are Mexican hat wavelet,
Hermitian wavelet, beta wavelet, Hermitian hat wavelet, and Shannon wavelet, and
the (2) complex-valued wavelets are Shannon wavelet, Morlet wavelet, complex
Mexican hat wavelet, and modified Morlet wavelet.
In the latest decades, the investigation using method of wavelets has been rising
progressively. One of the great rewards related with this method links to the compu-
tational improvement, that is, the analyses are treated virtually in real time. The
applicability is in numerous areas of science, like physics, mathematics, engineer-
ing, and genetics, among others.
The wavelet transform is a method of sighted and characterizes a signal.
Mathematically, it is characterized by a function wavering in time or space. As a
characteristic, it has sliding windows that increase or bandage to capture low- and
high-frequency signals, respectively [2]. Its origin arose in the field of seismic study
to define the instabilities ascending from a seismic impulse [6].
Among the wavelet techniques, we have the discrete non-decimated wavelet
transform (NDWT), whose main characteristic is that it may work with any extent
of signals/sequences.
In this procedure, the coefficients are translation invariants, that is, the selec-
tion of source is unrelated since all the annotations are done in the investigation,
a condition that does not happen in the discrete decimated wavelet transform
(DWT).
4
The Bioinformatics Tools for Discovery of Genetic Diversity by Means of Elastic Net and Hurst…
DOI: http://dx.doi.org/10.5772/intechopen.82755
In recent period, the discrete wavelet transforms were worn to find gene sites in
sequences of the genome [7], finding long-range correlations, finding periodicities
in sequences of the DNA molecule [8], and also in the scrutiny of G + C patterns [9].
The clustering analysis is often assumed to deal with DNA sequences profi-
ciently. A wavelet-based element vector model was anticipated for grouping of DNA
sequences [10].
The distinction of the discrete NDWT is to retain the similar extent of data in even
and odd decimations on each measure and remain to do the identical on each subse-
quent scale, being D0 the dyadic decimation, D1 the odd decimation, H the high-pass
filter, and L the low-pass filter. Consider, for example, an input path (y1, …,yn). Then,
put on and preserve both D0 Hy and D1 Hy, even and odd indexed of the observation-
filtered wavelets. Each of these sequences is length n/2. Consequently, in whole, the
amount of wavelet coefficients in both decimals on the better scale is 2 × n/2 = n [11].
5. GC content
6. Daubechies wavelet
5
Bioinformatics Tools for Detection and Clinical Interpretation of Genomic Variations
Ingrid Daubechies is a Belgian physicist and mathematician. Daubechies was the first
female to be chair of the International Mathematical Union (2011–2014). She is very
well acknowledged for her study using wavelets in image compression.
Daubechies earned the Louis Empain Prize for Physics in 1984, conferred once
every 5 years to a Belgian scientist on the basis of a study done before the age of 29.
In the middle of 1992 and 1997, she stood a partner of the MacArthur Foundation,
in addition in 1993, she was designated to the American Academy of Arts and
Sciences. In 1994, she earned the American Mathematical Society Steele Prize for
explanation for her book Ten Lectures on Wavelets and was requested to provide an
entire talk in Zurich at the International Congress of Mathematicians. In 1997, she
stood granted the AMS Ruth Lyttle Satter Prize available at http://www.ams.org/
profession/prizes-awards/pabrowse#year=1997. In 1998, she was selected to the
United States National Academy of Sciences, which can be visualized at http://nas.
nasonline.org/site/Dir/1753239219?pg=vprof&mbr=1001102&returl=http%3A%
2F%2Fwww.nasonline.org%2Fsite%2FDir%2F1753239219%3Fpg%3Dsrch%26vie
w%3Dbasic&retmk=search_again_link and acquired the Golden Jubilee Award for
Technological Innovation from the IEEE Information Theory Society (https://www.
itsoc.org/honors/golden-jubilee-awards-for-technological-innovation). She turns
into an overseas fellow of the Royal Netherlands Academy of Arts and Sciences in
1999 accessible at https://www.knaw.nl/en/members/foreign-members/4013.
In 2000, Daubechies turns out to be the pioneer lady to obtain the National
Academy of Sciences Award in Mathematics, stated every 4 years for excellence
in published mathematical investigation. The prize honored her for important
findings on wavelets and wavelet growths and designed for her accomplishment in
building wavelet methods a constructive elementary apparatus of applied math-
ematics. This achievement is presented on https://www.knaw.nl/en/members/
foreign-members/4013. She was also conferred the Basic Research Award, German
Eduard Rhein Foundation, which could be visualized on https://web.archive.org/
web/20110718233021/http://www.eduard-rhein-stiftung.de/html/Preistraeger_e.
html and https://web.archive.org/web/20110718234059/http://www.eduard-
rhein-stiftung.de/html/2000/G00_e.html and the NAS Prize in Mathematics
https://web.archive.org/web/20101229195210/http://www.nasonline.org/site/
PageServer?pagename=AWARDS_mathematics.
Generally, the Daubechies wavelet properties stay preferred to have the maxi-
mum sum A of vanishing moments (this does not make sure of indicating the
preeminent levelness) on behalf of assumed provision measurement 2A-1 [3]. It is
present in two designation patterns in routine, DN via the extent or total of blows
and dbA stating to the quantity of vanishing moments. Thus db2 and D4 stand the
equivalent wavelet transform.
Among the 2A-1 thinkable resolution of the arithmetical calculations for the
moment and orthogonal circumstances, the one is elected whose scaling filter has
extreme phase. Wavelet transform remains too easy to place hooked on training
through the debauched wavelet transform. Daubechies wavelets are broadly used
in answering wide-ranging problems, for example, self-homology assets of sign or
fractal difficulties and sign cutoffs, among others.
Daubechies wavelets remain not demarcated in footings of the subsequent
scaling and wavelet functions; actually, they are not probable to inscribe down
in locked procedure.
In the production of a wavelet scaling arrangement, low-pass filter and the
wavelet sequence band-pass filter will standardized to ensure entirety unenliven 2
and summation__of squares unenliven 2. In particular requests, they are standardized
to require sum√2 ; thus one and other arrangements and entirely changes of them by
an even sum of coefficients are orthonormal to each other.
6
The Bioinformatics Tools for Discovery of Genetic Diversity by Means of Elastic Net and Hurst…
DOI: http://dx.doi.org/10.5772/intechopen.82755
8. Scalogram
7
Bioinformatics Tools for Detection and Clinical Interpretation of Genomic Variations
9. Cluster analysis
8
The Bioinformatics Tools for Discovery of Genetic Diversity by Means of Elastic Net and Hurst…
DOI: http://dx.doi.org/10.5772/intechopen.82755
Figure 1.
Elastic net standard scheme.
9
Bioinformatics Tools for Detection and Clinical Interpretation of Genomic Variations
genomic selection methods dropping the penalties of each approaches like in elastic
net, enabling the fitting of a certain statistical model. Therefore, an outstanding
methodology to analyze genome is elastic net domain used in several study, like
[33–36].
Recently, the tuberculosis strain’s differences were evaluated using the elastic
net domain [34]. In that examination, 10 genome sequences of Mycobacterium
tuberculosis with a window size of 10,000 bp were assessed combining the NDWT
and elastic net domain. This study encompasses 10 strains: 2 from drug resistant,
6 from drug susceptible, 1 from multidrug resistant, and finally 1 from exten-
sively drug resistant. The clustering detected on that analysis indicated to be real
adequate.
Figure 2.
Hurst exponent pattern interpretation of the index value.
10
The Bioinformatics Tools for Discovery of Genetic Diversity by Means of Elastic Net and Hurst…
DOI: http://dx.doi.org/10.5772/intechopen.82755
12. Conclusion
We strongly believe that exploring the genetic variability of any organism using
wavelet coupled with elastic net domain and/or Hurst exponent will be a valuable
and interesting tool. It is not difficult and the free R software could solve easily the
approach. In this way, it gives reliability and robustness in your results. Therefore,
these bioinformatics apparatuses provide more possibility to scrutinize the genetic
divergence of living organisms.
Conflict of interest
Author details
© 2018 The Author(s). Licensee IntechOpen. This chapter is distributed under the terms
of the Creative Commons Attribution License (http://creativecommons.org/licenses/
by/3.0), which permits unrestricted use, distribution, and reproduction in any medium,
provided the original work is properly cited.
11
Bioinformatics Tools for Detection and Clinical Interpretation of Genomic Variations
References
[1] Crowley PM. A guide to wavelets [10] Bao J, Yuan RY. A wavelet-based
for economists. Journal of Economic feature vector model for DNA
Surveys. 2007;21:207-267. DOI: clustering. Genetics and Molecular
10.1111/j.1467-6419.2006.00502.x Research. 2015;14:19163-19172. DOI:
10.4238/2015.December.29.26
[2] Percival DB, Walden AT. Wavelet
Methods for Time Series. 1st ed. [11] Nason G. Wavelet Methods in
Cambridge: Analysis Cambridge Statistics with R. 1st ed. New York:
University Press; 2000. 594 p. DOI: Springer-Verlag; 2008. 259 p. DOI:
10.1017/CBO9780511841040 10.1007/978-0-387-75961-6
[7] Ning J, Moore CN, Nelson JC. [15] Tryon RC. Cluster Analysis:
Preliminary wavelet analysis of genomic Correlation Profile and Orthometric
sequences. In: Proceedings of the (Factor) Analysis for the Isolation of
IEEE Computer Society Conference Unities in Mind and Personality. 1st ed.
on Bioinformatics CSB ’03. Stanford, Ann Arbor: Edwards Brothers;
California: IEEE; 2003. pp. 509-510 1939. 122 p
12
The Bioinformatics Tools for Discovery of Genetic Diversity by Means of Elastic Net and Hurst…
DOI: http://dx.doi.org/10.5772/intechopen.82755
13
Bioinformatics Tools for Detection and Clinical Interpretation of Genomic Variations
14
Chapter 2
Abstract
1. Introduction
Whole genome sequencing (WGS) and whole exome sequencing are next-
generation sequencing (NGS) technologies that determine the full and protein-
coding genomic sequence of an organism, respectively. Deep sequencing of
genomes improves understanding of clinical interpretation of genomic variations.
Analyzing NGS data with the aim of understanding the impact and the importance
of genomic variations in health and disease conditions is crucial for carrying the
personalized medicine applications one step further.
One of the main obstacles for reaching the full potential of WES/WGS in
personalized medicine is bioinformatics analysis, which mostly requires strong
computational power. Analysis of WES/WGS data with publicly or commercially
available algorithms and tools require a proper computational infrastructure in
addition to an at least basic understanding of NGS technologies. Second, almost all
publicly available algorithms and tools focus on a single aspect of the entire process
15
Bioinformatics Tools for Detection and Clinical Interpretation of Genomic Variations
and do not provide a workflow that can aid the researcher from start to finish.
Lastly, there are no gold standards for translating WES/WGS into clinical knowl-
edge, since different diseases need different strategies for the basic analysis to
obtain the genomic variants as well as further analyses, including disease-specific
interpretation and prioritization of the variants.
A comprehensive workflow that can be applied for WES/WGS data analysis is
composed of the following steps:
a. Quality control
b.Sequence alignment
c. Post-alignment processing
d.Variant discovery
e. Downstream analyses
Figure 1.
An example single-sample variant discovery workflow. Each step is labelled in the black rectangles. The most
widely used tools for each operation are also presented.
16
Bioinformatics Workflows for Genomic Variant Discovery, Interpretation and Prioritization
DOI: http://dx.doi.org/10.5772/intechopen.85524
Detection of genomic variations beginning from raw read data is a multistep task
that can be executed using numerous tools and resources. The workflow outlined in
the introduction section is laid out in detail in this section, including the best
practice recommendations and common pitfalls.
The raw data from a sequencing machine are most widely provided as FASTQ
files, which include sequence information, similar to FASTA files, but additionally
contain further information, including sequence quality information.
A FASTQ file consists of blocks, corresponding to reads, and each block consists
of four elements in four lines (Figure 2).
The first line contains a sequence identifier and includes an optional description
of sequencing information (such as machine ID, lane, tile, etc.). The raw sequence
letters are presented in line 2. The third line begins with a “+” sign and optionally
contains the same sequence identifier. The last line encodes the quality score for the
sequence in line 2 in the form of ASCII characters. While specific scoring measures
might differ among platforms, Phred Score (Qphred = -10log10P, where P being the
probability of misreading any given base) is the most widely used.
In general, the raw sequence data acquired from a sequencing provider is not
immediately ready to be used for variant discovery. The first and most important
Figure 2.
Example FASTQ file format.
17
Bioinformatics Tools for Detection and Clinical Interpretation of Genomic Variations
step of the WES/WGS analysis workflow following data acquisition is the quality
control (QC) step. QC is the process of improving raw data by removing any
identifiable errors from it. By performing QC at the beginning of the analysis,
chances encountering any contamination, bias, error, and missing data are
minimized.
The QC process is a cyclical process, in which (i) the quality is evaluated, (ii) QC
is stopped if the quality is adequate, and (iii) a data altering step (e.g., trimming of
low-quality reads, removal of adapters, etc.) is performed, and then the QC is
repeated beginning from step (i).
The most commonly used tool for evaluating and visualizing the quality of
FASTQ data is FastQC (Babraham Bioinformatics, n.d.), which provides compre-
hensive information about data quality, including but not limited to per base
sequence quality scores, GC content information, sequence duplication levels, and
overrepresented sequences (Figure 3). Alternatives to FastQC include, but are not
limited to, fastqp, NGS QC Toolkit, PRINSEQ , and QC-Chain.
Below, QC approaches for the most commonly encountered data quality issues
are discussed: adapter contamination and low-quality measurements toward the 50
and 30 ends of reads.
Adapters are ligated to the 50 and 30 ends of each single DNA molecule during
sequencing. These adapter sequences hold barcoding sequences, forward/reverse
primers, and the binding sequences to immobilize the fragments to the flow cell and
allow bridge amplification. Since the adapter sequences are synthetic and are not
seen in any genomic sequence, adapter contamination often leads to NGS alignment
errors and an increased number of unaligned reads. Hence, any adapter sequences
need to be removed before mapping. In addition to adapter removal, trimming can
be performed to discard any low-quality reads, which generally occur at the 50 and
30 ends.
There is an abundance of tools for QC, namely, Trimmomatic [1], CutAdapt [2],
AlienTrimmer [3], Skewer [4], BBDuk [5], Fastx Toolkit [6], and Trim Galore [7].
Figure 3.
An example FastQC result.
18
Bioinformatics Workflows for Genomic Variant Discovery, Interpretation and Prioritization
DOI: http://dx.doi.org/10.5772/intechopen.85524
In addition to these stand-alone tools, R packages for QC, such as PIQA and
ShortRead, are also available.
While QC is the most important step of NGS analysis, one must keep in mind
that once basic corrections (such as the ones described above) are made, no amount
of further QC can produce a radically better outcome. QC cannot simply turn bad
data into good data. Moreover, it is also important to remember that because QC
may also introduce error that can affect the analysis, it is vital never to perform
error correction on data that does not need it.
In order to find the exact locations of reads, each must be aligned to a reference
genome. Efficiency and accuracy are crucial in this step because large quantities of
reads could take days to align and a low-accuracy alignment would cause inade-
quate analyses. For humans, the most current and widely used reference sequences
are GRCh37 (hg19) and GRCh38 (hg38). Similar to any bioinformatics problem,
there are a great number of tools for alignment of sequences to the reference
genome, to name a few, BWA [8], Bowtie2 [9], novoalign [10], and mummer [11].
After aligning, a Sequence Alignment Map (SAM) file is produced. This file
contains the reads aligned to the reference. The binary version of a SAM file is
termed a Binary Alignment Map (BAM) file, and BAM files are utilized for random-
access purposes. The SAM/BAM file consists of a header and an alignment section.
The header section contains contigs of aligned reference sequence, read groups
(carrying platform, library, and sample information), and (optionally) data
processing tools applied to the reads. The alignment section includes information on
the alignments of reads.
19
Bioinformatics Tools for Detection and Clinical Interpretation of Genomic Variations
in further analyses. The most widely used tool for BQSR is provided by the Genome
Analysis Toolkit (GATK) [13].
After these post-alignment data processing operations, an analysis-ready BAM
file is obtained.
In this section, approaches for the discovery of germline SNV and indels are
discussed. In the following sections, approaches for the discovery of somatic short
variants and of structural variations are outlined.
Following data processing steps, the reads are ready for downstream analyses,
and the following step is most frequently variant calling. Variant calling is the
process of identifying differences between the sequencing reads, resulting from
NGS experiments and a reference genome. Countless variant callers have been and
are being developed for accomplishing this challenging task as alignment and
sequencing artifacts complicate the process of variant calling. For recent studies
comparing different variant callers, see [14–16]. Methods for detecting short vari-
ants can be broadly categorized into “probabilistic methods” and “heuristic-based
algorithms.” In probabilistic methods, the distribution of the observed data is
modeled, and then Bayesian statistics is utilized to calculate genotype probabilities.
In contrast, in heuristic-based algorithms, variant calls are made based on a number
of heuristic factors, such as read quality cutoffs, minimum allele counts, and bounds
on read depth. Whereas heuristic-based algorithms are not as widely used, they can
be robust to outlying data that violate the assumptions of probabilistic models.
The most widely used state-of-the-art variant callers include, but are not limited
to, GATK-HaplotypeCaller [13], SOAPsnp [17], SAMTools [18], bcftools [18],
Strelka [19], FreeBayes [20], Platypus [21], and DeepVariant [22]. We would like to
emphasize that for WES/WGS, a combination of different variant callers outper-
forms any single method [23].
Following the variant calling step, raw SNV and indels in the Variant Call
Format (VCF) are obtained. These should then be filtered either through applying
hard filters to the data or through a more complex approach such as GATK’s Variant
Quality Score Recalibration (VQSR).
Hard filtering is applied by filtering via thresholds for metrics such as
QualByDepth, FisherStrand, RMSMappingQuality, MappingQualityRankSumTest,
ReadPosRankSumTest, and StrandOddsRatio.
VQSR, on the other hand, relies on machine learning to identify annotation
profiles of variants that are likely to be real. It requires a large training dataset
(minimum 30 WES data, at least one WGS data if possible) and well-curated sets of
known variants. The aim is to assign a well-calibrated probability to each variant
call to create accurate variant quality scores that are then used for filtering.
The accuracy of variant calling is also affected by coverage. Coverage can be
broadly defined as the number of unique reads that include a given nucleotide.
Coverage is affected by the accuracy of alignment algorithms and by the
“mappability” of reads. Coverage can be utilized for both the filtration of variants
and for a general evaluation of the sequencing experiment. Tools for assessing
coverage information include GATK [13], BEDTools [24], Sambamba [25], and
RefCov [26].
20
Bioinformatics Workflows for Genomic Variant Discovery, Interpretation and Prioritization
DOI: http://dx.doi.org/10.5772/intechopen.85524
21
Bioinformatics Tools for Detection and Clinical Interpretation of Genomic Variations
While the abovementioned are all variant annotation tools, it might be wise to put
GEMINI in a different category as it has other built-in tools to make further analysis
of the variants easier.
Figure 4.
An example somatic variant discovery workflow. Each step is labelled in the black rectangles. Most widely used
tools for each operation are also presented. As can be seen in the diagram, the processing steps until the variant
calling step are performed for both the normal and tumor data, separately.
22
Bioinformatics Workflows for Genomic Variant Discovery, Interpretation and Prioritization
DOI: http://dx.doi.org/10.5772/intechopen.85524
ii. Split read-based methods use the incompletely mapped read from each read
pair to identify small CNVs. Split read-based tools include AGE, Pindel,
SLOPE, and SRiC.
iii. Read depth-based approach detects CNV by counting the number of reads
mapped to each genomic region. Tools using this approach include GATK,
SegSeq, CNV-seq, RDXplorer, BIC-seq, CNAseq, cn.MOPS, jointSLM,
ReadDepth, rSW-seq, CNVnator, CNVnorm, CMDS, mrCaNaVar, CNVeM,
and cnvHMM.
23
Bioinformatics Tools for Detection and Clinical Interpretation of Genomic Variations
CNVs more challenging. To overcome this challenge, numerous tools have been
developed. Widely used tools for detecting specifically somatic CNVs include
ADTEx [75], CONTRA [76], cn.MOPS [77], ExomeCNV [78], VarScan2 [62],
SynthEx [79], Control-FREEC [80], GATK [13], and CloneCNA [81].
Perhaps the most challenging process in WES/WGS analysis is the clinical inter-
pretation of genomic variations. While WES/WGS is rapidly becoming a routine
approach for the diagnosis of monogenic and complex disorders and personalized
treatment of such disorders, it is still challenging to interpret the vast amount of
genomic variation data detected through WES/WGS [97].
There exist numerous standardized widely accepted guidelines for the evalua-
tion of genomic variations obtained through NGS such as the American College of
Medical Genetics and Genomics (ACMG), the EuroGentest, and the European
Society of Human Genetics guidelines. These provide standards and guidelines for
the interpretation of genomic variations and include evidence-based recommenda-
tions on aspects including the use of literature and database and the use of in silico
predictors, criteria for variant interpretation, and reporting.
In addition to variant-dependent annotation such as allele frequency (e.g., in
1000 Genomes [29], ExAc [30], gnomAD [31]), the predicted effect on protein and
evolutionary conservation, disease-dependent inquiries such as mode of inheri-
tance, co-segregation of variant with disease within families, prior association of the
variant/gene with disease, investigation of clinical actionability, and pathway-based
analysis are required for the interpretation of genomic variants.
Databases such as ClinVar [33], HGV databases [98], OMIM [99], COSMIC
[100], and CIViC [101] are excellent resources that can aid interpretations of
clinical significance of germline and somatic variants for reported conditions. The
availability of shared genetic data in such databases makes it possible to identify
patients with similar conditions and aid the clinician to make a conclusive diagnosis.
While one may perform interpretation of genomic variations completely manu-
ally after annotation and filtering of variants, there are several tools to aid in
24
Bioinformatics Workflows for Genomic Variant Discovery, Interpretation and Prioritization
DOI: http://dx.doi.org/10.5772/intechopen.85524
4. Conclusion
25
Bioinformatics Tools for Detection and Clinical Interpretation of Genomic Variations
We attempted to describe the best practices for variant discovery, outlining the
fundamental aspects. We hope to have provided a basic understanding of WES/
WGS analysis as we believe awareness of the steps involved as well as the challenges
involved at each step is important to understand how each piece may affect the
downstream steps (and eventually affect interpretation). As emphasized through-
out the chapter, substantial (or even minor) changes at any step can fundamentally
alter the outcomes in the later stages.
While there is no definite gold standard for the interpretation of genomic vari-
ations, we attempted to briefly describe the currently available and widely used
guidelines, tools, and resources for clinical evaluation of genomic variations.
In the following years, with the advancements in bioinformatics, increasing
cooperation between the clinician and bioinformatician and large-scale efforts
(such as IRDiRC [132], TCGA [133], and ICGC [134]), we expect that a greater
focus will be on developing novel tools for clinical interpretation of genomic varia-
tions. Cooperation between multiple disciplines is vital to improve the existing
approaches as well as to develop novel approaches and resources.
Author details
Osman Ugur Sezerman*, Ege Ulgen, Nogayhan Seymen and Ilknur Melis Durasi
Department of Biostatistics and Medical Informatics, Acibadem Mehmet Ali
Aydinlar University, Istanbul, Turkey
© 2019 The Author(s). Licensee IntechOpen. This chapter is distributed under the terms
of the Creative Commons Attribution License (http://creativecommons.org/licenses/
by/3.0), which permits unrestricted use, distribution, and reproduction in any medium,
provided the original work is properly cited.
26
Bioinformatics Workflows for Genomic Variant Discovery, Interpretation and Prioritization
DOI: http://dx.doi.org/10.5772/intechopen.85524
References
27
Bioinformatics Tools for Detection and Clinical Interpretation of Genomic Variations
[22] Poplin R, Chang P-C, Alexander D, [33] Landrum MJ, Lee JM, Riley GR,
et al. A universal SNP and small-indel et al. ClinVar: Public archive of
variant caller using deep neural relationships among sequence variation
networks. Nature Biotechnology. 2018; and human phenotype. Nucleic Acids
36(10):983-987. DOI: 10.1038/nbt.4235 Research. 2014;42(Database issue):
D980-D985
[23] Bao R, Huang L, Andrade J, et al.
Review of current methods, [34] Ng PC, Henikoff S. SIFT: Predicting
applications, and data management for amino acid changes that affect protein
the bioinformatics analysis of whole function. Nucleic Acids Research. 2003;
exome sequencing. Cancer Informatics. 31(13):3812-3814
2014;13(Suppl 2):67-82
[35] Adzhubei IA, Schmidt S, Peshkin L,
[24] Quinlan AR, Hall IM. BEDTools: A et al. A method and server for predicting
flexible suite of utilities for comparing damaging missense mutations. Nature
genomic features. Bioinformatics. 2010; Methods. 2010;7(4):248-249
26(6):841-842
[36] Chun S, Fay JC. Identification of
[25] Tarasov A, Vilella AJ, Cuppen E, deleterious mutations within three
Nijman IJ, Prins P. Sambamba: Fast human genomes. Genome Research.
processing of NGS alignment formats. 2009;19(9):1553-1561
Bioinformatics. 2015;31(12):2032-2034
[37] Schwarz JM, Rödelsperger C,
[26] Available from: http://gmt.genome. Schuelke M, Seelow D. MutationTaster
wustl.edu/gmt-refcov evaluates disease-causing potential of
sequence alterations. Nature Methods.
[27] Robinson JT, Thorvaldsdóttir H, 2010;7:575-576
Winckler W, et al. Integrative genomics
viewer. Nature Biotechnology. 2011; [38] Reva B, Antipin Y, Sander C.
29(1):24-26 Predicting the functional impact of
protein mutations: Application to cancer
[28] Sherry ST, Ward MH, Kholodov M, genomics. Nucleic Acids Research. 2011;
et al. dbSNP: The NCBI database of 39(17):e118
genetic variation. Nucleic Acids
Research. 2001;29(1):308-311 [39] Shihab HA, Gough J, Cooper DN,
et al. Predicting the functional,
[29] Auton A, Brooks LD, Durbin RM, molecular, and phenotypic
et al. A global reference for human consequences of amino acid
genetic variation. Nature. 2015; substitutions using hidden Markov
526(7571):68-74 models. Human Mutation. 2013;34(1):
57-65
[30] Lek M, Karczewski KJ, Minikel EV,
et al. Analysis of protein-coding genetic [40] Davydov EV, Goode DL, Sirota M,
variation in 60,706 humans. Nature. Cooper GM, Sidow A, Batzoglou S.
2016;536(7616):285-291 Identifying a high fraction of the human
genome to be under selective constraint
[31] Available from: https://gnomad.b using GERP++. PLoS Computational
roadinstitute.org/ Biology. 2010;6(12):e1001025
[32] Tate JG, Bamford S, Jubb HC, et al. [41] Pollard KS, Hubisz MJ, Rosenbloom
COSMIC: The catalogue of somatic KR, Siepel A. Detection of nonneutral
28
Bioinformatics Workflows for Genomic Variant Discovery, Interpretation and Prioritization
DOI: http://dx.doi.org/10.5772/intechopen.85524
29
Bioinformatics Tools for Detection and Clinical Interpretation of Genomic Variations
30
Bioinformatics Workflows for Genomic Variant Discovery, Interpretation and Prioritization
DOI: http://dx.doi.org/10.5772/intechopen.85524
[82] Sedlazeck FJ, Dhroso A, Bodian DL, [91] Ritz A, Bashir A, Sindi S, Hsu D,
Paschall J, Hermes F, Zook JM. Tools for Hajirasouliha I, Raphael BJ.
annotation and comparison of structural Characterization of structural variants
variation. F1000Res. 2017;6:1795 with single molecule and hybrid
sequencing approaches. Bioinformatics.
[83] Fan X, Abbott TE, Larson D, Chen 2014;30(24):3458-3466
K. BreakDancer: Identification of
genomic structural variation from [92] Ye K, Schulz MH, Long Q , Apweiler
paired-end read mapping. Current R, Ning Z. Pindel: A pattern growth
31
Bioinformatics Tools for Detection and Clinical Interpretation of Genomic Variations
approach to detect break points of large [101] Griffith M, Spies NC, Krysiak K,
deletions and medium sized insertions et al. CIViC is a community
from paired-end short reads. knowledgebase for expert
Bioinformatics. 2009;25(21):2865-2871 crowdsourcing the clinical
interpretation of variants in cancer.
[93] Hart SN, Sarangi V, Moore R, et al. Nature Genetics. 2017;49(2):170-174
SoftSearch: Integration of multiple
sequence features to identify [102] Available from: www.ingenuity.c
breakpoints of structural variations. om
PLoS One. 2013;8(12):e83356
[103] Available from: https://www.
[94] Zeitouni B, Boeva V, Janoueix- illumina.com/products/by-type/inf
Lerosey I, et al. SVDetect: A tool to ormatics-products/basespace-variant-
identify genomic structural variations interpreter.html
from paired-end and mate-pair
[104] Available from: https://www.bioz.c
sequencing data. Bioinformatics. 2010;
om/result/VariantStudio%20variant/
26(15):1895-1896
product/Illumina
[95] Chen K, Chen L, Fan X, Wallis J, [105] Li MX, Gui HS, Kwan JS, Bao SY,
Ding L, Weinstock G. TIGRA: A Sham PC. A comprehensive framework
targeted iterative graph routing for prioritizing variants in exome
assembler for breakpoint assembly. sequencing studies of Mendelian
Genome Research. 2014;24(2):310-317 diseases. Nucleic Acids Research. 2012;
40(7):e53
[96] Guan P, Sung WK. Structural
variation detection using next- [106] Girdea M, Dumitriu S, Fiume M,
generation sequencing data: A et al. PhenoTips: Patient phenotyping
comparative technical review. Methods. software for clinical and research use.
2016;102:36-49 Human Mutation. 2013;34(8):1057-1065
32
Bioinformatics Workflows for Genomic Variant Discovery, Interpretation and Prioritization
DOI: http://dx.doi.org/10.5772/intechopen.85524
[117] Tarca AL, Draghici S, Khatri P, [127] Szklarczyk D, Morris JH, Cook H,
et al. A novel signaling pathway impact et al. The STRING database in 2017:
analysis. Bioinformatics. 2008;25(1): Quality-controlled protein-protein
75-82 association networks, made broadly
accessible. Nucleic Acids Research.
[118] Ulgen E, Ozisik O, Sezerman OU. 2016;45(D1):D362-D368
pathfindR: An R Package for Pathway
Enrichment Analysis Utilizing Active [128] Cerami EG, Gross BE, Demir E,
Subnetworks. bioRxiv. 2018 et al. Pathway commons, a web resource
33
Bioinformatics Tools for Detection and Clinical Interpretation of Genomic Variations
34
Chapter 3
Abstract
1. Introduction
35
Bioinformatics Tools for Detection and Clinical Interpretation of Genomic Variations
Figure 1.
The evolution of local ancestry deconvolution since 2003 to 2017.
36
Orienting Future Trends in Local Ancestry Deconvolution Models to Optimally Decipher…
DOI: http://dx.doi.org/10.5772/intechopen.82764
Figure 2.
A partial worldwide admixture painting map. The figure shows several worldwide admixed populations
with patterns identified through published paper on population structure from 2008 to 2018. The population
migrations within and between continents have resulted in different admixed populations ranging from one-
to five-way admixtures.
37
Bioinformatics Tools for Detection and Clinical Interpretation of Genomic Variations
As pointed out previously, existing local ancestry inference models can be cate-
gorized into two main groups based on whether the model makes use of admixture/
background linkage disequilibrium (LD) or not.
LD-based models account for LD in local ancestry deconvolution, and due to the
importance of LD in disease mapping, the first local ancestry methods fall into this
category. They assume that ancestry along an admixed individual genome follows a
first order Markov chain. This means that the immediate past state captures all the
information on past states [19]. As a result, LD-based models assume that, at every
site, the observed admixed genotypes are generated by the unobserved ancestry,
and hence, Hidden Markov Model (HMM) and its extensions are used to infer the
unobserved (hidden) states. Thus, to deconvolute ancestry along the admixed
genome, these models have three model parameters, namely the initial, transition
and observation, or emission probability models. Due to uncertainty and the num-
ber of parameters involved, LD-based methods use Markov Chain Monte Carlo
(MCMC), forward-backward, or Viterbi algorithms to determine the hidden ances-
try sequence for a given individual. Falush et al. and Patterson et al. modeled
ancestry switch between ancestry populations at a given site, X t ∈f1; …; K g, by
representing the
first marker,
and the transition probability between consecutive
0
markers with δ k ¼ k is the indicator function and dt the genetic distance
between sites t and t � 1, above and qk the proportion of ancestry contributed by
candidate ancestral population k such that q ¼ q1 ; …; qk is a vector of ancestry
inherited from each ancestral population. On haploid data, the probability of a
recombination event is 1 � e�dt r , meaning that the probability of no recombination
is e�dt r [8, 20]. LD-based methods can be subdivided into admixture LD-based and
admixture and background LD methods. Note that admixture LD occurs when
ancestry at nearby markers is inherited together and background LD is the LD
within ancestral populations, and it depends highly on population history
(i.e, generated by genetic drift and population bottlenecks).
Admixture LD-based methods are models that account for LD that resulted from
the admixture process. They do not model background LD. Admixture LD-based
methods include the early methods, for example, STRUCTURE V2 [20],
ANCESTRYMAP [8], and ADMIXMAP [9], which are based on the Bayesian
framework. Early methods rely on markers that show significant difference in
frequency between ancestral populations (AIMs). Admixture LD-based models
assume that markers are independent and the global and ancestral allele frequencies
are known. They integrate HMM with MCMC, and their switch model and initial
and transition models are as in Eqs. (1) and (2), respectively. Since LD-based
methods do not model background LD, their observation model depends on only
38
Orienting Future Trends in Local Ancestry Deconvolution Models to Optimally Decipher…
DOI: http://dx.doi.org/10.5772/intechopen.82764
the allele frequency of the ancestry at that site. For instance, assuming K = 2,
Patterson et al. defined the emission probability by
8 !
>
> 2 � �2�na
>
>
> pnka 1 � pk if na ¼ 0 or 1
>
> na
>
<0 � �� � 1
PðY t ¼ yjX t ¼ na Þ ¼ 2 1 � p1 1 � p2 (3)
>
> B � � � �C
> B
> p 1�p þp 1�p C
>
> @ 2 2 1 2 A if na ¼ 2
>
>
:
p1 p2
Since the biological data often have some dependences that violate the indepen-
dence assumption in standard HMM, admixture LD-based methods are often not
realistic. To relax the independence assumption, the HMM is extended to either
Markov HMM, factorial HMM, hierarchical HMM, or two-layer HMM or other
multivariate statistical models such as multivariate normal distribution (MVN) and
a rich ancestral haplotype data are used unlike early methods. This is the case for
SABER [10], SWITCH [25], HAPAA [26], HAPMIX [4], MULTIMIX [27], ALLOY
[28], and ELAI [29]. MHMMs were the first HMM extension in local ancestry. They
were first implemented in SABER and later in SWITCH. SABER was the first
method to model background LD in the genetic ancestry inference. MHMM
assumes that the current observed haplotype depends on both the current ancestry
39
Bioinformatics Tools for Detection and Clinical Interpretation of Genomic Variations
and the immediate past observation. The difference in the MHMM and admixture
LD HMM-based is that when ancestry switches between sites t � 1 and t, then the
MHMM observation model depends on the joint distribution of allele frequencies at
the two sites [6, 30], defined as follows [10]:
� 0
� � 0
�
P Y t ¼ cjY t�1 ¼ d; X t ¼ k; X t0 ¼ k ¼ Bt c; d; k ; k ,
8
� � <B ~ 0 ðc; dÞ for k0 ¼ k (4)
0 k ,t
P Y t ¼ cjY t�1 ¼ d; X t ¼ k; X t0 ¼ k ¼
: B 0 ðcÞ otherwise
k ,t
where B ~ ðc; dÞ is the probability of having alleles at marker t provided there was
k, t
allele d at t � 1 and Bk, t ðcÞ the allele frequency of alleles at marker t have for origins
the population k. However, if the ancestry does not switch, then the observation
model is like that of models in Section 2.1.1.1. The transition model of the SABER
model accounts for the differences in admixture times that are in the real case of
continuous gene flow where populations contribute their genetic material to the
admixture in different generations [10]. Tang et al. defined the probability of
switching from ancestry k at t to k at t as
8
>
> g 2i
< qi
>
∑Kk¼1 qk g k
� gi , for i ¼ j,
Aij ¼ gi gj (5)
>
>
>
: qj , otherwise
∑Kk¼1 qk g k
Non-LD methods neither model background nor admixture LD. They either
remove SNPs in LD which is the case for LAMP [11] and WINPOP [31], or use all
SNPs (linked and unlinked SNPs) without modeling LD; this is the case for EILA
[14], RFMIX [32], and LOTER [15]. Since MHMMs have a large number of param-
eters and do not model LD explicitly, an algorithmic approach that divides genome
into windows of SNPs, LAMP [11], emerged in 2008. LAMP is fast and robust, and
can infer local ancestry even without proxy ancestral genotypes. This is the case for
40
Orienting Future Trends in Local Ancestry Deconvolution Models to Optimally Decipher…
DOI: http://dx.doi.org/10.5772/intechopen.82764
two-way admixtures. It uses the naive Bayes classifier and a clustering algorithm
known as the iterative conditional modes. LAMP estimates the most probable
ancestry at a site by applying the majority vote for each SNP [11]. Although accu-
racy is comprised, LAMP does not suffer from challenges of HMM and extension.
As a result, LAMP underperforms in closely related populations, and hence it was
extended to WINPOP [31], a dynamic programming algorithm. Unlike LAMP,
WINPOP assumes at least one recombination event within each window and varies
the window length depending on the genetic distance between populations. Hence,
WINPOP and LAMP outperform other methods in closely and distantly related
populations, respectively. Both LAMP and WINPOP assume unlinked markers and
discards SNPs in LD.
As the admixed sequence data availability increases, Maples et al. proposed a
discriminative approach to estimate local ancestry, RFMIX [32]. A discriminative
approach estimates the posterior probability directly and not via the joint probabil-
ity distribution. In contrast to generative ancestry inference models, RFMIX uses
the information contained in admixed individuals. This is advantageous in cases of
genotyped few reference panels. This is the case for Native Americans [32]. RFMIX
uses conditional random fields (CRFs) parametrized on random forests. It outper-
forms in multi-way admixtures maybe due to modeling phase switch errors. In
2013, EILA [14], a multivariate statistic based method, was proposed particularly to
increase inference power through addressing three common challenges in local
ancestry. Addressed challenges are the independence of SNP assumption, difficul-
ties in identifying break points, and the use of three genotype values. Instead of raw
genotypes, EILA uses a numerical value between 0 and 1. The score determines how
close SNPs are to the ancestral populations. Breakpoints are a challenge to identify,
but EILA identifies them by fused quantile regression facilitating the use of esti-
mates in admixture dating. Finally, k-means classifiers are used to infer ancestry
using all genotyped SNPs [14].
Recently, a software package that deconvolves local ancestry in multi-way
admixtures for a wide range of species, LOTER [15], was proposed. LOTER can
account for phase errors in two-way admixture only. It facilitates the local ancestry
inference process and its application in non-model species [15]. Unlike other
methods, LOTER needs no biological such as admixture time and recombination
rate or statistical parameters such as, number of hidden states and misfit probabil-
ities to deconvolve ancestry [15]. Although it uses the Li and Stephen’s copying
model [33] as in LAMPLD/LAMPHAP, LOTER is a nonprobabilistic approach for-
mulated from an optimization problem. Its solution is obtained through dynamic
programming.
Finally, different existing LD and non-LD-based local ancestry inference models
are summarized in Table 1 extracted from Geza et al. [34].
Several models are now available to determine the date of admixture events in
a given admixed genome. Breakpoints of haplotypes are used by some models
while others focus on the ancestry blocks. Models based on ancestry blocks for
dating admixture are formulated using either an empirical criteria or variants
associated with a specific population. In order to determine the average length of
the admixture block, these methods then assign ancestry on predefined windows
using either wavelet transformation or conditional random fields [35]. On the
other hand, there are models requiring rapid decrease in haplotype block sizes to
estimate the date of the admixture event [36]. This suggests that, in general,
41
42
Software Multi- Account LD model Biological/statistical parameters Reference Admixed Year of
way LD populations populations publication
STRUCTURE V2* ✓ ✓ HMM Markers, and ancestry proportions Unphased Unphased August 2003
ANCESTRYMAP* ✗ ✓ HMM Physical map, recombination and ancestry Unphased Unphased May 2004
proportions
ADMIXMAP* ✓ ✓ HMM Physical map and ancestry proportions Unphased Unphased May 2004
SABER ✓ ✓ MHMM Physical map or recombination distance Phased/unphased Phased/unphased July 2006
“LAMP” ✓ ✗ ✗ Admixture generations, LD threshold, and physical Unphased Unphased February 2008
map
HAPAA ✓ ✓ HHMM Admixture generations and genetic divergence Phased Phased February 2008
GEDI-ADMX ✓ ✓ Fixed size FHMM Admixed and ancestral SNPs (physical map) Phased Unphased May 2009
HAPMIX ✗ ✓ HHMM Genetic map mutation rate and admixed and ancestral Phased Unphased June 2009
SNPs
LAMPLD ✓ ✓ HHMM Number of hidden states, window size and physical Phased Unphased May 2012
map
SUPPORTMIX* ✓ ✓ HMM Admixture generations and genetic map Phased Phased June 2012
PCADMIX* ✓ ✓ Windows of blocks Genetic map and window size Phased Phased August 2012
of SNPs
mSPECTRUM ✓ ✓ SNPs, mutation and recombination rate Phased Phased August 2012
Bioinformatics Tools for Detection and Clinical Interpretation of Genomic Variations
MULTIMIX ✓ ✓ MVN Genetic map, legend file and misfitting probabilities Phased/unphased Phased/unphased November
2012
43
Software Multi- Account LD model Biological/statistical parameters Reference Admixed Year of
way LD populations populations publication
ALLOY ✓ ✓ Non-homogeneous Markers, ancestral proportions, admixture Phased Phased February 2013
VLMC generations, and genetic map
RFMIX ✓ ✗ ✗ Genetic map, window size, and admixture Phased Phased August 2013
generations
EILA ✓ ✗ ✗ Physical map Unphased (no missing Unphased (no missing November
values) values) 2013
ELAI ✓ ✓ Two layer HMM Admixture generations, lower and upper cluster Phased/unphased Phased/unphased May 2014
Table 1.
Existing 20 ancestry deconvolution tools: ✓ indicates the ability of the software to perform a specified task, ✗ indicates the inapplicability of the task by a particular tool. Unless explicitly
specified, LD refers to background LD.
Orienting Future Trends in Local Ancestry Deconvolution Models to Optimally Decipher…
Bioinformatics Tools for Detection and Clinical Interpretation of Genomic Variations
models used for dating admixture events can be subdivided in two main classes
[17, 18], namely those based on LD and those based on the haplotype distribution,
as mentioned earlier.
44
Orienting Future Trends in Local Ancestry Deconvolution Models to Optimally Decipher…
DOI: http://dx.doi.org/10.5772/intechopen.82764
with θu , u∈f1; 2; 3g the mutation parameter is; h represents the haplotype site in the
chromosomal offspring; the function tvw is an indicator function. It takes the value 1
if individual w coming from offspring x has the same haplotype with the ancestral
population v and 0 otherwise; and P is the probability to inherit a pair of haplotype
[4]. The number of generations since admixture is given by
C
G¼ (9)
4γ ð1 � γ Þζ
where ζ is the total Morgan length, γ the proportion of admixture, and C the
observed number of breakpoints [4].
On the other hand, Pugach et al. [17] employed the wavelet transform to design
a haplotype block approach. The aim of this method is to derive the time of admix-
ture of a given population using the simple hybrid isolation model. It proceeds in
two main steps. First, it obtains a signal of admixture from the admixed data using
the principal component technique. The second step consists in deriving the date of
admixture using the signal obtained in the first step [17].
Pool and Nielsen also built a haplotype-based approach. It used precautionary
ancestral populations to infer the date of admixture from the genome of an admixed
population [39]. It assumed that after a number of generation g, the distribution of
the ancestral haplotypes follows exponential distribution given by
where λ is the length of haplotypes. Also, the mean of this distribution is known and
is equal to 1g.
Further methods include that of Gravel developed in 2012 for the identification
of multiple ancestral populations in a given admixture dataset [40]. Also, Jin et al.
[41] came up with a similar method to explain admixture dynamics. The method
incorporates several models including gradual admixture (GA), hybrid isolated
(HI), and continuous gene flow (CGF) models [41], which can be extended to
GA-Isolation (GA-I) and CGF-Isolation (CGF-I) by considering isolation after
admixture [42]. Hellenthal et al. [43] on the other hand built up on the work of
Lawson et al. [44] on dating admixture. This method particularly considers the
genome of an admixed individual to be a set chunk DNA coming from other
individuals. The scheme of this method is mainly made of two stages. The first stage
consists in dividing the genome into chunks and matching each of them to the
proper ancestral individual. This stage is achieved with the help of Hidden Markov
45
Bioinformatics Tools for Detection and Clinical Interpretation of Genomic Variations
Model. The second stage consists in identifying haplotypes and determining their
respective ancestral population [43, 44]. Moreover, the admixture event and its
date are derived by fitting the decay of the ancestral haplotype with an exponential
distribution curve. Moreover, Ni et al. developed a method based on the observation
that the date of admixture events is related to the model used. Their method
consists in using the likelihood ratio test to identify the best model for the inference
of the date of admixture. Furthermore, they are able to estimate several admixture
events with the given optimal model [35].
Finally, different existing models and tools for dating admixture events are
summarized in Table 2 extracted from Chimusa et al. [35].
Ancestry_HMM HI No No https://github.com/
russcd/
Table 2.
Existing dating admixture genomic tools.
46
Orienting Future Trends in Local Ancestry Deconvolution Models to Optimally Decipher…
DOI: http://dx.doi.org/10.5772/intechopen.82764
Although several models exist to deconvolve local ancestry, most studies that
evaluate such models showed that deviations in local ancestry estimates still exist in
multi-way admixtures. Deviations in local ancestry also result from genetic drift,
miscalling true ancestry, and genotyping errors. However, the signals from these
factors affect the whole genome while that of unmodelled natural selection affects
particular regions. For example, Chen et al. using four local ancestry inference
models to scan for disease-related loci through admixture mapping showed that
although all of them are LD based and divide the genome into windows of contin-
uous SNPs, MULTIMIX and LAMPLD estimates differed in almost 20% of the
analyzed SNPs. This results from the differences in the biological and statistical
parameters they require and the mathematical approaches they use. Another asso-
ciation study by Chimusa et al. [45] also pointed out that admixture mapping is still
limited by inaccuracies in multi-way local ancestry deconvolution when they eval-
uated one LD-based and one non-LD-based local ancestry models, WINPOP and
LAMPLD.
Inaccuracies in local ancestry estimates may result from the use of statistical or
biological parameters in the estimation process, which are not always accurate when
provided. It could also be due to the dependence of models on reference panels
which for some populations are few or even not sampled for others. This is the case
for the Native Americans. More so for other admixed populations, their history is
not well known. When applied to ancient admixtures, existing methods may yield
spurious estimates as they were designed for recent admixtures. Existing methods
do not account for natural selection; hence, some deviations exist in regions that are
under selection [45]. Also, most of them are benchmarked for three-way
admixtures.
Since each model was introduced to address a particular challenge that models
before it faced, it is clearly expected that no model or tool can achieve the best
performance in all admixture scenarios and not trading estimate accuracy with
computational speed. Using existing studies by Geza et al. [34], more than 50% of
studies that either introduced a model or evaluated methods for association map-
ping showed that LAMPLD/LAMPHAP outperforms most LD-based methods. And
the only LD-based method than outperformed LAMPLD is ELAI; however, this is
the only study that assessed ELAI with other models. In cases where LD-based
models were compared to non-LD-based models, RFMIX outperformed LAMPLD
in three cases highlighted in [34], while another separate study aiming to determine
the place of admixture of an admixed population RFMIX also outperformed. This
could be because RFMIX can deconvolve ancestry in closely related populations
[46]. However, a recent assessment between RFMIX and LOTER resulted in LOTER
outperforming in ancient admixtures [15].
Generally, each model is implemented as a tool in local ancestry deconvolution,
existing as individual scripts requiring unique inputs and producing unique outputs.
This challenges researchers with a limited computational background; thus, there
is lack of a unified framework which can require a standard easy to manipulate
input files and output results in a way that is easy to process for further application.
In conclusion, for informed decisions on models and algorithms, existing models or
tools should be assessed within a unified framework. This will allow them to be
tested on different admixture scenarios and also incorporating most state-of-the-art
LD and non-LD based models.
47
Bioinformatics Tools for Detection and Clinical Interpretation of Genomic Variations
The evolution of human populations and the history of the mixture of these
populations have been deciphered using statistical and computational methods.
These methods have been found to perform well when dealing with single point
admixture event in two-way admixed populations [35]. However, as any method,
they not only have advantages but also pitfalls regarding the estimation of admix-
ture dates in some cases. It is challenging to fit to real admixed populations (for
more than 3-way admixture context) in the existing models dating admixture
events due to several reasons, including reliance to optimal local ancestry estimates
and accurate ancestry breakpoints. This suggests that there is still a need for
designing an integrative or a new model to dating admixture events for current
multi-way admixed populations to further advance our understanding of human
demographics and movement, and facilitate admixture mapping and estimation of
the age of a disease locus contributing to disease risk.
In addition, it have been discovered that the mixture exponential decay model
over-estimates the date of older admixture events [35] and was suggested to detect
at most three admixture events. As mentioned earlier, Ni et al. [47] dealt with the
optimization of the method used in dating admixture estimation. They took into
account several models but the evaluation of their technique is not effective in the
estimation of ancient and multi admixture events [35, 47]. On the other hand,
several practical considerations can further limit these approaches including the use
of proxy ancestry populations in the estimations which could bias the accuracy of
the result. This is the case when dealing for instance with low sample size and
inappropriate proxy ancestral populations [35]; the requirement of having accurate
LD patterns, ancestry haplotypes distribution, and a big sample size of the
admixed population. Thus, there is a need for an adequate model for inferring
different dates of admixture events and matching real admixture history using
proxy ancestry-based methods [35].
4. Conclusions
Acknowledgements
Some of the authors are supported in part by the National Institutes of Health
(NIH) Common Fund [grant numbers U24HG006941 (H3ABioNet) and
1U01HG007459–01 (SADaCC)]. One of the authors is fully funded by the Organi-
zation for Women in Science for the Developing World (OWSD) and Swedish
International Development Cooperation Agency (Sida). The content of this
48
Orienting Future Trends in Local Ancestry Deconvolution Models to Optimally Decipher…
DOI: http://dx.doi.org/10.5772/intechopen.82764
publication is solely the responsibility of the authors and does not necessarily
represent the official views of the funders.
Conflict of interest
Author details
1 African Institute for Mathematical Sciences (AIMS), Cape Town, South Africa
© 2019 The Author(s). Licensee IntechOpen. This chapter is distributed under the terms
of the Creative Commons Attribution License (http://creativecommons.org/licenses/
by/3.0), which permits unrestricted use, distribution, and reproduction in any medium,
provided the original work is properly cited.
49
Bioinformatics Tools for Detection and Clinical Interpretation of Genomic Variations
References
[1] Cavalli-Sforza LL, Feldman MW. The [9] Hoggart CJ, Shriver MD, Kittles RA,
application of molecular genetic Clayton DG, McKeigue PM. Design and
approaches to the study of human analysis of admixture mapping studies.
evolution. Nature Genetics. 2003;33: The American Journal of Human
266-275 Genetics. 2004;74(5):965-978
[2] A. Koehl, Estimating Ancestry and [10] Tang H, Coram M, Wang P, Zhu X,
Genetic Diversity in Admixed Risch N. Reconstructing genetic
Populations. The University of New ancestry blocks in admixed individuals.
Mexico. Thesis 2016 The American Journal of Human
Genetics. 2006;79(1):1-12
[3] Yang JJ, Cheng C, Devidas M, Cao X,
Fan Y, Campana D, et al. Ancestry and [11] Sankararaman S, Sridhar S, Kimmel
pharmacogenomics of relapse in acute G, Halperin E. Estimating local ancestry
lymphoblastic leukemia. Nature in admixed populations. The American
Genetics. 2011;43(3):237-241 Journal of Human Genetics. 2008;82(2):
290-303
[4] Price AL, Tandon A, Patterson N,
Barnes KC, Rafaels N, Ruczinski I, et al. [12] Baran Y, Pasaniuc B, Sankararaman
Sensitive detection of chromosomal S, Torgerson DG, Gignoux C, Eng C,
segments of distinct ancestry in et al. Fast and accurate inference of local
admixed populations. PLoS Genetics. ancestry in latino populations.
2009;5(6):e1000519 Bioinformatics. 2012;28(10):1359-1367
[5] Thornton TA, Bermejo JL. Local and [13] Omberg L, Salit J, Hackett N, Fuller
global ancestry inference and J, Matthew R, Chouchane L, et al.
applications to genetic association Inferring genome-wide patterns of
analysis for admixed populations. admixture in qataris using fifty-five
Genetic Epidemiology. 2014;38(S1): ancestral populations. BMC Genetics.
S5-S12 2012;13(1):49
[6] Liu Y, Nyunoya T, Leng S, et al. [14] Yang JJ, Li J, Buu A, Williams LK.
Softwares and methods for estimating Efficient enference of local ancestry.
genetic ancestry in human populations. Bioinformatics. 2013;29(21):2750-2756
Human Genomics. 2013;7(1):1
[15] Dias-Alves T, Mairal J, Blum MG.
[7] Bhatia G, Patterson N, Pasaniuc B, Loter: A software package to infer local
et al. Genome-wide comparison of ancestry for a wide range of species.
African-ancestry populations from care Molecular Biology and Evolution. 2018;
and other cohorts reveals signals of 35(7):msy126
natural selection. American
Journal of Human Genetics. 2011;89: [16] Cheng R, Lim J, Samocha K, et al.
368-381 Genome-wide association studies and
the problem of relatedness among
[8] Patterson N, Hattangadi N, Lane B, advanced intercross lines and other
Lohmueller KE, Hafler DA, Oksenberg highly recombinant populations.
JR, et al. Methods for high-density Genetics. 2010;185:1033-1044
admixture mapping of diseases genes.
The American Journal of Human [17] Pugach I, Matveyev R, Wollstein A,
Genetics. 2004;74(5):979-1000 et al. Dating the age of admixture via
50
Orienting Future Trends in Local Ancestry Deconvolution Models to Optimally Decipher…
DOI: http://dx.doi.org/10.5772/intechopen.82764
[19] Murphy KP. Machine Learning: A [28] Rodriguez JM, Bercovici S, Elmore
Probabilistic Perspective. Cambridge, M, Batzoglou S. Ancestry inference in
Massachusetts, London: MIT press; 2012 complex admixtures via variable-length
Markov chain linkage models. Journal of
[20] Falush D, Stephens M, Pritchard JK. Computational Biology. 2013;20(3):
Inference of population structure using 199-211
multilocus genotype data: Linked loci
and correlated allele frequencies. [29] Guan Y. Detecting structure of
Genetics. 2003;164(4):1567-1587 haplotypes and local ancestry. Genetics.
2014;196(3):625-642
[21] Chen X, Ishwaran H. Random
forests for genomic data analysis. [30] Padhukasahasram B. Inferring
Genomics. 2012;99(6):323-329 ancestry from population genomic data
and its applications. Frontiers in
[22] Seldin MF, Pasaniuc B, Price AL. Genetics.5:204
New approaches to disease mapping in
admixed populations. Nature Reviews [31] Paşaniuc B, Sankararaman S,
Genetics. 2011;12(8):523-528 Kimmel G, Halperin E. Inference of
locus-specific ancestry in closely related
[23] Hu Y, Willer C, Zhan X, Kang HM, populations. Bioinformatics. 2009;
Abecasis G. Accurate local-ancestry 25(12):i213-i221
inference in exome-sequenced admixed
individuals via off-target sequence [32] Maples BK, Gravel S, Kenny EE,
reads. The American Journal of Human Bustamante CD. RFMix: A
Genetics. 2013;93(5):891-899 discriminative modeling approach for
rapid and robust local-ancestry
[24] Brisbin A, Bryc K, Byrnes J, inference. The American
Zakharia F, Omberg L, Degenhardt J, Journal of Human Genetics. 2013;93(2):
et al. PCAdmix: Principal components- 278-288
based assignment of ancestry along each
chromosome in individuals with [33] Li N, Stephens M. Modeling linkage
admixed ancestry from two or more disequilibrium and identifying
populations. Human Biology. 2012; recombination hotspots using single-
84(4):343 nucleotide polymorphism data.
Genetics. 2003;165(4):2213-2233
[25] Sankararaman S, Kimmel G,
Halperin E, Jordan MI. On the inference [34] Geza E, Mugo J, Mulder NJ,
of ancestries in admixed populations. Wonkam A, Chimusa ER, Mazandu GK.
Genome Research. 2008;18(4):668-675 A comprehensive survey of models for
dissecting local ancestry deconvolution
[26] Sundquist A, Fratkin E, Do CB, in human genome. Briefings in
Batzoglou S. Effect of genetic Bioinformatics. 2018. DOI: 10.1093/bib/
divergence in identifying ancestral bby044
51
Bioinformatics Tools for Detection and Clinical Interpretation of Genomic Variations
[35] Chimusa ER, Defo J, Thami PK, using dense haplotype data. PLoS
Awany D, Mulisa DD, Allali I, et al. Genetics. 2012;8:e1002453
Dating admixture events is unsolved
problem in multi-way admixed [45] Chimusa ER, Zaitlen N, Daya M,
populations. Briefings in Bioinformatics. Møller M, van Helden PD, Mulder NJ,
2018:1-58. https://doi.org/10.1093/bib/ et al. Genome-wide association study of
bby112 ancestry-specific tb risk in the South
African coloured population. Human
[36] Moorjani P, Thangaraj K, Patterson Molecular Genetics. 2014;23(3):796-809
N, et al. Genetic evidence for recent
population mixture in India. Human [46] Xue J, Lencz T, Darvasi A, Pe'er I,
Genetics. 2013;93:422-438 Carmi S. The time and place of
European admixture in Ashkenazi
[37] Moorjani P, Patterson N, Jewish history. PLoS Genetics. 2017;
Hirschhorn J, et al. The history of 13(4):e1006644
African gene flow into Southern
Europeans, Levantines, and Jews. PLoS [47] Ni X, Yuan K, Yang X, et al.
Genetics. 2011;7:e1001373 Inference of multiple-wave admixtures
by length distribution of ancestral
[38] Pickrell J, Reich D. Toward a new tracks. Heredity (Edinb). 2018;121:52-63
history and geography of human genes
informed by ancient DNA. Trends in
Genetics. 2014;30:377-389
52
Chapter 4
Recognition of Multiomics-Based
Molecule-Pattern Biomarker for
Precise Prediction, Diagnosis, and
Prognostic Assessment in Cancer
Xanquan Zhan, Tian Zhou, Tingting Cheng and Miaolong Lu
Abstract
1. Introduction
53
Bioinformatics Tools for Detection and Clinical Interpretation of Genomic Variations
54
Recognition of Multiomics-Based Molecule-Pattern Biomarker for Precise Prediction, Diagnosis…
DOI: http://dx.doi.org/10.5772/intechopen.84221
55
Bioinformatics Tools for Detection and Clinical Interpretation of Genomic Variations
individual differences between tumor patients. For example, the function of liver
and kidney, age, physical condition, psychological status and personal lifestyle
factors, are also another important factors which affect on the tumor progres-
sion and treatment [30]. A number of treatment plans of patients were designed
according to the doctor’s experiences and adopted same therapy model for different
cancer patients in clinic. Due to ignore tumor heterogeneity, the “one-size-fits-all”
therapeutic model resulted in the expected curative effect could not completely
be achieved [4]. Thereby, tumor heterogeneity is becoming an important factor to
hinder the effective treatment and cancer research.
Molecular mechanisms of initiation and progression of a cancer do not just exist
one kind of intracellular signal pathway [31]. Several researches have indicated that
phosphoinositide 3-kinase/protein kinase B (PI3K/Akt), mitogen-activated protein
kinase (MAPK) and signal transducer, and activator of transcription 3 (STAT3)
pathways were activated in obesity-associated colon cancer. Mammalian target
of rapamycin (mTOR) as a down-stream of both PI3K/Akt and MAPK is highly
activated [32]. Activated mTOR in proper order inhibits the PI3K/Akt pathway
and further activates the STAT3 pathway [33]. In case that mTOR is inhibited, the
activity of PI3K/Akt may obviously increase owing to the feedback inhibition of
mTOR on PI3K activity [34]. Therefore, it is necessary to simultaneously suppress
the expressions of mTOR and PI3K for the treatment of obesity-related cancer [4].
Hence one can see that the interaction and interrelationship of multiple signaling
pathways is essential to pay more attention to study, and a single signaling molecule
or biomarker is unreliable for the prediction, diagnosis, and treatment of cancer.
So far, there are many kinds of treatments for cancer including surgery, radio-
therapy, and systemic treatments including cytotoxic chemotherapy, hormonal
therapy, immunotherapy, and targeted therapies [35]. Personalized or individual-
ized variations are related to human healthcare, and the relationship is shown
(Figure 1). Three primary stages, prediction/prevention, early-stage diagnosis/
early-stage therapy, and late-stage diagnosis/late-stage therapy are involved
in human healthcare. Personalized or individualized variations can be used as
biomarkers for prediction, and further the assessment of preventive response
reflects the results of preventive treatments. Personalized or individualized
variations also can be regarded as diagnostic biomarkers and further for cancer
therapy. The assessment of therapeutic response, known as prognostic assessment,
consists in early-stage therapy and late-stage therapy, and reveals the influence of
therapeutic intervention. Of the three stages, prediction/prevention is the most
significant part due to make people keep on a healthy condition and be treated in
time once cancer occurs. Early-stage diagnosis/therapy also is better approach to
block and repress the progression of cancer while the preventive strategy failed.
Late-stage diagnosis/therapy is also named clinical diagnosis and treatment of a
cancer. Unluckily, most of cancer cases were found in late stage. In order to avoid
aforementioned problem and improve people’s health level, many researchers
concentrate on exploration of biomarkers on prediction/prevention and early-
stage diagnosis/therapy for cancer [4]. According to functional classification,
biomarkers are divided into two categories (Table 1): (i) serving for the mecha-
nism and therapeutic targets, and (ii) devoting to prediction, diagnostic test, and
prognosis assessment. The first kind of biomarkers is relevant to the initiation and
development of disease, and directly indicates the mechanism and pathogenesis of
the disease. Commonly, it is pivotal site in cell signal pathways, like P53 in naso-
pharyngeal carcinoma (NPC) [36]. Another kind of biomarkers does not need to
be causal to the occurrence and development of the disease, but requires to be pro-
vided with specificity and a certain number of changes to be easily detected. Based
on Bayes’ rule, three or more key molecules can form molecule-pattern biomarker
56
Recognition of Multiomics-Based Molecule-Pattern Biomarker for Precise Prediction, Diagnosis…
DOI: http://dx.doi.org/10.5772/intechopen.84221
Figure 1.
Variations involved in each aspect of healthcare. Reproduced from Hu et al. [4], with permission from BioMed
Central open access article, copyright 2013.
Type I This type of biomarker exist a causal relationship with disease, Contribute to the
associate with the initiation and development of disease, and mechanism and therapeutic
can directly address the pathogenesis of disease. targets of disease.
Type II This type of biomarker does not need a causal relationship Contribute to the
with the occurrence and development of disease, but requires prediction, diagnosis, and
specificity and a certain amount of change to be easily detected. prognostic assessment.
Table 1.
Concept and categories of biomarkers [9].
to improve the accuracy of cancer diagnosis and therapy [9, 37]. In summary, due
to the complex pathophysiological basis of cancer, recognition of molecule-pattern
biomarker for precise prediction, diagnosis, and prognostic assessment in cancer
is an urgent demand to study and further close to realize precision medicine (PM)
and PPPM.
Based on central dogma, genetic changes influence the RNA expression, and
cause the alterations of proteins, along with taken into account the changes of
metabolite and tumor heterogeneity, all above variations in genome, transcrip-
tome, proteome, metabolome, and radiome are measured with corresponding
omics methodology including genomics, transcriptomics, proteomics, metabolo-
mics, and radiomics. Multiomics-generated biomarkers can make up integrative
molecule-pattern biomarkers and pattern recognition for cancer treatment. This
section mainly addresses the previous mentioned five omics approaches com-
bined with computation biology and systems biology contribute to the develop-
ment of cancer precise medicine (Figure 2) [9].
57
Bioinformatics Tools for Detection and Clinical Interpretation of Genomic Variations
Figure 2.
Different levels of omics-based pattern biomarkers. Modified from Cheng and Zhan [9], with permission from
Springer open access article, copyright 2017.
3.1 Genomics
58
Recognition of Multiomics-Based Molecule-Pattern Biomarker for Precise Prediction, Diagnosis…
DOI: http://dx.doi.org/10.5772/intechopen.84221
3.2 Transcriptomics
Based on the genetic central rule, DNA through self-replication and transcripts
to form the mRNAs, and finally translates to be a protein. The mRNA is served as a
bridge between gene and protein in biological process and linked genome and phe-
notype. Once variation of gene sequence of mRNA occurs, the amino acid sequence
of the protein will be correspondingly altered. Therefore, the understanding of
transcriptomics is important for addressing functional elements of the genome and
cognizing the development of cancer. The key goal of transcriptomics is to classify
all types of transcripts, reveal the transcriptional structure of the genes, and quan-
tify the expression levels of each transcript during development and under different
conditions. Nowadays, many methods are generated to be used for the study of tran-
scriptome, such as hybridization-or sequence-based approaches [54]. In general, the
way of nucleic acids with hybridization-based is incubation of fluorescently labeled-
complementary DNA (cDNA) from reverse transcription of different mRNAs with
a microarray contained genes of interest, then digitized with a dedicated scanner
and image analysis and finally gene name, clone identifier, and intensity values are
acquired [55]. Furthermore, genomic tiling microarrays are found to provide a more
unerring opinion of the transcriptional activities within a genome [56]. Howbeit,
there are some disadvantages, like relying on the current knowledge of genome
sequence, high background levels owing to cross-hybridization, and both back-
ground and saturation of signals resulted in a limited dynamic range of detection
[57, 58]. Sequence-based strategy is able to detect cDNA sequence but not depend on
the probes. With the development of high-throughput DNA sequencing technique of
NGS, a new method used for mapping and quantifying transcriptome is occurred,
named RNA-seq. It possesses a lot of advantages, for instance, high throughput,
high sensitivity, high resolution, and no reconstructions. RNA-seq is able to analyze
59
Bioinformatics Tools for Detection and Clinical Interpretation of Genomic Variations
3.3 Proteomics
60
Recognition of Multiomics-Based Molecule-Pattern Biomarker for Precise Prediction, Diagnosis…
DOI: http://dx.doi.org/10.5772/intechopen.84221
3.4 Metabolomics
61
Bioinformatics Tools for Detection and Clinical Interpretation of Genomic Variations
3.5 Radiomics
62
Recognition of Multiomics-Based Molecule-Pattern Biomarker for Precise Prediction, Diagnosis…
DOI: http://dx.doi.org/10.5772/intechopen.84221
Figure 3.
Application of pattern biomarker in personalized medicine or precision medicine.
63
Bioinformatics Tools for Detection and Clinical Interpretation of Genomic Variations
5. Conclusion
Acknowledgements
The authors acknowledge the financial supports from the Xiangya Hospital
Funds for Talent Introduction (to X.Z.), the Hunan Provincial “Hundred Talent
Plan” program (to X.Z.), the National Natural Science Foundation of China
(Grant No. 81572278 and 81272798 to X.Z.), China “863” Plan Project (Grant No.
2014AA020610-1 to X.Z.), and the Hunan Provincial Natural Science Foundation of
China (Grant No. 14JJ7008 to X.Z.).
Conflict of interest
Author’s contributions
Z.T. analyzed references and wrote manuscript draft of the book chapter.
C.T. M.L. participated in collection of references and analysis of data. X.Z. con-
ceived the concept, designed the book chapter, and critically revised/wrote the
book chapter, coordinated and was responsible for the correspondence work and
financial support.
64
Recognition of Multiomics-Based Molecule-Pattern Biomarker for Precise Prediction, Diagnosis…
DOI: http://dx.doi.org/10.5772/intechopen.84221
65
Bioinformatics Tools for Detection and Clinical Interpretation of Genomic Variations
Author details
© 2019 The Author(s). Licensee IntechOpen. This chapter is distributed under the terms
of the Creative Commons Attribution License (http://creativecommons.org/licenses/
by/3.0), which permits unrestricted use, distribution, and reproduction in any medium,
provided the original work is properly cited.
66
Recognition of Multiomics-Based Molecule-Pattern Biomarker for Precise Prediction, Diagnosis…
DOI: http://dx.doi.org/10.5772/intechopen.84221
References
[1] Block KI, Gyllenhaal C, Lowe L, [9] Cheng T, Zhan X. Pattern recognition
Amedei A, ARMR A, Amin A, for predictive, preventive, and
et al. A broad-spectrum integrative personalized medicine in cancer. The
design for cancer prevention and EPMA Journal. 2017;8:51-60. DOI:
therapy. Seminars in Cancer Biology. 10.1007/s13167-017-0083-9
2015;35(Suppl):S276-S304. DOI:
10.1016/j.semcancer.2015.09.007 [10] Wagner PD, Srivastava S. New
paradigms in translational science
[2] Friedl P, Alexander S. Cancer research in cancer biomarkers.
invasion and the microenvironment: Translational Research. 2012;159:
Plasticity and reciprocity. Cell. 343-353. DOI: 10.1016/j.trsl.2012.01.015
2011;147:992-1009. DOI: 10.1016/j.
cell.2011.11.016 [11] Canonica GW, Bachert C, Hellings P,
Ryan D, Valovirta E, Wickman M, et al.
[3] Maximo V, Lima J, Prazeres H, Soares Allergen immunotherapy (AIT): A
P, Sobrinho-Simoes M. The biology and prototype of precision medicine. World
the genetics of Hurthle cell tumors of Allergy Organization Journal. 2015;8:31.
the thyroid. Endocrine-Related Cancer. DOI: 10.1186/s40413-015-0079-7
2012;19:R131-R147. DOI: 10.1530/
ERC-11-0354 [12] Biomarkers Definitions Working
Group. Biomarkers and surrogate
[4] Hu R, Wang X, Zhan X. Multi-
endpoints: Preferred definitions
parameter systematic strategies for
and conceptual framework. Clinical
predictive, preventive and personalised
Pharmacology and Therapeutics.
medicine in cancer. The EPMA Journal.
2001;69:89-95. DOI: 10.1067/
2013;4:2. DOI: 10.1186/1878-5085-4-2
mcp.2001.113989
[5] Kang M, Buckley YM, Lowe AJ.
[13] Zhai XH, Yu JK, Yang FQ, Zheng S.
Testing the role of genetic factors
Identification of a new protein
across multiple independent invasions
biomarker for colorectal cancer
of the shrub scotch broom (Cytisus
diagnosis. Molecular Medicine Reports.
scoparius). Molecular Ecology.
2012;6:444-448. DOI: 10.3892/
2007;16:4662-4673
mmr.2012.923
[6] Jobling MA. The impact of recent
events on human genetic diversity. [14] Taylor DR, Pavord ID. Biomarkers
Philosophical Transactions of the Royal in the assessment and management of
Society of London. Series B, Biological airways diseases. Postgraduate Medical
Sciences. 2012;367:793-799. DOI: Journal. 2008;84:628-634; quiz 633.
10.1098/rstb.2011.0297 DOI: 10.1136/pgmj.2008.069864
[8] Hoth M. CRAC channels, calcium, [16] Lu M, Zhan X. The crucial role of
and cancer in light of the driver and multiomic approach in cancer research
passenger concept. Biochimica et and clinically relevant outcomes. The
Biophysica Acta. 2016;1863:1408-1417. EPMA Journal. 2018;9:77-102. DOI: doi.
DOI: 10.1016/j.bbamcr.2015.12.009 org/10.1007/s13167-018-0128-8
67
Bioinformatics Tools for Detection and Clinical Interpretation of Genomic Variations
[19] Sheltzer JM, Torres EM, Dunham [27] Julien S, Merino-Trigo A, Lacroix L,
MJ, Amon A. Transcriptional Pocard M, Goéré D, Mariani P, et al.
consequences of aneuploidy. Characterization of a large panel of
Proceedings of the National Academy patient-derived tumor xenografts
of Sciences of the United States of representing the clinical heterogeneity
America. 2012;109:12644-12649. DOI: of human colorectal cancer. Clinical
10.1073/pnas.1209227109 Cancer Research. 2012;18:5314-5328.
DOI: 10.1158/1078-0432.CCR-12-0372
[20] Gould CM, Courtneidge SA.
Regulation of invadopodia by the tumor [28] Damia G, D'Incalci M. Genetic
microenvironment. Cell Adhesion & instability influences drug response
Migration. 2014;8:226-235 in cancer cells. Current Drug Targets.
2010;11:1317-1324
[21] Hanahan D, Weinberg RA.
Hallmarks of cancer: The next [29] Marusyk A, Almendro V, Polyak K.
generation. Cell. 2011;144:646-674. Intra-tumour heterogeneity: A looking
DOI: 10.1016/j.cell.2011.02.013 glass for cancer? Nature Reviews.
Cancer. 2012;12:323-334. DOI: 10.1038/
[22] Zhan X, Desiderio DM. The nrc3261
use of variations in proteomes to
predict, prevent, and personalize [30] George O, Koob GF. Individual
treatment for clinically nonfunctional differences in prefrontal cortex function
pituitary adenomas. The EPMA and the transition from drug use to
Journal. 2010;1:439-459. DOI: 10.1007/ drug dependence. Neuroscience and
s13167-010-0028-z Biobehavioral Reviews. 2010;35:232-247.
DOI: 10.1016/j.neubiorev.2010.05.002
[23] Longo DL. Tumor heterogeneity
and personalized medicine. The [31] Zhan X, Desiderio DM. Signaling
New England Journal of Medicine. pathway networks mined from human
2012;366:956-957. DOI: 10.1056/ pituitary adenoma proteomics data.
NEJMe1200656 BMC Medical Genomics. 2010;3:13.
DOI: 10.1186/1755-8794-3-13
[24] Moreno CS, Evans CO, Zhan X,
Okor M, Desiderio DM, Oyesiku NM. [32] Laplante M, Sabatini DM. mTOR
Novel molecular signaling and signaling in growth control and disease.
68
Recognition of Multiomics-Based Molecule-Pattern Biomarker for Precise Prediction, Diagnosis…
DOI: http://dx.doi.org/10.5772/intechopen.84221
69
Bioinformatics Tools for Detection and Clinical Interpretation of Genomic Variations
[58] Wu X, Weng L, Li X, Guo C, Pal [65] Ponting CP, Oliver PL, Reik W.
SK, Jin JM, et al. Identification of a Evolution and functions of long
70
Recognition of Multiomics-Based Molecule-Pattern Biomarker for Precise Prediction, Diagnosis…
DOI: http://dx.doi.org/10.5772/intechopen.84221
71
Bioinformatics Tools for Detection and Clinical Interpretation of Genomic Variations
72
Recognition of Multiomics-Based Molecule-Pattern Biomarker for Precise Prediction, Diagnosis…
DOI: http://dx.doi.org/10.5772/intechopen.84221
73
Bioinformatics Tools for Detection and Clinical Interpretation of Genomic Variations
74
Chapter 5
Abstract
1. Introduction
Due to the genetic diversity of the hepatitis C virus (HCV), its accurate genotyp-
ing is still currently challenging despite the use of modern molecular techniques. In
addition to the six widely-recognised HCV genotypes, a newly identified genotype
(GT) 7 was reported in 2015 [1]. Molecular methods including reverse hybridization,
real-time PCR and Sanger sequencing are commonly utilised for HCV genotyping and
subtyping in clinical laboratories. HCV genotype and subtype (ST) have been the crit-
ical factors in decision-making for administering interferon-based therapies for the
past decade [2]. According to the latest AASLD guidelines [3], determination of viral
characteristics including GT, ST and resistance-associated variants (RAVs) profile is
important in assigning direct-acting antivirals (DAAs) regimes in HCV patients.
75
Bioinformatics Tools for Detection and Clinical Interpretation of Genomic Variations
To help achieve the best clinical management of HCV patients, a routine diag-
nostic laboratory should aim at minimising reporting out non-informative HCV
genotyping results which are due to inherent limitations of the diagnostic platform
of choice. In general, about 2–8.5% of HCV positive samples have been reported to
carry “indeterminate” GTs by several commercial assays [4–9]. To tackle uncertain-
ties in determining HCV GT and ST, Sanger sequencing could be utilised to resolve
indeterminate or discordant GTs or ST results produced by commercial assays
[10, 11]. Despite the ability to provide definitive genotyping information most of
the time, unfavourable features of Sanger sequencing including low throughput,
time-consuming procedures and relatively high costs, pose a barrier to it becoming
routinely adopted as a first-line genotyping method. With the advent of next-gener-
ation sequencing (NGS), limitations of probe-based genotyping assays and Sanger
sequencing for HCV genotyping can be overcome. NGS provides a high-resolution
means for direct sequence-based interrogation of the HCV genome. Moreover, NGS
also allows concurrent profiling of RAVs where such value-added feature is highly
relevant for the clinical management of HCV infection with appropriate use of DAAs.
In the present study, the Sentosa SQ HCV genotyping assay (hereinafter referred
to as Vela NGS) (Vela Diagnostics, Singapore) which primarily interrogates the
NS5B region of HCV GTs 1–6 by ion torrent-based NGS technology, was evaluated
in comparison to the VERSANT HCV Genotype 2.0 Assay (hereinafter referred
to as LiPA) (Siemens Healthineers, Erlangen, Germany). HCV indeterminate GTs
previously reported in clinical samples by LiPA were resolved using Vela NGS assay
with further confirmation by Sanger sequencing. Information on RAVs was also
harnessed from deeply sequenced NS3, NS5A and NS5B regions in samples classi-
fied as HCV 1a and 1b using Vela NGS.
2. Study design
This study was performed on residual sera or plasma from 222 clinical specimens
previously received for routine genotyping using the VERSANT HCV Genotype 2.0
Line Probe Assay (Siemens Healthineers, Erlangen, Germany). All samples were stored
at -80°C post-LiPA analysis and were only thawed prior to re-analysis by NGS and
Sanger sequencing. All samples were de-identified for anonymisation purposes, and
hence, the treatment histories remain unknown and cannot be traced. These were all
residual samples, which would otherwise be discarded, and were used for the purposes
of assay validation only. In such situations, ethics approval is not normally required, as
all samples could not be linked back to the original patients after anonymisation.
In this study, NGS was performed using Sentosa SQ HCV Genotyping Assay
(4 × 16) (Vela Diagnostics, Singapore) according to the manufacturer’s instruc-
tions. The workflow started with automated extraction of total nucleic acids from
530 μL of sera or plasma using Sentosa SX Virus Total Nucleic Acid Plus II kit (Vela
Diagnostics) on Sentosa SX101 (Vela Diagnostics). PCR amplification of the HCV
NS3, NS5A and NS5B regions was performed on Veriti 96-Well Thermal Cycler
(Applied Biosystems, CA, USA). In every individual run, a pooled library containing
barcoded amplicons of 15 clinical samples and one system control, was prepared by
Sentosa SX101. The pooled library was subject to sequencing template preparation
and enrichment on Sentosa ST401 (Vela Diagnostics). Sequencing data generated
76
HCV Genotyping with Concurrent Profiling of Resistance-Associated Variants by NGS Analysis
DOI: http://dx.doi.org/10.5772/intechopen.84577
Total nucleic acids were extracted from 200 μL sera or plasma using EZ1 Virus
Mini Kit v2.0 (QIAGEN, Hilden, Germany) on Biorobot EZ1 (QIAGEN). Using
VERSANT HCV Genotype 2.0 Line Probe Assay (LiPA) (Siemens Healthineers), a
one-step reverse transcription-polymerase chain reaction (RT-PCR) amplifying the
5’UTR and core regions was performed on GeneAmp PCR System 9700 (Applied
Biosystems). Reverse hybridisation, washing and colour development steps were
performed on Autoblot 3000H (Fujirebio Europe, Gent, Belgium). For GT and ST
determination, band patterns were manually scored by aligning the strips to an
interpretation chart provided by the manufacturer.
3. Results
3.1 Concordance between results generated by the Vela NGS and Versant
platforms at GT and ST levels
The Vela NGS results at both GT and ST levels were tabulated in Table 1 for
170 clinical samples with GT and/or ST results from LiPA. Perfect (100%) con-
cordance at HCV genotype level was achieved in GT 2 (N = 13), GT 3 (N = 55) and
GT 5 (N = 7). For samples reported by LiPA as GT 1 (N = 40), 20% (N = 8) gave
discrepant results when compared to Vela NGS. These samples had been previ-
ously classified by LiPA as either GT 1a with core inconclusive, GT 1b with 96.1%
homology, GT 1b with core inconclusive, or GT 1b with core not available, due to
their unconventional band patterns. There was no discrepancy between samples
firmly reported as GT 1a and GT 1b by LiPA. In samples reported as GT 4 (N = 16)
77
Bioinformatics Tools for Detection and Clinical Interpretation of Genomic Variations
3.2 Verification of contig sequences generated by the Vela NGS in samples with
discordant results
Of the 170 samples tested, there were 104 agreements at both GT and ST levels,
49 partial agreements at genotype but not the subtype levels, and 117 discordant
results generated by LiPA and Vela NGS (Table 1). At GT level, the calculated
Cohen’s Kappa is 0.869 (95% confidence interval: 0.810–0.928), suggesting good
strength of agreement between the two assays. The 66 NGS contig sequences of
samples with partial agreement or discordant results were submitted to the online
analysis in the Los Alamos hepatitis C sequence database. HCV GT and ST called by
Vela NGS were verified in all 66 contigs.
HCV genotyping and subtyping results were found to be reproducible for a panel
of 5 samples with different HCV GT/ST including 1a, 1b, 2a, 3a and 3b tested in
triplicates within a single run on the Vela NGS platform (Figure 1a). For inter-run
reproducibility testing (Figure 1b), GT and ST results were consistently reported in
another panel of 7 samples including 1a, 1b, 2b, 3a, 4d, 5a and 6n, which were repeat-
edly tested in three separate runs on different days. Details of viral load and median
coverage of the targeted NS5B region are depicted in Figure 1a and b, respectively.
In the current Vela NGS assay, a list of variants differing from the wild-type codons
are detectable for HCV 1a and 1b. The 16 target codons in the NS3 gene are 36, 41, 43, 54,
55, 80, 109, 122, 132 (1a only), 138, 155, 156, 158, 168, 170 (1b only) and 175 (1b only).
For NS5A, variants at nine codons including 28 (1a only), 30 (1a only), 31, 32, 54 (1b
only), 58, 62 (1b only), 92 and 93, are detectable. Eight codons in the NS5B gene includ-
ing 414, 419, 422, 423, 495, 499 (1b only), 554 and 559, are also covered in this assay.
Of 13 GT 1a samples (Table 2), five were found to carry at least one target variant
in the NS3 gene. Notably, two samples carried the Q80K RAV. For NS5A, the M28A
variant was detected in one sample in which NS3 Q80K was also present. None of the
GT 1a samples was found to carry any of target variants in the NS5B gene.
Of 18 HCV 1b samples (Table 2), five were detected with at least one target vari-
ant in the NS3 gene. Twelve samples were identified with at least one target variant
in the NS5A gene. For NS5B, the P495A and V499A variants were detected in one
and eight samples, respectively. Notably, there were four samples detected with at
least one target variant in each of the NS3, NS5A and NS5B genes.
78
HCV Genotyping with Concurrent Profiling of Resistance-Associated Variants by NGS Analysis
DOI: http://dx.doi.org/10.5772/intechopen.84577
Comparison of GT and ST distribution in 170 samples tested by both LiPA and Vela NGS.
Table 1.
79
Bioinformatics Tools for Detection and Clinical Interpretation of Genomic Variations
Figure 1.
Precision studies on the Vela NGS. (a) Intra-run and (b) inter-run reproducibility on median read depth
were tested on 5 and 7 clinical specimens, respectively. For RAV analysis, variants were called with reproducible
frequency (c) within a run (intra-run) and (d) between runs (inter-run).
80
HCV Genotyping with Concurrent Profiling of Resistance-Associated Variants by NGS Analysis
DOI: http://dx.doi.org/10.5772/intechopen.84577
Table 2.
List of resistance-associated variants (RAVs) identified in GT 1a and 1b samples by Vela NGS.
81
Bioinformatics Tools for Detection and Clinical Interpretation of Genomic Variations
Table 3.
Comparison of genotyping results produced by the Vela NGS and Sanger sequencing methods in 40 specimens
with indeterminate genotypes by LiPA.
and V499A were also repeatedly identified in the NS5A and NS5B genes of the GT 1b
sample, respectively. Variant frequencies of the three variants were highly repro-
ducible within run (Figure 1c).
In the inter-run reproducibility study, NS3 S122G and NS5B V499A variants were
tested. Variant frequencies of the two variants were found to be highly reproducible
among the three separate runs (Figure 1d).
4. Discussion
The application of NGS assays to analyse quasispecies HCV genomes has been
increasing in recent years. Several laboratory-developed NGS assays had been
82
HCV Genotyping with Concurrent Profiling of Resistance-Associated Variants by NGS Analysis
DOI: http://dx.doi.org/10.5772/intechopen.84577
previously described in the literature for phylogenetic studies [14], outbreak inves-
tigation [15, 16], characterisation of HCV full genome [17, 18] and identification
of HCV GT and ST in clinical samples [19, 20]. However, there are fewer reports
of adoption of NGS assays in routine HCV genotyping. In 2016, Vela NGS became
available as a CE-IVD certified commercial kit for diagnostic use in the clinical
laboratories. In this study, we report the performance characteristics of Vela NGS in
comparison to the widely used LiPA assay for HCV genotyping.
The performance of Vela NGS in determining the HCV GT and ST in the clinical
specimens had been discussed in several previous studies [21–23]. Perfect agree-
ment at GT level was observed between Vela NGS and LiPA in a study by Manee
et al. [21]. Samples with unclear ST results in GTs 2, 3, 4 and 6 reported by LiPA
were each assigned with a specific subtype after subject to Vela NGS analysis. Dirani
et al. [22] also performed a direct comparison of GT and ST calling between Vela
NGS and LiPA for samples from patients infected with HCV GTs including GT 1, 2,
3 and 4, and found a high concordance (>99%) at GT level between the two tests.
Vela NGS was also found to have better performance in assigning HCV STs among
the four GTs when compared to LiPA [22]. In another study by Rodriguez et al. [23],
Vela NGS achieved high concordance rates with Sanger sequencing in assigning GTs
1 to 6, 1a and 1b STs, and other STs for GTs 4, 5 and 6. Discrepant calls at ST level
was mainly found among HCV GTs 1 and 2 between Vela NGS and Sanger sequenc-
ing; the latter was used as the reference method to sequence the 286 bp segment of
NS5B for which phylogenetic analysis was performed.
In the present study, discrepancy in results was mainly observed in samples with
LiPA GT 1b with incomplete or missing bands at the core region. In this particular
result group, GT 6 with different STs were assigned by Vela NGS. This observation
was not unexpected as it has been specified in the LiPA interpretation chart that GT
6 (STs c-1) cannot be differentiated from ST 1a and 1b without additional informa-
tion from the core region sequence. Among LiPA GT 4 samples, all ST 4h were
reassigned as GT 3 by Vela NGS. Some geographical regions, for example, Southeast
Asia, where GT 6 is highly prevalent [24], could thus be impacted more by this mis-
classification with the use of LiPA method.
In contrast to LiPA which utilises primarily the 5’UTR in GTs 1-6 and core regions
for the discrimination of GT 6 STs c-l from 1a and 1b, Vela NGS targets the non-
structural genes implicated in both accurate genotyping/subtyping and resistance to
DAAs. The LiPA is known to be poor at detecting and identifying recombinant forms
of HCV [25]. Due to the assay design of Vela NGS, this may also pose a problem for
this platform, despite the application of NGS technology. The HCV recombinant
forms can be accurately detected via sequencing of recombination breakpoint
junctions or the whole HCV genome [26]. For example, in our study, one previously
LiPA-indeterminate sample was reported by the Vela NGS to have mixed HCV infec-
tions with HCV 2a and 3a. This NGS finding was confirmed by Sanger sequencing in
which overlapping Sanger electropherograms were observed for NS5B.
The Vela NGS offers information on RAVs in HCV 1a or 1b positive samples,
where such profiling will be useful when prescribing DAA regimes, and detecting
of baseline or emerging RAVs. Targeted assays had been previously developed to
identify a specific RAV [27, 28]. RAVs which are found at levels with at least 15%
variant frequency, at baseline, are known to confer resistance to certain DAAs [29],
and therefore may impact on the effectiveness of DAA treatment [30]. Vela NGS tar-
gets relevant RAVs in three non-structural gene segments (NS3, NS5A and NS5B) of
HCV 1a and 1b, and although the RAV profiling is comprehensive but not exhaustive
due to the assay design, any baseline RAVs present in any of these DAA target genes,
can affect the therapeutic effectiveness [31]. In our study, four HCV 1b samples were
found to harbour variants in all three NS3, NS5A and NS5B genes concurrently.
83
Bioinformatics Tools for Detection and Clinical Interpretation of Genomic Variations
5. Conclusions
In conclusion, the genotyping results of the Vela NGS were found to be highly
concordant with those of the LiPA method. Vela NGS refined the ST assignment in
GT 6 and resolved previously indeterminate GTs reported by LiPA. Technically, the
HCV Vela NGS was found to have consistent intra- and inter-run reproducibility in
terms of GT and ST calling and RAVs identification. Detection of infections with
multiple HCV GTs or STs is feasible by Vela NGS. Due to the assay design which
relies on investigating the HCV sub-genomic regions, HCV recombinant strains
may still be potentially missed. Deep sequencing allows sensitive identification of
RAVs in the GT1a and 1b NS3, NS5A and NS5B regions, but the list of target RAVs
is not exhaustive. We would also suggest the RAVs detection spectrum should be
extended to cover GTs other than HCV 1a and 1b, namely GTs 2-6.
Acknowledgements
We thank Cui-Wen Chua, Mui-Joo Khoo and Lily Chiu of the Department of
Laboratory Medicine at the National University Hospital, Singapore, for their
technical assistance in performing the NGS and LiPA analysis. We also thank Vela
Diagnostics Singapore for funding the NGS reagents in this study.
Conflict of interest
Author details
© 2019 The Author(s). Licensee IntechOpen. This chapter is distributed under the terms
of the Creative Commons Attribution License (http://creativecommons.org/licenses/
by/3.0), which permits unrestricted use, distribution, and reproduction in any medium,
provided the original work is properly cited.
84
HCV Genotyping with Concurrent Profiling of Resistance-Associated Variants by NGS Analysis
DOI: http://dx.doi.org/10.5772/intechopen.84577
References
[1] Murphy DG, Sablon E, Chamberland J, cobas® HCV GT is a new tool that
Fournier E, Dandavino R, et al. accurately identifies Hepatitis C virus
Hepatitis C virus genotype 7, a new genotypes for clinical practice. PLoS
genotype originating from central One. 2017;12:e0175564
Africa. Journal of Clinical Microbiology.
2015;53:967-972 [9] Benedet M, Adachi D, Wong A,
Wong S, Pabbaraju K, Tellier R, et al.
[2] Ghany MG, Strader DB, Thomas DL, The need for a sequencing-based assay
Seeff LB. Diagnosis, management, and to supplement the Abbott m2000
treatment of hepatitis C: An update. RealTime HCV Genotype II assay:
Hepatology. 2009;49:1335-1374 A 1 year analysis. Journal of Clinical
Virology. 2014;60:301-304
[3] ASLD-IDSA HCV Guidance Panel.
Hepatitis C guidance 2018 update: [10] Larrat S, Poveda JD, Coudret C,
AASLD-IDSA recommendations Fusillier K, Magnat N, et al. Sequencing
for testing, managing, and treating assays for failed genotyping with the
hepatitis C virus infection. Clinical versant hepatitis C virus genotype assay
Infectious Diseases. 2018 (LiPA), version 2.0. Journal of Clinical
Microbiology. 2013;51:2815-2821
[4] Germer JJ, Majewski DW, Rosser M,
Thompson A, Mitchell PS, et al. [11] Chueca N, Rivadulla I, Lovatti R,
Evaluation of the TRUGENE HCV Reina G, Blanco A, et al. Using NS5B
5'NC genotyping kit with the new sequencing for hepatitis C virus
GeneLibrarian module 312 for genotyping reveals discordances with
genotyping of hepatitis C virus from commercial platforms. PLoS One.
clinical specimens. Journal of Clinical 2016;11:e0153754
Microbiology. 2003;41:4855-4857
[12] Quer J, Gregori J, Rodríguez-
[5] Verbeeck J, Stanley MJ, Shieh J, Celis L, Frias F, Buti M, Madejon A, et al.
Huyck E, et al. Evaluation of Versant High-resolution hepatitis C virus
hepatitis C virus genotype assay (LiPA) subtyping using NS5B deep sequencing
2.0. Journal of Clinical Microbiology. and phylogeny, an alternative to
2008;46:1901-1906 current methods. Journal of Clinical
Microbiology. 2015;53:219-226
[6] González V, Gomes-Fernandes M,
Bascuñana E, Casanovas S, Saludes V, [13] Kuiken C, Yusim K, Boykin L,
et al. Accuracy of a commercially Richardson R. The Los Alamos hepatitis
available assay for HCV genotyping C sequence database. Bioinformatics.
and subtyping in the clinical 2005;1:379-384
practice. Journal of Clinical Virology.
2013;58:249-253 [14] Gonçalves Rossi LM, Escobar-
Gutierrez A, Rahal P. Multiregion
[7] Némoz B, Roger L, Leroy V, Poveda JD, deep sequencing of hepatitis C virus:
Morand P, et al. Evaluation of the An improved approach for genetic
cobas® GT hepatitis C virus genotyping relatedness studies. Infection, Genetics
assay in G1-6 viruses including low and Evolution. 2016;38:138-145
viral loads and LiPA failures. PLoS One.
2018;13:e0194396 [15] Escobar-Gutiérrez A, Vazquez-
Pichardo M, Cruz-Rivera M, Rivera-
[8] Fernández-Caballero JA, Alvarez M, Osorio P, Carpio-Pedroza JC, et al.
Chueca N, Pérez AB, García F. The Identification of hepatitis C virus
85
Bioinformatics Tools for Detection and Clinical Interpretation of Genomic Variations
86
HCV Genotyping with Concurrent Profiling of Resistance-Associated Variants by NGS Analysis
DOI: http://dx.doi.org/10.5772/intechopen.84577
87
Edited by Ali Samadikuchaksaraei and Morteza Seifi
Genomic variations and phenotypic diversity are closely linked and form the
underlying mechanism for development of many human diseases. This book addresses
the methods of detection, analysis, and interpretation of genomic variations in
clinically relevant scenarios. If your research or clinical practice involves handling
of genomic sequencing data, this book is for you. Topics covered include: methods
for identifying genetic diversity, the workflow for analyzing whole exome and whole
genome sequencing data, local ancestry deconvolution models, the value of molecular
patterns and pattern biomarkers in cancer diagnosis and prognosis, and genotyping
and profiling resistance-associated variants of hepatitis C. If your research or clinical
practice involves handling of genomic sequencing data, this book is for you.
ISBN
ISBN 978-1-83881-844-9
978-1-78923-799-3
Published in London, UK
© 2019 IntechOpen
© Bilge Yurtsever / iStock