Computing and Visualizing GSVD
Computing and Visualizing GSVD
Rui Luo
University of Utah
UUCS-19-003
School of Computing
University of Utah
Salt Lake City, UT 84112 USA
29 April 2019
Abstract
The human genome project has been completed, but there are barriers between
researchers who study the genetic sequences and clinicians who treat cancers. First
of all, there is low reproducibility in genetic studies, caused by different sequencing
techniques and batch effects. Secondly, it is difficult for clinicians who do not have a
computational background to interpret existing computational methods. To minimize
these disconnections, a computational model should be developed to find the signifi-
cant genes in a genome that separate batch and experimental effects from biological
effects. The proposed solution is to use the generalized singular value decomposition
(GSVD) to reveal genetic patterns on the transformation of genes, and to separate the
tumor-exclusive genotype from experimental inconsistencies.
Here we developed a computation and visualization toolkit to improve computing
and visualizing the GSVD in Python.
1
COMPUTING AND VISUALIZING THE GENERALIZED
SINGULAR VALUE DECOMPOSITION IN PYTHON
by
Rui Luo
Bachelor of Science
in
Computer Science
School of Computing
The University of Utah
May 2019
Copyright c Rui Luo 2019
All Rights Reserved
ABSTRACT
The human genome project has been completed, but there are barriers between re-
searchers who study the genetic sequences and clinicians who treat cancers. First of all,
there is low reproducibility in genetic studies, caused by different sequencing techniques
and batch effects. Secondly, it is difficult for clinicians who do not have a computa-
tional background to interpret existing computational methods. To minimize these dis-
connections, a computational model should be developed to find the significant genes in a
genome that separate batch and experimental effects from biological effects. The proposed
solution is to use the generalized singular value decomposition (GSVD) to reveal genetic
patterns on the transformation of genes, and to separate the tumor-exclusive genotype
from experimental inconsistencies.
Here we developed a computation and visualization toolkit to improve computing and
visualizing the GSVD in Python.
CONTENTS
ABSTRACT . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ii
LIST OF FIGURES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . v
CHAPTERS
1. INTRODUCTION . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
2. BACKGROUND . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
2.1 The GSVD as a comparative spectral
decomposition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
2.1.1 Generalized fractions and angular distances . . . . . . . . . . . . . . . . . . . . . . . 3
2.1.2 Computing the GSVD via QR decomposition and the SVD . . . . . . . . . . . 4
2.2 Genomic signal processing case study . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
3. METHODS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
3.1 Computing the GSVD via QR decomposition
and the SVD . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
3.1.1 Computing QR decomposition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
3.1.2 Computing the SVD . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
3.2 Testing the GSVD raster visualization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
3.3 Computing and visualizing the generalized
fractions and angular distances . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
3.3.1 Computing and visualizing the generalized fractions . . . . . . . . . . . . . . . 8
3.3.2 Computing and visualizing angular distances . . . . . . . . . . . . . . . . . . . . . 9
3.4 Computing and visualizing Kaplan-Meier
survival analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
3.4.1 Lifelines versus scikit-survival . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
3.4.2 Extending lifelines visualization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
3.5 Creating boxplot displays . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
3.5.1 Mapping columns (attributes) to similar groups . . . . . . . . . . . . . . . . . . . . 11
4. RESULTS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
4.1 TCGA astrocytoma tumor and patient-matched
normal DNA copy-number datasets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
4.2 Visualization of the GSVD in this case study . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
4.3 Visualization of the bar charts in this
case study . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
4.4 Visualization of the Kaplan-Meier survival
analyses in this case study . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
4.5 Visualization of the boxplots in this
case study . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
4.6 Improvement of computational time of the
GSVD in Python relative to
Mathematica . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
5. CONCLUSIONS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
REFERENCES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
iv
LIST OF FIGURES
INTRODUCTION
Sequencing human genomes has become less expensive. However, scientists are hav-
ing a difficult time obtaining the information they expect to derive with the results from
genetic sequencing. For example, scientists need to analyze the genome to find out which
parts on the genome are significant. In order to determine the potentially significant areas,
scientists need to perform a large number of trials among the whole genome. An example
of the difficulty researchers face is the regularly observed patterns of copy-number varia-
tions in cancer. Although the sequencing results known as recurring DNA alterations have
been observed to play an important role in the study of cancer, they have not been applied
to actual clinical use.
Genomic signal processing tools are built to remove the barriers of transferring genetic
sequencing results to usable information. Genomic signal processing tools that have been
built to uncover the underlying patterns of sequencing results and assist clinicians to better
interpret recurring DNA alterations.
Most of the research relevant to genome-wide patterns of DNA copy-number variations
does not focus on the subtypes of DNA copy-number variations. However, the subtypes
of DNA copy-number variations are the indicators. The subtypes of DNA copy-number
variations can increase the accuracy of predictions. By including the tools mentioned
above in the research experiments, the results become more persuasive.
Our lab’s genomic signal processing has been implemented in Mathematica. The func-
tions that are used in Mathematica can be extended into a toolkit. While Mathematica
computes matrix decompositions correctly there is no evidence that it uses highly opti-
mized LAPACK routines. The decompositions contained in the Numpy linear algebra
library, however, are based on these LAPACK routines. This indicates that we may be
able to obtain performance improvements by implementing the GSVD in Python utilizing
2
the Numpy library. An additional benefit that Python has is that there are many cloud
computing pipelines that interface well with Python, but a paucity that are compatible
with Mathematica. Finally, due to Python being available at no cost and the ease of creating
and sharing libraries Python offers a platform for our labs research to appeal to a broader
audience.
CHAPTER 2
BACKGROUND
N
Di = Ui Σi V T = ∑ σi,n ui,n ⊗ vnT , i = 1, 2, (2.1)
n =1
where Ui ∈ RM ×N are real and column-wise orthonormal and V T ∈ RN×N is real, invert-
i
ible, and with normalized rows. The 2N positive generalized singular values are arranged
in Σi = diag(σi,n ) ∈ RN×N in a decreasing order of the ratio σ1,n /σ2,n . The GSVD is unique
up to phase factors of ±1 of each triplet of corresponding column and row basis vectors,
i.e., ui,n and vn , except in degenerate subspaces defined by subsets of pairs of generalized
singular values of equal ratios, i.e., σ1,n /σ2,n . The GSVD generalizes the SVD from one to
two matrices. Like the SVD, the GSVD is a mathematical building block of algorithms, e.g.,
for solving the problem of constrained least squares in algebra [5], and theories, e.g., for
describing oscillations near equilibrium in classical mechanics [6].
between their rows. The authors [16] defined the significance of the row basis vector vnT
and the corresponding column basis vector ui,n in the corresponding matrix Di , i.e., the
“generalized fraction” pi,n , to be proportional to the corresponding generalized singular
value σi,n , and the “generalized normalized Shannon entropy” of Di to be proportional to
the arithmetic mean of pi,n log pi,n . The authors [16] defined the significance of vn and u1,n
in D1 relative to that of vn and u2,n in D2 , i.e., the “GSVD angular distance,” to be a function
of the ratio σ1,n /σ2,n that, from the cosine-sine decomposition, is related to an angle,
Note that the angular distances θn are different from the principal angles corresponding
to canonical correlations, just as the GSVD is different from canonical correlations analysis
(CCA) [15].
A unique row basis vector vnT that is significant in either D1 or D2 , and with an angular
distance of θn ≈ ±π/4, which corresponds to a ratio of σ1,n /σ2,n 1 or 1, respec-
tively, is mathematically approximately exclusive to either D1 or D2 , and for consistency
should be interpreted with the corresponding column basis vector u1,n or u2,n to represent
phenomena exclusive to either the first or the second dataset. A unique row basis vector
vnT that is significant in both D1 and D2 , and with an angular distance of θn ≈ 0, which
corresponds to σ1,n /σ2,n ≈ 1, is mathematically common to D1 and D2 , and should be
interpreted with both u1,n and u2,n to represent phenomena common to both datasets.
The GSVD, which is mathematically invariant under the exchange of the two matrices
or the reordering of the pairs of matched columns or the rows, and then here is also blind
to the labels of the matrices, the columns, and the rows. These labels are used to interpret
only the row and column basis vectors in terms of the phenomena recorded in the datasets.
where R is upper triangular [4, 5]. Since D1 and D2 are with full column rank, then Q1
and Q2 are also with full column rank, VQT1 is orthonormal, and ΣQ1 is positive diagonal.
1
It follows from Eq. (2.3) that the diagonal ΣQ2 = ( I − Σ2Q1 ) 2 is also positive, and that
1
UQ2 = Q2 VQ1 ( I − Σ2Q1 )− 2 is column-wise orthonormal,
I = Q T Q = Q1T Q1 + Q2T Q2
(2.4)
That is, the SVD of Q1 also defines an SVD of Q2 , where the singular values are arranged
in ΣQ2 in an increasing order, because the singular values of Q1 are arranged in ΣQ1 in a
decreasing order.
It follows from Eq. (2.4) then that the SVD of Q1 factorizes D1 and D2 into the GSVD,
U1 = UQ1 ,
U2 = UQ2 ,
1
Σ1 = ΣQ1 {diag[(VQT1 R)(VQT1 R)T ]} 2 ,
1 1
Σ2 = ( I − Σ2Q1 ) 2 {diag[(VQT1 R)(VQT1 R)T ]} 2 ,
1
V T = {diag[(VQT1 R)(VQT1 R)T ]}− 2 VQT1 R, (2.5)
where U1 and U2 are column-wise orthonormal, Σ1 and Σ2 are positive diagonal, and V T ,
identical in both factorizations, has normalized rows. The positive generalized singular
1
values are arranged in Σ1 Σ2−1 = ΣQ1 ( I − Σ2Q1 )− 2 in a decreasing order.
The QR decomposition is unique and, from Eq. (2.5), the uniqueness properties of the
GSVD follow from the uniqueness properties of the SVDs of Q1 and Q2 .
6
METHODS
The function used to visualize the generalized fractions is displayed below. The func-
tion calls the generalized_fractions function described above internally to compute the
fractions. The fractions are then joined with an integer list that is the same length as the
generalized fractions array to preserve the fractions’ original location. This list is displayed
in a horizontal bar chart, with the fractions’ corresponding integer position as its label.
In order to visualize the angular distances, we took the variables d_1 and d_2 from
the GSVD as input to the function angular_distances. Then we indexed the angular
distances array by its length and plotted the array in the form of a horizontal bar chart.
10
The Kaplan-Meier objects have a built-in plot method. Each Kaplan-Meier object in
km_objs is called on a subplot given by the ax parameter and plotted with our default
styles specified by the colors parameter and censor_style dictionary parameter. If a
group has too few patients included, then it can be removed with remove_list. The
parameters of annotation_coords and patch_color are used to annotate the plot area
and increase the amount of information that can be conveyed.
11
RESULTS
The data are composed of two datasets, including one tumor dataset and one normal
dataset. For each dataset, rows contain the information of copy number variations along
the genomic coordinate, whereas columns contain the identification information of pa-
tients. Each column represents DNA copy number variations according to one patient, and
13
each row represents the DNA copy number variations based on one subposition location
on the genome.
During the data preprocessing step, only patients who have both a normal profile and
a tumor profile are included in the datasets. Thus, the columns of the two datasets are
the same. This is an important criterion for applying the GSVD to these datasets. Both
the tumor copy-number variations information and the normal copy-number variations
information are needed in order to analyze the tumor exclusive patterns or the normal
exclusive patterns or common in both. If the target patients are not the same group of pa-
tients in both datasets, the results will no longer be comparative, and they cannot separate
patterns from the tumor datasets and normal datasets. There does not have to be an exact
match along the rows of the two datasets to be used in the GSVD. As long as the rows have
similar meanings, the datasets are valid as input to the GSVD.
14
s
pe
pe
pe
40
80
82
85
82
85
40
80
1
1
Ty
Ty
Ty
1 1 1
Tumor Bins
Tumor Bins
ue
ue
ue
Row Basis
Vectors
ss
ss
ss
4
Ti
Ti
Ti
2 2
Dnormal Unormal Σ normal 82
85
Column Basis
1 1
Normal Bins
Normal Bins
1
3 3
Vectors
4
2 4 2 4 82
85
5 5
3 3
6 6
4 4
Angular Distance
7 7
π/10
π/6
π/10
5 5
π/6
8 8
0
6 6 1
9 9
7 10 7 10
8 11 8 11
20
12 12
10 13 10 13
14 14
11 11 40
15 15
12 16 12 16
17 17
13 18 13 18
14 19 14 19 60
20 20
15 21 15 21
22 22
16 X 16 X
17 17
18 18 80
19 19
20 20
21 21
22 22
X X
-c
c
c/2
Figure 4.1. The GSVD of the WGS read-count profiles of patient-matched astrocytoma
tumor and normal DNA. The GSVD is depicted in a raster display with relative WGS
read-count, i.e., DNA copy-number amplification (red), no change (black), and deletion
(green). This GSVD depiction is denoted as approximate, even though the GSVD of
Eq. (2.1) is exact, because only the 1st through the 5th and the 81st through the 85th
rows and corresponding tumor and normal column basis vectors and generalized singular
values are explicitly shown. The angular distances of Eq. (2.2) are depicted in a bar chart.
The red and green contrasts for the datasets Di , the dataset-specific column basis vectors
Ui and generalized singular values Σi , and the dataset-shared row basis vectors V T , are
c = 1, 750 and 0.0005, and 5, respectively.
16
0.2
0.3
0.1
0.2
0.3
0
0
1 85
2 82
7 83
Row Basis Vectors
11 84
82 73
5 66
10 79
3 81
66 77
73 80
Figure 4.2. The most significant row basis vectors uncovered by the GSVD of the WGS
astrocytoma tumor and normal datasets. (a) The 10 largest generalized fractions in the
WGS astrocytoma tumor dataset are depicted in a bar chart, showing that the two most
tumor-exclusive row basis vectors, i.e., the first and second, are also the first and second
most significant in the tumor dataset and capture ≈29% and 8% of the information, re-
spectively. The corresponding generalized normalized Shannon entropy is 0.78. (b) The 10
largest generalized fractions in the normal dataset are depicted in a bar chart, showing that
the most normal-exclusive row basis vector, i.e., the 85th, is also the most significant in the
normal dataset and captures ≈23% of the information. The 82nd row basis vector, which
is approximately common to both datasets, is the second and fifth most significant and
captures ≈14% and 2% of the information in the normal and tumor datasets, respectively.
17
(a) Agilent GBM (Corr.) (b) Agilent GBM/Age (c) Agilent GBM/Grade
P-value = 9.3 × 10 6 P-value = 1.7 × 10 5 P-value = 1.2 × 10 4
Hazard Ratio = 5.3 Hazard Ratios = 3.9/2.7 Hazard Ratios = 3.1/2.3
1
Low Low/<50 Low/III
Fraction of Surviving Patients
from the WGS Astrocytoma Set
0.75
High/<50 High/III
N=7 N=12
O=3 O=7
0.5
Low/II
N=8
O=1
0.25
High Low/>50
High/>50 High/IV
N=58 N=7 Low/IV
N=51 N=45
O=42 O=3 N=7
O=39 O=35
0 O=3
0 13 63 80 120 0 12 13 23 63 80 120 0 11 14 23 63 67 80 120
Survival Time (Months)
(d) MGMT Methylation (e) IDH1 Mutation
P-value = 1.2 × 10 3 P-value = 4.1 × 10 6
Hazard Ratio = 3.6 Hazard Ratio = 8.9
1
Yes Yes
Fraction of Surviving Patients
from the WGS Astrocytoma Set
N=40 N=24
O=22 O=9
0.75
0.5
0.25
No No
N=17 N=51
O=11 O=35
0
0 10 20 80 120 0 13 58 80 120
Survival Time (Months) Survival Time (Months)
Figure 4.3. Survival analyses of the WGS astrocytoma patients. The classifications of the
85 patients based upon (a) the Agilent GBM pattern and, in addition, (b) age or (c) grade,
or (d) MGMT promoter methylation or (e) IDH1 mutation, are depicted in KM curves
highlighting median survival time differences (yellow) with the corresponding log-rank
P-values and Cox hazard ratios.
19
0.004 0.004
Relative DNA Copy Number
0.002 0.002
0 0
-0.002 -0.002
-0.004 -0.004
Figure 4.4. The first tumor and 85th normal column basis vectors are correlated with
the fractional guanine-cytosine (GC) content across the tumor and normal genomes.
The distributions of the copy numbers listed in the (a) first tumor and (b) 85th normal
column basis vectors between tumor and normal bins, respectively, of >50% and ≤50% GC
content are depicted in boxplots with the corresponding Mann-Whitney-Wilcoxon (MWW)
P-values.
20
0.4 0.4
Relative DNA Copy Number
0.2 0.2
0 0
-0.2 -0.2
Other BI Other CS
N=40 N=45 N=80 N=5
Genomic Characterization Center Tissue Source Site
Figure 4.5. The first and 85th row basis vectors are correlated with experimental batches.
The distributions of the copy numbers listed in the (a) first and (b) 85th row basis vec-
tors between genomic characterization centers and tissue source sites, respectively, are
depicted in boxplots with the corresponding MWW P-values.
CONCLUSIONS
As the case study indicated, the genomic signal processing toolkit developed in this
research facilitates speedy analysis on large-scale data with no loss of accuracy. Today’s
datasets are enormous. Instead of waiting for a long time for the results to come out,
this toolkit helps speed up the process and while maintaining accuracy. Additionally,
the toolkit improves visualization over existing Python libraries. Taking advantage of
existing Python libraries and extending their features, it becomes easier to interpret the
data visually. Also, this toolkit is reusable and can be applied to multiple datasets. The
applications of the toolkit are not limited to genomic data, but all other types of data as
well. The computation and visualization of the GSVD can also be applied to many other
types of data. Python was chosen to implement this toolkit, because it makes many cloud
computing tasks and large datasets more efficient. It is anticipated that this toolkit can be
used to analyze extremely large datasets hosted in cloud storage.
REFERENCES
[2] C. C. Paige and M. A. Saunders, SIAM J. Numer. Anal. 18, 398 (1981).
[4] R. A. Horn and C. R. Johnson, Matrix Analysis, 2nd ed. Cambridge, UK: Cambridge
University Press (2012).
[5] G. H. Golub and C. F. Van Loan, Matrix Computations, 4th ed. Baltimore, MD: Johns
Hopkins University Press (2012).
[6] H. Goldstein, Classical Mechanics, 2nd ed. Reading, MA: Addison-Wesley (1980).
[7] O. Alter, P. O. Brown, and D. Botstein, Proc. Natl. Acad. Sci. USA 100, 3351 (2003).
[9] S. P. Ponnapalli, G. H. Golub, and O. Alter, in Stanford University and Yahoo! Research
Workshop on Algorithms for Modern Massive Datasets (June 21–24, 2006).
[10] S. P. Ponnapalli, M. A. Saunders, C. F. Van Loan, and O. Alter, PLoS One 6, e28072
(2011).
[11] P. Sankaranarayanan, T. E. Schomay, K. A. Aiello, and O. Alter, PLoS One 10, e0121396
(2015).
[13] L. N. Trefethen and D. Bau, III, Numerical Linear Algebra. Philadelphia, PA: SIAM
(1997).
[14] A. Edelman, T. A. Arias, and S. T. Smith, SIAM J. Matrix Anal. Appl. 20, 303 (1998).