0% found this document useful (0 votes)

19 views29 pages

Computing and Visualizing GSVD

Uploaded by

cyruvesvikar011187

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

19 views29 pages

Computing and Visualizing GSVD

Uploaded by

cyruvesvikar011187

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 29

Computing and Visualizing the

Generalized Singular Value

Decomposition in Python

Rui Luo
University of Utah

UUCS-19-003

School of Computing
University of Utah
Salt Lake City, UT 84112 USA

29 April 2019

Abstract

The human genome project has been completed, but there are barriers between
researchers who study the genetic sequences and clinicians who treat cancers. First
of all, there is low reproducibility in genetic studies, caused by different sequencing
techniques and batch effects. Secondly, it is difficult for clinicians who do not have a
computational background to interpret existing computational methods. To minimize
these disconnections, a computational model should be developed to find the signifi-
cant genes in a genome that separate batch and experimental effects from biological
effects. The proposed solution is to use the generalized singular value decomposition
(GSVD) to reveal genetic patterns on the transformation of genes, and to separate the
tumor-exclusive genotype from experimental inconsistencies.
Here we developed a computation and visualization toolkit to improve computing
and visualizing the GSVD in Python.

1
COMPUTING AND VISUALIZING THE GENERALIZED
SINGULAR VALUE DECOMPOSITION IN PYTHON

by
Rui Luo

A thesis submitted to the faculty of

The University of Utah
in partial fulfillment of the requirements for the degree of

Bachelor of Science
in
Computer Science

School of Computing
The University of Utah
May 2019
Copyright c Rui Luo 2019
All Rights Reserved
ABSTRACT

The human genome project has been completed, but there are barriers between re-
searchers who study the genetic sequences and clinicians who treat cancers. First of all,
there is low reproducibility in genetic studies, caused by different sequencing techniques
and batch effects. Secondly, it is difficult for clinicians who do not have a computa-
tional background to interpret existing computational methods. To minimize these dis-
connections, a computational model should be developed to find the significant genes in a
genome that separate batch and experimental effects from biological effects. The proposed
solution is to use the generalized singular value decomposition (GSVD) to reveal genetic
patterns on the transformation of genes, and to separate the tumor-exclusive genotype
from experimental inconsistencies.
Here we developed a computation and visualization toolkit to improve computing and
visualizing the GSVD in Python.
CONTENTS

ABSTRACT . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ii

LIST OF FIGURES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . v
CHAPTERS

1. INTRODUCTION . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1

2. BACKGROUND . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
2.1 The GSVD as a comparative spectral
decomposition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
2.1.1 Generalized fractions and angular distances . . . . . . . . . . . . . . . . . . . . . . . 3
2.1.2 Computing the GSVD via QR decomposition and the SVD . . . . . . . . . . . 4
2.2 Genomic signal processing case study . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6

3. METHODS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
3.1 Computing the GSVD via QR decomposition
and the SVD . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
3.1.1 Computing QR decomposition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
3.1.2 Computing the SVD . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
3.2 Testing the GSVD raster visualization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
3.3 Computing and visualizing the generalized
fractions and angular distances . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
3.3.1 Computing and visualizing the generalized fractions . . . . . . . . . . . . . . . 8
3.3.2 Computing and visualizing angular distances . . . . . . . . . . . . . . . . . . . . . 9
3.4 Computing and visualizing Kaplan-Meier
survival analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
3.4.1 Lifelines versus scikit-survival . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
3.4.2 Extending lifelines visualization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
3.5 Creating boxplot displays . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
3.5.1 Mapping columns (attributes) to similar groups . . . . . . . . . . . . . . . . . . . . 11

4. RESULTS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
4.1 TCGA astrocytoma tumor and patient-matched
normal DNA copy-number datasets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
4.2 Visualization of the GSVD in this case study . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
4.3 Visualization of the bar charts in this
case study . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
4.4 Visualization of the Kaplan-Meier survival
analyses in this case study . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
4.5 Visualization of the boxplots in this
case study . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
4.6 Improvement of computational time of the
GSVD in Python relative to
Mathematica . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
5. CONCLUSIONS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
REFERENCES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22

iv
LIST OF FIGURES

4.1 The GSVD . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15

4.2 Bar charts . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
4.3 Kaplan-Meier survival analyses . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
4.4 Boxplots of the column basis vectors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
4.5 Boxplots of the row basis vectors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
CHAPTER 1

INTRODUCTION

Sequencing human genomes has become less expensive. However, scientists are hav-
ing a difficult time obtaining the information they expect to derive with the results from
genetic sequencing. For example, scientists need to analyze the genome to find out which
parts on the genome are significant. In order to determine the potentially significant areas,
scientists need to perform a large number of trials among the whole genome. An example
of the difficulty researchers face is the regularly observed patterns of copy-number varia-
tions in cancer. Although the sequencing results known as recurring DNA alterations have
been observed to play an important role in the study of cancer, they have not been applied
to actual clinical use.
Genomic signal processing tools are built to remove the barriers of transferring genetic
sequencing results to usable information. Genomic signal processing tools that have been
built to uncover the underlying patterns of sequencing results and assist clinicians to better
interpret recurring DNA alterations.
Most of the research relevant to genome-wide patterns of DNA copy-number variations
does not focus on the subtypes of DNA copy-number variations. However, the subtypes
of DNA copy-number variations are the indicators. The subtypes of DNA copy-number
variations can increase the accuracy of predictions. By including the tools mentioned
above in the research experiments, the results become more persuasive.
Our lab’s genomic signal processing has been implemented in Mathematica. The func-
tions that are used in Mathematica can be extended into a toolkit. While Mathematica
computes matrix decompositions correctly there is no evidence that it uses highly opti-
mized LAPACK routines. The decompositions contained in the Numpy linear algebra
library, however, are based on these LAPACK routines. This indicates that we may be
able to obtain performance improvements by implementing the GSVD in Python utilizing
2

the Numpy library. An additional benefit that Python has is that there are many cloud
computing pipelines that interface well with Python, but a paucity that are compatible
with Mathematica. Finally, due to Python being available at no cost and the ease of creating
and sharing libraries Python offers a platform for our labs research to appeal to a broader
audience.
CHAPTER 2

BACKGROUND

2.1 The GSVD as a comparative spectral

decomposition
Given two column-matched but row-independent real matrices Di ∈ RM ×N , each with
i

full column rank N ≤ Mi , the GSVD is an exact simultaneous factorization [1–4],

N
Di = Ui Σi V T = ∑ σi,n ui,n ⊗ vnT , i = 1, 2, (2.1)
n =1

where Ui ∈ RM ×N are real and column-wise orthonormal and V T ∈ RN×N is real, invert-
i

ible, and with normalized rows. The 2N positive generalized singular values are arranged
in Σi = diag(σi,n ) ∈ RN×N in a decreasing order of the ratio σ1,n /σ2,n . The GSVD is unique
up to phase factors of ±1 of each triplet of corresponding column and row basis vectors,
i.e., ui,n and vn , except in degenerate subspaces defined by subsets of pairs of generalized
singular values of equal ratios, i.e., σ1,n /σ2,n . The GSVD generalizes the SVD from one to
two matrices. Like the SVD, the GSVD is a mathematical building block of algorithms, e.g.,
for solving the problem of constrained least squares in algebra [5], and theories, e.g., for
describing oscillations near equilibrium in classical mechanics [6].

2.1.1 Generalized fractions and angular distances

The authors [16] formulated the GSVD as a comparative spectral decomposition that
can simultaneously identify the similarity and dissimilarity between two column-matched
but row-independent matrices, and, therefore, create a single coherent model from two
datasets recording different aspects of interrelated phenomena [7, 8]. This formulation
[9–12] is possible because the GSVD is exact, exists, and has uniqueness properties that
directly generalize those of the SVD [13, 14] Eq. (2.1). The only assumption is that there
exists a one-to-one mapping between the columns of the matrices but not necessarily
4

between their rows. The authors [16] defined the significance of the row basis vector vnT
and the corresponding column basis vector ui,n in the corresponding matrix Di , i.e., the
“generalized fraction” pi,n , to be proportional to the corresponding generalized singular
value σi,n , and the “generalized normalized Shannon entropy” of Di to be proportional to
the arithmetic mean of pi,n log pi,n . The authors [16] defined the significance of vn and u1,n
in D1 relative to that of vn and u2,n in D2 , i.e., the “GSVD angular distance,” to be a function
of the ratio σ1,n /σ2,n that, from the cosine-sine decomposition, is related to an angle,

−π/4 < θn = arctan(σ1,n /σ2,n ) − π/4 < π/4. (2.2)

Note that the angular distances θn are different from the principal angles corresponding
to canonical correlations, just as the GSVD is different from canonical correlations analysis
(CCA) [15].
A unique row basis vector vnT that is significant in either D1 or D2 , and with an angular
distance of θn ≈ ±π/4, which corresponds to a ratio of σ1,n /σ2,n 1 or 1, respec-
tively, is mathematically approximately exclusive to either D1 or D2 , and for consistency
should be interpreted with the corresponding column basis vector u1,n or u2,n to represent
phenomena exclusive to either the first or the second dataset. A unique row basis vector
vnT that is significant in both D1 and D2 , and with an angular distance of θn ≈ 0, which
corresponds to σ1,n /σ2,n ≈ 1, is mathematically common to D1 and D2 , and should be
interpreted with both u1,n and u2,n to represent phenomena common to both datasets.
The GSVD, which is mathematically invariant under the exchange of the two matrices
or the reordering of the pairs of matched columns or the rows, and then here is also blind
to the labels of the matrices, the columns, and the rows. These labels are used to interpret
only the row and column basis vectors in terms of the phenomena recorded in the datasets.

2.1.2 Computing the GSVD via QR decomposition and the SVD

The GSVD is numerically stably computed by using the QR decomposition of the
appended D1 and D2 , followed by the SVD of the block of the column-wise orthonormal
Q that corresponds to D1 , i.e., Q1 ,
5

D1 Q1 R
= QR =
D2 Q2
UQ1 ΣQ1 VQT1 R UQ1 ΣQ1 VQT1 R

= = ,
Q2 UQ 2 Σ Q 2
(2.3)

where R is upper triangular [4, 5]. Since D1 and D2 are with full column rank, then Q1
and Q2 are also with full column rank, VQT1 is orthonormal, and ΣQ1 is positive diagonal.
1
It follows from Eq. (2.3) that the diagonal ΣQ2 = ( I − Σ2Q1 ) 2 is also positive, and that
1
UQ2 = Q2 VQ1 ( I − Σ2Q1 )− 2 is column-wise orthonormal,

I = Q T Q = Q1T Q1 + Q2T Q2

= VQ1 Σ2Q1 VQT1 + Q2T Q2 ,

Σ2Q2 = I − Σ2Q1 = ( Q2 VQ1 )T ( Q2 VQ1 ) > 0,

T
UQ 2
UQ 2 = I
1 1
= [ Q2 VQ1 ( I − Σ2Q1 )− 2 ]T [ Q2 VQ1 ( I − Σ2Q1 )− 2 ].

(2.4)

That is, the SVD of Q1 also defines an SVD of Q2 , where the singular values are arranged
in ΣQ2 in an increasing order, because the singular values of Q1 are arranged in ΣQ1 in a
decreasing order.
It follows from Eq. (2.4) then that the SVD of Q1 factorizes D1 and D2 into the GSVD,

U1 = UQ1 ,

U2 = UQ2 ,
1
Σ1 = ΣQ1 {diag[(VQT1 R)(VQT1 R)T ]} 2 ,
1 1
Σ2 = ( I − Σ2Q1 ) 2 {diag[(VQT1 R)(VQT1 R)T ]} 2 ,
1
V T = {diag[(VQT1 R)(VQT1 R)T ]}− 2 VQT1 R, (2.5)

where U1 and U2 are column-wise orthonormal, Σ1 and Σ2 are positive diagonal, and V T ,
identical in both factorizations, has normalized rows. The positive generalized singular
1
values are arranged in Σ1 Σ2−1 = ΣQ1 ( I − Σ2Q1 )− 2 in a decreasing order.

The QR decomposition is unique and, from Eq. (2.5), the uniqueness properties of the
GSVD follow from the uniqueness properties of the SVDs of Q1 and Q2 .
6

2.2 Genomic signal processing case study

DNA alterations had been observed in astrocytoma for decades. A copy-number geno-
type predictive of a survival phenotype was discovered only by using the generalized sin-
gular value decomposition (GSVD) formulated as a comparative spectral decomposition.
In this case study, the authors [16] used the GSVD to compare whole-genome sequenc-
ing (WGS) profiles of patient-matched astrocytoma and normal DNA. First, the GSVD un-
covered a genome-wide pattern of copy-number alterations, which was bounded by pat-
terns recently uncovered by the GSVDs of microarray-profiled patient-matched glioblas-
toma (GBM) and, separately, lower-grade astrocytoma and normal genomes. Like the
microarray patterns, the WGS pattern was correlated with an approximately one-year
median survival time. By filling in gaps in the microarray patterns, the WGS pattern
revealed that this biologically consistent genotype encoded for transformation via the
Notch together with the Ras and Shh pathways. Second, like the GSVDs of the microarray
profiles, the GSVD of the WGS profiles separated the tumor-exclusive pattern from normal
copy-number variations and experimental inconsistencies, including the WGS technology-
specific effects of guanine-cytosine content variations across the genomes that were corre-
lated with experimental batches. Third, by identifying the biologically consistent phe-
notype among the WGS-profiled tumors, the GBM pattern proved to be a technology-
independent predictor of survival and response to chemotherapy and radiation, statis-
tically better than the patient’s age and tumor’s grade, the best other indicators, and
MGMT promoter methylation and IDH1 mutation. The authors [16] concluded that by
using the complex structure of the data, comparative spectral decompositions underlay a
mathematically universal description of the genotype-phenotype relations in cancer that
other methods missed.
Here we demonstrate a computation and visualization toolkit by applying it to the data
from this case study.
CHAPTER 3

METHODS

3.1 Computing the GSVD via QR decomposition

and the SVD
The QR decomposition and the SVD are important steps in computing the GSVD. The
efficient computation of the GSVD is based on the QR and SVD decompositions of the
datasets. QR decomposition is performed on the stacked datasets such as matrix1 in the
code snippet, and the SVD is computed on the rows of Q that correspond to D1 and D2
separately.

3.1.1 Computing QR decomposition

The input data can potentially have missing values. We ensured that all Nulls or NAs
had been dropped from the input matrices D1 and D2 . The matrices D1 and D2 were then
concatenated vertically to form a (n1 + n2) by m matrix for QR decomposition where n1
and n2 are the rows of D1 and D2 , respectively, and m are the same columns of either
dataset. To perform QR decomposition on D1 and D2 , we used the linalg.qr command
from the Numpy package in Python. Numpy utilizes the LAPACK algorithms, which are
highly optimized for linear algebra. Taking advantage of this optimization, we used the
Numpy linalg package instead of any others or instead of implementing our own package.
The output of the linalg.qr command is the decomposition matrices Q with orthonormal
columns and upper triangular matrix R.
8

3.1.2 Computing the SVD

The matrix Q was split into two matrices, one of size n1 by m corresponding to the
dimensions of D1 and the other of size n2 by m corresponding to D2 . These matrices are
referred to as Q1 and Q2 , respectively. Using the linalg.svd function from the Numpy
package, we decomposed the matrix Q1 into U1 , Σ1 and V1T and the matrix Q2 into U1 , Σ1
and V2T .

3.2 Testing the GSVD raster visualization

To utilize the GSVD and find underlying patterns from the GSVD, we developed the
raster visualization for the GSVD. My task was to test if the visualization correctly depicted
the datasets. The raster_display function shown below can be applied to any decompo-
sition. The positional parameters of axes, matrix, row_basis_vectors, probe_num and
step are unique to each visualization. This raster_display can be applied to D1 and D2
as well as the results of the GSVD: U1 , U2 , Σ1 , Σ2 and V T to visualize all the patterns in the
GSVD and original dataset.

3.3 Computing and visualizing the generalized

fractions and angular distances
3.3.1 Computing and visualizing the generalized fractions
The code below demonstrates how generalized fractions were computed. The first
step was to multiply each element of the diagonal with itself and store it as the variable
fractions. Then we summed all the diagonal elements and stored them in variable
total_fractions. The function returns an array of fractions divided by total_fractions.
9

The function used to visualize the generalized fractions is displayed below. The func-
tion calls the generalized_fractions function described above internally to compute the
fractions. The fractions are then joined with an integer list that is the same length as the
generalized fractions array to preserve the fractions’ original location. This list is displayed
in a horizontal bar chart, with the fractions’ corresponding integer position as its label.

3.3.2 Computing and visualizing angular distances

The steps of computing angular distances follow Eq.(2.2). The variable distances
contains an array of angular distances to be plotted.

In order to visualize the angular distances, we took the variables d_1 and d_2 from
the GSVD as input to the function angular_distances. Then we indexed the angular
distances array by its length and plotted the array in the form of a horizontal bar chart.
10

3.4 Computing and visualizing Kaplan-Meier

survival analysis
3.4.1 Lifelines versus scikit-survival
Lifelines and scikit-survival are two libraries in Python. Lifelines has more features
built into it, including its own plot method, and has been extensively tested against R
and SAS. The scikit-survival is dependent on the sklearn module and is not extensively
supported. Lifelines was chosen for support and capability reasons.

3.4.2 Extending lifelines visualization

Here we initialized the Kaplan-Meier objects by calling the function KaplanMeierFitter,
which is a lifelines function. KaplanMeierFitter takes time and event data and fits the
survival function to the data. Helper functions not shown here generate Boolean arrays
that separate time and event data into user-defined subgroups. These Boolean arrays are
generated based on event column data and group column data in a dataset provided by
the user.

The Kaplan-Meier objects have a built-in plot method. Each Kaplan-Meier object in
km_objs is called on a subplot given by the ax parameter and plotted with our default
styles specified by the colors parameter and censor_style dictionary parameter. If a
group has too few patients included, then it can be removed with remove_list. The
parameters of annotation_coords and patch_color are used to annotate the plot area
and increase the amount of information that can be conveyed.
11

3.5 Creating boxplot displays

3.5.1 Mapping columns (attributes) to similar groups
Before displaying of patterns in the column and row basis vectors in a boxplot format,
the data have to be separated into two groups based on annotations. The user supplies an-
notations that separate the data into two groups based on a column in a pandas dataframe.
The pandas factorize method is then used to create dummy variables that are integers
in the place of strings. By using these dummy variables, the mapping step can work
independently of any user input regardless of the annotations in the input column.
CHAPTER 4

RESULTS

Case study of the TCGA astrocytoma datasets

4.1 TCGA astrocytoma tumor and patient-matched
normal DNA copy-number datasets
The Cancer Genome Atlas (TCGA) astrocytoma datasets are obtained from the Ge-
nomic Data Commons (GDC).
https://portal.gdc.cancer.gov/projects
http://www.alterlab.org/astrocytoma_genotype-phenotype/

The data are composed of two datasets, including one tumor dataset and one normal
dataset. For each dataset, rows contain the information of copy number variations along
the genomic coordinate, whereas columns contain the identification information of pa-
tients. Each column represents DNA copy number variations according to one patient, and
13

each row represents the DNA copy number variations based on one subposition location
on the genome.
During the data preprocessing step, only patients who have both a normal profile and
a tumor profile are included in the datasets. Thus, the columns of the two datasets are
the same. This is an important criterion for applying the GSVD to these datasets. Both
the tumor copy-number variations information and the normal copy-number variations
information are needed in order to analyze the tumor exclusive patterns or the normal
exclusive patterns or common in both. If the target patients are not the same group of pa-
tients in both datasets, the results will no longer be comparative, and they cannot separate
patterns from the tumor datasets and normal datasets. There does not have to be an exact
match along the rows of the two datasets to be used in the GSVD. As long as the rows have
similar meanings, the datasets are valid as input to the GSVD.
14

4.2 Visualization of the GSVD in this case study

After computing the GSVD as described in sections 3.1–3.2 and after visualization as
described in section 3.3, Figure 4.1 below has been generated. This figure matches the
previously generated figure in [16]. We directly compared the results of matrices U, Σ
and V T by subtracting Mathematica matrices from the corresponding ones in Python.
The absolute value of the difference is less than the default machine precision 1e−16 in
both Mathematica and Python. We also subtracted the angular distances computed by
Mathematica from the Python corresponding ones. The absolute difference is less than
default machine precision 1e−16 .
15

Column Basis Row Basis

Patients Vectors Vectors Patients
Dtumor Utumor Σ tumor VT
s

s
pe

pe
40

80
1

1
Ty

Ty
1 1 1

Tumor Bins

Tumor Bins
ue

Row Basis
Vectors
ss

ss
4
Ti

Ti
2 2
Dnormal Unormal Σ normal 82

Column Basis
1 1
Normal Bins

Normal Bins
1
3 3

Vectors
4
2 4 2 4 82

85
5 5
3 3

6 6
4 4
Angular Distance
7 7

π/10
π/6

π/10
5 5

π/6
8 8

0
6 6 1
9 9

7 10 7 10

8 11 8 11
20
12 12

Row Basis Vectors

9 9

10 13 10 13
14 14
11 11 40
15 15
12 16 12 16
17 17
13 18 13 18
14 19 14 19 60
20 20
15 21 15 21
22 22
16 X 16 X
17 17
18 18 80
19 19
20 20
21 21
22 22
X X
-c

c
c/2

Log2 DNA Copy Number

c/2

Figure 4.1. The GSVD of the WGS read-count profiles of patient-matched astrocytoma
tumor and normal DNA. The GSVD is depicted in a raster display with relative WGS
read-count, i.e., DNA copy-number amplification (red), no change (black), and deletion
(green). This GSVD depiction is denoted as approximate, even though the GSVD of
Eq. (2.1) is exact, because only the 1st through the 5th and the 81st through the 85th
rows and corresponding tumor and normal column basis vectors and generalized singular
values are explicitly shown. The angular distances of Eq. (2.2) are depicted in a bar chart.
The red and green contrasts for the datasets Di , the dataset-specific column basis vectors
Ui and generalized singular values Σi , and the dataset-shared row basis vectors V T , are
c = 1, 750 and 0.0005, and 5, respectively.
16

4.3 Visualization of the bar charts in this

case study
The bar charts in Figure 4.2 were generated using the functions described in section
3.3. The charts match the previously generated figures in [16]. We compared the results
of generalized fractions by subtracting Mathematica generalized fractions from the cor-
responding ones in Python. The absolute value of the difference is less than the default
machine precision 1e−16 .

(a) Tumor Generalized Fraction (b) Normal Generalized Fraction

d1 = 0.78 d2 = 0.75
0.1

0.2

0.3

0.1

0.2

0.3
0

0
1 85
2 82
7 83
Row Basis Vectors

11 84
82 73
5 66
10 79
3 81
66 77
73 80

Figure 4.2. The most significant row basis vectors uncovered by the GSVD of the WGS
astrocytoma tumor and normal datasets. (a) The 10 largest generalized fractions in the
WGS astrocytoma tumor dataset are depicted in a bar chart, showing that the two most
tumor-exclusive row basis vectors, i.e., the first and second, are also the first and second
most significant in the tumor dataset and capture ≈29% and 8% of the information, re-
spectively. The corresponding generalized normalized Shannon entropy is 0.78. (b) The 10
largest generalized fractions in the normal dataset are depicted in a bar chart, showing that
the most normal-exclusive row basis vector, i.e., the 85th, is also the most significant in the
normal dataset and captures ≈23% of the information. The 82nd row basis vector, which
is approximately common to both datasets, is the second and fifth most significant and
captures ≈14% and 2% of the information in the normal and tumor datasets, respectively.
17

4.4 Visualization of the Kaplan-Meier survival

analyses in this case study
The extension of lifelines Kaplan-Meier survival analyses can be seen in Figure 4.3.
These extensions, which are described in section 3.4 and allow more information to be
displayed. The figure here matches the previously generated figures in [16]. We compared
the results of p-values and hazard ratios by subtracting Mathematica p-values and hazard
ratios from the corresponding ones in Python. The absolute values of the differences are
less than the default machine precision 1e−16 . Additionally, the number of patients and
the number of events observed in each group match as well as the median survival time
and survivor function between Mathematica and Python are the same for each survival
analysis in Figure 4.3.
18

(a) Agilent GBM (Corr.) (b) Agilent GBM/Age (c) Agilent GBM/Grade
P-value = 9.3 × 10 6 P-value = 1.7 × 10 5 P-value = 1.2 × 10 4
Hazard Ratio = 5.3 Hazard Ratios = 3.9/2.7 Hazard Ratios = 3.1/2.3
1
Low Low/<50 Low/III
Fraction of Surviving Patients
from the WGS Astrocytoma Set

N=27 N=20 N=12

O=8 O=5 O=4

0.75
High/<50 High/III
N=7 N=12
O=3 O=7
0.5
Low/II
N=8
O=1
0.25

High Low/>50
High/>50 High/IV
N=58 N=7 Low/IV
N=51 N=45
O=42 O=3 N=7
O=39 O=35
0 O=3
0 13 63 80 120 0 12 13 23 63 80 120 0 11 14 23 63 67 80 120
Survival Time (Months)
(d) MGMT Methylation (e) IDH1 Mutation
P-value = 1.2 × 10 3 P-value = 4.1 × 10 6
Hazard Ratio = 3.6 Hazard Ratio = 8.9
1
Yes Yes
Fraction of Surviving Patients
from the WGS Astrocytoma Set

N=40 N=24
O=22 O=9

0.75

0.5

0.25

No No
N=17 N=51
O=11 O=35
0
0 10 20 80 120 0 13 58 80 120
Survival Time (Months) Survival Time (Months)

Figure 4.3. Survival analyses of the WGS astrocytoma patients. The classifications of the
85 patients based upon (a) the Agilent GBM pattern and, in addition, (b) age or (c) grade,
or (d) MGMT promoter methylation or (e) IDH1 mutation, are depicted in KM curves
highlighting median survival time differences (yellow) with the corresponding log-rank
P-values and Cox hazard ratios.
19

4.5 Visualization of the boxplots in this

case study
The column and corresponding row basis vectors can be interpreted by boxplot visu-
alization. Figures 4.4 and 4.5 were generated with the methods described in section 3.5.
These images match the previously generated ones in [16]. We compared the results of
p-values by subtracting Mathematica p-values from the corresponding ones in Python.
The absolute values of the differences are less than the default machine precision 1e−16 .
The values of the median, first and third quartiles and the whiskers match between Python
and Mathematica.

(a) Tumor Column Basis (b) Normal Column Basis

Vector 1 Vector 85
P-value = 9.1 × 10 145743 P-value = 1.6 × 10 160948

0.004 0.004
Relative DNA Copy Number

0.002 0.002

0 0

-0.002 -0.002

-0.004 -0.004

GC > 50% GC 50 GC > 50% GC 50

N=308329 N=2518708 N=308326 N=2519826
GC Content GC Content

Figure 4.4. The first tumor and 85th normal column basis vectors are correlated with
the fractional guanine-cytosine (GC) content across the tumor and normal genomes.
The distributions of the copy numbers listed in the (a) first tumor and (b) 85th normal
column basis vectors between tumor and normal bins, respectively, of >50% and ≤50% GC
content are depicted in boxplots with the corresponding Mann-Whitney-Wilcoxon (MWW)
P-values.
20

(a) Row Basis Vector 1 (b) Row Basis Vector 85

P-value = 3.7 × 10 6 P-value = 8.2 × 10 3

0.4 0.4
Relative DNA Copy Number

0.2 0.2

0 0

-0.2 -0.2

Other BI Other CS
N=40 N=45 N=80 N=5
Genomic Characterization Center Tissue Source Site

Figure 4.5. The first and 85th row basis vectors are correlated with experimental batches.
The distributions of the copy numbers listed in the (a) first and (b) 85th row basis vec-
tors between genomic characterization centers and tissue source sites, respectively, are
depicted in boxplots with the corresponding MWW P-values.

4.6 Improvement of computational time of the

GSVD in Python relative to
Mathematica
With over four billion entries in the astrocytoma tumor dataset and astrocytoma normal
dataset, Mathematica takes about 10 minutes to finish computing the GSVD, whereas
Python takes only about two minutes to finish computing the GSVD. The times here in-
clude reading in the datasets and performing the GSVD. This improvement leads us to
believe the toolkit developed in Python can be used more efficiently compared to previous
approaches.
CHAPTER 5

CONCLUSIONS

As the case study indicated, the genomic signal processing toolkit developed in this
research facilitates speedy analysis on large-scale data with no loss of accuracy. Today’s
datasets are enormous. Instead of waiting for a long time for the results to come out,
this toolkit helps speed up the process and while maintaining accuracy. Additionally,
the toolkit improves visualization over existing Python libraries. Taking advantage of
existing Python libraries and extending their features, it becomes easier to interpret the
data visually. Also, this toolkit is reusable and can be applied to multiple datasets. The
applications of the toolkit are not limited to genomic data, but all other types of data as
well. The computation and visualization of the GSVD can also be applied to many other
types of data. Python was chosen to implement this toolkit, because it makes many cloud
computing tasks and large datasets more efficient. It is anticipated that this toolkit can be
used to analyze extremely large datasets hosted in cloud storage.
REFERENCES

[1] C. F. Van Loan, SIAM J. Numer. Anal. 13, 76 (1976).

[2] C. C. Paige and M. A. Saunders, SIAM J. Numer. Anal. 18, 398 (1981).

[3] S. Friedland, SIAM J. Matrix Anal. Appl. 27, 434 (2005).

[4] R. A. Horn and C. R. Johnson, Matrix Analysis, 2nd ed. Cambridge, UK: Cambridge
University Press (2012).

[5] G. H. Golub and C. F. Van Loan, Matrix Computations, 4th ed. Baltimore, MD: Johns
Hopkins University Press (2012).

[6] H. Goldstein, Classical Mechanics, 2nd ed. Reading, MA: Addison-Wesley (1980).

[7] O. Alter, P. O. Brown, and D. Botstein, Proc. Natl. Acad. Sci. USA 100, 3351 (2003).

[8] O. Alter, G. H. Golub, P. O. Brown, and D. Botstein, in Miami Nature Biotechnology

Winter Symposium on Cell Cycle, Chromosomes and Cancer (January 31–February 4,
2004).

[9] S. P. Ponnapalli, G. H. Golub, and O. Alter, in Stanford University and Yahoo! Research
Workshop on Algorithms for Modern Massive Datasets (June 21–24, 2006).

[10] S. P. Ponnapalli, M. A. Saunders, C. F. Van Loan, and O. Alter, PLoS One 6, e28072
(2011).

[11] P. Sankaranarayanan, T. E. Schomay, K. A. Aiello, and O. Alter, PLoS One 10, e0121396
(2015).

[12] K. A. Aiello, C. A. Maughan, T. E. Schomay, S. P. Ponnapalli, H. A. Hanson, and

O. Alter, in 2018 AACR Annual Meeting (April 14–18, 2018), doi: 10.1158/1538-
7445.AM2018-4267.

[13] L. N. Trefethen and D. Bau, III, Numerical Linear Algebra. Philadelphia, PA: SIAM
(1997).

[14] A. Edelman, T. A. Arias, and S. T. Smith, SIAM J. Matrix Anal. Appl. 20, 303 (1998).

[15] L. M. Ewerbring and F. T. Luk, J. Comput. Appl. Math. 27, 37 (1989).

[16] K. A. Aiello, S. P. Ponnapalli, and O. Alter, APL Bioeng. 2, 031909 (2018).

Bio Stat Methods
No ratings yet
Bio Stat Methods
474 pages
Modern Analysis of Biological Data
No ratings yet
Modern Analysis of Biological Data
258 pages
Surprise For The Sniper - Sienna Trap
No ratings yet
Surprise For The Sniper - Sienna Trap
323 pages
Notes For Multivariate Statistics With R
No ratings yet
Notes For Multivariate Statistics With R
189 pages
Applied Statistics For Bioinformatics PDF
No ratings yet
Applied Statistics For Bioinformatics PDF
278 pages
Multivariate Statistics With R
No ratings yet
Multivariate Statistics With R
190 pages
Design, Analysis, and Interpretation of Genome Wide Association Scans Full Digital Edition
100% (11)
Design, Analysis, and Interpretation of Genome Wide Association Scans Full Digital Edition
15 pages
STAT501 Multivariate Analysis
No ratings yet
STAT501 Multivariate Analysis
196 pages
(Edward Curry) An Introduction To Bioinformatics - A Practical Guide For Biologists
No ratings yet
(Edward Curry) An Introduction To Bioinformatics - A Practical Guide For Biologists
248 pages
Statistical Methods For Data Science
100% (2)
Statistical Methods For Data Science
406 pages
Flexible Regression and Smoothing The GAMLSS Packages in R
100% (1)
Flexible Regression and Smoothing The GAMLSS Packages in R
380 pages
Ida PDF
No ratings yet
Ida PDF
62 pages
Drineas CMU 2012
No ratings yet
Drineas CMU 2012
59 pages
Applied Statistics For Bioinformatics Using R
100% (2)
Applied Statistics For Bioinformatics Using R
279 pages
R Textbook Full
No ratings yet
R Textbook Full
96 pages
Saga Species Source Book
100% (6)
Saga Species Source Book
129 pages
01 04 Circles 4 PDF
0% (1)
01 04 Circles 4 PDF
9 pages
Graphics Chapter
No ratings yet
Graphics Chapter
49 pages
Book-A Exible Regression Approach Using GAMLSS in R.
No ratings yet
Book-A Exible Regression Approach Using GAMLSS in R.
355 pages
Gotzenberger Et Al. - 2021 - Trait-Based Ecology Tools in R
No ratings yet
Gotzenberger Et Al. - 2021 - Trait-Based Ecology Tools in R
267 pages
Stationarity and Nonstationarity
No ratings yet
Stationarity and Nonstationarity
261 pages
Iso 13385-2 - 2011
No ratings yet
Iso 13385-2 - 2011
8 pages
Zhao, Sandelin - 2012 - GMD Vignette PDF
No ratings yet
Zhao, Sandelin - 2012 - GMD Vignette PDF
35 pages
Edge RUsers Guide
No ratings yet
Edge RUsers Guide
138 pages
Edger Users Guide
No ratings yet
Edger Users Guide
139 pages
Gplab A Genetic Programming Toolbox For MATLAB
No ratings yet
Gplab A Genetic Programming Toolbox For MATLAB
73 pages
Notes For Lectures 11 To 16 - 2024
No ratings yet
Notes For Lectures 11 To 16 - 2024
68 pages
Edger: Differential Analysis of Sequence Read Count Data User'S Guide
No ratings yet
Edger: Differential Analysis of Sequence Read Count Data User'S Guide
119 pages
Decoding Genomes Ed1 Ebook
No ratings yet
Decoding Genomes Ed1 Ebook
416 pages
Epics Springer
No ratings yet
Epics Springer
12 pages
ESTADOSTICA
No ratings yet
ESTADOSTICA
190 pages
Senior Thesis FINAL
No ratings yet
Senior Thesis FINAL
64 pages
Ismaykim1 PDF
No ratings yet
Ismaykim1 PDF
522 pages
Biology 12 00322
No ratings yet
Biology 12 00322
24 pages
Biotools
No ratings yet
Biotools
34 pages
MATH3353 Notes
No ratings yet
MATH3353 Notes
100 pages
Contents
No ratings yet
Contents
17 pages
Package QGG': December 11, 2024
No ratings yet
Package QGG': December 11, 2024
40 pages
Biological Data Analysis Using R
No ratings yet
Biological Data Analysis Using R
226 pages
Graphs and Viz With R
No ratings yet
Graphs and Viz With R
119 pages
Interactive and Dynamic Graphics For Data Analysis
No ratings yet
Interactive and Dynamic Graphics For Data Analysis
169 pages
AKTUA399 Masteroppgave Fredrik Hjorth Bentsen
No ratings yet
AKTUA399 Masteroppgave Fredrik Hjorth Bentsen
84 pages
Edger Users Guide
No ratings yet
Edger Users Guide
105 pages
Gamlss-Manual Instructions On How To Use The Gamlss Package 2008
No ratings yet
Gamlss-Manual Instructions On How To Use The Gamlss Package 2008
206 pages
Support Vector Machine Classification of Microarray Gene Expression Data UCSC-CRL-99-09
No ratings yet
Support Vector Machine Classification of Microarray Gene Expression Data UCSC-CRL-99-09
31 pages
Matteson Thesis
No ratings yet
Matteson Thesis
37 pages
Minor Project File
No ratings yet
Minor Project File
29 pages
Edger: Differential Analysis of Sequence Read Count Data User'S Guide
No ratings yet
Edger: Differential Analysis of Sequence Read Count Data User'S Guide
122 pages
Introduction To Ecological Multivariate Analysis
No ratings yet
Introduction To Ecological Multivariate Analysis
79 pages
PIC16F1503
No ratings yet
PIC16F1503
352 pages
Lab Manual - DSR
No ratings yet
Lab Manual - DSR
32 pages
BMDP 2009
No ratings yet
BMDP 2009
39 pages
VingronEtAl AlgorithmsPhylogeneticReconstruction Script 2005
No ratings yet
VingronEtAl AlgorithmsPhylogeneticReconstruction Script 2005
88 pages
Survivametodl
No ratings yet
Survivametodl
98 pages
Journal of Bioinformatics and Computational Biology Vol. 10, No. 4 (2012) 1203002 (3 Pages) C Imperial College Press Doi
No ratings yet
Journal of Bioinformatics and Computational Biology Vol. 10, No. 4 (2012) 1203002 (3 Pages) C Imperial College Press Doi
3 pages
Classification
No ratings yet
Classification
22 pages
Ecm1486 Sup 0002 Appendixs2
No ratings yet
Ecm1486 Sup 0002 Appendixs2
30 pages
Krijnen IntroBioInfStatistics
No ratings yet
Krijnen IntroBioInfStatistics
278 pages
PingPing Practise 1
100% (1)
PingPing Practise 1
2 pages
Non Vented Drift
No ratings yet
Non Vented Drift
17 pages
Installation Manual: Model APS 400
No ratings yet
Installation Manual: Model APS 400
24 pages
Football Players Need Specific Physical and Skill Based Attributes For Each Position
No ratings yet
Football Players Need Specific Physical and Skill Based Attributes For Each Position
4 pages
Brosur CCTV ZiFMachines
No ratings yet
Brosur CCTV ZiFMachines
3 pages
LTSpice Tutorial
No ratings yet
LTSpice Tutorial
24 pages
Cable Size Selection - Student Version
No ratings yet
Cable Size Selection - Student Version
14 pages
Resolución Test 2
No ratings yet
Resolución Test 2
2 pages
Scrubber
No ratings yet
Scrubber
15 pages
Rulebook - Preparatory Works-Prefeasibility and Feasibility Study Content and Scope
No ratings yet
Rulebook - Preparatory Works-Prefeasibility and Feasibility Study Content and Scope
8 pages
Relay For OLTC Control & Transformer Monitoring: Technical Data
No ratings yet
Relay For OLTC Control & Transformer Monitoring: Technical Data
32 pages
Science8 - q4 - CLAS1 - Phases of Digestion - v5 - Carissa Calalin
No ratings yet
Science8 - q4 - CLAS1 - Phases of Digestion - v5 - Carissa Calalin
11 pages
NTH Full Cuticle Remy Hair Extension Price List 0611 (2024!06!18 01-02-45)
No ratings yet
NTH Full Cuticle Remy Hair Extension Price List 0611 (2024!06!18 01-02-45)
16 pages
Pengertian Narrative Text Kls 2
No ratings yet
Pengertian Narrative Text Kls 2
11 pages
SIROLL ALU en
No ratings yet
SIROLL ALU en
28 pages
INTRODUCTION TO BIOMEDICAL TECHNOLOGY - Module-1ppt
No ratings yet
INTRODUCTION TO BIOMEDICAL TECHNOLOGY - Module-1ppt
17 pages
Clase3SeminarioEO PDF
No ratings yet
Clase3SeminarioEO PDF
9 pages
500 MW Zener Diode 2.4 To 75 Volts: Features
No ratings yet
500 MW Zener Diode 2.4 To 75 Volts: Features
5 pages
Risk Assessment Sheet V2
No ratings yet
Risk Assessment Sheet V2
11 pages
Scheerlinckk Depthconfigurations - Proximity, Permeabilityandterritorialboundariesinurbanprojects
No ratings yet
Scheerlinckk Depthconfigurations - Proximity, Permeabilityandterritorialboundariesinurbanprojects
16 pages
Electric Spoon
No ratings yet
Electric Spoon
2 pages
Vedant's Resume
No ratings yet
Vedant's Resume
2 pages
Case Study
No ratings yet
Case Study
3 pages
Steering Damper For Yamaha R1M: Otee! Otee!
No ratings yet
Steering Damper For Yamaha R1M: Otee! Otee!
4 pages
Data Empowerment: Harnessing Advanced Mathematical and Statistical Methods for Data Science and Machine Learning
From Everand
Data Empowerment: Harnessing Advanced Mathematical and Statistical Methods for Data Science and Machine Learning
NAGARAJU CHEVURU
No ratings yet
Risk Management and System Safety
From Everand
Risk Management and System Safety
Leonam dos Santos Guimarães
5/5 (1)
The evaluation of financial risk profile of the companies and the mandatory disclosure on Liquidity and Credit Risk: An experiment to evaluate the usefulness of the disclosure required by the IFRS 7 accounting standard for Professional users (Financial analysts) and nonprofessional users (University students)
From Everand
The evaluation of financial risk profile of the companies and the mandatory disclosure on Liquidity and Credit Risk: An experiment to evaluate the usefulness of the disclosure required by the IFRS 7 accounting standard for Professional users (Financial analysts) and nonprofessional users (University students)
Olga Cucaro
No ratings yet
Intrusion Detection Honeypots
From Everand
Intrusion Detection Honeypots
Chris Sanders
3/5 (2)
Gray Hat Hacking the Ethical Hacker's
From Everand
Gray Hat Hacking the Ethical Hacker's
Çağatay Şanlı
5/5 (1)
Time-dependent Behaviour and Design of Composite Steel-concrete Structures
From Everand
Time-dependent Behaviour and Design of Composite Steel-concrete Structures
Massimiliano Bocciarelli
No ratings yet
Unlocking Statistics for the Social Sciences
From Everand
Unlocking Statistics for the Social Sciences
Norma Sinclair
No ratings yet
Content Creation Revolution with chatGPT
From Everand
Content Creation Revolution with chatGPT
Maria Cowen
No ratings yet

Computing and Visualizing GSVD

Uploaded by

Computing and Visualizing GSVD

Uploaded by

Computing and Visualizing the

Generalized Singular Value

A thesis submitted to the faculty of

4.1 The GSVD . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15

2.1 The GSVD as a comparative spectral

full column rank N ≤ Mi , the GSVD is an exact simultaneous factorization [1–4],

2.1.1 Generalized fractions and angular distances

−π/4 < θn = arctan(σ1,n /σ2,n ) − π/4 < π/4. (2.2)

2.1.2 Computing the GSVD via QR decomposition and the SVD

= VQ1 Σ2Q1 VQT1 + Q2T Q2 ,

Σ2Q2 = I − Σ2Q1 = ( Q2 VQ1 )T ( Q2 VQ1 ) > 0,

2.2 Genomic signal processing case study

3.1 Computing the GSVD via QR decomposition

3.1.1 Computing QR decomposition

3.1.2 Computing the SVD

3.2 Testing the GSVD raster visualization

3.3 Computing and visualizing the generalized

3.3.2 Computing and visualizing angular distances

3.4 Computing and visualizing Kaplan-Meier

3.4.2 Extending lifelines visualization

3.5 Creating boxplot displays

Case study of the TCGA astrocytoma datasets

4.2 Visualization of the GSVD in this case study

Column Basis Row Basis

Row Basis Vectors

Log2 DNA Copy Number

4.3 Visualization of the bar charts in this

(a) Tumor Generalized Fraction (b) Normal Generalized Fraction

4.4 Visualization of the Kaplan-Meier survival

N=27 N=20 N=12

4.5 Visualization of the boxplots in this

(a) Tumor Column Basis (b) Normal Column Basis

GC > 50% GC 50 GC > 50% GC 50

(a) Row Basis Vector 1 (b) Row Basis Vector 85

4.6 Improvement of computational time of the

[1] C. F. Van Loan, SIAM J. Numer. Anal. 13, 76 (1976).

[3] S. Friedland, SIAM J. Matrix Anal. Appl. 27, 434 (2005).

[8] O. Alter, G. H. Golub, P. O. Brown, and D. Botstein, in Miami Nature Biotechnology

[12] K. A. Aiello, C. A. Maughan, T. E. Schomay, S. P. Ponnapalli, H. A. Hanson, and

[15] L. M. Ewerbring and F. T. Luk, J. Comput. Appl. Math. 27, 37 (1989).

[16] K. A. Aiello, S. P. Ponnapalli, and O. Alter, APL Bioeng. 2, 031909 (2018).

You might also like