High Dimensional Microarray Data Analysis Cancer Gene Diagnosis and Malignancy Indexes by Microarray Instant EPUB Download

This book discusses the advancements in cancer gene diagnosis through high-dimensional microarray data analysis, emphasizing the inadequacies of traditional statistical methods. It introduces a new discriminant theory and methods that successfully classify cancer and normal patients, highlighting the importance of small Matryoshka structures in gene analysis. The authors aim to improve cancer gene diagnosis and malignancy indexes while encouraging collaboration among medical researchers and statisticians for further validation and application of their findings.

Uploaded by

koac.hrichguanhb.ach

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

100% found this document useful (16 votes)

208 views16 pages

High Dimensional Microarray Data Analysis Cancer Gene Diagnosis and Malignancy Indexes by Microarray Instant EPUB Download

Uploaded by

koac.hrichguanhb.ach

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 16

High dimensional Microarray Data Analysis Cancer Gene

Diagnosis and Malignancy Indexes by Microarray

Visit the link below to download the full version of this book:

https://medipdf.com/product/high-dimensional-microarray-data-analysis-cancer-gen
e-diagnosis-and-malignancy-indexes-by-microarray/

Click Download Now

Preface

This book extends the possibility of a cancer gene diagnosis using many results.
Medical researchers tried to identify oncogenes from genetic data such as
microarrays since 1970, but they did not obtain precise results because the statistical
discriminant analysis was useless for their research. In 2017, we explained our
surprising results to Japanese genetic expert. He told us as follows: “After NIH
reports microarrays are useless for cancer gene diagnosis, many researchers believe
that this theme has ended. Therefore, you terminate your research.” I am regretful to
start the study from 2015. If we could show our results before NIH’s report, we
believe that microarray genetic diagnosis has contributed to cancer control at this
time. Some statisticians focused on this research theme as a new field of “big or
high-dimensional data analysis” which is different from a small sample (small n and
small p data). However, they pointed out three excuses for the difficulty of research.
Although it was easy to use highly reliable data collected by physicians, they did
not obtain a definite result. The discriminant analysis is the most useful method to
classify the two groups of cancer and normal patients or two different cancers.
However, since the statistical discriminant functions are utterly useless, medical
researchers use cluster analysis such as “self-organizing map” (SOM) and so forth.
They seemed to have used discriminant analysis in the early stages of the study, but
they probably judged it to be utterly useless.
In this book, as a successful application example of “high-dimensional data
analysis” using microarray, it concretely shows that new discriminant theory
(Theory) is most suitable for cancer gene analysis and diagnosis in addition to the
small samples (small n and small p). The fatal problem of conventional studies is
that they do not know that the two classes are entirely separable in the
high-dimensional gene space (Fact3). There was no research of linearly separable
data (LSD) discrimination except for our research. Most researchers did not
understand that only H-SVM and Revised IP-Optimal Linear Discriminant
Function (Revised IP-OLDF, RIP) can find Fact3, and other LDFs including
LASSO cannot discriminate microarray correctly. This fact indicates only mathe-
matical programming (MP)-based LDFs can find Fact3. Statistical discriminant
functions are useless for cancer gene analysis. Therefore, they could not define

v
vi Preface

“signal” in high-dimensional genetic space clearly. They cannot select cancer genes
from microarray or filter oncogenes from noise without being based on the correct
signal. We call the linearly separable gene space and subspaces as Matryoshka.
Microarrays (big Matryoshka) include many small Matryoshkas in it. Moreover,
RIP and the Matryoshka feature selection method (Method2) can decompose
microarray into many Small Matryoshkas (SMs) and noise gene subspace (Fact4).
Because the quadratic programming (QP) defines SVMs, those cannot decompose
into many SMs. QP finds only one optimal H-SVM on the whole region. In order to
find optimal subspace (SM), H-HVM surveys all possible models. It is NP-hard. If
we call the smallest Matryoshka as cancer basic gene set (BGS), LINGO Program3
can find many SMs and LINGO Program4 can find many BGSs. At first, because
each SM (or BGS) consists of few genes, we expected statistical methods analyzed
those small samples and obtained many useful results for cancer gene diagnosis.
However, although all NMs of logistic regression are zero for all SMs of six
microarrays, other methods do not show the linear separable facts (Problem6).
After many trials, we produce the signal data made by RIP discriminant scores of all
SMs instead of genes included in SM or BGS. LINGO Program3 decomposes
Alon’s microarray into 64 SMs, and LINGO Program4 decomposes it into 130
BGSs. The RatioSVs of 64 SMs and 130 BGSs are [2.33%, 26.76%] and [0.00%,
0.9%], respectively. Because 64 RatioSVs of all SMs are over 2.33%, we judge
SMs are useful for cancer gene diagnosis. On the other hand, BGSs are useless for
cancer gene diagnosis because 130 RatioSVs of all BGSs are 0.9% or less. We
expect BGS is important for cancer gene research as same as Yamanaka’s four
genes in iPS research. That is, when a normal patient becomes cancer, RIP dis-
criminates two classes clearly. However, statistical discriminant functions cannot
discriminate two classes (Problem6) because of two reasons. First reason is those
cannot discriminate LSD theoretically. Second reason is all RatioSVs of BGS are
tiny. When other genes are added to BGS and become 64 SMs, SV can separate two
classes very easy. This result seems that SM is more suitable for the cancer gene
diagnosis than BGS. As a future task, we must clarify of the classification and roles
of many SMs and BGS (Problem7).
This book proposes the cancer gene diagnosis and malignancy indexes analyzing
all SMs obtained from six microarrays. However, the malignancy indexes need to
be verified by medical professionals. Therefore, we disclose LINGO programs and
explain many statistical results used for verification in this book. These results offer
benefits for statistical researchers and statistical education because many persons
can easily participate in this field, using our successful examples of the
“high-dimensional data analysis.” Also, due to maximum use of our statistical
knowledge, this book can be used for the excellent guidebook of the data analysis.
Moreover, seven problems and four facts that no one has pointed out in statistics
will undoubtedly be useful to improve your actual data analysis abilities. We expect
many persons such as medical researchers, statisticians, and statistical users con-
tribute to the cancer gene diagnosis, in order to produce useful results. However,
although many engineers such as pattern recognition and machine-learning tried
Preface vii

Problem5, they did not succeed also. It was very strange because they were free
from the restriction of normal distribution.
Chapter 1 introduces a novel theory of discriminant analysis and its application
to the genetic analysis of cancer with a new perspective (New Theory of
Discriminant Analysis After R. Fisher, Springer 2016). I graduated from the uni-
versity in 1971 and participated in the development project of “Electrocardiogram
Automatic Diagnosis System” at the Osaka Prefectural Adult Disease Center.
Dr. Nomura, the leader of project, given us the theme of diagnostic logic to separate
normal symptom and several abnormal symptoms by discriminant analysis. Four
years the discriminant study was inferior to empirical branching logic developed by
doctor Nomura at all. The reason is that the statistical discriminant theory is useless
because many data used for medical diagnosis are not a normal distribution. This
failure was motivated to research new discriminant theory. Then, based on many
empirical studies such as medical data until 2015, I established a new discriminant
theory. I first showed the relationship between number of misclassification
(NM) and discriminant coefficient (Fact1). From this fact, we could explain many
defects of NM (Problem1). We have developed IP-OLDF and Revised IP-OLDF
(RIP) based on minimum NM (MNM) criterion instead of NM. I found a monotonic
decrease of MNM (Fact2). Also, for Swiss banknote data with six variables,
MNM = 0 for two variables (X4, X6). In other words, we can ultimately distinguish
between genuine and counterfeit notes. With MNM monotonic decreasing nature,
the 16 models containing these two variables are MNM = 0, and 47 out of the
remaining MNMs are more than one. This fact is a first discriminant study on LSD
that is essential for the genetic analysis of cancer (Problem2). There are other two
problems such as deficiencies of generalized inverse matrices (Problem3) and
discriminant theory that is not inference statistics (Problem4). Because both prob-
lems have little relation with cancer gene analysis, we do not explain in this book
precisely. The six research groups in the USA published papers on the genetic
diagnosis of cancer using microarrays during the period from 1999 to 2004. They
released the microarrays on the Internet. When RIP discriminates the microarrays in
54 days from 25th October to 20th December 2015, we found that the six MNMs
are zero (Problem5). No researchers could solve this problem since 1970 because
the existing discriminant theory was useless. That is, cancer and normal patients are
entirely separable in the high-dimensional genetic space, which is the fact that it is
LSD (Fact3). Based on Fact2, we found that the gene space is a Matryoshka
structure containing many SMs in which MNM = 0. We developed a Matryoshka
feature selection method (Method2). RIP and Method2 could decompose
microarrays into many SMs (or BGS) (Fact4). Because of completing the research
theme since 1971, we published “New Theory of Discriminant Analysis After
R. Fisher” from Springer (2016). In Chap. 1, Method2 decomposes Swiss banknote
data and Japanese car data into several SMs. In other words, Method2 is a
general-purpose method for high-dimensional data and common data. Furthermore,
it shows how RIP and Revised LP-OLDF can easily produce many SMs. The reason
why H-SVM using QP cannot obtain SM can be understood by the common sense of
MP. That is, the cancer gene analysis cannot be done with a statistical discriminant
viii Preface

function based on normal distribution. And the cancer gene analysis is easy for
MP-based LDFs. Using LINGO Program3 introduced in Chap. 10, we can divide
arbitrary microarray and ordinary data into SM. We analyze this SM by statistical
method and propose genetic diagnosis of cancer in Chap. 2 and below.
Chapter 2 introduces the cancer gene diagnosis using SMs (From Cancer Gene
Analysis to Cancer Gene Diagnosis. 2017). In order to evaluate many SMs found in
Method2, we created a statistic called RatioSV. Like MNM, this is an essential
statistic of LSD-discrimination. In Alon’s dataset (Proc.Natl.Acad. Sci. USA 96:
6745–6750, 1999), RIP found 130 pairs of BGS in addition to 64 pairs of SM. The
130 SVs of BGS separated cancer and normal patients at less than 1%. The 64 SVs
of SM separated the two groups from 2.4% to 26.8%. Although these results
indicate the discrimination of SM is easy, no researchers could not succeed from
1970. BGS is vital for the study of oncogene combinations, but we judged that it
was not useful for cancer gene diagnosis. Because SM is a small sample (small n
and small p), we considered the standard statistical methods are useful for the
analysis of SM. However, only logistic regression was found to be NM = 0 for all
SMs. Two groups often overlapped by other statistical methods (Problem6).
Therefore, we created new data with RIP discriminant score (RipDS) as a variable
and showed this signal data is a true signal in microarrays. By this breakthrough,
the analysis was carried out by standard statistical methods using signal data.
Especially, PCA and cluster analysis separate the two groups completely. It was
also found that the first principal component of PCA represents the malignancy
index of cancer the same as the DS of each SM. Because we need to verify these
results medically, we published the book from Amazon to call for cooperation
among the six research groups. However, there were no answers as following
reasons: (1) Six projects may have ended after 2004, (2) they did not access this
book and our papers because we are medically unknown, and (3) the Kindle version
is not an academic journal. In Chap. 2, we outline the results of cluster analysis and
PCA obtained by using six microarrays. After Chap. 3, we examine our claim about
the signal by many approaches.
Chapter 3 explains the cancer gene diagnosis of Alon dataset to compare 39 SMs
by Revised LP-OLDF and 56 SMs by RIP. In 2017, only RIP and Revised
LP-OLDF were convinced that the datasets could be decomposed into different
combinations of SMs. Therefore, if 39 pairs of SM obtained by Revised LP-OLDF
with a short calculation time are useful for genetic analysis of cancer, it is more
useful than using 56 sets of SM obtained by RIP. Therefore, they were analyzed by
RatioSV and various statistical methods, compared and evaluated. In conclusion,
almost the same results were obtained in any analysis.
In Chap. 4, we try that we have not done so far. One is the evaluation of the
signal and noise separated by the RIP and Revised LP-OLDF. For this reason, we
analyze Alon’s microarray (2000 genes). RIP finds 62 SMs (1968 genes) and noise
subspace (32 genes). Revised LP-OLDF finds 32 SMs (1005 genes) and noise
subspace (995 genes). Although we have analyzed individual SMs so far, we have
not evaluated a signal subspace and noise subspace. When we discriminate the
signal and noise subspaces by RIP, it is certainly confirmed that the MNM of the
Preface ix

signal subspace is 0 and the noise subspace is more than one. In addition, many
normal cases locate on SV = −1, and many cancer cases were on SV = 1. This
shows that many cases are concentrated on two points in a high-dimensional signal
subspace. The Revised LP-OLDF decomposes a signal subspace with 1005 genes
into 32 SMs and a noise subspace with 995 genes. Both the signal and noise
subspaces are NM = 0, which indicates that Revised LP-OLDF cannot separate SM
from the noise subspace. This is the reason why the Revised LP-OLDF cannot
make NM = 0 for all of the linearly separable subspaces (Fact1). We examine the
correlation of the genes contained in the signal subspace, and it was found that they
are all fairly high-positive correlations. Moreover, we explain the reason why the
statistical methods cannot find Fact3.
From Chaps. 5 to 9, we introduce the cancer gene diagnosis of other five
datasets. Those datasets are Golub dataset (Science 286(5439): 531–537. 1999),
Shipp dataset (Nature Medicine 8(1.1): 68–74. 2002), Chiaretti dataset (Blood 103:
2771–2778. 2004), Singh dataset (Cancer Cell 1(1.1): 203–209. 2002), and Tian
dataset (The New England Journal of Medicine, 349: 2483–2494. 2003). Each
chapter shows different verification results to explain Problem6 and Problem7.
In Chap. 10, we will discuss three LINGO programs. The first model is the
LINGO sample model developed by Schrage, which is explained by common data
such as Swiss banknote data, Japanese automobile data, and iris data. Since the
high-dimensional gene datasets are unfamiliar for a statistical user, the threshold is
high for statistical users. By explaining genetic diagnosis with common data,
familiarity is born even for general statistical users. In particular, Swiss banknote
data and Japanese automobile data are LSD, but RatioSV is very small, less than 0
0.1% as same as BGSs. This contrasts with the genetic diagnosis. With these
programs, not only microarrays but also other data can easily be decomposed by
RIP. This will be useful for research on marketing and exam questions and product
characteristics as a new research theme to classify many variables. We are released
from the curse of high-dimensional data and prove the theory can solve six prob-
lems of discriminant analysis.
Research Gate: https://www.researchgate.net/profile/Shuichi_Shinmura
Research Map (Japanese Researchers DB): https://researchmap.jp/read0049917/
Economic Department HP: http://sun.econ.seikei.ac.jp/*shinmura/
Please refer to Research Gate for the update of this book.

Musashino, Japan Shuichi Shinmura

Emeritus Professor of Seikei University
Acknowledgements

We can achieve our research by the dominant software such as LINGO supported
by LINDO Systems Inc. and JMP backed by SAS Institute Japan Ltd. JMP Japan
Division.
I wish to acknowledge the following researchers who contributed to my research
for this book.
Linus Schrage, Kevin Cunningham, Mark Wiley (LINDO Systems Inc.); Hitoshi
Ichikawa (LINDO Japan); John Sall, Noriki Inoue, Kyoko Takenaka (JMP Division
of SAS Institute Inc.); Takaichirou Suzuki, Akira Ooshima, Yutaka Nomura (Osaka
International Cancer Institute, Center for Adult Diseases); Naoji Tsuda (SCS);
Hiromi Wada (Kyoto University); Toshio Fukuzumi (Gene Science); Aki Ishii,
Masahiro Mizuta, Mika Sato-Ilic, Atsuhiro Hayashi, Ian B Jeffery, Kazunori
Yamaguchi, Michiko Watanabe.
I also wish to thank my family: Reiko, Makiko, Hideki, Kana, and Yasuhiro
Watanabe. Moreover, I am grateful for the legacy of my late father, Otojirou
Shinmura, who supported the research.

xi
Contents

1 New Theory of Discriminant Analysis and Cancer Gene

Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.2 Fundamental of Theory . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
1.2.1 The Motivation of Our Research . . . . . . . . . . . . . . . . 3
1.2.2 IP-OLDF Based on MNM Criterion
and Two Facts . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
1.2.3 Simple Example . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
1.2.4 Ordinary LP Solution . . . . . . . . . . . . . . . . . . . . . . . . 7
1.3 Five Serious Problems and Three Excuses . . . . . . . . . . . . . . . 8
1.3.1 Four Problems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
1.3.2 Problem5 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
1.3.3 Three Excuses of Cancer Gene Analysis . . . . . . . . . . . 11
1.4 Four OLDFs and MNM Instead of NM . . . . . . . . . . . . . . . . . 13
1.4.1 Revised IP-OLDF and the Defects of Number
of Misclassiﬁcations . . . . . . . . . . . . . . . . . . . . . . . . . 13
1.4.2 Revised LP-OLDF and Revised IPLP-OLDF . . . . . . . 16
1.4.3 Hard-Margin SVM (H-SVM) . . . . . . . . . . . . . . . . . . . 16
1.4.4 Soft-Margin SVM (S-SVM) . . . . . . . . . . . . . . . . . . . . 17
1.4.5 Statisticians Claim for MP-Based LDFs . . . . . . . . . . . 17
1.5 Matryoshka Feature Selection Method (Method2)
and RatioSV . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ..... 18
1.5.1 Method2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ..... 18
1.5.2 RatioSV: Measurement of the Degree of Linear
Separability . . . . . . . . . . . . . . . . . . . . . . . . . . . ..... 19
1.5.3 Six Famous Microarrays . . . . . . . . . . . . . . . . . ..... 20
1.5.4 How to Develop Method2 (a Surprising 54-Day
Research Diary) . . . . . . . . . . . . . . . . . . . . . . . ..... 21
1.5.5 Results of Six Microarrays . . . . . . . . . . . . . . . . ..... 23

xiii
xiv Contents

1.5.6 The Reason for Natural Feature Selection . . . . . . . . . . 24

1.5.7 Two New Facts . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26
1.6 Validation of Method2 by Common Data . . . . . . . . . . . . . . . . 27
1.6.1 Matryoshka Structure of Swiss Banknote Data . . . . . . 27
1.6.2 Validation of LINGO Program3 Results . . . . . . . . . . . 28
1.6.3 Validation of Method2 by Japanese 44 Cars Data . . . . 31
1.6.4 Examination of Duplicate Data . . . . . . . . . . . . . . . . . 38
1.7 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40
References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41
2 Overview of Cancer Gene Diagnosis . . . . . . . . . . . . . . . . . . . . . . . 45
2.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45
2.2 Cancer Gene Diagnosis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47
2.3 Analysis of 64 SMs Obtained by Alon’s Microarray . . . . . . . . 48
2.3.1 Analysis of 64 SMs . . . . . . . . . . . . . . . . . . . . . . . . . . 48
2.3.2 Analysis of RipDS8 by Standard Statistical
Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51
2.4 Analysis of 64 RipDSs Data . . . . . . . . . . . . . . . . . . . . . . . . . . 56
2.4.1 Examination of 64 RipDSs and RatioSV of RIP . . . . . 56
2.4.2 Ward Cluster Analysis of RipDSs New Data . . . . . . . 59
2.4.3 PCA Results of New Data . . . . . . . . . . . . . . . . . . . . . 60
2.5 The 130 BGSs of Alon’s Microarray . . . . . . . . . . . . . . . . . . . 68
2.5.1 Results by Standard Statistical Methods . . . . . . . . . . . 68
2.5.2 Examination of RipDSs of 130 BGSs . . . . . . . . . . . . . 73
2.5.3 Examination of RipDSs New Data by PCA and
Cluster Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . 75
2.5.4 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 77
2.6 Other Five Microarrays . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 78
2.6.1 Singh’s Microarray . . . . . . . . . . . . . . . . . . . . . . . . . . 78
2.6.2 Golub Microarray . . . . . . . . . . . . . . . . . . . . . . . . . . . 81
2.6.3 Tian’s Microarray . . . . . . . . . . . . . . . . . . . . . . . . . . . 83
2.6.4 Chiaretti Microarray . . . . . . . . . . . . . . . . . . . . . . . . . 85
2.6.5 Shipp Microarray . . . . . . . . . . . . . . . . . . . . . . . . . . . 88
2.7 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 91
References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 93
3 Cancer Gene Diagnosis of Alon’s microarray by RIP
and Revised LP-OLDF . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 95
3.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 95
3.2 Outlook of This Chapter . . . . . . . . . . . . . . . . . . . . . . . . . . . . 97
3.2.1 Alon’s microarray . . . . . . . . . . . . . . . . . . . . . . . . . . . 97
3.2.2 Examination of the Iteration Option of LINGO
Program3 . . . . . . . . . . . . . . . . . . . . . . . . . . . ...... 98
Contents xv

3.3 Comparison of 39 SMs by Revised LP-OLDF and 56 SMs by

RIP . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 99
3.3.1 Result of 39 SMs by Revised LP-OLDF . . . . . . . . . . 99
3.3.2 Result of 56 SMs by RIP . . . . . . . . . . . . . . . . . . . . . 102
3.3.3 Comparison of Three Results . . . . . . . . . . . . . . . . . . . 104
3.4 Three Signal Data Using 39 SMs Found by Revised
LP-OLDF . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 105
3.4.1 Signal Data Made by 39 RipDSs Using 39 SMs
Found by Revised LP-OLDF . . . . . . . . . . . . . . . . . . . 105
3.4.2 Signal Data Made by 39 LpDSs Using 39 SMs Found
by Revised LP-OLDF . . . . . . . . . . . . . . . . . . . . . . . . 113
3.4.3 Signal Data Made by 39 HsvmDSs Using 39 SMs
Found by Revised LP-OLDF . . . . . . . . . . . . . . . . . . . 117
3.5 Analysis of Three Signal Data Using 56 SMs Found
by RIP . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 122
3.5.1 Signal Data Made by 56 RipDSs Using 56 SMs
Found by RIP . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 122
3.5.2 Signal Data Made by 56 LpDSs Using 56 SMs
Found by RIP . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 131
3.5.3 Signal Data Made by 56 HsvmDSs Using 56 SMs
Found by RIP . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 138
3.6 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 145
References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 146
4 Further Examinations of SMs—Defect of Revised LP-OLDF and
Correlations of Genes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 147
4.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 147
4.2 Detail Survey of Signal and Noise Subspaces Found
by RIP . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 149
4.2.1 Conﬁrmation of Signal and Noise Subspaces
Found by RIP . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 149
4.2.2 Detail Survey of Signal and Noise Subspaces
Found by RIP . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 151
4.2.3 Basic Structure of Six Microarrays . . . . . . . . . . . . . . . 156
4.3 Detail Survey of Signal and Noise Subspaces
Found by Revised LP-OLDF . . . . . . . . . . . . . . . . . . . . . . . . . 156
4.3.1 Conﬁrmation of Signal and Noise Subspaces
Found by Revised LP-OLDF . . . . . . . . . . . . . . . . . . . 156
4.3.2 Detail Survey of Signal and Noise Subspaces
Separated by Revised LP-OLDF . . . . . . . . . . . . . . . . 158
4.4 Analysis of 62 SMs Found by RIP . . . . . . . . . . . . . . . . . . . . . 162
4.4.1 Examination of RatioSV and NM . . . . . . . . . . . . . . . 162
4.4.2 Correlations of 62 RipDSs . . . . . . . . . . . . . . . . . . . . . 165
xvi Contents

4.5 Validation of SM13 and SM62 . . . . . . . . . . . . . . . . . . . . . . . . 168

4.5.1 RatioSVs and Outliers of SM13 and SM62 . . . . . . . . 168
4.5.2 T-Tests of Mean’s Difference Between the Tumor
and Normal Subjects in SM13 and SM62 . . . . . . . . . . 173
4.5.3 PCA and Cluster Analysis of SM13 and SM62 . . . . . . 176
4.5.4 Examination of Correlation of 37 Genes Included
in SM13 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 180
4.6 The Reason Why Standard Statistical Methods Could
not Find Fact6 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 184
4.7 Another Problem Suggested by Linus Schrage . . . . . . . . . . . . 186
4.8 Comparison of Our Research with iPS Cell Research
and Problem6 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 187
4.9 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 188
References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 190
5 Cancer Gene Diagnosis of Golub et al. Microarray . . . . . . . . . . . . 191
5.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 192
5.2 Validation of SM Found by the RIP and Revised
LP-OLDF . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 193
5.2.1 Verification of the Number of Iterations of Revised
LP-OLDF and RIP . . . . . . . . . . . . . . . . . . . . . . . . . . 194
5.2.2 Analysis of Signal Subspace and Noise Subspace
Obtained by Revised LP-OLDF . . . . . . . . . . . . . . . . . 194
5.2.3 Analysis of Signal Subspace and Noise Subspace
Obtained by RIP . . . . . . . . . . . . . . . . . . . . . . . . . . . . 200
5.3 Analysis of 179 SMs of Golub et al. Microarray (2018) . . . . . 208
5.3.1 Validation of 179 SMs by Six MP-Based LDFs
and Discriminant Functions . . . . . . . . . . . . . . . . . . . . 209
5.3.2 Correlation Coefficient of Discriminant Score
of 179 RIPs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 214
5.4 Verification of SM3 and SM179 . . . . . . . . . . . . . . . . . . . . . . . 217
5.4.1 RatioSV of SM3 and SM179 . . . . . . . . . . . . . . . . . . . 217
5.4.2 T-Test of RIP3 and RIP179 . . . . . . . . . . . . . . . . . . . . 222
5.4.3 BGS and Yamanaka’s Four Genes of IPS Research . . . . 224
5.4.4 PCA and Cluster Analysis . . . . . . . . . . . . . . . . . . . . . 224
5.5 Analysis of Signal Data Made by 179 RipDSs . . . . . . . . . . . . 226
5.5.1 Cluster Analysis and PCA of RipDSs Signal Data . . . 227
5.5.2 Cluster Analysis and PCA of HsvmDSs Signal Data . . . 231
5.5.3 Analysis of Transposed Data . . . . . . . . . . . . . . . . . . . 232
5.6 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 233
References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 234
Contents xvii

6 Cancer Gene Diagnosis of Shipp et al. Microarray . . . . . . . . . . . . 237

6.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 239
6.2 Validation of SM Found by the RIP and Revised LP-OLDF . . . . 239
6.2.1 Veriﬁcation of the Number of Iterations of Revised
LP-OLDF and RIP . . . . . . . . . . . . . . . . . . . . . . . . . . 239
6.2.2 Analysis of Signal and Noise Subspaces Obtained
by Revised LP-OLDF . . . . . . . . . . . . . . . . . . . . . . . . 240
6.2.3 Analysis of 237 SMs and Noise Spaces Obtained
by RIP . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 245
6.3 Analysis of 237 SMs of Shipp et al. Microarray (2018) . . . . . . 251
6.3.1 Validation of 237 SMs by Six MP-Based LDFs
and Discriminant Functions . . . . . . . . . . . . . . . . . . . . 251
6.3.2 Correlation Coefﬁcients of 237 RipDSs . . . . . . . . . . . 259
6.3.3 Examination of Three RipDSs with a Correlation
of 1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 274
6.4 Analysis of 30 RipDSs of 30 SMs and 18 HsvmDSs
of 18 SMs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 276
6.4.1 Cluster Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . 276
6.4.2 Principal Component Analysis (PCA) . . . . . . . . . . . . . 280
6.5 Analysis of Transposed Data . . . . . . . . . . . . . . . . . . . . . . . . . 284
6.6 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 288
References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 288
7 Cancer Gene Diagnosis of Singh et al. Microarray . . . . . . . . . . . . 291
7.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 292
7.2 Problem6 of Cancer Gene Analysis . . . . . . . . . . . . . . . . . . . . 292
7.3 Examination of RipDSs and SMs . . . . . . . . . . . . . . . . . . . . . . 293
7.3.1 Correlations of 139 RipDSs . . . . . . . . . . . . . . . . . . . . 293
7.3.2 PCA of Signal Data Made by 139 RipDSs . . . . . . . . . 299
7.3.3 How to Categorize 139 RipDSs . . . . . . . . . . . . . . . . . 302
7.4 Analysis of 139 SMs of Singh et al. Microarray (2018) . . . . . . 310
7.4.1 Validation of 139 SMs by Six MP-Based LDFs
and Discriminant Functions . . . . . . . . . . . . . . . . . . . . 311
7.4.2 Analysis of Signal Data Using 139 SMs Found
by RIP . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 315
7.4.3 Transposed Data of RipDSs Using 139 SMs Found
by RIP . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 324
7.5 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 326
References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 326
8 Cancer Gene Diagnosis of Tian et al. Microarray . . . . . . . . . . . . . 329
8.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 330
8.2 Examination of Revised LP-OLDF Discriminant Scores
and SMs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 331
xviii Contents

8.2.1 Correlation of 104 LpDSs . . . . . . . . . . . . . . . . . . . . . 331

8.2.2 PCA of 104 LpDSs . . . . . . . . . . . . . . . . . . . . . . . . . . 333
8.2.3 How to Categorize Many 104 LpDSs . . . . . . . . . . . . . 335
8.3 Analysis of 104 SMs of Tian et al. Microarray (2018) . . . . . . . 341
8.4 Analysis of Three Signal Data Made by 104 DSs . . . . . . . . . . 345
8.4.1 Cluster Analysis of Three Signal Data . . . . . . . . . . . . 346
8.4.2 PCA of Three Signal Data . . . . . . . . . . . . . . . . . . . . . 348
8.4.3 PCA of Transpose Signal Data . . . . . . . . . . . . . . . . . 356
8.5 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 358
References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 358
9 Cancer Gene Diagnosis of Chiaretti et al. Microarray . . . . . . . . . . 359
9.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 360
9.2 Examination of Discriminant Scores of 124 SMs Found
by Revised LP-OLDF . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 361
9.2.1 Correlation of 124 LpDSs . . . . . . . . . . . . . . . . . . . . . 361
9.2.2 PCA Analysis of Signal Data Made by 124 LpDSs . . . 363
9.2.3 How to Categorize 124 LpDSs . . . . . . . . . . . . . . . . . 364
9.3 Validation of 124 SMs by Six MP-Based LDFs
and Discriminant Functions . . . . . . . . . . . . . . . . . . . . . . . . . . 367
9.4 Analysis of Three Signal Data of 124 RipDSs, LpDSs,
and HsvmDSs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 372
9.4.1 Hierarchical Clustering . . . . . . . . . . . . . . . . . . . . . . . 372
9.4.2 Clustering of LpDSs and HsvmDSs Signal Data . . . . . 378
9.5 PCA Analysis of Signal Data . . . . . . . . . . . . . . . . . . . . . . . . . 380
9.6 PCA (Transposed Signal Data) . . . . . . . . . . . . . . . . . . . . . . . . 386
9.7 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 391
References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 392
10 LINGO Programs of Cancer Gene Analysis . . . . . . . . . . . . . .. . . 393
10.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . 394
10.2 LINGO Sample Model (DiscrmSwiss.lng) . . . . . . . . . . . . .. . . 395
10.2.1 Original DiscrmSwiss.lng . . . . . . . . . . . . . . . . . .. . . 395
10.2.2 Modiﬁed DiscrmSwiss.lng . . . . . . . . . . . . . . . . . .. . . 398
10.2.3 Japanese Cars Data . . . . . . . . . . . . . . . . . . . . . . .. . . 400
10.2.4 Thank You for the Fabulous Model Creator Linus . . . 404
10.2.5 Iris Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . 404
10.3 Six MP-Based LDFs and LINGO Models . . . . . . . . . . . . .. . . 405
10.4 LINGO Program3 of Method2 . . . . . . . . . . . . . . . . . . . . .. . . 409
10.5 Validation Method2 by LINGO Program1 Using
Common Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 412
10.6 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 413
References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 415
Index . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 417
Abbreviations

Cancer gene In our cancer gene analysis, we use cancer

genes instead of oncogene.
Cancer gene analysis RIP, Revised LP-OLDF and H-SVM find six
microarrays are LSD (Fact3). Moreover,
Method2 decompose each microarray into
many SMs and noise subspace (Fact3) by RIP
and Revised LP-OLDF, not H-SVM. We call
these analysis as cancer gene analysis. It is
important statistical analysis are useless
Cancer gene diagnosis If we make three types of signal data by Rip
DSs (RipDSs), LpDSs and HsvmDSs, statisti-
cal methods can analyze these signal data and
find many malignancy indexes those open the
new frontier of cancer gene diagnosis
Common data The iris data, the student data, the CPD data,
the Swiss banknote data, the Japanese auto-
mobile data, and the pass/fail determination
using examination data
CP A convex polyhedron on discriminant coeffi-
cient space
HsvmDSs H-SVM discriminant scores
LDF Linear discriminant functions such as Fisher’s
LDF, logistic regression, four OLDFs, and
three SVMs
LOO Leave-one-out method
LpDSs Revised LP-OLDF discriminant scores
LSD A linearly separable data, MNM of which is
zero

xix
xx Abbreviations

Matryoshka All linear separable spaces and subspaces

Matryoshka structure The microarray is a big Matryoshka that
includes small Matryoshka in it. MNM mono-
tonic decrease is the same idea as Matryoshka
structure
Method1 The 100-fold cross-validation for small sample
Method2 Matryoshka feature selection method that can
discriminate the common data and the
microarrays. It can ﬁnd SM and decompose
LSD into many SMs
OCP An optimal CP, NM of which is MNM
Oncogenes This word is used for cancer genes found by
physicians
PCA Principal component analysis
Prin1 The ﬁrst principal component
QDF A quadratic discriminant function
RDA A regularized discriminant analysis
RipDSs RIP discriminant scores
Signal data Made by RIP, Revised LP-OLDF, and H-SVM
SOM Self-organizing map
Standard statistical methods One-way ANOVA with t-test, correlation
analysis, univariate analysis, hierarchical clus-
ter analysis, principal component analysis
(PCA), QDF, Fisher’s LDF, logistic regression
Statistical discriminant functions Fisher’s LDF, QDF, RDA and LASSO
including logistic regression. However, only
logistic regression can discriminate all SMs
correctly. Other discriminant functions are fatal
in determining the LSD and are useless
Symbols

Our Research Theme: Discrimination of two classes

(n Cases and p Variables) by Eight LDFs and QDF

LDF Linear Discriminant Function f(xi) = b1x1 + ,…, + bpxp + c

DS Discriminant score f(xi) for ith case xi for i = 1,…,n
Extended DS yi* f(xi) for yi = −1 for class1 and yi = 1 for class2
RipDS RIP discriminant score
LpDSs Revised LP-OLDF discriminant score
HsvmDSs H-SVM discriminant score

Book0 Optimum Linear Discriminant Functions, 2010, JUSE Press, Ltd.

Book1 New Theory of Discriminant Analysis After R. Fisher: advanced research
by the feature selection method for microarray data, 2016, Springer.
The theory consisted of two facts, two methods, and four optimal linear
discriminant functions (OLDF) and solved ﬁve problems.
Four OLDFs and three SVMs are solved by LINGO Program1.
Method1 is solved by LINGO Program2.
Method2 is solved by LINGO Program3.
Book2 From Cancer Gene Analysis to Cancer Gene Diagnosis, 2017, Amazon
Kindle version.
Book3 High-dimensional Microarray Data Analysis—Cancer Gene Diagnosis
and Malignancy Indexes by Microarray, 2019, Springer.

xxi