High Dimensional Microarray Data Analysis Cancer Gene Diagnosis and Malignancy Indexes by Microarray Instant EPUB Download
High Dimensional Microarray Data Analysis Cancer Gene Diagnosis and Malignancy Indexes by Microarray Instant EPUB Download
Visit the link below to download the full version of this book:
https://medipdf.com/product/high-dimensional-microarray-data-analysis-cancer-gen
e-diagnosis-and-malignancy-indexes-by-microarray/
This book extends the possibility of a cancer gene diagnosis using many results.
Medical researchers tried to identify oncogenes from genetic data such as
microarrays since 1970, but they did not obtain precise results because the statistical
discriminant analysis was useless for their research. In 2017, we explained our
surprising results to Japanese genetic expert. He told us as follows: “After NIH
reports microarrays are useless for cancer gene diagnosis, many researchers believe
that this theme has ended. Therefore, you terminate your research.” I am regretful to
start the study from 2015. If we could show our results before NIH’s report, we
believe that microarray genetic diagnosis has contributed to cancer control at this
time. Some statisticians focused on this research theme as a new field of “big or
high-dimensional data analysis” which is different from a small sample (small n and
small p data). However, they pointed out three excuses for the difficulty of research.
Although it was easy to use highly reliable data collected by physicians, they did
not obtain a definite result. The discriminant analysis is the most useful method to
classify the two groups of cancer and normal patients or two different cancers.
However, since the statistical discriminant functions are utterly useless, medical
researchers use cluster analysis such as “self-organizing map” (SOM) and so forth.
They seemed to have used discriminant analysis in the early stages of the study, but
they probably judged it to be utterly useless.
In this book, as a successful application example of “high-dimensional data
analysis” using microarray, it concretely shows that new discriminant theory
(Theory) is most suitable for cancer gene analysis and diagnosis in addition to the
small samples (small n and small p). The fatal problem of conventional studies is
that they do not know that the two classes are entirely separable in the
high-dimensional gene space (Fact3). There was no research of linearly separable
data (LSD) discrimination except for our research. Most researchers did not
understand that only H-SVM and Revised IP-Optimal Linear Discriminant
Function (Revised IP-OLDF, RIP) can find Fact3, and other LDFs including
LASSO cannot discriminate microarray correctly. This fact indicates only mathe-
matical programming (MP)-based LDFs can find Fact3. Statistical discriminant
functions are useless for cancer gene analysis. Therefore, they could not define
v
vi Preface
“signal” in high-dimensional genetic space clearly. They cannot select cancer genes
from microarray or filter oncogenes from noise without being based on the correct
signal. We call the linearly separable gene space and subspaces as Matryoshka.
Microarrays (big Matryoshka) include many small Matryoshkas in it. Moreover,
RIP and the Matryoshka feature selection method (Method2) can decompose
microarray into many Small Matryoshkas (SMs) and noise gene subspace (Fact4).
Because the quadratic programming (QP) defines SVMs, those cannot decompose
into many SMs. QP finds only one optimal H-SVM on the whole region. In order to
find optimal subspace (SM), H-HVM surveys all possible models. It is NP-hard. If
we call the smallest Matryoshka as cancer basic gene set (BGS), LINGO Program3
can find many SMs and LINGO Program4 can find many BGSs. At first, because
each SM (or BGS) consists of few genes, we expected statistical methods analyzed
those small samples and obtained many useful results for cancer gene diagnosis.
However, although all NMs of logistic regression are zero for all SMs of six
microarrays, other methods do not show the linear separable facts (Problem6).
After many trials, we produce the signal data made by RIP discriminant scores of all
SMs instead of genes included in SM or BGS. LINGO Program3 decomposes
Alon’s microarray into 64 SMs, and LINGO Program4 decomposes it into 130
BGSs. The RatioSVs of 64 SMs and 130 BGSs are [2.33%, 26.76%] and [0.00%,
0.9%], respectively. Because 64 RatioSVs of all SMs are over 2.33%, we judge
SMs are useful for cancer gene diagnosis. On the other hand, BGSs are useless for
cancer gene diagnosis because 130 RatioSVs of all BGSs are 0.9% or less. We
expect BGS is important for cancer gene research as same as Yamanaka’s four
genes in iPS research. That is, when a normal patient becomes cancer, RIP dis-
criminates two classes clearly. However, statistical discriminant functions cannot
discriminate two classes (Problem6) because of two reasons. First reason is those
cannot discriminate LSD theoretically. Second reason is all RatioSVs of BGS are
tiny. When other genes are added to BGS and become 64 SMs, SV can separate two
classes very easy. This result seems that SM is more suitable for the cancer gene
diagnosis than BGS. As a future task, we must clarify of the classification and roles
of many SMs and BGS (Problem7).
This book proposes the cancer gene diagnosis and malignancy indexes analyzing
all SMs obtained from six microarrays. However, the malignancy indexes need to
be verified by medical professionals. Therefore, we disclose LINGO programs and
explain many statistical results used for verification in this book. These results offer
benefits for statistical researchers and statistical education because many persons
can easily participate in this field, using our successful examples of the
“high-dimensional data analysis.” Also, due to maximum use of our statistical
knowledge, this book can be used for the excellent guidebook of the data analysis.
Moreover, seven problems and four facts that no one has pointed out in statistics
will undoubtedly be useful to improve your actual data analysis abilities. We expect
many persons such as medical researchers, statisticians, and statistical users con-
tribute to the cancer gene diagnosis, in order to produce useful results. However,
although many engineers such as pattern recognition and machine-learning tried
Preface vii
Problem5, they did not succeed also. It was very strange because they were free
from the restriction of normal distribution.
Chapter 1 introduces a novel theory of discriminant analysis and its application
to the genetic analysis of cancer with a new perspective (New Theory of
Discriminant Analysis After R. Fisher, Springer 2016). I graduated from the uni-
versity in 1971 and participated in the development project of “Electrocardiogram
Automatic Diagnosis System” at the Osaka Prefectural Adult Disease Center.
Dr. Nomura, the leader of project, given us the theme of diagnostic logic to separate
normal symptom and several abnormal symptoms by discriminant analysis. Four
years the discriminant study was inferior to empirical branching logic developed by
doctor Nomura at all. The reason is that the statistical discriminant theory is useless
because many data used for medical diagnosis are not a normal distribution. This
failure was motivated to research new discriminant theory. Then, based on many
empirical studies such as medical data until 2015, I established a new discriminant
theory. I first showed the relationship between number of misclassification
(NM) and discriminant coefficient (Fact1). From this fact, we could explain many
defects of NM (Problem1). We have developed IP-OLDF and Revised IP-OLDF
(RIP) based on minimum NM (MNM) criterion instead of NM. I found a monotonic
decrease of MNM (Fact2). Also, for Swiss banknote data with six variables,
MNM = 0 for two variables (X4, X6). In other words, we can ultimately distinguish
between genuine and counterfeit notes. With MNM monotonic decreasing nature,
the 16 models containing these two variables are MNM = 0, and 47 out of the
remaining MNMs are more than one. This fact is a first discriminant study on LSD
that is essential for the genetic analysis of cancer (Problem2). There are other two
problems such as deficiencies of generalized inverse matrices (Problem3) and
discriminant theory that is not inference statistics (Problem4). Because both prob-
lems have little relation with cancer gene analysis, we do not explain in this book
precisely. The six research groups in the USA published papers on the genetic
diagnosis of cancer using microarrays during the period from 1999 to 2004. They
released the microarrays on the Internet. When RIP discriminates the microarrays in
54 days from 25th October to 20th December 2015, we found that the six MNMs
are zero (Problem5). No researchers could solve this problem since 1970 because
the existing discriminant theory was useless. That is, cancer and normal patients are
entirely separable in the high-dimensional genetic space, which is the fact that it is
LSD (Fact3). Based on Fact2, we found that the gene space is a Matryoshka
structure containing many SMs in which MNM = 0. We developed a Matryoshka
feature selection method (Method2). RIP and Method2 could decompose
microarrays into many SMs (or BGS) (Fact4). Because of completing the research
theme since 1971, we published “New Theory of Discriminant Analysis After
R. Fisher” from Springer (2016). In Chap. 1, Method2 decomposes Swiss banknote
data and Japanese car data into several SMs. In other words, Method2 is a
general-purpose method for high-dimensional data and common data. Furthermore,
it shows how RIP and Revised LP-OLDF can easily produce many SMs. The reason
why H-SVM using QP cannot obtain SM can be understood by the common sense of
MP. That is, the cancer gene analysis cannot be done with a statistical discriminant
viii Preface
function based on normal distribution. And the cancer gene analysis is easy for
MP-based LDFs. Using LINGO Program3 introduced in Chap. 10, we can divide
arbitrary microarray and ordinary data into SM. We analyze this SM by statistical
method and propose genetic diagnosis of cancer in Chap. 2 and below.
Chapter 2 introduces the cancer gene diagnosis using SMs (From Cancer Gene
Analysis to Cancer Gene Diagnosis. 2017). In order to evaluate many SMs found in
Method2, we created a statistic called RatioSV. Like MNM, this is an essential
statistic of LSD-discrimination. In Alon’s dataset (Proc.Natl.Acad. Sci. USA 96:
6745–6750, 1999), RIP found 130 pairs of BGS in addition to 64 pairs of SM. The
130 SVs of BGS separated cancer and normal patients at less than 1%. The 64 SVs
of SM separated the two groups from 2.4% to 26.8%. Although these results
indicate the discrimination of SM is easy, no researchers could not succeed from
1970. BGS is vital for the study of oncogene combinations, but we judged that it
was not useful for cancer gene diagnosis. Because SM is a small sample (small n
and small p), we considered the standard statistical methods are useful for the
analysis of SM. However, only logistic regression was found to be NM = 0 for all
SMs. Two groups often overlapped by other statistical methods (Problem6).
Therefore, we created new data with RIP discriminant score (RipDS) as a variable
and showed this signal data is a true signal in microarrays. By this breakthrough,
the analysis was carried out by standard statistical methods using signal data.
Especially, PCA and cluster analysis separate the two groups completely. It was
also found that the first principal component of PCA represents the malignancy
index of cancer the same as the DS of each SM. Because we need to verify these
results medically, we published the book from Amazon to call for cooperation
among the six research groups. However, there were no answers as following
reasons: (1) Six projects may have ended after 2004, (2) they did not access this
book and our papers because we are medically unknown, and (3) the Kindle version
is not an academic journal. In Chap. 2, we outline the results of cluster analysis and
PCA obtained by using six microarrays. After Chap. 3, we examine our claim about
the signal by many approaches.
Chapter 3 explains the cancer gene diagnosis of Alon dataset to compare 39 SMs
by Revised LP-OLDF and 56 SMs by RIP. In 2017, only RIP and Revised
LP-OLDF were convinced that the datasets could be decomposed into different
combinations of SMs. Therefore, if 39 pairs of SM obtained by Revised LP-OLDF
with a short calculation time are useful for genetic analysis of cancer, it is more
useful than using 56 sets of SM obtained by RIP. Therefore, they were analyzed by
RatioSV and various statistical methods, compared and evaluated. In conclusion,
almost the same results were obtained in any analysis.
In Chap. 4, we try that we have not done so far. One is the evaluation of the
signal and noise separated by the RIP and Revised LP-OLDF. For this reason, we
analyze Alon’s microarray (2000 genes). RIP finds 62 SMs (1968 genes) and noise
subspace (32 genes). Revised LP-OLDF finds 32 SMs (1005 genes) and noise
subspace (995 genes). Although we have analyzed individual SMs so far, we have
not evaluated a signal subspace and noise subspace. When we discriminate the
signal and noise subspaces by RIP, it is certainly confirmed that the MNM of the
Preface ix
signal subspace is 0 and the noise subspace is more than one. In addition, many
normal cases locate on SV = −1, and many cancer cases were on SV = 1. This
shows that many cases are concentrated on two points in a high-dimensional signal
subspace. The Revised LP-OLDF decomposes a signal subspace with 1005 genes
into 32 SMs and a noise subspace with 995 genes. Both the signal and noise
subspaces are NM = 0, which indicates that Revised LP-OLDF cannot separate SM
from the noise subspace. This is the reason why the Revised LP-OLDF cannot
make NM = 0 for all of the linearly separable subspaces (Fact1). We examine the
correlation of the genes contained in the signal subspace, and it was found that they
are all fairly high-positive correlations. Moreover, we explain the reason why the
statistical methods cannot find Fact3.
From Chaps. 5 to 9, we introduce the cancer gene diagnosis of other five
datasets. Those datasets are Golub dataset (Science 286(5439): 531–537. 1999),
Shipp dataset (Nature Medicine 8(1.1): 68–74. 2002), Chiaretti dataset (Blood 103:
2771–2778. 2004), Singh dataset (Cancer Cell 1(1.1): 203–209. 2002), and Tian
dataset (The New England Journal of Medicine, 349: 2483–2494. 2003). Each
chapter shows different verification results to explain Problem6 and Problem7.
In Chap. 10, we will discuss three LINGO programs. The first model is the
LINGO sample model developed by Schrage, which is explained by common data
such as Swiss banknote data, Japanese automobile data, and iris data. Since the
high-dimensional gene datasets are unfamiliar for a statistical user, the threshold is
high for statistical users. By explaining genetic diagnosis with common data,
familiarity is born even for general statistical users. In particular, Swiss banknote
data and Japanese automobile data are LSD, but RatioSV is very small, less than 0
0.1% as same as BGSs. This contrasts with the genetic diagnosis. With these
programs, not only microarrays but also other data can easily be decomposed by
RIP. This will be useful for research on marketing and exam questions and product
characteristics as a new research theme to classify many variables. We are released
from the curse of high-dimensional data and prove the theory can solve six prob-
lems of discriminant analysis.
Research Gate: https://www.researchgate.net/profile/Shuichi_Shinmura
Research Map (Japanese Researchers DB): https://researchmap.jp/read0049917/
Economic Department HP: http://sun.econ.seikei.ac.jp/*shinmura/
Please refer to Research Gate for the update of this book.
We can achieve our research by the dominant software such as LINGO supported
by LINDO Systems Inc. and JMP backed by SAS Institute Japan Ltd. JMP Japan
Division.
I wish to acknowledge the following researchers who contributed to my research
for this book.
Linus Schrage, Kevin Cunningham, Mark Wiley (LINDO Systems Inc.); Hitoshi
Ichikawa (LINDO Japan); John Sall, Noriki Inoue, Kyoko Takenaka (JMP Division
of SAS Institute Inc.); Takaichirou Suzuki, Akira Ooshima, Yutaka Nomura (Osaka
International Cancer Institute, Center for Adult Diseases); Naoji Tsuda (SCS);
Hiromi Wada (Kyoto University); Toshio Fukuzumi (Gene Science); Aki Ishii,
Masahiro Mizuta, Mika Sato-Ilic, Atsuhiro Hayashi, Ian B Jeffery, Kazunori
Yamaguchi, Michiko Watanabe.
I also wish to thank my family: Reiko, Makiko, Hideki, Kana, and Yasuhiro
Watanabe. Moreover, I am grateful for the legacy of my late father, Otojirou
Shinmura, who supported the research.
xi
Contents
xiii
xiv Contents
xix
xx Abbreviations
xxi