Generatebioinformaticsdatausing Generative Adversarial Network AReview
Generatebioinformaticsdatausing Generative Adversarial Network AReview
net/publication/321865166
CITATIONS READS
0 1,534
2 authors, including:
Sharmilan S
Informatics Institute of Technology
4 PUBLICATIONS 3 CITATIONS
SEE PROFILE
Some of the authors of this publication are also working on these related projects:
All content following this page was uploaded by Sharmilan S on 17 December 2017.
Abstract – Data is the most important part in machine learning. to two main types such as text and image. Many researchers
In bioinformatics field the sensitivity of the data is high and done to predict or analysis the medical issues or dieses using
due to that the accessibility of the data for a secondary purpose the images [19] [33]. And same time many researchers done
(e.g.: research) is consist with many legal and ethical issues.
Due to that in many bioinformatics researches collecting the
to predict the things using patient records or electronic
data consume more time than the development phase. There health records [8] [36] [33]. In that case the sensitivity and
are some researches done to solve the legal and ethical issues privacy of the data comes in to play. As most of the health
by anonymising the data using encryption, de-identification records consists with personal details.
and perturbation of potentially identifiable attributes. For If we take a process of prediction models it all depend on the
some extend those solutions restricted the data breach but in
other hand anonymized data not performed well during the data sets. The attribute selection will be done based on
analysis and mining tasks. Recently Generative adversarial looking at the data set. The collected data will be
networks (GANs) have become a research focus of artificial preprocessed and use for model training and testing. So the
intelligence. The goal of GANs is to estimate the potential model learns about the things that provided based on those
distribution of real data samples and generate new samples data sets. If the gathers data is poor in quality wise or if the
from that distribution. Here, researcher review GAN in data is class imbalanced then the trained model will be
bioinformatics to generate data sets, presenting examples of inefficient [27]. So to improve this whole process collected
current research. To provide a useful and comprehensive
perspective, Researcher categorize research both by the data should be balanced and quality one. Also it should
bioinformatics data and GAN architecture and flow. cover a wide range of inputs.
Additionally, discussed about the issues of GAN in Modern machine learning methods based on deep neural
bioinformatics to generate data sets and suggest future network architectures require large amounts of training data
research directions. Researcher believes that this review will to achieve the best possible results [25]. But due to the
provide valuable insights for researchers to apply GAN to
generate bioinformatics data sets. privacy and legal issues it’s not possible to access large scale
of real patient data. But if try to get access some sort of data
sets then the quality and the distribution of the data will be
I INTRODUCTION poor in many times [8] [33] [19] [36]. So in this paper
Access to data is one of the bottlenecks in the development researcher going to discuss about the issues and difficulties
of machine learning solutions to domain-specific problems. in the bioinformatics researches in terms of access and
The availability of standard datasets (with associated tasks) availability of data. Also discussing about the recent
has helped to advance the capabilities of learning systems in researches done using GAN to generate various things
multiple tasks. However in bioinformatics and medical field including data sets as well. Also how GAN improving the
it is hard to collect the standard datasets in a huge amount field of bioinformatics in terms of generating data samples.
[33]. For example in medical, defense, security and some
other fields the sensitivity of the data is high. In that case the II BIOINFORMATICS
access to data is highly controlled.
Bioinformatics is an interdisciplinary field that develops
The exponential growth of the amount of biological data software tools and machine learning models to understand
available raises two problems: on one hand, efficient the biological data. As it’s an interdisciplinary area of
information storage and management and, on the other hand, science, it combines computer, science, mathematics to
the extraction of useful information from these data. The analyses and understand biological data. As it’s a wide and
second problem is which requires the development of tools complicated area bioinformatics have different genres
and methods capable of transforming all these inside itself. Most popular ones are sequence analysis, Gene
heterogeneous data into biological knowledge about the and protein expression, Analysis of cellular organization,
underlying mechanism [19]. Medical data will be divided in Structural bioinformatics and Network and systems biology
[26]. Even we can divide these things in to sub parts as well. A Quality and quantity of the data
For an example DNA sequencing, Sequence assembly, Generally a successful decision support or prediction
Genome annotation, Computational evolutionary biology, system needs a good amount of quality data. The data can
Analysis of mutations in cancer are the sub parts of be collected as domain knowledge or real patient data sets.
sequence analysis [26]. The National Center for But the first approach is more expensive and we need to
Biotechnology Information reports that there are three main collect good quality and quantity of knowledge [33] [32].
scientific applications of bioinformatics. They categorize But the other one is easy to get but the amount of data that
them as Evolutionary Biology, Protein Modeling and needed is the issue. During the big data boom as similar as
Genome Mapping [29]. As it’s improving dramatically, other industries health care industry also understand the
over the past decades the quantity and quality of biological importance of the data and they started to collect and stored
information has skyrocketed. them for the future works. Even there are researches done to
As bioinformatics containing mathematics and computer unify all the medical data in to a one central system to solve
programing, the advancement of machine learning models the data diverse issues as well [35]. But if a researcher try to
and artificial intelligence largely improved the access the data they will face many legal and ethical issues
bioinformatics field as well. Deep learning has advanced as these data are sensitive as its collected form patients.
rapidly since the early 2000s and now demonstrates state- Basically to train a supervised model the amount of required
of-the-art performance in various fields as well as data is not a constant. The amount of data required is depend
bioinformatics also [20]. Even the invention of GAN helped on the complexity of the problem and the models as well
the bioinformatics researchers to develop biomedical data [25] [10] [32]. Most of the bioinformatics researches are is
and images to solve some complicated areas of more complex and sensitive. Due to that researchers needs
bioinformatics [33] [11] [33] [19] [41]. Using the machine to build an efficient models to provide high accuracy
learning and data mining models bioinformatics solved predictions. Also if the model build using some nonlinear
many complicated issues like predicting diseases in early algorithm then they need more data samples for training and
stage, calculate the patient risk level in early stage and even testing compare to linear model algorithms [25].
modeling and remodeling the RNA and DNA [26] [28].
There are applications and tools developed by researchers B Lack of data in terms of quality and quantity
to detect or identify various types of cancer, brain tumor, Data mining and machine learning is typically associated
diabetes, heart attack and etc. [26] [29]. So as a conclusion with solving real world problems that are characterized by a
the field of bioinformatics absorbed the advancement of large amount of data. However, in practice, collecting large
computing and AI and used effectively in biology and amounts of data in medical field is infeasible. Although data
medical mining could make important advances in this field, several
III BIOINFORMATICS DATA challenges must be addressed [11]. Existing works that
apply data mining to small datasets have shown the
A wide array of biomedical data are generated and made
following challenges:
available to healthcare experts and researchers for the
purposes of research. However, due to the diverse nature of • The over fitting problem. Obviously, a
the medical data, it is difficult to analyze and predict classification decision based on a small number of
outcomes [28]. When we consider diseases most of the instances is susceptible to the over fitting problem,
symptoms and causes are differ from region to region or because of lack of samples representative of the
even country to country. Also for many disease, symptoms whole data distribution [27].
and causes are vary from many non-medical parameters • Due to the small amount of data, it will cause very
such as climate, behaviors and culture [33] [29] [43]. So if poor classification performance [13].
a data collected from a specific place and used for research • Noise. The noisy data instances will lead to
will not applicable for another place with different non- unclear class boundaries and reduce overall classification
medical parameters. When a researcher tries to access the accuracy.
medical data many of the important parameters will be hided New researches and innovations are needed in data mining
due to the privacy issues of patients such as date of birth, technology to address these challenges [11]. Class
birth place, addictions and some diseases like HIV [30] [17]. imbalance is another major problem with data and
Most of the times collected data will not cover all the classes especially in medical field. In the imbalance data set the
of prediction or even there will be no data for some rare class having more number of instances is called as major
cases. With this type of data used to train a supervised model class while the one having relatively less number of
to predict or classify, it will not able to predict rare cases or instances are called as minor class [27] [10]. Most machine
even it will predict them wrong. So the efficiency and the learning algorithms works best when the number of
accuracy of the model will go down due to the class instances of each classes are roughly equal. When the
imbalanced. number of instances of one class far exceeds the other,
problems arise. In such situation most of the classifier are
biased towards the major classes and hence show very poor Now a day’s health care industry facing the big issue was
classification rates on minor classes. It is also possible that data breaching. As they storing sensitive patient data
classifier predicts everything as major class and ignores the including their personal details. Totally 40 biggest health
minor class as it not have enough evidence for the minor care record breaches done all around the world in 2017 until
class [27]. October [14]. Also the data breaches done in places where
the data shared for secondary purposes like researches. To
IV LEGAL AND ETHICAL ISSUES avoid the privacy issues while if there is a data breach,
health care industries used information randomizes and
One reasons behind limited access stems from the fact that
generalization techniques. However, this approach is not
EHR data are composed of personal identifiers, which in
impregnable to attacks, such as linkage via residual
combination with potentially sensitive medical information,
information to re-identify the individuals to whom the data
induces privacy concerns. As a result, access to such data
corresponds [24]. Also anonymising and sharing patient
for secondary purposes (e.g., research) is regulated, as well
data is the new trend in health care industry. But still this
as controlled by the health care organizations that are at risk
if data are misused or breached [17]. The review process by process consume more time to anonymising the data [24].
legal departments and institutional review boards can take Researchers not able to predict the outcome or symptoms by
months, with no guarantee of access [11]. This process region or a particular place because the residential data are
limits timely opportunities to use data and may slow anonymised.
advances in biomedical knowledge and patient care. Health In recent times researches done to generate data samples to
care organizations often aim to mitigate privacy risks overcome these issues. In Recent years advancement of auto
through the practice of de-identification [22], typically generated data went to next extend to create total fake
through the perturbation of potentially identifiable attributes records based on some real record samples. This type of
(e.g., dates of birth) via generalization, suppression or solutions will resolve the data access issues as well as the
randomization. And then they made the data available for data piracy and security issues.
research uses [21]. But for most of the bioinformatics
researches in diseases predication or risk analysis field
needs a wide range of data that includes exact date of birth VII GENERATIVE ADVESARILA NETWORKS
and residential details. So if the data is randomized then GANs are neural networks that learn to create synthetic data
there will be a chance that the accuracy and efficiency is not similar to some known input data. For instance, researchers
up to the high level. have generated convincing images from photographs of
everything from bedrooms to album covers, and they
V OTHER ISSUES WHILE ACCESSING DATA display a remarkable ability to reflect higher-order semantic
logic [2] [1] [37] [19] [11] [7]. GAN was invented by ian
As considering patient data still in many countries like Sri
goodfellow [16]. It was first introduced in 2014 and
Lanka there is no EHR management in large scale [40]. And
afterword’s there are many number of GAN variants were
for some diseases and medical cases such as maternal,
introduced by researchers for different tasks [23]. The
autism and many mental diseases still there is no recorded
concept is basic as if anyone want to improve some skills
data sets and only having the domain knowledge
especially in games they will compete with an opponent
[36][15][38]. So if there is a bioinformatics research need to
better than them. Then they will analyze what went wrong
be done to build some machine learning models using data
or which point it went wrong. Afterword’s they use that
then the researches by them self they need to collect and
knowledge to improve their skills. Same as that hear in GAN
record the data. So the time allocated for research will be
generator network will always compete with the high
diverted to collect the data and store them [36] [40]. Most
accuracy discriminator and then learn his mistakes and
of the times researchers are not in the medical fields, due to
improve his accuracy. So in one point generator will beat
that their domain knowledge is limited in those areas
the discriminator [16]. The efficiency and accuracy of the
compare to the doctors and specialist in medical field. So
generator depend on how powerful the discriminator is, so
they may miss some important attributes while building a
all the time in GAN must to have a powerful discriminator
model as they will only consider about the data and the
[23] [4].
attribute variant. As a conclusion many developed countries
having issues of privacy and storing patient data and the
same time some developing countries still figuring out ways A Architecture and Flow
to collect and store the data. So for these type of countries The architecture of the GAN consist with two classifier
there will be no history of previous data and they will only models. One is discriminator and other one is generator. The
have current data cases. task of generator model to learn and generate things such as
images, sound, and etc. and other one is discriminator.
VI DATA BREACH AND HACKING
Discriminator will classify the generated things as real or
fake by its trained knowledge. Discriminator model
determines whether a given image looks like a real image equation (2) achieves its minimum value based on the below
from the dataset or like an artificially created image. This is equation.
basically a binary classifier that will take the form of a 𝑃𝑑𝑎𝑡𝑎 (𝑥)
normal convolutional neural network over the course of 𝐷𝐺∗ (𝑥) =
𝑃𝑑𝑎𝑡𝑎 (𝑥) + 𝑝𝑔(𝑥)
many training iterations, the weights and biases in the
(3)
discriminator and the generator are trained through back
propagation. The discriminator learns to tell "real" This is th ebest solution of discriminator D. based on the
things/data apart from "fake" things/data created by the equation 4, discriminator of GAN estimates the ratio of two
generator. And once the generated one classified as fake probability densities. D(x) denoting the probability of X.
then generator will get that feedback from discriminator and that meand D(x) denoting the real data, so the discriminator
stat generate a new sample. Until the discriminator fooled try to make the D(x) as 1. And same time if the input data
by the generator to classify generated samples as real one. comes from G (z). G(z) denoting the generated data, then
The sample structure of GAN diagram given below. the discriminator try to make that D(G(z)) as 0 [16]. And he
same time generator G tries to make it approach 1. Since it’s
a min max game between G nad D, the loss function of G is
ObjG(θG) = −ObjD(θD, θG). Therefore, the optimization of
GAN can be formulated as a minimax problem: