Thesis Pooja Gupta PhD18018 Final
Thesis Pooja Gupta PhD18018 Final
Thesis Pooja Gupta PhD18018 Final
By
P OOJA G UPTA
(PhD18018)
N EW D ELHI – 110020
S EPTEMBER , 2023
I NFORMATION F USION USING C ONVOLUTIONAL T RANSFORM L EARNING
By
P OOJA G UPTA
PhD18018
A Thesis
Doctor of Philosophy
N EW D ELHI – 110020
S EPTEMBER , 2023
Certificate
This is to certify that the thesis titled Information Fusion using Convolutional
In my opinion, the thesis has reached the standard fulfilling the requirements of
The results contained in this thesis have not been submitted in part or full to
any other university or institute for the award of any degree or diploma.
September, 2023
There are many real-world problems pertaining to the need for the fusion of
information from multiple sources. Consider, for example, the problem of
demand forecasting that requires estimating the power consumption at a future
point given the available information till the current instant. At the building level
forecasting, the inputs are usually power consumption, weather(temperature,
humidity), and occupancy. This is a crucial problem in smart grids that ranges
from planning electricity generation to preventing non-technical losses. Likewise,
many such real-world examples can be cast as multi-channel information fusion
based problems. Thus, we need the techniques whereby this varied nature
of information from multiple sources can be combined/fused to predict some
value(s) that can contribute significantly to future decision making.
A bountiful of techniques have been proposed so far for multi-channel fusion,
yet hardly any of them have been addressed as an end-to-end fusion formulation.
Few of such solutions are based on techniques that include - Deep learning and
Statistical Machine Learning (SML) algorithms. However, existing solutions
related to deep learning paradigms involve Convolutional Neural Network (CNN).
The latter might not guarantee distinct filters and hence, quality representations
might not be obtained that could lead to redundancy. Secondly, CNNs are
supervised and, therefore, require large labelled datasets that are not readily
available in every other domain. Lastly, SML algorithms are largely prone
to overfitting as these heavily rely on quality of features input. Thus, end-to-
end, multi-channel, both unsupervised and supervised Convolutional Transform
Learning (CTL) based solutions are proposed that bridges all the gaps. The
problems targeted lie under multiple domains including financial, biomedical
and multiview image and text datasets.
Firstly, this dissertation proposes unsupervised multi-channel fusion solutions
to the problems in the financial domain - stock trading(trend prediction/classifi-
cation) and stock forecasting(price prediction/regression) both of which include
i
time-series data. It preserves the true nature of time-series as univariate instead
of frameworks treating them as 2D matrix/image. Also, the given solution is
highly efficient in terms of training a single framework single framework and
obtaining features that can be utilized for both classification and regression tasks.
The latter benefit cannot be achieved with CNNs.
Secondly, multiple information fusion problems are solved by giving super-
vised frameworks based on CTL and deep learning paradigms. Specifically,
one of the frameworks is proposed to cater to the problem of stock trading that
eliminates the issue of dead ReLU and guarantees representations that are more
diverse helping in obtaining better performance over the state-of-art techniques.
The latter has been validated via fair comparison with CNN where the proposed
method supersedes it. Next, an information fusion solution is given that is su-
pervised jointly trained and optimized approach based on CTL and Decision
Forest (DF) for predicting Drug-Drug Interactions that could lead to Adverse
Drug Reactions (ADRs) instead of utilizing them in a piecemeal fashion.
Lastly, this thesis contributes to solve multiview clustering fusion problem
handling the challenge of data-constrained scenarios. It involves the multiview
datasets under image and text categories. A joint optimization of Deep CTL
(DCTL) and K-Means clustering is proposed. It avoids the piecemeal approach
and learns representations from clustering perspective with the help of K-Means
clustering loss.
ii
Dedication
iii
Acknowledgements
I owe a big thanks to my advisor Prof. Angshul Majumdar who is behind this
day when I can call myself a researcher and can prepend Dr. with my name. His
constant guidance, support and friendly behaviour have helped to sail through
in my journey of research. He has always given an outstanding environment of
research and infrastructure, his time and efforts for brainstorming discussions.
Prof. Angshul has always lend his ears to my problems in tough times and helped
me to overcome them with his words of encouragement and motivation. I am
truly blessed to have him as a supervisor, a mentor and a friend.
I would like to thank Indraprastha Institute of Information Technology Delhi
that gave me an opportunity to be part of it as a researcher. Also, the institute has
given excellent infrastructure and proactively solved any of the issues in access
to any necessary things , especially, Mr. Adarsh Kumar Agarwal sir from IT
helpdesk and others too. I also want to thank my Internal Committee members
Dr. Pushpendra Singh and Dr. Sanat Biswas and collaborators Dr. Emilie
Chouzenoux, Dr. Giovanni Chierchia and Dr. Ronita Bardhan for providing
their insightful comments in the research assessments and collaboration works
respectively.
Next, I want to thank my Grandparents, Late Shri. Ram Nath Gupta and Late
Smt. Shanti Devi Gupta for continuously showering blessings. I am thankful
to my parents (Dr. Satish Chandra Gupta and Mrs. Sunita Gupta) for working
hard to provide me with the privilege of having a good life and attaining this
prestigious degree. I am also grateful to my in-laws (Mr. Ashok Kumar Gupta
and Mrs. Manju Gupta) for understanding my ambitions and supporting me
post-marriage during my journey. I especially want to thank my husband, Ankur,
for his care, patience, encouragement, and unwavering support throughout this
journey. My daughter Vedanshi, born recently, has been an integral part of this
journey, for she has been my lucky charm. She has always made me feel alive in
the time of melancholy, for which I feel really blessed. I am also grateful to my
iv
brother Rishabh and brother-in-law Akshay Kumar Gupta for their never-ending
love, motivation and support.
I also thank my labmates - Jyoti Maggu, Aanchal Mongia, Priyadarshini Rai,
Shalini Sharma, Anurag Goel, Shikha Singh, Megha Gupta Gaur, Vanika Singhal
and Kriti Gupta for their companionship and never-ending tea-time stories. I
humbly acknowledge their help and support. I also present my sincere gratitude
to my friends from other labs Anand Singh, Gunjan Singh, Dhananjay Kimothi,
Charul Paliwal, Saurabh Aggarwal, Neetesh Pandey, Ashwini Teertha, Smriti
Chawla and Sarita for always supporting me. Last but not the least I thank
my all time friends Naina Gupta, Sonal Goel, Shalini Sheoran, Pragati Sharma,
Pradyumn Nand, Mona Nandwani, Love Chopra, Prachi Luthra Chopra, Rachit
Rakhyani and Bani Rakhyani for standing by my side always.
(POOJA GUPTA)
v
Contents
Abstract i
Dedication iii
Acknowledgements iv
List of Tables x
1 Introduction 1
1.1 Problem Statement . . . . . . . . . . . . . . . . . . . . . . . . 3
1.2 Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
1.2.1 Probabilistic approach . . . . . . . . . . . . . . . . . . 4
1.2.2 Machine Learning-based Frameworks . . . . . . . . . . 6
1.2.3 Fuzzy based systems . . . . . . . . . . . . . . . . . . . 8
1.2.4 Deep Learning based fusion approaches . . . . . . . . . 10
1.3 Datasets Descriptions . . . . . . . . . . . . . . . . . . . . . . . 12
1.3.1 National Stock Exchange (NSE) dataset . . . . . . . . . 13
1.3.2 Past 22 years stock data . . . . . . . . . . . . . . . . . 13
1.3.3 Drug-Drug Interaction Data . . . . . . . . . . . . . . . 14
vi
1.3.4 Mutli-view datasets . . . . . . . . . . . . . . . . . . . . 16
1.4 Research Contributions . . . . . . . . . . . . . . . . . . . . . . 17
1.5 Acronyms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
List of Abbreviations 1
vii
3.1 SuperDeConFuse: A supervised deep convolutional transform
based fusion framework for financial trading systems . . . . . . 54
3.1.1 Literature Review - Stock Trading . . . . . . . . . . . . 54
3.1.2 Proposed Formulation . . . . . . . . . . . . . . . . . . 56
3.1.3 Optimization algorithm . . . . . . . . . . . . . . . . . . 59
3.1.4 Preprocessing . . . . . . . . . . . . . . . . . . . . . . . 60
3.1.5 Experimental Evaluation . . . . . . . . . . . . . . . . . 62
3.1.6 Results and Analysis . . . . . . . . . . . . . . . . . . . 66
3.2 DeConDFFuse : Predicting Drug-Drug Interaction using joint
Deep Convolutional Transform Learning and Decision Forest
fusion framework . . . . . . . . . . . . . . . . . . . . . . . . . 84
3.2.1 Literature Review - DDI . . . . . . . . . . . . . . . . . 85
3.2.2 Proposed Formulation . . . . . . . . . . . . . . . . . . 88
3.2.3 Experimental Setup . . . . . . . . . . . . . . . . . . . . 97
3.2.4 Results and Analysis . . . . . . . . . . . . . . . . . . . 100
3.3 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 107
viii
5 Conclusion 129
5.1 Summary of Contribution . . . . . . . . . . . . . . . . . . . . . 129
5.1.1 Unsupervised multi-channel CTL based fusion frame-
works - ConFuse(shallow) and DeConFuse(Deep) . . . . 129
5.1.2 Supervised multi-channel fusion frameworks - SuperDe-
ConFuse and DeConDFFuse . . . . . . . . . . . . . . . 130
5.1.3 Multiview Clustering Framework based on CTL - De-
ConFCluster . . . . . . . . . . . . . . . . . . . . . . . 131
5.2 Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . 132
References 135
ix
List of Tables
x
3.6 Summary of Financial Results for Stock Trading . . . . . . . 76
3.7 Ablation Study performance for BUY Class . . . . . . . . . . 79
3.8 Ablation Study performance for HOLD Class . . . . . . . . . 80
3.9 Ablation Study performance for SELL Class . . . . . . . . . 80
3.10 Ablation Study performance weighted results . . . . . . . . . 80
3.11 Ablation Study Financial Results . . . . . . . . . . . . . . . . 81
3.12 Comparative Summary Results for Stock Trading for win-
dow sizes 5,10,20 . . . . . . . . . . . . . . . . . . . . . . . . . 82
3.13 DDI Prediction DeConDFFuse Architecture Details . . . . . . . 99
3.14 DDI Prediction Results . . . . . . . . . . . . . . . . . . . . . 101
3.15 Comparative Results with DeConDFFuse and Piecemeal ap-
proaches . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 106
xi
List of Figures
xii
3.4 Visualization of channel-wise features Xc for SDCF versus a
standard CNN for one sample of stock BSELINFRA.BO (with
16 × 1 as the shape of the features obtained and resized to 8 × 2
for better visualization) . . . . . . . . . . . . . . . . . . . . . . 77
3.5 Evolution of the loss during training for a few stock examples of
the proposed model with (a) CTL 1 layer, (b) CTL 2 layers, (c)
CTL 3 layers and (d) CTL 4 layers. . . . . . . . . . . . . . . . . 83
3.6 Each node n ∈ N of the tree performs routing decisions via
function dn (·). The black path shows an exemplary routing of
a sample x along a tree to reach leaf ℓ4 , which has probability
µℓ4 = d1 (x)d¯2 (x)d¯5 (x). Image taken from [1]. . . . . . . . . . . 91
3.7 illustration of how to implement a deep neural decision for-
est (DNDF). Top: Deep CNN with a variable number of lay-
ers, subsumed via parameters θ. FC block: Fully Connected
layer used to provide functions fn (·; θ), described in Equ. 3.8.
Each output of fn is brought in correspondence with a split
node in a tree, eventually producing the routing (split) decisions
dn (x) = σ(fn (x)). The order of the assignments of output units
to decision nodes can be arbitrary (the one shown allows a sim-
ple visualization). The circles at the bottom correspond to leaf
nodes, holding probability distributions πℓ . Image taken from [1]. 93
3.8 DDI prediction using combined DeConFuse and decision for-
est architecture- DeConDFFuse. Here C = 2, the number of
networks/channels via each of which a drug in the drug pair
is passed along with its bioactivity descriptors/ features vector,
respectively. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 98
3.9 Confusion matrices for different benchmarks and the proposed
method- DeConDFFuse . . . . . . . . . . . . . . . . . . . . . . 102
3.10 Loss plot with the proposed method - DeConDFFuse. . . . . . . 105
3.11 Confusion matrices for the proposed method - DeConDFFuse
and Piecemeal approach . . . . . . . . . . . . . . . . . . . . . 106
xiii
4.2 Overview of the proposed DeConFCluster architecture. C rep-
resents the number of DeepCTL networks/channels, L is the
number of DCTL layers, Mℓc is the filter size and Fℓc is the
number of filters of the respective layer ℓ and channel c. . . . . . 119
4.3 Loss Plots . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 124
4.4 Ablation Studies Result Plots on λ, µ . . . . . . . . . . . . . . . 125
4.5 Ablation Studies Result Plots on K-Means Regularizer . . . . . 126
xiv
Chapter 1
Introduction
diverse and sometimes conflicting as well. This integration produces specific and
Let us consider the demand forecasting problem that requires estimating the
until the current instant. Usually, in this respect, the inputs at the building
1
occupancy etc. It is pertinent to solve this problem as it is a crucial aspect in smart
example, the problem of blood pressure estimation. The inputs are usually from
(PPG) [3], and the goal is to estimate the systolic and diastolic pressures.
consumption, etc. In the same domain, the work [6] deals with the problem of
forecasting the taxi demand in the event areas. It is done by fusing the publicly
Image fusion is another area where the information from two or more images
applied in medical imaging. For example, to improve the functional and spatial
Imaging (MRI) and Positron Emission Tomography (PET) images using Intensity
data fusion applied in the medical domain. It uses the fused video displays and
2
Similarly, there are other domains that we will not discuss at length but briefly
mention where IF plays its role. Opinion mining based on sentiment analysis
[9], Stock Price prediction [10], drug-drug interaction [11], human activity
works that learn better representations for solving problems in the analysis
supervised category are regression and classification and clustering for unsu-
1.2 Background
have been proposed for solving the problems in IF, from probabilistic methods
3
and statistical machine learning to deep learning. We will briefly discuss these
to one of the fusion-based ITS surveys [14], most of the studies had used
81 out of 135 articles studied in the survey were based on probabilistic based
fusion). It is worth mentioning that these include Kalman Filter (KF) algorithm
and its variations, e.g., Extended Kalman Filter (EKF) [15–21], Sequential
Kalman Filter (SKF) [22] etc. The kinds of applications under ITS covered here
concerning KF are car or vehicle positioning [15, 22] in a smart city, vehicle
localization [16, 19, 20], moving object detection and tracking [18], navigation
[21] etc.
Opinion Mining (OM) is another area of application where IF finds its scope
models are applied [23, 24]. The study in [25] presents an “Enterprise IF"
enterprise’s business. The latter includes client feedback and any noteworthy
news about events that could affect it. Also, sometimes involve corporate’s
4
data for analysis. Thus, such a framework depends on multiple sources of
information - news sourced from platforms like Twitter and feedback sourced
feeds from specific blogs. For this purpose, they use a “blackboard architecture"
nodes. The study’s authors observed a dip in sales of a given product after higher
negative feedbacks. They stated that even though their analysis was ex-post,
the unstructured data mining synchronized with sales data could have provided
insights to perform better marketing campaigns and find a better market niche
of Things (IoT) is also an area where IF finds its application. Consider a smart
home, i.e., where data is collected through different sensors installed in the home.
One such problem that can be solved is knowing about the person’s occupancy
while preserving privacy. In the study [27], sensors data, including temperature,
humidity, light and CO2 are used to detect the occupancy in a room. The authors
5
PMA value for a final decision [27]. However, Dempster-Shafer’s theory poses
density function and define priori probabilities. Also, when dealing with complex
Traditional machine learning algorithms have also been extensively used to solve
many IF-based problems. In the study [29], the task was to classify the incorrect
driving behavior using multiple inputs, including the driver’s driving operation
behavior, steering wheel angle, brake force, and throttle position. Also, it
considers road conditions and then classifies using these inputs via the Adaboost
at two levels - feature and score fusion levels through Naive Bayes Algorithm.
One of the applications of the ITS and IOT is the vacant parking spot detection
problem in urban environments. In view of the same, the work in [31] employs
6
Sentiment classification is more challenging than document topic classifi-
cation as the latter has specific keywords that do not require context /emotion
exist where fusion is performed via Naive Bayes, Maximum Entropy Classifier
and SVM where SVM superseded the other two [32]. In the healthcare sector,
patients and classifying the stroke as ischemic stroke and hemorrhagic stroke.
packet energy, fuzzy entropy and hierarchical theory. Further, SVM, Decision
Tree (DT) and Random Decision Forest (RDF) are used as the stroke signal
classification models.
Fault detection in motors also requires the fusion of information. In this regard,
Banerjee et al. [34] proposed a hybrid method for fault detection based on multi-
sensor data fusion with SVM, Short Term Fourier Transform (STFT) and a time
false alarm rate in the conventional Adaboost detector [35]. The latter is enhanced
via a color-checking module and an SVM detector that checks the image patch
for LP. Another domain of the IF application is finance, specifically, the stock
market. One such event is to predict stock price movement. In study [36],
authors gathered historical stock market data and derived technical indicators
rich knowledge base. Using this data, the authors generated features and utilized
7
three ML models - DT, SVM, and Artificial Neural Networks (ANNs) for stock
We can see that many domains have used IF based on traditional ML algo-
non-linear mapping and fitting capability. Also, they are highly dependent on the
quality of features and hence may not be able to build a relation between input
IF solutions are also based on fuzzy logic. According to one of the surveys [37],
image fusion and fuzzy-based intelligent health and medical systems accounts for
image fusion has emerged as a hot topic of research. For the same, the work in
traditional weighted fuzzy logic is not adapted to raw data due to invalid data
upon the same, the work in [39] uses K-Means clustering in addition to fuzzy
8
Another fuzzy-based fusion approach at the feature level is adopted for in-
trusion detection in [40] that overcomes the data imperfection challenge of IF.
Under ITS, for high-speed heavy vehicles, a Global Positioning System (GPS)
based navigation method is developed by authors in [41]. The work used fuzzy
logic to fuse the GPS and odometric sensors. Next, under the same category of
ITS, to avoid congestion, the fusion framework combines the Inertial Navigation
System (INS) and the GPS [42]. It uses Extended KF (EKF) and Input-Delayed
homes and detect if they fall. To detect the same, the study in [43] proposed a
data fusion approach based on fuzzy logic with a set of rules directed by medical
It is due to the complex nature of the data. The study in [44] developed a
framework to predict daily stock price movements. The authors deployed and
domains are based on Fuzzy logic. Nevertheless, the challenge with fuzzy
times.
9
1.2.4 Deep Learning based fusion approaches
Deep Learning (DL) has been widely used for analyzing multi-channel / multi-
sensor signals. It facilitates the automated learning of features versus the hand-
learning algorithms. Thus, it saves the human effort of the latter task mentioned.
Also, it can learn the complex mappings between the input and output variables
In many DL studies, all the sensors are stacked one after the other to form
a matrix using 2-D CNN to analyze the sensor signals. For example, in the
study, [12], the authors use the previously mentioned framework with input from
that in the study [12], temporal modeling needs to be included. This shortcoming
is overcome in [45] where 2-D CNN is used on a time series window. These
windows are processed by GRU in the final step and hence time series modeling
the fusion happened at the feature level versus raw signal level like in [12, 45].
Traffic flow prediction is also the use case of information fusion. The study
CNNs for object detection from moving vehicle camera images [48]. DL and IF
10
combination has also been applied in detecting anomaly-based intrusion. The
They applied five different fusion rules to verify system effectiveness [49].
The fusion has also been observed in solving problems pertaining to the
ally, they employed the combination of CNNs and LSTMs for enriching features.
They utilized morphological and temporal information from ECG. The authors
in another work [51] provided the best adaptation for patients with irregular astig-
images in input.
requires IF [52]. It does not take as input audio data for the task, but it proposes
with different levels of early and late fusion. There are studies where multi-
channel image dataset fusion has also been investigated. In [53], a fusion
scheme is proposed for processing color and depth information (via 3-D and
the authors have fused hyperspectral data (high spatial resolution) with Lidar
improvement in analysis tasks was observed with the help of the fusion of deeply
11
learned features (from CNN) with handcrafted features via a fully connected
layer.
issue with CNN is that these are primarily supervised and require large labeled
datasets. But, the labeled datasets are only present in abundance for a few
involves learning of filters. However, CNNs may not guarantee distinct filters
as any loss function involved generally does not impose any distinctiveness
has even been shown via experiments discussed later in chapters 2 and 3.
This thesis proposes solutions based on CTL for three types of problems under
the tasks dealt with are - regression, classification and multiview clustering
tasks. While proposing solutions, it was the chance to explore and apply them to
us realize that the solutions are generic enough to be applied to the problems
pertaining to fusion other than those utilized in this thesis. Thus, the different
12
datasets used in the problems presented in this thesis are presented below.
It is a real dataset from India’s National Stock Exchange (NSE). The dataset
contains information on 150 symbols between 2014 and 2018; these stocks were
chosen after filtering out stocks with less than three years of data. The companies
available in the dataset are from various sectors such as IT (e.g., TCS, INFY),
ICICIBANK), coal and petroleum (e.g., OIL, ONGC), steel (e.g., JSWSTEEL,
(e.g., POWERGRID, GAIL), etc. There are two signals for each sample in the
former indicates whether to buy the stock and the latter indicates to sell that
This dataset consists of 15 Indian stocks that fall under the NSE and the Bombay
Stock Exchange (BSE), which are taken from publicly available Yahoo finance
symbols data. The stock symbols ending with .NS fall under NSE and with .BO
under BSE. The data comprises day-wise readings for the past 22 years, i.e.,
from 1998 - 2019. It is collected internally using the in-built python module
Web and the Yahoo API end-point. At the time of data collection, the year 2019
13
was still ongoing; hence, the data was only partially available for 2019. Also,
there were some missing values for some raw features. Thus, the data for 2019
have not been used in the experiments pertaining to it for simplicity. Further,
The dataset includes stocks from multiple sectors, such as Indian consumer
BUY, HOLD and SELL represented numerically by 0, 1 and 2. The BUY and
SELL have the same roles as explained in section 1.3.1 and HOLD signifies that
we do nothing on any given day if we are signaled HOLD for that day, i.e., we
keep the stock with us and do not buy or sell for that symbol.
The DDI data is from Stanford’s Biosnap dataset, which contains a network
the U.S. Food and Drug Administration. It is assumed that all other interactions
others by 0. The SMILE values of the drugs are first determined using compound
IDs taken from the dataset using DrugBank.ca. Since the SMILE values are not
available for all the drugs (retrieved using DrugBank IDs), thus, the number of
the drugs in the dataset got reduced to 1368 and, accordingly, the number of
14
interactions. Further, drugs that have at least 10 known-to-interact interactions
with other drugs have been processed. So, there are finally 1059 drugs and their
Thereafter, the bioactivity descriptors via the Signaturizer tool [58] using the
determined smile value of each drug are extracted. This tool provides bioactivity
molecule drugs covering all the drugs present in Chemical Checker (CC). The
latter has further covered the source databases - DrugBank.ca and ChEMBl. It
has a pre-trained Siamese Neural Network via which inputting a smile value for
the drug, 25 different types of bioactivity descriptors can be inferred for the drugs
of size 128. There are broadly five categories of bioactivity descriptors labeled
Each has five sub-categories marked as A1 to A5, for example, thus 25 different
Since these are three types of bioactivity descriptors out of 25, each having 128
fixed-sized vectors, each drug has 384(128 × 3) features. Thus, the final dataset
comprises 1059 unique drugs with 384 bioactivity descriptors/features for each
15
1.3.4 Mutli-view datasets
Here the proposed approach was tested on various multiview clustering datasets
listed below:
per specie. Thus, there are 100 clusters and 1600 total samples. For each
sample, shape descriptor, fine scale margin and texture histogram are given
[59].
11025 images of 100 small objects. Every image is represented using four
features namely - Color similarity, HSV, RGB, and Haralick features [60].
• Mfeat: Mfeat dataset is from the UCI repository that contains 2000 samples
• WebKB: It consists of 203 web pages with four classes collected from
The complete statistics of all the datasets mentioned above can be referred from
Table 1.1.
16
Table 1.1: Statistics of the considered MVC datasets
WebKB 203 4 3
Mfeat 2000 10 6
This thesis has three main objectives: 1. To propose more accurate algorithms for
that are largely supervised, thus eliminating the need for large labeled datasets;
and 3. To propose methods that ensure that the learned filters are distinct and
(CTL). The excellent learning ability of convolutional filters for data analysis is
manner [56, 67]. Therefore, the proposed framework is (i) a deep version of
proposed CTL representation; (iii) filters learned are distinct and hence more
17
interpretable representations are obtained. The proposed techniques, ConFuse
[68] and DeConFuse [69], have been applied to the problems of stock forecasting
LSTM network) shows the superiority of our approaches for performing reliable
feature extraction.
Second, two supervised frameworks have been proposed that are based on
DCTL. The former offer all the benefits of the CTL approach discussed previ-
that it facilitated the removal of the non-linear activation located between the
one located between the latter and the output layer, thus, handling the problem
regularization on the aforementioned layer outputs and filters during the training
phase. Further, this technique has been applied to the problem of Stock Fore-
Multi-channel DCTL based networks and Decision Forest (DF) rather than a
piecemeal approach.
proposed that takes multiview data as input namely DeConFCluster [71]. The
framework jointly trains DCTL networks and the K-Means clustering module;
thus, the representations are distinct and more effective as these are also guided
18
arts.
For the quick reference, the summary of all the proposed models are given in
Table 1.2 and each one’s Advantages and Disadvantages are discussed in Table
1.3. More details of each of these models are discussed in subsequent chapters.
Stocks (15)
2. ALOI Clustering
3. Mfeat
4. WebKB
Model
19
2. Avoids Retraining a network
and Regression
3. It is a deep architecture
as it is with encoder-decoder
3. avoided overfitting in
20
are high
datasets compared to
benchmarks
1.5 Acronyms
Let’s introduce here all the acronyms used in the following chapters for the quick
AR Annualized Returns
21
CC Chemical Checker
CE Cross Entropy
DCDF DeConDFFuse
DF Decision Forest
DL Deep Learning
DT Decision Tree
ECG Electrocardiogram
EEG Electroencephalogram
EM Expectation Maximization
22
GARCH Generalized Autoregressive Conditional Heteroskedasticity
IF Information Fusion
IT Information Technology
KF Kalman Filter
KG Knowledge Graphs
LP License Plate
23
MFNN Multi-Filters Neural Networks
ML Machine Learning
OM Opinion Mining
SDCF SuperDeConFuse
24
SELU Scaled Exponential Linear Unit
TA Technical Analysis
TL Transform Learning
25
Chapter 2
Deep Learning (DL) paradigms currently solve several problems. Most of the
frameworks in DL are based on CNNs that are largely supervised. For supervised
learning, the labeled data are needed in abundance, which is in dearth for some
checked experimentally and its details can be referred in this Chapter and Chapter
3 later. Additionally, it has been observed that the problems concerning time-
26
series data in stock forecasting are treated as a 2-D image matrix versus univariate
data, which is the true nature of time-series. We will learn about the said issue
in more detail in this chapter in subsequent sections. Thus, there is a need for a
filters for data analysis is well acknowledged [61–66]. The convolutive features’
this chapter is (i) a shallow and a deep versions of the CTL approach; (ii) has an
via CTL and fused via TL; (iii) is a mathematically sounded optimization strategy
for performing the learning task; and (iv) learns distinct filters that consequently
(deep) and are applied to the problems of stock forecasting and trading. Compari-
son with state-of-the-art methods (based on CNN and LSTM network) shows the
This chapter is organized into sections, with the first section 2.1 discussing the
related work and proposed algorithm in section 2.2. The experimental evalua-
27
tions and results are discussed in sections 2.3 and 2.4 respectively, followed by
discussion in 2.5.
Let us briefly review and discuss CNN-based methods for time series analysis.
For a more detailed review, the interested reader can peruse [72]. In this section,
the main focus are on the studies about stock forecasting as it is the use case for
experimental validation.
The traditional choice for processing time series with a neural network is
Long-Short Term Memory (LSTM) [73] and Gated Recurrent Unit (GRU) [74]
have been proposed. However, due to the complexity of training such networks
via backpropagation through time, these have been progressively replaced with
1D CNN [75]. For example, in [76], a generic time series analysis framework
was built based on LSTM, with assessed performance on the UCR time series
classification datasets [77]. The later study from the same group [78], based on
1D CNN, showed considerable improvement over the prior model on the same
datasets.
Many studies convert 1D time series data into a matrix form to use 2D
28
within a given time window, and the resulting matrix is processed as an image.
The 2D CNN model has been prevalent in stock forecasting. In [81], the said
techniques have been used on stock prices for forecasting. A slightly different
input is used in [82]: instead of using the standard stock variables (open, close,
high, low and NAV), it uses high frequency data for forecasting major points of
used for modeling Exchange Traded Fund (ETF). It has been seen that the 2D
CNN model performs the same as LSTM or the standard multi-layer perceptron
models are also emerging currently when no labels for the data are available.
turned supervised by predicting the pseudo labels and then training happens.
There are few works that utilize and propose solutions based on it for stock
trading prediction [86–89]. However, such techniques are resource intense and
just like CNNs, these SSL based learning paradigms do not have distinctiveness
guarantees.
seminal paper [56]. Since the proposed framework is based on the said recent
29
work, it is presented in detail to make it self-contained. CTL learns a set of
filters (tm )1≤m≤M operated on observed samples s(k) 1≤k≤K to generate a set of
(k)
features (xm )1≤m≤M,1≤k≤K . Formally, the inherent learning model is expressed
was imposed on the features for improving representation ability and limiting
features in the same line as CNN models. Training then consisted of learning the
K M
1XX (k) (k) 2 (k)
minimize ∥tm ∗ s − xm ∥2 + ψ(xm )
(k)
(tm )m ,(xm )m,k 2 m=1
k=1
M
X
+µ ∥tm ∥22 − λ log det ([t1 |. . . |tM ]), (2.2)
m=1
“µ ∥·∥2F − λ log det” ensured that the learned filters were distinct, which was not
30
⊤
(k) (k)
where T = t1 . . . tM , S = s(1) . . . s(K) , and X = x1 . . . xM .
1≤k≤K
The cost function in Problem (2.2) could be compactly rewritten as1 as the sum
1
F (T, X) = ∥T ∗ S − X∥2F + Ψ(X) + µ ∥T ∥2F − λ log det (T ) , (2.4)
2
variables T and X. More precisely, set a Hilbert space (H, ∥·∥), and define the
function φ : H →] − ∞, +∞] as
1
proxφ (x̃) = arg min φ(x) + ∥x − x̃∥2 . (2.5)
x∈H 2
For n = 0, 1, ...
T [n+1] = prox [n] T [n] (2.6)
γ1 F (·,X )
X [n+1] = proxγ2 F (T [n+1] ,·) X [n]
with initializations T [0] , X [0] and γ1 , γ2 positive constants. For more details on
the derivations and the convergence guarantees, the readers can refer to [56].
1
Note that T is not necessarily a square matrix. By abuse of notation, the “log-det” of a rectangular matrix was defined
31
2.1.3 Updates of T
was to learn, for each channel c ∈ {1, . . . , C}, a distinct set of convolutional
filters (T (c) )1≤c≤C and associated features (X (c) )1≤c≤C , by solving a CTL-based
formulation:
K
1 X (c) (c) (c) 2 (c)
minimize ∥Sk T − Xk ∥F +Ψ(Xk )
T (c) ,X (c) 2
k=1
(1) ⊤ (C) ⊤ ⊤
Then, the learned channel-wise features were stacked as Xk = [Xk |. . . |Xk ]
connected layer:
K
1X e
minimize ∥T Xk − Zk ∥2F + ι+ (Z)
Te,Z 2
k=1
32
where Te denoted the fusion stage transform (not assumed to be convolutional),
Z is the row-wise concatenation of the fusion stage features (Zk )1≤k≤K , and ι+
is the indicator function for positive orthant, equals to zero if all the entries of Z
However, the disjoint resolution of Problems (2.7) and (2.8) might lead to
tive strategy was proposed where all the variables are learned in an end-to-end
h i
(c)
Xk (T ) = Xk (T )
b b
1≤c≤C
h i
(c) (c)
= Φ(Sk T ) , (2.9)
1≤c≤C
with Φ the proximity operator of Ψ [94]. For example, if Ψ was the indicator
function of the positive orthant, then Φ identified with the famous rectified linear
unit (ReLU) activation function. Many other examples are provided in [94].
Consequently, it was proposed to plug Equation (2.9) into Problem (2.8), leading
33
to the final ConFuse formulation:
K
1X eb
minimize ∥T Xk (T ) − Zk ∥2F + ι+ (Z) + µ∥Te∥2F
T,Te,Z 2
k=1
C
X
+ µ∥T ∥2F −λ log det(Te) + log det(T (c)
) . (2.10)
c=1
Although Problem (2.10) was still nonconvex, this new formulation had two
notable advantages. First, it was remarked that, as soon as the involved activation
function was smooth, all terms of the cost function in (2.10) were differentiable,
except the indicator function. Thus, the accelerated stochastic projected gradient
descent, Adam, from [95] could be employed. The latter used automatic differ-
proposed model (2.9), for instance, Scaled Exponential Linear Unit (SELU) [96],
or Leaky ReLU [97]. This flexibility played a key role in the performance, as
in Figure 2.1. Note that the proposed approach was completely unsupervised.
which the non-negativity constraint was imposed to avoid trivial solutions. Re-
garding the representation filters stacked in matrices (T, Te), the log-det regu-
larization imposed a full rank on those. Thus, it helped to enforce the diversity
34
Figure 2.1: General view of the ConFuse architecture. C = 5 represents the number of DeepCTL networks/channels,
F1c = 5 × 1 is the filter size and M1c = 4 is the number of filters for all the channels.
In this framework, the ConFuse architecture was extended with more Convolu-
tional layers based on CTL and called it as - DeConFuse3 . Here, there were as
many Transforms as the number of CTL Layers. Thus, a different set of con-
(c) (c) (c) (c)
volutional filters T1 , . . . , TL and features X1 , . . . , XL were learned. These
35
where X0 = S and ϕℓ a given activation function for layer ℓ. Further, these
features were processed in the same manner as in the ConFuse architecture, i.e.,
C
(c) (c)
X
minimize Ffusion (Te, Z, X) + Fconv (T1 , . . . , TL , X (c) | S (c) ) (2.12)
T,X,Te,Z c=1
| {z }
J(T,X,Te,Z)
where
1
Fconv (T1 , . . . , TL , X | S) = ∥TL ∗ ϕL−1 (TL−1 ∗ . . . ϕ1 (T1 ∗ S)) − X∥2F
2
L
X
+ Ψ(X) + (µ||Tℓ ||2F −λ log det(Tℓ )). (2.13)
ℓ=1
and
C
1 X 2
Ffusion (Te, Z, X) = Z − flat(X (c) )Tec + ι+ (Z)
2 c=1
F
C
(2.14)
X
2
+ µ∥Tc ∥F −λ log det(Tc ) ,
e e
c=1
where the operator “flat” transformed X (c) into a matrix where each row con-
in Figure 2.2.
As for the solution of Problems (2.10) and (2.12), it was remarked that all terms
of the cost function are differentiable, except the indicator function of the non-
36
Figure 2.2: General view of the DeConFuse architecture. C = 5 represents the number of DeepCTL networks/chan-
nels, L = 2 is the number of DCTL layers, Mℓc is the filter size and Fℓc is the number of filters of the respective
layer ℓ and channel c.
(2.10) and (2.12) by employing the projected gradient descent, whose iterations
read
For n = 0, 1, ...
[n+1]
T = T [n] − γ∇T J(T [n] , X [n] , Te[n] , Z [n] )
X [n+1] = P+ (X [n] − γ∇X J(T [n] , X [n] , Te[n] , Z [n] )) (2.15)
e[n+1]
T = Te[n] − γ∇Te J(T [n] , X [n] , Te[n] , Z [n] )
Z [n+1] = P+ (Z [n] − γ∇Z J(T [n] , X [n] , Te[n] , Z [n] ))
37
with initialization T [0] , X [0] , Te[0] , Z [0] , γ > 0, and P+ = max{·, 0}. In practice,
the accelerated strategies [98] were used within each step of this algorithm to
speed up learning.
to ReLU activation in equations (2.9) and (2.11), but instead, more advanced
ones were used, such as SELU [96]. It can be observed from Tables 2.3 and 2.5
that although the ReLU activation performed better in the case of the ConFuse,
the convolution layer was added, more resultant values from convolution with
filters were believed to be negative. Since the ReLU activation function sets
the negative values to zero, the values were not that distinct compared to those
obtained via SELU. The latter does not set negative values to zero but near zero
[96] and hence prevents dead neuron issue also unlike ReLU. It proved beneficial
problem aiming at estimating the price of a stock at a future date (the next
day for the given problem) given inputs till the current date. Stock trading is a
38
classification problem, where the decision to buy or sell a stock has to be taken
at each time. The two problems are related by the fact that simple logic dictates
that if the price of a stock at a later date is expected to increase, the stock must
be bought, and if the stock price is expected to go down, the stock must be sold.
Five raw inputs were used for both tasks, namely open price, close price,
high price, low price and net asset value (NAV). One could compute technical
indicators based on the raw inputs [81] but, in keeping with the essence of true
representation learning, it was deliberately chosen to stay with those raw values.
Each of the pipelines produced a flattened output. The flattened outputs were
then concatenated and fed into the Transform Learning layer acting as the fully
connected layer (Fig. 2.2) for fusion. While the processing pipeline ended
output node. The node was binary (buy/sell) for classification and real-valued
for regression. The comparison with state-of-the-art time series analysis models,
namely TimeNet [76] and ConvTimeNet [78] was carried out. In the former,
the individual processing pipelines are based on LSTM and 1D CNN in the
latter. The complete architectural details and hyperparameters for ConFuse and
39
Table 2.1: Description of compared models with hyperparameters
Method Architecture
( Description Other Parameters
layer1: 1D Conv(1, 4, 5, 1, 2)1
ConFuse 5× Learning Rate = 0.001,
Activation (e.g., ReLU) µ = 0.01, λ = 0.0001
1 × layer2:
Transform Learning Optimizer Used: Adam
layer1 : 1D Conv(1, 4, 5, 1, 2)1 **with parameters**
(β1, β2) = (0.9, 0.999),
Maxpool(2, 2)2
DeConFuse 5× weight_decay = 5e-5,
SELU epsilon = 1e-8
layer2 : 1D Conv(5, 8, 3, 1, 1)1
For Forecasting:
Learning Rate = 0.001,
( For Trading:
layer1 : LSTM unit(1, 12, 2, T rue)4 Learning Rate = 0.0005,
TimeNet 5×
layer2 : Global Average Pooling Optimizer Used: Adam
**with parameters**
layer3 : Fully Connected
(β1, β2) = (0.9, 0.999),
For Trading, added layer4 : Softmax
weight_decay = 5e-5,
epsilon = 1e-8
1 (in_planes, out_planes, kernel_size, stride, padding)
2
(kernel_size, stride)
3
SC - Skip-Connection
4
(input_size,hidden_size,#layers,bidirectional)
The frameworks have been applied on the NSE dataset of 150 symbols, as also
40
Table 2.2: Forecasting Results (MAE)
architectures were tuned to yield the best performance and randomly initialized
Firstly, the experiments were performed with the stock forecasting problem. Next,
the generated unsupervised features were fed from the proposed architecture into
mean absolute error (MAE) between the predicted and actual stock prices for all
150 stocks. Root Mean Squared Error(RMSE) could also have been computed in
place of MAE but MAE was chosen here as MAE has lower sample variance and
is more interpretable than RMSE. The MAE for individual stocks is computed
for each of close price, open price, high price, low price and net asset value.
41
Results Analysis for ConFuse
The testing was done with six different activation functions for ConFuse. 4 For a
concise summary of results, Table 2.3 shows the average values over all stocks.
Table 2.3: Summary Forecasting Results (MAE) with ConFuse
It was found that the results for the stock forecasting problem were excep-
tionally good. For most tested activation functions, ConFuse has MAE more
than one order of magnitude lower than the state-of-the-arts. The regression
performance was also plotted in the Figure 2.3 for the two randomly chosen
stocks. Here, it was clearly observed that the output close prices were very
42
(a) AMARAJABAT (b) JSWENERGY
Summary results are presented in Table 2.4. Interested readers can see the
detailed results for all 150 stocks from the paper [69], Appendix section. Table
2.4 shows that the MAE values reached for the proposed DeConFuse solution for
the four first prices (open, close, high, low) are extremely good for all of the 150
well for 128 stocks. For the remaining 22 stocks, there are 13 stocks, highlighted
in red, for which DeConFuse did not give the lowest MAE, but it was still very
It can be observed that with both shallow (ConFuse) and Deep (DeConFuse)
43
the stat-of-the-arts. Further, going deep did better than the shallow version for a
Next, the Stock Trading, i.e., classification performance was evaluated. For
Random Decision Forest (RDF). The results were reported in terms of metrics
- precision, recall, F1 score, and area under the ROC curve (AUC). From the
financial viewpoint, annualized returns (AR) were also calculated using the
AR and True AR respectively. The latter metric is important from the perspective
more it is, better is the quality of predictions. To calculate the same, the starting
capital used for every stock was Rs. 1,00,000 and the transaction charges were
i.e. Pm T Pi +T Ni
i=1 T Pi +T Ni +F Ni +F Pi
Accuracy = (2.17)
m
44
F Ni = False Negatives, m is the total number of classes in the dataset, and i
ranges from 1 to m.
• Precision : also known as the positive predictive value (PPV), measures the
TP
P recision or P P V = (2.18)
TP + FP
TP
Recall or Sensitivity = (2.19)
TP + FN
P recision × Recall
Fβ = (1 + β 2 ) ∗ (2.20)
(β 2 × P recision) + Recall
here β = 1
olds. This curve plots two parameters: True Positive Rate (TPR) and False
Positive Rate (FPR). Here TPR is a synonym for recall and is therefore
45
False Positive Rate (FPR) is defined as follows:
FP
FPR = (2.21)
FP + TN
AUC stands for “Area under the ROC Curve." That is, AUC measures the
Here, transaction charges = Rs. 10/- and Start Capital = Rs. 1,00,000/-
The results can be referred from Table 2.5. For the stock trading problem,
SELU and ReLU, and reached a similar performance to the benchmarks with the
46
Table 2.5: Trading Results with ConFuse
The classification performance in detail for all 150 symbols can be referred from
the paper Appendix Section. Certain results from that table are highlighted in
bold or red. The first set of results, marked in bold, are the ones where one of
the techniques for each metric gave the best performance for each stock. The
proposed solution DeConFuse gave the best results for 89 stocks for a precision
score, 85 stocks for a recall score, 125 stocks for F1 score, 91 stocks for the
The other set marked in red highlighted the cases where DeConfuse did not
perform the best but performed nearly equal (here, a difference of a maximum
of 0.05 in the metric is considered) to the best performance given by one of the
benchmarks, i.e., DeConFuse gave the next best performance. It was noticed that
there are 24 stocks for which DeConFuse gave the next best precision metric
47
value. Likewise, 18 stocks in case of a recall, 22 stocks for F1 score, 26 stocks
for AUC values, and 1 stock in case of AR. Overall, DeConFuse reached a
very satisfying performance over the benchmark techniques. The trading results
Score AR
Some empirical convergence plots of Adam were also shown that can be seen in
Figure 2.4 when using ConFuse and DeConFuse with SELU, which depicted
Further, the representations both channel-wise i.e., Xc and final fused represen-
tation Z, were analyzed for one of the random stocks; here it is ANDHRABANK.
The visualizations are displayed in Figure 2.5 for one sample of the mentioned
stock. It can be seen from the figure that the heatmaps for all the channel-wise
features Xc and fused features Z are less redundant and have more variations.
Thus, it can be implied that one of the factors for this variation could be distinct
filters that are learned and transform data to produce the varied representations.
48
(a) Loss Plot with ConFuse
(a) Channel X1 (b) Channel X2 (c) Channel X3 (d) Channel X4 (e) Channel X5 (f) Channel Z -
Close Price Open Price High Price Low Price Net Asset Value Fused features
Figure 2.5: Visualization of channel-wise features Xc and fused representations Z for DeConFuse for one sample of
stock ANDHRABANK (with 8 × 2 as the shape of the features obtained for each channel Xc and flattened features
of shape 40 × 1 for Z)
49
2.5 Discussion
Shallow and deep fusion based end-to-end frameworks for processing 1D multi-
channel data were proposed. Unlike other deep learning models, these frame-
works are unsupervised. These are based on a novel deep version of the recently
proposed CTL model. The proposed models have been applied for stock fore-
casting and trading problems leading to very good performance. The overall
well.
unsupervised fashion. For example, consider the problem that is addressed. For
traditional deep learning-based models, one needs to re-train deep networks for
regression and classification. But here the learned final features can be reused,
without the requirement of re-training, for specific tasks. This has advantages
in other areas as well. For example, one can either do ischemia detection, i.e.,
detect whether one is having a stroke at the current time instant (from EEG); or
standard deep learning, two networks need to be re-trained and tuned to tackle
these two problems. With these proposed methods, there is no need for this
double effort.
Since the stock data is quite volatile, therefore, a minor improvement matters
in the problems pertaining to this domain. Thus, the better results with the
proposed frameworks than the benchmarks is beneficial for the system. However,
50
the AUC ROC values can be improved further in the future as those have the
scope of improvement in this case of two classes problem. Also, in the the future,
51
Chapter 3
In the last chapter, the unsupervised frameworks based on CTL were discussed
that bridged all the gaps that CNNs have. However, the question that comes
next is - if we have labeled datasets, are the CNNs based models sufficient
for supervised learning? It has been observed that CNNs have emerged as the
recommended solution in many such scenarios. But the issue with CNN is that
the supervised learning through them does not ensure distinct filters; hence,
the feature maps might have redundancy. Additionally, there is a dead neuron
problem with CNNs which is encountered with the kind of activation function
chosen with it and mostly happens when ReLU is used. A dead neuron can be
bigger problem. Let’s say if every neuron in a specific hidden layer is dead; it
52
cuts the gradient to the previous layer resulting in zero gradients to the layers
behind it. Thus, the weights would not be updated and the learning will be
improper. It can be fixed using lower learning rates, so the big gradient doesn’t
set a big negative weight and bias in a ReLU neuron. Another solution is to use
other activation functions like Leaky ReLU. It allows the neurons outside the
active interval to leak some gradient backward. But sometimes, the resolves just
Therefore, these two issues discussed above open up the scope for developing
supervised frameworks that can combinely tackle them. In the previous chapter,
trains and optimizes multiple CTL based channels and cross-entropy loss. Thus,
representations are not learned just via CTL but also directed by classification
loss - Cross Entropy. It has been applied to the stock trading problem. The other
and Decision Forest (DF). It deals with the drug-drug interaction problem. Here,
the representations are learned via CTL and DF, which yields better performance
sections.
53
3.1 SuperDeConFuse: A supervised deep convolutional transform based
1. Now, let’s briefly review here some of the works that have proposed solutions
for the Stock Trading problem. The problem of stock trading has been one of the
most difficult problems for researchers in finance data processing and speculators.
Struggles are mainly due to the uncertainties and noises of the samples. These
In literature, different methodologies have been applied to the stock data for
predicting future trading strategies (e.g., buy and sell decisions). These include
deep learning models (e.g., CNN, LSTM) and self-supervised learning based
Statistical methods are probably the methods that are universally used for
the use of sequential statistical models, such as ARMA [100], ARCH [101],
GARCH [102] and [103], Kalman filter [104]. Feature-based techniques are also
54
etc., have been used in past studies to extract the features from the data [105].
Text mining can also be used to process financial analysis from newspapers
[106]. The features are then input to machine learning models, for example,
Further studies have proposed hybrid machine learning models using multiple
base classifiers operating on a common input and a meta classifier learning from
base classifiers’ outputs to obtain more precise stock return and risk predictions.
SVM and weighted KNN model for predicting stock market indices is proposed
in [110]. Another study [111] combines the statistical and probabilistic Bayesian
Learning and the machine learning model ANN for the same. However, in all
and future value prediction may lack interpretation because of their “black-box"
property. Thus, the performance of these methods is directly related to the quality
Deep learning based models have also been extensively used for solving
the most appropriate models for time-series analysis. LSTM is one such RNN
LSTM on the technical indicators for the prediction. However, despite the great
55
has encouraged the users to search for more tractable models and solutions.
which have been used profusely and performed well in stock time-series forecast-
ing, especially 2-D CNNs. The studies pertaining to CNNs [79–85] have been
CNN model since these studies model an inherently 1D time series as an image.
models are also emerging currently when no labels for the data are available.
turned supervised by predicting the pseudo labels and then training happens.
There are few works that utilizes and proposes solutions based on it for stock
trading prediction [86–89]. However, such techniques are resource intense and
just like CNNs, these SSL based learning paradigms do not have distinctiveness
guarantees.
is discussed in this section. A crucial element of the latter is the recently intro-
duced CTL [56]. The details of CTL have already been covered in 2.1.2. Also,
extending it to the deep versions and the fusion part is covered with DeConFuse
described in section 2.2.2. Now, let’s move to the proposed framework which
56
is an extension of these approaches to handle a multi-layer architecture that is
This framework took the channels of input data samples to separate branches
obtained were thus decoupled. In order to couple (i.e., fuse) them, these were
coupled features via transform learning. These features were then fed to another
linear fully-connected layer. The obtained features that were finally inputted
to the softmax layer which yielded probabilities for the classes. The complete
Figure 3.1: General SuperDeConfuse Architecture. The architecture is tested for L = 1, 2, 3, 4 layers and C = 5.
Here M11 × 1, . . . , MLC × 1 represents the kernel size used in each layer ℓ ∈ {1, . . . , L}. Here, maxpooling is not
performed after layer 4 due to the small window size/input sequence length.
57
(c) (c)
T1 , . . . , TL and features X (c) were learned for each channel c ∈ {1, . . . , C}.
The linear transform (not convolutional) were also learned and calculated Te =
(Tec )1≤c≤C to fuse the channel-wise features X = (X (c) )1≤c≤C , along with the
corresponding fused features Z at the same time. The latter task was carried out
C
1 X 2
Ffusion (Te, Z, X) = Z − flat(X (c) )Tec + Ψ(Z)+
2 c=1
F
C
(3.1)
X 2
µ Tec − λ log det(Tc )
e
F
c=1
where the operator “flat" transforms X (c) into a matrix where each row contains
classifier was learned which took the input features Z and yielded the class
probabilities. The cross-entropy (CE) loss associated with the final classification
is given by
K V
X
zk⊤ (θv −θyk )
X
FCE (θ, Z | y) = log e , (3.2)
k=1 v=1
where V is the number of classes, θv is the v-th column of matrix θ, zk⊤ is the
k-th row of matrix Z, and yk ∈ {1, . . . , V } is the label of the k-th sample.
58
symmetry and enforced the diversity in the learned transforms. In contrast, the
It was chosen to find a local minimizer to the non-convex Problem (2.10) through
For n = 0, 1, ...
[n+1]
T
= T [n] − γ∇T J(T [n] , X [n] , Te[n] , Z [n] , θ[n] )
X [n+1] = P (X [n] − γ∇ J(T [n] , X [n] , Te[n] , Z [n] , θ[n] ))
+ X
e[n+1] (3.3)
T
= Te[n] − γ∇Te J(T [n] , X [n] , Te[n] , Z [n] , θ[n] )
Z [n+1] = P+ (Z [n] − γ∇Z J(T [n] , X [n] , Te[n] , Z [n] , θ[n] ))
θ[n+1] = θ[n] − γ∇θ J(T [n] , X [n] , Te[n] , Z [n] , θ[n] )
random matrices T [0] , X [0] , Te[0] , Z [0] , θ[0] and a suitable step size γ > 0 was
chosen. The gradient step was numerically evaluated with the accelerated scheme
initially introduced for the ADAM method in [95]. The advantages of this
59
3.1.4 Preprocessing
Before proceeding with the experimental setup, the labeling process for the
dataset and training details are discussed here. The specifics of the dataset can
In the labeling phase, the labels were manually assigned to the daily close prices
as Buy (0), Hold (1) and Sell (2). The labels were determined by performing a
grid search on the list of holding percentages to identify the percentage change
for which the stocks should be held to maximize the annualized returns for the
In general, the sliding walk forward validation technique is used as the cross-
validation technique in the case of time-series data, also shown in Figure 3.2. As
can be seen from Figure 3.2, ten years of data for training have been used and
the subsequent one year of data for testing, i.e., the stock data from 1998-2007
was for training and the year 2008 for testing. Then the training window was
slid by one year which implied that it was next trained from 1999-2008 and
tested on the following year 2009 data and this period is called the horizon.
In summary, it was trained for ten years, tested for the next year, slid it by a
one year horizon, and again trained and tested it until 2018. Thus, 11 years
60
Algorithm 1: Labelling Method
1 Input : CP - Array of
2 Parameter : X - array of K holding percentages,
3 NUMDAYS - number of days for the current symbol or len(CP)
4 Labels - 2D array of size K x NUMDAYS
5 Output : FinalLabels - Labelled Dataset for S
1: AR = [ ] //it is of size K
2: for k = 0, 1, 2, . . . , K − 1 do
3: for n = 0, . . . , N U M DAY S − 1 do
4: change = abs((CP [n + 1] − CP [n]/CP [n]) ∗ 100) //where CP[n+1] is the next day closing
price
5: if change > X[k] then
6: if CP [n + 1] > CP [n] then
7: label == “Sell"
8: else
9: label == “Buy"
10: end if
11: else
12: label == “Hold"
13: end if
14: Labels[k].append(label)
15: end for
16: ar = AnnualisedReturn(Labels[k],CP)
17: AR.append(ar)
18: end for
19: maxAr = Max(AR), maxIndex = index(Max(AR))
20: HoldPercentage = X[maxIndex]
21: FinalLabels = Labels[maxIndex]
22: return FinalLabels
23: Repeat all steps till 22 for all the Stocks/Symbols in the dataset.
of data from 2008 - 2018 were used as test data. This way, there were 11
models and the set of hyperparameters were selected that gave the best results
across all 11 models. The set of hyperparameters that were tuned includes µ, λ,
kernel sizes, number of filters/kernels, learning rate, weight decay of the Adam
optimizer, batch size, and number of epochs. Additionally, the weights for each
technique to analyze the robustness of the architecture. In other words, the model
performance was calculated every time a year’s data became available for testing
61
and used the previous year’s test data for training. The training and the test data
were standardized using Normalizer from the Python library as prices and the
Figure 3.2: Sliding walk-forward validation technique used for hyperparameters tuning
The experiments were carried out on the real-world problem of stock trading.
hold or sell a stock has to be taken at each time. The problem makes a decision
that if the price of a stock at a later date is expected to increase, the stock must
be bought; and if the stock price is expected to go down, the stock must be sold;
and if there is no change in the price then it should be held, i.e., do nothing until
the price increases. This was done in a way to maximize the annualized returns
from the stock for the company’s profit, as mentioned in the labeling process.
Five raw inputs were used: open price, close price, high, low and net asset
value (NAV). It was chosen to stay with the raw values. However, one could
compute technical indicators based on the raw inputs [105] but raw values
62
allowed here to keep up with the essence of the true nature of representation
pipeline. Each pipeline produced a flattened output (Figure 3.1). These flattened
outputs were then concatenated and fed for fusion into the Transform Learning
layer acting as the fully connected layer (Figure 3.1). Further, this is connected
to another linear fully connected layer and finally, there was a softmax function.
The softmax function gave the classification output which consisted of the class
The architecture was extended by adding CTL layers upto four layers resulting
in four different deep SDCF architectures. The details for all four architectures
are briefed in Table 3.1. Maxpooling halves the input sequence length/window
size/Time Steps with every operation. Thus, after three layers, the size was
getting reduced to the value that restricted us from using maxpooling operation
after the 4th CTL layer; hence, the architecture with 4 CTL layers of SDCF will
not have maxpooling operation after layer 4. This was due to the small window
size. Also, for making predictions on any day, the past ten days were analyzed
through the model labeled as Time Steps shown in Figure 3.1. Additionally, the
stock trading signal was not predicted for the first ten days of every test year to
63
Table 3.1: Hyperparameters for the different instances of the proposed SDCF network (see Figure 3.1 for the
general overview) used in the experimental section.
LearningRate = 0.001,
λ = 0.01, µ = 0.0001
layer1 : 1D Conv(1, 16, 3, 1, 1)1
5× epochs = 100,
Maxpool(2, 2)2
Optimizer Used: Adam
SDCF 1L layer2 : Fully Connected (TL)3
**with parameters**
layer3 : Fully Connected (Linear)
(β1, β2) = (0.9, 0.999),
Softmax
weight_decay = 1e-4,
epsilon = 1e-8
layer1 : 1D Conv(1, 8, 3, 1, 1)1
SELU + Maxpool(2, 2)2
5×
layer2 : 1D Conv(8, 16, 3, 1, 1)1
SDCF 2L
Maxpool(2, 2)2
Softmax
layer1 : 1D Conv(1, 4, 11, 1, 5)1
SELU + Maxpool(2, 2)2
layer2 : 1D Conv(4, 8, 7, 1, 3)1
5×
SELU + Maxpool(2, 2)2
SDCF 3L
layer3 : 1D Conv(8, 16, 3, 1, 1)1
Maxpool(2, 2)2
Softmax
64
layer1 : 1D Conv(1, 4, 13, 1, 6)1
SELU + Maxpool(2, 2)2
layer2 : 1D Conv(4, 8, 11, 1, 5)1
5× SELU + Maxpool(2, 2)2
1
layer3 : 1D Conv(8, 16, 9, 1, 4)
SDCF 4L
SELU + Maxpool(2, 2)2
layer4 : 1D Conv(16, 32, 5, 1, 2)1
Softmax
1
(in_planes, out_planes, kernel_size, stride, padding)
2
(kernel_size, stride)
3
TL - Transform Learning
L - #CTL layers
The comparison was made with three state-of-the-art time series based analy-
sis models, out of which two techniques presented the models proposed specifi-
cally for financial stock trading - CNN-TA [105] and MFNN [113]; and the last
technique presented a generic model for time-series based data - FCN (Fully
Convolutional Network) [75]. The latter was used to understand how generic the
proposed model was when compared against both specific stock trading based
and general time-series models. In all the techniques, processing pipelines were
based on CNN. Other than CNN, MFNN [113] was also based on the RNN type
of network - LSTM. In [105], the data used was not raw but processed as techni-
cal indicator values and passed as an image, hence using 2D CNN, whereas, in
FCN [75], the data was processed via 1D CNN. The same hyperparameters for
65
the benchmark techniques were used as given in the study, except for FCN which
was best tuned for the used dataset. It was also compared to the simple CNN
layers deep architecture and used the same hyperparameters too, except the
and 3 (padding size is Fℓ /2). The difference lied in the objective function of
the convolutional learning in both the techniques, i.e., 3 layers deep SDCF and
3 layers deep and simple 1D CNN. This was done to analyze the performance
architecture for CNN was having 3 convolutional layers since the results were
The predictions from every year totaling 11 years were saved, and the metrics
were computed to analyze the performance of the SDCF model. Two sets of
metrics were computed here, namely (i) classification metrics and (ii) financial
metrics.
of view. Also, the weighted F1 Score, Precision and Recall to account for
the class imbalance for every stock were calculated. Note that, in such a
case, the F1 score is not equivalent to the harmonic mean of Precision and
66
Recall since it is weighted.
proposed framework and state-of-the-art was carried out from the finan-
cial point of view. For the same purpose, Annualized Returns(AR) were
computed using the predictions from all the models. The AR value was
calculated the same way as mentioned in [105]. The starting capital was
was used to calculate the AR values since the dataset had all Indian stocks.
Note, however, that the chosen metric was versatile and could be used to
the proposed models. The framework was tested for shallow - 1 CTL layer and
deeper versions - 2, 3 and 4 CTL layers. The generated features from the fully
connected layers are passed to the softmax after which the probabilities for all the
classes were obtained. The one with the maximum probability was selected as
the predicted label. The performance was calculated for every class. Specifically,
F1 Score, Precision and Recall metrics are computed for BUY, HOLD and SELL
classes. Here, the summary results for each of the classes - BUY, SELL and
HOLD signals to understand class wise results are given in the Tables 3.2, 3.4
and 3.3 and the global results from 3.5. The detailed results can be referred to
67
Certain results are highlighted in bold or red. In the first set of results
marked bold, one or more techniques for each metric give the best/greater
8 stocks for which the proposed model performed greater than or equal to
when compared with benchmark techniques for F1 score in the case of the
BUY class. Following the same, it is found that the SDCF gave greater than
or equal to performance for 13 stocks for precision and 5 stocks for recall
metrics under the BUY class. Similarly, 7 stocks for the F1 score, 7 stocks
for precision and 5 stocks for recall in the HOLD class and 7 stocks for the F1
score, 11 stocks for precision and 6 stocks for recall in the case of the SELL
the proposed model. It was found that CNN gave greater than or equal to
performance for 2 stocks for each metric under the BUY class. Similarly, there
are 6, 1 and 9 stocks for the HOLD class and 2 stocks each for the metrics F1
Additionally, the other set of results in red indicates the performance where
one of the proposed model versions gave the similar/next best performance
under 0.02 error difference - err_dif (let’s say) after one of the benchmarks, i.e.,
0.0 < err_dif ≤ 0.02. Adhering to the same, it was observed that for the BUY
class, there is 1 stock each for metrics F1 score, precision and recall, respectively.
Likewise, for the HOLD class, there are 7, 4 and 5 stocks for F1 score, precision
and recall metrics, respectively; and for the SELL class, 1 stock each for F1
68
score and recall metrics. Although the results for CNN haven’t been highlighted
when it gave similar/next-best performance but the statistics for the same are
presented here. Analyzing for CNN, there are 2 and 3 stocks for F1 score and
precision under the HOLD class. Observing these statistics, they indicate that
the performance with the proposed model is better than CNN for all three BUY,
69
Table 3.3: Summary of HOLD Class Classification Results for Stock Trading
Table 3.4: Summary of SELL Class Classification Results for Stock Trading
From the summary results in the above displayed tables, the average metric
values for which the model gave the best performance are average F1 score and
precision for the BUY class, average F1 score and recall for the HOLD class,
and average F1 score and precision for the SELL class, where the F1 score is an
70
important metric, as it is the harmonic mean of precision and recall. It is the best
As it can be observed, the performance for the HOLD class decreased when
increasing the number of layers for the SDCF model. However, it can also be
seen that there is an increase in correct identification for BUY and SELL points
despite the fact that BUY and SELL points appear extremely less in the case
is actually more crucial for the financial system as it directly influenced the
indicated that the model captured all three classes, i.e., BUY, HOLD and SELL
well.
This was also indicated in the confusion matrices, given for each of the SDCF
layers, the model started to identify the BUY and SELL points more correctly.
The HOLD signal had more false positives with shallow architecture (SDCF 1L)
that decreased with the increase in layer number, which was essential for the
system in order to classify other class points correctly. Additionally, the overall
71
(a) CTL 1Layer (b) CTL 2Layers
Figure 3.3: Confusion matrices corresponding to the different number of CTL layers of the architecture: a) 1 layer
of CTL (shallow version), b) 2 layers of CTL (deep version), c) 3 layers of CTL (deep version) and d) 4 layers of
CTL (deep version) where 0 - BUY, 1 - HOLD, 2 - SELL signals.
sion and recall metric values were calculated for all the stocks under consider-
ation. The reason for computing weighted values was to incorporate the class
imbalance for every stock. The detailed results can be referred from the appendix
section of the paper [57] and summary results are given in Table 3.5. Again,
the results comprised two sets of values marked in bold or red with the same
err_dif of 0.02. There are 6, 9, and 5 stocks concerning the metrics F1 score,
precision and recall for which the model performed greater than or equal to the
performance given by the state-of-the-arts. Also, there are 6, 3 and 6 stocks for
the metrics F1 score, precision and recall, respectively, for which the model gave
72
the next best performance under 0.02 err_dif. Although the BUY and SELL
classes’ performance with the 4 CTL Layers deep architecture is better than
the benchmarks compared against, the overall performance from the average
weighted metric is suggestive of the good performance with the 3 layers deep
explained later.
Again analyzing explicitly for CNN, there are 4, 2 and 7 stocks with greater
than or equal performance; and 3, 2 and 3 stocks under similar/next best per-
formance for the F1 score, precision and recall metrics, respectively. As can
be referenced from the statistics presented here, the proposed model is giving
better results with greater than or equal and the next best/similar performances
except for the number of stocks for recall metric are slightly more with CNN
under greater than or equal to performance. However, the next best performance
statistic for the recall metric is much better than CNN. Overall performance on
average is good with the proposed model as compared to the benchmarks and
CNN which can also be referred from Table 3.5. For a deeper understanding of
73
Table 3.5: Summary of Weighted Classification Results for Stock Trading
understand the quality of predictions made by the SDCF model. For this, as
explained earlier, the AR values were calculated with the predictions generated
by each technique for every stock over 11 years. The AR values were also
calculated with the True labels for every stock over the same period. Finally,
the absolute difference/error between the AR values from Predictions and the
AR values from True labels was computed. The absolute difference values were
averaged for all the stocks yielding the so-called Mean Absolute Error. The
detailed results are given in Table from the paper [57]. With the proposed model,
5 stocks have the best performance whereas with CNN-TA, there is 1 stock and
2 stocks under MFNN and FCN. Overall, the performance is good with the
proposed model as also evident from the summary results in Table 3.6 where
74
there is a mean of the absolute difference/error(MAE) between the True AR and
Predicted AR. Also, there are 3 stocks for which the proposed model gave an
equal performance as the other benchmark techniques. Here, this set of results
illustrated that, despite the higher capability of identifying the BUY and SELL
points with 4 layers deep CTL, the AR values are better predicted with the 3
With respect to CNN, there are only 2 stocks for which CNN performs better
than any benchmarks and the proposed models and 3 stocks for which it gave
an equal performance. Thus, from the combined (greater than or equal to and
next best / similar), average classification and financial results, the CNN results
are less performant than the proposed model. This also indicated that the quality
of predictions made with the SDCF model is better than CNN as the identified
class labels give AR values quite close to the True AR values. This remained
true for all the benchmarks. The statistics presented here can be deduced from
75
Table 3.6: Summary of Financial Results for Stock Trading
Method MAE AR
SDCF 1L 22.5613
SDCF 2L 20.7227
SDCF 3L 20.5067
SDCF 4L 22.8287
CNN 21.1140
FCN 23.7720
CNN-TA 22.1380
MFNN 22.3040
To further understand the better supervised learning for both regular CNN and
the SDCF framework, the channel-wise Xc features for both frameworks obtained
after the last maxpool layer for the 3 convolutional layers deep framework were
visualized. The following Figure 3.4 shows the visualizations of the features for
76
Features generated by the proposed SDCF network.
(a) Channel X1 (b) Channel X2 (c) Channel X3 (d) Channel X4 Low (e) Channel X5
Close Price Open Price High Price Price NAV
Figure 3.4: Visualization of channel-wise features Xc for SDCF versus a standard CNN for one sample of stock
BSELINFRA.BO (with 16 × 1 as the shape of the features obtained and resized to 8 × 2 for better visualization)
77
As seen from Figure 3.4, the heatmap for each channel corresponding to
the prices(Close, Open, High and Low) show no variation in the case of CNN
compared to the SDCF architecture. While it shows some variations for the
features learned corresponding to NAV, the features are still better learned with
SDCF. Also, the darker the color in the heatmap, the more it is indicative of
the larger negative exponent values. In the case of CNN, hence, the values are
very very small that are almost diminishing to zero. This also corroborated
the fact that the filters learned with the proposed model are distinct due to the
"log-det" term added which further gives different features with significantly
less redundancy. Thus, the visualizations of these channel-wise features are also
supportive of better supervised training with the SDCF framework than CNN.
‘ In this section, the ablation study performed is discussed. The network was
trained in a piecemeal fashion here. The motive for performing this study was
to understand the behavior of the network without the benefit of joint training.
Since it is piecemeal, it was carried out in two parts. In the first part, the network
objective function:
C
1 X 2
Ffusion (T , Z, X) = Z −
e flat(X (c) )Tec + Ψ(Z)+
2 c=1
F
C
(3.4)
X 2
µ Tec − λ log det(Tc )
e
F
c=1
78
It is the same as present in equation 3.1 and hence all the variables here mean
the same. Then these learned Z’s are fed into the shallow single-layer neural
K V
X
zk⊤ (θv −θyk )
X
FCE (θ, Z | y) = log e , (3.5)
k=1 v=1
equation 3.2 and thus, all variables mean the same. However, the difference is
that here θ is not learned jointly with Z, but two separate pipelines learn each of
these variables individually. Also, the hyperparameters used for both parts are
the same as the ones used in the proposed architecture with the Adam optimizer.
The results corresponding to classification and financial analysis are given under
heading piecemeal in tables 3.7, 3.8, 3.9, 3.10 and 3.11 respectively.
Table 3.7: Ablation Study performance for BUY Class
79
Table 3.8: Ablation Study performance for HOLD Class
80
Table 3.11: Ablation Study Financial Results
Method MAE AR
SDCF 1L 22.5613
SDCF 2L 20.7227
SDCF 3L 20.5067
SDCF 4L 22.8287
piecemeal 23.4073
From the results in tables 3.7, 3.8, 3.9, 3.10 and 3.11, it can be clearly seen that
the piecemeal version did not perform well as compared to the proposed solution
except for the HOLD class and Recall value for weighted summary results under
values only slightly better than the proposed ones. Despite the slightly higher
results, the piecemeal approach did not recognize BUY and SELL points as
efficiently as the proposed method - SDCF, which is critical for the system. It,
thus, also suggests that the joint supervised solution involving cross-entropy loss
guided the better representation learning. Therefore, the proposed solution’s joint
training is justified and important for the system to recognize critical points BUY
and SELL as well as appropriately recognizing HOLD points efficiently that are
for two additional window sizes, namely 5 and 20 have been performed. In
order to avoid extensive space utilization, only the comparative summary re-
81
AR(Financial metric) in Table 3.12 for window sizes 5 and 20 along with the
summarized results for window size 10. The proposed method yielded the best
case (window size 20), it did not reach better results in terms of weighted F1 for
the same scenario. Furthermore, CNN-TA couldn’t be run for all small window
sizes (such as 5), hence cannot be deemed as an all-purpose go-to method. Small
window sizes are crucial for highly non-stationary stocks and the inability of a
model performed better than benchmarks and CNN both classification-wise and
financially; specifically, it gave the best performance with 3 CTL layers deep
plots were also displayed for a few stocks, namely INDRAMEDCO.BO and
NATIONALUM.BO in Figure 3.5 for both shallow and deeper versions. It can
be seen that the training loss decreased to the point of stability for each example
considered.
Table 3.12: Comparative Summary Results for Stock Trading for window sizes 5,10,20
82
(a) CTL 1 layer (b) CTL 2 layers
Figure 3.5: Evolution of the loss during training for a few stock examples of the proposed model with (a) CTL 1
layer, (b) CTL 2 layers, (c) CTL 3 layers and (d) CTL 4 layers.
83
3.2 DeConDFFuse : Predicting Drug-Drug Interaction using joint Deep
work
again based on CTL. Briefly, the framework jointly trains CTL based network
pipelines, fuses them with TL and passes lastly via DF. The technique has been
are the adverse changes or effects, or reactions of one drug due to the recent
concurrent use of another drug(s). For example, the drug Ceftriaxone should
be avoided in children less than 28 days old if they are receiving or expected to
resulting from crystalline deposits in the lungs and kidneys, as reported in [114].
Such reaction from DDIs is known as adverse drug reactions (ADRs). ADRs
are responsible for the threat to a person’s life and inadvertently increase overall
healthcare costs.
According to the studies [115, 116], ADRs contribute to more than 20% of
clinical trial failures and are considered the highest load in the modern drug
discovery process. Serious ADRs can cause severe disability and even death in
patients. Also, from study [116], it is observed that approximately 3.6% of all
84
perspective, the annual financial cost of drug-related morbidity in the United
States (US) was estimated at $528.4 billion in 2016, equivalent to 16% of total
US healthcare expenditures that year [117]. It, thus, becomes pertinent to identify,
in an exhaustive manner, the DDIs that could cause ADRs. This might not avoid
all unanticipated drug interactions, but it can help lower the drug development
strategies have been investigated in the literature, which are discussed below. Sev-
learning models, graph models, deep learning models and matrix factorization
models.
The work [119] proposes similarity-based models that compute similarity scores
have explored Bayesian learning models [120] under statistical learning paradigms.
Another work [121] uses a sparse feature learning ensemble method with lin-
85
enzymes and pathways. In [122], ML algorithms like Naive Bayes, Decision
Tree, Random Forest, Logistic Regression, and XGBoost were used with cross-
validation with input as SMILE values and interaction features based on CYP450
group.
features than restrict them to some set. Furthermore, overfitting stays a significant
issue with these techniques due to their restrictive non-linear mapping and fitting
capability.
diction. With the advent of the availability of biomedical data, researchers are
moving toward KGs to populate and complete the available biomedical informa-
tion. It is done with the help of the large structured databases and texts available
publicly [123]. Some works have used the combination of DDI matrix and KG
86
the said DDI prediction as well. In [124], the proposed framework integrates the
the integrated representations via concatenation are passed through the neural
network to get the final DDI predictions. The study [125] proposes attention-
based RNNs - LSTM for DDI prediction. In [126], the work utilizes deep Neural
Networks based on attention technique for predicting DDIs with features from
Further studies are combining KGs and DL to predict DDIs. In the study
[127], the DDI matrix and KG form the input to the DL network with CNN and
Let us also mention recent studies that present matrix factorization as the solution
to predict DDIs [128, 129]. Here, the input is the DDI matrix and the similarity
scores between the drugs. The pair for which the DDI is to be predicted is
scores. Then some works use the Triple Matrix Factorization also [130, 131].
The study that used this technique learned the unified drug representations from
encoders. Then, they applied four operators on the learned drug embeddings to
represent drug-drug pairs, and finally, they used an RF classifier to train models
87
for predicting DDIs [132].
Learning of features.
jointly optimized a decision forest with the binary decision that gives the final
DDI Predictions. Such a solution has been successfully used before in Deep
Neural Decision Forest (DNDF) framework [1]. Also, it is already known that
cross entropy loss in the optimization objective and softmax at the end of its
architecture. Here, the goal was to guide the supervision through a random
from CNN but instead from the DeConFuse network based on deep CTL through
linear transform learning. The latter’s advantage was that it promoted distinct
filters/transforms, which was not guaranteed with CNNs. Such a benefit helped
88
further guided by the predictions from the decision forest, whose parameters
are jointly optimized. This joint optimization was helpful as the representa-
tions learned are guided not only by deep CTL and fusion objectives but also
by decision forests. Thus, the representations that would have been fed to the
Decision Forest (DF) just like a normal input to it were learned better by its
joint end-to-end solution that was not available with the piecemeal approach.
Further, thus, this double guided (deep CTL based fusion + DF) supervised
The details of the DeConFuse network can be referred from Chapter 2. Let’s
briefly discuss the latter framework - DNDF and, finally, mention the details of
The DNDF framework from [1] is introduced in this section, which is the last
brick of the DDI pipeline. It is different from conventional deep neural networks
as it outputs the final predictions from the decision forest, and their split (decision
nodes) and leaf (prediction) nodes’ parameters are jointly and globally optimized.
lower layers of deep CNNs. This reduced the uncertainty on routing decisions of
a sample taken at the split nodes, such that the globally defined loss function is
89
minimized. For the leaf nodes, the optimal predictions are achieved by reducing
the convex objective function, which does not require step size selection. Further,
Consider a classification problem with input space χ and finite output space
Y . A decision tree comprises decision (or split) and prediction (or leaf) nodes.
Decision nodes, let’s say, indexed by N are internal nodes of the tree, and
prediction nodes are indexed by L, i.e., terminal/leaf nodes of the tree. Each
parameterized by θ, which routes the samples along the tree branches. When a
sample x ∈ χ reaches a decision node n, it will be directed to the left or right sub-
tree based on the output of the function dn (x; θ). Here, it is a probabilistic routing
where the routing direction is the output of a Bernoulli random variable with
mean dn (x; θ). As the sample ends in a leaf node ℓ, the related tree prediction
routings, the leaf predictions will be averaged by the probability of reaching the
leaf. Thus, the final prediction for a sample x from tree D with decision nodes
X
(∀y ∈ Y ) PD [ y | x, θ, π] = πℓy µℓ (x | θ) (3.6)
ℓ∈L
90
where π = (πℓ )ℓ∈L . Here above, µℓ (x | θ) is regarded as the routing function
providing the probability that sample x will reach leaf ℓ. Note that Σℓ µℓ (x |
θ) = 1 for any x ∈ χ.
For an explicit form for the routing function, the following binary relations
that depend on the tree’s structure are given as: ℓ ↙ n, which is true if ℓ belongs
to the left sub-tree of node n, and n ↘ ℓ, which is true if ℓ belongs to the right
Y
µℓ (x | θ) = dn (x; θ)1ℓ↙n d¯n (x; θ) (3.7)
n∈N
the argument P . Although the product in (3.7) runs over all nodes; however, only
decision nodes along the path from the root node to the leaf ℓ contribute to µℓ ,
because for all other nodes 1ℓ↙n and 1n↘ℓ will be both 0 (with the assumption
Figure 3.6: Each node n ∈ N of the tree performs routing decisions via function dn (·). The black path shows an
exemplary routing of a sample x along a tree to reach leaf ℓ4 , which has probability µℓ4 = d1 (x)d¯2 (x)d¯5 (x). Image
taken from [1].
91
as follows:
k
1X
(∀y ∈ Y ) PF [y | x, θ, π] = PDh [y | x, θ, π]. (3.9)
k
h=1
Note that the tree parameters are omitted for notational convenience.
Learning a decision tree, for which the model is explained in the previous
sections, requires estimating the decision node parameterizations θ and the leaf
Risk principle with respect to a given data set T ⊂ χ × Y under the log-loss
(also known as the cross-entropy loss), i.e., minimizers of the following risk term
are searched:
1 X
Ftree (θ; π; T ) = − log(PD [y | x, θ, π]) (3.10)
|T |
(x,y)∈T
The forest is learned by considering the ensemble of trees F, where all trees
can possibly share the parameters in θ. Still, each tree can have a different struc-
92
ture with a different set of decision functions and independent leaf predictions
π. The illustration of the forest of decision trees taking the parameters θ and
to from Figure 3.7. Thus, for the forest, the empirical risk is minimized as:
1 X
Fforest (θ; π; T ) = − log(PF [y | x, θ, π]). (3.11)
|T |
(x,y)∈T
Figure 3.7: illustration of how to implement a deep neural decision forest (DNDF). Top: Deep CNN with a variable
number of layers, subsumed via parameters θ. FC block: Fully Connected layer used to provide functions fn (·; θ),
described in Equ. 3.8. Each output of fn is brought in correspondence with a split node in a tree, eventually
producing the routing (split) decisions dn (x) = σ(fn (x)). The order of the assignments of output units to decision
nodes can be arbitrary (the one shown allows a simple visualization). The circles at the bottom correspond to leaf
nodes, holding probability distributions πℓ . Image taken from [1].
depend as shown in equation 3.8. To minimize the risk term with respect to
93
theta for a given π by employing Stochastic Gradient Descent (SGD) approach.
∂R (t)
θ(t+1) = θ(t) − η (θ , π; B)
∂θ
(t) η X ∂L (t) (3.12)
=θ − (θ , π; , x, y)
|B| ∂θ
(x,y)∈B
momentum term was also used to smooth out the gradients’ variations not shown
explicitly. The gradient of the loss L with respect to θ can be decomposed by the
∂L
(θ, π; , x, y) = dn (x; θ)Anr − d¯n (x; θ)Anl , (3.14)
∂fn (x; θ
Here nl and nr represent left and right children of node n respectively, and Am
with Lm ∈ L denoted the set of leaves held by the sub-tree rooted in node m.
94
Learning prediction nodes
After the understanding of learning θ from previous section, let’s learn about
prediction nodes. The risk term in 3.10 with respect to π when θ is fixed was
learnt which is a convex optimization problem with a global solution. Here, all
the leaf nodes were estimated jointly. The iterations for the updates read as :
(t)
(t+1) 1 X 1y=y′ πly (x|θ)
πl y = (t)
(3.16)
Zl (x,y′ )∈T PT [y|x, θ, π ( t)]
(t)
for all l ∈ L and y ∈ Y , where Zl was a normalizing factor so that
(t+1)
= 1. Initial π ( 0) is random, typically πl0y = |Y |−1 . Here, the update
P
y πy
instead of utilizing the features from a CNN network, it was proposed to inherit
the representations learned from the DeConFuse network to peruse them in the
decision forest, i.e., the decision forest was jointly trained and optimized within
c ∈ {1, 2}, and finally learned common representation Z from X (c) where the
95
channel-wise representations X (c) .
The representation Z was passed to the DF, where it applied the features
mask, i.e., randomly selected the features from the representation that would
participate in the decision tree’s routing process which fed those selected features
to the linear fully connected layer parameterized by θ, i.e., given by the function
fn (x; θn ) = θn⊤ x. The number of features involved was provided by the feature
ratio. After that, the sigmoid activation was applied as given in Eq. 3.8. Then the
routing function was computed, and the prediction probabilities were calculated.
Thus, the prediction probabilities having a probability for each class for each tree
were likewise obtained. Finally, the probabilities from all the trees of the Forest
F were averaged to get the outcome probability for each of the classes 0 and 1 in
this case. The negative log-likelihood loss was computed and back-propagated,
Learning of the parameters θ. The objective function for this framework that
combined the idea of DeConFuse and DF can be deduced from 2.10 and 3.11:
C
(c) (c)
X
minimize Ffusion (Te, Z, X) + Fconv (T1 , . . . , TL , X (c) | S (c) ) + Fforest (θ; π; TZ ).
T,X,Te,Z,θ,π c=1
| {z }
J(T,X,Te,Z,θ,π)
(3.17)
Hereabove, the dataset TZ was built with the learned features Z and the known
labels. Note that there was no positivity constraint anymore on the learned
representations Z.
96
3.2.3 Experimental Setup
the DDI matrix and bioactivity descriptors/feature vectors for each drug as
explained in section 1.3.3. The DDI matrix dataset was divided into training and
testing datasets. All the drugs are kept in the training data so that there are 95
samples per drug. Further, out of 95 samples, there are 60% of 1 interactions for
each drug (not exceeding half of the 95, i.e., min (60% of 1 interactions, 95//2)),
and the remaining are the samples from 0 interactions. The remaining samples of
0 and 1 interactions per drug are kept in testing data. All the training and test data
samples from each interaction category per drug are selected randomly. Also,
only one pair of interactions are kept from either the upper or lower triangle of
the DDI matrix. Thus, each training and testing sample is the drug pair and its
A drug pair was passed as a input to a sample during training. For each drug in
a pair, 1D feature vectors, i.e., bioactivity descriptors, were fed to the individual
layers. Thus, the input S gathered the bioactivity descriptors/1D feature vectors
of size 384 for each channel corresponding to each drug. Since there were 1D
feature vectors for each drug in the drug pair, thus, 1D convolutions were used in
each deep CTL network. The two networks’ learned features/representations X (c)
were flattened and concatenated. Then these features were passed to the linear
97
Transform learning layer that acted as a fully connected layer where transform Te
Z were selectively sent by applying the feature mask to the decision forest. The
final predictions were outputted by averaging the predictions from each tree in
The complete architecture is shown in Figure 3.8 and all the architectural and
the atom ratio signified the number of features to be kept in the representation Z;
and the feature ratio meant the randomly selected number of features from the
tree parameterized by θ.
8f 16
ilte filt
rs ers
Input
S1 = Drug1
SELU + Maxpool
Features 1
Maxpool
features
3x1 3x1
Flattening & Concatenation
ilte filt
rs ers Connected Decision Output
Layer Forest Predictions
Input Transform
SELU + Maxpool
S2 = Drug2 Learning
Features 2
Maxpool
features
Drug Interaction Matrix 3x1 3x1
DeConFuse
Figure 3.8: DDI prediction using combined DeConFuse and decision forest architecture- DeConDFFuse. Here
C = 2, the number of networks/channels via each of which a drug in the drug pair is passed along with its bioactivity
descriptors/ features vector, respectively.
namely:
• KGNN: This technique was used to build the Knowledge Graph (KG)
and passes the DDI matrix and KG to the Graph Neural Network (GNN).
98
Table 3.13: DDI Prediction DeConDFFuse Architec-
ture Details
Parameter Value
Layer Wise Hyperparameters
Layer1 - Convolution (CTL) (1,16,3,1,1)1
Maxpool (2,2)2
Layer2 - Convolution (CTL) (1,32,3,1,1)1
Maxpool (2,2)2
atom ratio 0.75
Decision Forest Hyperparameters
#Trees 90
tree depth 10
feature ratio 0.5
Other Model Hyperparameters
Epochs 75
Learning Rate 0.01
µ 1e-05
λ 0.0001
batch size 4096
weight decay 1e-05
Optimizer Hyperparameters
Optimizer Used Adam
Ams grad True
Learning rate 0.01
betas (0.9, 0.999)
eps 1e-08
1 (in_planes, out_planes, kernel_size, stride,
padding)
2
(kernel_size, stride)
• Conv-LSTM: This technique used the DDI matrix and KG as the input to
the DL network with a CNN and LSTM. KG was input to the network in the
It is compared against the embedding that gave the best results in this work,
• Graph Embedding DDI: This technique used KG and DDI matrix as input
99
but experiments with many different types of embedding techniques. Then
each of these embeddings, one by one, was passed to machine learning tech-
niques like Random Decision Forest (RF), Gaussian Naive Bayes (GNB),
and Logistic Regression (LR) [123]. Here also, the embedding type Skip
Gram was used, which gave the best results in their study.
DDI- the same DrugBank IDs were used as present in the considered training
and testing samples. Since these methods relied on KGs, the bioactivity descrip-
tors/features were not used but recreated KG and embeddings for the dataset
The prediction results were evaluated using the classification metrics - AUPRC,
AUC_ROC, F1 Score, Precision, Recall, and Accuracy. Except AUPRC, all other
metrics have been explained previously in the chapter 2. Thus, here AUPRC is
AUPRC : it is the area under the graph constructed by calculating and plotting
the precision against the recall for a single classifier at a variety of thresholds.
The higher the AUPRC score, the better a classifier performs for the given task. It
at each threshold, with the increase in recall from the previous threshold used as
100
All the metrics were computed except Accuracy as weighted metrics since
there was a huge class imbalance between 0 and 1 labels. The following Table
Method ROC
ComplEx
(Ours)
Also, confusion matrices for each method were computed. They are displayed
in Figure 3.9.
From Table 3.14, it is seen that benchmark Graph DDI gave the best val-
ues in terms of Accuracy, F1 Score and Recall, and the proposed method for
well in terms of all the classification metrics used for evaluation. In fact, the
next best performance in terms of Accuracy and F1 was given by the proposed
method. Despite the highest F1, Accuracy, and recall values, Graph DDI failed
to achieve the highest values for AUC -ROC and AUPRC, which are considered
101
(a) KGNN - Sum (b) KGNN - Concat (c) KGNN - Aggregate
Figure 3.9: Confusion matrices for different benchmarks and the proposed method- DeConDFFuse
more relevant and important metrics for the performance evaluation in the case
of binary classification. The reason for the same can be observed with the help
Here, it can be seen that with the proposed method, the highest number of
known-to-interact interactions (1) have been predicted correctly than any other
benchmarks. Also, except for Graph DDI, the False positives, i.e., classifying
102
0 as 1, were lesser with the proposed method than the other two benchmarks.
more important to prevent ADRs, as explained before and which Graph DDI
did not achieve at all or was nearly negligible. Thus, with the proposed method,
better than any other benchmark. For the false positives, too, it gave good
performance compared to the other benchmarks except for graph DDI. The latter
is the reason why Graph DDI has three metric values higher than the proposed
method. With the proposed approach, though the number of False positives was
higher than Graph DDI, however, it was not necessary that these False positives
were completely incorrect. The reason is that the 0 interactions did not signify
unknown.
interact) interactions that were against the study’s objective, i.e., to identify the
known-to-interact DDIs to avoid ADRs. With the proposed method, both types
of interactions were classified reasonably well. It was also the stable method
103
learned from the CTL based network. Each of the benchmarks was carefully ex-
and Graph Neural Network are utilized. The latter had a lot of parameters and
computation cost since the neural network connects neurons in each layer with
every neuron in the preceding and consecutive layers. Also, it had no distinc-
tiveness guarantee for the kind of weights learned. The poorest performance
from Graph DDI was due to a lack of learning ability of the traditional machine
from KG. Lastly, with the Conv-LSTM framework, all the disadvantages of CNN
The optimizer used for updating all the parameters of the framework except
probability distribution π of the decision forest is Adam, which uses the auto-
like learning rate, betas, and eps, etc. associated with it are mentioned in Table
3.13. Loss plots were also plotted with the proposed technique, which can be
referred to from Figure 3.10. It can be seen that with Adam optimizer, the
104
Figure 3.10: Loss plot with the proposed method - DeConDFFuse.
The experiments were further carried out to understand the architecture in detail.
- DeConDFFuse and the piecemeal version. The latter trained to learn the
these learned representations Z are fed to the separate module Random Decision
Forest where Z are treated as regular input to the system along with the labels.
The hyperparameters for DCTL networks are the same as the proposed solution.
However, for the RDF part, the hyperparameters were best tuned and were set
systems’ performance was evaluated using the same metrics and are reported in
105
Table 3.15: Comparative Results with DeConDFFuse and Piecemeal approaches
Method ROC
(Ours)
From the above table, it can be clearly seen that the performance of the
proposed solution is better than the piecemeal version. Further, it can be better
visualized with the help of the confusion matrices given below in Fig. 3.11.
Figure 3.11: Confusion matrices for the proposed method - DeConDFFuse and Piecemeal approach
On comparing the two versions, i.e., the proposed framework with the piece-
meal approach, it can be clearly seen that the piecemeal version is not as good
is slightly better than the piecemeal version; however, the critical point is to
predict the known-to-interact interactions (1) that are responsible for ADRs.
The latter is poorly predicted with the piecemeal approach compared to the
proposed approach. Computation metric wise too, the results are better with the
106
proposed model. It can be concluded that the joint optimization and training
of Decision Forest (DF) with the DCTL-based fusion networks in the proposed
solution results in better representations due to added guidance from the DF that
of the error from DF classifier to the previous neurons which is missing in the
piecemeal approach where the input, i.e., representations Z are treated as normal
3.3 Discussion
(DCDF) are discussed- both based on CTL and extended DeConFuse in this chap-
ter. First, the SuperDeConFuse network was discussed, a deep fusion end-to-end
framework for processing stock trading data, leading to very good performance.
In particular, the classification results are better with the proposed SDCF model
than with the 1-D CNN approach. Also, the features Xc visualized for each chan-
nel and each method indicated better feature learning with SDCF. The results
have shown that the presented solution (SDCF) is superior to CNN and other
Currently, the shortcoming of the model is that it takes slightly more time
than CNN for its training. Thus, techniques will be investigated that reduce the
time complexity of the proposed framework to make it more efficient from this
107
viewpoint in the future.
that it is an effective tool for predicting stocks. However, stock price prediction
step, it will be to investigate the use of the proposed algorithm to study if it can
‘buy stock XYZ at a price ABC’ or ‘sell stock ZYX at a price CBA.’ It will
be interesting to see if this algorithm can make such predictions given a time
horizon in the future. If possible, in the future, the present algorithm will be
and shorts)’.
framework for processing 1D multi-channel drug data. Unlike other deep learn-
ing models that separately use conventional machine learning algorithms like
RDF, the proposed framework is jointly optimized and is not piecemeal. It has
been applied for the binary classification task of DDI prediction leading to good
performance. The advantages of the proposed framework are the benefits of CTL
for the problem where there are two drugs in a drug pair that is also guided
108
In the future, there is a scope to improve performance by reducing the number
of false positives. Also, the current solution to the DDI problem considers the
event when two drugs are administered together. However, a combination of more
than two drugs is routinely used. Thus, another extension will be the capability
of the proposed framework to handle more than two drugs combinations in the
future. This can be done with the proposed architecture by increasing the number
of channels per increase in the number of drugs. Lastly, although the proposed
applied to other biomedical interaction problems. In the future, other areas can
presented are the supervised versions of CTL jointly trained and optimized by
proposed frameworks offers scope for improvement as discussed above for each
separately.
109
Chapter 4
With the rapid increase in data collection sources and volume, the exploration
data collected from the same data source but with different angles or different
media with different content; the same statement is labeled with different tags
by different individuals, and the same image is captured using different features.
Multiview data is richer and more informative but more complex than single-view
data. In multiview data, the data belonging to each view has information related
grouped into several groups or clusters based on the various features of the data
multiple views of the data to perform the grouping of data instances in possible
110
clusters.
Multiview data knowledge extraction is vital in big data mining and analytics
nowadays. In this regard, many recent works suggest CNN based clustering
work, the clustering loss is included after the encoder network, which ensues
the problem of additional training of a decoder network and hence, incurs extra
apply clustering algorithms like K-Means in a piecemeal fashion which may lead
- DeConFCluster is introduced that bridges all the gaps mentioned earlier, namely
datasets. The results demonstrate that the proposed framework outperforms the
111
state-of-the-art multiview deep clustering approaches. This chapter is further
organized into sections, with the first section 4.1 discussing the current related
works for MVC. Next, the proposed formulation is explained in section 4.2.
The experimental evaluations and results are discussed in sections 4.3 and 4.4
Multiview clustering clusters subjects into subgroups using multiview data and
that fall under big data analytics. Recently many solutions have been proposed to
perform the same. These are broadly classified into two categories generative and
data distribution. These use generative models with each model representing the
individual view and then find the clustering solution. In contrast, discriminative
and mixture models. The latter, being larger in number, can be further catego-
112
adopts multinomial distribution for the document clustering problem. Similarly,
based on different assumptions and criteria, two versions of the multiview EM al-
gorithm for finite mixture models are proposed in [143]. Using Convex Mixture
[144] could find the global optimum. It also avoided the initialization and local
algorithms executions. The major issue with EM based algorithms is their slow
algorithms is that in some scenarios, the E-step and M-step could be unman-
approaches. This method obtains a common clustering result and assumes that
the same or similar eigenvector matrix is shared among all views. There are
when both labeled and unlabelled data are available. Second is co-regularized
The objective function generally requires the difference between the predictor
from each view which is, in general, obtained by making each of the view’s
113
Matrix Factorization (NMF) that seeks two non-negative matrix factors called
basis and indicator. In the case of MVC, some studies point to learning a
common indicator matrix across each view [154, 155] for NMF. Some works
propose using multiview K-Means clustering to deal with the extensive data.
method that adopted a common indicator matrix across different views. Besides
categorical utility function to measure the similarity between the indicator matrix
from each view and the common indicator matrix and proposed a consensus
Also, there are methods in which direct view combination via a kernel is used
kernel for each view and then combining these kernels in a convex combination
[161, 162]. All methods mentioned earlier have achieved satisfactory perfor-
mance for the clustering task. But, it may be challenging to handle the data
with high-dimensional features and nonlinear property using the above stated
methods since they majorly adopt shallow and linear embedding functions to
Recently, graph based MVC has also gained momentum. The authors in
[59] proposed a solution wherein the graph matrices of multiple views are
combined into a unified graph matrix by generating the Similarity Induced Graph
114
(SIG) matrices for all the available views. Then the rank constraint on the
graph Laplacian matrix is applied, and the number of connected components are
produced from the unified graph, which gives the final number of clusters.
Deep learning has emerged as a highly utilized technique to solve almost all
real-world problems and is used in the case of MVC. In [163], multiple autoen-
coders are utilized for multiview data to generate multiple latent representations
and apply heterogeneous graph learning to fuse the generated latent represen-
tations followed by the K-Means network for the final clusters. Further, in the
consensus information from multiple views and collaboratively learns the deep
based MVC to completely benefit from the features embedded in the attributed
multiview graph data. Further, the work in [166] used Graph Convolutional
Network (GCN) as an encoder with the most reliable view as input. In another
feature representation among different views [167]. Here, the issue is with the
additional weights training incurred from the decoder network, which could lead
prevent the trivial solution. However, even using the mentioned solution, there
115
are chances of over-fitting. Further, CNNs do not guarantee the distinct learning
framework and trained in a joint end-to-end fashion. Also, DCTL was utilized
introduced that jointly trains and optimizes DeConFuse and K-Means clustering
aforementioned shortcomings.
MVC. It extends the previously established works - Deep CTL based K-Means
framework for MVC. The latter framework has already been discussed in Chapter
2. Next, there is a brief discussion of the other prior method mentioned and
116
Figure 4.1: DCKM architecture. L represents number of DCTL layers, Mlc - filter size and Flc - #filters of the
respective layer l and channel c.
This framework extended the Deep CTL (DCTL) approach by adding the K-
Means loss at the end. The DCTL approach from Section 2.2.2 explained in
perform single-view clustering [168]. The loss formulation after embedding with
tion weight associated with the K-Means clustering loss, and H is the matrix of
117
4.2.2 Proposed Formulation
cussed. Previously, the framework DCKM [168] combined DCTL [67] with
ding the K-Means clustering loss as was done in DCKM [168]. Here, fusion was
happening that was not present in DCKM. It jointly trained and globally opti-
mized DeConFuse Network and K-Means module. There were as many channels
channel was processed based on the DCTL network. This amounted to learning
distinct transforms (Tc )1≤c≤C and thus, distinct and interpretable representation
(Xc )1≤c≤C , for each channel input (Sc )1≤c≤C . These channel wise representa-
tions were further fused using TL [90] to learn a common representation Z and
transform Te. This completed the first module of the architecture. The repre-
sentations are then fed as input to the second part of the framework K-Means
clustering module that gives the clustering results. Thus, the representations
learned are also guided by the K-Means loss. The learning problem reads:
C
(c) (c)
X
minimize Ffusion (Te, Z, X) + Fconv (T1 , . . . , TL , X (c) | S (c) )+
T,X,Te,Z,H c=1 (4.2)
β∥Z − ZH ⊤ (HH ⊤ )−1 H∥2F
1
P.Gupta, A. Goel, A. Majumdar, E. Chouzenoux and G. Chierchia, “DeConFCluster: Deep Convolutional Transform
Learning based Multiview Clustering Fusion Framework". 2023. Submitted in IEEE TNNLS
118
The complete architecture of the DeConFCluster is summarized in the Fig. 4.2
Figure 4.2: Overview of the proposed DeConFCluster architecture. C represents the number of DeepCTL network-
s/channels, L is the number of DCTL layers, Mℓc is the filter size and Fℓc is the number of filters of the respective
layer ℓ and channel c.
All the variables were learned in an end-to-end fashion. Typically, SGD could
be used as an optimizer for all the variables except H. This latter variable was
updated directly via K-Means clustering [170] at each iteration using the current
Z estimate as an input.
multiview clustering datasets - 100leaves, ALOI, Mfeat and WebKB which have
been already discussed in the section 1.3.4. Next, let us explain the network
119
each channel was designated for one of the views of the multiview dataset. Then
the representations were learned from these channels’ networks that gave the
concatenated to pass through a fully connected layer learned via TL. Here, the
Finally, clusters were obtained by inputting the representation into the K-Means
module. The pipeline is shown in Fig. 4.2. The Stochastic Gradient Descent
(SGD) algorithm was used as the optimizer and λ = 0.01, µ = 0.0001 and weight
decay as 0.001 for all datasets. There was also another hyperparameter - feature_-
ratio that indicated the percentage of features kept in the final representation Z.
All other hyperparameters’ values are grid-searched and the ones that gave best
results are set as the final values. These values can be referred from Table 4.1.
Table 4.1: DeConFCluster hyperparameters for MVC Datasets
The results were compared with four state-of-art works. These are briefly
• MCGL: it was a graph based learning method. Starting graphs were learned
120
using different views’ data points that were further optimized with a rank
into a global graph. The graph was learned with the same rank constraint
on its Laplacian matrix. Cluster indicators were obtained from the global
graph only without conducting any graph cut technique and the K-Means
clustering [171].
• GMC: in this approach, each view was weighted and the SIG matrices and
the unified graph matrix were jointly learned [59]. The latter was obtained
that aligned the distributions of the views. Further, it added the contrastive
module and selective views alignment by prioritizing the views and, thus,
[172]. Therefore, the experiments were conducted with the CoMVC part
121
4.4 Results and Analysis
Normalized Mutual Information (NMI), and Adjusted Rand Index (ARI), are
commonly employed [173, 174]. Thus, the evaluation of the proposed model’s
ized measure of similarity between the labels of same data instances. The
I(l, c)
NMI = (4.3)
max(H(l), H(c))
where I(l,c) denotes the mutual information between the true label l and the
• Adjusted Rand Index (ARI): ARI measures the similarity between two
clusters by considering all pairs of data instances that are assigned to the
same or different clusters in the actual and predicted labels. The range of
ARI is [−1, 1]. The higher the ARI value, the better is the clustering.
(RI − E)
ARI = (4.4)
(max(RI) − E)
122
where RI = Rand Index and E is the Expected Rand Index Value for random
where a = the number of times a pair of elements belongs to the same cluster
and
X ni X nj
N
E=( ( )× ( ))/( ) (4.6)
2 2 2
samples in cluster j.
The results of the proposed model and benchmarks on all four datasets are
reported in Table 4.2. It can be observed from Table 4.2 that for all datasets, the
proposed model has shown better performance than the state-of-the-arts except
for NMI values for Mfeat and ALOI datasets and ARI in the case of ALOI. It is
worth noting that the proposed technique performed well in the case of WebKB
and ALOI, both of which had fewer samples than the number of clusters to be
identified. In the case of Mfeat and ALOI, it reached good Accuracy and ARI
values for Mfeat. Thus, the proposed method performed well for challenging
datasets and slightly worse for easier ones. The overall performance of the
Also, the convergence plot for all the datasets were plotted that can be referred
123
Table 4.2: Clustering Results. All the metrics in (%)
to from Fig. 4.3. Using SGD as an optimizer, it could be clearly inferred that the
given solution converged to the point of stability. The SGD parameters, such as
mini-batch size and learning rate, are given in Table 4.1 for all the considered
datasets.
This section shows the results corresponding to the three ablation studies per-
formed for all the datasets. The first experiment conducted was with changing
the values of the regularizers λ and µ associated with the penalty terms log-det
and Frobenius norms in both CTL and TL equations 2.11 and 2.14 respectively.
124
Figure 4.4: Ablation Studies Result Plots on λ, µ
The set of values taken as a combination for both the penalty regularizers are -
(10−4 , 10−5 )). The results can be referred from Table 4.3. These were also
displayed graphically for all three metrics Accuracy, NMI and ARI in Fig. 4.4.
It can be clearly concluded from the results that the penalization terms play
regularizers for these penalizations, the change is robust for three of the datasets
used from the performance evaluation perspective. However, the depleted results
for Mfeat for lower values of these regularizers demonstrate that they help to
learn better representations and hence should be the part of the formulation.
Table 4.3: Ablation Studies Results on λ, µ
Secondly, the experiments were carried out with the regularizer β associated
with K-Means clustering loss in the equation 4.2. The set of values for β lie in
125
Figure 4.5: Ablation Studies Result Plots on K-Means Regularizer
range [0, 1], specifically, these are (0.0, 0.1, 0.3, 0.5, 0.8, 1.0). The results were
represented results both in text and graphically that can be referred from Table
4.4 and Fig. 4.5 respectively. It could be observed that for all the datasets,
signified that K-Means loss was an important term associated with the final
The second experiment inference was also validated by the third experiment
conducted, where the results were computed using piecemeal version of the pro-
posed model. It means that first the learned representations from the DeConFuse
via the K-Means clustering module to get the final clusters, i.e., here β = 0. The
126
results could be referred from Table 4.5. Here, it was clearly inferred that the
4.5 Discussion
discussed. The proposed framework jointly trains the DCTL based DeConFuse
framework is that it does not have the additional overhead of learning the weights
data-constrained scenarios where the number of data instances is low and the
number of classes is high, for example, in the case of 100leaves and WebKB, the
filters and thus, in turn, helps to learn more interpretable filters that are further
127
demonstrated higher clustering scores as compared to the current state-of-the-art
MVC frameworks. However, for a few metrics in the case of Mfeat and ALOI
128
Chapter 5
Conclusion
The proposed works in this thesis focused on modeling various prediction prob-
frameworks proposed are based on the recently established technique CTL and
hence are variants that deal in the analysis domain, covering both unsupervised
In this section, the chapter-wise contributions are briefly summarized in the area
modeled as both shallow and deep architectures based on CTL, namely - ConFuse
129
and DeConFuse, respectively. These frameworks are applied to the problem of
The most significant advantage of this framework was that it avoided the effort of
re-training the network which is required with most other techniques, especially
CNNs. In summary, the same representations were utilized for both regression
different trainings.
on CTL. The first framework involved fusion that happened via TL over the
representations learned from individual CTL based channels. Further, there was
a linear fully connected layer followed by the cross-entropy loss. The framework
The performance of the proposed technique showed that the proposed method
neurons and did not require us to employ activation function between the last
convolutional layer and fully connected layer, i.e., fusion layer learned via TL
versus the required in CNNs. All the other advantages, like distinct filters and
130
interpretable representations, etc., are also present here since they are based on
CTL and TL. It has been even compared with the representations learned from
CNNs and found that the features and results from the proposed frameworks are
extended the DeConFuse network and jointly trained and optimized it with De-
cision Forest (DF). Again, all the benefits of CTL are applicable here as well.
information are there in the latter. Additionally, it extracted individual and cross-
channel features of the drugs finding out the most relevant features of the drugs
that interact with each other. The results from this framework are superior to
benchmarks indicating the benefit of employing it for the DDI prediction task.
performs multiview clustering task utilizing representations from the fusion of the
131
individual view’s representations; thus, it learns individual view information and
DeConFuse network and a K-means module that are jointly trained and optimized.
Therefore, representations learned are beneficial since those have the advantages
of CTL and are well-guided through the K-Means loss also. The same is observed
framework prevented the additional training from the decoder network that is
scenarios where the number of data instances is low and the number of classes is
high.
It is believed that the algorithms proposed are generic and can be used not only
in the kind of problems discussed throughout this dissertation but also in other
research fields where one can formulate the problem to be solved as a multi-
been proposed; thus, the proposed frameworks can solve both genres’ problems.
applied for stock prediction problems that required day-wise predictions. How-
132
and SELL. Also, the proposed solutions have dealt with 1D data so far and, thus,
estimate dense depth maps from the sparse maps. Thus, such a fusion using
DDI prediction, but it is believed that it can also be used to predict different types
etc. Also, the problem targeted currently involves two drugs administered
together, whereas many times, more than two are administered together in a real
scenario. The latter can be easily done with the proposed network.
are developed to extract features for performing dual tasks of regression and
well, like anomaly detection. Anomaly detection requires finding and identifying
outliers to prevent fraud, adversary attacks, network intrusions, etc., that can
it can be even utilized to extract such features to find clusters to segment cus-
tomers pertaining to a particular market. In all such applications, one can analyze
the data available and extract meaningful representations to perform the men-
tioned tasks using the proposed framework. Also, unsupervised (ConFuse and
133
DeConFuse) and supervised frameworks (SuperDeConFuse and DeConDFFuse)
views that are similar in nature. Nevertheless, it can even be considered extending
like image, text, video, and audio information. For example, Caltech-UCSD
text information.
immense importance as these will help perform tasks involving partially labeled
data. It basically reduces expenses on manual annotation and cuts down on data
134
References
itoring using ecg and ppg for personal healthcare,” Journal of medical
with Applications, vol. 110, pp. 352 – 362, 2018. [Online]. Available:
http://www.sciencedirect.com/science/article/pii/S0957417418303646
135
[6] F. Rodrigues, I. Markou, and F. C. Pereira, “Combining time-series
and textual data for taxi demand prediction in event areas: A deep
learning approach,” Information Fusion, vol. 49, pp. 120 – 129, 2019.
S1566253517308175
[7] S. Daneshvar and H. Ghassemian, “Mri and pet image fusion by combining
ihs and retina-inspired models,” Information Fusion, vol. 11(2), pp. 114–
123, 2010.
S1566253515000536
136
B. Zhao, Y. Xiong, and D.-Q. Wei, “MDF-SA-DDI: predicting drug–drug
https://doi.org/10.1093/bib/bbab421
learning for data fusion,” Information Fusion, vol. 57, pp. 115–129, 2020.
S1566253519303902
[14] C. Ounoughi and S. Ben Yahia, “Data fusion for its: A systematic
S1566253522001087
[15] I. Belhajem, Y. Ben Maissa, and A. Tamtaoui, “A robust low cost approach
for real time car positioning in a smart city using extended kalman fil-
806–811.
137
[16] I. Belhajem, Y. M. Ben, and A. Tamtaoui, “Improving vehicle localization
in a smart city with low cost sensor networks and support vector machines,”
and Autonomous Systems, vol. 74, pp. 128–147, 2015. [Online]. Available:
https://www.sciencedirect.com/science/article/pii/S0921889015001529
fusion system for moving object detection and tracking in urban driving
’12. New York, NY, USA: Association for Computing Machinery, 2012, p.
ization in vehicular ad hoc networks using data fusion and v2v communi-
138
cle navigation,” IEEE Transactions on Intelligent Transportation Systems,
2004.
[24] Q. Miao, Q. Li, and D. Zeng, “Mining fine grained opinions by using
[26] T. Rohe, A.-C. Ehlis, and U. Noppeney, “The neural dynamics of hier-
[27] N. Nesa and I. Banerjee, “Iot-based sensor data fusion for occupancy
139
sensing using dempster–shafer evidence theory for smart buildings,” IEEE
https://www.sciencedirect.com/science/article/pii/S1566253518304731
[29] S.-H. Chen, J.-S. Pan, K. Lu, and H. Xu, “Driving behavior analysis of
[33] F. Li, Y. Fan, X. Zhang, C. Wang, F. Hu, W. Jia, and H. Hui, “Multi-
140
feature fusion method based on eeg signal and its application in stroke
//www.sciencedirect.com/science/article/pii/S0020025512004185
[35] Z. Yao and W. Yi, “License plate detection based on multistage information
fusion,” Information Fusion, vol. 18, pp. 78–85, 2014. [Online]. Available:
https://www.sciencedirect.com/science/article/pii/S1566253513000663
https://www.sciencedirect.com/science/article/pii/S0957417417301331
//www.sciencedirect.com/science/article/pii/S030645732200214X
141
decomposition,” IEEE Journal of Biomedical and Health Informatics,
[39] F. Wang, K. Wang, and F. Jiang, “An improved fusion method of fuzzy
[40] S. Xiao, Y. Zhang, X. Liu, and J. Gao, “Alert fusion based on cluster and
https://doi.org/10.1109/ICHIT.2008.197
https://www.sciencedirect.com/science/article/pii/S1568494616300771
Signal Processing, vol. 41, no. 1, pp. 239–253, 2013. [Online]. Available:
https://www.sciencedirect.com/science/article/pii/S0888327013002963
142
in 2011 IEEE International Conference on Fuzzy Systems (FUZZ-IEEE
pii/S0377221716301096
unified deep learning framework for time-series mobile sensing data pro-
[46] Y. Zheng, Q. Liu, E. Chen, Y. Ge, and J. Zhao, “Time series classification
[47] B. Pu, Y. Liu, N. Zhu, K. Li, and K. Li, “Ed-acnn: Novel attention
pii/S1568494620306268
detection from a vehicle using deep learning network and future integration
143
with multi-sensor fusion algorithm,” in W CX T M 17 SAE World Congress,
03 2017.
Letters, vol. 24, no. 12, pp. 1795–1803, 2003. [Online]. Available:
https://www.sciencedirect.com/science/article/pii/S0167865503000047
[50] X. Chen, J. Chen, G. Cheng, and T. Gong, “Topics and trends in artificial
intelligence assisted human brain research,” PLoS ONE, vol. 15, no. 4,
2020.
“Multi-view deep learning for rigid gas permeable lens base curve fit-
2016.
[54] Y. Chen, C. Li, P. Ghamisi, X. Jia, and Y. Gu, “Deep fusion of remote
144
sensing data for accurate classification,” IEEE Geoscience and Remote
https://www.sciencedirect.com/science/article/pii/S0957417420309349
3932, 2021.
145
via multi-manifold regularized nonnegative matrix factorization,” in Pro-
applied to document recognition,” Proceedings of the IEEE, vol. 86, no. 11,
paper/2012/file/c399862d3b9d6b76c8436e924a68c45b-Paper.pdf
Representations, 2015.
146
Computer Society, jun 2015, pp. 1–9. [Online]. Available: https:
//doi.ieeecomputersociety.org/10.1109/CVPR.2015.7298594
[66] K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image
https://www.sciencedirect.com/science/article/pii/S0957417423007406
147
“Deconfcluster: Deep convolutional transform learning based multiview
“Deep learning for time series classification: a review,” Data Mining and
https://doi.org/10.1162/neco.1997.9.8.1735
[75] Z. Wang, W. Yan, and T. Oates, “Time series classification from scratch
148
classification,” in International Joint Conference on Neural Networks
A. Iosifidis, “Forecasting stock prices from the limit order book using con-
stock trading model with 2-d cnn trend detection,” In 2017 IEEE Sympo-
2017.
149
[85] H. Bauschke, R. Burachik, P. Combettes, and D. Luke, Fixed-Point Al-
[87] Y. Soun, J. Yoo, M. Cho, J. Jeon, and U. Kang, “Accurate stock movement
2022 IEEE International Conference on Big Data (Big Data), 2022, pp.
1691–1700.
150
backward splitting, and regularized gauss-seidel methods,” Mathematical
[94] P. Combettes and J.-C. Pesquet, “Deep neural network structures solv-
https://arxiv.org/abs/1808.07526.
2013.
[98] S. Reddi, S. Kale, and S. Kumar, “On the convergence of adam and
151
[99] A. Paszke, S. Gross, S. Chintala, G. Chanan, E. Yang, Z. DeVito, Z. Lin,
[100] C. Kocak, “Arma( p,q ) type high order fuzzy time series forecast method
based on fuzzy logic relations,” Applied Soft Computing, vol. 58, pp.
92–103, 2017.
[101] G. Zumbach and L. Fernndez, “Option pricing with realistic arch pro-
[102] Z. Lin, “Modelling and forecasting the stock market volatility of sse com-
model,” Emerging Market Review, vol. 33, pp. 140 – 154, 2017.
[104] R. Bisoi and P. Dash, “A hybrid evolutionary dynamic neural network for
stock market trend analysis and prediction using unscented kalman filter,”
[106] F. Ming, F. Wong, Z. Liu, and M. Chiang, “Stock market prediction from
152
wsj: Text mining via sparse matrix factorization,” 2014 IEEE International
163–173, 2017.
2017.
short-term stock prices using ensemble methods and online data sources,”
Expert Systems with Applications, vol. 112, pp. 258 – 273, 2018.
S0957417418303622
[110] Y. Chen and Y. Hao, “A feature weighted support vector machine and
Systems with Applications, vol. 80, pp. 340–355, 2017. [Online]. Available:
http://www.sciencedirect.com/science/article/pii/S0957417417301367
5501–5506, 2013.
153
using recurrent neural network and technical indicators,” Neural Compu-
[113] W. Long, Z. Lu, and L. Cui, “Deep learning-based feature engineering for
Textbook of Pediatrics.
154
[120] M. Yu, S. Kim, Z. Wang, S. Hall, and L. Li, “A bayesian meta-analysis on
https://www.sciencedirect.com/science/article/pii/S0020025519304116
antagonist using hybrid chemical features,” Cells, vol. 10, no. 11, 2021.
p. 726, 2019.
science/article/pii/S0020025521009294
155
[125] S. K. Sahu and A. Anand, “Drug-drug interaction extraction from
https://www.sciencedirect.com/science/article/pii/S1532046418301606
https://www.sciencedirect.com/science/article/pii/S1532046418302144
cal deep matrix factorization for "in silico" antiviral repositioning: Appli-
[130] J.-Y. Shi, H. Huang, J.-X. Li, P. Lei, Y.-N. Zhang, K. Dong, and S.-M.
156
Yiu, “Tmfuf: a triple matrix factorization-based unified framework for
[131] J.-Y. Shi, H. Huang, J.-X. Li, P. Lei, Y.-N. Zhang, and S.-M. Yiu, “Predict-
ing comprehensive drug-drug interactions for new drugs via triple matrix
108–117.
S1046202319303421
[133] X. Lin, Z. Quan, Z.-J. Wang, T. Ma, and X. Zeng, “Kgnn: Knowledge
//www.sciencedirect.com/science/article/pii/S0020025517307776
157
Systems with Applications, vol. 186, p. 115810, 2021. [Online]. Available:
https://www.sciencedirect.com/science/article/pii/S0957417421011787
science/article/pii/S0957417419301186
https://www.sciencedirect.com/science/article/pii/S0020025518303487
https://www.sciencedirect.com/science/article/pii/S0957417421003171
S0957417405002496
158
Systems with Applications, vol. 84, pp. 281–289, 2017. [Online]. Available:
https://www.sciencedirect.com/science/article/pii/S0957417417303202
[143] X. Yi, Y. Xu, and C. Zhang, “Multi-view em algorithm for finite mixture
Inc., 2007.
[146] J. Sun, J. Lu, T. Xu, and J. Bi, “Multi-view sparse co-clustering via prox-
159
ceedings of Machine Learning Research, F. Bach and D. Blei, Eds., vol. 37,
http://arxiv.org/abs/1707.09866
rithm based on global and local structure preserving,” IEEE Access, vol. 9,
[150] Y. Ye, X. Liu, J. Yin, and E. Zhu, “Co-regularized kernel k-means for multi-
[152] C. Zhang, H. Fu, Q. Hu, X. Cao, Y. Xie, D. Tao, and D. Xu, “Generalized
Analysis and Machine Intelligence, vol. 42, no. 1, pp. 86–99, 2020.
160
clustering by squeezing hybrid knowledge from cross view and each view,”
932–944, 2013.
[156] X. Cai, F. Nie, and H. Huang, “Multi-view k-means clustering on big data,”
Trans. Knowl. Discov. Data, vol. 12, no. 4, apr 2018. [Online]. Available:
https://doi.org/10.1145/3182384
161
12th ACM International Conference Knowledge Discovery Data Mining
[162] W. Shao, L. He, C.-t. Lu, and P. S. Yu, “Online multi-view clustering with
2021.
[164] J. Xu, Y. Ren, G. Li, L. Pan, C. Zhu, and Z. Xu, “Deep embedded multi-
162
of graph convolutional networks with laplacian rank constraints,” Neural
[167] S. Fan, X. Wang, C. Shi, E. Lu, K. Lin, and B. Wang, “One2multi graph
pp. 211–215.
arXiv:1512.07548, 2015.
[171] K. Zhan, C. Zhang, J. Guan, and J. Wang, “Graph learning for multiview
clustering,” IEEE Transactions on Cybernetics, vol. 48, no. 10, pp. 2887–
2895, 2018.
(CVPR 2021), Los Alamitos, CA, USA, jun 2021, pp. 1255–1265.
163
[173] X. Peng, S. Xiao, J. Feng, W.-Y. Yau, and Z. Yi, “Deep subspace clustering
[174] X. Peng, J. Feng, J. T. Zhou, Y. Lei, and S. Yan, “Deep subspace clustering,”
164