Thesis Pooja Gupta PhD18018 Final

I NFORMATION F USION USING C ONVOLUTIONAL T RANSFORM L EARNING
By
P OOJA G UPTA
(PhD18018)
Under the supervision of Prof. Angshul Majumdar
C OMPUTER S CIENCE AND E NGINEERING
I NDRAPRASTHA I NSTITUTE OF I NFORMATION T ECHNOLOGY D ELHI
N EW D ELHI – 110020
S EPTEMBER , 2023
I NFORMATION F USION USING C ONVOLUTIONAL T RANSFORM L EARNING
By
P OOJA G UPTA
PhD18018
A Thesis
submitted in partial fulfillment of the requirements for the degree of
Doctor of Philosophy
C OMPUTER S CIENCE AND E NGINEERING
I NDRAPRASTHA I NSTITUTE OF I NFORMATION T ECHNOLOGY D ELHI
N EW D ELHI – 110020
S EPTEMBER , 2023
Certificate
This is to certify that the thesis titled Information Fusion using Convolutional
Transform Learning being submitted by Pooja Gupta to the Indraprastha Institute
of Information Technology Delhi, for the award of the degree of Doctor of
Philosophy, is an original research work carried out by her under my supervision.
In my opinion, the thesis has reached the standard fulfilling the requirements of
the regulations relating to the degree.
The results contained in this thesis have not been submitted in part or full to
any other university or institute for the award of any degree or diploma.
September, 2023
Prof. Angshul Majumdar

Indraprastha Institute of Information Technology Delhi
New Delhi 110020

Abstract
There are many real-world problems pertaining to the need for the fusion of
information from multiple sources. Consider, for example, the problem of
demand forecasting that requires estimating the power consumption at a future
point given the available information till the current instant. At the building level
forecasting, the inputs are usually power consumption, weather(temperature,
humidity), and occupancy. This is a crucial problem in smart grids that ranges
from planning electricity generation to preventing non-technical losses. Likewise,
many such real-world examples can be cast as multi-channel information fusion
based problems. Thus, we need the techniques whereby this varied nature
of information from multiple sources can be combined/fused to predict some
value(s) that can contribute significantly to future decision making.
A bountiful of techniques have been proposed so far for multi-channel fusion,
yet hardly any of them have been addressed as an end-to-end fusion formulation.
Few of such solutions are based on techniques that include - Deep learning and
Statistical Machine Learning (SML) algorithms. However, existing solutions
related to deep learning paradigms involve Convolutional Neural Network (CNN).
The latter might not guarantee distinct filters and hence, quality representations
might not be obtained that could lead to redundancy. Secondly, CNNs are
supervised and, therefore, require large labelled datasets that are not readily
available in every other domain. Lastly, SML algorithms are largely prone
to overfitting as these heavily rely on quality of features input. Thus, end-to-
end, multi-channel, both unsupervised and supervised Convolutional Transform
Learning (CTL) based solutions are proposed that bridges all the gaps. The
problems targeted lie under multiple domains including financial, biomedical
and multiview image and text datasets.
Firstly, this dissertation proposes unsupervised multi-channel fusion solutions
to the problems in the financial domain - stock trading(trend prediction/classifi-
cation) and stock forecasting(price prediction/regression) both of which include
i
time-series data. It preserves the true nature of time-series as univariate instead
of frameworks treating them as 2D matrix/image. Also, the given solution is
highly efficient in terms of training a single framework single framework and
obtaining features that can be utilized for both classification and regression tasks.
The latter benefit cannot be achieved with CNNs.
Secondly, multiple information fusion problems are solved by giving super-
vised frameworks based on CTL and deep learning paradigms. Specifically,
one of the frameworks is proposed to cater to the problem of stock trading that
eliminates the issue of dead ReLU and guarantees representations that are more
diverse helping in obtaining better performance over the state-of-art techniques.
The latter has been validated via fair comparison with CNN where the proposed
method supersedes it. Next, an information fusion solution is given that is su-
pervised jointly trained and optimized approach based on CTL and Decision
Forest (DF) for predicting Drug-Drug Interactions that could lead to Adverse
Drug Reactions (ADRs) instead of utilizing them in a piecemeal fashion.
Lastly, this thesis contributes to solve multiview clustering fusion problem
handling the challenge of data-constrained scenarios. It involves the multiview
datasets under image and text categories. A joint optimization of Deep CTL
(DCTL) and K-Means clustering is proposed. It avoids the piecemeal approach
and learns representations from clustering perspective with the help of K-Means
clustering loss.
ii
Dedication
I dedicate my work to my family and friends. I am grateful to my loving parents,

grandparents, brother and in-laws whose words of encouragement and push for
tenacity ring in my ears - "Where there’s will, there’s a way !". Special thanks to
my husband Ankur for his whole-hearted belief in me, constantly motivating me,
and for never leaving my side in this journey.
iii
Acknowledgements
I owe a big thanks to my advisor Prof. Angshul Majumdar who is behind this
day when I can call myself a researcher and can prepend Dr. with my name. His
constant guidance, support and friendly behaviour have helped to sail through
in my journey of research. He has always given an outstanding environment of
research and infrastructure, his time and efforts for brainstorming discussions.
Prof. Angshul has always lend his ears to my problems in tough times and helped
me to overcome them with his words of encouragement and motivation. I am
truly blessed to have him as a supervisor, a mentor and a friend.
I would like to thank Indraprastha Institute of Information Technology Delhi
that gave me an opportunity to be part of it as a researcher. Also, the institute has
given excellent infrastructure and proactively solved any of the issues in access
to any necessary things , especially, Mr. Adarsh Kumar Agarwal sir from IT
helpdesk and others too. I also want to thank my Internal Committee members
Dr. Pushpendra Singh and Dr. Sanat Biswas and collaborators Dr. Emilie
Chouzenoux, Dr. Giovanni Chierchia and Dr. Ronita Bardhan for providing
their insightful comments in the research assessments and collaboration works
respectively.
Next, I want to thank my Grandparents, Late Shri. Ram Nath Gupta and Late
Smt. Shanti Devi Gupta for continuously showering blessings. I am thankful
to my parents (Dr. Satish Chandra Gupta and Mrs. Sunita Gupta) for working
hard to provide me with the privilege of having a good life and attaining this
prestigious degree. I am also grateful to my in-laws (Mr. Ashok Kumar Gupta
and Mrs. Manju Gupta) for understanding my ambitions and supporting me
post-marriage during my journey. I especially want to thank my husband, Ankur,
for his care, patience, encouragement, and unwavering support throughout this
journey. My daughter Vedanshi, born recently, has been an integral part of this
journey, for she has been my lucky charm. She has always made me feel alive in
the time of melancholy, for which I feel really blessed. I am also grateful to my
iv
brother Rishabh and brother-in-law Akshay Kumar Gupta for their never-ending
love, motivation and support.
I also thank my labmates - Jyoti Maggu, Aanchal Mongia, Priyadarshini Rai,
Shalini Sharma, Anurag Goel, Shikha Singh, Megha Gupta Gaur, Vanika Singhal
and Kriti Gupta for their companionship and never-ending tea-time stories. I
humbly acknowledge their help and support. I also present my sincere gratitude
to my friends from other labs Anand Singh, Gunjan Singh, Dhananjay Kimothi,
Charul Paliwal, Saurabh Aggarwal, Neetesh Pandey, Ashwini Teertha, Smriti
Chawla and Sarita for always supporting me. Last but not the least I thank
my all time friends Naina Gupta, Sonal Goel, Shalini Sheoran, Pragati Sharma,
Pradyumn Nand, Mona Nandwani, Love Chopra, Prachi Luthra Chopra, Rachit
Rakhyani and Bani Rakhyani for standing by my side always.
(POOJA GUPTA)
v
Contents
Abstract i
Dedication iii
Acknowledgements iv
List of Tables x
List of Figures xii
1 Introduction 1
1.1 Problem Statement . . . . . . . . . . . . . . . . . . . . . . . . 3
1.2 Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
1.2.1 Probabilistic approach . . . . . . . . . . . . . . . . . . 4
1.2.2 Machine Learning-based Frameworks . . . . . . . . . . 6
1.2.3 Fuzzy based systems . . . . . . . . . . . . . . . . . . . 8
1.2.4 Deep Learning based fusion approaches . . . . . . . . . 10
1.3 Datasets Descriptions . . . . . . . . . . . . . . . . . . . . . . . 12
1.3.1 National Stock Exchange (NSE) dataset . . . . . . . . . 13
1.3.2 Past 22 years stock data . . . . . . . . . . . . . . . . . 13
1.3.3 Drug-Drug Interaction Data . . . . . . . . . . . . . . . 14
vi
1.3.4 Mutli-view datasets . . . . . . . . . . . . . . . . . . . . 16
1.4 Research Contributions . . . . . . . . . . . . . . . . . . . . . . 17
1.5 Acronyms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
List of Abbreviations 1
2 Unsupervised multi-channel CTL based fusion frameworks -ConFuse(shallow)

and DeConFuse(Deep) 26
2.1 Literature Review . . . . . . . . . . . . . . . . . . . . . . . . . 28
2.1.1 CNN for Time Series Analysis . . . . . . . . . . . . . . 28
2.1.2 Convolutional Transform Learning . . . . . . . . . . . . 29
2.1.3 Updates of T . . . . . . . . . . . . . . . . . . . . . . . 32
2.2 Proposed Formulations - ConFuse and DeConFuse . . . . . . . 32
2.2.1 ConFuse: Convolutional Transform Learning Fusion
Framework For Multi-Channel Data Analysis . . . . . . 32
2.2.2 DeConFuse: a deep convolutional transform-based un-
supervised fusion framework . . . . . . . . . . . . . . . 35
2.2.3 Optimization Algorithm for the frameworks . . . . . . . 36
2.3 Experimental Evaluation . . . . . . . . . . . . . . . . . . . . . 38
2.4 Results and Analysis . . . . . . . . . . . . . . . . . . . . . . . 40
2.4.1 Stock Forecasting – Regression . . . . . . . . . . . . . 41
2.4.2 Stock Trading – Classification . . . . . . . . . . . . . . 44
2.4.3 Convergence Study . . . . . . . . . . . . . . . . . . . . 48
2.5 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50
3 Supervised multi-channel fusion frameworks - SuperDeConFuse and

DeConDFFuse 52
vii
3.1 SuperDeConFuse: A supervised deep convolutional transform
based fusion framework for financial trading systems . . . . . . 54
3.1.1 Literature Review - Stock Trading . . . . . . . . . . . . 54
3.1.2 Proposed Formulation . . . . . . . . . . . . . . . . . . 56
3.1.3 Optimization algorithm . . . . . . . . . . . . . . . . . . 59
3.1.4 Preprocessing . . . . . . . . . . . . . . . . . . . . . . . 60
3.1.5 Experimental Evaluation . . . . . . . . . . . . . . . . . 62
3.1.6 Results and Analysis . . . . . . . . . . . . . . . . . . . 66
3.2 DeConDFFuse : Predicting Drug-Drug Interaction using joint
Deep Convolutional Transform Learning and Decision Forest
fusion framework . . . . . . . . . . . . . . . . . . . . . . . . . 84
3.2.1 Literature Review - DDI . . . . . . . . . . . . . . . . . 85
3.2.3 Experimental Setup . . . . . . . . . . . . . . . . . . . . 97
3.2.4 Results and Analysis . . . . . . . . . . . . . . . . . . . 100
3.3 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 107
4 Multiview Clustering Framework based on CTL - DeConFCluster 110

4.1 Literature Review . . . . . . . . . . . . . . . . . . . . . . . . . 112
4.2 Proposed Formulation - DeConFCluster: Deep Convolutional
Transform Learning based Multiview Clustering Fusion Framework116
4.2.1 Deep CTL Based K-Means Clustering Framework . . . 117
4.3 Experimental Setup . . . . . . . . . . . . . . . . . . . . . . . . 119
4.4 Results and Analysis . . . . . . . . . . . . . . . . . . . . . . . 122
4.4.1 Ablation studies . . . . . . . . . . . . . . . . . . . . . . 124
4.5 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 127
viii
5 Conclusion 129
5.1 Summary of Contribution . . . . . . . . . . . . . . . . . . . . . 129
5.1.1 Unsupervised multi-channel CTL based fusion frame-
works - ConFuse(shallow) and DeConFuse(Deep) . . . . 129
5.1.2 Supervised multi-channel fusion frameworks - SuperDe-
ConFuse and DeConDFFuse . . . . . . . . . . . . . . . 130
5.1.3 Multiview Clustering Framework based on CTL - De-
ConFCluster . . . . . . . . . . . . . . . . . . . . . . . 131
5.2 Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . 132
References 135
ix
List of Tables
1.1 Statistics of the considered MVC datasets . . . . . . . . . . . . 17

1.2 Summary of all proposed frameworks . . . . . . . . . . . . . . 19
1.3 Pros and Cons of all proposed frameworks . . . . . . . . . . . . 19
1.4 Acronyms with full forms used in chapters . . . . . . . . . . . . 21
2.1 Description of compared models with hyperparameters . . . 40

2.2 Forecasting Results (MAE) . . . . . . . . . . . . . . . . . . . 41
2.3 Summary Forecasting Results (MAE) with ConFuse . . . . . 42
2.4 Summary Forecasting Results (MAE) with DeConFuse . . . 43
2.5 Trading Results with ConFuse . . . . . . . . . . . . . . . . . 47
2.6 Summary Trading Results with DeConFuse . . . . . . . . . 48
3.1 Hyperparameters for the different instances of the proposed

SDCF network (see Figure 3.1 for the general overview) used
in the experimental section. . . . . . . . . . . . . . . . . . . . 64
3.2 Summary of BUY Class Classification Results for Stock Trad-
ing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69
3.3 Summary of HOLD Class Classification Results for Stock
Trading . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 70
3.4 Summary of SELL Class Classification Results for Stock
Trading . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 70
3.5 Summary of Weighted Classification Results for Stock Trading 74
x
3.6 Summary of Financial Results for Stock Trading . . . . . . . 76
3.7 Ablation Study performance for BUY Class . . . . . . . . . . 79
3.8 Ablation Study performance for HOLD Class . . . . . . . . . 80
3.9 Ablation Study performance for SELL Class . . . . . . . . . 80
3.10 Ablation Study performance weighted results . . . . . . . . . 80
3.11 Ablation Study Financial Results . . . . . . . . . . . . . . . . 81
3.12 Comparative Summary Results for Stock Trading for win-
dow sizes 5,10,20 . . . . . . . . . . . . . . . . . . . . . . . . . 82
3.13 DDI Prediction DeConDFFuse Architecture Details . . . . . . . 99
3.14 DDI Prediction Results . . . . . . . . . . . . . . . . . . . . . 101
3.15 Comparative Results with DeConDFFuse and Piecemeal ap-
proaches . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 106
4.1 DeConFCluster hyperparameters for MVC Datasets . . . . . . . 120

4.2 Clustering Results. All the metrics in (%) . . . . . . . . . . . . 124
4.3 Ablation Studies Results on λ, µ . . . . . . . . . . . . . . . . . 125
4.4 Ablation Studies Results on K-Means Regularizer . . . . . . . . 126
4.5 Ablation Studies Results on Piecemeal and Proposed Formulation127
xi
List of Figures
2.1 General view of the ConFuse architecture. C = 5 represents the

number of DeepCTL networks/channels, F1c = 5 × 1 is the filter
size and M1c = 4 is the number of filters for all the channels. . . 35
2.2 General view of the DeConFuse architecture. C = 5 represents
the number of DeepCTL networks/channels, L = 2 is the num-
ber of DCTL layers, Mℓc is the filter size and Fℓc is the number
of filters of the respective layer ℓ and channel c. . . . . . . . . . 37
2.3 Stock Forecasting Performance with ConFuse . . . . . . . . . . 43
2.4 Loss Plots with a) ConFuse and b) DeConFuse . . . . . . . . . 49
2.5 Visualization of channel-wise features Xc and fused representa-
tions Z for DeConFuse for one sample of stock ANDHRABANK
(with 8 × 2 as the shape of the features obtained for each channel
Xc and flattened features of shape 40 × 1 for Z) . . . . . . . . . 49
3.1 General SuperDeConfuse Architecture. The architecture is

tested for L = 1, 2, 3, 4 layers and C = 5. Here M11 ×1, . . . , MLC ×
1 represents the kernel size used in each layer ℓ ∈ {1, . . . , L}.
Here, maxpooling is not performed after layer 4 due to the small
window size/input sequence length. . . . . . . . . . . . . . . . 57
3.2 Sliding walk-forward validation technique used for hyperparam-
eters tuning . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62
3.3 Confusion matrices corresponding to the different number of
CTL layers of the architecture: a) 1 layer of CTL (shallow
version), b) 2 layers of CTL (deep version), c) 3 layers of CTL
(deep version) and d) 4 layers of CTL (deep version) where 0 -
BUY, 1 - HOLD, 2 - SELL signals. . . . . . . . . . . . . . . . . 72
xii
3.4 Visualization of channel-wise features Xc for SDCF versus a
standard CNN for one sample of stock BSELINFRA.BO (with
16 × 1 as the shape of the features obtained and resized to 8 × 2
for better visualization) . . . . . . . . . . . . . . . . . . . . . . 77
3.5 Evolution of the loss during training for a few stock examples of
the proposed model with (a) CTL 1 layer, (b) CTL 2 layers, (c)
CTL 3 layers and (d) CTL 4 layers. . . . . . . . . . . . . . . . . 83
3.6 Each node n ∈ N of the tree performs routing decisions via
function dn (·). The black path shows an exemplary routing of
a sample x along a tree to reach leaf ℓ4 , which has probability
µℓ4 = d1 (x)d¯2 (x)d¯5 (x). Image taken from [1]. . . . . . . . . . . 91
3.7 illustration of how to implement a deep neural decision for-
est (DNDF). Top: Deep CNN with a variable number of lay-
ers, subsumed via parameters θ. FC block: Fully Connected
layer used to provide functions fn (·; θ), described in Equ. 3.8.
Each output of fn is brought in correspondence with a split
node in a tree, eventually producing the routing (split) decisions
dn (x) = σ(fn (x)). The order of the assignments of output units
to decision nodes can be arbitrary (the one shown allows a sim-
ple visualization). The circles at the bottom correspond to leaf
nodes, holding probability distributions πℓ . Image taken from [1]. 93
3.8 DDI prediction using combined DeConFuse and decision for-
est architecture- DeConDFFuse. Here C = 2, the number of
networks/channels via each of which a drug in the drug pair
is passed along with its bioactivity descriptors/ features vector,
respectively. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 98
3.9 Confusion matrices for different benchmarks and the proposed
method- DeConDFFuse . . . . . . . . . . . . . . . . . . . . . . 102
3.10 Loss plot with the proposed method - DeConDFFuse. . . . . . . 105
3.11 Confusion matrices for the proposed method - DeConDFFuse
and Piecemeal approach . . . . . . . . . . . . . . . . . . . . . 106
4.1 DCKM architecture. L represents number of DCTL layers, Mlc -

filter size and Flc - #filters of the respective layer l and channel c. 117
xiii
4.2 Overview of the proposed DeConFCluster architecture. C rep-
resents the number of DeepCTL networks/channels, L is the
number of DCTL layers, Mℓc is the filter size and Fℓc is the
number of filters of the respective layer ℓ and channel c. . . . . . 119
4.3 Loss Plots . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 124
4.4 Ablation Studies Result Plots on λ, µ . . . . . . . . . . . . . . . 125
4.5 Ablation Studies Result Plots on K-Means Regularizer . . . . . 126
xiv
Chapter 1
Introduction
Information Fusion (IF) is an advanced process that involves estimation and
empowers users to asses complex situations more efficiently, effectively and
accurately. It combines information from multiple sources that can be massive,
diverse and sometimes conflicting as well. This integration produces specific and
comprehensive unified estimates about an entity, activity or event. According to
[2], IF is defined as “the study of efficient methods for automatically or semi-
automatically transforming information from different sources and points in time
into a representation that provides effective support for human or automated
decision making" . Many real-world domains raise problems pertaining to the
need for the fusion of information from multiple sources.
Let us consider the demand forecasting problem that requires estimating the
power consumption at a future point by accounting for the available information
until the current instant. Usually, in this respect, the inputs at the building
level forecasting are power consumption, weather (temperature, humidity), and
1
occupancy etc. It is pertinent to solve this problem as it is a crucial aspect in smart
grids that ranges from planning electricity generation to preventing non-technical
losses. Next, we consider biomedical signal analysis where IF is required. For
example, the problem of blood pressure estimation. The inputs are usually from
two sources, namely the electrocardiogram (ECG) and pulsepleithismogram
(PPG) [3], and the goal is to estimate the systolic and diastolic pressures.
Transportation is also a domain that needs the fusion of information from
many sources to build intelligent transportation systems (ITS) [4, 5]. It is
essential to improve safety of a passenger, reduce transportation time and fuel
consumption, etc. In the same domain, the work [6] deals with the problem of
forecasting the taxi demand in the event areas. It is done by fusing the publicly
available data and time-series data using deep learning techniques.
Image fusion is another area where the information from two or more images
of an object has to be integrated into a single image that is more informative
and appropriate for visual perception or computer analysis. It is significantly
applied in medical imaging. For example, to improve the functional and spatial
information content of the PET images, the fusion of Magnetic Resonance
Imaging (MRI) and Positron Emission Tomography (PET) images using Intensity
Hue Saturation (IHS) and Retina-Inspired Models (RIM) fusion methods is
performed [7]. Multi-sensor video is also a domain that requires multi-channel
data fusion applied in the medical domain. It uses the fused video displays and
scanpath assessment for the visible and infrared side-by-side[8].
2
Similarly, there are other domains that we will not discuss at length but briefly
mention where IF plays its role. Opinion mining based on sentiment analysis
[9], Stock Price prediction [10], drug-drug interaction [11], human activity
recognition [12] etc. Thus, IF is a field that offers a plethora of opportunities to
solve many impactful real-world problems.
1.1 Problem Statement
This research dissertation aims to propose efficient multi-channel fusion frame-
works that learn better representations for solving problems in the analysis
domain. There could be supervised and unsupervised learning tasks finding
applications in n-Dimensional data domains. The problems targeted under the
supervised category are regression and classification and clustering for unsu-
pervised. The aim is to offer quick decision-making to the practitioner while
dealing with information from multiple sources.
1.2 Background
IF integrates heterogeneous data from multiple sources to learn representations
that can lead to effective decision-making in future events. The challenge
comes here, like Data imperfection, inconsistency, confliction, alignment and
correlation, heterogeneity, etc. [13]. Thus, IF is an area that needs solutions
to overcome these challenges for different applications. A range of methods
have been proposed for solving the problems in IF, from probabilistic methods
3
and statistical machine learning to deep learning. We will briefly discuss these
techniques covering some problems in different application domains.
1.2.1 Probabilistic approach
Several studies have used probabilistic methods. As mentioned previously, IF
finds excellent application in Intelligent Transportation Systems (ITS). According
to one of the fusion-based ITS surveys [14], most of the studies had used
probabilistic-based fusion methods since 2011 with a percentage of 46.29% (i.e.,
81 out of 135 articles studied in the survey were based on probabilistic based
fusion). It is worth mentioning that these include Kalman Filter (KF) algorithm
and its variations, e.g., Extended Kalman Filter (EKF) [15–21], Sequential
Kalman Filter (SKF) [22] etc. The kinds of applications under ITS covered here
concerning KF are car or vehicle positioning [15, 22] in a smart city, vehicle
localization [16, 19, 20], moving object detection and tracking [18], navigation
[21] etc.
Opinion Mining (OM) is another area of application where IF finds its scope
to solve real-world problems. OM is the technique that involves the task of
extracting opinions from unstructured text by combining techniques from Natural
Language Processing (NLP) and Computer Science. Here, also probabilistic
models are applied [23, 24]. The study in [25] presents an “Enterprise IF"
framework that exploits many techniques to better understand the impact on an
enterprise’s business. The latter includes client feedback and any noteworthy
news about events that could affect it. Also, sometimes involve corporate’s
4
data for analysis. Thus, such a framework depends on multiple sources of
information - news sourced from platforms like Twitter and feedback sourced
from comments on discussion boards and Really Simple Syndication (RSS)
feeds from specific blogs. For this purpose, they use a “blackboard architecture"
described in [23]. It is a belief network with nodes representing propositions with
associated probability distributions and edges denoting different conditions on
nodes. The study’s authors observed a dip in sales of a given product after higher
negative feedbacks. They stated that even though their analysis was ex-post,
the unstructured data mining synchronized with sales data could have provided
insights to perform better marketing campaigns and find a better market niche
for the observed product.
Another sector that uses probabilistic Bayesian models is healthcare. With
the help of Bayesian frameworks, functional MRI, multi-variate decoding and
psychophysics, the authors in [26] demonstrated Bayesian causal inference
through hierarchical multi-sensory processes in the human brain. The Internet
of Things (IoT) is also an area where IF finds its application. Consider a smart
home, i.e., where data is collected through different sensors installed in the home.
One such problem that can be solved is knowing about the person’s occupancy
while preserving privacy. In the study [27], sensors data, including temperature,
humidity, light and CO2 are used to detect the occupancy in a room. The authors
considered each kind of readings’ Probability Density Functions (PDF) and
calculated the Probability Mass Assignment (PMA) for each parameter y to
be in class x. Next, Dempster’s combination rule was applied to combine the
5
PMA value for a final decision [27]. However, Dempster-Shafer’s theory poses
difficulty in estimating mass due to which its applications are limited.
The drawback with the probabilistic models is that it is difficult to obtain a
density function and define priori probabilities. Also, when dealing with complex
and multi-variate data, we have to settle for limited performance. Further, we
cannot handle uncertainty with such solutions [28].
1.2.2 Machine Learning-based Frameworks
Traditional machine learning algorithms have also been extensively used to solve
many IF-based problems. In the study [29], the task was to classify the incorrect
driving behavior using multiple inputs, including the driver’s driving operation
behavior, steering wheel angle, brake force, and throttle position. Also, it
considers road conditions and then classifies using these inputs via the Adaboost
algorithm. In another study [30], activity recognition is performed via fusion
at two levels - feature and score fusion levels through Naive Bayes Algorithm.
One of the applications of the ITS and IOT is the vacant parking spot detection
problem in urban environments. In view of the same, the work in [31] employs
the fusion of the information from small-scale sensor-based detectors with
that obtained from exploiting the widely-deployed video surveillance camera
networks. This framework again utilizes traditional ML algorithms k-Nearest
Neighbors (kNN) and Support Vector Machine (SVM) on Histograms of Oriented
Gradients (HOG) and Gabor histograms features extracted.
6
Sentiment classification is more challenging than document topic classifi-
cation as the latter has specific keywords that do not require context /emotion
to be understood as in the former case. In sentiment classification also, works
exist where fusion is performed via Naive Bayes, Maximum Entropy Classifier
and SVM where SVM superseded the other two [32]. In the healthcare sector,
consider analysing the EEG (Electroencephalogram) signal in the case of stroke
patients and classifying the stroke as ischemic stroke and hemorrhagic stroke.
The method in [33] is a multi-feature fusion method that combines wavelet
packet energy, fuzzy entropy and hierarchical theory. Further, SVM, Decision
Tree (DT) and Random Decision Forest (RDF) are used as the stroke signal
classification models.
Fault detection in motors also requires the fusion of information. In this regard,
Banerjee et al. [34] proposed a hybrid method for fault detection based on multi-
sensor data fusion with SVM, Short Term Fourier Transform (STFT) and a time
duration-based observer model. License Plate (LP) detection technique is another
fusion-based problem. It is based on multistage IF adopted for reducing the high
false alarm rate in the conventional Adaboost detector [35]. The latter is enhanced
via a color-checking module and an SVM detector that checks the image patch
for LP. Another domain of the IF application is finance, specifically, the stock
market. One such event is to predict stock price movement. In study [36],
authors gathered historical stock market data and derived technical indicators
followed by integration of Wikipedia hits and Google news data to prepare a
rich knowledge base. Using this data, the authors generated features and utilized
7
three ML models - DT, SVM, and Artificial Neural Networks (ANNs) for stock
trend prediction. Also, an intelligent trading-based expert tool was presented to
assist the decision-making process on various instruments.
We can see that many domains have used IF based on traditional ML algo-
rithms. However, the issue with traditional ML is overfitting owing to their
non-linear mapping and fitting capability. Also, they are highly dependent on the
quality of features and hence may not be able to build a relation between input
and output variables.
1.2.3 Fuzzy based systems
IF solutions are also based on fuzzy logic. According to one of the surveys [37],
under topic identification and trend analysis, specifically, multimodality medical
image fusion and fuzzy-based intelligent health and medical systems accounts for
10.06% for Structural Topic Modelling (STM) research. Multi-modal medical
image fusion has emerged as a hot topic of research. For the same, the work in
[38] proposed a framework based on Structural Patch Decomposition (SPD) and
fuzzy logic technology. This framework creates a fused image by adopting a
weighted approach as a final step. Next, in Wireless Sensor Networks (WSN),
traditional weighted fuzzy logic is not adapted to raw data due to invalid data
gathered during data collection in a real-world environment. Thus, to improve
upon the same, the work in [39] uses K-Means clustering in addition to fuzzy
logic and final fusion is performed using a weighted approach.
8
Another fuzzy-based fusion approach at the feature level is adopted for in-
trusion detection in [40] that overcomes the data imperfection challenge of IF.
Under ITS, for high-speed heavy vehicles, a Global Positioning System (GPS)
based navigation method is developed by authors in [41]. The work used fuzzy
logic to fuse the GPS and odometric sensors. Next, under the same category of
ITS, to avoid congestion, the fusion framework combines the Inertial Navigation
System (INS) and the GPS [42]. It uses Extended KF (EKF) and Input-Delayed
Adaptive Neuro-Fuzzy Inference System (IDANFIS) for fusion.
Currently, telemedicine is trending which helps to monitor elderly people in
homes and detect if they fall. To detect the same, the study in [43] proposed a
data fusion approach based on fuzzy logic with a set of rules directed by medical
recommendations. Next, forecasting stock market returns is a challenging task.
It is due to the complex nature of the data. The study in [44] developed a
framework to predict daily stock price movements. The authors deployed and
integrated three data analytical prediction models: ANFIS, Artificial Neural
Networks (ANN), and SVMs.
From the above discussion, many fusion approaches in various application
domains are based on Fuzzy logic. Nevertheless, the challenge with fuzzy
systems is difficulty in setting up rules and membership for certain problems at
times.
9
1.2.4 Deep Learning based fusion approaches
Deep Learning (DL) has been widely used for analyzing multi-channel / multi-
sensor signals. It facilitates the automated learning of features versus the hand-
crafting or manual selection of features which is required in traditional machine
learning algorithms. Thus, it saves the human effort of the latter task mentioned.
Also, it can learn the complex mappings between the input and output variables
that are otherwise difficult to learn with traditional ML algorithms.
In many DL studies, all the sensors are stacked one after the other to form
a matrix using 2-D CNN to analyze the sensor signals. For example, in the
study, [12], the authors use the previously mentioned framework with input from
multiple body sensors to analyze human activity recognition. It is worth noting
that in the study [12], temporal modeling needs to be included. This shortcoming
is overcome in [45] where 2-D CNN is used on a time series window. These
windows are processed by GRU in the final step and hence time series modeling
is incorporated. Nevertheless, there is no explicit fusion framework in all the
discussed studies. Though a fusion framework was proposed in [46]; however,
the fusion happened at the feature level versus raw signal level like in [12, 45].
Traffic flow prediction is also the use case of information fusion. The study
[47] illustrates the use of a deep learning-based encoder–decoder framework
with an attention mechanism to capture the correlation between the spatial
traffic-flow images’ channels. Another study proposes a DL framework using
CNNs for object detection from moving vehicle camera images [48]. DL and IF
10
combination has also been applied in detecting anomaly-based intrusion. The
authors presented a framework with three layers neural network as a classifier.
They applied five different fusion rules to verify system effectiveness [49].
The fusion has also been observed in solving problems pertaining to the
healthcare sector using DL. In [50], the authors designed multi-information
fusion convolutional bidirectional Recurrent Neural Networks (RNNs) to detect
arrhythmia automatically using ECGs (Electrocardiograms) as input. Addition-
ally, they employed the combination of CNNs and LSTMs for enriching features.
They utilized morphological and temporal information from ECG. The authors
in another work [51] provided the best adaptation for patients with irregular astig-
matism using CNNs. This fusion framework considered multiview Pentacam
images in input.
One of the approaches to the problem of video-based action recognition
requires IF [52]. It does not take as input audio data for the task, but it proposes
a fusion scheme for incorporating temporal information (processed by CNN)
and spatial information (also processed by CNN). Experiments were conducted
with different levels of early and late fusion. There are studies where multi-
channel image dataset fusion has also been investigated. In [53], a fusion
scheme is proposed for processing color and depth information (via 3-D and
2-D convolutions, respectively) with the objective of action recognition. In [54],
the authors have fused hyperspectral data (high spatial resolution) with Lidar
(depth information), subsequently giving better classification results. In [55], an
improvement in analysis tasks was observed with the help of the fusion of deeply
11
learned features (from CNN) with handcrafted features via a fully connected
layer.
We see here that several studies include solutions to information fusion-based
problems via deep learning technique CNN-based frameworks. However, the
issue with CNN is that these are primarily supervised and require large labeled
datasets. But, the labeled datasets are only present in abundance for a few
domains. Hence, we need unsupervised solutions. Also, the training in CNNs
involves learning of filters. However, CNNs may not guarantee distinct filters
as any loss function involved generally does not impose any distinctiveness
constraint. CNN initializes filters randomly and depends on the non-convergence
of backpropagation algorithm to maintain the mutual difference [56]. Thus, there
is a possibility that representations/feature maps might be redundant [57]. This
has even been shown via experiments discussed later in chapters 2 and 3.
1.3 Datasets Descriptions
This thesis proposes solutions based on CTL for three types of problems under
the analysis domain - Supervised, Unsupervised and Clustering. Specifically,
the tasks dealt with are - regression, classification and multiview clustering
tasks. While proposing solutions, it was the chance to explore and apply them to
multiple datasets covering different domains of Information Fusion. It helped
us realize that the solutions are generic enough to be applied to the problems
pertaining to fusion other than those utilized in this thesis. Thus, the different
12
datasets used in the problems presented in this thesis are presented below.
1.3.1 National Stock Exchange (NSE) dataset
It is a real dataset from India’s National Stock Exchange (NSE). The dataset
contains information on 150 symbols between 2014 and 2018; these stocks were
chosen after filtering out stocks with less than three years of data. The companies
available in the dataset are from various sectors such as IT (e.g., TCS, INFY),
automobile (e.g., HEROMOTOCO, TATAMOTORS), bank (e.g., HDFCBANK,
ICICIBANK), coal and petroleum (e.g., OIL, ONGC), steel (e.g., JSWSTEEL,
TATASTEEL), construction (e.g., ABIRLANUVO, ACC), public sector units
(e.g., POWERGRID, GAIL), etc. There are two signals for each sample in the
dataset BUY represented as 0 and SELL represented as 1 numerically. The
former indicates whether to buy the stock and the latter indicates to sell that
stock on any day.
1.3.2 Past 22 years stock data
This dataset consists of 15 Indian stocks that fall under the NSE and the Bombay
Stock Exchange (BSE), which are taken from publicly available Yahoo finance
symbols data. The stock symbols ending with .NS fall under NSE and with .BO
under BSE. The data comprises day-wise readings for the past 22 years, i.e.,
from 1998 - 2019. It is collected internally using the in-built python module
Web and the Yahoo API end-point. At the time of data collection, the year 2019
13
was still ongoing; hence, the data was only partially available for 2019. Also,
there were some missing values for some raw features. Thus, the data for 2019
have not been used in the experiments pertaining to it for simplicity. Further,
The dataset includes stocks from multiple sectors, such as Indian consumer
products manufacturers (e.g., HINDUNILVR.NS), oil and gas (e.g., CAIRN.NS),
pharmaceuticals (e.g., AUROPHARMA.NS, DRREDDY.NS), mining and metal
industry (e.g., NATIONALUM.BO). In this dataset, we have three stock signals
BUY, HOLD and SELL represented numerically by 0, 1 and 2. The BUY and
SELL have the same roles as explained in section 1.3.1 and HOLD signifies that
we do nothing on any given day if we are signaled HOLD for that day, i.e., we
keep the stock with us and do not buy or sell for that symbol.
1.3.3 Drug-Drug Interaction Data
The DDI data is from Stanford’s Biosnap dataset, which contains a network
of 1514 DrugBank drugs representing nodes and 48514 drug-drug interactions
representing edges. This network of interactions between drugs is approved by
the U.S. Food and Drug Administration. It is assumed that all other interactions
apart from approved interactions as either known-not-to-interact or unknown.
Here, the known-to-interact interactions are numerically represented by 1 and the
others by 0. The SMILE values of the drugs are first determined using compound
IDs taken from the dataset using DrugBank.ca. Since the SMILE values are not
available for all the drugs (retrieved using DrugBank IDs), thus, the number of
the drugs in the dataset got reduced to 1368 and, accordingly, the number of
14
interactions. Further, drugs that have at least 10 known-to-interact interactions
with other drugs have been processed. So, there are finally 1059 drugs and their
respective interactions considered from the dataset.
Thereafter, the bioactivity descriptors via the Signaturizer tool [58] using the
determined smile value of each drug are extracted. This tool provides bioactivity
descriptors that encode the physicochemical and structural properties of small
molecule drugs covering all the drugs present in Chemical Checker (CC). The
latter has further covered the source databases - DrugBank.ca and ChEMBl. It
has a pre-trained Siamese Neural Network via which inputting a smile value for
the drug, 25 different types of bioactivity descriptors can be inferred for the drugs
with little or no information. The descriptors are fixed-length normalized vectors
of size 128. There are broadly five categories of bioactivity descriptors labeled
as A to E (A: Chemistry, B: Targets, C: Networks, D: Cells, and E: Clinics).
Each has five sub-categories marked as A1 to A5, for example, thus 25 different
descriptors. Descriptors from A and B broad categories representing a drug’s
Chemistry and Targets, respectively, are taken. Further, specifically, the A1
and A2 sub-categories from A, representing 2D and 3D fingerprints, and the B1
sub-category from B, representing the mechanism of action of a drug are selected.
Since these are three types of bioactivity descriptors out of 25, each having 128
fixed-sized vectors, each drug has 384(128 × 3) features. Thus, the final dataset
comprises 1059 unique drugs with 384 bioactivity descriptors/features for each
drug and corresponding interactions.
15
1.3.4 Mutli-view datasets
Here the proposed approach was tested on various multiview clustering datasets
listed below:
• 100leaves: It contains one hundred plant species, each with 16 samples
per specie. Thus, there are 100 clusters and 1600 total samples. For each
sample, shape descriptor, fine scale margin and texture histogram are given
[59].
• Amsterdam Library of Object Images (ALOI). ALOI dataset consists of
11025 images of 100 small objects. Every image is represented using four
features namely - Color similarity, HSV, RGB, and Haralick features [60].
• Mfeat: Mfeat dataset is from the UCI repository that contains 2000 samples
of handwritten digits (0-9). Each image of this dataset is represented using
six different features [59].
• WebKB: It consists of 203 web pages with four classes collected from
computer science departments of various universities. Each web page is
attributed by the page’s content, hyperlink’s anchor text of the hyperlink
and its title text [59].
The complete statistics of all the datasets mentioned above can be referred from
Table 1.1.
16
Table 1.1: Statistics of the considered MVC datasets
Datasets #Samples #Classes #Views
100leaves 1600 100 3
WebKB 203 4 3
Mfeat 2000 10 6
ALOI 11025 100 4
1.4 Research Contributions
This thesis has three main objectives: 1. To propose more accurate algorithms for
IF than state of the art; 2. To propose an unsupervised algorithm, unlike CNNs
that are largely supervised, thus eliminating the need for large labeled datasets;
and 3. To propose methods that ensure that the learned filters are distinct and
hence the representations learned are more interpretable after fusion.
Our contributions towards these objectives are as follows:
First, an unsupervised fusion framework has been proposed based on CTL
(CTL). The excellent learning ability of convolutional filters for data analysis is
well acknowledged [61–66]. The success of convolutive features owes to CNN.
However, CNN cannot perform learning tasks in an unsupervised fashion. In
recent work, it is shown that such shortcomings can be addressed by adopting
a CTL approach, where convolutional filters are learned in an unsupervised
manner [56, 67]. Therefore, the proposed framework is (i) a deep version of
CTL (DCTL); (ii) an unsupervised fusion formulation taking advantage of the
proposed CTL representation; (iii) filters learned are distinct and hence more
17
interpretable representations are obtained. The proposed techniques, ConFuse
[68] and DeConFuse [69], have been applied to the problems of stock forecasting
and trading. Comparison with state-of-the-art methods (based on CNN and
LSTM network) shows the superiority of our approaches for performing reliable
feature extraction.
Second, two supervised frameworks have been proposed that are based on
DCTL. The former offer all the benefits of the CTL approach discussed previ-
ously. Additionally, the first framework called SuperDeConFuse [57] is such
that it facilitated the removal of the non-linear activation located between the
multi-channel convolution layers and the fully-connected layers, as well as the
one located between the latter and the output layer, thus, handling the problem
of a dead neuron. This removal was compensated by introducing a suitable
regularization on the aforementioned layer outputs and filters during the training
phase. Further, this technique has been applied to the problem of Stock Fore-
casting. Next, the second supervised framework called - DeConDFFuse [70] is
also based on CTL and learns representations guided by joint optimization of
Multi-channel DCTL based networks and Decision Forest (DF) rather than a
piecemeal approach.
Lastly, a multiview clustering fusion framework based on CTL has been
proposed that takes multiview data as input namely DeConFCluster [71]. The
framework jointly trains DCTL networks and the K-Means clustering module;
thus, the representations are distinct and more effective as these are also guided
by K-Means loss. It gives superior clustering performance than the state-of-the-
18
arts.
For the quick reference, the summary of all the proposed models are given in
Table 1.2 and each one’s Advantages and Disadvantages are discussed in Table
1.3. More details of each of these models are discussed in subsequent chapters.
Table 1.2: Summary of all proposed frameworks
Learning Proposed Research Dataset Application
Technique Model objective Used Problem
Unsupervised ConFuse & 1.,2.,3. Yahoo Symbol 1. Stock Trading
DeConFuse Finance Data 2. Stock Forecasting
Supervised SuperDeConFuse 1.,3. NSE and BSE Stock Trading
Stocks (15)
DeConDFFuse 1.,3. Drug Drug Drug Drug
Interaction Interaction Prediction
Unsupervised DeConFCluster 1.,2.,3. 1. 100leaves Multi View
2. ALOI Clustering
3. Mfeat
4. WebKB
Table 1.3: Pros and Cons of all proposed frameworks
Proposed Advantages Disadvantages
Model
ConFuse 1. Meets Research objectives shallow architecture
19
2. Avoids Retraining a network
for different tasks Classification
and Regression
DeConFuse same as above -
3. It is a deep architecture
SuperDeConFuse 1. Meets Research objectives Takes more time
2. performed better than CNN
for the given problem
in training than CNN
DeConDFFuse 1. Meets Research objectives Currently handles case
2. Jointly optimizes Decision Forest when two drugs are
versus piecemeal approach, thus administered together
representations are guided by both when in in real scenario
CTL based fusion and Decision more than two drugs
Forest can be used
DeConFCluster 1. Meet Research Objectives less performant for
2. avoids additional overhead easy-to-cluster datasets
of learning weights of decoder compared to benchmarks
as it is with encoder-decoder
framework used in MVC generally
3. avoided overfitting in
data-constrained scenarios where
#data instance is low and #classes
20
are high
4. Performed well for difficult
datasets compared to
benchmarks
1.5 Acronyms
Let’s introduce here all the acronyms used in the following chapters for the quick
reference to them at one place. Those are as follow:
Table 1.4: Acronyms with full forms used in chapters
Acronym Full Form
ADAM Adaptive Moment Estimation
ADR Adverse Drug Reaction
ALOI Amsterdam Library of Object Images
ANFIS Adaptive Neuro-Fuzzy Inference System
ANN Artificial Neural Network
AR Annualized Returns
ARCH Autoregressive Conditional Heteroskedasticity
ARI Adjusted Rand Index
ARMA Autoregressive Moving Average
AUC Area Under Curve
BSE Bombay Stock Exchange
21
CC Chemical Checker
CCA Canonical Correlation Analysis
CE Cross Entropy
CMM Convex Mixture Model
CNN Convolutional Neural Networks
CoMVC Contrastive Multi-View Clustering
CTL Convolutional Transform Learning
DCTL Deep Convolutional Transform Learning
DCDF DeConDFFuse
DCKM Deep Convolutional K-Means Clustering
DDI Drug-Drug Interaction
DEMVC Deep Embedded Multiview Clustering
DF Decision Forest
DL Deep Learning
DNDF Deep Neural Decision Forest
DT Decision Tree
ECG Electrocardiogram
EEG Electroencephalogram
EKF Extended Kalman Filter
EM Expectation Maximization
EMA Exponential Moving Average
ETF Exchange Traded Fund
FCN Fully Convolutional Network
22
GARCH Generalized Autoregressive Conditional Heteroskedasticity
GCN Graph Convolutional Network
GMC Graph-based Multi-view Clustering
GNN Graph Neural Network
GPS Global Positioning System
GRU Gated Recurrent Unit
HOG Histograms of Oriented Gradients
HSV Hue, Saturation, and Value
IDANFIS Input-Delayed Adaptive Neuro-Fuzzy Inference System
IF Information Fusion
IOT Internet of Things
IHS Intensity Hue Saturation
IT Information Technology
ITS Intelligent Transportation Systems
kNN k-Nearest Neighbors
KF Kalman Filter
KG Knowledge Graphs
KGNN Knowledge Graph Neural Network
LP License Plate
LSTM Long Short Term Memory
MAE Mean Absolute error
MACD Moving Average Convergence and Divergence
MCGL Graph Learning for Multiview Clustering
23
MFNN Multi-Filters Neural Networks
ML Machine Learning
MRI Magnetic Resonance Imaging
MVC Multiview Clustering
NAV Net Asset Value
NLP Natural Language Processing
NMF Non-Negative Matrix Factorization
NMI Normalized Mutual Information
NSE National Stock Exchange
OM Opinion Mining
PDF Probability Density Functions
PET Positron Emission Tomography
PMA Probability Mass Assignment
RDF Random Decision Forest
RGB Red, Green and Blue
RIM Retina-Inspired Models
RNN Recurrent Neural Network
RMSE Root Mean Squared Error
ReLU Rectified Linear Unit
ROC Receiver Operating Characteristic
RRA-MVC Reconsidering Representation Alignment for Multi-view Clustering
RSS Really Simple Syndication
SDCF SuperDeConFuse
24
SELU Scaled Exponential Linear Unit
SGD Stochastic Gradient Descent
SIG Similarity Induced Graph
SiMVC Simple baseline Multi-View Clustering
SKF Sequential Kalman Filtering
SPD Structural Patch Decomposition
SSL Self Supervised Learning
STFT Short Term Fourier Transform
STM Structural Topic Modelling
SVC Single View Clustering
SVM Support Vector Machine
TA Technical Analysis
TL Transform Learning
UCI University of California Irvine
WSN Wireless Sensor Networks
25
Chapter 2
Unsupervised multi-channel CTL based

fusion frameworks -ConFuse(shallow) and
DeConFuse(Deep)
Deep Learning (DL) paradigms currently solve several problems. Most of the
frameworks in DL are based on CNNs that are largely supervised. For supervised
learning, the labeled data are needed in abundance, which is in dearth for some
domains. Also, CNNs cannot perform learning tasks in an unsupervised fashion.
The other shortcoming of CNNs, as discussed previously in Chapter 1, is that
these may not guarantee learning distinct filters; thus, representations/feature
maps might be redundant. It is due to random initialization of filters in CNNs
and thus, the latter depends on the non-convergence of the backpropagation
algorithm to maintain the mutual difference [56]. This redundancy is further
checked experimentally and its details can be referred in this Chapter and Chapter
3 later. Additionally, it has been observed that the problems concerning time-
26
series data in stock forecasting are treated as a 2-D image matrix versus univariate
data, which is the true nature of time-series. We will learn about the said issue
in more detail in this chapter in subsequent sections. Thus, there is a need for a
solution that can resolve these issues.
This chapter introduces unsupervised fusion frameworks based on Convolu-
tional Transform Learning (CTL). The great learning ability of convolutional
filters for data analysis is well acknowledged [61–66]. The convolutive features’
success is due to the Convolutional Neural Network (CNN). Nevertheless, as
mentioned previously, CNN cannot perform learning tasks in an unsupervised
fashion. However, the said shortcoming can be addressed by adopting a recently
established Convolutional Transform Learning (CTL) approach, where convolu-
tional filters are learned in an unsupervised fashion. The framework discussed in
this chapter is (i) a shallow and a deep versions of the CTL approach; (ii) has an
unsupervised fusion formulation taking advantage of the representations learned
via CTL and fused via TL; (iii) is a mathematically sounded optimization strategy
for performing the learning task; and (iv) learns distinct filters that consequently
learn more interpretable non-redundant representations, unlike CNNs.
The proposed frameworks are namely - ConFuse (shallow) and DeConFuse
(deep) and are applied to the problems of stock forecasting and trading. Compari-
son with state-of-the-art methods (based on CNN and LSTM network) shows the
superiority of the proposed approaches for performing reliable feature extraction.
This chapter is organized into sections, with the first section 2.1 discussing the
related work and proposed algorithm in section 2.2. The experimental evalua-
27
tions and results are discussed in sections 2.3 and 2.4 respectively, followed by
discussion in 2.5.
2.1 Literature Review
2.1.1 CNN for Time Series Analysis
Let us briefly review and discuss CNN-based methods for time series analysis.
For a more detailed review, the interested reader can peruse [72]. In this section,
the main focus are on the studies about stock forecasting as it is the use case for
experimental validation.
The traditional choice for processing time series with a neural network is
to adopt a recurrent neural network (RNN) architecture. Variants of RNN like
Long-Short Term Memory (LSTM) [73] and Gated Recurrent Unit (GRU) [74]
have been proposed. However, due to the complexity of training such networks
via backpropagation through time, these have been progressively replaced with
1D CNN [75]. For example, in [76], a generic time series analysis framework
was built based on LSTM, with assessed performance on the UCR time series
classification datasets [77]. The later study from the same group [78], based on
1D CNN, showed considerable improvement over the prior model on the same
datasets.
Many studies convert 1D time series data into a matrix form to use 2D
CNNs [79–81]. Each matrix column corresponds to a subset of the 1D series
28
within a given time window, and the resulting matrix is processed as an image.
The 2D CNN model has been prevalent in stock forecasting. In [81], the said
techniques have been used on stock prices for forecasting. A slightly different
input is used in [82]: instead of using the standard stock variables (open, close,
high, low and NAV), it uses high frequency data for forecasting major points of
inflection in the financial market. In another work [83], a similar approach is
used for modeling Exchange Traded Fund (ETF). It has been seen that the 2D
CNN model performs the same as LSTM or the standard multi-layer perceptron
[84, 85]. The apparent lack of performance improvement in the aforementioned
studies may be due to an incorrect choice of CNN model since an inherently 1D
time series is modeled as an image.
Another learning paradigm known as Self-Supervised Learning (SSL) based
models are also emerging currently when no labels for the data are available.
In all such techniques, initially, the data is unsupervised which is eventually
turned supervised by predicting the pseudo labels and then training happens.
There are few works that utilize and propose solutions based on it for stock
trading prediction [86–89]. However, such techniques are resource intense and
just like CNNs, these SSL based learning paradigms do not have distinctiveness
guarantees.
2.1.2 Convolutional Transform Learning
Convolutional Transform Learning (CTL) has been introduced in a previous
seminal paper [56]. Since the proposed framework is based on the said recent
29
work, it is presented in detail to make it self-contained. CTL learns a set of

filters (tm )1≤m≤M operated on observed samples s(k) 1≤k≤K to generate a set of
(k)
features (xm )1≤m≤M,1≤k≤K . Formally, the inherent learning model is expressed
through convolution operations defined as
(∀m ∈ {1, . . . , M } , ∀k ∈ {1, . . . , K}) tm ∗ s(k) = x(k)

m . (2.1)
Following the original study on transform learning [90], a sparsity penalty
was imposed on the features for improving representation ability and limiting
overfitting issues. Moreover, the non-negativity constraint was imposed on the
features in the same line as CNN models. Training then consisted of learning the
data’s convolutional filters and representation coefficients. This was expressed
as the following optimization problem
K M
1XX (k) (k) 2 (k)

minimize ∥tm ∗ s − xm ∥2 + ψ(xm )
(k)
(tm )m ,(xm )m,k 2 m=1
k=1
M
X
+µ ∥tm ∥22 − λ log det ([t1 |. . . |tM ]), (2.2)
m=1
where ψ is a suitable penalization function. Note that the regularization term
“µ ∥·∥2F − λ log det” ensured that the learned filters were distinct, which was not
guaranteed in CNN. Let us introduce the matrix notation

 
(1) (1) (1) (1)
 t1 ∗ s − . . . tM ∗ s −
x1  xM

T ∗S−X = .. . . . .. 
(2.3)
 . . 

 
(K) (K) (K) (K)
t1 ∗ s − x1 . . . tM ∗ s − xM
30
⊤
(k) (k)
where T = t1 . . . tM , S = s(1) . . . s(K) , and X = x1 . . . xM .
1≤k≤K
The cost function in Problem (2.2) could be compactly rewritten as1 as the sum
of logarithms of its singular values.
1
F (T, X) = ∥T ∗ S − X∥2F + Ψ(X) + µ ∥T ∥2F − λ log det (T ) , (2.4)
2
where Ψ applied the penalty term ψ column-wise on X.
A local minimizer to (2.4) could be reached efficiently using the alternating
proximal algorithm [91–93], which alternates between proximal updates on
variables T and X. More precisely, set a Hilbert space (H, ∥·∥), and define the
proximity operator [85] at x̃ ∈ H of a proper lower-semi-continuous convex
function φ : H →] − ∞, +∞] as
1
proxφ (x̃) = arg min φ(x) + ∥x − x̃∥2 . (2.5)
x∈H 2
Then, the alternating proximal algorithm reads
For n = 0, 1, ...


 T [n+1] = prox [n] T [n] (2.6)
 γ1 F (·,X )

X [n+1] = proxγ2 F (T [n+1] ,·) X [n]
with initializations T [0] , X [0] and γ1 , γ2 positive constants. For more details on
the derivations and the convergence guarantees, the readers can refer to [56].
1
Note that T is not necessarily a square matrix. By abuse of notation, the “log-det” of a rectangular matrix was defined
31
2.1.3 Updates of T
2.2 Proposed Formulations - ConFuse and DeConFuse
2.2.1 ConFuse: Convolutional Transform Learning Fusion Framework For Multi-Channel

Data Analysis
The novel approach ConFuse2 for the unsupervised construction of representation
features of multi-channel data is presented in this section. A natural strategy
was to learn, for each channel c ∈ {1, . . . , C}, a distinct set of convolutional
filters (T (c) )1≤c≤C and associated features (X (c) )1≤c≤C , by solving a CTL-based
formulation:
K
1 X (c) (c) (c) 2 (c)

minimize ∥Sk T − Xk ∥F +Ψ(Xk )
T (c) ,X (c) 2
k=1
+ µ∥T (c) ∥2F −λ log det(T (c) ). (2.7)
(1) ⊤ (C) ⊤ ⊤
Then, the learned channel-wise features were stacked as Xk = [Xk |. . . |Xk ]
for each k, and fused by a transform learning procedure acting as a fully-
connected layer:
K
1X e
minimize ∥T Xk − Zk ∥2F + ι+ (Z)
Te,Z 2
k=1
+ µ∥Te∥2F −λ log det(Te), (2.8)

2
P. Gupta, J. Maggu, A. Majumdar, E. Chouzenoux and G. Chierchia, “ConFuse: Convolutional Transform Learning Fusion
Framework For Multi-Channel Data Analysis," 2020 28th European Signal Processing Conference (EUSIPCO), Amsterdam,
Netherlands, 2021, pp. 1986-1990, doi: 10.23919/Eusipco47968.2020.9287506.
32
where Te denoted the fusion stage transform (not assumed to be convolutional),
Z is the row-wise concatenation of the fusion stage features (Zk )1≤k≤K , and ι+
is the indicator function for positive orthant, equals to zero if all the entries of Z
are non-negative, and +∞ otherwise. Such non-negativity constraint allowed us
to avoid trivial solutions.
However, the disjoint resolution of Problems (2.7) and (2.8) might lead to
unstable solutions that were too sensitive to initialization. Therefore, an alterna-
tive strategy was proposed where all the variables are learned in an end-to-end
fashion by solving a joint optimization problem. To this aim, it was relied on

b (c) )1≤c≤C of the CTL problem assuming
the key property that the solution (X
fixed filters (T (c) )1≤c≤C could be reformulated as the simple application of an
element-wise activation function, that is, for every k ∈ {1, . . . , K},
h i
(c)
Xk (T ) = Xk (T )
b b
1≤c≤C
h i
(c) (c)
= Φ(Sk T ) , (2.9)
1≤c≤C
with Φ the proximity operator of Ψ [94]. For example, if Ψ was the indicator
function of the positive orthant, then Φ identified with the famous rectified linear
unit (ReLU) activation function. Many other examples are provided in [94].
Consequently, it was proposed to plug Equation (2.9) into Problem (2.8), leading
33
to the final ConFuse formulation:
K
1X eb
minimize ∥T Xk (T ) − Zk ∥2F + ι+ (Z) + µ∥Te∥2F
T,Te,Z 2
k=1
C
X
+ µ∥T ∥2F −λ log det(Te) + log det(T (c)
) . (2.10)
c=1
Although Problem (2.10) was still nonconvex, this new formulation had two
notable advantages. First, it was remarked that, as soon as the involved activation
function was smooth, all terms of the cost function in (2.10) were differentiable,
except the indicator function. Thus, the accelerated stochastic projected gradient
descent, Adam, from [95] could be employed. The latter used automatic differ-
entiation and stochastic approximations to deal with large datasets efficiently.
Second, any (sub-)differentiable activation function Φ could be plugged into the
proposed model (2.9), for instance, Scaled Exponential Linear Unit (SELU) [96],
or Leaky ReLU [97]. This flexibility played a key role in the performance, as
shown in the experimental section.
An example of the structure of the learned ConFuse architecture is shown
in Figure 2.1. Note that the proposed approach was completely unsupervised.
Specifically, it replaced supervision by explicitly learning the features Z, on
which the non-negativity constraint was imposed to avoid trivial solutions. Re-
garding the representation filters stacked in matrices (T, Te), the log-det regu-
larization imposed a full rank on those. Thus, it helped to enforce the diversity
and to prevent the degenerate solution (T = 0, X = 0, Te = 0, Z = 0). The
Frobenius regularization ensured that the matrices entries remain bounded.
34
Figure 2.1: General view of the ConFuse architecture. C = 5 represents the number of DeepCTL networks/channels,
F1c = 5 × 1 is the filter size and M1c = 4 is the number of filters for all the channels.
2.2.2 DeConFuse: a deep convolutional transform-based unsupervised fusion framework
In this framework, the ConFuse architecture was extended with more Convolu-
tional layers based on CTL and called it as - DeConFuse3 . Here, there were as
many Transforms as the number of CTL Layers. Thus, a different set of con-
(c) (c) (c) (c)
volutional filters T1 , . . . , TL and features X1 , . . . , XL were learned. These
learned deep features can be computed by stacking many such layers
(∀ℓ ∈ {1, . . . , L − 1}) Xℓ = ϕℓ (Tℓ ∗ Xℓ−1 ), (2.11)

3
P. Gupta, J. Maggu, A. Majumdar, E. Chouzenoux and G. Chierchia, “DeConFuse: a deep convolutional transform-based
unsupervised fusion framework". EURASIP J. Adv. Signal Process. 2020, 26 (2020). https://doi.org/10.1186/s13634-020-00684-
5
35
where X0 = S and ϕℓ a given activation function for layer ℓ. Further, these
features were processed in the same manner as in the ConFuse architecture, i.e.,
with fusion transform Te and common representation Z learned subsequently.
This led to the joint optimization problem
C
(c) (c)
X
minimize Ffusion (Te, Z, X) + Fconv (T1 , . . . , TL , X (c) | S (c) ) (2.12)
T,X,Te,Z c=1
| {z }
J(T,X,Te,Z)
where
1
Fconv (T1 , . . . , TL , X | S) = ∥TL ∗ ϕL−1 (TL−1 ∗ . . . ϕ1 (T1 ∗ S)) − X∥2F
2
L
X
+ Ψ(X) + (µ||Tℓ ||2F −λ log det(Tℓ )). (2.13)
ℓ=1
and
C
1 X 2
Ffusion (Te, Z, X) = Z − flat(X (c) )Tec + ι+ (Z)
2 c=1
F
C
(2.14)
X
2
+ µ∥Tc ∥F −λ log det(Tc ) ,
e e
c=1
where the operator “flat” transformed X (c) into a matrix where each row con-
tained the “flattened” features of a sample. The complete architecture is shown
in Figure 2.2.
2.2.3 Optimization Algorithm for the frameworks
As for the solution of Problems (2.10) and (2.12), it was remarked that all terms
of the cost function are differentiable, except the indicator function of the non-
36
Figure 2.2: General view of the DeConFuse architecture. C = 5 represents the number of DeepCTL networks/chan-
nels, L = 2 is the number of DCTL layers, Mℓc is the filter size and Fℓc is the number of filters of the respective
layer ℓ and channel c.
negativity constraint. Therefore, it was possible to find a local minimizer to
(2.10) and (2.12) by employing the projected gradient descent, whose iterations
read
For n = 0, 1, ...

 [n+1]
 T = T [n] − γ∇T J(T [n] , X [n] , Te[n] , Z [n] )


 X [n+1] = P+ (X [n] − γ∇X J(T [n] , X [n] , Te[n] , Z [n] )) (2.15)


 e[n+1]
 T = Te[n] − γ∇Te J(T [n] , X [n] , Te[n] , Z [n] )

Z [n+1] = P+ (Z [n] − γ∇Z J(T [n] , X [n] , Te[n] , Z [n] ))
37
with initialization T [0] , X [0] , Te[0] , Z [0] , γ > 0, and P+ = max{·, 0}. In practice,
the accelerated strategies [98] were used within each step of this algorithm to
speed up learning.
There are two notable advantages of the proposed optimization approach.
Firstly, it was relied on automatic differentiation [99] and stochastic gradient
approximations to efficiently solve Problem (2.10). Secondly, it was not limited
to ReLU activation in equations (2.9) and (2.11), but instead, more advanced
ones were used, such as SELU [96]. It can be observed from Tables 2.3 and 2.5
that although the ReLU activation performed better in the case of the ConFuse,
however, in the case of DeConFuse, it was SELU that performed better. As
the convolution layer was added, more resultant values from convolution with
filters were believed to be negative. Since the ReLU activation function sets
the negative values to zero, the values were not that distinct compared to those
obtained via SELU. The latter does not set negative values to zero but near zero
[96] and hence prevents dead neuron issue also unlike ReLU. It proved beneficial
for performance, as shown by the numerical results in section 2.3.
2.3 Experimental Evaluation
Experiments were conducted for both frameworks on the real-world problems of
stock forecasting and Trading. The problem of stock forecasting is a regression
problem aiming at estimating the price of a stock at a future date (the next
day for the given problem) given inputs till the current date. Stock trading is a
38
classification problem, where the decision to buy or sell a stock has to be taken
at each time. The two problems are related by the fact that simple logic dictates
that if the price of a stock at a later date is expected to increase, the stock must
be bought, and if the stock price is expected to go down, the stock must be sold.
Five raw inputs were used for both tasks, namely open price, close price,
high price, low price and net asset value (NAV). One could compute technical
indicators based on the raw inputs [81] but, in keeping with the essence of true
representation learning, it was deliberately chosen to stay with those raw values.
Each of the five inputs was processed by a separate 1D processing pipeline.
Each of the pipelines produced a flattened output. The flattened outputs were
then concatenated and fed into the Transform Learning layer acting as the fully
connected layer (Fig. 2.2) for fusion. While the processing pipeline ended
here (unsupervised), the benchmark techniques were supervised and had an
output node. The node was binary (buy/sell) for classification and real-valued
for regression. The comparison with state-of-the-art time series analysis models,
namely TimeNet [76] and ConvTimeNet [78] was carried out. In the former,
the individual processing pipelines are based on LSTM and 1D CNN in the
latter. The complete architectural details and hyperparameters for ConFuse and
DeConFuse are in Table 2.1.
39
Table 2.1: Description of compared models with hyperparameters
Method Architecture
( Description Other Parameters
layer1: 1D Conv(1, 4, 5, 1, 2)1
ConFuse 5× Learning Rate = 0.001,
Activation (e.g., ReLU) µ = 0.01, λ = 0.0001
1 × layer2:
 Transform Learning Optimizer Used: Adam

 layer1 : 1D Conv(1, 4, 5, 1, 2)1 **with parameters**
(β1, β2) = (0.9, 0.999),

Maxpool(2, 2)2

DeConFuse 5× weight_decay = 5e-5,


 SELU epsilon = 1e-8
layer2 : 1D Conv(5, 8, 3, 1, 1)1

layer3 : Fully Connected



 layer1 : 1D Convolution(1, 32, 9, 1, 4)1
 For Forecasting:
Batch Normalization + SELU




 Learning Rate = 0.001,
layer2 : 1D Convolution(32, 32, 3, 1, 1)1




 For Trading:
Batch Normalization + SELU + SC3

Learning Rate = 0.0001,


ConvTimeNet 5× layer3 : 1D Convolution(32, 64, 9, 1, 4)1 Optimizer Used: Adam
**with parameters**

Batch Normalization + SELU



(β1, β2) = (0.9, 0.999),


layer4 : 1D Convolution(64, 64, 3, 1, 1)1




 weight_decay = 1e-4,
Batch Normalization + SELU + SC3




 epsilon = 1e-8
layer3 : Global Average Pooling


For Trading, added layer5 : Softmax
For Forecasting:
Learning Rate = 0.001,
( For Trading:
layer1 : LSTM unit(1, 12, 2, T rue)4 Learning Rate = 0.0005,
TimeNet 5×
layer2 : Global Average Pooling Optimizer Used: Adam
**with parameters**
(β1, β2) = (0.9, 0.999),
For Trading, added layer4 : Softmax
weight_decay = 5e-5,
epsilon = 1e-8
1 (in_planes, out_planes, kernel_size, stride, padding)
2
(kernel_size, stride)
3
SC - Skip-Connection
4
(input_size,hidden_size,#layers,bidirectional)
2.4 Results and Analysis
The frameworks have been applied on the NSE dataset of 150 symbols, as also
described in section 1.3.1. For DeConFuse, TimeNet and ConvTimeNet, the
40
Table 2.2: Forecasting Results (MAE)
Method Open Close High Low NAV

ConFuse-SELU 0.011 0.023 0.017 0.017 0.447
ConFuse-ReLU 0.009 0.021 0.014 0.014 0.445
ConFuse-PReLU 0.007 0.017 0.012 0.013 0.434
ConFuse- 0.007 0.017 0.012 0.013 0.427
LeakyReLU
ConFuse-Tanh 0.258 0.259 0.258 0.259 0.488
ConFuse-Sigmoid 0.227 0.227 0.227 0.227 0.482
ConvTimeNet 1.551 1.554 1.535 1.567 2.357
TimeNet 0.295 0.295 0.294 0.296 0.511
architectures were tuned to yield the best performance and randomly initialized
the weights for each stock’s training.
2.4.1 Stock Forecasting – Regression
Firstly, the experiments were performed with the stock forecasting problem. Next,
the generated unsupervised features were fed from the proposed architecture into
an external regressor - ridge regression. Evaluation was carried out regarding
mean absolute error (MAE) between the predicted and actual stock prices for all
150 stocks. Root Mean Squared Error(RMSE) could also have been computed in
place of MAE but MAE was chosen here as MAE has lower sample variance and
is more interpretable than RMSE. The MAE for individual stocks is computed
for each of close price, open price, high price, low price and net asset value.
41
Results Analysis for ConFuse
The testing was done with six different activation functions for ConFuse. 4 For a
concise summary of results, Table 2.3 shows the average values over all stocks.
Table 2.3: Summary Forecasting Results (MAE) with ConFuse

ConFuse-SELU 0.011 0.023 0.017 0.017 0.447
ConFuse-ReLU 0.009 0.021 0.014 0.014 0.445
ConFuse-PReLU 0.007 0.017 0.012 0.013 0.434
ConFuse- 0.007 0.017 0.012 0.013 0.427
LeakyReLU
ConFuse-Tanh 0.258 0.259 0.258 0.259 0.488
ConFuse-Sigmoid 0.227 0.227 0.227 0.227 0.482
ConvTimeNet 1.551 1.554 1.535 1.567 2.357
TimeNet 0.295 0.295 0.294 0.296 0.511
It was found that the results for the stock forecasting problem were excep-
tionally good. For most tested activation functions, ConFuse has MAE more
than one order of magnitude lower than the state-of-the-arts. The regression
performance was also plotted in the Figure 2.3 for the two randomly chosen
stocks. Here, it was clearly observed that the output close prices were very
closely predicted to the actual close prices versus the benchmarks.

4
The gradients of SELU, RELU, PRELU, and Leaky RELU are not defined in zero. It is customary to consider any valid
sub-gradient value instead. When resorting to this strategy, no practical convergence issues with the ADAM algorithm were
found.
42
(a) AMARAJABAT (b) JSWENERGY
Figure 2.3: Stock Forecasting Performance with ConFuse
Results Analysis for DeConFuse
Summary results are presented in Table 2.4. Interested readers can see the
detailed results for all 150 stocks from the paper [69], Appendix section. Table
2.4 shows that the MAE values reached for the proposed DeConFuse solution for
the four first prices (open, close, high, low) are extremely good for all of the 150
stocks. Regarding NAV prediction, the proposed method performed extremely
well for 128 stocks. For the remaining 22 stocks, there are 13 stocks, highlighted
in red, for which DeConFuse did not give the lowest MAE, but it was still very
close to the best results given by the TimeNet approach.

Table 2.4: Summary Forecasting Results (MAE) with DeConFuse
DeConFuse 0.007 0.016 0.012 0.013 0.410
ConvTimeNet 1.550 1.550 1.530 1.560 2.350
TimeNet 0.295 0.295 0.294 0.295 0.511
It can be observed that with both shallow (ConFuse) and Deep (DeConFuse)
versions of the proposed frameworks, the forecasting performance is better than
43
the stat-of-the-arts. Further, going deep did better than the shallow version for a
few predicted prices.
2.4.2 Stock Trading – Classification
Next, the Stock Trading, i.e., classification performance was evaluated. For
this purpose, the features/representations Z were fed to the external classifier -
Random Decision Forest (RDF). The results were reported in terms of metrics
- precision, recall, F1 score, and area under the ROC curve (AUC). From the
financial viewpoint, annualized returns (AR) were also calculated using the
predicted trading signals/labels and true trading signals/labels named Predicted
AR and True AR respectively. The latter metric is important from the perspective
of understanding the quality of predictions in financial terms as well. This value
indicates the geometric average of an investment’s earnings in a year. Thus, the
more it is, better is the quality of predictions. To calculate the same, the starting
capital used for every stock was Rs. 1,00,000 and the transaction charges were
Rs 10. Each of these metrics is explained below :
• Accuracy : the fraction of total samples that are correctly classified.
#correct identif ied samples

Accuracy = (2.16)
T otal#samples
i.e. Pm T Pi +T Ni
i=1 T Pi +T Ni +F Ni +F Pi
Accuracy = (2.17)
m
where T Pi = True Positives, T Ni = True Negatives, F Pi = False Positives,
44
F Ni = False Negatives, m is the total number of classes in the dataset, and i
ranges from 1 to m.
• Precision : also known as the positive predictive value (PPV), measures the
accuracy of a predicted positive outcome, i.e., how accurate the model is
for predicting positive values.
TP
P recision or P P V = (2.18)
TP + FP
• Recall : represents the sensitivity of a model and is useful for ascertaining
the strength of a model to predict positive outcomes.
TP
Recall or Sensitivity = (2.19)
TP + FN
• F1 Score : calculated using a weighted harmonic mean between precision
and recall. For the classification of positive instances, it helps to understand
the trade off between correctness and coverage.
P recision × Recall
Fβ = (1 + β 2 ) ∗ (2.20)
(β 2 × P recision) + Recall
here β = 1
• ROC AUC : ROC curve (receiver operating characteristic curve) is a graph
showing the performance of a classification model at all classification thresh-
olds. This curve plots two parameters: True Positive Rate (TPR) and False
Positive Rate (FPR). Here TPR is a synonym for recall and is therefore
defined same as Recall mathematically.
45
False Positive Rate (FPR) is defined as follows:
FP
FPR = (2.21)
FP + TN
AUC stands for “Area under the ROC Curve." That is, AUC measures the
entire two-dimensional area underneath the entire ROC curve.
• AR : indicates the geometric average of an investment’s earnings in a year.

n1
T otalM oney
AR = − T ransactioncharges. (2.22)
StartCapital
Here, transaction charges = Rs. 10/- and Start Capital = Rs. 1,00,000/-
Results Analysis for ConFuse
The results can be referred from Table 2.5. For the stock trading problem,
ConFuse outperformed the benchmarks with two activation functions, namely
SELU and ReLU, and reached a similar performance to the benchmarks with the
other activation functions.
46
Table 2.5: Trading Results with ConFuse
Method Precis. Recall F1 AUC AR

ConFuse-SELU 0.524 0.777 0.619 0.543 17.898
ConFuse-RELU 0.505 0.648 0.556 0.523 18.112
ConFuse-PRELU 0.491 0.601 0.528 0.506 19.091
ConFuse- 0.496 0.602 0.531 0.511 19.150
LeakyRELU
ConFuse-Tanh 0.469 0.560 0.493 0.497 19.002
ConFuse-Sigmoid 0.487 0.584 0.513 0.498 20.540
ConvTimeNet 0.457 0.507 0.413 0.524 19.410
TimeNet 0.469 0.648 0.496 0.513 18.764
Results Analysis for DeConFuse
The classification performance in detail for all 150 symbols can be referred from
the paper Appendix Section. Certain results from that table are highlighted in
bold or red. The first set of results, marked in bold, are the ones where one of
the techniques for each metric gave the best performance for each stock. The
proposed solution DeConFuse gave the best results for 89 stocks for a precision
score, 85 stocks for a recall score, 125 stocks for F1 score, 91 stocks for the
AUC measure, and 56 stocks in the case of the AR metric.
The other set marked in red highlighted the cases where DeConfuse did not
perform the best but performed nearly equal (here, a difference of a maximum
of 0.05 in the metric is considered) to the best performance given by one of the
benchmarks, i.e., DeConFuse gave the next best performance. It was noticed that
there are 24 stocks for which DeConFuse gave the next best precision metric
47
value. Likewise, 18 stocks in case of a recall, 22 stocks for F1 score, 26 stocks
for AUC values, and 1 stock in case of AR. Overall, DeConFuse reached a
very satisfying performance over the benchmark techniques. The trading results
summary corroborates the same in Table 2.6.

Table 2.6: Summary Trading Results with DeConFuse
Method Precision Recall F1 AUC MAE
Score AR
DeConFuse 0.520 0.810 0.628 0.543 17.350
ConvTimeNet 0.510 0.457 0.413 0.524 19.410
TimeNet 0.470 0.648 0.490 0.513 18.760
2.4.3 Convergence Study
Some empirical convergence plots of Adam were also shown that can be seen in
Figure 2.4 when using ConFuse and DeConFuse with SELU, which depicted
the practical stability of the end-to-end training method.
Further, the representations both channel-wise i.e., Xc and final fused represen-
tation Z, were analyzed for one of the random stocks; here it is ANDHRABANK.
The visualizations are displayed in Figure 2.5 for one sample of the mentioned
stock. It can be seen from the figure that the heatmaps for all the channel-wise
features Xc and fused features Z are less redundant and have more variations.
Thus, it can be implied that one of the factors for this variation could be distinct
filters that are learned and transform data to produce the varied representations.
48
(a) Loss Plot with ConFuse
(b) Loss Plot with DeConFuse
Figure 2.4: Loss Plots with a) ConFuse and b) DeConFuse
(a) Channel X1 (b) Channel X2 (c) Channel X3 (d) Channel X4 (e) Channel X5 (f) Channel Z -
Close Price Open Price High Price Low Price Net Asset Value Fused features
Figure 2.5: Visualization of channel-wise features Xc and fused representations Z for DeConFuse for one sample of
stock ANDHRABANK (with 8 × 2 as the shape of the features obtained for each channel Xc and flattened features
of shape 40 × 1 for Z)
49
2.5 Discussion
Shallow and deep fusion based end-to-end frameworks for processing 1D multi-
channel data were proposed. Unlike other deep learning models, these frame-
works are unsupervised. These are based on a novel deep version of the recently
proposed CTL model. The proposed models have been applied for stock fore-
casting and trading problems leading to very good performance. The overall
framework is generic enough to handle other multi-channel fusion problems as
well.
The advantage of the proposed frameworks is their ability to learn in an
unsupervised fashion. For example, consider the problem that is addressed. For
traditional deep learning-based models, one needs to re-train deep networks for
regression and classification. But here the learned final features can be reused,
without the requirement of re-training, for specific tasks. This has advantages
in other areas as well. For example, one can either do ischemia detection, i.e.,
detect whether one is having a stroke at the current time instant (from EEG); or
one can do ischemia prediction, i.e., forecast if a stroke is going to happen. In
standard deep learning, two networks need to be re-trained and tuned to tackle
these two problems. With these proposed methods, there is no need for this
double effort.
Since the stock data is quite volatile, therefore, a minor improvement matters
in the problems pertaining to this domain. Thus, the better results with the
proposed frameworks than the benchmarks is beneficial for the system. However,
50
the AUC ROC values can be improved further in the future as those have the
scope of improvement in this case of two classes problem. Also, in the the future,
the framework can be extended for semi-supervised formulations. It is believed
that the semi-supervised formulation will be of immense practical importance.
51
Chapter 3
Supervised multi-channel fusion

frameworks - SuperDeConFuse and
DeConDFFuse
In the last chapter, the unsupervised frameworks based on CTL were discussed
that bridged all the gaps that CNNs have. However, the question that comes
next is - if we have labeled datasets, are the CNNs based models sufficient
for supervised learning? It has been observed that CNNs have emerged as the
recommended solution in many such scenarios. But the issue with CNN is that
the supervised learning through them does not ensure distinct filters; hence,
the feature maps might have redundancy. Additionally, there is a dead neuron
problem with CNNs which is encountered with the kind of activation function
chosen with it and mostly happens when ReLU is used. A dead neuron can be
considered a natural Dropout. Further, due to dead neurons, there could be a
bigger problem. Let’s say if every neuron in a specific hidden layer is dead; it
52
cuts the gradient to the previous layer resulting in zero gradients to the layers
behind it. Thus, the weights would not be updated and the learning will be
improper. It can be fixed using lower learning rates, so the big gradient doesn’t
set a big negative weight and bias in a ReLU neuron. Another solution is to use
other activation functions like Leaky ReLU. It allows the neurons outside the
active interval to leak some gradient backward. But sometimes, the resolves just
discussed do not work in certain scenarios.
Therefore, these two issues discussed above open up the scope for developing
supervised frameworks that can combinely tackle them. In the previous chapter,
the success of unsupervised frameworks based on CTL was observed. This
chapter also presents two multi-channel supervised frameworks based on CTL.
The first one is SuperDeConFuse, a multi-channel fusion framework that jointly
trains and optimizes multiple CTL based channels and cross-entropy loss. Thus,
representations are not learned just via CTL but also directed by classification
loss - Cross Entropy. It has been applied to the stock trading problem. The other
framework is named - DeConDFFuse combines and jointly trains DeConFuse
and Decision Forest (DF). It deals with the drug-drug interaction problem. Here,
the representations are learned via CTL and DF, which yields better performance
than the state-of-the-arts. Both frameworks are explained in the subsequent
sections.
53
3.1 SuperDeConFuse: A supervised deep convolutional transform based
fusion framework for financial trading systems
3.1.1 Literature Review - Stock Trading
Information Fusion based techniques, in general, have been discussed in chapter
1. Now, let’s briefly review here some of the works that have proposed solutions
for the Stock Trading problem. The problem of stock trading has been one of the
most difficult problems for researchers in finance data processing and speculators.
Struggles are mainly due to the uncertainties and noises of the samples. These
samples are generated as a consequence of historical market behaviors.
In literature, different methodologies have been applied to the stock data for
predicting future trading strategies (e.g., buy and sell decisions). These include
statistical methods, machine learning algorithms like Support Vector Machine
(SVM) and Artificial Neural Networks (ANN), feature extraction approaches,
deep learning models (e.g., CNN, LSTM) and self-supervised learning based
techniques that are briefly reviewed in this section.
Statistical methods are probably the methods that are universally used for
predicting financial stock trading strategies. In particular, many studies rely on
the use of sequential statistical models, such as ARMA [100], ARCH [101],
GARCH [102] and [103], Kalman filter [104]. Feature-based techniques are also
considered state-of-the-art. Technical indicators like Exponential moving average
(EMA), Moving average convergence and divergence (MACD), Williams %R,
54
etc., have been used in past studies to extract the features from the data [105].
Text mining can also be used to process financial analysis from newspapers
[106]. The features are then input to machine learning models, for example,
SVM, ANN and kNN [107].
Further studies have proposed hybrid machine learning models using multiple
base classifiers operating on a common input and a meta classifier learning from
base classifiers’ outputs to obtain more precise stock return and risk predictions.
Strategies such as Bagging, Boosting and AdaBoost can be applied to create
diversity in classifier combinations [108, 109]. For example, a hybrid weighted
SVM and weighted KNN model for predicting stock market indices is proposed
in [110]. Another study [111] combines the statistical and probabilistic Bayesian
Learning and the machine learning model ANN for the same. However, in all
the aforementioned techniques, the relationship built between historical data
and future value prediction may lack interpretation because of their “black-box"
property. Thus, the performance of these methods is directly related to the quality
of the features. Moreover, overfitting is a major issue with machine learning
techniques due to their non-linear mapping and fitting capability.
Deep learning based models have also been extensively used for solving
stock forecasting problems. Recurrent Neural Networks (RNNs) are considered
the most appropriate models for time-series analysis. LSTM is one such RNN
that is regarded as the memory-mimicking model. The work in [112] uses
LSTM on the technical indicators for the prediction. However, despite the great
performance obtained, the time complexity of training RNN via backpropagation
55
has encouraged the users to search for more tractable models and solutions.
CNNs constitute another important deep learning model, besides RNNs,
which have been used profusely and performed well in stock time-series forecast-
ing, especially 2-D CNNs. The studies pertaining to CNNs [79–85] have been
discussed in the previous chapter in section 2.1.1. It is observed that there is a
lack of performance improvement may be owing to the incorrect choice of the
CNN model since these studies model an inherently 1D time series as an image.
Another learning paradigm known as Self-Supervised Learning (SSL) based
models are also emerging currently when no labels for the data are available.
In all such techniques, initially the data is unsupervised which is eventually
turned supervised by predicting the pseudo labels and then training happens.
There are few works that utilizes and proposes solutions based on it for stock
trading prediction [86–89]. However, such techniques are resource intense and
just like CNNs, these SSL based learning paradigms do not have distinctiveness
guarantees.
3.1.2 Proposed Formulation
A novel supervised framework for multi-channel data representation learning
is discussed in this section. A crucial element of the latter is the recently intro-
duced CTL [56]. The details of CTL have already been covered in 2.1.2. Also,
extending it to the deep versions and the fusion part is covered with DeConFuse
described in section 2.2.2. Now, let’s move to the proposed framework which
56
is an extension of these approaches to handle a multi-layer architecture that is
called as - SuperDeConFuse (SDCF)1 architecture.
This framework took the channels of input data samples to separate branches
of CTL layers, leading to multiple sets of channel-wise features. The features
obtained were thus decoupled. In order to couple (i.e., fuse) them, these were
concatenated and passed to a fully-connected layer, which yielded a set of distinct
coupled features via transform learning. These features were then fed to another
linear fully-connected layer. The obtained features that were finally inputted
to the softmax layer which yielded probabilities for the classes. The complete
architecture is shown in Figure 3.1.
Figure 3.1: General SuperDeConfuse Architecture. The architecture is tested for L = 1, 2, 3, 4 layers and C = 5.
Here M11 × 1, . . . , MLC × 1 represents the kernel size used in each layer ℓ ∈ {1, . . . , L}. Here, maxpooling is not
performed after layer 4 due to the small window size/input sequence length.
As the data considered is multi-channel, a different set of convolutional filters

1
P. Gupta, A. Majumdar, E. Chouzenoux, G. Chierchia, SuperDeConFuse: A supervised deep convolutional transform based
fusion framework for financial trading systems, Expert Systems with Applications, Volume 169, 2021, 114206, ISSN 0957-4174,
https://doi.org/10.1016/j.eswa.2020.114206
57
(c) (c)
T1 , . . . , TL and features X (c) were learned for each channel c ∈ {1, . . . , C}.
The linear transform (not convolutional) were also learned and calculated Te =
(Tec )1≤c≤C to fuse the channel-wise features X = (X (c) )1≤c≤C , along with the
corresponding fused features Z at the same time. The latter task was carried out
by the cost function
C
1 X 2
Ffusion (Te, Z, X) = Z − flat(X (c) )Tec + Ψ(Z)+
2 c=1
F
C
(3.1)
X 2
µ Tec − λ log det(Tc )
e
F
c=1
where the operator “flat" transforms X (c) into a matrix where each row contains
the “flattened" features of a sample. Further, the weight matrix θ of a multiclass
classifier was learned which took the input features Z and yielded the class
probabilities. The cross-entropy (CE) loss associated with the final classification
is given by
K V
X
zk⊤ (θv −θyk )
X
FCE (θ, Z | y) = log e , (3.2)
k=1 v=1
where V is the number of classes, θv is the v-th column of matrix θ, zk⊤ is the
k-th row of matrix Z, and yk ∈ {1, . . . , V } is the label of the k-th sample.
Conclusively, the proposed formulation aimed at jointly training the channel-

(c)
wise convolutional filters Tl , the fusion coefficients Te, and the multiclass
classifier θ in an end-to-end fashion. The features X and Z were explicitly
learned subjected to the regularization Ψ so as to avoid the problem of dead

(c)
neurons. Moreover, the “log-det" regularization on both Tl and Te broke the
58
symmetry and enforced the diversity in the learned transforms. In contrast, the
Frobenius regularization kept the transform coefficients bounded.
3.1.3 Optimization algorithm
It was chosen to find a local minimizer to the non-convex Problem (2.10) through
the projected (sub)gradient descent, whose iterations read:
For n = 0, 1, ...

 [n+1]
 T
 = T [n] − γ∇T J(T [n] , X [n] , Te[n] , Z [n] , θ[n] )

 X [n+1] = P (X [n] − γ∇ J(T [n] , X [n] , Te[n] , Z [n] , θ[n] ))
 + X

 e[n+1] (3.3)
 T
 = Te[n] − γ∇Te J(T [n] , X [n] , Te[n] , Z [n] , θ[n] )

 Z [n+1] = P+ (Z [n] − γ∇Z J(T [n] , X [n] , Te[n] , Z [n] , θ[n] ))


θ[n+1] = θ[n] − γ∇θ J(T [n] , X [n] , Te[n] , Z [n] , θ[n] )
with P+ = max{·, 0} (applied element-wise). It was initialized with some
random matrices T [0] , X [0] , Te[0] , Z [0] , θ[0] and a suitable step size γ > 0 was
chosen. The gradient step was numerically evaluated with the accelerated scheme
initially introduced for the ADAM method in [95]. The advantages of this
optimization method have been discussed in the previous chapter 2.
59
3.1.4 Preprocessing
Before proceeding with the experimental setup, the labeling process for the
dataset and training details are discussed here. The specifics of the dataset can
be referred from the section 1.3.2.
3.1.4.1 Labeling Process
In the labeling phase, the labels were manually assigned to the daily close prices
as Buy (0), Hold (1) and Sell (2). The labels were determined by performing a
grid search on the list of holding percentages to identify the percentage change
for which the stocks should be held to maximize the annualized returns for the
company. Algorithm 1 gave the details of the labeling process.
3.1.4.2 Training Details
In general, the sliding walk forward validation technique is used as the cross-
validation technique in the case of time-series data, also shown in Figure 3.2. As
can be seen from Figure 3.2, ten years of data for training have been used and
the subsequent one year of data for testing, i.e., the stock data from 1998-2007
was for training and the year 2008 for testing. Then the training window was
slid by one year which implied that it was next trained from 1999-2008 and
tested on the following year 2009 data and this period is called the horizon.
In summary, it was trained for ten years, tested for the next year, slid it by a
one year horizon, and again trained and tested it until 2018. Thus, 11 years
60
Algorithm 1: Labelling Method
1 Input : CP - Array of
2 Parameter : X - array of K holding percentages,
3 NUMDAYS - number of days for the current symbol or len(CP)
4 Labels - 2D array of size K x NUMDAYS
5 Output : FinalLabels - Labelled Dataset for S
1: AR = [ ] //it is of size K
2: for k = 0, 1, 2, . . . , K − 1 do
3: for n = 0, . . . , N U M DAY S − 1 do
4: change = abs((CP [n + 1] − CP [n]/CP [n]) ∗ 100) //where CP[n+1] is the next day closing
price
5: if change > X[k] then
6: if CP [n + 1] > CP [n] then
7: label == “Sell"
8: else
9: label == “Buy"
10: end if
11: else
12: label == “Hold"
13: end if
14: Labels[k].append(label)
15: end for
16: ar = AnnualisedReturn(Labels[k],CP)
17: AR.append(ar)
18: end for
19: maxAr = Max(AR), maxIndex = index(Max(AR))
20: HoldPercentage = X[maxIndex]
21: FinalLabels = Labels[maxIndex]
22: return FinalLabels
23: Repeat all steps till 22 for all the Stocks/Symbols in the dataset.
of data from 2008 - 2018 were used as test data. This way, there were 11
models and the set of hyperparameters were selected that gave the best results
across all 11 models. The set of hyperparameters that were tuned includes µ, λ,
kernel sizes, number of filters/kernels, learning rate, weight decay of the Adam
optimizer, batch size, and number of epochs. Additionally, the weights for each
stock’s training were randomly initialized. It appeared here as a very efficient
technique to analyze the robustness of the architecture. In other words, the model
performance was calculated every time a year’s data became available for testing
61
and used the previous year’s test data for training. The training and the test data
were standardized using Normalizer from the Python library as prices and the
NAV features/channels have a varied range of values.
Figure 3.2: Sliding walk-forward validation technique used for hyperparameters tuning
3.1.5 Experimental Evaluation
The experiments were carried out on the real-world problem of stock trading.
Stock trading is a classification problem, where the decision whether to buy or
hold or sell a stock has to be taken at each time. The problem makes a decision
that if the price of a stock at a later date is expected to increase, the stock must
be bought; and if the stock price is expected to go down, the stock must be sold;
and if there is no change in the price then it should be held, i.e., do nothing until
the price increases. This was done in a way to maximize the annualized returns
from the stock for the company’s profit, as mentioned in the labeling process.
Five raw inputs were used: open price, close price, high, low and net asset
value (NAV). It was chosen to stay with the raw values. However, one could
compute technical indicators based on the raw inputs [105] but raw values
62
allowed here to keep up with the essence of the true nature of representation
learning. Each of the five inputs was processed by a separate 1D processing
pipeline. Each pipeline produced a flattened output (Figure 3.1). These flattened
outputs were then concatenated and fed for fusion into the Transform Learning
layer acting as the fully connected layer (Figure 3.1). Further, this is connected
to another linear fully connected layer and finally, there was a softmax function.
The softmax function gave the classification output which consisted of the class
probabilities for the three classes (BUY, HOLD and SELL).
The architecture was extended by adding CTL layers upto four layers resulting
in four different deep SDCF architectures. The details for all four architectures
are briefed in Table 3.1. Maxpooling halves the input sequence length/window
size/Time Steps with every operation. Thus, after three layers, the size was
getting reduced to the value that restricted us from using maxpooling operation
after the 4th CTL layer; hence, the architecture with 4 CTL layers of SDCF will
not have maxpooling operation after layer 4. This was due to the small window
size. Also, for making predictions on any day, the past ten days were analyzed
through the model labeled as Time Steps shown in Figure 3.1. Additionally, the
stock trading signal was not predicted for the first ten days of every test year to
avoid data leak.
63
Table 3.1: Hyperparameters for the different instances of the proposed SDCF network (see Figure 3.1 for the
general overview) used in the experimental section.
Method Architecture Description Other Parameters
LearningRate = 0.001,

λ = 0.01, µ = 0.0001
layer1 : 1D Conv(1, 16, 3, 1, 1)1


5× epochs = 100,
Maxpool(2, 2)2


Optimizer Used: Adam
SDCF 1L layer2 : Fully Connected (TL)3
**with parameters**
layer3 : Fully Connected (Linear)
(β1, β2) = (0.9, 0.999),
Softmax
weight_decay = 1e-4,
epsilon = 1e-8

layer1 : 1D Conv(1, 8, 3, 1, 1)1








SELU + Maxpool(2, 2)2


5×
layer2 : 1D Conv(8, 16, 3, 1, 1)1







SDCF 2L 
Maxpool(2, 2)2


layer3 : Fully Connected (TL)3
Softmax

layer1 : 1D Conv(1, 4, 11, 1, 5)1








SELU + Maxpool(2, 2)2








layer2 : 1D Conv(4, 8, 7, 1, 3)1


5×
SELU + Maxpool(2, 2)2






SDCF 3L 
layer3 : 1D Conv(8, 16, 3, 1, 1)1








Maxpool(2, 2)2


Softmax
64

layer1 : 1D Conv(1, 4, 13, 1, 6)1
















layer2 : 1D Conv(4, 8, 11, 1, 5)1







5× SELU + Maxpool(2, 2)2




1

layer3 : 1D Conv(8, 16, 9, 1, 4)


SDCF 4L 











layer4 : 1D Conv(16, 32, 5, 1, 2)1


Softmax
1
(in_planes, out_planes, kernel_size, stride, padding)
2
3
TL - Transform Learning
L - #CTL layers
The comparison was made with three state-of-the-art time series based analy-
sis models, out of which two techniques presented the models proposed specifi-
cally for financial stock trading - CNN-TA [105] and MFNN [113]; and the last
technique presented a generic model for time-series based data - FCN (Fully
Convolutional Network) [75]. The latter was used to understand how generic the
proposed model was when compared against both specific stock trading based
and general time-series models. In all the techniques, processing pipelines were
based on CNN. Other than CNN, MFNN [113] was also based on the RNN type
of network - LSTM. In [105], the data used was not raw but processed as techni-
cal indicator values and passed as an image, hence using 2D CNN, whereas, in
FCN [75], the data was processed via 1D CNN. The same hyperparameters for
65
the benchmark techniques were used as given in the study, except for FCN which
was best tuned for the used dataset. It was also compared to the simple CNN
with the architecture same as that of proposed framework, i.e., 3 convolutional
layers deep architecture and used the same hyperparameters too, except the
kernel sizes of F1 = 11, F2 = 9 and F3 = 7 for the convolutional layers ℓ = 1, 2
and 3 (padding size is Fℓ /2). The difference lied in the objective function of
the convolutional learning in both the techniques, i.e., 3 layers deep SDCF and
3 layers deep and simple 1D CNN. This was done to analyze the performance
difference between the two supervised learning techniques. Additionally, the
architecture for CNN was having 3 convolutional layers since the results were
best with 3 convolutional layers and depleted after that.
3.1.6 Results and Analysis
The predictions from every year totaling 11 years were saved, and the metrics
were computed to analyze the performance of the SDCF model. Two sets of
metrics were computed here, namely (i) classification metrics and (ii) financial
metrics.
(i) Classification Metrics - this set of metrics includes class-wise F1 score,
Precision and Recall to assess the performance from a classification point
of view. Also, the weighted F1 Score, Precision and Recall to account for
the class imbalance for every stock were calculated. Note that, in such a
case, the F1 score is not equivalent to the harmonic mean of Precision and
66
Recall since it is weighted.
(ii) Financial Metrics - Additionally, the evaluation of the performance of the
proposed framework and state-of-the-art was carried out from the finan-
cial point of view. For the same purpose, Annualized Returns(AR) were
computed using the predictions from all the models. The AR value was
calculated the same way as mentioned in [105]. The starting capital was
Rs 10,00,00,000.0, and transaction charges were Rs 10. Indian currency
was used to calculate the AR values since the dataset had all Indian stocks.
Note, however, that the chosen metric was versatile and could be used to
evaluate the model in any currency depending on the stocks analyzed.
3.1.6.1 Classification Analysis
As mentioned previously, let’s first look at the Classification performance of
the proposed models. The framework was tested for shallow - 1 CTL layer and
deeper versions - 2, 3 and 4 CTL layers. The generated features from the fully
connected layers are passed to the softmax after which the probabilities for all the
classes were obtained. The one with the maximum probability was selected as
the predicted label. The performance was calculated for every class. Specifically,
F1 Score, Precision and Recall metrics are computed for BUY, HOLD and SELL
classes. Here, the summary results for each of the classes - BUY, SELL and
HOLD signals to understand class wise results are given in the Tables 3.2, 3.4
and 3.3 and the global results from 3.5. The detailed results can be referred to
from the tables in Appendix section .
67
Certain results are highlighted in bold or red. In the first set of results
marked bold, one or more techniques for each metric give the best/greater
than or equal performance. Analyzing it in detail, it is found that there are
8 stocks for which the proposed model performed greater than or equal to
when compared with benchmark techniques for F1 score in the case of the
BUY class. Following the same, it is found that the SDCF gave greater than
or equal to performance for 13 stocks for precision and 5 stocks for recall
metrics under the BUY class. Similarly, 7 stocks for the F1 score, 7 stocks
for precision and 5 stocks for recall in the HOLD class and 7 stocks for the F1
score, 11 stocks for precision and 6 stocks for recall in the case of the SELL
class. It was further analyzed to understand the performance difference between
the supervised learning techniques, specifically, performance with CNN and
the proposed model. It was found that CNN gave greater than or equal to
performance for 2 stocks for each metric under the BUY class. Similarly, there
are 6, 1 and 9 stocks for the HOLD class and 2 stocks each for the metrics F1
score, precision and recall under the SELL class.
Additionally, the other set of results in red indicates the performance where
one of the proposed model versions gave the similar/next best performance
under 0.02 error difference - err_dif (let’s say) after one of the benchmarks, i.e.,
0.0 < err_dif ≤ 0.02. Adhering to the same, it was observed that for the BUY
class, there is 1 stock each for metrics F1 score, precision and recall, respectively.
Likewise, for the HOLD class, there are 7, 4 and 5 stocks for F1 score, precision
and recall metrics, respectively; and for the SELL class, 1 stock each for F1
68
score and recall metrics. Although the results for CNN haven’t been highlighted
when it gave similar/next-best performance but the statistics for the same are
presented here. Analyzing for CNN, there are 2 and 3 stocks for F1 score and
precision under the HOLD class. Observing these statistics, they indicate that
the performance with the proposed model is better than CNN for all three BUY,
HOLD and SELL classes.

Table 3.2: Summary of BUY Class Classification Results for Stock Trading
Avg. BUY Avg. BUY Avg. BUY

Method
F1 Score Precision Recall
SDCF 1L 0.0645 0.2182 0.0475
SDCF 2L 0.0916 0.2356 0.0683
SDCF 3L 0.1091 0.2205 0.0854
SDCF 4L 0.1566 0.3242 0.1355
CNN 0.0688 0.1179 0.0551
FCN 0.0758 0.1446 0.0617
CNN-TA 0.1205 0.1611 0.1263
MFNN 0.0881 0.1672 0.2401
69
Table 3.3: Summary of HOLD Class Classification Results for Stock Trading
Avg. HOLD Avg. HOLD Avg. HOLD

Method
SDCF 1L 0.7983 0.7091 0.9446
SDCF 2L 0.7912 0.7113 0.9164
SDCF 3L 0.7813 0.7113 0.8842
SDCF 4L 0.6684 0.5950 0.7960
CNN 0.7909 0.7090 0.9239
FCN 0.7825 0.7119 0.9051
CNN-TA 0.7686 0.7142 0.8557
MFNN 0.5161 0.6425 0.5718
Table 3.4: Summary of SELL Class Classification Results for Stock Trading
Avg. SELL Avg. SELL Avg. SELL

Method
SDCF 1L 0.0423 0.1778 0.0285
SDCF 2L 0.0650 0.1752 0.0503
SDCF 3L 0.0759 0.1574 0.0635
SDCF 4L 0.1410 0.2139 0.1250
CNN 0.0481 0.0946 0.0379
FCN 0.0742 0.1658 0.0802
CNN-TA 0.0679 0.1768 0.0487
MFNN 0.0633 0.1034 0.1734
From the summary results in the above displayed tables, the average metric
values for which the model gave the best performance are average F1 score and
precision for the BUY class, average F1 score and recall for the HOLD class,
and average F1 score and precision for the SELL class, where the F1 score is an
70
important metric, as it is the harmonic mean of precision and recall. It is the best
with the proposed model - SDCF for all three classes.
As it can be observed, the performance for the HOLD class decreased when
increasing the number of layers for the SDCF model. However, it can also be
seen that there is an increase in correct identification for BUY and SELL points
despite the fact that BUY and SELL points appear extremely less in the case
of every stock as compared to HOLD points. The latter identification capacity
is actually more crucial for the financial system as it directly influenced the
financial gains or losses. Moreover, the overall individual class performance
indicated that the model captured all three classes, i.e., BUY, HOLD and SELL
well.
This was also indicated in the confusion matrices, given for each of the SDCF
framework’s shallow and deeper versions in Figure 3.3. With an increase in
layers, the model started to identify the BUY and SELL points more correctly.
The HOLD signal had more false positives with shallow architecture (SDCF 1L)
that decreased with the increase in layer number, which was essential for the
system in order to classify other class points correctly. Additionally, the overall
performance of the proposed model was better than the CNN.
71
(a) CTL 1Layer (b) CTL 2Layers
(c) CTL 3Layers (d) CTL 4Layers
Figure 3.3: Confusion matrices corresponding to the different number of CTL layers of the architecture: a) 1 layer
of CTL (shallow version), b) 2 layers of CTL (deep version), c) 3 layers of CTL (deep version) and d) 4 layers of
CTL (deep version) where 0 - BUY, 1 - HOLD, 2 - SELL signals.
To better analyze the framework performance, the weighted F1 score, preci-
sion and recall metric values were calculated for all the stocks under consider-
ation. The reason for computing weighted values was to incorporate the class
imbalance for every stock. The detailed results can be referred from the appendix
section of the paper [57] and summary results are given in Table 3.5. Again,
the results comprised two sets of values marked in bold or red with the same
err_dif of 0.02. There are 6, 9, and 5 stocks concerning the metrics F1 score,
precision and recall for which the model performed greater than or equal to the
performance given by the state-of-the-arts. Also, there are 6, 3 and 6 stocks for
the metrics F1 score, precision and recall, respectively, for which the model gave
72
the next best performance under 0.02 err_dif. Although the BUY and SELL
classes’ performance with the 4 CTL Layers deep architecture is better than
the benchmarks compared against, the overall performance from the average
weighted metric is suggestive of the good performance with the 3 layers deep
architecture classification wise. This is also suggested by the financial results
explained later.
Again analyzing explicitly for CNN, there are 4, 2 and 7 stocks with greater
than or equal performance; and 3, 2 and 3 stocks under similar/next best per-
formance for the F1 score, precision and recall metrics, respectively. As can
be referenced from the statistics presented here, the proposed model is giving
better results with greater than or equal and the next best/similar performances
except for the number of stocks for recall metric are slightly more with CNN
under greater than or equal to performance. However, the next best performance
statistic for the recall metric is much better than CNN. Overall performance on
average is good with the proposed model as compared to the benchmarks and
CNN which can also be referred from Table 3.5. For a deeper understanding of
the aforementioned statistics, please refer to Table from [57].
73
Table 3.5: Summary of Weighted Classification Results for Stock Trading
Avg. Avg. Avg.

Method
SDCF 1L 0.6169 0.6216 0.6941
SDCF 2L 0.6229 0.6207 0.6867
SDCF 3L 0.6250 0.6146 0.6784
SDCF 4L 0.5345 0.5464 0.5890
CNN 0.6182 0.5907 0.6898
FCN 0.6090 0.6079 0.6725
CNN-TA 0.6148 0.6161 0.6575
MFNN 0.4162 0.5509 0.4676
3.1.6.2 Financial Analysis
It is imperative to analyze the performance from a financial perspective to
understand the quality of predictions made by the SDCF model. For this, as
explained earlier, the AR values were calculated with the predictions generated
by each technique for every stock over 11 years. The AR values were also
calculated with the True labels for every stock over the same period. Finally,
the absolute difference/error between the AR values from Predictions and the
AR values from True labels was computed. The absolute difference values were
averaged for all the stocks yielding the so-called Mean Absolute Error. The
detailed results are given in Table from the paper [57]. With the proposed model,
5 stocks have the best performance whereas with CNN-TA, there is 1 stock and
2 stocks under MFNN and FCN. Overall, the performance is good with the
proposed model as also evident from the summary results in Table 3.6 where
74
there is a mean of the absolute difference/error(MAE) between the True AR and
Predicted AR. Also, there are 3 stocks for which the proposed model gave an
equal performance as the other benchmark techniques. Here, this set of results
illustrated that, despite the higher capability of identifying the BUY and SELL
points with 4 layers deep CTL, the AR values are better predicted with the 3
layers deep CTL framework.
With respect to CNN, there are only 2 stocks for which CNN performs better
than any benchmarks and the proposed models and 3 stocks for which it gave
an equal performance. Thus, from the combined (greater than or equal to and
next best / similar), average classification and financial results, the CNN results
are less performant than the proposed model. This also indicated that the quality
of predictions made with the SDCF model is better than CNN as the identified
class labels give AR values quite close to the True AR values. This remained
true for all the benchmarks. The statistics presented here can be deduced from
Table in [57] for complete understanding.
75
Table 3.6: Summary of Financial Results for Stock Trading
Method MAE AR
SDCF 1L 22.5613
SDCF 2L 20.7227
SDCF 3L 20.5067
SDCF 4L 22.8287
CNN 21.1140
FCN 23.7720
CNN-TA 22.1380
MFNN 22.3040
To further understand the better supervised learning for both regular CNN and
the SDCF framework, the channel-wise Xc features for both frameworks obtained
after the last maxpool layer for the 3 convolutional layers deep framework were
visualized. The following Figure 3.4 shows the visualizations of the features for
one sample of the stock ‘BSELINFRA.BO’.
76
Features generated by the proposed SDCF network.
Features generated by a standard CNN with a similar architecture.
(a) Channel X1 (b) Channel X2 (c) Channel X3 (d) Channel X4 Low (e) Channel X5
Close Price Open Price High Price Price NAV
Figure 3.4: Visualization of channel-wise features Xc for SDCF versus a standard CNN for one sample of stock
BSELINFRA.BO (with 16 × 1 as the shape of the features obtained and resized to 8 × 2 for better visualization)
77
As seen from Figure 3.4, the heatmap for each channel corresponding to
the prices(Close, Open, High and Low) show no variation in the case of CNN
compared to the SDCF architecture. While it shows some variations for the
features learned corresponding to NAV, the features are still better learned with
SDCF. Also, the darker the color in the heatmap, the more it is indicative of
the larger negative exponent values. In the case of CNN, hence, the values are
very very small that are almost diminishing to zero. This also corroborated
the fact that the filters learned with the proposed model are distinct due to the
"log-det" term added which further gives different features with significantly
less redundancy. Thus, the visualizations of these channel-wise features are also
supportive of better supervised training with the SDCF framework than CNN.
3.1.6.3 Ablation Studies
‘ In this section, the ablation study performed is discussed. The network was
trained in a piecemeal fashion here. The motive for performing this study was
to understand the behavior of the network without the benefit of joint training.
Since it is piecemeal, it was carried out in two parts. In the first part, the network
learned the representations Z in an unsupervised manner with the following
objective function:
C
1 X 2
Ffusion (T , Z, X) = Z −
e flat(X (c) )Tec + Ψ(Z)+
2 c=1
F
C
(3.4)
X 2
µ Tec − λ log det(Tc )
e
F
c=1
78
It is the same as present in equation 3.1 and hence all the variables here mean
the same. Then these learned Z’s are fed into the shallow single-layer neural
network with the cross-entropy loss at the end separately, i.e.
K V
X
zk⊤ (θv −θyk )
X
FCE (θ, Z | y) = log e , (3.5)
k=1 v=1
It is also followed from the previously mentioned objective terms given by
equation 3.2 and thus, all variables mean the same. However, the difference is
that here θ is not learned jointly with Z, but two separate pipelines learn each of
these variables individually. Also, the hyperparameters used for both parts are
the same as the ones used in the proposed architecture with the Adam optimizer.
The results corresponding to classification and financial analysis are given under
heading piecemeal in tables 3.7, 3.8, 3.9, 3.10 and 3.11 respectively.
Table 3.7: Ablation Study performance for BUY Class
Avg. BUY Avg. BUY Avg. BUY

Method
SDCF 1L 0.0645 0.2182 0.0475
SDCF 2L 0.0916 0.2356 0.0683
SDCF 3L 0.1091 0.2205 0.0854
SDCF 4L 0.1566 0.3242 0.1355
piecemeal 0.0449 0.1593 0.0379
79
Table 3.8: Ablation Study performance for HOLD Class
Avg. HOLD Avg. HOLD Avg. HOLD

Method
SDCF 1L 0.7983 0.7091 0.9446
SDCF 2L 0.7912 0.7113 0.9164
SDCF 3L 0.7813 0.7113 0.8842
SDCF 4L 0.6684 0.5950 0.7960
piecemeal 0.8002 0.7048 0.9592
Table 3.9: Ablation Study performance for SELL Class
Avg. SELL Avg. SELL Avg. SELL

Method
SDCF 1L 0.0423 0.1778 0.0285
SDCF 2L 0.0650 0.1752 0.0503
SDCF 3L 0.0759 0.1574 0.0635
SDCF 4L 0.1410 0.2139 0.1250
piecemeal 0.0221 0.1230 0.0127
Table 3.10: Ablation Study performance weighted results
Avg. Avg. Avg.

Method
SDCF 1L 0.6169 0.6216 0.6941
SDCF 2L 0.6229 0.6207 0.6867
SDCF 3L 0.6250 0.6146 0.6784
SDCF 4L 0.5345 0.5464 0.5890
piecemeal 0.6090 0.6002 0.6954
80
Table 3.11: Ablation Study Financial Results
Method MAE AR
SDCF 1L 22.5613
SDCF 2L 20.7227
SDCF 3L 20.5067
SDCF 4L 22.8287
piecemeal 23.4073
From the results in tables 3.7, 3.8, 3.9, 3.10 and 3.11, it can be clearly seen that
the piecemeal version did not perform well as compared to the proposed solution
except for the HOLD class and Recall value for weighted summary results under
Classification performance category. The exception in results, however, have
values only slightly better than the proposed ones. Despite the slightly higher
results, the piecemeal approach did not recognize BUY and SELL points as
efficiently as the proposed method - SDCF, which is critical for the system. It,
thus, also suggests that the joint supervised solution involving cross-entropy loss
guided the better representation learning. Therefore, the proposed solution’s joint
training is justified and important for the system to recognize critical points BUY
and SELL as well as appropriately recognizing HOLD points efficiently that are
comparable with other state-of-the-arts, CNNs, and piecemeal approaches.
In order to test the proposed architecture’s capability further, the experiments
for two additional window sizes, namely 5 and 20 have been performed. In
order to avoid extensive space utilization, only the comparative summary re-
sults are presented here - Weighted F1 Score(Classification Metric) and MAE
81
AR(Financial metric) in Table 3.12 for window sizes 5 and 20 along with the
summarized results for window size 10. The proposed method yielded the best
results on an aggregate. Even though CNN-TA yielded better AR for a solo
case (window size 20), it did not reach better results in terms of weighted F1 for
the same scenario. Furthermore, CNN-TA couldn’t be run for all small window
sizes (such as 5), hence cannot be deemed as an all-purpose go-to method. Small
window sizes are crucial for highly non-stationary stocks and the inability of a
technique to handle such stocks is a major shortcoming. Overall, the proposed
model performed better than benchmarks and CNN both classification-wise and
financially; specifically, it gave the best performance with 3 CTL layers deep
SDCF framework of all the 4 SDCF architectures. The empirical convergence
plots were also displayed for a few stocks, namely INDRAMEDCO.BO and
NATIONALUM.BO in Figure 3.5 for both shallow and deeper versions. It can
be seen that the training loss decreased to the point of stability for each example
considered.
Table 3.12: Comparative Summary Results for Stock Trading for window sizes 5,10,20
Window Size 5 Window Size 10 Window Size 20

Method
F1 MAE AR F1 MAE AR F1 MAE AR
SDCF 1L 0.6141 22.4947 0.6169 22.5613 0.6194 22.4453
SDCF 2L 0.6148 24.3820 0.6229 20.7227 0.6242 25.0200
SDCF 3L 0.6207 20.9193 0.6250 20.5067 0.6262 25.7667
SDCF 4L 0.6157 21.5427 0.5345 22.8287 0.6254 26.1007
CNN 0.6095 22.0113 0.6182 21.1140 0.6217 22.9560
FCN 0.6131 23.3107 0.6090 23.7720 0.6120 24.2233
CNN-TA* - - 0.6148 22.1380 0.6246 20.3820
MFNN 0.4105 23.4820 0.4162 22.3040 0.4869 23.2620
*CNN-TA cannot be run for window size 5 due to its inherent structure
82
(a) CTL 1 layer (b) CTL 2 layers
(c) CTL 3 layers (d) CTL 4 layers
Figure 3.5: Evolution of the loss during training for a few stock examples of the proposed model with (a) CTL 1
layer, (b) CTL 2 layers, (c) CTL 3 layers and (d) CTL 4 layers.
83
3.2 DeConDFFuse : Predicting Drug-Drug Interaction using joint Deep
Convolutional Transform Learning and Decision Forest fusion frame-
work
Let’s move on to the next supervised framework - DeConDFFuse 2 , which is
again based on CTL. Briefly, the framework jointly trains CTL based network
pipelines, fuses them with TL and passes lastly via DF. The technique has been
applied to Drug-Drug Interaction prediction. Drug-Drug interactions (DDIs)
are the adverse changes or effects, or reactions of one drug due to the recent
concurrent use of another drug(s). For example, the drug Ceftriaxone should
be avoided in children less than 28 days old if they are receiving or expected to
receive IV calcium-containing products. Indeed, it might lead to neonatal deaths
resulting from crystalline deposits in the lungs and kidneys, as reported in [114].
Such reaction from DDIs is known as adverse drug reactions (ADRs). ADRs
are responsible for the threat to a person’s life and inadvertently increase overall
healthcare costs.
According to the studies [115, 116], ADRs contribute to more than 20% of
clinical trial failures and are considered the highest load in the modern drug
discovery process. Serious ADRs can cause severe disability and even death in
patients. Also, from study [116], it is observed that approximately 3.6% of all
hospital admissions are caused by ADRs in Europe. Up to 10% of patients in
European hospitals experience an ADR among those patients. From a financial

2
P. Gupta, A. Majumdar, E. Chouzenoux and G. Chierchia, “DeConDFFuse : Predicting Drug-Drug Interaction using joint
Deep Convolutional Transform Learning and Decision Forest fusion framework". Expert Systems with Applications, Volume
227, 2023, 120238, ISSN 0957-4174, https://doi.org/10.1016/j.eswa.2023.120238
84
perspective, the annual financial cost of drug-related morbidity in the United
States (US) was estimated at $528.4 billion in 2016, equivalent to 16% of total
US healthcare expenditures that year [117]. It, thus, becomes pertinent to identify,
in an exhaustive manner, the DDIs that could cause ADRs. This might not avoid
all unanticipated drug interactions, but it can help lower the drug development
costs and optimize the drug design process [118].
3.2.1 Literature Review - DDI
DDIs identification is considered a non-trivial problem from the research perspec-
tive to be solved in the pharmacology discipline. Many different computational
strategies have been investigated in the literature, which are discussed below. Sev-
eral families of methods can be identified, relying either on statistical machine
learning models, graph models, deep learning models and matrix factorization
models.
3.2.1.1 Statistical Machine Learning based frameworks
The work [119] proposes similarity-based models that compute similarity scores
between drug features like chemical structures, side-effects, targets, pathways,
etc. and thereafter performs a probabilistic inference of the DDIs. Researchers
have explored Bayesian learning models [120] under statistical learning paradigms.
Another work [121] uses a sparse feature learning ensemble method with lin-
ear regularization utilizing four drug features - chemical substructures, targets,
85
enzymes and pathways. In [122], ML algorithms like Naive Bayes, Decision
Tree, Random Forest, Logistic Regression, and XGBoost were used with cross-
validation with input as SMILE values and interaction features based on CYP450
group.
As also discussed in the previous chapter, all the aforementioned studies
utilize statistical machine learning whose performance might depend highly
on the quality of features used; thus, it becomes pertinent to explore multiple
features than restrict them to some set. Furthermore, overfitting stays a significant
issue with these techniques due to their restrictive non-linear mapping and fitting
capability.
3.2.1.2 Graph-based frameworks
Graph-based embedding techniques are also gaining momentum in DDIs pre-
diction. With the advent of the availability of biomedical data, researchers are
moving toward KGs to populate and complete the available biomedical informa-
tion. It is done with the help of the large structured databases and texts available
publicly [123]. Some works have used the combination of DDI matrix and KG
followed by applying ML algorithms [123].
3.2.1.3 Deep Learning based frameworks
Deep Learning is another effective modeling technique that is extensively used
in solving most real-world problems these days. It has emerged to be helpful in
86
the said DDI prediction as well. In [124], the proposed framework integrates the
multi-relational and relation-aware network structure representations. Finally,
the integrated representations via concatenation are passed through the neural
network to get the final DDI predictions. The study [125] proposes attention-
based RNNs - LSTM for DDI prediction. In [126], the work utilizes deep Neural
Networks based on attention technique for predicting DDIs with features from
multiple networks learned using graph embedding techniques.
Further studies are combining KGs and DL to predict DDIs. In the study
[127], the DDI matrix and KG form the input to the DL network with CNN and
LSTM. KG is input to the network through learned embeddings like ComplEx,
TransE, RDF2Vec, etc.
3.2.1.4 Matrix Factorization and multi-modal techniques
Let us also mention recent studies that present matrix factorization as the solution
to predict DDIs [128, 129]. Here, the input is the DDI matrix and the similarity
scores between the drugs. The pair for which the DDI is to be predicted is
treated as a missing value; hence, it is imputed using the inputted similarity
scores. Then some works use the Triple Matrix Factorization also [130, 131].
Some researchers have even proposed multi-modal techniques to predict DDIs.
The study that used this technique learned the unified drug representations from
multiple drug feature networks simultaneously using multi-modal deep auto-
encoders. Then, they applied four operators on the learned drug embeddings to
represent drug-drug pairs, and finally, they used an RF classifier to train models
87
for predicting DDIs [132].
Several aforementioned studies are based on different categories like similarity-
based, network-based, graph-based, etc., but none guarantees the distinctive
Learning of features.
In this section, the proposed work is discussed. A fusion framework is presented
that combined the benefits of the recently established multi-channel, unsuper-
vised, fusion-based representation learning framework - DeConFuse [69] and
jointly optimized a decision forest with the binary decision that gives the final
DDI Predictions. Such a solution has been successfully used before in Deep
Neural Decision Forest (DNDF) framework [1]. Also, it is already known that
DeConFuse architecture is unsupervised. The aim was to propose a supervised
version of this architecture. The recently established supervised version of this
framework, namely SuperDeConfuse [57] is just explained in previous few sec-
tions. However, the supervision in SuperDeConFuse was incorporated by using
cross entropy loss in the optimization objective and softmax at the end of its
architecture. Here, the goal was to guide the supervision through a random
decision forest. The proposed solution did not utilize features/representations
from CNN but instead from the DeConFuse network based on deep CTL through
linear transform learning. The latter’s advantage was that it promoted distinct
filters/transforms, which was not guaranteed with CNNs. Such a benefit helped
in learning distinct and interpretable representations. These representations were
88
further guided by the predictions from the decision forest, whose parameters
are jointly optimized. This joint optimization was helpful as the representa-
tions learned are guided not only by deep CTL and fusion objectives but also
by decision forests. Thus, the representations that would have been fed to the
Decision Forest (DF) just like a normal input to it were learned better by its
feedback during backpropagation of error to the previous layers’ neurons in the
joint end-to-end solution that was not available with the piecemeal approach.
Further, thus, this double guided (deep CTL based fusion + DF) supervised
learned representations helped correctly identify many known-to-interact (1)
DDIs, as corroborated by the experimental results discussed in section 3.2.4 later.
The details of the DeConFuse network can be referred from Chapter 2. Let’s
briefly discuss the latter framework - DNDF and, finally, mention the details of
the combined fusion framework.
3.2.2.1 DNDF Framework
The DNDF framework from [1] is introduced in this section, which is the last
brick of the DDI pipeline. It is different from conventional deep neural networks
as it outputs the final predictions from the decision forest, and their split (decision
nodes) and leaf (prediction) nodes’ parameters are jointly and globally optimized.
The technique is stochastic and differentiable, thus giving a backpropagation
compatible version of decision trees that guides the representation learning in
lower layers of deep CNNs. This reduced the uncertainty on routing decisions of
a sample taken at the split nodes, such that the globally defined loss function is
89
minimized. For the leaf nodes, the optimal predictions are achieved by reducing
the convex objective function, which does not require step size selection. Further,
the objective function is explained briefly.
Decision Trees with Stochastic Routing
Consider a classification problem with input space χ and finite output space
Y . A decision tree comprises decision (or split) and prediction (or leaf) nodes.
Decision nodes, let’s say, indexed by N are internal nodes of the tree, and
prediction nodes are indexed by L, i.e., terminal/leaf nodes of the tree. Each
prediction node ℓ ∈ L is associated with a probability distribution πℓ = (πℓy )y∈Y .
Each decision node n ∈ N is assigned a decision function dn (·; θ) : χ → [ 0, 1]
parameterized by θ, which routes the samples along the tree branches. When a
sample x ∈ χ reaches a decision node n, it will be directed to the left or right sub-
tree based on the output of the function dn (x; θ). Here, it is a probabilistic routing
where the routing direction is the output of a Bernoulli random variable with
mean dn (x; θ). As the sample ends in a leaf node ℓ, the related tree prediction
is given by the class-label distribution πℓ = (πℓy )y∈Y . In the case of stochastic
routings, the leaf predictions will be averaged by the probability of reaching the
leaf. Thus, the final prediction for a sample x from tree D with decision nodes
parameterized by θ is given as:
X
(∀y ∈ Y ) PD [ y | x, θ, π] = πℓy µℓ (x | θ) (3.6)
ℓ∈L
90
where π = (πℓ )ℓ∈L . Here above, µℓ (x | θ) is regarded as the routing function
providing the probability that sample x will reach leaf ℓ. Note that Σℓ µℓ (x |
θ) = 1 for any x ∈ χ.
For an explicit form for the routing function, the following binary relations
that depend on the tree’s structure are given as: ℓ ↙ n, which is true if ℓ belongs
to the left sub-tree of node n, and n ↘ ℓ, which is true if ℓ belongs to the right
sub-tree of node n. Hence, these relations can be exploited to express µℓ as:
Y
µℓ (x | θ) = dn (x; θ)1ℓ↙n d¯n (x; θ) (3.7)
n∈N
where d¯n (x; θ) = 1 − dn (x; θ), and 1P is an indicator function conditioned on
the argument P . Although the product in (3.7) runs over all nodes; however, only
decision nodes along the path from the root node to the leaf ℓ contribute to µℓ ,
because for all other nodes 1ℓ↙n and 1n↘ℓ will be both 0 (with the assumption
00 = 1). See Figure 3.6.
Figure 3.6: Each node n ∈ N of the tree performs routing decisions via function dn (·). The black path shows an
exemplary routing of a sample x along a tree to reach leaf ℓ4 , which has probability µℓ4 = d1 (x)d¯2 (x)d¯5 (x). Image
taken from [1].
Decision functions deliver a stochastic routing with decision functions defined
91
as follows:
dn (x; θ) = σ(fn (x; θ)), (3.8)
where σ(x) = (1 + e−x )−1 is the sigmoid function, and fn (·; θ) : χ → R is a
real-valued function depending on the sample and the parameterization θ.
A forest is an ensemble of decision trees F = {D1 , . . . , Dk }, which delivers
a prediction for a sample x by averaging the output of each tree, i.e.
k
1X
(∀y ∈ Y ) PF [y | x, θ, π] = PDh [y | x, θ, π]. (3.9)
k
h=1
Note that the tree parameters are omitted for notational convenience.
Learning Trees by Back-Propagation
Learning a decision tree, for which the model is explained in the previous
sections, requires estimating the decision node parameterizations θ and the leaf
predictions π. The parameters θ are estimated using the Minimum Empirical
Risk principle with respect to a given data set T ⊂ χ × Y under the log-loss
(also known as the cross-entropy loss), i.e., minimizers of the following risk term
are searched:
1 X
Ftree (θ; π; T ) = − log(PD [y | x, θ, π]) (3.10)
|T |
(x,y)∈T
The forest is learned by considering the ensemble of trees F, where all trees
can possibly share the parameters in θ. Still, each tree can have a different struc-
92
ture with a different set of decision functions and independent leaf predictions
π. The illustration of the forest of decision trees taking the parameters θ and
computing routing decisions and prediction nodes probabilities can be referred
to from Figure 3.7. Thus, for the forest, the empirical risk is minimized as:
1 X
Fforest (θ; π; T ) = − log(PF [y | x, θ, π]). (3.11)
|T |
(x,y)∈T
Figure 3.7: illustration of how to implement a deep neural decision forest (DNDF). Top: Deep CNN with a variable
number of layers, subsumed via parameters θ. FC block: Fully Connected layer used to provide functions fn (·; θ),
described in Equ. 3.8. Each output of fn is brought in correspondence with a split node in a tree, eventually
producing the routing (split) decisions dn (x) = σ(fn (x)). The order of the assignments of output units to decision
nodes can be arbitrary (the one shown allows a simple visualization). The circles at the bottom correspond to leaf
nodes, holding probability distributions πℓ . Image taken from [1].
A two-step optimization strategy minimizes the above function, with alternate
updates of θ and π explained in further sections.
Learning decision nodes
Each function fn is parameterized by theta on which all decision functions
depend as shown in equation 3.8. To minimize the risk term with respect to
93
theta for a given π by employing Stochastic Gradient Descent (SGD) approach.
The update is given as :
∂R (t)
θ(t+1) = θ(t) − η (θ , π; B)
∂θ
(t) η X ∂L (t) (3.12)
=θ − (θ , π; , x, y)
|B| ∂θ
(x,y)∈B
where η > 0 is a learning rate and B ⊆ T is a random subset or mini-batch. A
momentum term was also used to smooth out the gradients’ variations not shown
explicitly. The gradient of the loss L with respect to θ can be decomposed by the
chain rule as follows :
∂L X ∂L(θ, π; , x, y) ∂fn (x; θ)

(θ, π; , x, y) = (3.13)
∂θ ∂fn (x; θ) ∂θ
n∈N
The gradient term that depended on decision forest was given by
∂L
(θ, π; , x, y) = dn (x; θ)Anr − d¯n (x; θ)Anl , (3.14)
∂fn (x; θ
Here nl and nr represent left and right children of node n respectively, and Am
for a node m ∈ N is defined as :

P
l∈Lmπly µl (x|θ)
Am = (3.15)
PT [y|x, θ, π]
with Lm ∈ L denoted the set of leaves held by the sub-tree rooted in node m.
94
Learning prediction nodes
After the understanding of learning θ from previous section, let’s learn about
prediction nodes. The risk term in 3.10 with respect to π when θ is fixed was
learnt which is a convex optimization problem with a global solution. Here, all
the leaf nodes were estimated jointly. The iterations for the updates read as :
(t)
(t+1) 1 X 1y=y′ πly (x|θ)
πl y = (t)
(3.16)
Zl (x,y′ )∈T PT [y|x, θ, π ( t)]
(t)
for all l ∈ L and y ∈ Y , where Zl was a normalizing factor so that
(t+1)
= 1. Initial π ( 0) is random, typically πl0y = |Y |−1 . Here, the update
P
y πy
of π was interleaved with a whole epoch of stochastic updates of θ.
3.2.2.2 Combined Proposed Framework DeConDFFuse - DeConFuse and Decision Forest
The proposed framework combined the frameworks DeConFuse explained in
the previous chapter 2 and DNDF in section 3.2.2.1, respectively. Specifically,
instead of utilizing the features from a CNN network, it was proposed to inherit
the representations learned from the DeConFuse network to peruse them in the
decision forest, i.e., the decision forest was jointly trained and optimized within
the DeConFuse network. The DeConFuse network learned the channel-wise
representations corresponding to each drug in a drug pair, that is X (c) with
c ∈ {1, 2}, and finally learned common representation Z from X (c) where the
fusion happened. Here, there was no positivity constraint on Z and only on
95
channel-wise representations X (c) .
The representation Z was passed to the DF, where it applied the features
mask, i.e., randomly selected the features from the representation that would
participate in the decision tree’s routing process which fed those selected features
to the linear fully connected layer parameterized by θ, i.e., given by the function
fn (x; θn ) = θn⊤ x. The number of features involved was provided by the feature
ratio. After that, the sigmoid activation was applied as given in Eq. 3.8. Then the
routing function was computed, and the prediction probabilities were calculated.
Thus, the prediction probabilities having a probability for each class for each tree
were likewise obtained. Finally, the probabilities from all the trees of the Forest
F were averaged to get the outcome probability for each of the classes 0 and 1 in
this case. The negative log-likelihood loss was computed and back-propagated,
which guided the representation learning of the DeConFuse framework and
Learning of the parameters θ. The objective function for this framework that
combined the idea of DeConFuse and DF can be deduced from 2.10 and 3.11:
C
(c) (c)
X
minimize Ffusion (Te, Z, X) + Fconv (T1 , . . . , TL , X (c) | S (c) ) + Fforest (θ; π; TZ ).
T,X,Te,Z,θ,π c=1
| {z }
J(T,X,Te,Z,θ,π)
(3.17)
Hereabove, the dataset TZ was built with the learned features Z and the known
labels. Note that there was no positivity constraint anymore on the learned
representations Z.
96
3.2.3 Experimental Setup
The experiments were conducted on the drug-drug interaction dataset comprising
the DDI matrix and bioactivity descriptors/feature vectors for each drug as
explained in section 1.3.3. The DDI matrix dataset was divided into training and
testing datasets. All the drugs are kept in the training data so that there are 95
samples per drug. Further, out of 95 samples, there are 60% of 1 interactions for
each drug (not exceeding half of the 95, i.e., min (60% of 1 interactions, 95//2)),
and the remaining are the samples from 0 interactions. The remaining samples of
0 and 1 interactions per drug are kept in testing data. All the training and test data
samples from each interaction category per drug are selected randomly. Also,
only one pair of interactions are kept from either the upper or lower triangle of
the DDI matrix. Thus, each training and testing sample is the drug pair and its
corresponding interaction value, which is called as a label. Approximately there
are 1Lac training samples and 4Lac testing samples.
A drug pair was passed as a input to a sample during training. For each drug in
a pair, 1D feature vectors, i.e., bioactivity descriptors, were fed to the individual
channel/network based on deep CTL, where L = 2 represents the number of CTL
layers. Thus, the input S gathered the bioactivity descriptors/1D feature vectors
of size 384 for each channel corresponding to each drug. Since there were 1D
feature vectors for each drug in the drug pair, thus, 1D convolutions were used in
each deep CTL network. The two networks’ learned features/representations X (c)
were flattened and concatenated. Then these features were passed to the linear
97
Transform learning layer that acted as a fully connected layer where transform Te
and common representation Z were learned. Further, the learned representation
Z were selectively sent by applying the feature mask to the decision forest. The
final predictions were outputted by averaging the predictions from each tree in
the decision forest.
The complete architecture is shown in Figure 3.8 and all the architectural and
hyperparameter details are given in Table 3.13. In the set of hyperparameters,
the atom ratio signified the number of features to be kept in the representation Z;
and the feature ratio meant the randomly selected number of features from the
representation Z that would participate in the routing decision function of each
tree parameterized by θ.
8f 16
ilte filt
rs ers
Input
S1 = Drug1
SELU + Maxpool
Features 1
Maxpool
features
3x1 3x1
Flattening & Concatenation
Drug Features Drug

Pair
(Drug1, Drug2)
8f 16 Linear Fully
Features
ilte filt
rs ers Connected Decision Output
Layer Forest Predictions
Input Transform
SELU + Maxpool
S2 = Drug2 Learning
Features 2
Maxpool
features
Drug Interaction Matrix 3x1 3x1
DeConFuse
Figure 3.8: DDI prediction using combined DeConFuse and decision forest architecture- DeConDFFuse. Here
C = 2, the number of networks/channels via each of which a drug in the drug pair is passed along with its bioactivity
descriptors/ features vector, respectively.
The results were compared with three state-of-the-arts/benchmark techniques,
namely:
• KGNN: This technique was used to build the Knowledge Graph (KG)
and passes the DDI matrix and KG to the Graph Neural Network (GNN).
It focused on neighborhood sampling and aggregates entities and their
98
Table 3.13: DDI Prediction DeConDFFuse Architec-
ture Details
Parameter Value
Layer Wise Hyperparameters
Layer1 - Convolution (CTL) (1,16,3,1,1)1
Maxpool (2,2)2
Layer2 - Convolution (CTL) (1,32,3,1,1)1
Maxpool (2,2)2
atom ratio 0.75
Decision Forest Hyperparameters
#Trees 90
tree depth 10
feature ratio 0.5
Other Model Hyperparameters
Epochs 75
Learning Rate 0.01
µ 1e-05
λ 0.0001
batch size 4096
weight decay 1e-05
Optimizer Hyperparameters
Optimizer Used Adam
Ams grad True
Learning rate 0.01
betas (0.9, 0.999)
eps 1e-08
1 (in_planes, out_planes, kernel_size, stride,
padding)
2
neighborhood representation into a single vector in 3 ways - sum, concat,
and neighbor [133].
• Conv-LSTM: This technique used the DDI matrix and KG as the input to
the DL network with a CNN and LSTM. KG was input to the network in the
form of learned embeddings like ComplEx, TransE, RDF2Vec, etc. [127].
It is compared against the embedding that gave the best results in this work,
i.e., ComplEx embedding.
• Graph Embedding DDI: This technique used KG and DDI matrix as input
99
but experiments with many different types of embedding techniques. Then
each of these embeddings, one by one, was passed to machine learning tech-
niques like Random Decision Forest (RF), Gaussian Naive Bayes (GNB),
and Logistic Regression (LR) [123]. Here also, the embedding type Skip
Gram was used, which gave the best results in their study.
For all three benchmarks - KGNN, KG Conv-LSTM, and Graph Embedding
DDI- the same DrugBank IDs were used as present in the considered training
and testing samples. Since these methods relied on KGs, the bioactivity descrip-
tors/features were not used but recreated KG and embeddings for the dataset
used to run these benchmarks.
3.2.4 Results and Analysis
The prediction results were evaluated using the classification metrics - AUPRC,
AUC_ROC, F1 Score, Precision, Recall, and Accuracy. Except AUPRC, all other
metrics have been explained previously in the chapter 2. Thus, here AUPRC is
only explained as:
AUPRC : it is the area under the graph constructed by calculating and plotting
the precision against the recall for a single classifier at a variety of thresholds.
The higher the AUPRC score, the better a classifier performs for the given task. It
summarizes a precision-recall curve as the weighted mean of precision achieved
at each threshold, with the increase in recall from the previous threshold used as
the weight. Thus, it is a kind of weighted-average precision across all thresholds.
100
All the metrics were computed except Accuracy as weighted metrics since
there was a huge class imbalance between 0 and 1 labels. The following Table
3.14 contains the values of the said evaluation metrics:

Table 3.14: DDI Prediction Results
Method Sub Accuracy F1 Precision Recall AUC AUPRC
Method ROC
KGNN Sum 85.8168 89.5672 95.0392 85.8168 82.6508 18.6945
Concat 86.9723 90.2730 95.0375 86.9723 83.5235 19.8427
Neighbor 81.7908 86.9563 94.1900 81.7908 74.379 10.7655
Conv-LSTM - 86.4785 89.3325 92.5174 86.4785 49.597 3.8164
ComplEx
Graph DDI GNB 95.8015 94.0807 92.5087 95.8015 50.1899 3.8546
Skip Gram LR 96.1235 94.2236 92.3973 96.1235 50.0603 3.8583
RF 96.1235 94.2236 92.3973 96.1235 50.0934 3.8439
DeConDFFuse - 90.7422 92.7777 95.9478 90.7422 91.4453 34.0847
(Ours)
Also, confusion matrices for each method were computed. They are displayed
in Figure 3.9.
From Table 3.14, it is seen that benchmark Graph DDI gave the best val-
ues in terms of Accuracy, F1 Score and Recall, and the proposed method for
Precision, AUC ROC, and AUPRC. However, no single Benchmark worked
well in terms of all the classification metrics used for evaluation. In fact, the
next best performance in terms of Accuracy and F1 was given by the proposed
method. Despite the highest F1, Accuracy, and recall values, Graph DDI failed
to achieve the highest values for AUC -ROC and AUPRC, which are considered
101
(a) KGNN - Sum (b) KGNN - Concat (c) KGNN - Aggregate
(d) Conv-LSTM (e) Graph DDI-GNB (f) Graph DDI-LR
(g) Graph DDI-RF (h) Proposed method - DeConDF-

Fuse
Figure 3.9: Confusion matrices for different benchmarks and the proposed method- DeConDFFuse
more relevant and important metrics for the performance evaluation in the case
of binary classification. The reason for the same can be observed with the help
of the confusion matrices in Figure 3.9.
Here, it can be seen that with the proposed method, the highest number of
known-to-interact interactions (1) have been predicted correctly than any other
benchmarks. Also, except for Graph DDI, the False positives, i.e., classifying
102
0 as 1, were lesser with the proposed method than the other two benchmarks.
Here, the former task of classifying the known-to-interact drug interactions is
more important to prevent ADRs, as explained before and which Graph DDI
did not achieve at all or was nearly negligible. Thus, with the proposed method,
the former task of identifying known-to-interact interactions was accomplished
better than any other benchmark. For the false positives, too, it gave good
performance compared to the other benchmarks except for graph DDI. The latter
is the reason why Graph DDI has three metric values higher than the proposed
method. With the proposed approach, though the number of False positives was
higher than Graph DDI, however, it was not necessary that these False positives
were completely incorrect. The reason is that the 0 interactions did not signify
no interaction between those two drugs. It meant either known-not-to-interact or
unknown.
In summary, Graph DDI classified almost all 0 (known-not-to-interact or
unknown) interactions correctly; still, it did not correctly classify 1 (known-to-
interact) interactions that were against the study’s objective, i.e., to identify the
known-to-interact DDIs to avoid ADRs. With the proposed method, both types
of interactions were classified reasonably well. It was also the stable method
corroborated by the classification metrics, as it gave good performance in all the
metrics. Hence, the proposed framework performed superior to the benchmarks.
Additionally, the comparison was done among representations/features learned
from benchmarks and the proposed framework - DeConDFFuse (DCDF). The
proposed method performed better due to the kind of representations/features
103
learned from the CTL based network. Each of the benchmarks was carefully ex-
amined. In KGNN, Knowledge Graph features, different aggregation techniques
and Graph Neural Network are utilized. The latter had a lot of parameters and
computation cost since the neural network connects neurons in each layer with
every neuron in the preceding and consecutive layers. Also, it had no distinc-
tiveness guarantee for the kind of weights learned. The poorest performance
from Graph DDI was due to a lack of learning ability of the traditional machine
learning algorithms utilized in this framework after learning the embeddings
from KG. Lastly, with the Conv-LSTM framework, all the disadvantages of CNN
discussed in section 3.2.1.3 are applicable. Thus, the representations learned
with it might have redundancy leading to inferior performance. Therefore, from
overall performance, it could be concluded that the representations learned from
the proposed method were better than the benchmarks.
The optimizer used for updating all the parameters of the framework except
probability distribution π of the decision forest is Adam, which uses the auto-
matic differentiation in PyTorch for gradient computation. The hyperparameters
like learning rate, betas, and eps, etc. associated with it are mentioned in Table
3.13. Loss plots were also plotted with the proposed technique, which can be
referred to from Figure 3.10. It can be seen that with Adam optimizer, the
proposed solution converged to the point of stability.
104
Figure 3.10: Loss plot with the proposed method - DeConDFFuse.
3.2.4.1 Ablation study
The experiments were further carried out to understand the architecture in detail.
It involved the performance comparison between the proposed joint model
- DeConDFFuse and the piecemeal version. The latter trained to learn the
representations Z in an unsupervised manner versus a supervised way. Next,
these learned representations Z are fed to the separate module Random Decision
Forest where Z are treated as regular input to the system along with the labels.
The hyperparameters for DCTL networks are the same as the proposed solution.
However, for the RDF part, the hyperparameters were best tuned and were set
to values: #trees = 5, tree_depth = 70 and random state = 11. The two
systems’ performance was evaluated using the same metrics and are reported in
Table 3.15 below.
105
Table 3.15: Comparative Results with DeConDFFuse and Piecemeal approaches
Method Sub Accuracy F1 Precision Recall AUC AUPRC
Method ROC
DeConDFFuse - 90.7422 92.7777 95.9478 90.7422 91.4453 34.0847
(Ours)
piecemeal - 89.2075 91.1640 93.3055 89.3005 62.9022 6.9772
From the above table, it can be clearly seen that the performance of the
proposed solution is better than the piecemeal version. Further, it can be better
visualized with the help of the confusion matrices given below in Fig. 3.11.
(a) Proposed method - DeConDF- (b) Piecemeal

Fuse
Figure 3.11: Confusion matrices for the proposed method - DeConDFFuse and Piecemeal approach
On comparing the two versions, i.e., the proposed framework with the piece-
meal approach, it can be clearly seen that the piecemeal version is not as good
as the proposed method in predicting known-to-interact interactions labeled by
1. Although the performance concerning predicting interactions labeled by 0
is slightly better than the piecemeal version; however, the critical point is to
predict the known-to-interact interactions (1) that are responsible for ADRs.
The latter is poorly predicted with the piecemeal approach compared to the
proposed approach. Computation metric wise too, the results are better with the
106
proposed model. It can be concluded that the joint optimization and training
of Decision Forest (DF) with the DCTL-based fusion networks in the proposed
solution results in better representations due to added guidance from the DF that
is missing in the piecemeal version. This added guidance is the backpropagation
of the error from DF classifier to the previous neurons which is missing in the
piecemeal approach where the input, i.e., representations Z are treated as normal
input without feedback to the system.
3.3 Discussion
So, two supervised frameworks - SuperDeConFuse (SDCF) and DeConDFFuse
(DCDF) are discussed- both based on CTL and extended DeConFuse in this chap-
ter. First, the SuperDeConFuse network was discussed, a deep fusion end-to-end
framework for processing stock trading data, leading to very good performance.
In particular, the classification results are better with the proposed SDCF model
than with the 1-D CNN approach. Also, the features Xc visualized for each chan-
nel and each method indicated better feature learning with SDCF. The results
have shown that the presented solution (SDCF) is superior to CNN and other
state-of-the- art techniques in this problem. Additionally, the framework handled
the dead neuron problem via positivity constraint.
Currently, the shortcoming of the model is that it takes slightly more time
than CNN for its training. Thus, techniques will be investigated that reduce the
time complexity of the proposed framework to make it more efficient from this
107
viewpoint in the future.
The purpose of this framework was to show by means of several experiments
that it is an effective tool for predicting stocks. However, stock price prediction
might be seen as a too rudimentary problem in financial analytics. As a next
step, it will be to investigate the use of the proposed algorithm to study if it can
emulate (human) expert-like suggestions. For example, fund managers suggest
‘buy stock XYZ at a price ABC’ or ‘sell stock ZYX at a price CBA.’ It will
be interesting to see if this algorithm can make such predictions given a time
horizon in the future. If possible, in the future, the present algorithm will be
extended to emulate more abstract financial operations such as ‘hedging (longs
and shorts)’.
Secondly, DeConDFFuse was introduced, which is based on the proposed De-
ConFuse framework and combined it with Decision Forest previously established
in the DNDF framework. This framework is a deep supervised fusion end-to-end
framework for processing 1D multi-channel drug data. Unlike other deep learn-
ing models that separately use conventional machine learning algorithms like
RDF, the proposed framework is jointly optimized and is not piecemeal. It has
been applied for the binary classification task of DDI prediction leading to good
performance. The advantages of the proposed framework are the benefits of CTL
based networks. That is, it helped learn non-redundant common representation
for the problem where there are two drugs in a drug pair that is also guided
by the jointly optimized Decision Forest loss. Reasonably well performance is
achieved compared to the state-of-the-art(s).
108
In the future, there is a scope to improve performance by reducing the number
of false positives. Also, the current solution to the DDI problem considers the
event when two drugs are administered together. However, a combination of more
than two drugs is routinely used. Thus, another extension will be the capability
of the proposed framework to handle more than two drugs combinations in the
future. This can be done with the proposed architecture by increasing the number
of channels per increase in the number of drugs. Lastly, although the proposed
architecture has been applied to drug-drug interaction, it is flexible enough to be
applied to other biomedical interaction problems. In the future, other areas can
be explored such as drug-target prediction [134–136], protein-protein interaction
[137–139] and drug repositioning [140].
In summary, both the frameworks - SuperDeConFuse and DeConDFFuse
presented are the supervised versions of CTL jointly trained and optimized by
cross-entropy loss and decision forest, respectively. Both performed well in
their respective problems by leveraging CTL’s benefits. However, each of the
proposed frameworks offers scope for improvement as discussed above for each
separately.
109
Chapter 4
Multiview Clustering Framework based on

CTL - DeConFCluster
With the rapid increase in data collection sources and volume, the exploration
of multiview data has become popular. Multiview data is referred to as the
data collected from the same data source but with different angles or different
perspectives. For example, the same news is advertised or published in different
media with different content; the same statement is labeled with different tags
by different individuals, and the same image is captured using different features.
Multiview data is richer and more informative but more complex than single-view
data. In multiview data, the data belonging to each view has information related
to different contexts and also has some complementary information. Clustering
is a category of unsupervised learning approach in which the data instances are
grouped into several groups or clusters based on the various features of the data
instances. Thus, multiview data clustering requires exploring and integrating
multiple views of the data to perform the grouping of data instances in possible
110
clusters.
Multiview data knowledge extraction is vital in big data mining and analytics
nowadays. In this regard, many recent works suggest CNN based clustering
objectives. These generally lie on the encoder-decoder framework. In such a
work, the clustering loss is included after the encoder network, which ensues
the problem of additional training of a decoder network and hence, incurs extra
learning of weights. In data-constrained scenarios, this can make the model
prone to overfitting. Also, some works learn representations independently and
apply clustering algorithms like K-Means in a piecemeal fashion which may lead
to representations being less effective for clustering task.
The success of CTL based unsupervised and supervised frameworks for
performing classification and regression tasks have been already witnessed in
last few chapters. In this chapter, an unsupervised multi-channel multiview
clustering framework based on Deep Convolutional Transform Learning (DCTL)
- DeConFCluster is introduced that bridges all the gaps mentioned earlier, namely
1. it avoids additional decoder training,
2. it learns distinct transforms and
3. it learns representations from joint training of deep CTL multiview layers
and K-Means algorithm.
The proposed framework is evaluated on four standard multiview clustering
datasets. The results demonstrate that the proposed framework outperforms the
111
state-of-the-art multiview deep clustering approaches. This chapter is further
organized into sections, with the first section 4.1 discussing the current related
works for MVC. Next, the proposed formulation is explained in section 4.2.
The experimental evaluations and results are discussed in sections 4.3 and 4.4
respectively, followed by a brief discussion in section 4.5.
4.1 Literature Review
Multiview clustering clusters subjects into subgroups using multiview data and
has gained significant attention rapidly as it caters to solving real-world problems
that fall under big data analytics. Recently many solutions have been proposed to
perform the same. These are broadly classified into two categories generative and
discriminative approaches. The generative approaches try to learn the underlying
data distribution. These use generative models with each model representing the
individual view and then find the clustering solution. In contrast, discriminative
approaches seek to optimize an objective function with pairwise similarities.
The average similarity in intra-clusters and inter-clusters is minimized and maxi-
mized respectively [141]. The former usually includes expectation maximization
and mixture models. The latter, being larger in number, can be further catego-
rized into sub-categories like multiview spectral clustering, multiview subspace
clustering, multiview non-negative matrix factorization clustering, multi-kernel
clustering, Canonical Correlation Analysis (CCA), etc. [141].
In generative approaches, the work in [142] assumes independent views and
112
adopts multinomial distribution for the document clustering problem. Similarly,
based on different assumptions and criteria, two versions of the multiview EM al-
gorithm for finite mixture models are proposed in [143]. Using Convex Mixture
Models (CMMs) for single-view clustering, the multiview version proposed in
[144] could find the global optimum. It also avoided the initialization and local
optima problems of standard mixture models, as the latter requires multiple EM
algorithms executions. The major issue with EM based algorithms is their slow
convergence, and convergence to local optima. The second issue in EM based
algorithms is that in some scenarios, the E-step and M-step could be unman-
ageable analytically since it requires both forward and backward probabilities
versus the numerical optimization that requires only forward probability.
Next, let’s discuss the multiview spectral clustering method in discriminative
approaches. This method obtains a common clustering result and assumes that
the same or similar eigenvector matrix is shared among all views. There are
two characteristic methods. First is co-training spectral clustering [145–148]
when both labeled and unlabelled data are available. Second is co-regularized
spectral clustering [149, 150], which is a semi-supervised learning technique.
The objective function generally requires the difference between the predictor
functions of the two views to be minimized.
There are methodologies based on subspace clustering in the multiview data
[151–153]. It requires finding the underlying low dimensional common subspace
from each view which is, in general, obtained by making each of the view’s
coefficient matrix as similar as possible. The other works suggest Non-Negative
113
Matrix Factorization (NMF) that seeks two non-negative matrix factors called
basis and indicator. In the case of MVC, some studies point to learning a
common indicator matrix across each view [154, 155] for NMF. Some works
propose using multiview K-Means clustering to deal with the extensive data.
These works use K-Means as it is computationally less expensive than eigen-
decomposition. In [156], authors proposed a multiview K-Means clustering
method that adopted a common indicator matrix across different views. Besides
Non-negative Matrix Factorization (NMF), the authors in [157] introduced a
categorical utility function to measure the similarity between the indicator matrix
from each view and the common indicator matrix and proposed a consensus
based MVC method.
Also, there are methods in which direct view combination via a kernel is used
as a common approach to perform MVC. Usually, it is done by designating a
kernel for each view and then combining these kernels in a convex combination
[158–160]. Another technique - CCA combines multiple views after projection
[161, 162]. All methods mentioned earlier have achieved satisfactory perfor-
mance for the clustering task. But, it may be challenging to handle the data
with high-dimensional features and nonlinear property using the above stated
methods since they majorly adopt shallow and linear embedding functions to
reveal the intrinsic structure of the multiview data.
Recently, graph based MVC has also gained momentum. The authors in
[59] proposed a solution wherein the graph matrices of multiple views are
combined into a unified graph matrix by generating the Similarity Induced Graph
114
(SIG) matrices for all the available views. Then the rank constraint on the
graph Laplacian matrix is applied, and the number of connected components are
produced from the unified graph, which gives the final number of clusters.
Deep learning has emerged as a highly utilized technique to solve almost all
real-world problems and is used in the case of MVC. In [163], multiple autoen-
coders are utilized for multiview data to generate multiple latent representations
and apply heterogeneous graph learning to fuse the generated latent represen-
tations followed by the K-Means network for the final clusters. Further, in the
study [164], based on autoencoders, Deep Embedded Multiview Clustering with
collaborative training (DEMVC) is proposed. It utilizes complementary and
consensus information from multiple views and collaboratively learns the deep
latent feature representations and clustering assignments.
A Graph Neural Network (GNN) [165] is applied to deep representation-
based MVC to completely benefit from the features embedded in the attributed
multiview graph data. Further, the work in [166] used Graph Convolutional
Network (GCN) as an encoder with the most reliable view as input. In another
study, multiple GCN decoders capture the view-consistent low-dimensional
feature representation among different views [167]. Here, the issue is with the
additional weights training incurred from the decoder network, which could lead
to overfitting in data-constrained scenarios [168]. Also, another shortcoming
of existing solutions is due to CNN. Additionally, CNN ends up in a trivial
solution without an output. Employing Deconvolutional layers is the lone way to
prevent the trivial solution. However, even using the mentioned solution, there
115
are chances of over-fitting. Further, CNNs do not guarantee the distinct learning
of filters that may lead to redundant representations/features which may reduce
the performance of the task at hand.
The work in [169] embedded K-Means clustering in the Transform Learning
framework and trained in a joint end-to-end fashion. Also, DCTL was utilized
to perform clustering by jointly training it with K-Means to perform single-
view clustering [168]. On the contrary, in this chapter, an MVC framework is
introduced that jointly trains and optimizes DeConFuse and K-Means clustering
modules in this work. It is named as DeConFCluster and it overcomes the
aforementioned shortcomings.
4.2 Proposed Formulation - DeConFCluster: Deep Convolutional Trans-
form Learning based Multiview Clustering Fusion Framework
In this section, the proposed work is discussed. The proposed framework is an
unsupervised multi-channel fusion framework called DeConFCluster to perform
MVC. It extends the previously established works - Deep CTL based K-Means
clustering framework [168] utilized for single-view clustering and DeConFuse
framework for MVC. The latter framework has already been discussed in Chapter
2. Next, there is a brief discussion of the other prior method mentioned and
finally, the proposed formulation is explained in subsequent sections.
116
Figure 4.1: DCKM architecture. L represents number of DCTL layers, Mlc - filter size and Flc - #filters of the
respective layer l and channel c.
4.2.1 Deep CTL Based K-Means Clustering Framework
This framework extended the Deep CTL (DCTL) approach by adding the K-
Means loss at the end. The DCTL approach from Section 2.2.2 explained in
Chapter 2 was extended by jointly training and optimizing it with K-Means to
perform single-view clustering [168]. The loss formulation after embedding with
the K-Means clustering loss [170], was:
minimize Fconv (T1 , . . . , TL , X | S) + β ∥X − XH ⊤ (HH ⊤ )−1 H∥2F . (4.1)

T1 ,...,TL ,X,H | {z }
K−Means loss
Here, X is the representation learned, S is the input, β > 0 is the regulariza-
tion weight associated with the K-Means clustering loss, and H is the matrix of
binary indicator variables such that an entry hij = 1 if xj belongs to cluster i
and 0 otherwise. The architecture is summarized in Fig. 4.1.
117
Here, the proposed unsupervised fusion framework - DeConFCluster1 is dis-
cussed. Previously, the framework DCKM [168] combined DCTL [67] with
K-Means for Single View Clustering (SVC). In contrast, a multiview clustering
task is targeted here. Hence, DeConFCluster was a multi-channel clustering
framework that extended DeConFuse Network [69] based on DCTL by embed-
ding the K-Means clustering loss as was done in DCKM [168]. Here, fusion was
happening that was not present in DCKM. It jointly trained and globally opti-
mized DeConFuse Network and K-Means module. There were as many channels
as the number of views in any of the considered datasets, i.e., C = V . Each
channel was processed based on the DCTL network. This amounted to learning
distinct transforms (Tc )1≤c≤C and thus, distinct and interpretable representation
(Xc )1≤c≤C , for each channel input (Sc )1≤c≤C . These channel wise representa-
tions were further fused using TL [90] to learn a common representation Z and
transform Te. This completed the first module of the architecture. The repre-
sentations are then fed as input to the second part of the framework K-Means
clustering module that gives the clustering results. Thus, the representations
learned are also guided by the K-Means loss. The learning problem reads:
C
(c) (c)
X
minimize Ffusion (Te, Z, X) + Fconv (T1 , . . . , TL , X (c) | S (c) )+
T,X,Te,Z,H c=1 (4.2)
β∥Z − ZH ⊤ (HH ⊤ )−1 H∥2F
1
P.Gupta, A. Goel, A. Majumdar, E. Chouzenoux and G. Chierchia, “DeConFCluster: Deep Convolutional Transform
Learning based Multiview Clustering Fusion Framework". 2023. Submitted in IEEE TNNLS
118
The complete architecture of the DeConFCluster is summarized in the Fig. 4.2
Figure 4.2: Overview of the proposed DeConFCluster architecture. C represents the number of DeepCTL network-
s/channels, L is the number of DCTL layers, Mℓc is the filter size and Fℓc is the number of filters of the respective
layer ℓ and channel c.
All the variables were learned in an end-to-end fashion. Typically, SGD could
be used as an optimizer for all the variables except H. This latter variable was
updated directly via K-Means clustering [170] at each iteration using the current
Z estimate as an input.
4.3 Experimental Setup
In this section, the performance of the proposed approach is illustrated on various
multiview clustering datasets - 100leaves, ALOI, Mfeat and WebKB which have
been already discussed in the section 1.3.4. Next, let us explain the network
architecture. The network’s pipeline consisted of multiple channels wherein
119
each channel was designated for one of the views of the multiview dataset. Then
the representations were learned from these channels’ networks that gave the
individual view’s contribution. Further, these representations were flattened and
concatenated to pass through a fully connected layer learned via TL. Here, the
common representation was learned across all channels’ representations that
provided the cross-channel information or shared information from each view.
Finally, clusters were obtained by inputting the representation into the K-Means
module. The pipeline is shown in Fig. 4.2. The Stochastic Gradient Descent
(SGD) algorithm was used as the optimizer and λ = 0.01, µ = 0.0001 and weight
decay as 0.001 for all datasets. There was also another hyperparameter - feature_-
ratio that indicated the percentage of features kept in the final representation Z.
All other hyperparameters’ values are grid-searched and the ones that gave best
results are set as the final values. These values can be referred from Table 4.1.
Table 4.1: DeConFCluster hyperparameters for MVC Datasets
Parameter 100leaves WebKB Mfeat ALOI

Batch size 1600 203 128 11025
Epochs 25 25 40 25
Learning Rate 5e-6 1e-4 1e-4 5e-6
Kernel Sizes1 (3,3,3) (3,3,3) (5,3,3) (3,3,3)
#Filters2 (4,8,16) (4,8,16) (2,4,8) (4,8,16)
feature_ratio 0.15 0.15 0.25 0.25
β3 1.0 0.5 0.8 0.5
1 Kernel sizes for DCTL layers 1,2,3
2
#Filters for DCTL layers 1,2,3
3
K-Means loss regularizer
The results were compared with four state-of-art works. These are briefly
described here that are as follow:
• MCGL: it was a graph based learning method. Starting graphs were learned
120
using different views’ data points that were further optimized with a rank
constraint on the Laplacian matrix. Next, optimized graphs were integrated
into a global graph. The graph was learned with the same rank constraint
on its Laplacian matrix. Cluster indicators were obtained from the global
graph only without conducting any graph cut technique and the K-Means
clustering [171].
• GMC: in this approach, each view was weighted and the SIG matrices and
the unified graph matrix were jointly learned [59]. The latter was obtained
by the fusion of the graph matrices of each view.
• DEMVC: this method proposed a framework based on autoencoders. It
utilized complementary and consensus information from multiple views and
learned the deep latent feature representations and clustering assignments
in a collaborative manner [164].
• RRA-MVC: this technique proposed a simple baseline model (SiMVC)
that aligned the distributions of the views. Further, it added the contrastive
module and selective views alignment by prioritizing the views and, thus,
improved the baseline model’s performance calling it as CoMVC framework
[172]. Therefore, the experiments were conducted with the CoMVC part
only that gave the best results.
121
4.4 Results and Analysis
It is generally presumed that that the quantity of clusters is already established
while conducting experiments. In such cases, various metrics, including accuracy,
Normalized Mutual Information (NMI), and Adjusted Rand Index (ARI), are
commonly employed [173, 174]. Thus, the evaluation of the proposed model’s
performance was carried out using these three metrics.
Some of the metrics are described below:
• Normalized Mutual Information (NMI): This metric computes the normal-
ized measure of similarity between the labels of same data instances. The
range of NMI is [0, 1] where 0 signifies no correlation and 1 signifies the
perfect correlation. The formula is given by:
I(l, c)
NMI = (4.3)
max(H(l), H(c))
where I(l,c) denotes the mutual information between the true label l and the
assigned cluster c and H denotes the entropy.
• Adjusted Rand Index (ARI): ARI measures the similarity between two
clusters by considering all pairs of data instances that are assigned to the
same or different clusters in the actual and predicted labels. The range of
ARI is [−1, 1]. The higher the ARI value, the better is the clustering.
(RI − E)
ARI = (4.4)
(max(RI) − E)
122
where RI = Rand Index and E is the Expected Rand Index Value for random
clusterings. These are:

(a + b)
RI = n
(4.5)
2
where a = the number of times a pair of elements belongs to the same cluster
across two clustering methods, b = the number of times a pair of elements
belong to different clusters across two clustering methods and n2 is the

number of unordered pairs in a set of n elements. Here, max(RI) = 1.
and
X ni X nj
N
E=( ( )× ( ))/( ) (4.6)
2 2 2
where ni is the number of samples in cluster i and nj is the number of
samples in cluster j.
The results of the proposed model and benchmarks on all four datasets are
reported in Table 4.2. It can be observed from Table 4.2 that for all datasets, the
proposed model has shown better performance than the state-of-the-arts except
for NMI values for Mfeat and ALOI datasets and ARI in the case of ALOI. It is
worth noting that the proposed technique performed well in the case of WebKB
and ALOI, both of which had fewer samples than the number of clusters to be
identified. In the case of Mfeat and ALOI, it reached good Accuracy and ARI
values for Mfeat. Thus, the proposed method performed well for challenging
datasets and slightly worse for easier ones. The overall performance of the
proposed approach was better than the state-of-the-arts.
Also, the convergence plot for all the datasets were plotted that can be referred
123
Table 4.2: Clustering Results. All the metrics in (%)
100leaves WebKB Mfeat ALOI

Models
Acc NMI ARI Acc NMI ARI Acc NMI ARI Acc NMI ARI
MCGL 81.06 91.30 51.50 54.19 8.60 4.01 85.30 90.55 83.13 46.25 66.57 4.41
GMC 82.38 92.92 49.74 76.35 41.64 42.80 88.20 90.50 85.02 57.05 73.50 43.05
DEMVC 6.69 24.53 0.60 49.75 10.05 8.43 46.45 37.53 24.59 13.52 41.30 8.45
RRA-MVC 73.25 92.56 71.58 40.89 13.43 9.22 81.20 83.19 74.36 55.22 80.79 49.34
Proposed 91.13 96.59 88.01 80.79 54.98 52.02 95.00 89.22 89.89 58.95 79.75 46.84
to from Fig. 4.3. Using SGD as an optimizer, it could be clearly inferred that the
given solution converged to the point of stability. The SGD parameters, such as
mini-batch size and learning rate, are given in Table 4.1 for all the considered
datasets.
Figure 4.3: Loss Plots
4.4.1 Ablation studies
This section shows the results corresponding to the three ablation studies per-
formed for all the datasets. The first experiment conducted was with changing
the values of the regularizers λ and µ associated with the penalty terms log-det
and Frobenius norms in both CTL and TL equations 2.11 and 2.14 respectively.
124
Figure 4.4: Ablation Studies Result Plots on λ, µ
The set of values taken as a combination for both the penalty regularizers are -
((10−2 , 10−4 ), (10−2 , 10−3 ), (10−3 , 10−4 ), (10−3 , 10−5 ),
(10−4 , 10−5 )). The results can be referred from Table 4.3. These were also
displayed graphically for all three metrics Accuracy, NMI and ARI in Fig. 4.4.
It can be clearly concluded from the results that the penalization terms play
an essential role in our formulation. Although, while changing the values of
regularizers for these penalizations, the change is robust for three of the datasets
used from the performance evaluation perspective. However, the depleted results
for Mfeat for lower values of these regularizers demonstrate that they help to
learn better representations and hence should be the part of the formulation.
Table 4.3: Ablation Studies Results on λ, µ
Value 100leaves (β = 1.0) WebKB (β = 0.5) Mfeat (β = 0.8) ALOI (β = 0.5)

(λ, µ) Acc NMI ARI Acc NMI ARI Acc NMI ARI Acc NMI ARI
(10 , 10−4 )
−2 91.13 96.59 88.01 80.79 54.98 52.02 95.00 89.22 89.89 58.95 79.75 46.84
(10−2 , 10−3 ) 91.13 96.59 88.01 80.79 54.98 52.02 95.00 89.22 89.89 58.95 79.75 46.84
(10−3 , 10−4 ) 91.13 96.59 88.01 80.79 54.98 52.02 94.70 89.49 88.66 58.95 79.75 46.84
(10−3 , 10−5 ) 91.13 96.59 88.01 80.79 54.98 52.02 94.70 89.49 88.66 58.95 79.75 46.84
(10−4 , 10−5 ) 91.13 96.59 88.01 80.79 54.98 52.02 91.80 86.24 83.74 58.95 79.75 46.84
Secondly, the experiments were carried out with the regularizer β associated
with K-Means clustering loss in the equation 4.2. The set of values for β lie in
125
Figure 4.5: Ablation Studies Result Plots on K-Means Regularizer
range [0, 1], specifically, these are (0.0, 0.1, 0.3, 0.5, 0.8, 1.0). The results were
represented results both in text and graphically that can be referred from Table
4.4 and Fig. 4.5 respectively. It could be observed that for all the datasets,
in general, the K-Means regularizer β ≥ 0.5 gave better performance. This
signified that K-Means loss was an important term associated with the final
objective function. It helped in learning better representations as guided by it
and was thus responsible for better clustering performance.

Table 4.4: Ablation Studies Results on K-Means Regularizer
Value 100leaves WebKB Mfeat ALOI

β Acc NMI ARI Acc NMI ARI Acc NMI ARI Acc NMI ARI
0.0 89.56 96.17 86.50 77.83 44.47 52.57 91.10 85.37 82.15 55.27 78.34 41.16
0.1 89.69 96.62 87.26 77.83 44.47 52.57 94.65 89.39 88.51 54.15 78.41 41.49
0.3 87.75 96.05 84.96 77.83 44.47 52.57 94.45 89.18 88.08 53.72 77.87 38.65
0.5 88.69 95.91 85.30 80.79 54.98 52.02 91.35 85.57 82.79 58.95 79.75 46.84
0.8 86.56 95.76 84.24 80.79 52.05 52.81 95.00 89.22 89.89 54.20 78.61 40.80
1.0 91.13 96.59 88.01 80.79 51.32 53.24 81.60 85.62 78.46 55.31 79.40 42.89
The second experiment inference was also validated by the third experiment
conducted, where the results were computed using piecemeal version of the pro-
posed model. It means that first the learned representations from the DeConFuse
network were learned separately and subsequently passed these representations
via the K-Means clustering module to get the final clusters, i.e., here β = 0. The
126
results could be referred from Table 4.5. Here, it was clearly inferred that the
joint optimization of the DeConFuse and K-Means clustering module is better
than the piecemeal approach.

Table 4.5: Ablation Studies Results on Piecemeal and Proposed Formulation
100leaves WebKB Mfeat ALOI

Methods
Acc NMI ARI Acc NMI ARI Acc NMI ARI Acc NMI ARI
Piecemeal 89.56 96.17 86.50 77.83 44.47 52.57 91.10 85.37 82.15 55.27 78.34 41.16
Proposed 91.13 96.59 88.01 80.79 54.98 52.02 95.00 89.22 89.89 58.95 79.75 46.84
4.5 Discussion
In this chapter, a novel unsupervised multi-channel fusion clustering framework
based on Deep Convolutional Transform Learning named DeConFCluster is
discussed. The proposed framework jointly trains the DCTL based DeConFuse
and K-Means clustering modules in an end-to-end fashion. The advantage of this
framework is that it does not have the additional overhead of learning the weights
of decoder or deconvolutional layers, which is the case in existing multiview
clustering approaches. Secondly, the framework avoided overfitting even in
data-constrained scenarios where the number of data instances is low and the
number of classes is high, for example, in the case of 100leaves and WebKB, the
proposed framework performed well compared to benchmarks.
Another advantage of this framework is that it promotes diversity among
filters and thus, in turn, helps to learn more interpretable filters that are further
guided by K-Means loss. Therefore, due to these advantages, the proposed
framework DeConFCluster, evaluated on the four standard multiview datasets,
127
demonstrated higher clustering scores as compared to the current state-of-the-art
MVC frameworks. However, for a few metrics in the case of Mfeat and ALOI
datasets, the performance of the method needs to be improved which can be
worked upon in the future.
128
Chapter 5
Conclusion
The proposed works in this thesis focused on modeling various prediction prob-
lems in the field of Information Fusion as multi-channel fusion problems. The
frameworks proposed are based on the recently established technique CTL and
hence are variants that deal in the analysis domain, covering both unsupervised
and supervised learning paradigms.
5.1 Summary of Contribution
In this section, the chapter-wise contributions are briefly summarized in the area
of Information Fusion, giving a bird’s eye view of the dissertation.
5.1.1 Unsupervised multi-channel CTL based fusion frameworks - ConFuse(shallow) and

DeConFuse(Deep)
In this part of the dissertation, an unsupervised multi-channel fusion framework is
modeled as both shallow and deep architectures based on CTL, namely - ConFuse
129
and DeConFuse, respectively. These frameworks are applied to the problem of
Stock trading (classification) and forecasting (regression) and obtained good
performance successfully. Since the data is time-series, the framework treated
the data as univariate versus 2D matrix/image, as discussed in that chapter. It
also guaranteed distinct filters and produced more interpretable representations.
The most significant advantage of this framework was that it avoided the effort of
re-training the network which is required with most other techniques, especially
CNNs. In summary, the same representations were utilized for both regression
and classification tasks without requiring them to be learned separately through
different trainings.
5.1.2 Supervised multi-channel fusion frameworks - SuperDeConFuse and DeConDFFuse
This chapter presented the supervised multi-channel fusion frameworks based
on CTL. The first framework involved fusion that happened via TL over the
representations learned from individual CTL based channels. Further, there was
a linear fully connected layer followed by the cross-entropy loss. The framework
is called as - SuperDeConFuse (SDCF). It is applied to the Stock trading data.
The performance of the proposed technique showed that the proposed method
worked well versus the state-of-the-art. The non-negativity constraint on the
fused representation learned, i.e., Z helped to eliminate the problem of dead
neurons and did not require us to employ activation function between the last
convolutional layer and fully connected layer, i.e., fusion layer learned via TL
versus the required in CNNs. All the other advantages, like distinct filters and
130
interpretable representations, etc., are also present here since they are based on
CTL and TL. It has been even compared with the representations learned from
CNNs and found that the features and results from the proposed frameworks are
better than those from CNNs.
The other framework that is discussed in this chapter is DeConDFFuse which
extended the DeConFuse network and jointly trained and optimized it with De-
cision Forest (DF). Again, all the benefits of CTL are applicable here as well.
The representations/features learned are guided by DF apart from CTL. The
framework is applied to the Drug-Drug Interaction (DDI) problem that predicts
two kinds of interactions - known-to-interact (1) and known-no-to-interact or un-
known solving a fundamental issue to prevent Adverse Drug Reactions (ADRs).
The contribution of this model is that it utilizes DF in a joint framework which is
generally used in a piecemeal fashion. The chances of missing any important
information are there in the latter. Additionally, it extracted individual and cross-
channel features of the drugs finding out the most relevant features of the drugs
that interact with each other. The results from this framework are superior to
benchmarks indicating the benefit of employing it for the DDI prediction task.
5.1.3 Multiview Clustering Framework based on CTL - DeConFCluster
Lastly, multiview clustering performing framework - DeConFCluster is explained
in chapter 4. It is an unsupervised multiview multi-channel fusion framework that
performs multiview clustering task utilizing representations from the fusion of the
131
individual view’s representations; thus, it learns individual view information and
then learns cross-channel information via fusion. The framework comprised a
DeConFuse network and a K-means module that are jointly trained and optimized.
Therefore, representations learned are beneficial since those have the advantages
of CTL and are well-guided through the K-Means loss also. The same is observed
in terms of performance from the experimental results too. Furthermore, the
framework prevented the additional training from the decoder network that is
generally applied in the case of multiview clustering approaches. Besides the
said advantage, the framework also avoided overfitting even in data-constrained
scenarios where the number of data instances is low and the number of classes is
high.
5.2 Future Work
It is believed that the algorithms proposed are generic and can be used not only
in the kind of problems discussed throughout this dissertation but also in other
research fields where one can formulate the problem to be solved as a multi-
channel fusion problem. Both supervised and unsupervised frameworks have
been proposed; thus, the proposed frameworks can solve both genres’ problems.
At the application level, the unsupervised and supervised frameworks were
applied for stock prediction problems that required day-wise predictions. How-
ever, in future, it is desired to experiment with the proposed frameworks at more
micro-level predictions in stock, i.e., at minute-level prediction of signals - BUY
132
and SELL. Also, the proposed solutions have dealt with 1D data so far and, thus,
used 1D Convolutions. Therefore, it encourages to explore these frameworks to
be applied to problems that involve 2D or multidimensional data. For example,
a hyperspectral image and an RGB image can be used to perform fusion to
estimate dense depth maps from the sparse maps. Thus, such a fusion using
these frameworks find their application in drones.
Likewise, the supervised learning framework - DeConDFFuse is utilized for
DDI prediction, but it is believed that it can also be used to predict different types
of associations in bio-informatics like - drug-virus, drug-target, protein-protein,
etc. Also, the problem targeted currently involves two drugs administered
together, whereas many times, more than two are administered together in a real
scenario. The latter can be easily done with the proposed network.
Similarly, in this thesis, unsupervised frameworks - ConFuse and DeConFuse-
are developed to extract features for performing dual tasks of regression and
classification. However, apart from these tasks, it is worthy to analyze if such
features from the framework can be successfully utilized in other applications as
well, like anomaly detection. Anomaly detection requires finding and identifying
outliers to prevent fraud, adversary attacks, network intrusions, etc., that can
compromise any organization’s future or invades an individual’s privacy. Further,
it can be even utilized to extract such features to find clusters to segment cus-
tomers pertaining to a particular market. In all such applications, one can analyze
the data available and extract meaningful representations to perform the men-
tioned tasks using the proposed framework. Also, unsupervised (ConFuse and
133
DeConFuse) and supervised frameworks (SuperDeConFuse and DeConDFFuse)
can be employed in another application of Human Activity Recognition.
Additionally, the multiview clustering framework performs clustering on
views that are similar in nature. Nevertheless, it can even be considered extending
this framework to apply to multi-modal datasets involving multiple modalities
like image, text, video, and audio information. For example, Caltech-UCSD
Birds-200-2011, shortly called as CUB-200-2011 dataset, contains image and
text information.
Lastly, it is intended to implement techniques following a semi-supervised
learning paradigm. Semi-supervised learning based formulations will hold
immense importance as these will help perform tasks involving partially labeled
data. It basically reduces expenses on manual annotation and cuts down on data
preparation time which is significant as unlabeled data is available in abundance.
134
References
[1] P. Kontschieder, M. Fiterau, A. Criminisi, and S. R. Bulò, “Deep neural
decision forests,” in 2015 IEEE International Conference on Computer
Vision (ICCV), 2015, pp. 1467–1475.
[2] H. Boström, S. F. Andler, M. Brohede, R. Johansson, A. Karlsson, J. van
Laere, L. Niklasson, M. Nilsson, A. S. Persson, and T. Ziemke, “On the
definition of information fusion as a field of research,” 2007.
[3] Y. Yoon, J. Cho, and G. Yoon, “Non-constrained blood pressure mon-
itoring using ecg and ppg for personal healthcare,” Journal of medical
systems, vol. 33(4), pp. 261–266, 2009.
[4] N.-E. El Faouzi, H. Leung, and A. Kurian, “Data fusion in intelligent
transportation systems: Progress and challenges – a survey,” Information
Fusion, vol. 12, pp. 4–10, 01 2011.
[5] I. Saadi, B. Farooq, A. Mustafa, J. Teller, and M. Cools, “An efficient
hierarchical model for multi-source information fusion,” Expert Systems
with Applications, vol. 110, pp. 352 – 362, 2018. [Online]. Available:
http://www.sciencedirect.com/science/article/pii/S0957417418303646
135
[6] F. Rodrigues, I. Markou, and F. C. Pereira, “Combining time-series
and textual data for taxi demand prediction in event areas: A deep
learning approach,” Information Fusion, vol. 49, pp. 120 – 129, 2019.
[Online]. Available: http://www.sciencedirect.com/science/article/pii/
S1566253517308175
[7] S. Daneshvar and H. Ghassemian, “Mri and pet image fusion by combining
ihs and retina-inspired models,” Information Fusion, vol. 11(2), pp. 114–
123, 2010.
[8] T. D. Dixon, S. G. Nikolov, J. J. Lewis, J. Li, E. F. Canga, J. M. Noyes,
T. Troscianko, R. D. Bull, and C. N. Canagarajah, “Task-based scanpath
assessment of multi-sensor video fusion in complex scenarios,” Informa-
tion Fusion, Special Issue on Biologically-Inspired Information Fusion,
vol. 11(1), pp. 51–65, 2010.
[9] J. A. Balazs and J. D. Velásquez, “Opinion mining and information
fusion: A survey,” Information Fusion, vol. 27, pp. 95–110, 2016.
[Online]. Available: https://www.sciencedirect.com/science/article/pii/
S1566253515000536
[10] L. Sun, W. Xu, and J. Liu, “Two-channel attention mechanism fusion
model of stock price prediction based on cnn-lstm,” ACM Transactions
on Asian and Low-Resource Language Information Processing, vol. 20,
no. 5, jul 2021. [Online]. Available: https://doi.org/10.1145/3453693
[11] S. Lin, Y. Wang, L. Zhang, Y. Chu, Y. Liu, Y. Fang, M. Jiang, Q. Wang,
136
B. Zhao, Y. Xiong, and D.-Q. Wei, “MDF-SA-DDI: predicting drug–drug
interaction events based on multi-source drug fusion, multi-source
feature fusion and transformer self-attention mechanism,” Briefings in
Bioinformatics, vol. 23, no. 1, 10 2021, bbab421. [Online]. Available:
https://doi.org/10.1093/bib/bbab421
[12] J. Yang, M. Nguyen, P. San, X. Li, and S. Krishnaswamy, “Deep convo-
lutional neural networks on multichannel time series for human activity
recognition,” In Twenty-Fourth International Joint Conference on Artificial
Intelligence (June 2015), June 2015.
[13] T. Meng, X. Jing, Z. Yan, and W. Pedrycz, “A survey on machine
learning for data fusion,” Information Fusion, vol. 57, pp. 115–129, 2020.
S1566253519303902
[14] C. Ounoughi and S. Ben Yahia, “Data fusion for its: A systematic
literature review,” Information Fusion, vol. 89, pp. 267–291, 2023.
S1566253522001087
[15] I. Belhajem, Y. Ben Maissa, and A. Tamtaoui, “A robust low cost approach
for real time car positioning in a smart city using extended kalman fil-
ter and evolutionary machine learning,” in 2016 4th IEEE International
Colloquium on Information Science and Technology (CiSt), 2016, pp.
806–811.
137
[16] I. Belhajem, Y. M. Ben, and A. Tamtaoui, “Improving vehicle localization
in a smart city with low cost sensor networks and support vector machines,”
Mobile Networks and Applications, vol. 23, pp. 1–10, 08 2018.
[17] G. Bresson, R. Aufrère, and R. Chapuis, “A general consistent
decentralized simultaneous localization and mapping solution,” Robotics
and Autonomous Systems, vol. 74, pp. 128–147, 2015. [Online]. Available:
https://www.sciencedirect.com/science/article/pii/S0921889015001529
[18] H. Cho, Y.-W. Seo, B. V. Kumar, and R. R. Rajkumar, “A multi-sensor
fusion system for moving object detection and tracking in urban driving
environments,” in 2014 IEEE International Conference on Robotics and
Automation (ICRA), 2014, pp. 1836–1843.
[19] K. Golestan, S. Seifzadeh, M. Kamel, F. Karray, and F. Sattar, “Vehicle
localization in vanets using data fusion and v2v communication,” in
Proceedings of the Second ACM International Symposium on Design and
Analysis of Intelligent Vehicular Networks and Applications, ser. DIVANet
’12. New York, NY, USA: Association for Computing Machinery, 2012, p.
123–130. [Online]. Available: https://doi.org/10.1145/2386958.2386977
[20] K. Golestan, F. Sattar, F. Karray, M. S. Kamel, and S. Saifzadeh, “Local-
ization in vehicular ad hoc networks using data fusion and v2v communi-
cation,” Computer Communications, vol. 71, 07 2015.
[21] A. Vu, A. Ramanandan, A. Chen, J. A. Farrell, and M. Barth, “Real-time
computer vision/dgps-aided inertial navigation system for lane-level vehi-
138
cle navigation,” IEEE Transactions on Intelligent Transportation Systems,
vol. 13, no. 2, pp. 899–913, 2012.
[22] K. Lassoued, I. Fantoni, and P. Bonnifait, “Mutual localization and po-
sitioning of vehicles sharing gnss pseudoranges: Sequential bayesian
approach and experiments,” in 2015 IEEE 18th International Conference
on Intelligent Transportation Systems, 2015, pp. 1896–1901.
[23] C. Sutton, C. Morrison, P. Cohen, J. Moody, and J. Adibi, “A bayesian
blackboard for information fusion,” Proceedings of the Seventh Inter-
national Conference on Information Fusion, FUSION 2004, vol. 2, 07
2004.
[24] Q. Miao, Q. Li, and D. Zeng, “Mining fine grained opinions by using
probabilistic models and domain knowledge,” in 2010 IEEE/WIC/ACM
International Conference on Web Intelligence and Intelligent Agent Tech-
nology, vol. 1, 2010, pp. 358–365.
[25] G. Shroff, P. Agarwal, and L. Dey, “Enterprise information fusion for
real-time business intelligence,” in 14th International Conference on In-
formation Fusion, 2011, pp. 1–8.
[26] T. Rohe, A.-C. Ehlis, and U. Noppeney, “The neural dynamics of hier-
archical bayesian causal inference in multisensory perception,” Nature
Communications, vol. 10(1), pp. 1–17, 2019.
[27] N. Nesa and I. Banerjee, “Iot-based sensor data fusion for occupancy
139
sensing using dempster–shafer evidence theory for smart buildings,” IEEE
Internet of Things Journal, vol. 4, no. 5, pp. 1563–1570, 2017.
[28] W. Ding, X. Jing, Z. Yan, and L. T. Yang, “A survey on data fusion
in internet of things: Towards secure and privacy-preserving fusion,”
Information Fusion, vol. 51, pp. 129–144, 2019. [Online]. Available:
[29] S.-H. Chen, J.-S. Pan, K. Lu, and H. Xu, “Driving behavior analysis of
multiple information fusion based on adaboost,” Advances in Intelligent
Systems and Computing, vol. 329, pp. 277–285, 01 2015.
[30] A. Ben Mahjoub, M. Ibn Khedher, M. Atri, and M. El Yacoubi, “Naive
bayesian fusion for action recognition from kinect,” in 4th International
Conference on Computer Networks & Data Communications, 12 2017.
[31] X. Sevillano, E. Màrmol, and V. Fernandez-Arguedas, “Towards smart
traffic management systems: Vacant on-street parking spot detection based
on video analytics,” in 17th International Conference on Information
Fusion (FUSION), 2014, pp. 1–8.
[32] B. Pang, L. Lee, and S. Vaithyanathan, “Thumbs up? sentiment
classification using machine learning techniques,” in Proceedings of the
2002 Conference on Empirical Methods in Natural Language Processing
(EMNLP 2002). Association for Computational Linguistics, Jul. 2002,
pp. 79–86. [Online]. Available: https://aclanthology.org/W02-1011
[33] F. Li, Y. Fan, X. Zhang, C. Wang, F. Hu, W. Jia, and H. Hui, “Multi-
140
feature fusion method based on eeg signal and its application in stroke
classification,” in Journal of medical systems, vol. 44(2). Association
for Computational Linguistics, 2019, p. 39.
[34] T. P. Banerjee and S. Das, “Multi-sensor data fusion using
support vector machine for motor fault detection,” Information
Sciences, vol. 217, pp. 96–107, 2012. [Online]. Available: https:
//www.sciencedirect.com/science/article/pii/S0020025512004185
[35] Z. Yao and W. Yi, “License plate detection based on multistage information
fusion,” Information Fusion, vol. 18, pp. 78–85, 2014. [Online]. Available:
[36] B. Weng, M. A. Ahmed, and F. M. Megahed, “Stock market one-day
ahead movement prediction using disparate data sources,” Expert Systems
with Applications, vol. 79, pp. 153–163, 2017. [Online]. Available:
[37] X. Chen, H. Xie, Z. Li, G. Cheng, M. Leng, and F. L.
Wang, “Information fusion and artificial intelligence for smart
healthcare: a bibliometric study,” Information Processing & Management,
vol. 60, no. 1, p. 103113, 2023. [Online]. Available: https:
//www.sciencedirect.com/science/article/pii/S030645732200214X
[38] Y. Yang, J. Wu, S. Huang, Y. Fang, P. Lin, and Y. Que, “Multimodal
medical image fusion based on fuzzy discrimination with structural patch
141
decomposition,” IEEE Journal of Biomedical and Health Informatics,
vol. 23, no. 4, pp. 1647–1660, 2019.
[39] F. Wang, K. Wang, and F. Jiang, “An improved fusion method of fuzzy
logic based on k-mean clustering in wsn,” Sensors and Transducers, vol.
157, pp. 20–25, 01 2013.
[40] S. Xiao, Y. Zhang, X. Liu, and J. Gao, “Alert fusion based on cluster and
correlation analysis,” in Proceedings of the 2008 International Conference
on Convergence and Hybrid Information Technology, ser. ICHIT ’08.
USA: IEEE Computer Society, 2008, p. 163–168. [Online]. Available:
https://doi.org/10.1109/ICHIT.2008.197
[41] A. Rodriguez-Castaño, G. Heredia, and A. Ollero, “High-speed
autonomous navigation system for heavy vehicles,” Applied Soft
Computing, vol. 43, pp. 572–582, 2016. [Online]. Available:
[42] K. Saadeddin, M. F. Abdel-Hafez, M. A. Jaradat, and M. A. Jarrah,
“Performance enhancement of low-cost, high-accuracy, state estimation for
vehicle collision prevention system using anfis,” Mechanical Systems and
Signal Processing, vol. 41, no. 1, pp. 239–253, 2013. [Online]. Available:
[43] H. Medjahed, D. Istrate, J. Boudy, J.-L. Baldinger, and B. Dorizzi, “A
pervasive multi-sensor data fusion for smart home healthcare monitoring,”
142
in 2011 IEEE International Conference on Fuzzy Systems (FUZZ-IEEE
2011), 2011, pp. 1466–1473.
[44] A. Oztekin, R. Kizilaslan, S. Freund, and A. Iseri, “A data analytic
approach to forecasting daily stock returns in an emerging market,”
European Journal of Operational Research, vol. 253, no. 3, pp. 697–710,
2016. [Online]. Available: https://www.sciencedirect.com/science/article/
pii/S0377221716301096
[45] S. Yao, S. Hu, Y. Zhao, A. Zhang, and T. Abdelzaher, “Deepsense: A
unified deep learning framework for time-series mobile sensing data pro-
cessing,” In Proceedings of the 26th International Conference on World
Wide Web, pp. 351–360, April 2017.
[46] Y. Zheng, Q. Liu, E. Chen, Y. Ge, and J. Zhao, “Time series classification
using multi-channels deep convolutional neural networks,” In Interna-
tional Conference on Web-Age Information Management,Springer, Cham,
pp. 289–310, June 2014.
[47] B. Pu, Y. Liu, N. Zhu, K. Li, and K. Li, “Ed-acnn: Novel attention
convolutional neural network based on encoder–decoder framework for
human traffic prediction,” Applied Soft Computing, vol. 97, p. 106688,
2020. [Online]. Available: https://www.sciencedirect.com/science/article/
pii/S1568494620306268
[48] R. Dheekonda, S. Panda, M. Khan, M. Hasan, and S. Anwar, “Object
detection from a vehicle using deep learning network and future integration
143
with multi-sensor fusion algorithm,” in W CX T M 17 SAE World Congress,
03 2017.
[49] G. Giacinto, F. Roli, and L. Didaci, “Fusion of multiple classifiers
for intrusion detection in computer networks,” Pattern Recognition
Letters, vol. 24, no. 12, pp. 1795–1803, 2003. [Online]. Available:
[50] X. Chen, J. Chen, G. Cheng, and T. Gong, “Topics and trends in artificial
intelligence assisted human brain research,” PLoS ONE, vol. 15, no. 4,
2020.
[51] S. Hashemi, H. Veisi, E. Jafarzadehpour, R. Rahmani, and Z. Heshmati,
“Multi-view deep learning for rigid gas permeable lens base curve fit-
ting based on pentacam images,” Medical & Biological Engineering &
Computing, vol. 58, 05 2020.
[52] C. Feichtenhofer, A. Pinz, and A. Zisserman, “Convolutional two-stream
network fusion for video action recognition,” In Proceedings of the IEEE
conference on computer vision and pattern recognition, pp. 1993–1941,
2016.
[53] A. Eitel, J. Springenberg, L. Spinello, M. Riedmiller, and W. Burgard,
“Multimodal deep learning for robust rgb-d object recognition,” In 2015
IEEE/RSJ International Conference on Intelligent Robots and Systems
(IROS), pp. 681–687, September 2015.
[54] Y. Chen, C. Li, P. Ghamisi, X. Jia, and Y. Gu, “Deep fusion of remote
144
sensing data for accurate classification,” IEEE Geoscience and Remote
Sensing Letters, vol. 14(8), pp. 1253–1257, 2017.
[55] N. Antropova, B. Huynh, and M. Giger, “A deep feature fusion methodol-
ogy for breast cancer diagnosis demonstrated on three imaging modality
datasets,” Medical physics, vol. 44(10), pp. 5162–5171, 2017.
[56] J. Maggu, E. Chouzenoux, G. Chierchia, and A. Majumdar, “Convolu-
tional transform learning,” In International Conference on Neural Infor-
mation Processing, pp. 162–174, Dec 2018.
[57] P. Gupta, A. Majumdar, E. Chouzenoux, and G. Chierchia, “Su-
perdeconfuse: A supervised deep convolutional transform based
fusion framework for financial trading systems,” Expert Systems
with Applications, vol. 169, p. 114206, 2021. [Online]. Available:
[58] M. Bertoni, M. Duran-Frigola, P. Badia-i Mompel, E. Pauls, M. Orozco-
Ruiz, O. Guitart-Pla, V. Alcalde, V. M. Diaz, A. Berenguer-Llergo, I. Brun-
Heath, N. Villegas, A. G. de Herreros, and P. Aloy, “Bioactivity descriptors
for uncharacterized chemical compounds,” Nature Communications, no.
3932, 2021.
[59] H. Wang, Y. Yang, and B. Liu, “Gmc: Graph-based multi-view clustering,”
IEEE Transactions on Knowledge and Data Engineering, vol. 32, no. 6,
pp. 1116–1129, 2019.
[60] X. Zhang, L. Zhao, L. Zong, X. Liu, and H. Yu, “Multi-view clustering
145
via multi-manifold regularized nonnegative matrix factorization,” in Pro-
ceedings of the IEEE International Conference on Data Mining (ICDM
2014), 2014, pp. 1103–1108.
[61] K. Fukushima and S. Miyake, “Neocognitron: A self-organizing neural
network model for a mechanism of visual pattern recognition,” in In
Competition and cooperation in neural nets, 1982, p. 267–285.
[62] Y. Lecun, L. Bottou, Y. Bengio, and P. Haffner, “Gradient-based learning
applied to document recognition,” Proceedings of the IEEE, vol. 86, no. 11,
pp. 2278–2324, 1998.
[63] A. Krizhevsky, I. Sutskever, and G. E. Hinton, “Imagenet classification
with deep convolutional neural networks,” in Advances in Neural
Information Processing Systems, F. Pereira, C. Burges, L. Bottou, and
K. Weinberger, Eds., vol. 25. Curran Associates, Inc., 2012, pp.
84–90. [Online]. Available: https://proceedings.neurips.cc/paper_files/
paper/2012/file/c399862d3b9d6b76c8436e924a68c45b-Paper.pdf
[64] K. Simonyan and A. Zisserman, “Very deep convolutional networks for
large-scale image recognition,” in International Conference on Learning
Representations, 2015.
[65] C. Szegedy, W. Liu, Y. Jia, P. Sermanet, S. Reed, D. Anguelov,
D. Erhan, V. Vanhoucke, and A. Rabinovich, “Going deeper
with convolutions,” in 2015 IEEE Conference on Computer Vision
and Pattern Recognition (CVPR). Los Alamitos, CA, USA: IEEE
146
Computer Society, jun 2015, pp. 1–9. [Online]. Available: https:
//doi.ieeecomputersociety.org/10.1109/CVPR.2015.7298594
[66] K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image
recognition,” in 2016 IEEE Conference on Computer Vision and Pattern
Recognition (CVPR), 2016, pp. 770–778.
[67] J. Maggu, A. Majumdar, E. Chouzenoux, and G. Chierchia, “Deep convolu-
tional transform learning,” in Proceedings of the International Conference
on Neural Information Processing (ICONIP 2000), 2020, pp. 300–307.
[68] P. Gupta, J. Maggu, A. Majumdar, E. Chouzenoux, and G. Chierchia,
“Confuse: Convolutional transform learning fusion framework for multi-
channel data analysis,” in 2020 28th European Signal Processing Confer-
ence (EUSIPCO), 2021, pp. 1986–1990.
[69] P. Gupta, J. Maggu, E. Majumdar, A. Chouzenoux, and G. Chierchia,
“Deconfuse: a deep convolutional transform-based unsupervised fusion
framework,” EURASIP J. Adv. Signal Process, vol. 26, 2020.
[70] P. Gupta, A. Majumdar, E. Chouzenoux, and G. Chierchia, “Decondffuse
: Predicting drug–drug interaction using joint deep convolutional
transform learning and decision forest fusion framework,” Expert Systems
with Applications, vol. 227, p. 120238, 2023. [Online]. Available:
[71] P. Gupta, A. Goel, A. Majumdar, E. Chouzenoux, and G. Chierchia,
147
“Deconfcluster: Deep convolutional transform learning based multiview
clustering fusion framework,” Submitted in IEEE TNNLS, 2023.
[72] H. Ismail Fawaz, G. Forestier, J. Weber, L. Idoumghar, and P.-A. Muller,
“Deep learning for time series classification: a review,” Data Mining and
Knowledge Discovery, vol. 33, pp. 917–963, 07 2019.
[73] S. Hochreiter and J. Schmidhuber, “Long short-term memory,” Neural
Comput., vol. 9, no. 8, p. 1735–1780, nov 1997. [Online]. Available:
https://doi.org/10.1162/neco.1997.9.8.1735
[74] J. Chung, C. Gulcehre, K. Cho, and Y. Bengio, “Gated feedback recurrent
neural networks,” in Proceedings of the 32nd International Conference on
International Conference on Machine Learning - Volume 37, ser. ICML’15.
JMLR.org, 2015, p. 2067–2075.
[75] Z. Wang, W. Yan, and T. Oates, “Time series classification from scratch
with deep neural networks: A strong baseline,” in International Joint
Conference on Neural Networks (IJCNN), 05 2017, pp. 1578–1585.
[76] P. Malhotra, V. Tv, L. Vig, P. Agarwal, and G. Shroff, “Timenet: Pre-
trained deep recurrent neural network for time series classification,” in
25th European Symposium on Artificial Neural Networks, Computational
Intelligence and Machine Learning, 04 2017.
[77] [Online]. Available: https://www.cs.ucr.edu/~eamonn/time_series_data/
[78] K. Kashiparekh, J. Narwariya, P. Malhotra, L. Vig, and G. Shroff, “Con-
vtimenet: A pre-trained deep convolutional neural network for time series
148
classification,” in International Joint Conference on Neural Networks
(IJCNN), 07 2019, pp. 1–8.
[79] J. Debayle, N. Hatami, and Y. Gavet, “Classification of time-series im-
ages using deep convolutional neural networks,” in Tenth International
Conference on Machine Vision (ICMV, 04 2018, p. 23.
[80] Z. Wang and T. Oates, “Imaging time-series to improve classification
and imputation,” in Proceedings of the 24th International Conference on
Artificial Intelligence, ser. IJCAI’15. AAAI Press, 2015, p. 3939–3945.
[81] O. Sezer and M. Ozbayoglu, “Algorithmic financial trading with deep
convolutional neural networks: Time series to image conversion approach,”
Applied Soft Computing, vol. 70, 04 2018.
[82] A. Tsantekidis, N. Passalis, A. Tefas, J. Kanniainen, M. Gabbouj, and
A. Iosifidis, “Forecasting stock prices from the limit order book using con-
volutional neural networks,” in 2017 IEEE 19th Conference on Business
Informatics (CBI), vol. 01, 2017, pp. 7–12.
[83] M. U. Gudelek, S. A. Boluk, and A. M. Ozbayoglu, “A deep learning based
stock trading model with 2-d cnn trend detection,” In 2017 IEEE Sympo-
sium Series on Computational Intelligence (SSCI), pp. 1–8, November
2017.
[84] S. Ravishankar and Y. Bresler, “Sparsifying transform learning with effi-
cient optimal updates and convergence guarantees,” IEEE Transactions on
Signal Processing, vol. 63, no. 9, pp. 2389–2404, 2015.
149
[85] H. Bauschke, R. Burachik, P. Combettes, and D. Luke, Fixed-Point Al-
gorithms for Inverse Problems in Science and Engineering. Springer
Optimization and Its Applications, 11 2009.
[86] J. Sun, Y. Qing, C. Liu, and J. Lin, “Self-fts: A self-supervised learning
method for financial time series representation in stock intraday trading,”
in 2022 IEEE 20th International Conference on Industrial Informatics
(INDIN), 2022, pp. 501–506.
[87] Y. Soun, J. Yoo, M. Cho, J. Jeon, and U. Kang, “Accurate stock movement
prediction with self-supervised learning from sparse noisy tweets,” in
2022 IEEE International Conference on Big Data (Big Data), 2022, pp.
1691–1700.
[88] Y. Yang and T. M. Hospedales, “An evaluation of self-supervised learning
for portfolio diversification,” in International Conference on Artificial
Neural Networks (ICANN), 2022, pp. 1691–1700.
[89] K. Xu, G. Zhong, Z. Deng, K. Zhang., and H. K., “Self-supervised genera-
tive learning for sequential data prediction,” in International Conference
on Artificial Neural Networks (ICANN), 2023.
[90] S. Ravishankar and Y. Bresler, “Learning sparsifying transforms,” IEEE
Transactions on Signal Processing, vol. 61(5), pp. 1072–1086, 2012.
[91] H. Attouch, J. Bolte, and B. Svaiter, “Convergence of descent methods
for semi-algebraic and tame problems: Proximal algorithms, forward-
150
backward splitting, and regularized gauss-seidel methods,” Mathematical
Programming, vol. 137, 01 2011.
[92] E. Chouzenoux, J.-C. Pesquet, and A. Repetti, “A block coordinate vari-
able metric forward-backward algorithm,” Journal of Global Optimization,
vol. 66, 11 2016.
[93] J. Bolte, S. Sabach, and M. Teboulle, “Proximal alternating linearized
minimization for nonconvex and nonsmooth problems,” Mathematical
Programming, vol. 146, 08 2013.
[94] P. Combettes and J.-C. Pesquet, “Deep neural network structures solv-
ing variational inequalities,” Set-Valued and Variational Analysis, 2020,
https://arxiv.org/abs/1808.07526.
[95] D. P. Kingma and J. Ba, “Adam: A method for stochastic optimization,”
in Proc. of ICLR, 2015.
[96] G. Klambauer, T. Unterthiner, A. Mayr, and S. Hochreiter, “Self-
normalizing neural networks,” in Proc. of NeurIPS, Long Beach, Cal-
ifornia, USA, 4-9 Dec. 2017.
[97] A. Mass, A. Hannun, and A. Ng, “Rectifier nonlinearities improve neural
network acoustic models,” in Proc. of ICML, Atlanta, USA, 16-21 June
2013.
[98] S. Reddi, S. Kale, and S. Kumar, “On the convergence of adam and
beyond,” Proc. ICLR., 2018.
151
[99] A. Paszke, S. Gross, S. Chintala, G. Chanan, E. Yang, Z. DeVito, Z. Lin,
A. Desmaison, L. Antiga, and A. Lerer, “Automatic differentiation in
pytorch,” NIPS Autodiff Workshop, 2017.
[100] C. Kocak, “Arma( p,q ) type high order fuzzy time series forecast method
based on fuzzy logic relations,” Applied Soft Computing, vol. 58, pp.
92–103, 2017.
[101] G. Zumbach and L. Fernndez, “Option pricing with realistic arch pro-
cesses,” Quantitative Finance, vol. 14(1), pp. 143–170, 2014.
[102] Z. Lin, “Modelling and forecasting the stock market volatility of sse com-
posite index using garch models,” Future Generation Computer Systems,
vol. 79, pp. 960–972, 2018.
[103] N. lk, D. Kuruppuarachchi, and O. Kuzmicheva, “Stock market’s response
to real output shocks in eastern european frontier markets: A varwal
model,” Emerging Market Review, vol. 33, pp. 140 – 154, 2017.
[104] R. Bisoi and P. Dash, “A hybrid evolutionary dynamic neural network for
stock market trend analysis and prediction using unscented kalman filter,”
Applied Soft Computing, vol. 19, pp. 41–56, 2014.
[105] O. B. Sezer and A. M. Ozbayogl, “Algorithmic financial trading with deep
convolutional neural networks: Time series to image conversion approach,”
Applied Soft Computing, vol. 70, pp. 525–538, 2018.
[106] F. Ming, F. Wong, Z. Liu, and M. Chiang, “Stock market prediction from
152
wsj: Text mining via sparse matrix factorization,” 2014 IEEE International
Conference on Data Mining, Shenzhen, pp. 430–439, 2014.
[107] Y. Shynkevich, T. McGinnity, S. Coleman, A. Belatreche, and Y. Li,
“Forecasting price movements using technical indicators: investigating the
impact of varying input window length,” Neurocomputing, vol. 164, pp.
163–173, 2017.
[108] S. Barak, A. Arjmand, and S. Ortobelli, “Fusion of multiple diverse
predictors in stock market,” Information Fusion, vol. 36, pp. 90–102,
2017.
[109] B. Weng, L. Lu, X. Wang, F. M. Megahed, and W. Martinez, “Predicting
short-term stock prices using ensemble methods and online data sources,”
Expert Systems with Applications, vol. 112, pp. 258 – 273, 2018.
[Online]. Available: http://www.sciencedirect.com/science/article/pii/
S0957417418303622
[110] Y. Chen and Y. Hao, “A feature weighted support vector machine and
k-nearest neighbor algorithm for stock market indices prediction,” Expert
Systems with Applications, vol. 80, pp. 340–355, 2017. [Online]. Available:
http://www.sciencedirect.com/science/article/pii/S0957417417301367
[111] J. L. Ticknor, “A bayesian regularized artificial neural network for stock
market forecasting,” Expert Systems with Applications, vol. 40(14), pp.
5501–5506, 2013.
[112] G. Tingwei and C. Yueting, “Improving stock closing price prediction
153
using recurrent neural network and technical indicators,” Neural Compu-
tation, vol. 30(10), pp. 2833–2854, 2018.
[113] W. Long, Z. Lu, and L. Cui, “Deep learning-based feature engineering for
stock price movement prediction,” Knowledge-Based Systems, vol. 164,
pp. 163–173, 2019.
[114] T. L. Sandritter, B. L. Jones, G. L. Kearns, and J. A. Lowry, Nelson
Textbook of Pediatrics.
[115] M. Allison, “Reinventing clinical trials,” Nature biotechnology, vol. 30(1),
pp. 41–49, 2012.
[116] J. C. Bouvy, M. L. D. Bruin, and M. A. Koopmanschap, “Epidemiology of
adverse drug reactions in europe: a review of recent observational studies,”
Drug Safety, vol. 38(5), pp. 437–453, 2015.
[117] J. D. H. Jonathan H. Watanabe, Terry McInnis, “Cost of prescription
drug-related morbidity and mortality,” The Annals of pharmacotherapy,
vol. 52(9), pp. 829—-837, 2018.
[118] F. Zhang, B. Sun, X. Diao, W. Zhao, and T. Shu, “Prediction of adverse
drug reactions based on knowledge graph embedding,” BMC Medical
Informatics and Decision Making, vol. 21, 2021.
[119] D. Sridhar, S. Fakhraei, and L. Getoor, “A probabilistic approach for col-
lective similarity-based drug–drug interaction prediction,” Bioinformatics,
vol. 32, no. 20, pp. 3175–3182, 06 2016.
154
[120] M. Yu, S. Kim, Z. Wang, S. Hall, and L. Li, “A bayesian meta-analysis on
published sample mean and variance pharmacokinetic data with applica-
tion to drug–drug interaction prediction,” Journal of Biopharmaceutical
Statistics, vol. 18, no. 6, pp. 1063–1083, 2008.
[121] W. Zhang, K. Jing, F. Huang, Y. Chen, B. Li, J. Li, and J. Gong,
“Sflln: A sparse feature learning ensemble method with linear
neighborhood regularization for predicting drug–drug interactions,”
Information Sciences, vol. 497, pp. 189–201, 2019. [Online]. Available:
[122] L. H. Dang, N. T. Dung, L. X. Quang, L. Q. Hung, N. H. Le,
N. T. N. Le, N. T. Diem, N. T. T. Nga, S.-H. Hung, and N. Q. K. Le,
“Machine learning-based prediction of drug-drug interactions for histamine
antagonist using hybrid chemical features,” Cells, vol. 10, no. 11, 2021.
[Online]. Available: https://www.mdpi.com/2073-4409/10/11/3092
[123] R. Celebi, H. Uyar, E. Yasar, O. Gumus, O. Dikenelli, and M. Dumontier,
“Evaluation of knowledge graph embedding approaches for drug-drug
interaction prediction in realistic settings,” BMC Bioinformatics, vol. 20,
p. 726, 2019.
[124] H. Yu, W. Dong, and J. Shi, “Raneddi: Relation-aware network embedding
for drug-drug interaction prediction,” Information Sciences, vol. 582,
pp. 167–180, 2022. [Online]. Available: https://www.sciencedirect.com/
science/article/pii/S0020025521009294
155
[125] S. K. Sahu and A. Anand, “Drug-drug interaction extraction from
biomedical texts using long short-term memory network,” Journal of
Biomedical Informatics, vol. 86, pp. 15–24, 2018. [Online]. Available:
[126] S. Liu, Y. Zhang, Y. Cui, Y. Qiu, Y. Deng, Z. M. Zhang, and W. Zhang,
“Enhancing drug-drug interaction prediction using deep attention neu-
ral networks,” IEEE/ACM Transactions on Computational Biology and
Bioinformatics, pp. 1–1, 2022.
[127] M. Rezaul Karim, M. Cochez, J. Jares, M. Uddin, O. Beyan, and S. Decker,
“Drug-drug interaction prediction based on knowledge graph embeddings
and convolutional-lstm network,” in ACM-BCB 2019 - Proceedings of the
10th ACM International Conference on Bioinformatics, Computational
Biology and Health Informatics. Association for Computing Machinery,
Inc, Sep. 2019, pp. 113–123.
[128] W. Zhang, Y. Chen, D. Li, and X. Yue, “Manifold regularized
matrix factorization for drug-drug interaction prediction,” Journal of
Biomedical Informatics, vol. 88, pp. 90–97, 2018. [Online]. Available:
[129] A. Mongia, S. Jain, É. Chouzenoux, and A. Majumdar, “Deepvir - graphi-
cal deep matrix factorization for "in silico" antiviral repositioning: Appli-
cation to covid-19,” ArXiv, vol. abs/2009.10333, 2020.
[130] J.-Y. Shi, H. Huang, J.-X. Li, P. Lei, Y.-N. Zhang, K. Dong, and S.-M.
156
Yiu, “Tmfuf: a triple matrix factorization-based unified framework for
predicting comprehensive drug-drug interactions of new drugs,” BMC
Bioinformatics, vol. 19, p. 411, 2018.
[131] J.-Y. Shi, H. Huang, J.-X. Li, P. Lei, Y.-N. Zhang, and S.-M. Yiu, “Predict-
ing comprehensive drug-drug interactions for new drugs via triple matrix
factorization,” in Bioinformatics and Biomedical Engineering, I. Rojas
and F. Ortuño, Eds. Cham: Springer International Publishing, 2017, pp.
108–117.
[132] Y. Zhang, Y. Qiu, Y. Cui, S. Liu, and W. Zhang, “Predicting
drug-drug interactions using multi-modal deep auto-encoders based
network embedding and positive-unlabeled learning,” Methods, vol.
179, pp. 37–46, 2020, interpretable machine learning in bioinformatics.
S1046202319303421
[133] X. Lin, Z. Quan, Z.-J. Wang, T. Ma, and X. Zeng, “Kgnn: Knowledge
graph neural network for drug-drug interaction prediction,” in IJCAI, 2020.
[134] Y. Ding, J. Tang, and F. Guo, “Identification of drug-target
interactions via multiple information integration,” Information Sciences,
vol. 418-419, pp. 546–560, 2017. [Online]. Available: https:
//www.sciencedirect.com/science/article/pii/S0020025517307776
[135] B. Tanoori, M. Z. Jahromi, and E. G. Mansoori, “Drug-target continuous
binding affinity prediction using multiple sources of information,” Expert
157
Systems with Applications, vol. 186, p. 115810, 2021. [Online]. Available:
[136] T. Turki and Y. h. Taguchi, “Machine learning algorithms for predicting
drugs–tissues relationships,” Expert Systems with Applications, vol. 127,
pp. 167–186, 2019. [Online]. Available: https://www.sciencedirect.com/
science/article/pii/S0957417419301186
[137] P. G. Sun, Y. N. Quan, Q. G. Miao, and J. Chi, “Identifying
influential genes in protein–protein interaction networks,” Information
Sciences, vol. 454-455, pp. 229–241, 2018. [Online]. Available:
[138] B. Yu, C. Chen, X. Wang, Z. Yu, A. Ma, and B. Liu, “Prediction of
protein–protein interactions based on elastic net and deep forest,” Expert
Systems with Applications, vol. 176, p. 114876, 2021. [Online]. Available:
[139] H.-C. Lee, S.-W. Huang, and E. Y. Li, “Mining protein–protein
interaction information on the internet,” Expert Systems with Applications,
vol. 30, no. 1, pp. 142–148, 2006, intelligent Bioinformatics Systems.
S0957417405002496
[140] J. Zhang, C. Li, Y. Lin, Y. Shao, and S. Li, “Computational drug
repositioning using collaborative filtering via multi-source fusion,” Expert
158
Systems with Applications, vol. 84, pp. 281–289, 2017. [Online]. Available:
[141] G. Chao, S. Sun, and J. Bi, “A survey on multiview clustering,” IEEE
Transactions on Artificial Intelligence, vol. 2, no. 2, pp. 146–168, 2021.
[142] S. Bickel and T. Scheffer, “Multi-view clustering,” in Proceedings of the
Fourth IEEE International Conference on Data Mining (ICDM 2004),
2004, pp. 19–26.
[143] X. Yi, Y. Xu, and C. Zhang, “Multi-view em algorithm for finite mixture
models,” in Pattern Recognition and Data Mining, S. Singh, M. Singh,
C. Apte, and P. Perner, Eds. Berlin, Heidelberg: Springer Berlin Heidel-
berg, 2005, pp. 420–425.
[144] D. Lashkari and P. Golland, “Convex clustering with exemplar-based
models,” in Advances in Neural Information Processing Systems, J. Platt,
D. Koller, Y. Singer, and S. Roweis, Eds., vol. 20. Curran Associates,
Inc., 2007.
[145] A. Kumar and H. D. III, “A co-training approach for multi-view spectral
clustering,” in Proceedings of the 28th International Conference on Inter-
national Conference on Machine Learning (ICML 2011). Madison, WI,
USA: Omnipress, 2011, p. 393–400.
[146] J. Sun, J. Lu, T. Xu, and J. Bi, “Multi-view sparse co-clustering via prox-
imal alternating linearized minimization,” in Proceedings of the 32nd
International Conference on Machine Learning (ICML 2015), ser. Pro-
159
ceedings of Machine Learning Research, F. Bach and D. Blei, Eds., vol. 37,
Lille, France, 07–09 Jul 2015, pp. 757–766.
[147] T. Liu, “Guided co-training for large-scale multi-view spectral
clustering,” CoRR, vol. abs/1707.09866, 2017. [Online]. Available:
http://arxiv.org/abs/1707.09866
[148] W. Cai, H. Zhou, and L. Xu, “A multi-view co-training clustering algo-
rithm based on global and local structure preserving,” IEEE Access, vol. 9,
pp. 29 293–29 302, 2021.
[149] A. Kumar, P. Rai, and H. Daume, “Co-regularized multi-view spectral clus-
tering,” in Advances in Neural Information Processing Systems, J. Shawe-
Taylor, R. Zemel, P. Bartlett, F. Pereira, and K. Weinberger, Eds., vol. 24.
Curran Associates, Inc., 2011.
[150] Y. Ye, X. Liu, J. Yin, and E. Zhu, “Co-regularized kernel k-means for multi-
view clustering,” in Proceedings of the 23rd International Conference on
Pattern Recognition (ICPR 2016), 2016, pp. 1583–1588.
[151] M. Brbić and I. Kopriva, “Multi-view low-rank sparse subspace clustering,”
Pattern Recognition, vol. 73, pp. 247–258, 2018.
[152] C. Zhang, H. Fu, Q. Hu, X. Cao, Y. Xie, D. Tao, and D. Xu, “Generalized
latent multi-view subspace clustering,” IEEE Transactions on Pattern
Analysis and Machine Intelligence, vol. 42, no. 1, pp. 86–99, 2020.
[153] J. Tan, Y. Shi, Z. Yang, C. Wen, and L. Lin, “Unsupervised multi-view
160
clustering by squeezing hybrid knowledge from cross view and each view,”
IEEE Transactions on Multimedia, vol. 23, pp. 2943–2956, 2021.
[154] S. Yu, L. Tranchevent, X. Liu, W. Glanzel, J. A. Suykens, B. De Moor,
and Y. Moreau, “Optimized data fusion for kernel k-means clustering,”
IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 34,
no. 5, pp. 1031–1039, 2012.
[155] X. Chen, X. Xu, J. Z. Huang, and Y. Ye, “Tw-k-means: Automated two-
level variable weighting clustering algorithm for multiview data,” IEEE
Transactions on Knowledge and Data Engineering, vol. 25, no. 4, pp.
932–944, 2013.
[156] X. Cai, F. Nie, and H. Huang, “Multi-view k-means clustering on big data,”
in Proceedings of the Twenty-Third International Joint Conference on
Artificial Intelligence (IJCAI 2013). AAAI Press, 2013, p. 2598–2604.
[157] H. Liu and Y. Fu, “Consensus guided multi-view clustering,” ACM
Trans. Knowl. Discov. Data, vol. 12, no. 4, apr 2018. [Online]. Available:
https://doi.org/10.1145/3182384
[158] T. Joachims, N. Cristianini, and J. Shawe-Taylor, “Composite kernels
for hypertext categorisation,” in Proceedings of the 18th International
Conference on Machine Learning (ICML 2001), 01 2001, pp. 250–257.
[159] T. Zhang, A. Popescul, and B. Dom, “Linear prediction models with
graph regularization for web-page categorization,” in Proceedings of the
161
12th ACM International Conference Knowledge Discovery Data Mining
(SIGKDD 2006), 2006, p. 821–826.
[160] G. Chao and S. Sun, “Multi-kernel maximum entropy discrimination
for multi-view learning,” Intelligence Data Analysis, vol. 20, no. 3, p.
481–493, jan 2016.
[161] M. B. Blaschko and C. H. Lampert, “Correlational spectral clustering,”
in Proceedings of the IEEE Conference on Computer Vision and Pattern
Recognition (CVPR 2008), 2008, pp. 1–8.
[162] W. Shao, L. He, C.-t. Lu, and P. S. Yu, “Online multi-view clustering with
incomplete views,” in Proceedings of the IEEE International Conference
on Big Data (Big Data 2016), 2016, pp. 1012–1017.
[163] X. Yang, C. Deng, Z. Dang, and D. Tao, “Deep multiview collaborative
clustering,” IEEE Transactions on Neural Networks and Learning Systems,
2021.
[164] J. Xu, Y. Ren, G. Li, L. Pan, C. Zhu, and Z. Xu, “Deep embedded multi-
view clustering with collaborative training,” Information Sciences, vol.
573, pp. 279–290, 2021.
[165] F. Scarselli, M. Gori, A. C. Tsoi, M. Hagenbuchner, and G. Monfardini,
“The graph neural network model,” IEEE Transactions on Neural Networks,
vol. 20, no. 1, pp. 61–80, 2009.
[166] H. Zhang, G. Lu, M. Zhan, and B. Zhang, “Semi-supervised classification
162
of graph convolutional networks with laplacian rank constraints,” Neural
Processing Letters, vol. 54, pp. 1–12, 08 2022.
[167] S. Fan, X. Wang, C. Shi, E. Lu, K. Lin, and B. Wang, “One2multi graph
autoencoder for multi-view graph clustering,” in In Proceedings of The
Web Conference 2020 (WWW ’20), 04 2020, pp. 3070–3076.
[168] A. Goel, A. Majumdar, E. Chouzenoux, and G. Chierchia, “Deep convo-
lutional k-means clustering,” in Proceedings of the IEEE International
Conference on Image Processing (ICIP 2022), Bordeaux, France, 2022,
pp. 211–215.
[169] A. Goel and A. Majumdar, “Transformed k-means clustering,” in Pro-
ceedings of the 29th European Signal Processing Conference (EUSIPCO
2021), 2021, pp. 1526–1530.
[170] C. Bauckhage, “K-means clustering is matrix factorization,” arXiv preprint
arXiv:1512.07548, 2015.
[171] K. Zhan, C. Zhang, J. Guan, and J. Wang, “Graph learning for multiview
clustering,” IEEE Transactions on Cybernetics, vol. 48, no. 10, pp. 2887–
2895, 2018.
[172] D. J. Trosten, S. Lokse, R. Jenssen, and M. Kampffmeyer, “Reconsidering
representation alignment for multi-view clustering,” in Proceedings of
the IEEE/CVF Conference on Computer Vision and Pattern Recognition
(CVPR 2021), Los Alamitos, CA, USA, jun 2021, pp. 1255–1265.
163
[173] X. Peng, S. Xiao, J. Feng, W.-Y. Yau, and Z. Yi, “Deep subspace clustering
with sparsity prior.” in IJCAI, 2016, pp. 1925–1931.
[174] X. Peng, J. Feng, J. T. Zhou, Y. Lei, and S. Yan, “Deep subspace clustering,”
IEEE transactions on neural networks and learning systems, vol. 31,
no. 12, pp. 5509–5521, 2020.
164

Thesis Pooja Gupta PhD18018 Final

Uploaded by

Copyright:

Available Formats

Thesis Pooja Gupta PhD18018 Final

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Thesis Pooja Gupta PhD18018 Final

Uploaded by

Copyright:

Available Formats

I NFORMATION F USION USING C ONVOLUTIONAL T RANSFORM L EARNING

Under the supervision of Prof. Angshul Majumdar

C OMPUTER S CIENCE AND E NGINEERING

I NDRAPRASTHA I NSTITUTE OF I NFORMATION T ECHNOLOGY D ELHI

submitted in partial fulfillment of the requirements for the degree of

C OMPUTER S CIENCE AND E NGINEERING

I NDRAPRASTHA I NSTITUTE OF I NFORMATION T ECHNOLOGY D ELHI

Transform Learning being submitted by Pooja Gupta to the Indraprastha Institute

of Information Technology Delhi, for the award of the degree of Doctor of

Philosophy, is an original research work carried out by her under my supervision.

the regulations relating to the degree.

Prof. Angshul Majumdar

New Delhi 110020

I dedicate my work to my family and friends. I am grateful to my loving parents,

List of Figures xii

2 Unsupervised multi-channel CTL based fusion frameworks -ConFuse(shallow)

3 Supervised multi-channel fusion frameworks - SuperDeConFuse and

4 Multiview Clustering Framework based on CTL - DeConFCluster 110

1.1 Statistics of the considered MVC datasets . . . . . . . . . . . . 17

2.1 Description of compared models with hyperparameters . . . 40

3.1 Hyperparameters for the different instances of the proposed

4.1 DeConFCluster hyperparameters for MVC Datasets . . . . . . . 120

2.1 General view of the ConFuse architecture. C = 5 represents the

3.1 General SuperDeConfuse Architecture. The architecture is

4.1 DCKM architecture. L represents number of DCTL layers, Mlc -

Information Fusion (IF) is an advanced process that involves estimation and

empowers users to asses complex situations more efficiently, effectively and

accurately. It combines information from multiple sources that can be massive,

comprehensive unified estimates about an entity, activity or event. According to

[2], IF is defined as “the study of efficient methods for automatically or semi-

automatically transforming information from different sources and points in time

into a representation that provides effective support for human or automated

decision making" . Many real-world domains raise problems pertaining to the

need for the fusion of information from multiple sources.

power consumption at a future point by accounting for the available information

level forecasting are power consumption, weather (temperature, humidity), and

grids that ranges from planning electricity generation to preventing non-technical

losses. Next, we consider biomedical signal analysis where IF is required. For

two sources, namely the electrocardiogram (ECG) and pulsepleithismogram

Transportation is also a domain that needs the fusion of information from

many sources to build intelligent transportation systems (ITS) [4, 5]. It is

essential to improve safety of a passenger, reduce transportation time and fuel

available data and time-series data using deep learning techniques.

of an object has to be integrated into a single image that is more informative

and appropriate for visual perception or computer analysis. It is significantly

information content of the PET images, the fusion of Magnetic Resonance

Hue Saturation (IHS) and Retina-Inspired Models (RIM) fusion methods is

performed [7]. Multi-sensor video is also a domain that requires multi-channel

scanpath assessment for the visible and infrared side-by-side[8].

recognition [12] etc. Thus, IF is a field that offers a plethora of opportunities to

solve many impactful real-world problems.

1.1 Problem Statement

This research dissertation aims to propose efficient multi-channel fusion frame-

domain. There could be supervised and unsupervised learning tasks finding

applications in n-Dimensional data domains. The problems targeted under the

pervised. The aim is to offer quick decision-making to the practitioner while

dealing with information from multiple sources.

IF integrates heterogeneous data from multiple sources to learn representations

that can lead to effective decision-making in future events. The challenge