Authors:
Quentin Portes
1
;
José Mendès Carvalho
1
;
Julien Pinquier
2
and
Frédéric Lerasle
3
Affiliations:
1
Renault Software Lab, Toulouse, France
;
2
IRIT, Paul Sabatier University, CNRS, Toulouse, France
;
3
LAAS-CNRS, Paul Sabatier University, Toulouse, France
Keyword(s):
Sentiment Analysis, Deep Learning, Multimodal, Fusion, Embedded System, Cockpit Monitoring.
Abstract:
Multimodal neural network in sentiment analysis uses video, text and audio. Processing these three modalities tends to create computationally high models. In the embedded context, all resources and specifically computational resources are restricted. In this paper, we design models dealing with these two antagonist issues. We focused our work on reducing the numbers of model input features and the size of the different neural network architectures. The major contribution in this paper is the design of a specific 3D Residual Network instead of using a basic 3D convolution. Our experiments are focused on the well-known dataset MOSI (Multimodal Corpus of Sentiment Intensity). The objective is to perform similar results as the state of the art. Our best multimodal approach achieves a F1 score of 80% with a number of parameters reduced by 2.2 and the memory load reduced by a factor 13.8, compared to the state of the art. We designed five models, one for each modality (i.e video, audio and
text) and one for each fusion technique. The two high-level multimodal fusions presented in this paper are based on the evidence theory and on a neural network approach.
(More)