Abstract
Most of the sentiment analysis studies focus on the sentimental classification of pictures in the video, ignoring the spatio-temporal information of the sequence of picture frames as well as text and audio information. The multiple kernel learning is a new hotspot in the field of nuclear machine learning, capable of handling multiple modalities. For multiple kernel learning, it is easy to ignore the basic features that are not discriminative, and cannot make full use of the base features of different modes. This paper puts forward a novel multi-modal fusion model for sentiment analysis, in which a multiple kernel learning algorithm based on convolution margin-dimension constraint is proposed for feature fusion. Moreover, the 3D convolutional neural network is used to extract the features of visual information, and the multiple kernel learning algorithm based on margin-dimension constraint is used to fuse visual, text and audio sentiment features. Experiments conducted on the MOUD and IEMOCAP sentiment databases show that the proposed model outperforms existing models in the field of multi-modal sentiment analysis research.
Similar content being viewed by others
Avoid common mistakes on your manuscript.
1 Introduction
Affective computing is an emerging research field aimed at enabling intelligent systems to recognize, infer, and explain human sentiments. This field is inherently interdisciplinary, integrating principles from computer science to psychology, social science to cognitive science [1]. With the proliferation of social networks and online news platforms, multimedia content—including text, images, and videos—has become a critical medium for communication. The ability to automatically monitor and analyze the reputation of brands or competitors through this multimedia content is essential for understanding online subjective judgments of products and brands.
To address this need, researchers have extensively studied sentiment analysis techniques over the past decade, focusing on extracting and analyzing emotions and opinions from unstructured documents. Traditionally, these studies have concentrated on text content using text mining and natural language processing (NLP) technologies. They often analyze user comments, browsing history, user tags, and social network relations to derive insights. For instance, Xu et al. proposed a semi-supervised Laplacian eigenmap to reduce detection errors and enhance the emotional classification of texts [2]. However, with the rise of social networks, users increasingly express their opinions through videos and images, necessitating sentiment analysis that encompasses multiple data sources.
Most current studies in this field focus on multimodal emotion recognition, utilizing both visual and auditory information. However, research on truly integrated multimodal emotion analysis remains limited. While media objects like images and videos are integral to online content, sentiment analysis through visual methods alone has been relatively constrained. A closely related field is visual asthetic analysis, which involves methods for automatically estimating the perceived quality of images [3]. For instance, a method was developed to construct a large-scale semantic concept ontology, correlating strong emotional responses to images, such as “beautiful flowers” [4]. In addition, Khan et al. utilized Clifford geometric algebra to enhance the segmentation of medical images, demonstrating the potential for advanced visual processing techniques in sentiment analysis [5].
The integration of multimodal emotion recognition can be viewed as the fusion of diverse information types. Multimodal fusion combines data from various modalities to improve analysis accuracy and decision-making. This approach has garnered significant attention due to its broad applications. Multimodal data fusion provides supplementary information and enhances overall results or decision accuracy [6]. Researchers have primarily explored two fusion types: feature-level fusion and decision-level fusion. Feature-level fusion typically involves concatenating feature vectors from different modalities to form comprehensive feature vectors, though this method necessitates careful feature and parameter design.
Recent studies indicate that multimodal emotion recognition can achieve superior performance. However, most sentiment recognition studies either focus on single-modality feature extraction or use simplistic concatenation of features from different modalities. Effectively integrating different forms of features remains a significant challenge. Multiple Kernel Learning (MKL) has emerged as a hotspot in kernel machine learning, offering solutions to the limitations of single kernel functions in handling heterogeneous or irregular data. MKL algorithms combine multiple kernel functions to achieve better results in complex scenarios [7, 8].
In this study, we propose a novel MKL algorithm based on margin-dimension constraints of convolution. This algorithm learns the weights of different basis features according to their recognition capabilities, integrating visual emotion features extracted by a 3D convolutional neural network (CNN) with text and audio emotion features to enhance video emotion recognition accuracy. Our model flexibly and stably handles multi-source and heterogeneous datasets, offering high generalization ability and significantly improving social-sentiment classification accuracy.
The contributions of this study are twofold:
-
(1)
We exploit a 3D CNN to extract visual information features from videos, effectively capturing spatial and temporal information. The embedded time dimension in the 3D CNN enhances the extraction of these features, improving recognition performance.
-
(2)
We propose an MKL algorithm based on convolution margin-dimension constraints, which fuses visual, text, and audio emotion features in videos. By assigning smaller weights to less discriminative features, this approach leverages the complementary strengths of different modalities, thereby enhancing video-emotion classification in social networks.
2 Related Work
In recent video modality recognition research, it has emerged a lot of effectiveness deep learning model such as 3D-CNN [9,10,11,12], LSTM [13, 14] in capturing and integrating features across temporal and spatial dimensions. The MKL has been adopted as a superior method for handling heterogeneous data sources in recent research [15,16,17]. The study about kernel functions has also been researched in [21,22,23,24,25]. A lot of research [26,27,28,29,30,31,32,33] shows that the success of SVM has promoted the development of kernel learning.
To improve the performance of modalities recognition in video, in recent years, deep learning algorithms are widely used to deal with the video modalities recognition of video representation. Compared with two-dimensional image motion recognition, human modalities have the characteristics of three-dimensional because the visual appearance and motion dynamics of objects are involved into a video sequence [9]. Therefore, a 3D Convolutional Neural Network deep learning model [10] is proposed, in which the image features are extracted from the time and space dimensions by 3D convolution to obtain the motion information in the image sequence. Simonyan et al. [11]. aims to obtain information about the appearance from the motion between still frames and adjacent frames. Therefore, they propose a dual-stream ConvNet architecture that includes spatial and temporal networks. A supervised multi-view feature learning framework is proposed to deal with different views from a unified perspective [12].
Recent studies have shown that long short-term memory networks can learn when to forget the previous hidden state and when to update the hidden state by integrating memory cells [13]. Video activity recognition has been implemented by inputting frame-level CNN sequence features into Long Short-Term Memory (LSTM), and new records have been created on multiple reference data sets [14]. Although multiple modal emotion recognition can achieve better performance. Most of the research in sentiment recognition is either based solely on extracting features from a single modality or simply using simple connected features from different morphologies. How to effectively integrate different forms of features is still a question. The most popular method of feature level fusion is a simple cascading of feature vectors from different modalities to form large feature vectors. However, this fusion method requires careful design of feature and parameter selection, such as feature size, which is actually manual feature selection.
In some cases, the sample size is large or the data contains heterogeneous information, that is, the data comes from different data sources. For example, in the field of reservoir modeling, it has been proved that the semi-supervised support vector regression has ability to convert rock properties into model distributions in a fluid environment. However, the calculation of linear regression in high-dimension space, the use of a single core for data mapping, will make the replication of multi-scale, unstable structure complex, and can not accurately reflect the relevance of the data. Although this complexity can be solved to some extent by semi-supervised learning of untracked data, it does not accurately reflect correlation at multiple scales. Therefore, MKL has gradually become a hot topic in current research.
With the development of MKL, researchers use MKL to fuse features of different modalities. People pay attention to the nuclear method [15,16,17], thanks to the development and application of the support vector machine (SVM) [18] theory. The adoption of kernel functions makes linear SVM easy to be extended to nonlinear SVM. Its core lies in the use of a much simpler kernel function operation, which not only avoids the complex inner product calculation in the feature space, but also avoids the design of the feature space itself [19, 20]. In fact, the study of kernel functions has been going on for a long time. Anceschi et al. [21] studied the relationship between functions of positive and negative types and integral equation theory. Aronszajn [22] developed the theory of regenerated kernel Hilbert space around 1950. In 1964, when Aizerman et al. [23] studied the potential function method for pattern recognition, they used Mercer theory to interpret the kernel function as the inner product of a feature space and introduced it into machine learning. However, the potential of the nuclear method was not fully exploited at that time. Until 1992, Boser et al. [24] proposed the SVM method, and the support vector machine is a learning method that realizes structural risk minimization and VC theory. Since the support vector machine showed excellent performance in text categorization, it quickly became the mainstream technology for machine learning. Support vector machines have two main categories: support vector classification (SVC) and support vector regression (SVR). SVM is a learning method that uses high-dimension feature spaces that extend the prediction function over a subset of the support vectors. SVM can promote complex gray level structures with few support vectors, thus providing a new mechanism for image compression. A new method [25] is proposed to introduce multi-channel data relations into the support vector machine optimization process. Different attributes of the training data are encoded in the graph structure in the form of paired data relationships.
The success of SVM has promoted the rapid popularization and development of Kernel methods, which have gradually penetrated into many fields of machine learning, such as regression estimation [26], pattern classification [27], probability density estimation [28], subspace analysis [29]. Typical examples include Kernel principal component analysis (KPCA) proposed by Zhou et al. [29]. Cai et al. [30] implemented Kernel discriminant analysis (KFD), Maldonado et al. [31] proposed Kernel discriminant analysis (KDA), Amarnath et al. [32] proposed Kernel canonical correlation analysis (KCCA). Chen et al. [33] proposed Kernel independent component analysis (KICA). Although SVM has successfully promoted the rapid popularization and development of nuclear methods, it has gradually penetrated into many fields of machine learning. But these methods are based on a single kernel. The performance of kernel functions varies greatly in different situations due to the characteristics of different kernels. In view of these problems, the nuclear method has been greatly improved and promoted, and has been widely used in many fields. In recent years, a lot of researches have been made on combinatorial kernel method, namely MKL.
The multi-core model [34, 35] is a more flexible kernel learning model. Recent theories and applications have shown that replacing a single kernel with multiple kernels can enhance the interpretability of decision making. Lanckriet et al. [36] proposed multi-kernel learning in 2004, which combines several kernel functions and makes up for the shortcomings of SVM, laying a solid foundation for the practical application of MKL. Chen et al. [37] proposed a scene classification algorithm based on multi-core fusion, which makes the test set of samples independent. In the same year, Zhou et al. [38] used multi-core SVM models to solve multi-modal tasks, providing a reference for multi-feature classification. In 2016, Q. Wang [39] proposed a discriminative multi-core learning method for spectral image classification.
This section reviewed recent advancements in video modality recognition, emphasizing the effectiveness of deep learning models such as 3D-CNN and LSTM in capturing and integrating features across temporal and spatial dimensions. The emergence of MKL as a superior method for handling heterogeneous data sources and improving multimodal feature fusion was highlighted, showcasing its application in various fields of machine learning. The development and success of SVM and subsequent kernel methods underscore the importance of MKL in addressing the limitations of single kernel approaches and enhancing the performance of sentiment analysis systems.
3 Proposed Multi-Modal Fusion Framework
This section puts forward a multi-modal fusion framework, which is an extension of 3DCLS [2] framework. Compared with 3D algorithm, the proposed framework employs multiple kernel extraction methods rather than a single softmax function. Figure 1 illustrates the feature extraction methods of different modes in the margin dimensionally constrained MKL model based on convolution, which combines the emotional features of audio, video and text. For video information, the model uses the 3D CNN to extract the emotional features of visual information, and uses the CNN to extract the emotional features of text in video. Then, it uses the openSMILE tool to extract the emotional features of audio in video. Finally, it uses the MKL constrained by the margin dimension to fuse the three emotional features. This model learns the weights of different basis features according to the recognition ability of different basis features, and combines the basic features with less discriminant power by assigning smaller weights when constructing the optimal combination kernel, so as to make full use of the advantages of complementary features of different modes.
Overview of the proposed approach, it illustrates the proposed multi-modal sentiment analysis framework, which integrates visual, textual, and auditory features to improve sentiment-classification accuracy. The framework processes video inputs through various steps, including feature extraction and kernel-based fusion, to leverage the strengths of multiple data modalities
The framework begins with video input, which undergoes 3D convolutional neural network (3D convNets) processing to extract visual features, capturing both spatial and temporal information from the video data. Concurrently, audio features are extracted using the openSMILE tool following initial pre-processing. Text features are derived from the video content using a separate CNN.
These multi-modal features—visual, auditory, and textual—are then mapped into feature vectors and processed through individual kernels: Kernel 1, Kernel 2, and Kernel 3. Each kernel is responsible for a specific feature set. The outputs from these kernels are integrated using a MKL algorithm. The MKL algorithm effectively combines the different modalities by assigning appropriate weights to each feature set, enhancing the discriminative power of the model.
The final stage involves combining these weighted features into a single, cohesive kernel, which is then used for sentiment classification. This comprehensive approach allows the model to utilize the diverse information from all three modalities, resulting in more accurate and robust sentiment-classification outcomes. The diagram effectively demonstrates the flow and fusion of multi-modal data from raw input to final classification output, showcasing the system’s capability to handle complex, heterogeneous data sources for sentiment analysis.
Study | Method | Dataset | Advantages | Disadvantages |
---|---|---|---|---|
Xu et al. (2020) | Semi-supervised Laplacian Eigenmap | Custom dataset | Reduces emotion detection errors, effective text emotion classification | Only applicable to text sentiment analysis |
Khan et al. (2019) | Clifford geometric algebra for image segmentation | Medical images | Improves image-segmentation accuracy | Limited to medical image processing |
Lin et al. (2021) | Combined visual, auditory, and linguistic features | Custom dataset | High accuracy due to comprehensive multimodal information | Requires significant computational resources |
Proposed Method | Multiple Kernel Learning and 3D convolutional neural network | MOUD and IEMOCAP | Highly integrated multimodal features, high sentiment classification accuracy | High model complexity, needs optimization for computational efficiency |
3.1 3D Convolutional Neural Network
Convolutional neural network is a feedforward neural network, in which artificial neurons can respond to surrounding units for large-scale image processing. Convolutional neural networks, including convolutional layer and pooling layer, have made great progress in image recognition. With the emergence of video in social networks, researchers use convolutional neural networks to detect the actions and behaviors of characters in video. The two-dimensional CNN is suitable for two-dimensional static images, but the visual information in video has not only time information but also spatial information. In view of the learning of spatio-temporal features, a simple and effective method was proposed to use the deep 3d convolutional network to identify actions on a large-scale supervised video dataset. Compared with the two-dimensional CNN, C3D adds a time dimension, so that it can effectively extract the spatial and temporal features, retain the time information of the input information, and achieve better recognition effect. In this paper, we determine the structure of the 3D CNN through experimental comparison, including eight convolutional layers, fivepooling layers and two fully connected layers. The 2D and 3D convolution frameworks are shown in Fig. 2 [2]:
3.2 Multiple Kernel Learning with Margin-Dimension Constraints
The idea of classification is to find a dividing hyperplane that separates the samples of different categories. In the sample space, the division of the hyperplane can be represented by a linear equation, as shown in Eq. (1):
where \(\omega\) is the normal vector that determines the direction of the hyperplane. \(b\) is the offset term that determines the distance between the hyperplane and the origin.
In the actual classification task, we may not find a hyperplane to correctly segment the sample. Therefore, the sample data needs to be mapped from the original space to a higher-dimensional feature space. Equation (1) can be changed to (2), where \(\Phi (x)\) represents the eigenvector after \(x\) mapping.
When we divide different categories, we need to find a dividing hyperplane with the largest interval, that is, to maximize the value of \(\gamma\).
To maximize the value of \(\gamma\), only \(\left\| w \right\|^{ - 1}\) needs to be maximized, which is equivalent to minimizing \(\left\| w \right\|^{2}\). thus, the objective cost function \(f\) and its constraint conditions of MKL can be obtained as shown in formula (4):
In the actual task, it is difficult to determine the use of a suitable kernel function to make the sample linearly separable in the feature space. Even if the kernel function is well found to be linearly separable, it may cause overfitting. Therefore, we need to introduce a penalty factor \(C\). Given the error classification penalty \(C\), the edge is maximized while minimizing the loss of the hint on the training set. When \(C\) is infinite, Eq. (4) enables all samples to satisfy the constraint \(\, y_{i} (w \cdot \Phi (x_{i} ) + b) \ge 1\). When \(C\) takes a finite value, Eq. (4) allows some samples to not satisfy their constraints. The slack variable \(\xi_{i} \ge 0\), the sample has a corresponding slack variable, indicating the extent to which the sample does not satisfy the constraint.
The optimization problem of formula (4) can be solved by the dual problem of formula (5), where \(a_{i}\) is the Lagrangian multiplier. \(\sum\nolimits_{k} {u_{k} W_{k} }\) is just a constant. Given a set of basic features and their associated basic kernel \(K_{k}\), we hope to find the best kernel combination \(K = \sum\nolimits_{{_{k} }} {W_{k} } * K_{k}\), where \(W_{k}\) is the weight of the \(k\)-th basic feature. Kernel combination \(K\) approximates the best tradeoff between the discriminability and invariance of a particular application. The weight \(W_{k}\) is \(l_{1}\) regularized because we want to find a minimal invariant set. Therefore, most of the weights will be set to 0, depending on the parameter \(u\), which encodes our previous preferences for the descriptor.
In the second iteration step with fixed \(a_{i}\), the projection gradient descent method is used to find the updated feature weight \(W_{k}\), as shown in formulas (6) and (7).
MKL tends to select the basis features with greater discriminant ability and ignore the basis features with lower discriminant power. The basis features selected by MKL usually have obvious feature differences in the high-dimension space. For the basis features of different modes, the kernel parameters related to the optimal high-dimension space may be significantly different. Therefore, the traditional MKL cannot simultaneously utilize the maximum discrimination ability of each basis feature. To solve this problem, we need to increase the acceptable range of the interval value \(\gamma\). Figure 3 shows the spacing constraint. Figure 3(a) shows that the hyperplane of the basic feature “a” has a large separation interval to separate solid point class and triangle class. Figure 3(b), the hyperplane of the basic feature “b,” has small margins to separate the solid point class from the triangle class. The hyperplane of the basic feature “a” in Fig. 3(a) has larger margins than the basic feature “b” in (b), thus separating the real point class from the triangle class. This indicates that the basic feature “a” is more distinguishable than the basic feature "b" for the classification of real points and triangles.
Therefore, we propose to change the formula of the interval value \(\gamma\) to (8) so that the separation margin of each basic feature in the high-dimension space provides a rough measure of the ability to discriminate the basic features. However, these coarse measurements can effectively guide MKL when searching for optimal feature combinations. The separation margin for each elementary feature can be calculated using Eq. (8) as the inverse square root of the target cost function.
After acquiring the visual features in the video using the 3D CNN, it is necessary to combine the text and audio features in the video. In the process of fusion, we use formula (8) to calculate the margin value between different categories, and select the Gaussian kernel RBF (Radial Basis Function) to map the input samples into the high-dimension feature space to solve the linear indivisible problem. RBF is a radial symmetry scalar function, usually defined as a function of the Euclidean distance from any point in space to a center, and its effect is often local. RBF can be expressed as follows:
where \(x_{i}\) and \(x_{j}\) represent the \(i\)-th sample and the \(j\)-th sample, respectively; \(x_{i,q}\) and \(x_{j,q}\) represent the \(q\)-th element in the feature vector; \(N\) is the sample feature dimension; \(\lambda\) is the parameter of the kernel function, when the value of \(\lambda\) increases, the kernel value is reduced.
In the fusion process, the basic features have different dimensions because of they are from the different modes. This situation will result in a fact that each basic feature having different optimal a value. Therefore, MKL cannot simultaneously take advantage of the maximum discriminant power of all the basic features from different modes. Therefore, we normalize the RBF kernel to eliminate the influence of feature dimension on the selection of \(\lambda\), so that all the basic features have similar optimal parameter \(\lambda\) value. Therefore, the maximum discriminant power from all the basic features of the multiple modes can be utilized. The dimension standardized RBF approval is expressed as shown in formula (10):
3.3 The Margin Dimension Constrained Multiple Kernel Learning Model Based on Convolution
To make full use of the emotional features of different modalities, we use 3D CNN to extract the emotional features of visual information, and use the CNN to extract the emotional features of the text in the video, and then use openSMILE to extract the emotional features of the audio in the video, and finally MKL based on margin dimension constraints is used to fuse three emotion features. To make full use of the features of the three modes, the convolution-based margin-dimension constrained MKL model is used as shown in Fig. 4. There are two levels of MKL in the model. First, a column kernel function is used to measure the similarity of samples on a subset of text and audio features, denoted \(K_{Col}^{m} = \sum\nolimits_{{_{k} }} {C_{k} } * K_{k}\). In this way, the discriminating power of text and audio feature subsets is utilized in different cores and integrated to generate the best combined kernel for each feature subset. Then, a linear combination is used to integrate multiple combined kernels generated by MKL on the three feature subsets, which is row MKL, denoted as \(K_{Row} = \sum\nolimits_{{_{k} }} {R_{k} } * K_{Col}^{m}\). Therefore, under the multiple kernel framework, the representation problem of the sample in the feature space is transformed into the selection of the basic kernel and weight coefficients. The information contained in the different feature subsets is mined and integrated into the final classification kernel.
The pseudo code of the CMDMKL model proposed in this paper is shown in Algorithm 1. The algorithm combines the emotional features of text, visual and audio, and solves the problem of the maximum discriminative ability of the traditional MKL not being able to utilize each basic feature simultaneously by the margin-dimension constraint. In the model, the margin values are used to distinguish different types of samples, and the dual problem is used to find the maximum value of the margin. The smaller weights are assigned to combine the basic features with less discriminative power, so that the advantages of the complementary features of different modes are fully utilized.
After the fusion of different modal emotions, we propose a multimodal fusion framework in combination with decision fusion. The pseudo-code of the multi-modal fusion framework is shown in Algorithm 2. First, the emotional features of different modalities are extracted, and the two modalities are merged and predicted using the CMDMKL model, and then combined with another feature for decision fusion. When performing fusion, \(w_{1}\) and \(w_{2}\) correspond to the weights of \(M_{12}\) and \(M_{3}\), where \(w_{1} + w_{2} = 1\). First, the value of \(w_{1}\) is increased by 0.1 from 0.1 to 0.9, and the corresponding \(w_{2}\) value is reduced from 0.9 to 0.1, and the best fusion parameters are obtained through experiments.
4 Experimental Results and Analysis
The MOUD and IEMOCAP datasets are processed to extract visual, textual and audio sentiment features, and then the proposed CMDMKL model is used to fuse the three features to obtain the combined sentiment-classification accuracy.
Accuracy is one of the most common classification evaluation metrics to measure the degree of classification accuracy of a classifier, and refers to the proportion of correctly classified samples to the total samples. The advantage of this evaluation metric is that it can comprehensively measure the classifier's classification performance for each category. It is often used to roughly characterize the performance of a classifier when it is being trained, and this metric is more important in evaluating multi-categorization tasks than in dichotomous classification tasks. Mathematically, it is expressed as:
where TP is the positive sample predicted by the model to be in the positive category, TN is the negative sample predicted to be in the negative category, FP is the negative sample predicted by the model to be in the positive category and FN is the positive sample predicted by the model to be in the negative category.
4.1 Dataset
MOUD: The Multimodal Opinion Utterances Dataset (MOUD) was compiled by Morency et al. [40], focusing on sentiment analysis in a social media context. The dataset includes video content such as product reviews and video recommendations, sourced from various social media platforms. The video samples were generated through interactions among 15 male and 65 female participants, aged between 20 and 60. Each participant's interaction was recorded, resulting in videos with a duration ranging from 2 to 5 min.
The videos were standardized to a 360 × 480 MP4 format for consistency. For analytical purposes, each video was divided into shorter segments. Each segment was manually annotated with one of three emotional labels: positive, negative, or neutral. This annotation process involved careful evaluation to ensure accurate emotional representation, making MOUD a valuable resource for training and evaluating sentiment analysis models, particularly in a multimodal framework.
IEMOCAP: The Interactive Emotional Dyadic Motion Capture (IEMOCAP) dataset is a comprehensive multimodal dataset developed to facilitate research in emotion recognition [41]. It comprises approximately 12 h of video data, recorded over several sessions involving two-way interactions between professional actors—both male and female. These sessions were meticulously scripted and acted out to elicit a wide range of emotional responses. The dataset includes a diverse array of emotional expressions, categorized into nine emotion labels: happy, sad, neutral, angry, surprised, excited, frustration, disgust, and fear. For this study, we focused on four primary emotions: angry, happy, sad, and neutral. Each video segment was annotated by at least two or three raters to ensure the reliability and accuracy of the emotion labels. The annotations were based on a majority voting system, where the final label was assigned if a consensus was reached among the raters. The detailed and diverse emotional annotations in IEMOCAP make it an essential resource for developing and testing advanced multimodal emotion recognition systems.
The MOUD and IEMOCAP datasets provide rich, multimodal resources for emotion recognition research, encompassing a broad spectrum of emotional expressions across various contexts and interaction types. The careful annotation and segmentation in both datasets enable the development of robust, generalizable sentiment analysis models that can effectively interpret complex emotional cues from multiple modalities.
4.2 Visual Emotion Classification
For the 3D CNN, in order to better aggregate the time information in video, we selected different time depths for experiments. In the experiment, we try two network architectures of uniform time depth and varying time depth. For uniform time-depth networks, the kernel time depth of all convolutional layers is the same. We set four different uniform time-depth networks, and the nuclear time-depth d was set as 1, 3, 5, and 7, respectively. When the time-depth d is set to 1, the network architecture is equivalent to a 2D CNN. The kernel time depth of different convolutional layers is different for the network architecture of transform time depth. We set two different transformation time-depth networks. The first one is to increase the depth of the five convolution layers by 3, 3, 5, 5 and 7. The second is to decrement the depth of the five convolution layers to 7, 5, 5, 3 and 3.
We trained the two proposed networks on the MOUD visual dataset. Figures 5 and 6 show the results of network experiments based on different time depths. Figure 5 shows the network results with a uniform kernel time depth. It can be seen in the figure that when the time-depth d is set to 3, the accuracy of visual sentiment classification is the highest. When the time-depth d is set to 1, the accuracy is the lowest because the network does not consider the temporal features in the video data, and the motion modeling information is lost. Figure 6 shows the experimental results of transforming the kernel time depth. It can be seen in the figure that when we change the kernel time depth of the convolutional layer, the classification accuracy is reduced, but the difference is small. Through different network architecture comparisons, when the time depth of each layer of convolutional layers is set to 3, the accuracy is the highest. Therefore, the 3D CNN used in this paper has 8 convolutional layers, 5 maximum pooling layers and 2 fully connected layers, and the convolution kernel is 3 × 3 × 3.
For the MOUD dataset, we clip each video into a sequence of length 16 with the sequence fragment as input. The C3D network used in this paper has 8 convolutional layers, 5 maximum pooling layers and 2 fully connected layers, followed by a softmax output layer. Figure 7 shows the accuracy of sentiment classification achieved by different iterations using C3D network. As the number of iterations increases, the accuracy rate increases, and the accuracy rate reaches 87.23%.
4.3 Experimental Analysis of CMDMKL Fusion Model
After feature extraction of audio, text and visual information, we combined features to form long feature vectors, and used the proposed CMDMKL for feature level fusion. Figure 8 shows the experimental results of emotional classification toward multi-modal data information by using different fusion algorithms. According to the figure, we can see that with the increase of training data set, the fusion classification results become better and better. For the initial stage of training sample growth, classification accuracy is improved rapidly. When the amount of data increases to a certain extent, the growth rate will slow down or even occasionally negative growth. The proposed CMDMKL algorithm is used to fuse text, audio and visual information to obtain the highest classification accuracy, with the accuracy of CMDMKL reaching 96.25%, SVM 88.90% and MKL 93.24%, respectively.
When experimenting with the MOUD dataset, we performed an analysis of the convergence of the algorithm, that is, the \(f\) value of the objective function is transformed with the number of iterations. In Fig. 9, we compare the convergence performance of the CMDMKL and MKL methods. The stopping criteria for both methods are that the \(d\) change between two consecutive steps is less than a given threshold of 0.001. Compared with the traditional MKL, it is found that CMDMKL converges faster than MKL with increasing number of iterations within a limited iteration range. In addition, we can observe that the target value of CMDMKL can converge to a stable value in less than 5 iterations.
Table 1 shows the results produced after CMDMKL fusion of different modalities. Using the 3D CNN to extract the visual emotional features of MOUD, the best emotional classification results are obtained, reaching 87.23%. Using convolutional neural networks to extract and classify text sentiment features in MOUD videos, an accuracy of 81.67% is obtained. The audio features in MOUD were extracted using openSMILE and classified using SVM, achieving an accuracy of 76.30%. When the two modalities are merged, the best emotional classification accuracy is obtained after the fusion of text and visual emotion features, reaching 95.84%. When the visual and audio emotional features were combined, an accuracy of 91.36% was obtained. When the text and audio sentiment features are combined, 95.46% of the master slave rate was obtained. When the three modes of audio, text and vision are fused, the accuracy of the fusion is higher than that of any two modes. The accuracy of the three modal fusions is up to 96.25%, which is significantly higher than the accuracy of the existing frame.
As with the method of processing MOUD datasets, we performed experimental analysis in the IEMOCAP dataset. Table 2 shows the results produced by the fusion of different modes. In the single-mode classifier, the visual mode provides the highest accuracy classification results. When visual information and text information are merged using CMDMKL, the accuracy of sentiment classification reaches 81.22%. When visual information is combined with audio information, the accuracy is lower than the result of visual and text fusion, up to 76.35%. When text information is combined with audio information, the accuracy rate is up to 76.33%. When the three modes of vision, text and audio are fused using the proposed CMDMKL, the emotional classification accuracy of the video is up to 83.46%.
4.4 Experimental Analysis Based on Hybrid Fusion Model
The proposed CMDMKL algorithm can achieve higher emotion recognition accuracy. Therefore, after the fusion of the two modal emotion features of the CMDMKL model, the model presented in Table 2 is used for the fusion of the three emotion modes to make full use of the advantages of feature fusion strategy and decision-level fusion strategy to overcome their shortcomings.
Figure 10 shows the experimental results of the relationship between the sentiment-classification accuracy rate and the fusion feature weighting factor \(w\) after multi-modal fusion of MOUD and IEMOCAP datasets. \(w\) represents the weight factor of \(M_{1}\) and \(M_{2}\) modules when they are fused with \(M_{3}\) by CMDMKL, \(wM_{12} + (1 - w)M_{3}\). Figure 10(a) shows the results of the multi-modal experiment based on the MOUD dataset. As can be seen from the figure, VA + T indicates that the visual information V and the audio information A are fused using CMDMKL and are combined with the text information T for decision making. Compared with the VT + A and AT + V modes, the accuracy of emotion recognition is the highest. Because the easily confused parts of visual information can be compensated for by features such as sound intensity and audio frequency. In addition, when the weight value \(w{ = }0.6\), the evaluation result can achieve the best performance of emotion recognition, reaching 94.98%. For VT + A mode, when the weight value \(w{ = }0.7\), the evaluation result can achieve the best performance of emotion recognition, reaching 94.17%. For AT + V mode, the weighted value \(w{ = }0.5\) has the best emotion recognition accuracy, reaching 93.34%.
Figure 10(b) shows the results of a multi-modal experiment based on the IEMOCAP dataset. The IEMOCAP dataset is generated through dialog, during which the emotions of the respondent are influenced by the speaker. Because the speaker should be happy when answering the question correctly, however, if he is not sure that his answer is correct or incorrect, then the emotional expression is therefore unclear. Combining the experimental results of 10(a), we combine the visual information V with the audio information A using CMDMKL, and then combine the decision with the text information T. The experimental results of the relationship between each emotion classification accuracy and fusion feature weight factor \(w\) were shown in Fig. 10(b). As can be seen from the figure, in addition to the angry mood, other emotion recognition is greatly affected by the weight value \(w\). When the weight value \(w{ = }0.7\), the identification accuracy of angry was the highest, and the text information improved the fuzziness of angry and happy in audio and visual information. Because when we express angry and happy, we are used to increasing the intensity of our voices and movements of our facial features, leading to confusion between these two emotions. For happy emotions, the text emotional characteristics can recognize their emotions well, so the accuracy after fusion is the highest. In addition, the visual features of the neutral emotion are different from those of happy and angry, so the recognition rate is better. For sad emotions, it is easy to be confused with the angle and neutral in text information and phonetic features, so the performance of sad emotions is the lowest.
Experiments are performed on MOUD and IEMOCAP datasets using a hybrid model, Table 3 shows experimental results on the MOUD dataset, and Table 4 shows experimental results on the IEMOCAP dataset. As can be seen from Tables 3 and 4, when we combine the visual and audio information and combine it with the text information, we get the best accuracy. Compared with the CMDMKL fusion, the accuracy of the multi-classification has improved, but the accuracy of the two classifications has decreased.
5 Conclusion
With the extensive use of video in social networks, more and more users are used to express their emotions through video. Consumers are also accustomed to using video to record their comments and opinions on products and upload them to social media. The auditory data in the video expresses the tone of the speaker, while the visual data expresses the facial expression, which helps to understand the emotional state of the user, and the text information in the video can also provide certain emotional information. Based on all of the above, I believe that in future we should direct our research more toward multi-modal dataset processing.
In this paper, we use the 3D CNN to obtain the visual emotion features in video, and propose a MKL multimodal fusion framework based on the margin-dimension constraint of convolution to fuse text, visual and audio information, effectively improving the emotion classification accuracy of video. In future work, we will focus on improving the accuracy of emotion detection through different neural network configurations, optimizing the model, reducing its time complexity, and ensuring the scalability and stability of the framework. Thus, a better, more time-saving and more reliable multimodal emotion analysis model can be created.
Data Availability
Data availability is not applicable to this article as no new data were created or analyzed in this study.
References
Wang, Z., Wang, Y., Zhang, J., Hu, C., Yin, Z., Song, Y.: Spatial-temporal feature fusion neural network for EEG-based emotion recognition. IEEE Trans. Instrum. Meas. 71, 390–313 (2022)
Xu, G.X., Li, W.F., Liu, J.: A social emotion classification approach using multi-model fusion. Futur. Gener. Comput. Syst. 102, 347–356 (2020)
Yang, J., Gao, X., Li, L., Wang, X., Ding, J.: SOLVER: scene-object interrelated visual emotion reasoning network. IEEE Trans. Image Process. 30, 8686–8701 (2021)
He, T., Zhang, L., Guo, J., Yi, Z.: Multilabel classification by exploiting data-driven pair-wise label dependence. Int. J. Intell. Syst. 35(9), 1375–1396 (2021)
Khan, P., Xu, G., Latif, M., et al.: UAV’s agricultural image segmentation predicated by clifford geometric algebra. IEEE Access. 7, 38442–38450 (2019)
Li, Z., Fan, Y., Jiang, B., Lei, T., Liu, W.: A survey on sentiment analysis and opinion mining for social multimedia. Multim. Tools Appl . 78(6), 6939–6967 (2019)
D’mello, S., Kory, J.: A review and meta-analysis of multimodal affect detection systems. ACM Comput. Surv. 47(3), 1–36 (2015)
Subrahmanya, N., Shin, Y.: Sparse multiple kernel learning for signal processing applications, pattern analysis and machine intelligence. IEEE Trans. Pattern Anal. Mach. Intell. 32(5), 788–798 (2010)
Tadas, B., Chaitanya, A., Louis-Philippe, M.: Multimodal machine learning: a survey and taxonomy. IEEE Trans. Pattern Anal. Mach. Intell. 41(2), 423–443 (2019)
Ji, S., Xu, W., Yang, M.: 3D convolutional neural networks for human action recognition. IEEE Trans. Pattern Anal. Mach. Intell. 35(1), 221–231 (2013)
Simonyan, K., Zisserman, A.A.: Two-stream convolutional networks for action recognition in videos. Adv. Neural. Inf. Process. Syst. 1, 568–576 (2014)
Yang, M., Deng, C., Nie, F.: Adaptive-Weighting discriminative regression for multi-view classification. Pattern Recogn. 88, 236–245 (2019)
Yue, N.J., Hausknecht, M., Vijayanarasimhan, S.: Beyond short snippets: deep networks for video classification, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR) IEEE. (2015)
Zhang, W., He, X., Lu, W.: Exploring discriminative representations for image emotion recognition with CNNs. IEEE Trans. Multim. 22(2), 515–523 (2020)
Zhang, W., Ma, B., Liu, K., Huang, R.: Video-based pedestrian reidentification by adaptive spatio-temporal appearance model. IEEE Trans. Image Process. 26(4), 2042–2054 (2017)
Guan, R., Wang, X., Marchese, M., Yang, M.: Liang y, and Yang C, Feature space learning model. J. Ambient. Intell. Humaniz. Comput. 10(5), 2029–2040 (2019)
Muller, K., Mika, S., Ratsch, G.: An introduction to kernel based learning algorithms. IEEE Trans. Neural Netw. 12(2), 181–201 (2001)
Vapnik, V.: The nature of statistical learning theory. IEEE Trans. Neural Netw. 8(6), 1564 (1997)
Williamson, R.C., Smola, A.J., Scholkopf, B.: Generalization performance of regularization networks and support vector machines via entropy numbers of compact operators. IEEE Trans. Inf. Theory 47(6), 2516–2532 (2001)
Liu, J., Zheng, S., Xu, G., Lin, M.: Cross-domain sentiment aware word embeddings for review sentiment analysis. Int. J. Mach. Learn. Cybern. 12(2), 343–354 (2020)
Anceschi, F., Polidoro, S., Ragusa, M.A.: Moser’s estimates for degenerate Kolmogorov equations with non-negative divergence lower order coefficients. Nonlinear Anal. 189, 111568 (2019)
Aronszajn, N.: Theory of reproducing kernels. Trans. Am. Math. Soc. 68(3), 337–404 (1950)
Aizerman, A., Braverman, E.M., Rozonoer, L.I.: Theoretical foundations of the potential function method in pattern recognition. Autom. Remote. Control. 25, 821–837 (1964)
Boser, B.E., Guyon, I.M., Vapnik, V.N.: A training algorithm for optimal margin classifiers. In: Proceedings of the 15 Annual Workshop on Computational Learning Theory. 144–152 (1992)
Vasileios, M., Anastasios, T., Ioannis, P.: Exploiting multiplex data relationships in support vector machines. Pattern Recogn. 85, 70–77 (2019)
Liu, J., Tang, S., Xu, G., Ma, C., Lin, M.: A novel configuration tuning method based on feature selection for hadoop mapreduce. IEEE Access. 8, 63862–63871 (2020)
Arefnezhad, S., Samiee, S., Eichberger, A.: Applying deep neural networks for multi-level classification of driver drowsiness using Vehicle-based measures. Expert Syst. Appl. 162, 113778 (2020)
Liu, J., Chen, S., Wu, T., Zhang, H.: A novel hot data identification mechanism for NAND flash memory. IEEE Trans. Consum. Electron. 61(4), 463–469 (2015)
Zhou, T., Peng, Y.: Kernel principal component analysis-based Gaussian process regression modelling for high-dimensional reliability analysis. Comput. Struct. 241(1), 1–22 (2020)
Cai, D., He, X., Han, J.: Speed up kernel discriminant analysis. VLDB J. 20(1), 21–33 (2011)
Maldonado, S., Weber, R., Basak, J.: Simultaneous feature selection and classification using kernel-penalized support vector machines. Inf. Sci. 181(1), 115–128 (2011)
Amarnath, B., Balamurugan, S., Alias, A.: Review on feature selection techniques and its impact for effective data classification using UCI machine learning repository dataset. J. Eng. Sci. Technol. 11(11), 1639–1646 (2016)
Chen, Y.R., Tao, X., Xiong, C., Yang, J.: An improved method of two stage linear discriminant analysis. KSII Trans. Internet Inf. Syst. 12(3), 1243–1263 (2018)
Niranjan, S., Shin, Y.C.: Sparse multiple kernel learning for signal processing applications. IEEE Trans. Pattern Anal. Mach. Intell. 32(5), 788–798 (2010)
Close, R., Wilson, J., Gader, P., A Bayesian.: A approach to localized multi-kernel learning using the relevance vector machine. In: Proceedings of the 2011 IEEE International Geoscience and Remote Sensing Symposium. 103–1106 (2011)
Gert, R.G., Tijl, D.B., Nello, C., Michael, I.: A statistical framework for genomic data fusion. Bioinformatics 20(16), 2626–2635 (2004)
Chen, T., Liu, C.J., Zou, H.: A multi-instance multi-label scene classification method based on multi-kernel fusion. In: Proceedings of the 2015 SAI Intelligent Systems Conference. 782–787 (2015)
Zhou, Y., Cui, X., Hu, Q.: Improved multi-kernel SVM for multi-modal and imbalanced dialogue act classification. In: Proceedings of the 2015 International Joint Conference on Neural Networks (IJCNN) (2015)
Wang, Q., Gu, Y., Tuia, D.: Discriminative multiple kernel learning for hyperspectral image classification. IEEE Trans. Geosci. Remote Sens. 54(7), 3912–3927 (2016)
Morency, L.P., Mihalcea, R., Doshi, P.: Towards multimodal sentiment analysis: harvesting opinions from the web. Proceedings of the 13th International Conference on Multimodal Interfaces. ACM, pp. 169–176 (2011)
Busso, C., Bulut, M., Lee, C.C.: IEMOCAP: Interactive emotional dyadic motion capture database. Lang. Resour. Eval. 42(4), 335–339 (2008)
Funding
This work was supported by the National Natural Science Foundation of China (Grant No. 61772099), the Chongqing Research Program of Basic Research and Frontier Technology (Grant No. cstc2021jcyj-msxmX0530).
Author information
Authors and Affiliations
Contributions
Jun Liu contributed the conceptualization, methodology and wrote the final draft. Zhihao Wang carried out experiments and wrote the original draft. The remaining authors contributed to validating the ideas, carrying out additional analyses and reviewing this paper. All authors read and approved the manuscript.
Corresponding author
Ethics declarations
Conflict of Interest
The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
Open Access This article is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License, which permits any non-commercial use, sharing, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if you modified the licensed material. You do not have permission under this licence to share adapted material derived from this article or parts of it. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by-nc-nd/4.0/.
About this article
Cite this article
Liu, J., Wang, Z., Wan, G. et al. A Novel Multi-modal Sentiment Analysis Based on Multiple Kernel Learning with Margin-Dimension Constraint. Int J Comput Intell Syst 17, 207 (2024). https://doi.org/10.1007/s44196-024-00624-3
Received:
Accepted:
Published:
DOI: https://doi.org/10.1007/s44196-024-00624-3