Multimodal Sentiment Analysis
Multimodal Sentiment Analysis
ABSTRACT
During real-life interactions, people are naturally gesturing and modulating their voice to emphasize specific points or to express their emotions. With the recent growth of social websites such as YouTube, Facebook, and Amazon, video reviews are emerging as a new source of multimodal and natural opinions that has been left almost untapped by automatic opinion analysis techniques. This paper presents a method for multimodal sentiment classification, which can identify the sentiment expressed in utterance-level visual datastreams. Using a new multimodal dataset consisting of sentiment annotated utterances extracted from video reviews, we show that multimodal sentiment analysis can be effectively performed, and that the joint use of visual, acoustic, and linguistic modalities can lead to error rate reductions of up to 10.5 % as compared to the best performing individual modality.
INTRODUCTION OF THE PROJECT we explore the addition of speech and visual modalities to text analysis in order to identify the sentiment expressed in video reviews. Given the non homogeneous nature of full-videoreviews, which typically include a mixture of positive,negative, and neutral statements, we decided to perform our experiments and analyses at the utterancelevel. This is in line with earlier work on text-based sentiment analysis, where it has been
observed that full-document reviews often contain both positive and negative comments, which led to a number of methods addressing opinion analysis at sentence level. Our results show that relying on the joint use of linguistic, acoustic, and visual modalities allows us to better sense the sentiment being expressed as compared to the use of only one modality at a time. Another important aspect of this paper is the introduction of a new multimodal opinion database annotated at the utterance level which is, to our knowledge, the first of its kind. In our work, this dataset enabled a wide range of multimodal sentiment analysis experiments, addressing the relative importance of modalities and individual features.
MODULES USED: 1. CREATING THE MOUD DATASET 2. PRE SCREENING OF VIDEOS 3. FEATURE EXTRACTION PROCESS 4. LINGUISTIC ANALYSIS 5. ACOUSTIC ANALYSIS 6. VISUAL ANALYSIS 7. EXPERIMENTS AND RESULTS
In this certain collection of set of videos from the social media web site YouTube, using several keywords likely to lead to a product review or recommendation based on keywords such as favourite products,favourite movies and musics and the keywords are not targeted at a specific product type. we used a variety of product names, so that the dataset has some degree of generality within the broad domain of product reviews.
All the videos returned by the YouTube search, we selected only videos in which the speaker should be in front of the camera; her face should be clearly visible,with a minimum amount of face occlusion during the recording; there should not be any background music or animation.
3.FEATURE EXTRACTION PROCESS It describes the process of automatically extracting linguistic, acoustic and visual features from the video reviews. First, we obtain the stream corresponding to each modality, followed by the extraction of a representative set of features for each modality. 4. LINGUISTIC ANALYSIS In this, we build a vocabulary consisting of all the words, including stopwords, occurring in the transcriptions of the training set. We then remove those words that have a frequency below 10 (value
determined empirically on a small development set). The remaining words represent the unigram features, which are then associated with a value corresponding to the frequency of the unigram inside each utterance transcription. 5. ACOUSTIC ANALYSIS We compute a set of acoustic features such as prosody,
energy, voicing probabilities, spectrum, and cepstral features based on anger, contempt,disgust, fear, joy, sad, surprise, and neutral. For the analysis, we use a sampling rate of 30 frames per second. The features extracted for each utterance are averaged over all the valid frames, which are automatically identified.
6. VISUAL ANALYSIS
The most widely used system for measuring and describing facial behaviors is the Facial Action Coding System (FACS), which allows for the description of face muscle activities through the use of a set of Action Units (AUs).
The smile feature is an estimate for smiles. Head pose detection consists of three-dimensional estimates of the head orientation, i.e., yaw, pitch, and roll. These features provide information about changes in smiles and face positions while uttering positive and negative opinions.
From the dataset, we remove utterances labeled as neutral, thus keeping only the positive and negative utterances with valid visual features. The removal of neutral utterances is done for two main
reasons.First, the number of neutral utterances in the dataset is rather small. Second, previous work in subjectivity and sentiment analysis has demonstrated that a layered approach (where neutral statements are first separated from opinion statements followed by a separation between positive and negative statements) works better than a single three-way classification.