Assignment 2
Assignment 2
Short Abstract of Problem - In the digital age, moviegoers increasingly rely on online
platforms like IMDb to make informed decisions about their cinematic choices. IMDb houses a
vast collection of user-generated movie reviews, reflecting the diverse spectrum of audience
sentiment. The challenge is to automate the process of sentiment analysis for these reviews,
categorizing them as either positive or negative. The objective is to develop a robust sentiment
analysis model tailored to IMDb movie reviews, aiming to streamline the understanding of
public opinion regarding films. This analysis harnesses natural language processing techniques to
provide a practical toolset for movie enthusiasts and industry professionals.
Dataset Details
1. Title - (Q1) MBiLSTMGloVe: Embedding GloVe knowledge into the corpus using multi-
layer BiLSTM deep learning model for social media sentiment analysis
Link - Research Paper
Short Abstract - This research paper presents the 'MBiLSTMGloVe' model, a deep learning
approach enriched with GloVe word embeddings, designed for social media sentiment
analysis. The model employs a multi-layer Bi-Directional Long-Short-Term Memory
(MBiLSTM) architecture to capture nuanced sentiments in short texts. Experimental results
on IMDB datasets demonstrate its effectiveness, surpassing traditional models. This model
underscores the value of deep understanding in sentiment classification.
2. Title - (Q2) RoBERTa-GRU: A Hybrid Deep Learning Model for Enhanced Sentiment
Analysis
Link - Research Paper
Short abstract - This research paper introduces an innovative hybrid model for sentiment
analysis. It harnesses the robust attention mechanism of RoBERTa (Robustly Optimized
BERT Pretraining Approach) alongside the sequence learning capabilities of GRU (Gated
Recurrent Units) to effectively handle sentiment classification tasks. To combat the common
challenge of imbalanced datasets in sentiment analysis, a data augmentation technique
involving word embeddings and oversampling of minority classes is proposed. The model's
comprehensive evaluation on various sentiment analysis datasets demonstrates its superior
accuracy and effectiveness, highlighting its potential as a valuable tool in sentiment analysis
tasks.
Dataset Description
The IMDb Movie Reviews dataset is a valuable resource for sentiment analysis and natural
language processing. It contains 50,000 movie reviews, each labelled as either positive or
negative, making it widely employed in sentiment analysis model development and evaluation.
This dataset encompasses diverse text content, reflecting user opinions on movies, and features
varying writing styles and subjective viewpoints. Researchers and data scientists frequently use it
to train models capable of categorizing reviews by sentiment. However, challenges may arise
due to text variability and potential class imbalances. This dataset finds extensive use in
academic and industry projects for text classification and sentiment analysis.
Size of data - The IMDb movie reviews dataset typically contains around 50,000 reviews.
Features of data – Each entry in the dataset usually consists of at least two main features:
• Text Review: The movie review itself, which is the primary source of information for
sentiment analysis.
• Sentiment Label: A binary label (e.g., positive or negative) indicating the sentiment
of the review.
Challenges on data:
• Class Imbalance: The dataset might have an imbalance between positive and negative
sentiment reviews, which can affect model training and evaluation.
• Noisy Text: User-generated content often contains misspellings, slang, and informal
language, making it challenging for natural language processing.
• Contextual Understanding: Some reviews may contain nuanced opinions that require
context to interpret accurately.
• Data Pre-processing: Cleaning and tokenizing text data can be computationally
intensive for a massive dataset.
Literature Review
1.1 Methodology: This paper introduces a framework for opinion mining that focuses on
enhancing feature extraction in both temporal and spatial dimensions of text data. The
methodology combines several techniques, including the Spatial-RNN Unit (SRU) for
temporal feature extraction, a Multi-head Attention Mechanism for capturing various
levels of natural language features, and Dilated Convolution to enhance spatial
feature extraction.
2. Paper Title -2: An Improved African Vulture Optimization Algorithm for Feature
Selection Problems and Its Application of Sentiment Analysis on Movie Reviews
Journal Name , Year and Quartile: Big Data and Cognitive Computing, 2022, Q2
Paper Link: Research Paper
2.1 Methodology: This paper introduces the Binary Artificial Vulture Optimization
Algorithm (BAVOA-v1) as a solution for feature selection in the context of big data.
The methodology of BAVOA-v1 comprises several key components. It begins with
initializing a population of binary solutions, each representing a subset of features
from the dataset. A fitness function evaluates these subsets, seeking to balance
classification accuracy using a chosen machine learning classifier and the
minimization of selected features to prevent overfitting. The algorithm employs
exploration and exploitation strategies such as mutation, crossover, and bitwise
operations to iteratively improve feature subsets. A selection mechanism chooses the
best solutions among the population for the next generation, and the algorithm
terminates after a predetermined number of iterations, returning the best-selected
feature subset.
3.1 Methodology: The paper introduces the Ensemble Random Forest-Based XG Boost
(ERF-XGB) approach, designed for accurate binary classification of sentiment in e-
commerce product reviews. The methodology encompasses various stages, including
data pre-processing, feature selection using the Harris Hawk Optimization (HHO)
algorithm, sentiment classification, and hyperparameter tuning. The core innovation
lies in ERF-XGB's ability to combine ratings prediction and sentiment analysis,
making it a comprehensive solution.
3.3 Open problem / challenges unhandled: The paper highlights several open
challenges in the field of sentiment analysis, including the need for fine-grained
sentiment analysis, aspect-based sentiment analysis, sarcasm and irony detection,
cross-lingual sentiment analysis, multi-modal sentiment analysis, modelling
sentiment changes over time, achieving explainable AI (XAI), optimizing inference
efficiency, handling user-generated abbreviations and misspellings, multimodal
fusion, zero-shot and few-shot learning, among others.
4. Paper Title -4: Prediction of Movie Success based on Machine Learning and Twitter
Sentiment Analysis using Internet Movie Database Data
Journal Name , Year and Quartile: Indonesian Journal of Electrical Engineering and
Computer Science, 2023, Q3
4.1 Methodology: This paper presents a hybrid approach for predicting the success of
movies, involving two primary components: rating prediction and sentiment analysis.
The methodology considers various movie features, employs supervised machine
learning models for rating prediction, and employs a range of sentiment analysis
models. The final movie rating is determined by combining the prediction and
sentiment analysis scores using different weight combinations.
4.2 Advantages or challenges handled: The paper's hybrid model tackles the complex
task of movie success prediction by integrating both movie features and sentiment
expressed in movie reviews. It achieves a more accurate prediction compared to
individual models, offering a comprehensive solution for this particular application.
4.3 Open problem / challenges unhandled: While the paper provides a promising
approach, some open challenges are yet to be addressed. These include assessing the
generalizability of the framework to other domains, further optimizing resource
efficiency, and exploring the challenges of deploying the hybrid model in large-scale,
real-world applications, particularly when dealing with massive datasets and
distributed computing environments.
5.3 Open problem / challenges unhandled: While the framework demonstrates promise,
several open challenges remain unaddressed. These include further optimizing feature
selection and classification components, evaluating its generalizability to different
domains and languages, addressing the complexities of real-time processing for
tweets, and exploring the intricacies of multilingual tweet analysis.