Predicting Movie Ratings With Multimodal Data: Yichen Yang Ruoyun Ma Min Haeng Cho
Predicting Movie Ratings With Multimodal Data: Yichen Yang Ruoyun Ma Min Haeng Cho
1. Introduction test for the best algorithm to predict user ratings for
films streamed on the platform, researchers utilized
Can we predict the success of a movie based on in-
singular value decomposition (SVD) to predict users’
formation about the film available prior to theatrical
movie ratings based on their previous ratings history
release? To answer this question, we use open-source
[1]. By utilizing a Bayesian approach, they effectively
multimodal data released by IMDB and TMDB, such
mitigated overfitting in SVD. In a different context,
as movie posters and synopses. By applying image
researchers utilized both surface and textual features
processing to visual data extracted from posters and
such as the number of tweets as well as Youtube com-
natural language processing to textual data, we aim to
ments for each film to predict ratings [2]. Despite
predict movie ratings.
strong predictive performance, the model cannot pre-
Although predicting movie ratings has been an ac-
dict movie ratings prior to release, since features from
tive research area, a rather unexplored avenue of re-
social media are extracted only after theatrical release,
search in this application is the exclusive use of infor-
and is thus limited. Others developed a supervised la-
mation available prior to theatrical release as features
tent Dirichlet allocation model (LDA), shown to have
to predict movie ratings. Moviegoers base their deci-
more predictive power than unsupervised LDA, to pre-
sions whether to watch a pre-released movie on lim-
dict ratings from movie reviews [3]. Others used data
ited information about the film, such as posters, syn-
mining to build a model on interesting relations be-
opsis, genre, cast and crew. Therefore, it behooves re-
tween different attributes and assign a weight for each
searchers to ask whether it is possible to predict movie
feature in every movie, enhancing prediction accuracy
ratings by only using information available prior to
[4]. Most recently, researchers extracted and utilized
theatrical release as features, and if so, which features
visual features from movie trailers to predict ratings.
have the strongest predictive power. To predict movie
[5]. Although results are preliminary, the approach
ratings, we conduct an ablation study of various visual
is novel and we thus adopt this method to extract vi-
and textual features and evaluate their performance in
sual features from posters. Indeed, research in predict-
the prediction accuracy. Specifically, we use linear
ing movie ratings using state-of-the-art techniques and
and ridge regression, decision trees and random forest,
various features has been active and ongoing.
SVR and neural networks as input algorithms.
Our motivation for exploring this question is
3. Dataset and Features
twofold. First, we aim to provide consumers with film
recommendations by predicting movie ratings accu- We utilized open-source data, “Movie Genre from
rately prior to theatrical release. Second, we hope to its Poster Dataset” [6] and “The Movie Dataset” [7],
offer insight on the determining factors of film ratings from Kaggle. We selected posters, synopses, cast,
that will guide producers through film promotion. crew, runtime and genre as input features and IMDB
film ratings as the prediction objective.
2. Related Work
We split features into three categories: images
Previous research has shown remarkable interest (posters), text (synopses), and others (cast, crew,
and advances in predicting movie ratings. As part genre, and runtime). For posters, we transformed the
of the 2009 Netflix Prize competition, an open con- pixel dimensions from 900x600 to the input resolu-
1
Table 1. Processed Data by Groups correlations among features, and efficiently improves
Data Type Dimension Example model prediction accuracy.
Genre Categorical 23 Action
Runtime Numerical 1 100 (minutes)
4.2. Decision Trees and Random Forest
Actors Categorical 1603 Robert Downey Jr. Since a nonlinear model could provide a better fit
Director Categorical 492 Steven Spielberg to our data, we also experimented with decision trees
Poster Numerical 13 Number of faces = 1
and random forest, an ensemble learning method that
Synopses Categorical 3884 “innocence”
aggregates outputs from a multitude of decision trees.
A decision tree contains internal nodes corresponding
tion 224x224 for ResNet34. Besides feeding pixels to input features, and leaves, which represent the out-
into our model, we manually extracted 13 visual fea- put value following certain paths from the root. Ran-
tures. Specifically, we considered posters in both RGB dom forest utilizes a bagging strategy by splitting in-
format and HSB format, and extracted the mean and put features into a random subset while selecting the
standard deviation of red, green, blue, hue, saturation, best feature for each node of a decision tree. It is ex-
brightness (as suggested by [5]), as well as the number tensively used for non-linear problems due to strong
of human faces using openCV (as suggested by [8]). stability and efficient reduction of overfitting.
For synopses, we used spaCy[9] for tokenization and 4.3. Support Vector Regression
only kept words that appeared in at least 20 movies.
For cast and crew, we extracted the main director and Support Vector Machine (SVM) is commonly used
top three leading actors for each movie, and kept di- in many machine learning problems as baselines.
rectors and actors involved in at least 5 movies in our SVM defines a margin from the hyperplane, where
dataset. The result is shown in Table 1. points inside the margin incur loss, while Support Vec-
After filtering out movies released before 1980 and tor Regression (SVR) defines a margin of -distance
whose original language is not English, we had 19429 from the hyperplane such that data points inside the
movies in our sample. We then did a train-validation- boundary are error-free. For our project, we used the
test split at the 70%-15%-15% ratio, with 13600 train- radical basis function as the kernel.
ing, 2914 validation and 2915 test data points. We 4.4. Neural Networks
standardized all the numerical features, performed fea-
ture selection and hyperparameter tuning using the val- To effectively capture information from the mul-
idation set, and did model selection with the test set. timodal data, we used convolutional neural network
(CNN), used to scale down the magnitude of pat-
4. Methods terns from original inputs, as a primary tool to ana-
lyze posters and perform natural language processing
4.1. Linear Regression Models
on synopses. We did word embedding on synopses
For our baseline algorithm, we used linear regres- and applied kernels on the embedding. We also ap-
sion, which models the relationship between indepen- plied a residual neural network ResNet 34 [10], known
dent and dependent variables by finding a linear corre- for mitigating the problem of vanishing gradients via
lation and minimizing the sum of the squares of the shortcuts between layers, to poster data.
differences between predicted and actual values. It
4.5. Feature Importance
is among the most common, fundamental and simple
methods used to solve regression problems. To evaluate the predictive power of different fea-
To mitigate overfitting that might arise from fitting tures, we employed permutation feature importance
a simple regression model to our data, we also used (FI) [11], a technique that measures the importance
ridge regression. This regularization method adds an of each feature. The intuition is that if a feature is
extra l2 -norm of the parameter to the cost function to not useful for predicting an outcome, then permuting
penalize large regression coefficients. In general, ridge its values will not result in a significant reduction in
regression helps solve multicollinearity, or high inter- a model’s performance. For each feature, permutation
2
FI is defined by: Table 2. Preliminary results on text and image data
Data Method Valid MSE Valid R2
FI = errperm − errorig Word Embedding
Overview Only 1.2900 0.136
+ CNN
where errorig is the baseline model error with all the Tend to Tend to
Poster Only ResNet34
overfit overfit
original features, and errperm is the model error when a Extracted Features
certain feature is permuted. Linear Regression 1.4699 0.015
from Poster
Also known as Random Forest FI, mean decrease
impurity (MDI) is another FI metric in a random for- Table 3. Model MSE and R2
est model that computes the extent to which each fea- Method Test MSE Test R2
ture decreases the weighted impurity on average while Linear Regression 0.9302 0.3745
training the model. We evaluated each individual fea- Ridge Regression 0.8775 0.4099
ture and group of features on the train set using MDI Decision Tree Regression 0.8959 0.3975
FI and permutation FI, respectively. Random Forest Regression 0.8546 0.4253
Support Vector Regression 0.8542 0.4256
5. Experiments and Results Neural Network 0.8765 0.4109
5.1. Metrics
We used the Mean Square Error (MSE) and R2 as Additionally, we used L2 regularization on linear re-
our metrics. MSE was also used for training and R2 gression, and results show that the best alpha is 10. For
for representing the proportion of the variance for the decision tree regression, the optimal maximum depth
predicted film ratings explained by the features. was 8. For random forest regression, we set the max-
imum depth to be 32 and used 100 trees. For support
5.2. Individual Models vector regression, the optimal penalty parameter was
1 and the optimal was 0.1. Finally, we combined
We first assessed individual contributions of the results from the CNN for the textual features and non-
three feature categories–text, images, others–in pre- textual features into 5 fully connected layers for our fi-
dicting film ratings. When we trained word embedding nal neural network. The regularization techniques we
and a CNN model (Kernel size = 2,3,4, 100 kernels used for the neural network are dropout, L2 penalty,
each) on the synopses as our only feature, we had an and early stopping. The results are shown in Table 3.
R2 of 0.136. This implies that synopses can be used
We found that all three feature categories contribute
to predict movie ratings. However, when we trained a
to the prediction of IMDB scores. Our current results
ResNet34 model only on 224x224 posters, we either
show that random forest regression and support vector
had overfitting (small regularization) or saw no signif-
regression have the best performance, which explain
icant improvement over simply predicting movie rat-
the 42 percent of variance in IMDB movie scores.
ings using the mean (big regularization). Thus, a com-
plicated model is not suitable for predicting movie rat- 6. Discussion
ings with posters. We therefore manually extracted 13
visual features instead of directly feeding the posters 6.1. Best Model
to our model. Combining these 13 visual features into Adding L2 regularization improved regression per-
a linear regression model, we had an R2 of 0.015, sim- formance. SVR and random forest regression have
ilar to the result in [5]. Thus, visual features in movie similar R2 , but SVR is significantly slower and harder
posters are capable of predicting movie ratings. Table to interpret. Considering accuracy, efficiency and in-
2 displays the results of all the individual models. terpretability, random forest was our best model.
Compared to the previous related work discussed in
5.3. Combined Models
Section 2, which used movie trailers and genre to pre-
We set the results from linear regression as our base- dict film ratings and achieved an MSE of 0.88 [5], we
line, and used validation data to tune hyperparameters. have a smaller MSE because our model had more fea-
3
1.00
train train
0.75 dev 0.8 dev
R^2
R^2
0.50
0.6
0.25
0.00 0.4
0 20 40 60 80 100 120 20 40 60 80 100
1.5 train train
0.75
dev dev
1.0
MSE
MSE
0.50
0.5 0.25
Figure 1. Model performance of different number of trees Figure 2. Model performance of different max depth
4
Table 4. Feature Importance Ranking (Top 5)
Poster Genre Actor Director
hue sd documentary Stephen Baldwin Tyler Perry
hue horror Thomas Kretschmann Woody Allen
saturation drama Lauren Bacall Craig Moss
blue sd animation Manisha Koirala Uwe Boll
green sd action Angie Everhart Steven R. Monroe
5
8. Contributions
Yichen Yang and Ruoyun Ma co-wrote the report
and ran experiments. Min Haeng Cho performed liter-
ature review and co-wrote and finalized the report. We
all created the poster together. The GitHub Link for
codes is https://github.com/JeffJeffy/CS229Project.
References
[1] Y. J. Lim and Y. W. Teh, “Variational bayesian ap-
proach to movie rating prediction,” in Proceedings of
KDD Cup and Workshop, vol. 7, 2007, pp. 15–21.
[2] A. Oghina, M. Breuss, M. Tsagkias, and M. De Rijke,
“Predicting imdb movie ratings using social media,”
in European Conference on Information Retrieval.
Springer, 2012, pp. 503–507.
[3] J. D. Mcauliffe and D. M. Blei, “Supervised topic
models,” in Advances in Neural Information Process-
ing Systems, 2008, pp. 121–128.
[4] J. Ahmad, P. Duraisamy, A. Yousef, and B. Buckles,
“Movie success prediction using data mining,” in In-
stitute of Electrical and Electronics Engineers, 2017.
[5] F. B. Moghaddam, M. Elahi, R. Hosseini, C. Trattner,
and M. Tkalčič, “Predicting movie popularity and rat-
ings with visual features,” in 2019 14th International
Workshop on Semantic and Social Media Adaptation
and Personalization (SMAP). IEEE, 2019, pp. 1–6.
[6] KaggleInc, “Movie genre from its
poster,” https://www.kaggle.com/neha1703/
movie-genre-from-its-poster.
[7] Kaggle, “The movies dataset,” https://www.kaggle.
com/rounakbanik/the-movies-dataset.
[8] C. Sun, “Predict movie rating,” https:
//nycdatascience.com/blog/student-works/
web-scraping/movie-rating-prediction/, 2016.
[9] M. Honnibal and I. Montani, “spaCy 2: Natural lan-
guage understanding with Bloom embeddings, con-
volutional neural networks and incremental parsing,”
2017, to appear.
[10] K. He, X. Zhang, S. Ren, and J. Sun, “Deep resid-
ual learning for image recognition,” in Proceedings of
the IEEE Conference on Computer Vision and Pattern
Recognition, 2016, pp. 770–778.
[11] A. Fisher, C. Rudin, and F. Dominici, “Model class
reliance: Variable importance measures for any ma-
chine learning model class, from the “rashomon” per-
spective,” arXiv preprint arXiv:1801.01489, 2018.