Performance Analysis of Various Activation Functions Using LSTM Neural Network For Movie Recommendation Systems
Performance Analysis of Various Activation Functions Using LSTM Neural Network For Movie Recommendation Systems
ANDRÉ BROGÄRD
PHILIP SONG
Abstract
The growth of importance and popularity of recommendations system has in-
creased in many various areas. This thesis focuses on recommendation sys-
tems for movies. Recurrent neural networks using LSTM blocks have shown
some success for movie recommendation systems. Research has indicated that
by changing activation functions in LSTM blocks, the performance, measured
as accuracy in predictions, can be improved. In this study we compare four
different activation functions (hyperbolic tangent, sigmoid, ELU and SELU
activation functions) used in LSTM blocks, and how they impact the predic-
tion accuracy of the neural networks. Specifically, they are applied to the block
input and the block output of the LSTM blocks. Our results indicate that the
hyperbolic tangent, which is the default, and sigmoid function perform about
the same, whereas the ELU and SELU functions perform worse. Further re-
search is needed to identify other activation functions that could improve the
prediction accuracy and improve certain aspects of our methodology.
iv
Sammanfattning
Rekommendationssystem har ökat i betydelse och popularitet i många olika
områden. Denna avhandling fokuserar på rekommendationssystem för filmer.
Recurrent neurala nätverk med LSTM blocks har visat viss framgång för re-
kommendationssystem för filmer. Tidigare forskning har indikerat att en änd-
ring av aktiverings funktioner har resulterat i förbättrad prediktering. I denna
studie jämför vi fyra olika aktiveringsfunktioner (hyperbolic tangent, sigmoid,
ELU and SELU) som appliceras i LSTM blocks och hur de påverkar predik-
teringen i det neurala nätverket. De appliceras specifikt på block input och
block output av LSTM blocken. Våra resultat indikerar att den hyperboliska
tangentfunktionen, som är standardvalet, och sigmoid funktionen presterar li-
ka, men ELU och SELU presterar båda sämre. Ytterligare forskning krävs för
att indentifiera andra aktiveringsfunktioner och för att förbättra flera delar av
metodologin.
Contents
1 Introduction 1
1.1 Problem Statement . . . . . . . . . . . . . . . . . . . . . . . 2
1.2 Scope . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
2 Background 3
2.1 Artifical Neural Networks . . . . . . . . . . . . . . . . . . . . 3
2.2 Multilayer Perception ANN . . . . . . . . . . . . . . . . . . . 4
2.3 Recurrent Neural Network . . . . . . . . . . . . . . . . . . . 4
2.4 Long Short-Term Memory . . . . . . . . . . . . . . . . . . . 5
2.4.1 LSTM Architecture . . . . . . . . . . . . . . . . . . . 5
2.4.2 Activation Functions . . . . . . . . . . . . . . . . . . 6
2.5 Metrics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
2.6 Related work . . . . . . . . . . . . . . . . . . . . . . . . . . 10
3 Methods 12
3.1 Dataset . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
3.2 Implementation . . . . . . . . . . . . . . . . . . . . . . . . . 12
3.3 Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
4 Results 14
5 Discussion 17
5.1 Result . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
5.2 Improvements . . . . . . . . . . . . . . . . . . . . . . . . . . 18
6 Conclusions 19
Bibliography 20
v
Chapter 1
Introduction
With more online movie platforms becoming available, people have a lot of
movie content to choose from. According to a study from Ericsson, people
spend up to one hour per day searching for movie content [1]. Seeking to min-
imize this time, movie recommendation systems have been developed using
Artificial Intelligence [2].
Recommendation systems aim to solve the problem of information over-
loading, which denies access to interesting items, by filtering information [3].
One such way is through collaborative filtering (CF) where similar users’ in-
terests are considered [3]. Popular approaches to CF include the use of neural
networks, and in [4] it is demonstrated that CF can be converted to a sequence
prediction problem with the use of recurrent neural networks (RNN).
Long Short-Term Memory (LSTM), an RNN with LSTM blocks, was de-
signed in order to solve a problem with RNN and has shown an improvement in
performance [5]. LSTM has been applied in several recommendation systems
[6] targeted at both entertainment (movie, music, videos) and e-commerce set-
tings and has outperformed state-of-the-art models in many cases.
In [4] an LSTM neural network was applied to the top-N recommendation
problem, using the default choice of activation functions, recommending 10
movies the user would be interested in seeing next. The rating of a movie
was ignored, only the sequence of watched movies was considered. It was
observed that extra features such as age, rating or sex did not lead to an increase
in accuracy. Both the Movielens and Netflix dataset were used and LSTM
outperformed all baseline models in nearly all metrics.
This study will use the same framework as in [4]. Since there has been
success in switching activation functions [7], the study will compare different
choices of activation functions in LSTM blocks and its impact on prediction
1
2 CHAPTER 1. INTRODUCTION
1.2 Scope
The implementation of LSTM is the same as in [4] with small modifications.
This study is therefore only considering this type of LSTM applied on the top
N recommendation problem. In [4] they limit the amount of features to only
three (user id, movie id and timestamp) and further conclude that more features
such as sex or age doesn’t improve the accuracy of the models, unless they are
all put together. We limit the features identically.
Only the Movielens 1M dataset will be used in this study because of limited
computational resources. Additionally, only the hyperbolic, sigmoid, ELU
and SELU activation functions will be tested due to them showing promising
results in previous work.
Chapter 2
Background
3
4 CHAPTER 2. BACKGROUND
The forget, input and output gates of each LSTM block are defined by
equations 2.1-2.3 respectively. C̃t defined in equation 2.4 is at time t the block
6 CHAPTER 2. BACKGROUND
input which consists of a tanh layer with the input gate. Together they decide
what information will be stored in the cell state, Ct . The cell state is updated
from the old cell state at time t. W and U are weight matrices and b is a bias
vector. Finally, the hidden state ht , is block output at time t.
Sigmoid function
The sigmoid function has a range of [0, 1] and is illustrated in figure 2.3. The
formula is given by:
1
σ(x) = −x
e −1
CHAPTER 2. BACKGROUND 7
In figure 2.5, the alpha parameter is set to 1, then its range is [−1, ∞].
is defined by:
λx :x>0
SELU (x) = x
λα(e − 1) : x ≤ 0
λ = 1.0507009873554804934193349852946
α = 1.6732632423543772848170429916717
2.5 Metrics
These are the same metrics used in [4] and are thus identically defined. They
are used to evaluate qualities in various recommendation systems.
• Sps. The Short-term Prediction Success captures the ability of the method
to predict the next item. It is 1 if the next item is present in the recom-
mendations, 0 else.
• Recall. The usual metrics for top-N recommendation captures the ability
of the method to do long term predictions.
• User coverage. The fraction of users who received at least one correct
recommendation. Average recall (and precision) hide the distribution
of success among users. A high recall could still mean that many users
10 CHAPTER 2. BACKGROUND
• Item coverage. The number of distinct items that were correctly recom-
mended. It captures the capacity of the method to make diverse, suc-
cessful, recommendations.
Observe that these metrics are all computed using recommendation systems
which always produces ten recommendations for each user.
Methods
3.1 Dataset
The dataset used is Movielens 1M. The dataset contains many possible features
that are not considered in the model, only the user id, movie id and timestamp
are treated as features. Preprocessing is included in the LSTM implementation
by [4].
3.2 Implementation
The modifications to the original code by [4] can be found in the authors’
fork of the original repository on github: github.com/andrebrogard/
sequence-based-recommendations. The only modifications made
are the option to specify which activation functions to apply to the all individ-
ual gates of the LSTM blocks when training and testing the model.
12
CHAPTER 3. METHODS 13
for the input, output and forget gate. In our tests, we will compare four different
activations functions applied on the block input and block output identically,
namely the hyperbolic, sigmoid, ELU and SELU functions.
3.3 Evaluation
Metrics
The metrics used are identical to those of [4] and captures the same properties
in order to make the results comparable. They are all calculated in the context
where the recommendation system makes ten recommendations. See 2.5 for
their definition.
Number of tests
Training will be conducted for each activation function on the dataset 15 times
in order to capture variance and observe a fair result. The models are then
evaluated according to the metrics above.
Chapter 4
Results
Figure 4.1-4.4 show mean sps, recall, user coverage and item coverage respec-
tively across intermediate epochs from 1 to 102. All results are evaluated on
the test data on saved models from each intermediate epoch. Each activation
function was, as described, used to train a model 15 times, from which the
mean of all metrics has been evaluated. Table 4.1 shows the mean and the
standard deviation of the results over 15 models.
Both ELU and SELU perform worse than the sigmoid and hyperbolic func-
tion across all metrics. Additionally, ELU always performs worse than SELU.
The hyperbolic and sigmoid function is similar in their performance, with a
slight advantage only to the hyperbolic in the recall metric.
An observation shared between most activation functions and metrics is
that the models don’t seem to improve significantly beyond around 20 epochs.
In the recall and sps metric all activation functions instead decrease. The
SELU function always decreases in all metrics after around 50 epochs. The
ELU function instead always decreases after around 20 epochs.
14
CHAPTER 4. RESULTS 15
Figure 4.1: The mean sps across intermediate epochs. Evaluated on the test
data.
Figure 4.2: The mean recall across intermediate epochs. Evaluated on the test
data.
Figure 4.3: The mean user coverage across intermediate epochs. Evaluated on
the test data.
16 CHAPTER 4. RESULTS
Figure 4.4: The mean item coverage across intermediate epochs. Evaluated
on the test data.
Chapter 5
Discussion
5.1 Result
The ELU and SELU seems to have had a negative impact on the models, as they
did not achieve the same accuracy as the hyperbolic and sigmoid functions.
Both functions were less accurate with shorter term and longer term recom-
mendations and less users received a correct recommendation and fewer items
were ever recommended. Interestingly, the sigmoid and hyperbolic function
displayed no significant difference in metrics and the SELU function achieved
the highest mean sps value at around 50 epochs compared to all other activa-
tion functions before it started decreasing.
The ELU displayed the lowest mean and highest standard deviation in
mostly all metrics. This further indicates that ELU was not a good choice
of activation function. Moreover, SELU had lower mean but similar standard
deviation to the sigmoid and hyperbolic function. We believe that is a promis-
ing property of the SELU function as it appears to be as stable as the sigmoid
and hyperbolic function.
The sigmoid function yields better results in sps and item coverage over
the hyperbolic. Additionally, the standard deviation is slightly lower for the
sigmoid function in those two metrics. Thus, the sigmoid function could be a
substitute for the default function according to our results.
The metrics associated with the hyperbolic function should be comparable
with the results of [4] because the same framework is used and similar tests
were performed. They presented, for a layer size of 20 neurons as ours, better
results; their mean for sps on the hyperbolic function, on the same dataset, was
well over 30% for around 100 epochs. Furthermore, it wasn’t until around 100
epochs that the model stopped improving. Our results show that most activa-
17
18 CHAPTER 5. DISCUSSION
tion functions had already attained its maximum sps at around 20 epochs. Had
we observed a smoother learning rate, then we would have had more convinc-
ing results for the SELU and ELU functions.
5.2 Improvements
The choice of neural network parameters may explain the difference in re-
sults compared to [4], especially the learning rate could affect the models. It
could contribute to the fact that our models reach maximum value quicker and
hinders it from achieving similar results. We use the default learning rate pa-
rameters of the framework for RNN, which uses Adam, which might explain
the difference compared to [4]. Furthermore, the layer size, which was 20 neu-
rons in this study, should have been varied as in [4] to better observe possible
differences in learning rate. What neural network parameters to use should be
considered more carefully in future work.
In each LSTM-block the block input and block output activation functions
were the only ones changed while maintaining the same activation functions in
the three gates (input, forget and output gates) using the sigmoid function (as
is default). Whereas, in [7], 23 activation functions were applied on the three
gates. The same activation functions that showed great performance in that
study was not tested here because of time restraints. We did not observe a sig-
nificant advantage for an activation function compared to the default. For fu-
ture work, more comprehensive experiments evaluating more activation func-
tions should be performed.
The study in [4] uses two datasets: Movielens 1M and Netflix. In this
study, only the Movielens 1M is used because of time restraints. Therefore,
our results could be very bound to the structure of this specific dataset. In
future work, more datasets need to be considered.
The performance for each activation function is evaluated strictly on ac-
curacy using each metric. The temporal aspect was overlooked. Because our
tests did not record the duration the network was trained; how and if an ac-
tivation function achieves better accuracy in shorter time was not evaluated.
To better evaluate an activation function, future work should not overlook the
temporal aspect.
Chapter 6
Conclusions
19
Bibliography
20
BIBLIOGRAPHY 21
www.kth.se