Recommender System Unit Ii
Recommender System Unit Ii
CONTENTS¶
INTRODUCTION
WHAT IS A RECOMMENDER SYSTEM
TYPES OF RECOMMENDER SYSTEM
WHY DO WE NEED A RECOMMENDER SYSTEM
Everyone loves movies irrespective of age, gender, race, color, or geographical location. We all
in a way are connected to each other via this amazing medium.
Yet what most interesting is the fact that how unique our choices and combinations are in terms
of movie preferences. Some people like genre-specific movies be it a thriller, romance, or sci-fi,
while others focus on lead actors and directors.
When we take all that into account, it’s astoundingly difficult to generalize a movie and say that
everyone would like it. But with all that said, it is still seen that similar movies are liked by a
specific part of the society.
So here’s where we as data scientists come into play and extract the juice out of all the
behavioral patterns of not only the audience but also from the movies themselves. So without
further ado let’s jump right into the basics of a recommendation system.
In [35]:
# Import Libraries
import pandas as pd
import numpy as np
In [6]:
#LOADING THE DATASET:
#The following loads the file onto a DataFrame using pandas’ read_csv() method
In [18]:
rating_df=pd.read_excel('ratings1.xlsx')
In [19]:
# Let us print the first five records.
In [20]:
rating_df.head()
Out[20]:
userId movieId rating timestamp
01 296 5.0 1147880044
11 306 3.5 1147868817
21 307 5.0 1147868828
31 665 5.0 1147878820
41 899 3.5 1147868510
The timestamp column will not be used in this example, so it can be dropped from the dataframe.
In [21]:
rating_df.drop( 'timestamp', axis = 1, inplace = True )
In [22]:
# The number of unique users in the dataset can be found using method unique() on u
serId column.
In [23]:
len(rating_df.userId.unique())
Out[23]:
526
In [24]:
# Similarly, the number of unique movies in the dataset is
In [25]:
len( rating_df.movieId.unique() )
Out[25]:
7312
Before proceeding further, we need to create a pivot table or matrix and represent users as rows
and movies as columns. The values of the matrix will be the ratings the users have given to
those movies.
As there are 526 users and 7312 movies, we will have a matrix of size 526 X 7312. The matrix
will be very sparse as very few cells will be filled with the ratings using only those movies that
users have watched.
Those movies that the users have not watched and rated yet, will be represented as NaN.
Pandas DataFrame has pivot method which takes the following three parameters:
1. index: Column value to be used as DataFrame’s index. So, it will be userId column of
rating_df.
2. columns: Column values to be used as DataFrame’s columns. So, it will be movieId
column of rating_df.
3. values: Column to use for populating DataFrame’s values. So, it will be rating column of
rating_df
In [26]:
user_movies_df = rating_df.pivot( index='userId',columns='movieId',values = "rating
").reset_index(drop=True)
user_movies_df.index=rating_df.userId.unique()
In [27]:
# Let us print the first 5 rows and first 15 columns.
In [28]:
user_movies_df.iloc[0:5, 0:15]
Out[28]:
movieId 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
1 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
2 3.5 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
3 4.0 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
4 3.0 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
5 4.0 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
The DataFrame contains NaN for those entries where users have seen a movie and not rated.
We can impute those NaNs with 0 values using the following codes.
In [29]:
user_movies_df.fillna( 0, inplace = True)
user_movies_df.iloc[0:5, 0:10]
Out[29]:
movieId 1 2 3 4 5 6 7 8 9 10
1 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
2 3.5 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
3 4.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
4 3.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
5 4.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
In [30]:
from sklearn.metrics import pairwise_distances
from scipy.spatial.distance import cosine, correlation
user_sim = 1 - pairwise_distances( user_movies_df.values,metric="cosine" )
In [31]:
# We can print the similarity between first 5 users by using the following code.
In [32]:
user_sim_df.iloc[0:5, 0:5]
Out[32]:
1 2 3 4 5
1 1.000000 0.040863 0.061306 0.040815 0.015609
2 0.040863 1.000000 0.179009 0.197496 0.158202
3 0.061306 0.179009 1.000000 0.357750 0.061448
4 0.040815 0.197496 0.357750 1.000000 0.065825
5 0.015609 0.158202 0.061448 0.065825 1.000000
In [33]:
# The total dimension of the matrix is available in the shape variable of user_sim_
df matrix.
In [34]:
user_sim_df.shape
Out[34]:
(526, 526)
user_sim_df matrix shape shows that it contains the cosine similarity between all possible pairs
of users.
And each cell represents the cosine similarity between two specific users. For example, the
similarity between userid 1 and userid 5 is 0.015609.
The diagonal of the matrix shows the similarity of an user with itself (i.e., 1.0). This is true as
each user is most similar to himself or herself. But we need the algorithm to find other users who
are similar to a specific user. So, we will set the diagonal values as 0.0 .
In [36]:
np.fill_diagonal( user_sim, 0 )
user_sim_df.iloc[0:5, 0:5]
Out[36]:
1 2 3 4 5
1 0.000000 0.040863 0.061306 0.040815 0.015609
2 0.040863 0.000000 0.179009 0.197496 0.158202
3 0.061306 0.179009 0.000000 0.357750 0.061448
4 0.040815 0.197496 0.357750 0.000000 0.065825
5 0.015609 0.158202 0.061448 0.065825 0.000000
All diagonal values are set to 0, which helps to avoid selecting self as the most similar user.
In [38]:
user_sim_df.idxmax(axis=1)[0:5]
Out[38]:
1 267
2 186
3 494
4 195
5 167
dtype: int64
The above result shows user 267 is most similar to user 1, user 186 is most similar to user 2, and
so on.
To dive a little deeper to understand the similarity, let us print the similarity values between user
2 and users ranging from 331 to 340.
In [39]:
user_sim_df.iloc[1:2, 330:340]
Out[39]:
331 332 333 334 335 336 337 338 339 340
2 0.10862 0.121669 0.090911 0.069766 0.123996 0.033399 0.052103 0.0083 0.110686 0.112213
The output shows that the cosine similarity between userid 2 and userid 335 is 0.123996 and
highest. But why is user 335 most similar to user 2? This can be explained intuitively if we can
verify that the two users have watched several movies in common and rated very similarly. For
this, we need to read movies dataset, which contains the movie id along with the movie name.
In [40]:
movies_df = pd.read_csv( "movies.csv")
In [41]:
# We will print the first 5 movie details using the following code.
In [42]:
movies_df[0:5]
Out[42]:
movieId title genres
01 Toy Story (1995) Adventure|Animation|Children|Comedy|Fantasy
12 Jumanji (1995) Adventure|Children|Fantasy
23 Grumpier Old Men (1995) Comedy|Romance
34 Waiting to Exhale (1995) Comedy|Drama|Romance
45 Father of the Bride Part II (1995) Comedy
In [43]:
# The genres column is dropped from the DataFrame, as it is not going to be used in
this analysis
In [45]:
movies_df.drop( 'genres', axis = 1, inplace = True )
In [50]:
def get_user_similar_movies( user1, user2 ):
# Inner join between movies watched between two users will give
# the common movies watched.
common_movies = rating_df[rating_df.userId == user1].merge(
rating_df[rating_df.userId == user2],on = "movieId",how = "inner" )
To find out the movies, user 2 and user 335 have watched in common and how they have rated
each one of them, we will filter out movies that both have rated at least 4 to limit the number of
movies to print
In [53]:
common_movies = get_user_similar_movies( 2, 335 )
In [54]:
common_movies[(common_movies.rating_x >= 4.0) &
((common_movies.rating_y >= 4.0))]
Out[54]:
userId_x movieId rating_x userId_y rating_y title
02 260 5.0 335 5.0 Star Wars: Episode IV - A New Hope (1977)
12 318 5.0 335 5.0 Shawshank Redemption, The (1994)
22 356 4.5 335 4.0 Forrest Gump (1994)
32 1196 5.0 335 5.0 Star Wars: Episode V - The Empire Strikes Back...
42 1197 5.0 335 5.0 Princess Bride, The (1987)
52 1210 5.0 335 5.0 Star Wars: Episode VI - Return of the Jedi (1983)
userId_x movieId rating_x userId_y rating_y title
62 5418 5.0 335 4.0 Bourne Identity, The (2002)
From the table we can see that users 2 and 335 have watched 6 movies in common and have
rated almost on the same scale. Their preferences seem to be very similar. How about users with
dissimilar behavior? Let us check users 2 and 338, whose cosine similarity is 0.0083.
In [55]:
common_movies = get_user_similar_movies( 2, 338 )
common_movies
Out[55]:
userId_x movieId rating_x userId_y rating_y title
02 588 2.0 338 3.5 Aladdin (1992)
12 35836 0.5 338 5.0 40-Year-Old Virgin, The (2005)
Users 2 and 338 have only two movies in common and have rated very differently. They indeed
are very dissimilar.
In [ ]:
In [56]:
rating_mat = rating_df.pivot(index='movieId',columns='userId',values = 'rating').re
set_index(drop = True)
Now, the following code is used to print similarity between the first 5 movies.
In [57]:
movie_sim_df.iloc[0:5, 0:5]
Out[57]:
0 1 2 3 4
0 1.000000 0.137878 0.207511 0.128774 0.150345
1 0.137878 1.000000 0.107603 0.118175 0.109820
2 0.207511 0.107603 1.000000 0.217580 0.374952
3 0.128774 0.118175 0.217580 1.000000 0.293146
4 0.150345 0.109820 0.374952 0.293146 1.000000
In [58]:
movie_sim_df.shape
Out[58]:
(7312, 7312)
There are 9066 movies and the dimension of the matrix (7312,7312) shows that the similarity is
calculated for all pairs of 7312 movies.
In [63]:
def get_similar_movies( movieid, topN = 5 ):
# Get the index of the movie record in movies_df
movieidx = movies_df[movies_df.movieId == movieid].index[0]
movies_df['similarity'] = movie_sim_df.iloc[movieidx]
top_n = movies_df.sort_values( ['similarity'], ascending =False )[0:topN]
return top_n
The above method get_similar_movies() takes movie id as an argument and returns other
movies which are similar to it.
Let us find out how the similarities play out by finding out movies which are similar to the movie
Godfather. And if it makes sense at all! The movie id for the movie Godfather is 858.
In [67]:
movies_df[movies_df.movieId == 858]
Out[67]:
movieId title
840 858 Godfather, The (1972)
In [68]:
get_similar_movies(858)
Out[68]:
movieId title similarity
3371 3468 Hustler, The (1961) 1.0
3308 3403 Raise the Titanic (1980) 1.0
202 204 Under Siege 2: Dark Territory (1995) 1.0
1573 1632 Smile Like Yours, A (1997) 1.0
movieId title similarity
3188 3281 Brandon Teena Story, The (1998) 1.0
Let us find out which movies are similar to the movie Dumb and Dumber.
In [69]:
movies_df[movies_df.movieId == 231]
Out[69]:
movieId title similarity
228 231 Dumb & Dumber (Dumb and Dumber) (1994) -0.003201
In [70]:
get_similar_movies(231)
Out[70]:
movieId title similarity
228 231 Dumb & Dumber (Dumb and Dumber) (1994) 1.000000
757 773 Touki Bouki (1973) 0.630712
1136 1164 2 ou 3 choses que je sais d'elle (2 or 3 Thing... 0.608290
390 395 Desert Winds (1995) 0.608290
545 551 Nightmare Before Christmas, The (1993) 0.533183
Since not every user gives ratings to all the movies, there are many missing values in the matrix
and it results in a sparse matrix. Hence, the null values not given by the users would be filled with
0 such that the filled values are provided for the multiplication.
For example, two users give high ratings to a certain move when the movie is acted by their
favorite actor and actress or the movie genre is an action one, etc.
From the table above, we can find that the user1 and user3 both give high ratings to move2 and
movie3.
Hence, from the matrix factorization, we are able to discover these latent features to give a
prediction on a rating with respect to the similarity in user’s preferences and interactions.
Given a scenario, user 4 didn’t give a rating to the movie 4. We’d like to know if user 4 would like
movie 4.
The method is to discover other users with similar preferences of user 4 by taking the ratings
given by users of similar preferences to the movie 4 and predict whether the user 4 would like the
movie 4 or not.
https://towardsdatascience.com/recommendation-system-matrix-factorization-d61978660b4b
In [ ]:
In [ ]: