0% found this document useful (0 votes)
110 views14 pages

Recommender System Unit Ii

The document discusses recommender systems. It defines a recommender system as a system that predicts a user's preferences for items and recommends the highest rated items. It describes two main types of recommender systems: content-based filtering, which recommends items similar to those a user liked based on item attributes, and collaborative filtering, which recommends items liked by similar users. It also explains that recommender systems are needed because the internet provides too many options for users to easily find items they will like.

Uploaded by

Mahi Rockzz
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
110 views14 pages

Recommender System Unit Ii

The document discusses recommender systems. It defines a recommender system as a system that predicts a user's preferences for items and recommends the highest rated items. It describes two main types of recommender systems: content-based filtering, which recommends items similar to those a user liked based on item attributes, and collaborative filtering, which recommends items liked by similar users. It also explains that recommender systems are needed because the internet provides too many options for users to easily find items they will like.

Uploaded by

Mahi Rockzz
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 14

RECOMMENDER SYSTEM¶

CONTENTS¶
INTRODUCTION
WHAT IS A RECOMMENDER SYSTEM
TYPES OF RECOMMENDER SYSTEM
WHY DO WE NEED A RECOMMENDER SYSTEM

Everyone loves movies irrespective of age, gender, race, color, or geographical location. We all
in a way are connected to each other via this amazing medium.
Yet what most interesting is the fact that how unique our choices and combinations are in terms
of movie preferences. Some people like genre-specific movies be it a thriller, romance, or sci-fi,
while others focus on lead actors and directors.
When we take all that into account, it’s astoundingly difficult to generalize a movie and say that
everyone would like it. But with all that said, it is still seen that similar movies are liked by a
specific part of the society.
So here’s where we as data scientists come into play and extract the juice out of all the
behavioral patterns of not only the audience but also from the movies themselves. So without
further ado let’s jump right into the basics of a recommendation system.

II.1 WHAT IS A RS:


A RS refers to a system that is capable of predicting the future preference of a set of items for a
user, and recommend the top items. Simply put a Recommendation System is a filtration
program whose prime goal is to predict the “rating” or “preference” of a user towards a domain-
specific item or item. In our case, this domain-specific item is a movie, therefore the main focus
of our recommendation system is to filter and predict only those movies which a user would
prefer given some data about the user him or herself.

II. I.A CONTENT BASED FILTERING¶


This filtration strategy is based on the data provided about the items. The algorithm recommends
products that are similar to the ones that a user has liked in the past. This similarity (generally
cosine similarity) is computed from the data we have about the items as well as the user’s past
preferences. For example, if a user likes movies such as ‘The Prestige’ then we can recommend
him the movies of ‘Christian Bale’ or movies with the genre ‘Thriller’ or maybe even movies
directed by ‘Christopher Nolan’.So what happens here the recommendation system checks the
past preferences of the user and find the film “The Prestige”, then tries to find similar movies to
that using the information available in the database such as the lead actors, the director, genre of
the film, production house, etc and based on this information find movies similar to “The
Prestige”. Disadvantages Different products do not get much exposure to the user. Businesses
cannot be expanded as the user does not try different types of products.
II.I. B COLLABORATIVE FILTERING¶
CF is based on the notion of similarity (or distance). If two users A and B have purchased the
same products and rated them similarly on a common rating scale , then A and B can be
considered similar in their buying and performance behaviour. Hence, if A buys a new product
and rates high, then that product can be recommended to B. Alternatively, the products that A
has already bought and rated high can be recommended to B if not already bought by B.
There are two types of Collaborative Filtering Algorithms.
II. I. B. a) USER BASED COLLABORATIVE FILTERING
II. I. B. b) ITEM BASED COLLABORATIVE FILTERING.

III. WHY DO WE NEED RS:¶


One key reason why we need a recommender system in modern society is that people have too
much options to use from due to the prevalence of Internet. In the past, people used to shop in a
physical store, in which the items available are limited. For instance, the number of movies that
can be placed in a Blockbuster store depends on the size of that store. By contrast, nowadays,
the Internet allows people to access abundant resources online. Netflix, for example, has an
enormous collection of movies. Although the amount of available information increased, a new
problem arose as people had a hard time selecting the items they actually want to see. This is
where the recommender system comes in.

USER BASED COLLABORATIVE


FILTERING:¶
The basic idea here is to find users that have similar past preference patterns as the user ‘A’ has
had and then recommending him or her items liked by those similar users which ‘A’ has not
encountered yet.
This is achieved by making a matrix of items each user has rated/viewed/liked/clicked depending
upon the task at hand, and then computing the similarity score between the users and finally
recommending items that the concerned user isn’t aware of but users similar to him/her are and
liked it.
For example, if the user ‘A’ likes ‘Batman Begins’, ‘Justice League’ and ‘The Avengers’ while the
user ‘B’ likes ‘Batman Begins’, ‘Justice League’ and ‘Thor’ then they have similar interests
because we know that these movies belong to the super-hero genre. So, there is a high
probability that the user ‘A’ would like ‘Thor’ and the user ‘B’ would like The Avengers’.
Disadvantages
People are fickle-minded i.e their taste change from time to time and as this algorithm is based
on user similarity it may pick up initial similarity patterns between 2 users who after a while may
have completely different preferences.
There are many more users than items therefore it becomes very difficult to maintain such large
matrices and therefore needs to be recomputed very regularly.
This algorithm is very susceptible to shilling attacks where fake users profiles consisting of
biased preference patterns are used to manipulate key decisions.

In [35]:
# Import Libraries
import pandas as pd
import numpy as np

In [6]:
#LOADING THE DATASET:
#The following loads the file onto a DataFrame using pandas’ read_csv() method

In [18]:
rating_df=pd.read_excel('ratings1.xlsx')

In [19]:
# Let us print the first five records.

In [20]:
rating_df.head()

Out[20]:
userId movieId rating timestamp
01 296 5.0 1147880044
11 306 3.5 1147868817
21 307 5.0 1147868828
31 665 5.0 1147878820
41 899 3.5 1147868510

The timestamp column will not be used in this example, so it can be dropped from the dataframe.

In [21]:
rating_df.drop( 'timestamp', axis = 1, inplace = True )

In [22]:
# The number of unique users in the dataset can be found using method unique() on u
serId column.

In [23]:
len(rating_df.userId.unique())

Out[23]:
526
In [24]:
# Similarly, the number of unique movies in the dataset is

In [25]:
len( rating_df.movieId.unique() )

Out[25]:
7312

Before proceeding further, we need to create a pivot table or matrix and represent users as rows
and movies as columns. The values of the matrix will be the ratings the users have given to
those movies.
As there are 526 users and 7312 movies, we will have a matrix of size 526 X 7312. The matrix
will be very sparse as very few cells will be filled with the ratings using only those movies that
users have watched.

Those movies that the users have not watched and rated yet, will be represented as NaN.
Pandas DataFrame has pivot method which takes the following three parameters:
1. index: Column value to be used as DataFrame’s index. So, it will be userId column of
rating_df.
2. columns: Column values to be used as DataFrame’s columns. So, it will be movieId
column of rating_df.
3. values: Column to use for populating DataFrame’s values. So, it will be rating column of
rating_df

In [26]:
user_movies_df = rating_df.pivot( index='userId',columns='movieId',values = "rating
").reset_index(drop=True)
user_movies_df.index=rating_df.userId.unique()

In [27]:
# Let us print the first 5 rows and first 15 columns.

In [28]:
user_movies_df.iloc[0:5, 0:15]

Out[28]:
movieId 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
1 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
2 3.5 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
3 4.0 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
4 3.0 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
5 4.0 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN

The DataFrame contains NaN for those entries where users have seen a movie and not rated.
We can impute those NaNs with 0 values using the following codes.
In [29]:
user_movies_df.fillna( 0, inplace = True)
user_movies_df.iloc[0:5, 0:10]

Out[29]:
movieId 1 2 3 4 5 6 7 8 9 10
1 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
2 3.5 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
3 4.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
4 3.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
5 4.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0

Calculating Cosine Similarity between


Users¶
Each row in user_movies_df represents a user. If we compute the similarity between rows, it will
represent the similarity between those users. sklearn.metrics.pairwise_distances can be used to
compute distance between all pairs of users. pairwise_distances() takes a metric parameter for
what distance measure to use.
We will be using cosine similarity for finding similarity. Cosine similarity closer to 1 means users
are very similar and closer to 0 means users are very dissimilar. The following code can be used
for calculating the similarity.

In [30]:
from sklearn.metrics import pairwise_distances
from scipy.spatial.distance import cosine, correlation
user_sim = 1 - pairwise_distances( user_movies_df.values,metric="cosine" )

#Store the results in a dataframe


user_sim_df = pd.DataFrame( user_sim )

#Set the index and column names to user ids (0 to 671)


user_sim_df.index = rating_df.userId.unique()
user_sim_df.columns = rating_df.userId.unique()

In [31]:
# We can print the similarity between first 5 users by using the following code.

In [32]:
user_sim_df.iloc[0:5, 0:5]

Out[32]:
1 2 3 4 5
1 1.000000 0.040863 0.061306 0.040815 0.015609
2 0.040863 1.000000 0.179009 0.197496 0.158202
3 0.061306 0.179009 1.000000 0.357750 0.061448
4 0.040815 0.197496 0.357750 1.000000 0.065825
5 0.015609 0.158202 0.061448 0.065825 1.000000

In [33]:
# The total dimension of the matrix is available in the shape variable of user_sim_
df matrix.

In [34]:
user_sim_df.shape

Out[34]:
(526, 526)

user_sim_df matrix shape shows that it contains the cosine similarity between all possible pairs
of users.
And each cell represents the cosine similarity between two specific users. For example, the
similarity between userid 1 and userid 5 is 0.015609.

The diagonal of the matrix shows the similarity of an user with itself (i.e., 1.0). This is true as
each user is most similar to himself or herself. But we need the algorithm to find other users who
are similar to a specific user. So, we will set the diagonal values as 0.0 .

In [36]:
np.fill_diagonal( user_sim, 0 )
user_sim_df.iloc[0:5, 0:5]

Out[36]:
1 2 3 4 5
1 0.000000 0.040863 0.061306 0.040815 0.015609
2 0.040863 0.000000 0.179009 0.197496 0.158202
3 0.061306 0.179009 0.000000 0.357750 0.061448
4 0.040815 0.197496 0.357750 0.000000 0.065825
5 0.015609 0.158202 0.061448 0.065825 0.000000

All diagonal values are set to 0, which helps to avoid selecting self as the most similar user.

Filtering Similar Users¶


To find most similar users, the maximum values of each column can be filtered. For example, the
most similar user to first 5 users with userid 1 to 5 can be obtained using the following code:

In [38]:
user_sim_df.idxmax(axis=1)[0:5]

Out[38]:
1 267
2 186
3 494
4 195
5 167
dtype: int64

The above result shows user 267 is most similar to user 1, user 186 is most similar to user 2, and
so on.

To dive a little deeper to understand the similarity, let us print the similarity values between user
2 and users ranging from 331 to 340.

In [39]:
user_sim_df.iloc[1:2, 330:340]

Out[39]:
331 332 333 334 335 336 337 338 339 340
2 0.10862 0.121669 0.090911 0.069766 0.123996 0.033399 0.052103 0.0083 0.110686 0.112213

The output shows that the cosine similarity between userid 2 and userid 335 is 0.123996 and
highest. But why is user 335 most similar to user 2? This can be explained intuitively if we can
verify that the two users have watched several movies in common and rated very similarly. For
this, we need to read movies dataset, which contains the movie id along with the movie name.

Loading the Movies Dataset¶


Movie information is contained in the file movies.csv. Each line of this file contains the movieid,
the movie name, and the movie genre.

In [40]:
movies_df = pd.read_csv( "movies.csv")

In [41]:
# We will print the first 5 movie details using the following code.

In [42]:
movies_df[0:5]

Out[42]:
movieId title genres
01 Toy Story (1995) Adventure|Animation|Children|Comedy|Fantasy
12 Jumanji (1995) Adventure|Children|Fantasy
23 Grumpier Old Men (1995) Comedy|Romance
34 Waiting to Exhale (1995) Comedy|Drama|Romance
45 Father of the Bride Part II (1995) Comedy

In [43]:
# The genres column is dropped from the DataFrame, as it is not going to be used in
this analysis

In [45]:
movies_df.drop( 'genres', axis = 1, inplace = True )

Finding Common Movies of Similar


Users¶
The following method takes userids of two users and returns the common movies they have
watched and their ratings

In [50]:
def get_user_similar_movies( user1, user2 ):

# Inner join between movies watched between two users will give
# the common movies watched.
common_movies = rating_df[rating_df.userId == user1].merge(
rating_df[rating_df.userId == user2],on = "movieId",how = "inner" )

# join the above result set with movies details


return common_movies.merge( movies_df, on = 'movieId' )

To find out the movies, user 2 and user 335 have watched in common and how they have rated
each one of them, we will filter out movies that both have rated at least 4 to limit the number of
movies to print

In [53]:
common_movies = get_user_similar_movies( 2, 335 )

In [54]:
common_movies[(common_movies.rating_x >= 4.0) &
((common_movies.rating_y >= 4.0))]

Out[54]:
userId_x movieId rating_x userId_y rating_y title
02 260 5.0 335 5.0 Star Wars: Episode IV - A New Hope (1977)
12 318 5.0 335 5.0 Shawshank Redemption, The (1994)
22 356 4.5 335 4.0 Forrest Gump (1994)
32 1196 5.0 335 5.0 Star Wars: Episode V - The Empire Strikes Back...
42 1197 5.0 335 5.0 Princess Bride, The (1987)
52 1210 5.0 335 5.0 Star Wars: Episode VI - Return of the Jedi (1983)
userId_x movieId rating_x userId_y rating_y title
62 5418 5.0 335 4.0 Bourne Identity, The (2002)

From the table we can see that users 2 and 335 have watched 6 movies in common and have
rated almost on the same scale. Their preferences seem to be very similar. How about users with
dissimilar behavior? Let us check users 2 and 338, whose cosine similarity is 0.0083.

In [55]:
common_movies = get_user_similar_movies( 2, 338 )
common_movies

Out[55]:
userId_x movieId rating_x userId_y rating_y title
02 588 2.0 338 3.5 Aladdin (1992)
12 35836 0.5 338 5.0 40-Year-Old Virgin, The (2005)

Users 2 and 338 have only two movies in common and have rated very differently. They indeed
are very dissimilar.

Challenges with User-Based Similarity¶


Finding user similarity does not work for new users. We need to wait until the new user buys a
few items and rates them.
Only then users with similar preferences can be found and recommendations can be made
based on that.
This is called cold start problem in recommender systems. This can be overcome by using item-
based similarity.
Item-based similarity is based on the notion that if two items have been bought by

In [ ]:

ITEM BASED COLLABORATIVE


FILTERING:¶
The concept in this case is to find similar movies instead of similar users and then recommending
similar movies to that ‘A’ has had in his/her past preferences.
This is executed by finding every pair of items that were rated/viewed/liked/clicked by the same
user, then measuring the similarity of those rated/viewed/liked/clicked across all user who
rated/viewed/liked/clicked both, and finally recommending them based on similarity scores.
Here, for example, we take 2 movies ‘A’ and ‘B’ and check their ratings by all users who have
rated both the movies and based on the similarity of these ratings, and based on this rating
similarity by users who have rated both we find similar movies.
So if most common users have rated ‘A’ and ‘B’ both similarly and it is highly probable that ‘A’
and ‘B’ are similar, therefore if someone has watched and liked ‘A’ they should be recommended
‘B’ and vice versa.
Advantages over User-based Collaborative Filtering Unlike people’s taste, movies don’t change.
There are usually a lot fewer items than people, therefore easier to maintain and compute the
matrices.
Shilling attacks are much harder because items cannot be faked.

Calculating Cosine Similarity between


Movies¶
In this approach, we need to create a pivot table, where the rows represent movies, columns
represent users, and the cells in the matrix represent ratings the users have given to the movies.
So, the pivot() method will be called with movieId as index and userId as columns as described
below:

In [56]:
rating_mat = rating_df.pivot(index='movieId',columns='userId',values = 'rating').re
set_index(drop = True)

# Fill all NaNs with 0


rating_mat.fillna(0, inplace = True)

# Find the correlation between movies


movie_sim = 1 - pairwise_distances(rating_mat.values,metric="correlation")

# Fill the diagonal with 0, as it repreresents the auto-correlation of movies


movie_sim_df = pd.DataFrame( movie_sim )

Now, the following code is used to print similarity between the first 5 movies.

In [57]:
movie_sim_df.iloc[0:5, 0:5]

Out[57]:
0 1 2 3 4
0 1.000000 0.137878 0.207511 0.128774 0.150345
1 0.137878 1.000000 0.107603 0.118175 0.109820
2 0.207511 0.107603 1.000000 0.217580 0.374952
3 0.128774 0.118175 0.217580 1.000000 0.293146
4 0.150345 0.109820 0.374952 0.293146 1.000000

The shape of the above similarity matrix is

In [58]:
movie_sim_df.shape

Out[58]:
(7312, 7312)

There are 9066 movies and the dimension of the matrix (7312,7312) shows that the similarity is
calculated for all pairs of 7312 movies.

Finding Most Similar Movies¶


In the following code, we write a method get_similar_movies() which takes a movieid as a
parameter and returns the similar movies based on cosine similarity.
Note that movieid and index of the movie record in the movies_df are not same. We need to find
the index of the movie record from the movieid and use that to find similarities in the
movie_sim_df.
It takes another parameter topN to specify how many similar movies will be returned.

In [63]:
def get_similar_movies( movieid, topN = 5 ):
# Get the index of the movie record in movies_df
movieidx = movies_df[movies_df.movieId == movieid].index[0]
movies_df['similarity'] = movie_sim_df.iloc[movieidx]
top_n = movies_df.sort_values( ['similarity'], ascending =False )[0:topN]
return top_n

The above method get_similar_movies() takes movie id as an argument and returns other
movies which are similar to it.
Let us find out how the similarities play out by finding out movies which are similar to the movie
Godfather. And if it makes sense at all! The movie id for the movie Godfather is 858.

In [67]:
movies_df[movies_df.movieId == 858]

Out[67]:
movieId title
840 858 Godfather, The (1972)

In [68]:
get_similar_movies(858)

Out[68]:
movieId title similarity
3371 3468 Hustler, The (1961) 1.0
3308 3403 Raise the Titanic (1980) 1.0
202 204 Under Siege 2: Dark Territory (1995) 1.0
1573 1632 Smile Like Yours, A (1997) 1.0
movieId title similarity
3188 3281 Brandon Teena Story, The (1998) 1.0

Let us find out which movies are similar to the movie Dumb and Dumber.

In [69]:
movies_df[movies_df.movieId == 231]

Out[69]:
movieId title similarity
228 231 Dumb & Dumber (Dumb and Dumber) (1994) -0.003201

In [70]:
get_similar_movies(231)

Out[70]:
movieId title similarity
228 231 Dumb & Dumber (Dumb and Dumber) (1994) 1.000000
757 773 Touki Bouki (1973) 0.630712
1136 1164 2 ou 3 choses que je sais d'elle (2 or 3 Thing... 0.608290
390 395 Desert Winds (1995) 0.608290
545 551 Nightmare Before Christmas, The (1993) 0.533183

Introduction to Matrix Factorization¶


Matrix factorization is a way to generate latent features when multiplying two different kinds of
entities. Collaborative filtering is the application of matrix factorization to identify the relationship
between items’ and users’ entities. With the input of users’ ratings on the shop items, we would
like to predict how the users would rate the items so the users can get the recommendation
based on the prediction.
Assume we have the customers’ ranking table of 5 users and 5 movies, and the ratings are
integers ranging from 1 to 5, the matrix is provided by the table below.

Since not every user gives ratings to all the movies, there are many missing values in the matrix
and it results in a sparse matrix. Hence, the null values not given by the users would be filled with
0 such that the filled values are provided for the multiplication.
For example, two users give high ratings to a certain move when the movie is acted by their
favorite actor and actress or the movie genre is an action one, etc.
From the table above, we can find that the user1 and user3 both give high ratings to move2 and
movie3.
Hence, from the matrix factorization, we are able to discover these latent features to give a
prediction on a rating with respect to the similarity in user’s preferences and interactions.
Given a scenario, user 4 didn’t give a rating to the movie 4. We’d like to know if user 4 would like
movie 4.
The method is to discover other users with similar preferences of user 4 by taking the ratings
given by users of similar preferences to the movie 4 and predict whether the user 4 would like the
movie 4 or not.
https://towardsdatascience.com/recommendation-system-matrix-factorization-d61978660b4b

In [ ]:

In [ ]:

You might also like