17BIT024
17BIT024
A PROJECT REPORT
Submitted by
SANDHIYA G (17BIT013)
KAUSHIKA S (17BIT024)
SUJITHRA R(17BIT005)
June 2021
i
BONAFIDE CERTIFICATE
SIGNATURE SIGNATURE
Dr. M. Alamelu Ms.S.Sathyavathi
HEAD OF THE DEPARTMENT Supervisor
Associate Professor Assistant Professor
Information Technology Information Technology
DECLARATION
1. INTRODUCTION 8
2. LITERATURE SURVEY 9
3. DATASET 11
4. PROPOSED SYSTEM 11
5. SYSTEM DESIGN 12
6. ALGORITHMS 13
6.1 . K MEANS CLUSTERING
ALGORITHM
6.2 . K NEAREST NEIGHBOUR
ALGORITHM
6.3 . AFFINITY PROPAGATION
CLUSTERING ALGORITHM
7. SYSTEM REQUIREMENTS 20
8. RESULT 21
9. CODING 22
10. SNAPSHOTS 40
11. CONCLUSION 47
12. FUTURE WORK 49
13. REFERENCE 51
ABSTRACT
2. LITERATURE SURVEY
This paper
Movie proposed a System showed This proves that
recommendation machine 95% accuracy on our system is a
2018 MovieLens
1. System using learning average in valid one for
(publicly
clustering approach to predicting rating prediction in the
Available) K Means
Algorithm and recommend from new user field of movies.
Pattern movies to clustering
11 target data which can
recognition users using be used to 146 This ensures that,
class
network K-means analyze which our system can
clustering Neural
And movie should be deal with
algorithm to Network
recommended different types of
separate to new users. users with diverse
12000 users
similar users attitude towards
and creating movies.
a neural
network for
each
cluster.
2. TV series Predict Movie data 1st TV series The result is
recommendation what rating collected K Means recommendation promising as the
Using fuzzy 2017 a user might from Clustering system that Average rating is
reference give to MovieLens consider no. of significantly
system, KMean certain TV TV series TV series as an lower, but more
clustering and by analyzing from IMDB. Adaptive input. research can
Adaptive neuro Imformation Fuzzy neuro improve the
fuzzy inference about user inference result even
system. and TV system further.
series. (AFNIS).
Movie
To shows predictions for
3. Analysis of Movie
that low collaborative the user with This proves that,
Recommendation
rated filtering user-Id 254: it is movies that have
Systems; with
2020 movies are observed that never got an
and without
not there is no above average
considering the
significant in Movie-Lens- significant rating does not
low rated movies
finding the 100k. difference have significant
movie between the contribution in
predictions. Pearson predictions. movie
So it’s correlation recommendations
suggestable coefficient The negligible and it’s suggested
to ignore difference to ignore such
them while between the movies.
calculating predictions
movie shows that the
predictions. effect of
removing low-
rated movies is
negligible and
hence can be
removed.
To improve
4. Group quality of The dataset hierarchical While doing The main concept
Recommendation 2008 service for used in this clustering clustering behind the GRS
System for Facebook research was and accuracy can be used in
Facebook users, we collected decision tree. improved by 9% many different
developed using applications. One
GRS to find Facebook is information
the most Platform. distribution
suitable system based on
group to profile features of
join by users. As social
matching networking
users community
profiles with expands
groups exponentially, it
identity. will become a
Facebook challenge to
social distribute right
network information to a
groups can right person. So If
be we know identity
identified of the user’s
based on groups, we can
their ensure the user
members’ to receive
profiles. information
he/she prefers.
3. DATASET
5. SYSTEM DESIGN
In our system, the primary and foremost is knowledge gathering.
We’ll be grouping the dataset that’s prepared, it ought to be pre-
processed since it's real-world data. There could be a possible ton of
missing data, mismatching entries etc… exploitation effective
functionalities the dataset are pre-processed and provided to following
steps. Now, the data is ready for applying the desired algorithmic rule,
here we use K-Means clustering algorithm, K nearest Neighbors
algorithm and Affinity propagation clustering algorithm. Once after
applying the algorithm separately for the dataset we will predict and
recommend 20 movies for the users who are yet to watch the movie,
and finally we conclude by comparing the results of each algorithm.
6. ALGORITHMS
6.1 K MEANS CLUSTERING ALGORITHM
For the simple task of finding the nearest neighbors between two sets
of data, the unsupervised algorithms within sklearn.neighbors can be
used. Because the query set matches the training set, the nearest
neighbor of each point is the point itself, at a distance of zero.
BRUTE FORCE
|x+y|≤|x|+|y|
With this setup, a single distance calculation between a test point and
the centroid is sufficient to determine a lower and upper bound on the
distance to all points within the node. Because of the spherical
geometry of the ball tree nodes, it can out-perform a KD-tree in high
dimensions, though the actual performance is highly dependent on the
structure of the training data. In scikit-learn, ball-tree-based neighbors
searches are specified using the keyword algorithm = 'ball_tree', and
are computed using the class BallTree. Alternatively, the user can
work with the BallTree class directly.
7. SYSTEM REQUIREMENTS
HARDWARE REQUIREMENTS
System : i7 Processor
Ram : 16 GB
SOFTWARE REQUIREMENTS
Operating system : Windows 10
Tool: Anaconda Navigator – 64bit
Scripting Tool: Jupyter Notebook
Language: Python3.8
8. RESULT
The dataset used is Rate.csv and Movie.csv which is taken from
MovieLens. This dataset consists of 100004 ratings given by 671
users for 9077 movies, each user rated nearly minimum of 20 movies.
The below image shows the movies recommended to the users. After
the result we check which system is more efficient for the
recommendation by comparing the result.
9. CODING
movies = pd.read_csv('Movie.csv')
ratings = pd.read_csv('Rate.csv')
dataset = pd.merge(movies, ratings, how ='inner', on ='movieId')
dataset.head()
print('The dataset contains: ', len(ratings), ' ratings of ', len(movies), ' movies.')
dataset.shape
dataset.nunique()
unique_user = ratings.userId.nunique(dropna = True)
unique_movie = ratings.movieId.nunique(dropna = True)
print("number of unique user:")
print(unique_user)
print("number of unique movies:")
print(unique_movie)
dataset = dataset.drop_duplicates()
print(dataset)
dataset.describe()
dataset.isnull()
dataset.isnull().sum()
x = dataset.genres
a = list()
for i in x:
abc = i
a.append(abc.split('|'))
a = pd.DataFrame(a)
b = a[0].unique()
for i in b:
dataset[i] = 0
dataset.head(2000)
for i in b:
dataset.loc[dataset['genres'].str.contains(i), i] = 1
dataset.head(2000)
print(genre_ratings)
genre_ratings.columns = column_names
return genre_ratings
return most_rated_movies_users_selection
n_movies = 30
n_users = 18
most_rated_movies_users_selection = sort_by_rating_density(user_movie_ratings, n_movies,
n_users)
# Draw heatmap
heatmap = ax.imshow(most_rated_movies_users_selection, interpolation='nearest', vmin=0,
vmax=5, aspect='auto')
if axis_labels:
ax.set_yticks(np.arange(most_rated_movies_users_selection.shape[0]) , minor=False)
ax.set_xticks(np.arange(most_rated_movies_users_selection.shape[1]) , minor=False)
ax.invert_yaxis()
ax.xaxis.tick_top()
labels = most_rated_movies_users_selection.columns.str[:40]
ax.set_xticklabels(labels, minor=False)
ax.set_yticklabels(most_rated_movies_users_selection.index, minor=False)
plt.setp(ax.get_xticklabels(), rotation=90)
else:
ax.get_xaxis().set_visible(False)
ax.get_yaxis().set_visible(False)
ax.grid(False)
ax.set_ylabel('User id')
# Color bar
cbar = fig.colorbar(heatmap, ticks=[5, 4, 3, 2, 1, 0], cax=cax)
cbar.ax.set_yticklabels(['5 stars', '4 stars','3 stars','2 stars','1 stars','0 stars'])
plt.show()
draw_movies_heatmap(most_rated_movies_users_selection)
sparse_ratings = csr_matrix(pd.SparseDataFrame(most_rated_movies_1k).to_coo())
d = d.reindex_axis(d.mean().sort_values(ascending=False).index, axis=1)
d = d.reindex_axis(d.count(axis=1).sort_values(ascending=False).index)
d = d.iloc[:max_users, :max_movies]
n_users_in_plot = d.shape[0]
# We're only selecting to show clusters that have more than 9 users, otherwise, they're less
interesting
if len(d) > 9:
print('cluster # {}'.format(cluster_id))
print('# of users in cluster: {}.'.format(n_users_in_cluster), '# of users in plot:
{}'.format(n_users_in_plot))
fig = plt.figure(figsize=(15,4))
ax = plt.gca()
ax.invert_yaxis()
ax.xaxis.tick_top()
labels = d.columns.str[:40]
ax.set_yticks(np.arange(d.shape[0]) , minor=False)
ax.set_xticks(np.arange(d.shape[1]) , minor=False)
ax.set_xticklabels(labels, minor=False)
ax.get_yaxis().set_visible(False)
# Heatmap
heatmap = plt.imshow(d, vmin=0, vmax=5, aspect='auto')
ax.set_xlabel('movies')
ax.set_ylabel('User id')
divider = make_axes_locatable(ax)
cax = divider.append_axes("right", size="5%", pad=0.05)
# Color bar
cbar = fig.colorbar(heatmap, ticks=[5, 4, 3, 2, 1, 0], cax=cax)
cbar.ax.set_yticklabels(['5 stars', '4 stars','3 stars','2 stars','1 stars','0 stars'])
plt.show()
import helper
import importlib
importlib.reload(helper)
user_id = 2
user_2_ratings = cluster.loc[user_id, :]
user_2_unrated_movies = user_2_ratings[user_2_ratings.isnull()]
avg_ratings = pd.concat([user_2_unrated_movies, cluster.mean()], axis=1, join='inner').loc[:,0]
avg_ratings.sort_values(ascending=False)[:20]
import numpy as np
import pickle
from sklearn.cluster import AffinityPropagation
import math
import sys
import pickle
from time import sleep
train_data=[]
test_data = []
users ={}
movies={}
def get_data():
train_data = np.genfromtxt("dataset.csv",delimiter= ',',skip_header=(1))
user_id = list(set(train_data[:,0]))
user_id.sort()
movie_id = list(set(train_data[:,1]))
movie_id.sort()
users={}
movies={}
for i,j in enumerate(user_id):
users[j]=i
for i,j in enumerate(movie_id):
# print(i)
movies[j]=i
user_item = np.empty((len(set(train_data[:,0])),len(set(train_data[:,1]))))
def find_similarity():
movie_sim = np.zeros([len(movies.keys()),len(movies.keys())])
for st,m1 in enumerate(movies.keys()):
if st%1000==0:
print('in movie',st)
for j in range(st,len(movies.keys())):
m2 = 1
myArr = list(movies.keys())
m2 =myArr[j]
r2 = np.average(user_item[:,movies[m2]])
u_m2 = np.where(user_item[:,movies[m2]]!=0)
u = list(set(u_m1[0]).intersection(set(u_m2[0])))
if len(u)!=0:
co_ratings = user_item[np.ix_(u,[int(movies[m1]),int(movies[m2])])]
num = sum((co_ratings[:,0]-r1)*(co_ratings[:,1]-r2))
den = ((sum((co_ratings[:,0]-r1)**2))**0.5)*((sum((co_ratings[:,1]-r2)**2))**0.5)
corr = num*1.0/den
movie_sim[st][j] = corr
if j != st:
movie_sim[j][st] = corr
return(movie_sim)
def compute_reco(act_user,act_mov):
user = users[act_user]
movie = movies[act_mov]
clus = clus_labels[movie]
clus_movie = np.where(clus_labels==clus)
user_rated_movies = np.where(user_item[user]!=0)
rated_movies = list(set(clus_movie[0]).intersection(set(user_rated_movies[0])))
if movie in rated_movies:
rated_movies.remove(movie)
clus_movie = np.delete(clus_movie,np.where(clus_movie[0]==movie))
ratings = user_item[user,rated_movies]
# dtype = [('movie_num', int), ('rating', float), ('W', int)]
reco_ratings = np.zeros([len(clus_movie),3])
for j,m in enumerate(clus_movie):
if m in user_rated_movies[0]:
pred_rating = user_item[user,m]
reco_ratings[j]=[0,m,pred_rating]
else:
sim = sim_mat[m,rated_movies]
rated = np.column_stack((ratings,sim))
pred_rating = np.dot(rated[:,0],rated[:,1])*1.0/sum(rated[:,1])
pred_rating = round(pred_rating * 2) / 2
if(math.isnan(pred_rating)):
pred_rating = 0
reco_ratings[j]=[1,m,pred_rating]
not_watched_ind = np.where(reco_ratings[:,0]==1)
if(len(not_watched_ind[0])>10):
not_watched = reco_ratings[not_watched_ind[0]]
reco_movies = not_watched[np.argsort(not_watched[:, 2])][::-1]
reco_movies= reco_movies[0:10]
elif (len(reco_ratings)>10):
reco_movies = reco_ratings[np.argsort(reco_ratings[:, 2])][::-1]
reco_movies= reco_movies[0:10]
else:
reco_movies = reco_ratings[np.argsort(reco_ratings[:, 2])][::-1]
final_list = [0]*len(reco_movies)
for k,i in enumerate(reco_movies):
final_list[k] = next(key for key, value in movies.iteritems() if value == i[1] )
return(final_list)
prompt = "Do you want to compute similarity matrix and cluster?\n Enter Y - To compute the
components \n Enter N - To use precomputed components\n Enter ex to stop execution \nInput - "
while True:
inp_choice = input(prompt)
if inp_choice.lower() == 'y':
train_data,users,movies,user_item = get_data()
print("Data read complete\n Computing similarity...")
sleep(6)
print("computation completed")
sleep(3)
print("Results are processing")
break
sim_mat = find_similarity()
print("Similarity matrix computed\n Movies being clustered...")
np.savetxt("sim_mat_Pearson.csv",sim_mat,delimiter=',')
af = AffinityPropagation(verbose=True,affinity="precomputed").fit(sim_mat)
clus_labels = af.labels_
print("Movies clustered")
break
elif inp_choice.lower()=='n':
print("Loading precomputed components...")
data2 = []
with open("Pickle_file", "rb") as f:
for _ in range(pickle.load(f)):
data2.append(pickle.load(f))
sim_mat = np.genfromtxt("sim_mat_Pearson.csv",delimiter=",")
train_data = data2[0]
# test_data = data2[1]
users = data2[1]
movies = data2[2]
user_item = data2[3]
clus_labels = data2[4]
print("\nPrecomputed components loaded")
break
elif inp_choice.lower()=="ex":
sys.exit("Program stopped as requested")
else:
print("Invalid input")
continue
while True:
break
print("\n1st 100 User ids =", users.keys()[0:100],)
print("\n")
print("1st 100 Movie ids =",movies.keys()[0:100],)
break
print("hello1")
#
inp = input("\nEnter user id and movie id separated by comma- ")
if inp == "":
break
else:
act_user,act_mov = inp.split(',')
act_user = int(act_user)
act_mov = int(act_mov)
final_list = compute_reco(act_user,act_mov)
print("Top %d movies for user %d similar to movie %d \n"%(len(final_list),act_user,act_mov))
print(final_list)
inp1 = input("Do you want to continue? Y/N - ")
if inp1.lower()== 'y':
continue
else:
break
computeRecommedationMatrix()
10. SNAPSHOT
11. CONCLUSION
Chart Title
execution time
efficiency
accuracy
0% 10% 20% 30% 40% 50% 60% 70% 80% 90% 100%
Neural Networks and Deep Learning have been all the rage the last
couple of years in many different fields, and it appears that they are
also helpful for solving recommendation system problems.
Another point that Ben Allison brought up is the need to see what
would happen if a customer was shown a sub-optimal
recommendation. This is taking a reinforcement learning approach,
since the goal in this case would be to show customers a
recommendation, and then record what the customer does. At times,
customers can be recommended something that does not seem like the
best option, just to see how the customer reacts which will improve the
learning in the long-term.
Most businesses will have some use for recommender systems, and I
encourage everyone to learn more about this fascinating area.
13. REFERENCE