KNN Reccomendation
KNN Reccomendation
import os
for dirname, _, filenames in os.walk('/kaggle/input'):
for filename in filenames:
print(os.path.join(dirname, filename))
# You can write up to 20GB to the current directory (/kaggle/working/) that gets prese
# You can also write temporary files to /kaggle/temp/, but they won't be saved outside
/kaggle/input/dataset-1/u.item
/kaggle/input/dataset-1/u.data
/kaggle/input/dataset-1/u.user
Introduction
I re-created an experiment that explores collaborative filtering techniques, specifically using the
KNNBasic algorithm for recommendation. I loaded a dataset from a plain text file and then
trained the model on it. To assess its performance, I employed cross-validation, evaluating
metrics such as RMSE and MAE.
Throughout the process, I followed several steps, including data loading, model training,
prediction making, and recommendation generation. My algorithm is based on user-based
collaborative filtering with cosine similarity. Additionally, I visualized the model errors and
retrieved nearest neighbors of an item.
1. Loading data
In [5]: # Path to dataset file
file_path = os.path.expanduser('/kaggle/input/dataset-1/u.data')
∑ sim(u, v) ⋅ rvi
v∈Nki (u)
^ui =
r
∑ sim(u, v)
v∈Nki (u)
In [10]: # Use k-NN algorithm with user-based collaborative filtering and cosine similarity
kk = 50
sim_options = {'name': 'cosine', 'user_based': True}
algo = KNNBasic(k = kk, sim_options = sim_options, verbose = True)
We're setting up a recommendation system using a k-Nearest Neighbors (k-NN) algorithm for
collaborative filtering. Collaborative filtering is a method for making predictions about what a
user might like based on preferences from similar users.
Firstly, we're using the k-NN algorithm. It's a straightforward but powerful approach where we
find the 'k' nearest neighbors to a target user based on their past ratings. Then, we use these
neighbors' ratings to predict what the target user might like. It's a method that's often used in
recommendation systems because of its simplicity and effectiveness.
Next, we're opting for user-based collaborative filtering. This means we're recommending items
to a user based on the preferences of users who are similar to them. If two users have similar
tastes, they're likely to enjoy similar items. It's intuitive and often yields good results.
For measuring similarity between users, we're using cosine similarity. This metric calculates the
cosine of the angle between two vectors, providing a measure of similarity that's unaffected by
the magnitude of the vectors. It's suitable for recommendation systems because it focuses on
the direction of preferences rather than their magnitude.
The specific parameters we've chosen, like kk = 50 and sim_options, allow us to fine-tune the
algorithm. For example, kk = 50 specifies that we'll consider the 50 nearest neighbors, and
sim_options configures the similarity measure to cosine similarity. These parameters help
balance prediction accuracy and computational efficiency.
In [11]: # Run 5-fold cross-validation and print results
cv = cross_validate(algo, data, measures = ['RMSE', 'MAE'], cv = 5, verbose = True)
It's nice to see that the algorithm's performance doesn't vary much across different parts of the
dataset. The standard deviations for RMSE and MAE are pretty small, which tells us that it's
giving consistent results across the board.
The errors (RMSE and MAE) aren't too high, which means it's making decent guesses about
what users might like. The RMSE values are around 1.0169 on average, and the MAE values
hover around 0.8042
# Chart setup
plt.title("Model Errors", fontsize = 12)
plt.xlabel("CV", fontsize = 10)
plt.ylabel("Error", fontsize = 10)
plt.legend()
plt.show()
3. Make some predictions
In [13]: # Without real rating
p1 = algo.predict(uid = '13', iid = '181', verbose = True)
user: 13 item: 181 r_ui = None est = 4.04 {'actual_k': 50, 'was_im
possible': False}
user: 196 item: 302 r_ui = 4.00 est = 4.02 {'actual_k': 50, 'was_im
possible': False}
In [16]: # Return two mappings to convert raw ids into movie names and movie names into raw ids
def read_item_names(file_path):
rid_to_name = {}
name_to_rid = {}
#we are using the above function as it allows for easy conversion between raw movie ID
In [17]: # Read the mappings raw id <-> movie name
item_filepath = '/kaggle/input/dataset-1/u.item'
rid_to_name, name_to_rid = read_item_names(item_filepath)
[13, 44, 54, 91, 96, 100, 102, 106, 117, 148]
Out[20]:
# Then sort the predictions for each user and retrieve the k highest ones.
for uid, user_ratings in top_n.items():
user_ratings.sort(key=lambda x: x[1], reverse=True)
top_n[uid] = user_ratings[:n]
return top_n
The get_top_n function efficiently generates personalized recommendations for users based
on predictions made by a recommendation algorithm. It achieves this by first grouping the
predictions according to user IDs, creating a mapping that associates each user with a list of
predicted ratings for different items. This initial organization lays the groundwork for
subsequent steps in the recommendation process.
In [25]: # Than predict ratings for all pairs (u, i) that are NOT in the training set
top_n = 10
top_pred = get_top_n(predictions, n = top_n)
# User raw Id
uid_list = ['196']