0% found this document useful (0 votes)
10 views8 pages

Assignment 15 (Mini Project)

The document outlines a mini project for developing a content-based movie recommendation system using scikit-learn, focusing on Content-Based Filtering with TF-IDF and Cosine Similarity. It details the process of feature extraction from movie data, data preprocessing, and the implementation of similarity calculations to recommend similar movies. Additionally, it introduces the use of the difflib module for handling user input variations when searching for movie titles.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
10 views8 pages

Assignment 15 (Mini Project)

The document outlines a mini project for developing a content-based movie recommendation system using scikit-learn, focusing on Content-Based Filtering with TF-IDF and Cosine Similarity. It details the process of feature extraction from movie data, data preprocessing, and the implementation of similarity calculations to recommend similar movies. Additionally, it introduces the use of the difflib module for handling user input variations when searching for movie titles.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 8

Assignment No :15(Mini project)

Assignment No:15(Mini Project)

Movie Recommendation
This notebook develops a content-based movie recommendation system using scikit-learn.
## 1. Introduction to Recommendation Systems A recommendation system suggests items
(movies, books, music, etc.) to users based on certain criteria. There are different types of
recommendation systems:
Content-Based Filtering – Recommends items similar to what the user likes.
Collaborative Filtering – Recommends items based on user behavior and preferences of
similar users.
Hybrid Filtering – A combination of both approaches.
Your project implements Content-Based Filtering using TF-IDF and Cosine Similarity.

2. Content-Based Filtering
This method suggests movies similar to a given movie by analyzing its features (e.g., genres,
keywords, tagline, cast, and director). The similarity between movies is calculated based on
their descriptions.
Steps in Content-Based Filtering: Select Relevant Features – The important attributes
(genres, keywords, tagline, cast, director) are extracted.
Preprocess Data – Handle missing values, clean text, and combine features.
Convert Text to Numerical Form – Use TF-IDF Vectorization to convert textual data into
numerical vectors.
Compute Similarity – Use Cosine Similarity to measure the closeness between movies.
Recommend Similar Movies – Retrieve and display movies with the highest similarity
scores.

3. Key Concepts Used in the Notebook


(a) TF-IDF (Term Frequency-Inverse Document Frequency) TF-IDF is a technique to
convert text data into numerical values. It assigns importance to words based on their
frequency and uniqueness.
Formula:

Akhila Ohmkumar(Roll.No:03)
Assignment No :15(Mini project)

TF-IDF
TF × IDF TF-IDF=TF×IDF Where:
1)TF (Term Frequency) – How often a word appears in a document.
2)IDF (Inverse Document Frequency) – Gives importance to rare words by reducing the
weight of common words.

Example:
1)"Action movie with a great storyline."
2)"Comedy movie with a hilarious cast."
3)"Action thriller with a suspenseful plot."
The word "movie" appears frequently, so its importance is low, whereas "thriller" appears
rarely, so its importance is high.
In your notebook, TF-IDF is applied to the combined features of movies.

Import the Libraies


import pandas as pd
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity

Load the dataset


# Load dataset
movies = pd.read_csv('movie_dataset.csv')
movies.head() # Display first few rows

index budget genres \


0 0 237000000 Action Adventure Fantasy Science Fiction
1 1 300000000 Adventure Fantasy Action
2 2 245000000 Action Adventure Crime
3 3 250000000 Action Crime Drama Thriller
4 4 260000000 Action Adventure Science Fiction

homepage id \
0 http://www.avatarmovie.com/ 19995
1 http://disney.go.com/disneypictures/pirates/ 285
2 http://www.sonypictures.com/movies/spectre/ 206647
3 http://www.thedarkknightrises.com/ 49026
4 http://movies.disney.com/john-carter 49529

keywords original_language \
0 culture clash future space war space colony so... en

Akhila Ohmkumar(Roll.No:03)
Assignment No :15(Mini project)

1 ocean drug abuse exotic island east india trad... en


2 spy based on novel secret agent sequel mi6 en
3 dc comics crime fighter terrorist secret ident... en
4 based on novel mars medallion space travel pri... en

original_title \
0 Avatar
1 Pirates of the Caribbean: At World's End
2 Spectre
3 The Dark Knight Rises
4 John Carter

overview popularity ... runtime


\
0 In the 22nd century, a paraplegic Marine is di... 150.437577 ... 162.0
1 Captain Barbossa, long believed to be dead, ha... 139.082615 ... 169.0
2 A cryptic message from Bond’s past sends him o... 107.376788 ... 148.0
3 Following the death of District Attorney Harve... 112.312950 ... 165.0
4 John Carter is a war-weary, former military ca... 43.926995 ... 132.0

spoken_languages status \
0 [{"iso_639_1": "en", "name": "English"}, {"iso... Released
1 [{"iso_639_1": "en", "name": "English"}] Released
2 [{"iso_639_1": "fr", "name": "Fran\u00e7ais"},... Released
3 [{"iso_639_1": "en", "name": "English"}] Released
4 [{"iso_639_1": "en", "name": "English"}] Released

tagline \
0 Enter the World of Pandora.
1 At the end of the world, the adventure begins.
2 A Plan No One Escapes
3 The Legend Ends
4 Lost in our world, found in another.

title vote_average vote_count \


0 Avatar 7.2 11800
1 Pirates of the Caribbean: At World's End 6.9 4500
2 Spectre 6.3 4466
3 The Dark Knight Rises 7.6 9106
4 John Carter 6.1 2124

cast \
0 Sam Worthington Zoe Saldana Sigourney Weaver S...
1 Johnny Depp Orlando Bloom Keira Knightley Stel...
2 Daniel Craig Christoph Waltz L\u00e9a Seydoux ...
3 Christian Bale Michael Caine Gary Oldman Anne ...
4 Taylor Kitsch Lynn Collins Samantha Morton Wil...

crew director

Akhila Ohmkumar(Roll.No:03)
Assignment No :15(Mini project)

0 [{'name': 'Stephen E. Rivkin', 'gender': 0, 'd... James Cameron


1 [{'name': 'Dariusz Wolski', 'gender': 2, 'depa... Gore Verbinski
2 [{'name': 'Thomas Newman', 'gender': 2, 'depar... Sam Mendes
3 [{'name': 'Hans Zimmer', 'gender': 2, 'departm... Christopher Nolan
4 [{'name': 'Andrew Stanton', 'gender': 2, 'depa... Andrew Stanton

[5 rows x 24 columns]

Feature Engineering
1.In your Movie Recommendation System, features are the attributes used to compare and
recommend similar movies. These features are extracted from the dataset and converted
into numerical vectors for similarity calculations.

1. What is Feature Enginnering?


Feature data refers to the selected characteristics of each movie that describe its content.
These features help in comparing and recommending similar movies.
In your notebook, the selected features are:
Genres – The type/category of the movie (e.g., Action, Comedy, Drama).
Keywords – Important words that describe the movie’s themes (e.g., Space, War, Love).
Tagline – A short promotional phrase for the movie.
Cast – The main actors and actresses in the movie.
Director – The name of the movie’s director.
#selecting the relevant features for recommendation
selected_features = ['genres','keywords','tagline','cast','director']
print(selected_features)

['genres', 'keywords', 'tagline', 'cast', 'director']

# replacing the null valuess with null string


for feature in selected_features:
movies[feature] = movies[feature].fillna('')

movies.isnull().sum()

index 0
budget 0
genres 0
homepage 3091
id 0
keywords 0
original_language 0

Akhila Ohmkumar(Roll.No:03)
Assignment No :15(Mini project)

original_title 0
overview 3
popularity 0
production_companies 0
production_countries 0
release_date 1
revenue 0
runtime 2
spoken_languages 0
status 0
tagline 0
title 0
vote_average 0
vote_count 0
cast 0
crew 0
director 0
dtype: int64

# combining all the 5 selected features


combined_features = movies['genres']+' '+movies['keywords']+'
'+movies['tagline']+' '+movies['cast']+' '+movies['director']
print(combined_features)

0 Action Adventure Fantasy Science Fiction cultu...


1 Adventure Fantasy Action ocean drug abuse exot...
2 Action Adventure Crime spy based on novel secr...
3 Action Crime Drama Thriller dc comics crime fi...
4 Action Adventure Science Fiction based on nove...
...
4798 Action Crime Thriller united states\u2013mexic...
4799 Comedy Romance A newlywed couple's honeymoon ...
4800 Comedy Drama Romance TV Movie date love at fir...
4801 A New Yorker in Shanghai Daniel Henney Eliza...
4802 Documentary obsession camcorder crush dream gi...
Length: 4803, dtype: object

Convert Text to Feature Vectors (TF-IDF)

Cosine Similarity
1.What It Is: Cosine similarity is a measure used to calculate the similarity between two
vectors in a multi-dimensional space. It's widely used in text analysis, recommendation
systems, and machine learning.
2.How It Works: The similarity is computed as the cosine of the angle between the vectors.
Values range from 0 (no similarity) to 1 (exact match).

Akhila Ohmkumar(Roll.No:03)
Assignment No :15(Mini project)

3.Formula:
A⋅B
Cosine Similarity =
| A )| B )
Where ( A ) and ( B ) are vectors, ( \cdot ) is the dot product, and ( |A| ) is the magnitude of
vector ( A ).
4.Applications: In your movie recommendation system, cosine similarity is calculated
between the feature vectors of movies (e.g., based on descriptions or genres) to find the
closest matches.
#getting the similarity scores using cosine similarity

similarity = cosine_similarity(feature_vectors)

print(similarity)

[[1. 0.07219487 0.037733 ... 0. 0. 0. ]


[0.07219487 1. 0.03281499 ... 0.03575545 0. 0. ]
[0.037733 0.03281499 1. ... 0. 0.05389661 0. ]
...
[0. 0.03575545 0. ... 1. 0. 0.02651502]
[0. 0. 0.05389661 ... 0. 1. 0. ]
[0. 0. 0. ... 0.02651502 0. 1. ]]

similarity.shape

(4803, 4803)

def get_recommendations(title, cosine_sim, df, indices, top_n=5):


"""
Given a movie title, return the top_n most similar movies based on cosine
similarity.
"""
# Check if the title exists in indices
if title not in indices:
return f"Movie titled '{title}' not found in the dataset."

# Get the index of the movie


idx = indices[title]

# Calculate similarity scores


sim_scores = list(enumerate(cosine_sim[idx]))

# Sort the movies by similarity score


sim_scores = sorted(sim_scores, key=lambda x: x[1], reverse=True)

# Get the top_n most similar movies, excluding the movie itself

Akhila Ohmkumar(Roll.No:03)
Assignment No :15(Mini project)

sim_scores = sim_scores[1:top_n + 1]

# Get the movie indices


movie_indices = [i[0] for i in sim_scores]

# Return the movie details


return df.iloc[movie_indices][['genre', 'keyword', 'tagline']]

Recommend Movies
To recommend movies based on user input:

difflib
difflib is a Python module used for comparing sequences, finding similarities between
strings, and performing approximate matching. In your movie recommendation system,
difflib is useful for handling user input variations when searching for a movie.

Why Use difflib in a Recommendation System?


When a user types a movie name, they might:
Misspell the title (e.g., "Interstelar" instead of "Interstellar").
Use different cases (e.g., "inception" vs. "Inception").
Forget exact spacing or punctuation (e.g., "Avengers Endgame" vs. "Avengers: Endgame").
🔹 difflib helps by finding the closest matching movie title in the dataset, even if the user’s
input is slightly incorrect.
import difflib # Import the difflib module

# Example code
movie_name = input('Enter your favourite movie name: ')

list_of_all_titles = movies['title'].tolist()

# Use difflib to find the closest match


find_close_match = difflib.get_close_matches(movie_name, list_of_all_titles)

if find_close_match:
close_match = find_close_match[0]
print(f"Closest match found: {close_match}")

# Get the index of the matched movie


index_of_the_movie = movies[movies.title ==

Akhila Ohmkumar(Roll.No:03)
Assignment No :15(Mini project)

close_match]['index'].values[0]
print(f"Index of the matched movie: {index_of_the_movie}")
else:
print(f"No close match found for '{movie_name}'.")

Enter your favourite movie name: Pirates of the Caribbean: At World's End

Closest match found: Pirates of the Caribbean: At World's End


Index of the matched movie: 1

Akhila Ohmkumar(Roll.No:03)

You might also like