Aryan Gupta Project Report
Aryan Gupta Project Report
Document Details
Submission ID
trn:oid:::16158:101130041 21 Pages
Download Date
File Name
Aryan Gupta_project_report.docx
File Size
68.6 KB
Quoted Text
Cited Text
0 Missing Citation 0%
Matches that have quotation marks, but no in-text citation
Integrity Flags
0 Integrity Flags for Review
Our system's algorithms look deeply at a document for any inconsistencies that
No suspicious text manipulations found. would set it apart from a normal submission. If we notice something strange, we flag
it for you to review.
0 Missing Citation 0%
Matches that have quotation marks, but no in-text citation
Top Sources
The sources with the highest number of matches within the submission. Overlapping sources will not be displayed.
1 Internet
www.coursehero.com 4%
2 Internet
www.scribd.com 2%
3 Submitted works
4 Submitted works
5 Submitted works
6 Submitted works
7 Submitted works
8 Submitted works
9 Internet
www.amity.edu <1%
Project Report
on
by
Aryan Gupta
Enrolment No. A620145024009
Professor
June 2025
1
Page 4 of 24 - Integrity Submission Submission ID trn:oid:::16158:101130041
Page 5 of 24 - Integrity Submission Submission ID trn:oid:::16158:101130041
7
Amity Institute of Information and Technology
Amity University Madhya Pradesh, Gwalior
DECLARATION
Aryan Gupta
Date: (Enrolment No. – A620145024009)
2
Page 5 of 24 - Integrity Submission Submission ID trn:oid:::16158:101130041
Page 6 of 24 - Integrity Submission Submission ID trn:oid:::16158:101130041
CERTIFICATE
This is to certify that the minor project entitled “Movie Recommendation System” by Aryan
1 Gupta (Enrolment No. A620145024009) is a bonafide record of project carried out by him
under my supervision and guidance in partial fulfilment of the requirements for the award of
the Degree of Master of Computer Applications in the Department of Amity Institute of
Information and Technology, Amity University Madhya Pradesh, Gwalior. Neither this
project nor any part of it has been submitted for any degree or academic award elsewhere.
Date:
3
Page 6 of 24 - Integrity Submission Submission ID trn:oid:::16158:101130041
Page 7 of 24 - Integrity Submission Submission ID trn:oid:::16158:101130041
ACKNOWLEDGEMENT
I am very much thankful to our hon’ble Lt Gen. V. K. Sharma AVSM (Retd.), Pro Chancellor,
Amity University Madhya Pradesh for allowing me to carry out my project. I take pride in
acknowledging respected Prof. (Dr). R. S. Tomar, Vice Chancellor, Amity University Madhya
1 Pradesh for his valuable support, I would also like to thank Prof. (Dr.) M. P. Kaushik, Pro-Vice
Chancellor (Research), Amity University Madhya Pradesh for his support. I extend my sincere
thanks to Prof. (Dr). Vikas Thada, HOI, Amity School of Engineering and Technology, Amity
University Madhya Pradesh, for his guidance and support for the selection of appropriate labs
for my project. I am also very grateful to Dr. Devendra Kumar Mishra, Associate Professor,
3 Amity School of Engineering and Technology, Amity University Madhya Pradesh, My
Supervisor for their constant guidance and encouragement provided in this Endeavour. I am
also thankful to the whole staff of ASET, AUMP for teaching me every single minute in their
4 respective fields. At last I thank everyone who contributed to this work in all doable manners.
My heartfelt thanks to families and friends for their kind help and suggestions.
4
Page 7 of 24 - Integrity Submission Submission ID trn:oid:::16158:101130041
Page 8 of 24 - Integrity Submission Submission ID trn:oid:::16158:101130041
ABSTRACT
The system is implemented using Python and runs in a Google Colab environment, utilizing
powerful libraries such as Pandas, NumPy, and Scikit-learn. A recommendation function is
developed to return the top N most similar movies for any user-given input title. This provides
an efficient and scalable approach to movie recommendations without the need for explicit
user data or ratings.
TMDB Dataset
5
Page 8 of 24 - Integrity Submission Submission ID trn:oid:::16158:101130041
Page 9 of 24 - Integrity Submission Submission ID trn:oid:::16158:101130041
6
Table of Contents
Declaration by studenti
Certificate by supervisor (Forwarded by HOD/HOI)
Acknowledgement
Abstract
List of Figures
List of Abbreviations
1: Introduction
2: Literature Review
3: System Analysis
4: System Design
5: Implementation
6: Testing
8: Reference
6
Page 9 of 24 - Integrity Submission Submission ID trn:oid:::16158:101130041
Page 10 of 24 - Integrity Submission Submission ID trn:oid:::16158:101130041
1. Introduction
In today's digital age, the entertainment sector has experienced a notable increase in the
production and accessibility of films on different platforms. There are thousands of available
films on the internet, and viewers tend to find it challenging to find content that suits their
own interests. This has contributed to a heightened demand for systems that are smart enough
to suggest appropriate content. A Movie Recommendation System is one such smart
application that assists users in finding movies they are likely to like based on certain criteria
or patterns.
This is a content-based movie recommendation system built using Python and run in a Google
Colab environment. The system is created to scan through a movie dataset and recommend
movies that are similar to the one in question on the basis of textual data like genres,
keywords, and overviews. In contrast to collaborative filtering algorithms that need user
ratings, this system is solely based on content similarity, hence, not dependent on user
behavior or interactions.
The data used in this project is the TMDB 5000 Movies Dataset, which contains extensive
metadata regarding movies like their title, genre, keywords, and description. TF-IDF (Term
Frequency-Inverse Document Frequency) is a technique applied to convert textual content
into numerical vectors. Cosine similarity is then employed to calculate how similar the
movies are to one another in terms of content.
The last system permits the user to enter any film title and get a list of the most top-priority
movies that are closest in terms of theme and plot. This system is a basis for more
sophisticated recommendation engines and can be extended to include user behavior data,
ratings, or hybrid models.
In total, the project seeks to showcase how natural language processing and machine learning
methods can be utilized to address real-world issues in the entertainment and user
personalizationdomain.
7
Page 10 of 24 - Integrity Submission Submission ID trn:oid:::16158:101130041
Page 11 of 24 - Integrity Submission Submission ID trn:oid:::16158:101130041
The system utilizes the TMDB 5000 Movie Dataset, which contains metadata such as movie
title, description, genre, cast and crew, and keywords. Instead of relying on user ratings or
interactions, this content-based method relies on the descriptive characteristics of the movies.
Based on the analysis of the text-based characteristics, the system determines and
The process consists of several significant steps. Pre-processing and cleaning of data are
performed first to handle missing values and concatenate all textual features (e.g., overview,
genres, and keywords) into a single feature. The concatenated textual data is then vectorized
by applying the TF-IDF method, which calculates the importance of words against the entire
dataset. Cosine similarity is then used to calculate similarity of movies against the TF-IDF
vectors in order to give a similarity matrix, which is used to power the recommendation logic.
The basic operation of the system is taking a movie name as input and producing a list of
most content-relevant movies. This is a strong method, does not require user interaction data,
and is applicable for new users (solving the cold start issue to some extent).
Used in Google Colab with libraries like Pandas, Scikit-learn, and Numpy, this project
demonstrates the usability of natural language processing and vector space models on real-
including hybrid models that combine content-based and collaborative filtering approaches to
achievehigheraccuracy.
8
Page 11 of 24 - Integrity Submission Submission ID trn:oid:::16158:101130041
Page 12 of 24 - Integrity Submission Submission ID trn:oid:::16158:101130041
2. Literature Review
The history of recommendation systems is now a key field of study in machine learning and
information retrieval. With the growing availability of digital content, particularly in
entertainment, recommendation systems assist users in finding content without sifting through
vast databases by hand. This literature review introduces the ideas and existing studies on
movie recommendation systems, emphasizing content-based filtering techniques and their
evolution.
Content-Based Filtering: They suggest items similar to the items the user liked in the past.
They use item features (genre, keywords, cast) to compute similarity. It is effective when
there are no user interaction data or the data are negligible.
Collaborative Filtering: They recommend based on what other similar individuals have rated.
While great, they suffer from cold start issues and require a lot of user data.
Pazzani and Billsus (2007) explained content-based systems that learn user preferences from
item attributes using machine learning algorithms. The paper explained how text attributes
(e.g., movie plot, actors) can be translated to a feature vector used to calculate similarity.
In film recommendations, Basu, Hirsh & Cohen (1998) illustrated the use of filtering with
metadata information such as actors, genre, and keywords. Their approach was conducive to
early ideas of tag-based filtering.
9
Page 12 of 24 - Integrity Submission Submission ID trn:oid:::16158:101130041
Page 13 of 24 - Integrity Submission Submission ID trn:oid:::16158:101130041
Current systems employ TF-IDF vectorization, a statistical process utilized to determine how
important a word is to a document in comparison to a set. Employed in combination with
cosine similarity, it forms the core of the majority of content-based systems. It is used mainly
due to the fact that it is very simple and effective, as demonstrated in the experiment by
Salton & Buckley (1988) on vector space models.
Systems and Platforms in Place Commercial services like IMDb, Amazon Prime, and Netflix
utilize sophisticated recommendation systems. Netflix uses a hybrid method that utilizes
collaborative filtering, deep learning, and NLP, while IMDb focuses on user reviews and
metadata. These systems provide evidence of the success of content-based approaches in real-
world applications. In summary, the literature warrants that content-based filtering, aided by
natural language processing and vector space models, is a sound method for designing
scalable and accurate recommendation systems. The same is used in this project to design an
efficient movie recommendation engine on the basis of publicly available movie metadata.
10
Page 13 of 24 - Integrity Submission Submission ID trn:oid:::16158:101130041
Page 14 of 24 - Integrity Submission Submission ID trn:oid:::16158:101130041
3.System Analysis
System analysis is a fundamental component of any software development. It assists in
problem understanding, system requirement determination, study of feasibility, and
8 identification of technologies to be adopted. The primary objective of this project is the
development of a content-based Movie Recommendation System that suggests similar movies
to a user given a movie title through text features like genres, keywords, and overviews.
With the entertainment sector increasing at an exponential rate, audiences are exposed to
thousands of films on different platforms. With that many possibilities, the situation is often
disorganized and leads to decision fatigue. Users cannot scroll through and pick movies of
their liking independently. The lack of suggestions renders the user experience incomplete.
Therefore, there is a requirement for an intelligent recommendation system that can suggest
relevant films automatically based on content similarity.The function of this system is to:
Recommend films of a similar nature to a selected film.Explain how machine learning
techniques, that is, NLP and similarity measures, can be utilized to solve real problems.
Functional Requirements:Upload and load a movie dataset with metadata.
Clean and preprocess data fields.Identify the key features to compare (summary, genres,
keywords).
Use a TF-IDF recommendation algorithm and cosine similarity.
Provide movie suggestions according to user input.
Non-Functional Requirements:
Platform: Google Colab Libraries: Pandas, NumPy, Scikit-learn Dataset: TMDB 5000 Movie
Dataset Techniques: TF-IDF Vectorization, Cosine Similarity This system analysis showcases
the systematic method used in defining the root problem, defining the system boundary, and
defining appropriate technology. The analysis can determine that the system is feasible and
useful for education within the academic community and can be used as a foundation for
future development like hybrid recommendations or web deployment.
11
Page 14 of 24 - Integrity Submission Submission ID trn:oid:::16158:101130041
Page 15 of 24 - Integrity Submission Submission ID trn:oid:::16158:101130041
4.System Design
System design is the blueprint of a software system that lays out the structure, components,
and flow of data. It bridges the gap between the system requirements identified during
analysis and the implementation phase. The design of the Movie Recommendation System
focuses on functionality, modularity, and performance, using machine learning and natural
language processing techniques
The architecture of the system can be broken down into the following layers:
1. Input Layer:
o Accepts a movie name as input from the user.
o Validates the movie name against the dataset.
2. Processing Layer:
o Data preprocessing: Removes null values, cleans text data, and handles
missing fields.
o Feature extraction: Uses metadata such as genres, keywords, and overview.
o Vectorization: Applies TF-IDF vectorizer to convert text into numerical
format.
o Similarity calculation: Uses cosine similarity to compare movies based on
vectorized data.
3. Output Layer:
o Retrieves the top N similar movies.
o Displays recommendations with titles in ranked order.
12
Page 15 of 24 - Integrity Submission Submission ID trn:oid:::16158:101130041
Page 16 of 24 - Integrity Submission Submission ID trn:oid:::16158:101130041
Modules Description
Design Considerations
13
Page 16 of 24 - Integrity Submission Submission ID trn:oid:::16158:101130041
Page 17 of 24 - Integrity Submission Submission ID trn:oid:::16158:101130041
5. Implementation
The implementation phase involves converting the system design into a functional product
using appropriate technologies. For the Movie Recommendation System, the
implementation focuses on handling data efficiently, computing accurate recommendations,
and delivering a responsive user interface. This section outlines how the major components
were developed and integrated using Python and Streamlit.
The system uses the TMDB 5000 Movies Dataset, which includes detailed information such
as movie titles, genres, keywords, and overviews. These fields are essential for content-based
filtering.
Steps performed:
This preprocessing ensured that the data was clean, uniform, and ready for vectorization.
2. Feature Vectorization
python
CopyEdit
tfidf = TfidfVectorizer(stop_words='english')
tfidf_matrix = tfidf.fit_transform(movies['combined_features'])
This matrix was then used to compare the content similarity between movies.
3. Similarity Calculation
The system calculates cosine similarity between TF-IDF vectors to find how similar one
movie is to others. Cosine similarity measures the cosine of the angle between two vectors,
which is a standard way to measure document similarity in NLP tasks.
python
CopyEdit
cosine_sim = cosine_similarity(tfidf_matrix, tfidf_matrix)
4. Recommendation Logic
14
Page 17 of 24 - Integrity Submission Submission ID trn:oid:::16158:101130041
Page 18 of 24 - Integrity Submission Submission ID trn:oid:::16158:101130041
Using the movie title selected by the user, the system fetches the corresponding index and
retrieves the most similar movies based on the cosine similarity scores. These are sorted in
descending order to provide the top N recommendations.
All necessary components (movies, similarity matrix, and TF-IDF vectorizer) were saved
using Python’s pickle module to ensure fast loading and reuse without recomputation.
15
Page 18 of 24 - Integrity Submission Submission ID trn:oid:::16158:101130041
Page 19 of 24 - Integrity Submission Submission ID trn:oid:::16158:101130041
Testing and validation are crucial to ensure that the system performs as intended and delivers
accurate, reliable results. In the case of the Movie Recommendation System, testing was
conducted at various levels — from data loading and transformation to similarity computation
and frontend functionality. The goal was to verify that the recommendation engine functions
correctly and the user interface behaves as expected under different input conditions.
1. Unit Testing
Data Preprocessing Module: Ensured that missing values were handled and
combined features were generated correctly.
Vectorization Module: Verified that the TF-IDF vectorizer processed the text without
errors and produced the expected matrix shape.
Recommendation Function: Checked if the similarity scores were correctly
calculated and returned the appropriate number of recommendations.
python
CopyEdit
assert len(recommend_movie('Avatar', 5)) == 5
This test ensures that the function returns exactly five recommendations for a valid input.
2. Exception Handling
Special attention was given to handling invalid or unknown movie titles. If a user enters a title
not present in the dataset, the system does not crash but instead displays a friendly error
message:
python
CopyEdit
try:
print(recommend_movie(user_movie, 5))
except KeyError:
print(f"Sorry, the movie '{user_movie}' was not found in the
database.")
16
Page 19 of 24 - Integrity Submission Submission ID trn:oid:::16158:101130041
Page 20 of 24 - Integrity Submission Submission ID trn:oid:::16158:101130041
3. Integration Testing
After integrating the backend with the Streamlit frontend, tests were conducted to ensure
seamless communication between components. It was verified that user selections were
correctly passed to the backend and that the recommendations were accurately displayed on
the web interface.
4. Functional Testing
The full workflow — from loading the app to receiving recommendations — was tested for
multiple movie titles, including popular and obscure films. In all cases, the system returned
logically similar recommendations based on content.
5. Performance Testing
Although the system is relatively lightweight, performance testing was conducted to confirm
that:
6. Validation
The system was validated by comparing its results with human expectations. For example,
when a user selects the movie Avatar, the recommended movies include other science fiction
or fantasy films with similar themes or visuals, indicating that the system is working
accurately.
17
Page 20 of 24 - Integrity Submission Submission ID trn:oid:::16158:101130041
Page 21 of 24 - Integrity Submission Submission ID trn:oid:::16158:101130041
The Movie Recommendation System developed in this project demonstrates how machine
learning and natural language processing techniques can be effectively applied to personalize
user experiences in the digital entertainment industry. By leveraging the TMDB 5000 Movies
Dataset and combining it with TF-IDF vectorization and cosine similarity, the system
successfully identifies and suggests movies that share thematic and descriptive similarities
with the user’s selected title.
The system eliminates the need for user-generated data such as ratings or reviews, making it
ideal for situations where user data is sparse or unavailable. The content-based approach
ensures that recommendations are grounded in actual metadata like genres, keywords, and
plot overviews, thereby maintaining consistency and logical relevance in the results.
Furthermore, the implementation using Streamlit makes the project accessible and user-
friendly, allowing seamless interaction through a web-based interface. The dropdown movie
selector, real-time recommendation engine, and error-handling mechanisms contribute to a
positive user experience. Overall, the system is lightweight, efficient, and highly scalable for
future integration into larger platforms or services.
This project not only fulfilled its goal of delivering a working movie recommendation engine
but also provided valuable insights into data preprocessing, vectorization, model building, and
web deployment. The modular architecture ensures that each component is independent and
can be updated or replaced without affecting the entire system.
Future Scope
While the current system performs well, there is significant room for enhancement and
expansion. Some of the key areas for future development include:
3. Visual and Audio Metadata: Analyzing trailers, posters, and soundtracks could
enrich the content-based approach.
4. User Login and History Tracking: Building user profiles and storing viewing history
would allow more personalized and dynamic recommendations.
5. Mobile App Integration: Deploying the system as a mobile application would
improve accessibility and increase user engagement.
6. Multilingual Support: Adding support for regional movies and languages could
broaden the system’s appeal to diverse audiences.
In conclusion, this project serves as a strong foundation for a scalable and intelligent
recommendation engine, with vast potential for real-world application and academic research.
8.References
2. Streamlit Inc. (2024). Streamlit – The fastest way to build and share data apps.
https://streamlit.io
3. TMDB (The Movie Database). TMDB 5000 Movies Dataset. Retrieved from
https://www.kaggle.com/datasets/tmdb/tmdb-movie-metadata
4. Pandas Development Team. (2024). pandas: Powerful Python data analysis toolkit.
https://pandas.pydata.org
5. NumPy Developers. (2024). NumPy: The fundamental package for scientific computing
with Python. https://numpy.org
6. Google Colab. (2024). Google Colaboratory – A research tool for machine learning
education and research. https://colab.research.google.com
19
Page 22 of 24 - Integrity Submission Submission ID trn:oid:::16158:101130041
Page 23 of 24 - Integrity Submission Submission ID trn:oid:::16158:101130041
20
Page 23 of 24 - Integrity Submission Submission ID trn:oid:::16158:101130041
Page 24 of 24 - Integrity Submission Submission ID trn:oid:::16158:101130041
1.
21
Page 24 of 24 - Integrity Submission Submission ID trn:oid:::16158:101130041