Faids Final Report.. 1 1
Faids Final Report.. 1 1
Submitted By
1. 927622BAD045-RITHANI KS
2. 927622BAD046-RITHISH RM
3. 927622BAD047-SANGAVI SA
4. 927622BAD048-SANJAJ SS
1
CERTIFICATE
This is to certify that the project entitled “MOVIE RATING PREDICTOR WITH USER
partial fulfilment of the requirement for the award of the degreeof Bachelor of Technology in
Submitted on
Verified by,
Soundarya-Corporate Trainer
2
INTRODUCTION
The "Movie Rating Predictor using Python" project introduces a sophisticated approach to
leveraging machine learning for predicting the ratings of movies. In the ever-evolving landscape
of the film industry, the ability to anticipate a movie's success before its release is a crucial aspect
of decision-making for filmmakers, studios, and streaming platforms. This project encompasses a
comprehensive pipeline, including data collection, preprocessing, exploratory data analysis, and
the implementation of a regression model that considers a range of features, such as genres, cast,
crew, budget, and user reviews. The predictive model, once trained and evaluated, serves as a
valuable tool for assessing the potential audience reception of a movie. By combining the power
of data science and machine learning, this project offers a practical solution to the challenges faced
in the entertainment industry, providing insights that can influence production and distribution
strategies. The documentation accompanying the project ensures transparency and usability,
making it a versatile and valuable asset for stakeholders in the film domain.
4
OBJECTIVES
1. Data Collection:
Gather a comprehensive dataset with details like genres, cast, crew, budget, release date, and user
reviews for movies.
2. Data Preprocessing:
Clean and preprocess the dataset, handling missing values and outliers. Apply feature engineering
to extract meaningful patterns.
5. Evaluation Metrics:
Assess model accuracy using metrics like Mean Squared Error, Root Mean Squared Error, and
R-squared.
8. Deployment:
Deploy the trained model for real-time predictions, either locally or on a cloud platform, making
the system practical for decision-making in the film industry.
9.Documentation:
Provide comprehensive documentation, including a user guide, code
documentation, and a detailed project report. Ensure transparency in methodology
and results for understanding and future development.
5
10.Benefits and Significance:
Highlight practical applications, emphasizing how the project benefits filmmakers, studios, and
streaming platforms by guiding decision-making in the dynamic film industry.
6
EXISTING SYSTEM
Existing movie rating systems play a crucial role in guiding viewers' choices and
influencing industry decisions. Prominent platforms such as IMDb, Rotten Tomatoes, and
Metacritic have a comprehensive overview of a movie's reception, incorporating both professional
critics' opinions and audience reviews.
IMDb, one of the most well-known databases for movies and television, allows users to rate
movies on a scale of 1 to 10. These individual user ratings contribute to an overall score displayed
on the platform. Rotten Tomatoes takes a unique approach by providing a percentage score based
on the ratio of positive to negative reviews from critics. Metacritic, on the other hand, calculates a
weighted average score from critics' reviews.
The methodologies employed by these platforms vary, reflecting the diverse preferences and
expectations of their user bases. While IMDb relies heavily on user ratings, Rotten Tomatoes
incorporates both critics and audience perspectives. Metacritic takes a balanced approach,
assigning weights to reviews from different sources to calculate its composite score.
Movie rating systems often extend beyond mere numerical scores. They include detailed reviews,
user comments, and additional information about the cast, crew, genres, and release date. These
features contribute to a richer user experience, allowing individuals to make informed decisions
about which movies to watch.
Behind the scenes, recommendation algorithms are frequently employed to enhance user
engagement. These algorithms utilize collaborative filtering or content-based approaches to
suggest movies tailored to individual preferences. They analyze user behavior, historical ratings,
and movie features to predict what a viewer might enjoy. Streaming platforms like Netflix and
Amazon Prime Video leverage such algorithms to offer personalized content recommendations.
In the film industry, movie ratings and reviews play a pivotal role in shaping audience perceptions
and influencing box office success. High ratings on platforms like Rotten Tomatoes can attract
more viewers, positively impacting a movie's commercial performance. Conversely, negative
reviews may lead to lower audience turnout.
It's important to note that the landscape of movie rating systems is dynamic, with ongoing
developments and innovations. Individual studios and streaming platforms may also have
proprietary systems for predicting a movie's success based on various factors, including audience
demographics, viewing habits, and social media engagement.
7
PROPOSED SYSTEM
The proposed "Movie Rating Predictor using Python" system represents an innovative
approach to forecasting movie ratings through the application of machine learning techniques. This
system is designed to address the growing demand for accurate pre-release assessments of a
movie's potential success, providing valuable insights for filmmakers, studios, and streaming
platforms.
4. User-Friendly Interface:
Optionally, develop an intuitive and user-friendly interface allowing stakeholders to input movie
details easily. This interface facilitates real-time predictions, making the system accessible and
practical for industry professionals.
8
7. Evaluation Metrics and Model Interpretability:
Utilize robust evaluation metrics such as Mean Squared Error (MSE), Root Mean Squared Error
(RMSE), and R-squared to measure the accuracy of the predictive model. Additionally, focus on
enhancing model interpretability, ensuring that stakeholders can understand the factors influencing
the predictions.
9
MODULES
10
8. Deployment Module:
Handles the deployment of the trained model for real-time predictions. Depending on the project
requirements, the deployment can be done locally or on a cloud platform for scalability.
9. Documentation Module:
Prepares comprehensive documentation, including a user guide, code documentation, and a
detailed project report. This module ensures transparency in the project's methodology, making it
easy for users and developers to understand and utilize the system.
Each module contributes to the overall functionality of the system, facilitating a structured
and organized development process for the "Movie Rating Predictor using Python" project.
11
BLOCK DIAGRAM
12
13
OUTPUT SCREENSHOTS
14
15
16
17
18
19
20
21
SAMPLE CODES
22
df_rating.columns =['ID','MovieID','Ratings','TimeStamp']
df_rating.dropna(inplace=True)
df_rating.head()
#Data acquisition of the users dataset
df_user = pd.read_csv("../input/users.dat",sep='::',engine='python')
df_user.columns =['UserID','Gender','Age','Occupation','Zip-code']
df_user.dropna(inplace=True)
df_user.head()
df = pd.concat([df_movie, df_rating,df_user], axis=1)
df.head()
2. Perform the Exploratory Data Analysis (EDA) for the users dataset
df['Age'].value_counts().plot(kind='barh',alpha=0.7,figsize=(10,10))
plt.show()
#Visualize user age distribution
df['Age'].value_counts().plot(kind='barh',alpha=0.7,figsize=(10,10))
plt.show()
df.Age.plot.hist(bins=25)
plt.title("Distribution of users' ages")
plt.ylabel('count of users')
plt.xlabel('Age')
labels = ['0-9', '10-19', '20-29', '30-39', '40-49', '50-59', '60-69', '70-79']
df['age_group'] = pd.cut(df.Age, range(0, 81, 10), right=False, labels=labels)
df[['Age', 'age_group']].drop_duplicates()[:10]
#Visualize overall rating by users
df['Ratings'].value_counts().plot(kind='bar',alpha=0.7,figsize=(10,10))
plt.show()
groupedby_movieName = df.groupby('MovieName')
groupedby_rating = df.groupby('Ratings')
23
groupedby_uid = df.groupby('UserID')
#groupedby_age = df.loc[most_50.index].groupby(['MovieName', 'age_group'])
movies = df.groupby('MovieName').size().sort_values(ascending=True)[:1000]
print(movies)
#Find and visualize the user rating of the movie “Toy Story”
plt.figure(figsize=(10,10))
plt.scatter(ToyStory_data['MovieName'],ToyStory_data['Ratings'])
plt.title('Plot showing the user rating of the movie “Toy Story”')
plt.show()
#Find and visualize the viewership of the movie “Toy Story” by age group
ToyStory_data[['MovieName','age_group']]
#Find and visualize the top 25 movies by viewership rating
top_25 = df[25:]
top_25['Ratings'].value_counts().plot(kind='barh',alpha=0.6,figsize=(7,7))
plt.show()
#Visualize the rating data by user of user id = 2696
userid_2696 = groupedby_uid.get_group(2696)
userid_2696[['UserID','Ratings']]
#First 500 extracted records
first_500 = df[500:]
first_500.dropna(inplace=True)
#Use the following features:movie id,age,occupation
features = first_500[['MovieID','Age','Occupation']].values
#Create train and test data set
train, test, train_labels, test_labels =
train_test_split(features,labels,test_size=0.33,random_state=42)
#Create a histogram for movie
df.Age.plot.hist(bins=25)
plt.title("Movie & Rating")
24
plt.ylabel('MovieID')
plt.xlabel('Ratings')
#Create a histogram for age
df.Age.plot.hist(bins=25)
plt.title("Age & Rating")
plt.ylabel('Age')
plt.xlabel('Ratings')
#Create a histogram for occupation
df.Age.plot.hist(bins=25)
plt.title("Occupation & Rating")
plt.ylabel('Occupation')
plt.xlabel('Ratings')
# Logistic Regression
logreg = LogisticRegression()
logreg.fit(train, train_labels)
Y_pred = logreg.predict(test)
acc_log = round(logreg.score(train, train_labels) * 100, 2)
acc_log
# Support Vector Machines
vc = SVC()
svc.fit(train, train_labels)
Y_pred = svc.predict(test)
acc_svc = round(svc.score(train, train_labels) * 100, 2)
acc_svc
# K Nearest Neighbors Classifier
knn = KNeighborsClassifier(n_neighbors = 3)
knn.fit(train, train_labels)
Y_pred = knn.predict(test)
25
acc_knn = round(knn.score(train, train_labels) * 100, 2)
acc_knn
# Gaussian Naive Bayes
gaussian = GaussianNB()
gaussian.fit(train, train_labels)
Y_pred = gaussian.predict(test)
acc_gaussian = round(gaussian.score(train, train_labels) * 100, 2)
acc_gaussian
# Perceptron
perceptron = Perceptron()
perceptron.fit(train, train_labels)
Y_pred = perceptron.predict(test)
acc_perceptron = round(perceptron.score(train, train_labels) * 100, 2)
acc_perceptron
# Linear SVC
linear_svc = LinearSVC()
linear_svc.fit(train, train_labels)
Y_pred = linear_svc.predict(test)
acc_linear_svc = round(linear_svc.score(train, train_labels) * 100, 2)
acc_linear_svc
# Stochastic Gradient Descent
sgd = SGDClassifier()
sgd.fit(train, train_labels)
Y_pred = sgd.predict(test)
acc_sgd = round(sgd.score(train, train_labels) * 100, 2)
acc_sgd
# Decision Tree
decision_tree = DecisionTreeClassifier()
26
decision_tree.fit(train, train_labels)
Y_pred = decision_tree.predict(test)
acc_decision_tree = round(decision_tree.score(train, train_labels) * 100, 2)
acc_decision_tree
# Random Forest
random_forest = RandomForestClassifier(n_estimators=100)
random_forest.fit(train, train_labels)
Y_pred = random_forest.predict(test)
random_forest.score(train, train_labels)
acc_random_forest = round(random_forest.score(train, train_labels) * 100, 2)
acc_random_forest
models = pd.DataFrame({
'Model': ['Support Vector Machines', 'KNN', 'Logistic Regression',
'Random Forest', 'Naive Bayes', 'Perceptron',
'Stochastic Gradient Decent', 'Linear SVC',
'Decision Tree'],
'Score': [acc_svc, acc_knn, acc_log,
acc_random_forest, acc_gaussian, acc_perceptron,
acc_sgd, acc_linear_svc, acc_decision_tree]})
models.sort_values(by='Score', ascending=False)
27
FUTURE ENHANCEMENT
6. Cross-Platform Compatibility:
Extend the system to predict ratings not only for traditional cinema releases but also for content
released on various streaming platforms. Consider differences in user behavior and preferences
across different platforms.
8. Interpretability Enhancements:
Implement tools for better model interpretability. This would help users understand how different
features contribute to the predicted ratings, fostering trust in the system.
28
9. Integration with Production Budget Analysis:
Expand the system to consider the relationship between movie ratings and production budgets.
This would provide insights into the cost-effectiveness of producing high-rated movies.
These future enhancements aim to elevate the functionality and performance of the "Movie
Rating Predictor" system, making it more adaptive, user-friendly, and capable of providing
valuable insights to the film industry.
29
CONCLUSION
In conclusion, the "Movie Rating Predictor using Python" project presents a promising and
innovative solution for the film industry, leveraging the power of data science and machine
learning to predict movie ratings with accuracy. The project's journey from data collection and
preprocessing to model implementation, evaluation, and potential deployment offers a
comprehensive approach to addressing the challenges faced by filmmakers, studios, and streaming
platforms in predicting the success of their productions.The objectives set forth in the project,
including data preprocessing, exploratory data analysis, and the implementation of a robust
machine learning model, are designed to create a system that not only predicts movie ratings but
also provides valuable insights into the underlying factors influencing audience reception. The
incorporation of user interfaces, continuous model improvement, and deployment mechanisms
enhances the practicality and usability of the system, making it a valuable tool for decision-makers
in the industry.The benefits and significance of the project lie in its potential to streamline decision-
making processes, allocate resources efficiently, and guide marketing and distribution strategies.
The transparent documentation accompanying the project ensures that users and developers alike
can understand the methodology, results, and potential areas for improvement.Looking forward,
the project's future enhancements, including the incorporation of streaming data, advanced feature
engineering, and user feedback loops, demonstrate a commitment to staying adaptive in a dynamic
industry. These improvements aim to make the system even more sophisticated, personalized, and
capable of providing nuanced predictions in an ever-evolving landscape.In essence, the "Movie
Rating Predictor using Python" project stands as a testament to the practical applications of
technology in the entertainment industry. As it evolves with future enhancements, it has the
potential to become an indispensable tool for industry professionals seeking data-driven insights
to navigate the complexities of movie production and distribution.
30
ABSTRACT
The "Movie Rating Predictor using Python" project aims to develop a machine learning-
based system that accurately predicts the rating of movies by analyzing various features such as
genres, cast, crew, budget, release date, and user reviews. The project encompasses data collection,
preprocessing, exploratory data analysis, and the implementation of a regression model for
predicting movie ratings. The machine learning model is trained and evaluated using metrics like
Mean Squared Error and R-squared. Optionally, a user-friendly interface can be created to allow
users to input movie details for real-time predictions. The project's significance lies in its practical
application for filmmakers, studios, and streaming platforms, offering a tool to estimate a movie's
potential success before release. The comprehensive documentation includes a user guide, code
documentation, and a project report detailing methodology, results, and potential future
enhancements. Overall, the "Movie Rating Predictor" project represents a valuable integration of
data science and machine learning in the entertainment industry.