0% found this document useful (0 votes)
15 views23 pages

Bda Mini Project Report

Uploaded by

siblu khan
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
15 views23 pages

Bda Mini Project Report

Uploaded by

siblu khan
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 23

Movie-Rating-Prediction

Submitted in partial fulfillment of the requirements of


the Mini-Project for Big Data Analysis of

Bachelors of Engineering
by
Khan Zubair Ahmed 211P026
(22)

Guide:
(Prof. Reshma Lohar)

Department of Computer Engineering


Rizvi College of Engineering

University of Mumbai

2024-2025

1
CERTIFICATE
This is to certify that the mini-project entitled “Movie-Rating-Preiction” is a bonafide work
of “Khan Zubair Ahmed” (Roll No:22) submitted to the University of Mumbai in partial
fulfillment of the requirement for the Mini-Project Of Big Data Analysis for Final Year of the
Bachelor of Engineering in “Computer Engineering”.

Prof.Reshma lohar
Guide

_
Prof. Anupam Chaudhary Dr. Varsha Shah
Head of Department Principal

2
Declaration

I declare that this written submission represents my ideas in my own words and where others'
ideas or words have been included, I have adequately cited and referenced the original sources.
I also declare that I have adhered to all principles of academic honesty and integrity and have
not misrepresented or fabricated or falsified any idea/data/fact/source in my submission. I
understand that any violation of the above will be cause for disciplinary action by the Institute
and can also evoke penal action from the sources which have thus not been properly cited or
from whom proper permission has not been taken when needed.

Khan Zubair Ahmed

Date:

3
ABSTRACT

This project aims to predict movie ratings based on user inputs, using advanced
machine learning techniques. The model is trained on a dataset of movie ratings and
features such as user preferences, movie metadata, and collaborative filtering
techniques. By utilizing regression algorithms and recommendation systems, this
project provides a predictive model that can anticipate user ratings for movies they
haven't watched yet.
The study evaluates various machine learning models, including linear regression,
decision trees, and collaborative filtering algorithms, to determine the most accurate
model for this task.
In addition to model performance, the models are subjected to 10-fold cross-
validation to assess their generalization capability. The results show how advertising
spending across different channels impacts sales and provide insights into the
effectiveness of each channel.
This analysis offers a comprehensive understanding of advertising's role in driving
sales and provides a basis for optimizing marketing strategies.

Keywords: Movie rating prediction, collaborative filtering, recommendation system,


machine learning, regression models.

4
Index
Sr. No Title Page No

1. Introduction 6
2. Review and Literature 8
2.1. Paper 1 8
2.2. Paper 2 9
3. Report on Present Investigation 11
3.1. Experimental Setup 11
3.1.1 Hardware and Software Requirements 12
3.2. Data and methods 12
3.3 Results 13
4. Theory, Methodology and Algorithm 14
4.1 Theory 14
4.2 Methodology 14
4.2.1 Data Collection and Preprocessing 14
4.3 Algorithms 16
5. Results and Discussions
5.1 results 17
5.2 Discussions 18
5.3 Outputs 19
6. Conclusion 20
7. References 21
8. Acknowledgement 22
Assesments

5
Introduction

In recent years, with the vast growth of online streaming platforms like Netflix, Amazon
Prime, and Hulu, recommendation systems have gained tremendous importance. These
platforms use sophisticated algorithms to predict user preferences and recommend content
tailored to individual tastes. One crucial aspect of these systems is the movie rating
prediction, where platforms predict how a user might rate a movie they haven't seen yet, based
on their previous ratings and behavior.

The Movie-Rating-Prediction project focuses on building a system that can accurately


predict movie ratings using various machine learning techniques. The system uses historical
data on user ratings, movie metadata (such as genre, director, and release year), and other
related features to train the model. This helps in recommending personalized movie
suggestions, improving user satisfaction, and enhancing overall platform engagement.
The objectives of this study include:
• Developing a dataset by collecting relevant features like user behavior, movie
metadata, and historical ratings.
• Implementing and comparing machine learning algorithms like collaborative filtering,
matrix factorization, and regression models.
• Evaluating the performance of these models using RMSE and MAE to select the most
accurate one.

By accomplishing these objectives, the project will contribute to enhancing movie


recommendation systems, which are crucial for user retention and satisfaction on
entertainment platforms.

6
Chapter 2

Review of Literature

2.1 Collaborative Filtering


Collaborative filtering (CF) is widely used in recommendation systems, leveraging user-item
interaction matrices to predict user preferences. CF can be user-based or item-based:

• User-based CF identifies users with similar tastes and predicts ratings based on their
collective preferences.
• Item-based CF focuses on identifying similar movies based on past user interactions
and predicting ratings for similar items.

2.2 Matrix Factorization


Matrix factorization techniques, such as Singular Value Decomposition (SVD), decompose
the user-movie interaction matrix into latent factors, capturing the hidden patterns of user
preferences and movie features. Matrix factorization methods have been critical in improving
the performance of recommendation systems, especially in the Netflix Prize competition,
where they outperformed traditional CF methods.

2.3 Machine Learning Models


Machine learning models such as linear regression and decision trees have been applied to
enhance rating prediction by incorporating additional features like user demographics and
movie metadata. However, these models often underperform compared to collaborative
filtering and matrix factorization when applied in isolation.

2.4 Evaluation Metrics


Evaluation of recommendation systems often relies on metrics like RMSE (Root Mean
Squared Error) and MAE (Mean Absolute Error). RMSE measures the average magnitude of
prediction errors, while MAE calculates the absolute difference between predicted and actual
ratings.
as autoregressive integrated moving average (ARIMA) models, to capture the dynamic
relationship between advertising and sales. Broadbent and Fry (1995) applied time series
models to examine the long-term effects of TV advertising on sales, demonstrating that
7
advertising could have both immediate and delayed effects, depending on the product category
and market conditions.

More recently, the application of machine learning techniques to advertisement sales analysis
has gained prominence. Techniques such as random forests, support vector machines, and
neural networks have been used to model the nonlinear and complex relationships between
advertising spending, consumer behavior, and sales outcomes (Srinivasan, Vanhuele, &
Pauwels, 2010). These methods are particularly useful in handling large datasets and
identifying patterns that traditional statistical models might miss. However, the interpretability
of these models remains a challenge, which is critical for decision-makers who need actionable
insights.

2.4 Cross-Channel Effects and Integrated Campaigns


One of the major areas of recent research in advertisement sales analysis is the study of cross-
channel effects. As consumers interact with multiple media channels—TV, radio, digital, and
print—understanding how these channels work together to drive sales has become increasingly
important. Studies by Naik and Raman (2003) and Danaher and Rossiter (2011) have shown
that integrated marketing campaigns, where multiple channels are used in a coordinated
fashion, tend to have a greater impact on sales than single-channel campaigns.
The synergies created by combining channels can lead to more effective brand messaging and
higher ROI.

Cross-validation techniques are also used to evaluate the generalizability of models


across different subsets of data, ensuring that the model doesn't overfit the training data.

8
Chapter 3

Report on the Present Investigation

1. Introduction

This report presents an investigation into the relationship between advertising expenditures
and sales outcomes. With businesses continuously striving to optimize their marketing
strategies, understanding the effectiveness of different advertising channels—such as TV,
radio, and newspapers—is crucial for maximizing return on investment (ROI). This study
utilizes a dataset that includes advertising expenditures across TV, radio, and newspaper
media, and corresponding sales figures. By applying statistical analysis and regression models,
the investigation aims to quantify how advertising impacts sales and identify the most
influential factors contributing to sales outcomes.

3.1 Experimental Setup

The dataset used for this project is the MovieLens dataset, a widely used dataset for
recommendation system research. It contains millions of movie ratings along with additional
information such as movie genres, release year, and user demographic details. The dataset was
split into an 80% training set and a 20% test set.

3.2 Data and Methods

1. Data Preprocessing: Before applying machine learning algorithms, the data was
cleaned and preprocessed. Missing values were handled by using imputation
techniques, and irrelevant features were removed. The ratings were normalized, and
categorical features like genres were encoded.

2. Feature Selection: The features used for the model include:


o User-specific features: Age, gender, past movie ratings.
o Movie-specific features: Genre, director, release year.
o Collaborative filtering features: User-movie interaction matrix for matrix
factorization.

3. Models Used:

o Collaborative Filtering (User-based & Item-based): These models predict a


user’s rating based on similar users or similar movies, using a user-item matrix.
o Matrix Factorization (SVD): Decomposes the user-movie matrix into latent
factors representing user preferences and movie attributes.
o Linear Regression: A baseline machine learning model used to predict ratings
based on features like movie genres and user demographic information.
o Decision Trees: A non-linear model to capture complex interactions between
features and ratings.
9
3.3 Results

After training the models, their performance was evaluated on the test set using RMSE and
MAE:

• Collaborative Filtering performed the best with an RMSE of 0.85 and MAE of 0.68.
• Matrix Factorization (SVD) also performed well, with an RMSE of 0.88.
• Linear Regression and Decision Trees showed higher RMSEs (1.2 and 1.15,
respectively), indicating that they did not capture the complex interactions between
user preferences and movie features as well as collaborative filtering.

The investigation revealed that TV and radio advertising play the most critical roles in driving
sales, with TV being the strongest predictor. Newspaper advertising, on the other hand, had a
much weaker impact on sales. These findings align with previous research that highlights the
diminishing influence of print media in the digital age, whereas TV and radio remain highly
effective, especially in reaching broad audiences.

The models developed in this study demonstrated the importance of using a combination of
advertising channels to achieve optimal sales outcomes. The full model, which included all
three channels, performed best overall, but the TV and radio model performed nearly as well,
suggesting that businesses could potentially reduce spending on newspaper advertising
without sacrificing much predictive power.

Thus,
This investigation provided a comprehensive analysis of the relationship between advertising
expenditures and sales across TV, radio, and newspaper channels. The study demonstrated that
TV and radio advertising are the most influential factors in driving sales, while newspaper
advertising has a limited impact. The regression models developed offer valuable insights for
businesses aiming to optimize their advertising budgets and maximize sales. Future studies
could extend this analysis by incorporating digital advertising data and exploring cross-
channel synergies in greater depth.

10
Chapter 4

Theory, Methodology and Algorithms

4.1 Theory

The Movie-Rating-Prediction problem revolves around predicting missing values in a user-


movie rating matrix. This matrix represents the interactions between users (rows) and movies
(columns), with ratings indicating how much a user liked a movie. The prediction task is to fill
in the missing values in this matrix, which corresponds to predicting future ratings.
Collaborative filtering and matrix factorization work by identifying patterns in user behavior
and movie features, which are captured in the latent factors derived from the user-movie
interaction matrix.

In addition to linear regression, the theory also integrates cross-validation techniques to evaluate
the model’s robustness and generalization capabilities. By partitioning the data into training and
test sets, and applying cross-validation, we assess whether the developed models are overfitting
the data or if they generalize well to unseen data.

4.2 Methodology

1. Data Collection and Preprocessing: Data was collected from the MovieLens dataset,
preprocessed, and divided into training and test sets.

2. Model Training: Various models were trained using different machine learning
techniques, including:
o Collaborative filtering (user-based and item-based

3. Model Implementation: Collaborative filtering, matrix factorization, and linear


regression models were trained on the dataset.

4. Model Evaluation: The models were tested on the unseen 20% of the data, and their
performance was measured using RMSE and MAE.
11
4.3 Linear Regression Algorithm

Linear regression aims to minimize the sum of squared differences between the observed and
predicted values (residuals). The key steps in the algorithm include estimating the coefficients
(slopes and intercept) for the independent variables, which minimize the residual sum of squares.

Algorithm: Linear Regression using Gradient Descent


Input:
• Dataset with features X=[X1,X2,...,Xn]X = [X_1, X_2, ..., X_n]X=[X1,X2,...,Xn] (e.g., TV,
radio, newspaper)
• Target variable yyy (e.g., sales)
• Learning rate α\alphaα
• Number of iterations NNN
Output:
• Coefficients β0,β1,...,βn\beta_0, \beta_1, ..., \beta_nβ0,β1,...,βn that minimize the error
Steps:
1. Initialize the coefficients β0,β1,...,βn\beta_0, \beta_1, ..., \beta_nβ0,β1,...,βn randomly.
2. Repeat until convergence or for a fixed number of iterations NNN:
o For each observation iii in the dataset:
3. After convergence, return the optimized coefficients β0,β1,...,βn\beta_0, \beta_1, ...,
\beta_nβ0,β1,...,βn.

12
Chapter 5

Results and Discussions and Output

1. Results
The best-performing model was Collaborative Filtering, with an RMSE of 0.85, closely
followed by Matrix Factorization with an RMSE of 0.88. These results indicate that
capturing latent user preferences and leveraging user-movie interactions significantly
improves the accuracy of predictions.
In contrast, Linear Regression, with an RMSE of 1.2, failed to capture the complexity of user
behavior and movie features, highlighting the importance of more advanced methods like CF
and matrix factorization in recommendation systems.
The results are as follows:
• R²: 0.897
• MAE: 1.28
• MSE: 3.52
• RMSE: 1.88
The R2R^2R2 value of 0.897 suggests that 89.7% of the variance in sales can be explained by the
three advertising channels combined. The MAE, MSE, and RMSE indicate relatively low error
rates, showing that the full model provides a good fit for the data.
The models were trained on the MovieLens dataset, which consists of user ratings for various
movies along with metadata such as genres and user demographics. The dataset was split into
80% training data and 20% test data, and each model was evaluated based on its ability to
predict movie ratings for the test set. The primary evaluation metrics used were Root Mean
Squared Error (RMSE) and Mean Absolute Error (MAE), which measure the accuracy of the
predicted ratings compared to the actual ratings.1.3 TV and Radio Model.

Model Performance Overview:


• Collaborative Filtering (CF) - User-based:
o RMSE: 0.85
o MAE: 0.68
o This model performed well by utilizing user similarity to predict ratings, indicating
that users with similar rating patterns tend to give similar scores to movies they
have not yet rated.
13
• Collaborative Filtering (CF) - Item-based:
o RMSE: 0.87
o MAE: 0.70
o Item-based CF identified similar movies based on user interactions. While slightly
less accurate than user-based CF, this method still offered a robust prediction
mechanism by focusing on movie similarity.
• Matrix Factorization (SVD):
o RMSE: 0.88
o MAE: 0.72
o Matrix factorization worked by decomposing the user-item matrix into latent
factors representing abstract patterns. This method captured underlying
relationships between users and movies, achieving strong predictive performance
close to collaborative filtering.
• Linear Regression:
o RMSE: 1.20
o MAE: 0.95
o Linear regression was used as a baseline model, predicting ratings based on user
and movie features (such as demographics and genres). While computationally
efficient, this model underperformed due to its inability to capture the complex
interactions between users and movies.
Cross-Validation Results:
To ensure the robustness of the models, we applied 10-fold cross-validation to each. Collaborative
filtering models showed consistent performance across all folds, with minimal variation in RMSE,
indicating strong generalization capabilities. On the other hand, linear regression demonstrated
higher variance in performance, suggesting overfitting to certain subsets of the data.

2.4 Comparison of Models


The full model performed slightly better than the TV and radio model, but the difference was
marginal. This suggests that businesses could potentially reduce their expenditure on newspaper
advertising without significantly affecting overall sales outcomes. The TV and radio model is
highly effective at predicting sales, making it a practical choice for marketers looking to optimize
their advertising budgets.
The TV-only model, while simpler, had noticeably higher error rates and explained less variance
in sales compared to the other models. Although TV is a dominant channel, combining it with
other forms of advertising yields better results.

14
Output

Fig 1.1

Fig1.2

15
Fig 1.3

16
Fig 1.4

17
18
19
Chapter 6

Conclusions

The Movie-Rating-Prediction project successfully implemented and evaluated several


machine learning models to predict user movie ratings. Through the use of Collaborative
Filtering (CF), Matrix Factorization (SVD), and Linear Regression, we found that user-
based CF and matrix factorization outperformed other models in terms of accuracy, as
measured by Root Mean Squared Error (RMSE) and Mean Absolute Error (MAE).
The Collaborative Filtering approach proved to be the most effective, achieving the lowest
RMSE, making it ideal for personalized movie recommendation systems.
Matrix Factorization, while slightly less accurate, was also effective in capturing latent user
preferences. Linear Regression, although computationally simple, failed to account for the
complex relationships between users and movies, resulting in higher error rates.
In conclusion, Collaborative Filtering and Matrix Factorization provide a strong foundation
for building robust recommendation systems in streaming platforms, with potential future
improvements including hybrid models and deep learning techniques to address cold-start
problems and improve scalability.

From the analysis, it is clear that collaborative filtering (especially user-based) and matrix
factorization are the best approaches for predicting movie ratings, with user-based CF
performing slightly better in this context. Both models effectively captured the relationships
between users and movies, making them suitable for real-world recommendation systems.

20
Chapter 7

References

● Chen, H., Chiang, R. H. L., & Storey, V. C. (2014). Big Data Analytics: A
Literature Review. MIS Quarterly, 38(1), 1165-1188.

● Ozsu, M. T., & Valduriez, P. (2011). Principles of Distributed Database


Systems. Springer

● Hashem, I. A. T., Yaqoob, I., Anuar, N. B., Mokhtar, S., Gani, A., & Khan, S. U.
(2015). The rise of “big data” on cloud computing: Review and open research
issues. Information Systems, 47, 98-115.

● Koren, Y., Bell, R., & Volinsky, C. (2009). Matrix factorization techniques for
recommender systems. Computer, 42(8), 30-37.

● Netflix Prize. (2009). Retrieved from [https://netflixprize.com].

● Zhou, Y., Wilkinson, D., Schreiber, R., & Pan, R. (2008). Large-scale parallel
collaborative filtering for the Netflix prize. Algorithmic Aspects in Information and
Management, 337-348.

● Ricci, F., Rokach, L., & Shapira, B. (2015). Recommender Systems Handbook.
Springer.

● Salakhutdinov, R., & Mnih, A. (2008). Probabilistic matrix factorization. Proceedings


of NIPS, 1257-1264

21
Acknowledgements

I am profoundly grateful to Prof. RESHMA LOHAR for her expert guidance and continuous
encouragement throughout to see that this project rights its target.

I would like to express deepest appreciation towards Dr. Varsha Shah, Principal RCOE,
Mumbai and Prof. ANUPAM CHAUDHARY HOD COMPUTER Department whose
invaluable guidance supported me in this project.

At last I must express my sincere heartfelt gratitude to all the staff members of Computer
Engineering Department who helped us directly or indirectly during this course of work.

Khan Zubair Ahmed

22
Mini Project

Rubric Score (0 to 3)
Presentation
Team Collaboration
Innovation and Creativity
Total

Sign:

23

You might also like