Bda Mini Project Report
Bda Mini Project Report
Bachelors of Engineering
by
Khan Zubair Ahmed 211P026
(22)
Guide:
(Prof. Reshma Lohar)
University of Mumbai
2024-2025
1
CERTIFICATE
This is to certify that the mini-project entitled “Movie-Rating-Preiction” is a bonafide work
of “Khan Zubair Ahmed” (Roll No:22) submitted to the University of Mumbai in partial
fulfillment of the requirement for the Mini-Project Of Big Data Analysis for Final Year of the
Bachelor of Engineering in “Computer Engineering”.
Prof.Reshma lohar
Guide
_
Prof. Anupam Chaudhary Dr. Varsha Shah
Head of Department Principal
2
Declaration
I declare that this written submission represents my ideas in my own words and where others'
ideas or words have been included, I have adequately cited and referenced the original sources.
I also declare that I have adhered to all principles of academic honesty and integrity and have
not misrepresented or fabricated or falsified any idea/data/fact/source in my submission. I
understand that any violation of the above will be cause for disciplinary action by the Institute
and can also evoke penal action from the sources which have thus not been properly cited or
from whom proper permission has not been taken when needed.
Date:
3
ABSTRACT
This project aims to predict movie ratings based on user inputs, using advanced
machine learning techniques. The model is trained on a dataset of movie ratings and
features such as user preferences, movie metadata, and collaborative filtering
techniques. By utilizing regression algorithms and recommendation systems, this
project provides a predictive model that can anticipate user ratings for movies they
haven't watched yet.
The study evaluates various machine learning models, including linear regression,
decision trees, and collaborative filtering algorithms, to determine the most accurate
model for this task.
In addition to model performance, the models are subjected to 10-fold cross-
validation to assess their generalization capability. The results show how advertising
spending across different channels impacts sales and provide insights into the
effectiveness of each channel.
This analysis offers a comprehensive understanding of advertising's role in driving
sales and provides a basis for optimizing marketing strategies.
4
Index
Sr. No Title Page No
1. Introduction 6
2. Review and Literature 8
2.1. Paper 1 8
2.2. Paper 2 9
3. Report on Present Investigation 11
3.1. Experimental Setup 11
3.1.1 Hardware and Software Requirements 12
3.2. Data and methods 12
3.3 Results 13
4. Theory, Methodology and Algorithm 14
4.1 Theory 14
4.2 Methodology 14
4.2.1 Data Collection and Preprocessing 14
4.3 Algorithms 16
5. Results and Discussions
5.1 results 17
5.2 Discussions 18
5.3 Outputs 19
6. Conclusion 20
7. References 21
8. Acknowledgement 22
Assesments
5
Introduction
In recent years, with the vast growth of online streaming platforms like Netflix, Amazon
Prime, and Hulu, recommendation systems have gained tremendous importance. These
platforms use sophisticated algorithms to predict user preferences and recommend content
tailored to individual tastes. One crucial aspect of these systems is the movie rating
prediction, where platforms predict how a user might rate a movie they haven't seen yet, based
on their previous ratings and behavior.
6
Chapter 2
Review of Literature
• User-based CF identifies users with similar tastes and predicts ratings based on their
collective preferences.
• Item-based CF focuses on identifying similar movies based on past user interactions
and predicting ratings for similar items.
More recently, the application of machine learning techniques to advertisement sales analysis
has gained prominence. Techniques such as random forests, support vector machines, and
neural networks have been used to model the nonlinear and complex relationships between
advertising spending, consumer behavior, and sales outcomes (Srinivasan, Vanhuele, &
Pauwels, 2010). These methods are particularly useful in handling large datasets and
identifying patterns that traditional statistical models might miss. However, the interpretability
of these models remains a challenge, which is critical for decision-makers who need actionable
insights.
8
Chapter 3
1. Introduction
This report presents an investigation into the relationship between advertising expenditures
and sales outcomes. With businesses continuously striving to optimize their marketing
strategies, understanding the effectiveness of different advertising channels—such as TV,
radio, and newspapers—is crucial for maximizing return on investment (ROI). This study
utilizes a dataset that includes advertising expenditures across TV, radio, and newspaper
media, and corresponding sales figures. By applying statistical analysis and regression models,
the investigation aims to quantify how advertising impacts sales and identify the most
influential factors contributing to sales outcomes.
The dataset used for this project is the MovieLens dataset, a widely used dataset for
recommendation system research. It contains millions of movie ratings along with additional
information such as movie genres, release year, and user demographic details. The dataset was
split into an 80% training set and a 20% test set.
1. Data Preprocessing: Before applying machine learning algorithms, the data was
cleaned and preprocessed. Missing values were handled by using imputation
techniques, and irrelevant features were removed. The ratings were normalized, and
categorical features like genres were encoded.
3. Models Used:
After training the models, their performance was evaluated on the test set using RMSE and
MAE:
• Collaborative Filtering performed the best with an RMSE of 0.85 and MAE of 0.68.
• Matrix Factorization (SVD) also performed well, with an RMSE of 0.88.
• Linear Regression and Decision Trees showed higher RMSEs (1.2 and 1.15,
respectively), indicating that they did not capture the complex interactions between
user preferences and movie features as well as collaborative filtering.
The investigation revealed that TV and radio advertising play the most critical roles in driving
sales, with TV being the strongest predictor. Newspaper advertising, on the other hand, had a
much weaker impact on sales. These findings align with previous research that highlights the
diminishing influence of print media in the digital age, whereas TV and radio remain highly
effective, especially in reaching broad audiences.
The models developed in this study demonstrated the importance of using a combination of
advertising channels to achieve optimal sales outcomes. The full model, which included all
three channels, performed best overall, but the TV and radio model performed nearly as well,
suggesting that businesses could potentially reduce spending on newspaper advertising
without sacrificing much predictive power.
Thus,
This investigation provided a comprehensive analysis of the relationship between advertising
expenditures and sales across TV, radio, and newspaper channels. The study demonstrated that
TV and radio advertising are the most influential factors in driving sales, while newspaper
advertising has a limited impact. The regression models developed offer valuable insights for
businesses aiming to optimize their advertising budgets and maximize sales. Future studies
could extend this analysis by incorporating digital advertising data and exploring cross-
channel synergies in greater depth.
10
Chapter 4
4.1 Theory
In addition to linear regression, the theory also integrates cross-validation techniques to evaluate
the model’s robustness and generalization capabilities. By partitioning the data into training and
test sets, and applying cross-validation, we assess whether the developed models are overfitting
the data or if they generalize well to unseen data.
4.2 Methodology
1. Data Collection and Preprocessing: Data was collected from the MovieLens dataset,
preprocessed, and divided into training and test sets.
2. Model Training: Various models were trained using different machine learning
techniques, including:
o Collaborative filtering (user-based and item-based
4. Model Evaluation: The models were tested on the unseen 20% of the data, and their
performance was measured using RMSE and MAE.
11
4.3 Linear Regression Algorithm
Linear regression aims to minimize the sum of squared differences between the observed and
predicted values (residuals). The key steps in the algorithm include estimating the coefficients
(slopes and intercept) for the independent variables, which minimize the residual sum of squares.
12
Chapter 5
1. Results
The best-performing model was Collaborative Filtering, with an RMSE of 0.85, closely
followed by Matrix Factorization with an RMSE of 0.88. These results indicate that
capturing latent user preferences and leveraging user-movie interactions significantly
improves the accuracy of predictions.
In contrast, Linear Regression, with an RMSE of 1.2, failed to capture the complexity of user
behavior and movie features, highlighting the importance of more advanced methods like CF
and matrix factorization in recommendation systems.
The results are as follows:
• R²: 0.897
• MAE: 1.28
• MSE: 3.52
• RMSE: 1.88
The R2R^2R2 value of 0.897 suggests that 89.7% of the variance in sales can be explained by the
three advertising channels combined. The MAE, MSE, and RMSE indicate relatively low error
rates, showing that the full model provides a good fit for the data.
The models were trained on the MovieLens dataset, which consists of user ratings for various
movies along with metadata such as genres and user demographics. The dataset was split into
80% training data and 20% test data, and each model was evaluated based on its ability to
predict movie ratings for the test set. The primary evaluation metrics used were Root Mean
Squared Error (RMSE) and Mean Absolute Error (MAE), which measure the accuracy of the
predicted ratings compared to the actual ratings.1.3 TV and Radio Model.
14
Output
Fig 1.1
Fig1.2
15
Fig 1.3
16
Fig 1.4
17
18
19
Chapter 6
Conclusions
From the analysis, it is clear that collaborative filtering (especially user-based) and matrix
factorization are the best approaches for predicting movie ratings, with user-based CF
performing slightly better in this context. Both models effectively captured the relationships
between users and movies, making them suitable for real-world recommendation systems.
20
Chapter 7
References
● Chen, H., Chiang, R. H. L., & Storey, V. C. (2014). Big Data Analytics: A
Literature Review. MIS Quarterly, 38(1), 1165-1188.
● Hashem, I. A. T., Yaqoob, I., Anuar, N. B., Mokhtar, S., Gani, A., & Khan, S. U.
(2015). The rise of “big data” on cloud computing: Review and open research
issues. Information Systems, 47, 98-115.
● Koren, Y., Bell, R., & Volinsky, C. (2009). Matrix factorization techniques for
recommender systems. Computer, 42(8), 30-37.
● Zhou, Y., Wilkinson, D., Schreiber, R., & Pan, R. (2008). Large-scale parallel
collaborative filtering for the Netflix prize. Algorithmic Aspects in Information and
Management, 337-348.
● Ricci, F., Rokach, L., & Shapira, B. (2015). Recommender Systems Handbook.
Springer.
21
Acknowledgements
I am profoundly grateful to Prof. RESHMA LOHAR for her expert guidance and continuous
encouragement throughout to see that this project rights its target.
I would like to express deepest appreciation towards Dr. Varsha Shah, Principal RCOE,
Mumbai and Prof. ANUPAM CHAUDHARY HOD COMPUTER Department whose
invaluable guidance supported me in this project.
At last I must express my sincere heartfelt gratitude to all the staff members of Computer
Engineering Department who helped us directly or indirectly during this course of work.
22
Mini Project
Rubric Score (0 to 3)
Presentation
Team Collaboration
Innovation and Creativity
Total
Sign:
23