Project Problem Statement
Project Problem Statement
Statistics with R Project by Mohit Narang 222113 & Neerh Bordoloi 221098
Overview
Dataset
The dataset used in this project was downloaded from [IMDb datasets]
(https://datasets.imdbws.com). It consists of multiple tab-separated values (TSV) files
containing various types of information about movies, such as titles, ratings, genres,
crew members, and more. We utilized the following seven files to build the
recommendation system, each playing a crucial role in the data processing:
1. name.basics.tsv: This file contains information about people in the film industry,
including their names, birth and death years, and the titles they've been involved with.
Although not used directly for recommendations, it provides additional context about
crew members.
2. title.akas.tsv: This file includes alternate titles for movies across different regions and
languages. It was used to account for variations in movie titles, ensuring comprehensive
matching when users input movie names.
3. title.basics.tsv: This file contains primary information about movies, such as title,
release year, and genre. It served as the main source for filtering movies by type (e.g.,
movies only) and extracting genres for recommendation purposes. The genre column
was further split to allow multi-genre analysis.
4. title.crew.tsv: This file provides details on the directors and writers of movies. While
not directly used for recommendations, it could enhance future iterations of the system
by incorporating crew-based filtering.
7. title.ratings.tsv: This file includes user ratings and the number of votes each movie
has received. It was critical for ranking movies by average rating and number of votes to
generate meaningful recommendations.
Each of these files was processed using R packages like `readr` for reading the data
and `dplyr` for filtering, merging, and cleaning the information to ensure quality
recommendations.
- R: The primary programming language used for data analysis and building the
application.
- Shiny: Used for creating the interactive web-based movie recommendation system.
- dplyr: For data manipulation, filtering, and merging datasets.
- readr: To read TSV files and handle data import.
- tidyr: To transform and tidy the data, including splitting genres.
- ggplot2: For creating visualizations to enhance data presentation.
- stringr: For handling string operations during data preprocessing.
Expected Outcomes
Upon completing this project, we gained valuable skills and insights, including:
Conclusion