DSLAB5
DSLAB5
Dataset: https://www.kaggle.com/datasets/rajugc/imdb-top-250-movies-dataset
Loading the required library
# Load required libraries
# Load required libraries
library(dplyr)
library(ggplot2)
library(tidyr)
2|Pa ge
21BCE2455 NIKHILESH MEHER
3|Pa ge
21BCE2455 NIKHILESH MEHER
4|Pa ge
21BCE2455 NIKHILESH MEHER
5|Pa ge
21BCE2455 NIKHILESH MEHER
6|Pa ge
21BCE2455 NIKHILESH MEHER
Cleaning and exploration:
# Handle outliers
# You can identify outliers using boxplots or histograms and decide whether to remove
or transform them
7|Pa ge
21BCE2455 NIKHILESH MEHER
# For example, to remove outliers in rating using interquartile range (IQR) method:
rating_iqr <- IQR(imdb_data$rating)
rating_upper_bound <- quantile(imdb_data$rating, 0.75) + 1.5 * rating_iqr
imdb_data <- imdb_data %>%
filter(rating <= rating_upper_bound)
8|Pa ge
21BCE2455 NIKHILESH MEHER
Distribution of ratings:
# Create a histogram to visualize the distribution of ratings
ggplot(imdb_data, aes(x = rating)) +
geom_histogram(binwidth = 0.5, fill = "skyblue", color = "black") +
labs(title = "Distribution of Movie Ratings on IMDB",
x = "Rating", y = "Frequency") +
theme_minimal()
9|Pa ge
21BCE2455 NIKHILESH MEHER
Ratings vs. year
# Create a scatterplot to explore the relationship between release year and rating
ggplot(imdb_data, aes(x = year, y = rating)) +
geom_point() +
labs(title = "Relationship Between Release Year and Rating",
x = "Year", y = "Rating") +
theme_minimal() +
geom_smooth(method = "lm", se = FALSE) # Add a linear regression line
10 | P a g e
21BCE2455 NIKHILESH MEHER
11 | P a g e
21BCE2455 NIKHILESH MEHER
Rating by genre
# Group the data by genre and calculate the average rating for each genre
genre_ratings <- imdb_data %>%
group_by(genre) %>%
summarise(avg_rating = mean(rating, na.rm = TRUE))
12 | P a g e
21BCE2455 NIKHILESH MEHER
To visualize correlations between different movie attributes using a
heatmap, you can calculate the correlation matrix using the cor()
function and then plot the matrix as a heatmap. Here's how you can
do it:
13 | P a g e
21BCE2455 NIKHILESH MEHER
# Convert the correlation matrix to a dataframe
correlation_df <- as.data.frame(correlation_matrix)
correlation_df$attributes <- rownames(correlation_df)
14 | P a g e
21BCE2455 NIKHILESH MEHER
15 | P a g e
21BCE2455 NIKHILESH MEHER
If the correlation matrix contains values close to zero, it indicates weak or no correlation
between the variables. In such cases, a heatmap will appear mostly blank or gray.
16 | P a g e
21BCE2455 NIKHILESH MEHER
Summary of findings and observations:
The distribution of movie ratings on IMDB is roughly normal, with a peak around the 7-8
rating range.
There is a weak positive relationship between release year and rating, indicating that
newer movies tend to have slightly higher ratings.
Among the genres, documentaries tend to have the highest average ratings, while
horror movies have the lowest.
The correlation heatmap reveals weak correlations between the attributes "year",
"rating", and "run_time", suggesting limited interdependence among these variables.
17 | P a g e
21BCE2455 NIKHILESH MEHER
Lab Assessment 6: Interactive Visualization in R
Objectives:
Tasks:
1. Import and clean data: Download and import the data using read.csv(),
handling missing values and ensuring data types are appropriate.
2. Interactive visualization:
o Create an interactive bar chart with ggplot2 and the plotly package.
o Use plotly::ggplotly() to convert the ggplot object into an
interactive plotly object.
o Map delay categories (e.g., carrier delay, weather delay) to unique
colors and bar labels.
o Use sliders or dropdown menus to allow users to filter data:
Filter by origin or destination airport using a dropdown menu.
Filter by delay reason using a slider or checkbox group.
Bonus task:
Add a map visualization to the interactive display, highlighting the origins and
destinations of delayed flights using colour or marker size based on delay
severity.
Deliverables:
18 | P a g e
21BCE2455 NIKHILESH MEHER