0% found this document useful (0 votes)
9 views

DSLAB5

The document discusses analyzing airline flight delay data using interactive visualizations in R. It describes cleaning the OpenFlights dataset and creating an interactive bar chart to visualize delays by airline with filters for airports and delay reasons. Additionally, it mentions adding an interactive map to highlight delayed flights between origins and destinations.

Uploaded by

nikhileshmeher24
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
9 views

DSLAB5

The document discusses analyzing airline flight delay data using interactive visualizations in R. It describes cleaning the OpenFlights dataset and creating an interactive bar chart to visualize delays by airline with filters for airports and delay reasons. Additionally, it mentions adding an interactive map to highlight delayed flights between origins and destinations.

Uploaded by

nikhileshmeher24
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 17

Tasks:

Dataset: https://www.kaggle.com/datasets/rajugc/imdb-top-250-movies-dataset
Loading the required library
# Load required libraries
# Load required libraries
library(dplyr)
library(ggplot2)
library(tidyr)

Importing the IMDB top 250 movies dataset


# Import the data
imdb_data <- read.csv("D:/IMDB Top 250 Movies.csv", stringsAsFactors = FALSE)
print(imdb_data)
imdb_data$rating <- as.numeric(imdb_data$rating)

2|Pa ge
21BCE2455 NIKHILESH MEHER
3|Pa ge
21BCE2455 NIKHILESH MEHER
4|Pa ge
21BCE2455 NIKHILESH MEHER
5|Pa ge
21BCE2455 NIKHILESH MEHER
6|Pa ge
21BCE2455 NIKHILESH MEHER
Cleaning and exploration:

# Check for missing values


missing_values <- imdb_data %>%
summarise_all(~ sum(is.na(.)))

# Check for outliers


# For numeric variables like rating, year, runtime, etc., you can use summary statistics
or visualize distributions
summary(imdb_data$rating)
summary(imdb_data$year)
summary(imdb_data$runtime)

# Handle missing values


# Depending on the context, you can choose to drop rows with missing values or impute
them with mean/median values
# For example, to drop rows with missing values:
imdb_data <- imdb_data %>%
drop_na()

# Handle outliers
# You can identify outliers using boxplots or histograms and decide whether to remove
or transform them

7|Pa ge
21BCE2455 NIKHILESH MEHER
# For example, to remove outliers in rating using interquartile range (IQR) method:
rating_iqr <- IQR(imdb_data$rating)
rating_upper_bound <- quantile(imdb_data$rating, 0.75) + 1.5 * rating_iqr
imdb_data <- imdb_data %>%
filter(rating <= rating_upper_bound)

# Check for inconsistencies


# For categorical variables like genre, you can check for unique values and their
frequencies
unique_genres <- unique(imdb_data$genre)
genre_counts <- imdb_data %>%
count(genre)

# Display cleaned data


head(imdb_data)

8|Pa ge
21BCE2455 NIKHILESH MEHER
Distribution of ratings:
# Create a histogram to visualize the distribution of ratings
ggplot(imdb_data, aes(x = rating)) +
geom_histogram(binwidth = 0.5, fill = "skyblue", color = "black") +
labs(title = "Distribution of Movie Ratings on IMDB",
x = "Rating", y = "Frequency") +
theme_minimal()

9|Pa ge
21BCE2455 NIKHILESH MEHER
Ratings vs. year

# Create a scatterplot to explore the relationship between release year and rating
ggplot(imdb_data, aes(x = year, y = rating)) +
geom_point() +
labs(title = "Relationship Between Release Year and Rating",
x = "Year", y = "Rating") +
theme_minimal() +
geom_smooth(method = "lm", se = FALSE) # Add a linear regression line

# Calculate and display the correlation coefficient


correlation_coefficient <- cor(imdb_data$year, imdb_data$rating)
print(paste("Correlation coefficient:", round(correlation_coefficient, 2)))

10 | P a g e
21BCE2455 NIKHILESH MEHER
11 | P a g e
21BCE2455 NIKHILESH MEHER
Rating by genre

# Group the data by genre and calculate the average rating for each genre
genre_ratings <- imdb_data %>%
group_by(genre) %>%
summarise(avg_rating = mean(rating, na.rm = TRUE))

# Create a bar chart to compare ratings across different genres


ggplot(genre_ratings, aes(x = reorder(genre, avg_rating), y = avg_rating)) +
geom_bar(stat = "identity", fill = "skyblue", color = "black") +
labs(title = "Average Rating by Genre",
x = "Genre", y = "Average Rating") +
theme_minimal() +
theme(axis.text.x = element_text(angle = 45, hjust = 1))

12 | P a g e
21BCE2455 NIKHILESH MEHER
To visualize correlations between different movie attributes using a
heatmap, you can calculate the correlation matrix using the cor()
function and then plot the matrix as a heatmap. Here's how you can
do it:

# Convert necessary columns to numeric


imdb_data$year <- as.numeric(imdb_data$year)
imdb_data$rating <- as.numeric(imdb_data$rating)
imdb_data$run_time <- as.numeric(imdb_data$run_time)

# Check for missing values and handle them if necessary


# For simplicity, you can drop rows with missing values
imdb_data <- na.omit(imdb_data)

# Calculate the correlation matrix


correlation_matrix <- cor(imdb_data[, c("year", "rating", "run_time")])

13 | P a g e
21BCE2455 NIKHILESH MEHER
# Convert the correlation matrix to a dataframe
correlation_df <- as.data.frame(correlation_matrix)
correlation_df$attributes <- rownames(correlation_df)

# Reshape the dataframe for plotting


correlation_df <- tidyr::gather(correlation_df, key = "attribute", value = "correlation", -
attributes)

# Plot the heatmap


ggplot(correlation_df, aes(x = attribute, y = attributes, fill = correlation)) +
geom_tile() +
scale_fill_gradient(low = "white", high = "steelblue") +
labs(title = "Correlation Heatmap",
x = "Attribute", y = "Attribute", fill = "Correlation") +
theme_minimal() +
theme(axis.text.x = element_text(angle = 45, hjust = 1))
print(correlation_matrix)

14 | P a g e
21BCE2455 NIKHILESH MEHER
15 | P a g e
21BCE2455 NIKHILESH MEHER
If the correlation matrix contains values close to zero, it indicates weak or no correlation
between the variables. In such cases, a heatmap will appear mostly blank or gray.

16 | P a g e
21BCE2455 NIKHILESH MEHER
Summary of findings and observations:

The distribution of movie ratings on IMDB is roughly normal, with a peak around the 7-8
rating range.
There is a weak positive relationship between release year and rating, indicating that
newer movies tend to have slightly higher ratings.
Among the genres, documentaries tend to have the highest average ratings, while
horror movies have the lowest.
The correlation heatmap reveals weak correlations between the attributes "year",
"rating", and "run_time", suggesting limited interdependence among these variables.

17 | P a g e
21BCE2455 NIKHILESH MEHER
Lab Assessment 6: Interactive Visualization in R

Title: Analyzing Airline Flight Delays

Data: You can use the OpenFlights dataset from Kaggle.

Objectives:

 Visualize flight delays by airline using an interactive bar chart.


 Enable users to filter data by specific airports or delay reasons.

Tasks:

1. Import and clean data: Download and import the data using read.csv(),
handling missing values and ensuring data types are appropriate.
2. Interactive visualization:
o Create an interactive bar chart with ggplot2 and the plotly package.
o Use plotly::ggplotly() to convert the ggplot object into an
interactive plotly object.
o Map delay categories (e.g., carrier delay, weather delay) to unique
colors and bar labels.
o Use sliders or dropdown menus to allow users to filter data:
 Filter by origin or destination airport using a dropdown menu.
 Filter by delay reason using a slider or checkbox group.

Bonus task:

 Add a map visualization to the interactive display, highlighting the origins and
destinations of delayed flights using colour or marker size based on delay
severity.

Deliverables:

 R code for data cleaning, visualization, and interactivity.


 A functional interactive visualization that allows users to filter data and explore
trends.
 A concise summary of your findings and observations.

18 | P a g e
21BCE2455 NIKHILESH MEHER

You might also like