IMDb+Movie+Assignment Stub
IMDb+Movie+Assignment Stub
import warnings
warnings.filterwarnings('ignore')
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
# Sort the dataframe with the 'profit' column as reference using the
'sort_values' function. Make sure to set the argument
#'ascending' to 'False'
The dataset contains the 100 best performing movies from the year 2010 to 2016.
However, the scatter plot tells a different story. You can notice that there are some movies
with negative profit. Although good movies do incur losses, but there appear to be quite a
few movie with losses. What can be the reason behind this? Lets have a closer look at this
by finding the movies with negative profit.
#Find the movies with negative profit
Checkpoint 1: Can you spot the movie Tangled in the dataset? You may be aware of the
movie 'Tangled'. Although its one of the highest grossing movies of all time, it has negative
profit as per this result. If you cross check the gross values of this movie (link:
https://www.imdb.com/title/tt0398286/), you can see that the gross in the dataset
accounts only for the domestic gross and not the worldwide gross. This is true for may
other movies also in the list.
# Find the movies with metacritic-Imdb rating < 0.5 and also with an
average rating of >= 8 (sorted in descending order)
Checkpoint 2: Can you spot a Star Wars movie in your final dataset?
Having this condition ensures that you aren't getting any unpopular actor in your trio
(since the total likes calculated in the previous question doesn't tell anything about the
individual popularities of each actor in the trio.).
You can do a manual inspection of the top 5 popular trios you have found in the previous
subtask and check how many of those trios satisfy this condition. Also, which is the most
popular trio after applying the condition above? Write your answers in the markdown cell
provided below.
Write your answers below.
• No. of trios that satisfy the above condition: (your answer here)
• Most popular trio after applying the condition: (your answer here)
Optional: Even though you are finding this out by a natural inspection of the dataframe,
can you also achieve this through some if-else statements to incorporate this. You can try
this out on your own time after you are done with the assignment.
# Your answer here (optional and not graded)
# Add the grouped data frames and store it in a new data frame
2. Make the second heatmap to see how the average number of votes of females is
varying across the genres. Use seaborn heatmap for this analysis. The X-axis should
contain the four age-groups for females, i.e., CVotesU18F,CVotes1829F,
CVotes3044F, and CVotes45AF. The Y-axis will have the genres and the annotation
in the heatmap tell the average number of votes for that age-female group.
3. Make sure that you plot these heatmaps side by side using subplots so that you
can easily compare the two genders and derive insights.
4. Write your any three inferences from this plot. You can make use of the previous bar
plot also here for better insights. Refer to this link-
https://seaborn.pydata.org/generated/seaborn.heatmap.html. You might have to
plot something similar to the fifth chart in this page (You have to plot two such
heatmaps side by side).
5. Repeat subtasks 1 to 4, but now instead of taking the CVotes-related columns, you
need to do the same process for the Votes-related columns. These heatmaps will
show you how the two genders have rated movies across various genres.
You might need the below link for formatting your heatmap.
https://stackoverflow.com/questions/56942670/matplotlib-seaborn-first-and-last-row-
cut-in-half-of-heatmap-plot
• Note : Use genre_top10 dataframe for this subtask
# 1st set of heat maps for CVotes-related columns
Inferences: A few inferences that can be seen from the heatmap above is that males have
voted more than females, and Sci-Fi appears to be most popular among the 18-29 age
group irrespective of their gender. What more can you infer from the two heatmaps that
you have plotted? Write your three inferences/observations below:
• Inference 1:
• Inference 2:
• Inference 3:
# 2nd set of heat maps for Votes-related columns
Inferences: Sci-Fi appears to be the highest rated genre in the age group of U18 for both
males and females. Also, females in this age group have rated it a bit higher than the males
in the same age group. What more can you infer from the two heatmaps that you have
plotted? Write your three inferences/observations below:
• Inference 1:
• Inference 2:
• Inference 3:
2. Now make a boxplot that shows how the number of votes from the US people i.e.
CVotesUS is varying for the US and non-US movies. Make use of the column IFUS to
make this plot. Similarly, make another subplot that shows how non US voters have
voted for the US and non-US movies by plotting CVotesnUS for both the US and non-
US movies. Write any of your two inferences/observations from these plots.
3. Again do a similar analysis but with the ratings. Make a boxplot that shows how the
ratings from the US people i.e. VotesUS is varying for the US and non-US movies.
Similarly, make another subplot that shows how VotesnUS is varying for the US and
non-US movies. Write any of your two inferences/observations from these plots.
Note : Use movies dataframe for this subtask. Make use of this documention to format your
boxplot - https://seaborn.pydata.org/generated/seaborn.boxplot.html
# Creating IFUS column
• Inference 1:
• Inference 2:
# Box plot - 2: VotesUS(y) vs IFUS(x)
• Inference 1:
• Inference 2:
3. Write your inferences. You can also try to relate it with the heatmaps you did in the
previous subtasks.
# Sorting by CVotes1000
# Bar plot
Checkpoint 6: The genre Romance seems to be most unpopular among the top 1000
voters.
With the above subtask, your assignment is over. In your free time, do explore the dataset
further on your own and see what kind of other insights you can get across various other
columns.