Data Storytelling Final Project
Data Storytelling Final Project
Final Project
By: Dorothy Kunth
As a data science consultant for a large production company, Spectacular Studios, the tasks
involve defining a problem that is compelling enough for the executive team to warrant taking
action and to develop the storyline of what can be expected of the data to tell from the analysis.
How do these affect the recommendation to produce highly popular and financially successful
movies in the next three years?
2. There is a limitation on the fact that the IMDb’s 83 million registered users are not
the absolute representation of the world’s total movie going audience. Not all
moviegoers are IMDb registered user.
3. Other factors that affect the people’s decision to see a movie are the professional
movie critical review, IMDb user review, actors, directors, plot summary, word of
mouth advertising which are not present in the feature set. And factors such as
plot summary and word of mouth advertising are not possible to measure.
Limitations and Biases
Data Preprocessing:
1. The genres are a stringified list of dictionaries that list out all the genres and
hybrid genres per movie which has about 5-6 genres. Upon data preprocessing,
the genres were converted into a list of maximum of 3 genres only.
2. The production countries are a stringified list of countries where the movies are
produced. Some of the movies are international co-production between 5-6
countries. Upon data preprocessing, the production countries were converted
into a list of maximum of 3 countries only.
3. Less than 1% of missingness in the following features:
vote_average (6), vote_count (6), revenue (6), popularity (5), language (11) and
production countries (3)
Out of the 45466 records, missing values were just around less than 1%
therefore, these were just ignored due to a very small percentage.
Limitations and Biases
Insights : Popularity distribution of the Top 250 movies based on IMDb Score is right skewed
with 8.8% high outliers. However, these outliers have to be included as removal of the
observations will have a significant effect on the analysis.
Limitations and Biases
Insights 2: Popularity distribution of the Top 250 movies based on estimated profit is right
skewed with 8.0% high outliers. However, these outliers have to be included as removal of the
observations will have a significant effect on the analysis.
Limitations and Biases
Insights 3
IMDb uses proprietary algorithms that take into account several measures of
popularity and the primary measure is what people are looking at on IMDb. IMDb
records and sums the pageviews which form part of the foundation of popularity
rankings.
In the feature set, the popularity is not expressed in ranking but scores which we
assumed to be the number of user visits and pageviews expressed in millions.
Next Steps
1. Identify sources of potential data for popularity ranking, professional critical review, IMDb user review
rating, movie actors and directors.
2. Follow-up analysis based on a more recent dataset, probably weeks-old dataset.
3. Since profitability of a film studio is crucially dependent on picking the right film projects and box office
revenue is highly concentrated in a small number of very successful films, the proposed next steps from
the analysis made, suggest:
● Consider movie projects that are in the genres or hybrid genres of Drama, Crime, Romance,
Comedy, Adventure, Action, Science Fiction, Fantasy and Animation
● Produce movie projects in English language.
Thank you!