Imdb Scrape v3
Imdb Scrape v3
% imdb_scrape.m
% written by Jason G. Fleischer
%
%
% Based on an argument we had on The Facebook, I want to test a theory that
% the mid 80s (specifically 84-86) were unusually good at generating
% "classic/iconic" movies or if I just think so because that's my teen years.
% I mean, come on, Ghostbusters? The Breakfast Club? Alien? Stand By Me?
% To address this question I'm going to use MATLAB to scrape data off of
% IMDB and look at the top 100 movies in terms of US Box office. I'm going
% to try to figure out how to use various combinations of the average user
% rating and/or # of rating votes as a measure of 'classic-ness', and look
% at how that measure changes from year to year.
%
% AT THIS POINT, UNLESS YOU LIKE READING CODE YOU SHOULD
% PROBABLY SKIP AHEAD TO THE RESULTS
%
% Note that my solution is hard-coded to features of the
% current IMDB html... its going to fail if they change anything
% Please pardon the messiness of this code, and know that this is probably
% the best code commenting I've done in years :)
%
% This code is tested and working in MATLAB R2013a and IMDB's website as of
% March 13, 2015
%
% Please feel free to use and adapt this code. I would like to hear from
% you if you have a different analysis/viewpoint on this data:
% [email protected]
years=[1964:2004]; % let's not even worry about films that are less than 10
years old, its impossible to decide if they are classic yet
data={}; % data includes title for later exploration
allvotes=[]; % for convenience we'll use these matrices for raw summary
statistics
allratings=[];
xi=0;
for xx=years,
xi=xi+1;
disp(['Scraping ' num2str(xx)]);
s1=urlread(sprintf('http://www.imdb.com/search/title?
at=0&sort=boxoffice_gross_us&title_type=feature&year=%s,
%s',num2str(xx),num2str(xx)));
s2=urlread(sprintf('http://www.imdb.com/search/title?
at=0&sort=boxoffice_gross_us&start=51&title_type=feature&year=%s,
%s',num2str(xx),num2str(xx)));
allscrape=[s1 s2]; % imdb only serves 50 movies on a page, combine two
pages to get the reqd data
indxs=strfind(allscrape,'wlb_wrapper'); % this string marks the beginning
of a film's entry in the html
% one each page we get 50 of these wlb_wrappers for movies plus one extra
at the end of the page
yi=0;
%% RESULTS
%
% first questions: how do the distributions look for rating and votes?
figure; hist(allvotes(:)); title('Histogram of # votes');
xlabel('value'); ylabel('count');
figure; hist(allratings(:)); title('Histogram of ratings');
xlabel('value'); ylabel('count');
%
%
%
%
%
figure; plotyy(1964:2004,mean(allratings),1964:2004,mean(allvotes));
legend('mean ratings','mean number of votes'); xlabel('release year')
% The mean number of votes increases year on year! I'm guessing this is
% because more people are discovering IMDB every year, and they vote on the
% movies they have seen that year. This means that there is no easy
threshold
% criteria to define a classic by # of votes.
% This is terrible because I'd hoped votes would be the way to
% quantify this. It's clear that people's ratings of movies can be very
% multi-modal: true fans love Star Trek movies, everyone else finds them
% mostly blah. I figured that lots of votes would indicate that people
% cared about a movie
%
% Interestingly, the mean ratings go up in the past, even as the number of
% votes drops tremendously. Only classic film buffs and truefans vote that
% far back?
%
%
%
%
%
%
%
%
%
%
%
let's look at ratings... to give you an idea of how IMDB ratings look
here's some 1986 films that get the following ratings
5-6: 9 1/2 weeks, The Golden Child, Maximum Overdrive
6-7: Top Gun, Pretty in Pink, Short Circuit
7-8: Ferris Bueller, Blue Velvet, Transformers: The Movie
8+: Aliens, Platoon, Stand by Me
In other words high ratings probably don't correlate much with high brow,
which suggests that depending on what your tastes are, IMDB
ratings may not be good predictors of an iconic/classic movie
figure; plot(1964:2004,sum(allratings>6.5),1964:2004,sum(allratings>7),
1964:2004,sum(allratings>7.5),1964:2004,sum(allratings>8));
legend('# films rating > 6.5','# films rating > 7','# films rating > 7.5','#
films rating > 8')
xlabel('release year')
minds=find(allratings(:,1993-1964+1)>7.5);
ms=data(((1993-1964)*100+1):((1994-1964)*100));
ms{minds}
%
%
%
%
%
%
%
%
%
%
%
%
%
%
%
%
%
%
%
%
%
%
%
%
%
%
%
%
%
%
%
%
%
%
%
%
%
%
%
%
%
%
%
%
%
%
%
%
%
%
%
%
%
%
title:
votes:
rating:
year:
'Jurassic Park'
462302
8
1993
title:
votes:
rating:
year:
'The Fugitive'
192146
7.8000
1993
title:
votes:
rating:
year:
'Schindler's List'
715139
8.9000
1993
title:
votes:
rating:
year:
'Philadelphia'
160708
7.7000
1993
title:
votes:
rating:
year:
title:
votes:
rating:
year:
'Groundhog Day'
362506
8.1000
1993
title:
votes:
rating:
year:
'Tombstone'
84615
7.8000
1993
title:
votes:
rating:
year:
'Falling Down'
122604
7.6000
1993
title:
votes:
rating:
year:
'The Piano'
58157
7.6000
1993
%
%
%
%
%
%
%
%
%
%
%
%
%
%
%
%
%
%
%
%
%
%
%
%
%
%
%
%
%
%
%
%
%
%
%
%
%
%
%
%
%
%
%
%
%
%
%
%
%
%
%
%
%
title:
votes:
rating:
year:
'Carlito's Way'
147449
7.9000
1993
title:
votes:
rating:
year:
title:
votes:
rating:
year:
'The Sandlot'
49868
7.8000
1993
title:
votes:
rating:
year:
title:
votes:
rating:
year:
title:
votes:
rating:
year:
title:
votes:
rating:
year:
'Iron Monkey'
12177
7.6000
1993
title:
votes:
rating:
year:
'True Romance'
144833
8
1993
%
%
%
%
%
%
%
%
%
%
%
%
%
%
%
%
%
%
on the questions: how can we extract "classic-ness" out of this data set?
How can we seperate Groundhog Day (clearly a classic!) from The Sandlot
(which as good a movie as it is, just doesn't meet my personal standards.
Most importantly how can we remove Iron Monkey-like results from the
above list? It might be a good kung-fu flick, but nobody in this country
saw it in spite of Tarrantinos backing, and it isnt a classic IMHO.
I will argue that we need a metric that takes into account these things:
1) Ratings
2) Votes (get rid of low-vote Iron Monkey fanboy noise in the ratings)
3) the upward trend of vote #s with year.
Viola... we will use the mean rating of the most-voted-on movies in each
year. I did some complicated stuff to account for year-to-year variability
in how many movies lived out there in the long-tail of the vote
distribution: taking only the 90th percentile+ voted movies, or using 1.5 *
inter-quartile range. But it turned out that just using the top 10 votegetters produced an essentially-identical graph:
The mid 80s are better than the early 80s, but still nothing
%
%
%
%
%
%
%
In conclusion, I couldn't figure out how to use this data to show what I
know to be true: the 1980s are superior movie years. No matter how they
were massaged, the analyses continue to point to the superiority of the
mid 90s over the mid 80s in producing "classic/iconic" movies. These
results are clearly counter to ground truth, and thus I conclude that
IMDB's DB must have been corrupted by hackers. Thanks Obama.