Imdb Scrape v1
Imdb Scrape v1
% imdb_scrape.m
% written by Jason G. Fleischer
%
%
% Based on an argument we had on The Facebook, I want to test a theory that
% the mid 80s (specifically 84-86) were unusually good at generating
% "classic/iconic" movies or if I just think so because that's my teen years.
% I mean, come on, Ghostbusters? The Breakfast Club? Alien? Stand By Me?
% To address this question I'm going to use MATLAB to scrape data off of
% IMDB and look at the top 100 movies in terms of US Box office. I'm going
% to try to figure out how to use various combinations of the average user
% rating and/or # of rating votes as a measure of 'classic-ness', and look
% at how that measure changes from year to year.
%
% Note that my solution is hard-coded to features of the
% current IMDB html... its going to fail if they change anything
% Please pardon the messiness of this code, and know that this is probably
% the best code commenting I've done in years :)
%
% This code is tested and working in MATLAB R2013a and IMDB's website as of
% March 13, 2015
%
% Please feel free to use and adapt this code. I would like to hear from
% you if you have a different analysis/viewpoint on this data.
years=[1964:2004]; % let's not even worry about films that are less than 10
years old, its impossible to decide if they are classic yet
data={}; % data includes title for later exploration
allvotes=[]; % for convenience we'll use these matrices for raw summary
statistics
allratings=[];
xi=0;
for xx=years,
xi=xi+1;
disp(['Scraping ' num2str(xx)]);
s1=urlread(sprintf('http://www.imdb.com/search/title?
at=0&sort=boxoffice_gross_us&title_type=feature&year=%s,
%s',num2str(xx),num2str(xx)));
s2=urlread(sprintf('http://www.imdb.com/search/title?
at=0&sort=boxoffice_gross_us&start=51&title_type=feature&year=%s,
%s',num2str(xx),num2str(xx)));
allscrape=[s1 s2]; % imdb only serves 50 movies on a page, combine two
pages to get the reqd data
indxs=strfind(allscrape,'wlb_wrapper'); % this string marks the beginning
of a film's entry in the html
% one each page we get 50 of these wlb_wrappers for movies plus one extra
at the end of the page
yi=0;
for yy=[1:50 52:101] % skip the end-of-page wlb_wrapper
yi=yi+1;
first=indxs(yy);
last=indxs(yy+1);
%%
% first questions: how do the distributions look for rating and votes?
figure; hist(allvotes(:)); title('Histogram of # votes');
xlabel('value'); ylabel('count');
figure; hist(allratings(:)); title('Histogram of ratings');
xlabel('value'); ylabel('count');
%
%
%
%
%
figure; plotyy(1964:2004,mean(allratings),1964:2004,mean(allvotes));
legend('mean ratings','mean number of votes'); xlabel('release year')
% The mean number of votes increases year on year! I'm guessing this is
% because more people are discovering IMDB every year, and they vote on the
% movies they have seen that year. This means that there is no easy
threshold
% criteria to define a classic by # of votes.
% This is terrible because I'd hoped votes would be the way to
% quantify this. It's clear that people's ratings of movies can be very
% multi-modal: true fans love Star Trek movies, everyone else finds them
% mostly blah. I figured that lots of votes would indicate that people
% cared about a movie
%
% Interestingly, the mean ratings go up in the past, even as the number of
% votes drops tremendously. Only classic film buffs and truefans vote that
% far back?
%
%
%
%
%
%
%
%
%
%
%
let's look at ratings... to give you an idea of how IMDB ratings look
here's some 1986 films that get the following ratings
5-6: 9 1/2 weeks, The Golden Child, Maximum Overdrive
6-7: Top Gun, Pretty in Pink, Short Circuit
7-8: Ferris Bueller, Blue Velvet, Transformers: The Movie
8+: Aliens, Platoon, Stand by Me
In other words high ratings probably don't correlate much with high brow,
which suggests that depending on what your tastes are, IMDB
ratings may not be good predictors of an iconic/classic movie
figure; plot(1964:2004,sum(allratings>6.5),1964:2004,sum(allratings>7),
1964:2004,sum(allratings>7.5),1964:2004,sum(allratings>8));
legend('# films rating > 6.5','# films rating > 7','# films rating > 7.5','#
films rating > 8')
xlabel('release year')
%
%
%
%
%
%
%
%
%
%
%
%
%
%
%
%
%
%
%
%
%
%
%
%
%
%
%
%
%
%
%
%
%
%
%
%
%
%
%
%
%
%
%
%
%
%
%
%
%
%
%
%
%
%
title:
votes:
rating:
year:
'Jurassic Park'
462302
8
1993
ans =
title:
votes:
rating:
year:
'The Fugitive'
192146
7.8000
1993
ans =
title:
votes:
rating:
year:
'Schindler's List'
715139
8.9000
1993
ans =
title:
votes:
rating:
year:
'Philadelphia'
160708
7.7000
1993
ans =
title:
votes:
rating:
year:
ans =
title:
votes:
rating:
year:
'Groundhog Day'
362506
8.1000
1993
ans =
title:
votes:
rating:
year:
'Tombstone'
84615
7.8000
1993
%
%
%
%
%
%
%
%
%
%
%
%
%
%
%
%
%
%
%
%
%
%
%
%
%
%
%
%
%
%
%
%
%
%
%
%
%
%
%
%
%
%
%
%
%
%
%
%
%
%
%
%
%
%
ans =
title:
votes:
rating:
year:
'Falling Down'
122604
7.6000
1993
ans =
title:
votes:
rating:
year:
'The Piano'
58157
7.6000
1993
ans =
title:
votes:
rating:
year:
'Carlito's Way'
147449
7.9000
1993
ans =
title:
votes:
rating:
year:
ans =
title:
votes:
rating:
year:
'The Sandlot'
49868
7.8000
1993
ans =
title:
votes:
rating:
year:
ans =
title:
votes:
rating:
year:
%
%
% ans =
%
%
title: 'A Bronx Tale'
%
votes: 87700
%
rating: 7.8000
%
year: 1993
%
%
% ans =
%
%
title: 'Iron Monkey'
%
votes: 12177
%
rating: 7.6000
%
year: 1993
%
%
% ans =
%
%
title: 'True Romance'
%
votes: 144833
%
rating: 8
%
year: 1993
%
%
%
%
%
%
%
%
%
%
%
%
%
%
%
%
%
%
%
%
%
%
%
% Well crap.
% on 1993.
%
%
%
%
%
%
%
The mid 80s are better than the early 80s, but still nothing
In conclusion, I couldn't figure out how to use this data to show what I
know to be true: the 1980s are superior movie years. No matter how they
were massaged, the analyses continue to point to the superiority of the
mid 90s over the mid 80s in producing "classic/iconic" movies. These
results are clearly counter to ground truth, and thus I conclude that
IMDB's DB must have been corrupted by hackers. Thanks Obama.