0% found this document useful (0 votes)

37 views

Imdb Scrape v3

I wanted to test a theory that the mid 80s (specifically 84-86) were unusually good at generating "classic/iconic" movies. Or maybe I just think so because that's my teen years.

Uploaded by

Jason Fleischer

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

37 views

Imdb Scrape v3

I wanted to test a theory that the mid 80s (specifically 84-86) were unusually good at generating "classic/iconic" movies. Or maybe I just think so because that's my teen years.

Uploaded by

Jason Fleischer

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 9

%%

% imdb_scrape.m
% written by Jason G. Fleischer
%
%
% Based on an argument we had on The Facebook, I want to test a theory that
% the mid 80s (specifically 84-86) were unusually good at generating
% "classic/iconic" movies or if I just think so because that's my teen years.
% I mean, come on, Ghostbusters? The Breakfast Club? Alien? Stand By Me?
% To address this question I'm going to use MATLAB to scrape data off of
% IMDB and look at the top 100 movies in terms of US Box office. I'm going
% to try to figure out how to use various combinations of the average user
% rating and/or # of rating votes as a measure of 'classic-ness', and look
% at how that measure changes from year to year.
%
% AT THIS POINT, UNLESS YOU LIKE READING CODE YOU SHOULD
% PROBABLY SKIP AHEAD TO THE RESULTS
%
% Note that my solution is hard-coded to features of the
% current IMDB html... its going to fail if they change anything
% Please pardon the messiness of this code, and know that this is probably
% the best code commenting I've done in years :)
%
% This code is tested and working in MATLAB R2013a and IMDB's website as of
% March 13, 2015
%
% Please feel free to use and adapt this code. I would like to hear from
% you if you have a different analysis/viewpoint on this data:
% [email protected]
years=[1964:2004]; % let's not even worry about films that are less than 10
years old, its impossible to decide if they are classic yet
data={}; % data includes title for later exploration
allvotes=[]; % for convenience we'll use these matrices for raw summary
statistics
allratings=[];
xi=0;
for xx=years,
xi=xi+1;
disp(['Scraping ' num2str(xx)]);
s1=urlread(sprintf('http://www.imdb.com/search/title?
at=0&sort=boxoffice_gross_us&title_type=feature&year=%s,
%s',num2str(xx),num2str(xx)));
s2=urlread(sprintf('http://www.imdb.com/search/title?
at=0&sort=boxoffice_gross_us&start=51&title_type=feature&year=%s,
%s',num2str(xx),num2str(xx)));
allscrape=[s1 s2]; % imdb only serves 50 movies on a page, combine two
pages to get the reqd data
indxs=strfind(allscrape,'wlb_wrapper'); % this string marks the beginning
of a film's entry in the html
% one each page we get 50 of these wlb_wrappers for movies plus one extra
at the end of the page
yi=0;

for yy=[1:50 52:101] % skip the end-of-page wlb_wrapper

yi=yi+1;
first=indxs(yy);
last=indxs(yy+1);
toParse=allscrape(first:last); % substring we will parse for the film
info
% the title lays in the beginning, right between a </span> and the
next <span>
tinds=strfind(toParse,'span');
temp=toParse(tinds(1)+5:tinds(2)-2); % remove the spans
titl=strtrim(temp);
% rating is here
rind=strfind(toParse,'Users rated this');
rating=str2num(toParse(rind+16:rind+19));
% the number of votes lies right after the rating a fixed number of
% spaces because the format is always:
% Users rated this X.Y/10 (ZZZ,ZZZ votes)
vind1=rind+25;
vind2=strfind(toParse(vind1:(vind1+20)),'votes')+vind1-3;
votesStr=toParse(vind1:vind2); % this is the number in ZZZ,ZZZ format
remove=strfind(votesStr,','); % get rid of the commas
keep=setdiff(1:length(votesStr),remove);
votes=str2num(votesStr(keep)); % numerical format
allvotes(yi,xi)=votes;
allratings(yi,xi)=rating;
record.title=titl;
record.rating=rating;
record.votes=votes;
record.year=xx;
data{end+1}=record;
end
end

%% RESULTS
%
% first questions: how do the distributions look for rating and votes?
figure; hist(allvotes(:)); title('Histogram of # votes');
xlabel('value'); ylabel('count');
figure; hist(allratings(:)); title('Histogram of ratings');
xlabel('value'); ylabel('count');

%
%
%
%
%

Answer: ratings seem close to normally distributed, # votes is

nowhere near... very exponential-ish. I've looked at the distribution of
votes in individual years as well, and its pretty much always like that,
every year, as well as across all years. Even worse, there is a disturbing
non-sataionarity in the votes data:

figure; plotyy(1964:2004,mean(allratings),1964:2004,mean(allvotes));
legend('mean ratings','mean number of votes'); xlabel('release year')

% The mean number of votes increases year on year! I'm guessing this is
% because more people are discovering IMDB every year, and they vote on the
% movies they have seen that year. This means that there is no easy
threshold
% criteria to define a classic by # of votes.
% This is terrible because I'd hoped votes would be the way to
% quantify this. It's clear that people's ratings of movies can be very
% multi-modal: true fans love Star Trek movies, everyone else finds them
% mostly blah. I figured that lots of votes would indicate that people
% cared about a movie
%
% Interestingly, the mean ratings go up in the past, even as the number of
% votes drops tremendously. Only classic film buffs and truefans vote that
% far back?
%
%
%
%
%
%
%
%
%
%
%

let's look at ratings... to give you an idea of how IMDB ratings look
here's some 1986 films that get the following ratings
5-6: 9 1/2 weeks, The Golden Child, Maximum Overdrive
6-7: Top Gun, Pretty in Pink, Short Circuit
7-8: Ferris Bueller, Blue Velvet, Transformers: The Movie
8+: Aliens, Platoon, Stand by Me
In other words high ratings probably don't correlate much with high brow,
which suggests that depending on what your tastes are, IMDB
ratings may not be good predictors of an iconic/classic movie

figure; plot(1964:2004,sum(allratings>6.5),1964:2004,sum(allratings>7),
1964:2004,sum(allratings>7.5),1964:2004,sum(allratings>8));
legend('# films rating > 6.5','# films rating > 7','# films rating > 7.5','#
films rating > 8')
xlabel('release year')

% Looking at thresholded ratings, we can see that a "hardline" stance on

% defining a classic (>8) puts it down to quite a constant (and low) level of
% achievement across years. Using a lesser threshold results in peaks and
% valleys from year to year, and big trends such as observed previously
% where the dim past gets "grade inflation"
%
% this view of user ratings suggests two things:
% 1. It's less that 84-86 were good, and more that the early 80s were
terrible
% 2. There's a very interesting bump in 1993 (mostly 7.5-8 movies) and
another
% bump in 1995 (>8 movies).
%
% Here's all 7.5+ movies in 1993:

minds=find(allratings(:,1993-1964+1)>7.5);
ms=data(((1993-1964)*100+1):((1994-1964)*100));
ms{minds}

%
%
%
%
%
%
%
%
%
%
%
%
%
%
%
%
%
%
%
%
%
%
%
%
%
%
%
%
%
%
%
%
%
%
%
%
%
%
%
%
%
%
%
%
%
%
%
%
%
%
%
%
%
%

title:
votes:
rating:
year:

'Jurassic Park'
462302
8
1993

title:
votes:
rating:
year:

'The Fugitive'
192146
7.8000
1993

title:
votes:
rating:
year:

'Schindler's List'
715139
8.9000
1993

title:
votes:
rating:
year:

'Philadelphia'
160708
7.7000
1993

title:
votes:
rating:
year:

'The Nightmare Before Christmas'

196302
8
1993

title:
votes:
rating:
year:

'Groundhog Day'
362506
8.1000
1993

title:
votes:
rating:
year:

'Tombstone'
84615
7.8000
1993

title:
votes:
rating:
year:

'Falling Down'
122604
7.6000
1993

title:
votes:
rating:
year:

'The Piano'
58157
7.6000
1993

%
%
%
%
%
%
%
%
%
%
%
%
%
%
%
%
%
%
%
%
%
%
%
%
%
%
%
%
%
%
%
%
%
%
%
%
%
%
%
%
%
%
%
%
%
%
%
%
%
%
%
%
%

title:
votes:
rating:
year:

'Carlito's Way'
147449
7.9000
1993

title:
votes:
rating:
year:

'The Joy Luck Club'

11908
7.6000
1993

title:
votes:
rating:
year:

'The Sandlot'
49868
7.8000
1993

title:
votes:
rating:
year:

'In the Name of the Father'

89483
8.1000
1993

title:
votes:
rating:
year:

'The Remains of the Day'

41066
7.9000
1993

title:
votes:
rating:
year:

'A Bronx Tale'

87700
7.8000
1993

title:
votes:
rating:
year:

'Iron Monkey'
12177
7.6000
1993

title:
votes:
rating:
year:

'True Romance'
144833
8
1993

Whew!! Still with me?

It looks like my theory is pretty wrong, but then
again, what good is a theory if you don't go to bat for it? I'll make
one more argument that I hope you'll find attractive. Again we fall back

%
%
%
%
%
%
%
%
%
%
%
%
%
%
%
%
%
%

on the questions: how can we extract "classic-ness" out of this data set?
How can we seperate Groundhog Day (clearly a classic!) from The Sandlot
(which as good a movie as it is, just doesn't meet my personal standards.
Most importantly how can we remove Iron Monkey-like results from the
above list? It might be a good kung-fu flick, but nobody in this country
saw it in spite of Tarrantinos backing, and it isnt a classic IMHO.
I will argue that we need a metric that takes into account these things:
1) Ratings
2) Votes (get rid of low-vote Iron Monkey fanboy noise in the ratings)
3) the upward trend of vote #s with year.
Viola... we will use the mean rating of the most-voted-on movies in each
year. I did some complicated stuff to account for year-to-year variability
in how many movies lived out there in the long-tail of the vote
distribution: taking only the 90th percentile+ voted movies, or using 1.5 *
inter-quartile range. But it turned out that just using the top 10 votegetters produced an essentially-identical graph:

for xx=1:length(years), [dummy ginds]=sort(allvotes(:,xx),'descend');

ts(xx)=mean(allratings(ginds(1:10),xx)); end
figure; plot(1964:2004,ts); legend('Mean rating of top 10 vote-getters each
year');
% Well crap.
% on 1993.

The mid 80s are better than the early 80s, but still nothing

%
%
%
%
%
%
%

In conclusion, I couldn't figure out how to use this data to show what I
know to be true: the 1980s are superior movie years. No matter how they
were massaged, the analyses continue to point to the superiority of the
mid 90s over the mid 80s in producing "classic/iconic" movies. These
results are clearly counter to ground truth, and thus I conclude that
IMDB's DB must have been corrupted by hackers. Thanks Obama.

Blood, Sweat, and Pixels: The Triumphant, Turbulent Stories Behind How Video Games Are Made
From Everand
Blood, Sweat, and Pixels: The Triumphant, Turbulent Stories Behind How Video Games Are Made
Jason Schreier
4/5 (192)
List of German Girls Snapchat Usernames
No ratings yet
List of German Girls Snapchat Usernames
1 page
2024 Huef Application Form
No ratings yet
2024 Huef Application Form
3 pages
2009 Theatrical Market Statistics
No ratings yet
2009 Theatrical Market Statistics
20 pages
All Ultimate Mortal Kombat 3 Fatalities
100% (1)
All Ultimate Mortal Kombat 3 Fatalities
20 pages
The Self Presence Storytelling
No ratings yet
The Self Presence Storytelling
20 pages
Imdb Scrape v1
No ratings yet
Imdb Scrape v1
9 pages
Lo 4
No ratings yet
Lo 4
2 pages
MPAA Glickman Remarks March 08
100% (1)
MPAA Glickman Remarks March 08
5 pages
RE Paper
No ratings yet
RE Paper
25 pages
Top Movies Ratings
100% (1)
Top Movies Ratings
10 pages
AERO348 Demonstration 3
No ratings yet
AERO348 Demonstration 3
4 pages
PHAR318 Study Guide 5
No ratings yet
PHAR318 Study Guide 5
4 pages
Vertopal.com IMDb+Movie+Assignment Stub
No ratings yet
Vertopal.com IMDb+Movie+Assignment Stub
9 pages
imdb
No ratings yet
imdb
11 pages
Assignment After The Movie.: Write A Report Considering The Following Points
No ratings yet
Assignment After The Movie.: Write A Report Considering The Following Points
3 pages
09 Assignment 2 Six Degrees
No ratings yet
09 Assignment 2 Six Degrees
10 pages
MovieLens Final-Project
No ratings yet
MovieLens Final-Project
18 pages
09 Assignment 2 Six Degrees
No ratings yet
09 Assignment 2 Six Degrees
10 pages
JAP179 Chapter 7
No ratings yet
JAP179 Chapter 7
4 pages
Shin Megami Tensei: Devil Survivor (DS) : PDF Walkthrough
No ratings yet
Shin Megami Tensei: Devil Survivor (DS) : PDF Walkthrough
22 pages
All Just A Dream
No ratings yet
All Just A Dream
15 pages
Ekran Resmi 2022-01-29 - 00.51.09
No ratings yet
Ekran Resmi 2022-01-29 - 00.51.09
1 page
ECO277 Week 10
No ratings yet
ECO277 Week 10
7 pages
Untitled
No ratings yet
Untitled
4 pages
Rádio Web Inespec Programação em 09032013 Noite PRT 462715.733.463447
No ratings yet
Rádio Web Inespec Programação em 09032013 Noite PRT 462715.733.463447
15 pages
Shahid 1: Strengths of The Essay
No ratings yet
Shahid 1: Strengths of The Essay
5 pages
21Bcs5066 - Deepanshu Tyagi Source Code: #Importing Libraries
No ratings yet
21Bcs5066 - Deepanshu Tyagi Source Code: #Importing Libraries
18 pages
Nintendo Case Assignment
No ratings yet
Nintendo Case Assignment
5 pages
ASTR130 Lec Notes 4
No ratings yet
ASTR130 Lec Notes 4
4 pages
POLS125 Document 3
No ratings yet
POLS125 Document 3
3 pages
FAT12: The Oldest Type of File Allocation Table That Uses 12-Bit Binary System. A Hard
No ratings yet
FAT12: The Oldest Type of File Allocation Table That Uses 12-Bit Binary System. A Hard
13 pages
Documents
No ratings yet
Documents
7 pages
Moria
No ratings yet
Moria
66 pages
Fiasco Playset Template
No ratings yet
Fiasco Playset Template
13 pages
Littleredhen Tot
No ratings yet
Littleredhen Tot
36 pages
EDU190 Image 4
No ratings yet
EDU190 Image 4
4 pages
Axia College Material: Sample Feedback For Student Paper On A Popular Film
No ratings yet
Axia College Material: Sample Feedback For Student Paper On A Popular Film
4 pages
Fim Review Analysis
No ratings yet
Fim Review Analysis
2 pages
JAP291 Slides 9733
No ratings yet
JAP291 Slides 9733
16 pages
Homework Assignment-1(9) (1)
No ratings yet
Homework Assignment-1(9) (1)
3 pages
Art229 Week 3
No ratings yet
Art229 Week 3
4 pages
Redesign: Product Name: Target User: Brief History
No ratings yet
Redesign: Product Name: Target User: Brief History
3 pages
ACC153 Assessment 4
No ratings yet
ACC153 Assessment 4
3 pages
LAW288 Module 2
No ratings yet
LAW288 Module 2
7 pages
Ocelot - Cyberpunk 2020 - Gangs in Cyberpunk 2020 (2003)
No ratings yet
Ocelot - Cyberpunk 2020 - Gangs in Cyberpunk 2020 (2003)
7 pages
Question 6: What Have You Learnt About Technologies From The Process of Constructing This Product?
No ratings yet
Question 6: What Have You Learnt About Technologies From The Process of Constructing This Product?
6 pages
Question 6: What Have You Learnt About Technologies From The Process of Constructing This Product?
No ratings yet
Question 6: What Have You Learnt About Technologies From The Process of Constructing This Product?
6 pages
DANC331 Module 1
No ratings yet
DANC331 Module 1
3 pages
Soci203 Sheet 1
No ratings yet
Soci203 Sheet 1
6 pages
exam_240829
No ratings yet
exam_240829
4 pages
Respondents Age Group: Indeterminate Total Survey: 100
No ratings yet
Respondents Age Group: Indeterminate Total Survey: 100
3 pages
46 Gambler Defense
No ratings yet
46 Gambler Defense
45 pages
Talk About The Future
No ratings yet
Talk About The Future
1 page
8859 11
No ratings yet
8859 11
5 pages
Eccles Sixth Form Centre: Salford City College
No ratings yet
Eccles Sixth Form Centre: Salford City College
19 pages
Screenshot 2025-01-20 at 2.12.33 PM
No ratings yet
Screenshot 2025-01-20 at 2.12.33 PM
1 page
Lethal Weapons Die Hard: The Complete Story of the 1980s Action Film Genre
From Everand
Lethal Weapons Die Hard: The Complete Story of the 1980s Action Film Genre
Dr. Robbie King
1/5 (1)
2015 Baseball Forecaster: & Encyclopedia of Fanalytics
From Everand
2015 Baseball Forecaster: & Encyclopedia of Fanalytics
Ron Shandler
No ratings yet
The 100 Movies That Get No Respect: An Analysis and Evaluation of the Most Underrated Films of All Time
From Everand
The 100 Movies That Get No Respect: An Analysis and Evaluation of the Most Underrated Films of All Time
Robbie King
No ratings yet
Anactagram First Volume
From Everand
Anactagram First Volume
Jacques Hopkins
No ratings yet
Subtraction
No ratings yet
Subtraction
26 pages
DSD Faults Detection and Location Methods
No ratings yet
DSD Faults Detection and Location Methods
17 pages
SAMPLEX3 Mill Level 3 Training Tutorial
No ratings yet
SAMPLEX3 Mill Level 3 Training Tutorial
85 pages
Lab Handout 9
No ratings yet
Lab Handout 9
3 pages
TL 52660 en 2019
No ratings yet
TL 52660 en 2019
22 pages
Payment - of - Bonus - Act
No ratings yet
Payment - of - Bonus - Act
47 pages
Cambridge IGCSE: Biology 0610/22
No ratings yet
Cambridge IGCSE: Biology 0610/22
16 pages
Low-Temperature Water-Gas Shift Reaction Over Cu-And Ni-Loaded Cerium Oxide Catalysts
No ratings yet
Low-Temperature Water-Gas Shift Reaction Over Cu-And Ni-Loaded Cerium Oxide Catalysts
13 pages
Soma India New Office LH
No ratings yet
Soma India New Office LH
2 pages
Sinohidroo Report 1
No ratings yet
Sinohidroo Report 1
12 pages
Corrosion Q
No ratings yet
Corrosion Q
10 pages
JESR 2010 - 11 Final Report MEM
No ratings yet
JESR 2010 - 11 Final Report MEM
190 pages
Madame Dowding's Corsets
No ratings yet
Madame Dowding's Corsets
95 pages
Turning Plots COATED DNMA Cs-135 & 0.08 at 0.5 MM Doc FFT
No ratings yet
Turning Plots COATED DNMA Cs-135 & 0.08 at 0.5 MM Doc FFT
13 pages
1 Introduction To Statistical Quality Control, 7th Edition by Douglas C. Montgomery
0% (1)
1 Introduction To Statistical Quality Control, 7th Edition by Douglas C. Montgomery
78 pages
Glossary of Business Terms
No ratings yet
Glossary of Business Terms
9 pages
Fundamentals of Graphics Using MATLAB® Ranjan Parkeh
100% (1)
Fundamentals of Graphics Using MATLAB® Ranjan Parkeh
427 pages
دكتور علي المتناني ورقت بحث
No ratings yet
دكتور علي المتناني ورقت بحث
7 pages
E-Way Bill: E-Way Bill No: E-Way Bill Date: Generated By: Valid From: Valid Until
No ratings yet
E-Way Bill: E-Way Bill No: E-Way Bill Date: Generated By: Valid From: Valid Until
1 page
Lung - Pathophysiology
No ratings yet
Lung - Pathophysiology
66 pages
Lab Project Report Computer Networking Lab
No ratings yet
Lab Project Report Computer Networking Lab
13 pages
Avoiding Run-On Sentences, Comma Splices, and Fragments: Independent Clause
No ratings yet
Avoiding Run-On Sentences, Comma Splices, and Fragments: Independent Clause
15 pages
Buku Teks Matematik Tahun 6 KSSR
No ratings yet
Buku Teks Matematik Tahun 6 KSSR
201 pages
Uantitative: Data Analysis
No ratings yet
Uantitative: Data Analysis
15 pages
Earth Subsystem
No ratings yet
Earth Subsystem
9 pages
danfos
No ratings yet
danfos
2 pages
MT7626-15W-140V - 85mA-120Vac: 1. Description Features
No ratings yet
MT7626-15W-140V - 85mA-120Vac: 1. Description Features
10 pages
Excel Exercises
No ratings yet
Excel Exercises
13 pages
Letter Q Homework
100% (1)
Letter Q Homework
6 pages