0% found this document useful (0 votes)
10 views6 pages

Predicting Film Box Office Performance Using Wikipedia Edit Data

This study investigates the use of Wikipedia edit data as a predictor for opening box office revenues of films released in the US, focusing on films from 2007 to 2011. A predictive model utilizing gradient boosting trees achieved an R² of 0.54 for films released in 2012, indicating that Wikipedia activity can reflect social interest but should be combined with other predictors for better accuracy. Key features influencing the model's performance include edit frequency, content changes, and revenues of similar films.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
10 views6 pages

Predicting Film Box Office Performance Using Wikipedia Edit Data

This study investigates the use of Wikipedia edit data as a predictor for opening box office revenues of films released in the US, focusing on films from 2007 to 2011. A predictive model utilizing gradient boosting trees achieved an R² of 0.54 for films released in 2012, indicating that Wikipedia activity can reflect social interest but should be combined with other predictors for better accuracy. Key features influencing the model's performance include edit frequency, content changes, and revenues of similar films.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 6

Volume 10, Issue 2, February – 2025 International Journal of Innovative Science and Research Technology

ISSN No:-2456-2165 https://doi.org/10.5281/zenodo.14987459

Predicting Film Box Office Performance Using


Wikipedia Edit Data
Niraj Patel1
1
St. Clair College

Publication Date: 2025/03/11

Abstract: This study explores the potential of Wikipedia edit data as a predictor of opening box office revenues for films
released in the US. After analyzing films from 2007 to 2011, we developed a predictive model based on Wikipedia article
edits using gradient boosting trees as the primary algorithm. Our model incorporates features such as the frequency of
Wikipedia edits, the size and content of article revisions, and the revenues of similar films. The results demonstrate that
Wikipedia activity can serve as a rough indicator of film popularity, though the model’s predictive accuracy is limited. We
find that Wikipedia-based features, particularly edit runs and content changes, significantly contribute to the model’s
performance, achieving an R² of 0.54 for films released in 2012. This suggests that while Wikipedia data offers valuable
insights into social interest, it is best used in conjunction with other predictors for more reliable revenue estimates.

How to Cite: Niraj Patel (2025). Predicting Film Box Office Performance Using Wikipedia Edit Data. International Journal of
Innovative Science and Research Technology, 10(2), 1951-1956. https://doi.org/10.5281/zenodo.14987459

I. INTRODUCTION

 Wikipedia as a Gauge of Social Interest

Fig 1 Count of Wikipedia article edits for the films used in this paper’s training dataset over the 4 weeks prior to each film’s
respective release date, bucketed by days before the release date that the edits occurred. This graph shows the uptick in editing
activity that typically accompanies a film’s release.

 According to its article about itself (as of this writing), registered user account. The edit history of each article is
Wikipedia is “a collaboratively edited, multilingual, free saved with a timestamp. Interested users can view any
Internet encyclopedia” launched in January 2011. [6] Its past version of an article, and an article’s edit history
articles can be edited by anyone, either anonymously exhibits an evolving record of Wikipedia’s “knowledge”1
(though the editor’s IP address is logged) or with a of its subject.

IJISRT25FEB802 www.ijisrt.com 1951


Volume 10, Issue 2, February – 2025 International Journal of Innovative Science and Research Technology
ISSN No:-2456-2165 https://doi.org/10.5281/zenodo.14987459
 As such, Wikipedia’s edit history can be viewed as a  Rotten Tomatoes (http://www.rottentomatoes. com/) - a
barometer of social interest. For example, when a person popular movie review aggregator. I used it to obtain
is in the news, editing activity in his or her article often descriptive information about films: genres, runtime,
spikes. In fact, Wikipedia has template warnings MPAA rating, cast and directors, and so on. It offers an
indicating when an article is likely to be in flux due to a API if you register for a key (which is free as of the
relevant current event. Edit activity on Wikipedia, in this present writing).
sense, is akin to mentions on social networks like
Facebook or Twitter, although perhaps with a smaller  Wikipedia (English-language) (http://en.wikipedia.org/)-
participating audience (although many people read MediaWiki, the name of the web application upon which
Wikipedia, not nearly so many participate in its creation). Wikipedia is based, offers an API, no registration or key
 One area where we can try to gauge the degree to which necessary.
Wikipedia activity reflects social in- terest is in film box
office performance. Films have relatively well-defined  Much of the work involved in data retrieval and
release dates prior to which we can measure activity on formatting was to ensure that data retrieved from these
Wikipedia. They also have well-defined, measurable three sources corresponded to the same film; data from
outcomes - revenues at the ticket booth - that are clearly Rotten Tomatoes and Wikipedia was obtained by using
sensitive to popular interest. Theater owners obviously their APIs’ search functionalities, which can lead to
have a direct fi- nancial interest in knowing how well a incorrect hits if you are not careful. For example, we want
film is going to perform. Advertisers and publicists, to make sure that Rotten Tomatoes data for the 2012 film
sellers of tie- in products, and film journalists have a “The Lucky One” is not mapped to the 2008 film “The
slightly more indirect but still strong interest; they will Lucky Ones,” or that for the 2010 film “Salt” we do not
want to know how they should spend their time and examine the Wikipedia article for salt, the mineral.3
money. Can we use Wikipedia to usefully predict films’
opening box office performances?
 The universe of films that I considered were those listed
on Box Office Mojo as having opened in at least 1000
II. FORMULATION OF PROBLEM theaters. I manually excluded a handful of films that were
AND DATA SOURCES re-released or had limited engagement special features. I
trained my algorithms on films released between 2007 and
 The specific question I set out to answer was how 2011, inclusive. In total, 689 films were in the training
accurately, using Wikipedia’s help, can we predict the dataset. Data from films as far back as 2002 were used for
domestic per-theater box office gross of a film released some of the feature calculations; see the next section for
widely in the US over the first three days2 of its release.
more details. I tested my algorithm on films released in
2012, of which there were 124.
 Of course, Wikipedia’s highly open policy means that it
contains a stunning breadth of information from
 Box Office Mojo data had to be scraped from HTML, but
contributors with wide- ranging expertise and that said
the HTML was regular and consistent. Rotten Tomatoes
information is sometimes unreliable. For an example that has a nice JSON-based API for data retrieval, but its
was in the news not long before this paper was written, ranking of returns is quirky, sometimes retrieving obscure
see [5], or for Wikipedia’s own list of Wikipedia hoaxes, films or films with similar names (example: Oliver
see [4]. Stone’s 2008 biopic “W.” was unfindable through search
query, even though the website’s front end; I had to go
 Films traditionally open on Friday, and their “opening” to Stone’s Rotten Tomatoes page just to find the relevant
often refers to their gross over the first Friday, Saturday, web page). Wikipedia has a nice API and solid/consistent
and Sunday that they are playing. However, there are lookup, which is all the more impressive given that it
plenty of non-Friday openings. Consequently, I’ve stated contains articles on anything, not just films.
the problem in terms of the first three days’ worth of
grosses. III. FEATURES
 The Data Sources I used to Answer this Question were: A. Descriptive Features
 Box Office Mojo (http://www.boxofficemojo. com/) -  The descriptive features considered were the year of
contains detailed box office data. I used it to select the release, runtime, MPAA rating, whether the film was
universe of films to analyze and as my source for released on a Friday, and membership in gen- res as
theatrical release dates, number of opening theaters, and defined by Rotten Tomatoes. Rotten Tomatoes has 18
revenues. There is no API - I scraped the data with the genre labels, listed below. A film can belong to any
Python package Beautiful Soup. number of these genres.

IJISRT25FEB802 www.ijisrt.com 1952


Volume 10, Issue 2, February – 2025 International Journal of Innovative Science and Research Technology
ISSN No:-2456-2165 https://doi.org/10.5281/zenodo.14987459

B. Wikipedia-Based Features revisions for certain textual patterns. One was a count of
the number of article section headings, another was a
 For each Wikipedia article, I measured the number of count of the number of external file references (typically
edits runs that occurred during the period 0 to 7 days an image or sound file inserted into the article), and the
prior to midnight on the day of the film’s release, as well last was a case- insensitive search for the word “ IMAX”.
as during the period 7 to 28 days prior. I defined an edit
run as a sequence of consecutive edits from the same C. Revenues of Similar Films
author (identified by IP address if anonymous).
Sometimes, on Wikipedia, the same author commits  A natural approach to predicting the box office
several edits in a row, presumably as part of a single effort performance of a film is to look at comparable films; in
to edit the page, which I wanted to correspondingly treat particular, the natural benchmark for a sequel is its
as a single edit. I generally found this to be a slight predecessor. To this end, I created a feature consisting of
improvement over raw edit count in terms of predictive revenues of “similar” films released in the five years
power. preceding each film’s release (hence, data as far back as
2002 was involved, even though the training dataset
 I also extracted a few features from the content of the extended only as far back as 2007). The five-year window
article revisions themselves. One feature I used was the was arbitrary, but I think it forms a reasonable bond when
average size, in bytes, of revisions in the 28-day window. comparing expected box office performance.
Other features were obtained by scanning the text of the

Fig 2 Example similarity scores: for “The Avengers” (2012).

IJISRT25FEB802 www.ijisrt.com 1953


Volume 10, Issue 2, February – 2025 International Journal of Innovative Science and Research Technology
ISSN No:-2456-2165 https://doi.org/10.5281/zenodo.14987459
 Similarity between two films was defined as the generated by summing so-called “weak predictors” that
geometric mean of the Jaccard4 similarity measures of the are sequentially fit to the gradient of a specified loss
films’ 1) Rotten Tomatoes genre information and 2) function (for example, squared error). The overall model
Rotten Tomatoes cast/director information. The Rotten may be accurate and robust even if each individual weak
Tomatoes API only returns the first few starring members predictor is very simplistic. Gradient boosting trees refer
of each film’s cast, so the metric is not distorted by to gradient boosting with decision trees as our weak
differing cast sizes. Directors were always treated as predictors. For details, see Friedman’s article [2], and also
single people, even if there were co- directors, so for our Wikipedia’s own page on gradient boosting [3].
purposes, the Coen brothers, for example, count as a
single person.  I used the Python statistical package scikit- learn’s
implementation of gradient boosting trees, using the
 The feature incorporated into the algorithms was, for each default learning rate and least squares as my loss
film, the opening revenue of all other films in our universe function.There are a few other model parameters.
released up to five prior to that film, weighted by
similarity. See Figure 2 for an example of similarity 4The Jaccard similarity of two sets A and B is defined
scores for one of the films in the test dataset. as.

IV. ANALYSIS AND PREDICTION

 I tried a few different prediction algorithms; the one that 5Random forests and ordinary linear regression
proved the most effective on the test set, as measured by performed worse, but not by much. Despite the clearly non-
R 2, was gradient boosting trees.5 Gradi- ent boosting is a normal distribution of the revenue per theater (it has a positive
general predictive technique pioneered by Jerome skew), I did not have better success with a generalized linear
Friedman of Stanford in which a predictive formula is regression than with ordinary linear regression.

Fig 3 Estimators

Fig. 3: R2 of gradient boosting tree models on the test and the depth of the trees (how many leaves are in each
dataset as a function of the number of estimator iterations. decision tree - this parameterizes the complexity of each
The different curves represent different numbers of leaves in individual weak predictor).
the weak learner decision trees. The simplest weak learner, a
2-leaf tree, performs the best. Using stochastic gradient  Adapting the example in scikit-learn’s docu- mentation
boosting trees, in which a subsample of the features is used to [1], I calculated the R2 of gradient boosting trees at
fit the decision trees, improved the high-leaf models to some different iterations and tree depths. I fit the model using
degree. This suggests that the inferior performance of the different parameterizations to the test data. Figure 3
higher-leaf models may be due to overfitting. that can be illustrates the results and shows that this model fits the test
controlled by the user; the most important ones are the data best with about 100 iterations (this is, in fact,, scikit-
number of estimators (the number of weak predictors to fit) learn’s default value) and a very simple 2-leaf functional

IJISRT25FEB802 www.ijisrt.com 1954


Volume 10, Issue 2, February – 2025 International Journal of Innovative Science and Research Technology
ISSN No:-2456-2165 https://doi.org/10.5281/zenodo.14987459
form for its weak predictors. decision trees is representative of its importance in
generating predictions; highly relevant features will be
 Using a gradient boosting tree model with 100 estimators frequently involved in trees, and ir- relevant features will
and two leaves in each weak learner and training on films be involved rarely or not at all. Figure 5 shows the top 10
from 2007 to 2011, as mentioned previously, I was able to features. Several features had frequencies of 0, in
achieve an R2 of 0.5400 on the 2012 dataset. The particular the boolean variables for several of the genre
predictions and results are listed in an appendix at the end categories, indicating that they could have been
of this paper. Figure 4 shows a scatter of predictions and completely omitted without impacting the outcomes of
actual values. this m odel .
 The frequency with which a feature is used in the model’s

Fig 4 Predicted values vs. actual values.

Table 1 Top 10 Features in the Gradient Boosting Tree Model.


Feature Frequency (%)
Wikipedia edit runs 7-28 days prior 18.31
Film runtime 14.60
Opening per-theater revenue of similar films 13.30
Wikipedia frequency of headers/subheaders 12.07
Wikipedia edit runs 0-7 days prior 10.97
Wikipedia average size of revisions 9.73
Wikipedia frequency of word “IMAX” 5.07
Wikipedia frequency of external files 4.62
Is comedy 3.74
MPAA rating is PG-13 3.17

 The importance of the Wikipedia data in this model can another; frequency of appearance in news headlines is
also be seen by removing the Wikipedia features and another. There are many conceivable metrics to gauge
rerunning the model, which produces a considerably lower popular interest in seeing a film, and a comprehensive
R2 of 0.3434. model would include data from many sources.

V. CONCLUSION AND AVENUES FOR 6In fact, I found that the number of opening theaters
FURTHER EXPLORATION itself has significant predictive power on per-theater revenue.
I omitted it mainly because I wanted to specifically examine
 While the results above do show that Wikipedia activity Wikipedia’s ability to measure social interest.
has some ability to predict box office returns, I do not
think the model in this paper is precise enough to be used  In particular, a many-source approach will help overcome
as anything but a very rough forecasting tool. Wikipedia the biases that any one source would have. Although
is just one possible source of data for quantifying social Wikipedia is widely known, read, and edited by a wide
interest; social networks such as Twitter or Facebook are variety of people, it will still be biased to whatever extent

IJISRT25FEB802 www.ijisrt.com 1955


Volume 10, Issue 2, February – 2025 International Journal of Innovative Science and Research Technology
ISSN No:-2456-2165 https://doi.org/10.5281/zenodo.14987459
that Wikipedia editors do not reflect the population of potential source of data to consider when gauging interest
people who go to the movies. It is my opinion that the best - and not just in films, but anywhere popular interest is a
way to improve this model would be to obtain more concern. Wikipedia could be used as input for predictions
measurements of popular interest, particularly data related to interest in news and current events, ticket sales
sources whose audiences overlap little with Wikipedia for events other than films, investor sentiments, and many
editors - measurements of interest among moviegoing other ar ea s.
demographics that use the Internet relatively infrequently,
for example.

 Nevertheless, the partial success in predicting box office


revenues with Wikipedia demonstrates that it is one

Table 2 2012 Predictions and Errors, Sorted by Actual Revenue per theater.
Title Actual Predicted Error (actual - predicted)
Marvel’s The Avengers 47698 26452 21247
The Hunger Games 36871 22247 14624
The Dark Knight Rises 36532 19194 17338
The Twilight Saga: Breaking Dawn Part 2 Skyfall 34660 11890 22770
25211 31496 -6285
The Hobbit: An Unexpected Journey 20919 18152 2767
Dr. Seuss’ The Lorax 18830 7018 11812
The Amazing Spider-Man 17176 21054 -3877
Ted 16800 8127 8673
Think Like a Man 16693 5536 11157

Table 3 2012 Predictions and Errors, Sorted by Actual Revenue per theater (Part 1).
Title Actual Predicted Error
Abraham Lincoln: Vampire Hunter 5247 5668 -421
The Cabin in the Woods 5245 6979 -1734
Sparkle 5189 4511 677
Mirror Mirror 5032 3589 1444
Red Dawn 4916 7430 -2514
The Three Stooges 4892 5981 -1089
Rise of the Guardians 4869 8725 -3856
End of Watch 4818 2503 2315
Cloud Atlas 4787 8046 -3259
Step Up Revolution 4570 4409 162

Table 4 2012 Predictions and Errors, Sorted by actual Revenue per theater (Part 2).
Title Actual Predicted Error
Alex Cross 4489 3955 533
That’s My Boy 4440 6258 -1818
Parental Guidance 4392 4140 252
Diary of a Wimpy Kid: Dog Days 4312 5826 -1514
The Dictator 4245 7210 -2965
The Secret World of Arrietty 4235 6930 -2695
The Man with the Iron Fists 4235 6053 -1818
One For the Money 4207 4619 -411
Rock of Ages 4161 7405 -3244
ParaNorman 4108 6899 -2791

REFERENCES http://en.wikipedia. org/wiki/Gradient boosting


[4]. “List of hoaxes on Wikipedia.” Retrieved 10 Jan 2012.
[1]. “Ensemble methods.” Retrieved 13 Jan 2012. http:// en.wikipedia.org/wiki/Wikipedia:List of
http://scikit-learn. org/stable/modules/ensemble.html hoaxes on Wikipedia
[2]. Friedman, Jerome H. (19 Apr 2001). “Greedy [5]. Pfeiffer, Eric (4 Jan 2013). “War is over: Imaginary
Function Approx- imation: A Gradient Boosting ‘Bicholm’ conflict removed from Wikipedia after five
Machine.” Retrieved 10 Jan 2012. http://www- years.” Retrieved 10 Jan 2012.
stat.stanford.edu/∼jhf/ftp/trebst.pdf [6]. “Wikipedia.” Retrieved 10 Jan 2012.
[3]. “Gradient boosting.” Retrieved 13 Jan 2012. http://en.wikipedia.org/ wiki/Wikipedia

IJISRT25FEB802 www.ijisrt.com 1956

You might also like