Predicting Film Box Office Performance Using Wikipedia Edit Data
Predicting Film Box Office Performance Using Wikipedia Edit Data
Abstract: This study explores the potential of Wikipedia edit data as a predictor of opening box office revenues for films
released in the US. After analyzing films from 2007 to 2011, we developed a predictive model based on Wikipedia article
edits using gradient boosting trees as the primary algorithm. Our model incorporates features such as the frequency of
Wikipedia edits, the size and content of article revisions, and the revenues of similar films. The results demonstrate that
Wikipedia activity can serve as a rough indicator of film popularity, though the model’s predictive accuracy is limited. We
find that Wikipedia-based features, particularly edit runs and content changes, significantly contribute to the model’s
performance, achieving an R² of 0.54 for films released in 2012. This suggests that while Wikipedia data offers valuable
insights into social interest, it is best used in conjunction with other predictors for more reliable revenue estimates.
How to Cite: Niraj Patel (2025). Predicting Film Box Office Performance Using Wikipedia Edit Data. International Journal of
Innovative Science and Research Technology, 10(2), 1951-1956. https://doi.org/10.5281/zenodo.14987459
I. INTRODUCTION
Fig 1 Count of Wikipedia article edits for the films used in this paper’s training dataset over the 4 weeks prior to each film’s
respective release date, bucketed by days before the release date that the edits occurred. This graph shows the uptick in editing
activity that typically accompanies a film’s release.
According to its article about itself (as of this writing), registered user account. The edit history of each article is
Wikipedia is “a collaboratively edited, multilingual, free saved with a timestamp. Interested users can view any
Internet encyclopedia” launched in January 2011. [6] Its past version of an article, and an article’s edit history
articles can be edited by anyone, either anonymously exhibits an evolving record of Wikipedia’s “knowledge”1
(though the editor’s IP address is logged) or with a of its subject.
B. Wikipedia-Based Features revisions for certain textual patterns. One was a count of
the number of article section headings, another was a
For each Wikipedia article, I measured the number of count of the number of external file references (typically
edits runs that occurred during the period 0 to 7 days an image or sound file inserted into the article), and the
prior to midnight on the day of the film’s release, as well last was a case- insensitive search for the word “ IMAX”.
as during the period 7 to 28 days prior. I defined an edit
run as a sequence of consecutive edits from the same C. Revenues of Similar Films
author (identified by IP address if anonymous).
Sometimes, on Wikipedia, the same author commits A natural approach to predicting the box office
several edits in a row, presumably as part of a single effort performance of a film is to look at comparable films; in
to edit the page, which I wanted to correspondingly treat particular, the natural benchmark for a sequel is its
as a single edit. I generally found this to be a slight predecessor. To this end, I created a feature consisting of
improvement over raw edit count in terms of predictive revenues of “similar” films released in the five years
power. preceding each film’s release (hence, data as far back as
2002 was involved, even though the training dataset
I also extracted a few features from the content of the extended only as far back as 2007). The five-year window
article revisions themselves. One feature I used was the was arbitrary, but I think it forms a reasonable bond when
average size, in bytes, of revisions in the 28-day window. comparing expected box office performance.
Other features were obtained by scanning the text of the
I tried a few different prediction algorithms; the one that 5Random forests and ordinary linear regression
proved the most effective on the test set, as measured by performed worse, but not by much. Despite the clearly non-
R 2, was gradient boosting trees.5 Gradi- ent boosting is a normal distribution of the revenue per theater (it has a positive
general predictive technique pioneered by Jerome skew), I did not have better success with a generalized linear
Friedman of Stanford in which a predictive formula is regression than with ordinary linear regression.
Fig 3 Estimators
Fig. 3: R2 of gradient boosting tree models on the test and the depth of the trees (how many leaves are in each
dataset as a function of the number of estimator iterations. decision tree - this parameterizes the complexity of each
The different curves represent different numbers of leaves in individual weak predictor).
the weak learner decision trees. The simplest weak learner, a
2-leaf tree, performs the best. Using stochastic gradient Adapting the example in scikit-learn’s docu- mentation
boosting trees, in which a subsample of the features is used to [1], I calculated the R2 of gradient boosting trees at
fit the decision trees, improved the high-leaf models to some different iterations and tree depths. I fit the model using
degree. This suggests that the inferior performance of the different parameterizations to the test data. Figure 3
higher-leaf models may be due to overfitting. that can be illustrates the results and shows that this model fits the test
controlled by the user; the most important ones are the data best with about 100 iterations (this is, in fact,, scikit-
number of estimators (the number of weak predictors to fit) learn’s default value) and a very simple 2-leaf functional
The importance of the Wikipedia data in this model can another; frequency of appearance in news headlines is
also be seen by removing the Wikipedia features and another. There are many conceivable metrics to gauge
rerunning the model, which produces a considerably lower popular interest in seeing a film, and a comprehensive
R2 of 0.3434. model would include data from many sources.
V. CONCLUSION AND AVENUES FOR 6In fact, I found that the number of opening theaters
FURTHER EXPLORATION itself has significant predictive power on per-theater revenue.
I omitted it mainly because I wanted to specifically examine
While the results above do show that Wikipedia activity Wikipedia’s ability to measure social interest.
has some ability to predict box office returns, I do not
think the model in this paper is precise enough to be used In particular, a many-source approach will help overcome
as anything but a very rough forecasting tool. Wikipedia the biases that any one source would have. Although
is just one possible source of data for quantifying social Wikipedia is widely known, read, and edited by a wide
interest; social networks such as Twitter or Facebook are variety of people, it will still be biased to whatever extent
Table 2 2012 Predictions and Errors, Sorted by Actual Revenue per theater.
Title Actual Predicted Error (actual - predicted)
Marvel’s The Avengers 47698 26452 21247
The Hunger Games 36871 22247 14624
The Dark Knight Rises 36532 19194 17338
The Twilight Saga: Breaking Dawn Part 2 Skyfall 34660 11890 22770
25211 31496 -6285
The Hobbit: An Unexpected Journey 20919 18152 2767
Dr. Seuss’ The Lorax 18830 7018 11812
The Amazing Spider-Man 17176 21054 -3877
Ted 16800 8127 8673
Think Like a Man 16693 5536 11157
Table 3 2012 Predictions and Errors, Sorted by Actual Revenue per theater (Part 1).
Title Actual Predicted Error
Abraham Lincoln: Vampire Hunter 5247 5668 -421
The Cabin in the Woods 5245 6979 -1734
Sparkle 5189 4511 677
Mirror Mirror 5032 3589 1444
Red Dawn 4916 7430 -2514
The Three Stooges 4892 5981 -1089
Rise of the Guardians 4869 8725 -3856
End of Watch 4818 2503 2315
Cloud Atlas 4787 8046 -3259
Step Up Revolution 4570 4409 162
Table 4 2012 Predictions and Errors, Sorted by actual Revenue per theater (Part 2).
Title Actual Predicted Error
Alex Cross 4489 3955 533
That’s My Boy 4440 6258 -1818
Parental Guidance 4392 4140 252
Diary of a Wimpy Kid: Dog Days 4312 5826 -1514
The Dictator 4245 7210 -2965
The Secret World of Arrietty 4235 6930 -2695
The Man with the Iron Fists 4235 6053 -1818
One For the Money 4207 4619 -411
Rock of Ages 4161 7405 -3244
ParaNorman 4108 6899 -2791