Keywords
publishing, copyright, sci-hub, open access, intellectual property, piracy
This article is included in the Research on Research, Policy & Culture gateway.
publishing, copyright, sci-hub, open access, intellectual property, piracy
Through the course of the 20th century, the academic publishing market has radically transformed. What used to be a small, decentralized marketplace, occupied by university presses and educational publishers, is now a global, highly profitable enterprise, dominated by commercial publishers1. This development is seen as the outcome of a multifactorial process, with the inability of libraries to resist price increases, the passivity of researchers who are not directly bearing the costs and the merging of publishing companies, leading to an oligopoly2.
In response to these developments and rising subscription costs, the Open Access movement started out to reclaim the process of academic publishing3. Besides the academic and economic impact, the potential societal impact of Open Access publishing is getting more attention4,5, and large funding bodies seem to agree with this opinion, as more and more are adopting Open Access policies6–8. These efforts seem to have an impact, as a 2014 study of scholarly publishing in the English language found that, while the adoption of Open Access varies between scholarly disciplines, an average of around 24 % of scholarly documents are freely accessible on the web9.
Another response to these shifts in the academic publishing world is what has been termed Guerilla Open Access1, Bibliogifts10 or Black Open Access11. Or in short, the usage of semi-legal or outright illegal ways of accessing scholarly publications, like peer2peer file sharing, for example the use of #icanhazpdf on Twitter10, or centralized web services like Sci-Hub/LibGen12.
Especially Sci-Hub, which started in 2011, has moved into the spotlight in the recent years. According to founder Alexandra Elbakyan, the website uses donated library credentials of contributors to circumvent publishers’ paywalls and thus downloads large parts of their collections13. This clear violation of copyright not only lead to a lawsuit by Elsevier against Elbakyan14, but also to her being called "the Robin Hood of Science"15, with both sparking further interest in Sci-Hub.
Despite this, there has been little research into how Sci-Hub is used and what kind of materials are being accessed through it. A 2014 study has looked at content provided through LibGen10. In 2016 Sci-Hub released data on ~28 million downloads done through the service16. This data was subsequently analyzed to see in which countries the website is being used, which publishers are most frequent13 and how downloading publications through Sci-Hub relates to socio-economic factors, such as being based in a research institution17 and how it impacts interlibrary loans12.
In March 2017 Sci-Hub released the list of ~ 62 million Digital Object Identifiers (DOIs) of the content they have stored. This study is the first to utilize both the data on which publications are downloaded through Sci-Hub, as well as the complete corpus available through them. This allows a data-driven approach to evaluate what is stored in the Sci-Hub universe, how the actual use of the service differs from that, and what different use cases people might have for Sci-Hub.
The data on the around 62 million DOIs indexed by Sci-Hub was taken from the dataset released on 2017-03-1918. In addition, the data on the 28 million downloads done through Sci-Hub between September 2015 and February 201616 was matched to the complete corpus of DOIs. This made it possible to quantify how often each object listed in Sci-Hub was actually requested from its user base.
The corresponding information for the publisher, the year of publication, as well as the journal in which it was published was gotten from doi.org, using the RubyGem Terrier (v1.0.2, https://github.com/Authorea/terrier). Acquiring the metadata for each of the 62 million DOIs in Sci-Hub was done between 2017-03-20 and 2017-03-31. In order to save time, the DOIs of the 28 million downloads were then matched to the superset of the already resolved DOI of the complete Sci-Hub catalog. In both cases, DOIs that could not be resolved were excluded from further analysis, but they are included in the dataset released with this article.
For each publisher, the number of papers downloaded was compared to the expected number of downloads, given the publishers’ presence in the whole Sci-Hub database. For this the relative contribution to the database was calculated for each publisher, excluding all missing data. The number of actual downloads was then compared to the expected number of downloads using a binomial test. All p-values were corrected for multiple testing with False Discovery Rate19 and post-correction p<0.05 were accepted.
For the 61,940,926 DOIs listed in the Sci-Hub data dump, a total of 46,931,934 DOIs could be resolved (75.77%). Manual inspection of the unresolvable 25% shows that nearly all of these could not be resolved as they are not available via doi.org, and are not a technical error in the procedure to resolve them (i.e. lack of internet connection). For the data on the downloads done through Sci-Hub, 21,515,195 downloads could be resolved out of 27,819,965 total downloads (77.34%).
To estimate the age distribution of the publications listed in Sci-Hub, and which fraction of these publications is actually requested by the people using Sci-Hub, the respective datasets were tabulated according to the year of publication, see Figure 1. While over 95% of the publications listed in Sci-Hub were published after 1950, there is nevertheless a long tail, reaching back to the 1619 edition of Descriptio cometæ20.
Red bars denote the years 1914, 1918, 1939 and 1945. Bottom: Number of publications downloaded by year of publication.
As a general trend the number of publications listed in Sci-Hub increases from year to year. Two notable exceptions are the time periods of the two World Wars, at which ends the number of publications dropped to pre-1906 and pre-1926 levels, respectively (red bars in Figure 1).
When it comes to the publications downloaded by Sci-Hub users, the skew towards recent publications is even more extreme. Over 95% of all downloads fall into publications done after 1982, with ~35% of the downloaded publications being less than 2 years old at the time they are being accessed (i.e. published after 2013). Despite this, there is also a long tail of publications being accessed, with articles published even in the 1600s being amongst the downloads, and 0.04% of all downloads being made for publications released prior to 1900.
The complete released database contains ~177,000 journals, with ~60% of these having at least a single paper downloaded. The number of articles per journal likely follows an exponential function, for both the total number of publications listed on Sci-Hub as well as the number of downloaded articles (see Supplementary Figure S1), with <10% of the journals being responsible for >50% of the total content in Sci-Hub. The skew for the downloaded content is even more extreme, with <1% of all journals getting over 50% of all downloads.
Contrasting the 20 most frequent journals in the complete database with the 20 most downloaded ones (Figure 2), one observes a clear shift not only in the distribution but also in the ranking, with the most abundant journal of the whole corpus not appearing in the 20 most downloaded journals. In addition, chemical journals appear to be overrepresented in the downloads (12 journals), compared to the complete corpus (7 journals), with no other discipline showing an increase amongst the 20 most frequent journals.
Looking at the data on a publisher level, there are ~1,700 different publishers, with ~1,000 having at least a single paper downloaded. Both corpus and downloaded publications are heavily skewed towards a set of few publishers, with the 9 most abundant publishers having published ~70% of the complete corpus and ~80% of all downloads respectively (see Supplementary Figure S2).
Given the background frequency in the complete corpus, the download numbers were compared to the expected numbers using a binomial test. After false discovery rate correction for multiple testing, 982 publishers differed significantly from the expected download numbers, with 201 publishers having more downloads than expected and 781 being underrepresented. Interestingly, while some big publishers like Elsevier and Springer Nature come in amongst the overly downloaded publishers, many of the large publishers, like Wiley-Blackwell and the Institute of Electrical and Electronics Engineers (IEEE) are being downloaded less than expected given their portfolio (Figure 3).
Earlier investigations into the data provided through Sci-Hub and LibGen focused large on either on the material being accessed13 or on the data stored in these resources10. This study is the first to make use of both the whole corpus of Sci-Hub as well as data on how this corpus is being accessed by its users.
Comparing actual usage with the background set of articles shows that articles from recent history are highly sought for, giving some evidence that embargoes prior to making publications Open Access seem to become less effective. These findings are in line with prior research into the motivations for crowd-sourced, peer2peer academic file sharing21. While embargoes have impact on the use of those publications22, these hurdles are being surpassed more and more by Black Open Access11, as provided by Sci-Hub.
While a good part of the literature available through Sci-Hub seems to be rarely accessed, the long tail of, publications, especially older ones, seems to be put to use - albeit at a lower frequency. With DOIs that are unresolvable due to issues on publishers’ sides23, and with Open Access publications that disappear behind accidental paywalls24, this use for Black Open Access might play an important role and needs to be investigated more closely. It is worth noting that all analyses related to the number of downloads are limited to the six month period between September 2015 and February 2016, and do not necessarily reflect the complete use of Sci-Hub.
Looking at the disproportionately frequented journals, one finds that 12 of the 20 most downloaded journals can broadly be classified as being within the subject area of chemistry. This is an effect that has also been seen in a prior study looking at the downloads done from Sci-Hub in the United States12. In addition, publishers with a focus on chemistry and engineering are also amongst the most highly accessed and overrepresented. While it is unclear whether this imbalance comes due to lack of access by university libraries, it’s noteworthy that both disciplines have a traditionally high number of graduates who go into industry. The 2013 Survey of Doctorate Recipients of the National Center for Science and Engineering Statistics (NCSES) of the United States finds that 50% of chemistry graduates and 58% of engineering graduates move to private, for-profit industry while only 32% and 27% respectively stay at educational institutions25. In comparison, in the life sciences these numbers are nearly switched, with 52% of graduates staying at educational institutions, which presumably offer more access to the scientific literature.
The prior analysis of the roughly 28 million downloads done through Sci-Hub showed a bleak picture when it came to the diversity of actors in the academic publishing space, with around 1/3 of all articles downloaded being published through Elsevier13. The analysis presented here puts this into perspective with the whole space of academic publishing available through Sci-Hub, in which Elsevier is also the dominant force with ~24% of the whole corpus. The general picture of a few publishers dominating the market, with around 50% of all publications being published through only 3 companies, is even more pronounced at the usage level compared to the complete corpus, perpetuating the trend of the rich getting richer. Only 11% of all publishers, amongst them already dominating companies, are downloaded more often than expected, while publications of 45% of all publishers are significantly less downloaded.
The analyses presented here suggest that Sci-Hub is used for a variety of reasons, by different populations. While most usage is biased towards getting access to recent publications, there is a subset of users interested in getting historical academic literature. Compared to the complete corpus, Sci-Hub seems to be a convenient resource, especially for engineers and chemists, as the overrepresentation shows. Lastly, when it comes to the representation of publishers, the Sci-Hub data shows that the academic publishing field is even more of an oligopoly in terms of actual usage when compared to the amount of literature published. Further analysis of how, by whom and where Sci-Hub is used will undoubtedly shed more light onto the practice of academic publishing around the globe.
All the data used in this study, as well as the code to analyze the data and create the figures, is archived on Zenodo as Data and Scripts for Looking into Pandora’s Box: The Content of Sci-Hub and its Usage (DOI, 10.5281/zenodo.472493)26.
In addition the analysis code can also be found on GitHub at http://www.github.com/gedankenstuecke/scihub.
The author uses SciHub regularly in his own research. Otherwise the author declares no competing financial, personal, or professional interests.
The author wants to acknowledge Alexandra Elbakyan, for releasing both data sets used in this study. Further thanks go to John Bohannon, who analyzed and helped release the initial data on downloads from Sci-Hub, and to Athina Tzovara and Philipp Bayer, for fruitful discussion of this manuscript as well as the statistics and analyses involved.
Supplementary Figure S1: Top: The distribution of publications per journal in the whole corpus, sorted in ascending order of articles. Bottom: The distribution of downloads per journals, sorted in ascending order of downloads.
Supplementary Figure S2: The proportion of the whole content as aggregated by publisher, both for the corpus (top) and downloads (bottom). Sorted by number of publications in the respective dataset. Only the 9 most frequent publishers are listed, smaller ones are grouped as other.
Views | Downloads | |
---|---|---|
F1000Research | - | - |
PubMed Central
Data from PMC are received and updated monthly.
|
- | - |
Is the work clearly and accurately presented and does it cite the current literature?
Yes
Is the study design appropriate and is the work technically sound?
Partly
Are sufficient details of methods and analysis provided to allow replication by others?
Yes
If applicable, is the statistical analysis and its interpretation appropriate?
Partly
Are all the source data underlying the results available to ensure full reproducibility?
Yes
Are the conclusions drawn adequately supported by the results?
Partly
Competing Interests: No competing interests were disclosed.
Is the work clearly and accurately presented and does it cite the current literature?
Yes
Is the study design appropriate and is the work technically sound?
Yes
Are sufficient details of methods and analysis provided to allow replication by others?
Yes
If applicable, is the statistical analysis and its interpretation appropriate?
Yes
Are all the source data underlying the results available to ensure full reproducibility?
Yes
Are the conclusions drawn adequately supported by the results?
Yes
References
1. Ware M: The STM Report An overview of scientific and scholarly journal publishing. STM. 2015. Reference SourceCompeting Interests: No competing interests were disclosed.
Reviewer Expertise: scholarly communication and scholarly publishing
Is the work clearly and accurately presented and does it cite the current literature?
Partly
Is the study design appropriate and is the work technically sound?
Yes
Are sufficient details of methods and analysis provided to allow replication by others?
Yes
If applicable, is the statistical analysis and its interpretation appropriate?
Yes
Are all the source data underlying the results available to ensure full reproducibility?
Yes
Are the conclusions drawn adequately supported by the results?
Partly
Competing Interests: No competing interests were disclosed.
Is the work clearly and accurately presented and does it cite the current literature?
Partly
Is the study design appropriate and is the work technically sound?
Yes
Are sufficient details of methods and analysis provided to allow replication by others?
Yes
If applicable, is the statistical analysis and its interpretation appropriate?
Yes
Are all the source data underlying the results available to ensure full reproducibility?
Yes
Are the conclusions drawn adequately supported by the results?
Partly
Competing Interests: No competing interests were disclosed.
Reviewer Expertise: Scholarly Communication
Alongside their report, reviewers assign a status to the article:
Invited Reviewers | ||||
---|---|---|---|---|
1 | 2 | 3 | 4 | |
Version 1 21 Apr 17 |
read | read | read | read |
Provide sufficient details of any financial or non-financial competing interests to enable users to assess whether your comments might lead a reasonable person to question your impartiality. Consider the following examples, but note that this is not an exhaustive list:
Sign up for content alerts and receive a weekly or monthly email with all newly published articles
Already registered? Sign in
The email address should be the one you originally registered with F1000.
You registered with F1000 via Google, so we cannot reset your password.
To sign in, please click here.
If you still need help with your Google account password, please click here.
You registered with F1000 via Facebook, so we cannot reset your password.
To sign in, please click here.
If you still need help with your Facebook account password, please click here.
If your email address is registered with us, we will email you instructions to reset your password.
If you think you should have received this email but it has not arrived, please check your spam filters and/or contact for further assistance.
thanks for the interest in the paper!
I just tried to download the data from Zenodo using the link you gave in your comment and it worked ... Continue reading Hey Ernesto,
thanks for the interest in the paper!
I just tried to download the data from Zenodo using the link you gave in your comment and it worked on my end without any issues (with Chrome, using the University's internet connection in my office). So either it was a temporary issue with Zenodo or the issue must be somehow with your connection.
I vaguely remember that someone had issues with Zenodo and their connection as well at some point. Could you try another connection for the download? Otherwise I'd be happy to find another way to get the data to you, i.e. if it helps I can deposit the data somewhere else for comparison.
Cheers,
Bastian
thanks for the interest in the paper!
I just tried to download the data from Zenodo using the link you gave in your comment and it worked on my end without any issues (with Chrome, using the University's internet connection in my office). So either it was a temporary issue with Zenodo or the issue must be somehow with your connection.
I vaguely remember that someone had issues with Zenodo and their connection as well at some point. Could you try another connection for the download? Otherwise I'd be happy to find another way to get the data to you, i.e. if it helps I can deposit the data somewhere else for comparison.
Cheers,
Bastian