Content deleted Content added
No edit summary |
m Dating maintenance tags: {{Clarify}} |
||
(25 intermediate revisions by 20 users not shown) | |||
Line 1:
{{Short description|Nonprofit web crawling and archive organization}}
{{Infobox dot-com company
| name
| company_type
| traded_as
| foundation
| dissolved =
| location
| incorporated =
| founder
| chairman
| president =
| | key_people
| industry =
▲| key_people = [[Peter Norvig]], [[Nova Spivack]], [[Carl Malamud]], [[Kurt Bollacker]], [[Joi Ito]]
|
|
|
|
|
| parent =
▲| num_employees =
|
▲| subsid =
| ipv6 =
▲| url = {{url|commoncrawl.org}}
| alexa =
|
|
|
| language =
| launch_date =
|
|
|
|
| license=[[Apache 2.0]] (software) {{clarify|reason=dataset license?|date=November 2024}}
}}
'''Common Crawl''' is a [[nonprofit organization|nonprofit]] [[501(c) organization#501.28c.29.283.29|501(c)(3)]] organization that [[web crawler|crawls]] the web and freely provides its archives and datasets to the public.<ref name=latimes>{{cite news |title=Tech entrepreneur Gil Elbaz made it big in L.A.|author=Rosanna Xia|work=Los Angeles Times|date=February 5, 2012|access-date=July 31, 2014|url=
Common Crawl was founded by [[Gil Elbaz]].<ref name=twist>{{cite news |title=Startups - Gil Elbaz and Nova Spivack of Common Crawl - TWiST #222|publisher=This Week In Startups|date=January 10, 2012}}</ref> Advisors to the non-profit include [[Peter Norvig]] and [[Joi Ito]].<ref name=technologyreview>{{cite news
The Common Crawl dataset includes copyrighted work and is distributed from the US under [[fair use]] claims. Researchers in other countries have made use of techniques such as shuffling sentences or referencing the common crawl dataset to work around copyright law in other [[Jurisdiction|legal jurisdictions]].<ref>{{Cite journal |last=Schäfer |first=Roland |title=CommonCOW: Massively Huge Web Corpora from CommonCrawl Data and a Method to Distribute them Freely under Restrictive EU Copyright Laws |url=https://aclanthology.org/L16-1712 |journal=Proceedings of the Tenth International Conference on Language Resources and Evaluation (LREC'16) |date=May 2016 |location=Portorož, Slovenia |publisher=European Language Resources Association (ELRA) |pages=4501}}</ref>
English is the primary language for 46% of documents in the March 2023 version of the Common Crawl dataset. The next most common primary languages are German, Russian, Japanese, French, Spanish and Chinese, each with less than 6% of documents.<ref>{{Cite web |title=Statistics of Common Crawl Monthly Archives by commoncrawl |url=https://commoncrawl.github.io/cc-crawl-statistics/plots/languages.html |access-date=2023-04-02 |website=commoncrawl.github.io}}</ref>
==History==
[[Amazon Web Services]] began hosting Common Crawl's archive through its Public Data Sets program in 2012.<ref name=semanticweb_1>{{cite news |title=Common Crawl
The organization began releasing [[metadata]] files and the text output of the crawlers alongside [[
In December 2012, [[blekko]] donated to Common Crawl search engine [[metadata]] blekko had gathered from crawls it conducted from February to October 2012.<ref name=semanticweb_3>{{cite news |author=Jennifer Zaino |date=December 18, 2012 |title=Blekko Data Donation Is
In 2013, Common Crawl began using the [[Apache Software Foundation
A filtered version of Common Crawl was used to train OpenAI's [[GPT-3]] language model, announced in 2020.<ref>{{Cite arXiv |last1=Brown |first1=Tom |last2=Mann |first2=Benjamin |last3=Ryder |first3=Nick |last4=Subbiah |first4=Melanie |last5=Kaplan |first5=Jared |last6=Dhariwal |first6=Prafulla |last7=Neelakantan |first7=Arvind |last8=Shyam |first8=Pranav |last9=Sastry |first9=Girish |last10=Askell |first10=Amanda |last11=Agarwal |first11=Sandhini |date=2020-06-01
==
The following data have been collected from the official Common Crawl Blog
and Common Crawl's API.<ref>{{Cite web|url=https://index.commoncrawl.org/collinfo.json|title=Collection info - Common Crawl}}</ref>
{| class="wikitable"
|-
! Crawl date !! Size in [[Tebibyte|TiB]] !! Billions of pages !! Comments
|-
|April 2024
|386
|2.7
|Crawl conducted from April 12 to April 24, 2024
|-
|February/March 2024
|425
|3.16
|Crawl conducted from February 20 to March 5, 2024
|-
|December 2023
|454
|3.35
|Crawl conducted from November 28 to December 12, 2023
|-
|June 2023
|390
|3.1
|Crawl conducted from May 27 to June 11, 2023
|-
|April 2023
|400
|3.1
|Crawl conducted from March 20 to April 2, 2023
|-
|February 2023
|400
|3.15
|Crawl conducted from January 26 to February 9, 2023
|-
|December 2022
|420
|3.35
|Crawl conducted from November 26 to December 10, 2022
|-
|October 2022
Line 88 ⟶ 127:
|-
|August 2018
|
|2.65
|
|-
Line 190 ⟶ 229:
| March 2014 || 223 || 2.8 || First Nutch crawl
|-
|
|-
|
|-
|
|-
|
|-
| 2008-2009 || ? || ? || Crawl conducted from May 2008 through January 2009
|}
==Norvig Web Data Science Award==
In corroboration with [[SURFsara]], Common Crawl sponsors the Norvig Web Data Science Award, a competition open to students and researchers in [[Benelux]].<ref name=ccaward>{{cite web |title=The Norvig Web Data Science Award|author=Lisa Green|publisher=Common Crawl|date=November 15, 2012|access-date=July 31, 2014|url=http://commoncrawl.org/the-norvig-web-data-science-award/}}</ref><ref name=dtlsaward>{{cite web|title=Norvig Web Data Science Award 2014|publisher=Dutch Techcentre for Life Sciences|access-date=July 31, 2014|url=http://www.dtls.nl/dtl/news/norvig-web-data-science-award-2014.html|archive-url=https://web.archive.org/web/20140815035946/http://www.dtls.nl/dtl/news/norvig-web-data-science-award-2014.html|archive-date=August 15, 2014|url-status=dead}}</ref> The award is named for [[Peter Norvig]] who also chairs the judging committee for the award.<ref name=ccaward />
== Colossal Clean Crawled Corpus ==
{{Anchor|Colossal Clean Crawled Corpus}}
Google's version of the Common Crawl is called the Colossal Clean Crawled Corpus, or C4 for short. It was constructed for the training of the [[T5 (language model)|T5 language model series]] in 2019.<ref name=":0">{{Cite journal |last=Raffel |first=Colin |last2=Shazeer |first2=Noam |last3=Roberts |first3=Adam |last4=Lee |first4=Katherine |last5=Narang |first5=Sharan |last6=Matena |first6=Michael |last7=Zhou |first7=Yanqi |last8=Li |first8=Wei |last9=Liu |first9=Peter J. |date=2020 |title=Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer |url=http://jmlr.org/papers/v21/20-074.html |journal=Journal of Machine Learning Research |volume=21 |issue=140 |pages=1–67 |issn=1533-7928}}</ref> There are some concern over copyrighted content in the C4.<ref>{{Cite news |last=Hern |first=Alex |date=2023-04-20 |title=Fresh concerns raised over sources of training material for AI systems |url=https://www.theguardian.com/technology/2023/apr/20/fresh-concerns-training-material-ai-systems-facist-pirated-malicious |access-date=2023-04-21 |work=The Guardian |language=en-GB |issn=0261-3077}}</ref>
==References==
Line 211 ⟶ 255:
*[https://github.com/commoncrawl/ Common Crawl GitHub Repository] with the crawler, libraries and example code
*[https://groups.google.com/forum/?fromgroups#!forum/common-crawl Common Crawl Discussion Group]
*[https://commoncrawl.org
[[Category:Internet-related organizations]]
|