GeoCoV19: A Dataset of Hundreds of Millions of Multilingual COVID-19 Tweets with Location Information

Citation Author(s):
Umair
Qazi
Qatar Computing Research Institute
Muhammad
Imran
Qatar Computing Research Institute
Ferda
Ofli
Qatar Computing Research Institute
Submitted by:
Muhammad Imran
Last updated:
Wed, 06/24/2020 - 15:39
DOI:
10.21227/et8d-w881
Data Format:
Links:
License:
0
0 ratings - Please login to submit your rating.

Abstract 

Abstract:

We present GeoCoV19, a large-scale Twitter dataset related to the ongoing COVID-19 pandemic. The dataset has been collected over a period of 90 days from February 1 to May 1, 2020 and consists of more than 524 million multilingual tweets. As the geolocation information is essential for many tasks such as disease tracking and surveillance, we employed a gazetteer-based approach to extract toponyms from user location and tweet content to derive their geolocation information using the Nominatim (Open Street Maps) data at different geolocation granularity levels. In terms of geographical coverage, the dataset spans over 218 countries and 47K cities in the world. The tweets in the dataset are from more than 43 million Twitter users, including around 209K verified accounts. These users posted tweets in 62 different languages.

The dataset was collected using more than 800 multilingual keywords and hashtags. The complete list of keywords can be downloaded from here: https://crisisnlp.qcri.org/covid19 

For more details, please refer to this paper: https://arxiv.org/abs/2005.11177

Explore interesting trends in GeoCoV19 dataset using our new service: https://covid19-trends.qcri.org/

Instructions: 

GeoCoV19 Dataset Description 

The GeoCoV19 Dataset comprises several TAR files, which contain zip files representing daily data. Each zip file contains a JSON with the following format:

{ "tweet_id": "122365517305623353", "created_at": "Sat Feb 01 17:11:42 +0000 2020", "user_id": "335247240", "geo_source": "user_location", "user_location": { "country_code": "br" }, "geo": {}, "place": { }, "tweet_locations": [ { "country_code": "it", "state": "Trentino-Alto", "county": "Pustertal - Val Pusteria" }, { "country_code": "us" }, { "country_code": "ru", "state": "Voronezh Oblast", "county": "Petropavlovsky District" }, { "country_code": "at", "state": "Upper Austria", "county": "Braunau am Inn" }, { "country_code": "it", "state": "Trentino-Alto", "county": "Pustertal - Val Pusteria" }, { "country_code": "cn" }, { "country_code": "in", "state": "Himachal Pradesh", "county": "Jubbal" } ] }

Description of all the fields in the above JSON 

Each JSON in the Geo file has the following eight keys:

1. Tweet_id: it represents the Twitter provided id of a tweet

2. Created_at: it represents the Twitter provided "created_at" date and time in UTC

3. User_id: it represents the Twitter provided user id

4. Geo_source: this field shows one of the four values: (i) coordinates, (ii) place, (iii) user_location, or (iv) tweet_text. The value depends on the availability of these fields. However, priority is given to the most accurate fields if available. The priority order is coordinates, places, user_location, and tweet_text. For instance, when a tweet has GPS coordinates, the value will be "coordinates" even though all other location fields are present. If a tweet does not have GPS, place, and user_location information, then the value of this field will be "tweet_text" if there is any location mention in the tweet text.

The remaining keys can have the following location_json inside them. Sample location_json: {"country_code":"us","state":"California","county":"San Francisco","city":"San Francisco"}. Depending on the available granularity, country_code, state, county or city keys can be missing in the location_json.

5. user_location: It can have a "location_json" as described above or an empty JSON {}. This field uses the "location" profile meta-data of a Twitter user and represents the user declared location in the text format. We resolve the text to a location.

6. geo: represents the "geo" field provided by Twitter. We resolve the provided latitude and longitude values to locations. It can have a "location_json" as described above or an empty JSON {}.

7. tweet_locations: This field can have an array of "location_json" as described above [location_json1, location_json2] or an empty array []. This field uses the tweet content (i.e., actual tweet message) to find toponyms. A tweet message can have several mentions of different locations (i.e., toponyms). That is why we have an array of locations representing all those toponyms in a tweet. For instance, in a tweet like "The UK has over 65,000 #COVID19 deaths. More than Qatar, Pakistan, and Norway.", there are four location mentions. Our tweet_locations array should represent these four separately.

8. place: It can have a "location_json" described above or an empty JSON {}. It represents the Twitter-provided "place" field.

 

Tweets hydrators:

CrisisNLP (Java): https://crisisnlp.qcri.org/#resource8

Twarc (Python): https://github.com/DocNow/twarc#dehydrate

Docnow (Desktop application): https://github.com/docnow/hydrator

If you have doubts or questions, feel free to contact us at: [email protected] and [email protected]

Comments

can you please tell me how can I get the sentiment label i this dataset.

Submitted by imran khan on Tue, 07/07/2020 - 11:04

Thanks for your question. This dataset does not have sentiment labels. However, you can use any multilingual sentiment classifier to determine tweets' sentiment polarity. 

Submitted by Muhammad Imran on Tue, 07/07/2020 - 16:03

If you don't mind. can you give some reference for the "sentiment classifier", Because I search all over the internet and I find some reference which was not good as I want. 

thank you.

Submitted by imran khan on Wed, 07/08/2020 - 07:05

Probably the following references would be helpful:

Severyn, A., & Moschitti, A. (2015, August). Twitter sentiment analysis with deep convolutional neural networks. In Proceedings of the 38th International ACM SIGIR Conference on Research and Development in Information Retrieval (pp. 959-962).

Giachanou, A., & Crestani, F. (2016). Like it or not: A survey of twitter sentiment analysis methods. ACM Computing Surveys (CSUR), 49(2), 1-41.

Submitted by Muhammad Imran on Tue, 07/14/2020 - 15:21

how much time will it take me to hydrate this complete dataset?

Submitted by Somodo Non on Thu, 07/23/2020 - 02:20

I think it depends on how many parallel threads one uses to call Twitter API. Parallel calls will significantly reduce the rehydration time.

Submitted by Muhammad Imran on Tue, 08/18/2020 - 10:15

Do you have datasets for May and June as well?

Submitted by Hyun Kim on Thu, 07/23/2020 - 04:08

Yes, we have been collecting data for May, June, July, and onwards. We need to process it before sharing it. It may take some time though.

Submitted by Muhammad Imran on Tue, 08/18/2020 - 10:17

how can I get the sentiment label of this datasets

Submitted by Abdullah Matin on Sun, 08/23/2020 - 04:47

Hi, I didn't see the original tweets in this dataset, without it I cannot apply sentiment analysis. Could you also include this in your dataset?

Submitted by Yimei Fan on Thu, 11/12/2020 - 21:45

do you have any statistics of covid-related tweets per country you can share?

Submitted by Davide Morselli on Wed, 01/27/2021 - 05:42

Thank you Mohammed for this dataset, but I did not find tweet text.. does t the data set have tweet text?

Submitted by Soha Mohamed on Wed, 03/17/2021 - 18:40

Dataset Files

LOGIN TO ACCESS DATASET FILES
Open Access dataset files are accessible to all logged in  users. Don't have a login?  Create a free IEEE account.  IEEE Membership is not required.