Twitter Data Scraping Jupyter Notebook Text Instruction
Twitter Data Scraping Jupyter Notebook Text Instruction
Abstract
This tutorial walks you through installing Anaconda, GetOldTweets3, and details how to scrape
data and then manipulate it within Excel to prepare the dataset for analysis.
Twitter Data Scraping Tutorial Amy Larner Giroux
Table of Contents
Introduction ........................................................................................................................................................2
Install Anaconda with Jupyter Notebook ..................................................................................................... 4
Install GetOldTweets3 Library ........................................................................................................................9
Launch the Data Scraper Jupyter Notebook .............................................................................................. 11
Run the Data Scraper ..................................................................................................................................... 13
Initialization of the Process ...................................................................................................................... 13
Text-based Query ....................................................................................................................................... 14
Username Query ........................................................................................................................................ 16
Open the Dataset in Excel and Prepare it for Analysis ............................................................................ 18
Unique ID Changes ................................................................................................................................... 19
Splitting Date and Time ............................................................................................................................ 21
Fixing the Mentions Error ........................................................................................................................ 25
Concluding Thoughts ..................................................................................................................................... 26
Introduction
This Twitter Data Scraping tutorial will step you through the process of setting up a Python
environment and how to use the supplied Jupyter Notebook to collect tweet data. The instructions
and screen shots are shown in a Microsoft Windows environment, but the programs used also exist
for Mac and Linux operating systems.
The concept behind using this Python data scraper is to remove the need for you to register for a
Twitter developer account, and also to give you access to all past Twitter data. Many of the other
methods of tweet collection limit you to retrieving only the past week’s tweets and you would need
to plan far ahead and set up a retriever, such as TAGS, to collect data over time.
This tutorial and the methods it will teach you will allow you to retrieve historical Twitter data from
any point since Twitter’s inception. The following is a snapshot of the first tweets of the developers
when Twitter launched in 2006. These tweets were retrieved using the methods detailed in this
tutorial.
There are a set of outcomes for this Twitter data scraping tutorial and the instructional material is
separated into these goals:
1. Install Anaconda with Jupyter Notebook
2. Install GetOldTweets3 library
3. Launch the data scraper Jupyter Notebook
4. Run the data scraper
5. Open the dataset in Excel and prepare it for analysis
Some of the steps to follow are embedded in the main text of the instructional material, while others
are in the captions of the figures. Bolded text has been used to draw your attention to items to do.
In some of the illustrations, a yellow cursor is visible to indicate what to select.
Installation Type – If you share your computer with other people and have admin
privileges, select All Users, otherwise leave the default. Click Next
Completion Screen – Uncheck the tutorial/learn more boxes and select Finish
This completes the “Installation of Anaconda and Jupyter Notebook” section of the tutorial.
The web app for the notebook will launch in your default browser and display folder navigation
options. It defaults to the Desktop and if you unzipped the notebook there you should see it in the
list. If you placed it elsewhere, click on the folder next to the Desktop link and navigate to the
correct location.
This completes “Launch the Data Scraper Jupyter Notebook” section of the tutorial.
asterisk (*) while the code executes. Since this cell is only two lines of code you may blink and miss
the asterisk. You will notice it more later when the data is being scraped.
Text-based Query
As each of the appropriate code cells are executed, the code is loaded into memory and is then
available to other code within the notebook. The next code cell in the notebook contains a function
that runs the query on the Twitter data and creates a CSV file that contains the results.
This function will be run (loaded into memory) before we execute the code that defines what our
search parameters will contain.
By looking at the comments within this code cell, you will see that 4 parameters are passed to the
function: text_query (the search terms), start_date (beginning of date range), end_date (ending of
date range) and a count. The count constrains the number of tweets requested through the Twitter
API. There is a variable limit for the number of tweets you can ask for in a single query. Some
documentation says you can retrieve up to 18,000 per query. Typically, I can retrieve about 10,000
every 15-20 minutes without the process failing.
As you work with this scraper and find you have criteria that push against this limit, think about
breaking the queries up by day, by a single hashtag, etc. to reduce the size of the dataset retrieved in
a single query. You can combine multiple datasets in Excel afterwards as described later in this
tutorial.
Text query code cell – Click within this cell to make it active and Run (Ctrl-Enter)
Once the text query code cell has been run, we will set the criteria for the tweet scraping and retrieve
the data.
In the code cell below, you will see the 4 parameters to set. In this example, I am retrieving tweets
that use the hashtag #covid19 and include the text realDonaldTrump. This combination query looks
for Trump in any context: username, tweet content, or mentions, regardless of whether someone
used the @ username or the # hashtag symbol.
The date parameters require some caveats.
1. The until_date (Twitter’s variable name) needs to be your end date + 1. In the example
below, the last date in the range that the API will send back will be 15 March 2020.
2. The Twitter API will return query results from the until_date back towards the since_date
(i.e., end date to start date). This means that if your query hits the count limit before the
query finishes traversing all of the dates in your range, you may get, in this example, say
5,000 tweets from 15 March 2020, 5,000 tweets from 14 March 2020, and none from the rest
of the days in the range. You will need to examine your dataset if you are querying over
multiple days to ensure that all your requested data is retrieved. If you do not get all the
expected data, run single days individually and combine the data afterwards.
As this function takes time to run, you will notice the [*] displayed as the code executes and you will
see that it is completed when the asterisk is replaced with a number. This number denotes the
number of times cells are executed in a session.
You will also notice a CSV file will be saved to your desktop. The name of the file will be your
text_query and the number (in thousands) of the count you requested (not the actual count of
returned tweets).
If the process has an error, the asterisk will be replaced by a number, but your CSV file will not
appear. If you look below the code cell, you will find the error information. As mentioned
previously, Twitter constrains the number of queries. If you stay within the 10,000 count and a 1520
minute interval between large queries (ones that approach the 10,000 tweet boundary) you shouldn’t
have any issues.
If you see this error, you are running too many large queries to quickly. Wait 20 minutes and try
again.
Username Query
The remaining two code cells in the notebook are used for queries by Twitter username. To collect
the tweets of a specific user, the code cell that runs the query by username needs to be loaded into
memory like we did for the text-based query function.
The difference here is that the parameter is the username instead of the text criteria.
The last cell in the notebook is the function to set the query by username parameters and run the
actual retrieval. Modify the username, count, and date parameters as necessary for your research.
CSV in Excel
The table below describes the content of these columns and some comments outlining manipulation
we will do to make them more useful/visually understandable before saving the data as an Excel
spreadsheet.
Col. Content Comments
A count It is an incremental number of the count of tweets in the file and is zero-based
B ID The unique ID of the tweet. This column is useful for de-duplicating data that
has been collected via multiple queries. This technique will be outlined later in
the tutorial. When the CSV is opened initially, this appears in exponential
notation.
C Datetime This is the date/time of the tweet. Typically, I am interested in just the date
and will explain later how to split this column into its constituent parts.
D Text The actual textual content of the tweet without emoji (these were excluded)
E User The username on the account sending the tweet
F To The username(s) the tweet was sent to in a reply
G Retweets The number of times the tweet was retweeted
H Favorites The number of times the tweet was liked
Understanding Digital Culture: Humanist Lenses for Internet Research 18
NEH Summer Institute, University of Central Florida, 1–5 June 2020
Twitter Data Scraping Tutorial Amy Larner Giroux
I Mentions Other Twitter usernames mentioned in the tweet. Currently the column in the
example shows an Excel error of #NAME?. This will be explained below.
J Hashtags A list of the hashtags included in the tweet
Unique ID Changes
The ID is shown in exponential notation. To change, highlight the column, click on the format
dropdown and select Number, and then use the decimal place decreasing button to remove the
two decimal places.
To
remove decimals
When combining multiple datasets in one spreadsheet, you will want to remove any duplicates. Copy
and paste the rows into one spreadsheet as below:
Remove Duplicates
When the Remove Duplicates dialog is shown, make sure that the My data has headers is
checked, and then uncheck (Column A) as that column is not unique.
Results of Deduplication
If you have merged multiple files together, Column A is no longer unique. You may delete the
column. The figures throughout the rest of the tutorial still have Column A in place.
First, select the column and change it to a Text format using the dropdown in the Home tab. This
action will help to retain the YYYY-MM-DD format of the date portion of the field.
Next, insert two empty columns to the right of the Datetime column as shown above. The Text to
Column function needs these columns to hold the separated data.
On the Data tab, select the Text to Columns function.
The Text to Columns Wizard will step you through the process of splitting the Datetime field. The
wizard will default to Delimited, which is fine since the two fields are separated by a space.
Step 1 of Text to Column – make sure Delimited is selected and click Next
Step 2 of Text to Column – check the box for Space and click Next
The third screen of the Text to Column wizard will require multiple changes. We need to set the
destination to the two columns we inserted and set the data format for the new columns.
Step 3(b) – highlight the two new columns (D & E) and once the =$D:$E appears in the bar, click
the down arrow to return to the wizard
Next, we need to format the two new columns so that the date format is retained.
Step 3(c) – Use Ctrl-click to select both columns in the preview and then click
on Text
After completing the three parts of Step 3, click on Finish and enter column names for the two
new columns.
Find and Replace the equal sign to fix Mentions formula error
Concluding Thoughts
By following the methods outlined in this tutorial, you will be able to create a dataset of tweets that
can be used as input for a textual analysis program such as Orange (https://orange.biolab.si/).
Once Anaconda and Jupyter Notebook have been installed, creating new datasets is as simple as
changing the text or user query criteria in the notebook and running the code cells as needed. It is
not a complicated process and the installation of the software is straightforward.
Scraping Twitter data is a simple process:
1. Decide whether to query by text criteria or username.
2. Run the first cell in the notebook to load the libraries.
3. If using a text-based query:
a. Run the “Using a text-based search to collect tweets” code cell
b. Modify the search parameters in the “Text query process” code cell and then run the
cell
4. If using a username-based query:
a. Run the “Using a username-based search to collect tweets” code cell
b. Modify the search parameters in the “Username query process” code cell and the run
the cell
5. Modify your CSV file to prepare it for data analysis.