Notes Regarding The Use of Beautifulsoup: Python
Notes Regarding The Use of Beautifulsoup: Python
The sample code for this course and textbook examples use BeautifulSoup to parse HTML. The
examples in the textbook and in this class work with BeautifulSoup 3.
Using BeautifulSoup 3
If you want use our samples "as is", download our Python 3 version of BeautifulSoup 3 from
http://www.py4e.com/code3/bs4.zip
You must unzip this into a "bs4" folder and have that folder as a sub-folder of the folder where you
put our sample code like:
http://www.py4e.com/code3/urllinks.py
This is a set of data sources curated by the instructional staff. Feel free to suggest new data sources in the forums.
The initial list was provided by Kevyn Collins-Thomson from the University of Michigan School of Information.
https://vincentarelbundock.github.io/Rdatasets/datasets.html
The Academic Torrents site has a growing number of datasets, including a few text collections that might be of
interest (Wikipedia, email, twitter, academic, etc.) for current or future projects.
http://academictorrents.com/browse.php?cat=6
Google Books n-gram corpus
http://aws.amazon.com/datasets/41740
https://commoncrawl.atlassian.net/wiki/display/CRWL/About+the+Data+Set
Award project using Common Crawl: http://norvigaward.github.io/entries.html
Python example: http://www.freelancer.com/projects/Python-Data-Processing/Python-script-for-
CommonCrawl.html
Business/commercial data Yelp external link:
http://www.yelp.com/developers/documentation/v2/search_api
Upcoming Deprecation of Yelp API v2 on June 30, 2018 (Posted by Yelp Jun 28, 2017)
Internet Archive (huge, ever-growing archive of the Web going back to 1990s) external link:
http://archive.org/help/json.php
WikiData:
https://www.wikidata.org/wiki/Wikidata:Main_Page
World Food Facts
http://world.openfoodfacts.org/data
Data USA - a variety of census data
https://datausa.io/
Center for Disease Control - variety of data sets related to COVID
https://data.cdc.gov/browse
U.S. Government open data - datasets from 75 agencies and subagencies
https://data.gov/
NASA data portal - space and earth science
https://data.nas.nasa.gov/
https://data.nasa.gov/
This week we do the first half of a project to download, process, and visualize an email corpus from
the Sakai open source project from 2004-2011:
http://mbox.dr-chuck.net/
This is a large amount of data and requires significant cleanup to make sense of the data before we
visualize it.
Important: You do not have to download all of the data to complete this project. Depending on your
Internet connection, downloading nearly a gigabyte of data might be impossible. All we want to do is
to have you download a small subset of the data and run the steps to process the data.
Here is the software we will be using to retrieve and process the email data:
https://www.py4e.com/code3/gmane.zip
If you have a fast network connection with no bandwidth charge - you can download all the data. If
you try to download all the data it may take well over 24 hours to pull the data. The good news is that
because there are separate crawl, clean, model, and visualization steps, you can start and stop the
crawl process as often as you like and run the other processes on the data that has been
downloaded so far.