0% found this document useful (0 votes)
63 views

Notes Regarding The Use of Beautifulsoup: Python

The document provides instructions for using BeautifulSoup to parse HTML in Python code samples for a course. It explains that the examples use BeautifulSoup 3 and provides a link to download this version. It notes the folder structure needed to use the BeautifulSoup library with sample code files.

Uploaded by

Jia Yan
Copyright
© © All Rights Reserved
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
63 views

Notes Regarding The Use of Beautifulsoup: Python

The document provides instructions for using BeautifulSoup to parse HTML in Python code samples for a course. It explains that the examples use BeautifulSoup 3 and provides a link to download this version. It notes the folder structure needed to use the BeautifulSoup library with sample code files.

Uploaded by

Jia Yan
Copyright
© © All Rights Reserved
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 3

Python

Notes Regarding the Use of BeautifulSoup

The sample code for this course and textbook examples use BeautifulSoup to parse HTML. The
examples in the textbook and in this class work with BeautifulSoup 3.

Using BeautifulSoup 3

If you want use our samples "as is", download our Python 3 version of BeautifulSoup 3 from

http://www.py4e.com/code3/bs4.zip

You must unzip this into a "bs4" folder and have that folder as a sub-folder of the folder where you
put our sample code like:

http://www.py4e.com/code3/urllinks.py

List of Data Sources (Instructional Staff Curated)

This is a set of data sources curated by the instructional staff. Feel free to suggest new data sources in the forums.
The initial list was provided by Kevyn Collins-Thomson from the University of Michigan School of Information.

Long general-purpose list of datasets:

 https://vincentarelbundock.github.io/Rdatasets/datasets.html
The Academic Torrents site has a growing number of datasets, including a few text collections that might be of
interest (Wikipedia, email, twitter, academic, etc.) for current or future projects.

 http://academictorrents.com/browse.php?cat=6
Google Books n-gram corpus

 external link: http://books.google.com/ngrams


 Dataset: external link: http://aws.amazon.com/datasets/8172056142375670
Common Crawl: • Currently 6 billion Web documents (81 Tb) • Amazon S3 Public Data Set

 http://aws.amazon.com/datasets/41740
 https://commoncrawl.atlassian.net/wiki/display/CRWL/About+the+Data+Set
 Award project using Common Crawl: http://norvigaward.github.io/entries.html
 Python example: http://www.freelancer.com/projects/Python-Data-Processing/Python-script-for-
CommonCrawl.html
Business/commercial data Yelp external link:

 http://www.yelp.com/developers/documentation/v2/search_api
 Upcoming Deprecation of Yelp API v2 on June 30, 2018 (Posted by Yelp Jun 28, 2017)
Internet Archive (huge, ever-growing archive of the Web going back to 1990s) external link:

 http://archive.org/help/json.php
WikiData:

 https://www.wikidata.org/wiki/Wikidata:Main_Page
World Food Facts

 http://world.openfoodfacts.org/data
Data USA - a variety of census data

 https://datausa.io/
Center for Disease Control - variety of data sets related to COVID

 https://data.cdc.gov/browse
U.S. Government open data - datasets from 75 agencies and subagencies

 https://data.gov/
NASA data portal - space and earth science

 https://data.nas.nasa.gov/
 https://data.nasa.gov/

Spidering and Modeling Email Data -


Introduction

This week we do the first half of a project to download, process, and visualize an email corpus from
the Sakai open source project from 2004-2011:

http://mbox.dr-chuck.net/
This is a large amount of data and requires significant cleanup to make sense of the data before we
visualize it.

Important: You do not have to download all of the data to complete this project. Depending on your
Internet connection, downloading nearly a gigabyte of data might be impossible. All we want to do is
to have you download a small subset of the data and run the steps to process the data.

Here is the software we will be using to retrieve and process the email data:

https://www.py4e.com/code3/gmane.zip

If you have a fast network connection with no bandwidth charge - you can download all the data. If
you try to download all the data it may take well over 24 hours to pull the data. The good news is that
because there are separate crawl, clean, model, and visualization steps, you can start and stop the
crawl process as often as you like and run the other processes on the data that has been
downloaded so far.

You might also like