Workshop 2B: Web Scraping With Beautifulsoup 4: Comp20008 Elements of Data Processing
Workshop 2B: Web Scraping With Beautifulsoup 4: Comp20008 Elements of Data Processing
0.1
During this task you will learn to scrape the web using BeautifulSoup4 and save the results to a .csv file. The typical procedure is:
1.
2.
3.
4.
5.
Your task is to determine and plot the number of books pubished per year by OReilly about Web Development. First, browse
http://shop.oreilly.com and go to the Web Development section. Heres a few things to look out for:
Is the information you need displayed on the page? If the information required is not displayed on the page, its most likely
that you wont be able to scrape it (sometimes information is hidden from the user in the form of HTML comments, it is not the case
here).
How many items are displayed per page? Are all the items books? In our case we are only interested in the first 30 books.
However, if we wanted to scrape the entire website, for all categories, we would need a different stopping criterion as well (how many
books are there?). Now, pay attention to the url and browse different pages.
Does the url change? How many pages are there? Using this information you should infer what should change in the url to
be able to crawl the entire target category.
Is scraping allowed? Whenever you want to scrape data from a website you should first check to see if it has some sort of access
policy (e.g. http://oreilly.com/terms/). It is best to check for a robots.txt file that tells webcrawlers how to behave. (Take a look at
eBays robots.txt for a restrictive case). The important lines for OReilys robots.txt are:
Crawl-delay: 30
Request-rate: 1/30
The first tells us that we should wait 30 seconds between requests, the second that we should request only one page every 30 seconds.
Two different ways of saying the same thing. Not following these terms will lead to our crawlers being banned!
def book_info(td):
"""
Given a BeautifulSoup <td> Tag representing a book,
extract the books details and return a dict
"""
# your code here:
return {
"title" : title,
"authors" : authors,
"price" : price,n
}
"date" : date
print(book_info(tds[0]))
4. Scrape the website for all the books.
In [ ]: from time import sleep
books = []
num_pages = 5 # should be able to find 30 books in 5 pages
base_url = ""
for page_num in range(1, num_pages + 1):
# your code here:
print("Scraping page", page_num, ",", len(books), "books found so far")
# now we wait as requested in robots.txt
sleep(30)
The previous code block might take a while to process.
3