Workshop 2B: Web Scraping With Beautifulsoup 4: Comp20008 Elements of Data Processing

The document provides instructions for scraping web data from O'Reilly's website using BeautifulSoup and saving it to a CSV file. It involves 5 steps: 1) defining the task, 2) browsing the site to determine relevant content, 3) identifying page structure and writing code to extract data, 4) automating extraction, and 5) extracting and saving data. Specifically, the task is to determine and plot the number of web development books published per year on O'Reilly. Code cells demonstrate how to scrape book data from pages, filter for relevant books, extract metadata into dicts, scrape multiple pages, and plot the results. Finally, the scraped book data is saved to a CSV file.

Uploaded by

Mubashwer Salman Khurshid

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

99 views5 pages

Workshop 2B: Web Scraping With Beautifulsoup 4: Comp20008 Elements of Data Processing

Uploaded by

Mubashwer Salman Khurshid

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 5

Workshop 2b: Web Scraping with BeautifulSoup 4

COMP20008 Elements of Data Processing

0.1

Workshop 2b - Web Scraping with BeautifulSoup

During this task you will learn to scrape the web using BeautifulSoup4 and save the results to a .csv file. The typical procedure is:
1.
2.
3.
4.
5.

Define the task

Browse the website & determine the contents of interest
Identify the structure & write a script to extract one entity
Write an extraction function & automate the process
Extract and save/archive the data

Your task is to determine and plot the number of books pubished per year by OReilly about Web Development. First, browse
http://shop.oreilly.com and go to the Web Development section. Heres a few things to look out for:
Is the information you need displayed on the page? If the information required is not displayed on the page, its most likely
that you wont be able to scrape it (sometimes information is hidden from the user in the form of HTML comments, it is not the case
here).
How many items are displayed per page? Are all the items books? In our case we are only interested in the first 30 books.
However, if we wanted to scrape the entire website, for all categories, we would need a different stopping criterion as well (how many
books are there?). Now, pay attention to the url and browse different pages.
Does the url change? How many pages are there? Using this information you should infer what should change in the url to
be able to crawl the entire target category.
Is scraping allowed? Whenever you want to scrape data from a website you should first check to see if it has some sort of access
policy (e.g. http://oreilly.com/terms/). It is best to check for a robots.txt file that tells webcrawlers how to behave. (Take a look at
eBays robots.txt for a restrictive case). The important lines for OReilys robots.txt are:
Crawl-delay: 30
Request-rate: 1/30
The first tells us that we should wait 30 seconds between requests, the second that we should request only one page every 30 seconds.
Two different ways of saying the same thing. Not following these terms will lead to our crawlers being banned!

Task 1: Scraping from the web and plotting

Now that you have the url, have a look at the page source to figure out what data you need to extract. For each point, write the code in
the empty cell below it.
In [ ]: from bs4 import BeautifulSoup as bsoup
import requests, html5lib
base_url = ""
soup = bsoup(requests.get(base_url).text, html5lib)
1. Find all of the relevant < td > tag elements.
In [ ]: tds = soup(td, figure_out_class_keyword)
print(len(tds))
2. Write a function that filters out videos.
In [ ]: def is_video(td):
"""
Its a video if it has exactly one pricelabel and
the stripped text inside that pricelabel starts with Video
"""
# your code here:
return bool()
print(len([td for td in tds if not is_video(td)]))
3. Write a function that returns a dict with the title, author, price, date given a BeautifulSoup tag representing a book.
In [ ]: import re #

regex might prove useful

def book_info(td):
"""
Given a BeautifulSoup <td> Tag representing a book,
extract the books details and return a dict
"""
# your code here:
return {
"title" : title,
"authors" : authors,
"price" : price,n
}

"date" : date

print(book_info(tds[0]))
4. Scrape the website for all the books.
In [ ]: from time import sleep
books = []
num_pages = 5 # should be able to find 30 books in 5 pages
base_url = ""
for page_num in range(1, num_pages + 1):
# your code here:
print("Scraping page", page_num, ",", len(books), "books found so far")
# now we wait as requested in robots.txt
sleep(30)
The previous code block might take a while to process.
3

Wait for it to finish, then print first 5 book titles to check:

In [ ]: print( [book[title] for book in books[:5]] )
5. Now that you collected the data, plot the number of books published each year.
In [ ]: def get_year(book):
"""book["date"] looks like November 2014 so we need to
split on the space and then take the second piece"""
return int(book["date"].split()[1])
import matplotlib.pyplot as plt
%matplotlib inline
import seaborn as sns
bookyears = [get_year(book) for book in books]
sns.distplot(bookyears)
plt.ylabel("Number of books")
plt.title("Web Dev!!")
plt.show()
Saving the books to a CSV file:
In [ ]: import csv
try:
with open(books.csv, w) as output_file:
dict_writer = csv.DictWriter(output_file, fieldnames=books[0].keys())
dict_writer.writeheader()
dict_writer.writerows(books)
except IOError as err:
print("I/O error({0}): {1}".format(err.errno, err.strerror))

Fluency With Information Technology 6th Edition by Lawrence Snyder ISBN Solution Manual
100% (53)
Fluency With Information Technology 6th Edition by Lawrence Snyder ISBN Solution Manual
4 pages
Elastic Search Presentation
No ratings yet
Elastic Search Presentation
55 pages
Web Scraping Cheat Sheet (2021), Python For Web Scraping by Frank Andrade Geek Culture - Medium
100% (3)
Web Scraping Cheat Sheet (2021), Python For Web Scraping by Frank Andrade Geek Culture - Medium
26 pages
How To Scrap Any Website's Content Using Scrapy
0% (1)
How To Scrap Any Website's Content Using Scrapy
20 pages
Java Advanced OOP
100% (1)
Java Advanced OOP
0 pages
Sequence Diagram-Vending Machine Case Study
No ratings yet
Sequence Diagram-Vending Machine Case Study
3 pages
Beginner Guide To Web Scraping of Data
No ratings yet
Beginner Guide To Web Scraping of Data
14 pages
Web Scraping With Python
No ratings yet
Web Scraping With Python
16 pages
DAP 4 Module
No ratings yet
DAP 4 Module
45 pages
Web Scraping Using Python
No ratings yet
Web Scraping Using Python
18 pages
Web Scraping With Python - Sample Chapter
100% (3)
Web Scraping With Python - Sample Chapter
26 pages
bs4 Examples
No ratings yet
bs4 Examples
2 pages
Class Assign
No ratings yet
Class Assign
3 pages
4F IntroToWebScraping
No ratings yet
4F IntroToWebScraping
6 pages
Web Scrapping Final
No ratings yet
Web Scrapping Final
7 pages
Web Scraping For Data Analytics A BeatifulSoup Implementation
No ratings yet
Web Scraping For Data Analytics A BeatifulSoup Implementation
6 pages
Using Scrapy in PyCharm
100% (1)
Using Scrapy in PyCharm
8 pages
Introduction To Web Crawling Chapter - 13
No ratings yet
Introduction To Web Crawling Chapter - 13
3 pages
Web Scraping Using Python: A Step by Step Guide: September 2019
No ratings yet
Web Scraping Using Python: A Step by Step Guide: September 2019
7 pages
Practical Web Scraping For Economists 1744341390
No ratings yet
Practical Web Scraping For Economists 1744341390
33 pages
Web Scraping Using Python: A Step by Step Guide: September 2019
No ratings yet
Web Scraping Using Python: A Step by Step Guide: September 2019
7 pages
Web Scraping Using Python: A Step by Step Guide: September 2019
0% (1)
Web Scraping Using Python: A Step by Step Guide: September 2019
7 pages
Web Scraping
No ratings yet
Web Scraping
11 pages
Web Crawling and Social Media Mining: Module No. 5
No ratings yet
Web Crawling and Social Media Mining: Module No. 5
77 pages
Web Scrapping: Dept - of CS&E, BIET, Davangere Page - 1
No ratings yet
Web Scrapping: Dept - of CS&E, BIET, Davangere Page - 1
8 pages
WEBSCRAping Buildwithpython
No ratings yet
WEBSCRAping Buildwithpython
78 pages
Getting Started With Beautiful Soup Sample Chapter
No ratings yet
Getting Started With Beautiful Soup Sample Chapter
15 pages
Implementing Web Scraping in Python With Beautifulsoup
No ratings yet
Implementing Web Scraping in Python With Beautifulsoup
6 pages
DAP Module4
No ratings yet
DAP Module4
109 pages
19-5E8 Tushara Priya
No ratings yet
19-5E8 Tushara Priya
23 pages
Web Scraping
No ratings yet
Web Scraping
28 pages
UI Ex 6 (61) - 1
No ratings yet
UI Ex 6 (61) - 1
3 pages
Advanced Web Scraping - Bypassing - 403 Forbidden, - Captchas, and More - Sangaline
No ratings yet
Advanced Web Scraping - Bypassing - 403 Forbidden, - Captchas, and More - Sangaline
12 pages
Introduction To Web Scraping in RPA With Python
No ratings yet
Introduction To Web Scraping in RPA With Python
10 pages
Practical Introduction To Web Scraping in Python
100% (1)
Practical Introduction To Web Scraping in Python
14 pages
Web+Scraping+Cheat+Sheet+2 0
No ratings yet
Web+Scraping+Cheat+Sheet+2 0
3 pages
Learning Scrapy - Sample Chapter
0% (1)
Learning Scrapy - Sample Chapter
16 pages
Template
No ratings yet
Template
21 pages
Programming 2 Lectures
No ratings yet
Programming 2 Lectures
52 pages
Web Crawling - Python
No ratings yet
Web Crawling - Python
34 pages
Unit 11 Application Development Using Python
No ratings yet
Unit 11 Application Development Using Python
19 pages
Scraping Book Python PDF
No ratings yet
Scraping Book Python PDF
50 pages
Scraping Book
No ratings yet
Scraping Book
50 pages
Product Info Scrapper
No ratings yet
Product Info Scrapper
18 pages
Beautiful Soup Tutorial
100% (2)
Beautiful Soup Tutorial
56 pages
Full Web Scraping With Python Collecting Data From The Modern Web 1st Edition Ryan Mitchell Ebook All Chapters
No ratings yet
Full Web Scraping With Python Collecting Data From The Modern Web 1st Edition Ryan Mitchell Ebook All Chapters
67 pages
The Ultimate Web Scraping With Python Bootcamp 2023 - Coderprog
No ratings yet
The Ultimate Web Scraping With Python Bootcamp 2023 - Coderprog
3 pages
Web Scraping
No ratings yet
Web Scraping
5 pages
Web Scraping Report
No ratings yet
Web Scraping Report
14 pages
Web Scraping and Data Collection CheatSheet 1731972399
No ratings yet
Web Scraping and Data Collection CheatSheet 1731972399
10 pages
Document 2
No ratings yet
Document 2
6 pages
Web Scraping Cheat Sheet 2.0
No ratings yet
Web Scraping Cheat Sheet 2.0
3 pages
DAP - Module 4
No ratings yet
DAP - Module 4
57 pages
RajSingh WIexp4
No ratings yet
RajSingh WIexp4
7 pages
Seminar Completed
No ratings yet
Seminar Completed
22 pages
Data Analysis by Web Scraping Using Python
No ratings yet
Data Analysis by Web Scraping Using Python
6 pages
Web Data Scraping
No ratings yet
Web Data Scraping
5 pages
Experiment2 Web Scraping and Data Analysis
No ratings yet
Experiment2 Web Scraping and Data Analysis
5 pages
Web Scraping With Python and Selenium: Sarah Fatima, Shaik Luqmaan Nuha Abdul Rasheed
No ratings yet
Web Scraping With Python and Selenium: Sarah Fatima, Shaik Luqmaan Nuha Abdul Rasheed
5 pages
ML Week 6
No ratings yet
ML Week 6
11 pages
Beautiful Soup
No ratings yet
Beautiful Soup
7 pages
Web Scraping Using Python - Notes
No ratings yet
Web Scraping Using Python - Notes
6 pages
ICG DevOps Methodology Ebook
No ratings yet
ICG DevOps Methodology Ebook
5 pages
Gyan Deep Shiksha Bharti: Computer Science
No ratings yet
Gyan Deep Shiksha Bharti: Computer Science
21 pages
Health Monitoring Using Internet of Things (Iot)
No ratings yet
Health Monitoring Using Internet of Things (Iot)
5 pages
Home Automation Catalogue
No ratings yet
Home Automation Catalogue
78 pages
CCS 335-Assign - 1
No ratings yet
CCS 335-Assign - 1
3 pages
Configuration of ELAN Card of MADM As Layer 2 Switch
100% (1)
Configuration of ELAN Card of MADM As Layer 2 Switch
2 pages
Sample Thesis Title Industrial Engineering
100% (4)
Sample Thesis Title Industrial Engineering
5 pages
Orcad电路设计与实践电子工业出版社华春梅 (等) 编著 12286272
No ratings yet
Orcad电路设计与实践电子工业出版社华春梅 (等) 编著 12286272
229 pages
AZ-303 Exam - Free Actual Q&as, Page 1 - ExamTopics
0% (1)
AZ-303 Exam - Free Actual Q&as, Page 1 - ExamTopics
5 pages
0000389000-02 Tendersure Africa Sez Limited
No ratings yet
0000389000-02 Tendersure Africa Sez Limited
1 page
RPSC Programmer Sad DPP Part-3 By-Sunil Yadav Sir
No ratings yet
RPSC Programmer Sad DPP Part-3 By-Sunil Yadav Sir
2 pages
Assignment Solution 4 Jan 2020
No ratings yet
Assignment Solution 4 Jan 2020
5 pages
SQL Functions: Assignments Q
No ratings yet
SQL Functions: Assignments Q
4 pages
ECE Campus Placement Preparation Guide
No ratings yet
ECE Campus Placement Preparation Guide
8 pages
Lect06 Introducing The Do While... Loop and Do Until... Loop Repetition Statements
No ratings yet
Lect06 Introducing The Do While... Loop and Do Until... Loop Repetition Statements
39 pages
MongoBoulder - Schema Design
No ratings yet
MongoBoulder - Schema Design
59 pages
User Manual Tsi Bravo 120vac v3 4
No ratings yet
User Manual Tsi Bravo 120vac v3 4
37 pages
Science BSC Computer Science Semester 5 2024 April Information Network Security R 2023
No ratings yet
Science BSC Computer Science Semester 5 2024 April Information Network Security R 2023
1 page
MPN Program Benefit Usage Guide
No ratings yet
MPN Program Benefit Usage Guide
7 pages
APIs Questions
No ratings yet
APIs Questions
4 pages
Documentation. HiPath 3000 - 5000 HiPath 3000 Manager C. Communication For The Open Minded. Administrator Documentation A31003-H3580-M101!7!76A9
No ratings yet
Documentation. HiPath 3000 - 5000 HiPath 3000 Manager C. Communication For The Open Minded. Administrator Documentation A31003-H3580-M101!7!76A9
283 pages
MCA Rtu Syllabuss
No ratings yet
MCA Rtu Syllabuss
6 pages
8085 ALP Five ALP To Count Even or and Odd Data Byte
No ratings yet
8085 ALP Five ALP To Count Even or and Odd Data Byte
5 pages
NZS4402 2 2-1986
No ratings yet
NZS4402 2 2-1986
10 pages
Objectives of The General Ledger System
No ratings yet
Objectives of The General Ledger System
3 pages
Customer Startup Info V20 E 101005 (AGe)
No ratings yet
Customer Startup Info V20 E 101005 (AGe)
31 pages