Web Scrapping
Web Scrapping
Web Scraping
Beautiful Soup. Parses HTML, the format that web pages are written in.
Selenium. Launches and controls a web browser. Selenium is able to fill in forms
and simulate mouse clicks in this browser.
mapit.py with the webbrowser Module
The webbrowser module’s open() function can launch a new browser to a
specified URL.
Urllib package is the URL handling module for python. It is used to fetch
URLs (Uniform Resource Locators). It uses the urlopen function and is able
to fetch URLs using a variety of different protocols.
Urllib is a package that collects several modules for working with URLs,
such as:
•urllib.request for opening and reading.
•urllib.parse for parsing URLs
•urllib.error for the exceptions raised
•urllib.robotparser for parsing robot.txt files
Retrieving web pages with urllib
Reading web pages simpler in Python by using the urllib library.
Using urllib, you can treat a web page much like a file.
You simply indicate which web page you would like to retrieve and
urllib handles all of the HTTP protocol and header details.
Example
Output
Write a program to retrieve the data for http://data.pr4e.org/romeo.txt and
compute the frequency of each word in the file using urllib.
Parsing HTML and scraping the web
The common uses of the urllib in Python is to scrape the web.
Web scraping : A program that pretends to be a web browser and retrieves pages, then
examines the data in those pages looking for patterns.
Example,:
A search engine such as Google will look at the source of one web page and extract the links
to other pages and retrieve those pages, extracting links, and so on.
Using this technique, Google spiders its way through nearly all of the pages on the web.
Google also uses the frequency of links from pages to determine how “important” a page is
and how high (rank) the page should appear in its search results.
Example to extract all the links from the given URL using regular
expression.
One simple way to parse HTML is to use regular expressions to
repeatedly search for and extract substrings that match a
particular pattern.
Reading binary files using urllib
Sometimes you want to retrieve a web page containing a non-text (or binary) file such as an image or
video file. The data in these files is generally not useful to print out, but you can easily make a copy of
a URL to a local file on your hard disk using urllib.
The requests module doesn’t come with Python, so you’ll have to install it first.
From the command line, run pip install requests.
By calling type() on requests.get()’s return value, you can see that it returns
a Response object, which contains the response that the web server gave for
your request
The raise_for_status() method
The iter_content() method returns “chunks” of the content on each iteration through
the loop.
Each chunk is of the bytes data type, and you get to specify how many bytes each
chunk will contain.
One hundred thousand bytes is generally a good size, so pass 100000 as the
argument to iter_content().
write() method
The write() method returns the number of bytes written to the file.
To review, here’s the complete process for downloading and saving
a file:
1. Call requests.get() to download the file.
2. Call open() with 'wb' to create a new file in write binary mode.
3. Loop over the Response object’s iter_content() method.
4. Call write() on each iteration to write the content to the file.
5. Call close() to close the file.
Parsing HTML with the BeautifulSoup Module
Beautiful Soup is a module for extracting information from an HTML page (and
is much better for this purpose than regular expressions).
Selenium allows you to interact with web pages in a much more advanced way
than Requests and Beautiful Soup; but because it launches a web browser, it is
a bit slower and hard to run in the background if, say, you just need to
download some files from the Web.
Launching Selenium Controlled Browser
>>> from selenium import webdriver
>>> browser = webdriver.Firefox()
>>> type(browser)
<class 'selenium.webdriver.firefox.webdriver.WebDriver'>
>>> browser.get('http://inventwithpython.com')