100% found this document useful (1 vote)
126 views20 pages

Web Scrapping

The document discusses various Python modules for web scraping including: - Requests downloads files and web pages from the internet - Beautiful Soup parses HTML and allows extracting information from web pages - Selenium launches and controls a web browser, allowing filling forms and simulating clicks It provides examples of using these modules to retrieve web pages, extract links from pages, download files from URLs, and parse HTML content.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
100% found this document useful (1 vote)
126 views20 pages

Web Scrapping

The document discusses various Python modules for web scraping including: - Requests downloads files and web pages from the internet - Beautiful Soup parses HTML and allows extracting information from web pages - Selenium launches and controls a web browser, allowing filling forms and simulating clicks It provides examples of using these modules to retrieve web pages, extract links from pages, download files from URLs, and parse HTML content.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 20

Chapter 11.

Web Scraping

Name : Vedant Badiger


USN: 1BI19CS182
Web scraping
Web scraping is the term for using a program to download and
process content from the Web. For example, Google runs many web
scraping programs to index web pages for its search engine.

Web Scraping Modules


webbrowser. Comes with Python and opens a browser to a specific page.

Requests. Downloads files and web pages from the Internet.

Beautiful Soup. Parses HTML, the format that web pages are written in.

Selenium. Launches and controls a web browser. Selenium is able to fill in forms
and simulate mouse clicks in this browser.
mapit.py with the webbrowser Module
The webbrowser module’s open() function can launch a new browser to a
specified URL.

A web browser tab will open to the URL http://amazon.in/.


Python Urllib Module

Urllib package is the URL handling module for python. It is used to fetch
URLs (Uniform Resource Locators). It uses the urlopen function and is able
to fetch URLs using a variety of different protocols.
Urllib is a package that collects several modules for working with URLs,
such as:
•urllib.request for opening and reading.
•urllib.parse for parsing URLs
•urllib.error for the exceptions raised
•urllib.robotparser for parsing robot.txt files
Retrieving web pages with urllib
Reading web pages simpler in Python by using the urllib library.
Using urllib, you can treat a web page much like a file.
You simply indicate which web page you would like to retrieve and
urllib handles all of the HTTP protocol and header details.

Example

Output
Write a program to retrieve the data for http://data.pr4e.org/romeo.txt and
compute the frequency of each word in the file using urllib.
Parsing HTML and scraping the web
The common uses of the urllib in Python is to scrape the web.
Web scraping : A program that pretends to be a web browser and retrieves pages, then
examines the data in those pages looking for patterns.

Example,:
A search engine such as Google will look at the source of one web page and extract the links
to other pages and retrieve those pages, extracting links, and so on.

Using this technique, Google spiders its way through nearly all of the pages on the web.

Google also uses the frequency of links from pages to determine how “important” a page is
and how high (rank) the page should appear in its search results.
Example to extract all the links from the given URL using regular
expression.
One simple way to parse HTML is to use regular expressions to
repeatedly search for and extract substrings that match a
particular pattern.
Reading binary files using urllib
Sometimes you want to retrieve a web page containing a non-text (or binary) file such as an image or
video file. The data in these files is generally not useful to print out, but you can easily make a copy of
a URL to a local file on your hard disk using urllib.

The above program reads all of the data in at once


across the network and stores it in the variable img in
the main memory of your computer.
This will work if the size of the file is less than the size
of the memory of your computer.
Downloading Files from the Web with the requests Module
The requests module lets you easily download files from the Web without having
to worry about complicated issues such as network errors, connection problems,
and data compression.

The requests module doesn’t come with Python, so you’ll have to install it first.
From the command line, run pip install requests.

Downloading a Web Page with the requests.get() Function


The requests.get() function takes a string of a URL to download.

By calling type() on requests.get()’s return value, you can see that it returns
a Response object, which contains the response that the web server gave for
your request
The raise_for_status() method

The raise_for_status() method is a good way to ensure that a program halts if a


bad download occurs.
This is a good thing: You want your program to stop as soon as some unexpected
error happens.
iter_content () method:

The iter_content() method returns “chunks” of the content on each iteration through
the loop.

Each chunk is of the bytes data type, and you get to specify how many bytes each
chunk will contain.

One hundred thousand bytes is generally a good size, so pass 100000 as the
argument to iter_content().
write() method
The write() method returns the number of bytes written to the file.
To review, here’s the complete process for downloading and saving
a file:
1. Call requests.get() to download the file.
2. Call open() with 'wb' to create a new file in write binary mode.
3. Loop over the Response object’s iter_content() method.
4. Call write() on each iteration to write the content to the file.
5. Call close() to close the file.
Parsing HTML with the BeautifulSoup Module

Beautiful Soup is a module for extracting information from an HTML page (and
is much better for this purpose than regular expressions).

The BeautifulSoup module’s name is bs4 (for Beautiful Soup, version


4).To install it, you will need to run pip install beautifulsoup4 from the
command line. (Check out Appendix A for instructions on installing third-party
modules.)
While beautifulsoup4 is the name used for installation, to import Beautiful
Soup you run import bs4.

Syntax: BeautifulSoup(document, parser)


Example
Creating a BeautifulSoup Object from HTML
The bs4.BeautifulSoup() function needs to be called with a string containing the HTML it will
parse.
The bs4.BeautifulSoup() function returns is a BeautifulSoup object.
Controlling the Browser with the selenium Module

The selenium module lets Python directly control the browser by


programmatically clicking links and filling in login information, almost as
though there is a human user interacting with the page.

Selenium allows you to interact with web pages in a much more advanced way
than Requests and Beautiful Soup; but because it launches a web browser, it is
a bit slower and hard to run in the background if, say, you just need to
download some files from the Web.
Launching Selenium Controlled Browser
>>> from selenium import webdriver
>>> browser = webdriver.Firefox()
>>> type(browser)
<class 'selenium.webdriver.firefox.webdriver.WebDriver'>

>>> browser.get('http://inventwithpython.com')

When webdriver.Firefox() is called, the Firefox web browser starts up.

Calling type() on the value webdriver.Firefox() reveals it’s of the WebDriver


data type.

And calling browser.get('http://inventwithpython.com') directs the browser


to http://inventwithpython.com/.

You might also like