100% found this document useful (1 vote)

126 views20 pages

Web Scrapping

The document discusses various Python modules for web scraping including: - Requests downloads files and web pages from the internet - Beautiful Soup parses HTML and allows extracting information from web pages - Selenium launches and controls a web browser, allowing filling forms and simulating clicks It provides examples of using these modules to retrieve web pages, extract links from pages, download files from URLs, and parse HTML content.

Uploaded by

Muhammad Abdullah

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

100% found this document useful (1 vote)

126 views20 pages

Web Scrapping

Uploaded by

Muhammad Abdullah

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 20

Chapter 11.

Web Scraping

Name : Vedant Badiger

USN: 1BI19CS182
Web scraping
Web scraping is the term for using a program to download and
process content from the Web. For example, Google runs many web
scraping programs to index web pages for its search engine.

Web Scraping Modules

webbrowser. Comes with Python and opens a browser to a specific page.

Requests. Downloads files and web pages from the Internet.

Beautiful Soup. Parses HTML, the format that web pages are written in.

Selenium. Launches and controls a web browser. Selenium is able to fill in forms
and simulate mouse clicks in this browser.
mapit.py with the webbrowser Module
The webbrowser module’s open() function can launch a new browser to a
specified URL.

A web browser tab will open to the URL http://amazon.in/.

Python Urllib Module

Urllib package is the URL handling module for python. It is used to fetch
URLs (Uniform Resource Locators). It uses the urlopen function and is able
to fetch URLs using a variety of different protocols.
Urllib is a package that collects several modules for working with URLs,
such as:
•urllib.request for opening and reading.
•urllib.parse for parsing URLs
•urllib.error for the exceptions raised
•urllib.robotparser for parsing robot.txt files
Retrieving web pages with urllib
Reading web pages simpler in Python by using the urllib library.
Using urllib, you can treat a web page much like a file.
You simply indicate which web page you would like to retrieve and
urllib handles all of the HTTP protocol and header details.

Example

Output
Write a program to retrieve the data for http://data.pr4e.org/romeo.txt and
compute the frequency of each word in the file using urllib.
Parsing HTML and scraping the web
The common uses of the urllib in Python is to scrape the web.
Web scraping : A program that pretends to be a web browser and retrieves pages, then
examines the data in those pages looking for patterns.

Example,:
A search engine such as Google will look at the source of one web page and extract the links
to other pages and retrieve those pages, extracting links, and so on.

Using this technique, Google spiders its way through nearly all of the pages on the web.

Google also uses the frequency of links from pages to determine how “important” a page is
and how high (rank) the page should appear in its search results.
Example to extract all the links from the given URL using regular
expression.
One simple way to parse HTML is to use regular expressions to
repeatedly search for and extract substrings that match a
particular pattern.
Reading binary files using urllib
Sometimes you want to retrieve a web page containing a non-text (or binary) file such as an image or
video file. The data in these files is generally not useful to print out, but you can easily make a copy of
a URL to a local file on your hard disk using urllib.

The above program reads all of the data in at once

across the network and stores it in the variable img in
the main memory of your computer.
This will work if the size of the file is less than the size
of the memory of your computer.
Downloading Files from the Web with the requests Module
The requests module lets you easily download files from the Web without having
to worry about complicated issues such as network errors, connection problems,
and data compression.

The requests module doesn’t come with Python, so you’ll have to install it first.
From the command line, run pip install requests.

Downloading a Web Page with the requests.get() Function

The requests.get() function takes a string of a URL to download.

By calling type() on requests.get()’s return value, you can see that it returns
a Response object, which contains the response that the web server gave for
your request
The raise_for_status() method

The raise_for_status() method is a good way to ensure that a program halts if a

bad download occurs.
This is a good thing: You want your program to stop as soon as some unexpected
error happens.
iter_content () method:

The iter_content() method returns “chunks” of the content on each iteration through
the loop.

Each chunk is of the bytes data type, and you get to specify how many bytes each
chunk will contain.

One hundred thousand bytes is generally a good size, so pass 100000 as the
argument to iter_content().
write() method
The write() method returns the number of bytes written to the file.
To review, here’s the complete process for downloading and saving
a file:
1. Call requests.get() to download the file.
2. Call open() with 'wb' to create a new file in write binary mode.
3. Loop over the Response object’s iter_content() method.
4. Call write() on each iteration to write the content to the file.
5. Call close() to close the file.
Parsing HTML with the BeautifulSoup Module

Beautiful Soup is a module for extracting information from an HTML page (and
is much better for this purpose than regular expressions).

The BeautifulSoup module’s name is bs4 (for Beautiful Soup, version

4).To install it, you will need to run pip install beautifulsoup4 from the
command line. (Check out Appendix A for instructions on installing third-party
modules.)
While beautifulsoup4 is the name used for installation, to import Beautiful
Soup you run import bs4.

Syntax: BeautifulSoup(document, parser)

Example
Creating a BeautifulSoup Object from HTML
The bs4.BeautifulSoup() function needs to be called with a string containing the HTML it will
parse.
The bs4.BeautifulSoup() function returns is a BeautifulSoup object.
Controlling the Browser with the selenium Module

The selenium module lets Python directly control the browser by

programmatically clicking links and filling in login information, almost as
though there is a human user interacting with the page.

Selenium allows you to interact with web pages in a much more advanced way
than Requests and Beautiful Soup; but because it launches a web browser, it is
a bit slower and hard to run in the background if, say, you just need to
download some files from the Web.
Launching Selenium Controlled Browser
>>> from selenium import webdriver
>>> browser = webdriver.Firefox()
>>> type(browser)
<class 'selenium.webdriver.firefox.webdriver.WebDriver'>

>>> browser.get('http://inventwithpython.com')

When webdriver.Firefox() is called, the Firefox web browser starts up.

Calling type() on the value webdriver.Firefox() reveals it’s of the WebDriver

data type.

And calling browser.get('http://inventwithpython.com') directs the browser

to http://inventwithpython.com/.

Media Ministry Made Easy
100% (6)
Media Ministry Made Easy
176 pages
New Scientist Essential Guide No2 2020
No ratings yet
New Scientist Essential Guide No2 2020
100 pages
Wa0004
No ratings yet
Wa0004
16 pages
Regina Navarro Lins - A Cama Na Varanda (PDF) (Rev) : (HTTPS://WWW - Academia.edu/)
0% (1)
Regina Navarro Lins - A Cama Na Varanda (PDF) (Rev) : (HTTPS://WWW - Academia.edu/)
292 pages
DRILL New SAT Math Fake Passport To Advanced With SOLUTIONS Part 1
No ratings yet
DRILL New SAT Math Fake Passport To Advanced With SOLUTIONS Part 1
12 pages
Catalog Important Software
No ratings yet
Catalog Important Software
77 pages
Lets Be A Dork and Read Js Files
No ratings yet
Lets Be A Dork and Read Js Files
9 pages
Licenses
No ratings yet
Licenses
46 pages
Glossary - Malwarebytes
No ratings yet
Glossary - Malwarebytes
63 pages
Untitled
No ratings yet
Untitled
5 pages
Gathering Info On Remote Host
No ratings yet
Gathering Info On Remote Host
11 pages
Ankit Fadia: Books of Ankit Wadia Are
No ratings yet
Ankit Fadia: Books of Ankit Wadia Are
3 pages
Web Spoofing
No ratings yet
Web Spoofing
6 pages
Proxies
100% (1)
Proxies
4 pages
Finding The Real Origin Ips Hiding Behind Cloudflare or Tor
No ratings yet
Finding The Real Origin Ips Hiding Behind Cloudflare or Tor
10 pages
File Extension .TXT Details
No ratings yet
File Extension .TXT Details
6 pages
Beautiful Soup Documentation: Getting Help
100% (1)
Beautiful Soup Documentation: Getting Help
56 pages
Spywares: By:Murad M. Ali Supervised By: Dr. Lo'ai Tawalbeh New York Institute of Technology (NYIT) - Jordan's Campus 2006
No ratings yet
Spywares: By:Murad M. Ali Supervised By: Dr. Lo'ai Tawalbeh New York Institute of Technology (NYIT) - Jordan's Campus 2006
35 pages
MaxMobile UserGuide
No ratings yet
MaxMobile UserGuide
3 pages
Domain SQLi Finder - Py
No ratings yet
Domain SQLi Finder - Py
13 pages
Sqlmap Advanced
No ratings yet
Sqlmap Advanced
6 pages
IP Spoofing and Web Spoofing
No ratings yet
IP Spoofing and Web Spoofing
11 pages
HackingBB PDF
100% (1)
HackingBB PDF
49 pages
Test PDF
No ratings yet
Test PDF
14 pages
Offsec SearchSploit
No ratings yet
Offsec SearchSploit
17 pages
Script Login
No ratings yet
Script Login
3 pages
Proxy List Good Sources
No ratings yet
Proxy List Good Sources
2 pages
Bug Bytes #71 - 20K Facebook XSS, LevelUp 0x06 & Naffy's Notes
No ratings yet
Bug Bytes #71 - 20K Facebook XSS, LevelUp 0x06 & Naffy's Notes
16 pages
gs2 Job Aid
No ratings yet
gs2 Job Aid
17 pages
The Crypt Er Blueprint Free
No ratings yet
The Crypt Er Blueprint Free
23 pages
Google Dorks Vulnerable Sites #1 Vulnerable Sites #2 Vulnerable Sites #3 Vulnerable Sites (With Syntax)
No ratings yet
Google Dorks Vulnerable Sites #1 Vulnerable Sites #2 Vulnerable Sites #3 Vulnerable Sites (With Syntax)
11 pages
Online Bidding System
No ratings yet
Online Bidding System
8 pages
How to a Developers Guide to 4k: Developer edition, #3
From Everand
How to a Developers Guide to 4k: Developer edition, #3
Xinc Cyberwizard
No ratings yet
US Proxy List - Free Proxy List PDF
No ratings yet
US Proxy List - Free Proxy List PDF
3 pages
Ceh V10
No ratings yet
Ceh V10
97 pages
0 APR Credit Cards - True Benefits
No ratings yet
0 APR Credit Cards - True Benefits
2 pages
Guide Mysql
No ratings yet
Guide Mysql
44 pages
Google SOCKSTranslate
No ratings yet
Google SOCKSTranslate
11 pages
Controlling IP Spoofing Through - Documentation
No ratings yet
Controlling IP Spoofing Through - Documentation
78 pages
K-Anonymity: Universit' A Degli Studi Di Milano, 26013 Crema, Italia (Ciriani, Decapita, Foresti, Samarati) @dti - Unimi.it
No ratings yet
K-Anonymity: Universit' A Degli Studi Di Milano, 26013 Crema, Italia (Ciriani, Decapita, Foresti, Samarati) @dti - Unimi.it
33 pages
Hacking Windows 8 Games
No ratings yet
Hacking Windows 8 Games
10 pages
(Import) Understanding JCOP - Memory Dump - Re-Ws - PL
No ratings yet
(Import) Understanding JCOP - Memory Dump - Re-Ws - PL
1 page
Sevabot Skype Bot
No ratings yet
Sevabot Skype Bot
49 pages
Browsers
No ratings yet
Browsers
20 pages
Herramientas Osint
No ratings yet
Herramientas Osint
4 pages
Hacking New
No ratings yet
Hacking New
24 pages
Aindumps - lx0 103.v2015!10!20.by - Omer.94q.unlocked
100% (1)
Aindumps - lx0 103.v2015!10!20.by - Omer.94q.unlocked
38 pages
W3af - A Framework To Own The Web
No ratings yet
W3af - A Framework To Own The Web
28 pages
All About Cookies
No ratings yet
All About Cookies
3 pages
Advanced MiTM
No ratings yet
Advanced MiTM
3 pages
Android App Portal
No ratings yet
Android App Portal
10 pages
PDF
No ratings yet
PDF
28 pages
Hack Back Fisher
No ratings yet
Hack Back Fisher
24 pages
10 Tools To Make A Bootable USB From An ISO File
No ratings yet
10 Tools To Make A Bootable USB From An ISO File
10 pages
Emulator PDF
No ratings yet
Emulator PDF
20 pages
SQL Injection Attack Demonstration
No ratings yet
SQL Injection Attack Demonstration
9 pages
Crackz
No ratings yet
Crackz
29 pages
Proxy server A Complete Guide
From Everand
Proxy server A Complete Guide
Gerardus Blokdyk
No ratings yet
The Paranoid's Guide to Using the Internet
From Everand
The Paranoid's Guide to Using the Internet
Pamela Gifford
1/5 (1)
Setup of a Graphical User Interface Desktop for Linux Virtual Machine on Cloud Platforms
From Everand
Setup of a Graphical User Interface Desktop for Linux Virtual Machine on Cloud Platforms
Dr. Hidaia Mahmood Alassouli
No ratings yet
Evaluation of Some Android Emulators and Installation of Android OS on Virtualbox and VMware
From Everand
Evaluation of Some Android Emulators and Installation of Android OS on Virtualbox and VMware
Dr. Hidaia Mahmood Alassouli
No ratings yet
Review of Some Free Remote Desktop Protocol (RDP) Services
From Everand
Review of Some Free Remote Desktop Protocol (RDP) Services
Dr. Hidaia Mahmood Alassouli
No ratings yet
Building a Pentesting Lab for Wireless Networks
From Everand
Building a Pentesting Lab for Wireless Networks
Fadyushin Vyacheslav
No ratings yet
Analyzing Kernel Crash On Red Hat
No ratings yet
Analyzing Kernel Crash On Red Hat
9 pages
Study On Modbus/DNP3 Protocol: Topics Covered
No ratings yet
Study On Modbus/DNP3 Protocol: Topics Covered
9 pages
Hotelman
No ratings yet
Hotelman
8 pages
9626 m17 Er PDF
No ratings yet
9626 m17 Er PDF
11 pages
Course Work Syllabus PHD in Bharathiar University
100% (1)
Course Work Syllabus PHD in Bharathiar University
10 pages
Arellano University: SUBJECT Computer Programming I TOPIC Sequence and Selection Structure Reference Objectives
No ratings yet
Arellano University: SUBJECT Computer Programming I TOPIC Sequence and Selection Structure Reference Objectives
5 pages
Assignment IIOT 21bec077
No ratings yet
Assignment IIOT 21bec077
10 pages
C Lab Manual 1 To 3
No ratings yet
C Lab Manual 1 To 3
5 pages
Micro1 - 04E - Devices and Networks
No ratings yet
Micro1 - 04E - Devices and Networks
46 pages
Java Project
No ratings yet
Java Project
2 pages
1 Course Material - All Chapter 04-01-2024
No ratings yet
1 Course Material - All Chapter 04-01-2024
148 pages
Control M Interview Questions
No ratings yet
Control M Interview Questions
1 page
Wincc Scada - Diagnostics
No ratings yet
Wincc Scada - Diagnostics
25 pages
Parth Shah - Software Developer
No ratings yet
Parth Shah - Software Developer
2 pages
Sony KDL-40NX700 PDF
No ratings yet
Sony KDL-40NX700 PDF
33 pages
Integrating My Results in ChemStation - 121008
No ratings yet
Integrating My Results in ChemStation - 121008
50 pages
Manual Soft Starter 3RW44 en
0% (1)
Manual Soft Starter 3RW44 en
262 pages
MATRIX SLR Systematic Literature Review 2021
No ratings yet
MATRIX SLR Systematic Literature Review 2021
57 pages
BlackHat: Iphone Security
No ratings yet
BlackHat: Iphone Security
32 pages
ASC 4 Switchboard Data Sheet 4921240553 UK
No ratings yet
ASC 4 Switchboard Data Sheet 4921240553 UK
19 pages
Final Review of Related Literature
No ratings yet
Final Review of Related Literature
6 pages
Online Books
No ratings yet
Online Books
3 pages
Kerberos X509
No ratings yet
Kerberos X509
31 pages
Implementing Automated Schematic Generation at The Push of A Button With EPLAN Cogineer
No ratings yet
Implementing Automated Schematic Generation at The Push of A Button With EPLAN Cogineer
2 pages
TESA REFLEX Panel Product Presentation V1 EN
No ratings yet
TESA REFLEX Panel Product Presentation V1 EN
24 pages
5G in The Internet of Things Era: An Overview On Security and Privacy Challenges
No ratings yet
5G in The Internet of Things Era: An Overview On Security and Privacy Challenges
7 pages
Lesson Plan Pong
No ratings yet
Lesson Plan Pong
3 pages