3252 Ids 10
3252 Ids 10
EXPERIMENT -X
Description:
Web scraping is the process of extracting data from websites. In Python, web scraping can be performed
using libraries such as BeautifulSoup, requests, and Selenium. Here's a step-by-step guide demonstrating
how to perform web scraping using the BeautifulSoup and requests libraries.
In today’s digital world, data is the key to unlocking valuable insights, and much of this data is available
on the web. But how do you gather large amounts of data from websites efficiently? That’s where Python
web scraping comes in.Web scraping, the process of extracting data from websites, has emerged as a
powerful technique to gather information from the vast expanse of the internet.
In this tutorial, we’ll explore various Python libraries and modules commonly used for web scraping and
delve into why Python 3 is the preferred choice for this task. Along with this you will also explore how to
use powerful tools like BeautifulSoup, Scrapy, and Selenium to scrape any website.
Table of Content
• Requests Module
• BeautifulSoup Library
• Selenium
• Lxml
• Urllib Module
• PyautoGUI
Requests Module
The requests library is used for making HTTP requests to a specific URL and returns the response.
Python requests provide inbuilt functionalities for managing both the request and response.
Python requests module has several built-in methods to make HTTP requests to specified URI using
GET, POST, PUT, PATCH, or HEAD requests. A HTTP request is meant to either retrieve data from a
specified URI or to push data to a server. It works as a request-response protocol between a client and a
server. Here we will be using the GET request. The GET method is used to retrieve information from the
given server using a given URI. The GET method sends the encoded user information appended to the
page request.
EX-CODE:
import requests
BeautifulSoup Library
Beautiful Soup provides a few simple methods and Pythonic phrases for guiding, searching, and changing
a parse tree: a toolkit for studying a document and removing what you need. It doesn’t take much code to
document an application.
Beautiful Soup automatically converts incoming records to Unicode and outgoing forms to UTF-8. You
don’t have to think about encodings unless the document doesn’t define an encoding, and Beautiful Soup
can’t catch one. Then you just have to choose the original encoding. Beautiful Soup sits on top of famous
Python parsers like LXML and HTML, allowing you to try different parsing strategies or trade speed for
flexibility.
Example
Importing Libraries: The code imports the requests library for making HTTP requests and the
BeautifulSoup class from the bs4 library for parsing HTML.
Making a GET Request: It sends a GET request to ‘https://www.geeksforgeeks.org/python-programming-
language/’ and stores the response in the variable r.
Checking Status Code: It prints the status code of the response, typically 200 for success.
Parsing the HTML : The HTML content of the response is parsed using BeautifulSoup and stored in the
variable soup.
Printing the Prettified HTML: It prints the prettified version of the parsed HTML content for readability
and analysis.
EX-CODE:
import requests
from bs4 import BeautifulSoup
Lxml
The lxml module in Python is a powerful library for processing XML and HTML documents. It provides
a high-performance XML and HTML parsing capabilities along with a simple and Pythonic API. lxml is
widely used in Python web scraping due to its speed, flexibility, and ease of use.
We import the html module from lxml along with the requests module for sending HTTP requests.
We define the URL of the website we want to scrape.
We send an HTTP GET request to the website using the requests.get() function and retrieve the HTML
content of the page.
We parse the HTML content using the html.fromstring() function from lxml, which returns an HTML
element tree.
We use XPath expressions to extract specific elements from the HTML tree. In this case, we’re extracting
the text content of all the <a> (anchor) elements on the page.
We iterate over the extracted link titles and print them out.
EX-CODE:
# Send an HTTP request to the website and retrieve the HTML content
response = requests.get(url)
Result:
Verified Web Scraping using Python.