Increase the speed of Web Scraping in Python using HTTPX module
Last Updated :
30 Jul, 2024
In this article, we will talk about how to speed up web scraping using the requests module with the help of the HTTPX module and AsyncIO by fetching the requests concurrently.
The user must be familiar with Python. Knowledge about the Requests module or web scraping would be a bonus.
Required ModuleĀ
For this tutorial, we will use 4 modules -
- time
- requests
- httpx
- asyncio.
pip install httpx
pip install requests
time and asyncio comes pre-installed so no need to install them.
Using the requests module to get the required time -
First, we will use the traditional way of fetching URLs using the get() method of the requests module, then using the time module we will check the total time consumed.
Python
import time
import requests
def fetch_urls():
urls=[
"https://en.wikipedia.org/wiki/Badlands",
"https://en.wikipedia.org/wiki/Canyon",
"https://en.wikipedia.org/wiki/Cave",
"https://en.wikipedia.org/wiki/Cliff",
"https://en.wikipedia.org/wiki/Coast",
"https://en.wikipedia.org/wiki/Continent",
"https://en.wikipedia.org/wiki/Coral_reef",
"https://en.wikipedia.org/wiki/Desert",
"https://en.wikipedia.org/wiki/Forest",
"https://en.wikipedia.org/wiki/Geyser",
"https://en.wikipedia.org/wiki/Mountain_range",
"https://en.wikipedia.org/wiki/Peninsula",
"https://en.wikipedia.org/wiki/Ridge",
"https://en.wikipedia.org/wiki/Savanna",
"https://en.wikipedia.org/wiki/Shoal",
"https://en.wikipedia.org/wiki/Steppe",
"https://en.wikipedia.org/wiki/Tundra",
"https://en.wikipedia.org/wiki/Valley",
"https://en.wikipedia.org/wiki/Volcano",
"https://en.wikipedia.org/wiki/Artificial_island",
"https://en.wikipedia.org/wiki/Lake"
]
res = [requests.get(addr).status_code for addr in urls]
print(set(res))
start = time.time()
fetch_urls()
end = time.time()
print("Total Consumed Time",end-start)
Firstly we imported the requests and time module then created a function called fetch_urls() inside which we created a list consisting of 20 links (user can choose any number of any random links which exists). Then inside a variable res which is a type of list we are using the get() method with the status_code method of requests module to send a request to each of those links and fetch and store their status_codes as a list. Then lastly we are printing the set of that res. Now main reason of converting it into a set is that if everysite is working then all will return 200 status code so making it set will only be a single value so the time consumption will be less in that work (Motive is to use as less time as possible in other works).
Then outside the function using the time() method of time module we are storing the starting and ending time and in between calling the function. Then finally printing the total consumed time.
Output:
Ā We can see from the output that it consumed total 12.6422558 seconds.
Using HTTPX with AsyncIOĀ
In a Jupyter Notebook environment, you need to be aware that Jupyter Notebook's event loop is managed differently compared to a standalone script. Specifically, you should use nest_asyncio to handle the event loop correctly in Jupyter Notebooks.
Here's how you can modify the code to work smoothly in a Jupyter Notebook:
- Import nest_asyncio and apply it to patch the event loop.
- Ensure the asynchronous function is run properly within the notebook environment.
Python
import time
import asyncio
import httpx
import nest_asyncio
# Patch the event loop for Jupyter Notebook
nest_asyncio.apply()
async def fetch_httpx():
urls = [
"https://en.wikipedia.org/wiki/Badlands",
"https://en.wikipedia.org/wiki/Canyon",
"https://en.wikipedia.org/wiki/Cave",
"https://en.wikipedia.org/wiki/Cliff",
"https://en.wikipedia.org/wiki/Coast",
"https://en.wikipedia.org/wiki/Continent",
"https://en.wikipedia.org/wiki/Coral_reef",
"https://en.wikipedia.org/wiki/Desert",
"https://en.wikipedia.org/wiki/Forest",
"https://en.wikipedia.org/wiki/Geyser",
"https://en.wikipedia.org/wiki/Mountain_range",
"https://en.wikipedia.org/wiki/Peninsula",
"https://en.wikipedia.org/wiki/Ridge",
"https://en.wikipedia.org/wiki/Savanna",
"https://en.wikipedia.org/wiki/Shoal",
"https://en.wikipedia.org/wiki/Steppe",
"https://en.wikipedia.org/wiki/Tundra",
"https://en.wikipedia.org/wiki/Valley",
"https://en.wikipedia.org/wiki/Volcano",
"https://en.wikipedia.org/wiki/Artificial_island",
"https://en.wikipedia.org/wiki/Lake"
]
async with httpx.AsyncClient() as httpx_client:
req = [httpx_client.get(addr) for addr in urls]
result = await asyncio.gather(*req)
start = time.time()
await fetch_httpx() # Use 'await' to run the async function directly in a notebook cell
end = time.time()
print("Total Consumed Time using HTTPX:", end - start)
we have to use asyncio with HTTPX otherwise we can't send requests concurrently. HTTPX itself has an inbuilt AsyncIO client which we will use here, for using that inside a function the function has be to Asynchronous. We are calling the AsyncClient() method of HTTPX module using the alias httpx_client, then using that alias we are concurrently sending requests to the same links used earlier. Then as we are using async we have to use await, using the await we are gathering the response and storing them in result. (If user wants they can print that too, but as my intention is to decrease the time consumed in other operations rather than fetching requests I didn't print them.).
Then from outside the function using the asyncio.run() method we are calling that function and then printing the result of the total time consumed.
Output:
As we can see from the Output the total time consumed has been decrease by nearly 6 times. This difference can differ anytime. If we are sending requests to the same URL again and again then the time consumed for both requests and HTTPX will be lesser than last time so this difference will then be increased more.
Here the difference is nearly 10 times, HTTPX with AsyncIO is nearly 10 times faster than requests.
Similar Reads
Implementing web scraping using lxml in Python Web scraping basically refers to fetching only some important piece of information from one or more websites. Every website has recognizable structure/pattern of HTML elements. Steps to perform web scraping :1. Send a link and get the response from the sent link 2. Then convert response object to a
3 min read
Web Scraping Financial News Using Python In this article, we will cover how to extract financial news seamlessly using Python. This financial news helps many traders in placing the trade in cryptocurrency, bitcoins, the stock markets, and many other global stock markets setting up of trading bot will help us to analyze the data. Thus all t
3 min read
Web Scraping using lxml and XPath in Python Prerequisites: Introduction to Web Scraping In this article, we will discuss the lxml python library to scrape data from a webpage, which is built on top of the libxml2 XML parsing library written in C. When compared to other python web scraping libraries like BeautifulSoup and Selenium, the lxml pa
3 min read
Web Scraping for Stock Prices in Python Web scraping is a data extraction method that collects data only from websites. It is often used for data mining and gathering valuable insights from large websites. Web scraping is also useful for personal use. Python includes a nice library called BeautifulSoup that enables web scraping. In this a
6 min read
Python | Tools in the world of Web Scraping Web page scraping can be done using multiple tools or using different frameworks in Python. There are variety of options available for scraping data from a web page, each suiting different needs. First, let's understand the difference between web-scraping and web-crawling. Web crawling is used to in
4 min read
How to Scrape Multiple Pages of a Website Using Python? Web Scraping is a method of extracting useful data from a website using computer programs without having to manually do it. This data can then be exported and categorically organized for various purposes. Some common places where Web Scraping finds its use are Market research & Analysis Websites
6 min read
How to scrape multiple pages using Selenium in Python? As we know, selenium is a web-based automation tool that helps us to automate browsers. Selenium is an Open-Source testing tool which means we can easily download it from the internet and use it. With the help of Selenium, we can also scrap the data from the webpages. Here, In this article, we are g
4 min read
Scraping data in network traffic using Python In this article, we will learn how to scrap data in network traffic using Python. Modules Neededselenium: Selenium is a portable framework for controlling web browser.time: This module provides various time-related functions.json: This module is required to work with JSON data.browsermobproxy: This
5 min read
Spoofing IP address when web scraping using Python In this article, we are going to scrap a website using Requests by rotating proxies in Python. Modules RequiredRequests module allows you to send HTTP requests and returns a response with all the data such as status, page content, etc. Syntax:Â requests.get(url, parameter)Â JSON JavaScript Object No
3 min read
Reading selected webpage content using Python Web Scraping Prerequisite: Downloading files in Python, Web Scraping with BeautifulSoup We all know that Python is a very easy programming language but what makes it cool are the great number of open source library written for it. Requests is one of the most widely used library. It allows us to open any HTTP/HTT
3 min read