Multithreaded crawler in Python
Last Updated :
09 Jan, 2023
In this article, we will describe how it is possible to build a simple multithreading-based crawler using Python.
Modules Needed
bs4: Beautiful Soup (bs4) is a Python library for extracting data from HTML and XML files. To install this library, type the following command in IDE/terminal.
pip install bs4
requests: This library allows you to send HTTP/1.1 requests very easily. To install this library, type the following command in IDE/terminal.
pip install requests
Stepwise implementation
Step 1: We will first import all the libraries that we need to crawl. If you're using Python3, you should already have all the libraries except BeautifulSoup, requests. So if you haven't installed these two libraries yet, you'll need to install them using the commands specified above.
Python3
import multiprocessing
from bs4 import BeautifulSoup
from queue import Queue, Empty
from concurrent.futures import ThreadPoolExecutor
from urllib.parse import urljoin, urlparse
import requests
Step 2: Create a main program and then create an object of class MultiThreadedCrawler and pass the seed URL to its parameterized constructor, and call run_web_scrawler() method.
Python3
if __name__ == '__main__':
cc = MultiThreadedCrawler("https://www.geeksforgeeks.org/")
cc.run_web_crawler()
cc.info()
Step 3: Create a class named MultiThreadedCrawler. And initialize all the variables in the constructor, assign base URL to the instance variable named seed_url. And then format the base URL into absolute URL, using schemes as HTTPS and net location.
To execute the crawl frontier task concurrently use multithreading in python. Create an object of ThreadPoolExecutor class and set max workers as 5 i.e To execute 5 threads at a time. And to avoid duplicate visits to web pages, In order to maintain the history create a set data structure.
Create a queue to store all the URLs of crawl frontier and put the first item as a seed URL.
Python3
class MultiThreadedCrawler:
def __init__(self, seed_url):
self.seed_url = seed_url
self.root_url = '{}://{}'.format(urlparse(self.seed_url).scheme,
urlparse(self.seed_url).netloc)
self.pool = ThreadPoolExecutor(max_workers=5)
self.scraped_pages = set([])
self.crawl_queue = Queue()
self.crawl_queue.put(self.seed_url)
Step 4: Create a method named run_web_crawler(), to keep on adding the link to frontier and extracting the information use an infinite while loop and display the name of the currently executing process.
Get the URL from crawl frontier, for lookup assign timeout as 60 seconds and check whether the current URL is already visited or not. If not visited already, Format the current  URL and add it to scraped_pages set to store in the history of visited pages and choose from a pool of threads and pass scrape page and target URL.
Python3
def run_web_crawler(self):
while True:
try:
print("\n Name of the current executing process: ",
multiprocessing.current_process().name, '\n')
target_url = self.crawl_queue.get(timeout=60)
if target_url not in self.scraped_pages:
print("Scraping URL: {}".format(target_url))
self.scraped_pages.add(target_url)
job = self.pool.submit(self.scrape_page, target_url)
job.add_done_callback(self.post_scrape_callback)
except Empty:
return
except Exception as e:
print(e)
continue
Step 5: Using the handshaking method place the request and set default time as 3 and maximum time as 30 and once the request is successful return the result set.
Python3
def scrape_page(self, url):
try:
res = requests.get(url, timeout=(3, 30))
return res
except requests.RequestException:
return
Step 6: Create a method named scrape_info(). And pass the webpage data into BeautifulSoup which helps us to organize and format the messy web data by fixing bad HTML and present to us in an easily-traversable structure.
Using the BeautifulSoup operator extract all the text present in the HTML document.
Python3
def scrape_info(self, html):
soup = BeautifulSoup(html, "html5lib")
web_page_paragraph_contents = soup('p')
text = ''
for para in web_page_paragraph_contents:
if not ('https:' in str(para.text)):
text = text + str(para.text).strip()
print('\n <-----Text Present in The WebPage is--->\n', text, '\n')
return
Step 7: Create a method named parse links, using BeautifulSoup operator extract all the anchor tags present in HTML document. Soup.find_all(‘a’,href=True) returns a list of items that contain all the anchor tags present in the webpage. Store all the tags in a list named anchor_Tags. For each anchor tag present in the list Aachor_Tags, Retrieve the value associated with href in the tag using Link[‘href’]. For each retrieved URL check whether it is any of the absolute URL or relative URL.
- Relative URL: URL Without root URL and protocol names.
- Absolute URLs: URL With protocol name, Root URL, Document name.
If it is a Relative URL using urljoin method change it to an absolute URL using the base URL and relative URL. Check whether the current URL is already visited or not. If the URL has not been visited already, put it in the crawl queue.
Python3
def parse_links(self, html):
soup = BeautifulSoup(html, 'html.parser')
Anchor_Tags = soup.find_all('a', href=True)
for link in Anchor_Tags:
url = link['href']
if url.startswith('/') or url.startswith(self.root_url):
url = urljoin(self.root_url, url)
if url not in self.scraped_pages:
self.crawl_queue.put(url)
Step 8: For extracting the links call the method named parse_links() and pass the result. For extracting the content call the method named scrape_info() and pass the result.
Python3
def post_scrape_callback(self, res):
result = res.result()
if result and result.status_code == 200:
self.parse_links(result.text)
self.scrape_info(result.text)
Below is the complete implementation:
Python3
import multiprocessing
from bs4 import BeautifulSoup
from queue import Queue, Empty
from concurrent.futures import ThreadPoolExecutor
from urllib.parse import urljoin, urlparse
import requests
class MultiThreadedCrawler:
def __init__(self, seed_url):
self.seed_url = seed_url
self.root_url = '{}://{}'.format(urlparse(self.seed_url).scheme,
urlparse(self.seed_url).netloc)
self.pool = ThreadPoolExecutor(max_workers=5)
self.scraped_pages = set([])
self.crawl_queue = Queue()
self.crawl_queue.put(self.seed_url)
def parse_links(self, html):
soup = BeautifulSoup(html, 'html.parser')
Anchor_Tags = soup.find_all('a', href=True)
for link in Anchor_Tags:
url = link['href']
if url.startswith('/') or url.startswith(self.root_url):
url = urljoin(self.root_url, url)
if url not in self.scraped_pages:
self.crawl_queue.put(url)
def scrape_info(self, html):
soup = BeautifulSoup(html, "html5lib")
web_page_paragraph_contents = soup('p')
text = ''
for para in web_page_paragraph_contents:
if not ('https:' in str(para.text)):
text = text + str(para.text).strip()
print(f'\n <---Text Present in The WebPage is --->\n', text, '\n')
return
def post_scrape_callback(self, res):
result = res.result()
if result and result.status_code == 200:
self.parse_links(result.text)
self.scrape_info(result.text)
def scrape_page(self, url):
try:
res = requests.get(url, timeout=(3, 30))
return res
except requests.RequestException:
return
def run_web_crawler(self):
while True:
try:
print("\n Name of the current executing process: ",
multiprocessing.current_process().name, '\n')
target_url = self.crawl_queue.get(timeout=60)
if target_url not in self.scraped_pages:
print("Scraping URL: {}".format(target_url))
self.current_scraping_url = "{}".format(target_url)
self.scraped_pages.add(target_url)
job = self.pool.submit(self.scrape_page, target_url)
job.add_done_callback(self.post_scrape_callback)
except Empty:
return
except Exception as e:
print(e)
continue
def info(self):
print('\n Seed URL is: ', self.seed_url, '\n')
print('Scraped pages are: ', self.scraped_pages, '\n')
if __name__ == '__main__':
cc = MultiThreadedCrawler("https://www.geeksforgeeks.org/")
cc.run_web_crawler()
cc.info()
Output:

Similar Reads
Multithreading in Python
This article covers the basics of multithreading in Python programming language. Just like multiprocessing , multithreading is a way of achieving multitasking. In multithreading, the concept of threads is used. Let us first understand the concept of thread in computer architecture. What is a Process
8 min read
Simple Multithreaded Download Manager in Python
A Download Manager is basically a computer program dedicated to the task of downloading stand alone files from internet. Here, we are going to create a simple Download Manager with the help of threads in Python. Using multi-threading a file can be downloaded in the form of chunks simultaneously from
5 min read
Multithreaded Priority Queue in Python
The Queue module is primarily used to manage to process large amounts of data on multiple threads. It supports the creation of a new queue object that can take a distinct number of items. The get() and put() methods are used to add or remove items from a queue respectively. Below is the list of oper
2 min read
Driving Headless Chrome with Python
In this article, we are going to see how to drive headless chrome with Python. Headless Chrome is just a regular Chrome but without User Interface(UI).  We need Chrome to be headless because UI entails CPU and RAM overheads. For this, we will use ChromeDriver, Which is a web server that provides us
3 min read
Multithreading in Python | Set 2 (Synchronization)
This article discusses the concept of thread synchronization in case of multithreading in Python programming language. Synchronization between threads Thread synchronization is defined as a mechanism which ensures that two or more concurrent threads do not simultaneously execute some particular prog
6 min read
aiter() in Python
aiter() is a built-in function that returns an asynchronous iterator object from an asynchronous iterable. This allows us to iterate over asynchronous sequences, making it ideal for non-blocking operations in asynchronous programs. It is commonly used with async for loops to iterate over data that i
3 min read
Handle Memory Error in Python
One common issue that developers may encounter, especially when working with loops, is a memory error. In this article, we will explore what a memory error is, delve into three common reasons behind memory errors in Python for loops, and discuss approaches to solve them. What is a Memory Error?A mem
3 min read
Pagination - xpath for a crawler in Python
In this article, we are going to learn about pagination using XPath for a crawler in Python. This article is about learning how to extract information from different websites where information is stored on multiple pages. So to move to all pages via API's call we use a concept of paging which helps
6 min read
Create Multiple jobs using python-crontab
Cron is a Unix-like operating system software utility that allows us to schedule tasks. Cron's tasks are specified in a Crontab, which is a text file that contains the instructions to run. The Crontab module in Python allows us to handle scheduled operations using Cron. It has functionalities that a
2 min read
How to Call Multiple Functions in Python
In Python, calling multiple functions is a common practice, especially when building modular, organized and maintainable code. In this article, weâll explore various ways we can call multiple functions in Python.The most straightforward way to call multiple functions is by executing them one after a
3 min read