Multithreaded crawler in Python

Last Updated : 09 Jan, 2023

In this article, we will describe how it is possible to build a simple multithreading-based crawler using Python.

Modules Needed

bs4: Beautiful Soup (bs4) is a Python library for extracting data from HTML and XML files. To install this library, type the following command in IDE/terminal.

pip install bs4

requests: This library allows you to send HTTP/1.1 requests very easily. To install this library, type the following command in IDE/terminal.

pip install requests

Stepwise implementation

Step 1: We will first import all the libraries that we need to crawl. If you're using Python3, you should already have all the libraries except BeautifulSoup, requests. So if you haven't installed these two libraries yet, you'll need to install them using the commands specified above.

Python3

import multiprocessing
from bs4 import BeautifulSoup
from queue import Queue, Empty
from concurrent.futures import ThreadPoolExecutor
from urllib.parse import urljoin, urlparse
import requests

Step 2: Create a main program and then create an object of class MultiThreadedCrawler and pass the seed URL to its parameterized constructor, and call run_web_scrawler() method.

Python3

if __name__ == '__main__':
    cc = MultiThreadedCrawler("https://www.geeksforgeeks.org/")
    cc.run_web_crawler()
    cc.info()

Step 3: Create a class named MultiThreadedCrawler. And initialize all the variables in the constructor, assign base URL to the instance variable named seed_url. And then format the base URL into absolute URL, using schemes as HTTPS and net location.

To execute the crawl frontier task concurrently use multithreading in python. Create an object of ThreadPoolExecutor class and set max workers as 5 i.e To execute 5 threads at a time. And to avoid duplicate visits to web pages, In order to maintain the history create a set data structure.

Create a queue to store all the URLs of crawl frontier and put the first item as a seed URL.

Python3

class MultiThreadedCrawler:

    def __init__(self, seed_url):
        self.seed_url = seed_url
        self.root_url = '{}://{}'.format(urlparse(self.seed_url).scheme,
                                         urlparse(self.seed_url).netloc)
        self.pool = ThreadPoolExecutor(max_workers=5)
        self.scraped_pages = set([])
        self.crawl_queue = Queue()
        self.crawl_queue.put(self.seed_url)

Step 4: Create a method named run_web_crawler(), to keep on adding the link to frontier and extracting the information use an infinite while loop and display the name of the currently executing process.

Get the URL from crawl frontier, for lookup assign timeout as 60 seconds and check whether the current URL is already visited or not. If not visited already, Format the current URL and add it to scraped_pages set to store in the history of visited pages and choose from a pool of threads and pass scrape page and target URL.

Python3

def run_web_crawler(self):
    while True:
        try:
            print("\n Name of the current executing process: ",
                  multiprocessing.current_process().name, '\n')
            target_url = self.crawl_queue.get(timeout=60)
            
            if target_url not in self.scraped_pages:
              
                print("Scraping URL: {}".format(target_url))
                self.scraped_pages.add(target_url)
                job = self.pool.submit(self.scrape_page, target_url)
                job.add_done_callback(self.post_scrape_callback)

        except Empty:
            return
        except Exception as e:
            print(e)
            continue

Step 5: Using the handshaking method place the request and set default time as 3 and maximum time as 30 and once the request is successful return the result set.

Python3

def scrape_page(self, url):
    try:
        res = requests.get(url, timeout=(3, 30))
        return res
    except requests.RequestException:
        return

Step 6: Create a method named scrape_info(). And pass the webpage data into BeautifulSoup which helps us to organize and format the messy web data by fixing bad HTML and present to us in an easily-traversable structure.

Using the BeautifulSoup operator extract all the text present in the HTML document.

Python3

def scrape_info(self, html):
    soup = BeautifulSoup(html, "html5lib")
    web_page_paragraph_contents = soup('p')
    text = ''
    
    for para in web_page_paragraph_contents:
        if not ('https:' in str(para.text)):
            text = text + str(para.text).strip()
    print('\n <-----Text Present in The WebPage is--->\n', text, '\n')
    return

Step 7: Create a method named parse links, using BeautifulSoup operator extract all the anchor tags present in HTML document. Soup.find_all(‘a’,href=True) returns a list of items that contain all the anchor tags present in the webpage. Store all the tags in a list named anchor_Tags. For each anchor tag present in the list Aachor_Tags, Retrieve the value associated with href in the tag using Link[‘href’]. For each retrieved URL check whether it is any of the absolute URL or relative URL.

Relative URL: URL Without root URL and protocol names.
Absolute URLs: URL With protocol name, Root URL, Document name.

If it is a Relative URL using urljoin method change it to an absolute URL using the base URL and relative URL. Check whether the current URL is already visited or not. If the URL has not been visited already, put it in the crawl queue.

Python3

def parse_links(self, html):
    soup = BeautifulSoup(html, 'html.parser')
    Anchor_Tags = soup.find_all('a', href=True)
    
    for link in Anchor_Tags:
        url = link['href']
        
        if url.startswith('/') or url.startswith(self.root_url):
            url = urljoin(self.root_url, url)
            
            if url not in self.scraped_pages:
                self.crawl_queue.put(url)

Step 8: For extracting the links call the method named parse_links() and pass the result. For extracting the content call the method named scrape_info() and pass the result.

Python3

def post_scrape_callback(self, res):
    result = res.result()
    
    if result and result.status_code == 200:
        self.parse_links(result.text)
        self.scrape_info(result.text)

Below is the complete implementation:

Python3

import multiprocessing
from bs4 import BeautifulSoup
from queue import Queue, Empty
from concurrent.futures import ThreadPoolExecutor
from urllib.parse import urljoin, urlparse
import requests


class MultiThreadedCrawler:

    def __init__(self, seed_url):
        self.seed_url = seed_url
        self.root_url = '{}://{}'.format(urlparse(self.seed_url).scheme,
                                         urlparse(self.seed_url).netloc)
        self.pool = ThreadPoolExecutor(max_workers=5)
        self.scraped_pages = set([])
        self.crawl_queue = Queue()
        self.crawl_queue.put(self.seed_url)

    def parse_links(self, html):
        soup = BeautifulSoup(html, 'html.parser')
        Anchor_Tags = soup.find_all('a', href=True)
        for link in Anchor_Tags:
            url = link['href']
            if url.startswith('/') or url.startswith(self.root_url):
                url = urljoin(self.root_url, url)
                if url not in self.scraped_pages:
                    self.crawl_queue.put(url)

    def scrape_info(self, html):
        soup = BeautifulSoup(html, "html5lib")
        web_page_paragraph_contents = soup('p')
        text = ''
        for para in web_page_paragraph_contents:
            if not ('https:' in str(para.text)):
                text = text + str(para.text).strip()
        print(f'\n <---Text Present in The WebPage is --->\n', text, '\n')
        return

    def post_scrape_callback(self, res):
        result = res.result()
        if result and result.status_code == 200:
            self.parse_links(result.text)
            self.scrape_info(result.text)

    def scrape_page(self, url):
        try:
            res = requests.get(url, timeout=(3, 30))
            return res
        except requests.RequestException:
            return

    def run_web_crawler(self):
        while True:
            try:
                print("\n Name of the current executing process: ",
                      multiprocessing.current_process().name, '\n')
                target_url = self.crawl_queue.get(timeout=60)
                if target_url not in self.scraped_pages:
                    print("Scraping URL: {}".format(target_url))
                    self.current_scraping_url = "{}".format(target_url)
                    self.scraped_pages.add(target_url)
                    job = self.pool.submit(self.scrape_page, target_url)
                    job.add_done_callback(self.post_scrape_callback)

            except Empty:
                return
            except Exception as e:
                print(e)
                continue

    def info(self):
        print('\n Seed URL is: ', self.seed_url, '\n')
        print('Scraped pages are: ', self.scraped_pages, '\n')


if __name__ == '__main__':
    cc = MultiThreadedCrawler("https://www.geeksforgeeks.org/")
    cc.run_web_crawler()
    cc.info()