Python Implementing Web Scraping with Scrapy

In this article, we will learn about the web scraping technique using the Scrappy module available in Python.

What is web scraping?

Web scraping is used to obtain/get the data from a website with the help of a crawler/scanner. Web scrapping comes handy to extract the data from a web page that doesn't offer the functionality of an API. In python, web scraping can be done by the help of various modules namely Beautiful Soup, Scrappy & lxml.

Here we will discuss web scraping using the Scrappy module.

For that, we first need to install Scrappy.

Type in the terminal or command prompt

>>> pip install Scrappy

As Scrappy is a framework we need to run an initializing command

>>> scrappy start project tutpts

Here we create a web crawler/ spider to fetch the data out of a website.

For building the crawler we create a separate script named tutptscraw.py Here we declare a class for extraction of content. Here we name the web crawler and using .requests we fetch the data from the given URL.

The generator functions are used which yields the fetched data.

Example

import scrapy
class ExtractUrls(scrapy.Spider):
   name = "fetch"
   # generator Function
   def start_requests(self):
      # enter the URL
      urls = ['https://www.tutorialspoint.com/index.htm/', ]
      for url in urls:
      yield scrapy.Request(url = url, callback = self.parse)

Here all the data encapsulated within the anchor tag is fetched using the request function. As scrappy is a mini framework we run all the functionalities in the scrappy shell.

To activate the scrap shell we use the following commands

scrapy shellhttps://www.tutorialspoint.com/index.htm/

Now we fetch data from the anchor tag using the selectors i.e. either CSS or xpaths

response.css('a')
links = response.css('a').extract()

To get all the links available on the web page we create a parse method. Scrappy internally bypasses previously visited URLs reducing rendering time during the display of results.

import scrapy
class ExtractUrls(scrapy.Spider):
   name = "fetch"
   # generator Function
   def start_requests(self):
      # enter the URL
      urls = ['https://www.tutorialspoint.com/index.htm/', ]
      for url in urls:
      yield scrapy.Request(url = url, callback = self.parse)
   # Parse function
   def parse(self, response):
   title = response.css('title::text').extract_first()
   # Get anchor tags
   links = response.css('a::attr(href)').extract()
   for link in links:
      yield
   {
      'title': title,
      'links': link
   }
   if 'tutorialspoint' in link:
   yield scrapy.Request(url = link, callback = self.parse)

Conclusion

In this tutorial, we learned about the implementation of web crawler using Scrappy module in python