In this article, we will learn about the web scraping technique using the Scrappy module available in Python.
What is web scraping?
Web scraping is used to obtain/get the data from a website with the help of a crawler/scanner. Web scrapping comes handy to extract the data from a web page that doesn't offer the functionality of an API. In python, web scraping can be done by the help of various modules namely Beautiful Soup, Scrappy & lxml.
Here we will discuss web scraping using the Scrappy module.
For that, we first need to install Scrappy.
Type in the terminal or command prompt
>>> pip install Scrappy
As Scrappy is a framework we need to run an initializing command
>>> scrappy start project tutpts
Here we create a web crawler/ spider to fetch the data out of a website.
For building the crawler we create a separate script named tutptscraw.py Here we declare a class for extraction of content. Here we name the web crawler and using .requests we fetch the data from the given URL.
The generator functions are used which yields the fetched data.
Example
import scrapy class ExtractUrls(scrapy.Spider): name = "fetch" # generator Function def start_requests(self): # enter the URL urls = ['https://www.tutorialspoint.com/index.htm/', ] for url in urls: yield scrapy.Request(url = url, callback = self.parse)
Here all the data encapsulated within the anchor tag is fetched using the request function. As scrappy is a mini framework we run all the functionalities in the scrappy shell.
To activate the scrap shell we use the following commands
scrapy shellhttps://www.tutorialspoint.com/index.htm/
Now we fetch data from the anchor tag using the selectors i.e. either CSS or xpaths
response.css('a') links = response.css('a').extract()
To get all the links available on the web page we create a parse method. Scrappy internally bypasses previously visited URLs reducing rendering time during the display of results.
import scrapy class ExtractUrls(scrapy.Spider): name = "fetch" # generator Function def start_requests(self): # enter the URL urls = ['https://www.tutorialspoint.com/index.htm/', ] for url in urls: yield scrapy.Request(url = url, callback = self.parse) # Parse function def parse(self, response): title = response.css('title::text').extract_first() # Get anchor tags links = response.css('a::attr(href)').extract() for link in links: yield { 'title': title, 'links': link } if 'tutorialspoint' in link: yield scrapy.Request(url = link, callback = self.parse)
Conclusion
In this tutorial, we learned about the implementation of web crawler using Scrappy module in python