Using Scrapy in PyCharm
Using Scrapy in PyCharm
blog.finxter.com/using-scrapy-in-pycharm
We live in a world that relies on data, massive amounts of data. This data is used in many
areas of business, for example:
Much of this data is available on the internet for people to read and compare through sites
that specialize in the type of data they’re interested in. But not very efficient, not to
mention time-consuming and very difficult to use in other programs. Web scraping is a
way that makes extracting the data you need very fast and efficiently saving them in
formats that can be used in other programs.
The purpose of this article is to get us up and running with Scrapy quickly. While Scrapy
can handle both CSS and xpath tags to get the data we want, we’ll be using CSS. The site
we’re going to scrape is ‘Books to Scrape’ using Python, Web Developer Tools in Firefox,
PyCharm, and Python package Scrapy.
1/8
I’ve named my project ‘scrapingProject’ but you can name it whatever you like, this
will take some time to create. Once the project is created click on the Terminal tab and
type in pip install scrapy :
2/8
Open the new python file enter the following:
# Import library
import scrapy
# Create Spider class
class booksToScrape(scrapy.Spider):
# Name of spider
name = 'books'
# Website you want to scrape
start_urls = [
'http://books.toscrape.com'
]
# Parses the website
def parse(self, response):
pass
It should look like this:
We’re going to be scraping the title and price from ‘Books to Scrape‘ so let’s open Firefox
and visit the site. Right-click on the title of a book and select ‘Inspect’ from the context
menu.
3/8
Inspecting the Website to Be Scraped
Inspecting the site, we see that the tag we need to use to get the title of the book is located
under <h3><a> tag. To make sure this will give us all the titles on the page use the
‘Search’ in the Inspector. We don’t have to use the whole path to get all the titles for the
page, use a[title] in the search. The ‘ a ’ identifies the tag and the [ ] separates the
title from the href . There will be 20 results found on the page, by pressing ‘Enter’ you
can see that all the book titles on this page cycling through.
To find out if this selector will work in scrapy we’re going to use the scrapy shell. Go back
to the PyCharm Terminal and enter scrapy shell to bring up the shell, this allows us to
interact directly with the page. Retrieve the web page using
fetch(‘http://books.toscrape.com’):
Close but we’re getting only one title and not just the title but also the catalogue link too.
We need to tell scrapy to grab just the title text of all the books on this page. To do this
we’ll use ::text to get the title text and .getall() for all the books. The new
command is response.css('a[title]::text').getall() :
Much better, we now have just all the titles from the page. Let’s see if we can make it look
better by using a for loop:
4/8
That works, now let’s add it to the spider. Just copy the commands and place them below
the parse command:
Crawling 101
Now that we have titles, we need the prices, using the same method as before right-click
on the price and inspect it.
5/8
The tag we want for the price of a book is .price_color . Using the previous commands,
we just swap out 'a[title]' for ‘.price_color’ . Using the scrapy shell we get this:
Now we have the tags needed to grab just the titles and prices from the page, we need to
find the common element holding them together. While looking at the earlier elements,
you may have noticed that they’re grouped under .product_pod with other attributes.
To separate these elements from the others we’ll just tweak the code a bit:
for i in response.css('.product_pod'):
title = i.css('a[title]::text').getall()
price = i.css('.price_color::text').getall()
print(title, price)
As you can see, we’re calling the tag that the title and price elements are grouped under
and calling their separate tags. While using the print() command will print results to
the terminal screen it can’t be saved to an output file like .csv or .json. To save the
results to a file you need to use the yield command:
6/8
yield {
'Title': title,
'Price': price
}
Now the spider is ready to crawl the site and grab just the titles and prices, it should look
like this:
# Import library
import scrapy
# Create Spider class
class booksToScrape(scrapy.Spider):
# Name of spider
name = 'books'
# Website you want to scrape
start_urls = [
'http://books.toscrape.com'
]
# Parses the website
def parse(self, response):
# Book Information cell
for i in response.css('.product_pod'):
# Attributes
title = i.css('a[title]::text').getall()
price = i.css('.price_color::text').getall()
# Output
yield {
'Title': title,
'Price': price
}
Let’s crawl the site and see what we get, I’ll be using scrapy crawl books -o
Books.csv from the terminal.
7/8
We now have the data we were after and can use it in other programs. Granted this isn’t
much data, it’s being used to demonstrate how the tool is used. You can use this spider to
explore the other elements on the page.
Conclusion
Scrapy isn’t easy to learn and many are discouraged. I wanted to give those interested in it
a quick way to start using it and see how it works. Scrapy is capable of so much more. I’ve
just scratched the surface with what wrote about it. To learn more, check the official
documentation.
8/8