Using Scrapy in PyCharm

The document discusses using Scrapy in PyCharm to scrape data from websites. It provides steps to install Scrapy on PyCharm, create a Scrapy project, inspect a sample website to identify elements to scrape, create a scraping spider, and crawl the site to extract title and price data into a CSV file. The purpose is to demonstrate a quick way to get started with Scrapy and extract structured data from HTML for use in other programs or analysis.

Uploaded by

angelojms

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

100% found this document useful (1 vote)

162 views8 pages

Using Scrapy in PyCharm

Uploaded by

angelojms

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 8

Using Scrapy in PyCharm

blog.finxter.com/using-scrapy-in-pycharm

We live in a world that relies on data, massive amounts of data. This data is used in many
areas of business, for example:

Marketing & sales

Competition research
Lead generation
Content aggregation
Monitoring consumer sentiment
Data Analytics and Data science
AI Machine learning
Real Estate
Product and price data

Much of this data is available on the internet for people to read and compare through sites
that specialize in the type of data they’re interested in. But not very efficient, not to
mention time-consuming and very difficult to use in other programs. Web scraping is a
way that makes extracting the data you need very fast and efficiently saving them in
formats that can be used in other programs.

The purpose of this article is to get us up and running with Scrapy quickly. While Scrapy
can handle both CSS and xpath tags to get the data we want, we’ll be using CSS. The site
we’re going to scrape is ‘Books to Scrape’ using Python, Web Developer Tools in Firefox,
PyCharm, and Python package Scrapy.

Installing Scrapy on Pycharm

Install using the default settings, once these applications are installed, we need to create a
project. To do this, open PyCharm and click on File → New Project… , you see this:

1/8
I’ve named my project ‘scrapingProject’ but you can name it whatever you like, this
will take some time to create. Once the project is created click on the Terminal tab and
type in pip install scrapy :

Creating a Scrapy Project in PyCharm

After Scrapy is installed we need to create a scrapy project using scrapy startproject
<projectName> I’m naming mine scrapeBooks :

Creating the Scraping Spider

When the project creation is completed change directories in the terminal to the project
folder ( cd <projectName> ), this creates additional files needed to run the spider.
Additionally, this is where we’ll be entering other needed commands. Now to create the
spider, open the project folder right click on the spider.folder select ‘New’ →
‘Python File ’ and create a new Python file:

2/8
Open the new python file enter the following:

We’re going to be scraping the title and price from ‘Books to Scrape‘ so let’s open Firefox
and visit the site. Right-click on the title of a book and select ‘Inspect’ from the context
menu.

3/8
Inspecting the Website to Be Scraped
Inspecting the site, we see that the tag we need to use to get the title of the book is located
under <h3><a> tag. To make sure this will give us all the titles on the page use the
‘Search’ in the Inspector. We don’t have to use the whole path to get all the titles for the
page, use a[title] in the search. The ‘ a ’ identifies the tag and the [ ] separates the
title from the href . There will be 20 results found on the page, by pressing ‘Enter’ you
can see that all the book titles on this page cycling through.

To find out if this selector will work in scrapy we’re going to use the scrapy shell. Go back
to the PyCharm Terminal and enter scrapy shell to bring up the shell, this allows us to
interact directly with the page. Retrieve the web page using
fetch(‘http://books.toscrape.com’):

Enter into the prompt response.css('a[title]').get() to see what we get.

Close but we’re getting only one title and not just the title but also the catalogue link too.
We need to tell scrapy to grab just the title text of all the books on this page. To do this
we’ll use ::text to get the title text and .getall() for all the books. The new
command is response.css('a[title]::text').getall() :

Much better, we now have just all the titles from the page. Let’s see if we can make it look
better by using a for loop:

The Ultimate Python Cheat Sheet [Free PDF Download]

for title in response.css('a[title]::text').getall():
print(title)

4/8
That works, now let’s add it to the spider. Just copy the commands and place them below
the parse command:

Exiting the Scrapy Shell

Now to crawl the site, first, we must exit the scrapy shell, to do that use exit() . Next
use the name of the spider, like this scrapy crawl books to crawl the site. You don’t
use the file name to crawl the page because the framework that scrapy uses looks for the
name of the spider, not the file name, and knows where to look.

Crawling 101
Now that we have titles, we need the prices, using the same method as before right-click
on the price and inspect it.

5/8
The tag we want for the price of a book is .price_color . Using the previous commands,
we just swap out 'a[title]' for ‘.price_color’ . Using the scrapy shell we get this:

Now we have the tags needed to grab just the titles and prices from the page, we need to
find the common element holding them together. While looking at the earlier elements,
you may have noticed that they’re grouped under .product_pod with other attributes.
To separate these elements from the others we’ll just tweak the code a bit:

for i in response.css('.product_pod'):
title = i.css('a[title]::text').getall()
price = i.css('.price_color::text').getall()
print(title, price)

As you can see, we’re calling the tag that the title and price elements are grouped under
and calling their separate tags. While using the print() command will print results to
the terminal screen it can’t be saved to an output file like .csv or .json. To save the
results to a file you need to use the yield command:

6/8
yield {
'Title': title,
'Price': price
}
Now the spider is ready to crawl the site and grab just the titles and prices, it should look
like this:

# Import library
import scrapy
# Create Spider class
class booksToScrape(scrapy.Spider):
# Name of spider
name = 'books'
# Website you want to scrape
start_urls = [
'http://books.toscrape.com'
]
# Parses the website
def parse(self, response):
# Book Information cell
for i in response.css('.product_pod'):
# Attributes
title = i.css('a[title]::text').getall()
price = i.css('.price_color::text').getall()
# Output
yield {
'Title': title,
'Price': price
}
Let’s crawl the site and see what we get, I’ll be using scrapy crawl books -o
Books.csv from the terminal.

7/8
We now have the data we were after and can use it in other programs. Granted this isn’t
much data, it’s being used to demonstrate how the tool is used. You can use this spider to
explore the other elements on the page.

Conclusion
Scrapy isn’t easy to learn and many are discouraged. I wanted to give those interested in it
a quick way to start using it and see how it works. Scrapy is capable of so much more. I’ve
just scratched the surface with what wrote about it. To learn more, check the official
documentation.

8/8

Advanced Web Scraping - Bypassing - 403 Forbidden, - Captchas, and More - Sangaline
No ratings yet
Advanced Web Scraping - Bypassing - 403 Forbidden, - Captchas, and More - Sangaline
12 pages
Web Scraping Cheat Sheet (2021), Python For Web Scraping by Frank Andrade Geek Culture - Medium
100% (3)
Web Scraping Cheat Sheet (2021), Python For Web Scraping by Frank Andrade Geek Culture - Medium
26 pages
000 - MX 15 19 - sn20600 49999 - D.P
No ratings yet
000 - MX 15 19 - sn20600 49999 - D.P
48 pages
VLSI
No ratings yet
VLSI
31 pages
Dash User Guide and Documentation
100% (2)
Dash User Guide and Documentation
376 pages
API Hooking - Part I
100% (8)
API Hooking - Part I
10 pages
OR2018 Workshop - Getting Started With DSpace 7 REST API
100% (1)
OR2018 Workshop - Getting Started With DSpace 7 REST API
93 pages
My VBA Bot
No ratings yet
My VBA Bot
19 pages
Efficient Python Tricks and Tools For Data Scientists
100% (1)
Efficient Python Tricks and Tools For Data Scientists
23 pages
Flask Docs
100% (1)
Flask Docs
300 pages
Python Cheat Sheet For Excel Users
No ratings yet
Python Cheat Sheet For Excel Users
5 pages
Python Flask 2
No ratings yet
Python Flask 2
66 pages
How To Web Scrape With Python in 4 Minutes
100% (1)
How To Web Scrape With Python in 4 Minutes
12 pages
$KhabyLame AIRDROP #01 (Responses)
100% (2)
$KhabyLame AIRDROP #01 (Responses)
34 pages
Kibana Tutorial
100% (1)
Kibana Tutorial
174 pages
How To Work With Excel Spreadsheets Using Python
100% (1)
How To Work With Excel Spreadsheets Using Python
21 pages
Fun With Python
100% (5)
Fun With Python
113 pages
(David Phillips) Web Scraping With Excel How To U (B-Ok - CC)
100% (3)
(David Phillips) Web Scraping With Excel How To U (B-Ok - CC)
59 pages
Guide To Python Print Function
No ratings yet
Guide To Python Print Function
44 pages
Linux Commands Cheatsheet V1.01
No ratings yet
Linux Commands Cheatsheet V1.01
36 pages
Python and Excel
No ratings yet
Python and Excel
11 pages
16 FBD Lab
No ratings yet
16 FBD Lab
28 pages
Current NixOS Packages
No ratings yet
Current NixOS Packages
520 pages
MC 608 MODBUS CMD03 e CMD16 ENG Ver. 1.5 Del 04.02.2016
No ratings yet
MC 608 MODBUS CMD03 e CMD16 ENG Ver. 1.5 Del 04.02.2016
8 pages
Pic16 (L) F15089 PDF
No ratings yet
Pic16 (L) F15089 PDF
418 pages
Data Analysis by Web Scraping Using Python
No ratings yet
Data Analysis by Web Scraping Using Python
6 pages
Detailed Process For The Submission of Online Academic Counselor Application
No ratings yet
Detailed Process For The Submission of Online Academic Counselor Application
4 pages
Top 18 Python Libraries
100% (1)
Top 18 Python Libraries
11 pages
Python Machine Learning Projects
No ratings yet
Python Machine Learning Projects
135 pages
Webscrapping Tools
100% (1)
Webscrapping Tools
27 pages
Pandas DataFrame Notes
No ratings yet
Pandas DataFrame Notes
13 pages
How To Scrap Any Website's Content Using Scrapy
0% (1)
How To Scrap Any Website's Content Using Scrapy
20 pages
Pandas PDF
No ratings yet
Pandas PDF
171 pages
How To Build A Reporting Dashboard Using Dash and Plotly
No ratings yet
How To Build A Reporting Dashboard Using Dash and Plotly
23 pages
(John E Grayson PH.D.) Python and Tkinter Programm
No ratings yet
(John E Grayson PH.D.) Python and Tkinter Programm
83 pages
Hackercool - May 2020
No ratings yet
Hackercool - May 2020
75 pages
06.05.23 - Python - Web Scraping in Python
No ratings yet
06.05.23 - Python - Web Scraping in Python
108 pages
Owasp Top 10
No ratings yet
Owasp Top 10
41 pages
Python Scrapy
No ratings yet
Python Scrapy
4 pages
IITKGP Induction Handbook
No ratings yet
IITKGP Induction Handbook
52 pages
Requirement Analysis and Modeling
No ratings yet
Requirement Analysis and Modeling
6 pages
WEBSCRAping Buildwithpython
No ratings yet
WEBSCRAping Buildwithpython
78 pages
Secure Hash Algorithm: Concepts
No ratings yet
Secure Hash Algorithm: Concepts
6 pages
Python 3 Cheat Sheet: Int Float Bool STR List Tuple
No ratings yet
Python 3 Cheat Sheet: Int Float Bool STR List Tuple
2 pages
Qwen Technical Report
No ratings yet
Qwen Technical Report
59 pages
Web Crawling and Social Media Mining: Module No. 5
No ratings yet
Web Crawling and Social Media Mining: Module No. 5
77 pages
SDS WebScraping Bonus Scrapy Vs BeautifulSoup PDF
No ratings yet
SDS WebScraping Bonus Scrapy Vs BeautifulSoup PDF
6 pages
Mementopython3 V1.0.5a English
100% (1)
Mementopython3 V1.0.5a English
2 pages
Python - Programming
No ratings yet
Python - Programming
9 pages
PLC - 1 (CPU 1214C AC/DC/Rly) : Totally Integrated Automation Portal
No ratings yet
PLC - 1 (CPU 1214C AC/DC/Rly) : Totally Integrated Automation Portal
39 pages
The Impact of Product
No ratings yet
The Impact of Product
26 pages
Demov6 141213202739 Conversion Gate01
No ratings yet
Demov6 141213202739 Conversion Gate01
41 pages
007 Python Introduction
No ratings yet
007 Python Introduction
22 pages
Kendriya Vidyalaya Painavu, Idukki: Class Xii - Term 2 Lab Record
No ratings yet
Kendriya Vidyalaya Painavu, Idukki: Class Xii - Term 2 Lab Record
38 pages
Automatic Xss Detection Using Google
No ratings yet
Automatic Xss Detection Using Google
10 pages
Extra OAI-PMH Tutorial
No ratings yet
Extra OAI-PMH Tutorial
35 pages
Web Scraping in Python Using Scrapy
No ratings yet
Web Scraping in Python Using Scrapy
30 pages
Web Scraping
No ratings yet
Web Scraping
35 pages
Acceleo User Guide
No ratings yet
Acceleo User Guide
56 pages
Hacking MAAS: Coding Style
No ratings yet
Hacking MAAS: Coding Style
7 pages
Chapter One Roots of Non-Linear Equations
No ratings yet
Chapter One Roots of Non-Linear Equations
15 pages
QlikView User Training
No ratings yet
QlikView User Training
13 pages
How To Import JSON To Excel Using VBA - Excelerator Solutions
No ratings yet
How To Import JSON To Excel Using VBA - Excelerator Solutions
15 pages
3 - GUIs With Tkinter
No ratings yet
3 - GUIs With Tkinter
11 pages
Development Web Scrapping
No ratings yet
Development Web Scrapping
14 pages
MT-8000 Series: MT-8056T/ MT-6056T Installation Instruction 1.0 Installation and Startup Guide
No ratings yet
MT-8000 Series: MT-8056T/ MT-6056T Installation Instruction 1.0 Installation and Startup Guide
8 pages
03 Dspace Theme PDF
No ratings yet
03 Dspace Theme PDF
7 pages
Tutorial: W3Schools Home Next Chapter
No ratings yet
Tutorial: W3Schools Home Next Chapter
12 pages
DSD Project Report
No ratings yet
DSD Project Report
10 pages
HIV/AIDS, Reproductive Rights and Reproductive Technologies: Mapping Different Perspectives
No ratings yet
HIV/AIDS, Reproductive Rights and Reproductive Technologies: Mapping Different Perspectives
10 pages
Logs
No ratings yet
Logs
5 pages
Rust in Action v1.0
No ratings yet
Rust in Action v1.0
7 pages
Assignment 7 2024
No ratings yet
Assignment 7 2024
6 pages
Disney+ Configs - Anom
No ratings yet
Disney+ Configs - Anom
4 pages
Scrap Website With Python Free Code Camp
No ratings yet
Scrap Website With Python Free Code Camp
6 pages
Lab - 3 - Seven Segment Displays
No ratings yet
Lab - 3 - Seven Segment Displays
4 pages
Python For Web Scraping - Week 3: 1 Installing A Module
No ratings yet
Python For Web Scraping - Week 3: 1 Installing A Module
4 pages
2020 Polaroid P422T Data Sheet 002
No ratings yet
2020 Polaroid P422T Data Sheet 002
2 pages
Microsoft Excel Scripts
No ratings yet
Microsoft Excel Scripts
5 pages
Web+Scraping+Cheat+Sheet+2 0
No ratings yet
Web+Scraping+Cheat+Sheet+2 0
3 pages
DS Toolbox DataScienceGenius
No ratings yet
DS Toolbox DataScienceGenius
1 page
CCIE Security Lab Exam Topics v4.0 System Hardening and Availability
No ratings yet
CCIE Security Lab Exam Topics v4.0 System Hardening and Availability
3 pages
SQL Injection Tutorial by Marezzi
No ratings yet
SQL Injection Tutorial by Marezzi
9 pages
Csol 510 Final Project
No ratings yet
Csol 510 Final Project
19 pages
Workshop Layout: Teaching / Learning Areas Size Area Total Area
No ratings yet
Workshop Layout: Teaching / Learning Areas Size Area Total Area
3 pages
Finding A Cryptocurrency Transaction ID (TXID) : What Is A Bitcoin Account?
No ratings yet
Finding A Cryptocurrency Transaction ID (TXID) : What Is A Bitcoin Account?
3 pages
Rest API Dspace
No ratings yet
Rest API Dspace
2 pages
Base System (Binary, Decimal, Octal & Hexadecimal)
No ratings yet
Base System (Binary, Decimal, Octal & Hexadecimal)
2 pages
The LUA 5.1 Language Short Reference
No ratings yet
The LUA 5.1 Language Short Reference
4 pages
Map Duisenberg 2021
No ratings yet
Map Duisenberg 2021
1 page
CheatSheet Python 1 Keywords1
No ratings yet
CheatSheet Python 1 Keywords1
1 page
Invitation Letter - Summer School - 03-14 June 2024
No ratings yet
Invitation Letter - Summer School - 03-14 June 2024
1 page
Zied Kanoun Resume v2
No ratings yet
Zied Kanoun Resume v2
1 page
Python Refcard
No ratings yet
Python Refcard
3 pages
MarketSmiths Growth250 (Total 300) 1-25
No ratings yet
MarketSmiths Growth250 (Total 300) 1-25
1 page
Python Data Cleaning Cookbook: Prepare your data for analysis with pandas, NumPy, Matplotlib, scikit-learn, and OpenAI
From Everand
Python Data Cleaning Cookbook: Prepare your data for analysis with pandas, NumPy, Matplotlib, scikit-learn, and OpenAI
Michael Walker
5/5 (1)
Python Advanced Programming: The Guide to Learn Python Programming. Reference with Exercises and Samples About Dynamical Programming, Multithreading, Multiprocessing, Debugging, Testing and More
From Everand
Python Advanced Programming: The Guide to Learn Python Programming. Reference with Exercises and Samples About Dynamical Programming, Multithreading, Multiprocessing, Debugging, Testing and More
Marcus Richards
No ratings yet