0% found this document useful (0 votes)

56 views17 pages

Scrapy Beginners Series Part 1 - First Scrapy Spider - ScrapeOps

Uploaded by

amritjsr

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

56 views17 pages

Scrapy Beginners Series Part 1 - First Scrapy Spider - ScrapeOps

Uploaded by

amritjsr

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 17

07/07/2024, 16:16 Scrapy Beginners Series Part 1 - First Scrapy Spider | ScrapeOps

Scrapy Beginners Series Part 1: How To

Build Your First Production Scraper
Whether you are a developer, data scientist or marketer, being able to develop web scrapers is a hugely
valuable skill to have. And there is no better web scraping framework than Python Scrapy.

There are lots of articles online, showing you how to make your first basic Scrapy spider. However, there
are very few that walk you through the full process of building a production ready Scrapy spider.

To address this, we are doing a 5-Part Scrapy Beginner Guide Series, where we're going to build a
Scrapy project end-to-end from building the scrapers to deploying on a server and run them every day.

Python Scrapy 5-Part Beginner Series

Part 1: Basic Scrapy Spider - We will go over the basics of Scrapy, and build our first Scrapy spider.
(This Tutorial)

Part 2: Cleaning Dirty Data & Dealing With Edge Cases - Web data can be messy, unstructured,
and have lots of edge cases. In this tutorial we will make our spider robust to these edge cases,
using Items, Itemloaders and Item Pipelines. (Part 2)

Part 3: Storing Our Data - There are many different ways we can store the data that we scrape from
databases, CSV files to JSON format, and to S3 buckets. We will explore several different ways we
can store the data and talk about their Pro's, Con's and in which situations you would use them.
(Part 3)

Part 4: User Agents & Proxies - Make our spider production ready by managing our user agents &
IPs so we don't get blocked. (Part 4)

https://scrapeops.io/python-scrapy-playbook/scrapy-beginners-guide/ 1/17
07/07/2024, 16:16 Scrapy Beginners Series Part 1 - First Scrapy Spider | ScrapeOps

Part 5: Deployment, Scheduling & Running Jobs - Deploying our spider on a server, and monitoring
and scheduling jobs via ScrapeOps. (Part 5)

For this beginner series, we're going to be using one of the simplest scraping architectures. A single
spider, being given a start URL which will then crawl the site, parse and clean the data from the HTML
responses, and store the data all in the same process.

This architecture is suitable for the majority of hobby and small scraping projects, however, if you are
scraping business critical data at larger scales then we would use different scraping architectures. We
will cover these in other Scrapy series.

The code for this project is available on Github here!

If you prefer video tutorials, then check out the video version of this article.

Need help scraping the web?

Then check out ScrapeOps, the complete toolkit for web scraping.

Proxy Manager Scraper Monitoring Job Scheduling

Part 1: Basic Scrapy Spider

In this tutorial, Part 1: Basic Scrapy Spider we're going to cover:

What is Scrapy?

https://scrapeops.io/python-scrapy-playbook/scrapy-beginners-guide/ 2/17
07/07/2024, 16:16 Scrapy Beginners Series Part 1 - First Scrapy Spider | ScrapeOps

How to Setup Our Python Environment

How to Setup Our Scrapy Project
Creating Our Scrapy Spider
Using Scrapy Shell To Find Our CSS Selectors
How to Run Our Scrapy Spider, Plus Output Data in CSV or JSON
How to Navigate Through Pages

For this series, we will be scraping the products from Chocolate.co.uk as it will be a good example of
how to approach scraping a e-commerce store. Plus, who doesn't like Chocolate!

What Is Scrapy?
Developed by the co-founders of Zyte, Pablo Hoffman and Shane Evans, Scrapy is a Python framework
specifically designed for web scraping.

Using Scrapy you can easily build highly scalable scrapers that will retrieve a pages HTML, parse and
process the data, and store it the file format and location of your choice.

Why & When Should You Use Scrapy?

Although, there are other Python libraries also used for web scraping:

Python Requests/BeautifulSoup: Good for small scale web scraping where the data is returned in
the HTML response. Would need to build you own spider management functionality to manage
concurrency, retries, data cleaning, data storage.

Python Request-HTML: Combining Python requests with a parsing library, Request-HTML is a middle-
ground between the Python Requests/BeautifulSoup combo and Scrapy.

Python Selenium: Use if you are scraping a site if it only returns the target data after the Javascript
has rendered, or you need to interact with page elements to get the data.

https://scrapeops.io/python-scrapy-playbook/scrapy-beginners-guide/ 3/17
07/07/2024, 16:16 Scrapy Beginners Series Part 1 - First Scrapy Spider | ScrapeOps

Python Scrapy has lots more functionality and is great for large scale scraping right out of the box:

CSS Selector & XPath Expressions Parsing

Data formating (CSV, JSON, XML) and Storage (FTP, S3, local filesystem)
Robust Encoding Support
Concurrency Managment
Automatic Retries
Cookies and Session Handling
Crawl Spiders & In-Built Pagination Support

You just need to customise it in your settings file or add in one of the many Scrapy extensions and
middlewares that developers have open sourced.

The learning curve is initially steeper than using the Python Requests/BeautifulSoup combo, however, it
will save you a lot of time in the long run when deploying production scrapers and scraping at scale.

Beginners Scrapy Tutorial

With the intro out of the way, let's start developing our Spider. First, things first we need to setup up our
Python environment.

Step 1 - Setup your Python Environment

To avoid version conflicts down the raod it is best practice to create a seperate virtual environment for
each of your Python projects. This means that any packages you install for a project are kept seperate
from other projects, so you don't inadverently end up breaking other projects.

Depending on the operating system of your machine these commands will be slightly different.

MacOS or Linux

Setup a virtual environment on MacOS or any Linux distro.

First, we want to make sure we've the latest version of our packages installed.

$ sudo apt-get update

$ apt install tree

Then install python3-venv if you haven't done so already

$ sudo apt install -y python3-venv

Next, we will create our Python virtual environment.

$ cd /scrapy_tutorials
$ python3 -m venv venv
$ source venv/bin/activate

Finally, we will install Scrapy in our virtual environment.

https://scrapeops.io/python-scrapy-playbook/scrapy-beginners-guide/ 4/17
07/07/2024, 16:16 Scrapy Beginners Series Part 1 - First Scrapy Spider | ScrapeOps

$ apt-get install python3-pip

$ sudo pip3 install scrapy

Windows

Setup a virtual environment on Windows.

Install virtualenv in your Windows command shell, Powershell, or other terminal you are using.

pip install virtualenv

Navigate to the folder you want to create the virtual environment, and start virtualenv.

cd /scrapy_tutorials
virtualenv venv

Activate the virtual environment.

source venv\Scripts\activate

Finally, we will install Scrapy in our virtual environment.

pip install scrapy

Test Scrapy Is Installed

To make sure everything is working, if you type the command scrapy into your command line you
should get an output like this:

$ scrapy

Usage:
scrapy <command> [options] [args]

Available commands:
bench Run quick benchmark test
check Check spider contracts
commands
crawl Run a spider
edit Edit spider
fetch Fetch a URL using the Scrapy downloader
genspider Generate new spider using pre-defined templates
list List available spiders
parse Parse URL (using its spider) and print the results
runspider Run a self-contained spider

Step 2 - Setup Our Scrapy Project

Now that we have our environment setup, we can get onto the fun stuff. Building our first Scrapy
spider!

https://scrapeops.io/python-scrapy-playbook/scrapy-beginners-guide/ 5/17
07/07/2024, 16:16 Scrapy Beginners Series Part 1 - First Scrapy Spider | ScrapeOps

Creating Our Scrapy Project

The first thing we need to do is create our Scrapy project. This project will hold all the code for our
scrapers.

The command line synthax to do this is:

scrapy startproject <project_name>

So in this case, as we're going to be scraping a chocolate website we will call our project
chocolatescraper . But you can use any project name you would like.

scrapy startproject chocolatescraper

Understanding Scrapy Project Structure

To help us understand what we've just done, and how Scrapy structures it projects we're going to pause
for a second.

First, we're going to see what the scrapy startproject chocolatescraper just did. Enter the following
commands into your command line:

$ cd /chocolatescraper
$ tree

You should see something like this:

├── scrapy.cfg
└── chocolatescraper
├── __init__.py
├── items.py
├── middlewares.py
├── pipelines.py
├── settings.py
└── spiders
└── __init__.py

When we ran the scrapy startproject chocolatescraper command, Scrapy automatically

generated a template project for us to use.

We won't be using most of these files in this beginners project, but we will give a quick explanation of
each as each one has a special purpose:

settings.py is where all your project settings are contained, like activating pipelines, middlewares
etc. Here you can change the delays, concurrency, and lots more things.
items.py is a model for the extracted data. You can define a custom model (like a ProductItem)
that will inherit the Scrapy Item class and contain your scraped data.
pipelines.py is where the item yielded by the spider gets passed, it’s mostly used to clean the text
and connect to file outputs or databases (CSV, JSON SQL, etc).
middlewares.py is useful when you want to modify how the request is made and scrapy handles
the response.
https://scrapeops.io/python-scrapy-playbook/scrapy-beginners-guide/ 6/17
07/07/2024, 16:16 Scrapy Beginners Series Part 1 - First Scrapy Spider | ScrapeOps

scrapy.cfg is a configuration file to change some deployment settings, etc.

Step 3- Creating Our Spider

Okay, we’ve created the general project structure. Now, we’re going to create our spider that will do the
scraping.

Scrapy provides a number of different spider types, however, in this tutorial we will cover the most
common one, the generic Spider. Here are some of the most common ones:

Spider - Takes a list of start_urls and scrapes each one with a parse method.
CrawlSpider - Designed to crawl a full website by following any links it finds.
SitemapSpider - Designed to extract URLs from a sitemap

To create a new generic spider, simply run the genspider command:

# syntax is --> scrapy genspider <name_of_spider> <website>

$ scrapy genspider chocolatespider chocolate.co.uk

A new spider will now have been added to your spiders folder, and it should look like this:

import scrapy

class ChocolatespiderSpider(scrapy.Spider):
name = 'chocolatespider'
allowed_domains = ['chocolate.co.uk']
start_urls = ['http://chocolate.co.uk/']

def parse(self, response):

pass

Here we see that the genspider command has created a template spider for us to use in the form of a
Spider class. This spider class contains:

name - a class attribute that gives a name to the spider. We will use this when running our spider
later scrapy crawl <spider_name> .
allowed_domains - a class attribute that tells Scrapy that it should only ever scrape pages of the
chocolate.co.uk domain. This prevents the spider going rouge and scraping lots of websites. This
is optional.
start_urls - a class attribute that tells Scrapy the first url it should scrape. We will be changing this
in a bit.
parse - the parse function is called after a response has been recieved from the target website.

To start using this Spider we will have to do two things:

1. Change the start_urls to the url we want to scrape https://www.chocolate.co.uk/collections/all.

2. Insert our parsing code into the parse function.

https://scrapeops.io/python-scrapy-playbook/scrapy-beginners-guide/ 7/17
07/07/2024, 16:16 Scrapy Beginners Series Part 1 - First Scrapy Spider | ScrapeOps

Step 4 - Update Start Urls

This is pretty easy, we just need to replace the url in the start_urls array:

import scrapy

class ChocolatespiderSpider(scrapy.Spider):
name = 'chocolatespider'
allowed_domains = ['chocolate.co.uk']
start_urls = ['https://www.chocolate.co.uk/collections/all']

def parse(self, response):

pass

Next, we need to create our CSS selectors to parse the data we want from the page. To do this, we will
use Scrapy Shell.

Step 5 - Scrapy Shell: Finding Our CSS Selectors

To extract data from a HTML page, we need to use XPath or CSS selectors to tell Scrapy where in the
page is the data. XPath and CSS selectors are like little maps for Scrapy to navigate the DOM tree and
find the location of the data we require. In this guide, we're going to use CSS selectors to parse the data
from the page. And to help us create these CSS selectors we will use Scrapy Shell.

One of the great features of Scrapy is that it comes with a built-in shell that allows you to quickly test
and debug your XPath & CSS selectors. Instead of having to run your full scraper to see if your XPath or
CSS selectors are correct, you can enter them directly into your terminal and see the result.

To open Scrapy shell use this command:

scrapy shell

Note: If you would like to use IPython as your Scrapy shell (much more powerful and provides smart
auto-completion and colorized output), then make sure you have IPython installed:

pip3 install ipython

And then edit your scrapy.cfg file like so:

## scrapy.cfg
[settings]
default = chocolatescraper.settings
shell = ipython

With our Scrapy shell open, you should see something like this:

[s] Available Scrapy objects:

[s] scrapy scrapy module (contains scrapy.Request, scrapy.Selector, etc)
[s] crawler <scrapy.crawler.Crawler object at 0x0000025111C47948>
[s] item {}
[s] settings <scrapy.settings.Settings object at 0x0000025111D17408>

https://scrapeops.io/python-scrapy-playbook/scrapy-beginners-guide/ 8/17
07/07/2024, 16:16 Scrapy Beginners Series Part 1 - First Scrapy Spider | ScrapeOps
[s] Useful shortcuts:
[s] fetch(url[, redirect=True]) Fetch URL and update local objects (by default,
redirects are followed)
[s] fetch(req) Fetch a scrapy.Request and update local objects
[s] shelp() Shell help (print this help)
[s] view(response) View response in a browser
In [1]:

Fetch The Page

To create our CSS selectors we will be testing them on the following page:

https://www.chocolate.co.uk/collections/all

The first thing we want to do is fetch the main products page of the chocolate site in our Scrapy shell.

fetch('https://www.chocolate.co.uk/collections/all')

We should see a response like this:

In [1]: fetch('https://www.chocolate.co.uk/collections/all')
2021-12-22 13:28:56 [scrapy.core.engine] INFO: Spider opened
2021-12-22 13:28:57 [scrapy.core.engine] DEBUG: Crawled (200) <GET
https://www.chocolate.co.uk/robots.txt> (referer: None)
2021-12-22 13:28:57 [scrapy.core.engine] DEBUG: Crawled (200) <GET
https://www.chocolate.co.uk/collections/all> (referer: None)

As we can see, we successful retrieve the page from chocolate.co.uk , and Scrapy shell has
automatically saved the HTML response in the response variable.

In [2]: response
Out[2]: <200 https://www.chocolate.co.uk/collections/all>

https://scrapeops.io/python-scrapy-playbook/scrapy-beginners-guide/ 9/17
07/07/2024, 16:16 Scrapy Beginners Series Part 1 - First Scrapy Spider | ScrapeOps
Find Product CSS Selectors

To find the correct CSS selectors to parse the product details we will first open the page in our browsers
DevTools.

Open the website, then open the developer tools console (right click on the page and click inspect).

Using the inspect element, hover over the item and look at the id's and classes on the individual
products.

In this case we can see that each box of chocolates has its own special component which is called
product-item . We can just use this to reference our products (see above image).

Now using our Scrapy shell we can see if we can extract the product informaton using this class.

response.css('product-item')

We can see that it has found all the elements that match this selector.

In [3]: response.css('product-item')
Out[3]:
[<Selector xpath='descendant-or-self::product-item' data='<product-item class="product-
item pro...'>,
<Selector xpath='descendant-or-self::product-item' data='<product-item class="product-
item " r...'>,
<Selector xpath='descendant-or-self::product-item' data='<product-item class="product-
item " r...'>,
<Selector xpath='descendant-or-self::product-item' data='<product-item class="product-
item " r...'>,
<Selector xpath='descendant-or-self::product-item' data='<product-item class="product-
item " r...'>,
<Selector xpath='descendant-or-self::product-item' data='<product-item class="product-

https://scrapeops.io/python-scrapy-playbook/scrapy-beginners-guide/ 10/17
07/07/2024, 16:16 Scrapy Beginners Series Part 1 - First Scrapy Spider | ScrapeOps
item pro...'>,
...

Get First Product

To just get the first product we use .get() appended to the end of the command.

response.css('product-item').get()

This returns all the HTML in this node of the DOM tree.

In [4]: response.css('product-item').get()
Out[4]: '<product-item class="product-item product-item--sold-out" reveal><div
class="product-item__image-wrapper product-item__image-wrapper--multiple"><div
class="product-item__label-list label-list">New
Sold out</div><a href="/products/100-dark-hot-
chocolate-flakes" class="product-item__aspect-ratio aspect-ratio " style="padding-bottom:
100.0%; --aspect-ratio: 1.0">\n
...

Get All Products

Now that we have found the DOM node that contains the product items, we will get all of them and save
this data into a response variable and loop through the items and extract the data we need.

So can do this with the following command.

products = response.css('product-item')

The products variable, is now an list of all the products on the page.

To check the length of the products variable we can see how many products are there.

len(products)

Here is the output:

In [6]: len(products)
Out[6]: 24

Extract Product Details

Now lets extract the name, price and url of each product from the list of products.

The products variable is a list of products. When we update our spider code, we will loop through this
list, however, to find the correct selectors we will test the CSS selectors on the first element of the list
products[0] .
https://scrapeops.io/python-scrapy-playbook/scrapy-beginners-guide/ 11/17
07/07/2024, 16:16 Scrapy Beginners Series Part 1 - First Scrapy Spider | ScrapeOps

Single Product - Get single product.

product = products[0]

Name - The product name can be found with:

product.css('a.product-item-meta__title::text').get()

In [5]: product.css('a.product-item-meta__title::text').get()
Out[5]: '100% Dark Hot Chocolate Flakes'

Price - The product price can be found with:

product.css('span.price').get()

You can see that the data returned for the price has lots of extra HTML. We'll get rid of this in the next
step.

In [6]: product.css('span.price').get()
Out[6]: '\n Sale
price▒8.50'

To remove the extra span tags from our price we can use the .replace() method. The replace
method can be useful when we need to clean up data.

Here we're going to replace the sections with empty quotes '' :

product.css('span.price').get().replace('\n <span

class="visually-hidden">Sale price','').replace('','')

In [7]: product.css('span.price').get().replace('\n

Sale price','').replace('','')
Out[7]: '8.50'

Product URL - Next lets see how we can extract the product url for each individual product. To do that
we can use the attrib function on the end of products.css('div.product-item-meta a')

product.css('div.product-item-meta a').attrib['href']

In [8]: product.css('div.product-item-meta a').attrib['href']

Out[8]: '/products/100-dark-hot-chocolate-flakes'

Updated Spider

Now, that we've found the correct CSS selectors let's update our spider. Exit Scrapy shell with the
exit() command.

Our updated Spider code should look like this:

import scrapy

https://scrapeops.io/python-scrapy-playbook/scrapy-beginners-guide/ 12/17
07/07/2024, 16:16 Scrapy Beginners Series Part 1 - First Scrapy Spider | ScrapeOps
class ChocolatespiderSpider(scrapy.Spider)

#the name of the spider

name = 'chocolatespider'

#the url of the first page that we will start scraping

start_urls = ['https://www.chocolate.co.uk/collections/all']

def parse(self, response):

#here we are looping through the products and extracting the name, price & url
products = response.css('product-item')
for product in products:
#here we put the data returned into the format we want to output for our csv
or json file
yield{
'name' : product.css('a.product-item-meta__title::text').get(),
'price' : product.css('span.price').get().replace('\n
Sale price','').replace('',''),
'url' : product.css('div.product-item-meta a').attrib['href'],
}

Here, our spider does the following steps:

1. Makes a request to 'https://www.chocolate.co.uk/collections/all' .

2. When it gets a response, it extracts all the products from the page using products =
response.css('product-item') .
3. Loops through each product, and extracts the name, price and url using the CSS selectors we
created.
4. Yields these items so they can be stored in a CSV, JSON, DB, etc.

Step 5 - Running Our Spider

Now that we have a spider we can run it by going to the top level in our scrapy project and running the
following command.

scrapy crawl chocolatespider

It will run, and you should see the logs on your screen. Here are the final stats:

2021-12-22 14:43:54 [scrapy.statscollectors] INFO: Dumping Scrapy stats:

{'downloader/request_bytes': 707,
'downloader/request_count': 2,
'downloader/request_method_count/GET': 2,
'downloader/response_bytes': 64657,
'downloader/response_count': 2,
'downloader/response_status_count/200': 2,
'elapsed_time_seconds': 0.794875,
'finish_reason': 'finished',
'finish_time': datetime.datetime(2021, 12, 22, 13, 43, 54, 937791),
'httpcompression/response_bytes': 268118,
'httpcompression/response_count': 2,

https://scrapeops.io/python-scrapy-playbook/scrapy-beginners-guide/ 13/17
07/07/2024, 16:16 Scrapy Beginners Series Part 1 - First Scrapy Spider | ScrapeOps
'item_scraped_count': 24,
'log_count/DEBUG': 26,
'log_count/INFO': 10,
'response_received_count': 2,
'robotstxt/request_count': 1,
'robotstxt/response_count': 1,
'robotstxt/response_status_count/200': 1,
'scheduler/dequeued': 1,
'scheduler/dequeued/memory': 1,
'scheduler/enqueued': 1,
'scheduler/enqueued/memory': 1,
'start_time': datetime.datetime(2021, 12, 22, 13, 43, 54, 142916)}
2021-12-22 14:43:54 [scrapy.core.engine] INFO: Spider closed (finished)

We can see from the above stats that our spider scraped 24 Items: 'item_scraped_count': 24 .

If we want to save the data to a JSON file we can use the -O option, followed by the name of the file.

scrapy crawl chocolatespider -O myscrapeddata.json

If we want to save the data to a CSV file we can do so too.

scrapy crawl chocolatespider -O myscrapeddata.csv

Step 6 - Navigating to the "Next Page"

So far the code is working great but we're only getting the products from the first page of the site, the url
which we have listed in the start_url variable.

So the next logical step is to go to the next page if there is one and scrape the item data from that too!
So here's how we do that.

First, lets open our Scrapy shell again, fetch the page and find the correct selector to get the next page
button.

scrapy shell

Then fetch the page again.

fetch('https://www.chocolate.co.uk/collections/all')

And then get the href attribute that contains the url to the next page.

response.css('[rel="next"] ::attr(href)').get()

In [2]: response.css('[rel="next"] ::attr(href)').get()

Out[2]: '/collections/all?page=2'

Now, we just need to update our spider to request this page after it has parsed all items from a page.

https://scrapeops.io/python-scrapy-playbook/scrapy-beginners-guide/ 14/17
07/07/2024, 16:16 Scrapy Beginners Series Part 1 - First Scrapy Spider | ScrapeOps

import scrapy

class ChocolateSpider(scrapy.Spider):

#the name of the spider

name = 'chocolatespider'

#these are the urls that we will start scraping

start_urls = ['https://www.chocolate.co.uk/collections/all']

def parse(self, response):

products = response.css('product-item')
for product in products:
#here we put the data returned into the format we want to output for our csv
or json file
yield{
'name' : product.css('a.product-item-meta__title::text').get(),
'price' : product.css('span.price').get().replace('\n
Sale price','').replace('',''),
'url' : product.css('div.product-item-meta a').attrib['href'],
}

next_page = response.css('[rel="next"] ::attr(href)').get()

if next_page is not None:

next_page_url = 'https://www.chocolate.co.uk' + next_page
yield response.follow(next_page_url, callback=self.parse)

Here we see that our spider now, finds the URL of the next page and if it isn't none it appends it to the
base URL and makes another request.

Now in our Scrapy stats we see that we have scraped 5 pages, and extracted 73 items:

2021-12-22 15:10:45 [scrapy.statscollectors] INFO: Dumping Scrapy stats:

{'downloader/request_bytes': 2497,
'downloader/request_count': 5,
'downloader/request_method_count/GET': 5,
'downloader/response_bytes': 245935,
'downloader/response_count': 5,
'downloader/response_status_count/200': 5,
'elapsed_time_seconds': 2.441196,
'feedexport/success_count/FileFeedStorage': 1,
'finish_reason': 'finished',
'finish_time': datetime.datetime(2021, 12, 22, 14, 10, 45, 62280),
'httpcompression/response_bytes': 986800,
'httpcompression/response_count': 5,
'item_scraped_count': 73,
'log_count/DEBUG': 78,
'log_count/INFO': 11,
'request_depth_max': 3,
'response_received_count': 5,
'robotstxt/request_count': 1,
'robotstxt/response_count': 1,
'robotstxt/response_status_count/200': 1,
'scheduler/dequeued': 4,

https://scrapeops.io/python-scrapy-playbook/scrapy-beginners-guide/ 15/17
07/07/2024, 16:16 Scrapy Beginners Series Part 1 - First Scrapy Spider | ScrapeOps
'scheduler/dequeued/memory': 4,
'scheduler/enqueued': 4,
'scheduler/enqueued/memory': 4,
'start_time': datetime.datetime(2021, 12, 22, 14, 10, 42, 621084)}
2021-12-22 15:10:45 [scrapy.core.engine] INFO: Spider closed (finished)

Next Steps
We hope you have enough of the basics to get up and running scraping a simple ecommerce site with
the above tutorial.

If you would like the code from this example please check out on Github here!

In Part 2 of the series we will work on Cleaning Dirty Data & Dealing With Edge Cases. Web data can be
messy, unstructured, and have lots of edge cases so will make our spider robust to these edge cases,
using Items, Itemloaders and Item Pipelines.

Need a Free Proxy? Then check out our Proxy Comparison Tool that allows to compare the pricing,
features and limits of every proxy provider on the market so you can find the one that best suits your
needs. Including the best free plans.

https://scrapeops.io/python-scrapy-playbook/scrapy-beginners-guide/ 16/17
07/07/2024, 16:16 Scrapy Beginners Series Part 1 - First Scrapy Spider | ScrapeOps

What do you think?

95 Responses

Upvote Funny Love Surprised Angry Sad

2 Comments 
1 Login

G Join the discussion…

LOG IN WITH OR SIGN UP WITH DISQUS ?

Name

 2 Share Best Newest Oldest

R Rue − ⚑
7 months ago

New to data scraping. Just wanna say it's very important that one picks the right place to start, and this works
perfectly for me. Highly recommend!

0 0 Reply Share ›

W Wambugu K. Martin − ⚑
7 months ago

👊🏾👍🏽
0 0 Reply Share ›

Subscribe Privacy Do Not Sell My Data

https://scrapeops.io/python-scrapy-playbook/scrapy-beginners-guide/ 17/17

Playboy Gold Magazine 100% Natural Girls - No.176.PDF (Audit)
No ratings yet
Playboy Gold Magazine 100% Natural Girls - No.176.PDF (Audit)
2 pages
Unit 1 Client Side Scripting Final
No ratings yet
Unit 1 Client Side Scripting Final
254 pages
Python Web Scraping Tutorial
92% (12)
Python Web Scraping Tutorial
65 pages
Python Scrapy
No ratings yet
Python Scrapy
4 pages
Web Crawling and Social Media Mining: Module No. 5
No ratings yet
Web Crawling and Social Media Mining: Module No. 5
77 pages
Demov6 141213202739 Conversion Gate01
No ratings yet
Demov6 141213202739 Conversion Gate01
41 pages
Learning Scrapy - Sample Chapter
0% (1)
Learning Scrapy - Sample Chapter
16 pages
Scrapy Beginners Series Part 2 - Cleaning & Processing Data - ScrapeOps
No ratings yet
Scrapy Beginners Series Part 2 - Cleaning & Processing Data - ScrapeOps
10 pages
Scrapy Tutorial PDF
100% (3)
Scrapy Tutorial PDF
114 pages
Scrapy Beginners Series Part 5 - Deploying & Scheduling Spiders - ScrapeOps
No ratings yet
Scrapy Beginners Series Part 5 - Deploying & Scheduling Spiders - ScrapeOps
12 pages
Scrapy Beginners Series Part 4 - User Agents and Proxies - ScrapeOps
No ratings yet
Scrapy Beginners Series Part 4 - User Agents and Proxies - ScrapeOps
8 pages
Common Practices - Scrapy 2.12.0 Documentation
No ratings yet
Common Practices - Scrapy 2.12.0 Documentation
5 pages
Scrapy - A Fast and Powerful Scraping and Web Crawling Framework
No ratings yet
Scrapy - A Fast and Powerful Scraping and Web Crawling Framework
2 pages
Web Scraping in Python Using Scrapy
No ratings yet
Web Scraping in Python Using Scrapy
30 pages
Scrapy Beginners Series Part 3 - Storing Data With Scrapy - ScrapeOps
No ratings yet
Scrapy Beginners Series Part 3 - Storing Data With Scrapy - ScrapeOps
9 pages
Scrapytutorial
No ratings yet
Scrapytutorial
5 pages
Id-11659 Scrapping Web
No ratings yet
Id-11659 Scrapping Web
295 pages
CSF2113 10 CLO4 Web Crawling With Scrapy
No ratings yet
CSF2113 10 CLO4 Web Crawling With Scrapy
25 pages
Web Crawling - Python
No ratings yet
Web Crawling - Python
34 pages
Introduction To Web Crawling Chapter - 13
No ratings yet
Introduction To Web Crawling Chapter - 13
3 pages
Using Scrapy in PyCharm
100% (1)
Using Scrapy in PyCharm
8 pages
Scrapy
No ratings yet
Scrapy
8 pages
Web+Scraping+Cheat+Sheet+2 0
No ratings yet
Web+Scraping+Cheat+Sheet+2 0
3 pages
Web Scraping Using Python
No ratings yet
Web Scraping Using Python
18 pages
Scrapy
No ratings yet
Scrapy
171 pages
Scrapy Docs
100% (1)
Scrapy Docs
197 pages
Web Scraping With Python
No ratings yet
Web Scraping With Python
21 pages
Architecture Overview - Scrapy 2.13.0 Documentation
No ratings yet
Architecture Overview - Scrapy 2.13.0 Documentation
3 pages
Docs Scrapy Org en Latest
No ratings yet
Docs Scrapy Org en Latest
354 pages
Advanced Web Scraping - Bypassing - 403 Forbidden, - Captchas, and More - Sangaline
No ratings yet
Advanced Web Scraping - Bypassing - 403 Forbidden, - Captchas, and More - Sangaline
12 pages
Web Scraping With Python Tutorials From A To Z
100% (2)
Web Scraping With Python Tutorials From A To Z
35 pages
Experiment2 Web Scraping and Data Analysis
No ratings yet
Experiment2 Web Scraping and Data Analysis
5 pages
The Ultimate Web Scraping With Python Bootcamp 2023 - Coderprog
No ratings yet
The Ultimate Web Scraping With Python Bootcamp 2023 - Coderprog
3 pages
Scrapy-Org Documentation
No ratings yet
Scrapy-Org Documentation
352 pages
Scrapy
No ratings yet
Scrapy
298 pages
Web Scraping With Scrapy - Practical Understanding - by Karthikeyan P - Jul, 2020 - Towards Data Science
No ratings yet
Web Scraping With Scrapy - Practical Understanding - by Karthikeyan P - Jul, 2020 - Towards Data Science
16 pages
Web Scraping With Python
No ratings yet
Web Scraping With Python
16 pages
SDS WebScraping Bonus Scrapy Vs BeautifulSoup PDF
No ratings yet
SDS WebScraping Bonus Scrapy Vs BeautifulSoup PDF
6 pages
DAP 4 Module
No ratings yet
DAP 4 Module
45 pages
Scrapy
No ratings yet
Scrapy
427 pages
WEBSCRAping Buildwithpython
No ratings yet
WEBSCRAping Buildwithpython
78 pages
Web Scraping Cheat Sheet 2.0
No ratings yet
Web Scraping Cheat Sheet 2.0
3 pages
Docs Scrapy Org en Latest
No ratings yet
Docs Scrapy Org en Latest
382 pages
Scrapy Documentation
No ratings yet
Scrapy Documentation
423 pages
Scrapy
No ratings yet
Scrapy
248 pages
Scrapy Documentation Guide
No ratings yet
Scrapy Documentation Guide
260 pages
Docs Scrapy Org en Master
No ratings yet
Docs Scrapy Org en Master
405 pages
Docs Scrapy Org en Latest
No ratings yet
Docs Scrapy Org en Latest
405 pages
Web Scraping
No ratings yet
Web Scraping
5 pages
Web Scraping Using Python: A Step by Step Guide: September 2019
No ratings yet
Web Scraping Using Python: A Step by Step Guide: September 2019
7 pages
Docs Scrapy Org en Master
No ratings yet
Docs Scrapy Org en Master
411 pages
Scrapy
No ratings yet
Scrapy
230 pages
Scrapy Documentation
No ratings yet
Scrapy Documentation
230 pages
Scrapy
No ratings yet
Scrapy
306 pages
Scrapy Documentation
No ratings yet
Scrapy Documentation
234 pages
Scrapy PDF
No ratings yet
Scrapy PDF
250 pages
Web Scraping Using Python: A Step by Step Guide: September 2019
0% (1)
Web Scraping Using Python: A Step by Step Guide: September 2019
7 pages
Web Scraping Using Python: A Step by Step Guide: September 2019
No ratings yet
Web Scraping Using Python: A Step by Step Guide: September 2019
7 pages
Debugging Spiders - Scrapy 2.13.0 Documentation
No ratings yet
Debugging Spiders - Scrapy 2.13.0 Documentation
4 pages
4F IntroToWebScraping
No ratings yet
4F IntroToWebScraping
6 pages
Social Media in The Connected Age
No ratings yet
Social Media in The Connected Age
13 pages
Moving, Fast or Slow How Perceived Speed Influences Mental Representation and Decision Making
No ratings yet
Moving, Fast or Slow How Perceived Speed Influences Mental Representation and Decision Making
36 pages
What Is CSR SSR SSG Isr
No ratings yet
What Is CSR SSR SSG Isr
8 pages
Contract Management
No ratings yet
Contract Management
5 pages
Landing Page Optimization
0% (1)
Landing Page Optimization
16 pages
Axis 2130 User's Manual
No ratings yet
Axis 2130 User's Manual
41 pages
Complete SEO Tools List For Website
100% (2)
Complete SEO Tools List For Website
31 pages
1.1basic Principles of Graphics and Layout 1.2creating Infographics 1.3principles of Basic Techniques of Image Manipulation 2
No ratings yet
1.1basic Principles of Graphics and Layout 1.2creating Infographics 1.3principles of Basic Techniques of Image Manipulation 2
3 pages
Elective 3 - Lesson 1
No ratings yet
Elective 3 - Lesson 1
15 pages
Chap 5
No ratings yet
Chap 5
5 pages
Full Download Sams Teach Yourself HTML and XHTML in 24 Hours 6th Edition Dick Oliver PDF
100% (8)
Full Download Sams Teach Yourself HTML and XHTML in 24 Hours 6th Edition Dick Oliver PDF
84 pages
Jaipur 5 Star Hotel Viral Video
No ratings yet
Jaipur 5 Star Hotel Viral Video
4 pages
Employee Newsletter: Front Page Employee News
No ratings yet
Employee Newsletter: Front Page Employee News
4 pages
E-Commerce: Search Engine Optimization
No ratings yet
E-Commerce: Search Engine Optimization
25 pages
Firepower Management Center Configuration Guide, V6.6
No ratings yet
Firepower Management Center Configuration Guide, V6.6
2,814 pages
Dire Dawa University Institute of Technology School of Computing Department of Computer Science
0% (1)
Dire Dawa University Institute of Technology School of Computing Department of Computer Science
45 pages
Products Affected / Serial Numbers Affected:: TP21 143g.pdf 03-21-24
No ratings yet
Products Affected / Serial Numbers Affected:: TP21 143g.pdf 03-21-24
8 pages
Group Presentation
No ratings yet
Group Presentation
7 pages
The Social Engineer Toolkit
No ratings yet
The Social Engineer Toolkit
20 pages
1000 IT Words
No ratings yet
1000 IT Words
74 pages
PSA ww60 Main Engine Oil Mist Detector
No ratings yet
PSA ww60 Main Engine Oil Mist Detector
117 pages
Thesis Registration System
100% (2)
Thesis Registration System
7 pages
Rajasthan Geography - Economics-1
No ratings yet
Rajasthan Geography - Economics-1
264 pages
Hen08102 NVR
No ratings yet
Hen08102 NVR
4 pages
Winnie
No ratings yet
Winnie
38 pages
06 Spotify BMC
No ratings yet
06 Spotify BMC
2 pages
Cryptotab Hack
No ratings yet
Cryptotab Hack
4 pages
363578ac Mdm500 Operator Manual
No ratings yet
363578ac Mdm500 Operator Manual
62 pages

Scrapy Beginners Series Part 1 - First Scrapy Spider - ScrapeOps

Uploaded by

Scrapy Beginners Series Part 1 - First Scrapy Spider - ScrapeOps

Uploaded by

07/07/2024, 16:16 Scrapy Beginners Series Part 1 - First Scrapy Spider | ScrapeOps

Scrapy Beginners Series Part 1: How To

Python Scrapy 5-Part Beginner Series

The code for this project is available on Github here!

Need help scraping the web?

Proxy Manager Scraper Monitoring Job Scheduling

Part 1: Basic Scrapy Spider

How to Setup Our Python Environment

Why & When Should You Use Scrapy?

CSS Selector & XPath Expressions Parsing

Beginners Scrapy Tutorial

Step 1 - Setup your Python Environment

Setup a virtual environment on MacOS or any Linux distro.

$ sudo apt-get update

Then install python3-venv if you haven't done so already

$ sudo apt install -y python3-venv

Next, we will create our Python virtual environment.

Finally, we will install Scrapy in our virtual environment.

$ apt-get install python3-pip

Setup a virtual environment on Windows.

pip install virtualenv

Activate the virtual environment.

Finally, we will install Scrapy in our virtual environment.

pip install scrapy

Test Scrapy Is Installed

Step 2 - Setup Our Scrapy Project

Creating Our Scrapy Project

The command line synthax to do this is:

scrapy startproject <project_name>

scrapy startproject chocolatescraper

Understanding Scrapy Project Structure

You should see something like this:

When we ran the scrapy startproject chocolatescraper command, Scrapy automatically

scrapy.cfg is a configuration file to change some deployment settings, etc.

Step 3- Creating Our Spider

To create a new generic spider, simply run the genspider command:

# syntax is --> scrapy genspider <name_of_spider> <website>

def parse(self, response):

To start using this Spider we will have to do two things:

1. Change the start_urls to the url we want to scrape https://www.chocolate.co.uk/collections/all.

Step 4 - Update Start Urls

def parse(self, response):

Step 5 - Scrapy Shell: Finding Our CSS Selectors

To open Scrapy shell use this command:

pip3 install ipython

And then edit your scrapy.cfg file like so:

[s] Available Scrapy objects:

Fetch The Page

We should see a response like this:

Get First Product

Get All Products

So can do this with the following command.

Here is the output:

Extract Product Details

Single Product - Get single product.

Name - The product name can be found with:

Price - The product price can be found with:

product.css('span.price').get().replace('<span class="price">\n <span

In [7]: product.css('span.price').get().replace('<span class="price">\n

In [8]: product.css('div.product-item-meta a').attrib['href']

Our updated Spider code should look like this:

#the name of the spider

#the url of the first page that we will start scraping

def parse(self, response):

Here, our spider does the following steps:

1. Makes a request to 'https://www.chocolate.co.uk/collections/all' .

Step 5 - Running Our Spider

scrapy crawl chocolatespider

2021-12-22 14:43:54 [scrapy.statscollectors] INFO: Dumping Scrapy stats:

scrapy crawl chocolatespider -O myscrapeddata.json

If we want to save the data to a CSV file we can do so too.

scrapy crawl chocolatespider -O myscrapeddata.csv

Step 6 - Navigating to the "Next Page"

Then fetch the page again.

In [2]: response.css('[rel="next"] ::attr(href)').get()

#the name of the spider