Scrapy Beginners Series Part 3 - Storing Data With Scrapy - ScrapeOps
Scrapy Beginners Series Part 3 - Storing Data With Scrapy - ScrapeOps
In Part 3 we will be exploring how to save the data into files/formats which would work for most
common use cases. We'll be looking at how to save the data to a CSV or JSON file as well as how to save
the data to a database or S3 bucket.
Part 1: Basic Scrapy Spider - We will go over the basics of Scrapy, and build our first Scrapy spider.
(Part 1)
Part 2: Cleaning Dirty Data & Dealing With Edge Cases - Web data can be messy, unstructured,
and have lots of edge cases. In this tutorial we will make our spider robust to these edge cases,
using Items, Itemloaders and Item Pipelines. (Part 2)
Part 3: Storing Our Data - There are many different ways we can store the data that we scrape from
databases, CSV files to JSON format, and to S3 buckets. We will explore several different ways we
can store the data and talk about their Pro's, Con's and in which situations you would use them.
(This Tutorial)
Part 4: User Agents & Proxies - Make our spider production ready by managing our user agents &
IPs so we don't get blocked. (Part 4)
https://scrapeops.io/python-scrapy-playbook/scrapy-beginners-guide-storing-data/ 1/9
07/07/2024, 16:17 Scrapy Beginners Series Part 3 - Storing Data With Scrapy | ScrapeOps
Part 5: Deployment, Scheduling & Running Jobs - Deploying our spider on a server, and monitoring
and scheduling jobs via ScrapeOps. (Part 5)
In this tutorial, Part 3: Storing Data With Scrapy we're going to cover:
With the intro out of the way let's get down to business.
https://scrapeops.io/python-scrapy-playbook/scrapy-beginners-guide-storing-data/ 2/9
07/07/2024, 16:17 Scrapy Beginners Series Part 3 - Storing Data With Scrapy | ScrapeOps
Scrapy already has a way to save the data to several different formats. Scrapy call's these ready to go
export methods Feed Exporters.
Out of the box scrapy provides the following formats to save/export the scraped data:
The files which are generated can then be saved to the following places using a Feed Exporter:
In this guide we're going to give examples on how your can use Feed Exporters to store your data in
different file formats and locations. However, there are many more ways you can store data with
Scrapy.
To get the data to be saved in the most simple way for a once off job we can use the following
commands:
https://scrapeops.io/python-scrapy-playbook/scrapy-beginners-guide-storing-data/ 3/9
07/07/2024, 16:17 Scrapy Beginners Series Part 3 - Storing Data With Scrapy | ScrapeOps
You can also decide whether to overwrite or append the data to the output file.
For example, when using the crawl or runspider commands, you can use the -O option instead of -o
to overwrite the output file. (Be sure to remember the difference as this might be confusing!)
You can check out how to set up an S3 bucket with amazon here:
https://docs.aws.amazon.com/AmazonS3/latest/userguide/setting-up-s3.html
OK- First we need to install Botocore which is an external Python library created by Amazon to help with
connecting to S3.
Now that we have that installed we can save the file to S3 by specifying the URI to your Amazon S3
bucket:
Obviously you will need to replace the aws_key & aws_secret with your own Amazon Key & Secret.
As well as putting in your bucket name and file path. We need the :csv at the end to specify the
format but this could be :json or :xml .
You can also save the aws_key & aws_secret in your project settings file:
AWS_ACCESS_KEY_ID = 'myaccesskeyhere'
AWS_SECRET_ACCESS_KEY = 'mysecretkeyhere'
Note: When saving data with this method the AWS S3 Feed Exporter uses delayed file delivery. This
means that the file is first temporarily saved locally to the machine the scraper is running on and then
it's uploaded to AWS once the spider has completed the job.
https://scrapeops.io/python-scrapy-playbook/scrapy-beginners-guide-storing-data/ 4/9
07/07/2024, 16:17 Scrapy Beginners Series Part 3 - Storing Data With Scrapy | ScrapeOps
Here well show you how to save the data to MySQL and PostgreSQL databases. To do this we'll be using
Item Pipelines again.
For this we are presuming that you already have a database setup called chocolate_scraping .
For more information on setting up a MySQL or Postgres database check out the following resources:
To save the data to the databases we're again going to be using the Item Pipelines. If you don't know
what they are please check out part 2 of this series where we go through how to use Scrapy Item
Pipelines!
The first step in our new Item Pipeline class, as you may expect is to connect to our MySQL database
and the table in which we will be storing our scraped data.
If you already have mysql installed on your computer - you might only need the connection package.
Then create a Item pipeline in our pipelines.py file that will connect with the database.
import mysql.connector
class SavingToMySQLPipeline(object):
def __init__(self):
self.create_connection()
def create_connection(self):
self.conn = mysql.connector.connect(
https://scrapeops.io/python-scrapy-playbook/scrapy-beginners-guide-storing-data/ 5/9
07/07/2024, 16:17 Scrapy Beginners Series Part 3 - Storing Data With Scrapy | ScrapeOps
host = 'localhost',
user = 'root',
password = '123456',
database = 'chocolate_scraping'
)
self.curr = self.conn.cursor()
Now that we are connecting to the database, for the next part we need to save each chocolate product
we scrape into our database item by item as they are processed by Scrapy.
To do that we will use the scrapy process_item() function (which runs after each item is scraped)
and then create a new function called store_in_db in which we will run the MySQL command to store
the Item data into our chocolate_products table.
import mysql.connector
class SavingToMySQLPipeline(object):
def __init__(self):
self.create_connection()
def create_connection(self):
self.connection = mysql.connector.connect(
host = 'localhost',
user = 'root',
password = '123456',
database = 'chocolate_scraping'
)
self.curr = self.connection.cursor()
Before trying to run our pipeline we mustn't forget to add the pipeline to our ITEM_PIPELINES in our
project settings.py file.
ITEM_PIPELINES = {
'chocolatescraper.pipelines.PriceToUSDPipeline': 100,
'chocolatescraper.pipelines.DuplicatesPipeline': 200,
'chocolatescraper.pipelines.SavingToMySQLPipeline': 300,
}
https://scrapeops.io/python-scrapy-playbook/scrapy-beginners-guide-storing-data/ 6/9
07/07/2024, 16:17 Scrapy Beginners Series Part 3 - Storing Data With Scrapy | ScrapeOps
To save the data to a PostgreSQL database the main thing we need to do is to update how the
connection is created. To do so we will will install the Python package psycopg2 .
import psycopg2
class SavingToPostgresPipeline(object):
def __init__(self):
self.create_connection()
def create_connection(self):
self.connection = psycopg2.connect(
host="localhost",
database="chocolate_scraping",
user="root",
password="123456")
self.curr = self.connection.cursor()
https://scrapeops.io/python-scrapy-playbook/scrapy-beginners-guide-storing-data/ 7/9
07/07/2024, 16:17 Scrapy Beginners Series Part 3 - Storing Data With Scrapy | ScrapeOps
except BaseException as e:
print(e)
self.connection.commit()
Again before trying to run our pipeline we mustn't forget to add the pipeline to our ITEM_PIPELINES in
our project settings.py file.
ITEM_PIPELINES = {
'chocolatescraper.pipelines.PriceToUSDPipeline': 100,
'chocolatescraper.pipelines.DuplicatesPipeline': 200,
'chocolatescraper.pipelines.SavingToPostgresPipeline': 300,
}
After running our spider again we should be able to see the data in our database if we run a simple
select command like the following(after logging into our database!):
Next Steps
We hope you now have a good understanding of how to save the data you've scraped into the file or
database you need! If you have any questions leave them in the comments below and we'll do our best
to help out!
If you would like the code from this example please check it out on Github.
The next tutorial covers how to make our spider production ready by managing our user agents & IPs so
we don't get blocked. (Part 4)
Need a Free Proxy? Then check out our Proxy Comparison Tool that allows to compare the pricing,
features and limits of every proxy provider on the market so you can find the one that best suits your
needs. Including the best free plans.
https://scrapeops.io/python-scrapy-playbook/scrapy-beginners-guide-storing-data/ 8/9
07/07/2024, 16:17 Scrapy Beginners Series Part 3 - Storing Data With Scrapy | ScrapeOps
2 Comments
1 Login
Name
AdmiralLuke − ⚑
a year ago
I was having trouble inserting the scraped data into the postgresql database table with this code since I copied and
pasted it. I realized the problem was "self.connection.commit()" was within the exception clause when it shouldn't
be. Unindenting it solved the problem and everything worked after.
0 0 Reply Share ›
https://scrapeops.io/python-scrapy-playbook/scrapy-beginners-guide-storing-data/ 9/9