0% found this document useful (0 votes)

42 views7 pages

Sequentially Crawl Website Using Scrapy: Asked Active Viewed

iogo

Uploaded by

Brian Beegee

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

42 views7 pages

Sequentially Crawl Website Using Scrapy: Asked Active Viewed

iogo

Uploaded by

Brian Beegee

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 7

Sequentially crawl website using scrapy Ask Question

Asked 6 years, 5 months ago Active 6 years, 5 months ago Viewed 888 times

Is there a way to tell scrapy to stop crawling based upon condition in the 2nd level page? I am doing the following:

1. I have a start_url to begin with (1st level page)

0
2. I have set of urls extracted from the start_url using parse(self, response)
3. Then I add queue the links using Request with callback as parseDetailPage(self, response)
4. Under parseDetail (2nd level page) I come to know if I can stop crawling or not

Right now I am using CloseSpider() to accomplish this, but the problem is that the urls to be parsed are already queued
by the time I start crawling second level pages and I do not know how to remove them from the queue. Is there a way to
sequentially crawl the list of links and then be able to stop in parseDetailPage?

global job_in_range
start_urls = []
start_urls.append("http://sfbay.craigslist.org/sof/")
def __init__(self):
self.job_in_range = True
def parse(self, response):
hxs = HtmlXPathSelector(response)
results = hxs.select('//blockquote[@id="toc_rows"]')
items = []
if results:
links = results.select('.//p[@class="row"]/a/@href')
for link in links:
if link is self.end_url:
break;
nextUrl = link.extract()
isValid = WPUtil.validateUrl(nextUrl);
if isValid:

Create PDF in your applications with the Pdfcrowd HTML to PDF API PDFCROWD
item = WoodPeckerItem()
item['url'] = nextUrl
item = Request(nextUrl, meta={'item':item},callback=self.parseDetailPage)
items.append(item)
else:
self.error.log('Could not parse the document')
return items

def parseDetailPage(self, response):

if self.job_in_range is False:
raise CloseSpider('End date reached - No more crawling for ' + self.name)
hxs = HtmlXPathSelector(response)
print response
body = hxs.select('//article[@id="pagecontainer"]/section[@class="body"]')
item = response.meta['item']
item['postDate'] = body.select('.//section[@class="userbody"]/div[@class="postinginfos"]/p')[1]
if item['jobTitle'] is 'Admin':
self.job_in_range = False
raise CloseSpider('Stop crawling')

scrapy

share improve this question edited Feb 19 '13 at 23:48 asked Feb 19 '13 at 3:46
Praveer
1 1

add a comment

1 Answer active oldest votes

Do you mean that you would like to stop the spider and resume it without parsing the urls which have been parsed? If
so, you may try to set the JOB_DIR setting. This setting can keep the request.queue in specified file on the disk.
0 share improve this answer answered Feb 22 '13 at 6:55
Java Xu

Create PDF in your applications with the Pdfcrowd HTML to PDF API PDFCROWD
1,637 16 25

I want to stop crawling entirely once a condition is satisfied in the parseDetail page, and not resume it. The problem I am facing is
that the queue already has bunch of urls which scrapy is going to crawl irrespective of raising CloseSpider. – Praveer Feb 25 '13
at 20:18

Which CloseSpider did you use? scrapy.contrib.closespider.CloseSpider? or scrapy.exceptions.CloseSpider? – Java Xu Feb 26

'13 at 8:04

add a comment

Got a question that you can’t ask on public Stack Overflow? Learn more about sharing private information with Stack Overflow for
Teams.

Your Answer

Create PDF in your applications with the Pdfcrowd HTML to PDF API PDFCROWD
Sign up using Facebook

Sign up using Email and Password

Post as a guest
Name

Email
Required, but never shown

Post Your Answer

By clicking “Post Your Answer”, you agree to our terms of service, privacy policy and cookie policy

Not the answer you're looking for? Browse other questions tagged scrapy or ask your own question.

Blog

CROKAGE: A New Way to Search Stack

Overflow

Featured on Meta

Congratulations to our 29 oldest beta sites

- They're now no longer beta!

Create PDF in your applications with the Pdfcrowd HTML to PDF API PDFCROWD
Unicorn Meta Zoo #7: Interview with
Nicolas

Experiment: closing and reopening

happens at 3 votes for the next 30 days…

Should we burninate the [heisenbug] tag?

23 Scrapy - parse a page to extract items -

then follow and store item url contents

7 Scrapy. How to change spider settings after

start crawling?

3 How to pass a url value to all subsequent

items in the Scrapy crawl?

1 Why is Scrapy not crawling or parsing?

3 Scrapy Crawl Spider Only Scrape Certain

Number Of Layers

0 how to use scrapy to crawl all items in a

website

0 Scrapy pagination uses javascript

1 Scrapy: Store/scrape current start_url?

1 Control/limit broad crawl with scrapy

2 How to make Scrapy crawl only 1 page

(make it non recursive)?

Hot Network Questions

Create PDF in your applications with the Pdfcrowd HTML to PDF API PDFCROWD
Do beef farmed pastures net remove carbon
emissions?

Telephone number in spoken words

Will using a resistor in series with a LED to control

its voltage increase the total energy expenditure?

How do some PhD students get 10+ papers? Is

that what I need for landing good faculty position?

Weird resistor with dots around it

Would the USA be eligible to join the European

Union?

Graphs for which a calculus student can

reasonably compute the arclength

Why doesn't a commutator cause the rotation to

reverse periodically or stop?

Shifting tenses in the middle of narration

If you know the location of an invisible creature,

can you attack it?

What is the status of this patent?

Scam? Phone call from "Department of Social

Security" asking me to call back

How should I write this passage to make it the

most readable?

Running code generated in realtime in JavaScript

with eval()

What is the farthest a camera can see?

How to use Smarty for change greetings for

French?

Causal Diagrams using Wolfram?

Why won't the Republicans use a superdelegate

system like the DNC in their nomination process?

How would timezones work on a planet 100 times

the size of our Earth

Create PDF in your applications with the Pdfcrowd HTML to PDF API PDFCROWD
Are those flyers about apartment purchase a
scam?

Chunk + Enumerate a list of digits

Why aren't rockets built with truss structures inside

their fuel & oxidizer tanks to increase structural
strength?

Why did Saruman lie?

Heating Margarine in Pan = loss of calories?

Question feed

STACK OVERFLOW PRODUCTS COMPANY STACK EXCHANGE Blog Facebook Twitter LinkedIn

Questions Teams About NETWORK

Jobs Talent Press Technology

Developer Jobs Directory Advertising Work Here Life / Arts
Salary Calculator Enterprise Legal Culture / Recreation
Help Privacy Policy Science
Mobile Contact Us Other site design / logo © 2019 Stack Exchange Inc; user
Disable Responsiveness contributions licensed under cc by-sa 3.0 with
attribution required. rev 2019.8.14.34593

Create PDF in your applications with the Pdfcrowd HTML to PDF API PDFCROWD

ICICI Bank Statement
0% (1)
ICICI Bank Statement
15 pages
Advanced Web Scraping - Bypassing - 403 Forbidden, - Captchas, and More - Sangaline
No ratings yet
Advanced Web Scraping - Bypassing - 403 Forbidden, - Captchas, and More - Sangaline
12 pages
Week 4
No ratings yet
Week 4
38 pages
Scrapy Beginners Series Part 1 - First Scrapy Spider - ScrapeOps
No ratings yet
Scrapy Beginners Series Part 1 - First Scrapy Spider - ScrapeOps
17 pages
Python Scrapy
No ratings yet
Python Scrapy
4 pages
SDS WebScraping Bonus Scrapy Vs BeautifulSoup PDF
No ratings yet
SDS WebScraping Bonus Scrapy Vs BeautifulSoup PDF
6 pages
Experiment2 Web Scraping and Data Analysis
No ratings yet
Experiment2 Web Scraping and Data Analysis
5 pages
Icrawler
No ratings yet
Icrawler
35 pages
Web Scraping Cheat Sheet 2.0
No ratings yet
Web Scraping Cheat Sheet 2.0
3 pages
Web Crawling and Social Media Mining: Module No. 5
No ratings yet
Web Crawling and Social Media Mining: Module No. 5
77 pages
Web Scraping in Python Using Scrapy
No ratings yet
Web Scraping in Python Using Scrapy
30 pages
Lect 02-Crawling Part A
No ratings yet
Lect 02-Crawling Part A
21 pages
Introduction To Web Scraping in RPA With Python
No ratings yet
Introduction To Web Scraping in RPA With Python
10 pages
Cse3024 WM Module-2 Smsatapathy
No ratings yet
Cse3024 WM Module-2 Smsatapathy
106 pages
Scrapytutorial
No ratings yet
Scrapytutorial
5 pages
Module 2.1 - Web Automation
No ratings yet
Module 2.1 - Web Automation
32 pages
Scrapy Beginners Series Part 4 - User Agents and Proxies - ScrapeOps
No ratings yet
Scrapy Beginners Series Part 4 - User Agents and Proxies - ScrapeOps
8 pages
Web Crawling - Python
No ratings yet
Web Crawling - Python
34 pages
Practical Web Scraping For Economists 1744341390
No ratings yet
Practical Web Scraping For Economists 1744341390
33 pages
Web Crawlers: Presented By: B. Tech. Final Year Information Technology
No ratings yet
Web Crawlers: Presented By: B. Tech. Final Year Information Technology
27 pages
Q-1 Web Scraping: Definition and Significance
No ratings yet
Q-1 Web Scraping: Definition and Significance
4 pages
Web Scraping Using Python
No ratings yet
Web Scraping Using Python
18 pages
I) Web Crawling: Yash Pahlani D17B 49
No ratings yet
I) Web Crawling: Yash Pahlani D17B 49
7 pages
Python Web Crawler
No ratings yet
Python Web Crawler
15 pages
Common Practices - Scrapy 2.12.0 Documentation
No ratings yet
Common Practices - Scrapy 2.12.0 Documentation
5 pages
Coroutines - Scrapy 2.12.0 Documentation
No ratings yet
Coroutines - Scrapy 2.12.0 Documentation
4 pages
5G Voice - Network Evolution Aspects: SMS and Emergency Calls
No ratings yet
5G Voice - Network Evolution Aspects: SMS and Emergency Calls
8 pages
A Simple Python Web Crawler...
100% (1)
A Simple Python Web Crawler...
5 pages
Asyncio - Scrapy 2.12.0 Documentation
No ratings yet
Asyncio - Scrapy 2.12.0 Documentation
3 pages
The A-Z of Web Scraping in 2020 (A How-To Guide)
No ratings yet
The A-Z of Web Scraping in 2020 (A How-To Guide)
18 pages
Detailed Explanation: IR Vs Web Search Vs Web
No ratings yet
Detailed Explanation: IR Vs Web Search Vs Web
15 pages
Broad Crawls - Scrapy 2.12.0 Documentation
No ratings yet
Broad Crawls - Scrapy 2.12.0 Documentation
3 pages
Scrapy - A Fast and Powerful Scraping and Web Crawling Framework
No ratings yet
Scrapy - A Fast and Powerful Scraping and Web Crawling Framework
2 pages
Introduction To Web Crawling Chapter - 13
No ratings yet
Introduction To Web Crawling Chapter - 13
3 pages
Interview Question Webscrap
No ratings yet
Interview Question Webscrap
3 pages
Web Crawling: Christopher Olston and Marc Najork
No ratings yet
Web Crawling: Christopher Olston and Marc Najork
49 pages
Ms. Poonam Sinai Kenkre
No ratings yet
Ms. Poonam Sinai Kenkre
43 pages
Web Scraping Using Python: A Step by Step Guide: September 2019
No ratings yet
Web Scraping Using Python: A Step by Step Guide: September 2019
7 pages
Crawler and URL Retrieving & Queuing
No ratings yet
Crawler and URL Retrieving & Queuing
5 pages
Crawler: 1.0 Introduction
No ratings yet
Crawler: 1.0 Introduction
12 pages
Web+Scraping+Cheat+Sheet+2 0
No ratings yet
Web+Scraping+Cheat+Sheet+2 0
3 pages
All Manufacturer Routers Default Wps PSK Pins List by Alvin - Issuu
33% (3)
All Manufacturer Routers Default Wps PSK Pins List by Alvin - Issuu
1 page
Final SRS
No ratings yet
Final SRS
7 pages
Shift Note in SAP
No ratings yet
Shift Note in SAP
9 pages
HTML XML Applets
No ratings yet
HTML XML Applets
26 pages
P. Stefopoulos2
No ratings yet
P. Stefopoulos2
55 pages
13 Simple Ways To Improve Your English
No ratings yet
13 Simple Ways To Improve Your English
4 pages
ANSYS17.0 NCode Standard Online Training L2
No ratings yet
ANSYS17.0 NCode Standard Online Training L2
11 pages
Williams PreliminaryBuyingbehavior 04192022
No ratings yet
Williams PreliminaryBuyingbehavior 04192022
9 pages
Certification Schemes at European Level For Cyber Security Skills of ICS SCADA
No ratings yet
Certification Schemes at European Level For Cyber Security Skills of ICS SCADA
46 pages
Primary and Secondary Sources
No ratings yet
Primary and Secondary Sources
17 pages
CCS - View Topic - SOLVED - Problem With INT - RDA Not Beeing F
100% (3)
CCS - View Topic - SOLVED - Problem With INT - RDA Not Beeing F
5 pages
SFVSL League Rules and Procedures
No ratings yet
SFVSL League Rules and Procedures
2 pages
Social Media Dependency Measure
No ratings yet
Social Media Dependency Measure
16 pages
Gorilla Tag Creator Program
No ratings yet
Gorilla Tag Creator Program
3 pages
Dutch Diving Legislation
No ratings yet
Dutch Diving Legislation
71 pages
Travel Agency Website Design Proposal
No ratings yet
Travel Agency Website Design Proposal
7 pages
Harvard Referencing Style
No ratings yet
Harvard Referencing Style
37 pages
Sending Emails With Python - The Apps Blaster
No ratings yet
Sending Emails With Python - The Apps Blaster
6 pages
A Level Thinking 9694 - w18 - 1 - 2 - QP
No ratings yet
A Level Thinking 9694 - w18 - 1 - 2 - QP
16 pages
Installation and Authorization Manual
No ratings yet
Installation and Authorization Manual
32 pages
FXSFT Fast Track Terms and Conditions AUS 290124
No ratings yet
FXSFT Fast Track Terms and Conditions AUS 290124
8 pages
Vita Management Handbook
No ratings yet
Vita Management Handbook
4 pages
Double Cross Antenna For NOAA / METEOR Weather Satellite: 137 MHZ
No ratings yet
Double Cross Antenna For NOAA / METEOR Weather Satellite: 137 MHZ
22 pages
Iridium Talk Komplett
No ratings yet
Iridium Talk Komplett
55 pages
CP2 UserManual
No ratings yet
CP2 UserManual
15 pages
Cheap Clustering Ocfs2
No ratings yet
Cheap Clustering Ocfs2
27 pages
Pdfexploded viewsARA102692 PDF
No ratings yet
Pdfexploded viewsARA102692 PDF
6 pages
Icons
No ratings yet
Icons
8 pages
F 2
No ratings yet
F 2
1 page
Setup Guide-Cisco SPA122
No ratings yet
Setup Guide-Cisco SPA122
9 pages
The Grid
No ratings yet
The Grid
5 pages
Chap 1
No ratings yet
Chap 1
6 pages
How To Use Laravel 4 Filters: What Is A Filter?
No ratings yet
How To Use Laravel 4 Filters: What Is A Filter?
5 pages
Entity Framework Core
From Everand
Entity Framework Core
Kenji Elzerman
No ratings yet
HTML And CSS Lab Companion: HTML And CSS Lab Companion, #1
From Everand
HTML And CSS Lab Companion: HTML And CSS Lab Companion, #1
hendra mulyanto
No ratings yet
50 Recipes for Programming Node.js
From Everand
50 Recipes for Programming Node.js
Jamie Munro
3/5 (4)
JavaScript: Beginner's Guide to Programming Code with JavaScript
From Everand
JavaScript: Beginner's Guide to Programming Code with JavaScript
Charlie Masterson
5/5 (1)
Drupal 7 Social Networking
From Everand
Drupal 7 Social Networking
Michael Peacock
No ratings yet
Mastering RethinkDB
From Everand
Mastering RethinkDB
Shahid Shaikh
No ratings yet
100 Recipes for Programming Java
From Everand
100 Recipes for Programming Java
Jamie Munro
4.5/5 (2)
10 Lessons in Front-end
From Everand
10 Lessons in Front-end
Krasimir Tsonev
2/5 (1)
Introduction to PHP, Part 4, Second Edition
From Everand
Introduction to PHP, Part 4, Second Edition
Adam Majczak
No ratings yet
How to Create a Website in Dreamweaver
From Everand
How to Create a Website in Dreamweaver
Robert Lasky-Davison
No ratings yet
An Introduction to Website Performance: How to Outrun the Zombie Hordes: Undead Institute, #15
From Everand
An Introduction to Website Performance: How to Outrun the Zombie Hordes: Undead Institute, #15
John Rhea
No ratings yet
Learn JavaScript in 24 Hours
From Everand
Learn JavaScript in 24 Hours
Alex Nordeen
3.5/5 (5)
50 Recipes for Programming Angular
From Everand
50 Recipes for Programming Angular
Jamie Munro
4/5 (1)
PHP mySQL Web Programming Interview Questions, Answers, and Explanations: PHP mySQL FAQ
From Everand
PHP mySQL Web Programming Interview Questions, Answers, and Explanations: PHP mySQL FAQ
equitypress
4/5 (3)
50 Recipes for Programming CSS3
From Everand
50 Recipes for Programming CSS3
Jamie Munro
No ratings yet
C# Interview Questions, Answers, and Explanations: C Sharp Certification Review
From Everand
C# Interview Questions, Answers, and Explanations: C Sharp Certification Review
equitypress
4.5/5 (3)
JavaScript: Beginner's Guide to Programming Code with JavaScript: JavaScript Computer Programming
From Everand
JavaScript: Beginner's Guide to Programming Code with JavaScript: JavaScript Computer Programming
Charlie Masterson
No ratings yet
PHP Interview Questions, Answers, and Explanations: PHP Certification Review: PHP FAQ
From Everand
PHP Interview Questions, Answers, and Explanations: PHP Certification Review: PHP FAQ
equitypress
No ratings yet
Macromedia Dreamweaver Web Design Interview Questions
From Everand
Macromedia Dreamweaver Web Design Interview Questions
equitypress
No ratings yet
JavaScript Interview Questions, Answers, and Explanations: JavaScript Certification Review
From Everand
JavaScript Interview Questions, Answers, and Explanations: JavaScript Certification Review
equitypress
No ratings yet