Sequentially Crawl Website Using Scrapy: Asked Active Viewed
Sequentially Crawl Website Using Scrapy: Asked Active Viewed
Asked 6 years, 5 months ago Active 6 years, 5 months ago Viewed 888 times
Is there a way to tell scrapy to stop crawling based upon condition in the 2nd level page? I am doing the following:
Right now I am using CloseSpider() to accomplish this, but the problem is that the urls to be parsed are already queued
by the time I start crawling second level pages and I do not know how to remove them from the queue. Is there a way to
sequentially crawl the list of links and then be able to stop in parseDetailPage?
global job_in_range
start_urls = []
start_urls.append("http://sfbay.craigslist.org/sof/")
def __init__(self):
self.job_in_range = True
def parse(self, response):
hxs = HtmlXPathSelector(response)
results = hxs.select('//blockquote[@id="toc_rows"]')
items = []
if results:
links = results.select('.//p[@class="row"]/a/@href')
for link in links:
if link is self.end_url:
break;
nextUrl = link.extract()
isValid = WPUtil.validateUrl(nextUrl);
if isValid:
Create PDF in your applications with the Pdfcrowd HTML to PDF API PDFCROWD
item = WoodPeckerItem()
item['url'] = nextUrl
item = Request(nextUrl, meta={'item':item},callback=self.parseDetailPage)
items.append(item)
else:
self.error.log('Could not parse the document')
return items
scrapy
share improve this question edited Feb 19 '13 at 23:48 asked Feb 19 '13 at 3:46
Praveer
1 1
add a comment
Do you mean that you would like to stop the spider and resume it without parsing the urls which have been parsed? If
so, you may try to set the JOB_DIR setting. This setting can keep the request.queue in specified file on the disk.
0 share improve this answer answered Feb 22 '13 at 6:55
Java Xu
Create PDF in your applications with the Pdfcrowd HTML to PDF API PDFCROWD
1,637 16 25
I want to stop crawling entirely once a condition is satisfied in the parseDetail page, and not resume it. The problem I am facing is
that the queue already has bunch of urls which scrapy is going to crawl irrespective of raising CloseSpider. – Praveer Feb 25 '13
at 20:18
add a comment
Got a question that you can’t ask on public Stack Overflow? Learn more about sharing private information with Stack Overflow for
Teams.
Your Answer
Sign up or log in
Sign up using Google
Create PDF in your applications with the Pdfcrowd HTML to PDF API PDFCROWD
Sign up using Facebook
Post as a guest
Name
Email
Required, but never shown
By clicking “Post Your Answer”, you agree to our terms of service, privacy policy and cookie policy
Not the answer you're looking for? Browse other questions tagged scrapy or ask your own question.
Blog
Featured on Meta
Create PDF in your applications with the Pdfcrowd HTML to PDF API PDFCROWD
Unicorn Meta Zoo #7: Interview with
Nicolas
Related
Create PDF in your applications with the Pdfcrowd HTML to PDF API PDFCROWD
Do beef farmed pastures net remove carbon
emissions?
Create PDF in your applications with the Pdfcrowd HTML to PDF API PDFCROWD
Are those flyers about apartment purchase a
scam?
Question feed
STACK OVERFLOW PRODUCTS COMPANY STACK EXCHANGE Blog Facebook Twitter LinkedIn
Create PDF in your applications with the Pdfcrowd HTML to PDF API PDFCROWD