0% found this document useful (0 votes)
42 views7 pages

Sequentially Crawl Website Using Scrapy: Asked Active Viewed

iogo

Uploaded by

Brian Beegee
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
42 views7 pages

Sequentially Crawl Website Using Scrapy: Asked Active Viewed

iogo

Uploaded by

Brian Beegee
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 7

Sequentially crawl website using scrapy Ask Question

Asked 6 years, 5 months ago Active 6 years, 5 months ago Viewed 888 times

Is there a way to tell scrapy to stop crawling based upon condition in the 2nd level page? I am doing the following:

1. I have a start_url to begin with (1st level page)


0
2. I have set of urls extracted from the start_url using parse(self, response)
3. Then I add queue the links using Request with callback as parseDetailPage(self, response)
4. Under parseDetail (2nd level page) I come to know if I can stop crawling or not

Right now I am using CloseSpider() to accomplish this, but the problem is that the urls to be parsed are already queued
by the time I start crawling second level pages and I do not know how to remove them from the queue. Is there a way to
sequentially crawl the list of links and then be able to stop in parseDetailPage?

global job_in_range
start_urls = []
start_urls.append("http://sfbay.craigslist.org/sof/")
def __init__(self):
self.job_in_range = True
def parse(self, response):
hxs = HtmlXPathSelector(response)
results = hxs.select('//blockquote[@id="toc_rows"]')
items = []
if results:
links = results.select('.//p[@class="row"]/a/@href')
for link in links:
if link is self.end_url:
break;
nextUrl = link.extract()
isValid = WPUtil.validateUrl(nextUrl);
if isValid:

Create PDF in your applications with the Pdfcrowd HTML to PDF API PDFCROWD
item = WoodPeckerItem()
item['url'] = nextUrl
item = Request(nextUrl, meta={'item':item},callback=self.parseDetailPage)
items.append(item)
else:
self.error.log('Could not parse the document')
return items

def parseDetailPage(self, response):


if self.job_in_range is False:
raise CloseSpider('End date reached - No more crawling for ' + self.name)
hxs = HtmlXPathSelector(response)
print response
body = hxs.select('//article[@id="pagecontainer"]/section[@class="body"]')
item = response.meta['item']
item['postDate'] = body.select('.//section[@class="userbody"]/div[@class="postinginfos"]/p')[1]
if item['jobTitle'] is 'Admin':
self.job_in_range = False
raise CloseSpider('Stop crawling')

scrapy

share improve this question edited Feb 19 '13 at 23:48 asked Feb 19 '13 at 3:46
Praveer
1 1

add a comment

1 Answer active oldest votes

Do you mean that you would like to stop the spider and resume it without parsing the urls which have been parsed? If
so, you may try to set the JOB_DIR setting. This setting can keep the request.queue in specified file on the disk.
0 share improve this answer answered Feb 22 '13 at 6:55
Java Xu

Create PDF in your applications with the Pdfcrowd HTML to PDF API PDFCROWD
1,637 16 25

I want to stop crawling entirely once a condition is satisfied in the parseDetail page, and not resume it. The problem I am facing is
that the queue already has bunch of urls which scrapy is going to crawl irrespective of raising CloseSpider. – Praveer Feb 25 '13
at 20:18

Which CloseSpider did you use? scrapy.contrib.closespider.CloseSpider? or scrapy.exceptions.CloseSpider? – Java Xu Feb 26


'13 at 8:04

add a comment

Got a question that you can’t ask on public Stack Overflow? Learn more about sharing private information with Stack Overflow for
Teams.

Your Answer

Sign up or log in
Sign up using Google

Create PDF in your applications with the Pdfcrowd HTML to PDF API PDFCROWD
Sign up using Facebook

Sign up using Email and Password

Post as a guest
Name

Email
Required, but never shown

Post Your Answer

By clicking “Post Your Answer”, you agree to our terms of service, privacy policy and cookie policy

Not the answer you're looking for? Browse other questions tagged scrapy or ask your own question.

Blog

CROKAGE: A New Way to Search Stack


Overflow

Featured on Meta

Congratulations to our 29 oldest beta sites


- They're now no longer beta!

Create PDF in your applications with the Pdfcrowd HTML to PDF API PDFCROWD
Unicorn Meta Zoo #7: Interview with
Nicolas

Experiment: closing and reopening


happens at 3 votes for the next 30 days…

Should we burninate the [heisenbug] tag?

Related

23 Scrapy - parse a page to extract items -


then follow and store item url contents

7 Scrapy. How to change spider settings after


start crawling?

3 How to pass a url value to all subsequent


items in the Scrapy crawl?

1 Why is Scrapy not crawling or parsing?

3 Scrapy Crawl Spider Only Scrape Certain


Number Of Layers

0 how to use scrapy to crawl all items in a


website

0 Scrapy pagination uses javascript

1 Scrapy: Store/scrape current start_url?

1 Control/limit broad crawl with scrapy

2 How to make Scrapy crawl only 1 page


(make it non recursive)?

Hot Network Questions

Create PDF in your applications with the Pdfcrowd HTML to PDF API PDFCROWD
Do beef farmed pastures net remove carbon
emissions?

Telephone number in spoken words

Will using a resistor in series with a LED to control


its voltage increase the total energy expenditure?

How do some PhD students get 10+ papers? Is


that what I need for landing good faculty position?

Weird resistor with dots around it

Would the USA be eligible to join the European


Union?

Graphs for which a calculus student can


reasonably compute the arclength

Why doesn't a commutator cause the rotation to


reverse periodically or stop?

Shifting tenses in the middle of narration

If you know the location of an invisible creature,


can you attack it?

What is the status of this patent?

Scam? Phone call from "Department of Social


Security" asking me to call back

How should I write this passage to make it the


most readable?

Running code generated in realtime in JavaScript


with eval()

What is the farthest a camera can see?

How to use Smarty for change greetings for


French?

Causal Diagrams using Wolfram?

Why won't the Republicans use a superdelegate


system like the DNC in their nomination process?

How would timezones work on a planet 100 times


the size of our Earth

Create PDF in your applications with the Pdfcrowd HTML to PDF API PDFCROWD
Are those flyers about apartment purchase a
scam?

Chunk + Enumerate a list of digits

Why aren't rockets built with truss structures inside


their fuel & oxidizer tanks to increase structural
strength?

Why did Saruman lie?

Heating Margarine in Pan = loss of calories?

Question feed

STACK OVERFLOW PRODUCTS COMPANY STACK EXCHANGE Blog Facebook Twitter LinkedIn

Questions Teams About NETWORK

Jobs Talent Press Technology


Developer Jobs Directory Advertising Work Here Life / Arts
Salary Calculator Enterprise Legal Culture / Recreation
Help Privacy Policy Science
Mobile Contact Us Other site design / logo © 2019 Stack Exchange Inc; user
Disable Responsiveness contributions licensed under cc by-sa 3.0 with
attribution required. rev 2019.8.14.34593

Create PDF in your applications with the Pdfcrowd HTML to PDF API PDFCROWD

You might also like