-
Notifications
You must be signed in to change notification settings - Fork 141
Description
scrapy_playwright
tries to imitate the web-browser so it downloads all resources (images, scripts, stylesheets, etc). Therefore, given this information is downloaded, is it possible to grab the payload from specific requests sent from the network tab - specifically from the fetch/XHR tab?
For example (minimal reproducible code):
import scrapy
from scrapy_playwright.page import PageCoroutine
class testSpider(scrapy.Spider):
name = 'test'
def start_requests(self):
yield scrapy.Request(
url="http://quotes.toscrape.com/scroll",
cookies={"foo": "bar", "asdf": "qwerty"},
meta={
"playwright": True,
"playwright_page_coroutines": [
PageCoroutine("wait_for_selector", "div.quote"),
PageCoroutine("evaluate", "window.scrollBy(0, document.body.scrollHeight)"),
PageCoroutine("wait_for_selector", "div.quote:nth-child(11)"), # 10 per page
PageCoroutine( "screenshot", path="scroll.png", full_page=True
),
],
},
)
def parse(self, response):
pass
Produces the following output:
2022-02-21 11:50:39 [scrapy-playwright] DEBUG: [Context=default] Request: <GET http://quotes.toscrape.com/api/quotes?page=2> (resource type: xhr, referrer: http://quotes.toscrape.com/scroll)
2022-02-21 11:50:39 [scrapy-playwright] DEBUG: [Context=default] Request: <GET http://quotes.toscrape.com/api/quotes?page=3> (resource type: xhr, referrer: http://quotes.toscrape.com/scroll)
2022-02-21 11:50:39 [scrapy-playwright] DEBUG: [Context=default] Request: <GET http://quotes.toscrape.com/api/quotes?page=4> (resource type: xhr, referrer: http://quotes.toscrape.com/scroll)
...
Very similar to firefox which allows you to copy the url and url parameters. The aim is to store the url parameters into a list each time playwright downloads a specific request url. With a more complex website I would have to find specific requests urls and grab the url parameters.
Something like:
if response.meta['resource_type'] is 'xhr':
print(parameters(response.meta['resource_type_urls']))
This is a pseduo-example to express what I want to get; parameters
would be a function to grab the url parameters.
Or perhaps it works like this:
if response.meta['resource_type'] is 'xhr':
print(response.meta['parameters'])
However saving it into response.meta
will likely overload the results if I have a large number of urls for resource types, and url parameters are fairly large dicts.
- I'm convinced this data is available as it's downloaded however I just do not know how to get it.