Skip to content

Grabbing parameters from default requests #61

@lime-n

Description

@lime-n

scrapy_playwright tries to imitate the web-browser so it downloads all resources (images, scripts, stylesheets, etc). Therefore, given this information is downloaded, is it possible to grab the payload from specific requests sent from the network tab - specifically from the fetch/XHR tab?

For example (minimal reproducible code):

import scrapy
from scrapy_playwright.page import PageCoroutine

class testSpider(scrapy.Spider):
    name = 'test'

    def start_requests(self):
        yield scrapy.Request(
            url="http://quotes.toscrape.com/scroll",
            cookies={"foo": "bar", "asdf": "qwerty"},
            meta={
                "playwright": True,
                "playwright_page_coroutines": [
                    PageCoroutine("wait_for_selector", "div.quote"),
                    PageCoroutine("evaluate", "window.scrollBy(0, document.body.scrollHeight)"),
                    PageCoroutine("wait_for_selector", "div.quote:nth-child(11)"),  # 10 per page
                    PageCoroutine( "screenshot", path="scroll.png", full_page=True
                    ),
                ],
            },
        )
    def parse(self, response):
        pass

Produces the following output:

2022-02-21 11:50:39 [scrapy-playwright] DEBUG: [Context=default] Request: <GET http://quotes.toscrape.com/api/quotes?page=2> (resource type: xhr, referrer: http://quotes.toscrape.com/scroll)
2022-02-21 11:50:39 [scrapy-playwright] DEBUG: [Context=default] Request: <GET http://quotes.toscrape.com/api/quotes?page=3> (resource type: xhr, referrer: http://quotes.toscrape.com/scroll)
2022-02-21 11:50:39 [scrapy-playwright] DEBUG: [Context=default] Request: <GET http://quotes.toscrape.com/api/quotes?page=4> (resource type: xhr, referrer: http://quotes.toscrape.com/scroll)
...

Very similar to firefox which allows you to copy the url and url parameters. The aim is to store the url parameters into a list each time playwright downloads a specific request url. With a more complex website I would have to find specific requests urls and grab the url parameters.

Something like:

if response.meta['resource_type'] is 'xhr':
    print(parameters(response.meta['resource_type_urls'])) 

This is a pseduo-example to express what I want to get; parameters would be a function to grab the url parameters.

Or perhaps it works like this:

if response.meta['resource_type'] is 'xhr':
    print(response.meta['parameters']) 

However saving it into response.meta will likely overload the results if I have a large number of urls for resource types, and url parameters are fairly large dicts.

  • I'm convinced this data is available as it's downloaded however I just do not know how to get it.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions