-
Notifications
You must be signed in to change notification settings - Fork 146
Description
What is (possibly) going wrong?
Observing a mismatch between the (scrapy) Response headers and the Response body.
How to reproduce?
- Crawl a spider that would
yielda single (scrapy)Requestconfigured to be handled byscrapy-playwrightas described in the readme. - In the
"playwright_page_methods"meta, include aPageMethodas follows:
[a] Pass to themethodargument, either the string"click"or aCallablewhichawaitcalls the (playwright)Locator.clickmethod.
[b] In thePageMethodorLocator(as chosen in [a]), pass to theselectorargument, the selector of a hyperlink to a different page.
[tldr;] Make the Playwright click on a hyperlink to a page different than the one the Playwright originally landed on (after any redirects).
[note] One needs to ensure that their code waits for the completion of browser navigation to the linked page (until at least"commit").
What's happening in the background? (afaik)
In a simple example not involving any redirects, I believe the first (original) Request explicitly sent by the spider creates a second, implicit request to the URL referred to by the hyperlink when the Playwright clicks on the link it received in response to the first request.
In a more nuanced example, the first request gets (HTTP 302) redirected to a second request, the Playwright creates a third, implicit request when it clicks on the link it received in response to the second request.
The first example of this I noticed in the wild involved four requests. Everything happened as described in the three-request example right above this paragraph, but the response to the third request was (yet another) HTTP 302 redirect, causing a fourth and final request to be sent.
Summary so far...
There are 4 requests in the chain, Req. 1 ... Req. 4. There are 2 pages, the <Interim_Page> which contains the link the Playwright clicks on, and the <Requested_Page> that the spider parses.
Req. 1 --[ redirect ]--> Req. 2 --[ ok ]--> <Interim_Page> --[ click ]--> Req. 3 --[ redirect ]--> Req. 4 --[ ok ]--> <Requested_Page>
What is the result?
I am observing that the resulting (scrapy) Response contains the body of the response to Req. 4, but the headers of the response to Req. 2. This confuses me, and maybe also confuses some scrapy middleware, unless my understanding of the situation is incomplete.
What do I like about the result?
The final resulting body should continue to correspond to Req. 4 by default as it is right now, in my opinion. Reasoning:
- Generally, when the server has set up such a redirection pattern, the final data that it would send in response to
Req. 4would probably be highly similar to what it would have sent in response toReq. 1had it skipped this redirection as a result of knowing that the client was returning to a page that they had previously visited in that same session. This validates my opinion that it is usually beneficial for the responsebodyby default to correspond to the last and final request in the chain, which wasReq. 4in this case. - It seems like
scrapy-playwright(or perhaps theRedirectMiddlewarecontrols this behavior) reacts to the HTTP 302 redirect responses in a reasonable and expected manner. The system is being made to believe that the responsebodyforReq. 2was the responsebodyforReq. 1, which makes sense because the browser redirected in response to a 302. The system is also told the responsebodyforReq. 4was the responsebodyforReq. 3for that same reason, as it should. At least in my scenario, thePageMethodclick betweenReq. 2andReq. 3was intended similarly, and at making progress towards the<Requested_Page>, so it should continue to be treated similarly to a redirect by default. That is, the click should forward the responsibility of setting thebodysimilar to how a redirect would. - The only response
bodythat ends up reaching the spider's parser from my observations was the one in response toReq. 4aka<Requested_Page>, which further validates my point that theResponseobject should continue to refer to that samebody. - This behavior works seamlessly with the core of scrapy as discussed right above in [3.], and also with the
HttpCacheMiddleware. To elaborate, theresponse_urlin the cachedmetafile is that of the response toReq. 4, so the fact that the cachedresponse_bodyfile contains the responsebodyof none other thanReq. 4promotes cache data consistency. - Finally, if a spider has instructions for executing a
PageMethodto click on a link to a different page even before the current page is parsed, then I am happy with the default assumption that the author intended the click as a redirect unless explicitly stated otherwise by them.
There may be cases that I haven't encountered where one might want the configurability to not treat such clicks as redirects. However, I would like to keep the scope of this issue I'm raising limited to clearing my confusion around or fixing the mismatch between the body and the headers. The request for overriding the default behavior around body can be raised as a separate feature request if folks start to find the lack of it bothersome enough.
What should we change about the result?
The headers should match the body.
If every clog of this system wants to treat those 4 requests as a single chain, then we should embrace the chain. Why stop forwarding the responsibility of setting
headersatReq. 2when we continue to forward the responsibility of settingbodythroughReq. 4?When I inspected the value of the
Responseobject and first noticed a differentbodymatched with differentheaders, I wondered if it was a bug. I had no clue what was going on until my spider re-crawled many times, and I read through the logs and cache, monitored the network tab, and hypothesized the above "What's happening in the background? (afaik)" section.
Let us get the responsibility of setting headers forwarded through the "click" of a hyperlink (to another route) facilitated by scrapy-playwright, such that the final headers that would get associated with the (scrapy) Response would be that of the final response in the chain, and would thus match the body associated with that same response as well.
Why does this matter?
- The Response 2 headers being in the same (scrapy)
Responseobject as the Response 4 body just sounds wrong. I know thatHttpCacheMiddlewarealso then stores the data wrong with the cachedresponse_headersfile matching neither the cachedresponse_bodyfile nor theresponse_urlin the cachedmetafile. There may be layers of scrapy or some custom spider code one might write that could malfunction from breaking the assumption that theheadersmatch thebodyand theurl. - Particularly, I saw that this messed up the
RFC2616cache policy in theHttpCacheMiddleware, e.g., say, Response 2 contained cacheheadersto indicate avoiding caching since the<Interim_Page>was an on-the-fly response to the client's session state. However, say, Response 4, or the<Requested_Page>, was relatively a static resource, so Response 4 contained cacheheadersto indicate caching that resource for some time. With the Response 2headersbeing incorrectly respected for the Response 4 resourcebody, the static resource was mistaken for a dynamic resource by scrapy's caching layer. The cache policy that depends on accurate cacheheadersto reduce the spider's network request rate with the server became less effective. This has the potential danger of angering the server administrators over repeatedly requesting static resources that their server encourages clients to cache.
Version Details
scrapy-playwright==0.0.42
Scrapy==2.12.0
playwright==1.50.0
python==3.13.2