-
Notifications
You must be signed in to change notification settings - Fork 146
Description
Some websites can freeze crawling ,like http://www.hemehealth.com
If abort image,crawls could freeze and PLAYWRIGHT_DEFAULT_NAVIGATION_TIMEOUT don't work,can't call errback to close page.
Playwright raise playwright._impl._errors.TimeoutError like this:
import asyncio
from playwright.async_api import async_playwright
async def main():
async with async_playwright() as pw:
async def abort(route):
if route.request.resource_type in ["image"]:
await route.abort()
browser = await pw.chromium.launch()
context = await browser.new_context()
page = await context.new_page()
page.set_default_timeout(3000)
try:
await page.route("**/*", abort)
await page.goto("http://www.hemehealth.com")
title=await page.title()
print(title)
await page.close()
await context.close()
await browser.close()
except:
await page.close()
await context.close()
await browser.close()
asyncio.run(main())
File "/usr/local/lib/python3.13/site-packages/playwright/_impl/_connection.py", line 558, in wrap_api_call
raise rewrite_error(error, f"{parsed_st['apiName']}: {error}") from None
playwright._impl._errors.TimeoutError: Page.goto: Timeout 3000ms exceeded.
Call log:
- navigating to "http://www.hemehealth.com/", waiting until "load"
scrapy-playwright has been frozen,can't call errback to close page:
import scrapy
from playwright.async_api import Page
from scrapy.crawler import CrawlerProcess
class ExampleSpider(scrapy.Spider):
name = "example"
custom_settings = {
"PLAYWRIGHT_ABORT_REQUEST":lambda req : req.resource_type in ["image"],
"TWISTED_REACTOR": "twisted.internet.asyncioreactor.AsyncioSelectorReactor",
"DOWNLOAD_HANDLERS": {
"https": "scrapy_playwright.handler.ScrapyPlaywrightDownloadHandler",
"http": "scrapy_playwright.handler.ScrapyPlaywrightDownloadHandler",
},
"CONCURRENT_REQUESTS":1,
"PLAYWRIGHT_LAUNCH_OPTIONS":{'headless': True},
"RETRY_ENABLED":False,
"PLAYWRIGHT_DEFAULT_NAVIGATION_TIMEOUT":3000,
}
def start_requests(self):
urls=['http://www.hemehealth.com']
for url in urls:
yield scrapy.Request(
url=url,
meta={
"playwright": True,
"playwright_context": url,
"playwright_include_page": True,
},
callback=self.parse,
errback=self.close,
)
async def parse(self, response):
page = response.meta["playwright_page"]
title = await page.title()
html = await page.content()
await page.close()
await page.context.close()
async def close(self, failure):
page = failure.request.meta["playwright_page"]
await page.close()
await page.context.close()
process = CrawlerProcess()
process.crawl(ExampleSpider)
process.start()
2025-08-07 14:18:54 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2025-08-07 14:18:54 [scrapy.extensions.telnet] INFO: Telnet console listening on 127.0.0.1:6023
2025-08-07 14:18:54 [scrapy-playwright] INFO: Starting download handler
2025-08-07 14:18:54 [scrapy-playwright] INFO: Starting download handler
2025-08-07 14:18:54 [scrapy-playwright] INFO: Launching browser chromium
2025-08-07 14:18:55 [scrapy-playwright] INFO: Browser chromium launched
2025-08-07 14:18:55 [scrapy-playwright] DEBUG: Browser context started: 'http://www.hemehealth.com' (persistent=False, remote=False)
2025-08-07 14:18:55 [scrapy-playwright] DEBUG: [Context=http://www.hemehealth.com] New page created, page count is 1 (1 for all contexts)
2025-08-07 14:18:55 [scrapy-playwright] DEBUG: [Context=http://www.hemehealth.com] Request: <GET http://www.hemehealth.com/> (resource type: document)
2025-08-07 14:18:55 [scrapy-playwright] DEBUG: [Context=http://www.hemehealth.com] Response: <200 http://www.hemehealth.com/>
........
2025-08-07 14:18:55 [scrapy-playwright] DEBUG: [Context=http://www.hemehealth.com] Response: <200 http://www.hemehealth.com/js/jquery.min.js>
2025-08-07 14:18:55 [scrapy-playwright] DEBUG: [Context=http://www.hemehealth.com] Response: <200 http://www.hemehealth.com/js/bootstrap.bundle.js>
Maybe the same issue:#266 (comment)