tools: Update scraper.go fixed crawling depth mechanism #1408
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Fixed the issue of bypassing Colly's built-in crawling depth control
I noticed that in a previous PR, custom code was implemented to control crawling depth. However, Colly actually provides built-in depth control mechanisms that we can utilize directly.
The current implementation has a critical issue. The crawler recursively follows all links found on each page, continuously crawling to deeper levels. This eventually causes the program to timeout when the context deadline is exceeded, rather than stopping at a reasonable depth.
Chinese:
修复了绕过 Colly 内置爬取深度控制的问题
我注意到在之前的 PR 中,通过自定义代码来控制爬取深度。实际上,Colly 本身就提供了内置的深度控制机制,我们可以直接使用。
当前的实现存在一个关键问题。爬虫会在每个页面上递归地跟踪所有找到的链接,不断向更深层级爬取。这最终会导致程序在上下文超时限制到达时被迫终止,而无法在合理的深度停止爬取。
PR Checklist
memory: add interfaces for X, Yorutil: add whizzbang helpers).Fixes #123).golangci-lintchecks.