Skip to content

Conversation

@GuoYuefei
Copy link

@GuoYuefei GuoYuefei commented Sep 28, 2025

Fixed the issue of bypassing Colly's built-in crawling depth control

I noticed that in a previous PR, custom code was implemented to control crawling depth. However, Colly actually provides built-in depth control mechanisms that we can utilize directly.

The current implementation has a critical issue. The crawler recursively follows all links found on each page, continuously crawling to deeper levels. This eventually causes the program to timeout when the context deadline is exceeded, rather than stopping at a reasonable depth.


Chinese:
修复了绕过 Colly 内置爬取深度控制的问题

我注意到在之前的 PR 中,通过自定义代码来控制爬取深度。实际上,Colly 本身就提供了内置的深度控制机制,我们可以直接使用。

当前的实现存在一个关键问题。爬虫会在每个页面上递归地跟踪所有找到的链接,不断向更深层级爬取。这最终会导致程序在上下文超时限制到达时被迫终止,而无法在合理的深度停止爬取。

PR Checklist

  • Read the Contributing documentation.
  • Read the Code of conduct documentation.
  • Name your Pull Request title clearly, concisely, and prefixed with the name of the primarily affected package you changed according to Good commit messages (such as memory: add interfaces for X, Y or util: add whizzbang helpers).
  • Check that there isn't already a PR that solves the problem the same way to avoid creating a duplicate.
  • Provide a description in this PR that addresses what the PR is solving, or reference the issue that it solves (e.g. Fixes #123).
  • Describes the source of new concepts.
  • References existing implementations as appropriate.
  • Contains test coverage for new functions.
  • Passes all golangci-lint checks.

fixed This error of bypassing the Colly restriction crawling depth mechanism
@GuoYuefei
Copy link
Author

GuoYuefei commented Sep 28, 2025

#210
我搜索的issue 和 PR 只有搜到这个类似的PR,但是被关闭了。
I only found this similar PR when searching for issues and PRs, but it was closed.

@GuoYuefei
Copy link
Author

@tmc I have tested it myself. When you have time, please help review it. Thank you very much.🌹🌹🌹

@GuoYuefei
Copy link
Author

#1318

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant