tools: Update scraper.go fixed crawling depth mechanism #1408

GuoYuefei · 2025-09-28T16:57:44Z

Fixed the issue of bypassing Colly's built-in crawling depth control

I noticed that in a previous PR, custom code was implemented to control crawling depth. However, Colly actually provides built-in depth control mechanisms that we can utilize directly.

The current implementation has a critical issue. The crawler recursively follows all links found on each page, continuously crawling to deeper levels. This eventually causes the program to timeout when the context deadline is exceeded, rather than stopping at a reasonable depth.

Chinese：
修复了绕过 Colly 内置爬取深度控制的问题

我注意到在之前的 PR 中，通过自定义代码来控制爬取深度。实际上，Colly 本身就提供了内置的深度控制机制，我们可以直接使用。

当前的实现存在一个关键问题。爬虫会在每个页面上递归地跟踪所有找到的链接，不断向更深层级爬取。这最终会导致程序在上下文超时限制到达时被迫终止，而无法在合理的深度停止爬取。

PR Checklist

Read the Contributing documentation.
Read the Code of conduct documentation.
Name your Pull Request title clearly, concisely, and prefixed with the name of the primarily affected package you changed according to Good commit messages (such as memory: add interfaces for X, Y or util: add whizzbang helpers).
Check that there isn't already a PR that solves the problem the same way to avoid creating a duplicate.
Provide a description in this PR that addresses what the PR is solving, or reference the issue that it solves (e.g. Fixes #123).
Describes the source of new concepts.
References existing implementations as appropriate.
Contains test coverage for new functions.
Passes all golangci-lint checks.

fixed This error of bypassing the Colly restriction crawling depth mechanism

GuoYuefei · 2025-09-28T16:58:46Z

#210
我搜索的issue 和 PR 只有搜到这个类似的PR，但是被关闭了。
I only found this similar PR when searching for issues and PRs, but it was closed.

GuoYuefei · 2025-09-29T02:31:21Z

@tmc I have tested it myself. When you have time, please help review it. Thank you very much.🌹🌹🌹

GuoYuefei · 2025-10-30T08:23:47Z

#1318

Update scraper.go

10926f2

fixed This error of bypassing the Colly restriction crawling depth mechanism

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

tools: Update scraper.go fixed crawling depth mechanism #1408

tools: Update scraper.go fixed crawling depth mechanism #1408

Uh oh!

GuoYuefei commented Sep 28, 2025 •

edited

Loading

Uh oh!

GuoYuefei commented Sep 28, 2025 •

edited

Loading

Uh oh!

GuoYuefei commented Sep 29, 2025

Uh oh!

GuoYuefei commented Oct 30, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Uh oh!

tools: Update scraper.go fixed crawling depth mechanism #1408

Are you sure you want to change the base?

tools: Update scraper.go fixed crawling depth mechanism #1408

Uh oh!

Conversation

GuoYuefei commented Sep 28, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

PR Checklist

Uh oh!

GuoYuefei commented Sep 28, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

GuoYuefei commented Sep 29, 2025

Uh oh!

GuoYuefei commented Oct 30, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

GuoYuefei commented Sep 28, 2025 •

edited

Loading

GuoYuefei commented Sep 28, 2025 •

edited

Loading