provider: dedicated provide queue #907

guillaumemichel · 2025-04-09T08:57:24Z

Reprovides are currently spiky since all keys are reprovided at once (for both normal and accelerated DHT clients). If new keys are added during a reprovide, the reprovider will wait until the current reprovide is finished before advertising the newly added keys to the DHT. So new keys are blocked by reprovides.

This PR splits the (re)provide queue in a provide queue and a reprovide queue, allowing provides to be advertised immediately, even during reprovides.

ipfs/kubo#10777

Depending on either:

Part of Improve the providing subsystem in Kubo — IPFS/2025 ipshipyard/roadmaps#6

provider/reprovider.go

gammazero

Provide collects batches of CID before providing them. This means there may be some delay when filling the batch, especially if each new CID arrives just before the pause detection timer expired. It seems like it might be better to send new provides immediately.

One problem with doing that is that the dublicate CIDs would not be removed, because the batch is currently a map of CIDs. Consider keeping the map to ignore duplicates, but publishing each unique CID immediately. There may be some additional logic needed to keep the map from growing too large, which may mean making it an LRU cache.

Also, we may need a test to show that providing does not need to wait for reprovide.

Otherwise, this all looks good.

guillaumemichel · 2025-04-11T12:39:36Z

Added instant provides but I didn't add tests yet.

I have set instant provides as default with a "safe (?)" default of 128 concurrent provides, batched provides can still be used if the right option is provided. I have set a warning if the queue of cids to be provided instantly isn't processed fast enough. I have set instant provides as the default because before #641, all cids were instantly provided, so the behavior is expected to be similar to what it used to be.

It adds quite some complexity :(

In any case, the provide operation is resource intensive, each provide triggering sending a message to 20 remote peers. With batching and the accelerated DHT client, we can group the messages we send to a remote peer to avoid opening a closing a connection many times to the same peer. But it comes at the expense of timing.

Instant provides are "instant" but the node is expected to open and close more connections, even though the same payload is sent to the same peers.

Some open questions:

Should we get rid of the batched providing altogether? It may seem great for nodes with limited networking capabilities, but it requires to run the accelerated DHT client (which is resource intensive), to be efficient?
If we keep batched providing, should we add a configurable number of workers so that the new provides don't have to wait for the previous batch provide to be finished before being provided?
I think it is a safe behavior to bound the default number of provide workers, but it is hard to come with the right number. I expect accelerated DHT client provides to take ~1s, which means that the number of workers is approx the number of cids/second that the system can provide.

@gammazero let me know if you see anything that can be simplified further

gammazero · 2025-04-11T15:04:42Z

Should we get rid of the batched providing altogether? It may seem great for nodes with limited networking capabilities, but it requires to run the accelerated DHT client (which is resource intensive), to be efficient?

If using a worker pool, I do not see a need for batched provides. The size of the worker pool will limit the number of CIDs read into memory and processed.

For lower-priority reprovides, batching could be used to limit the number of CIDs in memory, but this limitation can also be done using a worker pool. It would probably be good to have a smaller worker pool for reprovides than is used for provides. Maybe even limit the reprovide pool size to one or two.

Batching used to be the way CIDs were deduplicated. This has been replaced by the deduplication cache in this PR, so there really does not appear to be any utility in batching. Once #901 is fixed, there will also be no need for deduplication cache.

If we keep batched providing, should we add a configurable number of workers so that the new provides don't have to wait for the previous batch provide to be finished before being provided?

If we have a limited size worker pool, there does not seem to be a need for a batch, since the size of the pool limits the number of CIDs that get loaded into memory for (re)providing. The pool size should be adjustable, and a smaller pool should be used for reprovides by default.

I think it is a safe behavior to bound the default number of provide workers, but it is hard to come with the right number. I expect accelerated DHT client provides to take ~1s, which means that the number of workers is approx the number of cids/second that the system can provide.

If provide sends each CID to 20 remote peers (requires 20 connections) and the worker pool has 50 workers, then that is using 1000 connections. Perhaps the pool size can be informed by the maximum (or a reasonable) number of peer connections, leaving enough capacity to handle incoming peer connections. Does resource manager give us some clue?

guillaumemichel · 2025-04-11T15:35:34Z

The main benefit of batching, is that it reduces the number of connections we need to open. Basically it will

select all the keys from the batch that should be allocated to a DhtServer
open the connection to DhtServer
send all provides to DhtServer (PUT RPC)
close the connection to DhtServer
Repeat with the next peer

Where as if there is no batching, the provider will open a connection to the 20 closest nodes to a cid, send them the PUT request, and continue with the next cid. As a result the node will dial the same nodes many times in a short while if the amount of provided cids is large enough.

Hence batching helps reduce the number of dials. Note that batch providing only works the the accelerated DHT client.

Perhaps the pool size can be informed by the maximum (or a reasonable) number of peer connections, leaving enough capacity to handle incoming peer connections

This is tricky, because the provider has no visibility on the number of (concurrent) connections opened by the DHT. It is true that for each provide the libp2p node will try to open at least 20 connections. However, the normal DHT client limits the number of concurrent connection per request to 10, and the accelerated DHT client has 20 workers. Both default parameters can be overwritten by implementations.

The libp2p resource manager will probably prevent us from opening too many connections (if enabled).

gammazero

See comments. I think we can get rid of batched provides.

provider/reprovider.go

gammazero · 2025-04-14T14:14:33Z

Where as if there is no batching ...
OK, so a little more than simple batching to make use of connections more efficient.

guillaumemichel · 2025-04-16T13:34:59Z

As observed in #901, CIDs are added multiple times to the queue. Deduplication mechanisms were removed in this PR. So each CID may be advertised up to 3 times.

A quick fix is #910, but on the long run we don't need it if we can solve #901.

provider/reprovider.go

gammazero

LGTM

prioritize new provides

716e1cd

guillaumemichel requested a review from a team as a code owner April 9, 2025 08:57

guillaumemichel mentioned this pull request Apr 9, 2025

Provider performance regression ipfs/kubo#10777

Closed

3 tasks

changelog

0a199e0

aschmahmann reviewed Apr 9, 2025

View reviewed changes

provider/reprovider.go Outdated Show resolved Hide resolved

split provide and reprovide queues

30c634c

guillaumemichel mentioned this pull request Apr 9, 2025

feat: Provider.WorkerCount and stats reprovide cmd ipfs/kubo#10779

Merged

adjust slice capacity

c3975cc

guillaumemichel marked this pull request as draft April 9, 2025 13:33

guillaumemichel added 2 commits April 9, 2025 16:04

fix hanging test

6cb0cff

fix race condition

7a8f29c

guillaumemichel marked this pull request as ready for review April 10, 2025 08:46

guillaumemichel requested a review from gammazero April 10, 2025 08:46

guillaumemichel changed the title ~~provider: prioritize new provides~~ provider: dedicated provide queue Apr 10, 2025

gammazero reviewed Apr 11, 2025

View reviewed changes

instant provides

92e2ba0

gammazero reviewed Apr 11, 2025

View reviewed changes

guillaumemichel and others added 3 commits April 15, 2025 11:37

optimize waitUntilProvideSystemReady ticker

ea97ca3

Merge branch 'main' into prioritize-new-provides

cd45ce1

removing cid deduplication logic

541cf0e

guillaumemichel mentioned this pull request Apr 16, 2025

provider: deduplicate cids in queue #910

Merged

guillaumemichel and others added 5 commits April 16, 2025 10:17

removed batched provides

47720cb

remove unused variables

13ec2a0

Merge branch 'main' into prioritize-new-provides

d26d20a

rename maxReprovideBatchSize

a8e4345

ProviderWorkerCount takes an int

75ffcd7

guillaumemichel mentioned this pull request Apr 16, 2025

tests: ensure kubo is providing cid only once ipfs/kubo#10784

Closed

3 tasks

gammazero reviewed Apr 17, 2025

View reviewed changes

provider/reprovider.go Show resolved Hide resolved

provider/reprovider.go Outdated Show resolved Hide resolved

provider/reprovider.go Outdated Show resolved Hide resolved

provider/reprovider.go Outdated Show resolved Hide resolved

guillaumemichel added 2 commits April 22, 2025 14:21

address review

fc2e7f9

deduplicate cids in reprovide

6ab2994

guillaumemichel requested a review from gammazero April 23, 2025 08:31

gammazero approved these changes Apr 24, 2025

View reviewed changes

Merge branch 'main' into prioritize-new-provides

196fd34

guillaumemichel merged commit 95aa3f0 into main Apr 24, 2025
19 of 20 checks passed

guillaumemichel deleted the prioritize-new-provides branch April 24, 2025 07:50

lidel mentioned this pull request Apr 24, 2025

Release 0.35 ipfs/kubo#10760

Closed

45 tasks

lidel mentioned this pull request May 13, 2025

kubo 0.35.0-rc1: missing Provide.Enabled flag ipfs/kubo#10803

Closed

3 tasks

BrewTestBot mentioned this pull request May 21, 2025

ipfs 0.35.0 Homebrew/homebrew-core#224309

Merged

lidel mentioned this pull request Jun 12, 2025

Improve the providing subsystem in Kubo — IPFS/2025 ipshipyard/roadmaps#6

Open

guillaumemichel mentioned this pull request Jun 13, 2025

ipfs dht provide is not parallelized ipfs/kubo#4863

Closed

provider: dedicated provide queue #907

provider: dedicated provide queue #907

Uh oh!

Conversation

guillaumemichel commented Apr 9, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

gammazero left a comment

Choose a reason for hiding this comment

Uh oh!

guillaumemichel commented Apr 11, 2025

Uh oh!

gammazero commented Apr 11, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

guillaumemichel commented Apr 11, 2025

Uh oh!

gammazero left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

gammazero commented Apr 14, 2025

Uh oh!

guillaumemichel commented Apr 16, 2025

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

gammazero left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

guillaumemichel commented Apr 9, 2025 •

edited

Loading

gammazero commented Apr 11, 2025 •

edited

Loading