Add `HashTable` methods related to the raw bucket index #657

cuviper · 2025-10-23T01:29:03Z

On HashTable<T, A>:

pub fn num_buckets(&self) -> usize;
pub fn find_bucket_index(&self, hash: u64, eq: impl FnMut(&T) -> bool) -> Option<usize>;
pub fn get_bucket_entry(&mut self, index: usize) -> Result<OccupiedEntry<'_, T, A>, AbsentEntry<'_, T, A>>;
pub unsafe fn get_bucket_entry_unchecked(&mut self, index: usize) -> OccupiedEntry<'_, T, A>;
pub fn get_bucket(&self, index: usize) -> Option<&T>;
pub unsafe fn get_bucket_unchecked(&self, index: usize) -> &T;
pub fn get_bucket_mut(&mut self, index: usize) -> Option<&mut T>;
pub unsafe fn get_bucket_unchecked_mut(&mut self, index: usize) -> &mut T;
pub fn iter_buckets(&self) -> IterBuckets<'_>;
pub fn iter_hash_buckets(&self, hash: u64) -> IterHashBuckets<'_>;

On OccupiedEntry<'_, T, A>:

pub fn bucket_index(&self) -> usize;

IterHashBuckets<'_> implements Clone, Debug, Default, Iterator<Item = usize>, and FusedIterator.
IterBuckets<'_> implements all those and also ExactSizeIterator.

Closes #613, although I chose different names.

cuviper · 2025-10-23T01:30:03Z

although I chose different names.

I'm still willing to change if other names are preferred.

src/table.rs

clarfonthey

I think that get_bucket and get_bucket_mut are good, but buckets should be renamed to num_buckets to make it clear it's returning a count and not an iterator.

In terms of find_bucket_index and get_bucket_entry, I left some comments. But if you wanted to merge the smaller set first, I'm fine with that.

clarfonthey

Also, not required at all, but since a lot of code using the HashTable API is going to be unsafe anyway, it might be worth adding get_bucket_unchecked and get_bucket_unchecked_mut as well, which don't return Option.

moulins · 2025-10-23T15:50:13Z

Does get_bucket_entry really need to take a hash? The Entry structs currently store the full hash, but I suspect the code could be refactored to only store a raw::control::Tag, which can then be fetched from the hashtable storage when constructing an OccupiedEntry.

cuviper · 2025-10-23T17:15:44Z

Also, not required at all, but since a lot of code using the HashTable API is going to be unsafe anyway, it might be worth adding get_bucket_unchecked and get_bucket_unchecked_mut as well, which don't return Option.

I personally think HashTable should emphasize safe usage, but I guess that line has already been breached with get_disjoint_unchecked_mut. These would be easy to add if desired.

Does get_bucket_entry really need to take a hash? The Entry structs currently store the full hash, but I suspect the code could be refactored to only store a raw::control::Tag, which can then be fetched from the hashtable storage when constructing an OccupiedEntry.

Ooh, I like that! We couldn't support the full Option<Entry> suggested above, but this avoids all my concerns about breaking the hash probe sequence if there's no new hash involved. I think we also only need this Tag in VacantEntry, because OccupiedEntry::remove can just get it from the control byte.

cuviper · 2025-10-23T18:34:23Z

Calling out this additional benefit of the Tag change, as I wrote in the commit:

Also, since OccupiedEntry is now smaller, enum Entry will be the
same size as VacantEntry by using a niche for the discriminant.
(Although this is not guaranteed by the compiler.)

clarfonthey · 2025-10-23T21:29:46Z

re: unsafe methods, I guess the main concern is whether the bounds checks can easily be optimised out. While there isn't the same robust codegen testing as the compiler here, maybe it would be worth poking around on godbolt and seeing how find_bucket_index followed by one of the relevant methods is handled.

cuviper · 2025-10-23T23:04:46Z

I've added those two unchecked methods.

seeing how find_bucket_index followed by one of the relevant methods is handled.

Using cargo asm I can confirm that the optimizer does not see that the checks are redundant after find_bucket_index -- comparing get_bucket_unchecked(index) to get_bucket(index).unwrap() adds:

+       inc rdi
+       cmp r10, rdi
+       jae .LBB6_11
+       cmp byte ptr [rcx + r10], 0
+       js .LBB6_11
        add rax, -8
        pop rcx
        .cfi_def_cfa_offset 8
        ret

... where .LBB6_11 calls unwrap_failed.

clarfonthey · 2025-10-23T23:36:32Z

I guess that another benefit to bumping MSRV would be the ability to add assert_unchecked here to potentially help, but I'm not 100% sure it actually would in this case. It's very finicky in cases like this.

Even without an MSRV bump, I guess you could use the nightly feature as a temporary measure.

cuviper · 2025-10-24T00:04:38Z

I assume you mean adding that in find_bucket_index?

    pub fn find_bucket_index(&self, hash: u64, eq: impl FnMut(&T) -> bool) -> Option<usize> {
        match self.raw.find(hash, eq) {
            Some(bucket) => Some(unsafe {
                let index = self.raw.bucket_index(&bucket);
                core::hint::assert_unchecked(index < self.raw.buckets());
                core::hint::assert_unchecked(self.raw.is_bucket_full(index));
                index
            }),
            None => None,
        }
    }

It does remove the unwrap in my simple back-to-back test, but it also pessimizes the unchecked case a little, recomputing the final address. Different variations of that assertion had the same effect.

+        neg r9
+        lea rax, [rax + 8*r9]
         add rax, -8
         pop rcx
         .cfi_def_cfa_offset 8
         ret

In a real scenario I expect you'd also have other interim code, and who knows if the assertion could still be propagated. I don't think we should bother until there's a real measured benefit, and anyway even the unhinted checked case is still much less than a full hash probe that you would have needed before!

cuviper · 2025-10-24T00:05:50Z

RE MSRV, I recently bumped indexmap to 1.82, which is just over a year old.

src/raw/mod.rs

Amanieu · 2025-11-01T00:11:27Z

Is there any downside to having get_bucket_mut always return an OccupiedEntry? It's strictly more flexible than returning a &mut T and this API is only intended for advanced use cases anyways. Same with the unchecked version.

Separately, we probably want to add a method to find the next non-empty bucket after X. This would address the use cases in #420 and #382.

cuviper · 2025-11-01T00:39:46Z

Is there any downside to having get_bucket_mut always return an OccupiedEntry? It's strictly more flexible than returning a &mut T and this API is only intended for advanced use cases anyways. Same with the unchecked version.

i.e. instead of the separate get_bucket_entry, I suppose? At first I needed the hash value for that, which was a downside if you only wanted &mut T, but now I guess they are functionally equivalent.

I suppose the downside now is just that the caller would have to use an extra .get_mut() or .into_mut() on the entry if that's all they want, and hope that the compiler does optimize the unused stuff away. It probably will though.

Hmm, if OccupiedEntry implemented Deref<Target = T> and DerefMut, then my experimental branch in indexmap wouldn't even need to change the way it's currently using get_bucket_mut(). That could be nice, but it's not required.

cuviper · 2025-11-01T00:44:13Z

Note that there is a similar trio in find, find_mut, and find_entry as well.

cuviper · 2025-11-01T01:13:06Z

I went ahead and made that change to try it -- easy enough to drop that commit if we change our minds. :)

Amanieu · 2025-11-01T01:45:09Z

Hmm, on second thought it might be better to keep get_bucket_mut returning &mut T and have get_bucket_entry returning Result<OccupiedEntry, AbsentEntry> for the same reason as find_entry (reasoning explained here). I prefer keeping the symmetry with the find_ methods.

The only missing functionality here is the ability to insert a value just by bucket index, but it's not clear how useful this would be since the only way to get a valid index for this is by re-using the index of a previous insertion of the same key.

Amanieu · 2025-11-01T01:49:00Z

Also possibly a bucket variant of iter_hash which returns all the full buckets that may match the given hash.

cuviper · 2025-11-01T18:53:07Z

Reverted get_bucket_mut (and unchecked) back to &mut T
Changed get_bucket_entry to return Result<OccupiedEntry, AbsentEntry>
Added get_bucket_entry_unchecked returning OccupiedEntry
Added iter_buckets and iter_hash_buckets

The only missing functionality here is the ability to insert a value just by bucket index, but it's not clear how useful this would be since the only way to get a valid index for this is by re-using the index of a previous insertion of the same key.

I don't think we should support this, as it would let you insert values in places that wouldn't be probed by the same hash, especially if we're not making any guarantees about index stability after removals. You can do something like:

    let (old, vacant) = table.get_bucket_entry(index).unwrap().remove();
    let new_index = vacant.insert(new).bucket_index();

... but I think even there we're not promising that the new index will be the same. Supposing we were to shift items to avoid tombstones, then the vacant entry might already represent a different bucket.

cuviper · 2025-11-01T18:55:14Z

src/table.rs

+    /// The order in which the iterator yields indices is unspecified
+    /// and may change in the future.


I think it would actually be fine to guarantee increasing order here, but the underlying raw::FullBucketsIndices is explicitly disclaiming that.

cuviper · 2025-11-01T18:56:32Z

src/table.rs

+    /// The order in which the iterator yields indices is unspecified
+    /// and may change in the future.


This one should definitely be unspecified though, since it follows the probe order.

cuviper · 2025-11-01T20:32:09Z

Another possible method is get_disjoint_buckets_mut (and unchecked) -- I'm willing to implement that if there's interest, but I don't want this PR to get out of hand. :)

`VacantEntry` now stores a `Tag` instead of a full `hash: u64`. This means that `OccupiedEntry` doesn't need to store anything, because it can get that tag from the control byte for `remove -> VacantEntry`. The `get_bucket_entry` method doesn't need a hash argument either. Also, since `OccupiedEntry` is now smaller, `enum Entry` will be the same size as `VacantEntry` by using a niche for the discriminant. (Although this is not _guaranteed_ by the compiler.)

cuviper · 2025-11-04T02:11:35Z

Rebased to resolve conflicts with the InsertSlot removal.

clarfonthey · 2025-11-04T02:24:29Z

FWIW, I was looking into what it would take to implement HashMap and HashSet with HashTable and it does look like these APIs would be necessary for that, at least in part. I haven't looked since the number of methods doubled in size but I think it's a good change.

cuviper requested a review from Amanieu October 23, 2025 01:29

clarfonthey reviewed Oct 23, 2025

View reviewed changes

src/table.rs Outdated Show resolved Hide resolved

clarfonthey reviewed Oct 23, 2025

View reviewed changes

src/table.rs Outdated Show resolved Hide resolved

clarfonthey suggested changes Oct 23, 2025

View reviewed changes

clarfonthey reviewed Oct 23, 2025

View reviewed changes

cuviper force-pushed the table-bucket branch from 32d1dac to 8c30147 Compare October 23, 2025 18:30

clarfonthey reviewed Oct 24, 2025

View reviewed changes

src/raw/mod.rs Outdated Show resolved Hide resolved

cuviper force-pushed the table-bucket branch from 8f72446 to 99176cc Compare October 24, 2025 06:35

clarfonthey approved these changes Oct 24, 2025

View reviewed changes

cuviper mentioned this pull request Oct 24, 2025

Bucket-based API for HashTable #613

Open

cuviper force-pushed the table-bucket branch 3 times, most recently from 069761b to 5796895 Compare November 1, 2025 18:37

cuviper commented Nov 1, 2025

View reviewed changes

cuviper force-pushed the table-bucket branch 2 times, most recently from 6ae99e1 to d043b0f Compare November 1, 2025 19:17

cuviper added 7 commits November 3, 2025 17:56

Add HashTable methods related to the raw bucket index

d552103

Rename HashTable::buckets to num_buckets

c0419b4

Add get_bucket_unchecked and get_bucket_unchecked_mut

901742d

get_bucket_entry -> Result<OccupiedEntry, AbsentEntry>

a6592a8

Add HashTable::get_bucket_entry_unchecked

6b3687e

Add HashTable::iter_buckets and iter_hash_buckets

12ad6f4

cuviper force-pushed the table-bucket branch from d043b0f to 12ad6f4 Compare November 4, 2025 02:11

		/// The order in which the iterator yields indices is unspecified
		/// and may change in the future.

Add HashTable methods related to the raw bucket index #657

Are you sure you want to change the base?

Add HashTable methods related to the raw bucket index #657

Conversation

cuviper commented Oct 23, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

cuviper commented Oct 23, 2025

Uh oh!

Uh oh!

Uh oh!

clarfonthey left a comment

Choose a reason for hiding this comment

Uh oh!

clarfonthey left a comment

Choose a reason for hiding this comment

Uh oh!

moulins commented Oct 23, 2025

Uh oh!

cuviper commented Oct 23, 2025

Uh oh!

cuviper commented Oct 23, 2025

Uh oh!

clarfonthey commented Oct 23, 2025

Uh oh!

cuviper commented Oct 23, 2025

Uh oh!

clarfonthey commented Oct 23, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

cuviper commented Oct 24, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

cuviper commented Oct 24, 2025

Uh oh!

Uh oh!

Amanieu commented Nov 1, 2025

Uh oh!

cuviper commented Nov 1, 2025

Uh oh!

cuviper commented Nov 1, 2025

Uh oh!

cuviper commented Nov 1, 2025

Uh oh!

Amanieu commented Nov 1, 2025

Uh oh!

Amanieu commented Nov 1, 2025

Uh oh!

cuviper commented Nov 1, 2025

Uh oh!

cuviper Nov 1, 2025

Choose a reason for hiding this comment

Uh oh!

cuviper Nov 1, 2025

Choose a reason for hiding this comment

Uh oh!

cuviper commented Nov 1, 2025

Uh oh!

cuviper commented Nov 4, 2025

Uh oh!

clarfonthey commented Nov 4, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Add `HashTable` methods related to the raw bucket index #657

Add `HashTable` methods related to the raw bucket index #657

cuviper commented Oct 23, 2025 •

edited

Loading

clarfonthey commented Oct 23, 2025 •

edited

Loading

cuviper commented Oct 24, 2025 •

edited

Loading