Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
277 commits
Select commit Hold shift + click to select a range
5fe5666
fixes missed
itazap Oct 10, 2025
51e62e1
gemma test fix
itazap Oct 10, 2025
0e5dbdf
refactor
itazap Oct 14, 2025
9136d3c
rm legacy from llama
itazap Oct 14, 2025
ab77f57
added renaming
itazap Oct 14, 2025
36bc3ef
add _model
itazap Oct 14, 2025
c4f045c
update legacy
itazap Oct 14, 2025
c80dd1d
update legacy
itazap Oct 14, 2025
790c092
fix docstring
itazap Oct 14, 2025
f4d956a
always load blank, then set _tokenizer if we have it
itazap Oct 14, 2025
b2c320c
new toks
itazap Oct 15, 2025
0c3caff
update all berttokenizer based models
itazap Oct 15, 2025
d43412a
apply feedback - delete bert duplicates
itazap Oct 16, 2025
48eeb50
more models --> fast only
itazap Oct 17, 2025
d3a3cbd
more convert_slow models
itazap Oct 20, 2025
493f9e0
fix common test refs
itazap Oct 20, 2025
a51cea0
updating fast only tokenizers
itazap Oct 20, 2025
d9c1ec3
openai and pegasus
itazap Oct 21, 2025
d879bc3
enable sentencepiecebackend
itazap Oct 22, 2025
ca51029
more models
itazap Oct 22, 2025
132c617
code gen
itazap Oct 22, 2025
ed5bf86
t5
itazap Oct 22, 2025
158b444
code gen tests
itazap Oct 22, 2025
64eaf88
speecht5
itazap Oct 22, 2025
95f48d3
mbart
itazap Oct 22, 2025
f3248d2
mbart50
itazap Oct 22, 2025
f3dd103
more models
itazap Oct 22, 2025
c66037d
more models
itazap Oct 23, 2025
cb5e08b
layouglmv2
itazap Oct 23, 2025
3159033
update tests
itazap Oct 24, 2025
a14a45d
update tests
itazap Oct 24, 2025
7ca10f8
update tests
itazap Oct 25, 2025
f5cbc49
pretrainedtokenizer
itazap Oct 27, 2025
72e8043
whisper
itazap Oct 28, 2025
3cd8e5b
whisper
itazap Oct 28, 2025
4bf2b85
layoutxlm and storing backends
itazap Oct 28, 2025
2ef0fd3
refactor sentencepiecebackend and additional_special_tokens
itazap Oct 29, 2025
5c7d347
renaming tokenization_utils --> tokenization_python
itazap Oct 29, 2025
fcf67ff
udpate tests
itazap Oct 30, 2025
a8ccf16
bert test
itazap Oct 30, 2025
ccca98e
blenderbot
itazap Oct 30, 2025
c118c10
clip
itazap Oct 30, 2025
0f74081
codegen
itazap Oct 30, 2025
a11dba7
code_llama
itazap Oct 30, 2025
b678cde
cohere
itazap Oct 30, 2025
ea9a546
deberata, deberat v2, funnel
itazap Oct 30, 2025
ffbdecf
gpt2
itazap Oct 30, 2025
9f08ade
batch update tests
itazap Oct 30, 2025
a7cd5c0
pegasus qwen2 roberta
itazap Oct 30, 2025
b5b3cd9
more models
itazap Oct 31, 2025
1250bcc
layout tests
itazap Oct 31, 2025
cf72cae
some renaming
itazap Oct 31, 2025
4fafdcc
fix references to utils_fast
itazap Oct 31, 2025
236f9f1
fix refs
itazap Oct 31, 2025
cd743bf
fix refs
itazap Oct 31, 2025
0e7e593
fix refs
itazap Oct 31, 2025
2af6d2c
fix refs
itazap Oct 31, 2025
b58b7b1
fix refs
itazap Oct 31, 2025
518dcaf
fix refs
itazap Oct 31, 2025
0f2f4b6
fix refs
itazap Oct 31, 2025
c849148
fix some tests
itazap Nov 2, 2025
0d54bbd
regression
itazap Nov 2, 2025
81a140a
fix refs
itazap Nov 2, 2025
61366d6
fix refs
itazap Nov 3, 2025
4374a66
missed the most crucial file in my last commit
itazap Nov 3, 2025
df383d7
fix refs
itazap Nov 4, 2025
b8035ec
fix refs
itazap Nov 4, 2025
37e1b92
fix refs
itazap Nov 4, 2025
9b45774
batch encode fix
itazap Nov 4, 2025
a24856d
fix some tests
itazap Nov 6, 2025
1868870
BC for batch_decode bc too many refs
itazap Nov 6, 2025
35dd250
more tests
itazap Nov 6, 2025
b0428f3
fix more tests
itazap Nov 6, 2025
8fe6873
fix for processors
itazap Nov 6, 2025
c1e0e46
fixing more models
itazap Nov 10, 2025
79568cd
deleted mbart50 by accident
itazap Nov 10, 2025
cfa159a
seamless m4t
itazap Nov 10, 2025
5854f4c
albert fix
itazap Nov 10, 2025
714a856
whisper
itazap Nov 11, 2025
c016f11
layout3
itazap Nov 11, 2025
2e3e178
attempt to fix cached tokenizers on CI
itazap Nov 11, 2025
03e3ab9
trying another fix on CI
itazap Nov 11, 2025
2c30d79
again try to work around CI
itazap Nov 11, 2025
98f51d5
bertweet
itazap Nov 11, 2025
96f0517
tapas
itazap Nov 11, 2025
c26f54b
mbart50
itazap Nov 12, 2025
da0bbf0
luke
itazap Nov 12, 2025
494ef3e
mluke
itazap Nov 13, 2025
39bb884
markuplm
itazap Nov 13, 2025
960dfcf
markuplm
itazap Nov 13, 2025
54992a0
fix some more auto tests
itazap Nov 13, 2025
d0383bd
some random model failures
itazap Nov 14, 2025
a969c6b
mistralcommontestser
itazap Nov 14, 2025
2bf4a13
more fixes
itazap Nov 14, 2025
e88322f
ref fix
itazap Nov 14, 2025
cfb0100
siglip
itazap Nov 14, 2025
0fd1066
marian
itazap Nov 14, 2025
02c524c
plbart
itazap Nov 14, 2025
820191e
update utils toks
itazap Nov 14, 2025
0cd714d
seamless m4t
itazap Nov 16, 2025
8a412bc
roc bert
itazap Nov 17, 2025
e8c3258
udpate byt5 test
itazap Nov 17, 2025
85a3b1f
xlm
itazap Nov 17, 2025
45e718f
esm
itazap Nov 17, 2025
96fc467
roformer
itazap Nov 17, 2025
7727e3b
code llama
itazap Nov 17, 2025
6795515
biogpt
itazap Nov 17, 2025
2f49a39
m2m100
itazap Nov 17, 2025
a42e7a8
dpr and flaubert
itazap Nov 18, 2025
33634be
xlm and speech to text
itazap Nov 18, 2025
ca5e389
tok backend pass object
itazap Nov 18, 2025
25021d4
tokenizer object pass
itazap Nov 18, 2025
69610fe
wav2vec2
itazap Nov 18, 2025
51799ca
wav2vec2
itazap Nov 18, 2025
f23abc3
cpmant
itazap Nov 18, 2025
88f0db5
update utils tokenizers
itazap Nov 18, 2025
077e6f8
cpmant
itazap Nov 18, 2025
e004b56
bartpho
itazap Nov 18, 2025
e069763
test apply chat template assistant mask
itazap Nov 18, 2025
9df9cfc
apply chat template video
itazap Nov 18, 2025
dc9b1ae
apply chat template assistant mask
itazap Nov 18, 2025
4c05e9d
test torch
itazap Nov 18, 2025
5c209a4
update from slow in base and fix donut processor errors
itazap Nov 19, 2025
d8a8db8
auto to point to tokenizers backend, fix kosmos2
itazap Nov 19, 2025
6b40d91
some non model fixes for old slow models that no longer have their ow…
itazap Nov 19, 2025
976265b
missed file from last commit
itazap Nov 19, 2025
b6ca8b2
idefics2
itazap Nov 19, 2025
5c72105
fixup
ArthurZucker Nov 19, 2025
964b461
fixup
ArthurZucker Nov 19, 2025
0381407
pretrained tokenizer fast test update
itazap Nov 19, 2025
887b477
Merge branch 'main' of github.com:huggingface/transformers into one_t…
ArthurZucker Nov 19, 2025
f4c46ab
stash
ArthurZucker Nov 19, 2025
efbbb04
Merge branch 'one_tokenizer' of github.com:huggingface/transformers i…
ArthurZucker Nov 19, 2025
71ef282
bad merged
ArthurZucker Nov 19, 2025
a5b018c
cherry pick more stuff that did not merge well
ArthurZucker Nov 19, 2025
8ea91f6
fix gptsw3
ArthurZucker Nov 19, 2025
1947894
nit warn for now
ArthurZucker Nov 19, 2025
20a06ff
update error raising
ArthurZucker Nov 19, 2025
aa197a0
just ran fixup
ArthurZucker Nov 19, 2025
63c7c1c
bring back bert legacy
ArthurZucker Nov 19, 2025
5895bab
fix
ArthurZucker Nov 19, 2025
6b8217b
nit
ArthurZucker Nov 19, 2025
184ed58
fix 56 errors on blenderbotsmall?
ArthurZucker Nov 19, 2025
09e4021
18 for blenderbotsmall
ArthurZucker Nov 19, 2025
a8c299e
tok auto
itazap Nov 19, 2025
1259052
missed clip
itazap Nov 19, 2025
06e3485
fix tests
itazap Nov 20, 2025
3a95bf1
something missed
itazap Nov 20, 2025
05d5c08
token healing
itazap Nov 20, 2025
78f4e58
tok common tests update - nonmodel
itazap Nov 20, 2025
8fbaf83
try to fix non-model test in test_tokenization_utils
itazap Nov 20, 2025
fd40b1b
fix hub tests
itazap Nov 20, 2025
70330b8
try to fix hub tests
itazap Nov 20, 2025
7c78007
custom vocab related fixed
itazap Nov 20, 2025
ca1f6b0
bert jap
itazap Nov 20, 2025
dd3ae59
BERT JAP
itazap Nov 20, 2025
2e1893f
rename bert legacy to bert legacy
itazap Nov 20, 2025
f4be6a9
Wav2vec2
itazap Nov 20, 2025
919103a
fix in tok python to update total vocab size - fixes speech t5
itazap Nov 20, 2025
c452f92
blender bot small
itazap Nov 20, 2025
6d167eb
forgot test file
itazap Nov 20, 2025
025722b
test failures
itazap Nov 21, 2025
7d1d0d3
marian
itazap Nov 21, 2025
dfb67a4
gpt2 tiktoken
itazap Nov 21, 2025
51da6b2
big bird / marian
itazap Nov 21, 2025
c611058
udop
itazap Nov 21, 2025
cc4a972
forgot couple changes
itazap Nov 21, 2025
51202da
test_serve fix
itazap Nov 21, 2025
ca988b9
missing import
itazap Nov 21, 2025
f5bc69e
a couple processors fixes
itazap Nov 21, 2025
c67de10
Merge branch 'main' of github.com:huggingface/transformers into one_t…
ArthurZucker Nov 24, 2025
045bbff
style partly
ArthurZucker Nov 24, 2025
75662fd
fix to fetch tests ci
itazap Nov 24, 2025
8d248a3
Revert branch back to commit f5bc69ef state
itazap Nov 24, 2025
4c29924
revert branch to styling
itazap Nov 24, 2025
189cabd
update mistral after merge
itazap Nov 24, 2025
e02741c
fixes for non model tests
itazap Nov 25, 2025
b828ae1
some processor test fixes
itazap Nov 26, 2025
83b579c
more processor test fixes
itazap Nov 26, 2025
2ce27bc
more processor fixes
itazap Nov 26, 2025
881b97c
hub tests
itazap Nov 26, 2025
2e28b3d
python tok utils
itazap Nov 26, 2025
925d187
fix hub test
itazap Nov 26, 2025
6624231
Merge branch 'main' of github.com:huggingface/transformers into one_t…
ArthurZucker Nov 26, 2025
437321b
make style for now
ArthurZucker Nov 26, 2025
cd4d3ac
remove problemattic fic copies
ArthurZucker Nov 26, 2025
5c5864f
python utils/check_copies.py --fix_and_overwrite
ArthurZucker Nov 26, 2025
2f13c13
more styling
ArthurZucker Nov 26, 2025
1e1aa11
fixup
ArthurZucker Nov 26, 2025
5eeb1fe
silence docstirng
ArthurZucker Nov 26, 2025
dea8e1e
fix import?
ArthurZucker Nov 26, 2025
452d6d8
fix imports
ArthurZucker Nov 26, 2025
e650205
add the local test as well
ArthurZucker Nov 26, 2025
3dd1716
throw spm error
itazap Nov 26, 2025
e700dfa
llamas
itazap Nov 26, 2025
ce23d67
fix a couple tests
itazap Nov 26, 2025
ff1bf36
broke ci
itazap Nov 26, 2025
0bdfeae
broke ci
itazap Nov 26, 2025
a137649
broke ci
itazap Nov 26, 2025
366597c
broke ci
itazap Nov 26, 2025
22887b1
add logs to debug gemma on ci
itazap Nov 26, 2025
73819f4
gemma and llama
itazap Nov 26, 2025
c24c997
gemma
itazap Nov 26, 2025
551a959
revert las commit
itazap Nov 26, 2025
a18e84d
gemma debug
itazap Nov 26, 2025
c23ee13
gemma debug
itazap Nov 26, 2025
93187b3
gemma
itazap Nov 26, 2025
81428ef
safely import spiece backend
itazap Nov 27, 2025
eb95c2e
tok tests
itazap Nov 27, 2025
24d89c4
check none
itazap Nov 27, 2025
e2c4434
setup and qual
itazap Nov 27, 2025
7a737b7
ruff
itazap Nov 27, 2025
a19c90c
del dev files
itazap Nov 27, 2025
18e7484
tok auto
itazap Nov 27, 2025
3cdd8ee
fill docstrings
itazap Nov 27, 2025
50756c4
update auto
itazap Nov 27, 2025
6bccb46
blenderbot small nit
itazap Nov 27, 2025
a76015a
Merge branch 'main' of github.com:huggingface/transformers into one_t…
ArthurZucker Nov 27, 2025
4afb570
add migration guide
ArthurZucker Nov 27, 2025
be1d95a
move mixtral patch to `TokenizersBackend`, move `TokenizerExtractor`
ArthurZucker Nov 27, 2025
fad31d7
rename MistralCommonTokenizer to MistralCommonB ackend
ArthurZucker Nov 27, 2025
d4aff20
Merge branch 'one_tokenizer' of github.com:huggingface/transformers i…
ArthurZucker Nov 27, 2025
3ab4bec
nit
ArthurZucker Nov 27, 2025
0c1a40a
Merge branch 'main' of github.com:huggingface/transformers into one_t…
ArthurZucker Nov 27, 2025
30f1640
fix failures
ArthurZucker Nov 27, 2025
f2a1482
fixup
ArthurZucker Nov 27, 2025
d8010f8
remoove one old test
ArthurZucker Nov 27, 2025
82e5675
mark the slow one as slow
ArthurZucker Nov 27, 2025
088fc39
very small fixes
ArthurZucker Nov 27, 2025
f677ddf
update auto mapping for missing ones
ArthurZucker Nov 27, 2025
d30e46b
fixup lorsd
ArthurZucker Nov 27, 2025
ad24f43
fixup doc and stuff
ArthurZucker Nov 27, 2025
ebfe7f1
should be the final fixe
ArthurZucker Nov 27, 2025
c4a743d
processing update
ArthurZucker Nov 27, 2025
f81a966
Merge branch 'main' of github.com:huggingface/transformers into one_t…
ArthurZucker Nov 27, 2025
9a5638d
update
ArthurZucker Nov 27, 2025
7c32dfb
FIX or brute AI fix the llava test
ArthurZucker Nov 27, 2025
c520a66
style
ArthurZucker Nov 27, 2025
718b2f0
slow?
ArthurZucker Nov 27, 2025
20d9036
Merge branch 'main' of github.com:huggingface/transformers into one_t…
ArthurZucker Nov 27, 2025
8f536c2
fix is offline mode?
ArthurZucker Nov 27, 2025
e96c18b
fix mt5
ArthurZucker Nov 27, 2025
5ce65b8
One tok utils (#42462)
itazap Nov 27, 2025
4418e8a
Merge branch 'main' of github.com:huggingface/transformers into one_t…
ArthurZucker Nov 27, 2025
7f9954a
fix cohere
ArthurZucker Nov 27, 2025
bfa5fd0
Merge branch 'one_tokenizer' of github.com:huggingface/transformers i…
ArthurZucker Nov 27, 2025
4dce834
?
ArthurZucker Nov 27, 2025
fcdc9bb
up
ArthurZucker Nov 27, 2025
a5a3a7c
am I dumbb?
ArthurZucker Nov 27, 2025
0244be9
grumble
ArthurZucker Nov 27, 2025
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
The table of contents is too big for display.
Diff view
Diff view
  •  
  •  
  •  
201 changes: 201 additions & 0 deletions MIGRATION_GUIDE_V5.md
Original file line number Diff line number Diff line change
Expand Up @@ -74,6 +74,207 @@ While this is being implemented, expect varying levels of support across differe

Linked PR: https://github.com/huggingface/transformers/pull/41580




## Tokenization

Just as we moved towards a single backend library for model definition, we want `Tokenizer` to be a lot more intuitive.
With v5, you can now initialize an empty `LlamaTokenizer` and train it directly on your new task!

Defining a new tokenizer object should be as simple as this:
```python
from transformers import TokenizersBackend, generate_merges
from tokenizers import pre_tokenizers, Tokenizer
from tokenizers.model import BPE

class Llama5Tokenizer(TokenizersBackend):
def __init__(self, unk_token="<unk>",bos_token="<s>", eos_token="</s>", vocab=None, merges=None ):
if vocab is None:
self._vocab = {
str(unk_token): 0,
str(bos_token): 1,
str(eos_token): 2,
}

else:
self._vocab = vocab

if merges is not None:
self._merges = merges
else:
self._merges = generate_merges(filtered_vocab)

self._tokenizer = Tokenizer(
BPE(vocab=self._vocab, merges=self._merges, fuse_unk=True)
)
self._tokenizer.pre_tokenizer = pre_tokenizers.Metaspace(
replacement="▁", prepend_scheme=_get_prepend_scheme(self.add_prefix_space, self), split=False
)
super().__init__(
tokenizer_object=self._tokenizer,
unk_token=unk_token,
bos_token=bos_token,
eos_token=eos_token,
)
```

And now if you call `Llama5Tokenizer()` you just get an empty, trainable tokenizer that follows the definition of the authors of `Llama5` (it does not exist yet :wink:).

The above is the main motivation towards refactoring tokenization: we want people to just instantiate a tokenizer like they would a model, empty or not and with exactly what they defined.

### Non-tokenizers
If you tokenizers is not common, or you just don't want to rely on `sentencepiece` nor `tokenizers` you can just import the `PythonBackend` (previousl `PreTrainedTokenzier`) which has all the API and logic for added tokens, encoding and decoding wieht them etc.

If you want to have en less features, you can use the common `PreTrainedTokenizerBase` mixin, which mostly defines `transformers` tokenizer API: `encode`, `decode`, `vocab_size`, `get_vocab`, `convert_tokens_to_ids`, `convert_ids_to_tokens`, `from_pretrained`, `save_pretrained`, etc.

### Backend Architecture Changes

**Moving away from "slow" vs "fast" tokenizers:**

Previously, transformers maintained two parallel implementations for many tokenizers:
- "Slow" tokenizers (`tokenization_<model>.py`) - Python-based implementations, often using [SentencePiece](https://github.com/google/sentencepiece) as the backend.
- "Fast" tokenizers (`tokenization_<model>_fast.py`) - Rust-based implementations using the 🤗 [tokenizers](https://github.com/huggingface/tokenizers) library.

In v5, we consolidate to a single tokenizer file per model: `tokenization_<model>.py`. This file will use the most appropriate backend available:

1. **TokenizersBackend** (preferred): Rust-based tokenizers from the 🤗 [tokenizers](https://github.com/huggingface/tokenizers) library. In general its performances are better, but it also offers a lot more features that are comonly adopted across the ecosystem, like handling additional tokens, easily update the state of the tokenizer, automatic parallelisation etc.
2. **SentencePieceBackend**: For models requiring SentencePiece
3. **PythonBackend**: Pure Python implementations
4. **MistralCommonBackend**: Relies on `MistralCommon`'s toknenization library. (Previously `MistralCommonTokenizer`)

The `AutoTokenizer` automatically selects the appropriate backend based on available files and dependencies. This is transparent, you continue to use `AutoTokenizer.from_pretrained()` as before. This allows transformers to be future-proof and modular to easily support future backends.


### API Changes

**1. Direct tokenizer initialization with vocab and merges:**

In v5, you can now initialize tokenizers directly with vocabulary and merges, enabling training custom tokenizers from scratch:

```python
# v5: Initialize a blank tokenizer for training
from transformers import LlamaTokenizer

# Create a tokenizer with custom vocabulary and merges
vocab = {"<unk>": 0, "<s>": 1, "</s>": 2, "hello": 3, "world": 4}
merges = [("h", "e"), ("l", "l"), ("o", " ")]

tokenizer = LlamaTokenizer(vocab=vocab, merges=merges)

# Or initialize a blank tokenizer to train on your own dataset
tokenizer = LlamaTokenizer() # Creates a blank Llama-like tokenizer
```
But you can no longer pass a vocab file. As this accounts for `from_pretrained` use-case.

**2. Simplified decoding API:**

The `batch_decode` method has been unified with `decode`. Both single and batch decoding now use the same method:
```python
from transformers import AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained("t5-small")
inputs = ["hey how are you?", "fine"]
tokenizer.decode(tokenizer.encode(inputs))
```
Gives:
```diff
- 'hey how are you?</s> fine</s>'
+ ['hey how are you?</s>', 'fine</s>']
```

This is mostly because people get `list[list[int]]` out of `generate`, but then they would use `decode` because they use `encode` and would get:
```python
...: tokenizer.decode([[1,2], [1,4]])
---------------------------------------------------------------------------
TypeError Traceback (most recent call last)
Cell In[2], line 4
2 tokenizer = AutoTokenizer.from_pretrained("t5-small")
3 inputs = ["hey how are you?", "fine"]
----> 4 tokenizer.decode([[1,2], [1,4]])

File /raid/arthur/transformers/src/transformers/tokenization_utils_base.py:3948, in PreTrainedTokenizerBase.decode(self, token_ids, skip_special_tokens, clean_up_tokenization_spaces, **kwargs)
3945 # Convert inputs to python lists
3946 token_ids = to_py_obj(token_ids)
-> 3948 return self._decode(
3949 token_ids=token_ids,
3950 skip_special_tokens=skip_special_tokens,
3951 clean_up_tokenization_spaces=clean_up_tokenization_spaces,
3952 **kwargs,
3953 )

File /raid/arthur/transformers/src/transformers/tokenization_utils_fast.py:682, in PreTrainedTokenizerFast._decode(self, token_ids, skip_special_tokens, clean_up_tokenization_spaces, **kwargs)
680 if isinstance(token_ids, int):
681 token_ids = [token_ids]
--> 682 text = self._tokenizer.decode(token_ids, skip_special_tokens=skip_special_tokens)
684 clean_up_tokenization_spaces = (
685 clean_up_tokenization_spaces
686 if clean_up_tokenization_spaces is not None
687 else self.clean_up_tokenization_spaces
688 )
689 if clean_up_tokenization_spaces:

TypeError: argument 'ids': 'list' object cannot be interpreted as an integer
```

**3. Unified encoding API:**

The `encode_plus` is deprecated → call directly with `__call__`

**3. `apply_chat_template` returns `BatchEncoding`:**

Previously, `apply_chat_template` returned `input_ids` for backward compatibility. In v5, it now consistently returns a `BatchEncoding` dict like other tokenizer methods:

```python
# v5
messages = [
{"role": "user", "content": "Hello!"},
{"role": "assistant", "content": "Hi there!"}
]

# Now returns BatchEncoding with input_ids, attention_mask, etc.
outputs = tokenizer.apply_chat_template(messages, return_tensors="pt")
print(outputs.keys()) # dict_keys(['input_ids', 'attention_mask'])
```

#### Removed legacy configuration file saving:

- `special_tokens_map.json` - special tokens are now stored in `tokenizer_config.json`.
- `added_tokens.json` - added tokens are now stored in `tokenizer.json`.
- `added_tokens_decoder` is only stored when there is no `tokenizer.json`.

When loading older tokenizers, these files are still read for backward compatibility, but new saves use the consolidated format.

### Model-Specific Changes

Several models that had identical tokenizers now import from their base implementation:

- **LayoutLM** → uses BertTokenizer
- **LED** → uses BartTokenizer
- **Longformer** → uses RobertaTokenizer
- **LXMert** → uses BertTokenizer
- **MT5** → uses T5Tokenizer
- **MVP** → uses BartTokenizer

We're just gonna remove these files at term.

**Removed T5-specific workarounds:**

The internal `_eventually_correct_t5_max_length` method has been removed. T5 tokenizers now handle max length consistently with other models.

### Testing Changes

Model-specific tokenization test files now focus on integration tests.
Common tokenization API tests (e.g., `add_tokens`, `encode`, `decode`) are now centralized and automatically applied across all tokenizers. This reduces test duplication and ensures consistent behavior


For legacy implementations, the original BERT Python tokenizer code (including `WhitespaceTokenizer`, `BasicTokenizer`, etc.) is preserved in `bert_legacy.py` for reference purposes.

**Linked PRs:**
- https://github.com/huggingface/transformers/issues/40938
- https://github.com/huggingface/transformers/pull/40936
- https://github.com/huggingface/transformers/pull/41626


## Library-wide changes with lesser impact

### `use_auth_token`
Expand Down
6 changes: 1 addition & 5 deletions docs/source/en/internal/tokenization_utils.md
Original file line number Diff line number Diff line change
Expand Up @@ -18,8 +18,7 @@ rendered properly in your Markdown viewer.

This page lists all the utility functions used by the tokenizers, mainly the class
[`~tokenization_utils_base.PreTrainedTokenizerBase`] that implements the common methods between
[`PreTrainedTokenizer`] and [`PreTrainedTokenizerFast`] and the mixin
[`~tokenization_utils_base.SpecialTokensMixin`].
[`PreTrainedTokenizer`] and [`PreTrainedTokenizerFast`].

Most of those are only useful if you are studying the code of the tokenizers in the library.

Expand All @@ -29,9 +28,6 @@ Most of those are only useful if you are studying the code of the tokenizers in
- __call__
- all

## SpecialTokensMixin

[[autodoc]] tokenization_utils_base.SpecialTokensMixin

## Enums and namedtuples

Expand Down
15 changes: 13 additions & 2 deletions docs/source/en/main_classes/tokenizer.md
Original file line number Diff line number Diff line change
Expand Up @@ -28,8 +28,7 @@ The base classes [`PreTrainedTokenizer`] and [`PreTrainedTokenizerFast`]
implement the common methods for encoding string inputs in model inputs (see below) and instantiating/saving python and
"Fast" tokenizers either from a local file or directory or from a pretrained tokenizer provided by the library
(downloaded from HuggingFace's AWS S3 repository). They both rely on
[`~tokenization_utils_base.PreTrainedTokenizerBase`] that contains the common methods, and
[`~tokenization_utils_base.SpecialTokensMixin`].
[`~tokenization_utils_base.PreTrainedTokenizerBase`] that contains the common methods.

[`PreTrainedTokenizer`] and [`PreTrainedTokenizerFast`] thus implement the main
methods for using all the tokenizers:
Expand Down Expand Up @@ -98,6 +97,18 @@ loaded very simply into 🤗 transformers. Take a look at the [Using tokenizers
- push_to_hub
- all

## PythonBackend

[[autodoc]] PythonBackend

## TokenizersBackend

[[autodoc]] TokenizersBackend

## SentencePieceBackend

[[autodoc]] SentencePieceBackend

## BatchEncoding

[[autodoc]] BatchEncoding
6 changes: 4 additions & 2 deletions docs/source/en/model_doc/bert.md
Original file line number Diff line number Diff line change
Expand Up @@ -100,11 +100,13 @@ echo -e "Plants create [MASK] through a process known as photosynthesis." | tran
## BertTokenizer

[[autodoc]] BertTokenizer
- build_inputs_with_special_tokens
- get_special_tokens_mask
- create_token_type_ids_from_sequences
- save_vocabulary

## BertTokenizerLegacy

[[autodoc]] BertTokenizerLegacy

## BertTokenizerFast

[[autodoc]] BertTokenizerFast
Expand Down
2 changes: 0 additions & 2 deletions docs/source/en/model_doc/big_bird.md
Original file line number Diff line number Diff line change
Expand Up @@ -104,9 +104,7 @@ print(f"The predicted token is: {predicted_token}")
## BigBirdTokenizer

[[autodoc]] BigBirdTokenizer
- build_inputs_with_special_tokens
- get_special_tokens_mask
- create_token_type_ids_from_sequences
- save_vocabulary

## BigBirdTokenizerFast
Expand Down
2 changes: 0 additions & 2 deletions docs/source/en/model_doc/blenderbot-small.md
Original file line number Diff line number Diff line change
Expand Up @@ -68,9 +68,7 @@ the left.
## BlenderbotSmallTokenizer

[[autodoc]] BlenderbotSmallTokenizer
- build_inputs_with_special_tokens
- get_special_tokens_mask
- create_token_type_ids_from_sequences
- save_vocabulary

## BlenderbotSmallTokenizerFast
Expand Down
2 changes: 0 additions & 2 deletions docs/source/en/model_doc/blenderbot.md
Original file line number Diff line number Diff line change
Expand Up @@ -84,12 +84,10 @@ An example:
## BlenderbotTokenizer

[[autodoc]] BlenderbotTokenizer
- build_inputs_with_special_tokens

## BlenderbotTokenizerFast

[[autodoc]] BlenderbotTokenizerFast
- build_inputs_with_special_tokens

## BlenderbotModel

Expand Down
4 changes: 0 additions & 4 deletions docs/source/en/model_doc/bloom.md
Original file line number Diff line number Diff line change
Expand Up @@ -63,10 +63,6 @@ See also:
[[autodoc]] BloomConfig
- all

## BloomTokenizerFast

[[autodoc]] BloomTokenizerFast
- all

## BloomModel

Expand Down
2 changes: 0 additions & 2 deletions docs/source/en/model_doc/camembert.md
Original file line number Diff line number Diff line change
Expand Up @@ -122,9 +122,7 @@ print(f"The predicted token is: {predicted_token}")
## CamembertTokenizer

[[autodoc]] CamembertTokenizer
- build_inputs_with_special_tokens
- get_special_tokens_mask
- create_token_type_ids_from_sequences
- save_vocabulary

## CamembertTokenizerFast
Expand Down
2 changes: 0 additions & 2 deletions docs/source/en/model_doc/clip.md
Original file line number Diff line number Diff line change
Expand Up @@ -99,9 +99,7 @@ print(f"Most likely label: {most_likely_label} with probability: {probs[0][most_
## CLIPTokenizer

[[autodoc]] CLIPTokenizer
- build_inputs_with_special_tokens
- get_special_tokens_mask
- create_token_type_ids_from_sequences
- save_vocabulary

## CLIPTokenizerFast
Expand Down
4 changes: 0 additions & 4 deletions docs/source/en/model_doc/code_llama.md
Original file line number Diff line number Diff line change
Expand Up @@ -167,16 +167,12 @@ visualizer("""def func(a, b):
## CodeLlamaTokenizer

[[autodoc]] CodeLlamaTokenizer
- build_inputs_with_special_tokens
- get_special_tokens_mask
- create_token_type_ids_from_sequences
- save_vocabulary

## CodeLlamaTokenizerFast

[[autodoc]] CodeLlamaTokenizerFast
- build_inputs_with_special_tokens
- get_special_tokens_mask
- create_token_type_ids_from_sequences
- update_post_processor
- save_vocabulary
1 change: 0 additions & 1 deletion docs/source/en/model_doc/codegen.md
Original file line number Diff line number Diff line change
Expand Up @@ -77,7 +77,6 @@ hello_world()
## CodeGenTokenizer

[[autodoc]] CodeGenTokenizer
- create_token_type_ids_from_sequences
- save_vocabulary

## CodeGenTokenizerFast
Expand Down
12 changes: 4 additions & 8 deletions docs/source/en/model_doc/cohere.md
Original file line number Diff line number Diff line change
Expand Up @@ -129,14 +129,10 @@ visualizer("Plants create energy through a process known as")

[[autodoc]] CohereConfig

## CohereTokenizerFast

[[autodoc]] CohereTokenizerFast
- build_inputs_with_special_tokens
- get_special_tokens_mask
- create_token_type_ids_from_sequences
- update_post_processor
- save_vocabulary
## CohereTokenizer

[[autodoc]] CohereTokenizer


## CohereModel

Expand Down
2 changes: 0 additions & 2 deletions docs/source/en/model_doc/convbert.md
Original file line number Diff line number Diff line change
Expand Up @@ -62,9 +62,7 @@ ConvBERT training tips are similar to those of BERT. For usage tips refer to [BE
## ConvBertTokenizer

[[autodoc]] ConvBertTokenizer
- build_inputs_with_special_tokens
- get_special_tokens_mask
- create_token_type_ids_from_sequences
- save_vocabulary

## ConvBertTokenizerFast
Expand Down
Loading
Loading