Skip to content

Commit 05c0e1d

Browse files
itazapArthurZucker
andauthored
rm slow tokenizers (#40936)
* fixes missed * gemma test fix * refactor * rm legacy from llama * added renaming * add _model * update legacy * update legacy * fix docstring * always load blank, then set _tokenizer if we have it * new toks * update all berttokenizer based models * apply feedback - delete bert duplicates * more models --> fast only * more convert_slow models * fix common test refs * updating fast only tokenizers * openai and pegasus * enable sentencepiecebackend * more models * code gen * t5 * code gen tests * speecht5 * mbart * mbart50 * more models * more models * layouglmv2 * update tests * update tests * update tests * pretrainedtokenizer * whisper * whisper * layoutxlm and storing backends * refactor sentencepiecebackend and additional_special_tokens * renaming tokenization_utils --> tokenization_python * udpate tests * bert test * blenderbot * clip * codegen * code_llama * cohere * deberata, deberat v2, funnel * gpt2 * batch update tests * pegasus qwen2 roberta * more models * layout tests * some renaming * fix references to utils_fast * fix refs * fix refs * fix refs * fix refs * fix refs * fix refs * fix refs * fix some tests * regression * fix refs * fix refs * missed the most crucial file in my last commit * fix refs * fix refs * fix refs * batch encode fix * fix some tests * BC for batch_decode bc too many refs * more tests * fix more tests * fix for processors * fixing more models * deleted mbart50 by accident * seamless m4t * albert fix * whisper * layout3 * attempt to fix cached tokenizers on CI * trying another fix on CI * again try to work around CI * bertweet * tapas * mbart50 * luke * mluke * markuplm * markuplm * fix some more auto tests * some random model failures * mistralcommontestser * more fixes * ref fix * siglip * marian * plbart * update utils toks * seamless m4t * roc bert * udpate byt5 test * xlm * esm * roformer * code llama * biogpt * m2m100 * dpr and flaubert * xlm and speech to text * tok backend pass object * tokenizer object pass * wav2vec2 * wav2vec2 * cpmant * update utils tokenizers * cpmant * bartpho * test apply chat template assistant mask * apply chat template video * apply chat template assistant mask * test torch * update from slow in base and fix donut processor errors * auto to point to tokenizers backend, fix kosmos2 * some non model fixes for old slow models that no longer have their own tokenizer file as they are the same as bert * missed file from last commit * idefics2 * fixup * fixup * pretrained tokenizer fast test update * stash * bad merged * cherry pick more stuff that did not merge well * fix gptsw3 * nit warn for now * update error raising * just ran fixup * bring back bert legacy * fix * nit * fix 56 errors on blenderbotsmall? * 18 for blenderbotsmall * tok auto * missed clip * fix tests * something missed * token healing * tok common tests update - nonmodel * try to fix non-model test in test_tokenization_utils * fix hub tests * try to fix hub tests * custom vocab related fixed * bert jap * BERT JAP * rename bert legacy to bert legacy * Wav2vec2 * fix in tok python to update total vocab size - fixes speech t5 * blender bot small * forgot test file * test failures * marian * gpt2 tiktoken * big bird / marian * udop * forgot couple changes * test_serve fix * missing import * a couple processors fixes * style partly * fix to fetch tests ci * Revert branch back to commit f5bc69e state * revert branch to styling * update mistral after merge * fixes for non model tests * some processor test fixes * more processor test fixes * more processor fixes * hub tests * python tok utils * fix hub test * make style for now * remove problemattic fic copies * python utils/check_copies.py --fix_and_overwrite * more styling * fixup * silence docstirng * fix import? * fix imports * add the local test as well * throw spm error * llamas * fix a couple tests * broke ci * broke ci * broke ci * broke ci * add logs to debug gemma on ci * gemma and llama * gemma * revert las commit * gemma debug * gemma debug * gemma * safely import spiece backend * tok tests * check none * setup and qual * ruff * del dev files * tok auto * fill docstrings * update auto * blenderbot small nit * add migration guide * move mixtral patch to `TokenizersBackend`, move `TokenizerExtractor` * rename MistralCommonTokenizer to MistralCommonB ackend * nit * fix failures * fixup * remoove one old test * mark the slow one as slow * very small fixes * update auto mapping for missing ones * fixup lorsd * fixup doc and stuff * should be the final fixe * processing update * update * FIX or brute AI fix the llava test * style * slow? * fix is offline mode? * fix mt5 * One tok utils (#42462) * consolidate python and utils tokenization files, they are copies * ruff and ref * Format * fix cohere * ? * up * am I dumbb? * grumble --------- Co-authored-by: Arthur <[email protected]>
1 parent 01c5159 commit 05c0e1d

File tree

443 files changed

+16908
-55397
lines changed

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

443 files changed

+16908
-55397
lines changed

MIGRATION_GUIDE_V5.md

Lines changed: 201 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -74,6 +74,207 @@ While this is being implemented, expect varying levels of support across differe
7474

7575
Linked PR: https://github.com/huggingface/transformers/pull/41580
7676

77+
78+
79+
80+
## Tokenization
81+
82+
Just as we moved towards a single backend library for model definition, we want `Tokenizer` to be a lot more intuitive.
83+
With v5, you can now initialize an empty `LlamaTokenizer` and train it directly on your new task!
84+
85+
Defining a new tokenizer object should be as simple as this:
86+
```python
87+
from transformers import TokenizersBackend, generate_merges
88+
from tokenizers import pre_tokenizers, Tokenizer
89+
from tokenizers.model import BPE
90+
91+
class Llama5Tokenizer(TokenizersBackend):
92+
def __init__(self, unk_token="<unk>",bos_token="<s>", eos_token="</s>", vocab=None, merges=None ):
93+
if vocab is None:
94+
self._vocab = {
95+
str(unk_token): 0,
96+
str(bos_token): 1,
97+
str(eos_token): 2,
98+
}
99+
100+
else:
101+
self._vocab = vocab
102+
103+
if merges is not None:
104+
self._merges = merges
105+
else:
106+
self._merges = generate_merges(filtered_vocab)
107+
108+
self._tokenizer = Tokenizer(
109+
BPE(vocab=self._vocab, merges=self._merges, fuse_unk=True)
110+
)
111+
self._tokenizer.pre_tokenizer = pre_tokenizers.Metaspace(
112+
replacement="", prepend_scheme=_get_prepend_scheme(self.add_prefix_space, self), split=False
113+
)
114+
super().__init__(
115+
tokenizer_object=self._tokenizer,
116+
unk_token=unk_token,
117+
bos_token=bos_token,
118+
eos_token=eos_token,
119+
)
120+
```
121+
122+
And now if you call `Llama5Tokenizer()` you just get an empty, trainable tokenizer that follows the definition of the authors of `Llama5` (it does not exist yet :wink:).
123+
124+
The above is the main motivation towards refactoring tokenization: we want people to just instantiate a tokenizer like they would a model, empty or not and with exactly what they defined.
125+
126+
### Non-tokenizers
127+
If you tokenizers is not common, or you just don't want to rely on `sentencepiece` nor `tokenizers` you can just import the `PythonBackend` (previousl `PreTrainedTokenzier`) which has all the API and logic for added tokens, encoding and decoding wieht them etc.
128+
129+
If you want to have en less features, you can use the common `PreTrainedTokenizerBase` mixin, which mostly defines `transformers` tokenizer API: `encode`, `decode`, `vocab_size`, `get_vocab`, `convert_tokens_to_ids`, `convert_ids_to_tokens`, `from_pretrained`, `save_pretrained`, etc.
130+
131+
### Backend Architecture Changes
132+
133+
**Moving away from "slow" vs "fast" tokenizers:**
134+
135+
Previously, transformers maintained two parallel implementations for many tokenizers:
136+
- "Slow" tokenizers (`tokenization_<model>.py`) - Python-based implementations, often using [SentencePiece](https://github.com/google/sentencepiece) as the backend.
137+
- "Fast" tokenizers (`tokenization_<model>_fast.py`) - Rust-based implementations using the 🤗 [tokenizers](https://github.com/huggingface/tokenizers) library.
138+
139+
In v5, we consolidate to a single tokenizer file per model: `tokenization_<model>.py`. This file will use the most appropriate backend available:
140+
141+
1. **TokenizersBackend** (preferred): Rust-based tokenizers from the 🤗 [tokenizers](https://github.com/huggingface/tokenizers) library. In general its performances are better, but it also offers a lot more features that are comonly adopted across the ecosystem, like handling additional tokens, easily update the state of the tokenizer, automatic parallelisation etc.
142+
2. **SentencePieceBackend**: For models requiring SentencePiece
143+
3. **PythonBackend**: Pure Python implementations
144+
4. **MistralCommonBackend**: Relies on `MistralCommon`'s toknenization library. (Previously `MistralCommonTokenizer`)
145+
146+
The `AutoTokenizer` automatically selects the appropriate backend based on available files and dependencies. This is transparent, you continue to use `AutoTokenizer.from_pretrained()` as before. This allows transformers to be future-proof and modular to easily support future backends.
147+
148+
149+
### API Changes
150+
151+
**1. Direct tokenizer initialization with vocab and merges:**
152+
153+
In v5, you can now initialize tokenizers directly with vocabulary and merges, enabling training custom tokenizers from scratch:
154+
155+
```python
156+
# v5: Initialize a blank tokenizer for training
157+
from transformers import LlamaTokenizer
158+
159+
# Create a tokenizer with custom vocabulary and merges
160+
vocab = {"<unk>": 0, "<s>": 1, "</s>": 2, "hello": 3, "world": 4}
161+
merges = [("h", "e"), ("l", "l"), ("o", " ")]
162+
163+
tokenizer = LlamaTokenizer(vocab=vocab, merges=merges)
164+
165+
# Or initialize a blank tokenizer to train on your own dataset
166+
tokenizer = LlamaTokenizer() # Creates a blank Llama-like tokenizer
167+
```
168+
But you can no longer pass a vocab file. As this accounts for `from_pretrained` use-case.
169+
170+
**2. Simplified decoding API:**
171+
172+
The `batch_decode` method has been unified with `decode`. Both single and batch decoding now use the same method:
173+
```python
174+
from transformers import AutoTokenizer
175+
tokenizer = AutoTokenizer.from_pretrained("t5-small")
176+
inputs = ["hey how are you?", "fine"]
177+
tokenizer.decode(tokenizer.encode(inputs))
178+
```
179+
Gives:
180+
```diff
181+
- 'hey how are you?</s> fine</s>'
182+
+ ['hey how are you?</s>', 'fine</s>']
183+
```
184+
185+
This is mostly because people get `list[list[int]]` out of `generate`, but then they would use `decode` because they use `encode` and would get:
186+
```python
187+
...: tokenizer.decode([[1,2], [1,4]])
188+
---------------------------------------------------------------------------
189+
TypeError Traceback (most recent call last)
190+
Cell In[2], line 4
191+
2 tokenizer = AutoTokenizer.from_pretrained("t5-small")
192+
3 inputs = ["hey how are you?", "fine"]
193+
----> 4 tokenizer.decode([[1,2], [1,4]])
194+
195+
File /raid/arthur/transformers/src/transformers/tokenization_utils_base.py:3948, in PreTrainedTokenizerBase.decode(self, token_ids, skip_special_tokens, clean_up_tokenization_spaces, **kwargs)
196+
3945 # Convert inputs to python lists
197+
3946 token_ids = to_py_obj(token_ids)
198+
-> 3948 return self._decode(
199+
3949 token_ids=token_ids,
200+
3950 skip_special_tokens=skip_special_tokens,
201+
3951 clean_up_tokenization_spaces=clean_up_tokenization_spaces,
202+
3952 **kwargs,
203+
3953 )
204+
205+
File /raid/arthur/transformers/src/transformers/tokenization_utils_fast.py:682, in PreTrainedTokenizerFast._decode(self, token_ids, skip_special_tokens, clean_up_tokenization_spaces, **kwargs)
206+
680 if isinstance(token_ids, int):
207+
681 token_ids = [token_ids]
208+
--> 682 text = self._tokenizer.decode(token_ids, skip_special_tokens=skip_special_tokens)
209+
684 clean_up_tokenization_spaces = (
210+
685 clean_up_tokenization_spaces
211+
686 if clean_up_tokenization_spaces is not None
212+
687 else self.clean_up_tokenization_spaces
213+
688 )
214+
689 if clean_up_tokenization_spaces:
215+
216+
TypeError: argument 'ids': 'list' object cannot be interpreted as an integer
217+
```
218+
219+
**3. Unified encoding API:**
220+
221+
The `encode_plus` is deprecated → call directly with `__call__`
222+
223+
**3. `apply_chat_template` returns `BatchEncoding`:**
224+
225+
Previously, `apply_chat_template` returned `input_ids` for backward compatibility. In v5, it now consistently returns a `BatchEncoding` dict like other tokenizer methods:
226+
227+
```python
228+
# v5
229+
messages = [
230+
{"role": "user", "content": "Hello!"},
231+
{"role": "assistant", "content": "Hi there!"}
232+
]
233+
234+
# Now returns BatchEncoding with input_ids, attention_mask, etc.
235+
outputs = tokenizer.apply_chat_template(messages, return_tensors="pt")
236+
print(outputs.keys()) # dict_keys(['input_ids', 'attention_mask'])
237+
```
238+
239+
#### Removed legacy configuration file saving:
240+
241+
- `special_tokens_map.json` - special tokens are now stored in `tokenizer_config.json`.
242+
- `added_tokens.json` - added tokens are now stored in `tokenizer.json`.
243+
- `added_tokens_decoder` is only stored when there is no `tokenizer.json`.
244+
245+
When loading older tokenizers, these files are still read for backward compatibility, but new saves use the consolidated format.
246+
247+
### Model-Specific Changes
248+
249+
Several models that had identical tokenizers now import from their base implementation:
250+
251+
- **LayoutLM** → uses BertTokenizer
252+
- **LED** → uses BartTokenizer
253+
- **Longformer** → uses RobertaTokenizer
254+
- **LXMert** → uses BertTokenizer
255+
- **MT5** → uses T5Tokenizer
256+
- **MVP** → uses BartTokenizer
257+
258+
We're just gonna remove these files at term.
259+
260+
**Removed T5-specific workarounds:**
261+
262+
The internal `_eventually_correct_t5_max_length` method has been removed. T5 tokenizers now handle max length consistently with other models.
263+
264+
### Testing Changes
265+
266+
Model-specific tokenization test files now focus on integration tests.
267+
Common tokenization API tests (e.g., `add_tokens`, `encode`, `decode`) are now centralized and automatically applied across all tokenizers. This reduces test duplication and ensures consistent behavior
268+
269+
270+
For legacy implementations, the original BERT Python tokenizer code (including `WhitespaceTokenizer`, `BasicTokenizer`, etc.) is preserved in `bert_legacy.py` for reference purposes.
271+
272+
**Linked PRs:**
273+
- https://github.com/huggingface/transformers/issues/40938
274+
- https://github.com/huggingface/transformers/pull/40936
275+
- https://github.com/huggingface/transformers/pull/41626
276+
277+
77278
## Library-wide changes with lesser impact
78279

79280
### `use_auth_token`

docs/source/en/internal/tokenization_utils.md

Lines changed: 1 addition & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -18,8 +18,7 @@ rendered properly in your Markdown viewer.
1818

1919
This page lists all the utility functions used by the tokenizers, mainly the class
2020
[`~tokenization_utils_base.PreTrainedTokenizerBase`] that implements the common methods between
21-
[`PreTrainedTokenizer`] and [`PreTrainedTokenizerFast`] and the mixin
22-
[`~tokenization_utils_base.SpecialTokensMixin`].
21+
[`PreTrainedTokenizer`] and [`PreTrainedTokenizerFast`].
2322

2423
Most of those are only useful if you are studying the code of the tokenizers in the library.
2524

@@ -29,9 +28,6 @@ Most of those are only useful if you are studying the code of the tokenizers in
2928
- __call__
3029
- all
3130

32-
## SpecialTokensMixin
33-
34-
[[autodoc]] tokenization_utils_base.SpecialTokensMixin
3531

3632
## Enums and namedtuples
3733

docs/source/en/main_classes/tokenizer.md

Lines changed: 13 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -28,8 +28,7 @@ The base classes [`PreTrainedTokenizer`] and [`PreTrainedTokenizerFast`]
2828
implement the common methods for encoding string inputs in model inputs (see below) and instantiating/saving python and
2929
"Fast" tokenizers either from a local file or directory or from a pretrained tokenizer provided by the library
3030
(downloaded from HuggingFace's AWS S3 repository). They both rely on
31-
[`~tokenization_utils_base.PreTrainedTokenizerBase`] that contains the common methods, and
32-
[`~tokenization_utils_base.SpecialTokensMixin`].
31+
[`~tokenization_utils_base.PreTrainedTokenizerBase`] that contains the common methods.
3332

3433
[`PreTrainedTokenizer`] and [`PreTrainedTokenizerFast`] thus implement the main
3534
methods for using all the tokenizers:
@@ -98,6 +97,18 @@ loaded very simply into 🤗 transformers. Take a look at the [Using tokenizers
9897
- push_to_hub
9998
- all
10099

100+
## PythonBackend
101+
102+
[[autodoc]] PythonBackend
103+
104+
## TokenizersBackend
105+
106+
[[autodoc]] TokenizersBackend
107+
108+
## SentencePieceBackend
109+
110+
[[autodoc]] SentencePieceBackend
111+
101112
## BatchEncoding
102113

103114
[[autodoc]] BatchEncoding

docs/source/en/model_doc/bert.md

Lines changed: 4 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -100,11 +100,13 @@ echo -e "Plants create [MASK] through a process known as photosynthesis." | tran
100100
## BertTokenizer
101101

102102
[[autodoc]] BertTokenizer
103-
- build_inputs_with_special_tokens
104103
- get_special_tokens_mask
105-
- create_token_type_ids_from_sequences
106104
- save_vocabulary
107105

106+
## BertTokenizerLegacy
107+
108+
[[autodoc]] BertTokenizerLegacy
109+
108110
## BertTokenizerFast
109111

110112
[[autodoc]] BertTokenizerFast

docs/source/en/model_doc/big_bird.md

Lines changed: 0 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -104,9 +104,7 @@ print(f"The predicted token is: {predicted_token}")
104104
## BigBirdTokenizer
105105

106106
[[autodoc]] BigBirdTokenizer
107-
- build_inputs_with_special_tokens
108107
- get_special_tokens_mask
109-
- create_token_type_ids_from_sequences
110108
- save_vocabulary
111109

112110
## BigBirdTokenizerFast

docs/source/en/model_doc/blenderbot-small.md

Lines changed: 0 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -68,9 +68,7 @@ the left.
6868
## BlenderbotSmallTokenizer
6969

7070
[[autodoc]] BlenderbotSmallTokenizer
71-
- build_inputs_with_special_tokens
7271
- get_special_tokens_mask
73-
- create_token_type_ids_from_sequences
7472
- save_vocabulary
7573

7674
## BlenderbotSmallTokenizerFast

docs/source/en/model_doc/blenderbot.md

Lines changed: 0 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -84,12 +84,10 @@ An example:
8484
## BlenderbotTokenizer
8585

8686
[[autodoc]] BlenderbotTokenizer
87-
- build_inputs_with_special_tokens
8887

8988
## BlenderbotTokenizerFast
9089

9190
[[autodoc]] BlenderbotTokenizerFast
92-
- build_inputs_with_special_tokens
9391

9492
## BlenderbotModel
9593

docs/source/en/model_doc/bloom.md

Lines changed: 0 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -63,10 +63,6 @@ See also:
6363
[[autodoc]] BloomConfig
6464
- all
6565

66-
## BloomTokenizerFast
67-
68-
[[autodoc]] BloomTokenizerFast
69-
- all
7066

7167
## BloomModel
7268

docs/source/en/model_doc/camembert.md

Lines changed: 0 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -122,9 +122,7 @@ print(f"The predicted token is: {predicted_token}")
122122
## CamembertTokenizer
123123

124124
[[autodoc]] CamembertTokenizer
125-
- build_inputs_with_special_tokens
126125
- get_special_tokens_mask
127-
- create_token_type_ids_from_sequences
128126
- save_vocabulary
129127

130128
## CamembertTokenizerFast

docs/source/en/model_doc/clip.md

Lines changed: 0 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -99,9 +99,7 @@ print(f"Most likely label: {most_likely_label} with probability: {probs[0][most_
9999
## CLIPTokenizer
100100

101101
[[autodoc]] CLIPTokenizer
102-
- build_inputs_with_special_tokens
103102
- get_special_tokens_mask
104-
- create_token_type_ids_from_sequences
105103
- save_vocabulary
106104

107105
## CLIPTokenizerFast

0 commit comments

Comments
 (0)