You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
* fixes missed
* gemma test fix
* refactor
* rm legacy from llama
* added renaming
* add _model
* update legacy
* update legacy
* fix docstring
* always load blank, then set _tokenizer if we have it
* new toks
* update all berttokenizer based models
* apply feedback - delete bert duplicates
* more models --> fast only
* more convert_slow models
* fix common test refs
* updating fast only tokenizers
* openai and pegasus
* enable sentencepiecebackend
* more models
* code gen
* t5
* code gen tests
* speecht5
* mbart
* mbart50
* more models
* more models
* layouglmv2
* update tests
* update tests
* update tests
* pretrainedtokenizer
* whisper
* whisper
* layoutxlm and storing backends
* refactor sentencepiecebackend and additional_special_tokens
* renaming tokenization_utils --> tokenization_python
* udpate tests
* bert test
* blenderbot
* clip
* codegen
* code_llama
* cohere
* deberata, deberat v2, funnel
* gpt2
* batch update tests
* pegasus qwen2 roberta
* more models
* layout tests
* some renaming
* fix references to utils_fast
* fix refs
* fix refs
* fix refs
* fix refs
* fix refs
* fix refs
* fix refs
* fix some tests
* regression
* fix refs
* fix refs
* missed the most crucial file in my last commit
* fix refs
* fix refs
* fix refs
* batch encode fix
* fix some tests
* BC for batch_decode bc too many refs
* more tests
* fix more tests
* fix for processors
* fixing more models
* deleted mbart50 by accident
* seamless m4t
* albert fix
* whisper
* layout3
* attempt to fix cached tokenizers on CI
* trying another fix on CI
* again try to work around CI
* bertweet
* tapas
* mbart50
* luke
* mluke
* markuplm
* markuplm
* fix some more auto tests
* some random model failures
* mistralcommontestser
* more fixes
* ref fix
* siglip
* marian
* plbart
* update utils toks
* seamless m4t
* roc bert
* udpate byt5 test
* xlm
* esm
* roformer
* code llama
* biogpt
* m2m100
* dpr and flaubert
* xlm and speech to text
* tok backend pass object
* tokenizer object pass
* wav2vec2
* wav2vec2
* cpmant
* update utils tokenizers
* cpmant
* bartpho
* test apply chat template assistant mask
* apply chat template video
* apply chat template assistant mask
* test torch
* update from slow in base and fix donut processor errors
* auto to point to tokenizers backend, fix kosmos2
* some non model fixes for old slow models that no longer have their own tokenizer file as they are the same as bert
* missed file from last commit
* idefics2
* fixup
* fixup
* pretrained tokenizer fast test update
* stash
* bad merged
* cherry pick more stuff that did not merge well
* fix gptsw3
* nit warn for now
* update error raising
* just ran fixup
* bring back bert legacy
* fix
* nit
* fix 56 errors on blenderbotsmall?
* 18 for blenderbotsmall
* tok auto
* missed clip
* fix tests
* something missed
* token healing
* tok common tests update - nonmodel
* try to fix non-model test in test_tokenization_utils
* fix hub tests
* try to fix hub tests
* custom vocab related fixed
* bert jap
* BERT JAP
* rename bert legacy to bert legacy
* Wav2vec2
* fix in tok python to update total vocab size - fixes speech t5
* blender bot small
* forgot test file
* test failures
* marian
* gpt2 tiktoken
* big bird / marian
* udop
* forgot couple changes
* test_serve fix
* missing import
* a couple processors fixes
* style partly
* fix to fetch tests ci
* Revert branch back to commit f5bc69e state
* revert branch to styling
* update mistral after merge
* fixes for non model tests
* some processor test fixes
* more processor test fixes
* more processor fixes
* hub tests
* python tok utils
* fix hub test
* make style for now
* remove problemattic fic copies
* python utils/check_copies.py --fix_and_overwrite
* more styling
* fixup
* silence docstirng
* fix import?
* fix imports
* add the local test as well
* throw spm error
* llamas
* fix a couple tests
* broke ci
* broke ci
* broke ci
* broke ci
* add logs to debug gemma on ci
* gemma and llama
* gemma
* revert las commit
* gemma debug
* gemma debug
* gemma
* safely import spiece backend
* tok tests
* check none
* setup and qual
* ruff
* del dev files
* tok auto
* fill docstrings
* update auto
* blenderbot small nit
* add migration guide
* move mixtral patch to `TokenizersBackend`, move `TokenizerExtractor`
* rename MistralCommonTokenizer to MistralCommonB ackend
* nit
* fix failures
* fixup
* remoove one old test
* mark the slow one as slow
* very small fixes
* update auto mapping for missing ones
* fixup lorsd
* fixup doc and stuff
* should be the final fixe
* processing update
* update
* FIX or brute AI fix the llava test
* style
* slow?
* fix is offline mode?
* fix mt5
* One tok utils (#42462)
* consolidate python and utils tokenization files, they are copies
* ruff and ref
* Format
* fix cohere
* ?
* up
* am I dumbb?
* grumble
---------
Co-authored-by: Arthur <[email protected]>
And now if you call `Llama5Tokenizer()` you just get an empty, trainable tokenizer that follows the definition of the authors of `Llama5` (it does not exist yet :wink:).
123
+
124
+
The above is the main motivation towards refactoring tokenization: we want people to just instantiate a tokenizer like they would a model, empty or not and with exactly what they defined.
125
+
126
+
### Non-tokenizers
127
+
If you tokenizers is not common, or you just don't want to rely on `sentencepiece` nor `tokenizers` you can just import the `PythonBackend` (previousl `PreTrainedTokenzier`) which has all the API and logic for added tokens, encoding and decoding wieht them etc.
128
+
129
+
If you want to have en less features, you can use the common `PreTrainedTokenizerBase` mixin, which mostly defines `transformers` tokenizer API: `encode`, `decode`, `vocab_size`, `get_vocab`, `convert_tokens_to_ids`, `convert_ids_to_tokens`, `from_pretrained`, `save_pretrained`, etc.
130
+
131
+
### Backend Architecture Changes
132
+
133
+
**Moving away from "slow" vs "fast" tokenizers:**
134
+
135
+
Previously, transformers maintained two parallel implementations for many tokenizers:
136
+
- "Slow" tokenizers (`tokenization_<model>.py`) - Python-based implementations, often using [SentencePiece](https://github.com/google/sentencepiece) as the backend.
137
+
- "Fast" tokenizers (`tokenization_<model>_fast.py`) - Rust-based implementations using the 🤗 [tokenizers](https://github.com/huggingface/tokenizers) library.
138
+
139
+
In v5, we consolidate to a single tokenizer file per model: `tokenization_<model>.py`. This file will use the most appropriate backend available:
140
+
141
+
1.**TokenizersBackend** (preferred): Rust-based tokenizers from the 🤗 [tokenizers](https://github.com/huggingface/tokenizers) library. In general its performances are better, but it also offers a lot more features that are comonly adopted across the ecosystem, like handling additional tokens, easily update the state of the tokenizer, automatic parallelisation etc.
142
+
2.**SentencePieceBackend**: For models requiring SentencePiece
143
+
3.**PythonBackend**: Pure Python implementations
144
+
4.**MistralCommonBackend**: Relies on `MistralCommon`'s toknenization library. (Previously `MistralCommonTokenizer`)
145
+
146
+
The `AutoTokenizer` automatically selects the appropriate backend based on available files and dependencies. This is transparent, you continue to use `AutoTokenizer.from_pretrained()` as before. This allows transformers to be future-proof and modular to easily support future backends.
147
+
148
+
149
+
### API Changes
150
+
151
+
**1. Direct tokenizer initialization with vocab and merges:**
152
+
153
+
In v5, you can now initialize tokenizers directly with vocabulary and merges, enabling training custom tokenizers from scratch:
154
+
155
+
```python
156
+
# v5: Initialize a blank tokenizer for training
157
+
from transformers import LlamaTokenizer
158
+
159
+
# Create a tokenizer with custom vocabulary and merges
Previously, `apply_chat_template` returned `input_ids` for backward compatibility. In v5, it now consistently returns a `BatchEncoding` dict like other tokenizer methods:
226
+
227
+
```python
228
+
# v5
229
+
messages = [
230
+
{"role": "user", "content": "Hello!"},
231
+
{"role": "assistant", "content": "Hi there!"}
232
+
]
233
+
234
+
# Now returns BatchEncoding with input_ids, attention_mask, etc.
-`special_tokens_map.json` - special tokens are now stored in `tokenizer_config.json`.
242
+
-`added_tokens.json` - added tokens are now stored in `tokenizer.json`.
243
+
-`added_tokens_decoder` is only stored when there is no `tokenizer.json`.
244
+
245
+
When loading older tokenizers, these files are still read for backward compatibility, but new saves use the consolidated format.
246
+
247
+
### Model-Specific Changes
248
+
249
+
Several models that had identical tokenizers now import from their base implementation:
250
+
251
+
-**LayoutLM** → uses BertTokenizer
252
+
-**LED** → uses BartTokenizer
253
+
-**Longformer** → uses RobertaTokenizer
254
+
-**LXMert** → uses BertTokenizer
255
+
-**MT5** → uses T5Tokenizer
256
+
-**MVP** → uses BartTokenizer
257
+
258
+
We're just gonna remove these files at term.
259
+
260
+
**Removed T5-specific workarounds:**
261
+
262
+
The internal `_eventually_correct_t5_max_length` method has been removed. T5 tokenizers now handle max length consistently with other models.
263
+
264
+
### Testing Changes
265
+
266
+
Model-specific tokenization test files now focus on integration tests.
267
+
Common tokenization API tests (e.g., `add_tokens`, `encode`, `decode`) are now centralized and automatically applied across all tokenizers. This reduces test duplication and ensures consistent behavior
268
+
269
+
270
+
For legacy implementations, the original BERT Python tokenizer code (including `WhitespaceTokenizer`, `BasicTokenizer`, etc.) is preserved in `bert_legacy.py` for reference purposes.
0 commit comments