-
Notifications
You must be signed in to change notification settings - Fork 31.2k
rm slow tokenizers #40936
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
rm slow tokenizers #40936
Conversation
af77c18 to
dc0611f
Compare
ArthurZucker
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice!
| @require_tokenizers | ||
| def test_added_token_are_matched_longest_first(self): | ||
| if not self.test_slow_tokenizer: | ||
| self.skipTest(reason="This test is only for slow tokenizers") | ||
|
|
||
| tokenizers = self.get_tokenizers(fast=False) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
should be moved to sentencepiece as well
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes it's in test_sentencepiece_backend_mixin.py
| words = ["Wonderful", "no", "inspiration", "example", "with", "subtoken"] | ||
| text = " ".join(words) | ||
| batch_size = 3 | ||
|
|
||
| encoding = tokenizer_r.encode_plus(text, add_special_tokens=False) | ||
|
|
||
| batch_encoding = tokenizer_r([text] * batch_size, add_special_tokens=False) | ||
| num_tokens = len(encoding["input_ids"]) | ||
|
|
||
| last_word_index = len(words) - 1 | ||
| last_token_index = num_tokens - 1 | ||
| last_batch_index = batch_size - 1 | ||
| last_char_index = len(text) - 1 | ||
|
|
||
| # words, tokens | ||
| self.assertEqual(len(encoding.words(0)), num_tokens) | ||
| self.assertEqual(max(encoding.words(0)), last_word_index) | ||
| self.assertEqual(min(encoding.words(0)), 0) | ||
| self.assertEqual(len(batch_encoding.words(last_batch_index)), num_tokens) | ||
| self.assertEqual(max(batch_encoding.words(last_batch_index)), last_word_index) | ||
| self.assertEqual(min(batch_encoding.words(last_batch_index)), 0) | ||
| self.assertEqual(len(encoding.tokens(0)), num_tokens) | ||
|
|
||
| # Assert token_to_word | ||
| self.assertEqual(encoding.token_to_word(0), 0) | ||
| self.assertEqual(encoding.token_to_word(0, 0), 0) | ||
| self.assertEqual(encoding.token_to_word(last_token_index), last_word_index) | ||
| self.assertEqual(encoding.token_to_word(0, last_token_index), last_word_index) | ||
| self.assertEqual(batch_encoding.token_to_word(1, 0), 0) | ||
| self.assertEqual(batch_encoding.token_to_word(0, last_token_index), last_word_index) | ||
| self.assertEqual(batch_encoding.token_to_word(last_batch_index, last_token_index), last_word_index) | ||
|
|
||
| # Assert word_to_tokens | ||
| self.assertEqual(encoding.word_to_tokens(0).start, 0) | ||
| self.assertEqual(encoding.word_to_tokens(0, 0).start, 0) | ||
| self.assertEqual(encoding.word_to_tokens(last_word_index).end, last_token_index + 1) | ||
| self.assertEqual(encoding.word_to_tokens(0, last_word_index).end, last_token_index + 1) | ||
| self.assertEqual(batch_encoding.word_to_tokens(1, 0).start, 0) | ||
| self.assertEqual(batch_encoding.word_to_tokens(0, last_word_index).end, last_token_index + 1) | ||
| self.assertEqual( | ||
| batch_encoding.word_to_tokens(last_batch_index, last_word_index).end, last_token_index + 1 | ||
| ) | ||
|
|
||
| # Assert token_to_chars | ||
| self.assertEqual(encoding.token_to_chars(0).start, 0) | ||
| self.assertEqual(encoding.token_to_chars(0, 0).start, 0) | ||
| self.assertEqual(encoding.token_to_chars(last_token_index).end, last_char_index + 1) | ||
| self.assertEqual(encoding.token_to_chars(0, last_token_index).end, last_char_index + 1) | ||
| self.assertEqual(batch_encoding.token_to_chars(1, 0).start, 0) | ||
| self.assertEqual(batch_encoding.token_to_chars(0, last_token_index).end, last_char_index + 1) | ||
| self.assertEqual( | ||
| batch_encoding.token_to_chars(last_batch_index, last_token_index).end, last_char_index + 1 | ||
| ) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
indeed rust takes care of these for himself, the other part can be tested in sentencepiece file
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
seems like we only tested tokenizer_r aka rust here, and spiece / slow never supported the tokens_to_chars, word_to_tokens, etc.
tests/test_tokenization_common.py
Outdated
| self.skipTest( | ||
| reason="This test is now in TokenizersBackendTesterMixin - it tests tokenizers-backend API, not transformers code" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
indeed thats for tokenizers overlay
| # Check the changes | ||
| for token in special_tokens_list: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
and this can go to trash as well
ArthurZucker
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A good start:
def __init__(self, vocab, merges):
self.tokenizer = Tokenizer(
BPE(
vocab=vocab,
merges=merges,
dropout=None,
unk_token=None,
continuing_subword_prefix="",
end_of_word_suffix="",
fuse_unk=False,
byte_fallback=False,
)
)
tokenizer.normalizer = normalizers.NFC()
tokenizer.pre_tokenizer = pre_tokenizers.Sequence(
[
pre_tokenizers.Split(
Regex(
r"""(?i:'s|'t|'re|'ve|'m|'ll|'d)|[^\r\n\p{L}\p{N}]?\p{L}+|\p{N}| ?[^\s\p{L}\p{N}]+[\r\n]*|\s*[\r\n]+|\s+(?!\S)|\s+"""
),
behavior="isolated",
invert=False,
),
pre_tokenizers.ByteLevel(
add_prefix_space=getattr(self.original_tokenizer, "add_prefix_space", False),
use_regex=False,
),
]
)
tokenizer.decoder = decoders.ByteLevel()
tokenizer.post_processor = processors.ByteLevel(trim_offsets=False)
return tokenizerIdeally I think we can even just do this, without defining the functions separately.
The only upside would have been that we can use modular for less copy pasting, but its so small that I want to have this explicit, without extra abstraction!
| logger.info( | ||
| "Falling back to PreTrainedSentencePieceTokenizer since tokenizer.model file was found " | ||
| "but no config or tokenizer class could be determined." | ||
| ) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
IDK if we want to fallback here! I think if tokenizer.json is not found -> we convert tokenizer.model to tokenizer.json, unless user enforces sentencepiece
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
enforce by passing like. tokenizer_backend="sentencepiece" for ex?
| def _tokenizer(self) -> Tokenizer: | ||
| return Tokenizer(Unigram(self._vocab_scores, unk_id=self._unk_id(), byte_fallback=True)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yep that's good, tho I think we might want to abstract
def _model(self) -> Model:
return Unigram(...)
| return output | ||
| def _decoder(self, replacement=None, add_prefix_space=None): | ||
| return decoders.Sequence([decoders.Replace("▁", " "), decoders.ByteFallback(), decoders.Fuse()]) | ||
|
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
and then finally a function that shows how we build the final tokenizer. I think we want __init__ to make self.tokenizer = Tokenizer(model=self._model(), decoder=self._decoder, etc)
| """Tokenizer configuration for this tokenizer.""" | ||
| return Tokenizer(BPE(vocab=self._vocab, merges=self._merges, fuse_unk=True, byte_fallback=True, dropout=None)) | ||
|
|
||
| def _vocab(self): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
should be Initial vocab or something
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
no legacy in general! (we want to hide non good default most probably) so the super class will support changing this, but the real llama tokenizer is not with legacy
| def _normalizer(self): | ||
| """Normalizer configuration for this tokenizer.""" | ||
| return normalizers.NFC() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nice
ArthurZucker
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Second part of the review, very nice work on ubloating already
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
very nice
| self._special_tokens_map["additional_special_tokens"] = [] # BC default to empty list | ||
|
|
||
| # Directly set hidden values to allow init with tokens not yet in vocab | ||
| for key in list(kwargs.keys()): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we can keep this as a TODO, but with the new logic that was added, we already have the self.xxx_token and self.xxx_token_id so IDK if additional_special_tokens is even useful. Let's leave it for later anyhways
| if not isinstance(value, (list, tuple)) or not all(isinstance(t, (str, AddedToken)) for t in value): | ||
| raise ValueError(f"Tokens {value} for key {key} should all be str or AddedToken instances") | ||
| new_tokens = [ | ||
| (AddedToken(t, rstrip=False, lstrip=False, normalized=False, special=True) if isinstance(t, str) else t) | ||
| for t in value | ||
| if replace_additional_special_tokens or str(t) not in self.additional_special_tokens | ||
| ] | ||
| if replace_additional_special_tokens and new_tokens: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would kind of want to get rid of this and put it only in spm, because tokenizers just supports tokenizer.special_tokens which gives all special tokens -> duplicated info with the additional special tokens
| return all_toks | ||
| seen = set() | ||
| all_toks = [] | ||
| for value in self.special_tokens_map.values(): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
same here, would leave as abstract and rely on tokenizers's special_tokens attr if we can!
| @classmethod | ||
| def convert_added_tokens(cls, obj: Union[AddedToken, Any], save=False, add_type_field=True): | ||
| if isinstance(obj, dict) and "__type" in obj and obj["__type"] == "AddedToken": | ||
| obj.pop("__type") | ||
| return AddedToken(**obj) | ||
| if isinstance(obj, AddedToken) and save: | ||
| obj = obj.__getstate__() | ||
| if add_type_field: | ||
| obj["__type"] = "AddedToken" | ||
| else: | ||
| # Don't save "special" for previous tokenizers | ||
| obj.pop("special") | ||
| return obj |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
IDRemember why we use this one? Only for SPM no?
| ) -> BatchEncoding: | ||
| # Input validation (from _call_one) | ||
| def _is_valid_text_input(t): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think (but I might be wrong here) that tokenizers does the typechecking itself as well
| self.assertEqual(tokens, EXPECTED_TOKENS) | ||
| def test_integration_expected_token_ids(self): | ||
| for tok in self.tokenizers: | ||
| self.assertEqual(tok.encode(input_string), expected_token_ids) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this is just missing a decode test
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
overall LGTM!
| str(unk_token): 3, | ||
| } | ||
|
|
||
| self._merges = merges if merges is not None else generate_merges(self._vocab) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
you actually should never generate merges out of the bos pad eos unk ! so the merge generation should happen before
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is it all special tokens or just these 4? in convert_slow_tokenizer it currently indexes the vocab[3:]
| self.add_tokens(list(self.all_special_tokens), special_tokens=True) | ||
| self.update_post_processor() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
both can probably be called from the TokenizerBackend class wdyt? As in we are adding the post processor thing to all of them, and that already by default special tokens need to be added?
| sub_texts = "".join(sub_texts) | ||
|
|
||
| return sub_texts.replace(SPIECE_UNDERLINE, " ") | ||
| self._post_init() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
you can also just call
self.add_tokens(list(self.all_special_tokens), special_tokens=True)
but adding token has historically been done in the super call!
* consolidate python and utils tokenization files, they are copies * ruff and ref * Format
…nto one_tokenizer
Tokenization
Just as we moved towards a single backend library for model definition, we want
Tokenizerto be a lot more intuitive.With v5, you can now initialize an empty
LlamaTokenizerand train it directly on your new task!Defining a new tokenizer object should be as simple as this:
And now if you call
Llama5Tokenizer()you just get an empty, trainable tokenizer that follows the definition of the authors ofLlama5(it does not exist yet 😉).The above is the main motivation towards refactoring tokenization: we want people to just instantiate a tokenizer like they would a model, empty or not and with exactly what they defined.
Non-tokenizers
If you tokenizers is not common, or you just don't want to rely on
sentencepiecenortokenizersyou can just import thePythonBackend(previouslPreTrainedTokenzier) which has all the API and logic for added tokens, encoding and decoding wieht them etc.If you want to have en less features, you can use the common
PreTrainedTokenizerBasemixin, which mostly definestransformerstokenizer API:encode,decode,vocab_size,get_vocab,convert_tokens_to_ids,convert_ids_to_tokens,from_pretrained,save_pretrained, etc.Backend Architecture Changes
Moving away from "slow" vs "fast" tokenizers:
Previously, transformers maintained two parallel implementations for many tokenizers:
tokenization_<model>.py) - Python-based implementations, often using SentencePiece as the backend.tokenization_<model>_fast.py) - Rust-based implementations using the 🤗 tokenizers library.In v5, we consolidate to a single tokenizer file per model:
tokenization_<model>.py. This file will use the most appropriate backend available:MistralCommon's toknenization library. (PreviouslyMistralCommonTokenizer)The
AutoTokenizerautomatically selects the appropriate backend based on available files and dependencies. This is transparent, you continue to useAutoTokenizer.from_pretrained()as before. This allows transformers to be future-proof and modular to easily support future backends.API Changes
1. Direct tokenizer initialization with vocab and merges:
In v5, you can now initialize tokenizers directly with vocabulary and merges, enabling training custom tokenizers from scratch:
But you can no longer pass a vocab file. As this accounts for
from_pretraineduse-case.2. Simplified decoding API:
The
batch_decodemethod has been unified withdecode. Both single and batch decoding now use the same method:Gives:
This is mostly because people get
list[list[int]]out ofgenerate, but then they would usedecodebecause they useencodeand would get:3. Unified encoding API:
The
encode_plusis deprecated → call directly with__call__3.
apply_chat_templatereturnsBatchEncoding:Previously,
apply_chat_templatereturnedinput_idsfor backward compatibility. In v5, it now consistently returns aBatchEncodingdict like other tokenizer methods:Removed legacy configuration file saving:
special_tokens_map.json- special tokens are now stored intokenizer_config.json.added_tokens.json- added tokens are now stored intokenizer.json.added_tokens_decoderis only stored when there is notokenizer.json.When loading older tokenizers, these files are still read for backward compatibility, but new saves use the consolidated format.