-
Notifications
You must be signed in to change notification settings - Fork 31.3k
rm slow tokenizers #40936
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Merged
Merged
rm slow tokenizers #40936
Changes from 18 commits
Commits
Show all changes
277 commits
Select commit
Hold shift + click to select a range
5fe5666
fixes missed
itazap 51e62e1
gemma test fix
itazap 0e5dbdf
refactor
itazap 9136d3c
rm legacy from llama
itazap ab77f57
added renaming
itazap 36bc3ef
add _model
itazap c4f045c
update legacy
itazap c80dd1d
update legacy
itazap 790c092
fix docstring
itazap f4d956a
always load blank, then set _tokenizer if we have it
itazap b2c320c
new toks
itazap 0c3caff
update all berttokenizer based models
itazap d43412a
apply feedback - delete bert duplicates
itazap 48eeb50
more models --> fast only
itazap d3a3cbd
more convert_slow models
itazap 493f9e0
fix common test refs
itazap a51cea0
updating fast only tokenizers
itazap d9c1ec3
openai and pegasus
itazap d879bc3
enable sentencepiecebackend
itazap ca51029
more models
itazap 132c617
code gen
itazap ed5bf86
t5
itazap 158b444
code gen tests
itazap 64eaf88
speecht5
itazap 95f48d3
mbart
itazap f3248d2
mbart50
itazap f3dd103
more models
itazap c66037d
more models
itazap cb5e08b
layouglmv2
itazap 3159033
update tests
itazap a14a45d
update tests
itazap 7ca10f8
update tests
itazap f5cbc49
pretrainedtokenizer
itazap 72e8043
whisper
itazap 3cd8e5b
whisper
itazap 4bf2b85
layoutxlm and storing backends
itazap 2ef0fd3
refactor sentencepiecebackend and additional_special_tokens
itazap 5c7d347
renaming tokenization_utils --> tokenization_python
itazap fcf67ff
udpate tests
itazap a8ccf16
bert test
itazap ccca98e
blenderbot
itazap c118c10
clip
itazap 0f74081
codegen
itazap a11dba7
code_llama
itazap b678cde
cohere
itazap ea9a546
deberata, deberat v2, funnel
itazap ffbdecf
gpt2
itazap 9f08ade
batch update tests
itazap a7cd5c0
pegasus qwen2 roberta
itazap b5b3cd9
more models
itazap 1250bcc
layout tests
itazap cf72cae
some renaming
itazap 4fafdcc
fix references to utils_fast
itazap 236f9f1
fix refs
itazap cd743bf
fix refs
itazap 0e7e593
fix refs
itazap 2af6d2c
fix refs
itazap b58b7b1
fix refs
itazap 518dcaf
fix refs
itazap 0f2f4b6
fix refs
itazap c849148
fix some tests
itazap 0d54bbd
regression
itazap 81a140a
fix refs
itazap 61366d6
fix refs
itazap 4374a66
missed the most crucial file in my last commit
itazap df383d7
fix refs
itazap b8035ec
fix refs
itazap 37e1b92
fix refs
itazap 9b45774
batch encode fix
itazap a24856d
fix some tests
itazap 1868870
BC for batch_decode bc too many refs
itazap 35dd250
more tests
itazap b0428f3
fix more tests
itazap 8fe6873
fix for processors
itazap c1e0e46
fixing more models
itazap 79568cd
deleted mbart50 by accident
itazap cfa159a
seamless m4t
itazap 5854f4c
albert fix
itazap 714a856
whisper
itazap c016f11
layout3
itazap 2e3e178
attempt to fix cached tokenizers on CI
itazap 03e3ab9
trying another fix on CI
itazap 2c30d79
again try to work around CI
itazap 98f51d5
bertweet
itazap 96f0517
tapas
itazap c26f54b
mbart50
itazap da0bbf0
luke
itazap 494ef3e
mluke
itazap 39bb884
markuplm
itazap 960dfcf
markuplm
itazap 54992a0
fix some more auto tests
itazap d0383bd
some random model failures
itazap a969c6b
mistralcommontestser
itazap 2bf4a13
more fixes
itazap e88322f
ref fix
itazap cfb0100
siglip
itazap 0fd1066
marian
itazap 02c524c
plbart
itazap 820191e
update utils toks
itazap 0cd714d
seamless m4t
itazap 8a412bc
roc bert
itazap e8c3258
udpate byt5 test
itazap 85a3b1f
xlm
itazap 45e718f
esm
itazap 96fc467
roformer
itazap 7727e3b
code llama
itazap 6795515
biogpt
itazap 2f49a39
m2m100
itazap a42e7a8
dpr and flaubert
itazap 33634be
xlm and speech to text
itazap ca5e389
tok backend pass object
itazap 25021d4
tokenizer object pass
itazap 69610fe
wav2vec2
itazap 51799ca
wav2vec2
itazap f23abc3
cpmant
itazap 88f0db5
update utils tokenizers
itazap 077e6f8
cpmant
itazap e004b56
bartpho
itazap e069763
test apply chat template assistant mask
itazap 9df9cfc
apply chat template video
itazap dc9b1ae
apply chat template assistant mask
itazap 4c05e9d
test torch
itazap 5c209a4
update from slow in base and fix donut processor errors
itazap d8a8db8
auto to point to tokenizers backend, fix kosmos2
itazap 6b40d91
some non model fixes for old slow models that no longer have their ow…
itazap 976265b
missed file from last commit
itazap b6ca8b2
idefics2
itazap 5c72105
fixup
ArthurZucker 964b461
fixup
ArthurZucker 0381407
pretrained tokenizer fast test update
itazap 887b477
Merge branch 'main' of github.com:huggingface/transformers into one_t…
ArthurZucker f4c46ab
stash
ArthurZucker efbbb04
Merge branch 'one_tokenizer' of github.com:huggingface/transformers i…
ArthurZucker 71ef282
bad merged
ArthurZucker a5b018c
cherry pick more stuff that did not merge well
ArthurZucker 8ea91f6
fix gptsw3
ArthurZucker 1947894
nit warn for now
ArthurZucker 20a06ff
update error raising
ArthurZucker aa197a0
just ran fixup
ArthurZucker 63c7c1c
bring back bert legacy
ArthurZucker 5895bab
fix
ArthurZucker 6b8217b
nit
ArthurZucker 184ed58
fix 56 errors on blenderbotsmall?
ArthurZucker 09e4021
18 for blenderbotsmall
ArthurZucker a8c299e
tok auto
itazap 1259052
missed clip
itazap 06e3485
fix tests
itazap 3a95bf1
something missed
itazap 05d5c08
token healing
itazap 78f4e58
tok common tests update - nonmodel
itazap 8fbaf83
try to fix non-model test in test_tokenization_utils
itazap fd40b1b
fix hub tests
itazap 70330b8
try to fix hub tests
itazap 7c78007
custom vocab related fixed
itazap ca1f6b0
bert jap
itazap dd3ae59
BERT JAP
itazap 2e1893f
rename bert legacy to bert legacy
itazap f4be6a9
Wav2vec2
itazap 919103a
fix in tok python to update total vocab size - fixes speech t5
itazap c452f92
blender bot small
itazap 6d167eb
forgot test file
itazap 025722b
test failures
itazap 7d1d0d3
marian
itazap dfb67a4
gpt2 tiktoken
itazap 51da6b2
big bird / marian
itazap c611058
udop
itazap cc4a972
forgot couple changes
itazap 51202da
test_serve fix
itazap ca988b9
missing import
itazap f5bc69e
a couple processors fixes
itazap c67de10
Merge branch 'main' of github.com:huggingface/transformers into one_t…
ArthurZucker 045bbff
style partly
ArthurZucker 75662fd
fix to fetch tests ci
itazap 8d248a3
Revert branch back to commit f5bc69ef state
itazap 4c29924
revert branch to styling
itazap 189cabd
update mistral after merge
itazap e02741c
fixes for non model tests
itazap b828ae1
some processor test fixes
itazap 83b579c
more processor test fixes
itazap 2ce27bc
more processor fixes
itazap 881b97c
hub tests
itazap 2e28b3d
python tok utils
itazap 925d187
fix hub test
itazap 6624231
Merge branch 'main' of github.com:huggingface/transformers into one_t…
ArthurZucker 437321b
make style for now
ArthurZucker cd4d3ac
remove problemattic fic copies
ArthurZucker 5c5864f
python utils/check_copies.py --fix_and_overwrite
ArthurZucker 2f13c13
more styling
ArthurZucker 1e1aa11
fixup
ArthurZucker 5eeb1fe
silence docstirng
ArthurZucker dea8e1e
fix import?
ArthurZucker 452d6d8
fix imports
ArthurZucker e650205
add the local test as well
ArthurZucker 3dd1716
throw spm error
itazap e700dfa
llamas
itazap ce23d67
fix a couple tests
itazap ff1bf36
broke ci
itazap 0bdfeae
broke ci
itazap a137649
broke ci
itazap 366597c
broke ci
itazap 22887b1
add logs to debug gemma on ci
itazap 73819f4
gemma and llama
itazap c24c997
gemma
itazap 551a959
revert las commit
itazap a18e84d
gemma debug
itazap c23ee13
gemma debug
itazap 93187b3
gemma
itazap 81428ef
safely import spiece backend
itazap eb95c2e
tok tests
itazap 24d89c4
check none
itazap e2c4434
setup and qual
itazap 7a737b7
ruff
itazap a19c90c
del dev files
itazap 18e7484
tok auto
itazap 3cdd8ee
fill docstrings
itazap 50756c4
update auto
itazap 6bccb46
blenderbot small nit
itazap a76015a
Merge branch 'main' of github.com:huggingface/transformers into one_t…
ArthurZucker 4afb570
add migration guide
ArthurZucker be1d95a
move mixtral patch to `TokenizersBackend`, move `TokenizerExtractor`
ArthurZucker fad31d7
rename MistralCommonTokenizer to MistralCommonB ackend
ArthurZucker d4aff20
Merge branch 'one_tokenizer' of github.com:huggingface/transformers i…
ArthurZucker 3ab4bec
nit
ArthurZucker 0c1a40a
Merge branch 'main' of github.com:huggingface/transformers into one_t…
ArthurZucker 30f1640
fix failures
ArthurZucker f2a1482
fixup
ArthurZucker d8010f8
remoove one old test
ArthurZucker 82e5675
mark the slow one as slow
ArthurZucker 088fc39
very small fixes
ArthurZucker f677ddf
update auto mapping for missing ones
ArthurZucker d30e46b
fixup lorsd
ArthurZucker ad24f43
fixup doc and stuff
ArthurZucker ebfe7f1
should be the final fixe
ArthurZucker c4a743d
processing update
ArthurZucker f81a966
Merge branch 'main' of github.com:huggingface/transformers into one_t…
ArthurZucker 9a5638d
update
ArthurZucker 7c32dfb
FIX or brute AI fix the llava test
ArthurZucker c520a66
style
ArthurZucker 718b2f0
slow?
ArthurZucker 20d9036
Merge branch 'main' of github.com:huggingface/transformers into one_t…
ArthurZucker 8f536c2
fix is offline mode?
ArthurZucker e96c18b
fix mt5
ArthurZucker 5ce65b8
One tok utils (#42462)
itazap 4418e8a
Merge branch 'main' of github.com:huggingface/transformers into one_t…
ArthurZucker 7f9954a
fix cohere
ArthurZucker bfa5fd0
Merge branch 'one_tokenizer' of github.com:huggingface/transformers i…
ArthurZucker 4dce834
?
ArthurZucker fcdc9bb
up
ArthurZucker a5a3a7c
am I dumbb?
ArthurZucker 0244be9
grumble
ArthurZucker File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,185 @@ | ||
| # Copyright 20125 The HuggingFace Inc. team. | ||
| # | ||
| # Licensed under the Apache License, Version 2.0 (the "License"); | ||
| # you may not use this file except in compliance with the License. | ||
| # You may obtain a copy of the License at | ||
| # | ||
| # http://www.apache.org/licenses/LICENSE-2.0 | ||
| # | ||
| # Unless required by applicable law or agreed to in writing, software | ||
| # distributed under the License is distributed on an "AS IS" BASIS, | ||
| # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. | ||
| # See the License for the specific language governing permissions and | ||
| # limitations under the License. | ||
| """ | ||
| Utilities for creating fast tokenizers from scratch. | ||
| """ | ||
|
|
||
| from typing import Optional | ||
|
|
||
| from tokenizers import AddedToken, Regex, Tokenizer, decoders, normalizers, pre_tokenizers | ||
| from tokenizers.models import BPE, Unigram | ||
| from .utils import is_protobuf_available, is_sentencepiece_available, logging, requires_backends | ||
|
|
||
|
|
||
| def _get_prepend_scheme(add_prefix_space: bool, original_tokenizer) -> str: | ||
| if add_prefix_space: | ||
| prepend_scheme = "always" | ||
| if not getattr(original_tokenizer, "legacy", True): | ||
| prepend_scheme = "first" | ||
| else: | ||
| prepend_scheme = "never" | ||
| return prepend_scheme | ||
|
|
||
|
|
||
| def generate_merges(vocab, vocab_scores: Optional[dict[str, float]] = None): | ||
| reverse = vocab_scores is not None | ||
| vocab_scores = dict(vocab_scores) if reverse else vocab | ||
|
|
||
| merges = [] | ||
| for merge, piece_score in vocab_scores.items(): | ||
| local = [] | ||
| for index in range(1, len(merge)): | ||
| piece_l, piece_r = merge[:index], merge[index:] | ||
| if piece_l in vocab and piece_r in vocab: | ||
| local.append((piece_l, piece_r, piece_score)) | ||
| local = sorted(local, key=lambda x: (vocab[x[0]], vocab[x[1]])) | ||
| merges.extend(local) | ||
|
|
||
| merges = sorted(merges, key=lambda val: (val[2], len(val[0]), len(val[1])), reverse=reverse) | ||
| merges = [(val[0], val[1]) for val in merges] | ||
| return merges | ||
|
|
||
|
|
||
| class SentencePieceExtractor: | ||
| """ | ||
| Extractor implementation for SentencePiece trained models. https://github.com/google/sentencepiece | ||
| """ | ||
|
|
||
| def __init__(self, model: str): | ||
| requires_backends(self, "sentencepiece") | ||
| from sentencepiece import SentencePieceProcessor | ||
|
|
||
| self.sp = SentencePieceProcessor() | ||
| self.sp.Load(model) | ||
|
|
||
| def extract(self, vocab_scores=None) -> tuple[dict[str, int], list[tuple]]: | ||
| """ | ||
| By default will return vocab and merges with respect to their order, by sending `vocab_scores` we're going to | ||
| order the merges with respect to the piece scores instead. | ||
| """ | ||
| sp = self.sp | ||
| vocab = {sp.id_to_piece(index): index for index in range(sp.GetPieceSize())} | ||
|
|
||
| # let's get the vocab_scores | ||
| vocab_scores = {sp.id_to_piece(i): sp.get_score(i) for i in range(sp.GetPieceSize())} | ||
|
|
||
| merges = generate_merges(vocab, vocab_scores) | ||
|
|
||
| return vocab, merges | ||
|
|
||
|
|
||
| class SpmTokenizer: | ||
| """ | ||
| Base SentencePiece tokenizer that can be instantiated with model-specific arguments. | ||
| """ | ||
|
|
||
| def __init__( | ||
| self, | ||
| handle_byte_fallback: bool = True, | ||
| legacy: bool = False, | ||
| add_prefix_space: bool = True, | ||
| special_tokens: Optional[dict] = None, | ||
| vocab: Optional[callable] = None, | ||
| unk_id: Optional[callable] = None, | ||
| normalizer: Optional[callable] = None, | ||
| pre_tokenizer: Optional[callable] = None, | ||
| decoder: Optional[callable] = None, | ||
| post_processor: Optional[callable] = None, | ||
| tokenizer: Optional[callable] = None, | ||
| ): | ||
| self.handle_byte_fallback = handle_byte_fallback | ||
| self.legacy = legacy | ||
| self.add_prefix_space = add_prefix_space | ||
| self.special_tokens = special_tokens or {} | ||
| # Store user-provided callables under private names to avoid clashing with methods | ||
| self._vocab_fn = vocab | ||
| self._unk_id_fn = unk_id | ||
| self._normalizer_fn = normalizer | ||
| self._pre_tokenizer_fn = pre_tokenizer | ||
| self._decoder_fn = decoder | ||
| self._post_processor_fn = post_processor | ||
| self._tokenizer_fn = tokenizer | ||
|
|
||
| def vocab(self): | ||
| if self._vocab_fn is not None: | ||
| return self._vocab_fn() | ||
| # Return empty vocab for training | ||
| return [] | ||
|
|
||
| def unk_id(self): | ||
| if self._unk_id_fn is not None: | ||
| return self._unk_id_fn() | ||
| return 0 # Default unk_id | ||
|
|
||
| def tokenizer(self): | ||
| # Always create empty trainable tokenizer | ||
| minimal_vocab = [("<unk>", 0.0)] | ||
| return Tokenizer(Unigram(minimal_vocab, unk_id=self.unk_id(), byte_fallback=self.handle_byte_fallback)) | ||
|
|
||
| def normalizer(self): | ||
| if self._normalizer_fn is not None: | ||
| return self._normalizer_fn() | ||
| _normalizers = [ | ||
| normalizers.Strip(left=False, right=True), | ||
| normalizers.Replace(Regex(" {2,}"), "▁"), | ||
| ] | ||
| return normalizers.Sequence(_normalizers) | ||
itazap marked this conversation as resolved.
Outdated
Show resolved
Hide resolved
|
||
|
|
||
| def pre_tokenizer(self, replacement, add_prefix_space): | ||
| if self._pre_tokenizer_fn is not None: | ||
| return self._pre_tokenizer_fn(replacement, add_prefix_space) | ||
|
|
||
| prepend_scheme = _get_prepend_scheme(add_prefix_space, self) | ||
| return pre_tokenizers.Metaspace(replacement=replacement, prepend_scheme=prepend_scheme) | ||
|
|
||
| def decoder(self, replacement, add_prefix_space): | ||
| if self._decoder_fn is not None: | ||
| return self._decoder_fn(replacement, add_prefix_space) | ||
|
|
||
| prepend_scheme = _get_prepend_scheme(add_prefix_space, self) | ||
| return decoders.Metaspace(replacement=replacement, prepend_scheme=prepend_scheme) | ||
|
|
||
| def post_processor(self): | ||
| if self._post_processor_fn is not None: | ||
| return self._post_processor_fn() | ||
| return None | ||
|
|
||
| def create_tokenizer(self) -> Tokenizer: | ||
| """Create and return the configured empty trainable tokenizer.""" | ||
| if self._tokenizer_fn is not None: | ||
| tokenizer = self._tokenizer_fn() | ||
| else: | ||
| tokenizer = self.tokenizer() | ||
|
|
||
| # Tokenizer assemble | ||
| normalizer = self.normalizer() | ||
| if normalizer is not None: | ||
| tokenizer.normalizer = normalizer | ||
|
|
||
| replacement = "▁" | ||
| add_prefix_space = self.add_prefix_space | ||
|
|
||
| pre_tokenizer = self.pre_tokenizer(replacement, add_prefix_space) | ||
| if pre_tokenizer is not None: | ||
| tokenizer.pre_tokenizer = pre_tokenizer | ||
|
|
||
| tokenizer.decoder = self.decoder(replacement, add_prefix_space) | ||
| post_processor = self.post_processor() | ||
| if post_processor: | ||
| tokenizer.post_processor = post_processor | ||
|
|
||
| return tokenizer | ||
|
|
||
|
|
||
| __all__ = ["SpmTokenizer", "_get_prepend_scheme"] | ||
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Oops, something went wrong.
Oops, something went wrong.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Uh oh!
There was an error while loading. Please reload this page.