-
Notifications
You must be signed in to change notification settings - Fork 31.3k
Description
System Info
-
transformersversion: 4.57.3 -
Platform: Windows-11-10.0.26200-SP0
-
Python version: 3.12.9
-
Huggingface_hub version: 0.36.0
-
Safetensors version: 0.7.0
-
Accelerate version: not installed
-
Accelerate config: not found
-
DeepSpeed version: not installed
-
PyTorch version (accelerator?): 2.9.1+cpu (NA)
-
Tensorflow version (GPU?): not installed (NA)
-
Flax version (CPU?/GPU?/TPU?): not installed (NA)
-
Jax version: not installed
-
JaxLib version: not installed
-
Using distributed or parallel set-up in script?: No
-
torch: 2.9.1
-
torchcodec: 0.8.1
-
ffmpeg: version 8.0.1-full_build-www.gyan.dev
Who can help?
No response
Information
- The official example scripts
- My own modified scripts
Tasks
- An officially supported task in the
examplesfolder (such as GLUE/SQuAD, ...) - My own task or dataset (give details below)
Reproduction
uv inituv add torch torchcodec transformers- Run the following: https://huggingface.co/docs/transformers/main/en/task_summary#automatic-speech-recognition
from transformers import pipeline
transcriber = pipeline(
task="automatic-speech-recognition", model="openai/whisper-small"
)
transcriber("https://huggingface.co/datasets/Narsil/asr_dummy/resolve/main/mlk.flac")Then I will get the following Error:
$ uv run ./main.py
Device set to use cpu
Traceback (most recent call last):
File "C:\Users\user\transformers-debug\main.py", line 6, in <module>
transcriber("https://huggingface.co/datasets/Narsil/asr_dummy/resolve/main/mlk.flac")
File "C:\Users\user\transformers-debug\.venv\Lib\site-packages\transformers\pipelines\automatic_speech_recognition.py", line 275, in __call__
return super().__call__(inputs, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "C:\Users\user\transformers-debug\.venv\Lib\site-packages\transformers\pipelines\base.py", line 1459, in __call__
return next(
^^^^^
File "C:\Users\user\transformers-debug\.venv\Lib\site-packages\transformers\pipelines\pt_utils.py", line 126, in __next__
item = next(self.iterator)
^^^^^^^^^^^^^^^^^^^
File "C:\Users\user\transformers-debug\.venv\Lib\site-packages\transformers\pipelines\pt_utils.py", line 271, in __next__
processed = self.infer(next(self.iterator), **self.params)
^^^^^^^^^^^^^^^^^^^
File "C:\Users\user\transformers-debug\.venv\Lib\site-packages\torch\utils\data\dataloader.py", line 732, in __next__
data = self._next_data()
^^^^^^^^^^^^^^^^^
File "C:\Users\user\transformers-debug\.venv\Lib\site-packages\torch\utils\data\dataloader.py", line 788, in _next_data
data = self._dataset_fetcher.fetch(index) # may raise StopIteration
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "C:\Users\user\transformers-debug\.venv\Lib\site-packages\torch\utils\data\_utils\fetch.py", line 33, in fetch
data.append(next(self.dataset_iter))
^^^^^^^^^^^^^^^^^^^^^^^
File "C:\Users\user\transformers-debug\.venv\Lib\site-packages\transformers\pipelines\pt_utils.py", line 188, in __next__
processed = next(self.subiterator)
^^^^^^^^^^^^^^^^^^^^^^
File "C:\Users\user\transformers-debug\.venv\Lib\site-packages\transformers\pipelines\automatic_speech_recognition.py", line 381, in preprocess
import torchcodec
File "C:\Users\user\transformers-debug\.venv\Lib\site-packages\torchcodec\__init__.py", line 10, in <module>
from . import decoders, samplers # noqa
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "C:\Users\user\transformers-debug\.venv\Lib\site-packages\torchcodec\decoders\__init__.py", line 7, in <module>
from .._core import AudioStreamMetadata, VideoStreamMetadata
File "C:\Users\user\transformers-debug\.venv\Lib\site-packages\torchcodec\_core\__init__.py", line 8, in <module>
from ._metadata import (
File "C:\Users\user\transformers-debug\.venv\Lib\site-packages\torchcodec\_core\_metadata.py", line 16, in <module>
from torchcodec._core.ops import (
File "C:\Users\user\transformers-debug\.venv\Lib\site-packages\torchcodec\_core\ops.py", line 84, in <module>
load_torchcodec_shared_libraries()
File "C:\Users\user\transformers-debug\.venv\Lib\site-packages\torchcodec\_core\ops.py", line 69, in load_torchcodec_shared_libraries
raise RuntimeError(
RuntimeError: Could not load libtorchcodec. Likely causes:
1. FFmpeg is not properly installed in your environment. We support
versions 4, 5, 6, and 7 on all platforms, and 8 on Mac and Linux.
2. The PyTorch version (2.9.1+cpu) is not compatible with
this version of TorchCodec. Refer to the version compatibility
table:
https://github.com/pytorch/torchcodec?tab=readme-ov-file#installing-torchcodec.
3. Another runtime dependency; see exceptions below.
The following exceptions were raised as we tried to load libtorchcodec:
[start of libtorchcodec loading traceback]
FFmpeg version 8: Could not load this library: C:\Users\user\transformers-debug\.venv\Lib\site-packages\torchcodec\libtorchcodec_core8.dll
FFmpeg version 7: Could not load this library: C:\Users\user\transformers-debug\.venv\Lib\site-packages\torchcodec\libtorchcodec_core7.dll
FFmpeg version 6: Could not load this library: C:\Users\user\transformers-debug\.venv\Lib\site-packages\torchcodec\libtorchcodec_core6.dll
FFmpeg version 5: Could not load this library: C:\Users\user\transformers-debug\.venv\Lib\site-packages\torchcodec\libtorchcodec_core5.dll
FFmpeg version 4: Could not load this library: C:\Users\user\transformers-debug\.venv\Lib\site-packages\torchcodec\libtorchcodec_core4.dll
[end of libtorchcodec loading traceback].I think the cause is, in Windows, even if ffmpeg and torchcodec are installed, just import torchcodec raises an error, without using conda or manually setting os.add_dll_directory("path/to/ffmpeg/dll/dir) for shared builds of ffmpeg.
So the following part raises an error (because torchcodec is available but importing yields an error):
transformers/src/transformers/pipelines/automatic_speech_recognition.py
Lines 375 to 376 in cac0a28
| if is_torchcodec_available(): | |
| import torchcodec |
Expected behavior
Raises no error. Actually, without torchcodec, we have
$ uv run ./main.py
Device set to use cpu
`return_token_timestamps` is deprecated for WhisperFeatureExtractor and will be removed in Transformers v5. Use `return_attention_mask` instead, as the number of frames can be inferred from it.
Using custom `forced_decoder_ids` from the (generation) config. This is deprecated in favor of the `task` and `language` flags/config options.
Transcription using a multilingual Whisper will default to language detection followed by transcription instead of translation to English. This might be a breaking change for your use case. If you want to instead always translate your audio to English, make sure to pass `language='en'`. See https://github.com/huggingface/transformers/pull/28687 for more details.which is fine.