Skip to content

Automatic Speech Recognition pipline raises error when torchcodec is installed but not valid #42499

@litagin02

Description

@litagin02

System Info

  • transformers version: 4.57.3

  • Platform: Windows-11-10.0.26200-SP0

  • Python version: 3.12.9

  • Huggingface_hub version: 0.36.0

  • Safetensors version: 0.7.0

  • Accelerate version: not installed

  • Accelerate config: not found

  • DeepSpeed version: not installed

  • PyTorch version (accelerator?): 2.9.1+cpu (NA)

  • Tensorflow version (GPU?): not installed (NA)

  • Flax version (CPU?/GPU?/TPU?): not installed (NA)

  • Jax version: not installed

  • JaxLib version: not installed

  • Using distributed or parallel set-up in script?: No

  • torch: 2.9.1

  • torchcodec: 0.8.1

  • ffmpeg: version 8.0.1-full_build-www.gyan.dev

Who can help?

No response

Information

  • The official example scripts
  • My own modified scripts

Tasks

  • An officially supported task in the examples folder (such as GLUE/SQuAD, ...)
  • My own task or dataset (give details below)

Reproduction

  1. uv init
  2. uv add torch torchcodec transformers
  3. Run the following: https://huggingface.co/docs/transformers/main/en/task_summary#automatic-speech-recognition
from transformers import pipeline

transcriber = pipeline(
    task="automatic-speech-recognition", model="openai/whisper-small"
)
transcriber("https://huggingface.co/datasets/Narsil/asr_dummy/resolve/main/mlk.flac")

Then I will get the following Error:

$ uv run ./main.py
Device set to use cpu
Traceback (most recent call last):
  File "C:\Users\user\transformers-debug\main.py", line 6, in <module>
    transcriber("https://huggingface.co/datasets/Narsil/asr_dummy/resolve/main/mlk.flac")
  File "C:\Users\user\transformers-debug\.venv\Lib\site-packages\transformers\pipelines\automatic_speech_recognition.py", line 275, in __call__
    return super().__call__(inputs, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "C:\Users\user\transformers-debug\.venv\Lib\site-packages\transformers\pipelines\base.py", line 1459, in __call__
    return next(
           ^^^^^
  File "C:\Users\user\transformers-debug\.venv\Lib\site-packages\transformers\pipelines\pt_utils.py", line 126, in __next__
    item = next(self.iterator)
           ^^^^^^^^^^^^^^^^^^^
  File "C:\Users\user\transformers-debug\.venv\Lib\site-packages\transformers\pipelines\pt_utils.py", line 271, in __next__
    processed = self.infer(next(self.iterator), **self.params)
                           ^^^^^^^^^^^^^^^^^^^
  File "C:\Users\user\transformers-debug\.venv\Lib\site-packages\torch\utils\data\dataloader.py", line 732, in __next__
    data = self._next_data()
           ^^^^^^^^^^^^^^^^^
  File "C:\Users\user\transformers-debug\.venv\Lib\site-packages\torch\utils\data\dataloader.py", line 788, in _next_data
    data = self._dataset_fetcher.fetch(index)  # may raise StopIteration
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "C:\Users\user\transformers-debug\.venv\Lib\site-packages\torch\utils\data\_utils\fetch.py", line 33, in fetch
    data.append(next(self.dataset_iter))
                ^^^^^^^^^^^^^^^^^^^^^^^
  File "C:\Users\user\transformers-debug\.venv\Lib\site-packages\transformers\pipelines\pt_utils.py", line 188, in __next__
    processed = next(self.subiterator)
                ^^^^^^^^^^^^^^^^^^^^^^
  File "C:\Users\user\transformers-debug\.venv\Lib\site-packages\transformers\pipelines\automatic_speech_recognition.py", line 381, in preprocess
    import torchcodec
  File "C:\Users\user\transformers-debug\.venv\Lib\site-packages\torchcodec\__init__.py", line 10, in <module>
    from . import decoders, samplers  # noqa
    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "C:\Users\user\transformers-debug\.venv\Lib\site-packages\torchcodec\decoders\__init__.py", line 7, in <module>
    from .._core import AudioStreamMetadata, VideoStreamMetadata
  File "C:\Users\user\transformers-debug\.venv\Lib\site-packages\torchcodec\_core\__init__.py", line 8, in <module>
    from ._metadata import (
  File "C:\Users\user\transformers-debug\.venv\Lib\site-packages\torchcodec\_core\_metadata.py", line 16, in <module>
    from torchcodec._core.ops import (
  File "C:\Users\user\transformers-debug\.venv\Lib\site-packages\torchcodec\_core\ops.py", line 84, in <module>
    load_torchcodec_shared_libraries()
  File "C:\Users\user\transformers-debug\.venv\Lib\site-packages\torchcodec\_core\ops.py", line 69, in load_torchcodec_shared_libraries
    raise RuntimeError(
RuntimeError: Could not load libtorchcodec. Likely causes:
          1. FFmpeg is not properly installed in your environment. We support
             versions 4, 5, 6, and 7 on all platforms, and 8 on Mac and Linux.
          2. The PyTorch version (2.9.1+cpu) is not compatible with
             this version of TorchCodec. Refer to the version compatibility
             table:
             https://github.com/pytorch/torchcodec?tab=readme-ov-file#installing-torchcodec.
          3. Another runtime dependency; see exceptions below.
        The following exceptions were raised as we tried to load libtorchcodec:

[start of libtorchcodec loading traceback]
FFmpeg version 8: Could not load this library: C:\Users\user\transformers-debug\.venv\Lib\site-packages\torchcodec\libtorchcodec_core8.dll
FFmpeg version 7: Could not load this library: C:\Users\user\transformers-debug\.venv\Lib\site-packages\torchcodec\libtorchcodec_core7.dll
FFmpeg version 6: Could not load this library: C:\Users\user\transformers-debug\.venv\Lib\site-packages\torchcodec\libtorchcodec_core6.dll
FFmpeg version 5: Could not load this library: C:\Users\user\transformers-debug\.venv\Lib\site-packages\torchcodec\libtorchcodec_core5.dll
FFmpeg version 4: Could not load this library: C:\Users\user\transformers-debug\.venv\Lib\site-packages\torchcodec\libtorchcodec_core4.dll
[end of libtorchcodec loading traceback].

I think the cause is, in Windows, even if ffmpeg and torchcodec are installed, just import torchcodec raises an error, without using conda or manually setting os.add_dll_directory("path/to/ffmpeg/dll/dir) for shared builds of ffmpeg.
So the following part raises an error (because torchcodec is available but importing yields an error):

if is_torchcodec_available():
import torchcodec

Expected behavior

Raises no error. Actually, without torchcodec, we have

$ uv run ./main.py
Device set to use cpu
`return_token_timestamps` is deprecated for WhisperFeatureExtractor and will be removed in Transformers v5. Use `return_attention_mask` instead, as the number of frames can be inferred from it.
Using custom `forced_decoder_ids` from the (generation) config. This is deprecated in favor of the `task` and `language` flags/config options.
Transcription using a multilingual Whisper will default to language detection followed by transcription instead of translation to English. This might be a breaking change for your use case. If you want to instead always translate your audio to English, make sure to pass `language='en'`. See https://github.com/huggingface/transformers/pull/28687 for more details.

which is fine.

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions