Releases: pytorch/audio
torchaudio 0.13.0 Release Note
Highlights
TorchAudio 0.13.0 release includes:
- Source separation models and pre-trained bundles (Hybrid Demucs, ConvTasNet)
- New datasets and metadata mode for the SUPERB benchmark
- Custom language model support for CTC beam search decoding
- StreamWriter for audio and video encoding
[Beta] Source Separation Models and Bundles
Hybrid Demucs is a music source separation model that uses both spectrogram and time domain features. It has demonstrated state-of-the-art performance in the Sony Music DeMixing Challenge. (citation: https://arxiv.org/abs/2111.03600)
The TorchAudio v0.13 release includes the following features
- MUSDB_HQ Dataset, which is used in Hybrid Demucs training (docs)
- Hybrid Demucs model architecture (docs)
- Three factory functions suitable for different sample rate ranges
- Pre-trained pipelines (docs) and tutorial
SDR Results of pre-trained pipelines on MUSDB-HQ test set
| Pipeline | All | Drums | Bass | Other | Vocals |
|---|---|---|---|---|---|
| HDEMUCS_HIGH_MUSDB* | 6.42 | 7.76 | 6.51 | 4.47 | 6.93 |
| HDEMUCS_HIGH_MUSDB_PLUS** | 9.37 | 11.38 | 10.53 | 7.24 | 8.32 |
* Trained on the training data of MUSDB-HQ dataset.
** Trained on both training and test sets of MUSDB-HQ and 150 extra songs from an internal database that were specifically produced for Meta.
Special thanks to @adefossez for the guidance.
ConvTasNet model architecture was added in TorchAudio 0.7.0. It is the first source separation model that outperforms the oracle ideal ratio mask. In this release, TorchAudio adds the pre-trained pipeline that is trained within TorchAudio on the Libri2Mix dataset. The pipeline achieves 15.6dB SDR improvement and 15.3dB Si-SNR improvement on the Libri2Mix test set.
[Beta] Datasets and Metadata Mode for SUPERB Benchmarks
With the addition of four new audio-related datasets, there is now support for all downstream tasks in version 1 of the SUPERB benchmark. Furthermore, these datasets support metadata mode through a get_metadata function, which enables faster dataset iteration or preprocessing without the need to load or store waveforms.
Datasets with metadata functionality:
- LIBRISPEECH (docs)
- LibriMix (docs)
- QUESST14 (docs)
- SPEECHCOMMANDS (docs)
- (new) FluentSpeechCommands (docs)
- (new) Snips (docs)
- (new) IEMOCAP (docs)
- (new) VoxCeleb1 (Identification, Verification)
[Beta] Custom Language Model support in CTC Beam Search Decoding
In release 0.12, TorchAudio released a CTC beam search decoder with KenLM language model support. This release, there is added functionality for creating custom Python language models that are compatible with the decoder, using the torchaudio.models.decoder.CTCDecoderLM wrapper.
[Beta] StreamWriter
torchaudio.io.StreamWriter is a class for encoding media including audio and video. This can handle a wide variety of codecs, chunk-by-chunk encoding and GPU encoding.
Backward-incompatible changes
- [BC-breaking] Fix momentum in transforms.GriffinLim (#2568)
TheGriffinLimimplementations in transforms and functional used themomentumparameter differently, resulting in inconsistent results between the two implementations. Thetransforms.GriffinLimusage ofmomentumis updated to resolve this discrepancy. - Make
torchaudio.infodecode audio to computenum_framesif it is not found in metadata (#2740).
In such cases,torchaudio.infomay now return non-zero values fornum_frames.
Bug Fixes
- Fix random Gaussian generation (#2639)
torchaudio.compliance.kaldi.fbankwith dither option produced a different output from kaldi because it used a skewed, rather than gaussian, distribution for dither. This is updated in this release to correctly use a random gaussian instead. - Update download link for speech commands (#2777)
The previous download link for SpeechCommands v2 did not include data for the valid and test sets, resulting in errors when trying to use those subsets. Update the download link to correctly download the whole dataset.
New Features
IO
- Add metadata to source stream info (#2461, #2464)
- Add utility function to fetch FFmpeg library versions (#2467)
- Add YUV444P support to StreamReader (#2516)
- Add StreamWriter (#2628, #2648, #2505)
- Support in-memory decoding via Tensor wrapper in StreamReader (#2694)
- Add StreamReader Tensor Binding to src (#2699)
- Add StreamWriter media device/streaming tutorial (#2708)
- Add StreamWriter tutorial (#2698)
Ops
- Add ITU-R BS.1770-4 loudness recommendation (#2472)
- Add convolution operator (#2602)
- Add additive noise function (#2608)
Models
- Hybrid Demucs model implementation (#2506)
- Docstring change for Hybrid Demucs (#2542, #2570)
- Add NNLM support to CTC Decoder (#2528, #2658)
- Move hybrid demucs model out of prototype (#2668)
- Move conv_tasnet_base doc out of prototype (#2675)
- Add custom lm example to decoder tutorial (#2762)
Pipelines
- Add SourceSeparationBundle to prototype (#2440, #2559)
- Adding pipeline changes, factory functions to HDemucs (#2547, #2565)
- Create tutorial for HDemucs (#2572)
- Add HDEMUCS_HIGH_MUSDB (#2601)
- Move SourceSeparationBundle and pre-trained ConvTasNet pipeline into Beta (#2669)
- Move Hybrid Demucs pipeline to beta (#2673)
- Update description of HDemucs pipelines
Datasets
- Add fluent speech commands (#2480, #2510)
- Add musdb dataset and tests (#2484)
- Add VoxCeleb1 dataset (#2349)
- Add metadata function for LibriSpeech (#2653)
- Add Speech Commands metadata function (#2687)
- Add metadata mode for various datasets (#2697)
- Add IEMOCAP dataset (#2732)
- Add Snips Dataset (#2738)
- Add metadata for Librimix (#2751)
- Add file name to returned item in Snips dataset (#2775)
- Update IEMOCAP variants and labels (#2778)
Improvements
IO
- Replace
runtime_errorexception withTORCH_CHECK(#2550, #2551, #2592) - Refactor StreamReader (#2507, #2508, #2512, #2530, #2531, #2533, #2534)
- Refactor sox C++ (#2636, #2663)
- Delay the import of kaldi_io (#2573)
Ops
- Speed up resample with kernel generation modification (#2553, #2561)
The kernel generation for resampling is optimized in this release. The following table illustrates the performance improvements from the previous release for thetorchaudio.functional.resamplefunction using the sinc resampling method, onfloat32tensor with two channels and one second duration.
CPU
| torchaudio version | 8k → 16k [Hz] | 16k → 8k | 16k → 44.1k | 44.1k → 16k |
|---|---|---|---|---|
| 0.13 | 0.256 | 0.549 | 0.769 | 0.820 |
| 0.12 | 0.386 | 0.534 | 31.8 | 12.1 |
CUDA
| torchaudio version | 8k → 16k [Hz] | 16k → 8k | 16k → 44.1k | 44.1k → 16k |
|---|---|---|---|---|
| 0.13 | 0.332 | 0.336 | 0.345 | 0.381 |
| 0.12 | 0.524 | 0.334 | 64.4 | 22.8 |
- Add normalization parameter on spectrogram and inverse spectrogram (#2554)
- Replace assert with raise for ops (#2579, #2599)
- Replace CHECK_ by TORCH_CHECK_ (#2582)
- Fix argument validation in TorchAudio filtering (#2609)
Models
- Switch to flashlight decoder from upstream (#2557)
- Add dimension and shape check (#2563)
- Replace assert with raise in models (#2578, #2590)
- Migrate CTC decoder code (#2580)
- Enable CTC decoder in Windows (#2587)
Datasets
- Replace assert with raise in datasets (#2571)
- Add unit test for LibriMix dataset (#2659)
- Add gtzan download note (#2763)
Tutorials
- Tweak tutorials (#2630, #2733)
- Update ASR inference tutorial (#2631)
- Update and fix tutorials (#2661, #2701)
- Introduce IO section to getting started tutorials (#2703)
- Update HW video processing tutorial (#2739)
- Update tutorial author information (#2764)
- Fix typos in tacotron2 tutorial (#2761)
- Fix fading in hybrid demucs tutorial (#2771)
- Fix leaking matplotlib figure (#2769)
- Update resampling tutorial (#2773)
Recipes
- Use lazy import for joblib (#2498)
- Revise LibriSpeech Conformer RNN-T recipe (#2535)
- Fix bug in Conformer RNN-T recipe (#2611)
- Replace bg_iterator in examples (#2645)
- Remove obsolete examples (#2655)
- Fix LibriSpeech Conforner RNN-T eval script (#2666)
- Replace IValue::toString()->string() with IValue::toStringRef() (#2700)
- Improve wav2vec2/hubert model for pre-training (#2716)
- Improve hubert recipe for pre-training and fine-tuning (#2744)
WER improvement on LibriSpe...
torchaudio 0.12.1 Release Note
This is a minor release, which is compatible with PyTorch 1.12.1 and include small bug fixes, improvements and documentation update. There is no new feature added.
Bug Fix
Improvement
- #2552 Remove unused boost source code
- #2527 Improve speech enhancement tutorial
- #2544 Update forced alignment tutorial
- #2595 Update data augmentation tutorial
For the full feature of v0.12, please refer to the v0.12.0 release note.
v0.12.0
TorchAudio 0.12.0 Release Notes
Highlights
TorchAudio 0.12.0 includes the following:
- CTC beam search decoder
- New beamforming modules and methods
- Streaming API
[Beta] CTC beam search decoder
To support inference-time decoding, the release adds the wav2letter CTC beam search decoder, ported over from Flashlight (GitHub). Both lexicon and lexicon-free decoding are supported, and decoding can be done without a language model or with a KenLM n-gram language model. Compatible token, lexicon, and certain pretrained KenLM files for the LibriSpeech dataset are also available for download.
For usage details, please check out the documentation and ASR inference tutorial.
[Beta] New beamforming modules and methods
To improve flexibility in usage, the release adds two new beamforming modules under torchaudio.transforms: SoudenMVDR and RTFMVDR. They differ from MVDR mainly in that they:
- Use power spectral density (PSD) and relative transfer function (RTF) matrices as inputs instead of time-frequency masks. The module can be integrated with neural networks that directly predict complex-valued STFT coefficients of speech and noise.
- Add
reference_channelas an input argument in the forward method to allow users to select the reference channel in model training or dynamically change the reference channel in inference.
Besides the two modules, the release adds new function-level beamforming methods under torchaudio.functional. These include
For usage details, please check out the documentation at torchaudio.transforms and torchaudio.functional and the Speech Enhancement with MVDR Beamforming tutorial.
[Beta] Streaming API
StreamReader is TorchAudio’s new I/O API. It is backed by FFmpeg† and allows users to
- Decode various audio and video formats, including MP4 and AAC.
- Handle various input forms, such as local files, network protocols, microphones, webcams, screen captures and file-like objects.
- Iterate over and decode media chunk-by-chunk, while changing the sample rate or frame rate.
- Apply various audio and video filters, such as low-pass filter and image scaling.
- Decode video with Nvidia's hardware-based decoder (NVDEC).
For usage details, please check out the documentation and tutorials:
- Media Stream API - Pt.1
- Media Stream API - Pt.2
- Online ASR with Emformer RNN-T
- Device ASR with Emformer RNN-T
- Accelerated Video Decoding with NVDEC
† To use StreamReader, FFmpeg libraries are required. Please install FFmpeg. The coverage of codecs depends on how these libraries are configured. TorchAudio official binaries are compiled to work with FFmpeg 4 libraries; FFmpeg 5 can be used if TorchAudio is built from source.
Backwards-incompatible changes
I/O
- MP3 decoding is now handled by FFmpeg in sox_io backend. (#2419, #2428)
- FFmpeg is now used as fallback in sox_io backend, and now MP3 decoding is handled by FFmpeg. To load MP3 audio with
torchaudio.load, please install a compatible version of FFmpeg (Version 4 when using an official binary distribution). - Note that, whereas the previous MP3 decoding scheme pads the output audio, the new scheme does not. As a consequence, the new version returns shorter audio tensors.
torchaudio.infonow returnsnum_frames=0for MP3.
- FFmpeg is now used as fallback in sox_io backend, and now MP3 decoding is handled by FFmpeg. To load MP3 audio with
Models
- Change underlying implementation of RNN-T hypothesis to tuple (#2339)
- In release 0.11,
Hypothesissubclassednamedtuple. Containers ofnamedtupleinstances, however, are incompatible with the PyTorch Lite Interpreter. To achieve compatibility,Hypothesishas been modified in release 0.12 to instead aliastuple. This affectsRNNTBeamSearchas it accepts and returns a list ofHypothesisinstances.
- In release 0.11,
Bug Fixes
Ops
- Fix return dtype in MVDR module (#2376)
- In release 0.11, the MVDR module converts the dtype of input spectrum to
complex128to improve the precision and robustness of downstream matrix computations. The output dtype, however, is not correctly converted back to the original dtype. In release 0.12, we fix the output dtype to be consistent with the original input dtype.
- In release 0.11, the MVDR module converts the dtype of input spectrum to
Build
- Fix Kaldi submodule integration (#2269)
- Pin jinja2 version for build_docs (#2292)
- Use sourceforge url to fetch zlib (#2297)
New Features
I/O
- Add Streaming API (#2041, #2042, #2043, #2044, #2045, #2046, #2047, #2111, #2113, #2114, #2115, #2135, #2164, #2168, #2202, #2204, #2263, #2264, #2312, #2373, #2378, #2402, #2403, #2427, #2429)
- Add YUV420P format support to Streaming API (#2334)
- Support specifying decoder and its options (#2327)
- Add NV12 format support in Streaming API (#2330)
- Add HW acceleration support on Streaming API (#2331)
- Add file-like object support to Streaming API (#2400)
- Make FFmpeg log level configurable (#2439)
- Set the default ffmpeg log level to FATAL (#2447)
Ops
- New beamforming methods (#2227, #2228, #2229, #2230, #2231, #2232, #2369, #2401)
- New MVDR modules (#2367, #2368)
- Add and refactor CTC lexicon beam search decoder (#2075, #2079, #2089, #2112, #2117, #2136, #2174, #2184, #2185, #2273, #2289)
- Add lexicon free CTC decoder (#2342)
- Add Pretrained LM Support for Decoder (#2275)
- Move CTC beam search decoder to beta (#2410)
Datasets
Improvements
I/O
Ops
- Raise error for resampling int waveform (#2318)
- Move multi-channel modules to a separate file (#2382)
- Refactor MVDR module (#2383)
Models
- Add an option to use Tanh instead of ReLU in RNNT joiner (#2319)
- Support GroupNorm and re-ordering Convolution/MHA in Conformer (#2320)
- Add extra arguments to hubert pretrain factory functions (#2345)
- Add feature_grad_mult argument to HuBERTPretrainModel (#2335)
Datasets
Performance
- Make Pitchshift for faster by caching resampling kernel (#2441)
The following table illustrates the performance improvement over the previous release by comparing the time in msecs it takestorchaudio.transforms.PitchShift, after its first call, to perform the operation onfloat32Tensor with two channels and 8000 frames, resampled to 44.1 kHz across various shifted steps.
| TorchAudio Version | 2 | 3 | 4 | 5 |
|---|---|---|---|---|
| 0.12 | 2.76 | 5 | 1860 | 223 |
| 0.11 | 6.71 | 161 | 8680 | 1450 |
Tests
- Add complex dtype support in functional autograd test (#2244)
- Refactor torchscript consistency test in functional (#2246)
- Add unit tests for PyTorch Lightning modules of emformer_rnnt recipes (#2240)
- Refactor batch consistency test in functional (#2245)
- Run smoke tests on regular PRs (#2364)
- Refactor smoke test executions (#2365)
- Move seed to setup (#2425)
- Remove possible manual seeds from test files (#2436)
Build
- Revise the parameterization of third party libraries (#2282)
- Use zlib v1.2.12 with GitHub source (#2300)
- Fix ffmpeg integration for ffmpeg 5.0 (#2326)
- Use custom FFmpeg libraries for torchaudio binary distributions (#2355)
- Adding m1 builds to torchaudio (#2421)
Other
- Add download utility specialized for torchaudio (#2283)
- Use module-level
__getattr__to implement delayed initialization (#2377) - Update build_doc job to use Conda CUDA package (#2395)
- Update I/O initialization (#2417)
- Add Python 3.10 (build and test) (#2224)
- Retrieve version from version.txt (#2434)
- Disable OpenMP on mac (#2431)
Examples
Ops
- Add CTC decoder example for librispeech (#2130, #2161)
- Fix LM, arguments in CTC decoding script (#2235, #2315)
- Use pretrained LM API for decoder example (#2317)
Pipelines
- Refactor pipeline_demo.py to support variant EMFORMER_RNNT bundles (#2203)
- Refactor eval and pipeline_demo scripts in emformer_rnnt (#2238)
- Refactor pipeline_demo script in emformer_rnnt recipes (#2239)
- Add EMFORMER_RNNT_BASE_MUSTC into pipeline demo script (#2248)
Tests
- Add unit tests for Emformer RNN-T LibriSpeech recipe (#2216)
- Add fixed random seed for Emformer RNN-T recipe test (#2220)
Training recipes
v0.11.0
torchaudio 0.11.0 Release Note
Highlights
TorchAudio 0.11.0 release includes:
- Emformer (paper) RNN-T components, training recipe, and pre-trained pipeline for streaming ASR
- Voxpopuli pre-trained pipelines
- HuBERTPretrainModel for training HuBERT from scratch
- Conformer model for speech recognition
- Drop Python 3.6 support
[Beta] Emformer RNN-T
To support streaming ASR use cases, the release adds implementations of Emformer (docs), an RNN-T model that uses Emformer (emformer_rnnt_base), and an RNN-T beam search decoder (RNNTBeamSearch). It also includes a pipeline bundle (EMFORMER_RNNT_BASE_LIBRISPEECH) that wraps pre- and post-processing components, the beam search decoder, and the RNN-T Emformer model with weights pre-trained on LibriSpeech, which in whole allow for performing streaming ASR inference out of the box. For reference and reproducibility, the release provides the training recipe used to produce the pre-trained weights in the examples directory.
[Beta] HuBERT Pretrain Model
The masked prediction training of HuBERT model requires the masked logits, unmasked logits, and feature norm as the outputs. The logits are for cross-entropy losses and the feature norm is for penalty loss. The release adds HuBERTPretrainModel and corresponding factory functions (hubert_pretrain_base, hubert_pretrain_large, and hubert_pretrain_xlarge) to enable training from scratch.
[Beta] Conformer (paper)
The release adds an implementation of Conformer (docs), a convolution-augmented transformer architecture that has achieved state-of-the-art results on speech recognition benchmarks.
Backward-incompatible changes
Ops
- Removed deprecated
F.magphase,F.angle,F.complex_norm, andT.ComplexNorm. (#1934, #1935, #1942)- Utility functions for pseudo complex types were deprecated in 0.10, and now they are removed in 0.11. For the detail of this migration plan, please refer to #1337.
- Dropped pseudo complex support from
F.spectrogram,T.Spectrogram,F.phase_vocoder, andT.TimeStretch(#1957, #1958)- The support for the pseudo complex type was deprecated in 0.10, and now they are removed in 0.11. For the detail of this migration plan, please refer to #1337.
- Removed deprecated
create_fb_matrix(#1998)create_fb_matrixwas replaced bymelscale_fbanksin release 0.10. It is removed in 0.11. Please usemelscale_fbanks.
Datasets
- Removed deprecated VCTK (#1825)
- The original VCTK archive file is no longer accessible. Please migrate to
VCTK_092class for the latest version of the dataset.
- The original VCTK archive file is no longer accessible. Please migrate to
- Removed deprecated dataset utils (#1826)
- Undocumented methods
diskcache_iteratorandbg_iteratorwere deprecated in 0.10. They are removed in 0.11. Please cease the usage of them.
- Undocumented methods
Models
- Removed unused dimension from pretrained Wav2Vec2 ASR (#1914)
- The final linear layer of Wav2Vec2 ASR models included dimensions (
<s>,<pad>,</s>,<unk>) that were not related to ASR tasks and not used. These dimensions were removed.
- The final linear layer of Wav2Vec2 ASR models included dimensions (
Build
- Dropped support for Python3.6 (#2119, #2139)
- Following the lifecycle of Python-3.6, torchaudio dropped the support for Python 3.6.
New Features
RNN-T Emformer
- Introduced Emformer (#1801)
- Added Emformer RNN-T model (#2003)
- Added RNN-T beam search decoder (#2028)
- Cleaned up Emformer module (#2091)
- Added pretrained Emformer RNN-T streaming ASR inference pipeline (#2093)
- Reorganized RNN-T components in prototype module (#2110)
- Added integration test for Emformer RNN-T LibriSpeech pipeline (#2172)
- Registered RNN-T pipeline global stats constants as buffers (#2175)
- Refactored RNN-T factory function to support num_symbols argument (#2178)
- Fixed output shape description in RNN-T docstrings (#2179)
- Removed invalid token blanking logic from RNN-T decoder (#2180)
- Updated stale prototype references (#2189)
- Revised RNN-T pipeline streaming decoding logic (#2192)
- Cleaned up Emformer (#2207)
- Applied minor fixes to Emformer implementation (#2252)
Conformer
- Introduced Conformer (#2068)
- Removed subsampling and positional embedding logic from Conformer (#2171)
- Moved ASR features out of prototype (#2187)
- Passed bias and dropout args to Conformer convolution block (#2215)
- Adjusted Conformer args (#2223)
Datasets
- Added DR-VCTK dataset (#1819)
Models
- Added HuBERT pretrain model to enable training from scratch (#2064)
- Added feature mean square value to HuBERT Pretrain model output (#2128)
Pipelines
- Added wav2vec2 ASR French pretrained from voxpopuli (#1919)
- Added wav2vec2 ASR Spanish pretrained model from voxpopuli (#1924)
- Added wav2vec2 ASR German pretrained model from voxpopuli (#1953)
- Added wav2vec2 ASR Italian pretrained model from voxpopuli (#1954)
- Added wav2vec2 ASR English pretrained model from voxpopuli (#1956)
Build
- Added CUDA-11.5 builds to torchaudio (#2067)
Improvements
I/O
- Fixed load behavior for 24-bit input (#2084)
Ops
- Added OpenMP support (#1761)
- Improved MVDR stability (#2004)
- Relaxed dtype for MVDR (#2024)
- Added warnings in mu_law* for the wrong input type (#2034)
- Added parameter p to TimeMasking (#2090)
- Removed unused vars from RNN-T loss (#2142)
- Removed complex32 dtype in F.griffinlim (#2233)
Datasets
- Deprecated data utils (#2073)
- Updated URLs for libritts (#2074)
- Added subset support for TEDLIUM release3 dataset (#2157)
Models
- Replaced dropout with Dropout (#1815)
- Inplace initialization of RNN weights (#2010)
- Updated to xavier_uniform and avoid legacy data.uniform_ initialization (#2018)
- Allowed Tacotron2 decode batch_size 1 examples (#2156)
Pipelines
- Added tool to convert voxpopuli model (#1923)
- Refactored wav2vec2 pipeline util (#1925)
- Allowed the customization of axis exclusion for ASR head (#1932)
- Tweaked wav2vec2 checkpoint conversion tool (#1938)
- Added melkwargs setting for MFCC in HuBERT pipeline (#1949)
Documentation
- Added 0.10.0 to version compatibility matrix (#1862)
- Removed MACOSX_DEPLOYMENT_TARGET (#1880)
- Updated intersphinx inventory (#1893)
- Updated compatibility matrix to include LTS version (#1896)
- Updated CONTRIBUTING with doc conventions (#1898)
- Added anaconda stats to README (#1910)
- Updated README.md (#1916)
- Added citation information (#1947)
- Updated CONTRIBUTING.md (#1975)
- Doc fixes (#1982)
- Added tutorial to CONTRIBUTING (#1990)
- Fixed docstring (#2002)
- Fixed minor typo (#2012)
- Updated audio augmentation tutorial (#2082)
- Added Sphinx gallery automatically (#2101)
- Disabled matplotlib warning in tutorial rendering (#2107)
- Updated prototype documentations (#2108)
- Added custom CSS to make signatures appear in multi-line (#2123)
- Updated prototype pipeline documentation (#2148)
- Tweaked documentation (#2152)
Tests
- Refactored integration test (#1922)
- Enabled integration tests on CI (#1939)
- Removed facebook folder in wav2vec unit tests (#2015)
- Temporarily skipped threadpool test (#2025)
- Revised Griffin-Lim transform test to reduce execution time (#2037)
- Fixed CircleCI test failures (#2069)
- Do not auto-skip tests on CI (#2127)
- Relaxed absolute tolerance for Kaldi compat tests (#2165)
- Added tacotron2 unit test with different batch_size (#2176)
Build
- Updated GPU resource class (#1791)
- Updated the main version to 0.11.0 (#1793)
- Updated windows cuda installer 11.1.0 to 11.1.1 (#1795)
- Renamed build_tools to tools (#1812)
- Limit Windows GPU testing to CUDA-11.3 only (#1842)
- Used cu113 for unittest_windows_gpu (#1853)
- USE_CUDA in windows and reduce one vcvarsall (#1854)
- Check torch installation before building package (#1867)
- Install tools from conda instead of brew (#1873)
- Cleaned up setup.py (#1900)
- Moved TorchAudio conda package to use pytorch-mutex (#1904)
- Updated smoke test docker image (#1905)
- Fixed formatting CIRCLECI_TAG when building docs (#1915)
- Fetch third party sources automatically (#1966)
- Disabled SPHINXOPT=-W for local env (#2013)
- Improved installing nightly pytorch (#2026)
- Improved cuda installation on windows (#2032)
- Refactored the library loading mechanism (#2038)
- Cleaned up libtorchaudio customization logic (#2039)
- Refactored and functionize the library definition (#2040)
- Introduced helper function to define extension (#2077)
- Standardized the location of third-party source code (#2086)
- Show lint diff with color (#2102)
- Updated third party submodule setup (#2132)
- Suppressed stderr from subprocess in setup.py (#2133)
- Fixed header include (#2135)
- Updated ROCM version 4.1 -> 4.3.1 and 4.5 (#2186)
- Added "cu102" back (#2190)
- Pinned flake8 version (#2191)
Style
- Removed trailing whitespace (#1803)
- Fixed style checks (#1913)
- Resolved lint warning (#1971)
- Enabled CLANGFORMAT (#1999)
- Fixed style checks in examples/tutorials (#2006)
- OSS config for lint checks (#2066)
- Excluded sphinx-gallery examples (#2071)
- Reverted linting exemptions introduced in #2071 (#2087)
- Applied arc lint to pytorch audio (#2096)
- Enforced lint checks and fix/mute lint errors (#2116)...
torchaudio v0.10.2 Minor release
This is a minor release compatible with PyTorch 1.10.2.
There is no feature change in torchaudio from 0.10.1. For the full feature of v0.10, please refer to the v0.10.0 release notes.
torchaudio 0.10.1 Release Note
This is a minor release, which is compatible with PyTorch 1.10.1 and include small bug fix, improvements and documentation update. There is no new feature added.
Bug Fix
- #2050 Allow whitespace as
TORCH_CUDA_ARCH_LISTdelimiter
Improvement
- #2054 Fetch third party source code automatically
The build process now fetches third party source code (git submodule and cmake external projects) - #2059 Improve documentation
For the full feature of v0.10, please refer to the v0.10.0 release note.
v0.10.0
torchaudio 0.10.0 Release Note
Highlights
torchaudio 0.10.0 release includes:
- New models (Tacotron2, HuBERT) and datasets (CMUDict, LibriMix)
- Pretrained model support for ASR (Wav2Vec2, HuBERT) and TTS (WaveRNN, Tacotron2)
- New operations (RNN Transducer loss, MVDR beamforming, PitchShift, etc)
- CUDA-enabled binaries
[Beta] Wav2Vec2 / HuBERT Models and Pretrained Weights
HuBERT model architectures (“base”, “large” and “extra large” configurations) are added. In addition to that, support for pretrained weights from wav2vec 2.0, Unsupervised Cross-lingual Representation Learning and HuBERT are added.
These pretrained weights can be used for feature extractions and downstream task adaptation.
>>> import torchaudio
>>>
>>> # Build the model and load pretrained weight.
>>> model = torchaudio.pipelines.HUBERT_BASE.get_model()
>>> # Perform feature extraction.
>>> features, lengths = model.extract_features(waveforms)
>>> # Pass the features to downstream task
>>> ...Some of the pretrained weights are fine-tuned for ASR tasks. The following example illustrates how to use weights and access to associated information, such as labels, which can be used in subsequent CTC decoding steps. (Note: torchaudio does not provide a CTC decoding mechanism.)
>>> import torchaudio
>>>
>>> bundle = torchaudio.pipelines.HUBERT_ASR_LARGE
>>>
>>> # Build the model and load pretrained weight.
>>> model = bundle.get_model()
Downloading:
100%|███████████████████████████████| 1.18G/1.18G [00:17<00:00, 73.8MB/s]
>>> # Check the corresponding labels of the output.
>>> labels = bundle.get_labels()
>>> print(labels)
('<s>', '<pad>', '</s>', '<unk>', '|', 'E', 'T', 'A', 'O', 'N', 'I', 'H', 'S', 'R', 'D', 'L', 'U', 'M', 'W', 'C', 'F', 'G', 'Y', 'P', 'B', 'V', 'K', "'", 'X', 'J', 'Q', 'Z')
>>>
>>> # Infer the label probability distribution
>>> waveform, sample_rate = torchaudio.load(hello-world.wav')
>>>
>>> emissions, _ = model(waveform)
>>>
>>> # Pass emission to (hypothetical) decoder
>>> transcripts = ctc_decode(emissions, labels)
>>> print(transcripts[0])
HELLO WORLD[Beta] Tacotron2 and TTS Pipeline
A new model architecture, Tacotron2 is added, alongside several pretrained weights for TTS (text-to-speech). Since these TTS pipelines are composed of multiple models and specific data processing, so as to make it easy to use associated objects, a notion of bundle is introduced. Bundles provide a common access point to create a pipeline with a set of pretrained weights. They are available under torchaudio.pipelines module.
The following example illustrates a TTS pipeline where two models (Tacotron2 and WaveRNN) are used together.
>>> import torchaudio
>>>
>>> bundle = torchaudio.pipelines.TACOTRON2_WAVERNN_CHAR_LJSPEECH
>>>
>>> # Build text processor, Tacotron2 and vocoder (WaveRNN) model
>>> processor = bundle.get_text_preprocessor()
>>> tacotron2 = bundle.get_tacotron2()
Downloading:
100%|███████████████████████████████| 107M/107M [00:01<00:00, 87.9MB/s]
>>> vocoder = bundle.get_vocoder()
Downloading:
100%|███████████████████████████████| 16.7M/16.7M [00:00<00:00, 78.1MB/s]
>>>
>>> text = "Hello World!"
>>>
>>> # Encode text
>>> input, lengths = processor(text)
>>>
>>> # Generate (mel-scale) spectrogram
>>> specgram, lengths, _ = tacotron2.infer(input, lengths)
>>>
>>> # Convert spectrogram to waveform
>>> waveforms, lengths = vocoder(specgram, lengths)
>>>
>>> # Save audio
>>> torchaudio.save('hello-world.wav', waveforms, vocoder.sample_rate)[Beta] RNN Transducer Loss
The loss function used in the RNN transducer architecture, which is widely used for speech recognition tasks, is added. The loss function (torchaudio.functional.rnnt_loss or torchaudio.transforms.RNNTLoss) supports float16 and float32 logits, has autograd and torchscript support, and can be run on both CPU and GPU, which has a custom CUDA kernel implementation for improved performance.
[Beta] MVDR Beamforming
This release adds support for MVDR beamforming on multi-channel audio using Time-Frequency masks. There are three solutions (ref_channel, stv_evd, stv_power) and it supports single-channel and multi-channel (perform average in the method) masks. It provides an online option that recursively updates the parameters for streaming audio.
Please refer to the MVDR tutorial.
GPU Build
This release adds GPU builds that support custom CUDA kernels in torchaudio, like the one being used for RNN transducer loss. Following this change, torchaudio’s binary distribution now includes CPU-only versions and CUDA-enabled versions. To use CUDA-enabled binaries, PyTorch also needs to be compatible with CUDA.
Additional Features
torchaudio.functional.lfilter now supports batch processing and multiple filters. Additional operations, including pitch shift, LFCC, and inverse spectrogram, are now supported in this release. The datasets CMUDict and LibriMix are added as well.
Backward Incompatible Changes
I/O
- Default to PCM_16 for flac on soundfile backend (#1604)
- When saving FLAC format with “soundfile” backend,
PCM_24(the previous default) could cause warping. The default has been changed toPCM_16, which does not suffer this.
- When saving FLAC format with “soundfile” backend,
Ops
- Default to native complex type when returning raw spectrogram (#1549)
- When
power=None,torchaudio.functional.spectrogramandtorchaudio.transforms.Spectrogramnow defaults toreturn_complex=True, which returns Tensor of native complex type (such astorch.cfloatandtorch.cdouble). To use a pseudo complex type, pass the resulting tensor totorch.view_as_real.
- When
- Remove deprecated kaldi.resample_waveform (#1555)
- Please use
torchaudio.functional.resample.
- Please use
- Replace waveform with specgram in SlidingWindowCmn (#1859)
- The argument name was corrected to
specgram.
- The argument name was corrected to
- Ensure integer input frequencies for resample (#1857)
- Sampling rates were silently cast to integers in the resampling implementation, so it now requires integer sampling rate inputs to ensure expected resampling quality.
Wav2Vec2
- Update
extract_featuresof Wav2Vec2Model (#1776)- The previous implementation returned outputs from convolutional feature extractors. To match the behavior with the original fairseq’s implementation, the method was changed to return the outputs of the intermediate layers of transformer layers. To achieve the original behavior, please use
Wav2Vec2Model.feature_extractor().
- The previous implementation returned outputs from convolutional feature extractors. To match the behavior with the original fairseq’s implementation, the method was changed to return the outputs of the intermediate layers of transformer layers. To achieve the original behavior, please use
- Move fine-tune specific module out of wav2vec2 encoder (#1782)
- The internal structure of
Wav2Vec2Modelwas updated.Wav2Vec2Model.encoder.read_outmodule is moved toWav2Vec2Model.aux. If you have serialized state dict, please replace the keyencoder.read_outwithaux.
- The internal structure of
- Updated wav2vec2 factory functions for more customizability (#1783, #1804, #1830)
- The signatures of wav2vec2 factory functions are changed.
num_outparameter has been changed toaux_num_outand other parameters are added before it. Please update the code fromwav2vec2_base(num_out)towav2vec2_base(aux_num_out=num_out).
- The signatures of wav2vec2 factory functions are changed.
Deprecations
- Add
melscale_fbanksand deprecatecreate_fb_matrix(#1653)- As
linear_fbanksis introduced,create_fb_matrixis renamed tomelscale_fbanks. The originalcreate_fb_matrixis now deprecated. Please usemelscale_fbanks.
- As
- Deprecate
VCTKdataset (#1810)- This dataset has been taken down and is no longer available. Please use
VCTK_092dataset.
- This dataset has been taken down and is no longer available. Please use
- Deprecate data utils (#1809)
bg_iteratoranddiskcache_iteratorare known to not improve the throughput of data loaders. Please cease their usage.
New Features
Models
Tacotron2
- Add Tacotron2 model (#1621, #1647, #1844)
- Add Tacotron2 loss function (#1764)
- Add Tacotron2 inference method (#1648, #1839, #1849)
- Add phoneme text preprocessing for Tacotron2 (#1668)
- Move Tacotron2 out of prototype (#1714)
HuBERT
Pretrained Weights and Pipelines
- Add pretrained weights for wavernn (#1612)
- Add Tacotron2 pretrained models (#1693)
- Add HUBERT pretrained weights (#1821, #1824)
- Add pretrained weights from wav2vec2.0 and XLSR papers (#1827)
- Add customization support to wav2vec2 labels (#1834)
- Default pretrained weights to eval mode (#1843)
- Move wav2vec2 pretrained models to pipelines module (#1876)
- Add TTS bundle/pipelines (#1872)
- Fix vocoder interface (#1895)
- Fix Phonemizer download (#1897)
RNN Transducer Loss
- Add reduction parameter for RNNT loss (#1590)
- Rename RNNT loss C++ parameters (#1602)
- Rename transducer to RNNT (#1603)
- Remove gradient variable from RNNT loss Python code (#1616)
- Remove reuse_logits_for_grads option for RNNT loss (#1610)
- Remove fused_log_softmax option from RNNT loss (#1615)
- RNNT loss resolve null gradient (#1707)
- Move RNNT loss out of prototype (#1711)
MVDR Beamforming
- Add MVDR module to example (#1709)
- Add normalization to steering vector solutions in MVDR Module (#1765)
- Move MVDR and PSD modules to transforms (#1771)
- Add MVDR beamforming tutorial to example directory (#1768)
Ops
- Add edit_distance (#1601)
- Add PitchShift to functional and transform (#1629)
- Add LFCC feature to transforms (#1611)
- Add InverseSpectrogram to transforms and functional (#1652)
Datasets
Improvements
I/O
- Make buffer size for function info configurable (#1634)
Ops
torchaudio 0.9.1 Minor bugfix release
This release depends on pytorch 1.9.1
No functional changes other than minor updates to CI rules.
v0.9.0
torchaudio 0.9.0 Release Note
Highlights
torchaudio 0.9.0 release includes:
- Lots of performance improvements. (filtering, resampling, spectral operation)
- Popular wav2vec2.0 model architecture.
- Improved autograd support.
[Beta] Wav2Vec2.0 Model
This release includes model architectures from wav2vec2.0 paper with utility functions that allow importing pretrained model parameters published on fairseq and Hugging Face Hub. Now you can easily run speech recognition with torchaudio. These model architectures also support TorchScript, and you can deploy them with ONNX or in non-Python environments, such as C++, Android and iOS. Please checkout our C++, Android and iOS examples. The following snippets illustrate how to create a deployable model.
# Import fine-tuned model from Hugging Face Hub
import transformers
from torchaudio.models.wav2vec2.utils import import_huggingface_model
original = Wav2Vec2ForCTC.from_pretrained("facebook/wav2vec2-base-960h")
imported = import_huggingface_model(original)# Import fine-tuned model from fairseq
import fairseq
from torchaudio.models.wav2vec2.utils import import_fairseq_model
Original, _, _ = fairseq.checkpoint_utils.load_model_ensemble_and_task(
["wav2vec_small_960h.pt"], arg_overrides={'data': "<data_dir>"})
imported = import_fairseq_model(original[0].w2v_encoder)# Build uninitialized model and load state dict
from torchaudio.models import wav2vec2_base
model = wav2vec2_base(num_out=32)
model.load_state_dict(imported.state_dict())
# Quantize / script / optimize for mobile
quantized_model = torch.quantization.quantize_dynamic(
model, qconfig_spec={torch.nn.Linear}, dtype=torch.qint8)
scripted_model = torch.jit.script(quantized_model)
optimized_model = optimize_for_mobile(scripted_model)
optimized_model.save("model_for_deployment.pt")Filtering Improvement
The internal implementation of lfilter has been updated to support autograd on both CPU and CUDA. Additionally, the performance on CPU is significantly improved. These improvements also apply to biquad variants.
The following table illustrates the performance improvements compared against the previous releases. lfilter was applied on float32 tensors with one channel and different number of frames.
| torchaudio version | 256 |
512 |
1024 |
| 0.9 | 0.282 |
0.381 |
0.564 |
| 0.8 | 0.493 |
0.780 |
1.37 |
| 0.7 | 5.42 |
10.8 |
22.3 |
Unit: msec
Complex Tensor Migration
torchaudio has functions that handle complex-valued tensors. In early days when PyTorch did not have a complex dtype, torchaudio adopted the convention to use an extra dimension to represent real and imaginary parts. In PyTorch 1.6, new dtyps, such as torch.cfloat and torch.cdouble were introduced to represent complex values natively. (In the following, we refer to torchaudio’s original convention as pseudo complex types, and PyTorch’s native dtype as native complex types.)
As the native complex types have become mature and stable, torchaudio has started to migrate complex functions to use the native complex type. In this release, the internal implementation was updated to use the native complex types, and interfaces were updated to allow passing/receiving native complex type directly. Users can choose to keep using the pseudo complex type or opt in to use native complex type. However, please note that the use of the pseudo complex type is now deprecated. These functions are tested to support TorchScript and autograd. For the detail of this migration plan, please refer to #1337.
Additionally, switching the internal implementation to the native complex types improved the performance. Since the internal implementation uses native complex type regardless of which complex type is passed/returned, users will automatically benefit from this performance improvement.
The following table illustrates the performance improvements from the previous release by comparing the time it takes for complex transforms to perform operation on float32 Tensor with two channels and 256 frames.
CPU
| torchaudio version | Spectrogram
|
TimeStretch
|
GriffinLim
|
| 0.9 | 0.229 |
12.6 |
3320 |
| 0.8 | 0.283 |
126 |
5320 |
Unit: msec
CUDA
| torchaudio version | Spectrogram
|
TimeStretch
|
GriffinLim
|
| 0.9 | 0.195 |
0.599 |
36 |
| 0.8 | 0.219 |
0.687 |
60.2 |
Unit: msec
Improved Autograd Support
Along with the work of Complex Tensor Migration and Filtering Improvement mentioned above, more tests were added to ensure the autograd support. Now the following operations are guaranteed to support autograd up to second order.
Functionals
lfilterallpass_biquadbiquadband_biquadbandpass_biquadbandrefect_biquadbass_biquadequalizer_biquadtreble_biquadhighpass_biquadlowpass_biquad
Transforms
AmplitudeToDBComputeDeltasFadeGriffinLimTimeMaskingFrequencyMaskingMFCCMelScaleMelSpectrogramResampleSpectralCentroidSpectrogramSlidingWindowCmnTimeStretch*Vol
NOTE:
- Autograd test for transforms also covers the following functionals.
amplitude_to_DBspectrogramgriffinlimresamplephase_vocoder*mask_along_axis_iidmask_along_axisgainspectral_centroid
torchaudio.transforms.TimeStretchandtorchaudio.functional.phase_vocodercallatan2, which is not differentiable around zero. Therefore these functions are differentiable only when the input spectrogram does not contain values around zero.
[Beta] Resampling Improvement
In release 0.8, the resampling operation was vectorized and its performance improved. In this release, the implementation of the resampling algorithm has been further revised.
- Kaiser window has been added for a wider range of resampling quality.
rolloffparameter has been added for anti-aliasing control.torchaudio.transforms.Resampleprecomputes the kernel usingfloat64precision and caches it for even faster operation.- New entry point,
torchaudio.functional.resamplehas been added and the original entry point,torchaudio.compliance.kaldi.resample_waveformis deprecated.
The following table illustrates the performance improvements from the previous release by comparing the time it takes for torchaudio.transforms.Resample to complete the operation on float32 tensor with two channels and one-second duration.
CPU
| torchaudio version | 8k → 16k [Hz] | 16k → 8k | 16k → 44.1k | 44.1k → 16k |
| 0.9 | 0.192 |
0.559 |
0.478 |
0.467 |
| 0.8 | 0.537 |
0.753 |
43.9 |
17.6 |
Unit: msec
CUDA
...| torchaudio version | 8k → 16k | 16k → 8k | 16k → 44.1k | 44.1k → 16k |