28 Oct 16:48

bc8640b

Highlights

TorchAudio 0.13.0 release includes:

Source separation models and pre-trained bundles (Hybrid Demucs, ConvTasNet)
New datasets and metadata mode for the SUPERB benchmark
Custom language model support for CTC beam search decoding
StreamWriter for audio and video encoding

[Beta] Source Separation Models and Bundles

Hybrid Demucs is a music source separation model that uses both spectrogram and time domain features. It has demonstrated state-of-the-art performance in the Sony Music DeMixing Challenge. (citation: https://arxiv.org/abs/2111.03600)

The TorchAudio v0.13 release includes the following features

MUSDB_HQ Dataset, which is used in Hybrid Demucs training (docs)
Hybrid Demucs model architecture (docs)
Three factory functions suitable for different sample rate ranges
Pre-trained pipelines (docs) and tutorial

SDR Results of pre-trained pipelines on MUSDB-HQ test set

Pipeline	All	Drums	Bass	Other	Vocals
HDEMUCS_HIGH_MUSDB*	6.42	7.76	6.51	4.47	6.93
HDEMUCS_HIGH_MUSDB_PLUS**	9.37	11.38	10.53	7.24	8.32

* Trained on the training data of MUSDB-HQ dataset.
** Trained on both training and test sets of MUSDB-HQ and 150 extra songs from an internal database that were specifically produced for Meta.

Special thanks to @adefossez for the guidance.

ConvTasNet model architecture was added in TorchAudio 0.7.0. It is the first source separation model that outperforms the oracle ideal ratio mask. In this release, TorchAudio adds the pre-trained pipeline that is trained within TorchAudio on the Libri2Mix dataset. The pipeline achieves 15.6dB SDR improvement and 15.3dB Si-SNR improvement on the Libri2Mix test set.

[Beta] Datasets and Metadata Mode for SUPERB Benchmarks

With the addition of four new audio-related datasets, there is now support for all downstream tasks in version 1 of the SUPERB benchmark. Furthermore, these datasets support metadata mode through a get_metadata function, which enables faster dataset iteration or preprocessing without the need to load or store waveforms.

Datasets with metadata functionality:

LIBRISPEECH (docs)
LibriMix (docs)
QUESST14 (docs)
SPEECHCOMMANDS (docs)
(new) FluentSpeechCommands (docs)
(new) Snips (docs)
(new) IEMOCAP (docs)
(new) VoxCeleb1 (Identification, Verification)

[Beta] Custom Language Model support in CTC Beam Search Decoding

In release 0.12, TorchAudio released a CTC beam search decoder with KenLM language model support. This release, there is added functionality for creating custom Python language models that are compatible with the decoder, using the torchaudio.models.decoder.CTCDecoderLM wrapper.

[Beta] StreamWriter

torchaudio.io.StreamWriter is a class for encoding media including audio and video. This can handle a wide variety of codecs, chunk-by-chunk encoding and GPU encoding.

Backward-incompatible changes

[BC-breaking] Fix momentum in transforms.GriffinLim (#2568)
The GriffinLim implementations in transforms and functional used the momentum parameter differently, resulting in inconsistent results between the two implementations. The transforms.GriffinLim usage of momentum is updated to resolve this discrepancy.
Make torchaudio.info decode audio to compute num_frames if it is not found in metadata (#2740).
In such cases, torchaudio.info may now return non-zero values for num_frames.

Bug Fixes

Fix random Gaussian generation (#2639)
torchaudio.compliance.kaldi.fbank with dither option produced a different output from kaldi because it used a skewed, rather than gaussian, distribution for dither. This is updated in this release to correctly use a random gaussian instead.
Update download link for speech commands (#2777)
The previous download link for SpeechCommands v2 did not include data for the valid and test sets, resulting in errors when trying to use those subsets. Update the download link to correctly download the whole dataset.

New Features

IO

Add metadata to source stream info (#2461, #2464)
Add utility function to fetch FFmpeg library versions (#2467)
Add YUV444P support to StreamReader (#2516)
Add StreamWriter (#2628, #2648, #2505)
Support in-memory decoding via Tensor wrapper in StreamReader (#2694)
Add StreamReader Tensor Binding to src (#2699)
Add StreamWriter media device/streaming tutorial (#2708)
Add StreamWriter tutorial (#2698)

Ops

Add ITU-R BS.1770-4 loudness recommendation (#2472)
Add convolution operator (#2602)
Add additive noise function (#2608)

Models

Hybrid Demucs model implementation (#2506)
Docstring change for Hybrid Demucs (#2542, #2570)
Add NNLM support to CTC Decoder (#2528, #2658)
Move hybrid demucs model out of prototype (#2668)
Move conv_tasnet_base doc out of prototype (#2675)
Add custom lm example to decoder tutorial (#2762)

Pipelines

Add SourceSeparationBundle to prototype (#2440, #2559)
Adding pipeline changes, factory functions to HDemucs (#2547, #2565)
Create tutorial for HDemucs (#2572)
Add HDEMUCS_HIGH_MUSDB (#2601)
Move SourceSeparationBundle and pre-trained ConvTasNet pipeline into Beta (#2669)
Move Hybrid Demucs pipeline to beta (#2673)
Update description of HDemucs pipelines

Datasets

Add fluent speech commands (#2480, #2510)
Add musdb dataset and tests (#2484)
Add VoxCeleb1 dataset (#2349)
Add metadata function for LibriSpeech (#2653)
Add Speech Commands metadata function (#2687)
Add metadata mode for various datasets (#2697)
Add IEMOCAP dataset (#2732)
Add Snips Dataset (#2738)
Add metadata for Librimix (#2751)
Add file name to returned item in Snips dataset (#2775)
Update IEMOCAP variants and labels (#2778)

Improvements

IO

Replace runtime_error exception with TORCH_CHECK (#2550, #2551, #2592)
Refactor StreamReader (#2507, #2508, #2512, #2530, #2531, #2533, #2534)
Refactor sox C++ (#2636, #2663)
Delay the import of kaldi_io (#2573)

Ops

Speed up resample with kernel generation modification (#2553, #2561)
The kernel generation for resampling is optimized in this release. The following table illustrates the performance improvements from the previous release for the torchaudio.functional.resample function using the sinc resampling method, on float32 tensor with two channels and one second duration.

CPU

torchaudio version	8k → 16k [Hz]	16k → 8k	16k → 44.1k	44.1k → 16k
0.13	0.256	0.549	0.769	0.820
0.12	0.386	0.534	31.8	12.1

CUDA

torchaudio version	8k → 16k [Hz]	16k → 8k	16k → 44.1k	44.1k → 16k
0.13	0.332	0.336	0.345	0.381
0.12	0.524	0.334	64.4	22.8

Add normalization parameter on spectrogram and inverse spectrogram (#2554)
Replace assert with raise for ops (#2579, #2599)
Replace CHECK_ by TORCH_CHECK_ (#2582)
Fix argument validation in TorchAudio filtering (#2609)

Models

Switch to flashlight decoder from upstream (#2557)
Add dimension and shape check (#2563)
Replace assert with raise in models (#2578, #2590)
Migrate CTC decoder code (#2580)
Enable CTC decoder in Windows (#2587)

Datasets

Replace assert with raise in datasets (#2571)
Add unit test for LibriMix dataset (#2659)
Add gtzan download note (#2763)

Tutorials

Tweak tutorials (#2630, #2733)
Update ASR inference tutorial (#2631)
Update and fix tutorials (#2661, #2701)
Introduce IO section to getting started tutorials (#2703)
Update HW video processing tutorial (#2739)
Update tutorial author information (#2764)
Fix typos in tacotron2 tutorial (#2761)
Fix fading in hybrid demucs tutorial (#2771)
Fix leaking matplotlib figure (#2769)
Update resampling tutorial (#2773)

Recipes

Use lazy import for joblib (#2498)
Revise LibriSpeech Conformer RNN-T recipe (#2535)
Fix bug in Conformer RNN-T recipe (#2611)
Replace bg_iterator in examples (#2645)
Remove obsolete examples (#2655)
Fix LibriSpeech Conforner RNN-T eval script (#2666)
Replace IValue::toString()->string() with IValue::toStringRef() (#2700)
Improve wav2vec2/hubert model for pre-training (#2716)
Improve hubert recipe for pre-training and fine-tuning (#2744)

WER improvement on LibriSpe...

Contributors

adefossez

Assets 2

05 Aug 21:15

atalman

v0.12.1

58da317

torchaudio 0.12.1 Release Note

This is a minor release, which is compatible with PyTorch 1.12.1 and include small bug fixes, improvements and documentation update. There is no new feature added.

Bug Fix

#2560 Fix fall back failure in sox_io backend
#2588 Fix hubert fine-tuning recipe bugs

Improvement

#2552 Remove unused boost source code
#2527 Improve speech enhancement tutorial
#2544 Update forced alignment tutorial
#2595 Update data augmentation tutorial

For the full feature of v0.12, please refer to the v0.12.0 release note.

Assets 2

28 Jun 16:47

hwangjeff

v0.12.0

2e13884

v0.12.0

TorchAudio 0.12.0 Release Notes

Highlights

TorchAudio 0.12.0 includes the following:

CTC beam search decoder
New beamforming modules and methods
Streaming API

[Beta] CTC beam search decoder

To support inference-time decoding, the release adds the wav2letter CTC beam search decoder, ported over from Flashlight (GitHub). Both lexicon and lexicon-free decoding are supported, and decoding can be done without a language model or with a KenLM n-gram language model. Compatible token, lexicon, and certain pretrained KenLM files for the LibriSpeech dataset are also available for download.

For usage details, please check out the documentation and ASR inference tutorial.

[Beta] New beamforming modules and methods

To improve flexibility in usage, the release adds two new beamforming modules under torchaudio.transforms: SoudenMVDR and RTFMVDR. They differ from MVDR mainly in that they:

Use power spectral density (PSD) and relative transfer function (RTF) matrices as inputs instead of time-frequency masks. The module can be integrated with neural networks that directly predict complex-valued STFT coefficients of speech and noise.
Add reference_channel as an input argument in the forward method to allow users to select the reference channel in model training or dynamically change the reference channel in inference.

Besides the two modules, the release adds new function-level beamforming methods under torchaudio.functional. These include

For usage details, please check out the documentation at torchaudio.transforms and torchaudio.functional and the Speech Enhancement with MVDR Beamforming tutorial.

[Beta] Streaming API

StreamReader is TorchAudio’s new I/O API. It is backed by FFmpeg† and allows users to

Decode various audio and video formats, including MP4 and AAC.
Handle various input forms, such as local files, network protocols, microphones, webcams, screen captures and file-like objects.
Iterate over and decode media chunk-by-chunk, while changing the sample rate or frame rate.
Apply various audio and video filters, such as low-pass filter and image scaling.
Decode video with Nvidia's hardware-based decoder (NVDEC).

For usage details, please check out the documentation and tutorials:

† To use StreamReader, FFmpeg libraries are required. Please install FFmpeg. The coverage of codecs depends on how these libraries are configured. TorchAudio official binaries are compiled to work with FFmpeg 4 libraries; FFmpeg 5 can be used if TorchAudio is built from source.

Backwards-incompatible changes

I/O

MP3 decoding is now handled by FFmpeg in sox_io backend. (#2419, #2428)
- FFmpeg is now used as fallback in sox_io backend, and now MP3 decoding is handled by FFmpeg. To load MP3 audio with torchaudio.load, please install a compatible version of FFmpeg (Version 4 when using an official binary distribution).
- Note that, whereas the previous MP3 decoding scheme pads the output audio, the new scheme does not. As a consequence, the new version returns shorter audio tensors.
- torchaudio.info now returns num_frames=0 for MP3.

Models

Change underlying implementation of RNN-T hypothesis to tuple (#2339)
- In release 0.11, Hypothesis subclassed namedtuple. Containers of namedtuple instances, however, are incompatible with the PyTorch Lite Interpreter. To achieve compatibility, Hypothesis has been modified in release 0.12 to instead alias tuple. This affects RNNTBeamSearch as it accepts and returns a list of Hypothesis instances.

Bug Fixes

Ops

Fix return dtype in MVDR module (#2376)
- In release 0.11, the MVDR module converts the dtype of input spectrum to complex128 to improve the precision and robustness of downstream matrix computations. The output dtype, however, is not correctly converted back to the original dtype. In release 0.12, we fix the output dtype to be consistent with the original input dtype.

Build

Fix Kaldi submodule integration (#2269)
Pin jinja2 version for build_docs (#2292)
Use sourceforge url to fetch zlib (#2297)

New Features

I/O

Add Streaming API (#2041, #2042, #2043, #2044, #2045, #2046, #2047, #2111, #2113, #2114, #2115, #2135, #2164, #2168, #2202, #2204, #2263, #2264, #2312, #2373, #2378, #2402, #2403, #2427, #2429)
Add YUV420P format support to Streaming API (#2334)
Support specifying decoder and its options (#2327)
Add NV12 format support in Streaming API (#2330)
Add HW acceleration support on Streaming API (#2331)
Add file-like object support to Streaming API (#2400)
Make FFmpeg log level configurable (#2439)
Set the default ffmpeg log level to FATAL (#2447)

Ops

New beamforming methods (#2227, #2228, #2229, #2230, #2231, #2232, #2369, #2401)
New MVDR modules (#2367, #2368)
Add and refactor CTC lexicon beam search decoder (#2075, #2079, #2089, #2112, #2117, #2136, #2174, #2184, #2185, #2273, #2289)
Add lexicon free CTC decoder (#2342)
Add Pretrained LM Support for Decoder (#2275)
Move CTC beam search decoder to beta (#2410)

Datasets

Add QUESST14 dataset (#2290, #2435, #2458)
Add LibriLightLimited dataset (#2302)

Improvements

I/O

Use FFmpeg-based I/O as fallback in sox_io backend. (#2416, #2418, #2423)

Ops

Raise error for resampling int waveform (#2318)
Move multi-channel modules to a separate file (#2382)
Refactor MVDR module (#2383)

Models

Add an option to use Tanh instead of ReLU in RNNT joiner (#2319)
Support GroupNorm and re-ordering Convolution/MHA in Conformer (#2320)
Add extra arguments to hubert pretrain factory functions (#2345)
Add feature_grad_mult argument to HuBERTPretrainModel (#2335)

Datasets

Refactor LibriSpeech dataset (#2387)
Raising RuntimeErrors when datasets missing (#2430)

Performance

Make Pitchshift for faster by caching resampling kernel (#2441)
The following table illustrates the performance improvement over the previous release by comparing the time in msecs it takes torchaudio.transforms.PitchShift, after its first call, to perform the operation on float32 Tensor with two channels and 8000 frames, resampled to 44.1 kHz across various shifted steps.

TorchAudio Version	2	3	4	5
0.12	2.76	5	1860	223
0.11	6.71	161	8680	1450

Tests

Add complex dtype support in functional autograd test (#2244)
Refactor torchscript consistency test in functional (#2246)
Add unit tests for PyTorch Lightning modules of emformer_rnnt recipes (#2240)
Refactor batch consistency test in functional (#2245)
Run smoke tests on regular PRs (#2364)
Refactor smoke test executions (#2365)
Move seed to setup (#2425)
Remove possible manual seeds from test files (#2436)

Build

Revise the parameterization of third party libraries (#2282)
Use zlib v1.2.12 with GitHub source (#2300)
Fix ffmpeg integration for ffmpeg 5.0 (#2326)
Use custom FFmpeg libraries for torchaudio binary distributions (#2355)
Adding m1 builds to torchaudio (#2421)

Other

Add download utility specialized for torchaudio (#2283)
Use module-level __getattr__ to implement delayed initialization (#2377)
Update build_doc job to use Conda CUDA package (#2395)
Update I/O initialization (#2417)
Add Python 3.10 (build and test) (#2224)
Retrieve version from version.txt (#2434)
Disable OpenMP on mac (#2431)

Examples

Ops

Add CTC decoder example for librispeech (#2130, #2161)
Fix LM, arguments in CTC decoding script (#2235, #2315)
Use pretrained LM API for decoder example (#2317)

Pipelines

Refactor pipeline_demo.py to support variant EMFORMER_RNNT bundles (#2203)
Refactor eval and pipeline_demo scripts in emformer_rnnt (#2238)
Refactor pipeline_demo script in emformer_rnnt recipes (#2239)
Add EMFORMER_RNNT_BASE_MUSTC into pipeline demo script (#2248)

Tests

Add unit tests for Emformer RNN-T LibriSpeech recipe (#2216)
Add fixed random seed for Emformer RNN-T recipe test (#2220)

Training recipes

Add recipe for HuBERT model pre-training (#2143, #2198, #2296, #2310, #2311, #2412...

Assets 2

10 Mar 17:28

nateanl

v0.11.0

820b383

v0.11.0

torchaudio 0.11.0 Release Note

Highlights

TorchAudio 0.11.0 release includes:

Emformer (paper) RNN-T components, training recipe, and pre-trained pipeline for streaming ASR
Voxpopuli pre-trained pipelines
HuBERTPretrainModel for training HuBERT from scratch
Conformer model for speech recognition
Drop Python 3.6 support

[Beta] Emformer RNN-T

To support streaming ASR use cases, the release adds implementations of Emformer (docs), an RNN-T model that uses Emformer (emformer_rnnt_base), and an RNN-T beam search decoder (RNNTBeamSearch). It also includes a pipeline bundle (EMFORMER_RNNT_BASE_LIBRISPEECH) that wraps pre- and post-processing components, the beam search decoder, and the RNN-T Emformer model with weights pre-trained on LibriSpeech, which in whole allow for performing streaming ASR inference out of the box. For reference and reproducibility, the release provides the training recipe used to produce the pre-trained weights in the examples directory.

[Beta] HuBERT Pretrain Model

The masked prediction training of HuBERT model requires the masked logits, unmasked logits, and feature norm as the outputs. The logits are for cross-entropy losses and the feature norm is for penalty loss. The release adds HuBERTPretrainModel and corresponding factory functions (hubert_pretrain_base, hubert_pretrain_large, and hubert_pretrain_xlarge) to enable training from scratch.

[Beta] Conformer (paper)

The release adds an implementation of Conformer (docs), a convolution-augmented transformer architecture that has achieved state-of-the-art results on speech recognition benchmarks.

Backward-incompatible changes

Ops

Removed deprecated F.magphase, F.angle, F.complex_norm, and T.ComplexNorm. (#1934, #1935, #1942)
- Utility functions for pseudo complex types were deprecated in 0.10, and now they are removed in 0.11. For the detail of this migration plan, please refer to #1337.
Dropped pseudo complex support from F.spectrogram, T.Spectrogram, F.phase_vocoder, and T.TimeStretch (#1957, #1958)
- The support for the pseudo complex type was deprecated in 0.10, and now they are removed in 0.11. For the detail of this migration plan, please refer to #1337.
Removed deprecated create_fb_matrix (#1998)
- create_fb_matrix was replaced by melscale_fbanks in release 0.10. It is removed in 0.11. Please use melscale_fbanks.

Datasets

Removed deprecated VCTK (#1825)
- The original VCTK archive file is no longer accessible. Please migrate to VCTK_092 class for the latest version of the dataset.
Removed deprecated dataset utils (#1826)
- Undocumented methods diskcache_iterator and bg_iterator were deprecated in 0.10. They are removed in 0.11. Please cease the usage of them.

Models

Removed unused dimension from pretrained Wav2Vec2 ASR (#1914)
- The final linear layer of Wav2Vec2 ASR models included dimensions (<s>, <pad>, </s>, <unk>) that were not related to ASR tasks and not used. These dimensions were removed.

Build

Dropped support for Python3.6 (#2119, #2139)
- Following the lifecycle of Python-3.6, torchaudio dropped the support for Python 3.6.

New Features

RNN-T Emformer

Introduced Emformer (#1801)
Added Emformer RNN-T model (#2003)
Added RNN-T beam search decoder (#2028)
Cleaned up Emformer module (#2091)
Added pretrained Emformer RNN-T streaming ASR inference pipeline (#2093)
Reorganized RNN-T components in prototype module (#2110)
Added integration test for Emformer RNN-T LibriSpeech pipeline (#2172)
Registered RNN-T pipeline global stats constants as buffers (#2175)
Refactored RNN-T factory function to support num_symbols argument (#2178)
Fixed output shape description in RNN-T docstrings (#2179)
Removed invalid token blanking logic from RNN-T decoder (#2180)
Updated stale prototype references (#2189)
Revised RNN-T pipeline streaming decoding logic (#2192)
Cleaned up Emformer (#2207)
Applied minor fixes to Emformer implementation (#2252)

Conformer

Introduced Conformer (#2068)
Removed subsampling and positional embedding logic from Conformer (#2171)
Moved ASR features out of prototype (#2187)
Passed bias and dropout args to Conformer convolution block (#2215)
Adjusted Conformer args (#2223)

Datasets

Added DR-VCTK dataset (#1819)

Models

Added HuBERT pretrain model to enable training from scratch (#2064)
Added feature mean square value to HuBERT Pretrain model output (#2128)

Pipelines

Added wav2vec2 ASR French pretrained from voxpopuli (#1919)
Added wav2vec2 ASR Spanish pretrained model from voxpopuli (#1924)
Added wav2vec2 ASR German pretrained model from voxpopuli (#1953)
Added wav2vec2 ASR Italian pretrained model from voxpopuli (#1954)
Added wav2vec2 ASR English pretrained model from voxpopuli (#1956)

Build

Added CUDA-11.5 builds to torchaudio (#2067)

Improvements

I/O

Fixed load behavior for 24-bit input (#2084)

Ops

Added OpenMP support (#1761)
Improved MVDR stability (#2004)
Relaxed dtype for MVDR (#2024)
Added warnings in mu_law* for the wrong input type (#2034)
Added parameter p to TimeMasking (#2090)
Removed unused vars from RNN-T loss (#2142)
Removed complex32 dtype in F.griffinlim (#2233)

Datasets

Deprecated data utils (#2073)
Updated URLs for libritts (#2074)
Added subset support for TEDLIUM release3 dataset (#2157)

Models

Replaced dropout with Dropout (#1815)
Inplace initialization of RNN weights (#2010)
Updated to xavier_uniform and avoid legacy data.uniform_ initialization (#2018)
Allowed Tacotron2 decode batch_size 1 examples (#2156)

Pipelines

Added tool to convert voxpopuli model (#1923)
Refactored wav2vec2 pipeline util (#1925)
Allowed the customization of axis exclusion for ASR head (#1932)
Tweaked wav2vec2 checkpoint conversion tool (#1938)
Added melkwargs setting for MFCC in HuBERT pipeline (#1949)

Documentation

Added 0.10.0 to version compatibility matrix (#1862)
Removed MACOSX_DEPLOYMENT_TARGET (#1880)
Updated intersphinx inventory (#1893)
Updated compatibility matrix to include LTS version (#1896)
Updated CONTRIBUTING with doc conventions (#1898)
Added anaconda stats to README (#1910)
Updated README.md (#1916)
Added citation information (#1947)
Updated CONTRIBUTING.md (#1975)
Doc fixes (#1982)
Added tutorial to CONTRIBUTING (#1990)
Fixed docstring (#2002)
Fixed minor typo (#2012)
Updated audio augmentation tutorial (#2082)
Added Sphinx gallery automatically (#2101)
Disabled matplotlib warning in tutorial rendering (#2107)
Updated prototype documentations (#2108)
Added custom CSS to make signatures appear in multi-line (#2123)
Updated prototype pipeline documentation (#2148)
Tweaked documentation (#2152)

Tests

Refactored integration test (#1922)
Enabled integration tests on CI (#1939)
Removed facebook folder in wav2vec unit tests (#2015)
Temporarily skipped threadpool test (#2025)
Revised Griffin-Lim transform test to reduce execution time (#2037)
Fixed CircleCI test failures (#2069)
Do not auto-skip tests on CI (#2127)
Relaxed absolute tolerance for Kaldi compat tests (#2165)
Added tacotron2 unit test with different batch_size (#2176)

Build

Updated GPU resource class (#1791)
Updated the main version to 0.11.0 (#1793)
Updated windows cuda installer 11.1.0 to 11.1.1 (#1795)
Renamed build_tools to tools (#1812)
Limit Windows GPU testing to CUDA-11.3 only (#1842)
Used cu113 for unittest_windows_gpu (#1853)
USE_CUDA in windows and reduce one vcvarsall (#1854)
Check torch installation before building package (#1867)
Install tools from conda instead of brew (#1873)
Cleaned up setup.py (#1900)
Moved TorchAudio conda package to use pytorch-mutex (#1904)
Updated smoke test docker image (#1905)
Fixed formatting CIRCLECI_TAG when building docs (#1915)
Fetch third party sources automatically (#1966)
Disabled SPHINXOPT=-W for local env (#2013)
Improved installing nightly pytorch (#2026)
Improved cuda installation on windows (#2032)
Refactored the library loading mechanism (#2038)
Cleaned up libtorchaudio customization logic (#2039)
Refactored and functionize the library definition (#2040)
Introduced helper function to define extension (#2077)
Standardized the location of third-party source code (#2086)
Show lint diff with color (#2102)
Updated third party submodule setup (#2132)
Suppressed stderr from subprocess in setup.py (#2133)
Fixed header include (#2135)
Updated ROCM version 4.1 -> 4.3.1 and 4.5 (#2186)
Added "cu102" back (#2190)
Pinned flake8 version (#2191)

Style

Removed trailing whitespace (#1803)
Fixed style checks (#1913)
Resolved lint warning (#1971)
Enabled CLANGFORMAT (#1999)
Fixed style checks in examples/tutorials (#2006)
OSS config for lint checks (#2066)
Excluded sphinx-gallery examples (#2071)
Reverted linting exemptions introduced in #2071 (#2087)
Applied arc lint to pytorch audio (#2096)
Enforced lint checks and fix/mute lint errors (#2116)...

Assets 2

27 Jan 22:33

atalman

v0.10.2

6f539cf

torchaudio v0.10.2 Minor release

This is a minor release compatible with PyTorch 1.10.2.

There is no feature change in torchaudio from 0.10.1. For the full feature of v0.10, please refer to the v0.10.0 release notes.

Assets 2

16 Dec 20:41

mthrok

v0.10.1

4b64f80

torchaudio 0.10.1 Release Note

This is a minor release, which is compatible with PyTorch 1.10.1 and include small bug fix, improvements and documentation update. There is no new feature added.

Bug Fix

#2050 Allow whitespace as TORCH_CUDA_ARCH_LIST delimiter

Improvement

#2054 Fetch third party source code automatically
The build process now fetches third party source code (git submodule and cmake external projects)
#2059 Improve documentation

For the full feature of v0.10, please refer to the v0.10.0 release note.

Assets 2

21 Oct 15:55

carolineechen

v0.10.0

d2634d8

v0.10.0

torchaudio 0.10.0 Release Note

Highlights

torchaudio 0.10.0 release includes:

New models (Tacotron2, HuBERT) and datasets (CMUDict, LibriMix)
Pretrained model support for ASR (Wav2Vec2, HuBERT) and TTS (WaveRNN, Tacotron2)
New operations (RNN Transducer loss, MVDR beamforming, PitchShift, etc)
CUDA-enabled binaries

[Beta] Wav2Vec2 / HuBERT Models and Pretrained Weights

HuBERT model architectures (“base”, “large” and “extra large” configurations) are added. In addition to that, support for pretrained weights from wav2vec 2.0, Unsupervised Cross-lingual Representation Learning and HuBERT are added.

These pretrained weights can be used for feature extractions and downstream task adaptation.

>>> import torchaudio
>>>
>>> # Build the model and load pretrained weight.
>>> model = torchaudio.pipelines.HUBERT_BASE.get_model()
>>> # Perform feature extraction.
>>> features, lengths = model.extract_features(waveforms)
>>> # Pass the features to downstream task
>>> ...

Some of the pretrained weights are fine-tuned for ASR tasks. The following example illustrates how to use weights and access to associated information, such as labels, which can be used in subsequent CTC decoding steps. (Note: torchaudio does not provide a CTC decoding mechanism.)

>>> import torchaudio
>>>
>>> bundle = torchaudio.pipelines.HUBERT_ASR_LARGE
>>>
>>> # Build the model and load pretrained weight.
>>> model = bundle.get_model()
Downloading:
100%|███████████████████████████████| 1.18G/1.18G [00:17<00:00, 73.8MB/s]
>>> # Check the corresponding labels of the output.
>>> labels = bundle.get_labels()
>>> print(labels)
('<s>', '<pad>', '</s>', '<unk>', '|', 'E', 'T', 'A', 'O', 'N', 'I', 'H', 'S', 'R', 'D', 'L', 'U', 'M', 'W', 'C', 'F', 'G', 'Y', 'P', 'B', 'V', 'K', "'", 'X', 'J', 'Q', 'Z')
>>>
>>> # Infer the label probability distribution
>>> waveform, sample_rate = torchaudio.load(hello-world.wav')
>>>
>>> emissions, _ = model(waveform)
>>>
>>> # Pass emission to (hypothetical) decoder
>>> transcripts = ctc_decode(emissions, labels)
>>> print(transcripts[0])
HELLO WORLD

[Beta] Tacotron2 and TTS Pipeline

A new model architecture, Tacotron2 is added, alongside several pretrained weights for TTS (text-to-speech). Since these TTS pipelines are composed of multiple models and specific data processing, so as to make it easy to use associated objects, a notion of bundle is introduced. Bundles provide a common access point to create a pipeline with a set of pretrained weights. They are available under torchaudio.pipelines module.
The following example illustrates a TTS pipeline where two models (Tacotron2 and WaveRNN) are used together.

>>> import torchaudio
>>>
>>> bundle = torchaudio.pipelines.TACOTRON2_WAVERNN_CHAR_LJSPEECH
>>>
>>> # Build text processor, Tacotron2 and vocoder (WaveRNN) model
>>> processor = bundle.get_text_preprocessor()
>>> tacotron2 = bundle.get_tacotron2()
Downloading:
100%|███████████████████████████████| 107M/107M [00:01<00:00, 87.9MB/s]
>>> vocoder = bundle.get_vocoder()
Downloading:
100%|███████████████████████████████| 16.7M/16.7M [00:00<00:00, 78.1MB/s]
>>>
>>> text = "Hello World!"
>>>
>>> # Encode text
>>> input, lengths = processor(text)
>>>
>>> # Generate (mel-scale) spectrogram
>>> specgram, lengths, _ = tacotron2.infer(input, lengths)
>>>
>>> # Convert spectrogram to waveform
>>> waveforms, lengths = vocoder(specgram, lengths)
>>>
>>> # Save audio
>>> torchaudio.save('hello-world.wav', waveforms, vocoder.sample_rate)

[Beta] RNN Transducer Loss

The loss function used in the RNN transducer architecture, which is widely used for speech recognition tasks, is added. The loss function (torchaudio.functional.rnnt_loss or torchaudio.transforms.RNNTLoss) supports float16 and float32 logits, has autograd and torchscript support, and can be run on both CPU and GPU, which has a custom CUDA kernel implementation for improved performance.

[Beta] MVDR Beamforming

This release adds support for MVDR beamforming on multi-channel audio using Time-Frequency masks. There are three solutions (ref_channel, stv_evd, stv_power) and it supports single-channel and multi-channel (perform average in the method) masks. It provides an online option that recursively updates the parameters for streaming audio.
Please refer to the MVDR tutorial.

GPU Build

This release adds GPU builds that support custom CUDA kernels in torchaudio, like the one being used for RNN transducer loss. Following this change, torchaudio’s binary distribution now includes CPU-only versions and CUDA-enabled versions. To use CUDA-enabled binaries, PyTorch also needs to be compatible with CUDA.

Additional Features

torchaudio.functional.lfilter now supports batch processing and multiple filters. Additional operations, including pitch shift, LFCC, and inverse spectrogram, are now supported in this release. The datasets CMUDict and LibriMix are added as well.

Backward Incompatible Changes

I/O

Default to PCM_16 for flac on soundfile backend (#1604)
- When saving FLAC format with “soundfile” backend, PCM_24 (the previous default) could cause warping. The default has been changed to PCM_16, which does not suffer this.

Ops

Default to native complex type when returning raw spectrogram (#1549)
- When power=None, torchaudio.functional.spectrogram and torchaudio.transforms.Spectrogram now defaults to return_complex=True, which returns Tensor of native complex type (such as torch.cfloat and torch.cdouble). To use a pseudo complex type, pass the resulting tensor to torch.view_as_real.
Remove deprecated kaldi.resample_waveform (#1555)
- Please use torchaudio.functional.resample.
Replace waveform with specgram in SlidingWindowCmn (#1859)
- The argument name was corrected to specgram.
Ensure integer input frequencies for resample (#1857)
- Sampling rates were silently cast to integers in the resampling implementation, so it now requires integer sampling rate inputs to ensure expected resampling quality.

Wav2Vec2

Update extract_features of Wav2Vec2Model (#1776)
- The previous implementation returned outputs from convolutional feature extractors. To match the behavior with the original fairseq’s implementation, the method was changed to return the outputs of the intermediate layers of transformer layers. To achieve the original behavior, please use Wav2Vec2Model.feature_extractor().
Move fine-tune specific module out of wav2vec2 encoder (#1782)
- The internal structure of Wav2Vec2Model was updated. Wav2Vec2Model.encoder.read_out module is moved to Wav2Vec2Model.aux. If you have serialized state dict, please replace the key encoder.read_out with aux.
Updated wav2vec2 factory functions for more customizability (#1783, #1804, #1830)
- The signatures of wav2vec2 factory functions are changed. num_out parameter has been changed to aux_num_out and other parameters are added before it. Please update the code from wav2vec2_base(num_out) to wav2vec2_base(aux_num_out=num_out).

Deprecations

Add melscale_fbanks and deprecate create_fb_matrix (#1653)
- As linear_fbanks is introduced, create_fb_matrix is renamed to melscale_fbanks. The original create_fb_matrix is now deprecated. Please use melscale_fbanks.
Deprecate VCTK dataset (#1810)
- This dataset has been taken down and is no longer available. Please use VCTK_092 dataset.
Deprecate data utils (#1809)
- bg_iterator and diskcache_iterator are known to not improve the throughput of data loaders. Please cease their usage.

New Features

Models

Tacotron2

Add Tacotron2 model (#1621, #1647, #1844)
Add Tacotron2 loss function (#1764)
Add Tacotron2 inference method (#1648, #1839, #1849)
Add phoneme text preprocessing for Tacotron2 (#1668)
Move Tacotron2 out of prototype (#1714)

HuBERT

Add HuBERT model architectures (#1769, #1811)

Pretrained Weights and Pipelines

Add pretrained weights for wavernn (#1612) 
Add Tacotron2 pretrained models (#1693) 
Add HUBERT pretrained weights (#1821, #1824) 
Add pretrained weights from wav2vec2.0 and XLSR papers (#1827) 
Add customization support to wav2vec2 labels (#1834) 
Default pretrained weights to eval mode (#1843) 
Move wav2vec2 pretrained models to pipelines module (#1876) 
Add TTS bundle/pipelines (#1872) 
Fix vocoder interface (#1895) 
Fix Phonemizer download (#1897)

RNN Transducer Loss

Add reduction parameter for RNNT loss (#1590) 
Rename RNNT loss C++ parameters (#1602) 
Rename transducer to RNNT (#1603) 
Remove gradient variable from RNNT loss Python code (#1616) 
Remove reuse_logits_for_grads option for RNNT loss (#1610) 
Remove fused_log_softmax option from RNNT loss (#1615) 
RNNT loss resolve null gradient (#1707) 
Move RNNT loss out of prototype (#1711)

MVDR Beamforming

Add MVDR module to example (#1709) 
Add normalization to steering vector solutions in MVDR Module (#1765) 
Move MVDR and PSD modules to transforms (#1771) 
Add MVDR beamforming tutorial to example directory (#1768)

Ops

Add edit_distance (#1601) 
Add PitchShift to functional and transform (#1629) 
Add LFCC feature to transforms (#1611) 
Add InverseSpectrogram to transforms and functional (#1652)

Datasets

Add CMUDict dataset (#1627) 
Move LibriMix dataset to datasets directory (#1833)

Improvements

I/O

Make buffer size for function info configurable (#1634)

Ops

Replace deprecated AutoNonVariableTypeMode (#1583) 
Remove lazy behavior from MelScale (#163...

Assets 2

27 Sep 04:42

malfet

v0.9.1

a85b239

torchaudio 0.9.1 Minor bugfix release

This release depends on pytorch 1.9.1
No functional changes other than minor updates to CI rules.

Assets 2

15 Jun 15:32

mthrok

v0.9.0

33b2469

v0.9.0

torchaudio 0.9.0 Release Note

Highlights

torchaudio 0.9.0 release includes:

Lots of performance improvements. (filtering, resampling, spectral operation)
Popular wav2vec2.0 model architecture.
Improved autograd support.

[Beta] Wav2Vec2.0 Model

This release includes model architectures from wav2vec2.0 paper with utility functions that allow importing pretrained model parameters published on fairseq and Hugging Face Hub. Now you can easily run speech recognition with torchaudio. These model architectures also support TorchScript, and you can deploy them with ONNX or in non-Python environments, such as C++, Android and iOS. Please checkout our C++, Android and iOS examples. The following snippets illustrate how to create a deployable model.

# Import fine-tuned model from Hugging Face Hub
import transformers
from torchaudio.models.wav2vec2.utils import import_huggingface_model

original = Wav2Vec2ForCTC.from_pretrained("facebook/wav2vec2-base-960h")
imported = import_huggingface_model(original)

# Import fine-tuned model from fairseq
import fairseq
from torchaudio.models.wav2vec2.utils import import_fairseq_model

Original, _, _ = fairseq.checkpoint_utils.load_model_ensemble_and_task(
    ["wav2vec_small_960h.pt"], arg_overrides={'data': "<data_dir>"})
imported = import_fairseq_model(original[0].w2v_encoder)

# Build uninitialized model and load state dict
from torchaudio.models import wav2vec2_base

model = wav2vec2_base(num_out=32)
model.load_state_dict(imported.state_dict())

# Quantize / script / optimize for mobile
quantized_model = torch.quantization.quantize_dynamic(
    model, qconfig_spec={torch.nn.Linear}, dtype=torch.qint8)
scripted_model = torch.jit.script(quantized_model)
optimized_model = optimize_for_mobile(scripted_model)
optimized_model.save("model_for_deployment.pt")

Filtering Improvement

The internal implementation of lfilter has been updated to support autograd on both CPU and CUDA. Additionally, the performance on CPU is significantly improved. These improvements also apply to biquad variants.

The following table illustrates the performance improvements compared against the previous releases. lfilter was applied on float32 tensors with one channel and different number of frames.

torchaudio version	256	512	1024
0.9	0.282	0.381	0.564
0.8	0.493	0.780	1.37
0.7	5.42	10.8	22.3

Unit: msec

Complex Tensor Migration

torchaudio has functions that handle complex-valued tensors. In early days when PyTorch did not have a complex dtype, torchaudio adopted the convention to use an extra dimension to represent real and imaginary parts. In PyTorch 1.6, new dtyps, such as torch.cfloat and torch.cdouble were introduced to represent complex values natively. (In the following, we refer to torchaudio’s original convention as pseudo complex types, and PyTorch’s native dtype as native complex types.)

As the native complex types have become mature and stable, torchaudio has started to migrate complex functions to use the native complex type. In this release, the internal implementation was updated to use the native complex types, and interfaces were updated to allow passing/receiving native complex type directly. Users can choose to keep using the pseudo complex type or opt in to use native complex type. However, please note that the use of the pseudo complex type is now deprecated. These functions are tested to support TorchScript and autograd. For the detail of this migration plan, please refer to #1337.

Additionally, switching the internal implementation to the native complex types improved the performance. Since the internal implementation uses native complex type regardless of which complex type is passed/returned, users will automatically benefit from this performance improvement.

The following table illustrates the performance improvements from the previous release by comparing the time it takes for complex transforms to perform operation on float32 Tensor with two channels and 256 frames.

CPU

torchaudio version	`Spectrogram`	`TimeStretch`	`GriffinLim`
0.9	0.229	12.6	3320
0.8	0.283	126	5320

Unit: msec

CUDA

torchaudio version	`Spectrogram`	`TimeStretch`	`GriffinLim`
0.9	0.195	0.599	36
0.8	0.219	0.687	60.2

Unit: msec

Improved Autograd Support

Along with the work of Complex Tensor Migration and Filtering Improvement mentioned above, more tests were added to ensure the autograd support. Now the following operations are guaranteed to support autograd up to second order.

Functionals

lfilter
allpass_biquad
biquad
band_biquad
bandpass_biquad
bandrefect_biquad
bass_biquad
equalizer_biquad
treble_biquad
highpass_biquad
lowpass_biquad

Transforms

AmplitudeToDB
ComputeDeltas
Fade
GriffinLim
TimeMasking
FrequencyMasking
MFCC
MelScale
MelSpectrogram
Resample
SpectralCentroid
Spectrogram
SlidingWindowCmn
TimeStretch*
Vol

NOTE:

Autograd test for transforms also covers the following functionals.
- amplitude_to_DB
- spectrogram
- griffinlim
- resample
- phase_vocoder*
- mask_along_axis_iid
- mask_along_axis
- gain
- spectral_centroid
torchaudio.transforms.TimeStretch and torchaudio.functional.phase_vocoder call atan2, which is not differentiable around zero. Therefore these functions are differentiable only when the input spectrogram does not contain values around zero.

[Beta] Resampling Improvement

In release 0.8, the resampling operation was vectorized and its performance improved. In this release, the implementation of the resampling algorithm has been further revised.

Kaiser window has been added for a wider range of resampling quality.
rolloff parameter has been added for anti-aliasing control.
torchaudio.transforms.Resample precomputes the kernel using float64 precision and caches it for even faster operation.
New entry point, torchaudio.functional.resample has been added and the original entry point, torchaudio.compliance.kaldi.resample_waveform is deprecated.

The following table illustrates the performance improvements from the previous release by comparing the time it takes for torchaudio.transforms.Resample to complete the operation on float32 tensor with two channels and one-second duration.

CPU

torchaudio version	8k → 16k [Hz]	16k → 8k	16k → 44.1k	44.1k → 16k
0.9	0.192	0.559	0.478	0.467
0.8	0.537	0.753	43.9	17.6

Unit: msec

CUDA

...

torchaudio version

8k → 16k

16k → 8k

16k → 44.1k

44.1k → 16k

Assets 2

25 Mar 16:30

vincentqb

v0.8.1

e4e171a

v0.8.1

Highlights

This release depends on pytorch 1.8.1.

Bug Fixes

Added back support for 24-bit signed LPCM wav via sox_io backend. (#1389)

Assets 2

Releases: pytorch/audio

torchaudio 0.13.0 Release Note

Highlights

[Beta] Source Separation Models and Bundles

[Beta] Datasets and Metadata Mode for SUPERB Benchmarks

[Beta] Custom Language Model support in CTC Beam Search Decoding

[Beta] StreamWriter

Backward-incompatible changes

Bug Fixes

New Features

IO

Ops

Models

Pipelines

Datasets

Improvements

IO

Ops

Models

Datasets

Tutorials

Recipes

Contributors

Uh oh!

torchaudio 0.12.1 Release Note

Bug Fix

Improvement

Uh oh!

v0.12.0

TorchAudio 0.12.0 Release Notes

Highlights

[Beta] CTC beam search decoder

[Beta] New beamforming modules and methods

[Beta] Streaming API

Backwards-incompatible changes

I/O

Models

Bug Fixes

Ops

Build

New Features

I/O

Ops

Datasets

Improvements

I/O

Ops

Models

Datasets

Performance

Tests

Build

Other

Examples

Ops

Pipelines

Tests

Training recipes

Uh oh!

v0.11.0

torchaudio 0.11.0 Release Note

Highlights

[Beta] Emformer RNN-T

[Beta] HuBERT Pretrain Model

[Beta] Conformer (paper)

Backward-incompatible changes

Ops

Datasets

Models

Build

New Features

RNN-T Emformer

Conformer

Datasets

Models

Pipelines

Build

Improvements

I/O

Ops