Skip to content

Commit 01c5159

Browse files
update with more recent tts models (#42328)
* update with more recent tts models * fix pipelin
1 parent a099b27 commit 01c5159

File tree

1 file changed

+6
-6
lines changed

1 file changed

+6
-6
lines changed

docs/source/en/tasks/text-to-speech.md

Lines changed: 6 additions & 6 deletions
Original file line numberDiff line numberDiff line change
@@ -19,18 +19,18 @@ rendered properly in your Markdown viewer.
1919
[[open-in-colab]]
2020

2121
Text-to-speech (TTS) is the task of creating natural-sounding speech from text, where the speech can be generated in multiple
22-
languages and for multiple speakers. Several text-to-speech models are currently available in 🤗 Transformers, such as
22+
languages and for multiple speakers. Several text-to-speech models are currently available in 🤗 Transformers, such as [Dia](../model_doc/dia), [CSM](../model_doc/csm),
2323
[Bark](../model_doc/bark), [MMS](../model_doc/mms), [VITS](../model_doc/vits) and [SpeechT5](../model_doc/speecht5).
2424

25-
You can easily generate audio using the `"text-to-audio"` pipeline (or its alias - `"text-to-speech"`). Some models, like Bark,
25+
You can easily generate audio using the `"text-to-audio"` pipeline (or its alias - `"text-to-speech"`). Some models, like Dia,
2626
can also be conditioned to generate non-verbal communications such as laughing, sighing and crying, or even add music.
27-
Here's an example of how you would use the `"text-to-speech"` pipeline with Bark:
27+
Here's an example of how you would use the `"text-to-speech"` pipeline with Dia:
2828

2929
```py
3030
>>> from transformers import pipeline
3131

32-
>>> pipe = pipeline("text-to-speech", model="suno/bark-small")
33-
>>> text = "[clears throat] This is a test ... and I just took a long pause."
32+
>>> pipe = pipeline("text-to-speech", model="nari-labs/Dia-1.6B-0626")
33+
>>> text = "[S1] (clears throat) Hello! How are you? [S2] I'm good, thanks! How about you?"
3434
>>> output = pipe(text)
3535
```
3636

@@ -45,7 +45,7 @@ For more examples on what Bark and other pretrained TTS models can do, refer to
4545
[Audio course](https://huggingface.co/learn/audio-course/chapter6/pre-trained_models).
4646

4747
If you are looking to fine-tune a TTS model, the only text-to-speech models currently available in 🤗 Transformers
48-
are [SpeechT5](model_doc/speecht5) and [FastSpeech2Conformer](model_doc/fastspeech2_conformer), though more will be added in the future. SpeechT5 is pre-trained on a combination of speech-to-text and text-to-speech data, allowing it to learn a unified space of hidden representations shared by both text and speech. This means that the same pre-trained model can be fine-tuned for different tasks. Furthermore, SpeechT5 supports multiple speakers through x-vector speaker embeddings.
48+
are [SpeechT5](model_doc/speecht5), [FastSpeech2Conformer](model_doc/fastspeech2_conformer), [Dia](model_doc/dia) and [CSM](model_doc/csm) though more will be added in the future. SpeechT5 is pre-trained on a combination of speech-to-text and text-to-speech data, allowing it to learn a unified space of hidden representations shared by both text and speech. This means that the same pre-trained model can be fine-tuned for different tasks. Furthermore, SpeechT5 supports multiple speakers through x-vector speaker embeddings.
4949

5050
The remainder of this guide illustrates how to:
5151

0 commit comments

Comments
 (0)