You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: docs/source/en/tasks/text-to-speech.md
+6-6Lines changed: 6 additions & 6 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -19,18 +19,18 @@ rendered properly in your Markdown viewer.
19
19
[[open-in-colab]]
20
20
21
21
Text-to-speech (TTS) is the task of creating natural-sounding speech from text, where the speech can be generated in multiple
22
-
languages and for multiple speakers. Several text-to-speech models are currently available in 🤗 Transformers, such as
22
+
languages and for multiple speakers. Several text-to-speech models are currently available in 🤗 Transformers, such as[Dia](../model_doc/dia), [CSM](../model_doc/csm),
23
23
[Bark](../model_doc/bark), [MMS](../model_doc/mms), [VITS](../model_doc/vits) and [SpeechT5](../model_doc/speecht5).
24
24
25
-
You can easily generate audio using the `"text-to-audio"` pipeline (or its alias - `"text-to-speech"`). Some models, like Bark,
25
+
You can easily generate audio using the `"text-to-audio"` pipeline (or its alias - `"text-to-speech"`). Some models, like Dia,
26
26
can also be conditioned to generate non-verbal communications such as laughing, sighing and crying, or even add music.
27
-
Here's an example of how you would use the `"text-to-speech"` pipeline with Bark:
27
+
Here's an example of how you would use the `"text-to-speech"` pipeline with Dia:
If you are looking to fine-tune a TTS model, the only text-to-speech models currently available in 🤗 Transformers
48
-
are [SpeechT5](model_doc/speecht5) and [FastSpeech2Conformer](model_doc/fastspeech2_conformer), though more will be added in the future. SpeechT5 is pre-trained on a combination of speech-to-text and text-to-speech data, allowing it to learn a unified space of hidden representations shared by both text and speech. This means that the same pre-trained model can be fine-tuned for different tasks. Furthermore, SpeechT5 supports multiple speakers through x-vector speaker embeddings.
48
+
are [SpeechT5](model_doc/speecht5), [FastSpeech2Conformer](model_doc/fastspeech2_conformer),[Dia](model_doc/dia) and [CSM](model_doc/csm) though more will be added in the future. SpeechT5 is pre-trained on a combination of speech-to-text and text-to-speech data, allowing it to learn a unified space of hidden representations shared by both text and speech. This means that the same pre-trained model can be fine-tuned for different tasks. Furthermore, SpeechT5 supports multiple speakers through x-vector speaker embeddings.
0 commit comments