Skip to content

Commit ca33467

Browse files
tsavinayatarkan
andauthored
Text-to-speech use case page (#3017)
## Description This PR adds new Speech Generation use case page to OpenVINO GenAI documentation. Preview: [Text-to-speech use case](https://tsavina.github.io/openvino.genai/docs/use-cases/speech-generation/) ## Ticket CVS-169351 ## Checklist: - [ ] Tests have been updated or added to cover the new code - N/A - [ ] This patch fully addresses the ticket. <!--- If follow-up pull requests are needed, specify in description. --> - [ ] I have made corresponding changes to the documentation. <!-- Run github.com/\<username>/openvino.genai/actions/workflows/deploy_gh_pages.yml on your fork with your branch as a parameter to deploy a test version with the updated content. Replace this comment with the link to the built docs. --> --------- Co-authored-by: Yaroslav Tarkan <[email protected]>
1 parent c37d5a4 commit ca33467

File tree

7 files changed

+202
-0
lines changed

7 files changed

+202
-0
lines changed
Lines changed: 21 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,21 @@
1+
import CodeBlock from '@theme/CodeBlock';
2+
3+
<CodeBlock language="cpp" showLineNumbers>
4+
{`#include "audio_utils.hpp"
5+
#include "openvino/genai/speech_generation/text2speech_pipeline.hpp"
6+
7+
int main(int argc, char* argv[]) {
8+
std::string models_path = argv[1];
9+
ov::genai::Text2SpeechPipeline pipeline(model_path, "${props.device || 'CPU'}");
10+
11+
auto result = pipeline.generate("Hello OpenVINO GenAI");
12+
13+
auto waveform_size = result.speeches[0].get_size();
14+
auto waveform_ptr = result.speeches[0].data<const float>();
15+
auto bits_per_sample = result.speeches[0].get_element_type().bitwidth();
16+
utils::audio::save_to_wav(waveform_ptr, waveform_size, "output_audio.wav", bits_per_sample);
17+
18+
return 0;
19+
}
20+
`}
21+
</CodeBlock>
Lines changed: 15 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,15 @@
1+
import CodeBlock from '@theme/CodeBlock';
2+
3+
<CodeBlock language="python" showLineNumbers>
4+
{`import openvino_genai
5+
import soundfile as sf
6+
7+
pipeline = openvino_genai.Text2SpeechPipeline(model_path, "${props.device || 'CPU'}")
8+
9+
# Generate audio using the default speaker
10+
result = pipeline.generate("Hello OpenVINO GenAI")
11+
# speech tensor contains the waveform of the spoken phrase
12+
speech = result.speeches[0]
13+
sf.write("output_audio.wav", speech.data[0], samplerate=16000)
14+
`}
15+
</CodeBlock>
Lines changed: 97 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,97 @@
1+
import CodeExampleCPP from './_code_example_cpp.mdx';
2+
import CodeExamplePython from './_code_example_python.mdx';
3+
4+
5+
## Run Model Using OpenVINO GenAI
6+
7+
The [`Text2SpeechPipeline`](https://docs.openvino.ai/2025/api/genai_api/_autosummary/openvino_genai.Text2SpeechPipeline.html) is the main object for generating speech from text.
8+
It automatically loads the TTS model and vocoder from the converted model directory.
9+
10+
<LanguageTabs>
11+
<TabItemPython>
12+
<Tabs groupId="device">
13+
<TabItem label="CPU" value="cpu">
14+
<CodeExamplePython device="CPU" />
15+
</TabItem>
16+
<TabItem label="GPU" value="gpu">
17+
<CodeExamplePython device="GPU" />
18+
</TabItem>
19+
</Tabs>
20+
</TabItemPython>
21+
<TabItemCpp>
22+
<Tabs groupId="device">
23+
<TabItem label="CPU" value="cpu">
24+
<CodeExampleCPP device="CPU" />
25+
</TabItem>
26+
<TabItem label="GPU" value="gpu">
27+
<CodeExampleCPP device="GPU" />
28+
</TabItem>
29+
</Tabs>
30+
</TabItemCpp>
31+
</LanguageTabs>
32+
33+
:::tip
34+
Use CPU or GPU as devices without any other code change.
35+
:::
36+
37+
## Additional Usage Options
38+
39+
:::tip
40+
Check out [Python](https://github.com/openvinotoolkit/openvino.genai/blob/master/samples/python/speech_generation/text2speech.py) and [C++](https://github.com/openvinotoolkit/openvino.genai/blob/master/samples/cpp/speech_generation/text2speech.cpp) speech generation samples.
41+
:::
42+
43+
### Use Speaker Embedding File
44+
45+
To generate speech using the SpeechT5 TTS model, you can specify a target voice by providing a speaker embedding file.
46+
47+
This file must contain 512 32-bit floating-point values that represent the voice characteristics of the target speaker. The model will use these characteristics to synthesize the input text in the specified voice.
48+
49+
If no speaker embedding is provided, the model uses the default built-in speaker.
50+
51+
You can generate a speaker embedding using the [create_speaker_embedding.py](https://github.com/openvinotoolkit/openvino.genai/blob/master/samples/python/speech_generation/create_speaker_embedding.py) script. This script records 5 seconds of audio from your microphone and extracts a speaker embedding vector from the recording.
52+
53+
```bash
54+
python create_speaker_embedding.py
55+
```
56+
57+
<LanguageTabs>
58+
<TabItemPython>
59+
```python
60+
import openvino_genai
61+
import openvino as ov
62+
import numpy as np
63+
import soundfile as sf
64+
65+
pipeline = openvino_genai.Text2SpeechPipeline(model_path, "CPU")
66+
67+
speaker_embedding = np.fromfile(args.speaker_embedding_file_path, dtype=np.float32).reshape(1, 512)
68+
speaker_embedding = ov.Tensor(speaker_embedding)
69+
result = pipeline.generate("Hello OpenVINO GenAI", speaker_embedding)
70+
71+
speech = result.speeches[0]
72+
sf.write("output_audio.wav", speech.data[0], samplerate=16000)
73+
```
74+
</TabItemPython>
75+
<TabItemCpp>
76+
```cpp
77+
#include "openvino/genai/speech_generation/text2speech_pipeline.hpp"
78+
#include "audio_utils.hpp"
79+
80+
int main(int argc, char* argv[]) {
81+
std::string model_path = argv[1];
82+
ov::genai::Text2SpeechPipeline pipeline(model_path, "CPU");
83+
84+
auto speaker_embedding = utils::audio::read_speaker_embedding(speaker_embedding_path);
85+
auto result = pipeline.generate("Hello OpenVINO GenAI", speaker_embedding);
86+
87+
auto waveform_size = result.speeches[0].get_size();
88+
auto waveform_ptr = result.speeches[0].data<const float>();
89+
auto bits_per_sample = result.speeches[0].get_element_type().bitwidth();
90+
utils::audio::save_to_wav(waveform_ptr, waveform_size, "output_audio.wav", bits_per_sample);
91+
92+
return 0;
93+
}
94+
```
95+
</TabItemCpp>
96+
</LanguageTabs>
97+
Lines changed: 27 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,27 @@
1+
---
2+
sidebar_position: 7
3+
---
4+
import OptimumCLI from '@site/src/components/OptimumCLI';
5+
import ConvertModelSection from '../_shared/_convert_model.mdx';
6+
import RunModelSection from './_sections/_run_model/index.mdx';
7+
8+
9+
# Speech Generation Using SpeechT5
10+
11+
:::info Note
12+
Currently, speech generation pipeline supports the SpeechT5 TTS model.
13+
The generated audio signal is a single-channel (mono) waveform with a sampling rate of 16 kHz.
14+
:::
15+
16+
<ConvertModelSection>
17+
Download and convert model (e.g. [speecht5_tts](https://huggingface.co/microsoft/speecht5_tts)) and its vocoder to OpenVINO format from Hugging Face.
18+
SpeechT5 requires specifying a vocoder via `--model-kwargs`:
19+
20+
<OptimumCLI model='microsoft/speecht5_tts' outputDir='speecht5_tts' weightFormat='int4' modelKwargs={{
21+
vocoder: "microsoft/speecht5_hifigan"
22+
}} />
23+
24+
See all supported [Speech Generation Models](/docs/supported-models/#speech-generation-models).
25+
</ConvertModelSection>
26+
27+
<RunModelSection />

site/src/components/OptimumCLI/index.tsx

Lines changed: 6 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -6,6 +6,7 @@ type OptimumCLIProps = {
66
weightFormat?: 'fp32' | 'fp16' | 'int8' | 'int4';
77
task?: string;
88
trustRemoteCode?: boolean;
9+
modelKwargs?: Record<string, string>;
910
};
1011

1112
export default function OptimumCLI({
@@ -14,6 +15,7 @@ export default function OptimumCLI({
1415
weightFormat,
1516
task,
1617
trustRemoteCode,
18+
modelKwargs,
1719
}: OptimumCLIProps): React.JSX.Element {
1820
const args = [`--model ${model}`];
1921
if (weightFormat) {
@@ -25,6 +27,10 @@ export default function OptimumCLI({
2527
if (trustRemoteCode) {
2628
args.push('--trust-remote-code');
2729
}
30+
if (modelKwargs) {
31+
const kwargsString = JSON.stringify(modelKwargs);
32+
args.push(`--model-kwargs '${kwargsString}'`);
33+
}
2834
return (
2935
<CodeBlock language="bash">{`optimum-cli export openvino ${args.join(
3036
' '
Lines changed: 34 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,34 @@
1+
import Button from '@site/src/components/Button';
2+
import { LanguageTabs, TabItemCpp, TabItemPython } from '@site/src/components/LanguageTabs';
3+
4+
import UseCaseCard from './UseCaseCard';
5+
6+
import CodeExampleCpp from '@site/docs/use-cases/speech-generation/_sections/_run_model/_code_example_cpp.mdx';
7+
import CodeExamplePython from '@site/docs/use-cases/speech-generation/_sections/_run_model/_code_example_python.mdx';
8+
9+
export const SpeechGeneration = () => (
10+
<UseCaseCard>
11+
<UseCaseCard.Title>Speech Generation Using SpeechT5</UseCaseCard.Title>
12+
<UseCaseCard.Description>
13+
Convert text to speech using SpeechT5 TTS models.
14+
</UseCaseCard.Description>
15+
<UseCaseCard.Features>
16+
<li>Generate natural and expressive speech from text prompts</li>
17+
<li>Use speaker embeddings for personalized voice synthesis</li>
18+
</UseCaseCard.Features>
19+
<UseCaseCard.Code>
20+
<LanguageTabs>
21+
<TabItemPython>
22+
<CodeExamplePython />
23+
</TabItemPython>
24+
<TabItemCpp>
25+
<CodeExampleCpp />
26+
</TabItemCpp>
27+
</LanguageTabs>
28+
</UseCaseCard.Code>
29+
<UseCaseCard.Actions>
30+
<Button label="Explore Use Case" link="docs/use-cases/speech-generation" variant="primary" />
31+
<Button label="View Code Samples" link="docs/samples" variant="primary" outline />
32+
</UseCaseCard.Actions>
33+
</UseCaseCard>
34+
);

site/src/pages/_sections/UseCasesSection/index.tsx

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -5,6 +5,7 @@ import Heading from '@theme/Heading';
55
import Link from '@docusaurus/Link';
66
import { ImageGeneration } from './components/image-generation';
77
import { ImageProcessing } from './components/image-processing';
8+
import { SpeechGeneration } from './components/speech-generation';
89
import { SpeechRecognition } from './components/speech-recognition';
910
import { TextGeneration } from './components/text-generation';
1011
import { TextRerank } from './components/text-rerank';
@@ -19,6 +20,7 @@ export const UseCasesSection = () => (
1920
<TextGeneration />
2021
<ImageGeneration />
2122
<SpeechRecognition />
23+
<SpeechGeneration />
2224
<ImageProcessing />
2325
<TextEmbedding />
2426
<TextRerank />

0 commit comments

Comments
 (0)