Hi! This is a great library, thanks for open sourcing it.
Is it possible to extract embeddings from this model that can then be clustered for speaker identification? E.g. could I take the output of the encoder here before the combined embedding is created?
|
speech_embeds = self.audio_encoder(speech) |
I'm new to speech processing so please forgive me if that's daft. Thanks!