Skip to content

A simple Gradio web app for transcribing .wav audio files using Google Cloud Speech-to-Text, deployed with Modal. Easily extendable for agentic AI workflows and HuggingFace model integration.

License

Notifications You must be signed in to change notification settings

MT-RD/gradio-modal-gcp-speech-ui

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 

History

85 Commits
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 

Repository files navigation

Gradio-Powered Speech-to-Text UI with Google Cloud STT and Modal Deployment

A modern, user-friendly web interface for transcribing speech from audio files using Google Cloud Speech-to-Text (STT), built with Gradio for the frontend and Modal for scalable serverless backend execution.

πŸš€ Features

βœ… Currently Implemented

  • Drag-and-drop audio file upload via clean Gradio interface
  • Multiple audio format support (WAV, MP3, M4A, OGG, FLAC, AAC, WMA)
  • File validation system with format and size checking (100MB limit)
  • Audio preview functionality with built-in player
  • Detailed file information display (name, size, format, upload time)
  • Enhanced error handling with troubleshooting guidance
  • Professional UI/UX with welcome messages and status indicators
  • Comprehensive audio analysis with librosa integration
  • Audio quality metrics (duration, sample rate, channels, RMS energy)
  • Robust file format handling with graceful fallbacks
  • GCP client foundation with authentication structure

🚧 In Development

  • Complete GCP Speech-to-Text integration for transcription
  • Audio quality validation against GCP requirements
  • Serverless processing with Modal for scalable transcription
  • Real-time progress tracking during transcription

🎯 Planned Features

  • Multiple deployment options: Local, Modal Web, and HuggingFace Spaces
  • Language selection interface with 50+ supported languages
  • Batch processing capabilities for multiple files
  • Transcription export (TXT, JSON, SRT formats)
  • Audio duration detection and advanced file analysis

πŸ—οΈ Architecture

β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”    β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”    β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚   Gradio UI     │────│   Modal Backend  │────│  Google Cloud STT  β”‚
β”‚   (Frontend)    β”‚    β”‚   (Serverless)   β”‚    β”‚      (API)         β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜    β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜    β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

πŸ“ Project Structure

gradio-modal-gcp-speech-ui/
β”œβ”€β”€ src/
β”‚   β”œβ”€β”€ gradio_app/          # Gradio frontend application
β”‚   β”œβ”€β”€ modal_functions/     # Modal serverless functions
β”‚   β”œβ”€β”€ gcp_client/          # Google Cloud STT client
β”‚   └── utils/               # Shared utilities
β”œβ”€β”€ config/
β”‚   β”œβ”€β”€ settings.py          # Application configuration
β”‚   └── gcp_config.py        # Google Cloud configuration
β”œβ”€β”€ docs/
β”‚   β”œβ”€β”€ setup/               # Setup and deployment guides
β”‚   └── api/                 # API documentation
β”œβ”€β”€ tests/                   # Test files
β”œβ”€β”€ samples/                 # Sample audio files for testing
β”œβ”€β”€ requirements.txt         # Python dependencies
β”œβ”€β”€ pyproject.toml          # Modern Python project configuration
└── README.md               # This file

πŸ› οΈ Installation & Setup

Prerequisites

  • Python 3.8+
  • Google Cloud Platform account
  • Modal account and API key
  • FFmpeg (for audio processing)

1. Clone and Setup Environment

# Clone the repository
git clone <your-repo-url>
cd gradio-modal-gcp-speech-ui

# Create virtual environment
python -m venv venv
source venv/bin/activate  # On Windows: venv\Scripts\activate

# Install dependencies
pip install -r requirements.txt

2. Google Cloud Setup

Follow the detailed setup guide in docs/setup/gcp-setup.md to:

  • Create a GCP project
  • Enable Speech-to-Text API
  • Create service account credentials
  • Set up authentication

3. Modal Setup

# Install Modal CLI and authenticate
pip install modal
modal token new

4. Environment Configuration

# Copy environment template
cp .env.example .env

# Edit .env with your credentials
# - GCP service account key path
# - Modal API key
# - Other configuration options

πŸš€ Usage

Current Local Development (Step 2 - UI Ready)

# Quick start with automated setup
./setup.sh

# Or manual setup:
# 1. Create virtual environment
python -m venv venv
source venv/bin/activate

# 2. Install dependencies
make install

# 3. Run the application
make run
# Application will be available at http://127.0.0.1:7860

# For development mode with auto-reload
make dev

Testing Current Features

  1. Upload Audio Files: Test with WAV, MP3, M4A, OGG, FLAC, AAC, or WMA files
  2. File Validation: Try invalid formats or large files (>100MB) to see error handling
  3. Audio Preview: Use the built-in player to preview uploaded audio
  4. File Information: View detailed file metadata and processing status

Future Usage (After GCP Integration)

# Configure environment
cp .env.example .env
# Edit .env with your GCP credentials

# Deploy to Modal (Step 4)
modal deploy src/modal_functions/speech_processor.py

# Deploy to HuggingFace Spaces (Step 4)
# See docs/setup/huggingface-deployment.md

πŸ“Š Supported Audio Formats

  • WAV (recommended for best quality)
  • MP3 (most common format)
  • M4A (Apple audio format)
  • OGG (open source format)
  • FLAC (lossless compression)
  • AAC (Advanced Audio Coding)
  • WMA (Windows Media Audio)

File Size Limit: 100MB per file
Validation: Automatic format and size checking with detailed error messages

🎯 Roadmap

βœ… Completed (Phase 1-2)

  • Project structure and build system
  • Comprehensive configuration management
  • Basic Gradio interface with audio upload
  • File validation and error handling
  • Audio preview and file information display
  • Enhanced UI/UX with professional feedback

βœ… Completed (Phase 3A - GCP Foundation)

  • GCP client package structure (src/gcp_client/)
  • Comprehensive audio analysis with librosa integration
  • Audio format validation for all 7 GCP-supported formats
  • Detailed audio metrics extraction (duration, sample rate, channels)
  • Robust error handling with user-friendly messages
  • AudioProcessor with graceful loading fallbacks
  • SpeechToTextClient authentication foundation

🚧 In Progress (Phase 3B)

  • Complete GCP Speech-to-Text client implementation
  • Audio quality validation against GCP requirements
  • Synchronous and asynchronous transcription methods

πŸ“‹ Planned (Phase 4-5)

  • Modal backend integration and deployment
  • HuggingFace Spaces deployment
  • Language selection interface (50+ languages)
  • Audio duration detection with librosa
  • Batch processing capabilities
  • Transcription export (TXT, JSON, SRT)
  • HuggingFace ASR model comparison
  • LLM-powered transcription post-processing
  • Real-time streaming transcription
  • Custom vocabulary support

🀝 Contributing

  1. Fork the repository
  2. Create a feature branch
  3. Make your changes
  4. Add tests
  5. Submit a pull request

πŸ“„ License

This project is licensed under the MIT License - see the LICENSE file for details.

πŸ†˜ Support

🏷️ Tags

#gradio #modal #speech-to-text #google-cloud #asr #ai #serverless #python

About

A simple Gradio web app for transcribing .wav audio files using Google Cloud Speech-to-Text, deployed with Modal. Easily extendable for agentic AI workflows and HuggingFace model integration.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published