A modern, user-friendly web interface for transcribing speech from audio files using Google Cloud Speech-to-Text (STT), built with Gradio for the frontend and Modal for scalable serverless backend execution.
- Drag-and-drop audio file upload via clean Gradio interface
- Multiple audio format support (WAV, MP3, M4A, OGG, FLAC, AAC, WMA)
- File validation system with format and size checking (100MB limit)
- Audio preview functionality with built-in player
- Detailed file information display (name, size, format, upload time)
- Enhanced error handling with troubleshooting guidance
- Professional UI/UX with welcome messages and status indicators
- Comprehensive audio analysis with librosa integration
- Audio quality metrics (duration, sample rate, channels, RMS energy)
- Robust file format handling with graceful fallbacks
- GCP client foundation with authentication structure
- Complete GCP Speech-to-Text integration for transcription
- Audio quality validation against GCP requirements
- Serverless processing with Modal for scalable transcription
- Real-time progress tracking during transcription
- Multiple deployment options: Local, Modal Web, and HuggingFace Spaces
- Language selection interface with 50+ supported languages
- Batch processing capabilities for multiple files
- Transcription export (TXT, JSON, SRT formats)
- Audio duration detection and advanced file analysis
βββββββββββββββββββ ββββββββββββββββββββ ββββββββββββββββββββββ
β Gradio UI ββββββ Modal Backend ββββββ Google Cloud STT β
β (Frontend) β β (Serverless) β β (API) β
βββββββββββββββββββ ββββββββββββββββββββ ββββββββββββββββββββββ
gradio-modal-gcp-speech-ui/
βββ src/
β βββ gradio_app/ # Gradio frontend application
β βββ modal_functions/ # Modal serverless functions
β βββ gcp_client/ # Google Cloud STT client
β βββ utils/ # Shared utilities
βββ config/
β βββ settings.py # Application configuration
β βββ gcp_config.py # Google Cloud configuration
βββ docs/
β βββ setup/ # Setup and deployment guides
β βββ api/ # API documentation
βββ tests/ # Test files
βββ samples/ # Sample audio files for testing
βββ requirements.txt # Python dependencies
βββ pyproject.toml # Modern Python project configuration
βββ README.md # This file
- Python 3.8+
- Google Cloud Platform account
- Modal account and API key
- FFmpeg (for audio processing)
# Clone the repository
git clone <your-repo-url>
cd gradio-modal-gcp-speech-ui
# Create virtual environment
python -m venv venv
source venv/bin/activate # On Windows: venv\Scripts\activate
# Install dependencies
pip install -r requirements.txtFollow the detailed setup guide in docs/setup/gcp-setup.md to:
- Create a GCP project
- Enable Speech-to-Text API
- Create service account credentials
- Set up authentication
# Install Modal CLI and authenticate
pip install modal
modal token new# Copy environment template
cp .env.example .env
# Edit .env with your credentials
# - GCP service account key path
# - Modal API key
# - Other configuration options# Quick start with automated setup
./setup.sh
# Or manual setup:
# 1. Create virtual environment
python -m venv venv
source venv/bin/activate
# 2. Install dependencies
make install
# 3. Run the application
make run
# Application will be available at http://127.0.0.1:7860
# For development mode with auto-reload
make dev- Upload Audio Files: Test with WAV, MP3, M4A, OGG, FLAC, AAC, or WMA files
- File Validation: Try invalid formats or large files (>100MB) to see error handling
- Audio Preview: Use the built-in player to preview uploaded audio
- File Information: View detailed file metadata and processing status
# Configure environment
cp .env.example .env
# Edit .env with your GCP credentials
# Deploy to Modal (Step 4)
modal deploy src/modal_functions/speech_processor.py
# Deploy to HuggingFace Spaces (Step 4)
# See docs/setup/huggingface-deployment.md- WAV (recommended for best quality)
- MP3 (most common format)
- M4A (Apple audio format)
- OGG (open source format)
- FLAC (lossless compression)
- AAC (Advanced Audio Coding)
- WMA (Windows Media Audio)
File Size Limit: 100MB per file
Validation: Automatic format and size checking with detailed error messages
- Project structure and build system
- Comprehensive configuration management
- Basic Gradio interface with audio upload
- File validation and error handling
- Audio preview and file information display
- Enhanced UI/UX with professional feedback
- GCP client package structure (
src/gcp_client/) - Comprehensive audio analysis with librosa integration
- Audio format validation for all 7 GCP-supported formats
- Detailed audio metrics extraction (duration, sample rate, channels)
- Robust error handling with user-friendly messages
- AudioProcessor with graceful loading fallbacks
- SpeechToTextClient authentication foundation
- Complete GCP Speech-to-Text client implementation
- Audio quality validation against GCP requirements
- Synchronous and asynchronous transcription methods
- Modal backend integration and deployment
- HuggingFace Spaces deployment
- Language selection interface (50+ languages)
- Audio duration detection with librosa
- Batch processing capabilities
- Transcription export (TXT, JSON, SRT)
- HuggingFace ASR model comparison
- LLM-powered transcription post-processing
- Real-time streaming transcription
- Custom vocabulary support
- Fork the repository
- Create a feature branch
- Make your changes
- Add tests
- Submit a pull request
This project is licensed under the MIT License - see the LICENSE file for details.
- π Check the documentation
- π Report issues on GitHub Issues
- π¬ Join discussions in GitHub Discussions
#gradio #modal #speech-to-text #google-cloud #asr #ai #serverless #python