STT Whisper
Overview
Section titled “Overview”The STT Whisper LOP provides real-time speech-to-text transcription using OpenAI’s Whisper model through the faster-whisper
library. This operator supports multiple languages, various model sizes, Voice Activity Detection (VAD) filtering, and both streaming and push-to-talk operating modes.
Requirements
Section titled “Requirements”- Python Environment: Requires a configured Python virtual environment with
faster-whisper
installed - ChatTD Component: Must be configured with a valid Python virtual environment path
- Audio Input: Expects 16kHz float32 audio data via the
ReceiveAudioChunk()
method - Optional: CUDA-compatible GPU for accelerated processing
- Audio Stream: Receives audio chunks via the
ReceiveAudioChunk()
method (16kHz, float32)
Output
Section titled “Output”- Transcription DAT: Real-time transcription text output
- Segments DAT: Detailed segment information with timestamps and text
- Status Information: Engine status and transcription state
Parameters
Section titled “Parameters”Faster Whisper
Section titled “Faster Whisper”op('stt_whisper').par.Status
str Current operational status of the transcription system
- Default:
None
op('stt_whisper').par.Active
toggle Enable/disable transcription processing
- Default:
false
op('stt_whisper').par.Enginestatus
str Current status of the Whisper engine
- Default:
None
op('stt_whisper').par.Initialize
pulse Initialize the Whisper transcription engine
- Default:
None
op('stt_whisper').par.Shutdown
pulse Shutdown the Whisper transcription engine
- Default:
None
op('stt_whisper').par.Initializeonstart
toggle Automatically initialize the engine when the operator starts
- Default:
false
op('stt_whisper').par.Cleartranscript
pulse Clear the current transcription history
- Default:
None
op('stt_whisper').par.Copytranscript
pulse Copy the current transcript to the system clipboard
- Default:
None
op('stt_whisper').par.Smartchunking
toggle Use intelligent chunking based on voice activity detection
- Default:
true
op('stt_whisper').par.Pausesensitivity
float Sensitivity for detecting pauses in speech (0=less sensitive, 1=more sensitive)
- Default:
0.1
op('stt_whisper').par.Maxchunkduration
float Maximum duration for audio chunks before forced processing
- Default:
8.0
op('stt_whisper').par.Chunkduration
float Target duration for audio chunks in streaming mode
- Default:
0.8
VAD / Filter
Section titled “VAD / Filter”op('stt_whisper').par.Phrasestoavoid
str Comma-separated list of phrases to filter out from transcription
- Default:
None
op('stt_whisper').par.Customspellings
str Custom prompt to guide spelling and terminology
- Default:
None
op('stt_whisper').par.Usevad
toggle Enable Voice Activity Detection filtering
- Default:
true
op('stt_whisper').par.Vadthreshold
float Threshold for voice activity detection (0=detect everything, 1=only clear speech)
- Default:
0.5
op('stt_whisper').par.Vadminsilence
int Minimum silence duration in milliseconds to consider as a pause
- Default:
250
op('stt_whisper').par.Beamsearchsize
int Beam search size for transcription quality (higher=better quality, slower)
- Default:
5
op('stt_whisper').par.Bypass
toggle Bypass the operator
- Default:
false
op('stt_whisper').par.Showbuiltin
toggle Show built-in TouchDesigner parameters
- Default:
false
op('stt_whisper').par.Version
str Current version of the operator
- Default:
None
op('stt_whisper').par.Lastupdated
str Date of last update
- Default:
None
op('stt_whisper').par.Creator
str Operator creator
- Default:
None
op('stt_whisper').par.Website
str Related website or documentation
- Default:
None
op('stt_whisper').par.Chattd
op Reference to the ChatTD operator for configuration
- Default:
None
Basic Setup
Section titled “Basic Setup”- Configure Python Environment: Ensure ChatTD is configured with a Python virtual environment that has
faster-whisper
installed - Initialize Engine: Click “Initialize Whisper” or enable “Initialize On Start”
- Select Model: Choose appropriate model size based on your accuracy vs. performance needs
- Choose Language: Select the target language for transcription
- Enable Transcription: Toggle “Transcription Active” to start processing
Operating Modes
Section titled “Operating Modes”Stream Mode (Live)
Section titled “Stream Mode (Live)”- Continuous real-time transcription
- Processes audio as it arrives
- Uses smart chunking for natural phrase boundaries
- Ideal for live conversations and real-time applications
Push to Talk Mode
Section titled “Push to Talk Mode”- Accumulates audio while “Active” is pressed
- Transcribes the entire buffer when “Active” is released
- Better for discrete speech segments
- Reduces processing overhead for intermittent use
Model Selection Guide
Section titled “Model Selection Guide”Model Size | Speed | Accuracy | Memory | Use Case |
---|---|---|---|---|
Tiny | Fastest | Basic | Low | Real-time, low-resource |
Base | Fast | Good | Medium | Balanced performance |
Medium | Moderate | Very Good | High | Quality transcription |
Large v3 | Slow | Excellent | Very High | Maximum accuracy |
Performance Optimization
Section titled “Performance Optimization”- GPU Acceleration: Set Device to “CUDA” for NVIDIA GPUs
- Compute Type: Use “FP16” for GPU, “INT8” for CPU optimization
- Chunk Duration: Shorter chunks for lower latency, longer for better accuracy
- VAD Filtering: Enable to reduce processing of non-speech audio
Language Support
Section titled “Language Support”The operator supports 99 languages including:
- Major Languages: English, Chinese, Spanish, French, German, Russian, Japanese, Korean
- Regional Variants: English-only models available for better English performance
- Automatic Detection: Leave language as “auto” for automatic detection
Advanced Features
Section titled “Advanced Features”Voice Activity Detection (VAD)
Section titled “Voice Activity Detection (VAD)”- Filters out non-speech audio segments
- Reduces false transcriptions from background noise
- Configurable threshold and silence duration
- Improves overall transcription quality
Smart Chunking
Section titled “Smart Chunking”- Automatically detects natural speech pauses
- Creates chunks at phrase boundaries
- Improves transcription coherence
- Reduces word splitting across chunks
Custom Filtering
Section titled “Custom Filtering”- Phrases to Avoid: Filter out specific unwanted phrases
- Custom Spellings: Guide pronunciation and terminology
- Beam Search: Adjust quality vs. speed tradeoff
Integration Examples
Section titled “Integration Examples”With Audio Input
Section titled “With Audio Input”# Send audio data to the STT operatorstt_op = op('stt_whisper')audio_data = np.array(audio_samples, dtype=np.float32)stt_op.ReceiveAudioChunk(audio_data)
Reading Transcription
Section titled “Reading Transcription”# Get the current transcriptiontranscription_dat = op('stt_whisper/transcription_out')current_text = transcription_dat.text
# Get segment informationsegments_dat = op('stt_whisper/segments_out')for row in segments_dat.rows()[1:]: # Skip header start_time, end_time, text = row print(f"{start_time}-{end_time}: {text}")
Troubleshooting
Section titled “Troubleshooting”Common Issues
Section titled “Common Issues”-
Engine Won’t Initialize
- Check Python virtual environment configuration in ChatTD
- Ensure
faster-whisper
is installed in the environment - Verify model download location
-
Poor Transcription Quality
- Try a larger model size
- Adjust VAD threshold
- Check audio input quality (16kHz recommended)
- Enable custom spellings for domain-specific terms
-
High CPU/Memory Usage
- Use smaller model size
- Enable GPU acceleration
- Adjust chunk duration
- Use INT8 compute type for CPU
-
Delayed Transcription
- Reduce chunk duration
- Disable smart chunking for immediate processing
- Check system performance and model size
Performance Tips
Section titled “Performance Tips”- GPU Usage: CUDA acceleration can provide 3-10x speed improvement
- Model Caching: Models are cached after first load for faster subsequent initialization
- Batch Processing: Push-to-talk mode is more efficient for non-continuous use
- Resource Management: Shutdown the engine when not in use to free resources
Research & Licensing
OpenAI
OpenAI is a leading AI research organization focused on developing artificial general intelligence (AGI) that benefits humanity. Their research spans natural language processing, computer vision, and speech recognition, with a commitment to open science and responsible AI development.
Whisper: Robust Speech Recognition via Large-Scale Weak Supervision
Whisper is a general-purpose speech recognition model trained on a large dataset of diverse audio. It is designed to be robust to accents, background noise, and technical language, making it suitable for real-world applications requiring reliable speech-to-text conversion.
Technical Details
- Transformer Architecture: Encoder-decoder model with attention mechanisms
- Multilingual Training: Support for 99 languages with varying degrees of accuracy
- Multiple Model Sizes: From tiny (39M parameters) to large (1550M parameters)
Research Impact
- Democratized Speech Recognition: Open-source model accessible to researchers and developers
- Multilingual Capabilities: Breakthrough in cross-lingual speech recognition
- Industry Adoption: Widely used in production applications worldwide
Citation
@article{radford2022robust, title={Robust Speech Recognition via Large-Scale Weak Supervision}, author={Radford, Alec and Kim, Jong Wook and Xu, Tao and Hallacy, Alex and Kavukcuoglu, Koray and Weng, Lillian}, journal={arXiv preprint arXiv:2212.04356}, year={2022}, url={https://arxiv.org/abs/2212.04356} }
Key Research Contributions
- Large-scale weak supervision training on 680,000 hours of multilingual data
- Zero-shot transfer capabilities across languages and domains
- Robust performance approaching human-level accuracy in speech recognition
License
MIT License - This model is freely available for research and commercial use.