STT Whisper
STT Whisper v1.2.1 [ September 2, 2025 ]
- Added CHOP channels for parity across all STT operators
- TCP IPC mode for robust worker communication
- Auto worker reattachment on TouchDesigner restart
- TCP heartbeat system for connection monitoring
- Segments parameter for transcript segmentation
- Menu cleanup and improved parameter organization
Overview
Section titled “Overview”The STT Whisper LOP provides real-time speech-to-text transcription using OpenAI’s Whisper model through the faster-whisper
library. This operator supports multiple languages, various model sizes, Voice Activity Detection (VAD) filtering, and both streaming and push-to-talk operating modes.
Requirements
Section titled “Requirements”- Python Environment: Requires a configured Python virtual environment with
faster-whisper
installed - ChatTD Component: Must be configured with a valid Python virtual environment path
- Audio Input: Expects 16kHz float32 audio data via the
ReceiveAudioChunk()
method - Optional: CUDA-compatible GPU for accelerated processing
- Audio Stream: Receives audio chunks via the
ReceiveAudioChunk()
method (16kHz, float32)
Output
Section titled “Output”- Transcription DAT: Real-time transcription text output
- Segments DAT: Detailed segment information with timestamps and text
- Status Information: Engine status and transcription state
Parameters
Section titled “Parameters”Page: Faster Whisper
Section titled “Page: Faster Whisper”op('stt_whisper').par.Status
Str - Default:
None
op('stt_whisper').par.Active
Toggle - Default:
None
op('stt_whisper').par.Copytranscript
Pulse - Default:
None
op('stt_whisper').par.Enginestatus
Str - Default:
None
op('stt_whisper').par.Initialize
Pulse - Default:
None
op('stt_whisper').par.Shutdown
Pulse - Default:
None
op('stt_whisper').par.Initializeonstart
Toggle - Default:
None
op('stt_whisper').par.Segments
Toggle - Default:
None
op('stt_whisper').par.Transcriptionfile
File - Default:
None
op('stt_whisper').par.Smartchunking
Toggle - Default:
None
op('stt_whisper').par.Pausesensitivity
Float - Default:
0.1
- Range:
- 0 to 1
op('stt_whisper').par.Maxchunkduration
Float - Default:
8.0
- Range:
- 3 to 15
op('stt_whisper').par.Chunkduration
Float - Default:
0.8
- Range:
- 0.5 to 5
op('stt_whisper').par.Cleartranscript
Pulse - Default:
None
Page: VAD
Section titled “Page: VAD”op('stt_whisper').par.Phrasestoavoid
Str - Default:
None
op('stt_whisper').par.Customspellings
Str - Default:
None
op('stt_whisper').par.Usevad
Toggle - Default:
None
op('stt_whisper').par.Vadthreshold
Float - Default:
0.5
- Range:
- 0 to 1
op('stt_whisper').par.Vadminsilence
Int - Default:
250
- Range:
- 50 to 2000
op('stt_whisper').par.Beamsearchsize
Int - Default:
5
- Range:
- 1 to 20
Page: Install/Settings
Section titled “Page: Install/Settings”op('stt_whisper').par.Installdependencies
Pulse - Default:
None
op('stt_whisper').par.Downloadmodel
Pulse - Default:
None
op('stt_whisper').par.Downloadprogress
Float - Default:
None
op('stt_whisper').par.Monitorworkerlogs
Toggle - Default:
None
op('stt_whisper').par.Autoreattachoninit
Toggle - Default:
None
op('stt_whisper').par.Forceattachoninit
Toggle - Default:
None
Page: About
Section titled “Page: About”op('stt_whisper').par.Bypass
Toggle - Default:
None
op('stt_whisper').par.Showbuiltin
Toggle - Default:
None
op('stt_whisper').par.Showicon
Toggle - Default:
None
op('stt_whisper').par.Version
Str - Default:
None
op('stt_whisper').par.Lastupdated
Str - Default:
None
op('stt_whisper').par.Creator
Str - Default:
None
op('stt_whisper').par.Website
Str - Default:
None
op('stt_whisper').par.Chattd
OP - Default:
None
Basic Setup
Section titled “Basic Setup”- Configure Python Environment: Ensure ChatTD is configured with a Python virtual environment that has
faster-whisper
installed - Initialize Engine: Click “Initialize Whisper” or enable “Initialize On Start”
- Select Model: Choose appropriate model size based on your accuracy vs. performance needs
- Choose Language: Select the target language for transcription
- Enable Transcription: Toggle “Transcription Active” to start processing
Operating Modes
Section titled “Operating Modes”Stream Mode (Live)
Section titled “Stream Mode (Live)”- Continuous real-time transcription
- Processes audio as it arrives
- Uses smart chunking for natural phrase boundaries
- Ideal for live conversations and real-time applications
Push to Talk Mode
Section titled “Push to Talk Mode”- Accumulates audio while “Active” is pressed
- Transcribes the entire buffer when “Active” is released
- Better for discrete speech segments
- Reduces processing overhead for intermittent use
Model Selection Guide
Section titled “Model Selection Guide”Model Size | Speed | Accuracy | Memory | Use Case |
---|---|---|---|---|
Tiny | Fastest | Basic | Low | Real-time, low-resource |
Base | Fast | Good | Medium | Balanced performance |
Medium | Moderate | Very Good | High | Quality transcription |
Large v3 | Slow | Excellent | Very High | Maximum accuracy |
Performance Optimization
Section titled “Performance Optimization”- GPU Acceleration: Set Device to “CUDA” for NVIDIA GPUs
- Compute Type: Use “FP16” for GPU, “INT8” for CPU optimization
- Chunk Duration: Shorter chunks for lower latency, longer for better accuracy
- VAD Filtering: Enable to reduce processing of non-speech audio
Language Support
Section titled “Language Support”The operator supports 99 languages including:
- Major Languages: English, Chinese, Spanish, French, German, Russian, Japanese, Korean
- Regional Variants: English-only models available for better English performance
- Automatic Detection: Leave language as “auto” for automatic detection
Advanced Features
Section titled “Advanced Features”Voice Activity Detection (VAD)
Section titled “Voice Activity Detection (VAD)”- Filters out non-speech audio segments
- Reduces false transcriptions from background noise
- Configurable threshold and silence duration
- Improves overall transcription quality
Smart Chunking
Section titled “Smart Chunking”- Automatically detects natural speech pauses
- Creates chunks at phrase boundaries
- Improves transcription coherence
- Reduces word splitting across chunks
Custom Filtering
Section titled “Custom Filtering”- Phrases to Avoid: Filter out specific unwanted phrases
- Custom Spellings: Guide pronunciation and terminology
- Beam Search: Adjust quality vs. speed tradeoff
Integration Examples
Section titled “Integration Examples”With Audio Input
Section titled “With Audio Input”- Connect an
Audio Device In
CHOP to the input of thestt_whisper
operator. - Set the
Mode
toStream (Live)
. - Toggle
Transcription Active
to On.
Reading Transcription
Section titled “Reading Transcription”- View the full transcript in the
transcription_out
DAT. - View the segmented transcript in the
segments_out
DAT.
Troubleshooting
Section titled “Troubleshooting”Common Issues
Section titled “Common Issues”-
Engine Won’t Initialize
- Check Python virtual environment configuration in ChatTD
- Ensure
faster-whisper
is installed in the environment - Verify model download location
-
Poor Transcription Quality
- Try a larger model size
- Adjust VAD threshold
- Check audio input quality (16kHz recommended)
- Enable custom spellings for domain-specific terms
-
High CPU/Memory Usage
- Use smaller model size
- Enable GPU acceleration
- Adjust chunk duration
- Use INT8 compute type for CPU
-
Delayed Transcription
- Reduce chunk duration
- Disable smart chunking for immediate processing
- Check system performance and model size
Performance Tips
Section titled “Performance Tips”- GPU Usage: CUDA acceleration can provide 3-10x speed improvement
- Model Caching: Models are cached after first load for faster subsequent initialization
- Batch Processing: Push-to-talk mode is more efficient for non-continuous use
- Resource Management: Shutdown the engine when not in use to free resources
Research & Licensing
OpenAI
OpenAI is a leading AI research organization focused on developing artificial general intelligence (AGI) that benefits humanity. Their research spans natural language processing, computer vision, and speech recognition, with a commitment to open science and responsible AI development.
Whisper: Robust Speech Recognition via Large-Scale Weak Supervision
Whisper is a general-purpose speech recognition model trained on a large dataset of diverse audio. It is designed to be robust to accents, background noise, and technical language, making it suitable for real-world applications requiring reliable speech-to-text conversion.
Technical Details
- Transformer Architecture: Encoder-decoder model with attention mechanisms
- Multilingual Training: Support for 99 languages with varying degrees of accuracy
- Multiple Model Sizes: From tiny (39M parameters) to large (1550M parameters)
Research Impact
- Democratized Speech Recognition: Open-source model accessible to researchers and developers
- Multilingual Capabilities: Breakthrough in cross-lingual speech recognition
- Industry Adoption: Widely used in production applications worldwide
Citation
@article{radford2022robust, title={Robust Speech Recognition via Large-Scale Weak Supervision}, author={Radford, Alec and Kim, Jong Wook and Xu, Tao and Hallacy, Alex and Kavukcuoglu, Koray and Weng, Lillian}, journal={arXiv preprint arXiv:2212.04356}, year={2022}, url={https://arxiv.org/abs/2212.04356} }
Key Research Contributions
- Large-scale weak supervision training on 680,000 hours of multilingual data
- Zero-shot transfer capabilities across languages and domains
- Robust performance approaching human-level accuracy in speech recognition
License
MIT License - This model is freely available for research and commercial use.