STT Whisper
STT Whisper v1.2.1 [ September 2, 2025 ]
- Added CHOP channels for parity across all STT operators
- TCP IPC mode for robust worker communication
- Auto worker reattachment on TouchDesigner restart
- TCP heartbeat system for connection monitoring
- Segments parameter for transcript segmentation
- Menu cleanup and improved parameter organization
Overview
Section titled “Overview”The STT Whisper LOP provides real-time speech-to-text transcription using OpenAI’s Whisper model through the faster-whisper library. This operator supports multiple languages, various model sizes, Voice Activity Detection (VAD) filtering, and both streaming and push-to-talk operating modes.
Requirements
Section titled “Requirements”- Python Environment: Requires a configured Python virtual environment with
faster-whisperinstalled - ChatTD Component: Must be configured with a valid Python virtual environment path
- Audio Input: Expects 16kHz float32 audio data via the
ReceiveAudioChunk()method - Optional: CUDA-compatible GPU for accelerated processing
- Audio Stream: Receives audio chunks via the
ReceiveAudioChunk()method (16kHz, float32)
Output
Section titled “Output”- Transcription DAT: Real-time transcription text output
- Segments DAT: Detailed segment information with timestamps and text
- Status Information: Engine status and transcription state
Parameters
Section titled “Parameters”Page: Faster Whisper
Section titled “Page: Faster Whisper”op('stt_whisper').par.Status Str - Default:
None
op('stt_whisper').par.Active Toggle - Default:
None
op('stt_whisper').par.Copytranscript Pulse - Default:
None
op('stt_whisper').par.Enginestatus Str - Default:
None
op('stt_whisper').par.Initialize Pulse - Default:
None
op('stt_whisper').par.Shutdown Pulse - Default:
None
op('stt_whisper').par.Initializeonstart Toggle - Default:
None
op('stt_whisper').par.Segments Toggle - Default:
None
op('stt_whisper').par.Transcriptionfile File - Default:
None
op('stt_whisper').par.Smartchunking Toggle - Default:
None
op('stt_whisper').par.Pausesensitivity Float - Default:
0.1- Range:
- 0 to 1
op('stt_whisper').par.Maxchunkduration Float - Default:
8.0- Range:
- 3 to 15
op('stt_whisper').par.Chunkduration Float - Default:
0.8- Range:
- 0.5 to 5
op('stt_whisper').par.Cleartranscript Pulse - Default:
None
Page: VAD
Section titled “Page: VAD”op('stt_whisper').par.Phrasestoavoid Str - Default:
None
op('stt_whisper').par.Customspellings Str - Default:
None
op('stt_whisper').par.Usevad Toggle - Default:
None
op('stt_whisper').par.Vadthreshold Float - Default:
0.5- Range:
- 0 to 1
op('stt_whisper').par.Vadminsilence Int - Default:
250- Range:
- 50 to 2000
op('stt_whisper').par.Beamsearchsize Int - Default:
5- Range:
- 1 to 20
Page: Install/Settings
Section titled “Page: Install/Settings”op('stt_whisper').par.Installdependencies Pulse - Default:
None
op('stt_whisper').par.Downloadmodel Pulse - Default:
None
op('stt_whisper').par.Downloadprogress Float - Default:
None
op('stt_whisper').par.Monitorworkerlogs Toggle - Default:
None
op('stt_whisper').par.Autoreattachoninit Toggle - Default:
None
op('stt_whisper').par.Forceattachoninit Toggle - Default:
None
Page: About
Section titled “Page: About”op('stt_whisper').par.Bypass Toggle - Default:
None
op('stt_whisper').par.Showbuiltin Toggle - Default:
None
op('stt_whisper').par.Showicon Toggle - Default:
None
op('stt_whisper').par.Version Str - Default:
None
op('stt_whisper').par.Lastupdated Str - Default:
None
op('stt_whisper').par.Creator Str - Default:
None
op('stt_whisper').par.Website Str - Default:
None
op('stt_whisper').par.Chattd OP - Default:
None
Basic Setup
Section titled “Basic Setup”- Configure Python Environment: Ensure ChatTD is configured with a Python virtual environment that has
faster-whisperinstalled - Initialize Engine: Click “Initialize Whisper” or enable “Initialize On Start”
- Select Model: Choose appropriate model size based on your accuracy vs. performance needs
- Choose Language: Select the target language for transcription
- Enable Transcription: Toggle “Transcription Active” to start processing
Operating Modes
Section titled “Operating Modes”Stream Mode (Live)
Section titled “Stream Mode (Live)”- Continuous real-time transcription
- Processes audio as it arrives
- Uses smart chunking for natural phrase boundaries
- Ideal for live conversations and real-time applications
Push to Talk Mode
Section titled “Push to Talk Mode”- Accumulates audio while “Active” is pressed
- Transcribes the entire buffer when “Active” is released
- Better for discrete speech segments
- Reduces processing overhead for intermittent use
Model Selection Guide
Section titled “Model Selection Guide”| Model Size | Speed | Accuracy | Memory | Use Case |
|---|---|---|---|---|
| Tiny | Fastest | Basic | Low | Real-time, low-resource |
| Base | Fast | Good | Medium | Balanced performance |
| Medium | Moderate | Very Good | High | Quality transcription |
| Large v3 | Slow | Excellent | Very High | Maximum accuracy |
Performance Optimization
Section titled “Performance Optimization”- GPU Acceleration: Set Device to “CUDA” for NVIDIA GPUs
- Compute Type: Use “FP16” for GPU, “INT8” for CPU optimization
- Chunk Duration: Shorter chunks for lower latency, longer for better accuracy
- VAD Filtering: Enable to reduce processing of non-speech audio
Language Support
Section titled “Language Support”The operator supports 99 languages including:
- Major Languages: English, Chinese, Spanish, French, German, Russian, Japanese, Korean
- Regional Variants: English-only models available for better English performance
- Automatic Detection: Leave language as “auto” for automatic detection
Advanced Features
Section titled “Advanced Features”Voice Activity Detection (VAD)
Section titled “Voice Activity Detection (VAD)”- Filters out non-speech audio segments
- Reduces false transcriptions from background noise
- Configurable threshold and silence duration
- Improves overall transcription quality
Smart Chunking
Section titled “Smart Chunking”- Automatically detects natural speech pauses
- Creates chunks at phrase boundaries
- Improves transcription coherence
- Reduces word splitting across chunks
Custom Filtering
Section titled “Custom Filtering”- Phrases to Avoid: Filter out specific unwanted phrases
- Custom Spellings: Guide pronunciation and terminology
- Beam Search: Adjust quality vs. speed tradeoff
Integration Examples
Section titled “Integration Examples”With Audio Input
Section titled “With Audio Input”- Connect an
Audio Device InCHOP to the input of thestt_whisperoperator. - Set the
ModetoStream (Live). - Toggle
Transcription Activeto On.
Reading Transcription
Section titled “Reading Transcription”- View the full transcript in the
transcription_outDAT. - View the segmented transcript in the
segments_outDAT.
Troubleshooting
Section titled “Troubleshooting”Common Issues
Section titled “Common Issues”-
Engine Won’t Initialize
- Check Python virtual environment configuration in ChatTD
- Ensure
faster-whisperis installed in the environment - Verify model download location
-
Poor Transcription Quality
- Try a larger model size
- Adjust VAD threshold
- Check audio input quality (16kHz recommended)
- Enable custom spellings for domain-specific terms
-
High CPU/Memory Usage
- Use smaller model size
- Enable GPU acceleration
- Adjust chunk duration
- Use INT8 compute type for CPU
-
Delayed Transcription
- Reduce chunk duration
- Disable smart chunking for immediate processing
- Check system performance and model size
Performance Tips
Section titled “Performance Tips”- GPU Usage: CUDA acceleration can provide 3-10x speed improvement
- Model Caching: Models are cached after first load for faster subsequent initialization
- Batch Processing: Push-to-talk mode is more efficient for non-continuous use
- Resource Management: Shutdown the engine when not in use to free resources
Research & Licensing
OpenAI
OpenAI is a leading AI research organization focused on developing artificial general intelligence (AGI) that benefits humanity. Their research spans natural language processing, computer vision, and speech recognition, with a commitment to open science and responsible AI development.
Whisper: Robust Speech Recognition via Large-Scale Weak Supervision
Whisper is a general-purpose speech recognition model trained on a large dataset of diverse audio. It is designed to be robust to accents, background noise, and technical language, making it suitable for real-world applications requiring reliable speech-to-text conversion.
Technical Details
- Transformer Architecture: Encoder-decoder model with attention mechanisms
- Multilingual Training: Support for 99 languages with varying degrees of accuracy
- Multiple Model Sizes: From tiny (39M parameters) to large (1550M parameters)
Research Impact
- Democratized Speech Recognition: Open-source model accessible to researchers and developers
- Multilingual Capabilities: Breakthrough in cross-lingual speech recognition
- Industry Adoption: Widely used in production applications worldwide
Citation
@article{radford2022robust,
title={Robust Speech Recognition via Large-Scale Weak Supervision},
author={Radford, Alec and Kim, Jong Wook and Xu, Tao and Hallacy, Alex and Kavukcuoglu, Koray and Weng, Lillian},
journal={arXiv preprint arXiv:2212.04356},
year={2022},
url={https://arxiv.org/abs/2212.04356}
} Key Research Contributions
- Large-scale weak supervision training on 680,000 hours of multilingual data
- Zero-shot transfer capabilities across languages and domains
- Robust performance approaching human-level accuracy in speech recognition
License
MIT License - This model is freely available for research and commercial use.