Skip to content

STT Whisper

  • Added CHOP channels for parity across all STT operators
  • TCP IPC mode for robust worker communication
  • Auto worker reattachment on TouchDesigner restart
  • TCP heartbeat system for connection monitoring
  • Segments parameter for transcript segmentation
  • Menu cleanup and improved parameter organization

The STT Whisper LOP provides real-time speech-to-text transcription using OpenAI’s Whisper model through the faster-whisper library. This operator supports multiple languages, various model sizes, Voice Activity Detection (VAD) filtering, and both streaming and push-to-talk operating modes.

  • Python Environment: Requires a configured Python virtual environment with faster-whisper installed
  • ChatTD Component: Must be configured with a valid Python virtual environment path
  • Audio Input: Expects 16kHz float32 audio data via the ReceiveAudioChunk() method
  • Optional: CUDA-compatible GPU for accelerated processing
  • Audio Stream: Receives audio chunks via the ReceiveAudioChunk() method (16kHz, float32)
  • Transcription DAT: Real-time transcription text output
  • Segments DAT: Detailed segment information with timestamps and text
  • Status Information: Engine status and transcription state
Status (Status) op('stt_whisper').par.Status Str
Default:
None
Transcription Active (Active) op('stt_whisper').par.Active Toggle
Default:
None
Copy Transcript to Clipboard (Copytranscript) op('stt_whisper').par.Copytranscript Pulse
Default:
None
Whisper Status (Enginestatus) op('stt_whisper').par.Enginestatus Str
Default:
None
Initialize Whisper (Initialize) op('stt_whisper').par.Initialize Pulse
Default:
None
Shutdown Whisper (Shutdown) op('stt_whisper').par.Shutdown Pulse
Default:
None
Initialize On Start (Initializeonstart) op('stt_whisper').par.Initializeonstart Toggle
Default:
None
Operating Mode (Mode) op('stt_whisper').par.Mode Menu
Default:
Pushtotalk
Output Segments (out1) (Segments) op('stt_whisper').par.Segments Toggle
Default:
None
Model Size (Modelsize) op('stt_whisper').par.Modelsize Menu
Default:
tiny.en
Language (Language) op('stt_whisper').par.Language StrMenu
Default:
en
Device (Device) op('stt_whisper').par.Device Menu
Default:
auto
Compute Type (Computetype) op('stt_whisper').par.Computetype Menu
Default:
default
Transcription File (Transcriptionfile) op('stt_whisper').par.Transcriptionfile File
Default:
None
Smart VAD Chunking (Smartchunking) op('stt_whisper').par.Smartchunking Toggle
Default:
None
Pause Sensitivity (Pausesensitivity) op('stt_whisper').par.Pausesensitivity Float
Default:
0.1
Range:
0 to 1
Max Chunk Duration (sec) (Maxchunkduration) op('stt_whisper').par.Maxchunkduration Float
Default:
8.0
Range:
3 to 15
Chunk Duration (sec) (Chunkduration) op('stt_whisper').par.Chunkduration Float
Default:
0.8
Range:
0.5 to 5
Clear Transcript (Cleartranscript) op('stt_whisper').par.Cleartranscript Pulse
Default:
None
Phrases to Avoid (Phrasestoavoid) op('stt_whisper').par.Phrasestoavoid Str
Default:
None
Custom Spellings (Prompt) (Customspellings) op('stt_whisper').par.Customspellings Str
Default:
None
Use VAD Filter (Usevad) op('stt_whisper').par.Usevad Toggle
Default:
None
VAD Threshold (Vadthreshold) op('stt_whisper').par.Vadthreshold Float
Default:
0.5
Range:
0 to 1
VAD Min Silence (ms) (Vadminsilence) op('stt_whisper').par.Vadminsilence Int
Default:
250
Range:
50 to 2000
Beam Search Size (Beamsearchsize) op('stt_whisper').par.Beamsearchsize Int
Default:
5
Range:
1 to 20
Install Dependencies (Installdependencies) op('stt_whisper').par.Installdependencies Pulse
Default:
None
Download Model (Downloadmodel) op('stt_whisper').par.Downloadmodel Pulse
Default:
None
Download Progress (Downloadprogress) op('stt_whisper').par.Downloadprogress Float
Default:
None
Worker Logging Level (Workerlogging) op('stt_whisper').par.Workerlogging Menu
Default:
OFF
Worker Connection Settings Header
IPC Mode (Ipcmode) op('stt_whisper').par.Ipcmode Menu
Default:
tcp
Monitor Worker Logs (stderr) (Monitorworkerlogs) op('stt_whisper').par.Monitorworkerlogs Toggle
Default:
None
Auto Reattach On Init (Autoreattachoninit) op('stt_whisper').par.Autoreattachoninit Toggle
Default:
None
Force Attach (Skip PID Check) (Forceattachoninit) op('stt_whisper').par.Forceattachoninit Toggle
Default:
None
Bypass (Bypass) op('stt_whisper').par.Bypass Toggle
Default:
None
Show Built-in Parameters (Showbuiltin) op('stt_whisper').par.Showbuiltin Toggle
Default:
None
Show Icon (Showicon) op('stt_whisper').par.Showicon Toggle
Default:
None
Version (Version) op('stt_whisper').par.Version Str
Default:
None
Last Updated (Lastupdated) op('stt_whisper').par.Lastupdated Str
Default:
None
Creator (Creator) op('stt_whisper').par.Creator Str
Default:
None
Website (Website) op('stt_whisper').par.Website Str
Default:
None
ChatTD Operator (Chattd) op('stt_whisper').par.Chattd OP
Default:
None
  1. Configure Python Environment: Ensure ChatTD is configured with a Python virtual environment that has faster-whisper installed
  2. Initialize Engine: Click “Initialize Whisper” or enable “Initialize On Start”
  3. Select Model: Choose appropriate model size based on your accuracy vs. performance needs
  4. Choose Language: Select the target language for transcription
  5. Enable Transcription: Toggle “Transcription Active” to start processing
  • Continuous real-time transcription
  • Processes audio as it arrives
  • Uses smart chunking for natural phrase boundaries
  • Ideal for live conversations and real-time applications
  • Accumulates audio while “Active” is pressed
  • Transcribes the entire buffer when “Active” is released
  • Better for discrete speech segments
  • Reduces processing overhead for intermittent use
Model SizeSpeedAccuracyMemoryUse Case
TinyFastestBasicLowReal-time, low-resource
BaseFastGoodMediumBalanced performance
MediumModerateVery GoodHighQuality transcription
Large v3SlowExcellentVery HighMaximum accuracy
  • GPU Acceleration: Set Device to “CUDA” for NVIDIA GPUs
  • Compute Type: Use “FP16” for GPU, “INT8” for CPU optimization
  • Chunk Duration: Shorter chunks for lower latency, longer for better accuracy
  • VAD Filtering: Enable to reduce processing of non-speech audio

The operator supports 99 languages including:

  • Major Languages: English, Chinese, Spanish, French, German, Russian, Japanese, Korean
  • Regional Variants: English-only models available for better English performance
  • Automatic Detection: Leave language as “auto” for automatic detection
  • Filters out non-speech audio segments
  • Reduces false transcriptions from background noise
  • Configurable threshold and silence duration
  • Improves overall transcription quality
  • Automatically detects natural speech pauses
  • Creates chunks at phrase boundaries
  • Improves transcription coherence
  • Reduces word splitting across chunks
  • Phrases to Avoid: Filter out specific unwanted phrases
  • Custom Spellings: Guide pronunciation and terminology
  • Beam Search: Adjust quality vs. speed tradeoff
  1. Connect an Audio Device In CHOP to the input of the stt_whisper operator.
  2. Set the Mode to Stream (Live).
  3. Toggle Transcription Active to On.
  1. View the full transcript in the transcription_out DAT.
  2. View the segmented transcript in the segments_out DAT.
  1. Engine Won’t Initialize

    • Check Python virtual environment configuration in ChatTD
    • Ensure faster-whisper is installed in the environment
    • Verify model download location
  2. Poor Transcription Quality

    • Try a larger model size
    • Adjust VAD threshold
    • Check audio input quality (16kHz recommended)
    • Enable custom spellings for domain-specific terms
  3. High CPU/Memory Usage

    • Use smaller model size
    • Enable GPU acceleration
    • Adjust chunk duration
    • Use INT8 compute type for CPU
  4. Delayed Transcription

    • Reduce chunk duration
    • Disable smart chunking for immediate processing
    • Check system performance and model size
  • GPU Usage: CUDA acceleration can provide 3-10x speed improvement
  • Model Caching: Models are cached after first load for faster subsequent initialization
  • Batch Processing: Push-to-talk mode is more efficient for non-continuous use
  • Resource Management: Shutdown the engine when not in use to free resources

Research & Licensing

OpenAI

OpenAI is a leading AI research organization focused on developing artificial general intelligence (AGI) that benefits humanity. Their research spans natural language processing, computer vision, and speech recognition, with a commitment to open science and responsible AI development.

Whisper: Robust Speech Recognition via Large-Scale Weak Supervision

Whisper is a general-purpose speech recognition model trained on a large dataset of diverse audio. It is designed to be robust to accents, background noise, and technical language, making it suitable for real-world applications requiring reliable speech-to-text conversion.

Technical Details

  • Transformer Architecture: Encoder-decoder model with attention mechanisms
  • Multilingual Training: Support for 99 languages with varying degrees of accuracy
  • Multiple Model Sizes: From tiny (39M parameters) to large (1550M parameters)

Research Impact

  • Democratized Speech Recognition: Open-source model accessible to researchers and developers
  • Multilingual Capabilities: Breakthrough in cross-lingual speech recognition
  • Industry Adoption: Widely used in production applications worldwide

Citation

@article{radford2022robust,
  title={Robust Speech Recognition via Large-Scale Weak Supervision},
  author={Radford, Alec and Kim, Jong Wook and Xu, Tao and Hallacy, Alex and Kavukcuoglu, Koray and Weng, Lillian},
  journal={arXiv preprint arXiv:2212.04356},
  year={2022},
  url={https://arxiv.org/abs/2212.04356}
}

Key Research Contributions

  • Large-scale weak supervision training on 680,000 hours of multilingual data
  • Zero-shot transfer capabilities across languages and domains
  • Robust performance approaching human-level accuracy in speech recognition

License

MIT License - This model is freely available for research and commercial use.