Skip to content

STT Whisper

The STT Whisper LOP provides real-time speech-to-text transcription using OpenAI’s Whisper model through the faster-whisper library. This operator supports multiple languages, various model sizes, Voice Activity Detection (VAD) filtering, and both streaming and push-to-talk operating modes.

  • Python Environment: Requires a configured Python virtual environment with faster-whisper installed
  • ChatTD Component: Must be configured with a valid Python virtual environment path
  • Audio Input: Expects 16kHz float32 audio data via the ReceiveAudioChunk() method
  • Optional: CUDA-compatible GPU for accelerated processing
  • Audio Stream: Receives audio chunks via the ReceiveAudioChunk() method (16kHz, float32)
  • Transcription DAT: Real-time transcription text output
  • Segments DAT: Detailed segment information with timestamps and text
  • Status Information: Engine status and transcription state
Status (Status) op('stt_whisper').par.Status str

Current operational status of the transcription system

Default:
None
Transcription Active (Active) op('stt_whisper').par.Active toggle

Enable/disable transcription processing

Default:
false
Operating Mode (Mode) op('stt_whisper').par.Mode menu

Choose between continuous streaming or push-to-talk transcription

Default:
Stream
Whisper Status (Enginestatus) op('stt_whisper').par.Enginestatus str

Current status of the Whisper engine

Default:
None
Initialize Whisper (Initialize) op('stt_whisper').par.Initialize pulse

Initialize the Whisper transcription engine

Default:
None
Shutdown Whisper (Shutdown) op('stt_whisper').par.Shutdown pulse

Shutdown the Whisper transcription engine

Default:
None
Initialize On Start (Initializeonstart) op('stt_whisper').par.Initializeonstart toggle

Automatically initialize the engine when the operator starts

Default:
false
Model Size (Modelsize) op('stt_whisper').par.Modelsize menu

Select the Whisper model size (larger models are more accurate but slower)

Default:
medium
Language (Language) op('stt_whisper').par.Language strmenu

Target language for transcription (supports 99 languages)

Default:
en
Device (Device) op('stt_whisper').par.Device menu

Choose the processing device for transcription

Default:
auto
Compute Type (Computetype) op('stt_whisper').par.Computetype menu

Select the compute precision type for performance optimization

Default:
default
Clear Transcript (Cleartranscript) op('stt_whisper').par.Cleartranscript pulse

Clear the current transcription history

Default:
None
Copy Transcript to Clipboard (Copytranscript) op('stt_whisper').par.Copytranscript pulse

Copy the current transcript to the system clipboard

Default:
None
Smart VAD Chunking (Smartchunking) op('stt_whisper').par.Smartchunking toggle

Use intelligent chunking based on voice activity detection

Default:
true
Pause Sensitivity (Pausesensitivity) op('stt_whisper').par.Pausesensitivity float

Sensitivity for detecting pauses in speech (0=less sensitive, 1=more sensitive)

Default:
0.1
Max Chunk Duration (sec) (Maxchunkduration) op('stt_whisper').par.Maxchunkduration float

Maximum duration for audio chunks before forced processing

Default:
8.0
Chunk Duration (sec) (Chunkduration) op('stt_whisper').par.Chunkduration float

Target duration for audio chunks in streaming mode

Default:
0.8
Phrases to Avoid (Phrasestoavoid) op('stt_whisper').par.Phrasestoavoid str

Comma-separated list of phrases to filter out from transcription

Default:
None
Custom Spellings (Prompt) (Customspellings) op('stt_whisper').par.Customspellings str

Custom prompt to guide spelling and terminology

Default:
None
Use VAD Filter (Usevad) op('stt_whisper').par.Usevad toggle

Enable Voice Activity Detection filtering

Default:
true
VAD Threshold (Vadthreshold) op('stt_whisper').par.Vadthreshold float

Threshold for voice activity detection (0=detect everything, 1=only clear speech)

Default:
0.5
VAD Min Silence (ms) (Vadminsilence) op('stt_whisper').par.Vadminsilence int

Minimum silence duration in milliseconds to consider as a pause

Default:
250
Beam Search Size (Beamsearchsize) op('stt_whisper').par.Beamsearchsize int

Beam search size for transcription quality (higher=better quality, slower)

Default:
5
Bypass (Bypass) op('stt_whisper').par.Bypass toggle

Bypass the operator

Default:
false
Show Built-in Parameters (Showbuiltin) op('stt_whisper').par.Showbuiltin toggle

Show built-in TouchDesigner parameters

Default:
false
Version (Version) op('stt_whisper').par.Version str

Current version of the operator

Default:
None
Last Updated (Lastupdated) op('stt_whisper').par.Lastupdated str

Date of last update

Default:
None
Creator (Creator) op('stt_whisper').par.Creator str

Operator creator

Default:
None
Website (Website) op('stt_whisper').par.Website str

Related website or documentation

Default:
None
ChatTD Operator (Chattd) op('stt_whisper').par.Chattd op

Reference to the ChatTD operator for configuration

Default:
None
  1. Configure Python Environment: Ensure ChatTD is configured with a Python virtual environment that has faster-whisper installed
  2. Initialize Engine: Click “Initialize Whisper” or enable “Initialize On Start”
  3. Select Model: Choose appropriate model size based on your accuracy vs. performance needs
  4. Choose Language: Select the target language for transcription
  5. Enable Transcription: Toggle “Transcription Active” to start processing
  • Continuous real-time transcription
  • Processes audio as it arrives
  • Uses smart chunking for natural phrase boundaries
  • Ideal for live conversations and real-time applications
  • Accumulates audio while “Active” is pressed
  • Transcribes the entire buffer when “Active” is released
  • Better for discrete speech segments
  • Reduces processing overhead for intermittent use
Model SizeSpeedAccuracyMemoryUse Case
TinyFastestBasicLowReal-time, low-resource
BaseFastGoodMediumBalanced performance
MediumModerateVery GoodHighQuality transcription
Large v3SlowExcellentVery HighMaximum accuracy
  • GPU Acceleration: Set Device to “CUDA” for NVIDIA GPUs
  • Compute Type: Use “FP16” for GPU, “INT8” for CPU optimization
  • Chunk Duration: Shorter chunks for lower latency, longer for better accuracy
  • VAD Filtering: Enable to reduce processing of non-speech audio

The operator supports 99 languages including:

  • Major Languages: English, Chinese, Spanish, French, German, Russian, Japanese, Korean
  • Regional Variants: English-only models available for better English performance
  • Automatic Detection: Leave language as “auto” for automatic detection
  • Filters out non-speech audio segments
  • Reduces false transcriptions from background noise
  • Configurable threshold and silence duration
  • Improves overall transcription quality
  • Automatically detects natural speech pauses
  • Creates chunks at phrase boundaries
  • Improves transcription coherence
  • Reduces word splitting across chunks
  • Phrases to Avoid: Filter out specific unwanted phrases
  • Custom Spellings: Guide pronunciation and terminology
  • Beam Search: Adjust quality vs. speed tradeoff
# Send audio data to the STT operator
stt_op = op('stt_whisper')
audio_data = np.array(audio_samples, dtype=np.float32)
stt_op.ReceiveAudioChunk(audio_data)
# Get the current transcription
transcription_dat = op('stt_whisper/transcription_out')
current_text = transcription_dat.text
# Get segment information
segments_dat = op('stt_whisper/segments_out')
for row in segments_dat.rows()[1:]: # Skip header
start_time, end_time, text = row
print(f"{start_time}-{end_time}: {text}")
  1. Engine Won’t Initialize

    • Check Python virtual environment configuration in ChatTD
    • Ensure faster-whisper is installed in the environment
    • Verify model download location
  2. Poor Transcription Quality

    • Try a larger model size
    • Adjust VAD threshold
    • Check audio input quality (16kHz recommended)
    • Enable custom spellings for domain-specific terms
  3. High CPU/Memory Usage

    • Use smaller model size
    • Enable GPU acceleration
    • Adjust chunk duration
    • Use INT8 compute type for CPU
  4. Delayed Transcription

    • Reduce chunk duration
    • Disable smart chunking for immediate processing
    • Check system performance and model size
  • GPU Usage: CUDA acceleration can provide 3-10x speed improvement
  • Model Caching: Models are cached after first load for faster subsequent initialization
  • Batch Processing: Push-to-talk mode is more efficient for non-continuous use
  • Resource Management: Shutdown the engine when not in use to free resources

Research & Licensing

OpenAI

OpenAI is a leading AI research organization focused on developing artificial general intelligence (AGI) that benefits humanity. Their research spans natural language processing, computer vision, and speech recognition, with a commitment to open science and responsible AI development.

Whisper: Robust Speech Recognition via Large-Scale Weak Supervision

Whisper is a general-purpose speech recognition model trained on a large dataset of diverse audio. It is designed to be robust to accents, background noise, and technical language, making it suitable for real-world applications requiring reliable speech-to-text conversion.

Technical Details

  • Transformer Architecture: Encoder-decoder model with attention mechanisms
  • Multilingual Training: Support for 99 languages with varying degrees of accuracy
  • Multiple Model Sizes: From tiny (39M parameters) to large (1550M parameters)

Research Impact

  • Democratized Speech Recognition: Open-source model accessible to researchers and developers
  • Multilingual Capabilities: Breakthrough in cross-lingual speech recognition
  • Industry Adoption: Widely used in production applications worldwide

Citation

@article{radford2022robust,
  title={Robust Speech Recognition via Large-Scale Weak Supervision},
  author={Radford, Alec and Kim, Jong Wook and Xu, Tao and Hallacy, Alex and Kavukcuoglu, Koray and Weng, Lillian},
  journal={arXiv preprint arXiv:2212.04356},
  year={2022},
  url={https://arxiv.org/abs/2212.04356}
}

Key Research Contributions

  • Large-scale weak supervision training on 680,000 hours of multilingual data
  • Zero-shot transfer capabilities across languages and domains
  • Robust performance approaching human-level accuracy in speech recognition

License

MIT License - This model is freely available for research and commercial use.