STT Whisper

v1.2.1 What's new

Overview

The STT Whisper LOP provides real-time speech-to-text transcription using OpenAI’s Whisper model through the faster-whisper library. This operator supports multiple languages, various model sizes, Voice Activity Detection (VAD) filtering, and both streaming and push-to-talk operating modes.

Requirements

Python Environment: Requires a configured Python virtual environment with faster-whisper installed
ChatTD Component: Must be configured with a valid Python virtual environment path
Audio Input: Expects 16kHz float32 audio data via the ReceiveAudioChunk() method
Optional: CUDA-compatible GPU for accelerated processing

I/O

Input

Audio Stream: Receives audio chunks via the ReceiveAudioChunk() method (16kHz, float32)

Output

Transcription DAT: Real-time transcription text output
Segments DAT: Detailed segment information with timestamps and text
Status Information: Engine status and transcription state

Parameters

Page: Faster Whisper

Status (Status) op('stt_whisper').par.Status Str

Default:: None

Transcription Active (Active) op('stt_whisper').par.Active Toggle

Default:: None

Copy Transcript to Clipboard (Copytranscript) op('stt_whisper').par.Copytranscript Pulse

Default:: None

Whisper Status (Enginestatus) op('stt_whisper').par.Enginestatus Str

Default:: None

Initialize Whisper (Initialize) op('stt_whisper').par.Initialize Pulse

Default:: None

Shutdown Whisper (Shutdown) op('stt_whisper').par.Shutdown Pulse

Default:: None

Initialize On Start (Initializeonstart) op('stt_whisper').par.Initializeonstart Toggle

Default:: None

Output Segments (out1) (Segments) op('stt_whisper').par.Segments Toggle

Default:: None

Language (Language) op('stt_whisper').par.Language StrMenu

Default:: en

Transcription File (Transcriptionfile) op('stt_whisper').par.Transcriptionfile File

Default:: None

Smart VAD Chunking (Smartchunking) op('stt_whisper').par.Smartchunking Toggle

Default:: None

Pause Sensitivity (Pausesensitivity) op('stt_whisper').par.Pausesensitivity Float

Default:: 0.1
Range:: 0 to 1

Max Chunk Duration (sec) (Maxchunkduration) op('stt_whisper').par.Maxchunkduration Float

Default:: 8.0
Range:: 3 to 15

Chunk Duration (sec) (Chunkduration) op('stt_whisper').par.Chunkduration Float

Default:: 0.8
Range:: 0.5 to 5

Clear Transcript (Cleartranscript) op('stt_whisper').par.Cleartranscript Pulse

Default:: None

Page: VAD

Phrases to Avoid (Phrasestoavoid) op('stt_whisper').par.Phrasestoavoid Str

Default:: None

Custom Spellings (Prompt) (Customspellings) op('stt_whisper').par.Customspellings Str

Default:: None

Use VAD Filter (Usevad) op('stt_whisper').par.Usevad Toggle

Default:: None

VAD Threshold (Vadthreshold) op('stt_whisper').par.Vadthreshold Float

Default:: 0.5
Range:: 0 to 1

VAD Min Silence (ms) (Vadminsilence) op('stt_whisper').par.Vadminsilence Int

Default:: 250
Range:: 50 to 2000

Beam Search Size (Beamsearchsize) op('stt_whisper').par.Beamsearchsize Int

Default:: 5
Range:: 1 to 20

Page: Install/Settings

Install Dependencies (Installdependencies) op('stt_whisper').par.Installdependencies Pulse

Default:: None

Download Model (Downloadmodel) op('stt_whisper').par.Downloadmodel Pulse

Default:: None

Download Progress (Downloadprogress) op('stt_whisper').par.Downloadprogress Float

Default:: None

Worker Connection Settings Header

Monitor Worker Logs (stderr) (Monitorworkerlogs) op('stt_whisper').par.Monitorworkerlogs Toggle

Default:: None

Auto Reattach On Init (Autoreattachoninit) op('stt_whisper').par.Autoreattachoninit Toggle

Default:: None

Force Attach (Skip PID Check) (Forceattachoninit) op('stt_whisper').par.Forceattachoninit Toggle

Default:: None

Page: About

Bypass (Bypass) op('stt_whisper').par.Bypass Toggle

Default:: None

Show Built-in Parameters (Showbuiltin) op('stt_whisper').par.Showbuiltin Toggle

Default:: None

Show Icon (Showicon) op('stt_whisper').par.Showicon Toggle

Default:: None

Version (Version) op('stt_whisper').par.Version Str

Default:: None

Last Updated (Lastupdated) op('stt_whisper').par.Lastupdated Str

Default:: None

Creator (Creator) op('stt_whisper').par.Creator Str

Default:: None

Website (Website) op('stt_whisper').par.Website Str

Default:: None

ChatTD Operator (Chattd) op('stt_whisper').par.Chattd OP

Default:: None

Usage

Basic Setup

Configure Python Environment: Ensure ChatTD is configured with a Python virtual environment that has faster-whisper installed
Initialize Engine: Click “Initialize Whisper” or enable “Initialize On Start”
Select Model: Choose appropriate model size based on your accuracy vs. performance needs
Choose Language: Select the target language for transcription
Enable Transcription: Toggle “Transcription Active” to start processing

Operating Modes

Stream Mode (Live)

Continuous real-time transcription
Processes audio as it arrives
Uses smart chunking for natural phrase boundaries
Ideal for live conversations and real-time applications

Push to Talk Mode

Accumulates audio while “Active” is pressed
Transcribes the entire buffer when “Active” is released
Better for discrete speech segments
Reduces processing overhead for intermittent use

Model Selection Guide

Model Size	Speed	Accuracy	Memory	Use Case
Tiny	Fastest	Basic	Low	Real-time, low-resource
Base	Fast	Good	Medium	Balanced performance
Medium	Moderate	Very Good	High	Quality transcription
Large v3	Slow	Excellent	Very High	Maximum accuracy

Performance Optimization

GPU Acceleration: Set Device to “CUDA” for NVIDIA GPUs
Compute Type: Use “FP16” for GPU, “INT8” for CPU optimization
Chunk Duration: Shorter chunks for lower latency, longer for better accuracy
VAD Filtering: Enable to reduce processing of non-speech audio

Language Support

The operator supports 99 languages including:

Major Languages: English, Chinese, Spanish, French, German, Russian, Japanese, Korean
Regional Variants: English-only models available for better English performance
Automatic Detection: Leave language as “auto” for automatic detection

Advanced Features

Voice Activity Detection (VAD)

Filters out non-speech audio segments
Reduces false transcriptions from background noise
Configurable threshold and silence duration
Improves overall transcription quality

Smart Chunking

Automatically detects natural speech pauses
Creates chunks at phrase boundaries
Improves transcription coherence
Reduces word splitting across chunks

Custom Filtering

Phrases to Avoid: Filter out specific unwanted phrases
Custom Spellings: Guide pronunciation and terminology
Beam Search: Adjust quality vs. speed tradeoff

Integration Examples

With Audio Input

Connect an Audio Device In CHOP to the input of the stt_whisper operator.
Set the Mode to Stream (Live).
Toggle Transcription Active to On.

Reading Transcription

View the full transcript in the transcription_out DAT.
View the segmented transcript in the segments_out DAT.

Troubleshooting

Common Issues

Engine Won’t Initialize
- Check Python virtual environment configuration in ChatTD
- Ensure faster-whisper is installed in the environment
- Verify model download location
Poor Transcription Quality
- Try a larger model size
- Adjust VAD threshold
- Check audio input quality (16kHz recommended)
- Enable custom spellings for domain-specific terms
High CPU/Memory Usage
- Use smaller model size
- Enable GPU acceleration
- Adjust chunk duration
- Use INT8 compute type for CPU
Delayed Transcription
- Reduce chunk duration
- Disable smart chunking for immediate processing
- Check system performance and model size

Performance Tips

GPU Usage: CUDA acceleration can provide 3-10x speed improvement
Model Caching: Models are cached after first load for faster subsequent initialization
Batch Processing: Push-to-talk mode is more efficient for non-continuous use
Resource Management: Shutdown the engine when not in use to free resources

Research & Licensing

OpenAI

OpenAI is a leading AI research organization focused on developing artificial general intelligence (AGI) that benefits humanity. Their research spans natural language processing, computer vision, and speech recognition, with a commitment to open science and responsible AI development.

Whisper: Robust Speech Recognition via Large-Scale Weak Supervision

Whisper is a general-purpose speech recognition model trained on a large dataset of diverse audio. It is designed to be robust to accents, background noise, and technical language, making it suitable for real-world applications requiring reliable speech-to-text conversion.

Technical Details

Transformer Architecture: Encoder-decoder model with attention mechanisms
Multilingual Training: Support for 99 languages with varying degrees of accuracy
Multiple Model Sizes: From tiny (39M parameters) to large (1550M parameters)

Research Impact

Democratized Speech Recognition: Open-source model accessible to researchers and developers
Multilingual Capabilities: Breakthrough in cross-lingual speech recognition
Industry Adoption: Widely used in production applications worldwide

Citation

@article{radford2022robust,
  title={Robust Speech Recognition via Large-Scale Weak Supervision},
  author={Radford, Alec and Kim, Jong Wook and Xu, Tao and Hallacy, Alex and Kavukcuoglu, Koray and Weng, Lillian},
  journal={arXiv preprint arXiv:2212.04356},
  year={2022},
  url={https://arxiv.org/abs/2212.04356}
}

Key Research Contributions

Large-scale weak supervision training on 680,000 hours of multilingual data
Zero-shot transfer capabilities across languages and domains
Robust performance approaching human-level accuracy in speech recognition

License

MIT License - This model is freely available for research and commercial use.