Skip to content

STT Kyutai Operator

The STT Kyutai operator provides real-time speech-to-text transcription using Kyutai’s advanced neural models from their groundbreaking Moshi speech-text foundation model. This operator leverages Kyutai’s state-of-the-art research in full-duplex spoken dialogue systems, delivering exceptional accuracy and ultra-low latency performance for professional TouchDesigner workflows.

Built on Kyutai’s Moshi foundation model - a 7B parameter speech-text foundation model designed for real-time dialogue - this operator provides access to cutting-edge speech recognition technology with theoretical latency as low as 160ms. The operator supports both English and French languages with advanced semantic Voice Activity Detection (VAD), streaming transcription, and comprehensive audio processing capabilities.

  • Advanced Neural Models: Uses Kyutai’s 1B and 2.6B parameter models for high-quality transcription
  • Multi-Language Support: English and French with automatic language detection
  • Semantic VAD: Intelligent voice activity detection with pause prediction
  • Real-time Streaming: Continuous transcription with low latency
  • Flexible Audio Processing: Configurable chunk duration and temperature settings
  • Comprehensive Logging: Detailed VAD events, state tracking, and performance metrics
  • GPU Acceleration: CUDA support for improved performance
  • ChatTD Operator: Required for Python environment management and async operations
  • Python Dependencies:
    • moshi (Kyutai’s core library)
    • julius (Audio processing)
    • torch (PyTorch for neural inference)
    • huggingface_hub (Model downloading)
  • Hardware: CUDA-compatible GPU recommended for optimal performance
  • Audio Input: Receives audio chunks via ReceiveAudioChunk() method (24kHz float32 format)
  • Transcription Text: transcription_out - Continuous text output
  • Segments Table: segments_out - Individual segments with timing and confidence
  • VAD Step Data: vad_step_out - Pause predictions at different intervals
  • VAD Events: vad_events_out - Speech start/end events
  • VAD State: vad_state_out - Current voice activity state
Model Size (Modelsize) op('stt_kyutai').par.Modelsize Menu
Default:
stt-1b-en_fr
Options:
stt-1b-en_fr, stt-2.6b-en
Language (Language) op('stt_kyutai').par.Language Menu
Default:
auto
Options:
auto, en, fr
STT Kyutai (Enginestatus) op('stt_kyutai').par.Enginestatus String
Default:
"" (Empty String)
Initialize STT Kyutai (Initialize) op('stt_kyutai').par.Initialize Pulse
Default:
false
Shutdown STT Kyutai (Shutdown) op('stt_kyutai').par.Shutdown Pulse
Default:
false
Worker Logging Level (Workerlogging) op('stt_kyutai').par.Workerlogging Menu
Default:
OFF
Options:
OFF, CRITICAL, ERROR, WARNING, INFO, DEBUG
Transcription Active (Active) op('stt_kyutai').par.Active Toggle
Default:
false
Chunk Duration (sec) (Chunkduration) op('stt_kyutai').par.Chunkduration Float
Default:
0.0
Temperature (Temperature) op('stt_kyutai').par.Temperature Float
Default:
0.0
Use Semantic VAD (Usevad) op('stt_kyutai').par.Usevad Toggle
Default:
false
VAD Threshold (Vadthreshold) op('stt_kyutai').par.Vadthreshold Float
Default:
0.0
VAD Debug Logging (Vaddebugging) op('stt_kyutai').par.Vaddebugging Toggle
Default:
false
Max VAD Table Rows (Vadmaxrows) op('stt_kyutai').par.Vadmaxrows Integer
Default:
0
Device (Device) op('stt_kyutai').par.Device Menu
Default:
auto
Options:
auto, cpu, cuda
Dependencies Available (Installdependencies) op('stt_kyutai').par.Installdependencies Pulse
Default:
false
Download Model (Downloadmodel) op('stt_kyutai').par.Downloadmodel Pulse
Default:
false
Initialize On Start (Initializeonstart) op('stt_kyutai').par.Initializeonstart Toggle
Default:
false
Clear History (Cleartranscript) op('stt_kyutai').par.Cleartranscript Pulse
Default:
false
Copy Transcript to Clipboard (Copytranscript) op('stt_kyutai').par.Copytranscript Pulse
Default:
false
  1. Setup Dependencies:

    • Click “Install Dependencies” if the button shows missing requirements
    • Wait for installation to complete and restart TouchDesigner
  2. Initialize the Engine:

    • Select desired model size (1B for multilingual, 2.6B for English-only)
    • Click “Initialize STT Kyutai”
    • If model is missing, choose to download it
  3. Start Transcription:

    • Enable “Transcription Active”
    • Send audio chunks using ReceiveAudioChunk(audio_array)
    • Monitor transcription output in the transcription_out DAT
# Enable VAD with custom threshold
op('stt_kyutai').par.Usevad = True
op('stt_kyutai').par.Vadthreshold = 0.7
op('stt_kyutai').par.Vaddebugging = True
# Monitor VAD events
vad_events = op('stt_kyutai').op('vad_events_out')
for row in range(1, vad_events.numRows):
event_type = vad_events[row, 'Event_Type'].val
timestamp = vad_events[row, 'Timestamp'].val
if event_type == 'speech_start':
print(f"Speech detected at {timestamp}s")
# Process audio from Audio Device In CHOP
audio_device = op('audiodevicein1')
stt_kyutai = op('stt_kyutai')
# Convert CHOP data to numpy array for processing
def process_audio():
if audio_device.numSamples > 0:
# Get audio as numpy array (24kHz float32)
audio_data = audio_device['chan1'].numpyArray()
stt_kyutai.ReceiveAudioChunk(audio_data)
# Call this from a Timer CHOP or Execute DAT
# Configure for French transcription
op('stt_kyutai').par.Modelsize = 'stt-1b-en_fr'
op('stt_kyutai').par.Language = 'fr'
op('stt_kyutai').par.Temperature = 0.1 # Lower temperature for more accurate French
# Configure for high-accuracy English
op('stt_kyutai').par.Modelsize = 'stt-2.6b-en'
op('stt_kyutai').par.Language = 'en'
op('stt_kyutai').par.Temperature = 0.0 # Deterministic output

Connect STT Kyutai to Agent operators for voice-controlled AI:

# Monitor transcription and trigger agent
transcription_dat = op('stt_kyutai').op('transcription_out')
agent_op = op('agent1')
def on_transcription_change():
text = transcription_dat.text.strip()
if text and len(text) > 10: # Minimum length threshold
agent_op.SendMessage(text)

Combine with audio analysis for enhanced processing:

# Use VAD state for audio routing
vad_state = op('stt_kyutai').op('vad_state_out')
audio_switch = op('switch1')
def update_audio_routing():
if vad_state.numRows > 1:
is_speech = vad_state[1, 'Is_Speech'].val == 'True'
audio_switch.par.index = 1 if is_speech else 0

Create visual feedback for transcription:

# Visualize VAD confidence
vad_step_dat = op('stt_kyutai').op('vad_step_out')
confidence_chop = op('constant1')
def update_vad_visualization():
if vad_step_dat.numRows > 1:
latest_row = vad_step_dat.numRows - 1
pause_2s = float(vad_step_dat[latest_row, 'Pause_2.0s'].val)
confidence_chop.par.value0 = 1.0 - pause_2s # Invert for speech confidence
  • 1B Model: Use for multilingual scenarios, lower latency, moderate accuracy
  • 2.6B Model: Use for English-only, higher accuracy, acceptable latency
  • Language Setting: Use “auto” for mixed-language content, specific language for better accuracy
  • GPU Usage: Enable CUDA for better performance with larger models
  • Chunk Duration: Keep at default (80ms) for optimal Kyutai model performance
  • VAD Threshold: Adjust based on ambient noise levels (0.3-0.7 typical range)
  • Temperature: Use 0.0-0.2 for deterministic results, 0.3-0.5 for more natural variation
  • Sample Rate: Ensure audio is 24kHz (Kyutai’s native rate)
  • Format: Use float32 format for best quality
  • Buffering: Process audio in consistent chunks for smooth operation
  • Noise Handling: Use VAD to filter out non-speech audio

Engine Won’t Initialize

  • Check that all dependencies are installed
  • Verify model is downloaded locally
  • Ensure ChatTD Python environment is configured
  • Check device compatibility (CUDA drivers for GPU)

Poor Transcription Quality

  • Verify audio sample rate is 24kHz
  • Check microphone quality and positioning
  • Adjust VAD threshold for ambient noise
  • Use appropriate model for language content

High Latency

  • Use 1B model for lower latency
  • Enable GPU acceleration if available
  • Reduce VAD debugging if enabled
  • Check system resource availability

Missing Audio

  • Verify ReceiveAudioChunk() is being called
  • Check audio format (float32 required)
  • Ensure “Transcription Active” is enabled
  • Monitor VAD state for speech detection

“Dependencies missing”

  • Click “Install Dependencies” button
  • Restart TouchDesigner after installation
  • Check ChatTD Python environment configuration

“Model not found”

  • Click “Download Model” to fetch from HuggingFace
  • Check internet connection and HuggingFace access
  • Verify sufficient disk space for model storage

“Worker process failed”

  • Check Python environment and dependencies
  • Review worker logging output for specific errors
  • Verify CUDA installation for GPU usage
  • Try CPU device if GPU fails

“Audio format error”

  • Ensure audio is float32 format
  • Convert audio to 24kHz sample rate
  • Check audio array dimensions and type

The operator includes advanced Voice Activity Detection:

  • Pause Prediction: Predicts pauses at 0.5s, 1.0s, 2.0s, and 3.0s intervals
  • Speech Events: Detects speech start/end with confidence scores
  • State Tracking: Maintains current speech/silence state
  • Debugging: Detailed logging for VAD parameter tuning

Uses Kyutai’s Delayed Streams Modeling:

  • Fixed Frame Size: 80ms frames (1920 samples at 24kHz)
  • Continuous Context: Maintains context across audio chunks
  • Low Latency: Optimized for real-time applications
  • Memory Efficient: Manages memory usage automatically

Automatic model downloading and caching:

  • HuggingFace Integration: Downloads models from official repositories
  • Local Caching: Stores models locally for offline use
  • Version Management: Handles model updates and compatibility
  • Storage Optimization: Efficient model storage and loading

This operator provides professional-grade speech-to-text capabilities for TouchDesigner workflows, enabling sophisticated audio processing and AI integration scenarios.

Research & Licensing

Kyutai Research Foundation

Kyutai is a leading AI research organization focused on advancing speech and language technologies. Their flagship achievement, the Moshi model, represents a breakthrough in speech-text foundation models, enabling truly real-time conversational AI with unprecedented quality and responsiveness.

Moshi: Speech-Text Foundation Model

Kyutai's Moshi is a revolutionary speech-text foundation model that processes two streams of audio simultaneously - one for the user and one for the AI assistant. This dual-stream architecture enables full-duplex conversation with natural interruptions and overlapping speech, closely mimicking human conversation patterns.

Technical Details

  • 7B Parameter Architecture: Large-scale transformer model optimized for speech processing
  • Dual-Stream Processing: Simultaneous modeling of user and assistant audio streams
  • Inner Monologue: Predicts text tokens corresponding to its own speech for improved generation quality
  • Ultra-Low Latency: Theoretical latency of 160ms (80ms frame size + 80ms acoustic delay)
  • Streaming Neural Codec: Uses Mimi codec for efficient audio compression at 1.1 kbps
  • Frame Rate: 12.5 Hz operation (80ms frames)
  • Compression: 24 kHz audio down to 1.1 kbps bandwidth
  • Streaming: Fully causal and streaming with 80ms latency
  • Quality: Outperforms existing non-streaming codecs like SpeechTokenizer and SemantiCodec
  • Architecture: Transformer-based encoder/decoder with adversarial training
  • STT-1B-EN/FR: 1B parameter model supporting English and French with 0.5s delay
  • STT-2.6B-EN: 2.6B parameter English-only model with 2.5s delay for higher accuracy
  • Multilingual Support: Automatic language detection for mixed-language scenarios
  • Quantization: INT8 and INT4 quantized versions for efficient deployment

Research Impact

  • Real-time Dialogue: Enables natural conversation with minimal latency
  • Full-duplex Communication: Supports interruptions and overlapping speech
  • Semantic Understanding: Advanced VAD with pause prediction capabilities
  • Production Ready: Rust, Python, and MLX implementations for various platforms

Citation

@techreport{kyutai2024moshi,
  title={Moshi: a speech-text foundation model for real-time dialogue},
  author={Alexandre Défossez and Laurent Mazaré and Manu Orsini and
  Amélie Royer and Patrick Pérez and Hervé Jégou and Edouard Grave and Neil Zeghidour},
  year={2024},
  eprint={2410.00037},
  archivePrefix={arXiv},
  primaryClass={eess.AS},
  url={https://arxiv.org/abs/2410.00037},
}

Key Research Contributions

  • Full-duplex spoken dialogue with dual-stream modeling
  • Ultra-low latency speech processing (160ms theoretical)
  • Streaming neural audio codec (Mimi) with 1.1 kbps compression
  • Semantic voice activity detection with pause prediction
  • Production-ready implementations across multiple platforms

License

CC-BY 4.0 - This model is freely available for research and commercial use.