Skip to content

STT Kyutai Operator

  • Added CHOP channels for parity across all STT operators
  • TCP IPC mode for robust worker communication
  • Auto worker reattachment on TouchDesigner restart
  • TCP heartbeat system for connection monitoring
  • Segments parameter for transcript segmentation
  • Menu cleanup and improved parameter organization

The STT Kyutai operator provides real-time speech-to-text transcription using Kyutai’s advanced neural models from their groundbreaking Moshi speech-text foundation model. This operator leverages Kyutai’s state-of-the-art research in full-duplex spoken dialogue systems, delivering exceptional accuracy and ultra-low latency performance for professional TouchDesigner workflows.

Built on Kyutai’s Moshi foundation model - a 7B parameter speech-text foundation model designed for real-time dialogue - this operator provides access to cutting-edge speech recognition technology with theoretical latency as low as 160ms. The operator supports both English and French languages with advanced semantic Voice Activity Detection (VAD), streaming transcription, and comprehensive audio processing capabilities.

  • Advanced Neural Models: Uses Kyutai’s 1B and 2.6B parameter models for high-quality transcription
  • Multi-Language Support: English and French with automatic language detection
  • Semantic VAD: Intelligent voice activity detection with pause prediction
  • Real-time Streaming: Continuous transcription with low latency
  • Flexible Audio Processing: Configurable chunk duration and temperature settings
  • Comprehensive Logging: Detailed VAD events, state tracking, and performance metrics
  • GPU Acceleration: CUDA support for improved performance
  • ChatTD Operator: Required for Python environment management and async operations
  • Python Dependencies:
    • moshi (Kyutai’s core library)
    • julius (Audio processing)
    • torch (PyTorch for neural inference)
    • huggingface_hub (Model downloading)
  • Hardware: CUDA-compatible GPU recommended for optimal performance
  • Audio Input: Receives audio chunks via ReceiveAudioChunk() method (24kHz float32 format)
  • Transcription Text: transcription_out - Continuous text output
  • Segments Table: segments_out - Individual segments with timing and confidence
  • VAD Step Data: vad_step_out - Pause predictions at different intervals
  • VAD Events: vad_events_out - Speech start/end events
  • VAD State: vad_state_out - Current voice activity state
Status (Status) op('stt_kyutai').par.Status Str
Default:
None
Transcription Active (Active) op('stt_kyutai').par.Active Toggle
Default:
None
Copy Transcript to Clipboard (Copytranscript) op('stt_kyutai').par.Copytranscript Pulse
Default:
None
STT Kyutai (Enginestatus) op('stt_kyutai').par.Enginestatus Str
Default:
None
Initialize STT Kyutai (Initialize) op('stt_kyutai').par.Initialize Pulse
Default:
None
Shutdown STT Kyutai (Shutdown) op('stt_kyutai').par.Shutdown Pulse
Default:
None
Initialize On Start (Initializeonstart) op('stt_kyutai').par.Initializeonstart Toggle
Default:
None
Output Segments (out1) (Segments) op('stt_kyutai').par.Segments Toggle
Default:
None
Chunk Duration (sec) (Chunkduration) op('stt_kyutai').par.Chunkduration Float
Default:
0.1
Range:
0.1 to 5
Temperature (Temperature) op('stt_kyutai').par.Temperature Float
Default:
0.0
Range:
0 to 1
Model Size (Modelsize) op('stt_kyutai').par.Modelsize Menu
Default:
stt-1b-en_fr
Language (Language) op('stt_kyutai').par.Language Menu
Default:
auto
Clear Transcript (Cleartranscript) op('stt_kyutai').par.Cleartranscript Pulse
Default:
None
Dependencies Available (Installdependencies) op('stt_kyutai').par.Installdependencies Pulse
Default:
None
Worker Connection Settings Header
IPC Mode (Ipcmode) op('stt_kyutai').par.Ipcmode Menu
Default:
tcp
Monitor Worker Logs (stderr) (Monitorworkerlogs) op('stt_kyutai').par.Monitorworkerlogs Toggle
Default:
None
Auto Reattach On Init (Autoreattachoninit) op('stt_kyutai').par.Autoreattachoninit Toggle
Default:
None
Worker Logging Level (Workerlogging) op('stt_kyutai').par.Workerlogging Menu
Default:
OFF
Device (Device) op('stt_kyutai').par.Device Menu
Default:
auto
Download Model (Downloadmodel) op('stt_kyutai').par.Downloadmodel Pulse
Default:
None
Bypass (Bypass) op('stt_kyutai').par.Bypass Toggle
Default:
None
Show Built-in Parameters (Showbuiltin) op('stt_kyutai').par.Showbuiltin Toggle
Default:
None
Show Icon (Showicon) op('stt_kyutai').par.Showicon Toggle
Default:
None
Version (Version) op('stt_kyutai').par.Version Str
Default:
None
Last Updated (Lastupdated) op('stt_kyutai').par.Lastupdated Str
Default:
None
Creator (Creator) op('stt_kyutai').par.Creator Str
Default:
None
Website (Website) op('stt_kyutai').par.Website Str
Default:
None
ChatTD Operator (Chattd) op('stt_kyutai').par.Chattd OP
Default:
None
  1. Setup Dependencies:

    • Click “Install Dependencies” if the button shows missing requirements
    • Wait for installation to complete and restart TouchDesigner
  2. Initialize the Engine:

    • Select desired model size (1B for multilingual, 2.6B for English-only)
    • Click “Initialize STT Kyutai”
    • If model is missing, choose to download it
  3. Start Transcription:

    • Enable “Transcription Active”
    • Send audio chunks using ReceiveAudioChunk(audio_array)
    • Monitor transcription output in the transcription_out DAT
  1. Select the 1B EN/FR model from the Model Size dropdown.
  2. Select your desired language from the Language dropdown.
  3. Initialize the engine and start transcription.
  • 1B Model: Use for multilingual scenarios, lower latency, moderate accuracy
  • 2.6B Model: Use for English-only, higher accuracy, acceptable latency
  • Language Setting: Use “auto” for mixed-language content, specific language for better accuracy
  • GPU Usage: Enable CUDA for better performance with larger models
  • Chunk Duration: Keep at default (80ms) for optimal Kyutai model performance
  • Temperature: Use 0.0-0.2 for deterministic results, 0.3-0.5 for more natural variation
  • Sample Rate: Ensure audio is 24kHz (Kyutai’s native rate)
  • Format: Use float32 format for best quality
  • Buffering: Process audio in consistent chunks for smooth operation

Engine Won’t Initialize

  • Check that all dependencies are installed
  • Verify model is downloaded locally
  • Ensure ChatTD Python environment is configured
  • Check device compatibility (CUDA drivers for GPU)

Poor Transcription Quality

  • Verify audio sample rate is 24kHz
  • Check microphone quality and positioning
  • Use appropriate model for language content

High Latency

  • Use 1B model for lower latency
  • Enable GPU acceleration if available
  • Check system resource availability

Missing Audio

  • Verify ReceiveAudioChunk() is being called
  • Check audio format (float32 required)
  • Ensure “Transcription Active” is enabled

“Dependencies missing”

  • Click “Install Dependencies” button
  • Restart TouchDesigner after installation
  • Check ChatTD Python environment configuration

“Model not found”

  • Click “Download Model” to fetch from HuggingFace
  • Check internet connection and HuggingFace access
  • Verify sufficient disk space for model storage

“Worker process failed”

  • Check Python environment and dependencies
  • Review worker logging output for specific errors
  • Verify CUDA installation for GPU usage
  • Try CPU device if GPU fails

“Audio format error”

  • Ensure audio is float32 format
  • Convert audio to 24kHz sample rate
  • Check audio array dimensions and type

Uses Kyutai’s Delayed Streams Modeling:

  • Fixed Frame Size: 80ms frames (1920 samples at 24kHz)
  • Continuous Context: Maintains context across audio chunks
  • Low Latency: Optimized for real-time applications
  • Memory Efficient: Manages memory usage automatically

Automatic model downloading and caching:

  • HuggingFace Integration: Downloads models from official repositories
  • Local Caching: Stores models locally for offline use
  • Version Management: Handles model updates and compatibility
  • Storage Optimization: Efficient model storage and loading

This operator provides professional-grade speech-to-text capabilities for TouchDesigner workflows, enabling sophisticated audio processing and AI integration scenarios.

Research & Licensing

Kyutai Research Foundation

Kyutai is a leading AI research organization focused on advancing speech and language technologies. Their flagship achievement, the Moshi model, represents a breakthrough in speech-text foundation models, enabling truly real-time conversational AI with unprecedented quality and responsiveness.

Moshi: Speech-Text Foundation Model

Kyutai's Moshi is a revolutionary speech-text foundation model that processes two streams of audio simultaneously - one for the user and one for the AI assistant. This dual-stream architecture enables full-duplex conversation with natural interruptions and overlapping speech, closely mimicking human conversation patterns.

Technical Details

  • 7B Parameter Architecture: Large-scale transformer model optimized for speech processing
  • Dual-Stream Processing: Simultaneous modeling of user and assistant audio streams
  • Inner Monologue: Predicts text tokens corresponding to its own speech for improved generation quality
  • Ultra-Low Latency: Theoretical latency of 160ms (80ms frame size + 80ms acoustic delay)
  • Streaming Neural Codec: Uses Mimi codec for efficient audio compression at 1.1 kbps
  • Frame Rate: 12.5 Hz operation (80ms frames)
  • Compression: 24 kHz audio down to 1.1 kbps bandwidth
  • Streaming: Fully causal and streaming with 80ms latency
  • Quality: Outperforms existing non-streaming codecs like SpeechTokenizer and SemantiCodec
  • Architecture: Transformer-based encoder/decoder with adversarial training
  • STT-1B-EN/FR: 1B parameter model supporting English and French with 0.5s delay
  • STT-2.6B-EN: 2.6B parameter English-only model with 2.5s delay for higher accuracy
  • Multilingual Support: Automatic language detection for mixed-language scenarios
  • Quantization: INT8 and INT4 quantized versions for efficient deployment

Research Impact

  • Real-time Dialogue: Enables natural conversation with minimal latency
  • Full-duplex Communication: Supports interruptions and overlapping speech
  • Semantic Understanding: Advanced VAD with pause prediction capabilities
  • Production Ready: Rust, Python, and MLX implementations for various platforms

Citation

@techreport{kyutai2024moshi,
  title={Moshi: a speech-text foundation model for real-time dialogue},
  author={Alexandre Défossez and Laurent Mazaré and Manu Orsini and
  Amélie Royer and Patrick Pérez and Hervé Jégou and Edouard Grave and Neil Zeghidour},
  year={2024},
  eprint={2410.00037},
  archivePrefix={arXiv},
  primaryClass={eess.AS},
  url={https://arxiv.org/abs/2410.00037},
}

Key Research Contributions

  • Full-duplex spoken dialogue with dual-stream modeling
  • Ultra-low latency speech processing (160ms theoretical)
  • Streaming neural audio codec (Mimi) with 1.1 kbps compression
  • Semantic voice activity detection with pause prediction
  • Production-ready implementations across multiple platforms

License

CC-BY 4.0 - This model is freely available for research and commercial use.