STT Kyutai Operator
STT Kyutai v1.2.1 [ September 2, 2025 ]
- Added CHOP channels for parity across all STT operators
- TCP IPC mode for robust worker communication
- Auto worker reattachment on TouchDesigner restart
- TCP heartbeat system for connection monitoring
- Segments parameter for transcript segmentation
- Menu cleanup and improved parameter organization
STT Kyutai Operator
Section titled “STT Kyutai Operator”Overview
Section titled “Overview”The STT Kyutai operator provides real-time speech-to-text transcription using Kyutai’s advanced neural models from their groundbreaking Moshi speech-text foundation model. This operator leverages Kyutai’s state-of-the-art research in full-duplex spoken dialogue systems, delivering exceptional accuracy and ultra-low latency performance for professional TouchDesigner workflows.
Built on Kyutai’s Moshi foundation model - a 7B parameter speech-text foundation model designed for real-time dialogue - this operator provides access to cutting-edge speech recognition technology with theoretical latency as low as 160ms. The operator supports both English and French languages with advanced semantic Voice Activity Detection (VAD), streaming transcription, and comprehensive audio processing capabilities.
Key Features
Section titled “Key Features”- Advanced Neural Models: Uses Kyutai’s 1B and 2.6B parameter models for high-quality transcription
- Multi-Language Support: English and French with automatic language detection
- Semantic VAD: Intelligent voice activity detection with pause prediction
- Real-time Streaming: Continuous transcription with low latency
- Flexible Audio Processing: Configurable chunk duration and temperature settings
- Comprehensive Logging: Detailed VAD events, state tracking, and performance metrics
- GPU Acceleration: CUDA support for improved performance
Requirements
Section titled “Requirements”- ChatTD Operator: Required for Python environment management and async operations
- Python Dependencies:
moshi
(Kyutai’s core library)julius
(Audio processing)torch
(PyTorch for neural inference)huggingface_hub
(Model downloading)
- Hardware: CUDA-compatible GPU recommended for optimal performance
Input/Output
Section titled “Input/Output”Inputs
Section titled “Inputs”- Audio Input: Receives audio chunks via
ReceiveAudioChunk()
method (24kHz float32 format)
Outputs
Section titled “Outputs”- Transcription Text:
transcription_out
- Continuous text output - Segments Table:
segments_out
- Individual segments with timing and confidence - VAD Step Data:
vad_step_out
- Pause predictions at different intervals - VAD Events:
vad_events_out
- Speech start/end events - VAD State:
vad_state_out
- Current voice activity state
Parameters
Section titled “Parameters”Page: KyutaiSTT
Section titled “Page: KyutaiSTT”op('stt_kyutai').par.Status
Str - Default:
None
op('stt_kyutai').par.Active
Toggle - Default:
None
op('stt_kyutai').par.Copytranscript
Pulse - Default:
None
op('stt_kyutai').par.Enginestatus
Str - Default:
None
op('stt_kyutai').par.Initialize
Pulse - Default:
None
op('stt_kyutai').par.Shutdown
Pulse - Default:
None
op('stt_kyutai').par.Initializeonstart
Toggle - Default:
None
op('stt_kyutai').par.Segments
Toggle - Default:
None
op('stt_kyutai').par.Chunkduration
Float - Default:
0.1
- Range:
- 0.1 to 5
op('stt_kyutai').par.Temperature
Float - Default:
0.0
- Range:
- 0 to 1
op('stt_kyutai').par.Cleartranscript
Pulse - Default:
None
Page: Install/Settings
Section titled “Page: Install/Settings”op('stt_kyutai').par.Installdependencies
Pulse - Default:
None
op('stt_kyutai').par.Monitorworkerlogs
Toggle - Default:
None
op('stt_kyutai').par.Autoreattachoninit
Toggle - Default:
None
op('stt_kyutai').par.Downloadmodel
Pulse - Default:
None
Page: About
Section titled “Page: About”op('stt_kyutai').par.Bypass
Toggle - Default:
None
op('stt_kyutai').par.Showbuiltin
Toggle - Default:
None
op('stt_kyutai').par.Showicon
Toggle - Default:
None
op('stt_kyutai').par.Version
Str - Default:
None
op('stt_kyutai').par.Lastupdated
Str - Default:
None
op('stt_kyutai').par.Creator
Str - Default:
None
op('stt_kyutai').par.Website
Str - Default:
None
op('stt_kyutai').par.Chattd
OP - Default:
None
Usage Examples
Section titled “Usage Examples”Basic Real-time Transcription
Section titled “Basic Real-time Transcription”-
Setup Dependencies:
- Click “Install Dependencies” if the button shows missing requirements
- Wait for installation to complete and restart TouchDesigner
-
Initialize the Engine:
- Select desired model size (1B for multilingual, 2.6B for English-only)
- Click “Initialize STT Kyutai”
- If model is missing, choose to download it
-
Start Transcription:
- Enable “Transcription Active”
- Send audio chunks using
ReceiveAudioChunk(audio_array)
- Monitor transcription output in the
transcription_out
DAT
Language-Specific Configuration
Section titled “Language-Specific Configuration”- Select the
1B EN/FR
model from theModel Size
dropdown. - Select your desired language from the
Language
dropdown. - Initialize the engine and start transcription.
Best Practices
Section titled “Best Practices”Model Selection
Section titled “Model Selection”- 1B Model: Use for multilingual scenarios, lower latency, moderate accuracy
- 2.6B Model: Use for English-only, higher accuracy, acceptable latency
- Language Setting: Use “auto” for mixed-language content, specific language for better accuracy
Performance Optimization
Section titled “Performance Optimization”- GPU Usage: Enable CUDA for better performance with larger models
- Chunk Duration: Keep at default (80ms) for optimal Kyutai model performance
- Temperature: Use 0.0-0.2 for deterministic results, 0.3-0.5 for more natural variation
Audio Quality
Section titled “Audio Quality”- Sample Rate: Ensure audio is 24kHz (Kyutai’s native rate)
- Format: Use float32 format for best quality
- Buffering: Process audio in consistent chunks for smooth operation
Troubleshooting
Section titled “Troubleshooting”Common Issues
Section titled “Common Issues”Engine Won’t Initialize
- Check that all dependencies are installed
- Verify model is downloaded locally
- Ensure ChatTD Python environment is configured
- Check device compatibility (CUDA drivers for GPU)
Poor Transcription Quality
- Verify audio sample rate is 24kHz
- Check microphone quality and positioning
- Use appropriate model for language content
High Latency
- Use 1B model for lower latency
- Enable GPU acceleration if available
- Check system resource availability
Missing Audio
- Verify
ReceiveAudioChunk()
is being called - Check audio format (float32 required)
- Ensure “Transcription Active” is enabled
Error Messages
Section titled “Error Messages”“Dependencies missing”
- Click “Install Dependencies” button
- Restart TouchDesigner after installation
- Check ChatTD Python environment configuration
“Model not found”
- Click “Download Model” to fetch from HuggingFace
- Check internet connection and HuggingFace access
- Verify sufficient disk space for model storage
“Worker process failed”
- Check Python environment and dependencies
- Review worker logging output for specific errors
- Verify CUDA installation for GPU usage
- Try CPU device if GPU fails
“Audio format error”
- Ensure audio is float32 format
- Convert audio to 24kHz sample rate
- Check audio array dimensions and type
Advanced Features
Section titled “Advanced Features”Streaming Architecture
Section titled “Streaming Architecture”Uses Kyutai’s Delayed Streams Modeling:
- Fixed Frame Size: 80ms frames (1920 samples at 24kHz)
- Continuous Context: Maintains context across audio chunks
- Low Latency: Optimized for real-time applications
- Memory Efficient: Manages memory usage automatically
Model Management
Section titled “Model Management”Automatic model downloading and caching:
- HuggingFace Integration: Downloads models from official repositories
- Local Caching: Stores models locally for offline use
- Version Management: Handles model updates and compatibility
- Storage Optimization: Efficient model storage and loading
This operator provides professional-grade speech-to-text capabilities for TouchDesigner workflows, enabling sophisticated audio processing and AI integration scenarios.
Research & Licensing
Kyutai Research Foundation
Kyutai is a leading AI research organization focused on advancing speech and language technologies. Their flagship achievement, the Moshi model, represents a breakthrough in speech-text foundation models, enabling truly real-time conversational AI with unprecedented quality and responsiveness.
Moshi: Speech-Text Foundation Model
Kyutai's Moshi is a revolutionary speech-text foundation model that processes two streams of audio simultaneously - one for the user and one for the AI assistant. This dual-stream architecture enables full-duplex conversation with natural interruptions and overlapping speech, closely mimicking human conversation patterns.
Technical Details
- 7B Parameter Architecture: Large-scale transformer model optimized for speech processing
- Dual-Stream Processing: Simultaneous modeling of user and assistant audio streams
- Inner Monologue: Predicts text tokens corresponding to its own speech for improved generation quality
- Ultra-Low Latency: Theoretical latency of 160ms (80ms frame size + 80ms acoustic delay)
- Streaming Neural Codec: Uses Mimi codec for efficient audio compression at 1.1 kbps
- Frame Rate: 12.5 Hz operation (80ms frames)
- Compression: 24 kHz audio down to 1.1 kbps bandwidth
- Streaming: Fully causal and streaming with 80ms latency
- Quality: Outperforms existing non-streaming codecs like SpeechTokenizer and SemantiCodec
- Architecture: Transformer-based encoder/decoder with adversarial training
- STT-1B-EN/FR: 1B parameter model supporting English and French with 0.5s delay
- STT-2.6B-EN: 2.6B parameter English-only model with 2.5s delay for higher accuracy
- Multilingual Support: Automatic language detection for mixed-language scenarios
- Quantization: INT8 and INT4 quantized versions for efficient deployment
Research Impact
- Real-time Dialogue: Enables natural conversation with minimal latency
- Full-duplex Communication: Supports interruptions and overlapping speech
- Semantic Understanding: Advanced VAD with pause prediction capabilities
- Production Ready: Rust, Python, and MLX implementations for various platforms
Citation
@techreport{kyutai2024moshi, title={Moshi: a speech-text foundation model for real-time dialogue}, author={Alexandre Défossez and Laurent Mazaré and Manu Orsini and Amélie Royer and Patrick Pérez and Hervé Jégou and Edouard Grave and Neil Zeghidour}, year={2024}, eprint={2410.00037}, archivePrefix={arXiv}, primaryClass={eess.AS}, url={https://arxiv.org/abs/2410.00037}, }
Key Research Contributions
- Full-duplex spoken dialogue with dual-stream modeling
- Ultra-low latency speech processing (160ms theoretical)
- Streaming neural audio codec (Mimi) with 1.1 kbps compression
- Semantic voice activity detection with pause prediction
- Production-ready implementations across multiple platforms
License
CC-BY 4.0 - This model is freely available for research and commercial use.