STT Kyutai Operator
STT Kyutai Operator
Section titled “STT Kyutai Operator”Overview
Section titled “Overview”The STT Kyutai operator provides real-time speech-to-text transcription using Kyutai’s advanced neural models from their groundbreaking Moshi speech-text foundation model. This operator leverages Kyutai’s state-of-the-art research in full-duplex spoken dialogue systems, delivering exceptional accuracy and ultra-low latency performance for professional TouchDesigner workflows.
Built on Kyutai’s Moshi foundation model - a 7B parameter speech-text foundation model designed for real-time dialogue - this operator provides access to cutting-edge speech recognition technology with theoretical latency as low as 160ms. The operator supports both English and French languages with advanced semantic Voice Activity Detection (VAD), streaming transcription, and comprehensive audio processing capabilities.
Key Features
Section titled “Key Features”- Advanced Neural Models: Uses Kyutai’s 1B and 2.6B parameter models for high-quality transcription
- Multi-Language Support: English and French with automatic language detection
- Semantic VAD: Intelligent voice activity detection with pause prediction
- Real-time Streaming: Continuous transcription with low latency
- Flexible Audio Processing: Configurable chunk duration and temperature settings
- Comprehensive Logging: Detailed VAD events, state tracking, and performance metrics
- GPU Acceleration: CUDA support for improved performance
Requirements
Section titled “Requirements”- ChatTD Operator: Required for Python environment management and async operations
- Python Dependencies:
moshi
(Kyutai’s core library)julius
(Audio processing)torch
(PyTorch for neural inference)huggingface_hub
(Model downloading)
- Hardware: CUDA-compatible GPU recommended for optimal performance
Input/Output
Section titled “Input/Output”Inputs
Section titled “Inputs”- Audio Input: Receives audio chunks via
ReceiveAudioChunk()
method (24kHz float32 format)
Outputs
Section titled “Outputs”- Transcription Text:
transcription_out
- Continuous text output - Segments Table:
segments_out
- Individual segments with timing and confidence - VAD Step Data:
vad_step_out
- Pause predictions at different intervals - VAD Events:
vad_events_out
- Speech start/end events - VAD State:
vad_state_out
- Current voice activity state
Parameters
Section titled “Parameters”Page: KyutaiSTT
Section titled “Page: KyutaiSTT”op('stt_kyutai').par.Enginestatus
String - Default:
"" (Empty String)
op('stt_kyutai').par.Initialize
Pulse - Default:
false
op('stt_kyutai').par.Shutdown
Pulse - Default:
false
op('stt_kyutai').par.Active
Toggle - Default:
false
op('stt_kyutai').par.Chunkduration
Float - Default:
0.0
op('stt_kyutai').par.Temperature
Float - Default:
0.0
op('stt_kyutai').par.Usevad
Toggle - Default:
false
op('stt_kyutai').par.Vadthreshold
Float - Default:
0.0
op('stt_kyutai').par.Vaddebugging
Toggle - Default:
false
op('stt_kyutai').par.Vadmaxrows
Integer - Default:
0
op('stt_kyutai').par.Installdependencies
Pulse - Default:
false
op('stt_kyutai').par.Downloadmodel
Pulse - Default:
false
op('stt_kyutai').par.Initializeonstart
Toggle - Default:
false
op('stt_kyutai').par.Cleartranscript
Pulse - Default:
false
op('stt_kyutai').par.Copytranscript
Pulse - Default:
false
Usage Examples
Section titled “Usage Examples”Basic Real-time Transcription
Section titled “Basic Real-time Transcription”-
Setup Dependencies:
- Click “Install Dependencies” if the button shows missing requirements
- Wait for installation to complete and restart TouchDesigner
-
Initialize the Engine:
- Select desired model size (1B for multilingual, 2.6B for English-only)
- Click “Initialize STT Kyutai”
- If model is missing, choose to download it
-
Start Transcription:
- Enable “Transcription Active”
- Send audio chunks using
ReceiveAudioChunk(audio_array)
- Monitor transcription output in the
transcription_out
DAT
Voice Activity Detection
Section titled “Voice Activity Detection”# Enable VAD with custom thresholdop('stt_kyutai').par.Usevad = Trueop('stt_kyutai').par.Vadthreshold = 0.7op('stt_kyutai').par.Vaddebugging = True
# Monitor VAD eventsvad_events = op('stt_kyutai').op('vad_events_out')for row in range(1, vad_events.numRows): event_type = vad_events[row, 'Event_Type'].val timestamp = vad_events[row, 'Timestamp'].val if event_type == 'speech_start': print(f"Speech detected at {timestamp}s")
Audio Processing Integration
Section titled “Audio Processing Integration”# Process audio from Audio Device In CHOPaudio_device = op('audiodevicein1')stt_kyutai = op('stt_kyutai')
# Convert CHOP data to numpy array for processingdef process_audio(): if audio_device.numSamples > 0: # Get audio as numpy array (24kHz float32) audio_data = audio_device['chan1'].numpyArray() stt_kyutai.ReceiveAudioChunk(audio_data)
# Call this from a Timer CHOP or Execute DAT
Language-Specific Configuration
Section titled “Language-Specific Configuration”# Configure for French transcriptionop('stt_kyutai').par.Modelsize = 'stt-1b-en_fr'op('stt_kyutai').par.Language = 'fr'op('stt_kyutai').par.Temperature = 0.1 # Lower temperature for more accurate French
# Configure for high-accuracy Englishop('stt_kyutai').par.Modelsize = 'stt-2.6b-en'op('stt_kyutai').par.Language = 'en'op('stt_kyutai').par.Temperature = 0.0 # Deterministic output
Integration Examples
Section titled “Integration Examples”With Agent Workflows
Section titled “With Agent Workflows”Connect STT Kyutai to Agent operators for voice-controlled AI:
# Monitor transcription and trigger agenttranscription_dat = op('stt_kyutai').op('transcription_out')agent_op = op('agent1')
def on_transcription_change(): text = transcription_dat.text.strip() if text and len(text) > 10: # Minimum length threshold agent_op.SendMessage(text)
With Audio Analysis
Section titled “With Audio Analysis”Combine with audio analysis for enhanced processing:
# Use VAD state for audio routingvad_state = op('stt_kyutai').op('vad_state_out')audio_switch = op('switch1')
def update_audio_routing(): if vad_state.numRows > 1: is_speech = vad_state[1, 'Is_Speech'].val == 'True' audio_switch.par.index = 1 if is_speech else 0
With Real-time Visualization
Section titled “With Real-time Visualization”Create visual feedback for transcription:
# Visualize VAD confidencevad_step_dat = op('stt_kyutai').op('vad_step_out')confidence_chop = op('constant1')
def update_vad_visualization(): if vad_step_dat.numRows > 1: latest_row = vad_step_dat.numRows - 1 pause_2s = float(vad_step_dat[latest_row, 'Pause_2.0s'].val) confidence_chop.par.value0 = 1.0 - pause_2s # Invert for speech confidence
Best Practices
Section titled “Best Practices”Model Selection
Section titled “Model Selection”- 1B Model: Use for multilingual scenarios, lower latency, moderate accuracy
- 2.6B Model: Use for English-only, higher accuracy, acceptable latency
- Language Setting: Use “auto” for mixed-language content, specific language for better accuracy
Performance Optimization
Section titled “Performance Optimization”- GPU Usage: Enable CUDA for better performance with larger models
- Chunk Duration: Keep at default (80ms) for optimal Kyutai model performance
- VAD Threshold: Adjust based on ambient noise levels (0.3-0.7 typical range)
- Temperature: Use 0.0-0.2 for deterministic results, 0.3-0.5 for more natural variation
Audio Quality
Section titled “Audio Quality”- Sample Rate: Ensure audio is 24kHz (Kyutai’s native rate)
- Format: Use float32 format for best quality
- Buffering: Process audio in consistent chunks for smooth operation
- Noise Handling: Use VAD to filter out non-speech audio
Troubleshooting
Section titled “Troubleshooting”Common Issues
Section titled “Common Issues”Engine Won’t Initialize
- Check that all dependencies are installed
- Verify model is downloaded locally
- Ensure ChatTD Python environment is configured
- Check device compatibility (CUDA drivers for GPU)
Poor Transcription Quality
- Verify audio sample rate is 24kHz
- Check microphone quality and positioning
- Adjust VAD threshold for ambient noise
- Use appropriate model for language content
High Latency
- Use 1B model for lower latency
- Enable GPU acceleration if available
- Reduce VAD debugging if enabled
- Check system resource availability
Missing Audio
- Verify
ReceiveAudioChunk()
is being called - Check audio format (float32 required)
- Ensure “Transcription Active” is enabled
- Monitor VAD state for speech detection
Error Messages
Section titled “Error Messages”“Dependencies missing”
- Click “Install Dependencies” button
- Restart TouchDesigner after installation
- Check ChatTD Python environment configuration
“Model not found”
- Click “Download Model” to fetch from HuggingFace
- Check internet connection and HuggingFace access
- Verify sufficient disk space for model storage
“Worker process failed”
- Check Python environment and dependencies
- Review worker logging output for specific errors
- Verify CUDA installation for GPU usage
- Try CPU device if GPU fails
“Audio format error”
- Ensure audio is float32 format
- Convert audio to 24kHz sample rate
- Check audio array dimensions and type
Advanced Features
Section titled “Advanced Features”Semantic VAD
Section titled “Semantic VAD”The operator includes advanced Voice Activity Detection:
- Pause Prediction: Predicts pauses at 0.5s, 1.0s, 2.0s, and 3.0s intervals
- Speech Events: Detects speech start/end with confidence scores
- State Tracking: Maintains current speech/silence state
- Debugging: Detailed logging for VAD parameter tuning
Streaming Architecture
Section titled “Streaming Architecture”Uses Kyutai’s Delayed Streams Modeling:
- Fixed Frame Size: 80ms frames (1920 samples at 24kHz)
- Continuous Context: Maintains context across audio chunks
- Low Latency: Optimized for real-time applications
- Memory Efficient: Manages memory usage automatically
Model Management
Section titled “Model Management”Automatic model downloading and caching:
- HuggingFace Integration: Downloads models from official repositories
- Local Caching: Stores models locally for offline use
- Version Management: Handles model updates and compatibility
- Storage Optimization: Efficient model storage and loading
This operator provides professional-grade speech-to-text capabilities for TouchDesigner workflows, enabling sophisticated audio processing and AI integration scenarios.
Research & Licensing
Kyutai Research Foundation
Kyutai is a leading AI research organization focused on advancing speech and language technologies. Their flagship achievement, the Moshi model, represents a breakthrough in speech-text foundation models, enabling truly real-time conversational AI with unprecedented quality and responsiveness.
Moshi: Speech-Text Foundation Model
Kyutai's Moshi is a revolutionary speech-text foundation model that processes two streams of audio simultaneously - one for the user and one for the AI assistant. This dual-stream architecture enables full-duplex conversation with natural interruptions and overlapping speech, closely mimicking human conversation patterns.
Technical Details
- 7B Parameter Architecture: Large-scale transformer model optimized for speech processing
- Dual-Stream Processing: Simultaneous modeling of user and assistant audio streams
- Inner Monologue: Predicts text tokens corresponding to its own speech for improved generation quality
- Ultra-Low Latency: Theoretical latency of 160ms (80ms frame size + 80ms acoustic delay)
- Streaming Neural Codec: Uses Mimi codec for efficient audio compression at 1.1 kbps
- Frame Rate: 12.5 Hz operation (80ms frames)
- Compression: 24 kHz audio down to 1.1 kbps bandwidth
- Streaming: Fully causal and streaming with 80ms latency
- Quality: Outperforms existing non-streaming codecs like SpeechTokenizer and SemantiCodec
- Architecture: Transformer-based encoder/decoder with adversarial training
- STT-1B-EN/FR: 1B parameter model supporting English and French with 0.5s delay
- STT-2.6B-EN: 2.6B parameter English-only model with 2.5s delay for higher accuracy
- Multilingual Support: Automatic language detection for mixed-language scenarios
- Quantization: INT8 and INT4 quantized versions for efficient deployment
Research Impact
- Real-time Dialogue: Enables natural conversation with minimal latency
- Full-duplex Communication: Supports interruptions and overlapping speech
- Semantic Understanding: Advanced VAD with pause prediction capabilities
- Production Ready: Rust, Python, and MLX implementations for various platforms
Citation
@techreport{kyutai2024moshi, title={Moshi: a speech-text foundation model for real-time dialogue}, author={Alexandre Défossez and Laurent Mazaré and Manu Orsini and Amélie Royer and Patrick Pérez and Hervé Jégou and Edouard Grave and Neil Zeghidour}, year={2024}, eprint={2410.00037}, archivePrefix={arXiv}, primaryClass={eess.AS}, url={https://arxiv.org/abs/2410.00037}, }
Key Research Contributions
- Full-duplex spoken dialogue with dual-stream modeling
- Ultra-low latency speech processing (160ms theoretical)
- Streaming neural audio codec (Mimi) with 1.1 kbps compression
- Semantic voice activity detection with pause prediction
- Production-ready implementations across multiple platforms
License
CC-BY 4.0 - This model is freely available for research and commercial use.