TTS Kyutai Operator
TTS Kyutai v1.1.0 [ September 2, 2025 ]
- TCP IPC mode for robust worker communication
- Auto worker reattachment on TouchDesigner restart
- TCP heartbeat system for connection monitoring
- Sophisticated audio saving with metadata and versioning
- Clear audio method and manual save functionality
- Enhanced audio buffering and CHOP updates
TTS Kyutai Operator
Section titled “TTS Kyutai Operator”Overview
Section titled “Overview”The TTS Kyutai operator provides real-time text-to-speech synthesis using Kyutai’s advanced neural voice models from their groundbreaking Moshi speech-text foundation model. This operator leverages Kyutai’s state-of-the-art research in full-duplex spoken dialogue systems, delivering exceptional voice quality and ultra-low latency performance for professional TouchDesigner workflows.
Built on Kyutai’s revolutionary Moshi foundation model - a 7B parameter speech-text foundation model designed for real-time dialogue - this operator provides access to cutting-edge speech synthesis technology with theoretical latency as low as 160ms. The operator offers high-quality, natural-sounding speech synthesis with multiple voice options, streaming capabilities, and integrated audio playback capabilities.
Key Features
Section titled “Key Features”- High-Quality Neural Synthesis: Uses Kyutai’s advanced TTS models for natural speech generation
- Multiple Voice Options: Extensive voice library with different speakers and emotional expressions
- Streaming Synthesis: Real-time audio generation with progressive output
- Integrated Audio Playback: Built-in audio device management and playback
- Voice Search: Intelligent voice filtering and selection system
- Flexible Configuration: Adjustable synthesis parameters and audio settings
- Model Management: Automatic downloading and caching of TTS models and voices
Requirements
Section titled “Requirements”- ChatTD Operator: Required for Python environment management and async operations
- Python Dependencies:
moshi
(Kyutai’s core library)torch
(PyTorch for neural inference)huggingface_hub
(Model downloading)
- Hardware: CUDA-compatible GPU recommended for optimal performance
- Audio System: Audio output device for playback
Input/Output
Section titled “Input/Output”Inputs
Section titled “Inputs”- Text Input: Text to be synthesized via parameter or
SynthesizeText()
method
Outputs
Section titled “Outputs”- Audio Output:
store_output
CHOP - Generated audio at 24kHz - Synthesis Log:
synthesis_log
- History of synthesis operations - Text Queue:
text_queue
- Queue of texts to be synthesized
Parameters
Section titled “Parameters”Page: KyutaiTTS
Section titled “Page: KyutaiTTS”op('tts_kyutai').par.Status
Str - Default:
None
op('tts_kyutai').par.Texttospeech
Pulse - Default:
None
op('tts_kyutai').par.Inputtext
Str - Default:
None
op('tts_kyutai').par.Initialize
Pulse - Default:
None
op('tts_kyutai').par.Shutdown
Pulse - Default:
None
op('tts_kyutai').par.Initializeonstart
Toggle - Default:
None
op('tts_kyutai').par.Appendtooutput
Toggle - Default:
None
op('tts_kyutai').par.Voicesearch
Str - Default:
None
op('tts_kyutai').par.Enginestatus
Str - Default:
None
op('tts_kyutai').par.Streamingmode
Toggle - Default:
None
op('tts_kyutai').par.Temperature
Float - Default:
0.0
- Range:
- 0 to 1
op('tts_kyutai').par.Cfgcoef
Float - Default:
0.0
- Range:
- 0.5 to 4
op('tts_kyutai').par.Paddingbetween
Int - Default:
0
- Range:
- 0 to 5
op('tts_kyutai').par.Clearqueue
Pulse - Default:
None
op('tts_kyutai').par.Stopsynth
Pulse - Default:
None
op('tts_kyutai').par.Clearaudio
Pulse - Default:
None
Page: Playback
Section titled “Page: Playback”op('tts_kyutai').par.Resetpulse
Pulse - Default:
None
op('tts_kyutai').par.Audioactive
Toggle - Default:
true
op('tts_kyutai').par.Volume
Float - Default:
1.0
- Range:
- 0 to 1
op('tts_kyutai').par.Autosavetodisk
Toggle - Default:
None
op('tts_kyutai').par.Folder
Folder - Default:
None
op('tts_kyutai').par.Name
Str - Default:
None
op('tts_kyutai').par.Autoversion
Toggle - Default:
None
op('tts_kyutai').par.Savefile
Pulse - Default:
None
Page: Install/Settings
Section titled “Page: Install/Settings”op('tts_kyutai').par.Installdependencies
Pulse - Default:
None
op('tts_kyutai').par.Modelrepo
Str - Default:
None
op('tts_kyutai').par.Downloadmodel
Pulse - Default:
None
op('tts_kyutai').par.Voicerepo
Str - Default:
None
op('tts_kyutai').par.Downloadvoices
Pulse - Default:
None
op('tts_kyutai').par.Monitorworkerlogs
Toggle - Default:
None
op('tts_kyutai').par.Autoreattachoninit
Toggle - Default:
None
op('tts_kyutai').par.Forceattachoninit
Toggle - Default:
None
Page: About
Section titled “Page: About”op('tts_kyutai').par.Bypass
Toggle - Default:
None
op('tts_kyutai').par.Showbuiltin
Toggle - Default:
None
op('tts_kyutai').par.Version
Str - Default:
None
op('tts_kyutai').par.Lastupdated
Str - Default:
None
op('tts_kyutai').par.Creator
Str - Default:
None
op('tts_kyutai').par.Website
Str - Default:
None
op('tts_kyutai').par.Chattd
OP - Default:
None
Usage Examples
Section titled “Usage Examples”Basic Text-to-Speech
Section titled “Basic Text-to-Speech”-
Setup Dependencies:
- Click “Install Dependencies” if the button shows missing requirements
- Wait for installation to complete and restart TouchDesigner
-
Initialize the Engine:
- Click “Download Model” to fetch the TTS model
- Click “Download Voices” to get the voice repository
- Click “Initialize TTS Kyutai” to start the engine
-
Synthesize Speech:
- Enter text in “Input Text” parameter
- Select a voice from the Voice menu
- Click “Generate Speech” to generate speech
Streaming Synthesis
Section titled “Streaming Synthesis”- Enable
Streaming Mode
on theKyutaiTTS
page. - Enter text in the
Input Text
parameter. - Pulse
Generate Speech
. - The audio will be generated in chunks and played back as it arrives.
Best Practices
Section titled “Best Practices”Voice Selection
Section titled “Voice Selection”- Voice Characteristics: Choose voices that match your content’s tone and audience
- Language Matching: Ensure voice language matches your text content
- Emotional Context: Select voices with appropriate emotional expressions
- Consistency: Use consistent voices for coherent user experiences
Synthesis Quality
Section titled “Synthesis Quality”- Temperature Settings: Use 0.0-0.3 for consistent results, 0.4-0.7 for natural variation
- CFG Coefficient: Use 1.5-3.0 for balanced voice adherence
- Padding: Add padding between segments for natural speech flow
- Text Preparation: Clean and format text for optimal synthesis
Performance Optimization
Section titled “Performance Optimization”- GPU Usage: Enable CUDA for faster synthesis with larger models
- Streaming Mode: Use streaming for real-time applications
- Queue Management: Clear queues regularly to prevent memory buildup
- Device Selection: Choose appropriate audio devices for your use case
Audio Management
Section titled “Audio Management”- Sample Rate: Kyutai TTS outputs at 24kHz - match your audio pipeline
- Volume Control: Set appropriate volume levels for your environment
- Device Compatibility: Test with different audio devices and drivers
- Buffer Management: Clear audio buffers when switching contexts
Troubleshooting
Section titled “Troubleshooting”Common Issues
Section titled “Common Issues”Engine Won’t Initialize
- Check that all dependencies are installed
- Verify models and voices are downloaded
- Ensure ChatTD Python environment is configured
- Check device compatibility (CUDA drivers for GPU)
Poor Voice Quality
- Verify voice repository is properly downloaded
- Check CFG coefficient settings (too high/low can degrade quality)
- Ensure text is properly formatted and clean
- Try different temperature settings
No Audio Output
- Check audio device selection and availability
- Verify volume settings and audio active state
- Test with different audio drivers
- Check system audio configuration
Synthesis Errors
- Review worker logging output for specific errors
- Check text content for unsupported characters
- Verify voice file exists and is accessible
- Try different voices or synthesis parameters
Error Messages
Section titled “Error Messages”“Dependencies missing”
- Click “Install Dependencies” button
- Restart TouchDesigner after installation
- Check ChatTD Python environment configuration
“Model not found”
- Click “Download Model” to fetch from HuggingFace
- Check internet connection and HuggingFace access
- Verify sufficient disk space for model storage
“Voice repository not found”
- Click “Download Voices” to fetch voice repository
- Check internet connection and download completion
- Verify voice repository path and permissions
“Worker process failed”
- Check Python environment and dependencies
- Review worker logging output for specific errors
- Verify CUDA installation for GPU usage
- Try CPU device if GPU fails
“Audio device error”
- Check audio device availability and permissions
- Try different audio drivers (DirectSound vs ASIO)
- Verify audio device is not in use by other applications
- Check system audio configuration
Advanced Features
Section titled “Advanced Features”Voice Repository Management
Section titled “Voice Repository Management”The operator includes comprehensive voice management:
- Automatic Discovery: Scans voice repository for available speakers
- Search Functionality: Filters voices by name, emotion, or characteristics
- Dynamic Loading: Loads voice embeddings on demand
- Cache Management: Efficiently manages voice data in memory
Streaming Architecture
Section titled “Streaming Architecture”Advanced streaming capabilities:
- Progressive Output: Audio frames are generated and output continuously
- Low Latency: Optimized for real-time applications
- Buffer Management: Intelligent audio buffer handling
- Frame-by-Frame Processing: Granular control over audio generation
Model Management
Section titled “Model Management”Sophisticated model handling:
- HuggingFace Integration: Seamless model downloading and caching
- Version Control: Handles model updates and compatibility
- Storage Optimization: Efficient model storage and loading
- Multi-Model Support: Can work with different TTS model architectures
Audio Processing
Section titled “Audio Processing”Professional audio features:
- 24kHz Output: High-quality audio generation
- Multi-Device Support: Works with various audio interfaces
- Driver Flexibility: Support for DirectSound and ASIO drivers
- Real-time Processing: Optimized for live audio applications
This operator provides professional-grade text-to-speech capabilities for TouchDesigner workflows, enabling sophisticated voice synthesis and audio generation scenarios.
Research & Licensing
Kyutai Research Foundation
Kyutai is a leading AI research organization focused on advancing speech and language technologies. Their flagship achievement, the Moshi model, represents a breakthrough in speech-text foundation models, enabling truly real-time conversational AI with unprecedented quality and responsiveness.
Moshi: Speech-Text Foundation Model
Kyutai's Moshi is a revolutionary speech-text foundation model that processes two streams of audio simultaneously - one for the user and one for the AI assistant. This dual-stream architecture enables full-duplex conversation with natural interruptions and overlapping speech, closely mimicking human conversation patterns.
Technical Details
- 7B Parameter Architecture: Large-scale transformer model optimized for speech processing
- Dual-Stream Processing: Simultaneous modeling of user and assistant audio streams
- Inner Monologue: Predicts text tokens corresponding to its own speech for improved generation quality
- Ultra-Low Latency: Theoretical latency of 160ms (80ms frame size + 80ms acoustic delay)
- Streaming Neural Codec: Uses Mimi codec for efficient audio compression at 1.1 kbps
- Frame Rate: 12.5 Hz operation (80ms frames)
- Compression: 24 kHz audio down to 1.1 kbps bandwidth
- Streaming: Fully causal and streaming with 80ms latency
- Quality: Outperforms existing non-streaming codecs like SpeechTokenizer and SemantiCodec
- Architecture: Transformer-based encoder/decoder with adversarial training
Research Impact
- Real-time Dialogue: Enables natural conversation with minimal latency
- Full-duplex Communication: Supports interruptions and overlapping speech
- Natural Prosody: Advanced modeling of speech rhythm, stress, and intonation
- Production Ready: Rust, Python, and MLX implementations for various platforms
Citation
@techreport{kyutai2024moshi, title={Moshi: a speech-text foundation model for real-time dialogue}, author={Alexandre Défossez and Laurent Mazaré and Manu Orsini and Amélie Royer and Patrick Pérez and Hervé Jégou and Edouard Grave and Neil Zeghidour}, year={2024}, eprint={2410.00037}, archivePrefix={arXiv}, primaryClass={eess.AS}, url={https://arxiv.org/abs/2410.00037}, }
Key Research Contributions
- Full-duplex spoken dialogue with dual-stream modeling
- Ultra-low latency speech synthesis (160ms theoretical)
- Streaming neural audio codec (Mimi) with 1.1 kbps compression
- Natural prosody generation with semantic understanding
- Production-ready implementations across multiple platforms
License
CC-BY 4.0 - This model is freely available for research and commercial use.