TTS Kyutai Operator
TTS Kyutai Operator
Section titled “TTS Kyutai Operator”Overview
Section titled “Overview”The TTS Kyutai operator provides real-time text-to-speech synthesis using Kyutai’s advanced neural voice models from their groundbreaking Moshi speech-text foundation model. This operator leverages Kyutai’s state-of-the-art research in full-duplex spoken dialogue systems, delivering exceptional voice quality and ultra-low latency performance for professional TouchDesigner workflows.
Built on Kyutai’s revolutionary Moshi foundation model - a 7B parameter speech-text foundation model designed for real-time dialogue - this operator provides access to cutting-edge speech synthesis technology with theoretical latency as low as 160ms. The operator offers high-quality, natural-sounding speech synthesis with multiple voice options, streaming capabilities, and integrated audio playback capabilities.
Key Features
Section titled “Key Features”- High-Quality Neural Synthesis: Uses Kyutai’s advanced TTS models for natural speech generation
- Multiple Voice Options: Extensive voice library with different speakers and emotional expressions
- Streaming Synthesis: Real-time audio generation with progressive output
- Integrated Audio Playback: Built-in audio device management and playback
- Voice Search: Intelligent voice filtering and selection system
- Flexible Configuration: Adjustable synthesis parameters and audio settings
- Model Management: Automatic downloading and caching of TTS models and voices
Requirements
Section titled “Requirements”- ChatTD Operator: Required for Python environment management and async operations
- Python Dependencies:
moshi
(Kyutai’s core library)torch
(PyTorch for neural inference)huggingface_hub
(Model downloading)
- Hardware: CUDA-compatible GPU recommended for optimal performance
- Audio System: Audio output device for playback
Input/Output
Section titled “Input/Output”Inputs
Section titled “Inputs”- Text Input: Text to be synthesized via parameter or
SynthesizeText()
method
Outputs
Section titled “Outputs”- Audio Output:
store_output
CHOP - Generated audio at 24kHz - Synthesis Log:
synthesis_log
- History of synthesis operations - Text Queue:
text_queue
- Queue of texts to be synthesized
Parameters
Section titled “Parameters”Page: KyutaiTTS
Section titled “Page: KyutaiTTS”op('tts_kyutai').par.Modelrepo
String - Default:
"" (Empty String)
op('tts_kyutai').par.Voicerepo
String - Default:
"" (Empty String)
op('tts_kyutai').par.Synthesize
Pulse - Default:
false
op('tts_kyutai').par.Inputtext
String - Default:
"" (Empty String)
op('tts_kyutai').par.Voicesearch
String - Default:
"" (Empty String)
op('tts_kyutai').par.Enginestatus
String - Default:
"" (Empty String)
op('tts_kyutai').par.Initialize
Pulse - Default:
false
op('tts_kyutai').par.Shutdown
Pulse - Default:
false
op('tts_kyutai').par.Temperature
Float - Default:
0.0
op('tts_kyutai').par.Cfgcoef
Float - Default:
0.0
op('tts_kyutai').par.Paddingbetween
Integer - Default:
0
op('tts_kyutai').par.Installdependencies
Pulse - Default:
false
op('tts_kyutai').par.Downloadmodel
Pulse - Default:
false
op('tts_kyutai').par.Downloadvoices
Pulse - Default:
false
op('tts_kyutai').par.Initializeonstart
Toggle - Default:
false
op('tts_kyutai').par.Clearqueue
Pulse - Default:
false
op('tts_kyutai').par.Stopsynth
Pulse - Default:
false
op('tts_kyutai').par.Streamingmode
Toggle - Default:
false
Page: Playback
Section titled “Page: Playback”op('tts_kyutai').par.Audioactive
Toggle - Default:
true
op('tts_kyutai').par.Volume
Float - Default:
1.0
op('tts_kyutai').par.Clearaudio
Pulse - Default:
false
Usage Examples
Section titled “Usage Examples”Basic Text-to-Speech
Section titled “Basic Text-to-Speech”-
Setup Dependencies:
- Click “Install Dependencies” if the button shows missing requirements
- Wait for installation to complete and restart TouchDesigner
-
Initialize the Engine:
- Click “Download Model” to fetch the TTS model
- Click “Download Voices” to get the voice repository
- Click “Initialize TTS Kyutai” to start the engine
-
Synthesize Speech:
- Enter text in “Input Text” parameter
- Select a voice from the Voice menu
- Click “Synthesize Text” to generate speech
Voice Selection and Search
Section titled “Voice Selection and Search”# Search for specific voice typesop('tts_kyutai').par.Voicesearch = 'happy' # Find happy voicesop('tts_kyutai').par.Voicesearch = 'female' # Find female voicesop('tts_kyutai').par.Voicesearch = '' # Show all voices
# Select voice programmaticallytts_op = op('tts_kyutai')available_voices = tts_op.par.Voice.menuNamesif 'expresso/ex03-ex01_happy_001_channel1_334s.wav' in available_voices: tts_op.par.Voice = 'expresso/ex03-ex01_happy_001_channel1_334s.wav'
Streaming Synthesis
Section titled “Streaming Synthesis”# Enable streaming mode for real-time outputtts_op = op('tts_kyutai')tts_op.par.Streamingmode = Truetts_op.par.Temperature = 0.3 # Add some variationtts_op.par.Cfgcoef = 2.0 # Strong voice adherence
# Synthesize with streamingtts_op.par.Inputtext = "This is a streaming synthesis example."tts_op.par.Synthesize.pulse()
# Monitor audio outputstore_output = tts_op.op('store_output')print(f"Audio samples: {store_output.numSamples}")
Programmatic Text Synthesis
Section titled “Programmatic Text Synthesis”# Use the SynthesizeText method for external controltts_op = op('tts_kyutai')
def speak_text(text): if tts_op.par.Enginestatus.eval() == "Ready": tts_op.SynthesizeText(text) else: print("TTS engine not ready")
# Synthesize multiple textstexts = [ "Hello, welcome to TouchDesigner.", "This is the Kyutai TTS operator.", "Enjoy high-quality speech synthesis!"]
for text in texts: speak_text(text)
Audio Configuration
Section titled “Audio Configuration”# Configure audio outputtts_op = op('tts_kyutai')tts_op.par.Audioactive = Truetts_op.par.Volume = 0.8tts_op.par.Driver = 'asio' # Use ASIO for low latency
# Monitor synthesis logsynthesis_log = tts_op.op('synthesis_log')for row in range(1, synthesis_log.numRows): time = synthesis_log[row, 'Time'].val text = synthesis_log[row, 'Text'].val status = synthesis_log[row, 'Status'].val print(f"{time}: {text} - {status}")
Integration Examples
Section titled “Integration Examples”With Agent Workflows
Section titled “With Agent Workflows”Connect TTS Kyutai to Agent operators for voice responses:
# Agent response synthesisagent_op = op('agent1')tts_op = op('tts_kyutai')
def on_agent_response(response_text): if response_text and tts_op.par.Enginestatus.eval() == "Ready": tts_op.SynthesizeText(response_text)
# Monitor agent output and synthesize responsesagent_output = agent_op.op('conversation_out')# Connect this to agent's response callback
With Interactive Systems
Section titled “With Interactive Systems”Create responsive voice interfaces:
# Interactive voice feedback systemdef handle_user_input(user_action): tts_op = op('tts_kyutai')
responses = { 'welcome': "Welcome to the interactive system!", 'help': "You can ask me anything. I'm here to help.", 'goodbye': "Thank you for using the system. Goodbye!" }
if user_action in responses: tts_op.SynthesizeText(responses[user_action])
With Data Sonification
Section titled “With Data Sonification”Convert data to speech announcements:
# Data-driven speech synthesisdef announce_data_changes(data_value, threshold): tts_op = op('tts_kyutai')
if data_value > threshold: message = f"Alert: Value has exceeded threshold at {data_value:.2f}" tts_op.SynthesizeText(message) elif data_value < threshold * 0.5: message = f"Notice: Value has dropped to {data_value:.2f}" tts_op.SynthesizeText(message)
Best Practices
Section titled “Best Practices”Voice Selection
Section titled “Voice Selection”- Voice Characteristics: Choose voices that match your content’s tone and audience
- Language Matching: Ensure voice language matches your text content
- Emotional Context: Select voices with appropriate emotional expressions
- Consistency: Use consistent voices for coherent user experiences
Synthesis Quality
Section titled “Synthesis Quality”- Temperature Settings: Use 0.0-0.3 for consistent results, 0.4-0.7 for natural variation
- CFG Coefficient: Use 1.5-3.0 for balanced voice adherence
- Padding: Add padding between segments for natural speech flow
- Text Preparation: Clean and format text for optimal synthesis
Performance Optimization
Section titled “Performance Optimization”- GPU Usage: Enable CUDA for faster synthesis with larger models
- Streaming Mode: Use streaming for real-time applications
- Queue Management: Clear queues regularly to prevent memory buildup
- Device Selection: Choose appropriate audio devices for your use case
Audio Management
Section titled “Audio Management”- Sample Rate: Kyutai TTS outputs at 24kHz - match your audio pipeline
- Volume Control: Set appropriate volume levels for your environment
- Device Compatibility: Test with different audio devices and drivers
- Buffer Management: Clear audio buffers when switching contexts
Troubleshooting
Section titled “Troubleshooting”Common Issues
Section titled “Common Issues”Engine Won’t Initialize
- Check that all dependencies are installed
- Verify models and voices are downloaded
- Ensure ChatTD Python environment is configured
- Check device compatibility (CUDA drivers for GPU)
Poor Voice Quality
- Verify voice repository is properly downloaded
- Check CFG coefficient settings (too high/low can degrade quality)
- Ensure text is properly formatted and clean
- Try different temperature settings
No Audio Output
- Check audio device selection and availability
- Verify volume settings and audio active state
- Test with different audio drivers
- Check system audio configuration
Synthesis Errors
- Review worker logging output for specific errors
- Check text content for unsupported characters
- Verify voice file exists and is accessible
- Try different voices or synthesis parameters
Error Messages
Section titled “Error Messages”“Dependencies missing”
- Click “Install Dependencies” button
- Restart TouchDesigner after installation
- Check ChatTD Python environment configuration
“Model not found”
- Click “Download Model” to fetch from HuggingFace
- Check internet connection and HuggingFace access
- Verify sufficient disk space for model storage
“Voice repository not found”
- Click “Download Voices” to fetch voice repository
- Check internet connection and download completion
- Verify voice repository path and permissions
“Worker process failed”
- Check Python environment and dependencies
- Review worker logging output for specific errors
- Verify CUDA installation for GPU usage
- Try CPU device if GPU fails
“Audio device error”
- Check audio device availability and permissions
- Try different audio drivers (DirectSound vs ASIO)
- Verify audio device is not in use by other applications
- Check system audio configuration
Advanced Features
Section titled “Advanced Features”Voice Repository Management
Section titled “Voice Repository Management”The operator includes comprehensive voice management:
- Automatic Discovery: Scans voice repository for available speakers
- Search Functionality: Filters voices by name, emotion, or characteristics
- Dynamic Loading: Loads voice embeddings on demand
- Cache Management: Efficiently manages voice data in memory
Streaming Architecture
Section titled “Streaming Architecture”Advanced streaming capabilities:
- Progressive Output: Audio frames are generated and output continuously
- Low Latency: Optimized for real-time applications
- Buffer Management: Intelligent audio buffer handling
- Frame-by-Frame Processing: Granular control over audio generation
Model Management
Section titled “Model Management”Sophisticated model handling:
- HuggingFace Integration: Seamless model downloading and caching
- Version Control: Handles model updates and compatibility
- Storage Optimization: Efficient model storage and loading
- Multi-Model Support: Can work with different TTS model architectures
Audio Processing
Section titled “Audio Processing”Professional audio features:
- 24kHz Output: High-quality audio generation
- Multi-Device Support: Works with various audio interfaces
- Driver Flexibility: Support for DirectSound and ASIO drivers
- Real-time Processing: Optimized for live audio applications
This operator provides professional-grade text-to-speech capabilities for TouchDesigner workflows, enabling sophisticated voice synthesis and audio generation scenarios.
Research & Licensing
Kyutai Research Foundation
Kyutai is a leading AI research organization focused on advancing speech and language technologies. Their flagship achievement, the Moshi model, represents a breakthrough in speech-text foundation models, enabling truly real-time conversational AI with unprecedented quality and responsiveness.
Moshi: Speech-Text Foundation Model
Kyutai's Moshi is a revolutionary speech-text foundation model that processes two streams of audio simultaneously - one for the user and one for the AI assistant. This dual-stream architecture enables full-duplex conversation with natural interruptions and overlapping speech, closely mimicking human conversation patterns.
Technical Details
- 7B Parameter Architecture: Large-scale transformer model optimized for speech processing
- Dual-Stream Processing: Simultaneous modeling of user and assistant audio streams
- Inner Monologue: Predicts text tokens corresponding to its own speech for improved generation quality
- Ultra-Low Latency: Theoretical latency of 160ms (80ms frame size + 80ms acoustic delay)
- Streaming Neural Codec: Uses Mimi codec for efficient audio compression at 1.1 kbps
- Frame Rate: 12.5 Hz operation (80ms frames)
- Compression: 24 kHz audio down to 1.1 kbps bandwidth
- Streaming: Fully causal and streaming with 80ms latency
- Quality: Outperforms existing non-streaming codecs like SpeechTokenizer and SemantiCodec
- Architecture: Transformer-based encoder/decoder with adversarial training
Research Impact
- Real-time Dialogue: Enables natural conversation with minimal latency
- Full-duplex Communication: Supports interruptions and overlapping speech
- Natural Prosody: Advanced modeling of speech rhythm, stress, and intonation
- Production Ready: Rust, Python, and MLX implementations for various platforms
Citation
@techreport{kyutai2024moshi, title={Moshi: a speech-text foundation model for real-time dialogue}, author={Alexandre Défossez and Laurent Mazaré and Manu Orsini and Amélie Royer and Patrick Pérez and Hervé Jégou and Edouard Grave and Neil Zeghidour}, year={2024}, eprint={2410.00037}, archivePrefix={arXiv}, primaryClass={eess.AS}, url={https://arxiv.org/abs/2410.00037}, }
Key Research Contributions
- Full-duplex spoken dialogue with dual-stream modeling
- Ultra-low latency speech synthesis (160ms theoretical)
- Streaming neural audio codec (Mimi) with 1.1 kbps compression
- Natural prosody generation with semantic understanding
- Production-ready implementations across multiple platforms
License
CC-BY 4.0 - This model is freely available for research and commercial use.