Skip to content

TTS Kyutai Operator

The TTS Kyutai operator provides real-time text-to-speech synthesis using Kyutai’s advanced neural voice models from their groundbreaking Moshi speech-text foundation model. This operator leverages Kyutai’s state-of-the-art research in full-duplex spoken dialogue systems, delivering exceptional voice quality and ultra-low latency performance for professional TouchDesigner workflows.

Built on Kyutai’s revolutionary Moshi foundation model - a 7B parameter speech-text foundation model designed for real-time dialogue - this operator provides access to cutting-edge speech synthesis technology with theoretical latency as low as 160ms. The operator offers high-quality, natural-sounding speech synthesis with multiple voice options, streaming capabilities, and integrated audio playback capabilities.

  • High-Quality Neural Synthesis: Uses Kyutai’s advanced TTS models for natural speech generation
  • Multiple Voice Options: Extensive voice library with different speakers and emotional expressions
  • Streaming Synthesis: Real-time audio generation with progressive output
  • Integrated Audio Playback: Built-in audio device management and playback
  • Voice Search: Intelligent voice filtering and selection system
  • Flexible Configuration: Adjustable synthesis parameters and audio settings
  • Model Management: Automatic downloading and caching of TTS models and voices
  • ChatTD Operator: Required for Python environment management and async operations
  • Python Dependencies:
    • moshi (Kyutai’s core library)
    • torch (PyTorch for neural inference)
    • huggingface_hub (Model downloading)
  • Hardware: CUDA-compatible GPU recommended for optimal performance
  • Audio System: Audio output device for playback
  • Text Input: Text to be synthesized via parameter or SynthesizeText() method
  • Audio Output: store_output CHOP - Generated audio at 24kHz
  • Synthesis Log: synthesis_log - History of synthesis operations
  • Text Queue: text_queue - Queue of texts to be synthesized
Model Repository (Modelrepo) op('tts_kyutai').par.Modelrepo String
Default:
"" (Empty String)
Voice Repository (Voicerepo) op('tts_kyutai').par.Voicerepo String
Default:
"" (Empty String)
Synthesize Text (Synthesize) op('tts_kyutai').par.Synthesize Pulse
Default:
false
Input Text (Inputtext) op('tts_kyutai').par.Inputtext String
Default:
"" (Empty String)
Search Voices (Voicesearch) op('tts_kyutai').par.Voicesearch String
Default:
"" (Empty String)
TTS Kyutai (Enginestatus) op('tts_kyutai').par.Enginestatus String
Default:
"" (Empty String)
Initialize TTS Kyutai (Initialize) op('tts_kyutai').par.Initialize Pulse
Default:
false
Shutdown TTS Kyutai (Shutdown) op('tts_kyutai').par.Shutdown Pulse
Default:
false
Device (Device) op('tts_kyutai').par.Device Menu
Default:
auto
Options:
auto, cpu, cuda
Temperature (Temperature) op('tts_kyutai').par.Temperature Float
Default:
0.0
CFG Coefficient (Cfgcoef) op('tts_kyutai').par.Cfgcoef Float
Default:
0.0
Padding Between (sec) (Paddingbetween) op('tts_kyutai').par.Paddingbetween Integer
Default:
0
Dependencies Available (Installdependencies) op('tts_kyutai').par.Installdependencies Pulse
Default:
false
Download Model (Downloadmodel) op('tts_kyutai').par.Downloadmodel Pulse
Default:
false
Download Voices (Downloadvoices) op('tts_kyutai').par.Downloadvoices Pulse
Default:
false
Initialize On Start (Initializeonstart) op('tts_kyutai').par.Initializeonstart Toggle
Default:
false
Worker Logging Level (Workerlogging) op('tts_kyutai').par.Workerlogging Menu
Default:
OFF
Options:
OFF, CRITICAL, ERROR, WARNING, INFO, DEBUG
Clear Queue (Clearqueue) op('tts_kyutai').par.Clearqueue Pulse
Default:
false
Stop Synthesis (Stopsynth) op('tts_kyutai').par.Stopsynth Pulse
Default:
false
Streaming Mode (Streamingmode) op('tts_kyutai').par.Streamingmode Toggle
Default:
false
Audio Device Settings Header
Active (Audioactive) op('tts_kyutai').par.Audioactive Toggle
Default:
true
Driver (Driver) op('tts_kyutai').par.Driver Menu
Default:
default
Options:
default, asio
Device (Audiodevice) op('tts_kyutai').par.Audiodevice Menu
Default:
default
Volume (Volume) op('tts_kyutai').par.Volume Float
Default:
1.0
Clear Audio Buffers (Clearaudio) op('tts_kyutai').par.Clearaudio Pulse
Default:
false
  1. Setup Dependencies:

    • Click “Install Dependencies” if the button shows missing requirements
    • Wait for installation to complete and restart TouchDesigner
  2. Initialize the Engine:

    • Click “Download Model” to fetch the TTS model
    • Click “Download Voices” to get the voice repository
    • Click “Initialize TTS Kyutai” to start the engine
  3. Synthesize Speech:

    • Enter text in “Input Text” parameter
    • Select a voice from the Voice menu
    • Click “Synthesize Text” to generate speech
# Search for specific voice types
op('tts_kyutai').par.Voicesearch = 'happy' # Find happy voices
op('tts_kyutai').par.Voicesearch = 'female' # Find female voices
op('tts_kyutai').par.Voicesearch = '' # Show all voices
# Select voice programmatically
tts_op = op('tts_kyutai')
available_voices = tts_op.par.Voice.menuNames
if 'expresso/ex03-ex01_happy_001_channel1_334s.wav' in available_voices:
tts_op.par.Voice = 'expresso/ex03-ex01_happy_001_channel1_334s.wav'
# Enable streaming mode for real-time output
tts_op = op('tts_kyutai')
tts_op.par.Streamingmode = True
tts_op.par.Temperature = 0.3 # Add some variation
tts_op.par.Cfgcoef = 2.0 # Strong voice adherence
# Synthesize with streaming
tts_op.par.Inputtext = "This is a streaming synthesis example."
tts_op.par.Synthesize.pulse()
# Monitor audio output
store_output = tts_op.op('store_output')
print(f"Audio samples: {store_output.numSamples}")
# Use the SynthesizeText method for external control
tts_op = op('tts_kyutai')
def speak_text(text):
if tts_op.par.Enginestatus.eval() == "Ready":
tts_op.SynthesizeText(text)
else:
print("TTS engine not ready")
# Synthesize multiple texts
texts = [
"Hello, welcome to TouchDesigner.",
"This is the Kyutai TTS operator.",
"Enjoy high-quality speech synthesis!"
]
for text in texts:
speak_text(text)
# Configure audio output
tts_op = op('tts_kyutai')
tts_op.par.Audioactive = True
tts_op.par.Volume = 0.8
tts_op.par.Driver = 'asio' # Use ASIO for low latency
# Monitor synthesis log
synthesis_log = tts_op.op('synthesis_log')
for row in range(1, synthesis_log.numRows):
time = synthesis_log[row, 'Time'].val
text = synthesis_log[row, 'Text'].val
status = synthesis_log[row, 'Status'].val
print(f"{time}: {text} - {status}")

Connect TTS Kyutai to Agent operators for voice responses:

# Agent response synthesis
agent_op = op('agent1')
tts_op = op('tts_kyutai')
def on_agent_response(response_text):
if response_text and tts_op.par.Enginestatus.eval() == "Ready":
tts_op.SynthesizeText(response_text)
# Monitor agent output and synthesize responses
agent_output = agent_op.op('conversation_out')
# Connect this to agent's response callback

Create responsive voice interfaces:

# Interactive voice feedback system
def handle_user_input(user_action):
tts_op = op('tts_kyutai')
responses = {
'welcome': "Welcome to the interactive system!",
'help': "You can ask me anything. I'm here to help.",
'goodbye': "Thank you for using the system. Goodbye!"
}
if user_action in responses:
tts_op.SynthesizeText(responses[user_action])

Convert data to speech announcements:

# Data-driven speech synthesis
def announce_data_changes(data_value, threshold):
tts_op = op('tts_kyutai')
if data_value > threshold:
message = f"Alert: Value has exceeded threshold at {data_value:.2f}"
tts_op.SynthesizeText(message)
elif data_value < threshold * 0.5:
message = f"Notice: Value has dropped to {data_value:.2f}"
tts_op.SynthesizeText(message)
  • Voice Characteristics: Choose voices that match your content’s tone and audience
  • Language Matching: Ensure voice language matches your text content
  • Emotional Context: Select voices with appropriate emotional expressions
  • Consistency: Use consistent voices for coherent user experiences
  • Temperature Settings: Use 0.0-0.3 for consistent results, 0.4-0.7 for natural variation
  • CFG Coefficient: Use 1.5-3.0 for balanced voice adherence
  • Padding: Add padding between segments for natural speech flow
  • Text Preparation: Clean and format text for optimal synthesis
  • GPU Usage: Enable CUDA for faster synthesis with larger models
  • Streaming Mode: Use streaming for real-time applications
  • Queue Management: Clear queues regularly to prevent memory buildup
  • Device Selection: Choose appropriate audio devices for your use case
  • Sample Rate: Kyutai TTS outputs at 24kHz - match your audio pipeline
  • Volume Control: Set appropriate volume levels for your environment
  • Device Compatibility: Test with different audio devices and drivers
  • Buffer Management: Clear audio buffers when switching contexts

Engine Won’t Initialize

  • Check that all dependencies are installed
  • Verify models and voices are downloaded
  • Ensure ChatTD Python environment is configured
  • Check device compatibility (CUDA drivers for GPU)

Poor Voice Quality

  • Verify voice repository is properly downloaded
  • Check CFG coefficient settings (too high/low can degrade quality)
  • Ensure text is properly formatted and clean
  • Try different temperature settings

No Audio Output

  • Check audio device selection and availability
  • Verify volume settings and audio active state
  • Test with different audio drivers
  • Check system audio configuration

Synthesis Errors

  • Review worker logging output for specific errors
  • Check text content for unsupported characters
  • Verify voice file exists and is accessible
  • Try different voices or synthesis parameters

“Dependencies missing”

  • Click “Install Dependencies” button
  • Restart TouchDesigner after installation
  • Check ChatTD Python environment configuration

“Model not found”

  • Click “Download Model” to fetch from HuggingFace
  • Check internet connection and HuggingFace access
  • Verify sufficient disk space for model storage

“Voice repository not found”

  • Click “Download Voices” to fetch voice repository
  • Check internet connection and download completion
  • Verify voice repository path and permissions

“Worker process failed”

  • Check Python environment and dependencies
  • Review worker logging output for specific errors
  • Verify CUDA installation for GPU usage
  • Try CPU device if GPU fails

“Audio device error”

  • Check audio device availability and permissions
  • Try different audio drivers (DirectSound vs ASIO)
  • Verify audio device is not in use by other applications
  • Check system audio configuration

The operator includes comprehensive voice management:

  • Automatic Discovery: Scans voice repository for available speakers
  • Search Functionality: Filters voices by name, emotion, or characteristics
  • Dynamic Loading: Loads voice embeddings on demand
  • Cache Management: Efficiently manages voice data in memory

Advanced streaming capabilities:

  • Progressive Output: Audio frames are generated and output continuously
  • Low Latency: Optimized for real-time applications
  • Buffer Management: Intelligent audio buffer handling
  • Frame-by-Frame Processing: Granular control over audio generation

Sophisticated model handling:

  • HuggingFace Integration: Seamless model downloading and caching
  • Version Control: Handles model updates and compatibility
  • Storage Optimization: Efficient model storage and loading
  • Multi-Model Support: Can work with different TTS model architectures

Professional audio features:

  • 24kHz Output: High-quality audio generation
  • Multi-Device Support: Works with various audio interfaces
  • Driver Flexibility: Support for DirectSound and ASIO drivers
  • Real-time Processing: Optimized for live audio applications

This operator provides professional-grade text-to-speech capabilities for TouchDesigner workflows, enabling sophisticated voice synthesis and audio generation scenarios.

Research & Licensing

Kyutai Research Foundation

Kyutai is a leading AI research organization focused on advancing speech and language technologies. Their flagship achievement, the Moshi model, represents a breakthrough in speech-text foundation models, enabling truly real-time conversational AI with unprecedented quality and responsiveness.

Moshi: Speech-Text Foundation Model

Kyutai's Moshi is a revolutionary speech-text foundation model that processes two streams of audio simultaneously - one for the user and one for the AI assistant. This dual-stream architecture enables full-duplex conversation with natural interruptions and overlapping speech, closely mimicking human conversation patterns.

Technical Details

  • 7B Parameter Architecture: Large-scale transformer model optimized for speech processing
  • Dual-Stream Processing: Simultaneous modeling of user and assistant audio streams
  • Inner Monologue: Predicts text tokens corresponding to its own speech for improved generation quality
  • Ultra-Low Latency: Theoretical latency of 160ms (80ms frame size + 80ms acoustic delay)
  • Streaming Neural Codec: Uses Mimi codec for efficient audio compression at 1.1 kbps
  • Frame Rate: 12.5 Hz operation (80ms frames)
  • Compression: 24 kHz audio down to 1.1 kbps bandwidth
  • Streaming: Fully causal and streaming with 80ms latency
  • Quality: Outperforms existing non-streaming codecs like SpeechTokenizer and SemantiCodec
  • Architecture: Transformer-based encoder/decoder with adversarial training

Research Impact

  • Real-time Dialogue: Enables natural conversation with minimal latency
  • Full-duplex Communication: Supports interruptions and overlapping speech
  • Natural Prosody: Advanced modeling of speech rhythm, stress, and intonation
  • Production Ready: Rust, Python, and MLX implementations for various platforms

Citation

@techreport{kyutai2024moshi,
  title={Moshi: a speech-text foundation model for real-time dialogue},
  author={Alexandre Défossez and Laurent Mazaré and Manu Orsini and
  Amélie Royer and Patrick Pérez and Hervé Jégou and Edouard Grave and Neil Zeghidour},
  year={2024},
  eprint={2410.00037},
  archivePrefix={arXiv},
  primaryClass={eess.AS},
  url={https://arxiv.org/abs/2410.00037},
}

Key Research Contributions

  • Full-duplex spoken dialogue with dual-stream modeling
  • Ultra-low latency speech synthesis (160ms theoretical)
  • Streaming neural audio codec (Mimi) with 1.1 kbps compression
  • Natural prosody generation with semantic understanding
  • Production-ready implementations across multiple platforms

License

CC-BY 4.0 - This model is freely available for research and commercial use.