Skip to content

TTS Kyutai Operator

  • TCP IPC mode for robust worker communication
  • Auto worker reattachment on TouchDesigner restart
  • TCP heartbeat system for connection monitoring
  • Sophisticated audio saving with metadata and versioning
  • Clear audio method and manual save functionality
  • Enhanced audio buffering and CHOP updates

The TTS Kyutai operator provides real-time text-to-speech synthesis using Kyutai’s advanced neural voice models from their groundbreaking Moshi speech-text foundation model. This operator leverages Kyutai’s state-of-the-art research in full-duplex spoken dialogue systems, delivering exceptional voice quality and ultra-low latency performance for professional TouchDesigner workflows.

Built on Kyutai’s revolutionary Moshi foundation model - a 7B parameter speech-text foundation model designed for real-time dialogue - this operator provides access to cutting-edge speech synthesis technology with theoretical latency as low as 160ms. The operator offers high-quality, natural-sounding speech synthesis with multiple voice options, streaming capabilities, and integrated audio playback capabilities.

  • High-Quality Neural Synthesis: Uses Kyutai’s advanced TTS models for natural speech generation
  • Multiple Voice Options: Extensive voice library with different speakers and emotional expressions
  • Streaming Synthesis: Real-time audio generation with progressive output
  • Integrated Audio Playback: Built-in audio device management and playback
  • Voice Search: Intelligent voice filtering and selection system
  • Flexible Configuration: Adjustable synthesis parameters and audio settings
  • Model Management: Automatic downloading and caching of TTS models and voices
  • ChatTD Operator: Required for Python environment management and async operations
  • Python Dependencies:
    • moshi (Kyutai’s core library)
    • torch (PyTorch for neural inference)
    • huggingface_hub (Model downloading)
  • Hardware: CUDA-compatible GPU recommended for optimal performance
  • Audio System: Audio output device for playback
  • Text Input: Text to be synthesized via parameter or SynthesizeText() method
  • Audio Output: store_output CHOP - Generated audio at 24kHz
  • Synthesis Log: synthesis_log - History of synthesis operations
  • Text Queue: text_queue - Queue of texts to be synthesized
Status (Status) op('tts_kyutai').par.Status Str
Default:
None
Generate Speech (Texttospeech) op('tts_kyutai').par.Texttospeech Pulse
Default:
None
Input Text (Inputtext) op('tts_kyutai').par.Inputtext Str
Default:
None
Initialize TTS Kyutai (Initialize) op('tts_kyutai').par.Initialize Pulse
Default:
None
Shutdown TTS Kyutai (Shutdown) op('tts_kyutai').par.Shutdown Pulse
Default:
None
Initialize On Start (Initializeonstart) op('tts_kyutai').par.Initializeonstart Toggle
Default:
None
Extend Current Audio (Appendtooutput) op('tts_kyutai').par.Appendtooutput Toggle
Default:
None
Voice (Voice) op('tts_kyutai').par.Voice StrMenu
Default:
None
Search Voices (Voicesearch) op('tts_kyutai').par.Voicesearch Str
Default:
None
TTS Kyutai (Enginestatus) op('tts_kyutai').par.Enginestatus Str
Default:
None
Streaming Mode (Streamingmode) op('tts_kyutai').par.Streamingmode Toggle
Default:
None
Temperature (Temperature) op('tts_kyutai').par.Temperature Float
Default:
0.0
Range:
0 to 1
CFG Coefficient (Cfgcoef) op('tts_kyutai').par.Cfgcoef Float
Default:
0.0
Range:
0.5 to 4
Padding Between (sec) (Paddingbetween) op('tts_kyutai').par.Paddingbetween Int
Default:
0
Range:
0 to 5
Clear Queue (Clearqueue) op('tts_kyutai').par.Clearqueue Pulse
Default:
None
Stop Synthesis (Stopsynth) op('tts_kyutai').par.Stopsynth Pulse
Default:
None
Clear Audio Buffers (Clearaudio) op('tts_kyutai').par.Clearaudio Pulse
Default:
None
Audio Device Settings Header
Reset Playback (Resetpulse) op('tts_kyutai').par.Resetpulse Pulse
Default:
None
Active (Audioactive) op('tts_kyutai').par.Audioactive Toggle
Default:
true
Driver (Driver) op('tts_kyutai').par.Driver Menu
Default:
default
Device (Audiodevice) op('tts_kyutai').par.Audiodevice Menu
Default:
default
Volume (Volume) op('tts_kyutai').par.Volume Float
Default:
1.0
Range:
0 to 1
Auto Save To Disk (Autosavetodisk) op('tts_kyutai').par.Autosavetodisk Toggle
Default:
None
Save Folder (Folder) op('tts_kyutai').par.Folder Folder
Default:
None
Base Name (Name) op('tts_kyutai').par.Name Str
Default:
None
File Type (Filetype) op('tts_kyutai').par.Filetype Menu
Default:
wav
Auto Version Files (Autoversion) op('tts_kyutai').par.Autoversion Toggle
Default:
None
Save Current Audio (Savefile) op('tts_kyutai').par.Savefile Pulse
Default:
None
Dependencies Available (Installdependencies) op('tts_kyutai').par.Installdependencies Pulse
Default:
None
Model Repository (Modelrepo) op('tts_kyutai').par.Modelrepo Str
Default:
None
Download Model (Downloadmodel) op('tts_kyutai').par.Downloadmodel Pulse
Default:
None
Voice Repository (Voicerepo) op('tts_kyutai').par.Voicerepo Str
Default:
None
Download Voices (Downloadvoices) op('tts_kyutai').par.Downloadvoices Pulse
Default:
None
Worker Connection Settings Header
IPC Mode (Ipcmode) op('tts_kyutai').par.Ipcmode Menu
Default:
tcp
Monitor Worker Logs (stderr) (Monitorworkerlogs) op('tts_kyutai').par.Monitorworkerlogs Toggle
Default:
None
Auto Reattach On Init (Autoreattachoninit) op('tts_kyutai').par.Autoreattachoninit Toggle
Default:
None
Force Attach (Skip PID Check) (Forceattachoninit) op('tts_kyutai').par.Forceattachoninit Toggle
Default:
None
Worker Logging Level (Workerlogging) op('tts_kyutai').par.Workerlogging Menu
Default:
OFF
Device (Device) op('tts_kyutai').par.Device Menu
Default:
auto
Bypass (Bypass) op('tts_kyutai').par.Bypass Toggle
Default:
None
Show Built-in Parameters (Showbuiltin) op('tts_kyutai').par.Showbuiltin Toggle
Default:
None
Version (Version) op('tts_kyutai').par.Version Str
Default:
None
Last Updated (Lastupdated) op('tts_kyutai').par.Lastupdated Str
Default:
None
Creator (Creator) op('tts_kyutai').par.Creator Str
Default:
None
Website (Website) op('tts_kyutai').par.Website Str
Default:
None
ChatTD Operator (Chattd) op('tts_kyutai').par.Chattd OP
Default:
None
  1. Setup Dependencies:

    • Click “Install Dependencies” if the button shows missing requirements
    • Wait for installation to complete and restart TouchDesigner
  2. Initialize the Engine:

    • Click “Download Model” to fetch the TTS model
    • Click “Download Voices” to get the voice repository
    • Click “Initialize TTS Kyutai” to start the engine
  3. Synthesize Speech:

    • Enter text in “Input Text” parameter
    • Select a voice from the Voice menu
    • Click “Generate Speech” to generate speech
  1. Enable Streaming Mode on the KyutaiTTS page.
  2. Enter text in the Input Text parameter.
  3. Pulse Generate Speech.
  4. The audio will be generated in chunks and played back as it arrives.
  • Voice Characteristics: Choose voices that match your content’s tone and audience
  • Language Matching: Ensure voice language matches your text content
  • Emotional Context: Select voices with appropriate emotional expressions
  • Consistency: Use consistent voices for coherent user experiences
  • Temperature Settings: Use 0.0-0.3 for consistent results, 0.4-0.7 for natural variation
  • CFG Coefficient: Use 1.5-3.0 for balanced voice adherence
  • Padding: Add padding between segments for natural speech flow
  • Text Preparation: Clean and format text for optimal synthesis
  • GPU Usage: Enable CUDA for faster synthesis with larger models
  • Streaming Mode: Use streaming for real-time applications
  • Queue Management: Clear queues regularly to prevent memory buildup
  • Device Selection: Choose appropriate audio devices for your use case
  • Sample Rate: Kyutai TTS outputs at 24kHz - match your audio pipeline
  • Volume Control: Set appropriate volume levels for your environment
  • Device Compatibility: Test with different audio devices and drivers
  • Buffer Management: Clear audio buffers when switching contexts

Engine Won’t Initialize

  • Check that all dependencies are installed
  • Verify models and voices are downloaded
  • Ensure ChatTD Python environment is configured
  • Check device compatibility (CUDA drivers for GPU)

Poor Voice Quality

  • Verify voice repository is properly downloaded
  • Check CFG coefficient settings (too high/low can degrade quality)
  • Ensure text is properly formatted and clean
  • Try different temperature settings

No Audio Output

  • Check audio device selection and availability
  • Verify volume settings and audio active state
  • Test with different audio drivers
  • Check system audio configuration

Synthesis Errors

  • Review worker logging output for specific errors
  • Check text content for unsupported characters
  • Verify voice file exists and is accessible
  • Try different voices or synthesis parameters

“Dependencies missing”

  • Click “Install Dependencies” button
  • Restart TouchDesigner after installation
  • Check ChatTD Python environment configuration

“Model not found”

  • Click “Download Model” to fetch from HuggingFace
  • Check internet connection and HuggingFace access
  • Verify sufficient disk space for model storage

“Voice repository not found”

  • Click “Download Voices” to fetch voice repository
  • Check internet connection and download completion
  • Verify voice repository path and permissions

“Worker process failed”

  • Check Python environment and dependencies
  • Review worker logging output for specific errors
  • Verify CUDA installation for GPU usage
  • Try CPU device if GPU fails

“Audio device error”

  • Check audio device availability and permissions
  • Try different audio drivers (DirectSound vs ASIO)
  • Verify audio device is not in use by other applications
  • Check system audio configuration

The operator includes comprehensive voice management:

  • Automatic Discovery: Scans voice repository for available speakers
  • Search Functionality: Filters voices by name, emotion, or characteristics
  • Dynamic Loading: Loads voice embeddings on demand
  • Cache Management: Efficiently manages voice data in memory

Advanced streaming capabilities:

  • Progressive Output: Audio frames are generated and output continuously
  • Low Latency: Optimized for real-time applications
  • Buffer Management: Intelligent audio buffer handling
  • Frame-by-Frame Processing: Granular control over audio generation

Sophisticated model handling:

  • HuggingFace Integration: Seamless model downloading and caching
  • Version Control: Handles model updates and compatibility
  • Storage Optimization: Efficient model storage and loading
  • Multi-Model Support: Can work with different TTS model architectures

Professional audio features:

  • 24kHz Output: High-quality audio generation
  • Multi-Device Support: Works with various audio interfaces
  • Driver Flexibility: Support for DirectSound and ASIO drivers
  • Real-time Processing: Optimized for live audio applications

This operator provides professional-grade text-to-speech capabilities for TouchDesigner workflows, enabling sophisticated voice synthesis and audio generation scenarios.

Research & Licensing

Kyutai Research Foundation

Kyutai is a leading AI research organization focused on advancing speech and language technologies. Their flagship achievement, the Moshi model, represents a breakthrough in speech-text foundation models, enabling truly real-time conversational AI with unprecedented quality and responsiveness.

Moshi: Speech-Text Foundation Model

Kyutai's Moshi is a revolutionary speech-text foundation model that processes two streams of audio simultaneously - one for the user and one for the AI assistant. This dual-stream architecture enables full-duplex conversation with natural interruptions and overlapping speech, closely mimicking human conversation patterns.

Technical Details

  • 7B Parameter Architecture: Large-scale transformer model optimized for speech processing
  • Dual-Stream Processing: Simultaneous modeling of user and assistant audio streams
  • Inner Monologue: Predicts text tokens corresponding to its own speech for improved generation quality
  • Ultra-Low Latency: Theoretical latency of 160ms (80ms frame size + 80ms acoustic delay)
  • Streaming Neural Codec: Uses Mimi codec for efficient audio compression at 1.1 kbps
  • Frame Rate: 12.5 Hz operation (80ms frames)
  • Compression: 24 kHz audio down to 1.1 kbps bandwidth
  • Streaming: Fully causal and streaming with 80ms latency
  • Quality: Outperforms existing non-streaming codecs like SpeechTokenizer and SemantiCodec
  • Architecture: Transformer-based encoder/decoder with adversarial training

Research Impact

  • Real-time Dialogue: Enables natural conversation with minimal latency
  • Full-duplex Communication: Supports interruptions and overlapping speech
  • Natural Prosody: Advanced modeling of speech rhythm, stress, and intonation
  • Production Ready: Rust, Python, and MLX implementations for various platforms

Citation

@techreport{kyutai2024moshi,
  title={Moshi: a speech-text foundation model for real-time dialogue},
  author={Alexandre Défossez and Laurent Mazaré and Manu Orsini and
  Amélie Royer and Patrick Pérez and Hervé Jégou and Edouard Grave and Neil Zeghidour},
  year={2024},
  eprint={2410.00037},
  archivePrefix={arXiv},
  primaryClass={eess.AS},
  url={https://arxiv.org/abs/2410.00037},
}

Key Research Contributions

  • Full-duplex spoken dialogue with dual-stream modeling
  • Ultra-low latency speech synthesis (160ms theoretical)
  • Streaming neural audio codec (Mimi) with 1.1 kbps compression
  • Natural prosody generation with semantic understanding
  • Production-ready implementations across multiple platforms

License

CC-BY 4.0 - This model is freely available for research and commercial use.