TTS Kyutai Operator

v1.1.0 What's new

TTS Kyutai Operator

Overview

The TTS Kyutai operator provides real-time text-to-speech synthesis using Kyutai’s advanced neural voice models from their groundbreaking Moshi speech-text foundation model. This operator leverages Kyutai’s state-of-the-art research in full-duplex spoken dialogue systems, delivering exceptional voice quality and ultra-low latency performance for professional TouchDesigner workflows.

Built on Kyutai’s revolutionary Moshi foundation model - a 7B parameter speech-text foundation model designed for real-time dialogue - this operator provides access to cutting-edge speech synthesis technology with theoretical latency as low as 160ms. The operator offers high-quality, natural-sounding speech synthesis with multiple voice options, streaming capabilities, and integrated audio playback capabilities.

Key Features

High-Quality Neural Synthesis: Uses Kyutai’s advanced TTS models for natural speech generation
Multiple Voice Options: Extensive voice library with different speakers and emotional expressions
Streaming Synthesis: Real-time audio generation with progressive output
Integrated Audio Playback: Built-in audio device management and playback
Voice Search: Intelligent voice filtering and selection system
Flexible Configuration: Adjustable synthesis parameters and audio settings
Model Management: Automatic downloading and caching of TTS models and voices

Requirements

ChatTD Operator: Required for Python environment management and async operations
Python Dependencies:
- moshi (Kyutai’s core library)
- torch (PyTorch for neural inference)
- huggingface_hub (Model downloading)
Hardware: CUDA-compatible GPU recommended for optimal performance
Audio System: Audio output device for playback

Input/Output

Inputs

Text Input: Text to be synthesized via parameter or SynthesizeText() method

Outputs

Audio Output: store_output CHOP - Generated audio at 24kHz
Synthesis Log: synthesis_log - History of synthesis operations
Text Queue: text_queue - Queue of texts to be synthesized

Parameters

Page: KyutaiTTS

Status (Status) op('tts_kyutai').par.Status Str

Default:: None

Generate Speech (Texttospeech) op('tts_kyutai').par.Texttospeech Pulse

Default:: None

Input Text (Inputtext) op('tts_kyutai').par.Inputtext Str

Default:: None

Initialize TTS Kyutai (Initialize) op('tts_kyutai').par.Initialize Pulse

Default:: None

Shutdown TTS Kyutai (Shutdown) op('tts_kyutai').par.Shutdown Pulse

Default:: None

Initialize On Start (Initializeonstart) op('tts_kyutai').par.Initializeonstart Toggle

Default:: None

Extend Current Audio (Appendtooutput) op('tts_kyutai').par.Appendtooutput Toggle

Default:: None

Voice (Voice) op('tts_kyutai').par.Voice StrMenu

Default:: None

Search Voices (Voicesearch) op('tts_kyutai').par.Voicesearch Str

Default:: None

TTS Kyutai (Enginestatus) op('tts_kyutai').par.Enginestatus Str

Default:: None

Streaming Mode (Streamingmode) op('tts_kyutai').par.Streamingmode Toggle

Default:: None

Temperature (Temperature) op('tts_kyutai').par.Temperature Float

Default:: 0.0
Range:: 0 to 1

CFG Coefficient (Cfgcoef) op('tts_kyutai').par.Cfgcoef Float

Default:: 0.0
Range:: 0.5 to 4

Padding Between (sec) (Paddingbetween) op('tts_kyutai').par.Paddingbetween Int

Default:: 0
Range:: 0 to 5

Clear Queue (Clearqueue) op('tts_kyutai').par.Clearqueue Pulse

Default:: None

Stop Synthesis (Stopsynth) op('tts_kyutai').par.Stopsynth Pulse

Default:: None

Clear Audio Buffers (Clearaudio) op('tts_kyutai').par.Clearaudio Pulse

Default:: None

Page: Playback

Audio Device Settings Header

Reset Playback (Resetpulse) op('tts_kyutai').par.Resetpulse Pulse

Default:: None

Active (Audioactive) op('tts_kyutai').par.Audioactive Toggle

Default:: true

Volume (Volume) op('tts_kyutai').par.Volume Float

Default:: 1.0
Range:: 0 to 1

Auto Save To Disk (Autosavetodisk) op('tts_kyutai').par.Autosavetodisk Toggle

Default:: None

Save Folder (Folder) op('tts_kyutai').par.Folder Folder

Default:: None

Base Name (Name) op('tts_kyutai').par.Name Str

Default:: None

Auto Version Files (Autoversion) op('tts_kyutai').par.Autoversion Toggle

Default:: None

Save Current Audio (Savefile) op('tts_kyutai').par.Savefile Pulse

Default:: None

Page: Install/Settings

Dependencies Available (Installdependencies) op('tts_kyutai').par.Installdependencies Pulse

Default:: None

Model Repository (Modelrepo) op('tts_kyutai').par.Modelrepo Str

Default:: None

Download Model (Downloadmodel) op('tts_kyutai').par.Downloadmodel Pulse

Default:: None

Voice Repository (Voicerepo) op('tts_kyutai').par.Voicerepo Str

Default:: None

Download Voices (Downloadvoices) op('tts_kyutai').par.Downloadvoices Pulse

Default:: None

Worker Connection Settings Header

Monitor Worker Logs (stderr) (Monitorworkerlogs) op('tts_kyutai').par.Monitorworkerlogs Toggle

Default:: None

Auto Reattach On Init (Autoreattachoninit) op('tts_kyutai').par.Autoreattachoninit Toggle

Default:: None

Force Attach (Skip PID Check) (Forceattachoninit) op('tts_kyutai').par.Forceattachoninit Toggle

Default:: None

Page: About

Bypass (Bypass) op('tts_kyutai').par.Bypass Toggle

Default:: None

Show Built-in Parameters (Showbuiltin) op('tts_kyutai').par.Showbuiltin Toggle

Default:: None

Version (Version) op('tts_kyutai').par.Version Str

Default:: None

Last Updated (Lastupdated) op('tts_kyutai').par.Lastupdated Str

Default:: None

Creator (Creator) op('tts_kyutai').par.Creator Str

Default:: None

Website (Website) op('tts_kyutai').par.Website Str

Default:: None

ChatTD Operator (Chattd) op('tts_kyutai').par.Chattd OP

Default:: None

Usage Examples

Basic Text-to-Speech

Setup Dependencies:
- Click “Install Dependencies” if the button shows missing requirements
- Wait for installation to complete and restart TouchDesigner
Initialize the Engine:
- Click “Download Model” to fetch the TTS model
- Click “Download Voices” to get the voice repository
- Click “Initialize TTS Kyutai” to start the engine
Synthesize Speech:
- Enter text in “Input Text” parameter
- Select a voice from the Voice menu
- Click “Generate Speech” to generate speech

Streaming Synthesis

Enable Streaming Mode on the KyutaiTTS page.
Enter text in the Input Text parameter.
Pulse Generate Speech.
The audio will be generated in chunks and played back as it arrives.

Best Practices

Voice Selection

Voice Characteristics: Choose voices that match your content’s tone and audience
Language Matching: Ensure voice language matches your text content
Emotional Context: Select voices with appropriate emotional expressions
Consistency: Use consistent voices for coherent user experiences

Synthesis Quality

Temperature Settings: Use 0.0-0.3 for consistent results, 0.4-0.7 for natural variation
CFG Coefficient: Use 1.5-3.0 for balanced voice adherence
Padding: Add padding between segments for natural speech flow
Text Preparation: Clean and format text for optimal synthesis

Performance Optimization

GPU Usage: Enable CUDA for faster synthesis with larger models
Streaming Mode: Use streaming for real-time applications
Queue Management: Clear queues regularly to prevent memory buildup
Device Selection: Choose appropriate audio devices for your use case

Audio Management

Sample Rate: Kyutai TTS outputs at 24kHz - match your audio pipeline
Volume Control: Set appropriate volume levels for your environment
Device Compatibility: Test with different audio devices and drivers
Buffer Management: Clear audio buffers when switching contexts

Troubleshooting

Common Issues

Engine Won’t Initialize

Check that all dependencies are installed
Verify models and voices are downloaded
Ensure ChatTD Python environment is configured
Check device compatibility (CUDA drivers for GPU)

Poor Voice Quality

Verify voice repository is properly downloaded
Check CFG coefficient settings (too high/low can degrade quality)
Ensure text is properly formatted and clean
Try different temperature settings

No Audio Output

Check audio device selection and availability
Verify volume settings and audio active state
Test with different audio drivers
Check system audio configuration

Synthesis Errors

Review worker logging output for specific errors
Check text content for unsupported characters
Verify voice file exists and is accessible
Try different voices or synthesis parameters

Error Messages

“Dependencies missing”

Click “Install Dependencies” button
Restart TouchDesigner after installation
Check ChatTD Python environment configuration

“Model not found”

Click “Download Model” to fetch from HuggingFace
Check internet connection and HuggingFace access
Verify sufficient disk space for model storage

“Voice repository not found”

Click “Download Voices” to fetch voice repository
Check internet connection and download completion
Verify voice repository path and permissions

“Worker process failed”

Check Python environment and dependencies
Review worker logging output for specific errors
Verify CUDA installation for GPU usage
Try CPU device if GPU fails

“Audio device error”

Check audio device availability and permissions
Try different audio drivers (DirectSound vs ASIO)
Verify audio device is not in use by other applications
Check system audio configuration

Advanced Features

Voice Repository Management

The operator includes comprehensive voice management:

Automatic Discovery: Scans voice repository for available speakers
Search Functionality: Filters voices by name, emotion, or characteristics
Dynamic Loading: Loads voice embeddings on demand
Cache Management: Efficiently manages voice data in memory

Streaming Architecture

Advanced streaming capabilities:

Progressive Output: Audio frames are generated and output continuously
Low Latency: Optimized for real-time applications
Buffer Management: Intelligent audio buffer handling
Frame-by-Frame Processing: Granular control over audio generation

Model Management

Sophisticated model handling:

HuggingFace Integration: Seamless model downloading and caching
Version Control: Handles model updates and compatibility
Storage Optimization: Efficient model storage and loading
Multi-Model Support: Can work with different TTS model architectures

Audio Processing

Professional audio features:

24kHz Output: High-quality audio generation
Multi-Device Support: Works with various audio interfaces
Driver Flexibility: Support for DirectSound and ASIO drivers
Real-time Processing: Optimized for live audio applications

This operator provides professional-grade text-to-speech capabilities for TouchDesigner workflows, enabling sophisticated voice synthesis and audio generation scenarios.

Research & Licensing

Kyutai Research Foundation

Kyutai is a leading AI research organization focused on advancing speech and language technologies. Their flagship achievement, the Moshi model, represents a breakthrough in speech-text foundation models, enabling truly real-time conversational AI with unprecedented quality and responsiveness.

Moshi: Speech-Text Foundation Model

Kyutai's Moshi is a revolutionary speech-text foundation model that processes two streams of audio simultaneously - one for the user and one for the AI assistant. This dual-stream architecture enables full-duplex conversation with natural interruptions and overlapping speech, closely mimicking human conversation patterns.

Technical Details

7B Parameter Architecture: Large-scale transformer model optimized for speech processing
Dual-Stream Processing: Simultaneous modeling of user and assistant audio streams
Inner Monologue: Predicts text tokens corresponding to its own speech for improved generation quality
Ultra-Low Latency: Theoretical latency of 160ms (80ms frame size + 80ms acoustic delay)
Streaming Neural Codec: Uses Mimi codec for efficient audio compression at 1.1 kbps
Frame Rate: 12.5 Hz operation (80ms frames)
Compression: 24 kHz audio down to 1.1 kbps bandwidth
Streaming: Fully causal and streaming with 80ms latency
Quality: Outperforms existing non-streaming codecs like SpeechTokenizer and SemantiCodec
Architecture: Transformer-based encoder/decoder with adversarial training

Research Impact

Real-time Dialogue: Enables natural conversation with minimal latency
Full-duplex Communication: Supports interruptions and overlapping speech
Natural Prosody: Advanced modeling of speech rhythm, stress, and intonation
Production Ready: Rust, Python, and MLX implementations for various platforms

Citation

@techreport{kyutai2024moshi,
  title={Moshi: a speech-text foundation model for real-time dialogue},
  author={Alexandre Défossez and Laurent Mazaré and Manu Orsini and
  Amélie Royer and Patrick Pérez and Hervé Jégou and Edouard Grave and Neil Zeghidour},
  year={2024},
  eprint={2410.00037},
  archivePrefix={arXiv},
  primaryClass={eess.AS},
  url={https://arxiv.org/abs/2410.00037},
}

Key Research Contributions

Full-duplex spoken dialogue with dual-stream modeling
Ultra-low latency speech synthesis (160ms theoretical)
Streaming neural audio codec (Mimi) with 1.1 kbps compression
Natural prosody generation with semantic understanding
Production-ready implementations across multiple platforms

License

CC-BY 4.0 - This model is freely available for research and commercial use.

TTS Kyutai Operator

TTS Kyutai v1.1.0 [ September 2, 2025 ]

TTS Kyutai Operator

Overview

Key Features

Requirements

Input/Output

Inputs

Outputs

Parameters

Page: KyutaiTTS

Page: Playback

Page: Install/Settings

Page: About

Usage Examples

Basic Text-to-Speech

Streaming Synthesis

Best Practices

Voice Selection

Synthesis Quality

Performance Optimization

Audio Management

Troubleshooting

Common Issues

Error Messages

Advanced Features

Voice Repository Management

Streaming Architecture

Model Management

Audio Processing

Research & Licensing

Kyutai Research Foundation

Moshi: Speech-Text Foundation Model

Technical Details

Research Impact

Citation

Key Research Contributions

License