STT Kyutai Operator

v1.2.1 What's new

STT Kyutai Operator

Overview

The STT Kyutai operator provides real-time speech-to-text transcription using Kyutai’s advanced neural models from their groundbreaking Moshi speech-text foundation model. This operator leverages Kyutai’s state-of-the-art research in full-duplex spoken dialogue systems, delivering exceptional accuracy and ultra-low latency performance for professional TouchDesigner workflows.

Built on Kyutai’s Moshi foundation model - a 7B parameter speech-text foundation model designed for real-time dialogue - this operator provides access to cutting-edge speech recognition technology with theoretical latency as low as 160ms. The operator supports both English and French languages with advanced semantic Voice Activity Detection (VAD), streaming transcription, and comprehensive audio processing capabilities.

Key Features

Advanced Neural Models: Uses Kyutai’s 1B and 2.6B parameter models for high-quality transcription
Multi-Language Support: English and French with automatic language detection
Semantic VAD: Intelligent voice activity detection with pause prediction
Real-time Streaming: Continuous transcription with low latency
Flexible Audio Processing: Configurable chunk duration and temperature settings
Comprehensive Logging: Detailed VAD events, state tracking, and performance metrics
GPU Acceleration: CUDA support for improved performance

Requirements

ChatTD Operator: Required for Python environment management and async operations
Python Dependencies:
- moshi (Kyutai’s core library)
- julius (Audio processing)
- torch (PyTorch for neural inference)
- huggingface_hub (Model downloading)
Hardware: CUDA-compatible GPU recommended for optimal performance

Input/Output

Inputs

Audio Input: Receives audio chunks via ReceiveAudioChunk() method (24kHz float32 format)

Outputs

Transcription Text: transcription_out - Continuous text output
Segments Table: segments_out - Individual segments with timing and confidence
VAD Step Data: vad_step_out - Pause predictions at different intervals
VAD Events: vad_events_out - Speech start/end events
VAD State: vad_state_out - Current voice activity state

Parameters

Page: KyutaiSTT

Status (Status) op('stt_kyutai').par.Status Str

Default:: None

Transcription Active (Active) op('stt_kyutai').par.Active Toggle

Default:: None

Copy Transcript to Clipboard (Copytranscript) op('stt_kyutai').par.Copytranscript Pulse

Default:: None

STT Kyutai (Enginestatus) op('stt_kyutai').par.Enginestatus Str

Default:: None

Initialize STT Kyutai (Initialize) op('stt_kyutai').par.Initialize Pulse

Default:: None

Shutdown STT Kyutai (Shutdown) op('stt_kyutai').par.Shutdown Pulse

Default:: None

Initialize On Start (Initializeonstart) op('stt_kyutai').par.Initializeonstart Toggle

Default:: None

Output Segments (out1) (Segments) op('stt_kyutai').par.Segments Toggle

Default:: None

Chunk Duration (sec) (Chunkduration) op('stt_kyutai').par.Chunkduration Float

Default:: 0.1
Range:: 0.1 to 5

Temperature (Temperature) op('stt_kyutai').par.Temperature Float

Default:: 0.0
Range:: 0 to 1

Clear Transcript (Cleartranscript) op('stt_kyutai').par.Cleartranscript Pulse

Default:: None

Page: Install/Settings

Dependencies Available (Installdependencies) op('stt_kyutai').par.Installdependencies Pulse

Default:: None

Worker Connection Settings Header

Monitor Worker Logs (stderr) (Monitorworkerlogs) op('stt_kyutai').par.Monitorworkerlogs Toggle

Default:: None

Auto Reattach On Init (Autoreattachoninit) op('stt_kyutai').par.Autoreattachoninit Toggle

Default:: None

Download Model (Downloadmodel) op('stt_kyutai').par.Downloadmodel Pulse

Default:: None

Page: About

Bypass (Bypass) op('stt_kyutai').par.Bypass Toggle

Default:: None

Show Built-in Parameters (Showbuiltin) op('stt_kyutai').par.Showbuiltin Toggle

Default:: None

Show Icon (Showicon) op('stt_kyutai').par.Showicon Toggle

Default:: None

Version (Version) op('stt_kyutai').par.Version Str

Default:: None

Last Updated (Lastupdated) op('stt_kyutai').par.Lastupdated Str

Default:: None

Creator (Creator) op('stt_kyutai').par.Creator Str

Default:: None

Website (Website) op('stt_kyutai').par.Website Str

Default:: None

ChatTD Operator (Chattd) op('stt_kyutai').par.Chattd OP

Default:: None

Usage Examples

Basic Real-time Transcription

Setup Dependencies:
- Click “Install Dependencies” if the button shows missing requirements
- Wait for installation to complete and restart TouchDesigner
Initialize the Engine:
- Select desired model size (1B for multilingual, 2.6B for English-only)
- Click “Initialize STT Kyutai”
- If model is missing, choose to download it
Start Transcription:
- Enable “Transcription Active”
- Send audio chunks using ReceiveAudioChunk(audio_array)
- Monitor transcription output in the transcription_out DAT

Language-Specific Configuration

Select the 1B EN/FR model from the Model Size dropdown.
Select your desired language from the Language dropdown.
Initialize the engine and start transcription.

Best Practices

Model Selection

1B Model: Use for multilingual scenarios, lower latency, moderate accuracy
2.6B Model: Use for English-only, higher accuracy, acceptable latency
Language Setting: Use “auto” for mixed-language content, specific language for better accuracy

Performance Optimization

GPU Usage: Enable CUDA for better performance with larger models
Chunk Duration: Keep at default (80ms) for optimal Kyutai model performance
Temperature: Use 0.0-0.2 for deterministic results, 0.3-0.5 for more natural variation

Audio Quality

Sample Rate: Ensure audio is 24kHz (Kyutai’s native rate)
Format: Use float32 format for best quality
Buffering: Process audio in consistent chunks for smooth operation

Troubleshooting

Common Issues

Engine Won’t Initialize

Check that all dependencies are installed
Verify model is downloaded locally
Ensure ChatTD Python environment is configured
Check device compatibility (CUDA drivers for GPU)

Poor Transcription Quality

Verify audio sample rate is 24kHz
Check microphone quality and positioning
Use appropriate model for language content

High Latency

Use 1B model for lower latency
Enable GPU acceleration if available
Check system resource availability

Missing Audio

Verify ReceiveAudioChunk() is being called
Check audio format (float32 required)
Ensure “Transcription Active” is enabled

Error Messages

“Dependencies missing”

Click “Install Dependencies” button
Restart TouchDesigner after installation
Check ChatTD Python environment configuration

“Model not found”

Click “Download Model” to fetch from HuggingFace
Check internet connection and HuggingFace access
Verify sufficient disk space for model storage

“Worker process failed”

Check Python environment and dependencies
Review worker logging output for specific errors
Verify CUDA installation for GPU usage
Try CPU device if GPU fails

“Audio format error”

Ensure audio is float32 format
Convert audio to 24kHz sample rate
Check audio array dimensions and type

Advanced Features

Streaming Architecture

Uses Kyutai’s Delayed Streams Modeling:

Fixed Frame Size: 80ms frames (1920 samples at 24kHz)
Continuous Context: Maintains context across audio chunks
Low Latency: Optimized for real-time applications
Memory Efficient: Manages memory usage automatically

Model Management

Automatic model downloading and caching:

HuggingFace Integration: Downloads models from official repositories
Local Caching: Stores models locally for offline use
Version Management: Handles model updates and compatibility
Storage Optimization: Efficient model storage and loading

This operator provides professional-grade speech-to-text capabilities for TouchDesigner workflows, enabling sophisticated audio processing and AI integration scenarios.

Research & Licensing

Kyutai Research Foundation

Kyutai is a leading AI research organization focused on advancing speech and language technologies. Their flagship achievement, the Moshi model, represents a breakthrough in speech-text foundation models, enabling truly real-time conversational AI with unprecedented quality and responsiveness.

Moshi: Speech-Text Foundation Model

Kyutai's Moshi is a revolutionary speech-text foundation model that processes two streams of audio simultaneously - one for the user and one for the AI assistant. This dual-stream architecture enables full-duplex conversation with natural interruptions and overlapping speech, closely mimicking human conversation patterns.

Technical Details

7B Parameter Architecture: Large-scale transformer model optimized for speech processing
Dual-Stream Processing: Simultaneous modeling of user and assistant audio streams
Inner Monologue: Predicts text tokens corresponding to its own speech for improved generation quality
Ultra-Low Latency: Theoretical latency of 160ms (80ms frame size + 80ms acoustic delay)
Streaming Neural Codec: Uses Mimi codec for efficient audio compression at 1.1 kbps
Frame Rate: 12.5 Hz operation (80ms frames)
Compression: 24 kHz audio down to 1.1 kbps bandwidth
Streaming: Fully causal and streaming with 80ms latency
Quality: Outperforms existing non-streaming codecs like SpeechTokenizer and SemantiCodec
Architecture: Transformer-based encoder/decoder with adversarial training
STT-1B-EN/FR: 1B parameter model supporting English and French with 0.5s delay
STT-2.6B-EN: 2.6B parameter English-only model with 2.5s delay for higher accuracy
Multilingual Support: Automatic language detection for mixed-language scenarios
Quantization: INT8 and INT4 quantized versions for efficient deployment

Research Impact

Real-time Dialogue: Enables natural conversation with minimal latency
Full-duplex Communication: Supports interruptions and overlapping speech
Semantic Understanding: Advanced VAD with pause prediction capabilities
Production Ready: Rust, Python, and MLX implementations for various platforms

Citation

@techreport{kyutai2024moshi,
  title={Moshi: a speech-text foundation model for real-time dialogue},
  author={Alexandre Défossez and Laurent Mazaré and Manu Orsini and
  Amélie Royer and Patrick Pérez and Hervé Jégou and Edouard Grave and Neil Zeghidour},
  year={2024},
  eprint={2410.00037},
  archivePrefix={arXiv},
  primaryClass={eess.AS},
  url={https://arxiv.org/abs/2410.00037},
}

Key Research Contributions

Full-duplex spoken dialogue with dual-stream modeling
Ultra-low latency speech processing (160ms theoretical)
Streaming neural audio codec (Mimi) with 1.1 kbps compression
Semantic voice activity detection with pause prediction
Production-ready implementations across multiple platforms

License

CC-BY 4.0 - This model is freely available for research and commercial use.

STT Kyutai Operator

STT Kyutai v1.2.1 [ September 2, 2025 ]

STT Kyutai Operator

Overview

Key Features

Requirements

Input/Output

Inputs

Outputs

Parameters

Page: KyutaiSTT

Page: Install/Settings

Page: About

Usage Examples

Basic Real-time Transcription

Language-Specific Configuration

Best Practices

Model Selection

Performance Optimization

Audio Quality

Troubleshooting

Common Issues

Error Messages

Advanced Features

Streaming Architecture

Model Management

Research & Licensing

Kyutai Research Foundation

Moshi: Speech-Text Foundation Model

Technical Details

Research Impact

Citation

Key Research Contributions

License