- OPERATORS
- CONTROLLERS
Voice Agent
v1.0.0newThe Voice Realtime LOP is a single operator that talks to any realtime voice-to-voice provider. Provider modules (provider_gemini_live.py, provider_openai_realtime.py, provider_xai_grok.py, provider_hume_evi.py) drop into the operator like TTS/STT providers — swap backends from the Provider menu, no per-provider operator required. It replaces the older gemini_live monolith (now deprecated) and mirrors the unified-operator pattern used by tts and stt_*.
Key Features
Section titled “Key Features”- One operator, four cloud providers — switch backends from a single menu
- Session modes:
continuous,one_turn,push_to_talk - Session resumption: native handle (Gemini, Hume) or transcript replay (OpenAI, xAI)
- Disk-persisted session history with optional auto-resume
- External tool orchestration via the same
Toolsequence used by the Agent LOP - Built-in tools:
end_conversationandoutput_text_content - Live per-minute cost ballpark + running session cost + optional
Costbudgetcap - Live-streaming user transcript (row rewrites in place as the user speaks)
- Profiles + Skills injection into the system prompt
- Affect / emotion signals (Hume EVI — 48 prosody dimensions on a dedicated signals CHOP)
Providers at a glance
Section titled “Providers at a glance”| Provider | Model families | Audio out | Tools | Notable |
|---|---|---|---|---|
| Gemini Live | gemini-3.1-flash-live-preview, gemini-2.5-flash-native-audio-preview-12-2025 | 24 kHz | Sync only on 3.x, async on 2.5 | Native session resumption, video-in, Google Search grounding |
| OpenAI Realtime | gpt-realtime, gpt-realtime-mini | 24 kHz | Streamed (one item per call) | Token-metered idle, long session cap |
| xAI Grok Voice | grok-voice-* | 24 kHz | Streamed | Flat per-minute pricing (wallclock-metered) |
| Hume EVI | EVI 3 | 48 kHz | Streamed | Prosody/affect side-channel, voice cloning supported |
All providers use the same interface — code written for one works for all four.
Requirements
Section titled “Requirements”- API keys flow through ChatTD’s Key Manager. Store a key per provider under its server name (
gemini,openai,xai,hume) or paste it into the Apikey parameter on the provider sub-page. - Python dependencies — declared per provider in its
DEPENDENCIESconstant. The first time you switch to a provider whose deps are missing, the backend page surfaces an install pulse. Gemini needsgoogle-genai, OpenAI needsopenai, xAI and Hume use rawwebsockets(already pinned).
Account & billing
Section titled “Account & billing”Realtime voice can get expensive fast — 10 minutes of continuous voice-to-voice on Gemini 3.1 runs ~$2.50 at current paid-tier rates. Check your tier and set Costbudget before a long session.
| Provider | Tier URL | Pricing reference |
|---|---|---|
| Gemini Live | <https://aistudio.google.com/usage> | <https://ai.google.dev/gemini-api/docs/pricing> |
| OpenAI Realtime | <https://platform.openai.com/usage> | <https://openai.com/api/pricing/> |
| xAI Grok | <https://console.x.ai/usage> | <https://x.ai/api#pricing> |
| Hume EVI | <https://beta.hume.ai/settings/billing> | <https://www.hume.ai/pricing> |
The Pricing parameter on the Voice page shows the current provider + model ballpark as soon as you select them (e.g. ~in 0.005 USD/min out 0.018 USD/min). Sessioncost accumulates live as the session runs. Pulse Resetcostmeter to zero it.
Models & Pricing
Section titled “Models & Pricing”Gemini Live
Section titled “Gemini Live”| Model ID | Status | Audio in/out | Video in | Notes |
|---|---|---|---|---|
gemini-3.1-flash-live-preview | Preview (newest, default) | ~$0.005 / $0.018 per min | ~$0.002/min | Acoustic nuance, thinkingLevel, sync function calling |
gemini-2.5-flash-native-audio-preview-12-2025 | Preview | ~$0.005 / $0.018 per min | ~$0.005/min | Native audio, async tools |
Voice-chat ballpark on 3.1: ~$0.30/min continuous voice-to-voice.
OpenAI Realtime
Section titled “OpenAI Realtime”| Model ID | Status | Audio in/out | Notes |
|---|---|---|---|
gpt-realtime | GA | Token-metered | Long session cap, native tool interruption |
gpt-realtime-mini | GA | Token-metered, cheaper | Lower quality, same interface |
xAI Grok Voice
Section titled “xAI Grok Voice”Flat ~$0.05/min wallclock (idle time bills). Use one_turn mode to avoid paying for dead air.
Hume EVI
Section titled “Hume EVI”~$0.04–0.07/min wallclock depending on voice. 30-minute session cap. Ships an onAffect callback with per-turn prosody scores (Joy, Surprise, Admiration, etc.).
The Model parameter is an editable menu — type a custom ID if a provider ships a new model before this operator is updated.
Input/Output
Section titled “Input/Output”Inputs
Section titled “Inputs”- Input 1 (Audio CHOP): Microphone audio. Typically a mono CHOP from an Audio Device In, fed to the operator via
Micinon the Voice page. The EXT resamples to each provider’s required rate automatically (SAMPLE_RATE_INconstant per provider).
Outputs
Section titled “Outputs”- Output 1: Conversation table DAT (role, message, id, timestamp, type, metadata, session_id)
- Output 2: Current audio playback CHOP (
store_output) - Output 3: Full session audio CHOP (
full_audio) - Output 4: Text output DAT (content from the
output_text_contenttool, when enabled) - signals CHOP: Common channels (
connected,model_ready,worker_active,cost_in_seconds,cost_out_seconds) plus any provider-specific channels declared via that provider’sSIGNAL_CHANNELSdict (Hume: 48 affect dimensions prefixedhume_evi_affect_*).
Session modes (Voice page)
Section titled “Session modes (Voice page)”continuous(default): Connect opens the session and keeps it alive full-duplex until Disconnect or theend_conversationtool fires.one_turn: Connect opens the session for one exchange. After the assistant’s first turn-final text the EXT either holds the socket and disarms the mic (token-metered providers — Gemini, OpenAI) or disconnects and writes the trace (wallclock-metered — xAI, Hume). The next Connect re-arms. Use for discrete voice prompts when you don’t want to pay for idle time.push_to_talk: Session stays open, but theTalktoggle gates whether mic audio flows. BindTalkto a Keyboard In or MIDI In CHOP for walkie-talkie-style interactions.
The Sessionstate readout shows where the session is: disconnected / connecting / active / armed / ending.
Session resumption
Section titled “Session resumption”On disconnect the EXT writes a sibling JSON trace (voice_<timestamp>_<hash>.json) to Sessiontracedir (defaults to project.folder/voice_sessions/). The trace holds the resume handle (provider-specific), the transcript, and the end reason.
On the next connect the EXT picks a resume source in this order:
- Loadsessionfile (file path) — explicit one-shot
- Resumelast (toggle) — newest trace in the directory
- None — starts fresh
Resumption strategy is per-provider:
- Gemini Live / Hume EVI → native handle. Zero replay cost.
- OpenAI Realtime / xAI Grok → transcript replay (last
Maxreplayrowsmessages, default 20, user+assistant only). Replay cost grows with history — the Sessionresume readout says so when replay fires.
Tool-call / tool-result rows are dropped from replay to avoid lying to the model about output it didn’t produce.
Tool Integration
Section titled “Tool Integration”Voice Realtime consumes tools from other LOPs using the same pattern as the Agent LOP. It does not expose a GetTool() method.
Connecting external tools
Section titled “Connecting external tools”- On the Tools page, enable Use LOP Tools.
- In the External Op Tools sequence, add a block and drag the tool operator into the OP field.
- Set Mode per tool:
- enabled — blocks until the tool completes before the model continues.
- enabled_nonblocking — fires and forgets. Safe on Gemini 2.5 and OpenAI; on Gemini 3.x the model runs sync regardless.
- disabled — skipped.
- Connect the session. The model calls tools as needed and folds the results into its response.
Built-in tools
Section titled “Built-in tools”end_conversation— on when Allow model to end conversation is enabled. The model can close gracefully on goodbyes.output_text_content— on when Output text is enabled. The model can display text (code, data, URLs) in the fourth output DAT without reading it aloud.- Google Search grounding (Gemini only) — add
google_searchas a tool inEnablegroundingon the Gemini Live page.
Tool-call rendering in chat_viewer is automatic: paired tool_call / tool_result rows collapse into a single expandable entry by metadata.call_id.
Streaming modes (Voice page)
Section titled “Streaming modes (Voice page)”Streamingmode controls how assistant/user transcripts land in the conversation DAT:
live(default): one row per turn, rewritten in place as deltas arrive. Best UX for live captions.chat_viewerre-renders in place via stable row ids.coalesce: one row per turn, written only on turn-final. Cleanest log; no streaming jitter.append: one row per delta. Debug-heavy. Avoid for long sessions.
xAI Grok emits no user-delta stream — on xAI, live degrades to coalesce automatically for user text.
Cost control
Section titled “Cost control”- Pricing — per-minute ballpark for the active provider + model, refreshed on change.
- Sessioncost — running session spend (accumulated via SAMPLE_RATE + audio seconds × provider pricing).
- Costbudget — hard cap in USD. When
SessioncostexceedsCostbudgetthe EXT disconnects and firesonErrorwithsource='budget'. Set to 0 to disable. - Resetcostmeter — pulse to zero the session cost meter (does not reset the budget).
Profiles & Skills
Section titled “Profiles & Skills”- Profiles page — scan a folder of JSON profile files, pick one from the menu, the system prompt + model + voice + tool toggles apply on connect.
- Skills page — scan a folder of JSON skills, each skill’s system-prompt chunk is appended to the session instructions.
Both pages mirror the agent LOP’s layout and share the same profile/skill file format.
Callbacks
Section titled “Callbacks”Wire custom logic on the Callbacks page. The Callbackdat textDAT receives a stub with every callback signature: onSessionStart, onSessionEnd, onAssistantText, onUserText, onToolCall, onToolResult, onAudioIn, onAudioOut, onProviderChange, onError, onAffect (Hume only), onUserSpeechStarted / onUserSpeechEnded where the provider supplies them.
Toggle Printcallbacks to log every callback fire to the textport while developing.
Usage Examples
Section titled “Usage Examples”Basic voice conversation
Section titled “Basic voice conversation”- Select Provider on the Voice page. Pulse Scanproviders if the menu is empty.
- On the provider sub-page, pick a Model and Voice. Check the Pricing readout on the Voice page.
- Paste an API key into Apikey (or store it under the provider’s server name in ChatTD Key Manager).
- Pulse Connect. Watch Sessionstate flip to
active. - Speak into the mic. Assistant audio plays through the Playback-page device.
- Pulse Disconnect when done — the session trace is written.
Resuming the last session
Section titled “Resuming the last session”- Enable Resumelast on the Playback page before connecting.
- Pulse Connect. The Sessionresume readout shows which path fired (
Resumed via native handle (gemini_live)orReplayed 20 messages (replay, cost grows with history)). - The conversation DAT pre-populates with the previous transcript; the provider is handed either the resume token or the replayed messages.
Budget-capped demo
Section titled “Budget-capped demo”- Set
Costbudget = 0.50on the Tools page. - Connect and converse.
- The EXT disconnects the moment spend crosses $0.50 and writes an error row to the conversation DAT.
Session refresh
Section titled “Session refresh”The EXT auto-refreshes sessions as they approach the provider’s MAX_SESSION_S cap (Gemini: 900s, OpenAI: 3600s, xAI: 3600s, Hume: 1800s), or immediately on provider-emitted goaway. Controls on the Voice page:
- Auto-Refresh Session (default on) — arm the deadline coordinator.
- Refresh Warning (s) (default 30) — seconds before cap at which
onStatus('expiring')fires. Set 0 to disable the warning and refresh only at cap.
Per-provider behavior routes off RESUMPTION:
- Native (Gemini Live, Hume EVI) — resume handle captured via
get_persistable_stateis re-injected into the newstart_session. The server carries the full history; no client-side replay. - Replay (OpenAI Realtime, xAI Grok) — new session primed via
prime_historywith the shaped transcript, capped byMaxreplayrows. Token cost grows with transcript length. - None — session ends cleanly;
onSessionEndfires withend_reason='cap'andonStatus('expired')follows. No reconnect.
During a refresh the Session State badge reads refreshing, the mic pump is paused, and a system row ▶ Refreshed session (<mode>) — reason: <cap|goaway> is appended to the conversation DAT. Conversation Cost accumulates across refreshes; Session Cost resets per leg. Costbudget is enforced against the conversation-wide total.
In-flight user audio at the moment of refresh is dropped (v1) — the audio buffer isn’t carried across the reconnect. Speaker-out finishes its current buffer since playback is decoupled from the session.
Known limitations
Section titled “Known limitations”- Voice cloning UI is not implemented in v1 even though Hume declares
SUPPORTS_VOICE_CLONING=True. - xAI Grok user-text deltas are not emitted by the provider — only final user transcripts land in the DAT.
Troubleshooting
Section titled “Troubleshooting”- Sessioncost stuck at
$0.00000: the active provider’spricing(model_id)returned nothing for the selected model. Verify the Model ID is in the provider’s pricing map. - Mic audio not flowing: check
Sessionstate. If it’sarmed, the gate is closed — you’re inpush_to_talkwithout Talk on, or inone_turnafter the first reply. Pulse Connect to re-arm. - “No key for server ‘gemini’”: open ChatTD Key Manager and add a key under the server name, or paste into Apikey on the provider sub-page.
- 1007 / 1008 close codes on Gemini: usually a dtype or rate mismatch on mic input. The provider asserts int16 little-endian and rate =
SAMPLE_RATE_IN— upstream resampling should handle it, but check the mic CHOP is mono. - Replay costs a lot: lower Maxreplayrows or switch to a native-resumption provider (Gemini, Hume). Replay cost scales with transcript length.
Parameters
Section titled “Parameters”op('voice_agent').par.Connect Pulse - Default:
False
op('voice_agent').par.Disconnect Pulse - Default:
False
op('voice_agent').par.Talk Toggle Push-to-talk gate. Only used when Session Mode = push_to_talk. True → mic audio streams to provider; False → mic is muted (session stays open).
- Default:
False
op('voice_agent').par.Sessionstate Str disconnected | connecting | active | armed | refreshing | ending. "armed" = session open, mic gated off (one_turn waiting for next arm, PTT with Talk=off, or Active=off). "refreshing" = session refresh coordinator is tearing down and restarting around MAX_SESSION_S / GoAway.
- Default:
"" (Empty String)
op('voice_agent').par.Autosessionrefresh Toggle Auto-refresh the session when the provider's MAX_SESSION_S cap is approached or on provider GoAway. Native-resumption providers (Gemini, Hume) carry an opaque handle across the refresh. Replay providers (OpenAI, xAI) rehydrate the transcript via prime_history. Turn off for clean expiry — the session ends, onSessionEnd fires with end_reason=cap.
- Default:
True
op('voice_agent').par.Refreshwarning Int Seconds before the session cap at which to emit onStatus(expiring). Set 0 to disable the warning and refresh only at cap.
- Default:
30- Range:
- 0 to 300
- Slider Range:
- 0 to 300
op('voice_agent').par.Scanproviders Pulse - Default:
False
op('voice_agent').par.Providersfolder Folder - Default:
"" (Empty String)
op('voice_agent').par.Micin CHOP - Default:
"" (Empty String)
op('voice_agent').par.Inputtext Str - Default:
"" (Empty String)
op('voice_agent').par.Sendtext Pulse - Default:
False
op('voice_agent').par.Pricing Str - Default:
"" (Empty String)
op('voice_agent').par.Conversationcost Str Running USD total across all session refreshes in the current conversation. Resets on Connect (fresh session) or Reset Cost Meter, but preserved across auto-refresh events.
- Default:
"" (Empty String)
op('voice_agent').par.Sessioncost Str - Default:
"" (Empty String)
op('voice_agent').par.Resetcostmeter Pulse - Default:
False
op('voice_agent').par.Enableconvdat Toggle Append transcript rows to the internal conversation DAT (role | message | id | timestamp | type | metadata | session_id). chat_viewer reads from this.
- Default:
True
Gemini Live
Section titled “Gemini Live”op('voice_agent').par.Apikey Str Google AI Studio API key (routed via key_manager). Get one at https://aistudio.google.com/api-keys
- Default:
"" (Empty String)
op('voice_agent').par.Languagecode Str BCP-47 language code for speech (optional). Leave blank to auto-detect.
- Default:
"" (Empty String)
op('voice_agent').par.Systemprompt Str System instruction sent at session open via send_client_content.
- Default:
"" (Empty String)
op('voice_agent').par.Enableusertranscription Toggle Emit partial + final user speech transcripts.
- Default:
True
op('voice_agent').par.Enableoutputtranscription Toggle Emit partial + final assistant speech transcripts.
- Default:
True
op('voice_agent').par.Enablegrounding Toggle Enable Google Search tool for grounded answers. Not compatible with custom function tools in the same session.
- Default:
False
op('voice_agent').par.Enablesessionresumption Toggle Server issues periodic resumption handles; provider reconnects within ~10 min on disconnect.
- Default:
True
op('voice_agent').par.Thinkingbudget Int Gemini 2.5 reasoning token budget. Ignored on 3.x.
- Default:
0- Range:
- 0 to 8192
- Slider Range:
- 0 to 8192
Playback
Section titled “Playback”op('voice_agent').par.Resetplayback Pulse - Default:
False
op('voice_agent').par.Playbackactive Toggle - Default:
True
op('voice_agent').par.Volume Float - Default:
1.0- Range:
- 0 to 1
- Slider Range:
- 0 to 1
op('voice_agent').par.Play Pulse - Default:
False
op('voice_agent').par.Pause Pulse - Default:
False
op('voice_agent').par.Stop Pulse - Default:
False
op('voice_agent').par.Replay Pulse - Default:
False
op('voice_agent').par.Sessiontracing Toggle Write a JSON trace of every session to disk on disconnect. Format: <tracedir>/<YYYYMMDD_HHMMSS>_<provider>_<model>.json. Holds transcript + resume handle + cost + timestamps. Required for Resume Last / Load Session.
- Default:
True
op('voice_agent').par.Sessiontracedir Folder Folder to write traces into. Blank = <project>/voice_sessions/.
- Default:
"" (Empty String)
op('voice_agent').par.Resumelast Toggle On Connect, auto-load the newest trace matching the active Provider+Model and resume. Native-resumption providers (Gemini, Hume) hand the server an opaque handle — server has the full state. Replay providers (OpenAI, xAI) re-feed the transcript, capped by Maxreplayrows; the model re-reads its prior turns — audio continuity is lost and token cost grows with transcript length.
- Default:
False
op('voice_agent').par.Loadsessionfile File Override Resume Last with a specific trace file. Loaded on the next Connect. Clear to return to newest-matching behavior.
- Default:
"" (Empty String)
op('voice_agent').par.Maxreplayrows Int Replay providers only. Caps how many prior messages are rehydrated into the new session. Higher = more context + higher cost per reconnect.
- Default:
20- Range:
- 1 to 200
- Slider Range:
- 1 to 200
op('voice_agent').par.Sessionresume Str What happened on the most recent Connect: a native handle round-trip, a replay of N messages, or a fresh session.
- Default:
"" (Empty String)
op('voice_agent').par.Sessionid Str Identifier of the active (or most recent) session. Matches the trace filename.
- Default:
"" (Empty String)
op('voice_agent').par.Usetools Toggle Enable external tool operators via Tool sequence blocks
- Default:
True
op('voice_agent').par.Allowendconversation Toggle Assistant can call end_conversation to hang up. It speaks its closing line first; the EXT disconnects on the tool call.
- Default:
True
op('voice_agent').par.Outputtext Toggle Assistant can display text without speaking it aloud — useful for code, data, or long blocks.
- Default:
False
op('voice_agent').par.Approvaltimeout Int Auto-deny after N seconds (0 = wait forever)
- Default:
0- Range:
- 0 to 600
- Slider Range:
- 0 to 600
op('voice_agent').par.Pendingtools Str - Default:
"" (Empty String)
op('voice_agent').par.Approvetools Pulse - Default:
False
op('voice_agent').par.Denytools Pulse - Default:
False
op('voice_agent').par.Costbudget Float Session cost limit in USD (0 = unlimited). When exceeded, the session is disconnected and onError fires with source=budget.
- Default:
0.0- Range:
- 0 to 10
- Slider Range:
- 0 to 10
op('voice_agent').par.Tool Sequence - Default:
0
op('voice_agent').par.Tool0toolop OP - Default:
"" (Empty String)
Skills
Section titled “Skills”op('voice_agent').par.Skillsfolder Folder - Default:
"" (Empty String)
op('voice_agent').par.Skillscomp OP - Default:
"" (Empty String)
op('voice_agent').par.Scanskills Pulse - Default:
False
op('voice_agent').par.Skillscount Str - Default:
"" (Empty String)
Profiles
Section titled “Profiles”op('voice_agent').par.Profilesfolder Folder - Default:
"" (Empty String)
op('voice_agent').par.Scanprofiles Pulse - Default:
False
op('voice_agent').par.Applyprofiles Pulse - Default:
False
op('voice_agent').par.Profile Sequence - Default:
0
op('voice_agent').par.Displayname Str Friendly name for UI, dashboards, event sinks, and agent swarm traces. Profiles may set this value.
- Default:
"" (Empty String)
op('voice_agent').par.Displaycolorr RGB Identity color for the operator tile, compact panels, dashboards, and profile-driven UI.
- Default:
0.98- Range:
- 0 to 1
- Slider Range:
- 0 to 1
op('voice_agent').par.Displaycolorg RGB Identity color for the operator tile, compact panels, dashboards, and profile-driven UI.
- Default:
0.52- Range:
- 0 to 1
- Slider Range:
- 0 to 1
op('voice_agent').par.Displaycolorb RGB Identity color for the operator tile, compact panels, dashboards, and profile-driven UI.
- Default:
0.02- Range:
- 0 to 1
- Slider Range:
- 0 to 1
Callbacks
Section titled “Callbacks”op('voice_agent').par.Callbackdat DAT - Default:
"" (Empty String)
op('voice_agent').par.Printcallbacks Toggle - Default:
False
Lifecycle
Section titled “Lifecycle”op('voice_agent').par.Installdependencies Pulse - Default:
False
op('voice_agent').par.Initialize Pulse - Default:
False
op('voice_agent').par.Shutdown Pulse - Default:
False
op('voice_agent').par.Enginestatus Str - Default:
"" (Empty String)
op('voice_agent').par.Active Toggle - Default:
False
Changelog
Section titled “Changelog”v1.0.02026-05-02
- Release update
v0.3.0
# 0.3.0
- Session refresh coordinator: tracks MAX_SESSION_S per provider, fires auto-refresh at cap and on provider GoAway
- Per-RESUMPTION routing — native handle replay (Gemini Live, Hume EVI), transcript replay via prime_history (OpenAI Realtime, xAI Grok), or clean expiry (none)
- Autosessionrefresh bool + Refreshwarning int on Voice page; onStatus(expiring|refreshing|expired) callbacks around refresh lifecycle
- Conversationcost readout accumulates USD across refreshes; Sessioncost resets per leg; Costbudget enforced against conversation total
- Sessionstate badge gains
refreshing; mic pump gated off during refresh; in-flight user audio dropped (v1) - provider_hume_evi RESUMPTION flipped from
replaytonativeto match wire behavior (chat_group_id carries conversation memory server-side) - docs/guide.md: Session refresh section replaces "not yet active" known-limitation
v0.2.0
# 0.2.0
First shipped provider: Gemini Live. Built fresh against the locked Group B interface — does not port the gemini_live monolith. Addresses every audit finding from notes/realtime_voice_primitive/001_session_log.md.
operator/provider_gemini_live.py:PROVIDER_TYPE='realtime',TRANSPORT='ws',KEY_SERVER='gemini',RESUMPTION='native',SUPPORTS_VIDEO_IN=True,FULL_DUPLEX=False,MAX_SESSION_S=900,SAMPLE_RATE_IN=16000,SAMPLE_RATE_OUT=24000,FRAME_MS=None,DEPENDENCIES=['google-genai'].- Model menu:
gemini-3.1-flash-live-preview(default),gemini-2.5-flash-native-audio-preview-12-2025,gemini-2.0-flash-live-001. - 2026 voice roster (22 voices): Achernar, Algenib, Algieba, Aoede, Charon, Despina, Erinome, Fenrir, Kore, Laomedeia, Leda, Orus, Puck, Rasalgethi, Sadachbia, Sadaltager, Schedar, Sulafat, Umbriel, Vindemiatrix, Zephyr, Zubenelgenubi.
pricing(model_id)returns normalized per-minute costs withtier_unverified=True(free-tier reachability is the open question in 001 → Group G).- Audit fixes baked in:
- Mid-session input uses
send_realtime_input(audio=…)/(text=…)/(video=…). Nosession.send(...)wrapper anywhere. - Tool responses never send
scheduling=NON_BLOCKING; sync-only across 3.x and 2.5 by design (3.x footgun). thinking_configplumbed:thinking_level(3.x) orthinking_budget(2.5), auto-selected per model family.turn_coverageset explicitly on theRealtimeInputConfig(defaultTURN_INCLUDES_AUDIO_ACTIVITY_AND_ALL_VIDEO).- Push-to-talk turn mode sets
activity_handling=START_OF_ACTIVITY_INTERRUPTSfor consistent barge-in. session_resumptiongated byEnablesessionresumptiontoggle (not always-on).GoAwaysurfaced viaon_status('goaway', …)instead of being swallowed.send_audio_frameasserts bytes + int16-aligned length; no silent float32 → 1007 close.- Canonical
async for message in session.receive()loop; no manualwhile self.conversation_activereceive walk. - Session state (socket, receive task, resumption handle) lives on the
GeminiConnobject — module stays stateless. - Tool schema conversion: OpenAI-style tool declarations from
ToolManager.parse_toolsare wrapped intypes.Tool(function_declarations=[types.FunctionDeclaration(...)]). Google Search grounding is a separatetypes.Tool(google_search=...)added whenEnablegroundingis on. - EXT:
_collect_session_parsnow also passestools(list of OpenAI-style tool definitions) via the reservedpars['tools']key. Provider template comment updated to document the reserved key. - Cost ballpark: Gemini 3.1 paid tier ~\$0.005/min in + \$0.018/min out audio (~\$18/hr voice chat). Free tier reachability via standard API key unverified.
v0.1.0
# 0.1.0
Initial scaffold of the unified realtime voice-to-voice operator.
- Submodule created with extends
["util-base-lop", "util-agent-core", "util-speech-template", "util-chained-callbacks"]. VoiceRealtimeEXTsubclassesSpeechTemplate(SPEECH_TYPE='voice-realtime') and mixes inChainedCallbacksExt.- Manually invokes the template's TTS playback infra (
_setup_tts_outputs,_setup_playback_parameters) so speaker-out works without editingutil-speech-template. - Wires
ProviderRegistry(owner, 'voice-realtime'),VoiceRealtimeCallbacks(extendsProviderCallbackswithon_tool_call/on_user_speech_started/on_user_speech_ended),ToolManagerfromutil-agent-core. - Base parameters:
Provider,Scanproviders,Providersfolder,Initialize,Shutdown,Active,Enginestatus,Micin(CHOP),Inputtext,Sendtext,Pricing.Endpointurlauto-added when the active provider declaresKEY_SERVER=None. - Async session lifecycle:
Initialize→provider.start_session(pars, callbacks)via TDAsyncIO; mic pump drainsaudio_buffer(filled by inheritedReceiveAudioChunk) intoprovider.send_audio_frameat the provider'sFRAME_MScadence; provider audio routes to the reused TTS playback chain. - Tool calls dispatch through
ToolManager(Tool sequence parameters created manually in TD per the agent operator convention) with results routed back viaprovider.send_tool_response. - Callbacks page:
Callbackdat,Printcallbacks. Callbacks fired:onSessionStart,onSessionEnd,onAssistantText,onUserText,onToolCall,onAudioIn,onAudioOut,onProviderChange,onError. provider_template.pydocuments the locked Group B interface: required constants (PROVIDER_TYPE='realtime',SPEECH_TYPE='voice-realtime',TRANSPORT,SAMPLE_RATE_IN/OUT,FRAME_MS,MAX_SESSION_S,RESUMPTION,SUPPORTS_VIDEO_IN,SUPPORTS_VOICE_CLONING,FULL_DUPLEX,KEY_SERVER,DEPENDENCIES); helpersget_parameters,pricing(model_id),voices,is_available; async APIstart_session,send_audio_frame,send_text,send_tool_response,end_session, optionalsend_video_frame.- No providers ship in this version. First provider (Gemini Live) lands in 0.2.0 (Group D).