AI Voice Agents

    Dualplex Voice AI Architecture: How Full-Duplex Agents Handle Interruption, Barge-In, and Sub-Second Response (2026)

    June 10, 2026·8 min read·By VocaIQ Team

    Dualplex Voice AI Architecture: How Full-Duplex Agents Handle Interruption, Barge-In, and Sub-Second Response (2026)

    When a caller interrupts a voice AI agent mid-sentence, the system has roughly 200 milliseconds to detect the speech onset, cancel the outgoing audio, discard the in-flight TTS buffer, re-evaluate context, and begin generating a new response. Classical sequential pipeline architectures cannot do this: they were designed assuming one party speaks, then the other. That assumption fails on real phone calls. The architectural pattern that removes this assumption is called full-duplex bidirectional audio processing, and its most mature productized form in 2026 is what one platform calls Dualplex. This article explains what full-duplex means in practice, why it is necessary for AI voice agents that handle real callers, and what the engineering stack behind it actually looks like.

    Why Full-Duplex Architecture Matters in 2026

    The AI voice agent market has grown sharply since 2024. A widely cited proof point: Avoca AI raised more than $125 million at a $1 billion valuation in April 2026 to build AI front-office infrastructure for service businesses, per the company's press release. That fundraise validates the category, but it also raises the engineering bar: institutional money is entering the space, and the commodity tier of "AI answers your phone" products is already crowded. The differentiator is not the capability to answer calls but the quality of the conversation once the call connects.

    Latency research consistently confirms that callers are unforgiving. Contact centers report 40 percent more hang-ups when voice agents take longer than one second to respond, according to Telnyx's production benchmark across real PSTN circuits. Delays above 800ms cause callers to begin talking over the agent before the agent finishes, which breaks the turn-taking structure and signals to the caller that the system is not human. Retell AI's evaluation of 200-plus test calls observed this threshold directly: configurations exceeding 900ms were excluded from production consideration because callers reliably started interrupting before the agent could complete a sentence. The Deepgram Voice AI Buyer's Guide 2026 notes that 8 to 12 percent of callers abandon calls in contact center deployments once latency exceeds 600ms.

    These are not edge cases. They describe the normal behavior of any caller who is not waiting patiently for a robotic system. Real callers interrupt, self-correct, change topic mid-sentence, and respond to auditory cues with overlapping speech. An architecture that cannot accommodate this is not a voice agent; it is an interactive voice response menu with a better script.

    The Three Architecture Patterns: Pipeline, Speech-to-Speech, and Hybrid

    Three dominant patterns exist in deployed voice AI systems in 2026. Understanding them is necessary to understand what full-duplex adds and what it costs.

    Pattern 1: Cascading sequential pipeline (STT to LLM to TTS). Audio is fully transcribed by a speech-to-text engine. The complete transcript is passed to a large language model. The LLM generates a full text response. The response is passed to a text-to-speech engine. Audio is played. Each stage waits for the previous to complete. End-to-end latency ranges from 800ms to 2,000ms in typical deployments, according to SuperMIA's 1,500-call comparison across 12 platforms. This is the easiest architecture to build and debug: components are independently swappable and mature tooling exists for each. The cost is pure latency. For use cases where the conversation is slow-paced and the caller is patient, it is acceptable. For live reception and dispatch scenarios, it is not.

    Pattern 2: End-to-end speech-to-speech (multimodal audio-in, audio-out). A single multimodal model processes raw audio input and produces audio output without intermediate text conversion. OpenAI's Realtime API and Google's Gemini 2.0 native audio mode operate on this principle. Latency drops to 300 to 600ms end-to-end. The tradeoff is voice quality and model choice: the model that handles audio natively does not necessarily produce the most natural or expressive speech output, and operator control over voice persona is limited. Debugging is also harder because there is no intermediate text representation to inspect.

    Pattern 3: Hybrid streaming pipeline. STT, LLM, and TTS stages run in parallel using streaming interfaces. Partial transcripts are streamed from the STT engine to the LLM before the caller finishes speaking. The LLM begins generating tokens before the utterance is complete. The TTS engine begins synthesizing audio from the first LLM tokens. End-to-end latency targets 500 to 900ms. This is the dominant pattern for platforms such as Vapi, Retell AI, and PolyAI. It preserves component independence, enables barge-in handling at the STT layer, and provides a text channel for debugging and logging.

    Each pattern has a fundamental constraint: none of them natively handles the case where the caller speaks while the agent is also speaking, which is the definition of conversational overlap and the engineering problem that full-duplex architecture is designed to solve.

    Full-Duplex Defined: Simultaneous Listen-and-Speak with Overlap Awareness

    In telephony, full-duplex refers to a channel that transmits in both directions simultaneously. In the context of AI voice agents, full-duplex means the agent's audio pipeline is active for both input (listening to the caller) and output (generating and playing speech) at the same time. This is distinct from half-duplex, where the system either listens or speaks but not both, and from simple voice activity detection (VAD) patterns where the system silently monitors for speech onset while playing audio but cannot react until the current TTS buffer empties.

    A Dualplex implementation, as it is currently deployed in practice, combines the low-latency inference path of a realtime multimodal model with a separate, high-quality TTS engine. The signal flow is approximately as follows: the caller's audio stream is fed continuously into a realtime STT and LLM processor that operates with sub-200ms first-token latency. At the same time, the TTS engine holds an active rendering session that can receive token interruptions. When the system detects speech onset from the caller while it is speaking, it fires a cancellation signal to the TTS session, flushes the audio buffer, and routes the new partial transcript to the LLM context window for re-evaluation. The LLM generates a new response from the updated context, and the TTS session resumes from the new token stream.

    The critical engineering requirement is that the cancellation and re-routing path must complete in under the caller's perceptual threshold for awkward silence, which human conversation research places at roughly 200ms for same-language speakers. Telnyx's benchmark data cites cross-cultural research placing the fastest natural inter-turn gap at around 200ms, with Japanese speakers at minus 7ms (overlapping speech) and Danish speakers averaging 469ms. A voice AI system that reacts to interruption in more than 400ms will already feel delayed to a significant portion of callers.

    The Dualplex pattern specifically introduced in the platform documentation (v1.5.0, September 2025, matured with GPT Realtime 1.5 in v1.7.7, March 2026) addresses the voice quality problem in pure speech-to-speech approaches: it uses the realtime STT and LLM path from OpenAI's Realtime models for natural, low-latency inference, then routes the output through ElevenLabs TTS for premium voice synthesis. This separates the inference latency problem (handled by the realtime model) from the voice quality problem (handled by ElevenLabs), giving the system 300 to 600ms end-to-end response latency while maintaining access to ElevenLabs' full voice library and cloning capabilities.

    The Latency Budget: Where the Milliseconds Go

    Understanding the latency budget is necessary for any engineer evaluating voice AI infrastructure. The total perceived delay from end of caller utterance to start of agent audio has several components, each with its own optimization ceiling.

    In a typical stitched pipeline connecting separate vendors for SIP, STT, LLM, and TTS, the component breakdown looks like this based on data from multiple production benchmarks:

    • Network ingress and SIP signaling: 100 to 200ms in a typical multi-vendor stitched stack. Can be reduced to 30 to 60ms with co-located infrastructure.
    • Speech-to-text (ASR) processing: 150 to 300ms for a well-optimized streaming STT engine. Deepgram Nova-3 achieves a 54.2 percent reduction in word error rate for streaming compared to competitors on noisy audio, per Deepgram's own benchmarks, which translates not just to accuracy but to faster partial-transcript availability.
    • Endpoint detection: The determination of when the caller has finished speaking adds 50 to 150ms depending on the VAD model and silence threshold configuration. Aggressive endpoint detection reduces this delay but increases the rate of false endpoints, causing the agent to start responding before the caller finishes. Conservative detection increases delay and frustrates callers who pause mid-thought.
    • LLM inference (time to first token): 150 to 400ms for optimized realtime models; 300 to 1,000ms for standard pipeline models. The first token is what matters for TTS onset, not the full completion time.
    • TTS time-to-first-audio (TTFA): Cartesia Sonic 3 achieves 90ms TTFA using State Space Model architecture rather than transformers, per the Inworld AI TTS benchmark. ElevenLabs achieves sub-100ms in optimized configurations. Standard neural TTS implementations range from 200 to 500ms TTFA.
    • Audio delivery and final network: 30 to 100ms depending on infrastructure proximity.

    The ITU-T G.114 standard establishes 150ms one-way delay as optimal for high-quality real-time voice traffic. Achieving sub-600ms end-to-end in a multi-vendor stack requires every component to hit its optimized target simultaneously. A single underperforming component, a slow LLM API response at the 95th percentile load, or a cross-region TTS API call, blows the budget. This is why the Telnyx benchmark showed that stitched stacks produced 800 to 1,650ms at the p95 percentile even when average-case performance appeared acceptable: tail latency from any one component cascades to every subsequent stage.

    For full-duplex interruption handling, the latency budget has an additional constraint: the barge-in detection and buffer cancellation path must complete before the caller perceives that they have been ignored. This path is outside the normal STT-LLM-TTS pipeline and must be implemented as a separate real-time event handler. Systems that share the same event loop for TTS rendering and barge-in detection will introduce jitter.

    Engineering Challenges: What Makes Full-Duplex Hard

    Full-duplex voice AI introduces specific engineering problems that do not exist in half-duplex or simple VAD-then-respond architectures.

    Echo cancellation. When the agent is speaking through the caller's telephone handset or speaker, the agent's own audio is picked up by the caller's microphone and re-transmitted back to the system. Without acoustic echo cancellation (AEC), the system will detect its own speech as caller input and attempt to interrupt itself. Standard telephony AEC handles physical echoes but not the full-duplex case where the caller speaks simultaneously with the agent: the two audio streams must be separated cleanly at the codec level before VAD can operate correctly.

    Confidence-weighted partial transcripts. Streaming STT engines produce partial transcripts continuously, but early partial tokens have low confidence. A full-duplex system that acts on every partial token will generate false barge-in events and premature LLM re-evaluations. The correct approach involves a confidence threshold layer that weighs partial transcript stability before routing to the LLM, introducing a small intentional delay to reduce false positives. The tradeoff between responsiveness and false-positive rate is a configurable parameter, not a solved constant.

    Mid-utterance TTS cancellation. When a barge-in event is confirmed, the TTS audio buffer must be cancelled cleanly. If the cancellation is abrupt, the caller hears a truncated word or a click artifact. A well-implemented cancellation finds a natural phoneme boundary in the buffer before cutting, which requires the TTS engine to expose buffer position metadata in real time. Not all TTS APIs provide this.

    Context window management after interruption. When the caller interrupts, the LLM must be re-evaluated with a context that includes the partial agent utterance (what the agent had said before being interrupted), the caller's interrupting utterance, and the conversation history to that point. Agents that discard the partial utterance context will lose coherence; agents that include too much partial text will produce confused responses. The correct behavior, as observed in Retell AI's proprietary turn-taking model during independent testing, is to acknowledge the interruption and resume on topic without repeating the prior sentence.

    Language-switch detection mid-call. In multilingual deployments, a caller may switch languages mid-utterance. The STT engine must detect this without waiting for a full-sentence endpoint and the system must swap TTS voice profile and LLM prompt language instruction simultaneously. In a standard pipeline this requires a full context reset; in a streaming full-duplex system it requires a hot-swap of the active inference chain without dropping audio frames.

    Trade-offs and Honest Constraints

    Full-duplex and Dualplex-pattern architectures are not strictly superior in all deployment scenarios. Several real constraints apply.

    • Voice model selection is narrower. Dualplex mode routes through the realtime LLM stack for inference, which limits the available voice models to those compatible with the realtime API. Custom-cloned voices or voices available only in the standard TTS library may not be accessible in Dualplex mode, requiring a fallback to the sequential pipeline for certain deployments.
    • Debugging complexity is higher. Because multiple pipeline stages run in parallel and the barge-in handler operates as an independent event channel, production issues can involve race conditions between the TTS renderer, the barge-in detector, and the LLM context manager. Log correlation across these components requires per-turn latency tracking at the processing step level, not just aggregate call-level metrics.
    • Beta-stage caveats. The Dualplex pattern introduced in September 2025 carried a beta designation as of March 2026. Beta designation in a production telephony context means potential instability at edge cases such as simultaneous double-interruption, rapid context switching in noisy environments, and high concurrent load. Any deployment should include staging-environment validation before production rollout to enterprise clients.
    • False barge-in rates increase at lower confidence thresholds. More responsive barge-in handling comes at the cost of more frequent false interruptions triggered by background noise, caller hesitation sounds, or short affirmative utterances such as "uh-huh". Tuning requires call sample analysis per deployment context, not a single universal threshold.
    • Echo cancellation quality depends on the telephony provider layer. The cleanliness of the full-duplex audio stream is constrained by the AEC implementation at the SIP and codec layer. Systems running over carriers with inconsistent AEC quality will see higher false barge-in rates regardless of how well the AI layer is configured.

    Architecture and Latency Benchmark Comparison

    Architecture Pattern Typical End-to-End Latency Interruption Handling Voice Quality Debug Complexity Representative Platforms
    Sequential pipeline (STT to LLM to TTS) 800ms to 2,000ms None (waits for TTS completion) High (any TTS engine) Low Early Bland AI, Synthflow, most DIY builds
    Speech-to-Speech (end-to-end multimodal) 300ms to 600ms Model-native; limited tuning Model-dependent; limited voice choice High (no text layer) OpenAI Realtime API, Sierra AI
    Hybrid streaming pipeline 500ms to 900ms STT-layer detection; configurable High (any TTS engine) Medium Vapi, Retell AI, PolyAI, SuperMIA
    Dualplex (Realtime LLM + premium TTS) 300ms to 600ms (target) Full barge-in with ElevenLabs TTS cancel Premium (ElevenLabs library + cloning) High (parallel event channels) Proprietary; not available as a commodity
    Rosie AI via Bland AI (sequential) 800ms to 1,500ms (independent tests) Limited; callers talk over agent at this latency Bland AI custom synthesis Low Rosie AI
    Goodcall (streaming) Claimed under 300ms; independent tests approx. 600ms Real-time streaming; 6 voices only 6 voices total, no cloning Medium Goodcall
    CallBird AI (pipeline) Average 1,200ms (published) Not disclosed Not disclosed Not disclosed CallBird AI

    Sources for benchmark data: SuperMIA 1,500-call platform comparison, Coval.dev Voice AI Platform Comparison 2026, Telnyx production PSTN benchmark, Retell AI 200-call evaluation. Competitor latency figures reflect independent test measurements, not vendor-claimed averages.

    Common Mistakes When Evaluating Full-Duplex Voice AI

    • Measuring average latency instead of the 95th percentile. A system that averages 400ms but hits 1,200ms at p95 under load will produce a bad caller experience more often than the average suggests. Coval.dev specifically recommends measuring at the 95th percentile for this reason. Vendor-reported average-case figures are almost always lower than production p95 numbers.
    • Treating vendor latency claims as end-to-end measurements. Several platforms publish sub-300ms latency figures that apply only to a single pipeline component, such as the STT stage, not the full call-to-first-audio round trip. Deepgram's sub-300ms figure refers to the STT API component; full end-to-end latency including LLM and TTS is longer. Always request or measure full round-trip time from end of caller utterance to start of agent audio.
    • Testing in clean audio conditions only. Endpoint detection and VAD accuracy degrade with background noise, accents, and compressed audio (G.711 vs. G.722 codec differences). A system that performs at 400ms latency with clean studio audio may miss endpoints or generate false barge-ins in production with PSTN noise at 55 to 65 dB.
    • Confusing full-duplex capability with full-duplex tuning. A platform may support barge-in detection technically but ship with conservative default thresholds that effectively disable it in most real-call scenarios. Evaluate with actual barge-in test calls, not feature documentation alone.
    • Ignoring the echo cancellation layer. Full-duplex performance is only as good as the AEC quality at the telephony layer. A high-quality AI pipeline behind a low-quality SIP provider with poor AEC will produce persistent false barge-in events that degrade the conversation worse than a half-duplex system with stable turn detection.

    How VocaIQ Approaches Dualplex Architecture

    VocaIQ's managed AI voice agent deployments use Dualplex mode as the default for production call handling, routing through the GPT Realtime 1.5 inference path for low-latency processing and ElevenLabs TTS for premium voice output. The result is a 300 to 600ms total response latency that places conversational performance in the excellent-to-good tier on published benchmarks, where callers do not perceive an artificial pause. This is the premium class voice agent callers do not realize is not a person.

    The underlying platform supports 18 named LLM models and versions spanning OpenAI GPT-5.4 through GPT Realtime variants and Google Gemini 3.1 Flash Live, allowing VocaIQ to route calls to the appropriate model per use case: GPT-5.4 Nano for high-volume FAQ calls, GPT-5.4 for complex medical intake or legal qualification, Gemini 3.1 Flash Live for multilingual calls requiring fast context switching. Mid-call language switching across 100-plus languages is supported natively, handling the caller who opens in English and shifts to French or Mandarin without a system restart. Deployments scale to 1,000 or more concurrent calls, with ISO 27001, ISO 9001, HIPAA, and GDPR compliance, and call data does not train models. VocaIQ's managed service handles all Dualplex configuration, threshold tuning, and LLM selection, so operators receive production-calibrated performance without managing the underlying architecture directly. More at vocaiq.ai.

    Bottom Line

    Full-duplex voice AI architecture is not a marketing term: it is the engineering precondition for a voice agent that behaves like a human conversationalist rather than a sophisticated IVR. The core requirement is a system that actively monitors caller audio while generating its own output and can cancel, re-route, and resume within the 200 to 400ms window that human perception allows for natural turn-taking. Achieving this requires parallel event handling across the STT, VAD, LLM context, and TTS layers, plus clean echo cancellation at the telephony foundation. The Dualplex pattern, combining realtime LLM inference with a premium TTS engine, represents the current practical state of the art for balancing sub-600ms latency with production voice quality. Sequential pipelines remain viable for complex reasoning tasks where latency tolerance is higher, but any deployment that needs to handle real callers who interrupt, correct, and overlap requires a full-duplex capable architecture as its baseline.

    Frequently Asked Questions

    What is the difference between full-duplex and half-duplex in voice AI?

    In voice AI, half-duplex means the agent either listens or speaks but not both simultaneously. When the agent is playing audio, it cannot detect or respond to caller speech until playback completes. Full-duplex means both channels are active simultaneously: the agent monitors incoming caller audio in real time even while it is generating and playing output audio. This enables barge-in, natural interruption, and overlap-aware turn-taking, behaviors that are normal in human conversation but impossible in a half-duplex system.

    What is Dualplex and how does it differ from speech-to-speech architecture?

    Dualplex is a hybrid architecture that combines the low-latency inference path of a realtime multimodal model (which handles both listening and processing simultaneously) with a separate high-quality text-to-speech engine such as ElevenLabs for audio output. Pure speech-to-speech architectures use a single model for all processing, which limits voice quality and model choice. Dualplex separates inference latency (handled by the realtime model) from voice quality (handled by the premium TTS engine), achieving 300 to 600ms end-to-end latency while maintaining access to a full voice library and cloning capabilities.

    Why does voice AI latency above 600ms cause problems?

    Human conversation operates with inter-turn gaps averaging around 200ms. Research and production benchmarks consistently show that delays above 500ms begin triggering listener anxiety, delays above 800ms cause callers to start talking over the agent before it finishes, and delays above 1,000ms generate hang-ups at measurable rates. Above 600ms, 8 to 12 percent of callers in contact center environments abandon the call entirely. These are not edge cases: they reflect normal caller behavior when the conversation rhythm feels unnatural.

    What engineering problems does full-duplex voice AI introduce?

    Full-duplex adds several engineering challenges absent from simpler architectures: acoustic echo cancellation to prevent the agent from detecting its own audio output as caller input; confidence-weighted partial transcript handling to avoid false barge-in events from low-confidence STT partials; mid-utterance TTS cancellation without audible artifacts; context window management after interruption so the LLM has coherent state; and language-switch detection mid-call for multilingual deployments. Each of these requires separate engineering beyond the standard STT-LLM-TTS pipeline.

    How should I measure voice AI latency to get accurate production numbers?

    Measure full round-trip latency from the end of the caller's utterance to the start of agent audio output, not individual component latency. Always measure at the 95th percentile under realistic concurrent load, not average-case in clean conditions. Test with PSTN audio conditions, including background noise and codec compression, not studio audio. Vendor-published latency claims almost always represent best-case component measurements, not production p95 end-to-end figures. Independent benchmarks such as the Coval.dev platform comparison and Telnyx's 100-concurrent-call PSTN test are more representative than vendor documentation alone.

    Can a voice AI system handle mid-call language switching with full-duplex architecture?

    Yes, but it requires coordinated hot-swapping across the STT engine language model, the TTS voice profile, and the LLM prompt language instruction, all within a single streaming context without dropping audio frames or resetting the conversation history. In a standard sequential pipeline, language switching typically requires a full context reset with a noticeable pause. In a properly implemented full-duplex streaming architecture with multilingual STT, language detection can trigger mid-utterance and the voice switch can complete before the next agent turn begins. This capability is available in deployments with ElevenLabs' 100-plus language library and a multilingual-capable realtime LLM such as Gemini 3.1 Flash Live.

    Frequently Asked Questions

    What is the difference between full-duplex and half-duplex in voice AI?

    In voice AI, half-duplex means the agent either listens or speaks but not both simultaneously. When the agent is playing audio, it cannot detect or respond to caller speech until playback completes. Full-duplex means both channels are active simultaneously: the agent monitors incoming caller audio in real time even while it is generating and playing output audio. This enables barge-in, natural interruption, and overlap-aware turn-taking, behaviors that are normal in human conversation but impossible in a half-duplex system.

    What is Dualplex and how does it differ from speech-to-speech architecture?

    Dualplex is a hybrid architecture that combines the low-latency inference path of a realtime multimodal model with a separate high-quality text-to-speech engine such as ElevenLabs for audio output. Pure speech-to-speech architectures use a single model for all processing, which limits voice quality and model choice. Dualplex separates inference latency from voice quality, achieving 300 to 600ms end-to-end latency while maintaining access to a full voice library and cloning capabilities.

    Why does voice AI latency above 600ms cause problems?

    Human conversation operates with inter-turn gaps averaging around 200ms. Research and production benchmarks show that delays above 500ms begin triggering listener anxiety, delays above 800ms cause callers to start talking over the agent before it finishes, and delays above 1,000ms generate hang-ups at measurable rates. Above 600ms, 8 to 12 percent of callers in contact center environments abandon the call entirely.

    What engineering problems does full-duplex voice AI introduce?

    Full-duplex adds several engineering challenges: acoustic echo cancellation to prevent the agent from detecting its own audio output as caller input; confidence-weighted partial transcript handling to avoid false barge-in events; mid-utterance TTS cancellation without audible artifacts; context window management after interruption so the LLM has coherent state; and language-switch detection mid-call for multilingual deployments. Each requires separate engineering beyond the standard STT-LLM-TTS pipeline.

    How should I measure voice AI latency to get accurate production numbers?

    Measure full round-trip latency from the end of the caller's utterance to the start of agent audio output, not individual component latency. Always measure at the 95th percentile under realistic concurrent load, not average-case in clean conditions. Test with PSTN audio conditions including background noise and codec compression. Vendor-published latency claims almost always represent best-case component measurements, not production p95 end-to-end figures.

    Can a voice AI system handle mid-call language switching with full-duplex architecture?

    Yes, but it requires coordinated hot-swapping across the STT engine language model, the TTS voice profile, and the LLM prompt language instruction, all within a single streaming context without dropping audio frames or resetting conversation history. In a properly implemented full-duplex streaming architecture with multilingual STT, language detection can trigger mid-utterance and the voice switch can complete before the next agent turn begins.

    Ready to see AI receptionist in action?

    Talk to Alex, our live AI receptionist, in the next 60 seconds.

    Try Live Demo →
    dualplexvoice-aiai-voice-agenttechnicalai-receptionistfull-duplexreal-time-voice