Voice AI & Networking Glossary

A

AI Agent

Voice AI

An autonomous software system that processes voice input, understands intent, and generates natural spoken responses in real-time conversations.

Use Case: Customer service AI that handles appointment scheduling, answers FAQs, and routes complex queries to human agents.

Agentic AI

Voice AI

An AI system that autonomously pursues multi-step goals, making decisions and invoking tools without a human prompt at each step. Unlike single-turn agents, agentic systems plan, retry, and self-correct across extended interactions — handling an entire workflow within one call rather than routing between departments.

Use Case: A voice AI system that detects a billing dispute, retrieves account history, calculates the adjustment, issues a credit, and confirms — all within a single call without human handoff.

B

Barge-in

Voice AI

The ability for a user to interrupt an AI agent while it is speaking, causing the agent to stop and listen to the new input. Essential for natural conversation flow.

Use Case: A caller says "wait, I meant Thursday" while the agent is confirming a Wednesday appointment, and the agent immediately stops to process the correction.

Branded Calling

Networking

A carrier-level service that displays a verified business name and logo on the recipient's phone screen before they answer. Unlike application-layer caller ID, branded calling relies on carrier infrastructure and STIR/SHAKEN attestation to authenticate identity — making it resistant to spoofing and critical for AI agents that place outbound calls.

Use Case: An AI appointment reminder agent displaying the clinic's verified name instead of an unknown number, increasing answer rates from ~20% to over 60%.

C

CDN (Content Delivery Network)

Infrastructure

A geographically distributed network of servers that delivers content to users from the nearest location, reducing latency and improving performance.

Use Case: Distributing voice AI model files and audio assets globally to ensure fast load times for users worldwide.

Codec

Speech Processing

Short for coder-decoder, a codec compresses and decompresses audio data to reduce bandwidth while maintaining quality.

Use Case: Opus codec used in WebRTC for high-quality, low-latency voice transmission in real-time calls.

Co-located Inference

Infrastructure

Running AI model inference (STT, LLM, TTS) in the same physical facility where voice calls terminate, eliminating inter-provider network hops. When inference and telephony share a facility, audio never traverses the public internet between processing stages — reducing round-trip overhead to near zero.

Compound Availability

Architecture

The effective uptime of a multi-vendor system, calculated by multiplying each vendor's individual availability. Five vendors each at 99.9% uptime yield a compound availability of ~99.5% — roughly 4.4 hours of downtime per month, compared to 43 minutes for a single-vendor architecture. Each additional vendor boundary is an independent failure domain that degrades overall system reliability.

Context Window

Voice AI

The maximum amount of text — measured in tokens — that an LLM can process in a single inference call, including both input and output. In voice AI, context window limits determine how many conversational turns an agent retains before older exchanges are dropped, directly affecting whether the agent remembers what the caller said five minutes ago.

Use Case: A voice agent handling a 20-minute support call must manage what stays in context — summarizing or discarding older turns as the window fills to maintain coherent responses.

Conversational AI

Voice AI

Technology that enables machines to understand, process, and respond to human language in a natural, context-aware manner across voice and text channels.

Use Case: Multi-turn dialogue systems that maintain context across a conversation, like virtual assistants or support chatbots.

CPaaS (Communications Platform as a Service)

Infrastructure

A cloud platform that provides programmable voice, messaging, and video capabilities through APIs, enabling developers to embed real-time communications into applications without building telecom infrastructure. Traditional CPaaS provides the connectivity layer; real-time AI infrastructure extends this by co-locating inference, orchestration, and carrier services in a single system.

Use Case: A developer building an outbound appointment system that places calls, handles spoken responses, and logs results — all through API calls to a communications platform.

D

Diarization

Speech Processing

The process of partitioning an audio stream into segments according to speaker identity, answering "who spoke when" in a conversation.

Use Case: Transcribing a customer support call and automatically labeling which parts were spoken by the agent versus the customer.

E

Edge Computing

Infrastructure

Computing infrastructure positioned close to data sources to minimize latency and bandwidth usage by processing data near where it's generated.

Use Case: Running voice recognition models on edge servers near users to achieve sub-100ms transcription latency.

Endpointing (End-of-Turn Detection)

Voice AI

Detecting when a speaker has finished their turn in a conversation, enabling the AI to respond at the right moment without cutting off the user or waiting too long.

Use Case: Distinguishing between a pause mid-sentence ("I want to order... pizza") and the end of a complete thought, avoiding premature responses.

F

Function Calling (Tool Use)

Voice AI

The ability for an LLM to invoke external functions or APIs during a conversation to retrieve data or perform actions, enabling AI agents to interact with real-world systems.

Use Case: A voice AI agent calling a calendar API to check availability and book an appointment while still on the phone with the customer.

Frankenstack

Architecture

A multi-vendor voice AI architecture that chains together separate providers for telephony, speech-to-text, LLM routing, text-to-speech, and orchestration. Each vendor boundary adds 30–80ms of network overhead, creates an independent failure domain, and takes its own margin. A typical five-vendor Frankenstack accumulates 150–400ms of latency before any model even begins processing.

Full-stack Voice AI

Architecture

A voice AI architecture where speech-to-text, LLM routing, text-to-speech, voice cloning, orchestration, and carrier-grade telephony all operate within a single platform and network. This eliminates the integration tax, compound vendor margins, and cross-boundary debugging overhead of multi-vendor stacks. The structural cost advantage is permanent — it comes from vertical integration, not promotional pricing.

H

Hallucination

Voice AI

When an LLM generates plausible-sounding but factually incorrect information. In voice AI, hallucinations are especially consequential: the caller cannot scroll back to verify, there is no visual context to signal uncertainty, and a confidently spoken falsehood damages trust immediately. RAG and constrained prompting are the primary mitigations.

Use Case: A voice agent incorrectly stating a product's return policy. The caller acts on the fabricated terms, discovers they were wrong, and escalates — a single hallucination generating a support incident.

I

Intent Recognition

Voice AI

The process of identifying the user's goal or purpose from their spoken or written input in a conversational system.

Use Case: Determining whether "I need help with my account" means billing support, password reset, or account closure.

Integration Tax

Architecture

The ongoing engineering cost of assembling and maintaining a multi-vendor voice AI stack. Includes custom integration work ($200K–$500K/year for enterprise), re-integration every time a vendor pushes API changes (1–2 sprints per quarter), and cross-vendor debugging (4–8 hours per incident across multiple support teams and dashboards). Often exceeds the combined component costs of the individual services.

Inter-provider Hops

Networking

Network round-trips between separate vendor systems in a multi-vendor voice AI pipeline. Each hop crosses the public internet, adding 30–80ms of latency per boundary. In a five-vendor stack, inter-provider hops alone consume 150–400ms — often exceeding the latency budget for natural-sounding conversation (under 300ms total).

IVR (Interactive Voice Response)

Voice AI

A telephony system that interacts with callers through pre-recorded prompts and keypad input (DTMF) to route calls or provide self-service. IVR systems follow rigid decision trees — "Press 1 for billing, Press 2 for support." Voice AI agents replace these trees with natural conversation that understands spoken intent and adapts in real time.

Use Case: Traditional IVR: "Press 1 for billing." Voice AI equivalent: "How can I help you today?" — routing by spoken intent with no menu to navigate.

J

Jitter

Networking

Variation in packet arrival times over a network connection, causing inconsistent delays that can degrade voice quality.

Use Case: High jitter in VoIP calls causes choppy audio; jitter buffers smooth out variations to maintain quality.

L

LLM (Large Language Model)

Voice AI

A neural network trained on massive text datasets that can understand and generate human language. The "brain" of conversational AI systems that processes user intent and generates responses.

Use Case: GPT-4, Claude, or Gemini powering a voice agent's understanding and response generation between the STT and TTS stages.

Latency

Networking

The time delay between a request and its response in a system. In voice AI, this includes network transmission, processing, and response generation time. Lower latency creates more natural, human-like conversations.

Use Case: A voice AI agent with 200ms latency feels responsive, while 1000ms+ latency creates awkward pauses that break conversation flow.

Load Balancing

Infrastructure

Distributing incoming network traffic across multiple servers to ensure no single server becomes overwhelmed, improving reliability and performance.

Use Case: Routing voice AI requests across a cluster of servers to maintain low latency during traffic spikes.

M

MOS (Mean Opinion Score)

Networking

A numerical measure of voice quality on a scale of 1-5, derived from listener ratings. Used to benchmark call quality in VoIP and voice AI systems.

Use Case: Monitoring voice AI call quality where a MOS above 4.0 indicates excellent quality, while below 3.5 suggests noticeable degradation.

N

Natural Language Processing (NLP)

Voice AI

A field of AI focused on enabling computers to understand, interpret, and generate human language in both written and spoken forms.

Use Case: Extracting meaning from "I'd like to reschedule my appointment for next Tuesday" to trigger calendar actions.

O

Observability

Infrastructure

The ability to understand the internal state of a production system from its external outputs — logs, traces, and metrics. In voice AI, observability means instrumenting every stage of the STT → LLM → TTS pipeline to diagnose latency spikes, transcription errors, and agent failures. Without end-to-end tracing, a multi-vendor pipeline produces five dashboards and no root cause.

Use Case: Distributed tracing revealing that 80% of response latency originates at a single inter-provider hop, enabling an informed routing decision.

P

Packet Loss

Networking

The failure of network data packets to reach their destination, causing gaps or degradation in voice calls and data transmission.

Use Case: Even 1% packet loss can cause noticeable audio dropouts in voice calls; protocols like WebRTC include error correction.

P99 Latency

Networking

The 99th-percentile response time: the latency that 99% of requests complete within. In voice AI, P99 is the metric that matters — not P50 (the median). A voice agent with 200ms median latency but 900ms P99 will produce noticeably broken conversations for 1 in 100 turns. Multi-vendor pipelines compound P99 because each inter-provider hop adds its own tail latency.

Use Case: A production voice agent reporting 250ms P50 latency but 1,100ms P99 — the demo sounds great, but 1% of turns in production have a full second of dead air.

Prompt Engineering

Voice AI

The practice of crafting input instructions to an LLM to reliably produce desired outputs. In voice AI, prompt engineering carries additional constraints: responses must be short enough for natural TTS playback, must avoid formatting (markdown, bullet points, URLs) that does not survive speech synthesis, and must handle conversational register — the difference between reading text and speaking to a human.

Use Case: Tuning a system prompt so the voice agent never reads out URLs, never produces numbered lists, and always confirms before taking irreversible actions like canceling a subscription.

PSTN (Public Switched Telephone Network)

Networking

The traditional global telephone network infrastructure that connects landlines and mobile phones. Voice AI agents connect to PSTN to make and receive real phone calls.

Use Case: An AI agent with a real phone number that customers can dial from any phone, not just through an app or website.

R

RAG (Retrieval Augmented Generation)

Voice AI

A technique that enhances LLM responses by retrieving relevant information from external knowledge bases before generating an answer, reducing hallucinations and providing up-to-date information.

Use Case: A voice agent retrieving current product pricing from a database before answering "how much does the Pro plan cost?"

Real-time Factor (RTF)

Speech Processing

A measure of processing speed relative to audio duration. An RTF of 0.5 means 1 second of audio is processed in 0.5 seconds. Lower is faster; RTF below 1.0 is required for real-time applications.

Use Case: Evaluating STT models where RTF of 0.1 enables responsive voice AI, while RTF of 2.0 would cause unacceptable delays.

RTP (Real-time Transport Protocol)

Networking

The network protocol that carries audio and video media over IP networks, providing sequencing, timing, and payload-type identification. While SIP handles call signaling (setup, teardown, routing), RTP carries the actual voice data. Every inter-provider hop in a voice AI pipeline adds RTP forwarding latency.

Use Case: An inbound call arrives via SIP signaling; RTP packets then stream the caller's audio to the STT service for real-time transcription.

S

Scalability

Infrastructure

The ability of a system to handle increasing workloads by adding resources, while maintaining performance and reliability.

Use Case: A voice AI platform that can grow from handling 100 concurrent calls to 10,000 without degrading response times.

Streaming

Infrastructure

Processing data incrementally as it arrives rather than waiting for complete input. In voice AI, streaming enables responses to begin before the full audio or text is received.

Use Case: Starting TTS playback while the LLM is still generating tokens, reducing perceived latency from seconds to milliseconds.

SIP (Session Initiation Protocol)

Networking

A signaling protocol used to initiate, maintain, and terminate real-time voice, video, and messaging sessions over IP networks.

Use Case: Connecting a voice AI agent to traditional phone systems (PSTN) for inbound and outbound calling.

SIP Trunking

Networking

A VoIP service that connects a PBX or communications platform to the PSTN over IP using SIP, replacing traditional physical phone lines. SIP trunks are the connectivity layer that voice AI platforms use to originate and terminate calls at scale — the bridge between the carrier network and the application.

Use Case: A voice AI deployment handling 500 concurrent inbound calls from customers dialing standard phone numbers, all routed through SIP trunks to the AI processing pipeline.

STIR/SHAKEN

Networking

A framework of telecom standards (Secure Telephone Identity Revisited / Signature-based Handling of Asserted information using toKENs) mandated by the FCC to verify caller identity and combat robocall fraud. Only the originating carrier can provide full A-level attestation — the highest trust rating. AI platforms that build on top of carriers inherit lower B or C-level attestation, making their calls more likely to be flagged or blocked.

Speech-to-Text (STT)

Speech Processing

Technology that converts spoken audio into written text, enabling voice AI systems to process and understand human speech.

Use Case: Transcribing customer support calls in real-time to feed into AI agents for intent recognition and response generation.

T

TTFB (Time to First Byte)

Networking

The time between making a request and receiving the first byte of the response. In voice AI, TTFB for TTS measures how quickly audio playback can begin.

Use Case: A TTS system with 100ms TTFB starts speaking almost instantly, while 500ms TTFB creates a noticeable pause before the agent responds.

Turn-taking

Voice AI

The conversational protocol governing when each party speaks. Effective turn-taking enables natural back-and-forth dialogue without awkward overlaps or long silences.

Use Case: An AI agent that knows to pause for customer responses after asking a question, and resumes if silence extends beyond a natural threshold.

Text-to-Speech (TTS)

Speech Processing

Technology that synthesizes natural-sounding speech from written text, allowing AI systems to communicate verbally with users.

Use Case: Voice AI agents using TTS to deliver responses with human-like intonation, emotion, and pacing.

U

Uptime

Infrastructure

The percentage of time a system or service is operational and available, typically measured as a percentage (e.g., 99.9% uptime).

Use Case: A voice AI platform with 99.99% uptime experiences less than 1 hour of downtime per year.

V

VAD (Voice Activity Detection)

Speech Processing

Technology that detects the presence or absence of human speech in an audio signal, distinguishing speech from silence, background noise, or music.

Use Case: Only sending audio to STT when speech is detected, reducing processing costs and improving accuracy by filtering out background noise.

Voice API

Voice AI

A programmatic interface that enables developers to integrate voice communication capabilities into applications, including making calls, receiving calls, and processing audio.

Use Case: Building a click-to-call feature in a web app or creating automated outbound calling systems for notifications.

Voice AI Infrastructure

Infrastructure

The integrated system of edge compute, voice AI platform, and global communications required to power AI agents that interact with humans over the telephone network. Unlike application-layer platforms that chain together multiple vendors, voice AI infrastructure owns the full call path — from inference to carrier delivery — in a single operational domain.

Voice Biometrics

Speech Processing

Technology that identifies or verifies individuals based on unique characteristics in their voice, such as pitch, tone, and speech patterns.

Use Case: Banking applications using voice authentication for secure account access instead of passwords.

Voice Cloning

Speech Processing

Creating a synthetic voice model that replicates the timbre, accent, and speaking style of a specific person from audio samples. Modern systems can produce a usable clone from as little as 3 seconds of audio. Voice cloning enables TTS to generate speech in a custom brand voice rather than a generic synthesized one — and creates the fraud risk that STIR/SHAKEN and AI voice detection are designed to counter.

Use Case: A company deploying a voice AI agent that speaks in a consistent brand voice trained from recordings of a professional voice actor, maintaining identity across millions of calls.

W

Wake Word Detection

Voice AI

Technology that continuously listens for a specific trigger phrase (like "Hey Siri" or "Alexa") to activate a voice assistant or AI agent.

Use Case: Smart speakers that remain in low-power mode until hearing the wake word, then activating full processing.

Webhook

Infrastructure

An HTTP callback where a server sends real-time event notifications to a specified URL when a trigger occurs. Voice APIs use webhooks to notify applications of call lifecycle events — call answered, transcription ready, recording complete, call ended — enabling event-driven architectures without polling.

Use Case: A webhook endpoint receiving a call.answered event, triggering the application to start streaming audio to the STT → LLM → TTS pipeline.

WebSocket

Networking

A persistent, full-duplex communication protocol over a single TCP connection that enables simultaneous bidirectional data flow. In voice AI, WebSocket connections carry continuous audio streams between the telephony platform and processing services — the real-time transport layer for streaming STT input and TTS output without the overhead of repeated HTTP requests.

Use Case: A voice agent maintaining a WebSocket connection that simultaneously receives transcribed caller speech and sends synthesized agent audio, with both directions active throughout the call.

WebRTC

Networking

Web Real-Time Communication - an open-source technology enabling peer-to-peer audio, video, and data sharing directly in web browsers without plugins.

Use Case: Browser-based voice AI applications that connect users to AI agents with low-latency audio streaming.

Voice AI & Networking Glossary

A

AI Agent

Agentic AI

B

Barge-in

Branded Calling

C

CDN (Content Delivery Network)

Codec

Co-located Inference

Compound Availability

Context Window

Conversational AI

CPaaS (Communications Platform as a Service)

D

Diarization

E

Edge Computing

Endpointing (End-of-Turn Detection)

F

Function Calling (Tool Use)

Frankenstack

Full-stack Voice AI

H

Hallucination

I

Intent Recognition

Integration Tax

Inter-provider Hops

IVR (Interactive Voice Response)

J

Jitter

L

LLM (Large Language Model)

Latency

Load Balancing

M

MOS (Mean Opinion Score)

N

Natural Language Processing (NLP)

O

Observability

P

Packet Loss

P99 Latency

Prompt Engineering

PSTN (Public Switched Telephone Network)

R

RAG (Retrieval Augmented Generation)

Real-time Factor (RTF)

RTP (Real-time Transport Protocol)

S

Scalability

Streaming

SIP (Session Initiation Protocol)

SIP Trunking

STIR/SHAKEN

Speech-to-Text (STT)

T

TTFB (Time to First Byte)

Turn-taking

Text-to-Speech (TTS)

U

Uptime

V

VAD (Voice Activity Detection)

Voice API

Voice AI Infrastructure

Voice Biometrics

Voice Cloning

W

Wake Word Detection

Webhook

WebSocket

WebRTC

Ready to build with low-latency voice AI?