Voice AI & Networking Glossary
Essential terminology for building real-time conversational AI systems
A
AI Agent
Voice AIAn autonomous software system that processes voice input, understands intent, and generates natural spoken responses in real-time conversations.
Agentic AI
Voice AIAn AI system that autonomously pursues multi-step goals, making decisions and invoking tools without a human prompt at each step. Unlike single-turn agents, agentic systems plan, retry, and self-correct across extended interactions — handling an entire workflow within one call rather than routing between departments.
B
Barge-in
Voice AIThe ability for a user to interrupt an AI agent while it is speaking, causing the agent to stop and listen to the new input. Essential for natural conversation flow.
Branded Calling
NetworkingA carrier-level service that displays a verified business name and logo on the recipient's phone screen before they answer. Unlike application-layer caller ID, branded calling relies on carrier infrastructure and STIR/SHAKEN attestation to authenticate identity — making it resistant to spoofing and critical for AI agents that place outbound calls.
C
CDN (Content Delivery Network)
InfrastructureA geographically distributed network of servers that delivers content to users from the nearest location, reducing latency and improving performance.
Codec
Speech ProcessingShort for coder-decoder, a codec compresses and decompresses audio data to reduce bandwidth while maintaining quality.
Co-located Inference
InfrastructureRunning AI model inference (STT, LLM, TTS) in the same physical facility where voice calls terminate, eliminating inter-provider network hops. When inference and telephony share a facility, audio never traverses the public internet between processing stages — reducing round-trip overhead to near zero.
Compound Availability
ArchitectureThe effective uptime of a multi-vendor system, calculated by multiplying each vendor's individual availability. Five vendors each at 99.9% uptime yield a compound availability of ~99.5% — roughly 4.4 hours of downtime per month, compared to 43 minutes for a single-vendor architecture. Each additional vendor boundary is an independent failure domain that degrades overall system reliability.
Context Window
Voice AIThe maximum amount of text — measured in tokens — that an LLM can process in a single inference call, including both input and output. In voice AI, context window limits determine how many conversational turns an agent retains before older exchanges are dropped, directly affecting whether the agent remembers what the caller said five minutes ago.
Conversational AI
Voice AITechnology that enables machines to understand, process, and respond to human language in a natural, context-aware manner across voice and text channels.
CPaaS (Communications Platform as a Service)
InfrastructureA cloud platform that provides programmable voice, messaging, and video capabilities through APIs, enabling developers to embed real-time communications into applications without building telecom infrastructure. Traditional CPaaS provides the connectivity layer; real-time AI infrastructure extends this by co-locating inference, orchestration, and carrier services in a single system.
D
Diarization
Speech ProcessingThe process of partitioning an audio stream into segments according to speaker identity, answering "who spoke when" in a conversation.
E
Edge Computing
InfrastructureComputing infrastructure positioned close to data sources to minimize latency and bandwidth usage by processing data near where it's generated.
Endpointing (End-of-Turn Detection)
Voice AIDetecting when a speaker has finished their turn in a conversation, enabling the AI to respond at the right moment without cutting off the user or waiting too long.
F
Function Calling (Tool Use)
Voice AIThe ability for an LLM to invoke external functions or APIs during a conversation to retrieve data or perform actions, enabling AI agents to interact with real-world systems.
Frankenstack
ArchitectureA multi-vendor voice AI architecture that chains together separate providers for telephony, speech-to-text, LLM routing, text-to-speech, and orchestration. Each vendor boundary adds 30–80ms of network overhead, creates an independent failure domain, and takes its own margin. A typical five-vendor Frankenstack accumulates 150–400ms of latency before any model even begins processing.
Full-stack Voice AI
ArchitectureA voice AI architecture where speech-to-text, LLM routing, text-to-speech, voice cloning, orchestration, and carrier-grade telephony all operate within a single platform and network. This eliminates the integration tax, compound vendor margins, and cross-boundary debugging overhead of multi-vendor stacks. The structural cost advantage is permanent — it comes from vertical integration, not promotional pricing.
H
Hallucination
Voice AIWhen an LLM generates plausible-sounding but factually incorrect information. In voice AI, hallucinations are especially consequential: the caller cannot scroll back to verify, there is no visual context to signal uncertainty, and a confidently spoken falsehood damages trust immediately. RAG and constrained prompting are the primary mitigations.
I
Intent Recognition
Voice AIThe process of identifying the user's goal or purpose from their spoken or written input in a conversational system.
Integration Tax
ArchitectureThe ongoing engineering cost of assembling and maintaining a multi-vendor voice AI stack. Includes custom integration work ($200K–$500K/year for enterprise), re-integration every time a vendor pushes API changes (1–2 sprints per quarter), and cross-vendor debugging (4–8 hours per incident across multiple support teams and dashboards). Often exceeds the combined component costs of the individual services.
Inter-provider Hops
NetworkingNetwork round-trips between separate vendor systems in a multi-vendor voice AI pipeline. Each hop crosses the public internet, adding 30–80ms of latency per boundary. In a five-vendor stack, inter-provider hops alone consume 150–400ms — often exceeding the latency budget for natural-sounding conversation (under 300ms total).
IVR (Interactive Voice Response)
Voice AIA telephony system that interacts with callers through pre-recorded prompts and keypad input (DTMF) to route calls or provide self-service. IVR systems follow rigid decision trees — "Press 1 for billing, Press 2 for support." Voice AI agents replace these trees with natural conversation that understands spoken intent and adapts in real time.
J
Jitter
NetworkingVariation in packet arrival times over a network connection, causing inconsistent delays that can degrade voice quality.
L
LLM (Large Language Model)
Voice AIA neural network trained on massive text datasets that can understand and generate human language. The "brain" of conversational AI systems that processes user intent and generates responses.
Latency
NetworkingThe time delay between a request and its response in a system. In voice AI, this includes network transmission, processing, and response generation time. Lower latency creates more natural, human-like conversations.
Load Balancing
InfrastructureDistributing incoming network traffic across multiple servers to ensure no single server becomes overwhelmed, improving reliability and performance.
M
MOS (Mean Opinion Score)
NetworkingA numerical measure of voice quality on a scale of 1-5, derived from listener ratings. Used to benchmark call quality in VoIP and voice AI systems.
N
Natural Language Processing (NLP)
Voice AIA field of AI focused on enabling computers to understand, interpret, and generate human language in both written and spoken forms.
O
Observability
InfrastructureThe ability to understand the internal state of a production system from its external outputs — logs, traces, and metrics. In voice AI, observability means instrumenting every stage of the STT → LLM → TTS pipeline to diagnose latency spikes, transcription errors, and agent failures. Without end-to-end tracing, a multi-vendor pipeline produces five dashboards and no root cause.
P
Packet Loss
NetworkingThe failure of network data packets to reach their destination, causing gaps or degradation in voice calls and data transmission.
P99 Latency
NetworkingThe 99th-percentile response time: the latency that 99% of requests complete within. In voice AI, P99 is the metric that matters — not P50 (the median). A voice agent with 200ms median latency but 900ms P99 will produce noticeably broken conversations for 1 in 100 turns. Multi-vendor pipelines compound P99 because each inter-provider hop adds its own tail latency.
Prompt Engineering
Voice AIThe practice of crafting input instructions to an LLM to reliably produce desired outputs. In voice AI, prompt engineering carries additional constraints: responses must be short enough for natural TTS playback, must avoid formatting (markdown, bullet points, URLs) that does not survive speech synthesis, and must handle conversational register — the difference between reading text and speaking to a human.
PSTN (Public Switched Telephone Network)
NetworkingThe traditional global telephone network infrastructure that connects landlines and mobile phones. Voice AI agents connect to PSTN to make and receive real phone calls.
R
RAG (Retrieval Augmented Generation)
Voice AIA technique that enhances LLM responses by retrieving relevant information from external knowledge bases before generating an answer, reducing hallucinations and providing up-to-date information.
Real-time Factor (RTF)
Speech ProcessingA measure of processing speed relative to audio duration. An RTF of 0.5 means 1 second of audio is processed in 0.5 seconds. Lower is faster; RTF below 1.0 is required for real-time applications.
RTP (Real-time Transport Protocol)
NetworkingThe network protocol that carries audio and video media over IP networks, providing sequencing, timing, and payload-type identification. While SIP handles call signaling (setup, teardown, routing), RTP carries the actual voice data. Every inter-provider hop in a voice AI pipeline adds RTP forwarding latency.
S
Scalability
InfrastructureThe ability of a system to handle increasing workloads by adding resources, while maintaining performance and reliability.
Streaming
InfrastructureProcessing data incrementally as it arrives rather than waiting for complete input. In voice AI, streaming enables responses to begin before the full audio or text is received.
SIP (Session Initiation Protocol)
NetworkingA signaling protocol used to initiate, maintain, and terminate real-time voice, video, and messaging sessions over IP networks.
SIP Trunking
NetworkingA VoIP service that connects a PBX or communications platform to the PSTN over IP using SIP, replacing traditional physical phone lines. SIP trunks are the connectivity layer that voice AI platforms use to originate and terminate calls at scale — the bridge between the carrier network and the application.
STIR/SHAKEN
NetworkingA framework of telecom standards (Secure Telephone Identity Revisited / Signature-based Handling of Asserted information using toKENs) mandated by the FCC to verify caller identity and combat robocall fraud. Only the originating carrier can provide full A-level attestation — the highest trust rating. AI platforms that build on top of carriers inherit lower B or C-level attestation, making their calls more likely to be flagged or blocked.
Speech-to-Text (STT)
Speech ProcessingTechnology that converts spoken audio into written text, enabling voice AI systems to process and understand human speech.
T
TTFB (Time to First Byte)
NetworkingThe time between making a request and receiving the first byte of the response. In voice AI, TTFB for TTS measures how quickly audio playback can begin.
Turn-taking
Voice AIThe conversational protocol governing when each party speaks. Effective turn-taking enables natural back-and-forth dialogue without awkward overlaps or long silences.
Text-to-Speech (TTS)
Speech ProcessingTechnology that synthesizes natural-sounding speech from written text, allowing AI systems to communicate verbally with users.
U
Uptime
InfrastructureThe percentage of time a system or service is operational and available, typically measured as a percentage (e.g., 99.9% uptime).
V
VAD (Voice Activity Detection)
Speech ProcessingTechnology that detects the presence or absence of human speech in an audio signal, distinguishing speech from silence, background noise, or music.
Voice API
Voice AIA programmatic interface that enables developers to integrate voice communication capabilities into applications, including making calls, receiving calls, and processing audio.
Voice AI Infrastructure
InfrastructureThe integrated system of edge compute, voice AI platform, and global communications required to power AI agents that interact with humans over the telephone network. Unlike application-layer platforms that chain together multiple vendors, voice AI infrastructure owns the full call path — from inference to carrier delivery — in a single operational domain.
Voice Biometrics
Speech ProcessingTechnology that identifies or verifies individuals based on unique characteristics in their voice, such as pitch, tone, and speech patterns.
Voice Cloning
Speech ProcessingCreating a synthetic voice model that replicates the timbre, accent, and speaking style of a specific person from audio samples. Modern systems can produce a usable clone from as little as 3 seconds of audio. Voice cloning enables TTS to generate speech in a custom brand voice rather than a generic synthesized one — and creates the fraud risk that STIR/SHAKEN and AI voice detection are designed to counter.
W
Wake Word Detection
Voice AITechnology that continuously listens for a specific trigger phrase (like "Hey Siri" or "Alexa") to activate a voice assistant or AI agent.
Webhook
InfrastructureAn HTTP callback where a server sends real-time event notifications to a specified URL when a trigger occurs. Voice APIs use webhooks to notify applications of call lifecycle events — call answered, transcription ready, recording complete, call ended — enabling event-driven architectures without polling.
call.answered event, triggering the application to start streaming audio to the STT → LLM → TTS pipeline.
WebSocket
NetworkingA persistent, full-duplex communication protocol over a single TCP connection that enables simultaneous bidirectional data flow. In voice AI, WebSocket connections carry continuous audio streams between the telephony platform and processing services — the real-time transport layer for streaming STT input and TTS output without the overhead of repeated HTTP requests.
WebRTC
NetworkingWeb Real-Time Communication - an open-source technology enabling peer-to-peer audio, video, and data sharing directly in web browsers without plugins.