Voice AI & Networking Glossary
Essential terminology for building real-time conversational AI systems
A
AI Agent
Voice AIAn autonomous software system that processes voice input, understands intent, and generates natural spoken responses in real-time conversations.
B
Barge-in
Voice AIThe ability for a user to interrupt an AI agent while it is speaking, causing the agent to stop and listen to the new input. Essential for natural conversation flow.
C
CDN (Content Delivery Network)
InfrastructureA geographically distributed network of servers that delivers content to users from the nearest location, reducing latency and improving performance.
Codec
Speech ProcessingShort for coder-decoder, a codec compresses and decompresses audio data to reduce bandwidth while maintaining quality.
Conversational AI
Voice AITechnology that enables machines to understand, process, and respond to human language in a natural, context-aware manner across voice and text channels.
D
Diarization
Speech ProcessingThe process of partitioning an audio stream into segments according to speaker identity, answering "who spoke when" in a conversation.
E
Edge Computing
InfrastructureComputing infrastructure positioned close to data sources to minimize latency and bandwidth usage by processing data near where it's generated.
Endpointing (End-of-Turn Detection)
Voice AIDetecting when a speaker has finished their turn in a conversation, enabling the AI to respond at the right moment without cutting off the user or waiting too long.
F
Function Calling (Tool Use)
Voice AIThe ability for an LLM to invoke external functions or APIs during a conversation to retrieve data or perform actions, enabling AI agents to interact with real-world systems.
I
Intent Recognition
Voice AIThe process of identifying the user's goal or purpose from their spoken or written input in a conversational system.
J
Jitter
NetworkingVariation in packet arrival times over a network connection, causing inconsistent delays that can degrade voice quality.
L
LLM (Large Language Model)
Voice AIA neural network trained on massive text datasets that can understand and generate human language. The "brain" of conversational AI systems that processes user intent and generates responses.
Latency
NetworkingThe time delay between a request and its response in a system. In voice AI, this includes network transmission, processing, and response generation time. Lower latency creates more natural, human-like conversations.
Load Balancing
InfrastructureDistributing incoming network traffic across multiple servers to ensure no single server becomes overwhelmed, improving reliability and performance.
M
MOS (Mean Opinion Score)
NetworkingA numerical measure of voice quality on a scale of 1-5, derived from listener ratings. Used to benchmark call quality in VoIP and voice AI systems.
N
Natural Language Processing (NLP)
Voice AIA field of AI focused on enabling computers to understand, interpret, and generate human language in both written and spoken forms.
P
PSTN (Public Switched Telephone Network)
NetworkingThe traditional global telephone network infrastructure that connects landlines and mobile phones. Voice AI agents connect to PSTN to make and receive real phone calls.
Packet Loss
NetworkingThe failure of network data packets to reach their destination, causing gaps or degradation in voice calls and data transmission.
R
RAG (Retrieval Augmented Generation)
Voice AIA technique that enhances LLM responses by retrieving relevant information from external knowledge bases before generating an answer, reducing hallucinations and providing up-to-date information.
Real-time Factor (RTF)
Speech ProcessingA measure of processing speed relative to audio duration. An RTF of 0.5 means 1 second of audio is processed in 0.5 seconds. Lower is faster; RTF below 1.0 is required for real-time applications.
S
Scalability
InfrastructureThe ability of a system to handle increasing workloads by adding resources, while maintaining performance and reliability.
Streaming
InfrastructureProcessing data incrementally as it arrives rather than waiting for complete input. In voice AI, streaming enables responses to begin before the full audio or text is received.
SIP (Session Initiation Protocol)
NetworkingA signaling protocol used to initiate, maintain, and terminate real-time voice, video, and messaging sessions over IP networks.
Speech-to-Text (STT)
Speech ProcessingTechnology that converts spoken audio into written text, enabling voice AI systems to process and understand human speech.
T
TTFB (Time to First Byte)
NetworkingThe time between making a request and receiving the first byte of the response. In voice AI, TTFB for TTS measures how quickly audio playback can begin.
Turn-taking
Voice AIThe conversational protocol governing when each party speaks. Effective turn-taking enables natural back-and-forth dialogue without awkward overlaps or long silences.
Text-to-Speech (TTS)
Speech ProcessingTechnology that synthesizes natural-sounding speech from written text, allowing AI systems to communicate verbally with users.
U
Uptime
InfrastructureThe percentage of time a system or service is operational and available, typically measured as a percentage (e.g., 99.9% uptime).
V
VAD (Voice Activity Detection)
Speech ProcessingTechnology that detects the presence or absence of human speech in an audio signal, distinguishing speech from silence, background noise, or music.
Voice API
Voice AIA programmatic interface that enables developers to integrate voice communication capabilities into applications, including making calls, receiving calls, and processing audio.
Voice Biometrics
Speech ProcessingTechnology that identifies or verifies individuals based on unique characteristics in their voice, such as pitch, tone, and speech patterns.
W
Wake Word Detection
Voice AITechnology that continuously listens for a specific trigger phrase (like "Hey Siri" or "Alexa") to activate a voice assistant or AI agent.
WebRTC
NetworkingWeb Real-Time Communication - an open-source technology enabling peer-to-peer audio, video, and data sharing directly in web browsers without plugins.