Voice AI & Networking Glossary

Essential terminology for building real-time conversational AI systems

A

AI Agent

Voice AI

An autonomous software system that processes voice input, understands intent, and generates natural spoken responses in real-time conversations.

Use Case: Customer service AI that handles appointment scheduling, answers FAQs, and routes complex queries to human agents.

B

Barge-in

Voice AI

The ability for a user to interrupt an AI agent while it is speaking, causing the agent to stop and listen to the new input. Essential for natural conversation flow.

Use Case: A caller says "wait, I meant Thursday" while the agent is confirming a Wednesday appointment, and the agent immediately stops to process the correction.

C

CDN (Content Delivery Network)

Infrastructure

A geographically distributed network of servers that delivers content to users from the nearest location, reducing latency and improving performance.

Use Case: Distributing voice AI model files and audio assets globally to ensure fast load times for users worldwide.

Codec

Speech Processing

Short for coder-decoder, a codec compresses and decompresses audio data to reduce bandwidth while maintaining quality.

Use Case: Opus codec used in WebRTC for high-quality, low-latency voice transmission in real-time calls.

Conversational AI

Voice AI

Technology that enables machines to understand, process, and respond to human language in a natural, context-aware manner across voice and text channels.

Use Case: Multi-turn dialogue systems that maintain context across a conversation, like virtual assistants or support chatbots.

D

Diarization

Speech Processing

The process of partitioning an audio stream into segments according to speaker identity, answering "who spoke when" in a conversation.

Use Case: Transcribing a customer support call and automatically labeling which parts were spoken by the agent versus the customer.

E

Edge Computing

Infrastructure

Computing infrastructure positioned close to data sources to minimize latency and bandwidth usage by processing data near where it's generated.

Use Case: Running voice recognition models on edge servers near users to achieve sub-100ms transcription latency.

Endpointing (End-of-Turn Detection)

Voice AI

Detecting when a speaker has finished their turn in a conversation, enabling the AI to respond at the right moment without cutting off the user or waiting too long.

Use Case: Distinguishing between a pause mid-sentence ("I want to order... pizza") and the end of a complete thought, avoiding premature responses.

F

Function Calling (Tool Use)

Voice AI

The ability for an LLM to invoke external functions or APIs during a conversation to retrieve data or perform actions, enabling AI agents to interact with real-world systems.

Use Case: A voice AI agent calling a calendar API to check availability and book an appointment while still on the phone with the customer.

I

Intent Recognition

Voice AI

The process of identifying the user's goal or purpose from their spoken or written input in a conversational system.

Use Case: Determining whether "I need help with my account" means billing support, password reset, or account closure.

J

Jitter

Networking

Variation in packet arrival times over a network connection, causing inconsistent delays that can degrade voice quality.

Use Case: High jitter in VoIP calls causes choppy audio; jitter buffers smooth out variations to maintain quality.

L

LLM (Large Language Model)

Voice AI

A neural network trained on massive text datasets that can understand and generate human language. The "brain" of conversational AI systems that processes user intent and generates responses.

Use Case: GPT-4, Claude, or Gemini powering a voice agent's understanding and response generation between the STT and TTS stages.

Latency

Networking

The time delay between a request and its response in a system. In voice AI, this includes network transmission, processing, and response generation time. Lower latency creates more natural, human-like conversations.

Use Case: A voice AI agent with 200ms latency feels responsive, while 1000ms+ latency creates awkward pauses that break conversation flow.

Load Balancing

Infrastructure

Distributing incoming network traffic across multiple servers to ensure no single server becomes overwhelmed, improving reliability and performance.

Use Case: Routing voice AI requests across a cluster of servers to maintain low latency during traffic spikes.

M

MOS (Mean Opinion Score)

Networking

A numerical measure of voice quality on a scale of 1-5, derived from listener ratings. Used to benchmark call quality in VoIP and voice AI systems.

Use Case: Monitoring voice AI call quality where a MOS above 4.0 indicates excellent quality, while below 3.5 suggests noticeable degradation.

N

Natural Language Processing (NLP)

Voice AI

A field of AI focused on enabling computers to understand, interpret, and generate human language in both written and spoken forms.

Use Case: Extracting meaning from "I'd like to reschedule my appointment for next Tuesday" to trigger calendar actions.

P

PSTN (Public Switched Telephone Network)

Networking

The traditional global telephone network infrastructure that connects landlines and mobile phones. Voice AI agents connect to PSTN to make and receive real phone calls.

Use Case: An AI agent with a real phone number that customers can dial from any phone, not just through an app or website.

Packet Loss

Networking

The failure of network data packets to reach their destination, causing gaps or degradation in voice calls and data transmission.

Use Case: Even 1% packet loss can cause noticeable audio dropouts in voice calls; protocols like WebRTC include error correction.

R

RAG (Retrieval Augmented Generation)

Voice AI

A technique that enhances LLM responses by retrieving relevant information from external knowledge bases before generating an answer, reducing hallucinations and providing up-to-date information.

Use Case: A voice agent retrieving current product pricing from a database before answering "how much does the Pro plan cost?"

Real-time Factor (RTF)

Speech Processing

A measure of processing speed relative to audio duration. An RTF of 0.5 means 1 second of audio is processed in 0.5 seconds. Lower is faster; RTF below 1.0 is required for real-time applications.

Use Case: Evaluating STT models where RTF of 0.1 enables responsive voice AI, while RTF of 2.0 would cause unacceptable delays.

S

Scalability

Infrastructure

The ability of a system to handle increasing workloads by adding resources, while maintaining performance and reliability.

Use Case: A voice AI platform that can grow from handling 100 concurrent calls to 10,000 without degrading response times.

Streaming

Infrastructure

Processing data incrementally as it arrives rather than waiting for complete input. In voice AI, streaming enables responses to begin before the full audio or text is received.

Use Case: Starting TTS playback while the LLM is still generating tokens, reducing perceived latency from seconds to milliseconds.

SIP (Session Initiation Protocol)

Networking

A signaling protocol used to initiate, maintain, and terminate real-time voice, video, and messaging sessions over IP networks.

Use Case: Connecting a voice AI agent to traditional phone systems (PSTN) for inbound and outbound calling.

Speech-to-Text (STT)

Speech Processing

Technology that converts spoken audio into written text, enabling voice AI systems to process and understand human speech.

Use Case: Transcribing customer support calls in real-time to feed into AI agents for intent recognition and response generation.

T

TTFB (Time to First Byte)

Networking

The time between making a request and receiving the first byte of the response. In voice AI, TTFB for TTS measures how quickly audio playback can begin.

Use Case: A TTS system with 100ms TTFB starts speaking almost instantly, while 500ms TTFB creates a noticeable pause before the agent responds.

Turn-taking

Voice AI

The conversational protocol governing when each party speaks. Effective turn-taking enables natural back-and-forth dialogue without awkward overlaps or long silences.

Use Case: An AI agent that knows to pause for customer responses after asking a question, and resumes if silence extends beyond a natural threshold.

Text-to-Speech (TTS)

Speech Processing

Technology that synthesizes natural-sounding speech from written text, allowing AI systems to communicate verbally with users.

Use Case: Voice AI agents using TTS to deliver responses with human-like intonation, emotion, and pacing.

U

Uptime

Infrastructure

The percentage of time a system or service is operational and available, typically measured as a percentage (e.g., 99.9% uptime).

Use Case: A voice AI platform with 99.99% uptime experiences less than 1 hour of downtime per year.

V

VAD (Voice Activity Detection)

Speech Processing

Technology that detects the presence or absence of human speech in an audio signal, distinguishing speech from silence, background noise, or music.

Use Case: Only sending audio to STT when speech is detected, reducing processing costs and improving accuracy by filtering out background noise.

Voice API

Voice AI

A programmatic interface that enables developers to integrate voice communication capabilities into applications, including making calls, receiving calls, and processing audio.

Use Case: Building a click-to-call feature in a web app or creating automated outbound calling systems for notifications.

Voice Biometrics

Speech Processing

Technology that identifies or verifies individuals based on unique characteristics in their voice, such as pitch, tone, and speech patterns.

Use Case: Banking applications using voice authentication for secure account access instead of passwords.

W

Wake Word Detection

Voice AI

Technology that continuously listens for a specific trigger phrase (like "Hey Siri" or "Alexa") to activate a voice assistant or AI agent.

Use Case: Smart speakers that remain in low-power mode until hearing the wake word, then activating full processing.

WebRTC

Networking

Web Real-Time Communication - an open-source technology enabling peer-to-peer audio, video, and data sharing directly in web browsers without plugins.

Use Case: Browser-based voice AI applications that connect users to AI agents with low-latency audio streaming.

Ready to build with low-latency voice AI?

Join developers building the future of real-time conversations