In 1996, four researchers — Mark Handley, Henning Schulzrinne, Eve Schooler, and Jonathan Rosenberg — designed a protocol they called SIP. It was originally intended to help run multicast video conferences on something called the MBone, a special network that let researchers share video feeds across the early internet.
Nobody working on it imagined that SIP would eventually carry every telephone call on Earth. But that is exactly what happened.
This is how it works.
Voice as Packets: The Fundamental Shift
The old telephone network treated voice as a continuous electrical signal — a dedicated, unbroken stream of electricity flowing between two phones. The VoIP revolution began with a simple observation: a human voice can be digitized, divided into small chunks, and reassembled at the other end without anyone noticing the difference.
That digitization process works like this: an analog voice signal is sampled thousands of times per second by a codec ( coder-decoder). Common codecs include G.711 (the original PCM standard, 64 Kbps), G.729 (compressed to 8 Kbps), and Opus (a modern open-source codec that adapts its bitrate dynamically between 6 and 510 Kbps depending on network conditions).
Each chunk is packaged as a data packet. Each packet has a header — containing information about where it came from, where it’s going, and its position in the conversation stream — and a payload, which is the actual voice data. A typical VoIP packet might contain 20 milliseconds of voice and be 40–100 bytes in total size, with only a fraction of that being the actual voice content.
These packets travel through the internet like any other data: through routers, across fiber links, through carrier networks. Each packet finds its own path. Unlike a circuit-switched call, which occupies a fixed physical route for the entire duration of the call, each packet can take a completely different route through the network — and they all reassemble correctly on the other end because the header tells the receiver how to order them.
The Protocol Stack: SIP, SDP, and RTP
SIP (Session Initiation Protocol) is the signaling layer — it handles the setup, management, and termination of sessions. Think of it as the air traffic controller that directs planes. It does not carry the voice; it sets up the conditions under which voice can flow.
When a SIP INVITE message is sent to start a call, it carries within it a payload encoded using SDP (Session Description Protocol). SDP tells the receiving endpoint: here is the IP address to send voice to, here is the codec I support, here is the port number, here is the bandwidth available. The called party responds with its own SDP payload listing its capabilities, and the two endpoints negotiate — typically agreeing to use a common codec and opening the appropriate ports.
Once the session is established via SIP, the actual voice travels through a completely different protocol: RTP (Real-time Transport Protocol). RTP runs on top of UDP (User Datagram Protocol) rather than TCP — meaning it prioritizes speed over guaranteed delivery. If a packet is lost in transit, RTP does not ask for a resend; it simply continues playing the audio, filling the gap with a small amount of silence or by extrapolating from surrounding packets. This happens fast enough — typically 50 packets per second for a 20ms codec — that the human ear rarely notices individual lost packets.
RTCP (RTP Control Protocol) runs alongside RTP, sending periodic reports between endpoints about packet counts, jitter, and round-trip time — allowing the system to monitor call quality and, in sophisticated implementations, dynamically adjust quality to match available bandwidth.
The SIP Call Flow: INVITE, 200 OK, ACK, BYE
The elegance of SIP is in its simplicity. A complete voice call involves exactly these messages:
INVITE: The caller sends an INVITE message to the SIP server (or directly to the recipient), announcing a desire to establish a session. This message contains SDP describing what the caller wants to send and can receive.
100 Trying: The server acknowledges receipt and indicates it is working on routing the call. The caller sees “connecting…” on their screen.
180 Ringing: The called party’s phone is ringing. This is the familiar tone.
200 OK: The called party answers. The message includes their SDP with their chosen media parameters.
ACK: The caller confirms receipt of the 200 OK. At this point, the call is established and RTP media begins flowing directly between the two endpoints.
BYE: Either party sends a BYE message when they want to end the call.
200 OK: Confirmation that the session has been terminated.
This is a remarkably compact set of messages for something as complex as establishing a global voice call. And because SIP is a text-based protocol — modeled directly on HTTP — every message is human-readable, which makes debugging significantly easier than it would be with a binary protocol.
The Infrastructure: UAs, Registrars, Proxies, and PBXs
A complete VoIP deployment involves several distinct components working together.
SIP UA (User Agent): The endpoint. This is your IP phone, your softphone application, or your mobile app. Each UA has a SIP URI — a unique identifier that looks like an email address, such as alice@company.com. The UA’s job is to initiate calls (User Agent Client) and receive them (User Agent Server).
SIP Registrar: When a SIP phone starts up or moves to a new network location, it sends a REGISTER message to a registrar server, telling it: “I am alice@company.com, and I am currently reachable at this IP address.” The registrar maintains this mapping of SIP URIs to current IP addresses. Without registrars, SIP would have no way to know where to deliver an incoming call — the network would not know where you are if you are at a coffee shop rather than your office.
SIP Proxy Server: The router of SIP messages. When a call comes in for alice@company.com, the proxy looks up where alice is currently registered and forwards the INVITE accordingly. Proxies can be stateful (remembering the details of active calls) or stateless (simply forwarding each message without remembering the conversation history). Stateful proxies are more sophisticated; stateless proxies are faster.
Redirect Server: Rather than forwarding a request, a redirect server tells the caller: “The person you are trying to reach is at this address instead.” This is how call routing is managed when a user is traveling — the redirect tells incoming calls to route to a different number or SIP URI.
IP PBX (Private Branch Exchange): The business phone system. In a VoIP context, the PBX is typically software running on a server — Asterisk, FreePBX, 3CX, or cloud platforms like RingCentral and Microsoft Teams Phone. The PBX manages internal extensions, IVRs (auto-attendants), call queuing, call recording, and routing policies. It connects to the outside world through a SIP Trunk.
SIP Trunk: The virtual equivalent of a physical telephone line. A SIP trunk is a logical connection between your PBX and a VoIP service provider (or directly to the public telephone network). It carries inbound and outbound calls. Unlike a physical PRI line with a fixed number of channels (typically 23 for a T1), a SIP trunk can typically carry hundreds of simultaneous calls, limited only by your internet bandwidth and the provider’s capacity.
SIP Trunking: The Business Transformation
For businesses, SIP trunking is the most significant change in enterprise telephone infrastructure since the introduction of digital PBX systems in the 1980s.
The practical implications are immediate and measurable:
Cost: A traditional PSTN line or PRI circuit costs a fixed monthly fee regardless of usage, plus per-minute charges for long-distance. SIP trunks typically come in flat-rate packages — unlimited calls for a fixed monthly cost per channel. Businesses routinely report 50–70% reductions in telephone costs after migrating to SIP trunking.
Scalability: Adding a physical PSTN line requires a technician visit and carrier coordination. Adding a SIP trunk channel takes minutes. During a product launch or seasonal spike, a business can provision 100 additional SIP trunk channels in the morning and release them in the evening.
Geographic flexibility: A SIP trunk is not tied to a physical location. A company with offices in Hong Kong, Singapore, and London can have local SIP trunk numbers in each city — so customers in each market call a local number — while all three offices share the same underlying IP network and call routing system.
Resilience: PSTN lines are physically vulnerable — a cut cable or power failure at the exchange can knock out every phone in the building. SIP traffic routes over the internet, which has built-in redundancy through multiple paths. If one router fails, packets automatically reroute through another path. A well-designed SIP deployment with diverse ISP connections can achieve uptime significantly better than a traditional PBX.
Security: The Real Concern
IP networks are inherently more exposed than closed circuit-switched networks. A physical telephone line cannot be intercepted without physical access to the wire. An IP packet traveling across the internet can, in principle, be captured by any router along the path.
The VoIP industry has addressed this with layered encryption:
TLS (Transport Layer Security): Encrypts SIP signaling messages — the INVITEs, BYEs, and 200 OKs — so that call metadata (who is calling whom, when, for how long) cannot be intercepted and analyzed.
SRTP (Secure RTP): Encrypts the actual voice media packets themselves. Even if an attacker captures the RTP stream, they cannot listen to the conversation without the encryption keys.
ZRTP: An alternative to SRTP that uses Diffie-Hellman key exchange to establish session keys without requiring a pre-shared secret — providing end-to-end encryption of the media stream.
In practice, enterprise VoIP deployments should always use TLS for SIP signaling and SRTP for media. Most modern IP phones and softphone clients support both by default.
The Next Wave: WebRTC and AI-Native Telephony
SIP is the foundation of virtually all modern voice communication, but two developments are pushing the technology further.
WebRTC (Web Real-Time Communication): A set of browser APIs that allows voice and video calls to be established directly within a web browser — without installing any software. When you click “Join meeting” on a Google Calendar invite and your camera opens in Chrome, that is WebRTC at work. WebRTC uses a modified version of SIP under the hood — adapted for browser environments — and RTP for media transport. It represents the continuing trend of moving voice from hardware to software.
AI-native call intelligence: The combination of SIP-based call infrastructure with AI is producing systems that can automatically transcribe, summarize, and analyze every phone conversation in real time. Call centers are deploying AI agents that listen to calls, suggest responses to agents in real time, and automatically log CRM notes after each call. This layer of intelligence sits on top of the SIP infrastructure — the protocol that carries the voice makes it possible for AI systems to access the audio stream.
The PSTN is gone. What replaced it is not just a new telephone network — it is a programmable, AI-ready, globally connected voice platform built on a protocol designed in 1996 for multicast video conferences. The people who designed SIP could not have predicted how their invention would eventually carry the world’s phone calls. But here we are: every Zoom call, every WhatsApp voice message, every cloud PBX extension, every AI-powered call center agent — all of it routed through a protocol that started as a way to stream academic seminars across the early internet.
