TelcomIQ

Overview

RTP — the Real-time Transport Protocol — is the IETF protocol that carries the actual audio and video content of a call. When SIP sets up a VoLTE call between two subscribers, it negotiates the parameters of the media streams they will exchange: which codecs are acceptable, what IP addresses and ports will be used, whether encryption is required. RTP is the protocol that then flows between those endpoints carrying the encoded voice samples for the duration of the call.

RTP is defined in RFC 3550 and runs over UDP. The choice of UDP over TCP is deliberate: real-time media tolerates occasional packet loss far better than it tolerates the retransmission delays that TCP would introduce. A short silence or momentary audio glitch is less disruptive to a voice call than the jitter caused by TCP retransmission during a burst of congestion.

Alongside RTP, the RTP Control Protocol (RTCP) operates on the adjacent UDP port. RTCP carries statistics — packet counts, loss rates, jitter measurements — that allow both endpoints and intermediate elements to assess call quality, and carries participant identity information (CNAME). RTCP does not carry media; it provides the monitoring and control plane for RTP sessions.

Secure RTP (SRTP), defined in RFC 3711, adds encryption and authentication to RTP payloads. In 3GPP IMS networks, TS 33.328 mandates SRTP for all IMS media sessions. Despite this, plain RTP remains common in enterprise VoIP deployments, legacy operator infrastructure, and anywhere that SRTP key negotiation was not properly implemented.

How it works

An RTP packet has a fixed 12-byte header followed by the media payload. The header carries enough information to allow the receiver to reconstruct the timing, ordering, and source identity of each packet in the stream.

RTP header fields:

Version (2 bits) — Always 2.
Padding, Extension flags (1 bit each) — Indicate the presence of padding bytes and a header extension.
CC (4 bits) — CSRC count; number of Contributing Source identifiers following the header.
Marker (1 bit) — Payload-specific; in voice, set on the first packet of a talk spurt.
Payload type (7 bits) — Identifies the codec. Statically assigned types include 0 (PCMU/G.711 μ-law), 8 (PCMA/G.711 A-law), 9 (G.722), 18 (G.729). Dynamic types 96–127 are assigned per-session in SDP; AMR-NB, AMR-WB, and EVS use dynamic types in VoLTE.
Sequence number (16 bits) — Incremented by one per packet. Used by the receiver to detect loss, reordering, and duplicates.
Timestamp (32 bits) — Reflects the sampling instant of the first byte in the payload. Drives the receiver's playout buffer and jitter calculation. Not a wall-clock time.
SSRC (32 bits) — Synchronisation Source identifier. A randomly chosen 32-bit value that uniquely identifies the RTP source within the session.

SDP negotiation and codec selection

Before any RTP flows, SIP carries Session Description Protocol (SDP) bodies in INVITE and 200 OK messages. The INVITE SDP offer lists the caller's supported codecs (as RTP payload types with associated rtpmap and fmtp attributes), the IP address and port for the RTP stream, and any SRTP parameters. The 200 OK SDP answer selects one codec from the offered set and confirms the media address.

In VoLTE, 3GPP TS 26.114 mandates support for AMR-NB and AMR-WB (Adaptive Multi-Rate Narrowband and Wideband) as the primary voice codecs. EVS (Enhanced Voice Services) is the 5G/VoNR codec, offering improved quality at lower bit rates.

RTCP

RTCP packets flow on the port one above the RTP port (by convention). Two principal RTCP packet types are relevant to operator networks:

SR (Sender Report) — Sent by the active media sender. Carries NTP and RTP timestamps (for synchronisation with other media streams), packet count, and octet count. The NTP timestamp allows an intermediate system to correlate RTP timing with wall-clock time — relevant for lawful intercept and call recording.
RR (Receiver Report) — Sent by receivers that are not sending. Carries fraction of packets lost, cumulative packet loss count, highest sequence number received, interarrival jitter, and last SR/delay since last SR. This is the primary source of in-call quality metrics.

Architecture role

RTP is purely a bearer protocol — it carries no signalling and imposes no topology requirements. The media path for a VoLTE call can flow directly between UEs (if ICE/STUN negotiation succeeds and no NAT traversal is required) or via a Media Resource Function (MRF) in the IMS core. The IMS core's signalling path (SIP) and media path (RTP) are logically separate, which is a fundamental architectural difference from circuit-switched telephony where signalling and bearer shared the same infrastructure.

In practice, operator IMS deployments almost always route media through a Telephony Application Server (TAS) or Media Border Gateway (MBG) that acts as a Back-to-Back User Agent (B2BUA). This centralises media termination for lawful intercept, call recording, and quality monitoring, at the cost of introducing an additional network element in the media path.

In 4G VoLTE: RTP streams are encapsulated in GTP-U tunnels between the UE and the P-GW, exactly like any other IP traffic. The P-GW applies QoS enforcement — the VoLTE bearer (EPS Bearer QCI 1) has guaranteed bit rate and priority treatment that ensures the RTP stream receives priority over best-effort data traffic.

In 5G VoNR, the media path is essentially identical. EVS replaces AMR-WB as the preferred codec for higher quality, and the 5G QoS framework (using QFI rather than QCI) provides the equivalent bearer priority. SRTP remains the encryption mechanism.

Key interfaces

Interface	Between	Direction	Purpose
Gm (media)	UE ↔ P-CSCF/MBG	Bidirectional	Media path for VoLTE; SRTP streams
Mr	S-CSCF ↔ MRFC	Bidirectional	SIP control of media resource for conference/tone
Mp	MRFC ↔ MRFP	Bidirectional	RTP/RTCP streams to/from media resource function

Security posture

RTP itself provides no authentication, integrity, or confidentiality. A plain RTP stream is a UDP flow that any node on the network path can intercept and read. In circuit-switched telephony, physical access to the transmission network was the implicit security boundary. In IP-based VoLTE, the equivalent assumption is that the operator's packet core is not accessible to attackers — an assumption that is increasingly strained in shared data centre environments, virtualised core deployments, and, especially, at roaming interconnects.

SRTP addresses confidentiality and authentication. An SRTP stream is indistinguishable from random data to an observer who does not hold the session key. SRTP authentication prevents injection — each packet carries a MAC that the receiver validates against the session key. However, SRTP only protects the stream between endpoints. The key negotiation mechanism — typically SDES (SDP Security Descriptions) in older IMS, or DTLS-SRTP in modern implementations — must itself be protected, and this is done by encrypting the SIP signalling that carries the SDP.

The combined SIP-over-TLS/IPsec plus SRTP model provides end-to-segment protection between the UE and the P-CSCF. Protection beyond the P-CSCF depends on the operator's internal network controls.

Attack surface

Media interception of unencrypted RTP

Any network element on the path between two RTP endpoints that carries plain RTP can capture and reconstruct a call. In operator networks, the most accessible position for such interception is at the roaming interconnect — the boundary where media paths traverse shared IPX infrastructure. Plain RTP across an IPX is trivially intercepted by any operator or party with access to that infrastructure.

Impact: Full interception of call audio content.
Difficulty: Low where RTP is unencrypted. Requires network access at an intermediate point in the media path.

RTP packet injection for call disruption

An attacker on the network path who can inject UDP packets matching the RTP source IP and SSRC of an active stream can inject audio into the call. High-volume injection — injecting packets with higher sequence numbers than the legitimate stream — causes the receiver's playout buffer to treat the injected packets as the current stream, overriding the legitimate audio.

Impact: Audio disruption or insertion of arbitrary content into an active call.
Difficulty: Medium without SRTP; requires network path access and knowledge of the active SSRC and sequence number.

SRTP key negotiation interception via SDP manipulation

If the SIP signalling that carries SDP is not encrypted end-to-end, an attacker in a man-in-the-middle position can modify the SRTP key exchange in the SDP. The attacker substitutes their own keying material, causing both endpoints to encrypt to the attacker rather than to each other, effectively establishing a decryption relay.

Impact: Full access to SRTP-protected media stream despite encryption being nominally enabled.
Difficulty: High. Requires both a position in the SIP signalling path and active manipulation capability — not passive interception alone.

RTCP flood

RTCP receiver reports are sent periodically, but the rate is bounded by a fraction of the RTP bandwidth. An attacker generating RTCP reports at high rates for a legitimate RTP session can overwhelm the receiver's RTCP processing, consume CPU on media handling nodes, and degrade call quality.

Impact: Call quality degradation; media processing resource exhaustion on MRF nodes.
Difficulty: Low. Requires only knowledge of a target SSRC and the media address.

Mitigations

SRTP enforcement per TS 33.328: The P-CSCF must enforce SRTP for all IMS media sessions. The S-CSCF and TAS should reject SDP offers that do not include SRTP key parameters. Plain RTP fallback must be disabled in IMS configurations.
SIP signalling encryption: SRTP key material in SDP is protected only if the SIP messages carrying it are encrypted. Mandate TLS between the UE and P-CSCF (via the Gm security setup in TS 33.203) and IPsec between P-CSCF and the IMS core.
SDP validation at the P-CSCF: The P-CSCF must validate that the media address in SDP corresponds to the subscriber's registered UE address. An SDP offer pointing media to an IP address outside the subscriber's assigned prefix is a red flag for call hijacking.
Media path monitoring: An MRF or probe in the media path can detect anomalous SSRC collisions (multiple sources claiming the same SSRC in a single session), packet loss anomalies, and codec negotiation changes mid-session that are inconsistent with normal call flows.

Spec references

RFC 3550 — The foundational RTP/RTCP specification. Section 5 defines the RTP packet format; Section 6 defines the RTCP packet types. The reference for all RTP implementations.
RFC 3711 — The SRTP specification. Section 3 defines the SRTP packet format and its relation to the RTP header; Section 4 defines the cipher suites; Section 9 defines the key derivation function. Essential for IMS media security configuration.
3GPP TS 26.114 — IMS multimedia telephony media handling. Section 7 defines the mandatory codec requirements (AMR, AMR-WB, EVS) for VoLTE and VoNR; Section 8 defines the SDP negotiation procedures.
3GPP TS 33.328 — IMS media plane security. Defines the SRTP cipher suite requirements, key management options, and interoperability requirements for IMS deployments.

RTP is paired with SIP in every IMS voice session: SIP is the signalling protocol that negotiates the session, while RTP carries the media once the session is established. The two protocols are complementary and always deployed together in IMS.

RTP is the media bearer for VoLTE in 4G and VoNR in 5G. Both services are built on the IMS architecture, with the IMS core providing the SIP proxy and service layer. For the security dimension, see SIP/VoIP attacks, which covers both the signalling and media attack surfaces in detail.

Specifications

Relationships