HRTF Binaural Rendering - bbx_audio Documentation

Head-Related Transfer Functions model how sounds arriving from different directions are filtered by the listener's head, pinnae (outer ears), and torso before reaching the eardrums. This filtering enables humans to localize sounds in 3D space using only two ears.

What is HRTF?

When a sound wave travels from a source to a listener, it arrives at each ear with different characteristics depending on the source's location:

Interaural Time Difference (ITD): Sound arrives at the near ear before the far ear
Interaural Level Difference (ILD): The head shadows high frequencies, making the far ear quieter
Spectral Cues: The pinnae create frequency-dependent reflections and diffractions that encode elevation and front/back information

Spatial Coordinate System

HRTF measurements use a spherical coordinate system centered on the listener:

Azimuth (θ): The horizontal angle around the listener, measured in degrees. 0° is directly in front, 90° is to the right, 180° (or -180°) is directly behind, and -90° is to the left.
Elevation (φ): The vertical angle above or below the horizontal plane. 0° is ear-level, +90° is directly above, and -90° is directly below.
Frequency (ω): The angular frequency of the sound wave in radians per second (ω = 2πf where f is frequency in Hz). HRTFs are frequency-dependent because the head and ears affect different frequencies differently—low frequencies diffract around the head while high frequencies are shadowed by it, and the small structures of the pinnae only interact with wavelengths comparable to their size (roughly 1.5-17 kHz).

An HRTF captures all these cues as a frequency-domain transfer function $H(\omega, \theta, \phi)$.

HRIR: Time-Domain Representation

The Head-Related Impulse Response (HRIR) is the time-domain equivalent of an HRTF. It represents what happens to an impulse (click) traveling from a specific direction:

$$ \text{HRIR}(t, \theta, \phi) = \mathcal{F}^{-1}\left{ \text{HRTF}(\omega, \theta, \phi) \right} $$

HRIRs are typically 128-512 samples long (2.7-10.7ms at 48kHz) and encode the full binaural transformation for a single direction.

Mathematical Foundation

Binaural Rendering via Convolution

To render a mono source $x(t)$ at position $(\theta, \phi)$, convolve it with the appropriate HRIRs:

$$ \begin{aligned} y_L(t) &= x(t) * h_L(t, \theta, \phi) \ y_R(t) &= x(t) * h_R(t, \theta, \phi) \end{aligned} $$

where $*$ denotes convolution and $h_L$, $h_R$ are the left and right ear HRIRs.

Time-Domain Convolution

For an HRIR of length $N$ and input signal $x[n]$, the output $y[n]$ at sample $n$ is:

$$ y[n] = \sum_{k=0}^{N-1} x[n-k] \cdot h[k] $$

This is an FIR filter operation with the HRIR as coefficients.

Complexity Analysis

For each sample:

Multiplications: $N$ (HRIR length)
Additions: $N-1$

Total per audio frame of $B$ samples: $$ \text{Operations} = B \times N \times 2 \quad \text{(left + right ears)} $$

Spherical Harmonics Decomposition

For ambisonic signals, we decode to virtual speaker positions then apply HRTFs. Each virtual speaker's signal is computed by weighting ambisonic channels with spherical harmonic coefficients:

$$ s_i = \sum_{l=0}^{L} \sum_{m=-l}^{l} Y_l^m(\theta_i, \phi_i) \cdot a_{l,m} $$

where:

$Y_l^m(\theta, \phi)$ are real spherical harmonics (SN3D normalized)
$a_{l,m}$ is the ambisonic channel for order $l$, degree $m$
$(\theta_i, \phi_i)$ is the virtual speaker's position

Implementation in bbx_audio

Virtual Speaker Approach

BinauralDecoderBlock uses a virtual speaker array for HRTF rendering:

Decode ambisonic input to $N$ virtual speaker signals using SH coefficients
Convolve each speaker signal with position-specific HRIRs
Sum all convolved outputs for left and right ears

$$ \begin{aligned} y_L &= \sum_{i=0}^{N-1} s_i * h_{L,i} \ y_R &= \sum_{i=0}^{N-1} s_i * h_{R,i} \end{aligned} $$

HRIR Data

The implementation uses HRIR measurements from the MIT KEMAR database:

Source: MIT Media Lab KEMAR HRTF Database (Gardner & Martin, 1994)
Mannequin: KEMAR (Knowles Electronics Manikin for Acoustic Research)
Length: 256 samples per HRIR
Positions: Quantized to cardinal directions (front, back, left, right, and 45° diagonals)

Spherical Harmonic Coefficients

For a virtual speaker at azimuth $\theta$ and elevation $\phi$, the real SH coefficients (ACN/SN3D) are:

Order 0: $$ Y_0^0 = 1 $$

Order 1: $$ \begin{aligned} Y_1^{-1} &= \cos\phi \cdot \sin\theta \quad \text{(Y channel)} \ Y_1^0 &= \sin\phi \quad \text{(Z channel)} \ Y_1^1 &= \cos\phi \cdot \cos\theta \quad \text{(X channel)} \end{aligned} $$

Order 2: $$ \begin{aligned} Y_2^{-2} &= \sqrt{\frac{3}{4}} \cos^2\phi \cdot \sin(2\theta) \quad \text{(V channel)} \ Y_2^{-1} &= \sqrt{\frac{3}{4}} \sin(2\phi) \cdot \sin\theta \quad \text{(T channel)} \ Y_2^0 &= \frac{3\sin^2\phi - 1}{2} \quad \text{(R channel)} \ Y_2^1 &= \sqrt{\frac{3}{4}} \sin(2\phi) \cdot \cos\theta \quad \text{(S channel)} \ Y_2^2 &= \sqrt{\frac{3}{4}} \cos^2\phi \cdot \cos(2\theta) \quad \text{(U channel)} \end{aligned} $$

Circular Buffer Convolution

For efficient realtime processing, convolution uses a circular buffer:

#![allow(unused)]
fn main() {
// Store incoming sample
buffer[write_pos] = input_sample;

// Convolve with HRIR
for k in 0..hrir_length {
    let buf_idx = (write_pos + hrir_length - k) % hrir_length;
    output += buffer[buf_idx] * hrir[k];
}

// Advance write position
write_pos = (write_pos + 1) % hrir_length;
}

This achieves $O(N)$ convolution per sample where $N$ is HRIR length.

Decoding Strategies

BinauralDecoderBlock offers two strategies:

Matrix Strategy (Lightweight)

Uses pre-computed ILD-based coefficients without HRTF convolution:

Lower CPU usage
Basic left/right separation
No ITD or spectral cues
Sounds may appear "inside the head"

HRTF Strategy (Default)

Full HRTF convolution with virtual speakers:

Higher CPU usage (proportional to HRIR length × speaker count)
Accurate ITD, ILD, and spectral cues
Better externalization (sounds appear outside the head)
More convincing 3D positioning

Virtual Speaker Layouts

Ambisonic Decoding (FOA)

4 virtual speakers at $\pm 45°$ and $\pm 135°$ azimuth:

        Front
          |
   FL ----+---- FR    (±45°)
          |
          |
   RL ----+---- RR    (±135°)
          |
        Rear

Surround Sound (5.1/7.1)

Standard ITU-R speaker positions:

5.1 (ITU-R BS.775-1):

Channel	Azimuth
L/R	$\pm 30°$
C	$0°$
LFE	$0°$ (non-directional)
Ls/Rs	$\pm 110°$

7.1 (ITU-R BS.2051):

Channel	Azimuth
L/R	$\pm 30°$
C	$0°$
LFE	$0°$
Ls/Rs	$\pm 90°$ (side)
Lrs/Rrs	$\pm 150°$ (rear)

Performance Considerations

CPU Cost

HRTF convolution complexity per audio frame:

$$ \text{Operations} = B \times N_{speakers} \times L_{HRIR} \times 2 $$

For a 512-sample buffer with 4-speaker FOA and 256-sample HRIRs: $$ 512 \times 4 \times 256 \times 2 = 1,048,576 \text{ multiply-adds} $$

Memory Usage

HRIR storage: $N_{speakers} \times 2 \times L_{HRIR} \times \text{sizeof}(f32)$
Signal buffers: $N_{speakers} \times L_{HRIR} \times \text{sizeof}(f32)$

For 4-speaker FOA with 256-sample HRIRs:

HRIRs: $4 \times 2 \times 256 \times 4 = 8$ KB
Buffers: $4 \times 256 \times 4 = 4$ KB

Realtime Safety

The implementation is fully realtime-safe:

All buffers pre-allocated at construction
No allocations during process()
No locks or system calls
Circular buffer avoids memory copies

bbx_audio Documentation