james king demo · audio super-resolution
home
§ demo — audio

Audio super-resolution.

A problem I worked on for my final year at Cambridge. Voice codecs throw away the top half of the spectrum. Given only the low half, can a model plausibly reconstruct the rest?

Seven approaches were tested on the same low-rate input. Classical interpolation (linear, cubic spline) stays honest and limp. Flat CNNs learn a crude extrapolation. EDSR, an image super-resolution backbone adapted with 1D convolutions, does better. U-Nets do best on squared error. GANs invent more confident, more plausible highs, but sometimes they invent the wrong ones.

Which model you prefer depends on whether you want accuracy or plausibility. The full write-up is on arxiv, and the original thesis PDF is here too.

ground truth
HIGH
ground truth spectrogram

Pick an upsampling factor, then a method.

prediction
model output spectrogram

These are the same samples I used in the dissertation. The spectrograms show what each method reconstructed in the upper frequencies. If it sounds sharper than the low-rate input but has the wrong detail, that is usually a GAN. If it sounds soft but correct, that is usually a U-Net.