Audio Super-resolution with Robust Speech Representation Learning of Masked Autoencoder

 

Seung-Bin Kim, Sang-Hoon Lee, Ha-Yeong Choi, and Seong-Whan Lee

Abstract

This paper proposes Fre-Painter, a high-fidelity audio super-resolution system that utilizes robust speech representation learning with various masking strategies. Recently, masked autoencoders have been found to be beneficial in learning robust representations of audio for speech classification tasks. Following these studies, we leverage these representations and investigate several masking strategies for neural audio super-resolution. In this paper, we propose an upper-band masking strategy with the initialization of the mask token, which is simple but efficient for audio super-resolution. Furthermore, we propose a mix-ratio masking strategy that makes the model robust for input speech with various sampling rates. For practical applicability, we extend Fre-Painter to a text-to-speech system, which synthesizes high-resolution speech using low-resolution speech data. The experimental results demonstrate that Fre-Painter outperforms other neural audio super-resolution models.

 

 

Official implementation of Fre-Painter: https://github.com/FrePainter/code

We compared Fre-Painter with several audio super-resolution models as:

1. VoiceFixer [Code][Demo page]

2. NU-Wave 2 [Code][Demo page]

3. UDM+ [Code][Demo page]

Audible Frequency

For more accurate listening, it is recommended to conduct a simple audible frequency test.

                                                                                                                                                                                                                                                                                                                              

WARNING: HIGH-FREQUENCY SAMPLES WITH LOUD VOLUME MAY HAVE PAINFUL SOUNDS.

2000 Hz       22000 Hz


                                                                                                                                                                                                                                                                                                                                          

Fre-Painter

For a fair comparison, we used the official implementations and provided checkpoints of the comparison models. Additionally, the demo samples were randomly extracted from the test set. Fre-Painter (fix) was further experimented with by training it with fixed masking ratios for input sampling rates of 16 kHz and 24 kHz. Please note that VoiceFixer was trained with a target sampling rate of 44.1 kHz, while NU-Wave 2 and UDM+ were trained with an input sampling rate in the range from 6 kHz to 48 kHz.

Audio: p360_418

Sampling rate

GT

Input

VoiceFixer

NU-Wave 2

UDM+

Fre-Painter

Fre-Painter (fixed)

Audio: p361_340

Sampling rate

GT

Input

VoiceFixer

NU-Wave 2

UDM+

Fre-Painter

Fre-Painter (fixed)

Audio: p374_137

Sampling rate

GT

Input

VoiceFixer

NU-Wave 2

UDM+

Fre-Painter

Fre-Painter (fixed)

Audio: s5_156

Sampling rate

GT

Input

VoiceFixer

NU-Wave 2

UDM+

Fre-Painter

Fre-Painter (fixed)

Ablation Study

Audio: p241_071

Sampling rate

GT

Input

Fre-Painter

w/o pre-training

w/o initialization

w/o masking

Audio: p253_301

Sampling rate

GT

Input

Fre-Painter

w/o pre-training

w/o initialization

w/o masking

Audio: p256_133

Sampling rate

GT

Input

Fre-Painter

w/o pre-training

w/o initialization

w/o masking

Audio: p266_151

Sampling rate

GT

Input

Fre-Painter

w/o pre-training

w/o initialization

w/o masking

Audio: p336_087

Sampling rate

GT

Input

Fre-Painter

w/o pre-training

w/o initialization

w/o masking

Speech Synthesis

Audio: p362_029

GT

Input

HiFi-GAN

HiFi-GAN + NU-Wave 2

HiFi-GAN + UDM+

Fre-Painter

Audio: p376_020

GT

Input

HiFi-GAN

HiFi-GAN + NU-Wave 2

HiFi-GAN + UDM+

Fre-Painter

Audio: p361_013

GT

Input

HiFi-GAN

HiFi-GAN + NU-Wave 2

HiFi-GAN + UDM+

Fre-Painter