Seung-Bin Kim, Sang-Hoon Lee, Ha-Yeong Choi, and Seong-Whan Lee
This paper proposes Fre-Painter, a high-fidelity audio super-resolution system that utilizes robust speech representation learning with various masking strategies. Recently, masked autoencoders have been found to be beneficial in learning robust representations of audio for speech classification tasks. Following these studies, we leverage these representations and investigate several masking strategies for neural audio super-resolution. In this paper, we propose an upper-band masking strategy with the initialization of the mask token, which is simple but efficient for audio super-resolution. Furthermore, we propose a mix-ratio masking strategy that makes the model robust for input speech with various sampling rates. For practical applicability, we extend Fre-Painter to a text-to-speech system, which synthesizes high-resolution speech using low-resolution speech data. The experimental results demonstrate that Fre-Painter outperforms other neural audio super-resolution models.
Official implementation of Fre-Painter: https://github.com/FrePainter/code
We compared Fre-Painter with several audio super-resolution models as:
1. VoiceFixer [Code][Demo page]
For more accurate listening, it is recommended to conduct a simple audible frequency test.
WARNING: HIGH-FREQUENCY SAMPLES WITH LOUD VOLUME MAY HAVE PAINFUL SOUNDS. 2000 Hz 22000 Hz![]() |
---|
For a fair comparison, we used the official implementations and provided checkpoints of the comparison models. Additionally, the demo samples were randomly extracted from the test set. Fre-Painter (fix) was further experimented with by training it with fixed masking ratios for input sampling rates of 16 kHz and 24 kHz. Please note that VoiceFixer was trained with a target sampling rate of 44.1 kHz, while NU-Wave 2 and UDM+ were trained with an input sampling rate in the range from 6 kHz to 48 kHz.
Audio: p360_418 | ||
---|---|---|
GT ![]() |
Input ![]() |
|
VoiceFixer ![]() |
NU-Wave 2 ![]() |
UDM+ ![]() |
Fre-Painter ![]() |
Fre-Painter (fixed) ![]() |
Audio: p361_340 | ||
---|---|---|
GT ![]() |
Input ![]() |
|
VoiceFixer ![]() |
NU-Wave 2 ![]() |
UDM+ ![]() |
Fre-Painter ![]() |
Fre-Painter (fixed) ![]() |
Audio: p374_137 | ||
---|---|---|
GT ![]() |
Input ![]() |
|
VoiceFixer ![]() |
NU-Wave 2 ![]() |
UDM+ ![]() |
Fre-Painter ![]() |
Fre-Painter (fixed) ![]() |
Audio: s5_156 | ||
---|---|---|
GT ![]() |
Input ![]() |
|
VoiceFixer ![]() |
NU-Wave 2 ![]() |
UDM+ ![]() |
Fre-Painter ![]() |
Fre-Painter (fixed) ![]() |
Audio: p241_071 | ||
---|---|---|
GT ![]() |
Input ![]() |
Fre-Painter ![]() |
w/o pre-training ![]() |
w/o initialization ![]() |
w/o masking ![]() |
Audio: p253_301 | ||
---|---|---|
GT ![]() |
Input ![]() |
Fre-Painter ![]() |
w/o pre-training ![]() |
w/o initialization ![]() |
w/o masking ![]() |
Audio: p256_133 | ||
---|---|---|
GT ![]() |
Input ![]() |
Fre-Painter ![]() |
w/o pre-training ![]() |
w/o initialization ![]() |
w/o masking ![]() |
Audio: p266_151 | ||
---|---|---|
GT ![]() |
Input ![]() |
Fre-Painter ![]() |
w/o pre-training ![]() |
w/o initialization ![]() |
w/o masking ![]() |
Audio: p336_087 | ||
---|---|---|
GT ![]() |
Input ![]() |
Fre-Painter ![]() |
w/o pre-training ![]() |
w/o initialization ![]() |
w/o masking ![]() |