Improving Inference-Time Optimisation for Vocal Effects Style Transfer with a Gaussian Prior
Abstract
Style Transfer with Inference-Time Optimisation (ST-ITO) is a recent approach for transferring the applied effects of a reference audio to a raw audio track. It optimises the effect parameters to minimise the distance between the style embeddings of the processed audio and the reference. However, this method treats all possible configurations equally and relies solely on the embedding space, which can lead to unrealistic or biased results. We address this pitfall by introducing a Gaussian prior derived from a vocal preset dataset, DiffVox, over the parameter space. The resulting optimisation is equivalent to maximum-a-posteriori estimation. Evaluations on vocal effects transfer on the MedleyDB dataset show significant improvements across metrics compared to baselines, including a blind audio effects estimator, nearest-neighbour approaches, and uncalibrated ST-ITO. The proposed calibration reduces parameter mean squared error by up to 33% and matches the reference style better. Subjective evaluations with 16 participants confirm our method's superiority, especially in limited data regimes. This work demonstrates how incorporating prior knowledge in inference time enhances audio effects transfer, paving the way for more effective and realistic audio processing systems.
Community
A calibrated ST-ITO for vocal effects style transfer.
This is an automated message from the Librarian Bot. I found the following papers similar to this paper.
The following papers were recommended by the Semantic Scholar API
- DiffVox: A Differentiable Model for Capturing and Analysing Professional Effects Distributions (2025)
- Unsupervised Estimation of Nonlinear Audio Effects: Comparing Diffusion-Based and Adversarial approaches (2025)
- ReverBERT: A State Space Model for Efficient Text-Driven Speech Style Transfer (2025)
- Towards Generalizability to Tone and Content Variations in the Transcription of Amplifier Rendered Electric Guitar Audio (2025)
- DiTSE: High-Fidelity Generative Speech Enhancement via Latent Diffusion Transformers (2025)
- The Inverse Drum Machine: Source Separation Through Joint Transcription and Analysis-by-Synthesis (2025)
- AlignDiT: Multimodal Aligned Diffusion Transformer for Synchronized Speech Generation (2025)
Please give a thumbs up to this comment if you found it helpful!
If you want recommendations for any Paper on Hugging Face checkout this Space
You can directly ask Librarian Bot for paper recommendations by tagging it in a comment:
@librarian-bot
recommend
Models citing this paper 0
No model linking this paper
Datasets citing this paper 0
No dataset linking this paper
Spaces citing this paper 0
No Space linking this paper
Collections including this paper 0
No Collection including this paper