DDAVS: Disentangled Audio Semantics and Delayed Bidirectional Alignment for Audio-Visual Segmentation

Jingqi Tian1, Yiheng Du2, Haoji Zhang1, Yuji Wang1, Isaac Ning Lee1, Xulong Bai1, Tianrui Zhu1, Jingxuan Niu1, Yansong Tang1✉
1Tsinghua University, 2Peking University.
Corresponding authors.
DDAVS method_compare
For audio disentanglement, (a–b) rely on learned queries or K-nearest-class features,
while (c) grounds audio queries in a prototype memory bank with contrastive refinement.
For audio-visual alignment, (d–e) treat audio as a fixed or gated condition,
while (f) uses delayed dual cross-attention for more robust multimodal alignment.

TL;DR

We propose DDAVS, an audio-visual segmentation framework that disentangles audio semantics and performs delayed bidirectional modality alignment to robustly localize sounding objects at the pixel level. DDAVS introduces an Audio Query Module with a prototype memory bank, a contrastive optimization module, and a multi-stage Audio-Visual Alignment Module, achieving state-of-the-art performance on AVS-Objects and VPO benchmarks, especially in challenging multi-source, subtle, distant, and off-screen scenarios.

DDAVS Teaser
DDAVS consistently outperforms previous approaches in challenging scenarios.

Abstract

Audio–Visual Segmentation (AVS) aims to localize sound-producing objects at the pixel level by jointly leveraging auditory and visual information. However, existing methods often suffer from multi-source entanglement and audio–visual misalignment, which lead to biases toward louder or larger objects while overlooking weaker, smaller, or co-occurring sources. To address these challenges, we propose DDAVS, a Disentangled Audio Semantics and Delayed Bidirectional Alignment framework. To mitigate multi-source entanglement, DDAVS employs learnable queries to extract audio semantics and anchor them within a structured semantic space derived from an audio prototype memory bank. This is further optimized through contrastive learning to enhance discriminability and robustness. To alleviate audio–visual misalignment, DDAVS introduces dual cross-attention with delayed modality interaction, improving the robustness of multimodal alignment. Extensive experiments on the AVS-Objects and VPO benchmarks demonstrate that DDAVS consistently outperforms existing approaches, exhibiting strong performance across single-source, multi-source, and multi-instance scenarios. These results validate the effectiveness and generalization ability of our framework under challenging real-world audio–visual segmentation conditions.

Pipeline

Pipeline
Overview of DDAVS framework. (a) The Audio Query Module (AQM) encodes original and augmented
waveforms into disentangled semantic queries anchored to a prototype memory bank. (b) The Contrastive
Optimization Module (COM) enhances query robustness through contrastive learning, used only during
training. (c) The Audio-Visual Alignment Module (AVAM) fuses audio queries with visual features via stacked
alignment blocks, and a lightweight decoder outputs the sound-conditioned segmentation masks.

Results

Results
Results
Results

Ablation Studies

Results
Results

Case Study

Results
Case study of DDAVS on Multi-class tasks.