SOTAVerified

audio-visual learning

Papers

Showing 138 of 38 papers

TitleStatusHype
CAV-MAE Sync: Improving Contrastive Audio-Visual Mask Autoencoders via Fine-Grained AlignmentCode1
Language-Guided Audio-Visual Learning for Long-Term Sports AssessmentCode1
Dense Audio-Visual Event Localization under Cross-Modal Consistency and Multi-Temporal Granularity CollaborationCode1
Enhancing Sound Source Localization via False Negative EliminationCode1
EquiAV: Leveraging Equivariance for Audio-Visual Contrastive LearningCode1
Towards Emotion Analysis in Short-form Videos: A Large-Scale Dataset and BaselineCode1
Can CLIP Help Sound Source Localization?Code1
AV-SUPERB: A Multi-Task Evaluation Benchmark for Audio-Visual Representation ModelsCode1
Class-Incremental Grouping Network for Continual Audio-Visual LearningCode1
A Unified Audio-Visual Learning Framework for Localization, Separation, and RecognitionCode1
Modality-Independent Teachers Meet Weakly-Supervised Audio-Visual Event ParserCode1
AudioToken: Adaptation of Text-Conditioned Diffusion Models for Audio-to-Image GenerationCode1
Unraveling Instance Associations: A Closer Look for Audio-Visual SegmentationCode1
UAVM: Towards Unifying Audio and Visual ModelsCode1
Modality-Aware Contrastive Instance Learning with Self-Distillation for Weakly-Supervised Audio-Visual Violence DetectionCode1
Learning to Answer Questions in Dynamic Audio-Visual ScenariosCode1
Cascaded Multilingual Audio-Visual Learning from VideosCode1
Distilling Audio-Visual Knowledge by Compositional Contrastive LearningCode1
Can audio-visual integration strengthen robustness under multimodal attacks?Code1
Lightweight Joint Audio-Visual Deepfake Detection via Single-Stream Multi-Modal Learning Framework0
Rethinking Audio-Visual Adversarial Vulnerability from Temporal and Modality Perspectives0
Unveiling Visual Biases in Audio-Visual Localization Benchmarks0
Sequential Contrastive Audio-Visual Learning0
MA-AVT: Modality Alignment for Parameter-Efficient Audio-Visual TransformersCode0
Multi-Input Multi-Output Target-Speaker Voice Activity Detection For Unified, Flexible, and Robust Audio-Visual Speaker Diarization0
Boosting Audio-visual Zero-shot Learning with Large Language ModelsCode0
Deep Video Inpainting Guided by Audio-Visual Self-SupervisionCode0
Leveraging Pretrained Image-text Models for Improving Audio-Visual Learning0
RealImpact: A Dataset of Impact Sound Fields for Real Objects0
Versatile audio-visual learning for emotion recognition0
Revisiting Pre-training in Audio-Visual LearningCode0
Object Segmentation with Audio Context0
Learning in Audio-visual Context: A Review, Analysis, and New Perspective0
Few-Shot Audio-Visual Learning of Environment Acoustics0
Adversarial-Metric Learning for Audio-Visual Cross-Modal MatchingCode0
Telling Left from Right: Learning Spatial Correspondence of Sight and Sound0
Deep Audio-Visual Learning: A Survey0
Audio-Visual Embedding for Cross-Modal MusicVideo Retrieval through Supervised Deep CCA0
Show:102550

No leaderboard results yet.