| SE4Lip: Speech-Lip Encoder for Talking Head Synthesis to Solve Phoneme-Viseme Alignment Ambiguity | Apr 8, 2025 | 3DGScross-modal alignment | —Unverified | 0 |
| Gaze-Guided Learning: Avoiding Shortcut Bias in Visual Classification | Apr 8, 2025 | cross-modal alignmentImage Classification | CodeCode Available | 0 |
| Multimodal Fusion and Vision-Language Models: A Survey for Robot Vision | Apr 3, 2025 | 3D Object Detectioncross-modal alignment | CodeCode Available | 1 |
| FineLIP: Extending CLIP's Reach via Fine-Grained Alignment with Longer Text Inputs | Apr 2, 2025 | cross-modal alignmentCross-Modal Retrieval | —Unverified | 0 |
| Leveraging Modality Tags for Enhanced Cross-Modal Video Retrieval | Apr 2, 2025 | cross-modal alignmentRetrieval | —Unverified | 0 |
| COST: Contrastive One-Stage Transformer for Vision-Language Small Object Tracking | Apr 2, 2025 | cross-modal alignmentObject | —Unverified | 0 |
| DF-Calib: Targetless LiDAR-Camera Calibration via Depth Flow | Apr 2, 2025 | Autonomous DrivingCamera Calibration | —Unverified | 0 |
| SViQA: A Unified Speech-Vision Multimodal Model for Textless Visual Question Answering | Apr 1, 2025 | cross-modal alignmentQuestion Answering | —Unverified | 0 |
| CADFormer: Fine-Grained Cross-modal Alignment and Decoding Transformer for Referring Remote Sensing Image Segmentation | Mar 30, 2025 | cross-modal alignmentImage Segmentation | —Unverified | 0 |
| BiPVL-Seg: Bidirectional Progressive Vision-Language Fusion with Global-Local Alignment for Medical Image Segmentation | Mar 30, 2025 | cross-modal alignmentImage Segmentation | CodeCode Available | 1 |