| ATARS: An Aerial Traffic Atomic Activity Recognition and Temporal Segmentation Dataset | Mar 24, 2025 | Activity RecognitionTemporal Localization | CodeCode Available | 0 |
| Crab: A Unified Audio-Visual Scene Understanding Model with Explicit Cooperation | Mar 17, 2025 | Data InteractionScene Understanding | CodeCode Available | 2 |
| VideoMind: A Chain-of-LoRA Agent for Long Video Reasoning | Mar 17, 2025 | Grounded Video Question AnsweringQuestion Answering | CodeCode Available | 3 |
| Adapting to the Unknown: Training-Free Audio-Visual Event Perception with Dynamic Thresholds | Mar 17, 2025 | Temporal Localization | CodeCode Available | 0 |
| Watch and Learn: Leveraging Expert Knowledge and Language for Surgical Video Understanding | Mar 14, 2025 | DenoisingDense Video Captioning | —Unverified | 0 |
| Measure Twice, Cut Once: Grasping Video Structures and Event Semantics with LLMs for Video Temporal Localization | Mar 12, 2025 | Temporal LocalizationVideo Understanding | —Unverified | 0 |
| Towards Fine-Grained Video Question Answering | Mar 10, 2025 | Language ModelingLanguage Modelling | —Unverified | 0 |
| TimeLoc: A Unified End-to-End Framework for Precise Timestamp Localization in Long Videos | Mar 9, 2025 | Action LocalizationBoundary Detection | CodeCode Available | 1 |
| Weakly Supervised Multiple Instance Learning for Whale Call Detection and Temporal Localization in Long-Duration Passive Acoustic Monitoring | Feb 28, 2025 | Multiple Instance LearningTemporal Localization | CodeCode Available | 0 |
| Knowing Your Target: Target-Aware Transformer Makes Better Spatio-Temporal Video Grounding | Feb 16, 2025 | AttributeObject | CodeCode Available | 1 |