| DisTime: Distribution-based Time Representation for Video Large Language Models | May 30, 2025 | Temporal LocalizationVideo Understanding | CodeCode Available | 1 |
| LocVTP: Video-Text Pre-training for Temporal Localization | Jul 21, 2022 | RetrievalTemporal Localization | CodeCode Available | 1 |
| Human-centric Spatio-Temporal Video Grounding With Visual Transformers | Nov 10, 2020 | Referring ExpressionSentence | CodeCode Available | 1 |
| Meerkat: Audio-Visual Large Language Model for Grounding in Space and Time | Jul 1, 2024 | AUDIO-VISUAL QUESTION ANSWERING (MUSIC-AVQA-v2.0)Fact Checking | CodeCode Available | 1 |
| FineAction: A Fine-Grained Video Dataset for Temporal Action Localization | May 24, 2021 | Action DetectionAction Localization | CodeCode Available | 1 |
| Dense-Caption Matching and Frame-Selection Gating for Temporal Localization in VideoQA | May 13, 2020 | Image CaptioningMulti-Label Classification | CodeCode Available | 1 |
| End-to-End Semi-Supervised Learning for Video Action Detection | Mar 8, 2022 | Action DetectionClassification Consistency | CodeCode Available | 1 |
| Knowing Your Target: Target-Aware Transformer Makes Better Spatio-Temporal Video Grounding | Feb 16, 2025 | AttributeObject | CodeCode Available | 1 |
| Audio-Visual Event Localization in Unconstrained Videos | Mar 23, 2018 | audio-visual event localizationTemporal Localization | CodeCode Available | 1 |
| Explore-And-Match: Bridging Proposal-Based and Proposal-Free With Transformer for Sentence Grounding in Videos | Jan 25, 2022 | Natural Language QueriesSentence | CodeCode Available | 1 |