| VideoMind: A Chain-of-LoRA Agent for Long Video Reasoning | Mar 17, 2025 | Grounded Video Question AnsweringQuestion Answering | CodeCode Available | 3 |
| VideoMolmo: Spatio-Temporal Grounding Meets Pointing | Jun 5, 2025 | Autonomous DrivingAutonomous Navigation | CodeCode Available | 2 |
| MINERVA: Evaluating Complex Video Reasoning | May 1, 2025 | BenchmarkingTemporal Localization | CodeCode Available | 2 |
| Crab: A Unified Audio-Visual Scene Understanding Model with Explicit Cooperation | Mar 17, 2025 | Data InteractionScene Understanding | CodeCode Available | 2 |
| LLaVA-ST: A Multimodal Large Language Model for Fine-Grained Spatial-Temporal Understanding | Jan 14, 2025 | Feature CompressionLanguage Modeling | CodeCode Available | 2 |
| TimeMarker: A Versatile Video-LLM for Long and Short Video Understanding with Superior Temporal Localization Ability | Nov 27, 2024 | Temporal LocalizationVideo Understanding | CodeCode Available | 2 |
| Number it: Temporal Grounding Videos like Flipping Manga | Nov 15, 2024 | Highlight DetectionMoment Retrieval | CodeCode Available | 2 |
| OphNet: A Large-Scale Video Benchmark for Ophthalmic Surgical Workflow Understanding | Jun 11, 2024 | Action UnderstandingDiversity | CodeCode Available | 2 |
| LITA: Language Instructed Temporal-Localization Assistant | Mar 27, 2024 | Instruction FollowingTemporal Localization | CodeCode Available | 2 |
| TimeChat: A Time-sensitive Multimodal Large Language Model for Long Video Understanding | Dec 4, 2023 | Dense CaptioningHighlight Detection | CodeCode Available | 2 |
| Egocentric Video-Language Pretraining | Jun 3, 2022 | Action RecognitionContrastive Learning | CodeCode Available | 2 |
| DisTime: Distribution-based Time Representation for Video Large Language Models | May 30, 2025 | Temporal LocalizationVideo Understanding | CodeCode Available | 1 |
| TimeLoc: A Unified End-to-End Framework for Precise Timestamp Localization in Long Videos | Mar 9, 2025 | Action LocalizationBoundary Detection | CodeCode Available | 1 |
| Knowing Your Target: Target-Aware Transformer Makes Better Spatio-Temporal Video Grounding | Feb 16, 2025 | AttributeObject | CodeCode Available | 1 |
| Training-free Video Temporal Grounding using Large-scale Pre-trained Models | Aug 29, 2024 | Temporal Localization | CodeCode Available | 1 |
| Meerkat: Audio-Visual Large Language Model for Grounding in Space and Time | Jul 1, 2024 | AUDIO-VISUAL QUESTION ANSWERING (MUSIC-AVQA-v2.0)Fact Checking | CodeCode Available | 1 |
| Self-Chained Image-Language Model for Video Localization and Question Answering | May 11, 2023 | Language ModelingLanguage Modelling | CodeCode Available | 1 |
| Unsupervised classification to improve the quality of a bird song recording dataset | Feb 15, 2023 | Sound ClassificationTemporal Localization | CodeCode Available | 1 |
| Multi-Task Learning of Object State Changes from Uncurated Videos | Nov 24, 2022 | Multi-Task LearningObject | CodeCode Available | 1 |
| LocVTP: Video-Text Pre-training for Temporal Localization | Jul 21, 2022 | RetrievalTemporal Localization | CodeCode Available | 1 |
| Stargazer: A transformer-based driver action detection system for intelligent transportation | Jun 1, 2022 | Action DetectionAction Recognition | CodeCode Available | 1 |
| Temporally Precise Action Spotting in Soccer Videos Using Dense Detection Anchors | May 20, 2022 | Action SpottingData Augmentation | CodeCode Available | 1 |
| TubeDETR: Spatio-Temporal Video Grounding with Transformers | Mar 30, 2022 | DecoderLanguage-Based Temporal Localization | CodeCode Available | 1 |
| Unsupervised Pre-training for Temporal Action Localization Tasks | Mar 25, 2022 | Action LocalizationContrastive Learning | CodeCode Available | 1 |
| OpenTAL: Towards Open Set Temporal Action Localization | Mar 10, 2022 | Action ClassificationAction Localization | CodeCode Available | 1 |