InternVideo2: Scaling Foundation Models for Multimodal Video Understanding Mar 22, 2024 Action Classification Action Recognition
Code Code Available 75 Tarsier2: Advancing Large Vision-Language Models from Detailed Video Description to Comprehensive Video Understanding Jan 14, 2025 Embodied Question Answering Hallucination
Code Code Available 45 SnAG: Scalable and Accurate Video Grounding Apr 2, 2024 Video Grounding Video Understanding
Code Code Available 45 PG-Video-LLaVA: Pixel Grounding Large Video-Language Models Nov 22, 2023 Benchmarking Phrase Grounding
Code Code Available 25 Reinforcement Learning Tuning for VideoLLMs: Reward Design and Data Efficiency Jun 2, 2025 reinforcement-learning Reinforcement Learning
Code Code Available 25 Prior Knowledge Integration via LLM Encoding and Pseudo Event Regulation for Video Moment Retrieval Jul 21, 2024 General Knowledge Highlight Detection
Code Code Available 25 UMT: Unified Multi-modal Transformers for Joint Video Moment Retrieval and Highlight Detection Mar 23, 2022 Decoder Highlight Detection
Code Code Available 25 VTimeLLM: Empower LLM to Grasp Video Moments Nov 30, 2023 Dense Video Captioning Temporal Relation Extraction
Code Code Available 25 Query-Dependent Video Representation for Moment Retrieval and Highlight Detection Mar 24, 2023 Highlight Detection Moment Retrieval
Code Code Available 25 TimeZero: Temporal Video Grounding with Reasoning-Guided LVLM Mar 17, 2025 Video Grounding
Code Code Available 25 LLaVA-ST: A Multimodal Large Language Model for Fine-Grained Spatial-Temporal Understanding Jan 14, 2025 Feature Compression Language Modeling
Code Code Available 25 Context-Guided Spatio-Temporal Video Grounding Jan 3, 2024 Object Spatio-Temporal Video Grounding
Code Code Available 25 TubeDETR: Spatio-Temporal Video Grounding with Transformers Mar 30, 2022 Decoder Language-Based Temporal Localization
Code Code Available 15 HawkEye: Training Video-Text LLMs for Grounding Text in Videos Mar 15, 2024 Video Grounding Video Question Answering
Code Code Available 15 TimeLoc: A Unified End-to-End Framework for Precise Timestamp Localization in Long Videos Mar 9, 2025 Action Localization Boundary Detection
Code Code Available 15 Text-Visual Prompting for Efficient 2D Temporal Video Grounding Mar 9, 2023 Sentence Video Grounding
Code Code Available 15 VidChain: Chain-of-Tasks with Metric-based Direct Preference Optimization for Dense Video Captioning Jan 12, 2025 Dense Video Captioning Video Captioning
Code Code Available 15 Detecting Moments and Highlights in Videos via Natural Language Queries Dec 1, 2021 Decoder Moment Retrieval
Code Code Available 15 Human-centric Spatio-Temporal Video Grounding With Visual Transformers Nov 10, 2020 Referring Expression Sentence
Code Code Available 15 Grounded Question-Answering in Long Egocentric Videos Dec 11, 2023 Video Grounding Video Question Answering
Code Code Available 15 Dense Regression Network for Video Grounding Apr 7, 2020 Natural Language Moment Retrieval Natural Language Queries
Code Code Available 15 Where Does It Exist: Spatio-Temporal Video Grounding for Multi-Form Sentences Jan 19, 2020 Form Object
Code Code Available 15 Weakly-Supervised Temporal Article Grounding Oct 22, 2022 All Articles
Code Code Available 15 Knowing Your Target: Target-Aware Transformer Makes Better Spatio-Temporal Video Grounding Feb 16, 2025 Attribute Object
Code Code Available 15 Embracing Consistency: A One-Stage Approach for Spatio-Temporal Video Grounding Sep 27, 2022 Decoder Spatio-Temporal Video Grounding
Code Code Available 15 CONE: An Efficient COarse-to-fiNE Alignment Framework for Long Video Temporal Grounding Sep 22, 2022 Contrastive Learning Video Grounding
Code Code Available 15 Knowing Where to Focus: Event-aware Transformer for Video Grounding Aug 14, 2023 Moment Queries Sentence
Code Code Available 15 VidLanKD: Improving Language Understanding via Video-Distilled Knowledge Transfer Jul 6, 2021 Image Retrieval Knowledge Distillation
Code Code Available 15 Negative Sample Matters: A Renaissance of Metric Learning for Temporal Grounding Sep 10, 2021 Metric Learning Representation Learning
Code Code Available 15 VLG-Net: Video-Language Graph Matching Network for Video Grounding Nov 19, 2020 Graph Matching Moment Retrieval
Code Code Available 15 Object-Shot Enhanced Grounding Network for Egocentric Video May 7, 2025 Video Grounding
Code Code Available 15 Animal Kingdom: A Large and Diverse Dataset for Animal Behavior Understanding Apr 18, 2022 Action Recognition Animal Action Recognition
Code Code Available 15 Explore-And-Match: Bridging Proposal-Based and Proposal-Free With Transformer for Sentence Grounding in Videos Jan 25, 2022 Natural Language Queries Sentence
Code Code Available 15 Bridging the Gap: A Unified Video Comprehension Framework for Moment Retrieval and Highlight Detection Nov 28, 2023 Contrastive Learning Highlight Detection
Code Code Available 15 Can I Trust Your Answer? Visually Grounded Video Question Answering Sep 4, 2023 Grounded Video Question Answering Question Answering
Code Code Available 15 Localizing Moments in Long Video Via Multimodal Guidance Feb 26, 2023 Natural Language Moment Retrieval Natural Language Visual Grounding
Code Code Available 15 Gaussian Mixture Proposals with Pull-Push Learning Scheme to Capture Diverse Events for Weakly Supervised Temporal Video Grounding Dec 27, 2023 Sentence Temporal Sentence Grounding
Code Code Available 15 DeCafNet: Delegate and Conquer for Efficient Temporal Grounding in Long Videos May 22, 2025 Natural Language Moment Retrieval Natural Language Queries
Code Code Available 15 VideoLLM Knows When to Speak: Enhancing Time-Sensitive Video Comprehension with Video-Text Duet Interaction Format Nov 27, 2024 Dense Video Captioning Grounded Video Question Answering
Code Code Available 15 OmniSTVG: Toward Spatio-Temporal Omni-Object Video Grounding Mar 13, 2025 Object Video Grounding
Code Code Available 15 Boundary-Denoising for Video Activity Localization Apr 6, 2023 Action Detection Decoder
Code Code Available 05 Consistency of Compositional Generalization across Multiple Levels Dec 18, 2024 Meta-Learning Question Answering
Code Code Available 05 Towards Parameter-Efficient Integration of Pre-Trained Language Models In Temporal Video Grounding Sep 26, 2022 Benchmarking Natural Language Queries
Code Code Available 05 Unified Static and Dynamic Network: Efficient Temporal Filtering for Video Grounding Mar 21, 2024 Video Grounding
Code Code Available 05 Dual-Path Temporal Map Optimization for Make-up Temporal Video Grounding Sep 12, 2023 Sentence text similarity
Code Code Available 05 Artemis: Towards Referential Understanding in Complex Videos Jun 1, 2024 Text Summarization Video Grounding
Code Code Available 05 Interventional Video Grounding with Dual Contrastive Learning Jun 21, 2021 Causal Inference Contrastive Learning
Code Code Available 05 Read, Watch, and Move: Reinforcement Learning for Temporally Grounding Natural Language Descriptions in Videos Jan 21, 2019 Decision Making Multi-Task Learning
Code Code Available 05 Dense Video Object Captioning from Disjoint Supervision Jun 20, 2023 Object Sentence
Code Code Available 05 A Simple Transformer-Based Model for Ego4D Natural Language Queries Challenge Nov 16, 2022 Action Localization Natural Language Queries
Code Code Available 05