InternVideo2: Scaling Foundation Models for Multimodal Video Understanding Mar 22, 2024 Action Classification Action Recognition
Code Code Available 7Tarsier2: Advancing Large Vision-Language Models from Detailed Video Description to Comprehensive Video Understanding Jan 14, 2025 Embodied Question Answering Hallucination
Code Code Available 4SnAG: Scalable and Accurate Video Grounding Apr 2, 2024 Video Grounding Video Understanding
Code Code Available 4Query-Dependent Video Representation for Moment Retrieval and Highlight Detection Mar 24, 2023 Highlight Detection Moment Retrieval
Code Code Available 2Prior Knowledge Integration via LLM Encoding and Pseudo Event Regulation for Video Moment Retrieval Jul 21, 2024 General Knowledge Highlight Detection
Code Code Available 2VTimeLLM: Empower LLM to Grasp Video Moments Nov 30, 2023 Dense Video Captioning Temporal Relation Extraction
Code Code Available 2Context-Guided Spatio-Temporal Video Grounding Jan 3, 2024 Object Spatio-Temporal Video Grounding
Code Code Available 2Reinforcement Learning Tuning for VideoLLMs: Reward Design and Data Efficiency Jun 2, 2025 reinforcement-learning Reinforcement Learning
Code Code Available 2UMT: Unified Multi-modal Transformers for Joint Video Moment Retrieval and Highlight Detection Mar 23, 2022 Decoder Highlight Detection
Code Code Available 2LLaVA-ST: A Multimodal Large Language Model for Fine-Grained Spatial-Temporal Understanding Jan 14, 2025 Feature Compression Language Modeling
Code Code Available 2PG-Video-LLaVA: Pixel Grounding Large Video-Language Models Nov 22, 2023 Benchmarking Phrase Grounding
Code Code Available 2TimeZero: Temporal Video Grounding with Reasoning-Guided LVLM Mar 17, 2025 Video Grounding
Code Code Available 2Human-centric Spatio-Temporal Video Grounding With Visual Transformers Nov 10, 2020 Referring Expression Sentence
Code Code Available 1HawkEye: Training Video-Text LLMs for Grounding Text in Videos Mar 15, 2024 Video Grounding Video Question Answering
Code Code Available 1Knowing Your Target: Target-Aware Transformer Makes Better Spatio-Temporal Video Grounding Feb 16, 2025 Attribute Object
Code Code Available 1Where Does It Exist: Spatio-Temporal Video Grounding for Multi-Form Sentences Jan 19, 2020 Form Object
Code Code Available 1Knowing Where to Focus: Event-aware Transformer for Video Grounding Aug 14, 2023 Moment Queries Sentence
Code Code Available 1Grounded Question-Answering in Long Egocentric Videos Dec 11, 2023 Video Grounding Video Question Answering
Code Code Available 1TubeDETR: Spatio-Temporal Video Grounding with Transformers Mar 30, 2022 Decoder Language-Based Temporal Localization
Code Code Available 1VidChain: Chain-of-Tasks with Metric-based Direct Preference Optimization for Dense Video Captioning Jan 12, 2025 Dense Video Captioning Video Captioning
Code Code Available 1Text-Visual Prompting for Efficient 2D Temporal Video Grounding Mar 9, 2023 Sentence Video Grounding
Code Code Available 1Detecting Moments and Highlights in Videos via Natural Language Queries Dec 1, 2021 Decoder Moment Retrieval
Code Code Available 1VLG-Net: Video-Language Graph Matching Network for Video Grounding Nov 19, 2020 Graph Matching Moment Retrieval
Code Code Available 1Weakly-Supervised Temporal Article Grounding Oct 22, 2022 All Articles
Code Code Available 1Embracing Consistency: A One-Stage Approach for Spatio-Temporal Video Grounding Sep 27, 2022 Decoder Spatio-Temporal Video Grounding
Code Code Available 1CONE: An Efficient COarse-to-fiNE Alignment Framework for Long Video Temporal Grounding Sep 22, 2022 Contrastive Learning Video Grounding
Code Code Available 1Dense Regression Network for Video Grounding Apr 7, 2020 Natural Language Moment Retrieval Natural Language Queries
Code Code Available 1TimeLoc: A Unified End-to-End Framework for Precise Timestamp Localization in Long Videos Mar 9, 2025 Action Localization Boundary Detection
Code Code Available 1VideoLLM Knows When to Speak: Enhancing Time-Sensitive Video Comprehension with Video-Text Duet Interaction Format Nov 27, 2024 Dense Video Captioning Grounded Video Question Answering
Code Code Available 1Gaussian Mixture Proposals with Pull-Push Learning Scheme to Capture Diverse Events for Weakly Supervised Temporal Video Grounding Dec 27, 2023 Sentence Temporal Sentence Grounding
Code Code Available 1OmniSTVG: Toward Spatio-Temporal Omni-Object Video Grounding Mar 13, 2025 Object Video Grounding
Code Code Available 1Can I Trust Your Answer? Visually Grounded Video Question Answering Sep 4, 2023 Grounded Video Question Answering Question Answering
Code Code Available 1Localizing Moments in Long Video Via Multimodal Guidance Feb 26, 2023 Natural Language Moment Retrieval Natural Language Visual Grounding
Code Code Available 1Animal Kingdom: A Large and Diverse Dataset for Animal Behavior Understanding Apr 18, 2022 Action Recognition Animal Action Recognition
Code Code Available 1Negative Sample Matters: A Renaissance of Metric Learning for Temporal Grounding Sep 10, 2021 Metric Learning Representation Learning
Code Code Available 1Explore-And-Match: Bridging Proposal-Based and Proposal-Free With Transformer for Sentence Grounding in Videos Jan 25, 2022 Natural Language Queries Sentence
Code Code Available 1Bridging the Gap: A Unified Video Comprehension Framework for Moment Retrieval and Highlight Detection Nov 28, 2023 Contrastive Learning Highlight Detection
Code Code Available 1DeCafNet: Delegate and Conquer for Efficient Temporal Grounding in Long Videos May 22, 2025 Natural Language Moment Retrieval Natural Language Queries
Code Code Available 1Object-Shot Enhanced Grounding Network for Egocentric Video May 7, 2025 Video Grounding
Code Code Available 1VidLanKD: Improving Language Understanding via Video-Distilled Knowledge Transfer Jul 6, 2021 Image Retrieval Knowledge Distillation
Code Code Available 1EVOQUER: Enhancing Temporal Grounding with Video-Pivoted BackQuery Generation Sep 10, 2021 Translation Video Grounding
— Unverified 0EtC: Temporal Boundary Expand then Clarify for Weakly Supervised Video Grounding with Multimodal Large Language Model Dec 5, 2023 Boundary Detection Language Modeling
— Unverified 0Enhancing Weakly Supervised Video Grounding via Diverse Inference Strategies for Boundary and Prediction Selection Mar 29, 2025 Prediction Video Grounding
— Unverified 0End-to-End Modeling via Information Tree for One-Shot Natural Language Spatial Video Grounding Mar 15, 2022 Descriptive Representation Learning
— Unverified 0SynopGround: A Large-Scale Dataset for Multi-Paragraph Video Grounding from TV Dramas and Synopses Aug 3, 2024 Natural Language Queries Video Grounding
— Unverified 0Multi-Scale Self-Contrastive Learning with Hard Negative Mining for Weakly-Supervised Query-based Video Grounding Mar 8, 2022 Contrastive Learning Sentence
— Unverified 0End-to-End Dense Video Grounding via Parallel Regression Sep 23, 2021 regression Sentence
— Unverified 0Collaborative Static and Dynamic Vision-Language Streams for Spatio-Temporal Video Grounding Jan 1, 2023 Object Spatio-Temporal Video Grounding
— Unverified 0Iterative Proposal Refinement for Weakly-Supervised Video Grounding Jan 1, 2023 Sentence Video Grounding
— Unverified 0DiffusionVMR: Diffusion Model for Joint Video Moment Retrieval and Highlight Detection Aug 29, 2023 Denoising Highlight Detection
— Unverified 0