| DocVideoQA: Towards Comprehensive Understanding of Document-Centric Videos through Question Answering | Mar 20, 2025 | Contrastive LearningQuestion Answering | —Unverified | 0 |
| InternVideo2.5: Empowering Video MLLMs with Long and Rich Context Modeling | Jan 21, 2025 | Object TrackingReferring Expression Segmentation | —Unverified | 0 |
| DOAD: Decoupled One Stage Action Detection Network | Apr 1, 2023 | Action DetectionAction Recognition | —Unverified | 0 |
| In-the-Wild Video Question Answering | Oct 1, 2022 | Evidence SelectionQuestion Answering | —Unverified | 0 |
| BATMAN: Bilateral Attention Transformer in Motion-Appearance Neighboring Space for Video Object Segmentation | Aug 1, 2022 | ObjectOptical Flow Estimation | —Unverified | 0 |
| IPAD: Industrial Process Anomaly Detection Dataset | Apr 23, 2024 | Anomaly DetectionVideo Anomaly Detection | —Unverified | 0 |
| ALLVB: All-in-One Long Video Understanding Benchmark | Mar 10, 2025 | AllVideo Understanding | —Unverified | 0 |
| IQViC: In-context, Question Adaptive Vision Compressor for Long-term Video Understanding LMMs | Dec 13, 2024 | Question AnsweringVideo Question Answering | —Unverified | 0 |
| Long-VMNet: Accelerating Long-Form Video Understanding via Fixed Memory | Mar 17, 2025 | FormGPU | —Unverified | 0 |
| LV-XAttn: Distributed Cross-Attention for Long Visual Inputs in Multimodal Large Language Models | Feb 4, 2025 | GPUVideo Understanding | —Unverified | 0 |
| InternVid: A Large-scale Video-Text Dataset for Multimodal Understanding and Generation | Jul 13, 2023 | Action RecognitionContrastive Learning | —Unverified | 0 |
| Is Temporal Prompting All We Need For Limited Labeled Action Recognition? | Apr 2, 2025 | Action RecognitionAll | —Unverified | 0 |
| InternLM-XComposer2.5-Reward: A Simple Yet Effective Multi-Modal Reward Model | Jan 21, 2025 | Instruction FollowingMathematical Reasoning | —Unverified | 0 |
| DMC-Net: Generating Discriminative Motion Cues for Fast Compressed Video Action Recognition | Jan 11, 2019 | Action ClassificationAction Recognition | —Unverified | 0 |
| InternLM-XComposer-2.5: A Versatile Large Vision Language Model Supporting Long-Contextual Input and Output | Jul 3, 2024 | ArticlesImage Comprehension | —Unverified | 0 |
| Jointly Learning Energy Expenditures and Activities Using Egocentric Multimodal Signals | Jul 1, 2017 | Video Understanding | —Unverified | 0 |
| Beyond Boxes: Mask-Guided Spatio-Temporal Feature Aggregation for Video Object Detection | Dec 6, 2024 | GPUMulti-Object Tracking | —Unverified | 0 |
| Kangaroo: A Powerful Video-Language Model Supporting Long-context Video Input | Aug 28, 2024 | Language ModelingLanguage Modelling | —Unverified | 0 |
| DLM-VMTL:A Double Layer Mapper for heterogeneous data video Multi-task prompt learning | Aug 29, 2024 | Multi-Task LearningPrompt Learning | —Unverified | 0 |
| KeyVideoLLM: Towards Large-scale Video Keyframe Selection | Jul 3, 2024 | Data CompressionManagement | —Unverified | 0 |
| Integrated Object Detection and Tracking with Tracklet-Conditioned Detection | Nov 27, 2018 | Objectobject-detection | —Unverified | 0 |
| KnowIT VQA: Answering Knowledge-Based Questions about Videos | Oct 23, 2019 | Question AnsweringVideo Question Answering | —Unverified | 0 |
| Knowledge-Based Visual Question Answering in Videos | Apr 17, 2020 | Question AnsweringVideo Question Answering | —Unverified | 0 |
| Koala: Key frame-conditioned long video-LLM | Apr 5, 2024 | Action RecognitionQuestion Answering | —Unverified | 0 |
| Instrument-tissue Interaction Detection Framework for Surgical Video Understanding | Mar 30, 2024 | Video Understanding | —Unverified | 0 |
| InstructionBench: An Instructional Video Understanding Benchmark | Apr 7, 2025 | Common Sense ReasoningMultiple-choice | —Unverified | 0 |
| Distantly Supervised Semantic Text Detection and Recognition for Broadcast Sports Videos Understanding | Oct 31, 2021 | Action RecognitionText Detection | —Unverified | 0 |
| AVT: Audio-Video Transformer for Multimodal Action Recognition | Sep 22, 2022 | Action RecognitionAudio Classification | —Unverified | 0 |
| Aligned Better, Listen Better for Audio-Visual Large Language Models | Apr 2, 2025 | Video Understanding | —Unverified | 0 |
| Disentangle and denoise: Tackling context misalignment for video moment retrieval | Aug 14, 2024 | DenoisingDisentanglement | —Unverified | 0 |
| InfiniPot-V: Memory-Constrained KV Cache Compression for Streaming Video Understanding | Jun 18, 2025 | GPUStreaming video understanding | —Unverified | 0 |
| AV-Reasoner: Improving and Benchmarking Clue-Grounded Audio-Visual Counting for MLLMs | Jun 5, 2025 | BenchmarkingVideo Understanding | —Unverified | 0 |
| Large-Scale Video Classification with Feature Space Augmentation coupled with Learned Label Relations and Ensembling | Sep 21, 2018 | General ClassificationVideo Classification | —Unverified | 0 |
| Large Scale Video Representation Learning via Relational Graph Clustering | Jun 1, 2020 | ClusteringGraph Clustering | —Unverified | 0 |
| Large-Scale YouTube-8M Video Understanding with Deep Neural Networks | Jun 14, 2017 | ClassificationGeneral Classification | —Unverified | 0 |
| LASER: A Neuro-Symbolic Framework for Learning Spatial-Temporal Scene Graphs with Weak Supervision | Apr 15, 2023 | Language ModelingLanguage Modelling | —Unverified | 0 |
| Dynamic Appearance: A Video Representation for Action Recognition with Joint Training | Nov 23, 2022 | Action RecognitionTemporal Action Localization | —Unverified | 0 |
| Beyond the Camera: Neural Networks in World Coordinates | Mar 12, 2020 | Action RecognitionVideo Stabilization | —Unverified | 0 |
| Dynamic Graph Modules for Modeling Object-Object Interactions in Activity Recognition | Dec 13, 2018 | 3D Action RecognitionAction Recognition | —Unverified | 0 |
| Learning an Augmented RGB Representation with Cross-Modal Knowledge Distillation for Action Detection | Aug 8, 2021 | Action DetectionKnowledge Distillation | —Unverified | 0 |
| Learning Audio-guided Video Representation with Gated Attention for Video-Text Retrieval | Apr 3, 2025 | Information RetrievalRepresentation Learning | —Unverified | 0 |
| Learning Dynamic MRI Reconstruction with Convolutional Network Assisted Reconstruction Swin Transformer | Sep 19, 2023 | AnatomyComputational Efficiency | —Unverified | 0 |
| Learning Dynamics via Graph Neural Networks for Human Pose Estimation and Tracking | Jun 7, 2021 | Graph Neural NetworkMulti-Person Pose Estimation | —Unverified | 0 |
| Learning Fine-grained View-Invariant Representations from Unpaired Ego-Exo Videos via Temporal Alignment | Jun 8, 2023 | Video Understanding | —Unverified | 0 |
| Learning from Multiple Sources for Video Summarisation | Jan 13, 2015 | ClusteringVideo Understanding | —Unverified | 0 |
| Learning Higher-order Object Interactions for Keypoint-based Video Understanding | May 16, 2023 | Action LocalizationAction Recognition | —Unverified | 0 |
| Inductive Attention for Video Action Anticipation | Dec 17, 2022 | Action AnticipationAction Recognition | —Unverified | 0 |
| Discrete neural representations for explainable anomaly detection | Dec 10, 2021 | Anomaly DetectionObject | —Unverified | 0 |
| Improving Video Model Transfer With Dynamic Representation Learning | Jan 1, 2022 | Action ClassificationKnowledge Distillation | —Unverified | 0 |
| Improving LLM Video Understanding with 16 Frames Per Second | Mar 18, 2025 | MMEVideo MME | —Unverified | 0 |