| Efficient Annotation and Learning for 3D Hand Pose Estimation: A Survey | Jun 5, 2022 | 3D Hand Pose EstimationDomain Adaptation | —Unverified | 0 |
| An Effective Way to Improve YouTube-8M Classification Accuracy in Google Cloud Platform | Jun 26, 2017 | ClassificationDeep Learning | —Unverified | 0 |
| AdaCM^2: On Understanding Extremely Long-Term Video with Adaptive Cross-Modality Memory Reduction | Nov 19, 2024 | GPUQuestion Answering | —Unverified | 0 |
| EAGLE: Egocentric AGgregated Language-video Engine | Sep 26, 2024 | Action RecognitionActivity Recognition | —Unverified | 0 |
| DynTok: Dynamic Compression of Visual Tokens for Efficient and Effective Video Understanding | Jun 4, 2025 | MMEVideo MME | —Unverified | 0 |
| BioVL-QR: Egocentric Biochemical Vision-and-Language Dataset Using Micro QR Codes | Apr 4, 2024 | ObjectVideo Understanding | —Unverified | 0 |
| An Attempt towards Interpretable Audio-Visual Video Captioning | Dec 7, 2018 | Audio captioningAudio-Visual Video Captioning | —Unverified | 0 |
| DynFocus: Dynamic Cooperative Network Empowers LLMs with Video Understanding | Nov 19, 2024 | Question AnsweringVideo Understanding | —Unverified | 0 |
| Dynamic Multistep Reasoning based on Video Scene Graph for Video Question Answering | Jul 1, 2022 | Question AnsweringVideo Question Answering | —Unverified | 0 |
| Beyond Training: Dynamic Token Merging for Zero-Shot Video Understanding | Nov 21, 2024 | Computational EfficiencyVideo Understanding | —Unverified | 0 |
| Dynamic Graph Modules for Modeling Object-Object Interactions in Activity Recognition | Dec 13, 2018 | 3D Action RecognitionAction Recognition | —Unverified | 0 |
| Dynamic Appearance: A Video Representation for Action Recognition with Joint Training | Nov 23, 2022 | Action RecognitionTemporal Action Localization | —Unverified | 0 |
| Beyond the Camera: Neural Networks in World Coordinates | Mar 12, 2020 | Action RecognitionVideo Stabilization | —Unverified | 0 |
| Analyzing Zero-Shot Abilities of Vision-Language Models on Video Understanding Tasks | Oct 7, 2023 | Action RecognitionMultiple-choice | —Unverified | 0 |
| DyMU: Dynamic Merging and Virtual Unmerging for Efficient VLMs | Apr 23, 2025 | Token ReductionVideo Understanding | —Unverified | 0 |
| DualX-VSR: Dual Axial SpatialTemporal Transformer for Real-World Video Super-Resolution without Motion Compensation | Jun 5, 2025 | Motion CompensationOptical Flow Estimation | —Unverified | 0 |
| Beyond still images: Temporal features and input variance resilience | Nov 1, 2023 | Video Understanding | —Unverified | 0 |
| DTVLT: A Multi-modal Diverse Text Benchmark for Visual Language Tracking Based on LLM | Oct 3, 2024 | Object TrackingVideo Understanding | —Unverified | 0 |
| Abductive Ego-View Accident Video Understanding for Safe Driving Perception | Mar 1, 2024 | Objectobject-detection | —Unverified | 0 |
| An Analysis of Data Transformation Effects on Segment Anything 2 | Feb 25, 2025 | Semantic SegmentationVideo Object Segmentation | —Unverified | 0 |
| Learning text-to-video retrieval from image captioning | Apr 26, 2024 | Image CaptioningImage Retrieval | —Unverified | 0 |
| Dilated Temporal Relational Adversarial Network for Generic Video Summarization | Apr 30, 2018 | Generative Adversarial NetworkVideo Summarization | —Unverified | 0 |
| DrVideo: Document Retrieval Based Long Video Understanding | Jun 18, 2024 | document understandingEgoSchema | —Unverified | 0 |
| DriveGPT4: Interpretable End-to-end Autonomous Driving via Large Language Model | Oct 2, 2023 | Autonomous DrivingLanguage Modeling | —Unverified | 0 |
| A Multimodal Sentiment Dataset for Video Recommendation | Sep 17, 2021 | Multimodal Sentiment AnalysisSentiment Analysis | —Unverified | 0 |
| A Backpack Full of Skills: Egocentric Video Understanding with Diverse Task Perspectives | Mar 5, 2024 | Video Understanding | —Unverified | 0 |
| Beyond Boxes: Mask-Guided Spatio-Temporal Feature Aggregation for Video Object Detection | Dec 6, 2024 | GPUMulti-Object Tracking | —Unverified | 0 |
| Dr2Net: Dynamic Reversible Dual-Residual Networks for Memory-Efficient Finetuning | Jan 1, 2024 | object-detectionObject Detection | —Unverified | 0 |
| Beyond Appearance: Geometric Cues for Robust Video Instance Segmentation | Jul 8, 2025 | Depth EstimationDepth Prediction | —Unverified | 0 |
| DPMix: Mixture of Depth and Point Cloud Video Experts for 4D Action Segmentation | Jul 31, 2023 | Action SegmentationHuman-Object Interaction Detection | —Unverified | 0 |
| BERT for Large-scale Video Segment Classification with Test-time Augmentation | Dec 2, 2019 | General ClassificationVideo Understanding | —Unverified | 0 |
| AMEGO: Active Memory from long EGOcentric videos | Sep 17, 2024 | Video Understanding | —Unverified | 0 |
| Domain Adaptation of VLM for Soccer Video Understanding | May 20, 2025 | Action ClassificationDomain Adaptation | —Unverified | 0 |
| Actor-Action Semantic Segmentation with Grouping Process Models | Dec 30, 2015 | Semantic SegmentationVideo Understanding | —Unverified | 0 |
| BEARCUBS: A benchmark for computer-using web agents | Mar 10, 2025 | Video Understanding | —Unverified | 0 |
| DocVideoQA: Towards Comprehensive Understanding of Document-Centric Videos through Question Answering | Mar 20, 2025 | Contrastive LearningQuestion Answering | —Unverified | 0 |
| InternVideo2.5: Empowering Video MLLMs with Long and Rich Context Modeling | Jan 21, 2025 | Object TrackingReferring Expression Segmentation | —Unverified | 0 |
| DOAD: Decoupled One Stage Action Detection Network | Apr 1, 2023 | Action DetectionAction Recognition | —Unverified | 0 |
| BATMAN: Bilateral Attention Transformer in Motion-Appearance Neighboring Space for Video Object Segmentation | Aug 1, 2022 | ObjectOptical Flow Estimation | —Unverified | 0 |
| ALLVB: All-in-One Long Video Understanding Benchmark | Mar 10, 2025 | AllVideo Understanding | —Unverified | 0 |
| Learning reusable concepts across different egocentric video understanding tasks | May 30, 2025 | Video Understanding | —Unverified | 0 |
| Learning Space-Time Semantic Correspondences | Jun 16, 2023 | Imitation LearningSemantic correspondence | —Unverified | 0 |
| InternVid: A Large-scale Video-Text Dataset for Multimodal Understanding and Generation | Jul 13, 2023 | Action RecognitionContrastive Learning | —Unverified | 0 |
| InternLM-XComposer2.5-Reward: A Simple Yet Effective Multi-Modal Reward Model | Jan 21, 2025 | Instruction FollowingMathematical Reasoning | —Unverified | 0 |
| DMC-Net: Generating Discriminative Motion Cues for Fast Compressed Video Action Recognition | Jan 11, 2019 | Action ClassificationAction Recognition | —Unverified | 0 |
| InternLM-XComposer-2.5: A Versatile Large Vision Language Model Supporting Long-Contextual Input and Output | Jul 3, 2024 | ArticlesImage Comprehension | —Unverified | 0 |
| DLM-VMTL:A Double Layer Mapper for heterogeneous data video Multi-task prompt learning | Aug 29, 2024 | Multi-Task LearningPrompt Learning | —Unverified | 0 |
| Integrated Object Detection and Tracking with Tracklet-Conditioned Detection | Nov 27, 2018 | Objectobject-detection | —Unverified | 0 |
| Instrument-tissue Interaction Detection Framework for Surgical Video Understanding | Mar 30, 2024 | Video Understanding | —Unverified | 0 |
| InstructionBench: An Instructional Video Understanding Benchmark | Apr 7, 2025 | Common Sense ReasoningMultiple-choice | —Unverified | 0 |