| Are Vision LLMs Road-Ready? A Comprehensive Benchmark for Safety-Critical Driving Video Understanding | Apr 20, 2025 | Autonomous DrivingImage Captioning | CodeCode Available | 0 | 5 |
| A Coding Framework and Benchmark towards Low-Bitrate Video Understanding | Feb 6, 2022 | Video CompressionVideo Understanding | CodeCode Available | 0 | 5 |
| Are you Struggling? Dataset and Baselines for Struggle Determination in Assembly Videos | Feb 16, 2024 | Decision MakingVideo Understanding | CodeCode Available | 0 | 5 |
| EVA02-AT: Egocentric Video-Language Understanding with Spatial-Temporal Rotary Positional Embeddings and Symmetric Optimization | Jun 17, 2025 | Multi-Instance RetrievalRetrieval | CodeCode Available | 0 | 5 |
| Pooled Motion Features for First-Person Videos | Dec 19, 2014 | Activity RecognitionActivity Recognition In Videos | CodeCode Available | 0 | 5 |
| CARPe Posterum: A Convolutional Approach for Real-time Pedestrian Path Prediction | May 26, 2020 | Autonomous VehiclesPrediction | CodeCode Available | 0 | 5 |
| Capturing Temporal Information in a Single Frame: Channel Sampling Strategies for Action Recognition | Jan 25, 2022 | Action RecognitionOptical Flow Estimation | CodeCode Available | 0 | 5 |
| Are current long-term video understanding datasets long-term? | Aug 22, 2023 | Action RecognitionVideo Understanding | CodeCode Available | 0 | 5 |
| Pairwise Emotional Relationship Recognition in Drama Videos: Dataset and Benchmark | Sep 23, 2021 | Video Understanding | CodeCode Available | 0 | 5 |
| Enhancing Temporal Modeling of Video LLMs via Time Gating | Oct 8, 2024 | MVBenchQuestion Answering | CodeCode Available | 0 | 5 |
| On the Pitfalls of Batch Normalization for End-to-End Video Learning: A Study on Surgical Workflow Analysis | Mar 15, 2022 | Video Understanding | CodeCode Available | 0 | 5 |
| OccludeNet: A Causal Journey into Mixed-View Actor-Centric Video Action Recognition under Occlusions | Nov 24, 2024 | Action ClassificationAction Recognition | CodeCode Available | 0 | 5 |
| End-to-End Learning of Motion Representation for Video Understanding | Apr 2, 2018 | Action RecognitionOptical Flow Estimation | CodeCode Available | 0 | 5 |
| NoisyActions2M: A Multimedia Dataset for Video Understanding from Noisy Labels | Oct 13, 2021 | Action ClassificationSelf-Supervised Learning | CodeCode Available | 0 | 5 |
| NeXtVLAD: An Efficient Neural Network to Aggregate Frame-level Features for Large-scale Video Classification | Nov 12, 2018 | Efficient Neural NetworkGeneral Classification | CodeCode Available | 0 | 5 |
| Multi-Moments in Time: Learning and Interpreting Models for Multi-Action Video Understanding | Nov 1, 2019 | Action DetectionAction Recognition | CodeCode Available | 0 | 5 |
| B-VLLM: A Vision Large Language Model with Balanced Spatio-Temporal Tokens | Dec 13, 2024 | Language ModelingLanguage Modelling | CodeCode Available | 0 | 5 |
| Long-Term Feature Banks for Detailed Video Understanding | Dec 12, 2018 | Action ClassificationAction Recognition | CodeCode Available | 0 | 5 |
| EgoVLM: Policy Optimization for Egocentric Video Understanding | Jun 3, 2025 | EgoSchemaQuestion Answering | CodeCode Available | 0 | 5 |
| Multi-attention Networks for Temporal Localization of Video-level Labels | Nov 15, 2019 | Action RecognitionTemporal Action Localization | CodeCode Available | 0 | 5 |
| MOFO: MOtion FOcused Self-Supervision for Video Understanding | Aug 23, 2023 | Action ClassificationAction Recognition | CodeCode Available | 0 | 5 |
| Localizing Moments in Video with Temporal Language | Sep 5, 2018 | Natural Language QueriesRetrieval | CodeCode Available | 0 | 5 |
| LMM-VQA: Advancing Video Quality Assessment with Large Multimodal Models | Aug 26, 2024 | Large Language ModelVideo Quality Assessment | CodeCode Available | 0 | 5 |
| MOD: A Deep Mixture Model with Online Knowledge Distillation for Large Scale Video Temporal Concept Localization | Oct 27, 2019 | Knowledge DistillationVideo Understanding | CodeCode Available | 0 | 5 |
| Multimodal Dialogue State Tracking | Jun 16, 2022 | Dialogue State TrackingVideo Understanding | CodeCode Available | 0 | 5 |
| Bridging Perspectives: A Survey on Cross-view Collaborative Intelligence with Egocentric-Exocentric Vision | Jun 6, 2025 | Video Understanding | CodeCode Available | 0 | 5 |
| LLaVA-OneVision: Easy Visual Task Transfer | Aug 6, 2024 | 3D Question Answering (3D-QA) | CodeCode Available | 0 | 5 |
| METok: Multi-Stage Event-based Token Compression for Efficient Long Video Understanding | Jun 3, 2025 | Video Understanding | CodeCode Available | 0 | 5 |
| MINOTAUR: Multi-task Video Grounding From Multimodal Queries | Feb 16, 2023 | Action DetectionSentence | CodeCode Available | 0 | 5 |
| Masked Autoencoders for Egocentric Video Understanding @ Ego4D Challenge 2022 | Nov 18, 2022 | Object State Change ClassificationTemporal Localization | CodeCode Available | 0 | 5 |
| A Challenge to Build Neuro-Symbolic Video Agents | May 20, 2025 | Scene ClassificationVideo Retrieval | CodeCode Available | 0 | 5 |
| Representation Flow for Action Recognition | Oct 2, 2018 | Action ClassificationAction Recognition | CodeCode Available | 0 | 5 |
| Learning to Visually Connect Actions and their Effects | Jan 19, 2024 | Object TrackingTask Planning | —Unverified | 0 | 0 |
| Learning to Focus on the Foreground for Temporal Sentence Grounding | Oct 1, 2022 | SentenceTemporal Sentence Grounding | —Unverified | 0 | 0 |
| Efficient Annotation and Learning for 3D Hand Pose Estimation: A Survey | Jun 5, 2022 | 3D Hand Pose EstimationDomain Adaptation | —Unverified | 0 | 0 |
| Learning text-to-video retrieval from image captioning | Apr 26, 2024 | Image CaptioningImage Retrieval | —Unverified | 0 | 0 |
| Learning Space-Time Semantic Correspondences | Jun 16, 2023 | Imitation LearningSemantic correspondence | —Unverified | 0 | 0 |
| An Effective Way to Improve YouTube-8M Classification Accuracy in Google Cloud Platform | Jun 26, 2017 | ClassificationDeep Learning | —Unverified | 0 | 0 |
| Learning reusable concepts across different egocentric video understanding tasks | May 30, 2025 | Video Understanding | —Unverified | 0 | 0 |
| EAGLE: Egocentric AGgregated Language-video Engine | Sep 26, 2024 | Action RecognitionActivity Recognition | —Unverified | 0 | 0 |
| Learning Object State Changes in Videos: An Open-World Perspective | Dec 19, 2023 | Video Understanding | —Unverified | 0 | 0 |
| Learning Higher-order Object Interactions for Keypoint-based Video Understanding | May 16, 2023 | Action LocalizationAction Recognition | —Unverified | 0 | 0 |
| Learning from Multiple Sources for Video Summarisation | Jan 13, 2015 | ClusteringVideo Understanding | —Unverified | 0 | 0 |
| DynTok: Dynamic Compression of Visual Tokens for Efficient and Effective Video Understanding | Jun 4, 2025 | MMEVideo MME | —Unverified | 0 | 0 |
| BioVL-QR: Egocentric Biochemical Vision-and-Language Dataset Using Micro QR Codes | Apr 4, 2024 | ObjectVideo Understanding | —Unverified | 0 | 0 |
| An Attempt towards Interpretable Audio-Visual Video Captioning | Dec 7, 2018 | Audio captioningAudio-Visual Video Captioning | —Unverified | 0 | 0 |
| AdaCM^2: On Understanding Extremely Long-Term Video with Adaptive Cross-Modality Memory Reduction | Nov 19, 2024 | GPUQuestion Answering | —Unverified | 0 | 0 |
| Learning Fine-grained View-Invariant Representations from Unpaired Ego-Exo Videos via Temporal Alignment | Jun 8, 2023 | Video Understanding | —Unverified | 0 | 0 |
| Learning Dynamics via Graph Neural Networks for Human Pose Estimation and Tracking | Jun 7, 2021 | Graph Neural NetworkMulti-Person Pose Estimation | —Unverified | 0 | 0 |
| DynFocus: Dynamic Cooperative Network Empowers LLMs with Video Understanding | Nov 19, 2024 | Question AnsweringVideo Understanding | —Unverified | 0 | 0 |