| Multimodal Distillation for Egocentric Action Recognition | Jul 14, 2023 | Action RecognitionKnowledge Distillation | CodeCode Available | 1 |
| Compositional Video Understanding with Spatiotemporal Structure-based Transformers | Jan 1, 2024 | Video Understanding | CodeCode Available | 1 |
| Mug-STAN: Adapting Image-Language Pretrained Models for General Video Understanding | Nov 25, 2023 | Video Understanding | CodeCode Available | 1 |
| Eliciting In-Context Learning in Vision-Language Models for Videos Through Curated Data Distributional Properties | Nov 28, 2023 | In-Context LearningVideo Understanding | CodeCode Available | 1 |
| Multimodal Long Video Modeling Based on Temporal Dynamic Context | Apr 14, 2025 | Video Understanding | CodeCode Available | 1 |
| MOMA-LRG: Language-Refined Graphs for Multi-Object Multi-Actor Activity Parsing | Nov 28, 2022 | Activity RecognitionFew Shot Action Recognition | CodeCode Available | 1 |
| MotionSqueeze: Neural Motion Feature Learning for Video Understanding | Jul 20, 2020 | Action ClassificationAction Recognition | CodeCode Available | 1 |
| Facial Dynamics in Video: Instruction Tuning for Improved Facial Expression Perception and Contextual Awareness | Jan 14, 2025 | Event ExtractionInstruction Following | CodeCode Available | 1 |
| Modeling Video As Stochastic Processes for Fine-Grained Video Representation Learning | Jan 1, 2023 | Contrastive LearningRepresentation Learning | CodeCode Available | 1 |
| MultiSports: A Multi-Person Video Dataset of Spatio-Temporally Localized Sports Actions | May 16, 2021 | Action DetectionAction Localization | CodeCode Available | 1 |
| MM-VID: Advancing Video Understanding with GPT-4V(ision) | Oct 30, 2023 | Script GenerationVideo Understanding | CodeCode Available | 1 |
| MMWorld: Towards Multi-discipline Multi-faceted World Model Evaluation in Videos | Jun 12, 2024 | counterfactualFuture prediction | CodeCode Available | 1 |
| Clover: Towards A Unified Video-Language Alignment and Fusion Model | Jul 16, 2022 | Language ModelingLanguage Modelling | CodeCode Available | 1 |
| CLIP4Clip: An Empirical Study of CLIP for End to End Video Clip Retrieval | Apr 18, 2021 | RetrievalText Retrieval | CodeCode Available | 1 |
| MM-Pyramid: Multimodal Pyramid Attentional Network for Audio-Visual Event Localization and Video Parsing | Nov 24, 2021 | audio-visual event localizationVideo Understanding | CodeCode Available | 1 |
| A Simple and Efficient Pipeline to Build an End-to-End Spatial-Temporal Action Detector | Jun 7, 2022 | Action ClassificationAction Detection | CodeCode Available | 1 |
| Feature Combination Meets Attention: Baidu Soccer Embeddings and Transformer based Temporal Detection | Jun 28, 2021 | Action RecognitionAction Spotting | CodeCode Available | 1 |
| F^3Set: Towards Analyzing Fast, Frequent, and Fine-grained Events from Videos | Apr 11, 2025 | Action UnderstandingEvent Detection | CodeCode Available | 1 |
| COM Kitchens: An Unedited Overhead-view Video Dataset as a Vision-Language Benchmark | Aug 5, 2024 | Dense Video CaptioningDiversity | CodeCode Available | 1 |
| Fact-R1: Towards Explainable Video Misinformation Detection with Deep Reasoning | May 22, 2025 | Misinformationreinforcement-learning | CodeCode Available | 1 |
| MMAD: Multi-label Micro-Action Detection in Videos | Jul 7, 2024 | Action AnalysisAction Detection | CodeCode Available | 1 |
| Modeling Fine-Grained Hand-Object Dynamics for Egocentric Video Representation Learning | Mar 2, 2025 | Large Language ModelMulti-Instance Retrieval | CodeCode Available | 1 |
| MUSEG: Reinforcing Video Temporal Understanding via Timestamp-Aware Multi-Segment Grounding | May 27, 2025 | Reinforcement Learning (RL)Video Understanding | CodeCode Available | 1 |
| PhysGame: Uncovering Physical Commonsense Violations in Gameplay Videos | Dec 2, 2024 | Question AnsweringVideo Understanding | CodeCode Available | 1 |
| MammAlps: A multi-view video behavior monitoring dataset of wild mammals in the Swiss Alps | Mar 23, 2025 | Scene SegmentationVideo Understanding | CodeCode Available | 1 |
| Mamba4D: Efficient 4D Point Cloud Video Understanding with Disentangled Spatial-Temporal State Space Models | Jan 1, 2025 | Action RecognitionAction Segmentation | CodeCode Available | 1 |
| Masked Spatio-Temporal Structure Prediction for Self-supervised Learning on Point Cloud Videos | Aug 18, 2023 | point cloud video understandingSelf-Supervised Learning | CodeCode Available | 1 |
| MCAM: Multimodal Causal Analysis Model for Ego-Vehicle-Level Driving Video Understanding | Jul 8, 2025 | Autonomous DrivingVideo Understanding | CodeCode Available | 1 |
| LoVR: A Benchmark for Long Video Retrieval in Multimodal Contexts | May 20, 2025 | Caption GenerationRetrieval | CodeCode Available | 1 |
| A Simple LLM Framework for Long-Range Video Question-Answering | Dec 28, 2023 | EgoSchemaLanguage Modelling | CodeCode Available | 1 |
| CEFHRI: A Communication Efficient Federated Learning Framework for Recognizing Industrial Human-Robot Interaction | Aug 29, 2023 | Federated Learningimage-classification | CodeCode Available | 1 |
| A Dataset for Medical Instructional Video Classification and Question Answering | Jan 30, 2022 | ClassificationQuestion Answering | CodeCode Available | 1 |
| M^3-VOS: Multi-Phase, Multi-Transition, and Multi-Scenery Video Object Segmentation | Jun 15, 2025 | ObjectSemantic Segmentation | CodeCode Available | 1 |
| MECD+: Unlocking Event-Level Causal Graph Discovery for Video Reasoning | Jan 13, 2025 | Causal DiscoveryCausal Inference | CodeCode Available | 1 |
| Long Movie Clip Classification with State-Space Video Models | Apr 4, 2022 | ClassificationDecoder | CodeCode Available | 1 |
| Localizing Moments in Long Video Via Multimodal Guidance | Feb 26, 2023 | Natural Language Moment RetrievalNatural Language Visual Grounding | CodeCode Available | 1 |
| CAST: Cross-Attention in Space and Time for Video Action Recognition | Nov 30, 2023 | Action ClassificationAction Recognition | CodeCode Available | 1 |
| Towards Visually Explaining Video Understanding Networks with Perturbation | May 1, 2020 | Video Understanding | CodeCode Available | 1 |
| ETAD: Training Action Detection End to End on a Laptop | May 14, 2022 | Action DetectionGPU | CodeCode Available | 1 |
| Leveraging triplet loss for unsupervised action segmentation | Apr 13, 2023 | Action SegmentationClustering | CodeCode Available | 1 |
| EPIC Fields: Marrying 3D Geometry and Video Understanding | Jun 14, 2023 | 3D geometryNeural Rendering | CodeCode Available | 1 |
| Learning the Predictability of the Future | Jun 19, 2021 | Representation LearningSelf-Supervised Action Recognition | CodeCode Available | 1 |
| Learning Transferable Spatiotemporal Representations from Natural Script Knowledge | Sep 30, 2022 | DescriptiveRepresentation Learning | CodeCode Available | 1 |
| Lightweight Network Architecture for Real-Time Action Recognition | May 21, 2019 | Action RecognitionCPU | CodeCode Available | 1 |
| Exploring Pre-trained Text-to-Video Diffusion Models for Referring Video Object Segmentation | Mar 18, 2024 | Referring Video Object SegmentationSemantic Segmentation | CodeCode Available | 1 |
| Enhancing Traffic Safety with Parallel Dense Video Captioning for End-to-End Event Analysis | Apr 12, 2024 | Dense Video CaptioningTransfer Learning | CodeCode Available | 1 |
| CATER: A diagnostic dataset for Compositional Actions and TEmporal Reasoning | Oct 10, 2019 | DiagnosticObject | CodeCode Available | 1 |
| Enhancing Self-supervised Video Representation Learning via Multi-level Feature Optimization | Aug 4, 2021 | Contrastive LearningRepresentation Learning | CodeCode Available | 1 |
| Exploring Hallucination of Large Multimodal Models in Video Understanding: Benchmark, Analysis and Mitigation | Mar 25, 2025 | HallucinationHallucination Evaluation | CodeCode Available | 1 |
| Learning Temporally Latent Causal Processes from General Temporal Data | Sep 29, 2021 | Causal DiscoveryDisentanglement | CodeCode Available | 1 |