| VideoICL: Confidence-based Iterative In-context Learning for Out-of-Distribution Video Understanding | Dec 3, 2024 | In-Context LearningVideo Understanding | CodeCode Available | 1 |
| PhysGame: Uncovering Physical Commonsense Violations in Gameplay Videos | Dec 2, 2024 | Question AnsweringVideo Understanding | CodeCode Available | 1 |
| T2Vid: Translating Long Text into Multi-Image is the Catalyst for Video-LLMs | Nov 29, 2024 | Data AugmentationDiversity | CodeCode Available | 1 |
| Teaching VLMs to Localize Specific Objects from In-context Examples | Nov 20, 2024 | ObjectObject Tracking | CodeCode Available | 1 |
| TS-LLaVA: Constructing Visual Tokens through Thumbnail-and-Sampling for Training-Free Video Large Language Models | Nov 17, 2024 | MVBenchVideo-based Generative Performance Benchmarking | CodeCode Available | 1 |
| Language-Assisted Skeleton Action Understanding for Skeleton-Based Temporal Action Segmentation | Oct 31, 2024 | Action SegmentationAction Understanding | CodeCode Available | 1 |
| TOMATO: Assessing Visual Temporal Reasoning Capabilities in Multimodal Foundation Models | Oct 30, 2024 | Video Understanding | CodeCode Available | 1 |
| VideoWebArena: Evaluating Long Context Multimodal Agents with Video Understanding Web Tasks | Oct 24, 2024 | Video Understanding | CodeCode Available | 1 |
| CAMEL-Bench: A Comprehensive Arabic LMM Benchmark | Oct 24, 2024 | document understandingVideo Understanding | CodeCode Available | 1 |
| TemporalBench: Benchmarking Fine-grained Temporal Understanding for Multimodal Video Models | Oct 14, 2024 | 2kBenchmarking | CodeCode Available | 1 |
| VERIFIED: A Video Corpus Moment Retrieval Benchmark for Fine-Grained Video Understanding | Oct 11, 2024 | HallucinationMoment Retrieval | CodeCode Available | 1 |
| VideoINSTA: Zero-shot Long Video Understanding via Informative Spatial-Temporal Reasoning with LLMs | Sep 30, 2024 | EgoSchemaLanguage Modelling | CodeCode Available | 1 |
| From Seconds to Hours: Reviewing MultiModal Large Language Models on Comprehensive Long Video Understanding | Sep 27, 2024 | Video UnderstandingVisual Reasoning | CodeCode Available | 1 |
| HAT: History-Augmented Anchor Transformer for Online Temporal Action Localization | Aug 12, 2024 | Action LocalizationTemporal Action Localization | CodeCode Available | 1 |
| COM Kitchens: An Unedited Overhead-view Video Dataset as a Vision-Language Benchmark | Aug 5, 2024 | Dense Video CaptioningDiversity | CodeCode Available | 1 |
| Learning Video Context as Interleaved Multimodal Sequences | Jul 31, 2024 | Language ModelingLanguage Modelling | CodeCode Available | 1 |
| EgoCVR: An Egocentric Benchmark for Fine-Grained Composed Video Retrieval | Jul 23, 2024 | Re-RankingRetrieval | CodeCode Available | 1 |
| VideoMamba: Spatio-Temporal Selective State Space Model | Jul 11, 2024 | Mambamodel | CodeCode Available | 1 |
| Hypergraph Multi-modal Large Language Model: Exploiting EEG and Eye-tracking Modalities to Evaluate Heterogeneous Responses for Video Understanding | Jul 11, 2024 | EEGLanguage Modeling | CodeCode Available | 1 |
| MMAD: Multi-label Micro-Action Detection in Videos | Jul 7, 2024 | Action AnalysisAction Detection | CodeCode Available | 1 |
| InfiniBench: A Comprehensive Benchmark for Large Multimodal Models in Very Long Video Understanding | Jun 28, 2024 | Multiple-choiceVideo Understanding | CodeCode Available | 1 |
| Snakes and Ladders: Two Steps Up for VideoMamba | Jun 27, 2024 | Action RecognitionMamba | CodeCode Available | 1 |
| Fibottention: Inceptive Visual Representation Learning with Diverse Attention Across Heads | Jun 27, 2024 | Diversityimage-classification | CodeCode Available | 1 |
| Towards Event-oriented Long Video Understanding | Jun 20, 2024 | Video Understanding | CodeCode Available | 1 |
| AlanaVLM: A Multimodal Embodied AI Foundation Model for Egocentric Video Understanding | Jun 19, 2024 | Question AnsweringSpatial Reasoning | CodeCode Available | 1 |
| Slot State Space Models | Jun 18, 2024 | MambaState Space Models | CodeCode Available | 1 |
| VideoVista: A Versatile Benchmark for Video Understanding and Reasoning | Jun 17, 2024 | Anomaly DetectionLogical Reasoning | CodeCode Available | 1 |
| MMWorld: Towards Multi-discipline Multi-faceted World Model Evaluation in Videos | Jun 12, 2024 | counterfactualFuture prediction | CodeCode Available | 1 |
| Differentiable Task Graph Learning: Procedural Activity Representation and Online Mistake Detection from Egocentric Videos | Jun 3, 2024 | Mistake DetectionOnline Mistake Detection | CodeCode Available | 1 |
| EgoSurgery-Phase: A Dataset of Surgical Phase Recognition from Egocentric Open Surgery Videos | May 30, 2024 | Action RecognitionSurgical phase recognition | CodeCode Available | 1 |
| TOPA: Extending Large Language Models for Video Understanding via Text-Only Pre-Alignment | May 22, 2024 | EgoSchemaVideo Understanding | CodeCode Available | 1 |
| No Time to Waste: Squeeze Time into Channel for Mobile Video Understanding | May 14, 2024 | Action DetectionGPU | CodeCode Available | 1 |
| SFMViT: SlowFast Meet ViT in Chaotic World | Apr 25, 2024 | Action LocalizationVideo Understanding | CodeCode Available | 1 |
| Task-Driven Exploration: Decoupling and Inter-Task Feedback for Joint Moment Retrieval and Highlight Detection | Apr 14, 2024 | Highlight DetectionMoment Retrieval | CodeCode Available | 1 |
| Enhancing Traffic Safety with Parallel Dense Video Captioning for End-to-End Event Analysis | Apr 12, 2024 | Dense Video CaptioningTransfer Learning | CodeCode Available | 1 |
| SportsHHI: A Dataset for Human-Human Interaction Detection in Sports Videos | Apr 6, 2024 | Graph GenerationRelation | CodeCode Available | 1 |
| Language Repository for Long Video Understanding | Mar 21, 2024 | EgoSchemaQuestion Answering | CodeCode Available | 1 |
| Exploring Pre-trained Text-to-Video Diffusion Models for Referring Video Object Segmentation | Mar 18, 2024 | Referring Video Object SegmentationSemantic Segmentation | CodeCode Available | 1 |
| Towards Neuro-Symbolic Video Understanding | Mar 16, 2024 | Video Understanding | CodeCode Available | 1 |
| Spatio-temporal Prompting Network for Robust Video Feature Extraction | Feb 4, 2024 | Instance Segmentationobject-detection | CodeCode Available | 1 |
| BehAVE: Behaviour Alignment of Video Game Encodings | Feb 2, 2024 | DiversityFPS Games | CodeCode Available | 1 |
| Compositional Video Understanding with Spatiotemporal Structure-based Transformers | Jan 1, 2024 | Video Understanding | CodeCode Available | 1 |
| A Simple LLM Framework for Long-Range Video Question-Answering | Dec 28, 2023 | EgoSchemaLanguage Modelling | CodeCode Available | 1 |
| Open-Vocabulary Video Relation Extraction | Dec 25, 2023 | Action ClassificationAction Understanding | CodeCode Available | 1 |
| Shot2Story20K: A New Benchmark for Comprehensive Understanding of Multi-shot Videos | Dec 16, 2023 | Video Captioningvideo narration captioning | CodeCode Available | 1 |
| SMILE: Multimodal Dataset for Understanding Laughter in Video with Language Models | Dec 15, 2023 | Video Understanding | CodeCode Available | 1 |
| How Well Does GPT-4V(ision) Adapt to Distribution Shifts? A Preliminary Investigation | Dec 12, 2023 | Anomaly DetectionAutonomous Driving | CodeCode Available | 1 |
| Grounded Question-Answering in Long Egocentric Videos | Dec 11, 2023 | Video GroundingVideo Question Answering | CodeCode Available | 1 |
| Action Scene Graphs for Long-Form Understanding of Egocentric Videos | Dec 6, 2023 | Action AnticipationForm | CodeCode Available | 1 |
| DEVIAS: Learning Disentangled Video Representations of Action and Scene | Nov 30, 2023 | Action RecognitionDecoder | CodeCode Available | 1 |