| CG-Bench: Clue-grounded Question Answering Benchmark for Long Video Understanding | Dec 16, 2024 | HallucinationMultiple-choice | —Unverified | 0 |
| Overview of TREC 2024 Medical Video Question Answering (MedVidQA) Track | Dec 15, 2024 | Image CaptioningMedical Question Answering | —Unverified | 0 |
| IQViC: In-context, Question Adaptive Vision Compressor for Long-term Video Understanding LMMs | Dec 13, 2024 | Question AnsweringVideo Question Answering | —Unverified | 0 |
| B-VLLM: A Vision Large Language Model with Balanced Spatio-Temporal Tokens | Dec 13, 2024 | Language ModelingLanguage Modelling | CodeCode Available | 0 |
| Apollo: An Exploration of Video Understanding in Large Multimodal Models | Dec 13, 2024 | MMEVideo MME | —Unverified | 0 |
| PVC: Progressive Visual Token Compression for Unified Image and Video Processing in Large Vision-Language Models | Dec 12, 2024 | Video Understanding | —Unverified | 0 |
| ViCaS: A Dataset for Combining Holistic and Pixel-level Video Understanding using Captions with Grounded Segmentation | Dec 12, 2024 | Phrase GroundingQuestion Answering | —Unverified | 0 |
| VCA: Video Curious Agent for Long Video Understanding | Dec 12, 2024 | Video Understanding | —Unverified | 0 |
| COEF-VQ: Cost-Efficient Video Quality Understanding through a Cascaded Multimodal LLM Framework | Dec 11, 2024 | GPULanguage Modeling | —Unverified | 0 |
| Multi-Scale Contrastive Learning for Video Temporal Grounding | Dec 10, 2024 | Contrastive LearningData Augmentation | —Unverified | 0 |
| GEXIA: Granularity Expansion and Iterative Approximation for Scalable Multi-grained Video-language Learning | Dec 10, 2024 | cross-modal alignmentVideo Understanding | —Unverified | 0 |
| 3DSRBench: A Comprehensive 3D Spatial Reasoning Benchmark | Dec 10, 2024 | Autonomous NavigationSpatial Reasoning | —Unverified | 0 |
| Towards Long Video Understanding via Fine-detailed Video Story Generation | Dec 9, 2024 | Story GenerationVideo Understanding | —Unverified | 0 |
| Expanding Performance Boundaries of Open-Source Multimodal Models with Model, Data, and Test-Time Scaling | Dec 6, 2024 | document understandingHallucination | —Unverified | 0 |
| Espresso: High Compression For Rich Extraction From Videos for Your Vision-Language Model | Dec 6, 2024 | EgoSchemaLanguage Modeling | —Unverified | 0 |
| Beyond Boxes: Mask-Guided Spatio-Temporal Feature Aggregation for Video Object Detection | Dec 6, 2024 | GPUMulti-Object Tracking | —Unverified | 0 |
| VidHalluc: Evaluating Temporal Hallucinations in Multimodal Large Language Models for Video Understanding | Dec 4, 2024 | HallucinationInstruction Following | —Unverified | 0 |
| Streaming Detection of Queried Event Start | Dec 4, 2024 | Autonomous Drivingparameter-efficient fine-tuning | CodeCode Available | 0 |
| Progress-Aware Video Frame Captioning | Dec 3, 2024 | Image CaptioningVideo Captioning | —Unverified | 0 |
| SEAL: Semantic Attention Learning for Long Video Representation | Dec 2, 2024 | DiversityQuestion Answering | —Unverified | 0 |
| VideoSAVi: Self-Aligned Video Language Models without Human Supervision | Dec 1, 2024 | EgoSchemaMVBench | —Unverified | 0 |
| VISTA: Enhancing Long-Duration and High-Resolution Video Understanding by Video Spatiotemporal Augmentation | Dec 1, 2024 | Instruction FollowingVideo Understanding | —Unverified | 0 |
| Look Every Frame All at Once: Video-Ma^2mba for Efficient Long-form Video Understanding with Multi-Axis Gradient Checkpointing | Nov 29, 2024 | AllForm | —Unverified | 0 |
| STEP: Enhancing Video-LLMs' Compositional Reasoning by Spatio-Temporal Graph-guided Self-Training | Nov 29, 2024 | Question AnsweringVideo Understanding | —Unverified | 0 |
| Perception Test 2024: Challenge Summary and a Novel Hour-Long VideoQA Benchmark | Nov 29, 2024 | BenchmarkingGrounded Video Question Answering | —Unverified | 0 |
| SAVEn-Vid: Synergistic Audio-Visual Integration for Enhanced Understanding in Long Video Context | Nov 25, 2024 | Large Language ModelMME | —Unverified | 0 |
| OccludeNet: A Causal Journey into Mixed-View Actor-Centric Video Action Recognition under Occlusions | Nov 24, 2024 | Action ClassificationAction Recognition | CodeCode Available | 0 |
| ReWind: Understanding Long Videos with Instructed Learnable Memory | Nov 23, 2024 | Large Language ModelQuestion Answering | —Unverified | 0 |
| Beyond Training: Dynamic Token Merging for Zero-Shot Video Understanding | Nov 21, 2024 | Computational EfficiencyVideo Understanding | —Unverified | 0 |
| Extending Video Masked Autoencoders to 128 frames | Nov 20, 2024 | DecoderVideo Understanding | —Unverified | 0 |
| Principles of Visual Tokens for Efficient Video Understanding | Nov 20, 2024 | Video Understanding | —Unverified | 0 |
| VideoAutoArena: An Automated Arena for Evaluating Large Multimodal Models in Video Analysis through User Simulation | Nov 20, 2024 | ChatbotMultiple-choice | —Unverified | 0 |
| DynFocus: Dynamic Cooperative Network Empowers LLMs with Video Understanding | Nov 19, 2024 | Question AnsweringVideo Understanding | —Unverified | 0 |
| AdaCM^2: On Understanding Extremely Long-Term Video with Adaptive Cross-Modality Memory Reduction | Nov 19, 2024 | GPUQuestion Answering | —Unverified | 0 |
| ViBe: A Text-to-Video Benchmark for Evaluating Hallucination in Large Multimodal Models | Nov 16, 2024 | HallucinationVideo Generation | —Unverified | 0 |
| Can MLLMs Guide Weakly-Supervised Temporal Action Localization Tasks? | Nov 13, 2024 | Action LocalizationTemporal Action Localization | —Unverified | 0 |
| EVQAScore: Efficient Video Question Answering Data Evaluation | Nov 11, 2024 | Keyword ExtractionQuestion Answering | —Unverified | 0 |
| Video RWKV:Video Action Recognition Based RWKV | Nov 8, 2024 | Action RecognitionRepresentation Learning | —Unverified | 0 |
| Personalized Video Summarization by Multimodal Video Understanding | Nov 5, 2024 | Unsupervised Video SummarizationVideo Summarization | —Unverified | 0 |
| Video Token Merging for Long-form Video Understanding | Oct 31, 2024 | FormVideo Classification | —Unverified | 0 |
| Situational Scene Graph for Structured Human-centric Situation Understanding | Oct 30, 2024 | Graph GenerationPredicate Classification | CodeCode Available | 0 |
| Zero-Shot Action Recognition in Surveillance Videos | Oct 28, 2024 | Action RecognitionVideo Understanding | —Unverified | 0 |
| Egocentric and Exocentric Methods: A Short Survey | Oct 27, 2024 | Action RecognitionSurvey | —Unverified | 0 |
| Adaptive Video Understanding Agent: Enhancing efficiency with dynamic frame sampling and feedback-driven reasoning | Oct 26, 2024 | Video Understanding | —Unverified | 0 |
| EVA: An Embodied World Model for Future Video Anticipation | Oct 20, 2024 | Language ModelingLanguage Modelling | —Unverified | 0 |
| ContextDet: Temporal Action Detection with Adaptive Context Aggregation | Oct 20, 2024 | Action DetectionVideo Understanding | —Unverified | 0 |
| FIOVA: A Multi-Annotator Benchmark for Human-Aligned Video Captioning | Oct 20, 2024 | DiagnosticVideo Captioning | —Unverified | 0 |
| Making Every Frame Matter: Continuous Video Understanding for Large Models via Adaptive State Modeling | Oct 19, 2024 | Video Understanding | —Unverified | 0 |
| Zero-shot Action Localization via the Confidence of Large Vision-Language Models | Oct 18, 2024 | Action LocalizationLanguage Modelling | —Unverified | 0 |
| VidCompress: Memory-Enhanced Temporal Compression for Video Understanding in Large Language Models | Oct 15, 2024 | Video Understanding | —Unverified | 0 |