| FocusChat: Text-guided Long Video Understanding via Spatiotemporal Information Filtering | Dec 17, 2024 | Language ModelingLanguage Modelling | —Unverified | 0 |
| ShotVL: Human-Centric Highlight Frame Retrieval via Language Queries | Dec 17, 2024 | Human Detectionimage-classification | —Unverified | 0 |
| CG-Bench: Clue-grounded Question Answering Benchmark for Long Video Understanding | Dec 16, 2024 | HallucinationMultiple-choice | —Unverified | 0 |
| Uni-AdaFocus: Spatial-temporal Dynamic Computation for Video Recognition | Dec 15, 2024 | Computational EfficiencyVideo Recognition | CodeCode Available | 2 |
| Overview of TREC 2024 Medical Video Question Answering (MedVidQA) Track | Dec 15, 2024 | Image CaptioningMedical Question Answering | —Unverified | 0 |
| Apollo: An Exploration of Video Understanding in Large Multimodal Models | Dec 13, 2024 | MMEVideo MME | —Unverified | 0 |
| IQViC: In-context, Question Adaptive Vision Compressor for Long-term Video Understanding LMMs | Dec 13, 2024 | Question AnsweringVideo Question Answering | —Unverified | 0 |
| B-VLLM: A Vision Large Language Model with Balanced Spatio-Temporal Tokens | Dec 13, 2024 | Language ModelingLanguage Modelling | CodeCode Available | 0 |
| VCA: Video Curious Agent for Long Video Understanding | Dec 12, 2024 | Video Understanding | —Unverified | 0 |
| ViCaS: A Dataset for Combining Holistic and Pixel-level Video Understanding using Captions with Grounded Segmentation | Dec 12, 2024 | Phrase GroundingQuestion Answering | —Unverified | 0 |
| PVC: Progressive Visual Token Compression for Unified Image and Video Processing in Large Vision-Language Models | Dec 12, 2024 | Video Understanding | —Unverified | 0 |
| Neptune: The Long Orbit to Benchmarking Long Video Understanding | Dec 12, 2024 | BenchmarkingMultimodal Reasoning | CodeCode Available | 2 |
| COEF-VQ: Cost-Efficient Video Quality Understanding through a Cascaded Multimodal LLM Framework | Dec 11, 2024 | GPULanguage Modeling | —Unverified | 0 |
| 3DSRBench: A Comprehensive 3D Spatial Reasoning Benchmark | Dec 10, 2024 | Autonomous NavigationSpatial Reasoning | —Unverified | 0 |
| Multi-Scale Contrastive Learning for Video Temporal Grounding | Dec 10, 2024 | Contrastive LearningData Augmentation | —Unverified | 0 |
| GEXIA: Granularity Expansion and Iterative Approximation for Scalable Multi-grained Video-language Learning | Dec 10, 2024 | cross-modal alignmentVideo Understanding | —Unverified | 0 |
| Towards Long Video Understanding via Fine-detailed Video Story Generation | Dec 9, 2024 | Story GenerationVideo Understanding | —Unverified | 0 |
| Beyond Boxes: Mask-Guided Spatio-Temporal Feature Aggregation for Video Object Detection | Dec 6, 2024 | GPUMulti-Object Tracking | —Unverified | 0 |
| LinVT: Empower Your Image-level Large Language Model to Understand Videos | Dec 6, 2024 | Language ModelingLanguage Modelling | CodeCode Available | 2 |
| Expanding Performance Boundaries of Open-Source Multimodal Models with Model, Data, and Test-Time Scaling | Dec 6, 2024 | document understandingHallucination | —Unverified | 0 |
| Espresso: High Compression For Rich Extraction From Videos for Your Vision-Language Model | Dec 6, 2024 | EgoSchemaLanguage Modeling | —Unverified | 0 |
| VisionZip: Longer is Better but Not Necessary in Vision Language Models | Dec 5, 2024 | Video UnderstandingVisual Question Answering | CodeCode Available | 3 |
| AIM: Adaptive Inference of Multi-Modal LLMs via Token Merging and Pruning | Dec 4, 2024 | Video Understanding | CodeCode Available | 2 |
| Inst-IT: Boosting Multimodal Instance Understanding via Explicit Visual Prompt Instruction Tuning | Dec 4, 2024 | Multimodal Large Language ModelVideo Understanding | CodeCode Available | 1 |
| Streaming Detection of Queried Event Start | Dec 4, 2024 | Autonomous Drivingparameter-efficient fine-tuning | CodeCode Available | 0 |
| VidHalluc: Evaluating Temporal Hallucinations in Multimodal Large Language Models for Video Understanding | Dec 4, 2024 | HallucinationInstruction Following | —Unverified | 0 |
| Progress-Aware Video Frame Captioning | Dec 3, 2024 | Image CaptioningVideo Captioning | —Unverified | 0 |
| VideoICL: Confidence-based Iterative In-context Learning for Out-of-Distribution Video Understanding | Dec 3, 2024 | In-Context LearningVideo Understanding | CodeCode Available | 1 |
| PhysGame: Uncovering Physical Commonsense Violations in Gameplay Videos | Dec 2, 2024 | Question AnsweringVideo Understanding | CodeCode Available | 1 |
| SEAL: Semantic Attention Learning for Long Video Representation | Dec 2, 2024 | DiversityQuestion Answering | —Unverified | 0 |
| Towards Universal Soccer Video Understanding | Dec 2, 2024 | Action ClassificationSports Understanding | CodeCode Available | 3 |
| VideoSAVi: Self-Aligned Video Language Models without Human Supervision | Dec 1, 2024 | EgoSchemaMVBench | —Unverified | 0 |
| VISTA: Enhancing Long-Duration and High-Resolution Video Understanding by Video Spatiotemporal Augmentation | Dec 1, 2024 | Instruction FollowingVideo Understanding | —Unverified | 0 |
| STEP: Enhancing Video-LLMs' Compositional Reasoning by Spatio-Temporal Graph-guided Self-Training | Nov 29, 2024 | Question AnsweringVideo Understanding | —Unverified | 0 |
| Perception Test 2024: Challenge Summary and a Novel Hour-Long VideoQA Benchmark | Nov 29, 2024 | BenchmarkingGrounded Video Question Answering | —Unverified | 0 |
| T2Vid: Translating Long Text into Multi-Image is the Catalyst for Video-LLMs | Nov 29, 2024 | Data AugmentationDiversity | CodeCode Available | 1 |
| LongVALE: Vision-Audio-Language-Event Benchmark Towards Time-Aware Omni-Modal Perception of Long Videos | Nov 29, 2024 | Boundary DetectionDense Video Captioning | CodeCode Available | 2 |
| Look Every Frame All at Once: Video-Ma^2mba for Efficient Long-form Video Understanding with Multi-Axis Gradient Checkpointing | Nov 29, 2024 | AllForm | —Unverified | 0 |
| TimeMarker: A Versatile Video-LLM for Long and Short Video Understanding with Superior Temporal Localization Ability | Nov 27, 2024 | Temporal LocalizationVideo Understanding | CodeCode Available | 2 |
| SAVEn-Vid: Synergistic Audio-Visual Integration for Enhanced Understanding in Long Video Context | Nov 25, 2024 | Large Language ModelMME | —Unverified | 0 |
| OccludeNet: A Causal Journey into Mixed-View Actor-Centric Video Action Recognition under Occlusions | Nov 24, 2024 | Action ClassificationAction Recognition | CodeCode Available | 0 |
| ReWind: Understanding Long Videos with Instructed Learnable Memory | Nov 23, 2024 | Large Language ModelQuestion Answering | —Unverified | 0 |
| Beyond Training: Dynamic Token Merging for Zero-Shot Video Understanding | Nov 21, 2024 | Computational EfficiencyVideo Understanding | —Unverified | 0 |
| Principles of Visual Tokens for Efficient Video Understanding | Nov 20, 2024 | Video Understanding | —Unverified | 0 |
| Extending Video Masked Autoencoders to 128 frames | Nov 20, 2024 | DecoderVideo Understanding | —Unverified | 0 |
| Teaching VLMs to Localize Specific Objects from In-context Examples | Nov 20, 2024 | ObjectObject Tracking | CodeCode Available | 1 |
| VideoAutoArena: An Automated Arena for Evaluating Large Multimodal Models in Video Analysis through User Simulation | Nov 20, 2024 | ChatbotMultiple-choice | —Unverified | 0 |
| Video-RAG: Visually-aligned Retrieval-Augmented Long Video Comprehension | Nov 20, 2024 | GPUMME | CodeCode Available | 3 |
| AdaCM^2: On Understanding Extremely Long-Term Video with Adaptive Cross-Modality Memory Reduction | Nov 19, 2024 | GPUQuestion Answering | —Unverified | 0 |
| DynFocus: Dynamic Cooperative Network Empowers LLMs with Video Understanding | Nov 19, 2024 | Question AnsweringVideo Understanding | —Unverified | 0 |