| TS-LLaVA: Constructing Visual Tokens through Thumbnail-and-Sampling for Training-Free Video Large Language Models | Nov 17, 2024 | MVBenchVideo-based Generative Performance Benchmarking | CodeCode Available | 1 |
| ViBe: A Text-to-Video Benchmark for Evaluating Hallucination in Large Multimodal Models | Nov 16, 2024 | HallucinationVideo Generation | —Unverified | 0 |
| Can MLLMs Guide Weakly-Supervised Temporal Action Localization Tasks? | Nov 13, 2024 | Action LocalizationTemporal Action Localization | —Unverified | 0 |
| EVQAScore: Efficient Video Question Answering Data Evaluation | Nov 11, 2024 | Keyword ExtractionQuestion Answering | —Unverified | 0 |
| Video RWKV:Video Action Recognition Based RWKV | Nov 8, 2024 | Action RecognitionRepresentation Learning | —Unverified | 0 |
| StreamingBench: Assessing the Gap for MLLMs to Achieve Streaming Video Understanding | Nov 6, 2024 | Image ComprehensionStreaming video understanding | CodeCode Available | 2 |
| Personalized Video Summarization by Multimodal Video Understanding | Nov 5, 2024 | Unsupervised Video SummarizationVideo Summarization | —Unverified | 0 |
| PPLLaVA: Varied Video Sequence Understanding With Prompt Guidance | Nov 4, 2024 | Caption GenerationMultiple-choice | CodeCode Available | 2 |
| Language-Assisted Skeleton Action Understanding for Skeleton-Based Temporal Action Segmentation | Oct 31, 2024 | Action SegmentationAction Understanding | CodeCode Available | 1 |
| Video Token Merging for Long-form Video Understanding | Oct 31, 2024 | FormVideo Classification | —Unverified | 0 |
| Situational Scene Graph for Structured Human-centric Situation Understanding | Oct 30, 2024 | Graph GenerationPredicate Classification | CodeCode Available | 0 |
| TOMATO: Assessing Visual Temporal Reasoning Capabilities in Multimodal Foundation Models | Oct 30, 2024 | Video Understanding | CodeCode Available | 1 |
| Zero-Shot Action Recognition in Surveillance Videos | Oct 28, 2024 | Action RecognitionVideo Understanding | —Unverified | 0 |
| Egocentric and Exocentric Methods: A Short Survey | Oct 27, 2024 | Action RecognitionSurvey | —Unverified | 0 |
| Adaptive Video Understanding Agent: Enhancing efficiency with dynamic frame sampling and feedback-driven reasoning | Oct 26, 2024 | Video Understanding | —Unverified | 0 |
| TimeSuite: Improving MLLMs for Long Video Understanding via Grounded Tuning | Oct 25, 2024 | EgoSchemaHallucination | CodeCode Available | 2 |
| VideoWebArena: Evaluating Long Context Multimodal Agents with Video Understanding Web Tasks | Oct 24, 2024 | Video Understanding | CodeCode Available | 1 |
| CAMEL-Bench: A Comprehensive Arabic LMM Benchmark | Oct 24, 2024 | document understandingVideo Understanding | CodeCode Available | 1 |
| LongVU: Spatiotemporal Adaptive Compression for Long Video-Language Understanding | Oct 22, 2024 | Token ReductionVideo Question Answering | CodeCode Available | 3 |
| ContextDet: Temporal Action Detection with Adaptive Context Aggregation | Oct 20, 2024 | Action DetectionVideo Understanding | —Unverified | 0 |
| EVA: An Embodied World Model for Future Video Anticipation | Oct 20, 2024 | Language ModelingLanguage Modelling | —Unverified | 0 |
| FIOVA: A Multi-Annotator Benchmark for Human-Aligned Video Captioning | Oct 20, 2024 | DiagnosticVideo Captioning | —Unverified | 0 |
| Making Every Frame Matter: Continuous Video Understanding for Large Models via Adaptive State Modeling | Oct 19, 2024 | Video Understanding | —Unverified | 0 |
| Zero-shot Action Localization via the Confidence of Large Vision-Language Models | Oct 18, 2024 | Action LocalizationLanguage Modelling | —Unverified | 0 |
| VidCompress: Memory-Enhanced Temporal Compression for Video Understanding in Large Language Models | Oct 15, 2024 | Video Understanding | —Unverified | 0 |
| VidEgoThink: Assessing Egocentric Video Understanding Capabilities for Embodied AI | Oct 15, 2024 | Question AnsweringVideo Question Answering | CodeCode Available | 2 |
| TemporalBench: Benchmarking Fine-grained Temporal Understanding for Multimodal Video Models | Oct 14, 2024 | 2kBenchmarking | CodeCode Available | 1 |
| Free Video-LLM: Prompt-guided Visual Perception for Efficient Training-free Video LLMs | Oct 14, 2024 | Computational EfficiencyQuestion Answering | CodeCode Available | 2 |
| ViFi-ReID: A Two-Stream Vision-WiFi Multimodal Approach for Person Re-identification | Oct 13, 2024 | Contrastive LearningPerson Re-Identification | —Unverified | 0 |
| Prompting Video-Language Foundation Models with Domain-specific Fine-grained Heuristics for Video Question Answering | Oct 12, 2024 | Question AnsweringVideo Question Answering | —Unverified | 0 |
| VERIFIED: A Video Corpus Moment Retrieval Benchmark for Fine-Grained Video Understanding | Oct 11, 2024 | HallucinationMoment Retrieval | CodeCode Available | 1 |
| TVBench: Redesigning Video-Language Evaluation | Oct 10, 2024 | Multiple-choiceOpen-Ended Question Answering | —Unverified | 0 |
| Enhancing Multimodal LLM for Detailed and Accurate Video Captioning using Multi-Round Preference Optimization | Oct 9, 2024 | Audio captioningLarge Language Model | —Unverified | 0 |
| MM-Ego: Towards Building Egocentric Multimodal LLMs | Oct 9, 2024 | Video Understanding | —Unverified | 0 |
| Enhancing Temporal Modeling of Video LLMs via Time Gating | Oct 8, 2024 | MVBenchQuestion Answering | CodeCode Available | 0 |
| TRACE: Temporal Grounding Video LLM via Causal Event Modeling | Oct 8, 2024 | Text GenerationVideo Understanding | CodeCode Available | 2 |
| SparseVLM: Visual Token Sparsification for Efficient Vision-Language Model Inference | Oct 6, 2024 | Language ModelingLanguage Modelling | CodeCode Available | 3 |
| AuroraCap: Efficient, Performant Video Detailed Captioning and a New Benchmark | Oct 4, 2024 | Image CaptioningVideo Understanding | —Unverified | 0 |
| Grounded-VideoLLM: Sharpening Fine-grained Temporal Grounding in Video Large Language Models | Oct 4, 2024 | Dense Video CaptioningSentence | CodeCode Available | 2 |
| Frame-Voyager: Learning to Query Frames for Video Large Language Models | Oct 4, 2024 | Question AnsweringVideo Question Answering | —Unverified | 0 |
| AirLetters: An Open Video Dataset of Characters Drawn in the Air | Oct 3, 2024 | Video Understanding | —Unverified | 0 |
| DTVLT: A Multi-modal Diverse Text Benchmark for Visual Language Tracking Based on LLM | Oct 3, 2024 | Object TrackingVideo Understanding | —Unverified | 0 |
| Deep learning for action spotting in association football videos | Oct 2, 2024 | Action SpottingBenchmarking | —Unverified | 0 |
| UAL-Bench: The First Comprehensive Unusual Activity Localization Benchmark | Oct 2, 2024 | Unusual Activity LocalizationVideo Understanding | CodeCode Available | 0 |
| ScVLM: Enhancing Vision-Language Model for Safety-Critical Event Understanding | Oct 1, 2024 | Contrastive LearningHallucination | CodeCode Available | 0 |
| Q-Bench-Video: Benchmarking the Video Quality Understanding of LMMs | Sep 30, 2024 | BenchmarkingMultiple-choice | —Unverified | 0 |
| MM1.5: Methods, Analysis & Insights from Multimodal LLM Fine-tuning | Sep 30, 2024 | Mixture-of-ExpertsOptical Character Recognition (OCR) | —Unverified | 0 |
| VideoINSTA: Zero-shot Long Video Understanding via Informative Spatial-Temporal Reasoning with LLMs | Sep 30, 2024 | EgoSchemaLanguage Modelling | CodeCode Available | 1 |
| Visual Context Window Extension: A New Perspective for Long Video Understanding | Sep 30, 2024 | Video Understanding | —Unverified | 0 |
| Temporal2Seq: A Unified Framework for Temporal Video Understanding Tasks | Sep 27, 2024 | Action DetectionAction Segmentation | —Unverified | 0 |