| VideoEval: Comprehensive Benchmark Suite for Low-Cost Evaluation of Video Foundation Model | Jul 9, 2024 | Video Understanding | CodeCode Available | 0 |
| Rethinking Image-to-Video Adaptation: An Object-centric Perspective | Jul 9, 2024 | Action RecognitionObject | —Unverified | 0 |
| Video-STaR: Self-Training Enables Video Instruction Tuning with Any Supervision | Jul 8, 2024 | Action Quality AssessmentDescriptive | CodeCode Available | 2 |
| MMAD: Multi-label Micro-Action Detection in Videos | Jul 7, 2024 | Action AnalysisAction Detection | CodeCode Available | 1 |
| OmChat: A Recipe to Train Multimodal Language Models with Strong Long Context and Video Understanding | Jul 6, 2024 | Video Understanding | —Unverified | 0 |
| KeyVideoLLM: Towards Large-scale Video Keyframe Selection | Jul 3, 2024 | Data CompressionManagement | —Unverified | 0 |
| InternLM-XComposer-2.5: A Versatile Large Vision Language Model Supporting Long-Contextual Input and Output | Jul 3, 2024 | ArticlesImage Comprehension | —Unverified | 0 |
| Video Watermarking: Safeguarding Your Video from (Unauthorized) Annotations by Video-based LLMs | Jul 2, 2024 | Video Understanding | —Unverified | 0 |
| https://arxiv.org/abs/2407.00634 | Jul 2, 2024 | Video CaptioningVideo Description | CodeCode Available | 0 |
| Tarsier: Recipes for Training and Evaluating Large Video Description Models | Jun 30, 2024 | Video CaptioningVideo Description | CodeCode Available | 4 |
| InfiniBench: A Comprehensive Benchmark for Large Multimodal Models in Very Long Video Understanding | Jun 28, 2024 | Multiple-choiceVideo Understanding | CodeCode Available | 1 |
| Snakes and Ladders: Two Steps Up for VideoMamba | Jun 27, 2024 | Action RecognitionMamba | CodeCode Available | 1 |
| OMG-LLaVA: Bridging Image-level, Object-level, Pixel-level Reasoning and Understanding | Jun 27, 2024 | DecoderSegmentation | CodeCode Available | 5 |
| Fibottention: Inceptive Visual Representation Learning with Diverse Attention Across Heads | Jun 27, 2024 | Diversityimage-classification | CodeCode Available | 1 |
| Zero-Shot Long-Form Video Understanding through Screenplay | Jun 25, 2024 | FormQuestion Answering | —Unverified | 0 |
| PVUW 2024 Challenge on Complex Video Understanding: Methods and Results | Jun 24, 2024 | SegmentationSemantic Segmentation | CodeCode Available | 4 |
| VideoHallucer: Evaluating Intrinsic and Extrinsic Hallucinations in Large Video-Language Models | Jun 24, 2024 | HallucinationVideo Understanding | —Unverified | 0 |
| OmAgent: A Multi-modal Agent Framework for Complex Video Understanding with Task Divide-and-Conquer | Jun 24, 2024 | AI AgentLarge Language Model | CodeCode Available | 2 |
| video-SALMONN: Speech-Enhanced Audio-Visual Large Language Models | Jun 22, 2024 | DiversityLanguage Modeling | CodeCode Available | 0 |
| Towards Event-oriented Long Video Understanding | Jun 20, 2024 | Video Understanding | CodeCode Available | 1 |
| MMBench-Video: A Long-Form Multi-Shot Benchmark for Holistic Video Understanding | Jun 20, 2024 | FormVideo Understanding | —Unverified | 0 |
| Towards Holistic Language-video Representation: the language model-enhanced MSR-Video to Text Dataset | Jun 19, 2024 | Language ModelingLanguage Modelling | —Unverified | 0 |
| AlanaVLM: A Multimodal Embodied AI Foundation Model for Egocentric Video Understanding | Jun 19, 2024 | Question AnsweringSpatial Reasoning | CodeCode Available | 1 |
| GVT2RPM: An Empirical Study for General Video Transformer Adaptation to Remote Physiological Measurement | Jun 19, 2024 | Video Understanding | —Unverified | 0 |
| DrVideo: Document Retrieval Based Long Video Understanding | Jun 18, 2024 | document understandingEgoSchema | —Unverified | 0 |
| Slot State Space Models | Jun 18, 2024 | MambaState Space Models | CodeCode Available | 1 |
| Hallucination Mitigation Prompts Long-term Video Understanding | Jun 17, 2024 | Answer GenerationHallucination | CodeCode Available | 0 |
| VideoVista: A Versatile Benchmark for Video Understanding and Reasoning | Jun 17, 2024 | Anomaly DetectionLogical Reasoning | CodeCode Available | 1 |
| VELOCITI: Benchmarking Video-Language Compositional Reasoning with Strict Entailment | Jun 16, 2024 | Action UnderstandingBenchmarking | —Unverified | 0 |
| Beyond Raw Videos: Understanding Edited Videos with Large Multimodal Model | Jun 15, 2024 | Question AnsweringVideo Understanding | CodeCode Available | 0 |
| Localizing Events in Videos with Multimodal Queries | Jun 14, 2024 | Natural Language QueriesVideo Understanding | —Unverified | 0 |
| GPT-4o: Visual perception performance of multimodal large language models in piglet activity understanding | Jun 14, 2024 | Activity RecognitionMMR total | —Unverified | 0 |
| LLAVIDAL: A Large LAnguage VIsion Model for Daily Activities of Living | Jun 13, 2024 | BenchmarkingHuman-Object Interaction Detection | —Unverified | 0 |
| Needle In A Video Haystack: A Scalable Synthetic Evaluator for Video MLLMs | Jun 13, 2024 | BenchmarkingQuestion Answering | CodeCode Available | 2 |
| VideoGPT+: Integrating Image and Video Encoders for Enhanced Video Understanding | Jun 13, 2024 | Dense Video CaptioningMVBench | CodeCode Available | 3 |
| Flash-VStream: Memory-Based Real-Time Understanding for Long Video Streams | Jun 12, 2024 | cross-modal alignmentLanguage Modelling | CodeCode Available | 3 |
| LVBench: An Extreme Long Video Understanding Benchmark | Jun 12, 2024 | Decision MakingVideo Understanding | CodeCode Available | 2 |
| MMWorld: Towards Multi-discipline Multi-faceted World Model Evaluation in Videos | Jun 12, 2024 | counterfactualFuture prediction | CodeCode Available | 1 |
| Fewer Tokens and Fewer Videos: Extending Video Understanding Abilities in Large Vision-Language Models | Jun 12, 2024 | Video Understanding | —Unverified | 0 |
| MeMSVD: Long-Range Temporal Structure Capturing Using Incremental SVD | Jun 11, 2024 | Video RecognitionVideo Understanding | —Unverified | 0 |
| Vript: A Video Is Worth Thousands of Words | Jun 10, 2024 | Video CaptioningVideo Understanding | CodeCode Available | 2 |
| 1st Place Winner of the 2024 Pixel-level Video Understanding in the Wild (CVPR'24 PVUW) Challenge in Video Panoptic Segmentation and Best Long Video Consistency of Video Semantic Segmentation | Jun 8, 2024 | BenchmarkingInstance Segmentation | —Unverified | 0 |
| Semantic Segmentation on VSPW Dataset through Masked Video Consistency | Jun 7, 2024 | Semantic SegmentationVideo Understanding | —Unverified | 0 |
| ShareGPT4Video: Improving Video Understanding and Generation with Better Captions | Jun 6, 2024 | Video CaptioningVideo Generation | CodeCode Available | 5 |
| 3rd Place Solution for PVUW Challenge 2024: Video Panoptic Segmentation | Jun 6, 2024 | Panoptic SegmentationSegmentation | —Unverified | 0 |
| MLVU: Benchmarking Multi-task Long Video Understanding | Jun 6, 2024 | BenchmarkingVideo Understanding | CodeCode Available | 3 |
| Contrastive Language Video Time Pre-training | Jun 4, 2024 | Action RecognitionContrastive Learning | —Unverified | 0 |
| Differentiable Task Graph Learning: Procedural Activity Representation and Online Mistake Detection from Egocentric Videos | Jun 3, 2024 | Mistake DetectionOnline Mistake Detection | CodeCode Available | 1 |
| HENASY: Learning to Assemble Scene-Entities for Egocentric Video-Language Model | Jun 1, 2024 | Action RecognitionActivity Recognition | —Unverified | 0 |
| 2nd Place Solution for PVUW Challenge 2024: Video Panoptic Segmentation | Jun 1, 2024 | Autonomous DrivingPanoptic Segmentation | —Unverified | 0 |