| Rethinking Image-to-Video Adaptation: An Object-centric Perspective | Jul 9, 2024 | Action RecognitionObject | —Unverified | 0 |
| VideoEval: Comprehensive Benchmark Suite for Low-Cost Evaluation of Video Foundation Model | Jul 9, 2024 | Video Understanding | CodeCode Available | 0 |
| OmChat: A Recipe to Train Multimodal Language Models with Strong Long Context and Video Understanding | Jul 6, 2024 | Video Understanding | —Unverified | 0 |
| KeyVideoLLM: Towards Large-scale Video Keyframe Selection | Jul 3, 2024 | Data CompressionManagement | —Unverified | 0 |
| InternLM-XComposer-2.5: A Versatile Large Vision Language Model Supporting Long-Contextual Input and Output | Jul 3, 2024 | ArticlesImage Comprehension | —Unverified | 0 |
| https://arxiv.org/abs/2407.00634 | Jul 2, 2024 | Video CaptioningVideo Description | CodeCode Available | 0 |
| Video Watermarking: Safeguarding Your Video from (Unauthorized) Annotations by Video-based LLMs | Jul 2, 2024 | Video Understanding | —Unverified | 0 |
| Zero-Shot Long-Form Video Understanding through Screenplay | Jun 25, 2024 | FormQuestion Answering | —Unverified | 0 |
| VideoHallucer: Evaluating Intrinsic and Extrinsic Hallucinations in Large Video-Language Models | Jun 24, 2024 | HallucinationVideo Understanding | —Unverified | 0 |
| video-SALMONN: Speech-Enhanced Audio-Visual Large Language Models | Jun 22, 2024 | DiversityLanguage Modeling | CodeCode Available | 0 |
| MMBench-Video: A Long-Form Multi-Shot Benchmark for Holistic Video Understanding | Jun 20, 2024 | FormVideo Understanding | —Unverified | 0 |
| Towards Holistic Language-video Representation: the language model-enhanced MSR-Video to Text Dataset | Jun 19, 2024 | Language ModelingLanguage Modelling | —Unverified | 0 |
| GVT2RPM: An Empirical Study for General Video Transformer Adaptation to Remote Physiological Measurement | Jun 19, 2024 | Video Understanding | —Unverified | 0 |
| DrVideo: Document Retrieval Based Long Video Understanding | Jun 18, 2024 | document understandingEgoSchema | —Unverified | 0 |
| Hallucination Mitigation Prompts Long-term Video Understanding | Jun 17, 2024 | Answer GenerationHallucination | CodeCode Available | 0 |
| VELOCITI: Benchmarking Video-Language Compositional Reasoning with Strict Entailment | Jun 16, 2024 | Action UnderstandingBenchmarking | —Unverified | 0 |
| Beyond Raw Videos: Understanding Edited Videos with Large Multimodal Model | Jun 15, 2024 | Question AnsweringVideo Understanding | CodeCode Available | 0 |
| GPT-4o: Visual perception performance of multimodal large language models in piglet activity understanding | Jun 14, 2024 | Activity RecognitionMMR total | —Unverified | 0 |
| Localizing Events in Videos with Multimodal Queries | Jun 14, 2024 | Natural Language QueriesVideo Understanding | —Unverified | 0 |
| LLAVIDAL: A Large LAnguage VIsion Model for Daily Activities of Living | Jun 13, 2024 | BenchmarkingHuman-Object Interaction Detection | —Unverified | 0 |
| Fewer Tokens and Fewer Videos: Extending Video Understanding Abilities in Large Vision-Language Models | Jun 12, 2024 | Video Understanding | —Unverified | 0 |
| MeMSVD: Long-Range Temporal Structure Capturing Using Incremental SVD | Jun 11, 2024 | Video RecognitionVideo Understanding | —Unverified | 0 |
| 1st Place Winner of the 2024 Pixel-level Video Understanding in the Wild (CVPR'24 PVUW) Challenge in Video Panoptic Segmentation and Best Long Video Consistency of Video Semantic Segmentation | Jun 8, 2024 | BenchmarkingInstance Segmentation | —Unverified | 0 |
| Semantic Segmentation on VSPW Dataset through Masked Video Consistency | Jun 7, 2024 | Semantic SegmentationVideo Understanding | —Unverified | 0 |
| 3rd Place Solution for PVUW Challenge 2024: Video Panoptic Segmentation | Jun 6, 2024 | Panoptic SegmentationSegmentation | —Unverified | 0 |
| Contrastive Language Video Time Pre-training | Jun 4, 2024 | Action RecognitionContrastive Learning | —Unverified | 0 |
| 2nd Place Solution for PVUW Challenge 2024: Video Panoptic Segmentation | Jun 1, 2024 | Autonomous DrivingPanoptic Segmentation | —Unverified | 0 |
| HENASY: Learning to Assemble Scene-Entities for Egocentric Video-Language Model | Jun 1, 2024 | Action RecognitionActivity Recognition | —Unverified | 0 |
| Temporal Grounding of Activities using Multimodal Large Language Models | May 30, 2024 | Video Understanding | —Unverified | 0 |
| MMCTAgent: Multi-modal Critical Thinking Agent Framework for Complex Visual Reasoning | May 28, 2024 | Decision MakingVideo Understanding | —Unverified | 0 |
| Hierarchical Action Recognition: A Contrastive Video-Language Approach with Hierarchical Interactions | May 28, 2024 | Action RecognitionVideo Recognition | —Unverified | 0 |
| Streaming Long Video Understanding with Large Language Models | May 25, 2024 | Question AnsweringVideo Understanding | —Unverified | 0 |
| MAMBA4D: Efficient Long-Sequence Point Cloud Video Understanding with Disentangled Spatial-Temporal State Space Models | May 23, 2024 | Action RecognitionAction Segmentation | —Unverified | 0 |
| Anticipating Object State Changes in Long Procedural Videos | May 21, 2024 | ObjectObject State Change Classification | —Unverified | 0 |
| Open-Vocabulary Spatio-Temporal Action Detection | May 17, 2024 | Action DetectionFine-Grained Action Detection | —Unverified | 0 |
| Challenges in Deploying Long-Context Transformers: A Theoretical Peak Performance Analysis | May 14, 2024 | 4kGPU | —Unverified | 0 |
| CinePile: A Long Video Question Answering Dataset and Benchmark | May 14, 2024 | FormHuman-Object Interaction Detection | —Unverified | 0 |
| Global Motion Understanding in Large-Scale Video Object Segmentation | May 11, 2024 | Instance SegmentationOptical Flow Estimation | —Unverified | 0 |
| RETTA: Retrieval-Enhanced Test-Time Adaptation for Zero-Shot Video Captioning | May 11, 2024 | Image-text matchingRetrieval | —Unverified | 0 |
| A Survey on Backbones for Deep Video Action Recognition | May 9, 2024 | Action RecognitionDiversity | —Unverified | 0 |
| Video-of-Thought: Step-by-Step Video Reasoning from Perception to Cognition | May 7, 2024 | Large Language ModelMultimodal Large Language Model | —Unverified | 0 |
| Snippet-Aware Transformer With Multiple Action Elements for Skeleton-Based Action Segmentation | May 6, 2024 | Action SegmentationSkeleton Based Action Segmentation | CodeCode Available | 0 |
| WorldQA: Multimodal World Knowledge in Videos through Long-Chain Reasoning | May 6, 2024 | Multiple-choiceVideo Understanding | —Unverified | 0 |
| How Good is my Video LMM? Complex Video Reasoning and Robustness Evaluation Suite for Video-LMMs | May 6, 2024 | Autonomous VehiclesVideo Understanding | —Unverified | 0 |
| Learning text-to-video retrieval from image captioning | Apr 26, 2024 | Image CaptioningImage Retrieval | —Unverified | 0 |
| Open-Set Video-based Facial Expression Recognition with Human Expression-sensitive Prompting | Apr 26, 2024 | Facial Expression RecognitionMulti-Task Learning | —Unverified | 0 |
| IPAD: Industrial Process Anomaly Detection Dataset | Apr 23, 2024 | Anomaly DetectionVideo Anomaly Detection | —Unverified | 0 |
| From Image to Video, what do we need in multimodal LLMs? | Apr 18, 2024 | Video Understanding | —Unverified | 0 |
| In My Perspective, In My Hands: Accurate Egocentric 2D Hand Pose and Action Recognition | Apr 14, 2024 | Action RecognitionHand Pose Estimation | CodeCode Available | 0 |
| A Transformer-Based Model for the Prediction of Human Gaze Behavior on Videos | Apr 10, 2024 | Activity RecognitionGaze Prediction | —Unverified | 0 |