SOTAVerified

Video Understanding

A crucial task of Video Understanding is to recognise and localise (in space and time) different actions or events appearing in the video.

Source: Action Detection from a Robot-Car Perspective

Papers

Showing 701750 of 1149 papers

TitleStatusHype
Rethinking Image-to-Video Adaptation: An Object-centric Perspective0
VideoEval: Comprehensive Benchmark Suite for Low-Cost Evaluation of Video Foundation ModelCode0
OmChat: A Recipe to Train Multimodal Language Models with Strong Long Context and Video Understanding0
KeyVideoLLM: Towards Large-scale Video Keyframe Selection0
InternLM-XComposer-2.5: A Versatile Large Vision Language Model Supporting Long-Contextual Input and Output0
https://arxiv.org/abs/2407.00634Code0
Video Watermarking: Safeguarding Your Video from (Unauthorized) Annotations by Video-based LLMs0
Zero-Shot Long-Form Video Understanding through Screenplay0
VideoHallucer: Evaluating Intrinsic and Extrinsic Hallucinations in Large Video-Language Models0
video-SALMONN: Speech-Enhanced Audio-Visual Large Language ModelsCode0
MMBench-Video: A Long-Form Multi-Shot Benchmark for Holistic Video Understanding0
Towards Holistic Language-video Representation: the language model-enhanced MSR-Video to Text Dataset0
GVT2RPM: An Empirical Study for General Video Transformer Adaptation to Remote Physiological Measurement0
DrVideo: Document Retrieval Based Long Video Understanding0
Hallucination Mitigation Prompts Long-term Video UnderstandingCode0
VELOCITI: Benchmarking Video-Language Compositional Reasoning with Strict Entailment0
Beyond Raw Videos: Understanding Edited Videos with Large Multimodal ModelCode0
GPT-4o: Visual perception performance of multimodal large language models in piglet activity understanding0
Localizing Events in Videos with Multimodal Queries0
LLAVIDAL: A Large LAnguage VIsion Model for Daily Activities of Living0
Fewer Tokens and Fewer Videos: Extending Video Understanding Abilities in Large Vision-Language Models0
MeMSVD: Long-Range Temporal Structure Capturing Using Incremental SVD0
1st Place Winner of the 2024 Pixel-level Video Understanding in the Wild (CVPR'24 PVUW) Challenge in Video Panoptic Segmentation and Best Long Video Consistency of Video Semantic Segmentation0
Semantic Segmentation on VSPW Dataset through Masked Video Consistency0
3rd Place Solution for PVUW Challenge 2024: Video Panoptic Segmentation0
Contrastive Language Video Time Pre-training0
2nd Place Solution for PVUW Challenge 2024: Video Panoptic Segmentation0
HENASY: Learning to Assemble Scene-Entities for Egocentric Video-Language Model0
Temporal Grounding of Activities using Multimodal Large Language Models0
MMCTAgent: Multi-modal Critical Thinking Agent Framework for Complex Visual Reasoning0
Hierarchical Action Recognition: A Contrastive Video-Language Approach with Hierarchical Interactions0
Streaming Long Video Understanding with Large Language Models0
MAMBA4D: Efficient Long-Sequence Point Cloud Video Understanding with Disentangled Spatial-Temporal State Space Models0
Anticipating Object State Changes in Long Procedural Videos0
Open-Vocabulary Spatio-Temporal Action Detection0
Challenges in Deploying Long-Context Transformers: A Theoretical Peak Performance Analysis0
CinePile: A Long Video Question Answering Dataset and Benchmark0
Global Motion Understanding in Large-Scale Video Object Segmentation0
RETTA: Retrieval-Enhanced Test-Time Adaptation for Zero-Shot Video Captioning0
A Survey on Backbones for Deep Video Action Recognition0
Video-of-Thought: Step-by-Step Video Reasoning from Perception to Cognition0
Snippet-Aware Transformer With Multiple Action Elements for Skeleton-Based Action SegmentationCode0
WorldQA: Multimodal World Knowledge in Videos through Long-Chain Reasoning0
How Good is my Video LMM? Complex Video Reasoning and Robustness Evaluation Suite for Video-LMMs0
Learning text-to-video retrieval from image captioning0
Open-Set Video-based Facial Expression Recognition with Human Expression-sensitive Prompting0
IPAD: Industrial Process Anomaly Detection Dataset0
From Image to Video, what do we need in multimodal LLMs?0
In My Perspective, In My Hands: Accurate Egocentric 2D Hand Pose and Action RecognitionCode0
A Transformer-Based Model for the Prediction of Human Gaze Behavior on Videos0
Show:102550
← PrevPage 15 of 23Next →

No leaderboard results yet.