SOTAVerified|Agents Browse Leaderboard About Blog

Video Understanding

A crucial task of Video Understanding is to recognise and localise (in space and time) different actions or events appearing in the video.

Source: Action Detection from a Robot-Car Perspective

Papers

Recently Added Most Hyped Most Active Needs Verification Most Verified

Showing 176–200 of 1149 papers

Title	Date	Tasks	Status	Hype
ViSpeak: Visual Instruction Feedback in Streaming Videos	Mar 17, 2025	Streaming video understandingVideo Understanding	CodeCode Available	2
VideoMind: A Chain-of-LoRA Agent for Long Video Reasoning	Mar 17, 2025	Grounded Video Question AnsweringQuestion Answering	CodeCode Available	3
Long-VMNet: Accelerating Long-Form Video Understanding via Fixed Memory	Mar 17, 2025	FormGPU	—Unverified	0
Towards Scalable Modeling of Compressed Videos for Efficient Action Recognition	Mar 17, 2025	Action RecognitionVideo Recognition	—Unverified	0
Logic-in-Frames: Dynamic Keyframe Search via Visual Semantic-Logical Verification for Long Video Understanding	Mar 17, 2025	AttributeMME	—Unverified	0
Does Your Vision-Language Model Get Lost in the Long Video Sampling Dilemma?	Mar 16, 2025	Language ModelingLanguage Modelling	CodeCode Available	1
AdaReTaKe: Adaptive Redundancy Reduction to Perceive Longer for Video-language Understanding	Mar 16, 2025	Video Understanding	CodeCode Available	2
Watch and Learn: Leveraging Expert Knowledge and Language for Surgical Video Understanding	Mar 14, 2025	DenoisingDense Video Captioning	—Unverified	0
LLaVA-MLB: Mitigating and Leveraging Attention Bias for Training-Free Video LLMs	Mar 14, 2025	Video Understanding	—Unverified	0
Vamba: Understanding Hour-Long Videos with Hybrid Mamba-Transformers	Mar 14, 2025	GPUMamba	—Unverified	0
V-STaR: Benchmarking Video-LLMs on Video Spatio-Temporal Reasoning	Mar 14, 2025	BenchmarkingRelational Reasoning	—Unverified	0
TIME: Temporal-sensitive Multi-dimensional Instruction Tuning and Benchmarking for Video-LLMs	Mar 13, 2025	BenchmarkingQuestion Answering	—Unverified	0
Keyframe-oriented Vision Token Pruning: Enhancing Efficiency of Large Vision Language Models on Long-Form Video Processing	Mar 13, 2025	EgoSchemaForm	CodeCode Available	0
LVAgent: Long Video Understanding by Multi-Round Dynamical Collaboration of MLLM Agents	Mar 13, 2025	Computational EfficiencyOptical Character Recognition (OCR)	—Unverified	0
Reasoning is All You Need for Video Generalization: A Counterfactual Benchmark with Sub-question Evaluation	Mar 12, 2025	Allcounterfactual	—Unverified	0
On the Limitations of Vision-Language Models in Understanding Image Transforms	Mar 12, 2025	Question AnsweringVideo Generation	—Unverified	0
Measure Twice, Cut Once: Grasping Video Structures and Event Semantics with LLMs for Video Temporal Localization	Mar 12, 2025	Temporal LocalizationVideo Understanding	—Unverified	0
FaVChat: Unlocking Fine-Grained Facail Video Understanding with Multimodal Large Language Models	Mar 12, 2025	Mixture-of-ExpertsQuestion Answering	—Unverified	0
Everything Can Be Described in Words: A Simple Unified Multi-Modal Framework with Semantic and Temporal Alignment	Mar 12, 2025	Automatic Speech RecognitionAutomatic Speech Recognition (ASR)	—Unverified	0
VideoScan: Enabling Efficient Streaming Video Understanding via Frame-level Semantic Carriers	Mar 12, 2025	GPUStreaming video understanding	—Unverified	0
VLog: Video-Language Models by Generative Retrieval of Narration Vocabulary	Mar 12, 2025	EgoSchemaRetrieval	CodeCode Available	4
Exo2Ego: Exocentric Knowledge Guided MLLM for Egocentric Video Understanding	Mar 12, 2025	Instruction FollowingVideo Understanding	—Unverified	0
Generative Frame Sampler for Long Video Understanding	Mar 12, 2025	Video Understanding	—Unverified	0
Memory-enhanced Retrieval Augmentation for Long Video Understanding	Mar 12, 2025	RAGRetrieval	—Unverified	0
QuoTA: Query-oriented Token Assignment via CoT Query Decouple for Long Video Comprehension	Mar 11, 2025	AutoMLDecoder	CodeCode Available	2

Show:10 25 50

← PrevPage 8 of 46Next →

No leaderboard results yet.