SOTAVerified

Video Understanding

A crucial task of Video Understanding is to recognise and localise (in space and time) different actions or events appearing in the video.

Source: Action Detection from a Robot-Car Perspective

Papers

Showing 301350 of 1149 papers

TitleStatusHype
MM-Pyramid: Multimodal Pyramid Attentional Network for Audio-Visual Event Localization and Video ParsingCode1
ETAD: Training Action Detection End to End on a LaptopCode1
M^3-VOS: Multi-Phase, Multi-Transition, and Multi-Scenery Video Object SegmentationCode1
EPIC Fields: Marrying 3D Geometry and Video UnderstandingCode1
LoVR: A Benchmark for Long Video Retrieval in Multimodal ContextsCode1
Enhancing Traffic Safety with Parallel Dense Video Captioning for End-to-End Event AnalysisCode1
A Simple LLM Framework for Long-Range Video Question-AnsweringCode1
Exploring Pre-trained Text-to-Video Diffusion Models for Referring Video Object SegmentationCode1
Modeling Fine-Grained Hand-Object Dynamics for Egocentric Video Representation LearningCode1
Modeling Video As Stochastic Processes for Fine-Grained Video Representation LearningCode1
Enhancing Self-supervised Video Representation Learning via Multi-level Feature OptimizationCode1
From Representation to Reasoning: Towards both Evidence and Commonsense Reasoning for Video Question-AnsweringCode1
Mamba4D: Efficient 4D Point Cloud Video Understanding with Disentangled Spatial-Temporal State Space ModelsCode1
MM-VID: Advancing Video Understanding with GPT-4V(ision)Code1
NExT-QA:Next Phase of Question-Answering to Explaining Temporal ActionsCode1
Facial Dynamics in Video: Instruction Tuning for Improved Facial Expression Perception and Contextual AwarenessCode1
Fact-R1: Towards Explainable Video Misinformation Detection with Deep ReasoningCode1
MultiSports: A Multi-Person Video Dataset of Spatio-Temporally Localized Sports ActionsCode1
Lightweight Network Architecture for Real-Time Action RecognitionCode1
End-to-End Video Instance Segmentation with TransformersCode1
Object-Region Video TransformersCode1
Feature Combination Meets Attention: Baidu Soccer Embeddings and Transformer based Temporal DetectionCode1
Leveraging triplet loss for unsupervised action segmentationCode1
Localizing Moments in Long Video Via Multimodal GuidanceCode1
Learning Temporally Latent Causal Processes from General Temporal DataCode1
End-to-end Temporal Action Detection with TransformerCode1
Learning the Predictability of the FutureCode1
Fibottention: Inceptive Visual Representation Learning with Diverse Attention Across HeadsCode1
End-to-End Streaming Video Temporal Action Segmentation with Reinforce LearningCode1
Open-Vocabulary Video Relation ExtractionCode1
End-to-End Referring Video Object Segmentation with Multimodal TransformersCode1
Panoramic Vision Transformer for Saliency Detection in 360° VideosCode1
Learning Temporally Causal Latent Processes from General Temporal DataCode1
PAVE: Patching and Adapting Video Large Language ModelsCode1
Learning Transferable Spatiotemporal Representations from Natural Script KnowledgeCode1
Learning Optical Flow with Adaptive Graph ReasoningCode1
CAMEL-Bench: A Comprehensive Arabic LMM BenchmarkCode1
Learning Salient Boundary Feature for Anchor-free Temporal Action LocalizationCode1
Language-Guided Audio-Visual Learning for Long-Term Sports AssessmentCode1
An overview on the evaluated video retrieval tasks at TRECVID 2022Code1
Procedure-Aware Pretraining for Instructional Video UnderstandingCode1
Progressive Attention on Multi-Level Dense Difference Maps for Generic Event Boundary DetectionCode1
Language Repository for Long Video UnderstandingCode1
Learning Self-Similarity in Space and Time as a Generalized Motion for Action RecognitionCode1
IV-Bench: A Benchmark for Image-Grounded Video Perception and Reasoning in Multimodal LLMsCode1
A Comprehensive Study of Deep Video Action RecognitionCode1
Elaborative Rehearsal for Zero-shot Action RecognitionCode1
Free Lunch for Surgical Video Understanding by Distilling Self-SupervisionsCode1
FrameExit: Conditional Early Exiting for Efficient Video RecognitionCode1
Language-Assisted Skeleton Action Understanding for Skeleton-Based Temporal Action SegmentationCode1
Show:102550
← PrevPage 7 of 23Next →

No leaderboard results yet.