SOTAVerified

Video Understanding

A crucial task of Video Understanding is to recognise and localise (in space and time) different actions or events appearing in the video.

Source: Action Detection from a Robot-Car Perspective

Papers

Showing 651700 of 1149 papers

TitleStatusHype
Judging a video by its bitstream coverCode0
SoccerNet 2023 Challenges ResultsCode1
CEFHRI: A Communication Efficient Federated Learning Framework for Recognizing Industrial Human-Robot InteractionCode1
Spherical Vision Transformer for 360-degree Video Saliency PredictionCode1
Motion-Guided Masking for Spatiotemporal Representation Learning0
MOFO: MOtion FOcused Self-Supervision for Video UnderstandingCode0
Are current long-term video understanding datasets long-term?Code0
Masked Spatio-Temporal Structure Prediction for Self-supervised Learning on Point Cloud VideosCode1
Audio-Visual Glance Network for Efficient Video Recognition0
EgoSchema: A Diagnostic Benchmark for Very Long-form Video Language UnderstandingCode1
Helping Hands: An Object-Aware Ego-Centric Video Recognition ModelCode1
Temporally-Adaptive Models for Efficient Video Understanding0
M^3Net: Multi-view Encoding, Matching, and Fusion for Few-shot Fine-grained Action Recognition0
MovieChat: From Dense Token to Sparse Memory for Long Video UnderstandingCode2
DPMix: Mixture of Depth and Point Cloud Video Experts for 4D Action Segmentation0
A Survey on Open-Vocabulary Detection and Segmentation: Past, Present, and FutureCode2
Multimodal Distillation for Egocentric Action RecognitionCode1
InternVid: A Large-scale Video-Text Dataset for Multimodal Understanding and Generation0
HA-ViD: A Human Assembly Video Dataset for Comprehensive Assembly Knowledge UnderstandingCode0
Self-Adaptive Sampling for Efficient Video Question-Answering on Image--Text ModelsCode1
VideoGLUE: Video General Understanding Evaluation of Foundation Models0
ICSVR: Investigating Compositional and Syntactic Understanding in Video Retrieval ModelsCode0
Temporal Action Proposal Generation With Action Frequency Adaptive NetworkCode0
An overview on the evaluated video retrieval tasks at TRECVID 2022Code1
Multi-Granularity Hand Action DetectionCode1
Learning Space-Time Semantic Correspondences0
EPIC Fields: Marrying 3D Geometry and Video UnderstandingCode1
Valley: Video Assistant with Large Language model Enhanced abilitYCode2
Video-ChatGPT: Towards Detailed Video Understanding via Large Vision and Language ModelsCode3
Learning Fine-grained View-Invariant Representations from Unpaired Ego-Exo Videos via Temporal Alignment0
Youku-mPLUG: A 10 Million Large-scale Chinese Video-Language Dataset for Pre-training and BenchmarksCode2
Video-LLaMA: An Instruction-tuned Audio-Visual Language Model for Video UnderstandingCode4
MoviePuzzle: Visual Narrative Reasoning through Multimodal Order Learning0
Teacher Agent: A Knowledge Distillation-Free Framework for Rehearsal-based Video Incremental LearningCode0
Action Sensitivity Learning for Temporal Action Localization0
VideoLLM: Modeling Video Sequence with Large Language ModelsCode1
A Video Is Worth 4096 Tokens: Verbalize Videos To Understand Them In Zero ShotCode0
Learning Higher-order Object Interactions for Keypoint-based Video Understanding0
Vehicle Detection and Classification without Residual Calculation: Accelerating HEVC Image Decoding with Random Perturbation Injection0
Transformer-Based Model for Monocular Visual Odometry: A Video Understanding ApproachCode1
VideoChat: Chat-Centric Video UnderstandingCode4
MH-DETR: Video Moment and Highlight Detection with Cross-modal TransformerCode1
Event-Free Moving Object Segmentation from Moving Ego VehicleCode1
ChatVideo: A Tracklet-centric Multimodal and Versatile Video Understanding System0
MRSN: Multi-Relation Support Network for Video Action Detection0
Search-Map-Search: A Frame Selection Paradigm for Action Recognition0
LASER: A Neuro-Symbolic Framework for Learning Spatial-Temporal Scene Graphs with Weak Supervision0
Leveraging triplet loss for unsupervised action segmentationCode1
Therbligs in Action: Video Understanding through Motion Primitives0
SVT: Supertoken Video Transformer for Efficient Video Understanding0
Show:102550
← PrevPage 14 of 23Next →

No leaderboard results yet.