Video Understanding

A crucial task of Video Understanding is to recognise and localise (in space and time) different actions or events appearing in the video.

Source: Action Detection from a Robot-Car Perspective

Papers

Recently Added Most Hyped Most Active Needs Verification Most Verified

Showing 701–750 of 1149 papers

Title	Date	Tasks	Status
Rethinking Image-to-Video Adaptation: An Object-centric Perspective	Jul 9, 2024	Action RecognitionObject	—Unverified
VideoEval: Comprehensive Benchmark Suite for Low-Cost Evaluation of Video Foundation Model	Jul 9, 2024	Video Understanding	CodeCode Available
OmChat: A Recipe to Train Multimodal Language Models with Strong Long Context and Video Understanding	Jul 6, 2024	Video Understanding	—Unverified
KeyVideoLLM: Towards Large-scale Video Keyframe Selection	Jul 3, 2024	Data CompressionManagement	—Unverified
InternLM-XComposer-2.5: A Versatile Large Vision Language Model Supporting Long-Contextual Input and Output	Jul 3, 2024	ArticlesImage Comprehension	—Unverified
https://arxiv.org/abs/2407.00634	Jul 2, 2024	Video CaptioningVideo Description	CodeCode Available
Video Watermarking: Safeguarding Your Video from (Unauthorized) Annotations by Video-based LLMs	Jul 2, 2024	Video Understanding	—Unverified
Zero-Shot Long-Form Video Understanding through Screenplay	Jun 25, 2024	FormQuestion Answering	—Unverified
VideoHallucer: Evaluating Intrinsic and Extrinsic Hallucinations in Large Video-Language Models	Jun 24, 2024	HallucinationVideo Understanding	—Unverified
video-SALMONN: Speech-Enhanced Audio-Visual Large Language Models	Jun 22, 2024	DiversityLanguage Modeling	CodeCode Available
MMBench-Video: A Long-Form Multi-Shot Benchmark for Holistic Video Understanding	Jun 20, 2024	FormVideo Understanding	—Unverified
Towards Holistic Language-video Representation: the language model-enhanced MSR-Video to Text Dataset	Jun 19, 2024	Language ModelingLanguage Modelling	—Unverified
GVT2RPM: An Empirical Study for General Video Transformer Adaptation to Remote Physiological Measurement	Jun 19, 2024	Video Understanding	—Unverified
DrVideo: Document Retrieval Based Long Video Understanding	Jun 18, 2024	document understandingEgoSchema	—Unverified
Hallucination Mitigation Prompts Long-term Video Understanding	Jun 17, 2024	Answer GenerationHallucination	CodeCode Available
VELOCITI: Benchmarking Video-Language Compositional Reasoning with Strict Entailment	Jun 16, 2024	Action UnderstandingBenchmarking	—Unverified
Beyond Raw Videos: Understanding Edited Videos with Large Multimodal Model	Jun 15, 2024	Question AnsweringVideo Understanding	CodeCode Available
GPT-4o: Visual perception performance of multimodal large language models in piglet activity understanding	Jun 14, 2024	Activity RecognitionMMR total	—Unverified
Localizing Events in Videos with Multimodal Queries	Jun 14, 2024	Natural Language QueriesVideo Understanding	—Unverified
LLAVIDAL: A Large LAnguage VIsion Model for Daily Activities of Living	Jun 13, 2024	BenchmarkingHuman-Object Interaction Detection	—Unverified
Fewer Tokens and Fewer Videos: Extending Video Understanding Abilities in Large Vision-Language Models	Jun 12, 2024	Video Understanding	—Unverified
MeMSVD: Long-Range Temporal Structure Capturing Using Incremental SVD	Jun 11, 2024	Video RecognitionVideo Understanding	—Unverified
1st Place Winner of the 2024 Pixel-level Video Understanding in the Wild (CVPR'24 PVUW) Challenge in Video Panoptic Segmentation and Best Long Video Consistency of Video Semantic Segmentation	Jun 8, 2024	BenchmarkingInstance Segmentation	—Unverified
Semantic Segmentation on VSPW Dataset through Masked Video Consistency	Jun 7, 2024	Semantic SegmentationVideo Understanding	—Unverified
3rd Place Solution for PVUW Challenge 2024: Video Panoptic Segmentation	Jun 6, 2024	Panoptic SegmentationSegmentation	—Unverified
Contrastive Language Video Time Pre-training	Jun 4, 2024	Action RecognitionContrastive Learning	—Unverified
2nd Place Solution for PVUW Challenge 2024: Video Panoptic Segmentation	Jun 1, 2024	Autonomous DrivingPanoptic Segmentation	—Unverified
HENASY: Learning to Assemble Scene-Entities for Egocentric Video-Language Model	Jun 1, 2024	Action RecognitionActivity Recognition	—Unverified
Temporal Grounding of Activities using Multimodal Large Language Models	May 30, 2024	Video Understanding	—Unverified
MMCTAgent: Multi-modal Critical Thinking Agent Framework for Complex Visual Reasoning	May 28, 2024	Decision MakingVideo Understanding	—Unverified
Hierarchical Action Recognition: A Contrastive Video-Language Approach with Hierarchical Interactions	May 28, 2024	Action RecognitionVideo Recognition	—Unverified
Streaming Long Video Understanding with Large Language Models	May 25, 2024	Question AnsweringVideo Understanding	—Unverified
MAMBA4D: Efficient Long-Sequence Point Cloud Video Understanding with Disentangled Spatial-Temporal State Space Models	May 23, 2024	Action RecognitionAction Segmentation	—Unverified
Anticipating Object State Changes in Long Procedural Videos	May 21, 2024	ObjectObject State Change Classification	—Unverified
Open-Vocabulary Spatio-Temporal Action Detection	May 17, 2024	Action DetectionFine-Grained Action Detection	—Unverified
Challenges in Deploying Long-Context Transformers: A Theoretical Peak Performance Analysis	May 14, 2024	4kGPU	—Unverified
CinePile: A Long Video Question Answering Dataset and Benchmark	May 14, 2024	FormHuman-Object Interaction Detection	—Unverified
Global Motion Understanding in Large-Scale Video Object Segmentation	May 11, 2024	Instance SegmentationOptical Flow Estimation	—Unverified
RETTA: Retrieval-Enhanced Test-Time Adaptation for Zero-Shot Video Captioning	May 11, 2024	Image-text matchingRetrieval	—Unverified
A Survey on Backbones for Deep Video Action Recognition	May 9, 2024	Action RecognitionDiversity	—Unverified
Video-of-Thought: Step-by-Step Video Reasoning from Perception to Cognition	May 7, 2024	Large Language ModelMultimodal Large Language Model	—Unverified
Snippet-Aware Transformer With Multiple Action Elements for Skeleton-Based Action Segmentation	May 6, 2024	Action SegmentationSkeleton Based Action Segmentation	CodeCode Available
WorldQA: Multimodal World Knowledge in Videos through Long-Chain Reasoning	May 6, 2024	Multiple-choiceVideo Understanding	—Unverified
How Good is my Video LMM? Complex Video Reasoning and Robustness Evaluation Suite for Video-LMMs	May 6, 2024	Autonomous VehiclesVideo Understanding	—Unverified
Learning text-to-video retrieval from image captioning	Apr 26, 2024	Image CaptioningImage Retrieval	—Unverified
Open-Set Video-based Facial Expression Recognition with Human Expression-sensitive Prompting	Apr 26, 2024	Facial Expression RecognitionMulti-Task Learning	—Unverified
IPAD: Industrial Process Anomaly Detection Dataset	Apr 23, 2024	Anomaly DetectionVideo Anomaly Detection	—Unverified
From Image to Video, what do we need in multimodal LLMs?	Apr 18, 2024	Video Understanding	—Unverified
In My Perspective, In My Hands: Accurate Egocentric 2D Hand Pose and Action Recognition	Apr 14, 2024	Action RecognitionHand Pose Estimation	CodeCode Available
A Transformer-Based Model for the Prediction of Human Gaze Behavior on Videos	Apr 10, 2024	Activity RecognitionGaze Prediction	—Unverified

Show:10 25 50

← PrevPage 15 of 23Next →

No leaderboard results yet.