Video Understanding

A crucial task of Video Understanding is to recognise and localise (in space and time) different actions or events appearing in the video.

Source: Action Detection from a Robot-Car Perspective

Papers

Recently Added Most Hyped Most Active Needs Verification Most Verified

Showing 601–650 of 1149 papers

Title	Date	Tasks	Status
CG-Bench: Clue-grounded Question Answering Benchmark for Long Video Understanding	Dec 16, 2024	HallucinationMultiple-choice	—Unverified
Overview of TREC 2024 Medical Video Question Answering (MedVidQA) Track	Dec 15, 2024	Image CaptioningMedical Question Answering	—Unverified
IQViC: In-context, Question Adaptive Vision Compressor for Long-term Video Understanding LMMs	Dec 13, 2024	Question AnsweringVideo Question Answering	—Unverified
B-VLLM: A Vision Large Language Model with Balanced Spatio-Temporal Tokens	Dec 13, 2024	Language ModelingLanguage Modelling	CodeCode Available
Apollo: An Exploration of Video Understanding in Large Multimodal Models	Dec 13, 2024	MMEVideo MME	—Unverified
PVC: Progressive Visual Token Compression for Unified Image and Video Processing in Large Vision-Language Models	Dec 12, 2024	Video Understanding	—Unverified
ViCaS: A Dataset for Combining Holistic and Pixel-level Video Understanding using Captions with Grounded Segmentation	Dec 12, 2024	Phrase GroundingQuestion Answering	—Unverified
VCA: Video Curious Agent for Long Video Understanding	Dec 12, 2024	Video Understanding	—Unverified
COEF-VQ: Cost-Efficient Video Quality Understanding through a Cascaded Multimodal LLM Framework	Dec 11, 2024	GPULanguage Modeling	—Unverified
Multi-Scale Contrastive Learning for Video Temporal Grounding	Dec 10, 2024	Contrastive LearningData Augmentation	—Unverified
GEXIA: Granularity Expansion and Iterative Approximation for Scalable Multi-grained Video-language Learning	Dec 10, 2024	cross-modal alignmentVideo Understanding	—Unverified
3DSRBench: A Comprehensive 3D Spatial Reasoning Benchmark	Dec 10, 2024	Autonomous NavigationSpatial Reasoning	—Unverified
Towards Long Video Understanding via Fine-detailed Video Story Generation	Dec 9, 2024	Story GenerationVideo Understanding	—Unverified
Expanding Performance Boundaries of Open-Source Multimodal Models with Model, Data, and Test-Time Scaling	Dec 6, 2024	document understandingHallucination	—Unverified
Espresso: High Compression For Rich Extraction From Videos for Your Vision-Language Model	Dec 6, 2024	EgoSchemaLanguage Modeling	—Unverified
Beyond Boxes: Mask-Guided Spatio-Temporal Feature Aggregation for Video Object Detection	Dec 6, 2024	GPUMulti-Object Tracking	—Unverified
VidHalluc: Evaluating Temporal Hallucinations in Multimodal Large Language Models for Video Understanding	Dec 4, 2024	HallucinationInstruction Following	—Unverified
Streaming Detection of Queried Event Start	Dec 4, 2024	Autonomous Drivingparameter-efficient fine-tuning	CodeCode Available
Progress-Aware Video Frame Captioning	Dec 3, 2024	Image CaptioningVideo Captioning	—Unverified
SEAL: Semantic Attention Learning for Long Video Representation	Dec 2, 2024	DiversityQuestion Answering	—Unverified
VideoSAVi: Self-Aligned Video Language Models without Human Supervision	Dec 1, 2024	EgoSchemaMVBench	—Unverified
VISTA: Enhancing Long-Duration and High-Resolution Video Understanding by Video Spatiotemporal Augmentation	Dec 1, 2024	Instruction FollowingVideo Understanding	—Unverified
Look Every Frame All at Once: Video-Ma^2mba for Efficient Long-form Video Understanding with Multi-Axis Gradient Checkpointing	Nov 29, 2024	AllForm	—Unverified
STEP: Enhancing Video-LLMs' Compositional Reasoning by Spatio-Temporal Graph-guided Self-Training	Nov 29, 2024	Question AnsweringVideo Understanding	—Unverified
Perception Test 2024: Challenge Summary and a Novel Hour-Long VideoQA Benchmark	Nov 29, 2024	BenchmarkingGrounded Video Question Answering	—Unverified
SAVEn-Vid: Synergistic Audio-Visual Integration for Enhanced Understanding in Long Video Context	Nov 25, 2024	Large Language ModelMME	—Unverified
OccludeNet: A Causal Journey into Mixed-View Actor-Centric Video Action Recognition under Occlusions	Nov 24, 2024	Action ClassificationAction Recognition	CodeCode Available
ReWind: Understanding Long Videos with Instructed Learnable Memory	Nov 23, 2024	Large Language ModelQuestion Answering	—Unverified
Beyond Training: Dynamic Token Merging for Zero-Shot Video Understanding	Nov 21, 2024	Computational EfficiencyVideo Understanding	—Unverified
Extending Video Masked Autoencoders to 128 frames	Nov 20, 2024	DecoderVideo Understanding	—Unverified
Principles of Visual Tokens for Efficient Video Understanding	Nov 20, 2024	Video Understanding	—Unverified
VideoAutoArena: An Automated Arena for Evaluating Large Multimodal Models in Video Analysis through User Simulation	Nov 20, 2024	ChatbotMultiple-choice	—Unverified
DynFocus: Dynamic Cooperative Network Empowers LLMs with Video Understanding	Nov 19, 2024	Question AnsweringVideo Understanding	—Unverified
AdaCM^2: On Understanding Extremely Long-Term Video with Adaptive Cross-Modality Memory Reduction	Nov 19, 2024	GPUQuestion Answering	—Unverified
ViBe: A Text-to-Video Benchmark for Evaluating Hallucination in Large Multimodal Models	Nov 16, 2024	HallucinationVideo Generation	—Unverified
Can MLLMs Guide Weakly-Supervised Temporal Action Localization Tasks?	Nov 13, 2024	Action LocalizationTemporal Action Localization	—Unverified
EVQAScore: Efficient Video Question Answering Data Evaluation	Nov 11, 2024	Keyword ExtractionQuestion Answering	—Unverified
Video RWKV:Video Action Recognition Based RWKV	Nov 8, 2024	Action RecognitionRepresentation Learning	—Unverified
Personalized Video Summarization by Multimodal Video Understanding	Nov 5, 2024	Unsupervised Video SummarizationVideo Summarization	—Unverified
Video Token Merging for Long-form Video Understanding	Oct 31, 2024	FormVideo Classification	—Unverified
Situational Scene Graph for Structured Human-centric Situation Understanding	Oct 30, 2024	Graph GenerationPredicate Classification	CodeCode Available
Zero-Shot Action Recognition in Surveillance Videos	Oct 28, 2024	Action RecognitionVideo Understanding	—Unverified
Egocentric and Exocentric Methods: A Short Survey	Oct 27, 2024	Action RecognitionSurvey	—Unverified
Adaptive Video Understanding Agent: Enhancing efficiency with dynamic frame sampling and feedback-driven reasoning	Oct 26, 2024	Video Understanding	—Unverified
EVA: An Embodied World Model for Future Video Anticipation	Oct 20, 2024	Language ModelingLanguage Modelling	—Unverified
ContextDet: Temporal Action Detection with Adaptive Context Aggregation	Oct 20, 2024	Action DetectionVideo Understanding	—Unverified
FIOVA: A Multi-Annotator Benchmark for Human-Aligned Video Captioning	Oct 20, 2024	DiagnosticVideo Captioning	—Unverified
Making Every Frame Matter: Continuous Video Understanding for Large Models via Adaptive State Modeling	Oct 19, 2024	Video Understanding	—Unverified
Zero-shot Action Localization via the Confidence of Large Vision-Language Models	Oct 18, 2024	Action LocalizationLanguage Modelling	—Unverified
VidCompress: Memory-Enhanced Temporal Compression for Video Understanding in Large Language Models	Oct 15, 2024	Video Understanding	—Unverified

Show:10 25 50

← PrevPage 13 of 23Next →

No leaderboard results yet.