Video Understanding

A crucial task of Video Understanding is to recognise and localise (in space and time) different actions or events appearing in the video.

Source: Action Detection from a Robot-Car Perspective

Papers

Recently Added Most Hyped Most Active Needs Verification Most Verified

Showing 351–400 of 1149 papers

Title	Date	Tasks	Status	Hype
TS-LLaVA: Constructing Visual Tokens through Thumbnail-and-Sampling for Training-Free Video Large Language Models	Nov 17, 2024	MVBenchVideo-based Generative Performance Benchmarking	CodeCode Available	1
ViBe: A Text-to-Video Benchmark for Evaluating Hallucination in Large Multimodal Models	Nov 16, 2024	HallucinationVideo Generation	—Unverified	0
Can MLLMs Guide Weakly-Supervised Temporal Action Localization Tasks?	Nov 13, 2024	Action LocalizationTemporal Action Localization	—Unverified	0
EVQAScore: Efficient Video Question Answering Data Evaluation	Nov 11, 2024	Keyword ExtractionQuestion Answering	—Unverified	0
Video RWKV:Video Action Recognition Based RWKV	Nov 8, 2024	Action RecognitionRepresentation Learning	—Unverified	0
StreamingBench: Assessing the Gap for MLLMs to Achieve Streaming Video Understanding	Nov 6, 2024	Image ComprehensionStreaming video understanding	CodeCode Available	2
Personalized Video Summarization by Multimodal Video Understanding	Nov 5, 2024	Unsupervised Video SummarizationVideo Summarization	—Unverified	0
PPLLaVA: Varied Video Sequence Understanding With Prompt Guidance	Nov 4, 2024	Caption GenerationMultiple-choice	CodeCode Available	2
Language-Assisted Skeleton Action Understanding for Skeleton-Based Temporal Action Segmentation	Oct 31, 2024	Action SegmentationAction Understanding	CodeCode Available	1
Video Token Merging for Long-form Video Understanding	Oct 31, 2024	FormVideo Classification	—Unverified	0
Situational Scene Graph for Structured Human-centric Situation Understanding	Oct 30, 2024	Graph GenerationPredicate Classification	CodeCode Available	0
TOMATO: Assessing Visual Temporal Reasoning Capabilities in Multimodal Foundation Models	Oct 30, 2024	Video Understanding	CodeCode Available	1
Zero-Shot Action Recognition in Surveillance Videos	Oct 28, 2024	Action RecognitionVideo Understanding	—Unverified	0
Egocentric and Exocentric Methods: A Short Survey	Oct 27, 2024	Action RecognitionSurvey	—Unverified	0
Adaptive Video Understanding Agent: Enhancing efficiency with dynamic frame sampling and feedback-driven reasoning	Oct 26, 2024	Video Understanding	—Unverified	0
TimeSuite: Improving MLLMs for Long Video Understanding via Grounded Tuning	Oct 25, 2024	EgoSchemaHallucination	CodeCode Available	2
VideoWebArena: Evaluating Long Context Multimodal Agents with Video Understanding Web Tasks	Oct 24, 2024	Video Understanding	CodeCode Available	1
CAMEL-Bench: A Comprehensive Arabic LMM Benchmark	Oct 24, 2024	document understandingVideo Understanding	CodeCode Available	1
LongVU: Spatiotemporal Adaptive Compression for Long Video-Language Understanding	Oct 22, 2024	Token ReductionVideo Question Answering	CodeCode Available	3
ContextDet: Temporal Action Detection with Adaptive Context Aggregation	Oct 20, 2024	Action DetectionVideo Understanding	—Unverified	0
EVA: An Embodied World Model for Future Video Anticipation	Oct 20, 2024	Language ModelingLanguage Modelling	—Unverified	0
FIOVA: A Multi-Annotator Benchmark for Human-Aligned Video Captioning	Oct 20, 2024	DiagnosticVideo Captioning	—Unverified	0
Making Every Frame Matter: Continuous Video Understanding for Large Models via Adaptive State Modeling	Oct 19, 2024	Video Understanding	—Unverified	0
Zero-shot Action Localization via the Confidence of Large Vision-Language Models	Oct 18, 2024	Action LocalizationLanguage Modelling	—Unverified	0
VidCompress: Memory-Enhanced Temporal Compression for Video Understanding in Large Language Models	Oct 15, 2024	Video Understanding	—Unverified	0
VidEgoThink: Assessing Egocentric Video Understanding Capabilities for Embodied AI	Oct 15, 2024	Question AnsweringVideo Question Answering	CodeCode Available	2
TemporalBench: Benchmarking Fine-grained Temporal Understanding for Multimodal Video Models	Oct 14, 2024	2kBenchmarking	CodeCode Available	1
Free Video-LLM: Prompt-guided Visual Perception for Efficient Training-free Video LLMs	Oct 14, 2024	Computational EfficiencyQuestion Answering	CodeCode Available	2
ViFi-ReID: A Two-Stream Vision-WiFi Multimodal Approach for Person Re-identification	Oct 13, 2024	Contrastive LearningPerson Re-Identification	—Unverified	0
Prompting Video-Language Foundation Models with Domain-specific Fine-grained Heuristics for Video Question Answering	Oct 12, 2024	Question AnsweringVideo Question Answering	—Unverified	0
VERIFIED: A Video Corpus Moment Retrieval Benchmark for Fine-Grained Video Understanding	Oct 11, 2024	HallucinationMoment Retrieval	CodeCode Available	1
TVBench: Redesigning Video-Language Evaluation	Oct 10, 2024	Multiple-choiceOpen-Ended Question Answering	—Unverified	0
Enhancing Multimodal LLM for Detailed and Accurate Video Captioning using Multi-Round Preference Optimization	Oct 9, 2024	Audio captioningLarge Language Model	—Unverified	0
MM-Ego: Towards Building Egocentric Multimodal LLMs	Oct 9, 2024	Video Understanding	—Unverified	0
Enhancing Temporal Modeling of Video LLMs via Time Gating	Oct 8, 2024	MVBenchQuestion Answering	CodeCode Available	0
TRACE: Temporal Grounding Video LLM via Causal Event Modeling	Oct 8, 2024	Text GenerationVideo Understanding	CodeCode Available	2
SparseVLM: Visual Token Sparsification for Efficient Vision-Language Model Inference	Oct 6, 2024	Language ModelingLanguage Modelling	CodeCode Available	3
AuroraCap: Efficient, Performant Video Detailed Captioning and a New Benchmark	Oct 4, 2024	Image CaptioningVideo Understanding	—Unverified	0
Grounded-VideoLLM: Sharpening Fine-grained Temporal Grounding in Video Large Language Models	Oct 4, 2024	Dense Video CaptioningSentence	CodeCode Available	2
Frame-Voyager: Learning to Query Frames for Video Large Language Models	Oct 4, 2024	Question AnsweringVideo Question Answering	—Unverified	0
AirLetters: An Open Video Dataset of Characters Drawn in the Air	Oct 3, 2024	Video Understanding	—Unverified	0
DTVLT: A Multi-modal Diverse Text Benchmark for Visual Language Tracking Based on LLM	Oct 3, 2024	Object TrackingVideo Understanding	—Unverified	0
Deep learning for action spotting in association football videos	Oct 2, 2024	Action SpottingBenchmarking	—Unverified	0
UAL-Bench: The First Comprehensive Unusual Activity Localization Benchmark	Oct 2, 2024	Unusual Activity LocalizationVideo Understanding	CodeCode Available	0
ScVLM: Enhancing Vision-Language Model for Safety-Critical Event Understanding	Oct 1, 2024	Contrastive LearningHallucination	CodeCode Available	0
Q-Bench-Video: Benchmarking the Video Quality Understanding of LMMs	Sep 30, 2024	BenchmarkingMultiple-choice	—Unverified	0
MM1.5: Methods, Analysis & Insights from Multimodal LLM Fine-tuning	Sep 30, 2024	Mixture-of-ExpertsOptical Character Recognition (OCR)	—Unverified	0
VideoINSTA: Zero-shot Long Video Understanding via Informative Spatial-Temporal Reasoning with LLMs	Sep 30, 2024	EgoSchemaLanguage Modelling	CodeCode Available	1
Visual Context Window Extension: A New Perspective for Long Video Understanding	Sep 30, 2024	Video Understanding	—Unverified	0
Temporal2Seq: A Unified Framework for Temporal Video Understanding Tasks	Sep 27, 2024	Action DetectionAction Segmentation	—Unverified	0

Show:10 25 50

← PrevPage 8 of 23Next →

No leaderboard results yet.