Video Understanding

A crucial task of Video Understanding is to recognise and localise (in space and time) different actions or events appearing in the video.

Source: Action Detection from a Robot-Car Perspective

Papers

Recently Added Most Hyped Most Active Needs Verification Most Verified

Showing 651–700 of 1149 papers

Title	Date	Tasks	Status
ViFi-ReID: A Two-Stream Vision-WiFi Multimodal Approach for Person Re-identification	Oct 13, 2024	Contrastive LearningPerson Re-Identification	—Unverified
Prompting Video-Language Foundation Models with Domain-specific Fine-grained Heuristics for Video Question Answering	Oct 12, 2024	Question AnsweringVideo Question Answering	—Unverified
TVBench: Redesigning Video-Language Evaluation	Oct 10, 2024	Multiple-choiceOpen-Ended Question Answering	—Unverified
MM-Ego: Towards Building Egocentric Multimodal LLMs	Oct 9, 2024	Video Understanding	—Unverified
Enhancing Multimodal LLM for Detailed and Accurate Video Captioning using Multi-Round Preference Optimization	Oct 9, 2024	Audio captioningLarge Language Model	—Unverified
Enhancing Temporal Modeling of Video LLMs via Time Gating	Oct 8, 2024	MVBenchQuestion Answering	CodeCode Available
AuroraCap: Efficient, Performant Video Detailed Captioning and a New Benchmark	Oct 4, 2024	Image CaptioningVideo Understanding	—Unverified
Frame-Voyager: Learning to Query Frames for Video Large Language Models	Oct 4, 2024	Question AnsweringVideo Question Answering	—Unverified
DTVLT: A Multi-modal Diverse Text Benchmark for Visual Language Tracking Based on LLM	Oct 3, 2024	Object TrackingVideo Understanding	—Unverified
AirLetters: An Open Video Dataset of Characters Drawn in the Air	Oct 3, 2024	Video Understanding	—Unverified
Deep learning for action spotting in association football videos	Oct 2, 2024	Action SpottingBenchmarking	—Unverified
UAL-Bench: The First Comprehensive Unusual Activity Localization Benchmark	Oct 2, 2024	Unusual Activity LocalizationVideo Understanding	CodeCode Available
ScVLM: Enhancing Vision-Language Model for Safety-Critical Event Understanding	Oct 1, 2024	Contrastive LearningHallucination	CodeCode Available
Visual Context Window Extension: A New Perspective for Long Video Understanding	Sep 30, 2024	Video Understanding	—Unverified
MM1.5: Methods, Analysis & Insights from Multimodal LLM Fine-tuning	Sep 30, 2024	Mixture-of-ExpertsOptical Character Recognition (OCR)	—Unverified
Q-Bench-Video: Benchmarking the Video Quality Understanding of LMMs	Sep 30, 2024	BenchmarkingMultiple-choice	—Unverified
Temporal2Seq: A Unified Framework for Temporal Video Understanding Tasks	Sep 27, 2024	Action DetectionAction Segmentation	—Unverified
EAGLE: Egocentric AGgregated Language-video Engine	Sep 26, 2024	Action RecognitionActivity Recognition	—Unverified
LLM4Brain: Training a Large Language Model for Brain Video Understanding	Sep 26, 2024	Domain AdaptationLanguage Modeling	—Unverified
Can CLIP Count Stars? An Empirical Study on Quantity Bias in CLIP	Sep 23, 2024	Image GenerationQuestion Answering	—Unverified
First Place Solution to the Multiple-choice Video QA Track of The Second Perception Test Challenge	Sep 20, 2024	Multiple-choiceQuestion Answering	—Unverified
Towards Child-Inclusive Clinical Video Understanding for Autism Spectrum Disorder	Sep 20, 2024	Activity RecognitionDiagnostic	—Unverified
Interpretable Action Recognition on Hard to Classify Actions	Sep 19, 2024	Action RecognitionDepth Estimation	—Unverified
AMEGO: Active Memory from long EGOcentric videos	Sep 17, 2024	Video Understanding	—Unverified
HAVANA: Hierarchical stochastic neighbor embedding for Accelerated Video ANnotAtions	Sep 16, 2024	Dimensionality ReductionVideo Understanding	—Unverified
SoccerNet 2024 Challenges Results	Sep 16, 2024	Action SpottingDense Video Captioning	CodeCode Available
Enhancing Long Video Understanding via Hierarchical Event-Based Memory	Sep 10, 2024	Video Understanding	—Unverified
VidLPRO: A Video-Language Pre-training Framework for Robotic and Laparoscopic Surgery	Sep 7, 2024	Computational EfficiencyContrastive Learning	—Unverified
TC-LLaVA: Rethinking the Transfer from Image to Video Understanding with Temporal Considerations	Sep 5, 2024	Causal InferencePosition	—Unverified
VideoLLaMB: Long-context Video Understanding with Recurrent Memory Bridges	Sep 2, 2024	GPUMVBench	—Unverified
StimuVAR: Spatiotemporal Stimuli-aware Video Affective Reasoning with Multimodal Large Language Models	Aug 31, 2024	Video Understanding	—Unverified
Streamlining Forest Wildfire Surveillance: AI-Enhanced UAVs Utilizing the FLAME Aerial Video Dataset for Lightweight and Efficient Monitoring	Aug 31, 2024	Disaster ResponseVideo Understanding	—Unverified
DLM-VMTL:A Double Layer Mapper for heterogeneous data video Multi-task prompt learning	Aug 29, 2024	Multi-Task LearningPrompt Learning	—Unverified
Kangaroo: A Powerful Video-Language Model Supporting Long-context Video Input	Aug 28, 2024	Language ModelingLanguage Modelling	—Unverified
Attend-Fusion: Efficient Audio-Visual Fusion for Video Classification	Aug 26, 2024	Video ClassificationVideo Understanding	—Unverified
LMM-VQA: Advancing Video Quality Assessment with Large Multimodal Models	Aug 26, 2024	Large Language ModelVideo Quality Assessment	CodeCode Available
LongVILA: Scaling Long-Context Visual Language Models for Long Videos	Aug 19, 2024	Video CaptioningVideo Question Answering	—Unverified
Flatten: Video Action Recognition is an Image Classification task	Aug 17, 2024	Action Recognitionimage-classification	—Unverified
Disentangle and denoise: Tackling context misalignment for video moment retrieval	Aug 14, 2024	DenoisingDisentanglement	—Unverified
Spherical World-Locking for Audio-Visual Localization in Egocentric Videos	Aug 9, 2024	Active Speaker LocalizationDecoder	—Unverified
VideoQA in the Era of LLMs: An Empirical Study	Aug 8, 2024	Multimodal Large Language ModelVideo Question Answering	CodeCode Available
LLaVA-OneVision: Easy Visual Task Transfer	Aug 6, 2024	3D Question Answering (3D-QA)	CodeCode Available
FE-Adapter: Adapting Image-based Emotion Classifiers to Videos	Aug 5, 2024	Dynamic Facial Expression RecognitionEmotion Recognition	—Unverified
Multimodal Fusion and Coherence Modeling for Video Topic Segmentation	Aug 1, 2024	Contrastive LearningMixture-of-Experts	—Unverified
Adversarial Robustness in RGB-Skeleton Action Recognition: Leveraging Attention Modality Reweighter	Jul 29, 2024	Action RecognitionAdversarial Robustness	—Unverified
Ego-VPA: Egocentric Video Understanding with Parameter-efficient Adaptation	Jul 28, 2024	Video Understanding	—Unverified
Wolf: Captioning Everything with a World Summarization Framework	Jul 26, 2024	Autonomous DrivingMixture-of-Experts	—Unverified
Audio-visual training for improved grounding in video-text LLMs	Jul 21, 2024	Video Understanding	—Unverified
Rethinking Video-Text Understanding: Retrieval from Counterfactually Augmented Data	Jul 18, 2024	Language ModellingLarge Language Model	—Unverified
Open Vocabulary Multi-Label Video Classification	Jul 12, 2024	Action ClassificationClassification	—Unverified

Show:10 25 50

← PrevPage 14 of 23Next →

No leaderboard results yet.