| From Seconds to Hours: Reviewing MultiModal Large Language Models on Comprehensive Long Video Understanding | Sep 27, 2024 | Video UnderstandingVisual Reasoning | CodeCode Available | 1 |
| EAGLE: Egocentric AGgregated Language-video Engine | Sep 26, 2024 | Action RecognitionActivity Recognition | —Unverified | 0 |
| LLM4Brain: Training a Large Language Model for Brain Video Understanding | Sep 26, 2024 | Domain AdaptationLanguage Modeling | —Unverified | 0 |
| E.T. Bench: Towards Open-Ended Event-Level Video-Language Understanding | Sep 26, 2024 | Question AnsweringVideo Understanding | CodeCode Available | 2 |
| Can CLIP Count Stars? An Empirical Study on Quantity Bias in CLIP | Sep 23, 2024 | Image GenerationQuestion Answering | —Unverified | 0 |
| Video-XL: Extra-Long Vision Language Model for Hour-Scale Video Understanding | Sep 22, 2024 | Anomaly DetectionGPU | CodeCode Available | 4 |
| Towards Child-Inclusive Clinical Video Understanding for Autism Spectrum Disorder | Sep 20, 2024 | Activity RecognitionDiagnostic | —Unverified | 0 |
| First Place Solution to the Multiple-choice Video QA Track of The Second Perception Test Challenge | Sep 20, 2024 | Multiple-choiceQuestion Answering | —Unverified | 0 |
| Interpretable Action Recognition on Hard to Classify Actions | Sep 19, 2024 | Action RecognitionDepth Estimation | —Unverified | 0 |
| AMEGO: Active Memory from long EGOcentric videos | Sep 17, 2024 | Video Understanding | —Unverified | 0 |
| SoccerNet 2024 Challenges Results | Sep 16, 2024 | Action SpottingDense Video Captioning | CodeCode Available | 0 |
| HAVANA: Hierarchical stochastic neighbor embedding for Accelerated Video ANnotAtions | Sep 16, 2024 | Dimensionality ReductionVideo Understanding | —Unverified | 0 |
| Enhancing Long Video Understanding via Hierarchical Event-Based Memory | Sep 10, 2024 | Video Understanding | —Unverified | 0 |
| VidLPRO: A Video-Language Pre-training Framework for Robotic and Laparoscopic Surgery | Sep 7, 2024 | Computational EfficiencyContrastive Learning | —Unverified | 0 |
| TC-LLaVA: Rethinking the Transfer from Image to Video Understanding with Temporal Considerations | Sep 5, 2024 | Causal InferencePosition | —Unverified | 0 |
| LongLLaVA: Scaling Multi-modal LLMs to 1000 Images Efficiently via a Hybrid Architecture | Sep 4, 2024 | GPUMamba | CodeCode Available | 3 |
| VideoLLaMB: Long-context Video Understanding with Recurrent Memory Bridges | Sep 2, 2024 | GPUMVBench | —Unverified | 0 |
| StimuVAR: Spatiotemporal Stimuli-aware Video Affective Reasoning with Multimodal Large Language Models | Aug 31, 2024 | Video Understanding | —Unverified | 0 |
| Streamlining Forest Wildfire Surveillance: AI-Enhanced UAVs Utilizing the FLAME Aerial Video Dataset for Lightweight and Efficient Monitoring | Aug 31, 2024 | Disaster ResponseVideo Understanding | —Unverified | 0 |
| DLM-VMTL:A Double Layer Mapper for heterogeneous data video Multi-task prompt learning | Aug 29, 2024 | Multi-Task LearningPrompt Learning | —Unverified | 0 |
| CogVLM2: Visual Language Models for Image and Video Understanding | Aug 29, 2024 | MM-VetMVBench | CodeCode Available | 9 |
| Kangaroo: A Powerful Video-Language Model Supporting Long-context Video Input | Aug 28, 2024 | Language ModelingLanguage Modelling | —Unverified | 0 |
| Video-CCAM: Enhancing Video-Language Understanding with Causal Cross-Attention Masks for Short and Long Videos | Aug 26, 2024 | Large Language ModelMVBench | CodeCode Available | 2 |
| Attend-Fusion: Efficient Audio-Visual Fusion for Video Classification | Aug 26, 2024 | Video ClassificationVideo Understanding | —Unverified | 0 |
| LMM-VQA: Advancing Video Quality Assessment with Large Multimodal Models | Aug 26, 2024 | Large Language ModelVideo Quality Assessment | CodeCode Available | 0 |
| LongVILA: Scaling Long-Context Visual Language Models for Long Videos | Aug 19, 2024 | Video CaptioningVideo Question Answering | —Unverified | 0 |
| Flatten: Video Action Recognition is an Image Classification task | Aug 17, 2024 | Action Recognitionimage-classification | —Unverified | 0 |
| Disentangle and denoise: Tackling context misalignment for video moment retrieval | Aug 14, 2024 | DenoisingDisentanglement | —Unverified | 0 |
| HAT: History-Augmented Anchor Transformer for Online Temporal Action Localization | Aug 12, 2024 | Action LocalizationTemporal Action Localization | CodeCode Available | 1 |
| Spherical World-Locking for Audio-Visual Localization in Egocentric Videos | Aug 9, 2024 | Active Speaker LocalizationDecoder | —Unverified | 0 |
| VideoQA in the Era of LLMs: An Empirical Study | Aug 8, 2024 | Multimodal Large Language ModelVideo Question Answering | CodeCode Available | 0 |
| LLaVA-OneVision: Easy Visual Task Transfer | Aug 6, 2024 | 3D Question Answering (3D-QA) | CodeCode Available | 0 |
| FE-Adapter: Adapting Image-based Emotion Classifiers to Videos | Aug 5, 2024 | Dynamic Facial Expression RecognitionEmotion Recognition | —Unverified | 0 |
| COM Kitchens: An Unedited Overhead-view Video Dataset as a Vision-Language Benchmark | Aug 5, 2024 | Dense Video CaptioningDiversity | CodeCode Available | 1 |
| Multimodal Fusion and Coherence Modeling for Video Topic Segmentation | Aug 1, 2024 | Contrastive LearningMixture-of-Experts | —Unverified | 0 |
| Segment Anything for Videos: A Systematic Survey | Jul 31, 2024 | Image SegmentationRobot Manipulation Generalization | CodeCode Available | 5 |
| Learning Video Context as Interleaved Multimodal Sequences | Jul 31, 2024 | Language ModelingLanguage Modelling | CodeCode Available | 1 |
| Adversarial Robustness in RGB-Skeleton Action Recognition: Leveraging Attention Modality Reweighter | Jul 29, 2024 | Action RecognitionAdversarial Robustness | —Unverified | 0 |
| Ego-VPA: Egocentric Video Understanding with Parameter-efficient Adaptation | Jul 28, 2024 | Video Understanding | —Unverified | 0 |
| Wolf: Captioning Everything with a World Summarization Framework | Jul 26, 2024 | Autonomous DrivingMixture-of-Experts | —Unverified | 0 |
| Harnessing Temporal Causality for Advanced Temporal Action Detection | Jul 25, 2024 | Action DetectionAction Recognition | CodeCode Available | 3 |
| EgoCVR: An Egocentric Benchmark for Fine-Grained Composed Video Retrieval | Jul 23, 2024 | Re-RankingRetrieval | CodeCode Available | 1 |
| LongVideoBench: A Benchmark for Long-context Interleaved Video-Language Understanding | Jul 22, 2024 | Multiple-choiceQuestion Answering | CodeCode Available | 2 |
| SlowFast-LLaVA: A Strong Training-Free Baseline for Video Large Language Models | Jul 22, 2024 | Language Modeling | CodeCode Available | 3 |
| Audio-visual training for improved grounding in video-text LLMs | Jul 21, 2024 | Video Understanding | —Unverified | 0 |
| Rethinking Video-Text Understanding: Retrieval from Counterfactually Augmented Data | Jul 18, 2024 | Language ModellingLarge Language Model | —Unverified | 0 |
| Goldfish: Vision-Language Understanding of Arbitrarily Long Videos | Jul 17, 2024 | RetrievalVideo Understanding | CodeCode Available | 4 |
| Open Vocabulary Multi-Label Video Classification | Jul 12, 2024 | Action ClassificationClassification | —Unverified | 0 |
| Hypergraph Multi-modal Large Language Model: Exploiting EEG and Eye-tracking Modalities to Evaluate Heterogeneous Responses for Video Understanding | Jul 11, 2024 | EEGLanguage Modeling | CodeCode Available | 1 |
| VideoMamba: Spatio-Temporal Selective State Space Model | Jul 11, 2024 | Mambamodel | CodeCode Available | 1 |