| ViFi-ReID: A Two-Stream Vision-WiFi Multimodal Approach for Person Re-identification | Oct 13, 2024 | Contrastive LearningPerson Re-Identification | —Unverified | 0 |
| Prompting Video-Language Foundation Models with Domain-specific Fine-grained Heuristics for Video Question Answering | Oct 12, 2024 | Question AnsweringVideo Question Answering | —Unverified | 0 |
| TVBench: Redesigning Video-Language Evaluation | Oct 10, 2024 | Multiple-choiceOpen-Ended Question Answering | —Unverified | 0 |
| MM-Ego: Towards Building Egocentric Multimodal LLMs | Oct 9, 2024 | Video Understanding | —Unverified | 0 |
| Enhancing Multimodal LLM for Detailed and Accurate Video Captioning using Multi-Round Preference Optimization | Oct 9, 2024 | Audio captioningLarge Language Model | —Unverified | 0 |
| Enhancing Temporal Modeling of Video LLMs via Time Gating | Oct 8, 2024 | MVBenchQuestion Answering | CodeCode Available | 0 |
| AuroraCap: Efficient, Performant Video Detailed Captioning and a New Benchmark | Oct 4, 2024 | Image CaptioningVideo Understanding | —Unverified | 0 |
| Frame-Voyager: Learning to Query Frames for Video Large Language Models | Oct 4, 2024 | Question AnsweringVideo Question Answering | —Unverified | 0 |
| DTVLT: A Multi-modal Diverse Text Benchmark for Visual Language Tracking Based on LLM | Oct 3, 2024 | Object TrackingVideo Understanding | —Unverified | 0 |
| AirLetters: An Open Video Dataset of Characters Drawn in the Air | Oct 3, 2024 | Video Understanding | —Unverified | 0 |
| Deep learning for action spotting in association football videos | Oct 2, 2024 | Action SpottingBenchmarking | —Unverified | 0 |
| UAL-Bench: The First Comprehensive Unusual Activity Localization Benchmark | Oct 2, 2024 | Unusual Activity LocalizationVideo Understanding | CodeCode Available | 0 |
| ScVLM: Enhancing Vision-Language Model for Safety-Critical Event Understanding | Oct 1, 2024 | Contrastive LearningHallucination | CodeCode Available | 0 |
| Visual Context Window Extension: A New Perspective for Long Video Understanding | Sep 30, 2024 | Video Understanding | —Unverified | 0 |
| MM1.5: Methods, Analysis & Insights from Multimodal LLM Fine-tuning | Sep 30, 2024 | Mixture-of-ExpertsOptical Character Recognition (OCR) | —Unverified | 0 |
| Q-Bench-Video: Benchmarking the Video Quality Understanding of LMMs | Sep 30, 2024 | BenchmarkingMultiple-choice | —Unverified | 0 |
| Temporal2Seq: A Unified Framework for Temporal Video Understanding Tasks | Sep 27, 2024 | Action DetectionAction Segmentation | —Unverified | 0 |
| EAGLE: Egocentric AGgregated Language-video Engine | Sep 26, 2024 | Action RecognitionActivity Recognition | —Unverified | 0 |
| LLM4Brain: Training a Large Language Model for Brain Video Understanding | Sep 26, 2024 | Domain AdaptationLanguage Modeling | —Unverified | 0 |
| Can CLIP Count Stars? An Empirical Study on Quantity Bias in CLIP | Sep 23, 2024 | Image GenerationQuestion Answering | —Unverified | 0 |
| First Place Solution to the Multiple-choice Video QA Track of The Second Perception Test Challenge | Sep 20, 2024 | Multiple-choiceQuestion Answering | —Unverified | 0 |
| Towards Child-Inclusive Clinical Video Understanding for Autism Spectrum Disorder | Sep 20, 2024 | Activity RecognitionDiagnostic | —Unverified | 0 |
| Interpretable Action Recognition on Hard to Classify Actions | Sep 19, 2024 | Action RecognitionDepth Estimation | —Unverified | 0 |
| AMEGO: Active Memory from long EGOcentric videos | Sep 17, 2024 | Video Understanding | —Unverified | 0 |
| HAVANA: Hierarchical stochastic neighbor embedding for Accelerated Video ANnotAtions | Sep 16, 2024 | Dimensionality ReductionVideo Understanding | —Unverified | 0 |
| SoccerNet 2024 Challenges Results | Sep 16, 2024 | Action SpottingDense Video Captioning | CodeCode Available | 0 |
| Enhancing Long Video Understanding via Hierarchical Event-Based Memory | Sep 10, 2024 | Video Understanding | —Unverified | 0 |
| VidLPRO: A Video-Language Pre-training Framework for Robotic and Laparoscopic Surgery | Sep 7, 2024 | Computational EfficiencyContrastive Learning | —Unverified | 0 |
| TC-LLaVA: Rethinking the Transfer from Image to Video Understanding with Temporal Considerations | Sep 5, 2024 | Causal InferencePosition | —Unverified | 0 |
| VideoLLaMB: Long-context Video Understanding with Recurrent Memory Bridges | Sep 2, 2024 | GPUMVBench | —Unverified | 0 |
| StimuVAR: Spatiotemporal Stimuli-aware Video Affective Reasoning with Multimodal Large Language Models | Aug 31, 2024 | Video Understanding | —Unverified | 0 |
| Streamlining Forest Wildfire Surveillance: AI-Enhanced UAVs Utilizing the FLAME Aerial Video Dataset for Lightweight and Efficient Monitoring | Aug 31, 2024 | Disaster ResponseVideo Understanding | —Unverified | 0 |
| DLM-VMTL:A Double Layer Mapper for heterogeneous data video Multi-task prompt learning | Aug 29, 2024 | Multi-Task LearningPrompt Learning | —Unverified | 0 |
| Kangaroo: A Powerful Video-Language Model Supporting Long-context Video Input | Aug 28, 2024 | Language ModelingLanguage Modelling | —Unverified | 0 |
| Attend-Fusion: Efficient Audio-Visual Fusion for Video Classification | Aug 26, 2024 | Video ClassificationVideo Understanding | —Unverified | 0 |
| LMM-VQA: Advancing Video Quality Assessment with Large Multimodal Models | Aug 26, 2024 | Large Language ModelVideo Quality Assessment | CodeCode Available | 0 |
| LongVILA: Scaling Long-Context Visual Language Models for Long Videos | Aug 19, 2024 | Video CaptioningVideo Question Answering | —Unverified | 0 |
| Flatten: Video Action Recognition is an Image Classification task | Aug 17, 2024 | Action Recognitionimage-classification | —Unverified | 0 |
| Disentangle and denoise: Tackling context misalignment for video moment retrieval | Aug 14, 2024 | DenoisingDisentanglement | —Unverified | 0 |
| Spherical World-Locking for Audio-Visual Localization in Egocentric Videos | Aug 9, 2024 | Active Speaker LocalizationDecoder | —Unverified | 0 |
| VideoQA in the Era of LLMs: An Empirical Study | Aug 8, 2024 | Multimodal Large Language ModelVideo Question Answering | CodeCode Available | 0 |
| LLaVA-OneVision: Easy Visual Task Transfer | Aug 6, 2024 | 3D Question Answering (3D-QA) | CodeCode Available | 0 |
| FE-Adapter: Adapting Image-based Emotion Classifiers to Videos | Aug 5, 2024 | Dynamic Facial Expression RecognitionEmotion Recognition | —Unverified | 0 |
| Multimodal Fusion and Coherence Modeling for Video Topic Segmentation | Aug 1, 2024 | Contrastive LearningMixture-of-Experts | —Unverified | 0 |
| Adversarial Robustness in RGB-Skeleton Action Recognition: Leveraging Attention Modality Reweighter | Jul 29, 2024 | Action RecognitionAdversarial Robustness | —Unverified | 0 |
| Ego-VPA: Egocentric Video Understanding with Parameter-efficient Adaptation | Jul 28, 2024 | Video Understanding | —Unverified | 0 |
| Wolf: Captioning Everything with a World Summarization Framework | Jul 26, 2024 | Autonomous DrivingMixture-of-Experts | —Unverified | 0 |
| Audio-visual training for improved grounding in video-text LLMs | Jul 21, 2024 | Video Understanding | —Unverified | 0 |
| Rethinking Video-Text Understanding: Retrieval from Counterfactually Augmented Data | Jul 18, 2024 | Language ModellingLarge Language Model | —Unverified | 0 |
| Open Vocabulary Multi-Label Video Classification | Jul 12, 2024 | Action ClassificationClassification | —Unverified | 0 |