| Video Question Answering for People with Visual Impairments Using an Egocentric 360-Degree Camera | May 30, 2024 | Question AnsweringVideo Question Answering | —Unverified | 0 |
| Encoding and Controlling Global Semantics for Long-form Video Question Answering | May 30, 2024 | FormQuestion Answering | CodeCode Available | 1 |
| Backpropagation-Free Multi-modal On-Device Model Adaptation via Cloud-Device Collaboration | May 21, 2024 | Question AnsweringVideo Question Answering | —Unverified | 0 |
| VideoQA-SC: Adaptive Semantic Communication for Video Question Answering | May 17, 2024 | Question AnsweringSemantic Communication | —Unverified | 0 |
| CinePile: A Long Video Question Answering Dataset and Benchmark | May 14, 2024 | FormHuman-Object Interaction Detection | —Unverified | 0 |
| FreeVA: Offline MLLM as Training-Free Video Assistant | May 13, 2024 | FairnessQuestion Answering | CodeCode Available | 2 |
| Capabilities of Gemini Models in Medicine | Apr 29, 2024 | In-Context LearningMedQA | —Unverified | 0 |
| MovieChat+: Question-aware Sparse Memory for Long Video Question Answering | Apr 26, 2024 | 2kQuestion Answering | CodeCode Available | 4 |
| PLLaVA : Parameter-free LLaVA Extension from Images to Videos for Video Dense Captioning | Apr 25, 2024 | Dense CaptioningMVBench | CodeCode Available | 4 |
| Pegasus-v1 Technical Report | Apr 23, 2024 | Language ModelingLanguage Modelling | —Unverified | 0 |
| Listen Then See: Video Alignment with Speaker Attention | Apr 21, 2024 | cross-modal alignmentQuestion Answering | CodeCode Available | 0 |
| Reka Core, Flash, and Edge: A Series of Powerful Multimodal Language Models | Apr 18, 2024 | GSM8KMMLU | —Unverified | 0 |
| Learning to Localize Objects Improves Spatial Reasoning in Visual-LLMs | Apr 11, 2024 | DescriptiveHallucination | CodeCode Available | 0 |
| MoReVQA: Exploring Modular Reasoning Models for Video Question Answering | Apr 9, 2024 | EgoSchemaMultiple-choice | —Unverified | 0 |
| MA-LMM: Memory-Augmented Large Multimodal Model for Long-Term Video Understanding | Apr 8, 2024 | GPUMultiple-choice | CodeCode Available | 3 |
| Koala: Key frame-conditioned long video-LLM | Apr 5, 2024 | Action RecognitionQuestion Answering | —Unverified | 0 |
| Neural-Symbolic VideoQA: Learning Compositional Spatio-Temporal Reasoning for Real-world Video Question Answering | Apr 5, 2024 | Question AnsweringVideo Question Answering | —Unverified | 0 |
| MiniGPT4-Video: Advancing Multimodal LLMs for Video Understanding with Interleaved Visual-Textual Tokens | Apr 4, 2024 | Language ModelingLanguage Modelling | CodeCode Available | 4 |
| LongVLM: Efficient Long Video Understanding via Large Language Models | Apr 4, 2024 | Question AnsweringVideo Question Answering | CodeCode Available | 2 |
| CausalChaos! Dataset for Comprehensive Causal Action Question Answering Over Longer Causal Chains Grounded in Dynamic Visual Scenes | Apr 1, 2024 | Causal DiscoveryCausal Discovery in Video Reasoning | CodeCode Available | 1 |
| VideoDistill: Language-aware Vision Distillation for Video Question Answering | Apr 1, 2024 | Answer GenerationQuestion Answering | —Unverified | 0 |
| TraveLER: A Modular Multi-LMM Agent Framework for Video Question-Answering | Apr 1, 2024 | Question AnsweringVideo Question Answering | CodeCode Available | 1 |
| Direct Preference Optimization of Video Large Multimodal Models from Language Model Reward | Apr 1, 2024 | Instruction FollowingLanguage Modeling | CodeCode Available | 2 |
| ST-LLM: Large Language Models Are Effective Temporal Learners | Mar 30, 2024 | MVBenchReading Comprehension | CodeCode Available | 2 |
| An Image Grid Can Be Worth a Video: Zero-shot Video Question Answering Using a VLM | Mar 27, 2024 | Language ModelingLanguage Modelling | CodeCode Available | 2 |
| LITA: Language Instructed Temporal-Localization Assistant | Mar 27, 2024 | Instruction FollowingTemporal Localization | CodeCode Available | 2 |
| OmniVid: A Generative Framework for Universal Video Understanding | Mar 26, 2024 | Action RecognitionDecoder | CodeCode Available | 2 |
| Elysium: Exploring Object-level Perception in Videos via MLLM | Mar 25, 2024 | ObjectObject Tracking | CodeCode Available | 2 |
| InternVideo2: Scaling Foundation Models for Multimodal Video Understanding | Mar 22, 2024 | Action ClassificationAction Recognition | CodeCode Available | 7 |
| Ranking Distillation for Open-Ended Video Question Answering with Insufficient Labels | Mar 21, 2024 | Multi-Label ClassificationMUlTI-LABEL-ClASSIFICATION | —Unverified | 0 |
| vid-TLDR: Training Free Token merging for Light-weight Video Transformer | Mar 20, 2024 | Action RecognitionComputational Efficiency | CodeCode Available | 2 |
| HawkEye: Training Video-Text LLMs for Grounding Text in Videos | Mar 15, 2024 | Video GroundingVideo Question Answering | CodeCode Available | 1 |
| DAM: Dynamic Adapter Merging for Continual Video QA Learning | Mar 13, 2024 | Continual Learningimage-classification | CodeCode Available | 1 |
| Gemini 1.5: Unlocking multimodal understanding across millions of tokens of context | Mar 8, 2024 | 1 Image, 2*2 StitchingCode Generation | CodeCode Available | 3 |
| Efficient Temporal Extrapolation of Multimodal Large Language Models with Temporal Grounding Bridge | Feb 25, 2024 | Computational EfficiencyLanguage Modelling | CodeCode Available | 1 |
| LLMs Meet Long Video: Advancing Long Video Question Answering with An Interactive Visual Adapter in LLMs | Feb 21, 2024 | Question AnsweringVideo Question Answering | —Unverified | 0 |
| Slot-VLM: SlowFast Slots for Video-Language Modeling | Feb 20, 2024 | Language ModelingLanguage Modelling | —Unverified | 0 |
| VideoPrism: A Foundational Visual Encoder for Video Understanding | Feb 20, 2024 | Question AnsweringVideo Question Answering | —Unverified | 0 |
| Question-Instructed Visual Descriptions for Zero-Shot Video Question Answering | Feb 16, 2024 | Language ModelingLanguage Modelling | CodeCode Available | 0 |
| BDIQA: A New Dataset for Video Question Answering to Explore Cognitive Reasoning through Theory of Mind | Feb 12, 2024 | Question AnsweringVideo Question Answering | —Unverified | 0 |
| SPHINX-X: Scaling Data and Parameters for a Family of Multi-modal Large Language Models | Feb 8, 2024 | BenchmarkingDiversity | CodeCode Available | 7 |
| CREMA: Generalizable and Efficient Video-Language Reasoning via Multimodal Modular Fusion | Feb 8, 2024 | Computational EfficiencyMultimodal Reasoning | CodeCode Available | 2 |
| YTCommentQA: Video Question Answerability in Instructional Videos | Jan 30, 2024 | Question AnsweringVideo Question Answering | CodeCode Available | 0 |
| Weakly Supervised Gaussian Contrastive Grounding with Large Multimodal Models for Video Question Answering | Jan 19, 2024 | Question AnsweringVideo Question Answering | CodeCode Available | 1 |
| STAIR: Spatial-Temporal Reasoning with Auditable Intermediate Results for Video Question Answering | Jan 8, 2024 | Question AnsweringVideo Question Answering | CodeCode Available | 0 |
| Answering from Sure to Uncertain: Uncertainty-Aware Curriculum Learning for Video Question Answering | Jan 3, 2024 | Question AnsweringScheduling | —Unverified | 0 |
| Sports-QA: A Large-Scale Video Question Answering Benchmark for Complex and Professional Sports | Jan 3, 2024 | Action Understandingcounterfactual | CodeCode Available | 1 |
| Glance and Focus: Memory Prompting for Multi-Event Video Question Answering | Jan 3, 2024 | Action DetectionHuman-Object Interaction Detection | CodeCode Available | 1 |
| On Scaling Up a Multilingual Vision and Language Model | Jan 1, 2024 | document understandingIn-Context Learning | —Unverified | 0 |
| Language-aware Visual Semantic Distillation for Video Question Answering | Jan 1, 2024 | Answer GenerationQuestion Answering | —Unverified | 0 |