| xGen-MM-Vid (BLIP-3-Video): You Only Need 32 Tokens to Represent a Video Even in VLMs | Oct 21, 2024 | Language ModelingLanguage Modelling | —Unverified | 0 |
| Prompting Video-Language Foundation Models with Domain-specific Fine-grained Heuristics for Video Question Answering | Oct 12, 2024 | Question AnsweringVideo Question Answering | —Unverified | 0 |
| Sample then Identify: A General Framework for Risk Control and Assessment in Multimodal Large Language Models | Oct 10, 2024 | Conformal PredictionLanguage Modeling | —Unverified | 0 |
| Enhancing Multimodal LLM for Detailed and Accurate Video Captioning using Multi-Round Preference Optimization | Oct 9, 2024 | Audio captioningLarge Language Model | —Unverified | 0 |
| Enhancing Temporal Modeling of Video LLMs via Time Gating | Oct 8, 2024 | MVBenchQuestion Answering | CodeCode Available | 0 |
| ActionAtlas: A VideoQA Benchmark for Domain-specialized Action Recognition | Oct 8, 2024 | Action RecognitionMultiple-choice | —Unverified | 0 |
| Frame-Voyager: Learning to Query Frames for Video Large Language Models | Oct 4, 2024 | Question AnsweringVideo Question Answering | —Unverified | 0 |
| Video Instruction Tuning With Synthetic Data | Oct 3, 2024 | 3D Question Answering (3D-QA) | —Unverified | 0 |
| Video DataFlywheel: Resolving the Impossible Data Trinity in Video-Language Understanding | Sep 29, 2024 | DiversityQuestion Answering | —Unverified | 0 |
| First Place Solution to the Multiple-choice Video QA Track of The Second Perception Test Challenge | Sep 20, 2024 | Multiple-choiceQuestion Answering | —Unverified | 0 |
| Uncertainty-Guided Self-Questioning and Answering for Video-Language Alignment | Sep 17, 2024 | Question AnsweringVideo Question Answering | —Unverified | 0 |
| QTG-VQA: Question-Type-Guided Architectural for VideoQA Systems | Sep 14, 2024 | Question AnsweringVideo Question Answering | —Unverified | 0 |
| Multi-object event graph representation learning for Video Question Answering | Sep 12, 2024 | Contrastive LearningGraph Representation Learning | —Unverified | 0 |
| Top-down Activity Representation Learning for Video Question Answering | Sep 12, 2024 | Question AnsweringRepresentation Learning | —Unverified | 0 |
| Assessing Modality Bias in Video Question Answering Benchmarks with Multimodal Large Language Models | Aug 22, 2024 | Question AnsweringVideo Question Answering | —Unverified | 0 |
| LongVILA: Scaling Long-Context Visual Language Models for Long Videos | Aug 19, 2024 | Video CaptioningVideo Question Answering | CodeCode Available | 0 |
| LLaVA-Surg: Towards Multimodal Surgical Assistant via Structured Surgical Video Learning | Aug 15, 2024 | Answer GenerationQuestion-Answer-Generation | —Unverified | 0 |
| Continuous Perception Benchmark | Aug 15, 2024 | Question AnsweringVideo Question Answering | —Unverified | 0 |
| VideoQA in the Era of LLMs: An Empirical Study | Aug 8, 2024 | Multimodal Large Language ModelVideo Question Answering | CodeCode Available | 0 |
| LLaVA-OneVision: Easy Visual Task Transfer | Aug 6, 2024 | 3D Question Answering (3D-QA) | CodeCode Available | 0 |
| Causal Understanding For Video Question Answering | Jul 23, 2024 | Question AnsweringVideo Question Answering | —Unverified | 0 |
| End-to-End Video Question Answering with Frame Scoring Mechanisms and Adaptive Sampling | Jul 21, 2024 | Question AnsweringVideo Question Answering | —Unverified | 0 |
| VDMA: Video Question Answering with Dynamically Generated Multi-Agents | Jul 4, 2024 | EgoSchemaQuestion Answering | —Unverified | 0 |
| MAMA: Meta-optimized Angular Margin Contrastive Framework for Video-Language Representation Learning | Jul 4, 2024 | Language ModelingLanguage Modelling | CodeCode Available | 0 |
| KeyVideoLLM: Towards Large-scale Video Keyframe Selection | Jul 3, 2024 | Data CompressionManagement | —Unverified | 0 |
| InternLM-XComposer-2.5: A Versatile Large Vision Language Model Supporting Long-Contextual Input and Output | Jul 3, 2024 | ArticlesImage Comprehension | CodeCode Available | 0 |
| Align and Aggregate: Compositional Reasoning with Video Alignment and Answer Aggregation for Video Question-Answering | Jul 3, 2024 | Contrastive LearningLanguage Modelling | —Unverified | 0 |
| The Solution for the ICCV 2023 Perception Test Challenge 2023 -- Task 6 -- Grounded videoQA | Jul 2, 2024 | Grounded Video Question AnsweringObject Tracking | —Unverified | 0 |
| Hierarchical Memory for Long Video QA | Jun 30, 2024 | GPUQuestion Answering | —Unverified | 0 |
| Zero-Shot Long-Form Video Understanding through Screenplay | Jun 25, 2024 | FormQuestion Answering | —Unverified | 0 |
| Hallucination Mitigation Prompts Long-term Video Understanding | Jun 17, 2024 | Answer GenerationHallucination | CodeCode Available | 0 |
| VideoLLM-online: Online Video Large Language Model for Streaming Video | Jun 17, 2024 | GPULanguage Modeling | —Unverified | 0 |
| Video Question Answering for People with Visual Impairments Using an Egocentric 360-Degree Camera | May 30, 2024 | Question AnsweringVideo Question Answering | —Unverified | 0 |
| Backpropagation-Free Multi-modal On-Device Model Adaptation via Cloud-Device Collaboration | May 21, 2024 | Question AnsweringVideo Question Answering | —Unverified | 0 |
| VideoQA-SC: Adaptive Semantic Communication for Video Question Answering | May 17, 2024 | Question AnsweringSemantic Communication | —Unverified | 0 |
| CinePile: A Long Video Question Answering Dataset and Benchmark | May 14, 2024 | FormHuman-Object Interaction Detection | —Unverified | 0 |
| Capabilities of Gemini Models in Medicine | Apr 29, 2024 | In-Context LearningMedQA | —Unverified | 0 |
| Pegasus-v1 Technical Report | Apr 23, 2024 | Language ModelingLanguage Modelling | —Unverified | 0 |
| Listen Then See: Video Alignment with Speaker Attention | Apr 21, 2024 | cross-modal alignmentQuestion Answering | CodeCode Available | 0 |
| Reka Core, Flash, and Edge: A Series of Powerful Multimodal Language Models | Apr 18, 2024 | GSM8KMMLU | —Unverified | 0 |
| Learning to Localize Objects Improves Spatial Reasoning in Visual-LLMs | Apr 11, 2024 | DescriptiveHallucination | CodeCode Available | 0 |
| MoReVQA: Exploring Modular Reasoning Models for Video Question Answering | Apr 9, 2024 | EgoSchemaMultiple-choice | —Unverified | 0 |
| Koala: Key frame-conditioned long video-LLM | Apr 5, 2024 | Action RecognitionQuestion Answering | —Unverified | 0 |
| Neural-Symbolic VideoQA: Learning Compositional Spatio-Temporal Reasoning for Real-world Video Question Answering | Apr 5, 2024 | Question AnsweringVideo Question Answering | —Unverified | 0 |
| VideoDistill: Language-aware Vision Distillation for Video Question Answering | Apr 1, 2024 | Answer GenerationQuestion Answering | —Unverified | 0 |
| Ranking Distillation for Open-Ended Video Question Answering with Insufficient Labels | Mar 21, 2024 | Multi-Label ClassificationMUlTI-LABEL-ClASSIFICATION | —Unverified | 0 |
| LLMs Meet Long Video: Advancing Long Video Question Answering with An Interactive Visual Adapter in LLMs | Feb 21, 2024 | Question AnsweringVideo Question Answering | —Unverified | 0 |
| Slot-VLM: SlowFast Slots for Video-Language Modeling | Feb 20, 2024 | Language ModelingLanguage Modelling | —Unverified | 0 |
| VideoPrism: A Foundational Visual Encoder for Video Understanding | Feb 20, 2024 | Question AnsweringVideo Question Answering | —Unverified | 0 |
| Question-Instructed Visual Descriptions for Zero-Shot Video Question Answering | Feb 16, 2024 | Language ModelingLanguage Modelling | CodeCode Available | 0 |