| Video-Language Alignment via Spatio-Temporal Graph Transformer | Jul 16, 2024 | Contrastive LearningQuestion Answering | CodeCode Available | 1 |
| LLaVA-NeXT-Interleave: Tackling Multi-image, Video, and 3D in Large Multimodal Models | Jul 10, 2024 | Video Question AnsweringZero-Shot Video Question Answer | CodeCode Available | 7 |
| VDMA: Video Question Answering with Dynamically Generated Multi-Agents | Jul 4, 2024 | EgoSchemaQuestion Answering | —Unverified | 0 |
| MAMA: Meta-optimized Angular Margin Contrastive Framework for Video-Language Representation Learning | Jul 4, 2024 | Language ModelingLanguage Modelling | CodeCode Available | 0 |
| Align and Aggregate: Compositional Reasoning with Video Alignment and Answer Aggregation for Video Question-Answering | Jul 3, 2024 | Contrastive LearningLanguage Modelling | —Unverified | 0 |
| InternLM-XComposer-2.5: A Versatile Large Vision Language Model Supporting Long-Contextual Input and Output | Jul 3, 2024 | ArticlesImage Comprehension | CodeCode Available | 0 |
| KeyVideoLLM: Towards Large-scale Video Keyframe Selection | Jul 3, 2024 | Data CompressionManagement | —Unverified | 0 |
| Referring Atomic Video Action Recognition | Jul 2, 2024 | Action LocalizationAction Recognition | CodeCode Available | 1 |
| The Solution for the ICCV 2023 Perception Test Challenge 2023 -- Task 6 -- Grounded videoQA | Jul 2, 2024 | Grounded Video Question AnsweringObject Tracking | —Unverified | 0 |
| Hierarchical Memory for Long Video QA | Jun 30, 2024 | GPUQuestion Answering | —Unverified | 0 |
| Tarsier: Recipes for Training and Evaluating Large Video Description Models | Jun 30, 2024 | Video CaptioningVideo Description | CodeCode Available | 4 |
| Zero-Shot Long-Form Video Understanding through Screenplay | Jun 25, 2024 | FormQuestion Answering | —Unverified | 0 |
| Long Context Transfer from Language to Vision | Jun 24, 2024 | Language ModelingLanguage Modelling | CodeCode Available | 4 |
| HCQA @ Ego4D EgoSchema Challenge 2024 | Jun 22, 2024 | Caption Generation | CodeCode Available | 1 |
| AlanaVLM: A Multimodal Embodied AI Foundation Model for Egocentric Video Understanding | Jun 19, 2024 | Question AnsweringSpatial Reasoning | CodeCode Available | 1 |
| VoCo-LLaMA: Towards Vision Compression with Large Language Models | Jun 18, 2024 | Computational EfficiencyQuestion Answering | CodeCode Available | 3 |
| VideoLLM-online: Online Video Large Language Model for Streaming Video | Jun 17, 2024 | GPULanguage Modeling | —Unverified | 0 |
| ISR-DPO: Aligning Large Multimodal Models for Videos by Iterative Self-Retrospective DPO | Jun 17, 2024 | Language ModellingQuestion Answering | CodeCode Available | 2 |
| Task Me Anything | Jun 17, 2024 | 2kAttribute | CodeCode Available | 2 |
| Hallucination Mitigation Prompts Long-term Video Understanding | Jun 17, 2024 | Answer GenerationHallucination | CodeCode Available | 0 |
| Too Many Frames, Not All Useful: Efficient Strategies for Long-Form Video QA | Jun 13, 2024 | AllEgoSchema | CodeCode Available | 1 |
| VideoGPT+: Integrating Image and Video Encoders for Enhanced Video Understanding | Jun 13, 2024 | Dense Video CaptioningMVBench | CodeCode Available | 3 |
| Flash-VStream: Memory-Based Real-Time Understanding for Long Video Streams | Jun 12, 2024 | cross-modal alignmentLanguage Modelling | CodeCode Available | 3 |
| VideoLLaMA 2: Advancing Spatial-Temporal Modeling and Audio Understanding in Video-LLMs | Jun 11, 2024 | Multiple-choiceQuestion Answering | CodeCode Available | 5 |
| Compositional 4D Dynamic Scenes Understanding with Physics Priors for Video Question Answering | Jun 2, 2024 | counterfactualCounterfactual Reasoning | CodeCode Available | 1 |