| VISTA-LLAMA: Reducing Hallucination in Video Language Models via Equal Distance to Visual Tokens | Jan 1, 2024 | HallucinationPosition | —Unverified | 0 |
| A Simple LLM Framework for Long-Range Video Question-Answering | Dec 28, 2023 | EgoSchemaLanguage Modelling | CodeCode Available | 1 |
| LingoQA: Visual Question Answering for Autonomous Driving | Dec 21, 2023 | Autonomous DrivingDecision Making | CodeCode Available | 2 |
| Perception Test 2023: A Summary of the First Challenge And Outcome | Dec 20, 2023 | BenchmarkingGrounded Video Question Answering | —Unverified | 0 |
| Cross-Modal Reasoning with Event Correlation for Video Question Answering | Dec 20, 2023 | Question AnsweringVideo Question Answering | —Unverified | 0 |
| Text-Conditioned Resampler For Long Form Video Understanding | Dec 19, 2023 | EgoSchemaForm | —Unverified | 0 |
| Shot2Story20K: A New Benchmark for Comprehensive Understanding of Multi-shot Videos | Dec 16, 2023 | Video Captioningvideo narration captioning | CodeCode Available | 1 |
| ViLA: Efficient Video-Language Alignment for Video Question Answering | Dec 13, 2023 | cross-modal alignmentLanguage Modeling | CodeCode Available | 1 |
| Vista-LLaMA: Reliable Video Narrator via Equal Distance to Visual Tokens | Dec 12, 2023 | HallucinationPosition | —Unverified | 0 |
| Grounded Question-Answering in Long Egocentric Videos | Dec 11, 2023 | Video GroundingVideo Question Answering | CodeCode Available | 1 |
| Retrieval-based Video Language Model for Efficient Long Video Question Answering | Dec 8, 2023 | Language ModelingLanguage Modelling | —Unverified | 0 |
| MoVQA: A Benchmark of Versatile Question-Answering for Long-Form Movie Understanding | Dec 8, 2023 | FormQuestion Answering | —Unverified | 0 |
| VaQuitA: Enhancing Alignment in LLM-Assisted Video Understanding | Dec 4, 2023 | Language ModelingLanguage Modelling | —Unverified | 0 |
| TimeChat: A Time-sensitive Multimodal Large Language Model for Long Video Understanding | Dec 4, 2023 | Dense CaptioningHighlight Detection | CodeCode Available | 2 |
| Zero-Shot Video Question Answering with Procedural Programs | Dec 1, 2023 | Code GenerationLanguage Modeling | —Unverified | 0 |
| RTQ: Rethinking Video-language Understanding Based on Image-text Model | Dec 1, 2023 | Video CaptioningVideo Question Answering | CodeCode Available | 1 |
| VTimeLLM: Empower LLM to Grasp Video Moments | Nov 30, 2023 | Dense Video CaptioningTemporal Relation Extraction | CodeCode Available | 2 |
| E-ViLM: Efficient Video-Language Model via Masked Video Modeling with Semantic Vector-Quantized Tokenizer | Nov 28, 2023 | Language ModelingLanguage Modelling | —Unverified | 0 |
| LLaMA-VID: An Image is Worth 2 Tokens in Large Language Models | Nov 28, 2023 | Image CaptioningQuestion Answering | CodeCode Available | 2 |
| MVBench: A Comprehensive Multi-modal Video Understanding Benchmark | Nov 28, 2023 | 3D Question Answering (3D-QA)Diagnostic | CodeCode Available | 2 |
| Characterizing Video Question Answering with Sparsified Inputs | Nov 27, 2023 | Question AnsweringVideo Question Answering | —Unverified | 0 |
| GPT4Video: A Unified Multimodal Large Language Model for lnstruction-Followed Understanding and Safety-Aware Generation | Nov 25, 2023 | Instruction FollowingLanguage Modeling | —Unverified | 0 |
| AutoEval-Video: An Automatic Benchmark for Assessing Large Vision Language Models in Open-Ended Video Question Answering | Nov 25, 2023 | Question AnsweringVideo Question Answering | CodeCode Available | 1 |
| Vamos: Versatile Action Models for Video Understanding | Nov 22, 2023 | EgoSchemaHard Attention | CodeCode Available | 0 |
| Video-LLaVA: Learning United Visual Representation by Alignment Before Projection | Nov 16, 2023 | Language ModelingLanguage Modelling | CodeCode Available | 4 |