| Answering Diverse Questions via Text Attached with Key Audio-Visual Clues | Mar 11, 2024 | Audio-visual Question AnsweringAudio-Visual Question Answering (AVQA) | CodeCode Available | 0 |
| Mipha: A Comprehensive Overhaul of Multimodal Assistant with Small Language Models | Mar 10, 2024 | Visual Question Answering | CodeCode Available | 3 |
| DeepSeek-VL: Towards Real-World Vision-Language Understanding | Mar 8, 2024 | ChatbotLanguage Modelling | CodeCode Available | 7 |
| Gemini 1.5: Unlocking multimodal understanding across millions of tokens of context | Mar 8, 2024 | 1 Image, 2*2 StitchingCode Generation | CodeCode Available | 3 |
| SnapNTell: Enhancing Entity-Centric Visual Question Answering with Retrieval Augmented Multimodal LLM | Mar 7, 2024 | Question AnsweringRetrieval | —Unverified | 0 |
| CAT: Enhancing Multimodal Large Language Model to Answer Questions in Dynamic Audio-Visual Scenarios | Mar 7, 2024 | Audio-visual Question AnsweringAudio-Visual Question Answering (AVQA) | CodeCode Available | 2 |
| Are Language Models Puzzle Prodigies? Algorithmic Puzzles Unveil Serious Challenges in Multimodal Reasoning | Mar 6, 2024 | Multimodal ReasoningQuestion Answering | CodeCode Available | 2 |
| Modeling Collaborator: Enabling Subjective Vision Classification With Minimal Human Effort via LLM Tool-Use | Mar 5, 2024 | image-classificationImage Classification | —Unverified | 0 |
| MOKA: Open-World Robotic Manipulation through Mark-Based Visual Prompting | Mar 5, 2024 | In-Context LearningObject Rearrangement | —Unverified | 0 |
| CLEVR-POC: Reasoning-Intensive Visual Question Answering in Partially Observable Environments | Mar 5, 2024 | Language ModellingLarge Language Model | —Unverified | 0 |