| I Can't Believe There's No Images! Learning Visual Tasks Using only Language Supervision | Nov 17, 2022 | Image CaptioningQuestion Answering | CodeCode Available | 1 | 5 |
| AntifakePrompt: Prompt-Tuned Vision-Language Models are Fake Image Detectors | Oct 26, 2023 | DeepFake DetectionFace Swapping | CodeCode Available | 1 | 5 |
| A Comprehensive Evaluation of GPT-4V on Knowledge-Intensive Visual Question Answering | Nov 13, 2023 | Decision MakingExplanation Generation | CodeCode Available | 1 | 5 |
| HyperLLaVA: Dynamic Visual and Language Expert Tuning for Multimodal Large Language Models | Mar 20, 2024 | MMEVisual Question Answering | CodeCode Available | 1 | 5 |
| Lever LM: Configuring In-Context Sequence to Lever Large Vision Language Models | Dec 15, 2023 | Image CaptioningIn-Context Learning | CodeCode Available | 1 | 5 |
| How Much Can CLIP Benefit Vision-and-Language Tasks? | Jul 13, 2021 | Question AnsweringVision and Language Navigation | CodeCode Available | 1 | 5 |
| How Do Multimodal Large Language Models Handle Complex Multimodal Reasoning? Placing Them in An Extensible Escape Game | Mar 13, 2025 | Multimodal ReasoningQuestion Answering | CodeCode Available | 1 | 5 |
| How to Configure Good In-Context Sequence for Visual Question Answering | Dec 4, 2023 | In-Context LearningQuestion Answering | CodeCode Available | 1 | 5 |
| Answer Mining from a Pool of Images: Towards Retrieval-Based Visual Question Answering | Jun 29, 2023 | Answer GenerationQuestion Answering | CodeCode Available | 1 | 5 |
| A Comparison of Pre-trained Vision-and-Language Models for Multimodal Representation Learning across Medical Images and Reports | Sep 3, 2020 | Image-text RetrievalMedical Visual Question Answering | CodeCode Available | 1 | 5 |