| Improved Baselines with Visual Instruction Tuning | Oct 5, 2023 | Factual Inconsistency Detection in Chart CaptioningImage Classification | CodeCode Available | 6 |
| Visual Instruction Tuning | Apr 17, 2023 | 1 Image, 2*2 Stitching3D Question Answering (3D-QA) | CodeCode Available | 6 |
| BLIP-2: Bootstrapping Language-Image Pre-training with Frozen Image Encoders and Large Language Models | Jan 30, 2023 | Generative Visual Question AnsweringImage Captioning | CodeCode Available | 4 |
| MiLoRA: Harnessing Minor Singular Components for Parameter-Efficient LLM Finetuning | Jun 13, 2024 | Instruction FollowingMath | CodeCode Available | 3 |
| GeoGround: A Unified Large Vision-Language Model for Remote Sensing Visual Grounding | Nov 16, 2024 | Instruction FollowingLanguage Modeling | CodeCode Available | 2 |
| MMSci: A Dataset for Graduate-Level Multi-Discipline Multimodal Scientific Understanding | Jul 6, 2024 | ArticlesInstruction Following | CodeCode Available | 2 |
| CuMo: Scaling Multimodal LLM with Co-Upcycled Mixture-of-Experts | May 9, 2024 | Image CaptioningInstruction Following | CodeCode Available | 2 |
| Chain-of-Spot: Interactive Reasoning Improves Large Vision-Language Models | Mar 19, 2024 | Instruction Followingvisual instruction following | CodeCode Available | 2 |
| InstructBLIP: Towards General-purpose Vision-Language Models with Instruction Tuning | May 11, 2023 | 1 Image, 2*2 StitchingDiversity | CodeCode Available | 2 |
| Inst-IT: Boosting Multimodal Instance Understanding via Explicit Visual Prompt Instruction Tuning | Dec 4, 2024 | Multimodal Large Language ModelVideo Understanding | CodeCode Available | 1 |
| Text as Images: Can Multimodal Large Language Models Follow Printed Instructions in Pixels? | Nov 29, 2023 | In-Context LearningInstruction Following | CodeCode Available | 1 |
| Do we Really Need Visual Instructions? Towards Visual Instruction-Free Fine-tuning for Large Vision-Language Models | Feb 17, 2025 | Instruction Followingvisual instruction following | —Unverified | 0 |
| MpoxVLM: A Vision-Language Model for Diagnosing Skin Lesions from Mpox Virus Infection | Nov 16, 2024 | DiagnosticInstruction Following | CodeCode Available | 0 |
| M4CXR: Exploring Multi-task Potentials of Multi-modal Large Language Models for Chest X-ray Interpretation | Aug 29, 2024 | Instruction FollowingMedical Report Generation | —Unverified | 0 |
| Space-LLaVA: a Vision-Language Model Adapted to Extraterrestrial Applications | Aug 12, 2024 | Instruction FollowingLanguage Modeling | —Unverified | 0 |
| LVLM-empowered Multi-modal Representation Learning for Visual Place Recognition | Jul 9, 2024 | Instruction FollowingRepresentation Learning | —Unverified | 0 |
| Pelican: Correcting Hallucination in Vision-LLMs via Claim Decomposition and Program of Thought Verification | Jul 2, 2024 | Claim VerificationHallucination | —Unverified | 0 |
| Reminding Multimodal Large Language Models of Object-aware Knowledge with Retrieved Tags | Jun 16, 2024 | Image to textInstruction Following | —Unverified | 0 |
| FaceGPT: Self-supervised Learning to Chat about 3D Human Faces | Jun 11, 2024 | 3D Face ReconstructionFace Model | —Unverified | 0 |
| Joint Embeddings for Graph Instruction Tuning | May 31, 2024 | Instruction Followingvisual instruction following | —Unverified | 0 |
| Self-Corrected Multimodal Large Language Model for End-to-End Robot Manipulation | May 27, 2024 | Instruction FollowingLanguage Modeling | —Unverified | 0 |
| Fine-Tuning Large Vision-Language Models as Decision-Making Agents via Reinforcement Learning | May 16, 2024 | Decision MakingInstruction Following | —Unverified | 0 |
| ShareGPT4V: Improving Large Multi-Modal Models with Better Captions | Nov 21, 2023 | DescriptiveMME | CodeCode Available | 0 |
| Instruction Clarification Requests in Multimodal Collaborative Dialogue Games: Tasks, and an Analysis of the CoDraw Dataset | Feb 28, 2023 | Instruction Followingvisual instruction following | CodeCode Available | 0 |