| Enhancing Visual-Language Modality Alignment in Large Vision Language Models via Self-Improvement | May 24, 2024 | HallucinationImage Comprehension | CodeCode Available | 2 |
| MM-MATH: Advancing Multimodal Math Evaluation with Process Evaluation and Fine-grained Classification | Apr 7, 2024 | Image ComprehensionMath | CodeCode Available | 0 |
| Mini-Gemini: Mining the Potential of Multi-modality Vision Language Models | Mar 27, 2024 | Image ClassificationImage Comprehension | CodeCode Available | 7 |
| Rec-GPT4V: Multimodal Recommendation with Large Vision-Language Models | Feb 13, 2024 | Image ComprehensionMultimodal Recommendation | —Unverified | 0 |
| EarthGPT: A Universal Multi-modal Large Language Model for Multi-sensor Image Comprehension in Remote Sensing Domain | Jan 30, 2024 | Image ComprehensionInstruction Following | CodeCode Available | 2 |
| Muffin or Chihuahua? Challenging Multimodal Large Language Models with Multipanel VQA | Jan 29, 2024 | BenchmarkingImage Comprehension | —Unverified | 0 |
| SlideAVSR: A Dataset of Paper Explanation Videos for Audio-Visual Speech Recognition | Jan 18, 2024 | Audio-Visual Speech RecognitionAutomatic Speech Recognition | —Unverified | 0 |
| Hidden flaws behind expert-level accuracy of multimodal GPT-4 vision in medicine | Jan 16, 2024 | DiagnosticImage Comprehension | —Unverified | 0 |
| CoCoT: Contrastive Chain-of-Thought Prompting for Large Multimodal Models with Multiple Image Inputs | Jan 5, 2024 | Image ComprehensionImage to text | CodeCode Available | 0 |
| GeoLocator: a location-integrated large multimodal model for inferring geo-privacy | Nov 21, 2023 | Image Comprehension | —Unverified | 0 |