| VisOnlyQA: Large Vision Language Models Still Struggle with Visual Perception of Geometric Information | Dec 1, 2024 | Multiple-choice | CodeCode Available | 1 |
| CHOICE: Benchmarking the Remote Sensing Capabilities of Large Vision-Language Models | Nov 27, 2024 | BenchmarkingEarth Observation | CodeCode Available | 1 |
| All Languages Matter: Evaluating LMMs on Culturally Diverse 100 Languages | Nov 25, 2024 | AllLong Question Answer | CodeCode Available | 1 |
| VidComposition: Can MLLMs Analyze Compositions in Compiled Videos? | Nov 17, 2024 | Multiple-choice | CodeCode Available | 1 |
| MEG: Medical Knowledge-Augmented Large Language Models for Question Answering | Nov 6, 2024 | Knowledge Graph EmbeddingsMultiple-choice | CodeCode Available | 1 |
| MILU: A Multi-task Indic Language Understanding Benchmark | Nov 4, 2024 | Multiple-choiceQuestion Answering | CodeCode Available | 1 |
| Delving into the Reversal Curse: How Far Can Large Language Models Generalize? | Oct 24, 2024 | Multiple-choice | CodeCode Available | 1 |
| TimeSeriesExam: A time series understanding exam | Oct 18, 2024 | Anomaly DetectionMultiple-choice | CodeCode Available | 1 |
| WorldMedQA-V: a multilingual, multimodal medical examination dataset for multimodal language models evaluation | Oct 16, 2024 | BenchmarkingFairness | CodeCode Available | 1 |
| MMIE: Massive Multimodal Interleaved Comprehension Benchmark for Large Vision-Language Models | Oct 14, 2024 | Multiple-choice | CodeCode Available | 1 |