| VisOnlyQA: Large Vision Language Models Still Struggle with Visual Perception of Geometric Information | Dec 1, 2024 | Multiple-choice | CodeCode Available | 1 |
| CHOICE: Benchmarking the Remote Sensing Capabilities of Large Vision-Language Models | Nov 27, 2024 | BenchmarkingEarth Observation | CodeCode Available | 1 |
| All Languages Matter: Evaluating LMMs on Culturally Diverse 100 Languages | Nov 25, 2024 | AllLong Question Answer | CodeCode Available | 1 |
| VidComposition: Can MLLMs Analyze Compositions in Compiled Videos? | Nov 17, 2024 | Multiple-choice | CodeCode Available | 1 |
| MEG: Medical Knowledge-Augmented Large Language Models for Question Answering | Nov 6, 2024 | Knowledge Graph EmbeddingsMultiple-choice | CodeCode Available | 1 |
| MILU: A Multi-task Indic Language Understanding Benchmark | Nov 4, 2024 | Multiple-choiceQuestion Answering | CodeCode Available | 1 |
| Delving into the Reversal Curse: How Far Can Large Language Models Generalize? | Oct 24, 2024 | Multiple-choice | CodeCode Available | 1 |
| TimeSeriesExam: A time series understanding exam | Oct 18, 2024 | Anomaly DetectionMultiple-choice | CodeCode Available | 1 |
| WorldMedQA-V: a multilingual, multimodal medical examination dataset for multimodal language models evaluation | Oct 16, 2024 | BenchmarkingFairness | CodeCode Available | 1 |
| MMIE: Massive Multimodal Interleaved Comprehension Benchmark for Large Vision-Language Models | Oct 14, 2024 | Multiple-choice | CodeCode Available | 1 |
| Taming Overconfidence in LLMs: Reward Calibration in RLHF | Oct 13, 2024 | Multiple-choice | CodeCode Available | 1 |
| SPORTU: A Comprehensive Sports Understanding Benchmark for Multimodal Large Language Models | Oct 11, 2024 | Few-Shot LearningMultiple-choice | CodeCode Available | 1 |
| MedQA-CS: Benchmarking Large Language Models Clinical Skills Using an AI-SCE Framework | Oct 2, 2024 | BenchmarkingInstruction Following | CodeCode Available | 1 |
| A Hitchhikers Guide to Fine-Grained Face Forgery Detection Using Common Sense Reasoning | Oct 1, 2024 | Common Sense ReasoningDeepFake Detection | CodeCode Available | 1 |
| Boosting Healthcare LLMs Through Retrieved Context | Sep 23, 2024 | BenchmarkingMultiple-choice | CodeCode Available | 1 |
| Annealed Winner-Takes-All for Motion Forecasting | Sep 17, 2024 | AllAutonomous Driving | CodeCode Available | 1 |
| Training on the Benchmark Is Not All You Need | Sep 3, 2024 | AllMultiple-choice | CodeCode Available | 1 |
| TourSynbio: A Multi-Modal Large Model and Agent Framework to Bridge Text and Protein Sequences for Protein Engineering | Aug 27, 2024 | Multiple-choiceProtein Folding | CodeCode Available | 1 |
| Enhancing Knowledge Tracing with Concept Map and Response Disentanglement | Aug 23, 2024 | DisentanglementKnowledge Tracing | CodeCode Available | 1 |
| LLMs Are Biased Towards Output Formats! Systematically Evaluating and Mitigating Output Format Bias of LLMs | Aug 16, 2024 | Instruction FollowingMultiple-choice | CodeCode Available | 1 |
| Annealed Multiple Choice Learning: Overcoming limitations of Winner-takes-all with annealing | Jul 22, 2024 | AllDiversity | CodeCode Available | 1 |
| Evaluating language models as risk scores | Jul 19, 2024 | Multiple-choiceQuestion Answering | CodeCode Available | 1 |
| TurkishMMLU: Measuring Massive Multitask Language Understanding in Turkish | Jul 17, 2024 | MathMultiple-choice | CodeCode Available | 1 |
| Fine-tuning Multimodal Large Language Models for Product Bundling | Jul 16, 2024 | In-Context LearningMultiple-choice | CodeCode Available | 1 |
| Uncertainty is Fragile: Manipulating Uncertainty in Large Language Models | Jul 15, 2024 | Backdoor AttackMultiple-choice | CodeCode Available | 1 |
| ORAN-Bench-13K: An Open Source Benchmark for Assessing LLMs in Open Radio Access Networks | Jul 8, 2024 | Anomaly DetectionCode Generation | CodeCode Available | 1 |
| LogicVista: Multimodal LLM Logical Reasoning Benchmark in Visual Contexts | Jul 6, 2024 | Logical ReasoningMathematical Reasoning | CodeCode Available | 1 |
| MMEvalPro: Calibrating Multimodal Benchmarks Towards Trustworthy and Efficient Evaluation | Jun 29, 2024 | Multiple-choice | CodeCode Available | 1 |
| InfiniBench: A Comprehensive Benchmark for Large Multimodal Models in Very Long Video Understanding | Jun 28, 2024 | Multiple-choiceVideo Understanding | CodeCode Available | 1 |
| HCQA @ Ego4D EgoSchema Challenge 2024 | Jun 22, 2024 | Caption Generation | CodeCode Available | 1 |
| African or European Swallow? Benchmarking Large Vision-Language Models for Fine-Grained Object Classification | Jun 20, 2024 | BenchmarkingClassification | CodeCode Available | 1 |
| FoodieQA: A Multimodal Dataset for Fine-Grained Understanding of Chinese Food Culture | Jun 16, 2024 | DiversityMultiple-choice | CodeCode Available | 1 |
| CoLoR-Filter: Conditional Loss Reduction Filtering for Targeted Language Model Pre-training | Jun 15, 2024 | Domain AdaptationLanguage Modeling | CodeCode Available | 1 |
| IntentionQA: A Benchmark for Evaluating Purchase Intention Comprehension Abilities of Language Models in E-commerce | Jun 14, 2024 | Multiple-choiceQuestion Answering | CodeCode Available | 1 |
| BLEnD: A Benchmark for LLMs on Everyday Knowledge in Diverse Cultures and Languages | Jun 14, 2024 | Multiple-choice | CodeCode Available | 1 |
| INS-MMBench: A Comprehensive Benchmark for Evaluating LVLMs' Performance in Insurance | Jun 13, 2024 | Multiple-choiceVisual Reasoning | CodeCode Available | 1 |
| MuirBench: A Comprehensive Benchmark for Robust Multi-image Understanding | Jun 13, 2024 | Multiple-choiceScene Understanding | CodeCode Available | 1 |
| A Fine-tuning Dataset and Benchmark for Large Language Models for Protein Understanding | Jun 8, 2024 | DescriptiveLanguage Modelling | CodeCode Available | 1 |
| TopViewRS: Vision-Language Models as Top-View Spatial Reasoners | Jun 4, 2024 | Multiple-choiceSpatial Reasoning | CodeCode Available | 1 |
| Embedding Trajectory for Out-of-Distribution Detection in Mathematical Reasoning | May 22, 2024 | Mathematical ReasoningMultiple-choice | CodeCode Available | 1 |
| Multiple-Choice Questions are Efficient and Robust LLM Evaluators | May 20, 2024 | GSM8KHumanEval | CodeCode Available | 1 |
| SciFIBench: Benchmarking Large Multimodal Models for Scientific Figure Interpretation | May 14, 2024 | BenchmarkingMultiple-choice | CodeCode Available | 1 |
| THRONE: An Object-based Hallucination Benchmark for the Free-form Generations of Large Vision-Language Models | May 8, 2024 | AttributeData Augmentation | CodeCode Available | 1 |
| Do Large Language Models Understand Conversational Implicature -- A case study with a chinese sitcom | Apr 30, 2024 | ImplicaturesMultiple-choice | CodeCode Available | 1 |
| Latxa: An Open Language Model and Evaluation Suite for Basque | Mar 29, 2024 | Language ModelingLanguage Modelling | CodeCode Available | 1 |
| Non-Linear Inference Time Intervention: Improving LLM Truthfulness | Mar 27, 2024 | Large Language ModelMultiple-choice | CodeCode Available | 1 |
| IllusionVQA: A Challenging Optical Illusion Dataset for Vision Language Models | Mar 23, 2024 | Common Sense ReasoningIn-Context Learning | CodeCode Available | 1 |
| Complex Reasoning over Logical Queries on Commonsense Knowledge Graphs | Mar 12, 2024 | Knowledge GraphsMultiple-choice | CodeCode Available | 1 |
| Unfamiliar Finetuning Examples Control How Language Models Hallucinate | Mar 8, 2024 | MMLUMultiple-choice | CodeCode Available | 1 |
| To Generate or to Retrieve? On the Effectiveness of Artificial Contexts for Medical Open-Domain Question Answering | Mar 4, 2024 | MedQAMMLU | CodeCode Available | 1 |