| Multiple-Choice Questions are Efficient and Robust LLM Evaluators | May 20, 2024 | GSM8KHumanEval | CodeCode Available | 1 |
| SciFIBench: Benchmarking Large Multimodal Models for Scientific Figure Interpretation | May 14, 2024 | BenchmarkingMultiple-choice | CodeCode Available | 1 |
| THRONE: An Object-based Hallucination Benchmark for the Free-form Generations of Large Vision-Language Models | May 8, 2024 | AttributeData Augmentation | CodeCode Available | 1 |
| Do Large Language Models Understand Conversational Implicature -- A case study with a chinese sitcom | Apr 30, 2024 | ImplicaturesMultiple-choice | CodeCode Available | 1 |
| Latxa: An Open Language Model and Evaluation Suite for Basque | Mar 29, 2024 | Language ModelingLanguage Modelling | CodeCode Available | 1 |
| Non-Linear Inference Time Intervention: Improving LLM Truthfulness | Mar 27, 2024 | Large Language ModelMultiple-choice | CodeCode Available | 1 |
| IllusionVQA: A Challenging Optical Illusion Dataset for Vision Language Models | Mar 23, 2024 | Common Sense ReasoningIn-Context Learning | CodeCode Available | 1 |
| Complex Reasoning over Logical Queries on Commonsense Knowledge Graphs | Mar 12, 2024 | Knowledge GraphsMultiple-choice | CodeCode Available | 1 |
| Unfamiliar Finetuning Examples Control How Language Models Hallucinate | Mar 8, 2024 | MMLUMultiple-choice | CodeCode Available | 1 |
| To Generate or to Retrieve? On the Effectiveness of Artificial Contexts for Medical Open-Domain Question Answering | Mar 4, 2024 | MedQAMMLU | CodeCode Available | 1 |