| Explaining NLP Models via Minimal Contrastive Editing (MiCE) | Dec 27, 2020 | counterfactualMultiple-choice | CodeCode Available | 1 | 5 |
| FaceXBench: Evaluating Multimodal LLMs on Face Understanding | Jan 17, 2025 | FairnessMultiple-choice | CodeCode Available | 1 | 5 |
| Boosting Healthcare LLMs Through Retrieved Context | Sep 23, 2024 | BenchmarkingMultiple-choice | CodeCode Available | 1 | 5 |
| All Languages Matter: Evaluating LMMs on Culturally Diverse 100 Languages | Nov 25, 2024 | AllLong Question Answer | CodeCode Available | 1 | 5 |
| BRAINTEASER: Lateral Thinking Puzzles for Large Language Models | Oct 8, 2023 | Distractor GenerationLanguage Modelling | CodeCode Available | 1 | 5 |
| Bridging Video-text Retrieval with Multiple Choice Questions | Jan 13, 2022 | Action RecognitionLinear evaluation | CodeCode Available | 1 | 5 |
| LLM-Coordination: Evaluating and Analyzing Multi-agent Coordination Abilities in Large Language Models | Oct 5, 2023 | Common Sense ReasoningMultiple-choice | CodeCode Available | 1 | 5 |
| Evaluating the Knowledge Dependency of Questions | Nov 21, 2022 | Multiple-choice | CodeCode Available | 1 | 5 |
| Estimating Contamination via Perplexity: Quantifying Memorisation in Language Model Evaluation | Sep 19, 2023 | Language Model EvaluationLanguage Modeling | CodeCode Available | 1 | 5 |
| GIE-Bench: Towards Grounded Evaluation for Text-Guided Image Editing | May 16, 2025 | Instruction FollowingMultiple-choice | CodeCode Available | 1 | 5 |