| FaceXBench: Evaluating Multimodal LLMs on Face Understanding | Jan 17, 2025 | FairnessMultiple-choice | CodeCode Available | 1 | 5 |
| Filter-then-Generate: Large Language Models with Structure-Text Adapter for Knowledge Graph Completion | Dec 12, 2024 | HallucinationKnowledge Graph Completion | CodeCode Available | 1 | 5 |
| Fool Your (Vision and) Language Model With Embarrassingly Simple Permutations | Oct 2, 2023 | In-Context LearningInstruction Following | CodeCode Available | 1 | 5 |
| HCQA @ Ego4D EgoSchema Challenge 2024 | Jun 22, 2024 | Caption Generation | CodeCode Available | 1 | 5 |
| AdaLoGN: Adaptive Logic Graph Network for Reasoning-Based Machine Reading Comprehension | Mar 16, 2022 | Logical ReasoningMachine Reading Comprehension | CodeCode Available | 1 | 5 |
| Boosting Healthcare LLMs Through Retrieved Context | Sep 23, 2024 | BenchmarkingMultiple-choice | CodeCode Available | 1 | 5 |
| Evaluating language models as risk scores | Jul 19, 2024 | Multiple-choiceQuestion Answering | CodeCode Available | 1 | 5 |
| Explicit Planning Helps Language Models in Logical Reasoning | Mar 28, 2023 | Logical ReasoningMultiple-choice | CodeCode Available | 1 | 5 |
| GIE-Bench: Towards Grounded Evaluation for Text-Guided Image Editing | May 16, 2025 | Instruction FollowingMultiple-choice | CodeCode Available | 1 | 5 |
| BLEnD: A Benchmark for LLMs on Everyday Knowledge in Diverse Cultures and Languages | Jun 14, 2024 | Multiple-choice | CodeCode Available | 1 | 5 |