| FaceXBench: Evaluating Multimodal LLMs on Face Understanding | Jan 17, 2025 | FairnessMultiple-choice | CodeCode Available | 1 |
| Filter-then-Generate: Large Language Models with Structure-Text Adapter for Knowledge Graph Completion | Dec 12, 2024 | HallucinationKnowledge Graph Completion | CodeCode Available | 1 |
| Evaluating GPT-3.5 and GPT-4 Models on Brazilian University Admission Exams | Mar 29, 2023 | Multiple-choice | CodeCode Available | 1 |
| Explicit Planning Helps Language Models in Logical Reasoning | Mar 28, 2023 | Logical ReasoningMultiple-choice | CodeCode Available | 1 |
| African or European Swallow? Benchmarking Large Vision-Language Models for Fine-Grained Object Classification | Jun 20, 2024 | BenchmarkingClassification | CodeCode Available | 1 |
| Benchmarking AI scientists in omics data-driven biological research | May 13, 2025 | BenchmarkingMultiple-choice | CodeCode Available | 1 |
| An MRC Framework for Semantic Role Labeling | Sep 14, 2021 | Computational EfficiencyMachine Reading Comprehension | CodeCode Available | 1 |
| Benchmarking Large Language Models on Answering and Explaining Challenging Medical Questions | Feb 28, 2024 | BenchmarkingMultiple-choice | CodeCode Available | 1 |
| Evaluating language models as risk scores | Jul 19, 2024 | Multiple-choiceQuestion Answering | CodeCode Available | 1 |
| Enhancing Knowledge Tracing with Concept Map and Response Disentanglement | Aug 23, 2024 | DisentanglementKnowledge Tracing | CodeCode Available | 1 |
| A Fine-tuning Dataset and Benchmark for Large Language Models for Protein Understanding | Jun 8, 2024 | DescriptiveLanguage Modelling | CodeCode Available | 1 |
| BRAINTEASER: Lateral Thinking Puzzles for Large Language Models | Oct 8, 2023 | Distractor GenerationLanguage Modelling | CodeCode Available | 1 |
| Annealed Winner-Takes-All for Motion Forecasting | Sep 17, 2024 | AllAutonomous Driving | CodeCode Available | 1 |
| Boosting Healthcare LLMs Through Retrieved Context | Sep 23, 2024 | BenchmarkingMultiple-choice | CodeCode Available | 1 |
| An Open Source Data Contamination Report for Large Language Models | Oct 26, 2023 | HellaSwagLanguage Modeling | CodeCode Available | 1 |
| From Machine Reading Comprehension to Dialogue State Tracking: Bridging the Gap | Apr 13, 2020 | Dialogue State TrackingMachine Reading Comprehension | CodeCode Available | 1 |
| Estimating Contamination via Perplexity: Quantifying Memorisation in Language Model Evaluation | Sep 19, 2023 | Language Model EvaluationLanguage Modeling | CodeCode Available | 1 |
| GPT as Knowledge Worker: A Zero-Shot Evaluation of (AI)CPA Capabilities | Jan 11, 2023 | Multiple-choice | CodeCode Available | 1 |
| LLM-Coordination: Evaluating and Analyzing Multi-agent Coordination Abilities in Large Language Models | Oct 5, 2023 | Common Sense ReasoningMultiple-choice | CodeCode Available | 1 |
| HCQA @ Ego4D EgoSchema Challenge 2024 | Jun 22, 2024 | Caption Generation | CodeCode Available | 1 |
| Hybrid-Level Instruction Injection for Video Token Compression in Multi-modal Large Language Models | Mar 20, 2025 | Multiple-choiceVideo Understanding | CodeCode Available | 1 |
| Is Bigger and Deeper Always Better? Probing LLaMA Across Scales and Layers | Dec 7, 2023 | MathMultiple-choice | CodeCode Available | 1 |
| E-EVAL: A Comprehensive Chinese K-12 Education Evaluation Benchmark for Large Language Models | Jan 29, 2024 | EthicsMultiple-choice | CodeCode Available | 1 |
| EduQG: A Multi-format Multiple Choice Dataset for the Educational Domain | Oct 12, 2022 | Distractor GenerationMultiple-choice | CodeCode Available | 1 |
| EgoSchema: A Diagnostic Benchmark for Very Long-form Video Language Understanding | Aug 17, 2023 | DiagnosticEgoSchema | CodeCode Available | 1 |