| Transliteration: A Simple Technique For Improving Multilingual Language Modeling | Sep 29, 2021 | Language ModelingLanguage Modelling | —Unverified | 0 | 0 |
| True Detective: A Deep Abductive Reasoning Benchmark Undoable for GPT-3 and Challenging for GPT-4 | Dec 20, 2022 | Multiple-choice | —Unverified | 0 | 0 |
| GRAF: Graph Retrieval Augmented by Facts for Romanian Legal Multi-Choice Question Answering | Dec 5, 2024 | Information RetrievalMultiple-choice | —Unverified | 0 | 0 |
| GraphITE: Estimating Individual Effects of Graph-structured Treatments | Sep 29, 2020 | counterfactualDecision Making | —Unverified | 0 | 0 |
| Graph-Structured Representations for Visual Question Answering | Sep 19, 2016 | Multiple-choiceQuestion Answering | —Unverified | 0 | 0 |
| Is There No Such Thing as a Bad Question? H4R: HalluciBot For Ratiocination, Rewriting, Ranking, and Routing | Apr 18, 2024 | HallucinationMultiple-choice | —Unverified | 0 | 0 |
| Hanfu-Bench: A Multimodal Benchmark on Cross-Temporal Cultural Understanding and Transcreation | Jun 2, 2025 | Multiple-choiceQuestion Answering | —Unverified | 0 | 0 |
| HANS, are you clever? Clever Hans Effect Analysis of Neural Systems | Sep 21, 2023 | Decision MakingMultiple-choice | —Unverified | 0 | 0 |
| HardML: A Benchmark For Evaluating Data Science And Machine Learning knowledge and reasoning in AI | Jan 26, 2025 | MMLUMultiple-choice | —Unverified | 0 | 0 |
| HashEvict: A Pre-Attention KV Cache Eviction Strategy using Locality-Sensitive Hashing | Dec 13, 2024 | GPUMultiple-choice | —Unverified | 0 | 0 |