| Why Has Predicting Downstream Capabilities of Frontier AI Models with Scale Remained Elusive? | Jun 6, 2024 | Multiple-choiceQuestion Answering | —Unverified | 0 |
| Analyzing the Performance of ChatGPT in Cardiology and Vascular Pathologies | Apr 15, 2023 | Language ModelingLanguage Modelling | —Unverified | 0 |
| Healthy LLMs? Benchmarking LLM Knowledge of UK Government Public Health Information | May 9, 2025 | BenchmarkingForm | —Unverified | 0 |
| HFL-RC System at SemEval-2018 Task 11: Hybrid Multi-Aspects Model for Commonsense Reading Comprehension | Mar 15, 2018 | Multiple-choiceReading Comprehension | —Unverified | 0 |
| Hierarchical Divide-and-Conquer for Fine-Grained Alignment in LLM-Based Medical Evaluation | Jan 12, 2025 | AttributeMultiple-choice | —Unverified | 0 |
| HindiLLM: Large Language Model for Hindi | Dec 29, 2024 | Language ModelingLanguage Modelling | —Unverified | 0 |
| Analyzing Multiple-Choice Reading and Listening Comprehension Tests | Jul 3, 2023 | Multiple-choiceReading Comprehension | —Unverified | 0 |
| How Far Are We to GPT-4V? Closing the Gap to Commercial Multimodal Models with Open-Source Suites | Apr 25, 2024 | 4kLanguage Modeling | —Unverified | 0 |
| How Far Can Off-the-Shelf Multimodal Large Language Models Go in Online Episodic Memory Question Answering? | Jun 19, 2025 | Multiple-choiceQuestion Answering | —Unverified | 0 |
| How Many Workers to Ask? Adaptive Exploration for Collecting High Quality Labels | Nov 1, 2014 | Multiple-choice | —Unverified | 0 |