| EQUATOR: A Deterministic Framework for Evaluating LLM Reasoning with Open-Ended Questions. # v1.0.0-beta | Dec 31, 2024 | Multiple-choiceQuestion Answering | —Unverified | 0 |
| Establishing Task Scaling Laws via Compute-Efficient Model Ladders | Dec 5, 2024 | Language ModelingLanguage Modelling | —Unverified | 0 |
| Towards Conversational AI for Disease Management | Mar 8, 2025 | Clinical KnowledgeDiagnostic | —Unverified | 0 |
| Evalita-LLM: Benchmarking Large Language Models on Italian | Feb 4, 2025 | BenchmarkingMultiple-choice | —Unverified | 0 |
| Towards Decision Support Technology Platform for Modular Systems | Aug 23, 2014 | ClusteringCombinatorial Optimization | —Unverified | 0 |
| Evaluating LLM-corrupted Crowdsourcing Data Without Ground Truth | Jun 8, 2025 | Multiple-choice | —Unverified | 0 |
| Evaluating LLM -- Generated Multimodal Diagnosis from Medical Images and Symptom Analysis | Jan 28, 2024 | Knowledge GraphsMedical Diagnosis | —Unverified | 0 |
| Evaluating LLMs on Document-Based QA: Exact Answer Selection and Numerical Extraction using Cogtale dataset | Nov 14, 2023 | Answer SelectionInformation Retrieval | —Unverified | 0 |
| Evaluating Machine Reading Systems through Comprehension Tests | May 1, 2012 | Answer SelectionMultiple-choice | —Unverified | 0 |
| Evaluating multiple large language models in pediatric ophthalmology | Nov 7, 2023 | Multiple-choice | —Unverified | 0 |