| Evaluating LLM -- Generated Multimodal Diagnosis from Medical Images and Symptom Analysis | Jan 28, 2024 | Knowledge GraphsMedical Diagnosis | —Unverified | 0 |
| Evaluating LLMs on Document-Based QA: Exact Answer Selection and Numerical Extraction using Cogtale dataset | Nov 14, 2023 | Answer SelectionInformation Retrieval | —Unverified | 0 |
| Evaluating Machine Reading Systems through Comprehension Tests | May 1, 2012 | Answer SelectionMultiple-choice | —Unverified | 0 |
| First Token Probability Guided RAG for Telecom Question Answering | Jan 11, 2025 | Multiple-choiceMultiple Choice Question Answering (MCQA) | —Unverified | 0 |
| Evalita-LLM: Benchmarking Large Language Models on Italian | Feb 4, 2025 | BenchmarkingMultiple-choice | —Unverified | 0 |
| Evaluating Nuanced Bias in Large Language Model Free Response Answers | Jul 11, 2024 | BenchmarkingLanguage Modeling | —Unverified | 0 |
| BiRdQA: A Bilingual Dataset for Question Answering on Tricky Riddles | Sep 23, 2021 | Multiple-choiceQuestion Answering | —Unverified | 0 |
| Evaluating Question Answering Evaluation | Nov 1, 2019 | Answer GenerationMultiple-choice | —Unverified | 0 |
| Break the Checkbox: Challenging Closed-Style Evaluations of Cultural Alignment in LLMs | Feb 12, 2025 | Multiple-choiceSurvey | —Unverified | 0 |
| Establishing Task Scaling Laws via Compute-Efficient Model Ladders | Dec 5, 2024 | Language ModelingLanguage Modelling | —Unverified | 0 |