| Evaluating LLM -- Generated Multimodal Diagnosis from Medical Images and Symptom Analysis | Jan 28, 2024 | Knowledge GraphsMedical Diagnosis | —Unverified | 0 |
| Evaluating Machine Reading Systems through Comprehension Tests | May 1, 2012 | Answer SelectionMultiple-choice | —Unverified | 0 |
| Evaluating Nuanced Bias in Large Language Model Free Response Answers | Jul 11, 2024 | BenchmarkingLanguage Modeling | —Unverified | 0 |
| Evaluating the Symbol Binding Ability of Large Language Models for Multiple-Choice Questions in Vietnamese General Education | Oct 18, 2023 | Multiple-choiceMultiple Choice Question Answering (MCQA) | —Unverified | 0 |
| Cleared for Takeoff? Compositional & Conditional Reasoning may be the Achilles Heel to (Flight-Booking) Language Agents | Apr 5, 2024 | Multiple-choiceNavigate | —Unverified | 0 |
| CinePile: A Long Video Question Answering Dataset and Benchmark | May 14, 2024 | FormHuman-Object Interaction Detection | —Unverified | 0 |
| ARGUS: Hallucination and Omission Evaluation in Video-LLMs | Jun 9, 2025 | DescriptiveForm | —Unverified | 0 |
| Chat-TS: Enhancing Multi-Modal Reasoning Over Time-Series and Natural Language Data | Mar 13, 2025 | Large Language ModelMath | —Unverified | 0 |
| Are You Doubtful? Oh, It Might Be Difficult Then! Exploring the Use of Model Uncertainty for Question Difficulty Estimation | Dec 16, 2024 | Multiple-choice | —Unverified | 0 |
| AmazUtah_NLP at SemEval-2024 Task 9: A MultiChoice Question Answering System for Commonsense Defying Reasoning | May 16, 2024 | Multiple-choiceQuestion Answering | —Unverified | 0 |