| Multiple Choice Learning for Efficient Speech Separation with Many Speakers | Nov 27, 2024 | Multiple-choiceSpeech Separation | —Unverified | 0 |
| NEMO: Can Multimodal LLMs Identify Attribute-Modified Objects? | Nov 26, 2024 | AttributeMultiple-choice | —Unverified | 0 |
| SAGEval: The frontiers of Satisfactory Agent based NLG Evaluation for reference-free open-ended text | Nov 25, 2024 | Language ModelingLanguage Modelling | —Unverified | 0 |
| GEMeX: A Large-Scale, Groundable, and Explainable Medical VQA Benchmark for Chest X-ray Diagnosis | Nov 25, 2024 | Medical Visual Question AnsweringMultiple-choice | —Unverified | 0 |
| AfriMed-QA: A Pan-African, Multi-Specialty, Medical Question-Answering Benchmark Dataset | Nov 23, 2024 | Language ModelingLanguage Modelling | —Unverified | 0 |
| VideoAutoArena: An Automated Arena for Evaluating Large Multimodal Models in Video Analysis through User Simulation | Nov 20, 2024 | ChatbotMultiple-choice | —Unverified | 0 |
| Testing Uncertainty of Large Language Models for Physics Knowledge and Reasoning | Nov 18, 2024 | Logical ReasoningMultiple-choice | —Unverified | 0 |
| A Benchmark for Long-Form Medical Question Answering | Nov 14, 2024 | Answer GenerationForm | CodeCode Available | 0 |
| DAHL: Domain-specific Automated Hallucination Evaluation of Long-Form Text through a Benchmark Dataset in Biomedicine | Nov 14, 2024 | FormHallucination | CodeCode Available | 0 |
| TRACE: Transformer-based Risk Assessment for Clinical Evaluation | Nov 13, 2024 | Decision MakingMissing Values | CodeCode Available | 0 |