| My Answer Is NOT 'Fair': Mitigating Social Bias in Vision-Language Models via Fair and Biased Residuals | May 26, 2025 | EthicsFairness | —Unverified | 0 |
| Genome-Bench: A Scientific Reasoning Benchmark from Real-World Expert Discussions | May 26, 2025 | Multiple-choice | —Unverified | 0 |
| CP-Router: An Uncertainty-Aware Router Between LLM and LRM | May 26, 2025 | Conformal PredictionLogical Reasoning | —Unverified | 0 |
| DFIR-Metric: A Benchmark Dataset for Evaluating Large Language Models in Digital Forensics and Incident Response | May 26, 2025 | Multiple-choice | —Unverified | 0 |
| BnMMLU: Measuring Massive Multitask Language Understanding in Bengali | May 25, 2025 | General KnowledgeMMLU | CodeCode Available | 0 |
| Enhancing LLMs' Reasoning-Intensive Multimedia Search Capabilities through Fine-Tuning and Reinforcement Learning | May 24, 2025 | Multiple-choicePrompt Engineering | —Unverified | 0 |
| Daily-Omni: Towards Audio-Visual Reasoning with Temporal Alignment across Modalities | May 23, 2025 | Automatic Speech RecognitionAutomatic Speech Recognition (ASR) | CodeCode Available | 1 |
| KoBALT: Korean Benchmark For Advanced Linguistic Tasks | May 22, 2025 | Multiple-choice | —Unverified | 0 |
| Collaboration among Multiple Large Language Models for Medical Question Answering | May 22, 2025 | Medical Question AnsweringMultiple-choice | —Unverified | 0 |
| AutoMCQ -- Automatically Generate Code Comprehension Questions using GenAI | May 22, 2025 | Multiple-choice | —Unverified | 0 |