| AutoMCQ -- Automatically Generate Code Comprehension Questions using GenAI | May 22, 2025 | Multiple-choice | —Unverified | 0 |
| KoBALT: Korean Benchmark For Advanced Linguistic Tasks | May 22, 2025 | Multiple-choice | —Unverified | 0 |
| Robo2VLM: Visual Question Answering from Large-Scale In-the-Wild Robot Manipulation Datasets | May 21, 2025 | Dataset GenerationDescriptive | —Unverified | 0 |
| Set-LLM: A Permutation-Invariant LLM | May 21, 2025 | Multiple-choiceQuestion Answering | —Unverified | 0 |
| Improving LLM First-Token Predictions in Multiple-Choice Question Answering via Prefilling Attack | May 21, 2025 | Multiple-choiceMultiple Choice Question Answering (MCQA) | —Unverified | 0 |
| Uncovering Cultural Representation Disparities in Vision-Language Models | May 20, 2025 | Multiple-choice | —Unverified | 0 |
| WirelessMathBench: A Mathematical Modeling Benchmark for LLMs in Wireless Communications | May 20, 2025 | Mathematical ReasoningMultiple-choice | —Unverified | 0 |
| MR. Judge: Multimodal Reasoner as a Judge | May 19, 2025 | MM-VetMultiple-choice | —Unverified | 0 |
| LEXam: Benchmarking Legal Reasoning on 340 Law Exams | May 19, 2025 | BenchmarkingLegal Reasoning | —Unverified | 0 |
| Teach2Eval: An Indirect Evaluation Method for LLM by Judging How It Teaches | May 18, 2025 | FairnessMemorization | CodeCode Available | 0 |