| Let Androids Dream of Electric Sheep: A Human-like Image Implication Understanding and Reasoning Framework | May 22, 2025 | Multiple-choiceVisual Question Answering (VQA) | CodeCode Available | 1 |
| Improving LLM First-Token Predictions in Multiple-Choice Question Answering via Prefilling Attack | May 21, 2025 | Multiple-choiceMultiple Choice Question Answering (MCQA) | —Unverified | 0 |
| Robo2VLM: Visual Question Answering from Large-Scale In-the-Wild Robot Manipulation Datasets | May 21, 2025 | Dataset GenerationDescriptive | —Unverified | 0 |
| Set-LLM: A Permutation-Invariant LLM | May 21, 2025 | Multiple-choiceQuestion Answering | —Unverified | 0 |
| Uncovering Cultural Representation Disparities in Vision-Language Models | May 20, 2025 | Multiple-choice | —Unverified | 0 |
| WirelessMathBench: A Mathematical Modeling Benchmark for LLMs in Wireless Communications | May 20, 2025 | Mathematical ReasoningMultiple-choice | —Unverified | 0 |
| VideoEval-Pro: Robust and Realistic Long Video Understanding Evaluation | May 20, 2025 | MMEMultiple-choice | CodeCode Available | 4 |
| MR. Judge: Multimodal Reasoner as a Judge | May 19, 2025 | MM-VetMultiple-choice | —Unverified | 0 |
| LEXam: Benchmarking Legal Reasoning on 340 Law Exams | May 19, 2025 | BenchmarkingLegal Reasoning | —Unverified | 0 |
| Teach2Eval: An Indirect Evaluation Method for LLM by Judging How It Teaches | May 18, 2025 | FairnessMemorization | CodeCode Available | 0 |
| LogicOCR: Do Your Large Multimodal Models Excel at Logical Reasoning on Text-Rich Images? | May 18, 2025 | Logical ReasoningMultimodal Reasoning | CodeCode Available | 1 |
| IRLBench: A Multi-modal, Culturally Grounded, Parallel Irish-English Benchmark for Open-Ended LLM Reasoning Evaluation | May 16, 2025 | Multiple-choice | CodeCode Available | 1 |
| MedGUIDE: Benchmarking Clinical Decision-Making in Large Language Models | May 16, 2025 | BenchmarkingDecision Making | —Unverified | 0 |
| ZeroTuning: Unlocking the Initial Token's Power to Enhance Large Language Models Without Training | May 16, 2025 | Multiple-choicetext-classification | —Unverified | 0 |
| GIE-Bench: Towards Grounded Evaluation for Text-Guided Image Editing | May 16, 2025 | Instruction FollowingMultiple-choice | CodeCode Available | 1 |
| Patho-R1: A Multimodal Reinforcement Learning-Based Pathology Expert Reasoner | May 16, 2025 | Cross-Modal RetrievalDiagnostic | CodeCode Available | 2 |
| Ranked Voting based Self-Consistency of Large Language Models | May 16, 2025 | Multiple-choiceOpen-Ended Question Answering | CodeCode Available | 1 |
| Are LLM-generated plain language summaries truly understandable? A large-scale crowdsourced evaluation | May 15, 2025 | InformativenessMultiple-choice | —Unverified | 0 |
| The CoT Encyclopedia: Analyzing, Predicting, and Controlling how a Reasoning Model will Think | May 15, 2025 | Multiple-choice | —Unverified | 0 |
| KRISTEVA: Close Reading as a Novel Task for Benchmarking Interpretive Reasoning | May 14, 2025 | BenchmarkingMMLU | —Unverified | 0 |
| SafePath: Conformal Prediction for Safe LLM-Based Autonomous Navigation | May 14, 2025 | Autonomous DrivingAutonomous Navigation | —Unverified | 0 |
| Grounding Synthetic Data Evaluations of Language Models in Unsupervised Document Corpora | May 13, 2025 | BenchmarkingDiagnostic | CodeCode Available | 0 |
| Benchmarking AI scientists in omics data-driven biological research | May 13, 2025 | BenchmarkingMultiple-choice | CodeCode Available | 1 |
| VCRBench: Exploring Long-form Causal Reasoning Capabilities of Large Video Language Models | May 13, 2025 | FormMultiple-choice | CodeCode Available | 0 |
| HealthBench: Evaluating Large Language Models Towards Improved Human Health | May 13, 2025 | Instruction FollowingMultiple-choice | CodeCode Available | 7 |
| How well do LLMs reason over tabular data, really? | May 12, 2025 | Missing ValuesMultiple-choice | —Unverified | 0 |
| Assessing the Chemical Intelligence of Large Language Models | May 12, 2025 | Multiple-choice | CodeCode Available | 1 |
| Tell Me Who Your Students Are: GPT Can Generate Valid Multiple-Choice Questions When Students' (Mis)Understanding Is Hinted | May 9, 2025 | Language ModelingLanguage Modelling | —Unverified | 0 |
| Healthy LLMs? Benchmarking LLM Knowledge of UK Government Public Health Information | May 9, 2025 | BenchmarkingForm | —Unverified | 0 |
| EchoInk-R1: Exploring Audio-Visual Reasoning in Multimodal LLMs via Reinforcement Learning | May 7, 2025 | Multiple-choiceQuestion Answering | CodeCode Available | 2 |
| MedArabiQ: Benchmarking Large Language Models on Arabic Medical Tasks | May 6, 2025 | BenchmarkingMultiple-choice | CodeCode Available | 0 |
| ReGraP-LLaVA: Reasoning enabled Graph-based Personalized Large Language and Vision Assistant | May 6, 2025 | DescriptiveMultiple-choice | CodeCode Available | 0 |
| Developing A Framework to Support Human Evaluation of Bias in Generated Free Response Text | May 5, 2025 | Multiple-choice | —Unverified | 0 |
| Unlearning vs. Obfuscation: Are We Truly Removing Knowledge? | May 5, 2025 | Multiple-choice | —Unverified | 0 |
| LLM-based Text Simplification and its Effect on User Comprehension and Cognitive Load | May 4, 2025 | ArticlesMultiple-choice | —Unverified | 0 |
| LookAlike: Consistent Distractor Generation in Math MCQs | May 3, 2025 | Distractor GenerationMath | —Unverified | 0 |
| Harnessing Structured Knowledge: A Concept Map-Based Approach for High-Quality Multiple Choice Question Generation with Effective Distractors | May 2, 2025 | High School PhysicsMisconceptions | CodeCode Available | 0 |
| Adaptive Wizard for Removing Cross-Tier Misconfigurations in Active Directory | May 2, 2025 | Multiple-choice | —Unverified | 0 |
| SARI: Structured Audio Reasoning via Curriculum-Guided Reinforcement Learning | Apr 22, 2025 | Multiple-choicereinforcement-learning | —Unverified | 0 |
| LongPerceptualThoughts: Distilling System-2 Reasoning for System-1 Perception | Apr 21, 2025 | MathMMLU | —Unverified | 0 |
| Are Vision LLMs Road-Ready? A Comprehensive Benchmark for Safety-Critical Driving Video Understanding | Apr 20, 2025 | Autonomous DrivingImage Captioning | CodeCode Available | 0 |
| FarsEval-PKBETS: A new diverse benchmark for evaluating Persian large language models | Apr 20, 2025 | DescriptiveEthics | —Unverified | 0 |
| Assessing AI-Generated Questions' Alignment with Cognitive Frameworks in Educational Assessment | Apr 19, 2025 | ClassificationMultiple-choice | —Unverified | 0 |
| DMind Benchmark: Toward a Holistic Assessment of LLM Capabilities across the Web3 Domain | Apr 18, 2025 | Multiple-choice | —Unverified | 0 |
| D-GEN: Automatic Distractor Generation and Evaluation for Reliable Assessment of Generative Model | Apr 18, 2025 | Distractor GenerationMultiple-choice | —Unverified | 0 |
| Benchmarking Next-Generation Reasoning-Focused Large Language Models in Ophthalmology: A Head-to-Head Evaluation on 5,888 Items | Apr 15, 2025 | BenchmarkingMultiple-choice | —Unverified | 0 |
| AgMMU: A Comprehensive Agricultural Multimodal Understanding and Reasoning Benchmark | Apr 14, 2025 | ManagementMultiple-choice | —Unverified | 0 |
| Large Language Models Could Be Rote Learners | Apr 11, 2025 | MemorizationMMLU | —Unverified | 0 |
| Kaleidoscope: In-language Exams for Massively Multilingual Vision Evaluation | Apr 9, 2025 | Multiple-choice | CodeCode Available | 0 |
| ChartQAPro: A More Diverse and Challenging Benchmark for Chart Question Answering | Apr 7, 2025 | Chart Question AnsweringChart Understanding | CodeCode Available | 1 |